home *** CD-ROM | disk | FTP | other *** search
- <?xml version="1.0" encoding="iso-8859-1"?>
- <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
- <html xml:lang="en" lang="en">
- <head>
- <title>zez.org: about code</title>
- <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
- </head>
-
- <body>
-
- <table width="100%" cellspacing="0" cellpadding="0" border="0">
- <tbody>
- <tr>
- <td align="center"><h3>Found at:
- http://publish.ez.no/article/articleprint/11/</h3>
- </td>
- </tr>
- </tbody>
- </table>
-
- <table width="100%" cellspacing="0" cellpadding="0" border="0">
- <tbody>
- <tr>
- <td><h1>Regular Expressions explained</h1>
- </td>
- </tr>
- </tbody>
- </table>
- <hr noshade="noshade" size="4" />
- <br />
-
-
- <table width="100%" cellspacing="0" cellpadding="0" border="0">
- <tbody>
- <tr>
- <td><p class="byline">Author: <a class="byline"
- href="/article/author/view/26">Jan Borsodi</a></p>
- </td>
- <td align="right"><p class="byline">Publishing date: 30.10.2000
- 18:02</p>
- </td>
- </tr>
- </tbody>
- </table>
-
- <p>This article will give you an introduction to the world of <i>regular
- expressions</i>. I'll start off with explaining what regular expressions are
- and introduce it's syntax, then some examples with varying complexity and
- last a list of tools which use <i>regular expressions</i>.</p>
-
- <p><span style="color: #FF0000">[Note: Puppy has a regular expression
- evaluation and learning tool in the Utilities menu]</span></p>
-
- <h2>Concept</h2>
- A <i>regular expression</i> is a text pattern consisting of a combination of
- alphanumeric characters and special characters known as metacharacters. A
- close relative is in fact the <i>wildcard expression</i> which are often used
- in file management. The pattern is used to match against text strings. The
- result of a match is either successful or not, however when a match is
- successful not all of the pattern must match, this is explained later in the
- article.<br />
- <br />
- You'll find that <i>regular expressions</i> are used in three different ways:
- Regular text match, search and replace and splitting. The latter is basicly
- the same as the reverse match ie. everything the <i>regular expression</i>
- did not match.<br />
- <br />
- <i>Regular expressions</i> are often simply called regexps or RE, but for
- consistency I'll be referring to it with it's full name.<br />
- <br />
- Due to the versatility of the <i>regular expression</i> it is widely used in
- text processing and parsing. UNIX users are probably familiar with them
- trough the use of the programs, <i>grep</i>, <i>sed</i>, <i>awk</i> and
- <i>ed</i>. Text editors such as <i>(X)Emacs</i> and <i>vi</i> also use them
- heavily. Probably the most known use of <i>regular expressions</i> are in
- the programming language Perl, you'll find that Perl sports the most advanced
- <i>regular expression</i> implementation to this day.<br />
-
-
- <h2>Usage</h2>
- Now you're probably wondering why you should bother to learn <i>regular
- expressions</i>. Well if you're a normal computer user your benefits from
- using them are somewhat small, however if you're either a developer or a
- system administrator you'll find that knowing <i>regular expressions</i> will
- make your (professional)life so much better.<br />
- <br />
- Developers can use them to parse text files, fix up code and other wonders.
- System administrators can use them to search trough logs, automate boring
- tasks and sniff the network traffic for unauthorized activity.<br />
- <br />
- Actually I would go so far as to say it's a crime for a System Administrator
- not to have <b>any</b> knowledge of <i>regular expressions</i>.<br />
-
-
- <h2>Quantifiers</h2>
- Before I start explaining the syntax you might want to jump to the last page
- to learn which programs you can use to test out the examples in this
- article.<br />
- <br />
- The contents of an expression is, as explained earlier, a combination of
- alphanumeric characters and metacharacters. An alphanumeric character is
- either a letter from the alphabet<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>abc</pre>
- </td>
- </tr>
- </tbody>
- </table>
- or a number<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>123</pre>
- </td>
- </tr>
- </tbody>
- </table>
- Actually in the world of regular expressions any character which is not a
- metacharacter will match itself(often called literal characters), however a
- lot of the time you're mostly concerned with the alphanumeric characters. A
- very special character is the backslash <b>\</b>, this turns any
- metacharacters into literal characters, and alphanumeric characters into a
- sort of metacharacter or sequence. The metacharacters are:<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>\ | ( ) [ { ^ $ * + ? . < ></pre>
- </td>
- </tr>
- </tbody>
- </table>
- With that said normal characters don't sound too interesting so lets jump to
- the our very first meta characters.<br />
- <br />
- The punctuation mark, or dot, <b>.</b> needs explaining first since it often
- leads to confusion. This character will not, as many might think, match the
- punctuation in a line, it is instead a special meta character which matches
- any character. Using this were you wanted to find the end of the line or the
- decimal in a floating number will lead to strange results. As explained
- above, you need to backslashify it to get the literal meaning. For instance
- take this expression<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>1.23</pre>
- </td>
- </tr>
- </tbody>
- </table>
- will match the number 1.23 in a text as you might have guessed, but it will
- also match these next lines<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>1x23
- 1 23
- 1-23</pre>
- </td>
- </tr>
- </tbody>
- </table>
- to make the expression <b>only</b> match the floating number we change it
- to<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>1\.23</pre>
- </td>
- </tr>
- </tbody>
- </table>
- Remember this, it's very important. Now with that said we can get the show
- going.<br />
- <br />
- Two heavily recurring metacharacters are<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>* and +</pre>
- </td>
- </tr>
- </tbody>
- </table>
- They are called quantifiers and tells the engine to look for several
- occurrences of a characters, the quantifier always precedes the character at
- hand. The <b>*</b> character matches zero or more occurrences of the
- character in a row, the <b>+</b> characters is similar but matches one or
- more.<br />
- <br />
- So what if you decided to find words which had the character <i>c</i> in it
- you might be tempted to write:<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>c*</pre>
- </td>
- </tr>
- </tbody>
- </table>
- What might come as a surprise to you is that you will find an enormous amount
- of matches, even words with no c in it will match. How so you ask, well the
- answer is simple. Recall that the <b>*</b> character matches <b>zero</b> or
- more characters, well thats exactly what you did, zero characters.<br />
- You see in <i>regular expressions</i> you have the possibility to match what
- is called <b>the empty string</b>, which is simply a string with zero size.
- This empty string can actually be found in all texts, for instance the
- word:<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>go</pre>
- </td>
- </tr>
- </tbody>
- </table>
- contains three empty strings. They are contained at the position right before
- the <b>g</b>, in between the <b>g</b> and the <b>o</b> and after the
- <b>o</b>. And an empty string contains exactly <b>one</b> empty string. At
- first this might seem like a really silly thing to do but you'll learn later
- on how this is used in more complex expressions.<br />
- <br />
- So with this knowledge we might want to change our expression to:<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>c+</pre>
- </td>
- </tr>
- </tbody>
- </table>
- and voila we get only words with c in them.<br />
- <br />
- The next metacharacter you'll learn is:<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>?</pre>
- </td>
- </tr>
- </tbody>
- </table>
- This simply tells the engine to either match the character or not (zero or
- one). For instance the expression:<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>cows?</pre>
- </td>
- </tr>
- </tbody>
- </table>
- will match any of these lines:<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>cow
- cows</pre>
- </td>
- </tr>
- </tbody>
- </table>
- <br />
- These three metacharacters are simply a specialized scenario for the more
- generalized quantifier<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>{n,m}</pre>
- </td>
- </tr>
- </tbody>
- </table>
- the <b>n</b> and <b>m</b> are respectively the minimum and maximum size for
- the quantifier. For instance<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>{1,5}</pre>
- </td>
- </tr>
- </tbody>
- </table>
- means match one or up to five characters. You can also skip m to allow for
- infinite match:<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>{1,}</pre>
- </td>
- </tr>
- </tbody>
- </table>
- which matches one or more characters. This is exactly what the <b>+</b>
- characters does. So now you see the connection, <b>*</b> is equal to
- <b>{0,}</b>, <b>+</b> is equal to <b>{1,}</b> and <b>?</b> is equal to
- <b>{0,1}</b>.<br />
- The last thing you can do with the quantifier is to also skip the comma,<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>{5}</pre>
- </td>
- </tr>
- </tbody>
- </table>
- which means to match 5 characters, no more no less.<br />
-
-
- <h2>Assertions</h2>
- The next type of metacharacters are assertions, these will match if a given
- assertion is true. The first pair of assertions are<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>^ and $</pre>
- </td>
- </tr>
- </tbody>
- </table>
- which match the beginning of the line and the end of the line. Note that some
- <i>regular expression</i> implementations allows you to change their behavior
- so that they will instead match the beginning of the text and the end of the
- text. These assertions always match a zero length string, or in other words
- they match a position. For instance if you wrote this expression:<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>^The</pre>
- </td>
- </tr>
- </tbody>
- </table>
- it would match any line which began with the word <b>The</b>.<br />
- <br />
- The next assertion characters match at the beginning and end of a word, they
- are:<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>< and ></pre>
- </td>
- </tr>
- </tbody>
- </table>
- they come in handy when you want to match a word precisely, for instance:<br
- />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>cow</pre>
- </td>
- </tr>
- </tbody>
- </table>
- would match any of the following words<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>cow
- coward
- cowage
- cowboy
- cowl</pre>
- </td>
- </tr>
- </tbody>
- </table>
- a small change to the expression:<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre><cow></pre>
- </td>
- </tr>
- </tbody>
- </table>
- and you'll only match the word <b>cow</b> in the text.<br />
- <br />
- One last thing to be said is that all literal characters are in fact
- assertions themselves, the difference between them and the ones above is that
- literal ones has a size. So for cleanliness sake we only use the word
- assertions for those that are zero-width.<br />
-
-
- <h2>Groups and alternation</h2>
- One thing you might have noticed when we explained quantifiers is that they
- only worked on the character to the left, since this pretty much limits our
- expressions I'll explain other uses for quantifiers. Quantifiers can also be
- used on metacharacters, using them on assertions is silly since they are
- zero-width and matching one, two, three or more of them doesn't do any good.
- However the grouping and sequence metacharacters are perfect for being
- quantified. Let's first start with grouping.<br />
- <br />
- You can form groups, or subexpressions as they are frequently called, by
- using the begin and end parenthesis characters:<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>( and )</pre>
- </td>
- </tr>
- </tbody>
- </table>
- The <b>(</b> starts the subexpression and the <b>)</b> ends it. It is also
- possible to have one or more subexpressions inside a subexpressions. The
- subexpression will match if the contents match. So mixing this with
- quantifiers and assertions you can do:<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>( ?ho)+</pre>
- </td>
- </tr>
- </tbody>
- </table>
- which matches all of the following lines<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>ho
- ho ho
- ho ho ho
- hohoho</pre>
- </td>
- </tr>
- </tbody>
- </table>
- Another use for the subexpressions are to extract a portion of the match if
- it matches, this is often used in conjunction with sequences which is
- discussed later.<br />
- <br />
- You can also use the result of a subexpression for what is called a back
- reference. A back reference is given by using a backslashified digit, only a
- single non-zero digit, this leaves you with nine back references.<br />
- The back reference matches whatever the corresponding subexpression actually
- matched (except that {article_contents_1} matches a null character). To find
- the number of the subexpression count the left parentheses from the left.<br
- />
- <br />
- The use for back references are somewhat limited, especially since you only
- have nine of them, but on some rare occasion you might need it. Note some
- <i>regular expression</i> implementations can use multi-digit numbers as long
- as they don't start with a 0.<br />
- <br />
- Next is alternations which allows you to match on of many words, the
- alternation character is<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>|</pre>
- </td>
- </tr>
- </tbody>
- </table>
- a sample usage is:<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>Bill|Linus|Steve|Larry</pre>
- </td>
- </tr>
- </tbody>
- </table>
- would match either Bill, Linus, Steve or Larry, and mixing this with
- subexpressions and quantifiers we can do:<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>cow(ard|age|boy|l)?</pre>
- </td>
- </tr>
- </tbody>
- </table>
- which matches any of the following words but none other<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>cow
- coward
- cowage
- cowboy
- cowl</pre>
- </td>
- </tr>
- </tbody>
- </table>
- I mentioned earlier in the article that not all of the expression must match
- for the match to be successful, this can happen when you're using
- subexpressions together with alternations. For instance<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>((Donald|Dolly) Duck)|(Scrooge McDuck)</pre>
- </td>
- </tr>
- </tbody>
- </table>
- As you see only the left or right top subexpression will match, not both,
- this is sometimes handy when you want to run a complex pattern in one
- subexpression and if it fails try another one.<br />
-
-
- <h2>Sequences</h2>
- Last we have sequences which defines sequences of characters which can match,
- sometimes you don't want match a word directly but rather something that
- resembles one. The sequence characters are<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>[ and ]</pre>
- </td>
- </tr>
- </tbody>
- </table>
- any characters put inside the sequence brackets are treated as a literal
- character, even metacharacters. The only special characters are the <b>-</b>
- which denotes character ranges and the <b>^</b> which is used to negate a
- sequence. The sequence is somewhat similar with alternation, the similarity
- is that only one of the items listed will match. For instance<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>[a-z]</pre>
- </td>
- </tr>
- </tbody>
- </table>
- will match any small characters which are in the English alphabet (a to z).
- Another common sequence is<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>[a-zA-Z0-9]</pre>
- </td>
- </tr>
- </tbody>
- </table>
- which matches any small or capital characters in the English alphabet as well
- as numbers. Sequences are also mixed with quantifiers and assertions to
- produce more elaborate searches. For instance<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre><[a-zA-Z]+></pre>
- </td>
- </tr>
- </tbody>
- </table>
- matches all whole words. This will match<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>cow
- Linus
- regular
- expression</pre>
- </td>
- </tr>
- </tbody>
- </table>
- but will not match<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>200
- x-files
- C++</pre>
- </td>
- </tr>
- </tbody>
- </table>
- Now what if you wanted to find anything but words, the expression<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>[^a-zA-Z0-9]+</pre>
- </td>
- </tr>
- </tbody>
- </table>
- would find any sequences of characters which does not contain the English
- alphabet or any numbers.<br />
- <br />
- Some implementations of <i>regular expressions</i> allows you to use
- shorthand versions for commonly used sequences, they are:<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>\d, a digit [0-9]
- \D, a non-digit [^0-9]
- \w, a word (alphanumeric) [a-zA-Z0-9]
- \W, a non-word [^a-zA-Z0-9]
- \s, a whitespace [ \t\n\r\f]
- \S, a non-whitespace [^ \t\n\r\f]</pre>
- </td>
- </tr>
- </tbody>
- </table>
-
- <h2>Wildcards</h2>
- For people who has some knowledge with wildcards I'll give a brief
- explanation on how to convert them to <i>regular expressions</i>. After
- reading this article you probably have seen the similarities with wildcards.
- For instance<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>*.jpg</pre>
- </td>
- </tr>
- </tbody>
- </table>
- matches any text which end with .jpg. You can also specify brackets with
- characters, for instance<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>*.[ch]pp</pre>
- </td>
- </tr>
- </tbody>
- </table>
- matches any text which ends in .cpp or .hpp. Altogether very similar to
- regular expressions.<br />
- <br />
-
-
- <h2>Converting the * operator</h2>
- The * means match zero or more of anything in wildcards, as we learned we do
- this is regular expression with the punctuation mark and the * quantifier.
- This gives<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>.*</pre>
- </td>
- </tr>
- </tbody>
- </table>
- Also remember to convert any punctuation marks from wildcards to be
- backslashified.<br />
- <br />
-
-
- <h2>Converting the ? operator</h2>
- The ? means match any character but do match <b>something</b>, this is
- exactly what the punctuation mark does.<br />
- <br />
-
-
- <h2>Converting the [] operator</h2>
- The square bracket can be used untouched since they have the same meaning
- going from wildcards to regular expressions.<br />
- <br />
- These leaves us with:<br />
- Replace any * characters with .*<br />
- Replace any ? characters with .<br />
- Leave square brackets as they are.<br />
- Replace any characters which are metacharacters with a backslashified
- variant.<br />
- <br />
-
-
- <h2>Examples</h2>
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>*.jpg</pre>
- </td>
- </tr>
- </tbody>
- </table>
- would be converted to<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>.*\.jpg</pre>
- </td>
- </tr>
- </tbody>
- </table>
- <br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>ez*.[ch]pp</pre>
- </td>
- </tr>
- </tbody>
- </table>
- would be convert to<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>ez.*\.[ch]pp</pre>
- </td>
- </tr>
- </tbody>
- </table>
- or alternatively<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>ez.*\.(cpp|hpp)</pre>
- </td>
- </tr>
- </tbody>
- </table>
- <br />
-
-
- <h2>Examples</h2>
- To really get to know <i>regular expressions</i> I've left some commonly used
- expressions on this page. Study them, experiment and try to understand
- exactly what they are doing.<br />
- <br />
- Email validity, will only match email addresses which are valid, for instance
- user@host.com<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>[a-z0-9_-]+(\.[a-z0-9_-]+)*@[a-z0-9_-]+(\.[a-z0-9_-]+)+</pre>
- </td>
- </tr>
- </tbody>
- </table>
- <br />
- Email validity #2, matches email addresses with a name in front, for instance
- "Joe Doe <user@host.com>"<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>("?[a-zA-Z]+"?[ \t]*)+\<[a-z0-9_-]+(\.[a-z0-9_-]+)*@[a-z0-9_-]+(\.[a-z0-9_-]+)+\></pre>
- </td>
- </tr>
- </tbody>
- </table>
- <br />
- Protocol validity, matches web based protocols such as htpp://, ftp:// or
- https://<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>[a-z]+://</pre>
- </td>
- </tr>
- </tbody>
- </table>
- <br />
- C/C++ includes, matches valid include statements in C/C++ files.<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>^#include[ \t]+[<"][^>"]+[">]</pre>
- </td>
- </tr>
- </tbody>
- </table>
- <br />
- C++ end of line comments<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>//.+$</pre>
- </td>
- </tr>
- </tbody>
- </table>
- <br />
- C/C++ span line comments, it has one flaw, can you spot it?<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>/\*[^*]*\*/</pre>
- </td>
- </tr>
- </tbody>
- </table>
- <br />
- Floating point numbers, matches simple floating point numbers of the kind 1.2
- and 0.5<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>-?[0-9]+\.[0-9]+</pre>
- </td>
- </tr>
- </tbody>
- </table>
- <br />
- Hexadecimal numbers, matches C/C++ style hex numbers, 0xcafebabe<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>0x[0-9a-fA-F]+</pre>
- </td>
- </tr>
- </tbody>
- </table>
- <br />
-
-
- <h2>Utilities</h2>
- There exists several utilities which uses regular expression. I'll leave a
- list of them with a short description:<br />
- <br />
-
-
- <h2>grep</h2>
- Grep searches named input files for lines containing a match to the given
- pattern. It can also be used to find files which contains a specific pattern,
- for instance:<br />
- <br clear="all" />
-
-
- <p></p>
-
- <table width="100%" cellspacing="0" cellpadding="4" border="0">
- <tbody>
- <tr>
- <td bgcolor="#f0f0f0"><pre>grep -E "cow|vache" * >/dev/null && echo "Found a cow"</pre>
- </td>
- </tr>
- </tbody>
- </table>
- <br />
- This is utility is rather common on Linux distributions, but if you don't
- have it you can grab a version on <a
- href="http://www.gnu.org/gnulist/production/grep.html">the GNU page</a><br />
- <br />
- A small tip is to enable extended regular expressions with the options -E, if
- not a lot of the metacharacters explained in this article won't work.<br />
- <br />
-
-
- <h2>sed</h2>
- Sed is a stream editor. A stream editor is used to perform basic text
- transformations on an input stream.<br />
- <br />
- This is utility is rather common on Linux distributions, but if you don't
- have it you can grab a version on <a
- href="http://www.gnu.org/gnulist/production/sed.html">the GNU page</a><br />
- <br />
-
-
- <h2>gawk</h2>
- Gawk is the GNU Project's implementation of the AWK programming language.
- It conforms to the definition of the language in the POSIX 1003.2 Command
- Language And Utilities Standard.<br />
- <br />
- This is utility is rather common on Linux distributions, but if you don't
- have it you can grab a version on <a href="http://www.gnu.org/">the GNU
- page</a><br />
- <br />
- <span style="color: #FF0000; background-color: #FFFFFF">[document edited
- here]</span><br style="color: #FF0000; background-color: #FFFFFF" />
- <i>Regular expression</i> related links:<br />
- <br />
- <a href="http://www.plover.com/~mjd/perl/NPC/index.html">Regular Expressions
- and NP-Completeness</a><br />
- <a href="http://www.cs.rochester.edu/u/leblanc/csc173/fa/re.html">Equivalence
- of Regular Expressions and Finite Automata</a><br />
- <a href="http://virtual.park.uga.edu/humcomp/perl/regex2a.html">Perl Regular
- Expression Tutorial</a> <br clear="all" />
- </body>
- </html>
-