Regular Expressions What, you ask, are regular expressions? Considered in one sense, they are no more than the wildcard characters '*' (match zero or more characters) and '?' (match one character) that are used by many operating systems for the specification of filenames. However, they are vastly more powerful and flexible than these two simple examples. Regular expressions consist of a combination of literal text characters and regular expression "metacharacters". Here is the full complement of regular expression metacharacters recognized by EGREP: Symbol Name -------------------- \ Escape " Quotation . Any ^ Line begin (or character class negation) $ Line end [ Character class begin - Character class range separator ] Character class end ( Group begin ) Group end ? Option * Closure + Positive closure { Iteration begin , Iteration parameter separator } Iteration end | Alternation and the following are examples of their usage: Expression Matches Example ------------------------------------------------------- c Any literal character a \c Character c literally \* "s" String s literally "**" . Any character but newline a.b ^ Beginning of line ^abc $ End of line abc$ [s] Any character in s [abc] [^s] Any character not in s [^abc] r1r2 r1 followed by r2 ab r? Zero or one r's a? r* Zero or more r's a* r+ One or more r's a+ r{m,n} m to n occurrences of r a{2,5} r1|r2 r1 or r2 a|b (r) r (a|b) Still confused? Let's discuss each of the above metacharacters in detail. 1.1. Literal Characters At its simplest level, a regular expression consists of nothing but literal characters. Unless a character is one of the regular expression metacharacters shown above, it represents itself as a member of the ASCII character set when used in a regular expression. 1.2. Escape Characters There will be times when you want to include a regular expression metacharacter in a regular expression as a literal character. Also, you may want to include non- printable characters. This is where escape characters are essential. EGREP recognizes the following escape characters: Expression Matches Hexadecimal Equivalent ------------------------------------------------------ \n newline 0x0A \t horizontal tab 0x09 \b backspace 0x08 \r carriage return 0x0D \f formfeed 0x0C \ddd octal digit 0x00 to 0x7F \xhhh hexadecimal digit 0x00 to 0x7F \c anything else 0x20 to 0x7E where: 1. Each 'd' of an octal digit is a printable digit from the character set "01234567". There can be between 1 and 3 digits. 2. Each 'h' of a hexadecimal digit is a printable digit from the character set "0123456789ABCDEFabcdef". The leading 'x' must be in lower case. There can be between 1 and 3 digits. 1. 'c' is any printable value not included in the set of escape characters shown above, including '\' itself (i.e. - "\\"). NOTE: Octal and hexadecimal escape characters are not supported by UNIX EGREP. In general, specifying characters in regular expressions as octal or hexadecimal digits is implementation dependent. Using EGREP would not produce the same results on a system using the ASCII character set and another using the EBCDIC character set, for example. byHeart Software's implementation of EGREP uses the ASCII character set. This means that you can specify any character between 0x00 and 0x7F in a regular expression. There are however two special cases to be considered: the "newline" character 0x0A and the NULL character 0x00. EGREP will not match any pattern containing a newline character unless that character is the last one in the regular expression. The "line end" metacharacter ('$') (see Section 1.6, "Line End") is more commonly used in such an application. EGREP will accept the NULL character when it is specified in a regular expression. However, it uses this character internally to represent the end of each text line it reads in for processing. If the line contains a NULL character, EGREP will consider it to be the end of the line and ignore the remaining characters when searching for pattern matches. 1.1. Literal Strings Occasionally, you may want to specify a string of metacharacters in a regular expression. One example of this would be to search for "*****". Since it is inconvenient to express this as "\*\*\*\*\*", EGREP allows you to specify literal strings of characters (either literal or metacharacters) by enclosing them in double quote marks. For example: "*** This is a literal string ***" WARNING: The above example will work only if the regular expression is taken from a file using the '-f' command-line switch. The double quote character ('"') is used by the MS- DOS command-line processor to delimit quoted arguments (see Section 2.5, "Entering Regular Expressions"). If you want to use it as a regular expression metacharacter in a command-line argument, you must use its escape character equivalent, as in: "The use of \"*+?$\" as literals is possible" Escape characters are also recognized inside literal strings, so that the double quote metacharacter can be expressed as a literal character. For example: "The '\"' symbol is a metacharacter." NOTE: Literal strings are not supported by UNIX EGREP. 1.4. Any Character The "any" metacharacter (a period) can be used to match any character except "newline". For example: ab.cd will match "abccd", "ab cd", "ab.cd", and so on. 1.5. Line Begin If the "line begin" metacharacter ('^') is the FIRST character in a regular expression, a string in a line will be recognized as matching the rest of the regular expression only if that string occurs at the beginning of the line. If the metacharacter occurs anywhere else in the regular expression outside of a character class (see Section 1.7, "Character Classes"), it is treated as a literal character. Thus: ^abc will match "abcdef" but not "aabcdef", and ab^c will match "ab^c", "xyzab^c" and so on. 1.6. Line End Analogous to the "line begin" metacharacter, the "line end" metacharacter ('$') is only recognized as such when it is the LAST character in a regular expression. Thus: abc$ will match "xyzabc" but not "abcdef", and a$bc will match "a$bc", "xyza$bc" and so on. 1.7. Character Classes This is where the expressive power of regular expressions becomes apparent. Whereas the "any" metacharacter will match any literal character except "newline", a character class can be used to specify any character in (or not in) a class of characters. Here are some examples: [abc] matches either 'a', 'b' or 'c' [^abc] matches any character but 'a', 'b' or 'c' [a^bc] matches 'a', 'b', 'c' or '^' [a-yD-M] matches any character in the range of 'a' to 'y' or 'D' to 'M' [^a-r] matches any character not in the range of 'a' to 'r' []ab] matches either 'a', 'b' or ']' [-ab] matches either 'a', 'b' or '-' [ab-] matches either 'a', 'b' or '-' [^-ab] matches any character but 'a', 'b' or '-' [\tb] matches either 'b' or a horizontal tab [b\014] matches either 'b' or 0x0B [*+$] matches either '*', '+' or '$' NOTE: UNIX EGREP does not support escape characters inside character classes. Within a character class, only the metacharacters '^' (character class negation), '-' (character class range) and '\' (escape character) are recognized. When the "character class negation" ('^') metacharacter appears as the FIRST character after the "character class begin" ('[') metacharacter, it indicates that the character class is to be negated. In other words, it indicates that the regular expression character it represents should consist of any of the characters NOT in the remainder of the character class string. If the "character class negation" metacharacter appears anywhere else in the character class string, it is treated as a literal character. It is often convenient to specify a range of contiguous characters from the character set with the "character range" metacharacter ('-'). For example, "abcdefghijklmn" can be more clearly and easily stated as: "a-n" If the "character range" metacharacter is the FIRST or LAST character in the character class string, or immediately follows a "character class negation" metacharacter, it is taken to be a literal character. Since character ranges are not the same for different characters sets (for example, "0-z" in ASCII is many more characters than it is in EBCDIC), EGREP will issue an "implementation dependent" warning message for any character range where the low and high characters are not both uppercase alphabetic, lowercase alphabetic or digit characters. Escape characters have their usual meaning inside character classes. Finally, a "character class end" metacharacter (']') appearing immediately after a "character class begin" metacharacter is taken to be a literal character within the character class string. It is illegal to specify an "empty" character class (i.e. - "[]"). 1.8. Grouping of Regular Expressions So far we have considered only single characters in regular expressions, where each character follows the preceding one in the expression. In the parlance of computer scientists and linguists, we say that each of these characters is "concatenated" with the preceding regular expression. There will be times, however, when we will want to specify not single characters but regular expressions as part of a regular expression. Stepping ahead a bit, the "closure" metacharacter ('*') allows you to specify zero or more occurrences of the immediately preceding regular expression. As an example: ab* will match "a", "ab", "abb", "abbb" and so on. Suppose however that we only want to match strings such as "aab", "aabab", "aababab" and so on. To do this, we use: a(ab)* where the string inside the "group begin" ('(') and "group end" (')') metacharacters is taken as a regular expression. Grouping is recursive. That is, we can specify regular expressions within regular expressions by means of nested groupings. Thus, as an example: a(b(cd)*)* is legal, and will match "a", "ab", "abcdbbcd" and so on. 1.9. Alternation Suppose we want to specify that our regular expression should match either "ab" or "cd". If these were single characters, we could use a character class. However, for regular expressions we must use the "alternation" metacharacter ('|'), such as in: ab|cd which will match either "ab" or "cd". Note carefully that the two regular expressions shown above did not have to be grouped. The "alternation" metacharacter has a "lower precedence" than that of concatenated literal characters. See Section 4, "Metacharacter Precedence", for a full explanation. 1.10. Optional Expressions The "option" metacharacter ('?') specifies zero or one occurrences of an immediately preceding regular expression. Thus: a(bc)?d? matches "a", "abc", "ad" and "abcd" only. 1.11. Repeated Expressions The "closure" metacharacter ('*') specifies zero or more occurrences of an immediately preceding regular expression. Thus: ab* matches "a", "ab", "abbb" and so on, while a(bc)* matches "a", "abc", "abcbcbcbc" and so forth. Similarly, [a-m]* matches "" (no character), "c", "cmdgijal" and in general any string that contains only letters between "a" and "m". Closely related is the "positive closure" metacharacter ('+'), which specifies one or more occurrences of an immediately preceding regular expression. As an example: a(bc)+ matches "abc", "abcbc", "abcbcbcbc" and so forth. Finally, there is the "iteration construct". Suppose you want to search for these string only: "abab", "ababab" and "abababab". You could use the regular expression: abab|ababab|abababab However, a simpler way to write this would be: (ab){2,4} which simply specifies that the regular expression will match from two to four occurrences of the regular expression "ab". The general format of the iteration construct is: r{m,n} where 'r' is a regular expression, 'm' is the least number of occurrences of 'r' that will be matched, and 'n' is the greatest number. The value of 'm' must be between zero and 254, while the value of 'n' must be between one and 255. Further, 'm' must always be less than 'n'. A word of warning, however. UNIX EGREP does not support the iteration construct, and for a good reason: EGREP must consume inordinate amounts of memory in building its internal state machine tables to represent the iteration construct. Do not be surprised if you see "ERROR: out of memory" for even moderate values of 'm' and 'n'. 2. Metacharacter Precedence 2.1. Simple Arithmetic Precedences You have seen metacharacter precedence before. In simple arithmetic, the multiplication metacharacter '*' has a higher precedence than the addition and subtraction metacharacters '+' and '-', while the division metacharacter '/' has a higher precedence again. Finally, grouping has the highest precedence of all. Do you see the relationship to regular expressions? We even have concatenation of digits to form "numbers". Let's look at an example. When calculating the result of the arithmetic expression: ( 7 + 3 ) * ( 8 + 12 / ( 4 - 2 ) ) - 3 we first look for the most deeply nested group, in this case "( 4 - 2 )". We solve this and get: ( 7 + 3 ) * ( 8 + 12 / 2 ) - 3 We again look for groupings, and find two at the same level of nesting. We solve the first one to get: 10 * ( 8 + 12 / 2 ) - 3 For the remaining group, division takes precedence over addition, and so we get: 10 * ( 8 + 6 ) - 3 and then 10 * 14 - 3 We no longer have any groups to consider, but multiplication takes precedence over subtraction, and so we get: 140 - 3 and finally 137 as our answer. 2.2. Regular Expressions Regular expressions are written in exactly the same manner, using the precedences of the metacharacters recognized by EGREP. These are shown in context as follows, grouped in DECREASING order of precedence: Metachar Function ----------------------- -------------------------------------------------------- (r) Grouped regular expression 'r' -------------------------------------------------------- c Literal character 'c' \c Escape character 'c' "s" Literal string 's' . Any character [s] Character class -------------------------------------------------------- r? Optional regular expression 'r' r* Closure of regular expression 'r' r+ Positive closure of regular expression 'r' r{m,n} Iteration of regular expression 'r' -------------------------------------------------------- r1r2 Regular expression 'r1' followed by 'r2' -------------------------------------------------------- r1|r2 Alternation of regular expressions 'r1' and 'r2' -------------------------------------------------------- ^ Beginning of line $ End of line -------------------------------------------------------- Remember, you perform the operations in order of decreasing precedence. As an example, consider the following regular expression: (ab|cd)?(ef)* Remembering that each literal character can be considered a regular expression 'r', this expression would be considered by EGREP in the following manner (where 'rn' is a regular expression): (r1r2|r3r4)?(r5r6)* (r7|r8)?(r9)* r10?r9* r11r12 r13 with EGREP then matching such strings as "abefef", "efefef", "cdef" and "cdcd", but not "abc", "abcd", or "abcdef".