home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
DP Tool Club 17
/
CD_ASCQ_17_101194.iso
/
vrac
/
regexpr.zip
/
REGEXPR.TXT
< prev
Wrap
Text File
|
1994-09-19
|
22KB
|
535 lines
Regular Expressions
What, you ask, are regular expressions? Considered in one
sense, they are no more than the wildcard characters '*'
(match zero or more characters) and '?' (match one
character) that are used by many operating systems for the
specification of filenames. However, they are vastly more
powerful and flexible than these two simple examples.
Regular expressions consist of a combination of literal text
characters and regular expression "metacharacters". Here is
the full complement of regular expression metacharacters
recognized by EGREP:
Symbol Name
--------------------
\ Escape
" Quotation
. Any
^ Line begin (or character class negation)
$ Line end
[ Character class begin
- Character class range separator
] Character class end
( Group begin
) Group end
? Option
* Closure
+ Positive closure
{ Iteration begin
, Iteration parameter separator
} Iteration end
| Alternation
and the following are examples of their usage:
Expression Matches Example
-------------------------------------------------------
c Any literal character a
\c Character c literally \*
"s" String s literally "**"
. Any character but newline a.b
^ Beginning of line ^abc
$ End of line abc$
[s] Any character in s [abc]
[^s] Any character not in s [^abc]
r1r2 r1 followed by r2 ab
r? Zero or one r's a?
r* Zero or more r's a*
r+ One or more r's a+
r{m,n} m to n occurrences of r a{2,5}
r1|r2 r1 or r2 a|b
(r) r (a|b)
Still confused? Let's discuss each of the above
metacharacters in detail.
1.1. Literal Characters
At its simplest level, a regular expression consists of
nothing but literal characters. Unless a character is one
of the regular expression metacharacters shown above, it
represents itself as a member of the ASCII character set
when used in a regular expression.
1.2. Escape Characters
There will be times when you want to include a regular
expression metacharacter in a regular expression as a
literal character. Also, you may want to include non-
printable characters. This is where escape characters are
essential.
EGREP recognizes the following escape characters:
Expression Matches Hexadecimal Equivalent
------------------------------------------------------
\n newline 0x0A
\t horizontal tab 0x09
\b backspace 0x08
\r carriage return 0x0D
\f formfeed 0x0C
\ddd octal digit 0x00 to 0x7F
\xhhh hexadecimal digit 0x00 to 0x7F
\c anything else 0x20 to 0x7E
where:
1. Each 'd' of an octal digit is a printable digit
from the character set "01234567". There can be
between 1 and 3 digits.
2. Each 'h' of a hexadecimal digit is a printable digit
from the character set "0123456789ABCDEFabcdef". The
leading 'x' must be in lower case. There can be
between 1 and 3 digits.
1. 'c' is any printable value not included in the set of
escape characters shown above, including '\' itself
(i.e. - "\\").
NOTE: Octal and hexadecimal escape characters are not
supported by UNIX EGREP.
In general, specifying characters in regular expressions as
octal or hexadecimal digits is implementation dependent.
Using EGREP would not produce the same results on a system
using the ASCII character set and another using the EBCDIC
character set, for example.
byHeart Software's implementation of EGREP uses the ASCII
character set. This means that you can specify any
character between 0x00 and 0x7F in a regular expression.
There are however two special cases to be considered: the
"newline" character 0x0A and the NULL character 0x00. EGREP
will not match any pattern containing a newline character
unless that character is the last one in the regular
expression. The "line end" metacharacter ('$') (see Section
1.6, "Line End") is more commonly used in such an
application.
EGREP will accept the NULL character when it is specified in
a regular expression. However, it uses this character
internally to represent the end of each text line it reads
in for processing. If the line contains a NULL character,
EGREP will consider it to be the end of the line and ignore
the remaining characters when searching for pattern matches.
1.1. Literal Strings
Occasionally, you may want to specify a string of
metacharacters in a regular expression. One example of this
would be to search for "*****". Since it is inconvenient to
express this as "\*\*\*\*\*", EGREP allows you to specify
literal strings of characters (either literal or
metacharacters) by enclosing them in double quote marks.
For example:
"*** This is a literal string ***"
WARNING: The above example will work only if the regular
expression is taken from a file using the '-f' command-line
switch. The double quote character ('"') is used by the MS-
DOS command-line processor to delimit quoted arguments (see
Section 2.5, "Entering Regular Expressions"). If you want
to use it as a regular expression metacharacter in a
command-line argument, you must use its escape character
equivalent, as in:
"The use of \"*+?$\" as literals is possible"
Escape characters are also recognized inside literal
strings, so that the double quote metacharacter can be
expressed as a literal character. For example:
"The '\"' symbol is a metacharacter."
NOTE: Literal strings are not supported by UNIX EGREP.
1.4. Any Character
The "any" metacharacter (a period) can be used to match any
character except "newline". For example:
ab.cd
will match "abccd", "ab cd", "ab.cd", and so on.
1.5. Line Begin
If the "line begin" metacharacter ('^') is the FIRST
character in a regular expression, a string in a line will
be recognized as matching the rest of the regular expression
only if that string occurs at the beginning of the line.
If the metacharacter occurs anywhere else in the regular
expression outside of a character class (see Section 1.7,
"Character Classes"), it is treated as a literal character.
Thus:
^abc
will match "abcdef" but not "aabcdef", and
ab^c
will match "ab^c", "xyzab^c" and so on.
1.6. Line End
Analogous to the "line begin" metacharacter, the "line end"
metacharacter ('$') is only recognized as such when it is
the LAST character in a regular expression. Thus:
abc$
will match "xyzabc" but not "abcdef", and
a$bc
will match "a$bc", "xyza$bc" and so on.
1.7. Character Classes
This is where the expressive power of regular expressions
becomes apparent. Whereas the "any" metacharacter will
match any literal character except "newline", a character
class can be used to specify any character in (or not in) a
class of characters. Here are some examples:
[abc] matches either 'a', 'b' or 'c'
[^abc] matches any character but 'a', 'b' or 'c'
[a^bc] matches 'a', 'b', 'c' or '^'
[a-yD-M] matches any character in the range of 'a' to
'y' or 'D' to 'M'
[^a-r] matches any character not in the range of
'a' to 'r'
[]ab] matches either 'a', 'b' or ']'
[-ab] matches either 'a', 'b' or '-'
[ab-] matches either 'a', 'b' or '-'
[^-ab] matches any character but 'a', 'b' or '-'
[\tb] matches either 'b' or a horizontal tab
[b\014] matches either 'b' or 0x0B
[*+$] matches either '*', '+' or '$'
NOTE: UNIX EGREP does not support escape characters
inside character classes.
Within a character class, only the metacharacters '^'
(character class negation), '-' (character class range) and
'\' (escape character) are recognized.
When the "character class negation" ('^') metacharacter
appears as the FIRST character after the "character class
begin" ('[') metacharacter, it indicates that the character
class is to be negated. In other words, it indicates that
the regular expression character it represents should
consist of any of the characters NOT in the remainder of the
character class string.
If the "character class negation" metacharacter appears
anywhere else in the character class string, it is treated
as a literal character.
It is often convenient to specify a range of contiguous
characters from the character set with the "character range"
metacharacter ('-'). For example,
"abcdefghijklmn"
can be more clearly and easily stated as:
"a-n"
If the "character range" metacharacter is the FIRST or LAST
character in the character class string, or immediately
follows a "character class negation" metacharacter, it is
taken to be a literal character.
Since character ranges are not the same for different
characters sets (for example, "0-z" in ASCII is many more
characters than it is in EBCDIC), EGREP will issue an
"implementation dependent" warning message for any character
range where the low and high characters are not both
uppercase alphabetic, lowercase alphabetic or digit
characters.
Escape characters have their usual meaning inside character
classes.
Finally, a "character class end" metacharacter (']')
appearing immediately after a "character class begin"
metacharacter is taken to be a literal character within the
character class string. It is illegal to specify an "empty"
character class (i.e. - "[]").
1.8. Grouping of Regular Expressions
So far we have considered only single characters in regular
expressions, where each character follows the preceding one
in the expression. In the parlance of computer scientists
and linguists, we say that each of these characters is
"concatenated" with the preceding regular expression.
There will be times, however, when we will want to specify
not single characters but regular expressions as part of a
regular expression. Stepping ahead a bit, the "closure"
metacharacter ('*') allows you to specify zero or more
occurrences of the immediately preceding regular expression.
As an example:
ab*
will match "a", "ab", "abb", "abbb" and so on.
Suppose however that we only want to match strings such as
"aab", "aabab", "aababab" and so on. To do this, we use:
a(ab)*
where the string inside the "group begin" ('(') and "group
end" (')') metacharacters is taken as a regular expression.
Grouping is recursive. That is, we can specify regular
expressions within regular expressions by means of nested
groupings. Thus, as an example:
a(b(cd)*)*
is legal, and will match "a", "ab", "abcdbbcd" and so on.
1.9. Alternation
Suppose we want to specify that our regular expression
should match either "ab" or "cd". If these were single
characters, we could use a character class. However, for
regular expressions we must use the "alternation"
metacharacter ('|'), such as in:
ab|cd
which will match either "ab" or "cd".
Note carefully that the two regular expressions shown above
did not have to be grouped. The "alternation" metacharacter
has a "lower precedence" than that of concatenated literal
characters. See Section 4, "Metacharacter Precedence", for
a full explanation.
1.10. Optional Expressions
The "option" metacharacter ('?') specifies zero or one
occurrences of an immediately preceding regular expression.
Thus:
a(bc)?d?
matches "a", "abc", "ad" and "abcd" only.
1.11. Repeated Expressions
The "closure" metacharacter ('*') specifies zero or more
occurrences of an immediately preceding regular expression.
Thus:
ab*
matches "a", "ab", "abbb" and so on, while
a(bc)*
matches "a", "abc", "abcbcbcbc" and so forth. Similarly,
[a-m]*
matches "" (no character), "c", "cmdgijal" and in general
any string that contains only letters between "a" and "m".
Closely related is the "positive closure" metacharacter
('+'), which specifies one or more occurrences of an
immediately preceding regular expression. As an example:
a(bc)+
matches "abc", "abcbc", "abcbcbcbc" and so forth.
Finally, there is the "iteration construct". Suppose you
want to search for these string only: "abab", "ababab" and
"abababab". You could use the regular expression:
abab|ababab|abababab
However, a simpler way to write this would be:
(ab){2,4}
which simply specifies that the regular expression will
match from two to four occurrences of the regular expression
"ab". The general format of the iteration construct is:
r{m,n}
where 'r' is a regular expression, 'm' is the least number
of occurrences of 'r' that will be matched, and 'n' is the
greatest number. The value of 'm' must be between zero and
254, while the value of 'n' must be between one and 255.
Further, 'm' must always be less than 'n'.
A word of warning, however. UNIX EGREP does not support the
iteration construct, and for a good reason: EGREP must
consume inordinate amounts of memory in building its
internal state machine tables to represent the iteration
construct. Do not be surprised if you see "ERROR: out of
memory" for even moderate values of 'm' and 'n'.
2. Metacharacter Precedence
2.1. Simple Arithmetic Precedences
You have seen metacharacter precedence before. In simple
arithmetic, the multiplication metacharacter '*' has a
higher precedence than the addition and subtraction
metacharacters '+' and '-', while the division metacharacter
'/' has a higher precedence again. Finally, grouping has
the highest precedence of all.
Do you see the relationship to regular expressions? We even
have concatenation of digits to form "numbers".
Let's look at an example. When calculating the result of
the arithmetic expression:
( 7 + 3 ) * ( 8 + 12 / ( 4 - 2 ) ) - 3
we first look for the most deeply nested group, in this case
"( 4 - 2 )". We solve this and get:
( 7 + 3 ) * ( 8 + 12 / 2 ) - 3
We again look for groupings, and find two at the same level
of nesting. We solve the first one to get:
10 * ( 8 + 12 / 2 ) - 3
For the remaining group, division takes precedence over
addition, and so we get:
10 * ( 8 + 6 ) - 3
and then
10 * 14 - 3
We no longer have any groups to consider, but multiplication
takes precedence over subtraction, and so we get:
140 - 3
and finally
137
as our answer.
2.2. Regular Expressions
Regular expressions are written in exactly the same manner,
using the precedences of the metacharacters recognized by
EGREP. These are shown in context as follows, grouped in
DECREASING order of precedence:
Metachar Function
-----------------------
--------------------------------------------------------
(r) Grouped regular expression 'r'
--------------------------------------------------------
c Literal character 'c'
\c Escape character 'c'
"s" Literal string 's'
. Any character
[s] Character class
--------------------------------------------------------
r? Optional regular expression 'r'
r* Closure of regular expression 'r'
r+ Positive closure of regular expression 'r'
r{m,n} Iteration of regular expression 'r'
--------------------------------------------------------
r1r2 Regular expression 'r1' followed by 'r2'
--------------------------------------------------------
r1|r2 Alternation of regular expressions 'r1'
and 'r2'
--------------------------------------------------------
^ Beginning of line
$ End of line
--------------------------------------------------------
Remember, you perform the operations in order of decreasing
precedence. As an example, consider the following regular
expression:
(ab|cd)?(ef)*
Remembering that each literal character can be considered a
regular expression 'r', this expression would be considered
by EGREP in the following manner (where 'rn' is a regular
expression):
(r1r2|r3r4)?(r5r6)*
(r7|r8)?(r9)*
r10?r9*
r11r12
r13
with EGREP then matching such strings as "abefef", "efefef",
"cdef" and "cdcd", but not "abc", "abcd", or "abcdef".