Regular Expressions

            What, you  ask, are  regular expressions?  Considered in one
            sense, they  are no  more than  the wildcard  characters '*'
            (match  zero   or  more   characters)  and  '?'  (match  one
            character) that  are used  by many operating systems for the
            specification of  filenames.   However, they are vastly more
            powerful and flexible than these two simple examples.

            Regular expressions consist of a combination of literal text
            characters and regular expression "metacharacters".  Here is
            the full  complement of  regular  expression  metacharacters
            recognized by EGREP:

                Symbol          Name
                --------------------

                \               Escape
                "               Quotation
                .               Any
                ^               Line begin (or character class negation)
                $               Line end
                [               Character class begin
                -               Character class range separator
                ]               Character class end
                (               Group begin
                )               Group end
                ?               Option
                *               Closure
                +               Positive closure
                {               Iteration begin
                ,               Iteration parameter separator
                }               Iteration end
                |               Alternation


            and the following are examples of their usage:

                Expression      Matches                         Example
                -------------------------------------------------------

                c               Any literal character           a
                \c              Character c literally           \*
                "s"             String s literally              "**"
                .               Any character but newline       a.b
                ^               Beginning of line               ^abc
                $               End of line                     abc$
                [s]             Any character in s              [abc]
                [^s]            Any character not in s          [^abc]



                r1r2            r1 followed by r2               ab
                r?              Zero or one r's                 a?
                r*              Zero or more r's                a*
                r+              One or more r's                 a+
                r{m,n}          m to n occurrences of r         a{2,5}
                r1|r2           r1 or r2                        a|b
                (r)             r                               (a|b)


            Still  confused?     Let's   discuss  each   of  the   above
            metacharacters in detail.


        1.1.  Literal Characters

            At its  simplest level,  a regular  expression  consists  of
            nothing but  literal characters.   Unless a character is one
            of the  regular expression  metacharacters shown  above,  it
            represents itself  as a  member of  the ASCII  character set
            when used in a regular expression.


        1.2.  Escape Characters

            There will  be times  when you  want to  include  a  regular
            expression  metacharacter  in  a  regular  expression  as  a
            literal character.   Also,  you may  want  to  include  non-
            printable characters.   This  is where escape characters are
            essential.

            EGREP recognizes the following escape characters:

                Expression      Matches         Hexadecimal Equivalent
                ------------------------------------------------------

                \n              newline                 0x0A
                \t              horizontal tab          0x09
                \b              backspace               0x08
                \r              carriage return         0x0D
                \f              formfeed                0x0C
                \ddd            octal digit             0x00 to 0x7F
                \xhhh           hexadecimal digit       0x00 to 0x7F
                \c              anything else           0x20 to 0x7E

                where:

                1. Each 'd' of an octal digit is a printable digit
                   from the character set "01234567".  There can be
                   between 1 and 3 digits.



                2. Each 'h' of a hexadecimal digit is a printable digit
                   from the character set "0123456789ABCDEFabcdef".  The
                   leading 'x' must be in lower case.  There can be
                   between 1 and 3 digits.

                1. 'c' is any printable value not included in the set of
                   escape characters shown above, including '\' itself
                   (i.e. - "\\").

                NOTE: Octal and hexadecimal escape characters are not
                      supported by UNIX EGREP.


            In general,  specifying characters in regular expressions as
            octal or  hexadecimal digits  is  implementation  dependent.
            Using EGREP  would not  produce the same results on a system
            using the  ASCII character  set and another using the EBCDIC
            character set, for example.

            byHeart Software's  implementation of  EGREP uses  the ASCII
            character  set.    This  means  that  you  can  specify  any
            character between 0x00 and 0x7F in a regular expression.

            There are  however two  special cases  to be considered: the
            "newline" character 0x0A and the NULL character 0x00.  EGREP
            will not  match any  pattern containing  a newline character
            unless that  character  is  the  last  one  in  the  regular
            expression. The  "line end" metacharacter ('$') (see Section
            1.6,  "Line   End")  is   more  commonly  used  in  such  an
            application.

            EGREP will accept the NULL character when it is specified in
            a  regular  expression.  However,  it  uses  this  character
            internally to  represent the  end of each text line it reads
            in for  processing. If  the line  contains a NULL character,
            EGREP will  consider it to be the end of the line and ignore
            the remaining characters when searching for pattern matches.


        1.1.  Literal Strings

            Occasionally,  you   may  want   to  specify   a  string  of
            metacharacters in a regular expression.  One example of this
            would be to search for "*****".  Since it is inconvenient to
            express this  as "\*\*\*\*\*",  EGREP allows  you to specify
            literal   strings   of   characters   (either   literal   or
            metacharacters) by  enclosing them  in double  quote  marks.
            For example:



                "*** This is a literal string ***"

            WARNING: The  above example  will work  only if  the regular
            expression is  taken from a file using the '-f' command-line
            switch.  The double quote character ('"') is used by the MS-
            DOS command-line  processor to delimit quoted arguments (see
            Section 2.5,  "Entering Regular  Expressions").  If you want
            to use  it  as  a  regular  expression  metacharacter  in  a
            command-line argument,  you must  use its  escape  character
            equivalent, as in:

                "The use of \"*+?$\" as literals is possible"

            Escape  characters   are  also   recognized  inside  literal
            strings, so  that the  double  quote  metacharacter  can  be
            expressed as a literal character.  For example:

                "The '\"' symbol is a metacharacter."

            NOTE: Literal strings are not supported by UNIX EGREP.


        1.4.  Any Character

            The "any"  metacharacter (a period) can be used to match any
            character except "newline".  For example:

                ab.cd

            will match "abccd", "ab cd", "ab.cd", and so on.


        1.5.  Line Begin

            If  the  "line  begin"  metacharacter  ('^')  is  the  FIRST
            character in  a regular  expression, a string in a line will
            be recognized as matching the rest of the regular expression
            only if that string occurs at the beginning of the line.

            If the  metacharacter occurs  anywhere else  in the  regular
            expression outside  of a  character class  (see Section 1.7,
            "Character Classes"),  it is treated as a literal character.
            Thus:

                ^abc

            will match "abcdef" but not "aabcdef", and

                ab^c

            will match "ab^c", "xyzab^c" and so on.


        1.6.  Line End

            Analogous to  the "line begin" metacharacter, the "line end"
            metacharacter ('$')  is only  recognized as  such when it is
            the LAST character in a regular expression.  Thus:

                abc$

            will match "xyzabc" but not "abcdef", and

                a$bc

            will match "a$bc", "xyza$bc" and so on.


        1.7.  Character Classes

            This is  where the  expressive power  of regular expressions
            becomes apparent.   Whereas  the  "any"  metacharacter  will
            match any  literal character  except "newline",  a character
            class can  be used to specify any character in (or not in) a
            class of characters.  Here are some examples:

                [abc]       matches either 'a', 'b' or 'c'
                [^abc]      matches any character but 'a', 'b' or 'c'
                [a^bc]      matches 'a', 'b', 'c' or '^'
                [a-yD-M]    matches any character in the range of 'a' to
                            'y' or 'D' to 'M'
                [^a-r]      matches any character not in the range of
                            'a' to 'r'
                []ab]       matches either 'a', 'b' or ']'
                [-ab]       matches either 'a', 'b' or '-'
                [ab-]       matches either 'a', 'b' or '-'
                [^-ab]      matches any character but 'a', 'b' or '-'
                [\tb]       matches either 'b' or a horizontal tab
                [b\014]     matches either 'b' or 0x0B
                [*+$]       matches either '*', '+' or '$'

                NOTE: UNIX  EGREP does  not  support  escape  characters
                inside character classes.


            Within  a  character  class,  only  the  metacharacters  '^'
            (character class  negation), '-' (character class range) and
            '\' (escape character) are recognized.

            When the  "character  class  negation"  ('^')  metacharacter
            appears as  the FIRST  character after  the "character class
            begin" ('[')  metacharacter, it indicates that the character
            class is  to be  negated.  In other words, it indicates that
            the  regular   expression  character  it  represents  should
            consist of any of the characters NOT in the remainder of the
            character class string.

            If the  "character  class  negation"  metacharacter  appears
            anywhere else  in the  character class string, it is treated
            as a literal character.

            It is  often convenient  to specify  a range  of  contiguous
            characters from the character set with the "character range"
            metacharacter ('-').  For example,

                "abcdefghijklmn"

            can be more clearly and easily stated as:

                "a-n"

            If the  "character range" metacharacter is the FIRST or LAST
            character in  the character  class  string,  or  immediately
            follows a  "character class  negation" metacharacter,  it is
            taken to be a literal character.

            Since character  ranges  are  not  the  same  for  different
            characters sets  (for example,  "0-z" in  ASCII is many more
            characters than  it is  in  EBCDIC),  EGREP  will  issue  an
            "implementation dependent" warning message for any character
            range where  the  low  and  high  characters  are  not  both
            uppercase  alphabetic,   lowercase   alphabetic   or   digit
            characters.

            Escape characters  have their usual meaning inside character
            classes.

            Finally,  a   "character  class   end"  metacharacter  (']')
            appearing  immediately   after  a  "character  class  begin"
            metacharacter is  taken to be a literal character within the
            character class string.  It is illegal to specify an "empty"
            character class (i.e. - "[]").


        1.8.  Grouping of Regular Expressions

            So far  we have considered only single characters in regular
            expressions, where  each character follows the preceding one
            in the  expression.   In the parlance of computer scientists
            and linguists,  we say  that each  of  these  characters  is
            "concatenated" with the preceding regular expression.

            There will  be times,  however, when we will want to specify
            not single  characters but  regular expressions as part of a
            regular expression.   Stepping  ahead a  bit, the  "closure"
            metacharacter ('*')  allows you  to  specify  zero  or  more
            occurrences of the immediately preceding regular expression.
            As an example:

                ab*

            will match "a", "ab", "abb", "abbb" and so on.

            Suppose however  that we  only want to match strings such as
            "aab", "aabab", "aababab" and so on.  To do this, we use:

                a(ab)*

            where the  string inside  the "group begin" ('(') and "group
            end" (')') metacharacters is taken as a regular expression.

            Grouping is  recursive.   That is,  we can  specify  regular
            expressions within  regular expressions  by means  of nested
            groupings.  Thus, as an example:

                a(b(cd)*)*

            is legal, and will match "a", "ab", "abcdbbcd" and so on.


        1.9.  Alternation

            Suppose we  want to  specify  that  our  regular  expression
            should match  either "ab"  or "cd".   If  these were  single
            characters, we  could use  a character  class.  However, for
            regular  expressions   we   must   use   the   "alternation"
            metacharacter ('|'), such as in:

                ab|cd

            which will match either "ab" or "cd".

            Note carefully  that the two regular expressions shown above
            did not have to be grouped.  The "alternation" metacharacter
            has a  "lower precedence"  than that of concatenated literal
            characters.   See Section 4, "Metacharacter Precedence", for
            a full explanation.


        1.10. Optional Expressions

            The "option"  metacharacter  ('?')  specifies  zero  or  one
            occurrences of  an immediately preceding regular expression.
            Thus:

                a(bc)?d?

            matches "a", "abc", "ad" and "abcd" only.


        1.11. Repeated Expressions

            The "closure"  metacharacter ('*')  specifies zero  or  more
            occurrences of  an immediately preceding regular expression.
            Thus:

                ab*

            matches "a", "ab", "abbb" and so on, while

                a(bc)*

            matches "a", "abc", "abcbcbcbc" and so forth. Similarly,

                [a-m]*

            matches ""  (no character),  "c", "cmdgijal"  and in general
            any string that contains only letters between "a" and "m".

            Closely related  is  the  "positive  closure"  metacharacter
            ('+'),  which  specifies  one  or  more  occurrences  of  an
            immediately preceding regular expression.  As an example:

                a(bc)+

            matches "abc", "abcbc", "abcbcbcbc" and so forth.

            Finally, there  is the  "iteration construct".   Suppose you
            want to  search for  these string only: "abab", "ababab" and
            "abababab".  You could use the regular expression:


                abab|ababab|abababab

            However, a simpler way to write this would be:

                (ab){2,4}

            which simply  specifies that  the  regular  expression  will
            match from two to four occurrences of the regular expression
            "ab". The general format of the iteration construct is:

                r{m,n}

            where 'r'  is a  regular expression, 'm' is the least number
            of occurrences  of 'r'  that will be matched, and 'n' is the
            greatest number.   The value of 'm' must be between zero and
            254, while  the value  of 'n'  must be  between one and 255.
            Further, 'm' must always be less than 'n'.

            A word of warning, however.  UNIX EGREP does not support the
            iteration construct,  and for  a  good  reason:  EGREP  must
            consume  inordinate   amounts  of  memory  in  building  its
            internal state  machine tables  to represent  the  iteration
            construct.   Do not  be surprised  if you see "ERROR: out of
            memory" for even moderate values of 'm' and 'n'.


        2.  Metacharacter Precedence

        2.1.  Simple Arithmetic Precedences

            You have  seen metacharacter  precedence before.   In simple
            arithmetic,  the  multiplication  metacharacter  '*'  has  a
            higher  precedence   than  the   addition  and   subtraction
            metacharacters '+' and '-', while the division metacharacter
            '/' has  a higher  precedence again.   Finally, grouping has
            the highest precedence of all.

            Do you see the relationship to regular expressions?  We even
            have concatenation of digits to form "numbers".

            Let's look  at an  example.   When calculating the result of
            the arithmetic expression:

                ( 7 + 3 ) * ( 8 + 12 / ( 4 - 2 ) ) - 3

            we first look for the most deeply nested group, in this case
            "( 4 - 2 )".  We solve this and get:

                ( 7 + 3 ) * ( 8 + 12 / 2 ) - 3

            We again  look for groupings, and find two at the same level
            of nesting.  We solve the first one to get:

                10 * ( 8 + 12 / 2 ) - 3

            For the  remaining group,  division  takes  precedence  over
            addition, and so we get:

                10 * ( 8 + 6 ) - 3

            and then

                10 * 14 - 3

            We no longer have any groups to consider, but multiplication
            takes precedence over subtraction, and so we get:

                140 - 3

            and finally

                137

            as our answer.

        2.2.  Regular Expressions

            Regular expressions  are written in exactly the same manner,
            using the  precedences of  the metacharacters  recognized by
            EGREP.   These are  shown in  context as follows, grouped in
            DECREASING order of precedence:

                Metachar      Function
                -----------------------

                --------------------------------------------------------
                (r)           Grouped regular expression 'r'
                --------------------------------------------------------
                c             Literal character 'c'
                \c            Escape character 'c'
                "s"           Literal string 's'
                .             Any character
                [s]           Character class
                --------------------------------------------------------
                r?            Optional regular expression 'r'
                r*            Closure of regular expression 'r'
                r+            Positive closure of regular expression 'r'
                r{m,n}        Iteration of regular expression 'r'
                --------------------------------------------------------
                r1r2          Regular expression 'r1' followed by 'r2'
                --------------------------------------------------------
                r1|r2         Alternation of regular expressions 'r1'
                              and 'r2'
                --------------------------------------------------------
                ^             Beginning of line
                $             End of line
                --------------------------------------------------------


            Remember, you  perform the operations in order of decreasing
            precedence. As  an example,  consider the  following regular
            expression:

                (ab|cd)?(ef)*

            Remembering that  each literal character can be considered a
            regular expression  'r', this expression would be considered
            by EGREP  in the  following manner  (where 'rn' is a regular
            expression):

                (r1r2|r3r4)?(r5r6)*

                (r7|r8)?(r9)*


                r10?r9*

                r11r12

                r13

            with EGREP then matching such strings as "abefef", "efefef",
            "cdef" and "cdcd", but not "abc", "abcd", or "abcdef".