home *** CD-ROM | disk | FTP | other *** search
-
- Syntax of Regular Expressions
- =============================
-
- Regular expressions have a syntax in which a few characters are special
- constructs and the rest are "ordinary". An ordinary character is a simple
- regular expression which matches that character and nothing else. The
- special characters are `$', `^', `.', `*', `+', `?', `[', `]' and `\'; no
- new special characters will be defined. Any other character appearing in a
- regular expression is ordinary, unless a `\' precedes it.
-
- For example, `f' is not a special character, so it is ordinary, and
- therefore `f' is a regular expression that matches the string `f' and no
- other string. (It does not match the string `ff'.) Likewise, `o' is a
- regular expression that matches only `o'.
-
- Any two regular expressions A and B can be concatenated. The result is a
- regular expression which matches a string if A matches some amount of the
- beginning of that string and B matches the rest of the string.
-
- As a simple example, we can concatenate the regular expressions `f'
- and `o' to get the regular expression `fo', which matches only
- the string `fo'. Still trivial. To do something nontrivial, you
- need to use one of the special characters. Here is a list of them.
-
- `. (Period)'
- is a special character that matches any single character except a
- newline. Using concatenation, we can make regular expressions like
- `a.b' which matches any three-character string which begins with `a'
- and ends with `b'.
-
- `*'
- is not a construct by itself; it is a suffix, which means the
- preceding regular expression is to be repeated as many times as
- possible. In `fo*', the `*' applies to the `o', so `fo*' matches one
- `f' followed by any number of `o's. The case of zero `o's is allowed:
- `fo*' does match `f'.
-
- `*' always applies to the smallest possible preceding expression.
- Thus, `fo*' has a repeating `o', not a repeating `fo'.
-
- The matcher processes a `*' construct by matching, immediately, as
- many repetitions as can be found. Then it continues with the rest of
- the pattern. If that fails, backtracking occurs, discarding some of
- the matches of the `*'-modified construct in case that makes it
- possible to match the rest of the pattern. For example, matching
- `ca*ar' against the string `caaar', the `a*' first tries to match all
- three `a's; but the rest of the pattern is `ar' and there is only `r'
- left to match, so this try fails. The next alternative is for `a*' to
- match only two `a's. With this choice, the rest of the regexp matches
- successfully.
-
- `+'
- Is a suffix character similar to `*' except that it requires that the
- preceding expression be matched at least once. So, for example,
- `ca+r' will match the strings `car' and `caaaar' but not the string
- `cr', whereas `ca*r' would match all three strings.
-
- `?'
- Is a suffix character similar to `*' except that it can match the
- preceding expression either once or not at all. For example,
- `ca?r' will match `car' or `cr'; nothing else.
-
- `[ ... ]'
- `[' begins a "character set", which is terminated by a `]'. In the
- simplest case, the characters between the two form the set. Thus,
- `[ad]' matches either one `a' or one `d', and `[ad]*' matches any
- string composed of just `a's and `d's (including the empty string),
- from which it follows that `c[ad]*r' matches `cr', `car', `cdr',
- `caddaar', etc.
-
- Character ranges can also be included in a character set, by writing
- two characters with a `-' between them. Thus, `[a-z]' matches any
- lower-case letter. Ranges may be intermixed freely with individual
- characters, as in `[a-z$%.]', which matches any lower case letter or
- `$', `%' or period.
-
- Note that the usual special characters are not special any more inside
- a character set. A completely different set of special characters
- exists inside character sets: `]', `-' and `^'.
-
- To include a `]' in a character set, you must make it the first
- character. For example, `[]a]' matches `]' or `a'. To include a `-',
- write `---', which is a range containing only `-'. To include `^',
- make it other than the first character in the set.
-
- `[^ ... ]'
- `[^' begins a "complement character set", which matches any character
- except the ones specified. Thus, `[^a-z0-9A-Z]' matches all
- characters except letters and digits.
-
- `^' is not special in a character set unless it is the first
- character. The character following the `^' is treated as if it
- were first (`-' and `]' are not special there).
-
- Note that a complement character set can match a newline, unless
- newline is mentioned as one of the characters not to match.
-
- `^'
- is a special character that matches the empty string, but only if at
- the beginning of a line in the text being matched. Otherwise it fails
- to match anything. Thus, `^foo' matches a `foo' which occurs
- at the beginning of a line.
-
- `$'
- is similar to `^' but matches only at the end of a line. Thus,
- `xx*$' matches a string of one `x' or more at the end of a line.
-
- `\'
- has two functions: it quotes the special characters (including
- `\'), and it introduces additional special constructs.
-
- Because `\' quotes special characters, `\$' is a regular expression
- which matches only `$', and `\[' is a regular expression which matches
- only `[', and so on.
-
- Note: for historical compatibility, special characters are treated as
- ordinary ones if they are in contexts where their special meanings make no
- sense. For example, `*foo' treats `*' as ordinary since there is no
- preceding expression on which the `*' can act. It is poor practice to
- depend on this behavior; better to quote the special character anyway,
- regardless of where is appears.
-
- For the most part, `\' followed by any character matches only
- that character. However, there are several exceptions: characters
- which, when preceded by `\', are special constructs. Such
- characters are always ordinary when encountered on their own. Here
- is a table of `\' constructs.
-
- `\|'
- specifies an alternative. Two regular expressions A and B with `\|'
- in between form an expression that matches anything that either A or B
- will match.
-
- Thus, `foo\|bar' matches either `foo' or `bar' but no other string.
-
- `\|' applies to the largest possible surrounding expressions. Only a
- surrounding `\( ... \)' grouping can limit the grouping power of `\|'.
-
- Full backtracking capability exists to handle multiple uses of `\|'.
-
- `\( ... \)'
- is a grouping construct that serves three purposes:
-
- 1. To enclose a set of `\|' alternatives for other operations.
- Thus, `\(foo\|bar\)x' matches either `foox' or `barx'.
-
- 2. To enclose a complicated expression for the postfix `*' to
- operate on. Thus, `ba\(na\)*' matches `bananana', etc., with any
- (zero or more) number of `na' strings.
-
- 3. To mark a matched substring for future reference.
-
-
- This last application is not a consequence of the idea of a
- parenthetical grouping; it is a separate feature which happens to be
- assigned as a second meaning to the same `\( ... \)' construct
- because there is no conflict in practice between the two meanings.
- Here is an explanation of this feature:
-
- `\DIGIT'
- after the end of a `\( ... \)' construct, the matcher remembers the
- beginning and end of the text matched by that construct. Then, later
- on in the regular expression, you can use `\' followed by DIGIT to
- mean "match the same text matched the DIGIT'th time by the `\( ... \)'
- construct."
-
- The strings matching the first nine `\( ... \)' constructs appearing
- in a regular expression are assigned numbers 1 through 9 in order that the
- open-parentheses appear in the regular expression. `\1' through
- `\9' may be used to refer to the text matched by the corresponding
- `\( ... \)' construct.
-
- For example, `\(.*\)\1' matches any newline-free string that is
- composed of two identical halves. The `\(.*\)' matches the first
- half, which may be anything, but the `\1' that follows must match
- the same exact text.
-
- `\`'
- matches the empty string, provided it is at the beginning
- of the buffer.
-
- `\''
- matches the empty string, provided it is at the end of
- the buffer.
-
- `\b'
- matches the empty string, provided it is at the beginning or end of a
- word. Thus, `\bfoo\b' matches any occurrence of `foo' as a separate
- word. `\bballs?\b' matches `ball' or `balls' as a separate word.
-
- `\B'
- matches the empty string, provided it is not at the beginning or
- end of a word.
-
- `\<'
- matches the empty string, provided it is at the beginning of a word.
-
- `\>'
- matches the empty string, provided it is at the end of a word.
-
- `\w'
- matches any word-constituent character. The editor syntax table
- determines which characters these are.
-
- `\W'
- matches any character that is not a word-constituent.
-
- `\sCODE'
- matches any character whose syntax is CODE. CODE is a character which
- represents a syntax code: thus, `w' for word constituent, `-' for
- whitespace, `(' for open-parenthesis, etc. *Note Syntax::.
-
- `\SCODE'
- matches any character whose syntax is not CODE.
-
- Here is a complicated regexp, used by Emacs to recognize the end of a
- sentence together with any whitespace that follows. It is given in Lisp
- syntax to enable you to distinguish the spaces from the tab characters. In
- Lisp syntax, the string constant begins and ends with a double-quote.
- `\"' stands for a double-quote as part of the regexp, `\\' for a
- backslash as part of the regexp, `\t' for a tab and `\n' for a
- newline.
-
- "[.?!][]\"')]*\\($\\|\t\\| \\)[ \t\n]*"
-
- This contains four parts in succession: a character set matching period,
- `?' or `!'; a character set matching close-brackets,
- quotes or parentheses, repeated any number of times; an alternative in
- backslash-parentheses that matches end-of-line, a tab or two spaces; and a
- character set matching whitespace characters, repeated any number of times.
-
-
-