Go to the previous, next section.
#include <nihcl/Regex.h>
Class
Regex
provides a pattern matching and searching facility for character strings. Patterns are described by regular expressions written using the same notation that the GNU Emacs text editor uses. In fact, class
Regex
is just a C++ interface to the actual GNU Emacs regular expression routines. This is handy, because you can try out a regular expression in the Emacs editor before coding it in your program.
Note: this section is adapted from the GNU Emacs Reference Manual.
Regular expressions have a syntax in which a few characters are special constructs and the rest are ordinary. An ordinary character is a simple regular expression which matches that character and nothing else. The special characters are `$', `^', `.', `*', `+', `?', `[', `]' and `\'; no new special characters will be defined. Any other character appearing in a regular expression is ordinary, unless a `\' precedes it.
For example, `f' is not a special character, so it is ordinary, and therefore `f' is a regular expression that matches the string `f' and no other string. (It does not match the string `ff'.) Likewise, `o' is a regular expression that matches only `o'.
Any two regular expressions a and b can be concatenated. The result is a regular expression which matches a string if a matches some amount of the beginning of that string and b matches the rest of the string.
As a simple example, we can concatenate the regular expressions `f' and `o' to get the regular expression `fo', which matches only the string `fo'. Still trivial. To do something nontrivial, you need to use one of the special characters. Here is a list of them.
`*' always applies to the smallest possible preceding expression. Thus, `fo*' has a repeating `o', not a repeating `fo'.
The matcher processes a `*' construct by matching, immediately, as many repetitions as can be found. Then it continues with the rest of the pattern. If that fails, backtracking occurs, discarding some of the matches of the `*'-modified construct in case that makes it possible to match the rest of the pattern. For example, matching `ca*ar' against the string `caaar', the `a*' first tries to match all three `a's; but the rest of the pattern is `ar' and there is only `r' left to match, so this try fails. The next alternative is for `a*' to match only two `a's. With this choice, the rest of the regexp matches successfully.
Character ranges can also be included in a character set, by writing two characters with a `-' between them. Thus, `[a-z]' matches any lower-case letter. Ranges may be intermixed freely with individual characters, as in `[a-z$%.]', which matches any lower case letter or `$', `%' or period.
Note that the usual special characters are not special any more inside a character set. A completely different set of special characters exists inside character sets: `]', `-' and `^'.
To include a `]' in a character set, you must make it the first character. For example, `[]a]' matches `]' or `a'. To include a `-', write `---', which is a range containing only `-'. To include `^', make it other than the first character in the set.
`^' is not special in a character set unless it is the first character. The character following the `^' is treated as if it were first (`-' and `]' are not special there).
Note that a complement character set can match a newline, unless newline is mentioned as one of the characters not to match.
Because `\' quotes special characters, `\$' is a regular expression which matches only `$', and `\[' is a regular expression which matches only `[', and so on.
Note: for historical compatibility, special characters are treated as ordinary ones if they are in contexts where their special meanings make no sense. For example, `*foo' treats `*' as ordinary since there is no preceding expression on which the `*' can act. It is poor practice to depend on this behavior; better to quote the special character anyway, regardless of where it appears.
For the most part, `\' followed by any character matches only that character. However, there are several exceptions: characters which, when preceded by `\', are special constructs. Such characters are always ordinary when encountered on their own. Here is a table of `\' constructs.
Thus, `foo\|bar' matches either `foo' or `bar' but no other string.
`\|' applies to the largest possible surrounding expressions. Only a surrounding `\( ... \)' grouping can limit the grouping power of `\|'.
Full backtracking capability exists to handle multiple uses of `\|'.
This last application is not a consequence of the idea of a parenthetical grouping; it is a separate feature which happens to be assigned as a second meaning to the same `\( ... \)' construct because there is no conflict in practice between the two meanings. Here is an explanation of this feature:
The strings matching the first nine `\( ... \)' constructs appearing in a regular expression are assigned numbers 1 through 9 in order that the open-parentheses appear in the regular expression. `\1' through `\9' may be used to refer to the text matched by the corresponding `\( ... \)' construct.
For example, `\(.*\)\1' matches any newline-free string that is composed of two identical halves. The `\(.*\)' matches the first half, which may be anything, but the `\1' that follows must match the same exact text.
Here is a complicated regexp, used by Emacs to recognize the end of a sentence together with any whitespace that follows. It is given in C++ syntax to enable you to distinguish the spaces from the tab characters. In C++ syntax, the string constant begins and ends with a double-quote. `\"' stands for a double-quote as part of the regexp, `\\' for a backslash as part of the regexp, `\t' for a tab and `\n' for a newline.
"[.?!][]\"')]*\\($\\|\t\\| \\)[ \t\n]*"
This contains four parts in succession: a character set matching period, `?' or `!'; a character set matching close-brackets, quotes or parentheses, repeated any number of times; an alternative in backslash-parentheses that matches end-of-line, a tab or two spaces; and a character set matching whitespace characters, repeated any number of times.
Regex(const char*
cs, unsigned
bufsize
=DEFAULT_BUFSIZE)
Regex(const String&
cs, unsigned
bufsize
=DEFAULT_BUFSIZE )
Regex(const SubString&
cs, unsigned
bufsize
=DEFAULT_BUFSIZE)
Regex
object for the regular expression described by
cs
with
bufsize
bytes allocated to hold the compiled form of the regular expression. A
NIHCL_BADREGEX
exception is raised if the regular expression is invalid. If
bufsize
is not specified, it defaults to
DEFAULT_BUFSIZE
bytes (currently 64). The buffer size is only an estimate--if more space is required for the compiled regular expression, the buffer size is automatically increased.For example:
Regex r = "ab*c";
constructs an instance of Regex named
r
for the pattern
"ab*c"
.
Regular expressions frequently contain the character `\' (backslash), which has a special meaning when used in a C-style character string, so each backslash must be doubled to quote it when written in a C or C++ program.
Regex r = "\\(ab*c\\)";
is the same as typing the regular expression
\(ab*c\)
to the Emacs editor. A good practice is to try a complex regular expression out first in the Emacs editor, then when it is working, copy it into your program and use an editor to replace all occurrences of
\
by
\\
.
Since class
Regex
compiles the regular expression whenever an instance is constructed, if you declare a
Regex
as a variable local to a function, it will be compiled each time the function is called. To avoid this, use
static
variables.
Regex(unsigned
bufsize
=DEFAULT_BUFSIZE)
Regex
object with
bufsize
bytes allocated to hold the compiled regular expression.
Regex(const Regex&)
Regex
initialized from the specified
Regex
.
bool match(const String&
s, int
pos
=0)
YES
if this regular expression matches the
String
s
beginning at
s
[
pos
]
.
int search(const String&
s, int
startpos
=0)
String
s
beginning at
s
[
startpos
]
through the end of the string looking for a match to this regular expression. Returns the starting position of the match if one is found; otherwise, returns -1.
int search(const String&
s, int
startpos, int
range)
String
s
beginning at
s
[
startpos
]
looking for a match to this regular expression. At most
abs(
range
)
matches are attempted. Searches backward from
startpos
if
range
< 0. Returns the starting position of the match if one is found; otherwise, returns -1. For example, here is how to search all of a String
s
backwards for the first occurrence of the pattern
ab*c
:
Regex r = "ab*c"; String s; //... if (r.search(s,s.length()-1,-s.length()+1) != -1) //...
Range operator[](unsigned
i) const
Range
object describing the substring matched by the
ith
\(
...
\)
group in the most recent call to
match()
or
search()
. A
Range
object describing the substring matched by the entire regular expression is accessible as the 0th group. For example, to replace the first occurrence of the pattern
ab*c
in a
String
s
with
xxx
:
Regex r = "ab*c"; if (r.search(s) != -1) s(r[0]) = "xxx";
To replace just the
b
s in this pattern with
xxx
:
Regex r = "a\\(b*\\)c"; if (r.search(s) != -1) s(r[1]) = "xxx";
Access is limited to the first 9
\(
...
\)
groups.
unsigned groups() const
\(
...
\)
groups matched by the most recent call to
match()
or
search()
for this
Regex
. The entire regular expression is considered group 0, so
groups()
will be at least one for a successful
match()
or
search()
.
operator const char*() const
Regex
to a pointer to a C-style (i.e. null-terminated) character string.
virtual void toAscii()
virtual void toLower()
virtual void toUpper()
Regex
and re-compiles it.
void operator=(const char*
cs)
void operator=(const String&
s)
void operator=(const SubString&
ss)
void operator=(const Regex&
r)
Regex
and compiles it. An
NIHCL_BADREGEX
exception is raised if the regular expression is invalid.
virtual void deepenShallowCopy()
Regex
.
virtual void dumpOn(ostream&
strm
=cerr) const
Regex
's regular expression, fastmap, and group registers on
strm.
virtual void scanFrom(istream&
strm)
Regex
and compiles it. An
NIHCL_BADREGEX
exception is raised if the regular expression is invalid.
virtual void storer(OIOofd&
fd) const
virtual void storer(OIOout&
strm) const
Regex
on
fd
or
strm.
String& operator&=(const String&)
String& operator&=(const SubString&)
String& operator&=(const char*)
shouldNotImplement()
.
NIHCL_BADREGEX
Go to the previous, next section.