re_exec(str)
char *str;
re_subs(src, dst)
char *src;
char *dst;
void re_fail(msg, op)
char *msg;
char op;
void re_modw(str)
char *str;
These functions implement ed(1)-style partial regular expressions and supporting facilities.
Re_comp compiles a pattern string into an internal form (a deterministic finite-state automaton) to be executed by re_exec for pattern matching. Re_comp returns 0 if the pattern is compiled successfully, otherwise it returns an error message string. If re_comp is called with a 0 or a null string, it returns without changing the currently compiled regular expression.
Re_comp supports the same limited set of regular expressions found in ed and Berkeley regex(3) routines:
[1] char Matches itself, unless it is a special
character (meta-character): . \ [ ] * + ^ $
[2] . Matches any character.
[3] \ Matches the character following it, except
when followed by a digit 1 to 9, (, fB), < or >.
(see [7], [8] and [9]) It is used as an escape character for all
other meta-characters, and itself. When used
in a set ([4]), it is treated as an ordinary
character.
[4] [set] Matches one of the characters in the set.
If the first character in the set is ^,
it matches a character NOT in the set. A
shorthand
S-E
is used to specify a set of
characters
S
up to
E,
inclusive. The special
characters ] and - have no special
meaning if they appear as the first chars
in the set.
examples: match: [a-z] any lowercase alpha [^]-] any char except ] and - [^A-Z] any char except uppercase alpha [a-zA-Z0-9] any alphanumeric
[5] * Any regular expression form [1] to [4], followed by
closure char (*) matches zero or more matches of
that form.
[6] + Same as [5], except it matches one or more.
[7] A regular expression in the form [1] to [10], enclosed
as \(form\) matches what form matches. The enclosure
creates a set of tags, used for [8] and for
pattern substitution in
re_subs.
The tagged forms are numbered
starting from 1.
[8] A \ followed by a digit 1 to 9 matches whatever a
previously tagged regular expression ([7]) matched.
[9] \< Matches the beginning of a word,
that is, an empty string followed by a
letter, digit, or _ and not preceded by
a letter, digit, or _ .
\> Matches the end of a word,
that is, an empty string preceded
by a letter, digit, or _ , and not
followed by a letter, digit, or _ .
[10] A composite regular expression
xy where x and y
are in the form of [1] to [10] matches the longest
match of x followed by a match for y.
[11] ^ $ a regular expression starting with a ^ character
and/or ending with a $ character, restricts the
pattern matching to the beginning of the line,
and/or the end of line [anchors]. Elsewhere in the
pattern, ^ and $ are treated as ordinary characters.
Re_exec executes the internal form produced by re_comp and searches the argument string for the regular expression described by the internal form. Re_exec returns 1 if the last regular expression pattern is matched within the string, 0 if no match is found. In case of an internal error (corrupted internal form), re_exec calls the user-supplied re_fail and returns 0.
The strings passed to both re_comp and re_exec may have trailing or embedded newline characters. The strings must be terminated by nulls.
Re_subs does ed-style pattern substitution, after a successful match is found by re_exec. The source string parameter to re_subs is copied to the destination string with the following interpretation;
[1] & Substitute the entire matched string in the destination.
[2] \n Substitute the substring matched by a tagged subpattern
numbered n, where n is between 1 to 9, inclusive.
[3] \char Treat the next character literally,
unless the character is a digit ([2]).
If the copy operation with the substitutions is successful, re_subs returns 1. If the source string is corrupted, or the last call to re_exec fails, it returns 0.
Re_modw is used to add new characters into an internal table to change the re_exec's understanding of what a word should look like, when matching with \< and \> constructs. If the string parameter is 0 or null string, the table is reset back to the default, which contains A-Z a-z 0-9 _ .
Re_fail is a user-supplied routine to handle internal errors. re_exec calls re_fail with an error message string, and the opcode character that caused the error. The default re_fail routine simply prints the message and the opcode character to stderr and invokes exit(2).
foo*.* dfaform: CHR f CHR o CLO CHR o END CLO ANY END END matches: fo foo fooo foobar fobar foxx ... fo[ob]a[rz] dfaform: CHR f CHR o CCL 2 o b CHR a CCL 2 r z END matches: fobar fooar fobaz fooaz foo\\+ dfaform: CHR f CHR o CHR o CHR \ CLO CHR \ END END matches: foo\ foo\\ foo\\\ ... \(foo\)[1-3]\1 (same as foo[1-3]foo, but takes less internal space) dfaform: BOT 1 CHR f CHR o CHR o EOT 1 CCL 3 1 2 3 REF 1 END matches: foo1foo foo2foo foo3foo \(fo.*\)-\1 dfaform: BOT 1 CHR f CHR o CLO ANY END EOT 1 CHR - REF 1 END matches: foo-foo fo-fo fob-fob foobar-foobar ...
No previous regular expression, Empty closure, Illegal closure, Cyclical reference, Undetermined reference, Unmatched \(, Missing ], Null pattern inside \(\), Null pattern inside \<\>, Too many \(\) pairs, Unmatched \).
Software tools Kernighan & Plauger Software tools in Pascal Kernighan & Plauger Grep sources [rsx-11 C dist] David Conroy Ed - text editor Unix Programmer's Manual Advanced editing on Unix B. W. Kernighan RegExp sources Henry Spencer
The re_comp and re_exec routines perform almost as well as their licensed counterparts, sometimes better. In very few instances, they are about 10% to 15% slower.