regex

Section: C Library Functions (3)
Updated: local
Index Return to Main Contents

NAME

re_comp, re_exec, re_subs, re_modw, re_fail - regular expression handling

ORIGIN

Dept. of Computer Science
York University

SYNOPSIS

char *re_comp(pat)
char *pat;

re_exec(str)
char *str;

re_subs(src, dst)
char *src;
char *dst;

void re_fail(msg, op)
char *msg;
char op;

void re_modw(str)
char *str;

DESCRIPTION

These functions implement ed(1)-style partial regular expressions and supporting facilities.

Re_comp compiles a pattern string into an internal form (a deterministic finite-state automaton) to be executed by re_exec for pattern matching. Re_comp returns 0 if the pattern is compiled successfully, otherwise it returns an error message string. If re_comp is called with a 0 or a null string, it returns without changing the currently compiled regular expression.

Re_comp supports the same limited set of regular expressions found in ed and Berkeley regex(3) routines:

[1] char Matches itself, unless it is a special
character (meta-character): . \ [ ] * + ^ $

[2] . Matches any character.

[3] \ Matches the character following it, except
when followed by a digit 1 to 9, (, fB), < or >. (see [7], [8] and [9]) It is used as an escape character for all other meta-characters, and itself. When used in a set ([4]), it is treated as an ordinary character.

[4] [set] Matches one of the characters in the set.
If the first character in the set is ^, it matches a character NOT in the set. A shorthand S-E is used to specify a set of characters S up to E, inclusive. The special characters ] and - have no special meaning if they appear as the first chars in the set.

        examples:       match:
        [a-z]           any lowercase alpha
        [^]-]           any char except ] and -
        [^A-Z]          any char except 
                        uppercase alpha
        [a-zA-Z0-9]     any alphanumeric

[5] * Any regular expression form [1] to [4], followed by
closure char (*) matches zero or more matches of that form.

[6] + Same as [5], except it matches one or more.

[7] A regular expression in the form [1] to [10], enclosed
as $form$ matches what form matches. The enclosure creates a set of tags, used for [8] and for pattern substitution in re_subs. The tagged forms are numbered starting from 1.

[8] A \ followed by a digit 1 to 9 matches whatever a
previously tagged regular expression ([7]) matched.

[9] \< Matches the beginning of a word,
that is, an empty string followed by a letter, digit, or _ and not preceded by a letter, digit, or _ .
\> Matches the end of a word,
that is, an empty string preceded by a letter, digit, or _ , and not followed by a letter, digit, or _ .

[10] A composite regular expression
xy where x and y are in the form of [1] to [10] matches the longest match of x followed by a match for y.

[11] ^ $ a regular expression starting with a ^ character
and/or ending with a $ character, restricts the pattern matching to the beginning of the line, and/or the end of line [anchors]. Elsewhere in the pattern, ^ and $ are treated as ordinary characters.

Re_exec executes the internal form produced by re_comp and searches the argument string for the regular expression described by the internal form. Re_exec returns 1 if the last regular expression pattern is matched within the string, 0 if no match is found. In case of an internal error (corrupted internal form), re_exec calls the user-supplied re_fail and returns 0.

The strings passed to both re_comp and re_exec may have trailing or embedded newline characters. The strings must be terminated by nulls.

Re_subs does ed-style pattern substitution, after a successful match is found by re_exec. The source string parameter to re_subs is copied to the destination string with the following interpretation;

[1] & Substitute the entire matched string in the destination.

[2] \n Substitute the substring matched by a tagged subpattern
numbered n, where n is between 1 to 9, inclusive.

[3] \char Treat the next character literally,
unless the character is a digit ([2]).

If the copy operation with the substitutions is successful, re_subs returns 1. If the source string is corrupted, or the last call to re_exec fails, it returns 0.

Re_modw is used to add new characters into an internal table to change the re_exec's understanding of what a word should look like, when matching with \< and \> constructs. If the string parameter is 0 or null string, the table is reset back to the default, which contains A-Z a-z 0-9 _ .

Re_fail is a user-supplied routine to handle internal errors. re_exec calls re_fail with an error message string, and the opcode character that caused the error. The default re_fail routine simply prints the message and the opcode character to stderr and invokes exit(2).

EXAMPLES

In the examples below, the dfaform describes the internal form after the pattern is compiled. For additional details, refer to the sources.

foo*.*
     dfaform:  CHR f CHR o CLO CHR o END CLO ANY END END
     matches:  fo foo fooo foobar fobar foxx ...

fo[ob]a[rz]
     dfaform:  CHR f CHR o CCL 2 o b CHR a CCL 2 r z END
     matches:  fobar fooar fobaz fooaz

foo\\+
     dfaform:  CHR f CHR o CHR o CHR \ CLO CHR \ END END
     matches:  foo\ foo\\ foo\\\  ...

\(foo\)[1-3]\1 (same as foo[1-3]foo, but takes less internal space)
     dfaform:  BOT 1 CHR f CHR o CHR o EOT 1 CCL 3 1 2 3 REF 1 END
     matches:  foo1foo foo2foo foo3foo

\(fo.*\)-\1
     dfaform:  BOT 1 CHR f CHR o CLO ANY END EOT 1 CHR - REF 1 END
     matches:  foo-foo fo-fo fob-fob foobar-foobar ...

DIAGNOSTICS

Re_comp returns one of the following strings if an error occurs:

No previous regular expression,
Empty closure,
Illegal closure,
Cyclical reference,
Undetermined reference,
Unmatched \(,
Missing ],
Null pattern inside \(\),
Null pattern inside \<\>,
Too many \(\) pairs,
Unmatched \).

REFERENCES

Software tools                Kernighan & Plauger
Software tools in Pascal      Kernighan & Plauger
Grep sources [rsx-11 C dist]  David Conroy
Ed - text editor              Unix Programmer's Manual
Advanced editing on Unix      B. W. Kernighan
RegExp sources                Henry Spencer

HISTORY AND NOTES

These routines are derived from various implementations found in Software Tools books, and David Conroy's grep. They are NOT derived from licensed/restricted software. For more interesting/academic/complicated implementations, see Henry Spencer's regexp routines (V8), or GNU Emacs pattern matching module.

The re_comp and re_exec routines perform almost as well as their licensed counterparts, sometimes better. In very few instances, they are about 10% to 15% slower.

AUTHOR

Ozan S. Yigit (oz)
usenet: utzoo!yetti!oz
bitnet: oz@yusol || oz@yuyetti

BUGS

These routines are Public Domain. You can get them in source.
The internal storage for the dfa form is not checked for overflows. Currently, it is 1024 bytes.
Others, no doubt.

This document was created by man2html, using the manual pages.
Time: 06:22:34 GMT, December 12, 2024