home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Club Amiga de Montreal - CAM
/
CAM_CD_1.iso
/
files
/
278.lha
/
RegexLibrary_v1.0
/
ReadMe
< prev
next >
Wrap
Text File
|
1989-08-06
|
25KB
|
585 lines
regex.library - v1.0
An Amiga Shared Library of the
GNU Regular Expression Package
Ported by Edwin Hoogerbeets 24/07/89
This collection of files may be copied and distributed under the GNU Public
Licence. See the comment at the top of regex.c for details.
Adapted from Elib by Jim Mackraz, mklib by Edwin Hoogerbeets, and the
GNU regular expression package by the Free Software Foundation.
A General View of How it is Used:
A regular expression is a concise method of describing a pattern of
characters in a string. By use of special wildcards, almost any pattern
can be described. A regular expression pattern can be used for searching
strings in such programs as editors or other string handling programs.
A regular expression pattern must first be compiled into a form more
easily understood by the matching routines. The compiled form is stored
in a buffer structure called `struct re_pattern_buffer.' The buffer must
first be initialized to allocate memory or resources. The pattern is
compiled into this buffer. Strings can then be matched against the
compiled regular expression as many times as desired. When the matching
is done, the buffer is terminated, and the program can exit.
There are two parts to the source: the linkable libraries and the Amiga
shared library routines. The linkable libraries contains the
non-re-entrant routines and the glue code that allows access to the
shared library routines. The shared library contains routines that
compile and match regular expressions.
To use the library, copy regex.library to your libs: directory and simply
execute a program that uses the library, such as tinygrep.
GNU Regular Expressions:
The following table details the various special characters understood in
each of the grep and egrep style regular expressions:
(grep) (egrep) (explanation)
. . matches any single character except newline
\? ? postfix operator; preceeding item is optional
* * postfix operator; preceeding item 0 or more times
\+ + postfix operator; preceeding item 1 or more times
\| | infix operator; matches either argument
^ ^ matches the empty string at the beginning of a line
$ $ matches the empty string at the end of a line
\< \< matches the empty string at the beginning of a word
\> \> matches the empty string at the end of a word
[chars] [chars] match any character in the given class; if the
first character after [ is ^, match any character
not in the given class; a range of characters may
be specified by <first>-<last>; for example, \W
(below) is equivalent to the class [^A-Za-z0-9]
\( \) ( ) parentheses are used for grouping and to override
operator precedence
\<1-9> \<1-9> \<n> matches a repeat of the text matched earlier
in the regexp by the subexpression inside the
nth opening parenthesis
\ \ any special character may be preceded by a backslash
to match it literally
Operator precedence is (highest to lowest) ?, *, and +, concatenation,
and finally |. All other constructs are syntactically identical to
normal characters.
Writing a C Program That Uses Regular Expressions:
To write a program that uses the library, include the header file regex.h
at the top of your source. This declares the data structures and function
return types for you.
You must do an OpenLibrary() call on regex.library and assign the pointer
obtained to the external variable RegexBase. The pointer RegexBase is
then used to find functions within regex.library, and thus RegexBase must
be valid before using any of these library routines. A RegexBase variable
is already provided in regex.lib. When linking, give the -lregex flag to
include regex.lib (the linkable library code).
To use the routines, first declare a struct re_pattern_buffer variable
and call re_initialize_buffer() with a pointer to this buffer. (Specific
details of the regex functions are listed below.)
Then, determine a regular expression you wish to compile, perhaps from
user input. Call the function re_compile_pattern() with a pointer to your
buffer and the string you wish to compile. Now the buffer will contain
the compiled regular expression ready for matching.
Next, you can search for your pattern in any given text by calling
re_search() with the compiled buffer and the string you wish to search
on. This will locate the regular expression anywhere in the string you
passed to it, within the bounds specified.
If you are looking for an exact match, however, re_match() is the
function you want. It returns true when the regular expression matches
the string starting at the character specified.
When you are done with the buffer, you must call re_terminate_buffer()
to reclaim all memory and resources used by the library.
Two programs, tester.c and tinygrep.c, are included in the distribution
as simple examples of programming with the library.
Tester allows you to enter grep style regular expressions and match them
against a string.
Tinygrep is a small implementation of the popular grep program that uses
the regex library to search for patterns in text files. However, it is
not overall as fast as GNU grep, or even Manx grep. This is because these
other programs handle their slowest part (input) much better. To make
tinygrep faster, the regular expression searching could be performed
directly on the input buffer.
Assembler Support:
If you are writing in assembler instead of C, the registers expected for
function parameters are listed along with the function descriptions
below.
The sequence of calls to the functions in regex.library described for C
still apply. However, instead of using the glue code to call the library,
you should call the regex.library functions directly following this
example:
; assembler example of calling re_terminate_buffer()
;
; define the library offsets
include 'regex.i'
; setup arguments in appropriate registers here
; d0 is where the buffer pointer parameter should go
move.l bufp,d0
; get the address of the library and jump to the appropriate point
move.l _RegexBase,a6
jsr _LVOre_terminate_buffer(a6)
; d0 should now contain the result
To use different functions, replace the re_terminate_buffer part of the
jsr line with the function name you wish to call. The _LVO with the
function name is expanded to a number which is the offset from register
a6 where the address of the function you are calling can be found.
Functions in regex.library:
This is a more detailed description of each of the functions and
variables offered by the regex package. These functions are available
from C by linking with the regex.lib.
Regex offers the following entry points:
D0 D0 D1
char *re_initialize_buffer(bufp,table)
struct re_pattern_buffer *bufp;
char *table;
This function is used to initialize a pattern buffer `bufp' that is
used to compile regular expressions. Declare a variable of type
`struct re_pattern_buffer' variable on the stack or dynamically
allocate room for it, and pass a pointer to the new memory to
re_initialize_buffer(). The fields of the buffer are filled in for
you.
The `table' parameter is a pointer to a translation table used to
equate characters during matching. When a character is matched, it is
used as an index into this table to find the resulting character. One
use for this might be to translate all vowels to the character @, so
that @ can be used in a regular expression to match any vowel. If the
table parameter is NULL, no translation is performed on the
characters, and each character is matched literally. (See the
__Upcase table below for another example)
If re_initialize_buffer succeeds, a NULL pointer is returned. If an
error occurs, a pointer to one of the following fixed strings is
returned:
"No buffer" - you passed a NULL pointer, not a pointer
to a regular expression buffer
"Memory exhausted" - Not enough memory in the system to
initialize the buffer
D0 D0
LONG re_terminate_buffer(bufp)
struct re_pattern_buffer *bufp;
This function must be called to free the memory and resources
allocated during the initialize routine. It is not fatal if this
routine is not called before the your program exits, but all the
memory will not be returned to the system. (for which you will get
royalled flamed on the nets, believe me! 8-)
A value of 1 is returned for a successful termination, and 0 for the
error condition. An error (zero) means you passed a NULL pointer to
the function.
D0 D0 D1 A0 A1
char *re_compile_pattern(pattern, size, bufp, ob)
char *pattern;
long size, ob;
struct re_pattern_buffer *bufp;
This function compiles a regular expression `pattern' with length
`size' into the properly initialized buffer `bufp.'
Different syntaxes for regular expressions exist. The syntax you
would like is specified in the `ob' parameter. The ob parameter can
be one of the following defined flags:
(In general, the presence of one of the flags below indicates that
the character referenced should be treated as a wildcard. If the flag
is absent, then the character is not treated as a wildcard.)
RE_NO_BK_PARENS
Treat parentheses as the grouping wildcard. To specify a literal
parenthesis the pattern \( or \) is needed. If this flag is left
out, \( and \) are the grouping wildcards and ( and ) match the
literal parentheses.
RE_NO_BK_VBAR
Treat the vertical bar as the "or"-operator, and \| as a literal
vertical bar. If this flag is left out, the syntax is reversed.
RE_BK_PLUS_QM
Treat the plus and the question mark characters as wildcards, and
\+ and \? as the literal characters.
RE_TIGHT_VBAR
Bind the vertical bar tighter than the ^ and $ operators. This
means that the vertical bar takes precedence over the ^ and $ in a
single expression.
RE_NEWLINE_OR
Treat the newline character `\n' as a an "or"-operator. This might
be useful in a program such as fgrep.
RE_CONTEXT_INDEP_OPS
Treat certain wildcards characters as wildcards only in certain
contexts. Specifically, this applies to:
^ - only special at the beginning of a line, or after ( or |
$ - only special at the end of a line, or before ) or |
*, +, ? - only special when not after the beginning of a line,
(, or |
Some programs have a combination of the above flags as their default.
The following flags give the syntax of some well-known Unix
utilities in terms of the above flags:
RE_SYNTAX_AWK - emulate awk regular expressions
RE_SYNTAX_EGREP - emulate egrep regular expressions
RE_SYNTAX_GREP - emulate grep regular expressions
RE_SYNTAX_EMACS - emulate emacs-like regular expressions
If re_compile_pattern() is successful in compiling the given regular
expression, a NULL pointer is returned. If an error condition occurs,
a pointer to one of the following fixed strings is returned.
"Invalid regular expression" - eg: "\(ab\)*123\" has an
invalid trailing '\'
"Unmatched \(" - eg: "\(ab*123" has no
closing "\)"
"Unmatched \)" - eg: "ab\)*123" has no
opening "\("
"Premature end of regular expression" - eg: "foo[1-9" has no ']'
"Nesting too deep" - you have too many levels
of groupings: "\( \)"
"Regular expression too big" - the regular expression
needed more than 64K to
store -- Try using a
shorter one!
"Memory exhausted" - Close some windows!
D0 D0
LONG re_compile_fastmap(bufp)
struct re_pattern_buffer *bufp;
If the initial part of a pattern does not match the string starting
at a certain position, the whole expression will not match the string
starting at that position.
On this basis, it is possible to compute which characters can
possibly be found at the start the pattern. If a string does not
start with one of these characters, it cannot match the pattern.
These collections of possible starting characters are called a
fastmap.
Fastmaps make pattern searching much faster by reducing the number of
failed full matches.
This function takes a compiled pattern in buffer `bufp' and computes
a fastmap for it, which is stored in the `fastmap' field of the
buffer. The fastmap is then used in the re_search() function while
searching a string for a regular expression. If this function is
not called before a re_search(), then re_search() will call it
for you.
D0 D0 D1 A0 A1 D2 D3
LONG re_search(pbufp, string, size, startpos, range, regs)
struct re_pattern_buffer *pbufp;
char *string;
long size, startpos, range;
struct re_registers *regs;
This function searches the string `string' of size `size' for the
regular expression previously compiled to the buffer `pbufp.' The
`startpos' parameter is the index into the string to start searching.
If the search is unsuccessful at startpos, it is tried at startpos+1
and so forth. The `range' parameter tells how far from the start
position to go before failing. It is up to the caller to make sure
that range is not so large as to take the starting position outside
of the input strings. If the range parameter is negative, then the
search will proceed from startpos to startpos-1 and so forth until
-range positions have been checked.
The `regs' parameter is a place to store information about exactly
what was matched if the search is successful, including
subexpressions. A subexpression is any part of a regular expression
bounded by parentheses. The `start' field of a re_registers structure
is an array of character pointers to the beginning of each
subexpression matched. The `end' field is an array of character
pointers to the character just past the end of each subexpression.
For example,
regs->start[0] to regs->end[0] is the entire expression matched
regs->start[1] to regs->end[1] is the subexpression contained
in the first \( \) grouping if there is one
regs->start[2] to regs->end[2] is the subexpression contained
in the second \( \) grouping if there is one
and so on. If a NULL pointer is passed as the `regs' parameter,
no information on matching is stored.
There is a maximum of NREGS groupings available. If you really need
more, you can change the definition of NREGS in regex.h and recompile
the library.
The return value is the position of the start of the of the string
that matches the regular expression. If there is no match, a -1 is
returned. If there was some internal error, a -2 is returned.
The function re_search() depends on re_search_2() below to do
its grunt work.
D0 D0 D1 D0 A1 D2 D3
LONG re_search_2(pbufp, string1, size1, string2, size2, startpos,
D4 D5 D6
range, regs, mstop)
struct re_pattern_buffer *pbufp;
char *string1, *string2;
long size1, size2;
long startpos;
register long range;
struct re_registers *regs;
long mstop;
This function works the same as re_search, with the exception that it
takes different arguments. The regular expression in the buffer
`pbufp' is searched for in the concatenation of `string1' and
`string2.' The parameters `size1' and `size2' are the lengths of
string1 and string2 respectively. The `startpos' is the starting
position of the search and the the `range' is how many characters
further to try the search, just as in re_search. The `regs' parameter
is a pointer to a re_registers structure which is space for storing
information about what exactly was matched.
The return value is the position of the start of the of the string
that matches the regular expression. If there is no match, a -1 is
returned. If there was some internal error, a -2 is returned.
See the description of the re_search() function for more details.
D0 D0 D1 A0 A1 D2
LONG re_match(pbufp, string, size, pos, regs)
struct re_pattern_buffer *pbufp;
char *string;
long size, pos;
struct re_registers *regs;
This function matches the compiled regular expression in `pbufp'
against `string,' which is of length `size.' The `pos' parameter is
the position in the string to start the matching. The `regs'
parameter points to space to store information about the part of the
string that matched the regular expression. See the description of
the re_search() function for more details of the `regs' parameter.
The return value is the length of the string that matches the regular
expression. If there is no match, a -1 is returned. If there was some
internal error, a -2 is returned.
The difference between re_search() and re_match() is that re_search()
finds the regular expression anywhere in a certain range of a string
by looking at different starting positions, while re_match() only
looks at the starting position specified.
D0 D0 D1 A0 A1 D2 D3 D4 D5
LONG re_match_2(pbufp, string1, size1, string2, size2, pos, regs, mstop)
struct re_pattern_buffer *pbufp;
unsigned char *string1, *string2;
long size1, size2;
long pos;
struct re_registers *regs;
long mstop;
This function is much like re_match(), except that two strings are
specified as parameters. The function matches the compiled regular
expression in `pbufp' against the concatenation of `string1' and
`string2,' which are of length `size1' and `size2' respectively. The
`pos' parameter is the position in the string to start the matching.
The `regs' parameter points to space to store information about the
part of the string that matched the regular expression. See the
description of the re_search() function for more details of the
`regs' parameter.
The return value is the length of the string that matches the regular
expression. If there is no match, a -1 is returned. If there was some
internal error, a -2 is returned.
Functions in regex.lib:
The following entry points are for compatibility with the BSD Unix
regular expression package. The BSD regular expression package does not
fiddle with such piddly re-entrant ideas as user buffers, and thus a
static buffer is used for you when compiling regular expressions.
If you are writing your program in assembler, you will have to link
with the aregex.lib as well as regex.lib to access these functions.
This is because these routines are written in C, and parameters must be
put on the stack. The glue code in aregex.lib does this for you. For
assembler programs, the entry points for these functions are the
function names without a leading underscore character. (ie. re_comp and
re_exec, instead of _re_comp and _re_exec)
D0
char *re_BSD_initialize()
This function initializes the internal buffer. This function should
be placed at the beginning of any program using the BSD entry points.
void re_BSD_terminate()
This function frees the system resources used by the initialize
routine. This function should be placed at the end of any program
using the BSD entry points.
D0 D0
char *re_comp( s )
char *s;
Compile the pattern in the string `s' for use in subsequent matchings.
If the internal buffer has not been properly initialized, this
function will detect the condition and call re_BSD_initialize()
for you. This means it is not critical to call the initialize
routine, but it is a good idea anyway.
If the string s is a NULL pointer, the previous regular expression
will be used.
If the compilation is succesful, a NULL pointer is returned.
Otherwise, a pointer to one of fixed strings returned by
re_compile_buffer() is returned. (see the description of
re_compile_buffer() above for details.) As well, re_comp() may
return a pointer to the following string:
"No previous regular expression" - re_BSD_initialize was never
called
D0 D0
LONG re_exec( s )
char *s;
Use the last compiled pattern to match against the string `s.'
Like re_search(), this function returns a -1 for no match, a -2 for
internal error, and the position of the beginning of the matched
string for a successful matching.
Variables in regex.lib:
The following variables are also provided in the linkable library for
programming convenience:
struct RegexBase *RegexBase
Assign the results of an OpenLibrary() on regex.library to this
variable. It is used to find the jump table in memory so that
the shared library routines can be executed.
char __Upcase[]
This is a pre-defined translation table for use in a call to
re_initialize_buffer(). It is a translation table that turns
all lower case letters into upper case letters, effectively
making the regular expression case insensitive while matching.
Still To Do:
- providing a Modula II, Lattice, PDC and/or Draco linkable support
library
Not having Modula II or Lattice, these are difficult for me to do right
now... However, if you do do any of these, I would be eager to hear
from you!
I suspect the Lattice support would simply consist of a header file
of #pragmas, but I have little idea how that would work.
Files:
alink.asm - assembler glue code source for aregex.lib
aregex.lib - interface between assembler and regex.library
interface.asm - interface between assembler and C within
regex.library
lib1.c - BSD style entry points to regex.library
lib2.c - default uppercase translation table
library.c - main shared library routines of regex.library
library.h - header for library.c
link.asm - C glue code source for regex.lib
makefile - makefile for Manx
malloc.c - support routines for regex.library
ReadMe - this file
regex.c - regular expression code in regex.library
regex.h - C header file for anything to do with regex
regex.i - assembler header file for anything to do with regex
regex.lib - interface code between C and regex.library
regex.library - Amiga shared library
rtag.asm - ROM tag code for regex.library
startup.asm - modified small model startup code for regex.library
tester - test program
tester.c - source to the above
tinygrep - small, almost-useful test program
tinygrep.c - source to the above
Please redirect any comments, criticisms or vivacious vixens:
Edwin Hoogerbeets
Usenet: ehoogerbeets@rose.waterloo.edu (school account until Aug '89)
or edwin@watcsc.waterloo.edu (permanent account)
or w-edwinh@microsoft.uucp (Sept '89 to Dec '89)
CIS: 72647,3675 (funds-dependent permanent 8-)
Remember, pillows don't hit people. People do.