Club Amiga de Montreal

home *** CD-ROM | disk | FTP | other *** search

/ Club Amiga de Montreal - CAM / CAM_CD_1.iso / files / 278.lha / RegexLibrary_v1.0 / ReadMe < prev next >

Wrap

Text File | 1989-08-06 | 25KB | 585 lines

regex.library - v1.0 An Amiga Shared Library of the GNU Regular Expression Package Ported by Edwin Hoogerbeets 24/07/89 This collection of files may be copied and distributed under the GNU Public Licence. See the comment at the top of regex.c for details. Adapted from Elib by Jim Mackraz, mklib by Edwin Hoogerbeets, and the GNU regular expression package by the Free Software Foundation. A General View of How it is Used: A regular expression is a concise method of describing a pattern of characters in a string. By use of special wildcards, almost any pattern can be described. A regular expression pattern can be used for searching strings in such programs as editors or other string handling programs. A regular expression pattern must first be compiled into a form more easily understood by the matching routines. The compiled form is stored in a buffer structure called `struct re_pattern_buffer.' The buffer must first be initialized to allocate memory or resources. The pattern is compiled into this buffer. Strings can then be matched against the compiled regular expression as many times as desired. When the matching is done, the buffer is terminated, and the program can exit. There are two parts to the source: the linkable libraries and the Amiga shared library routines. The linkable libraries contains the non-re-entrant routines and the glue code that allows access to the shared library routines. The shared library contains routines that compile and match regular expressions. To use the library, copy regex.library to your libs: directory and simply execute a program that uses the library, such as tinygrep. GNU Regular Expressions: The following table details the various special characters understood in each of the grep and egrep style regular expressions: (grep) (egrep) (explanation) . . matches any single character except newline \? ? postfix operator; preceeding item is optional * * postfix operator; preceeding item 0 or more times \+ + postfix operator; preceeding item 1 or more times \| | infix operator; matches either argument ^ ^ matches the empty string at the beginning of a line $ $ matches the empty string at the end of a line \< \< matches the empty string at the beginning of a word \> \> matches the empty string at the end of a word [chars] [chars] match any character in the given class; if the first character after [ is ^, match any character not in the given class; a range of characters may be specified by <first>-<last>; for example, \W (below) is equivalent to the class [^A-Za-z0-9]  ( ) parentheses are used for grouping and to override operator precedence \<1-9> \<1-9> \<n> matches a repeat of the text matched earlier in the regexp by the subexpression inside the nth opening parenthesis \ \ any special character may be preceded by a backslash to match it literally Operator precedence is (highest to lowest) ?, *, and +, concatenation, and finally |. All other constructs are syntactically identical to normal characters. Writing a C Program That Uses Regular Expressions: To write a program that uses the library, include the header file regex.h at the top of your source. This declares the data structures and function return types for you. You must do an OpenLibrary() call on regex.library and assign the pointer obtained to the external variable RegexBase. The pointer RegexBase is then used to find functions within regex.library, and thus RegexBase must be valid before using any of these library routines. A RegexBase variable is already provided in regex.lib. When linking, give the -lregex flag to include regex.lib (the linkable library code). To use the routines, first declare a struct re_pattern_buffer variable and call re_initialize_buffer() with a pointer to this buffer. (Specific details of the regex functions are listed below.) Then, determine a regular expression you wish to compile, perhaps from user input. Call the function re_compile_pattern() with a pointer to your buffer and the string you wish to compile. Now the buffer will contain the compiled regular expression ready for matching. Next, you can search for your pattern in any given text by calling re_search() with the compiled buffer and the string you wish to search on. This will locate the regular expression anywhere in the string you passed to it, within the bounds specified. If you are looking for an exact match, however, re_match() is the function you want. It returns true when the regular expression matches the string starting at the character specified. When you are done with the buffer, you must call re_terminate_buffer() to reclaim all memory and resources used by the library. Two programs, tester.c and tinygrep.c, are included in the distribution as simple examples of programming with the library. Tester allows you to enter grep style regular expressions and match them against a string. Tinygrep is a small implementation of the popular grep program that uses the regex library to search for patterns in text files. However, it is not overall as fast as GNU grep, or even Manx grep. This is because these other programs handle their slowest part (input) much better. To make tinygrep faster, the regular expression searching could be performed directly on the input buffer. Assembler Support: If you are writing in assembler instead of C, the registers expected for function parameters are listed along with the function descriptions below. The sequence of calls to the functions in regex.library described for C still apply. However, instead of using the glue code to call the library, you should call the regex.library functions directly following this example: ; assembler example of calling re_terminate_buffer() ; ; define the library offsets include 'regex.i' ; setup arguments in appropriate registers here ; d0 is where the buffer pointer parameter should go move.l bufp,d0 ; get the address of the library and jump to the appropriate point move.l _RegexBase,a6 jsr _LVOre_terminate_buffer(a6) ; d0 should now contain the result To use different functions, replace the re_terminate_buffer part of the jsr line with the function name you wish to call. The _LVO with the function name is expanded to a number which is the offset from register a6 where the address of the function you are calling can be found. Functions in regex.library: This is a more detailed description of each of the functions and variables offered by the regex package. These functions are available from C by linking with the regex.lib. Regex offers the following entry points: D0 D0 D1 char *re_initialize_buffer(bufp,table) struct re_pattern_buffer *bufp; char *table; This function is used to initialize a pattern buffer `bufp' that is used to compile regular expressions. Declare a variable of type `struct re_pattern_buffer' variable on the stack or dynamically allocate room for it, and pass a pointer to the new memory to re_initialize_buffer(). The fields of the buffer are filled in for you. The `table' parameter is a pointer to a translation table used to equate characters during matching. When a character is matched, it is used as an index into this table to find the resulting character. One use for this might be to translate all vowels to the character @, so that @ can be used in a regular expression to match any vowel. If the table parameter is NULL, no translation is performed on the characters, and each character is matched literally. (See the __Upcase table below for another example) If re_initialize_buffer succeeds, a NULL pointer is returned. If an error occurs, a pointer to one of the following fixed strings is returned: "No buffer" - you passed a NULL pointer, not a pointer to a regular expression buffer "Memory exhausted" - Not enough memory in the system to initialize the buffer D0 D0 LONG re_terminate_buffer(bufp) struct re_pattern_buffer *bufp; This function must be called to free the memory and resources allocated during the initialize routine. It is not fatal if this routine is not called before the your program exits, but all the memory will not be returned to the system. (for which you will get royalled flamed on the nets, believe me! 8-) A value of 1 is returned for a successful termination, and 0 for the error condition. An error (zero) means you passed a NULL pointer to the function. D0 D0 D1 A0 A1 char *re_compile_pattern(pattern, size, bufp, ob) char *pattern; long size, ob; struct re_pattern_buffer *bufp; This function compiles a regular expression `pattern' with length `size' into the properly initialized buffer `bufp.' Different syntaxes for regular expressions exist. The syntax you would like is specified in the `ob' parameter. The ob parameter can be one of the following defined flags: (In general, the presence of one of the flags below indicates that the character referenced should be treated as a wildcard. If the flag is absent, then the character is not treated as a wildcard.) RE_NO_BK_PARENS Treat parentheses as the grouping wildcard. To specify a literal parenthesis the pattern $ or $ is needed. If this flag is left out, $ and $ are the grouping wildcards and ( and ) match the literal parentheses. RE_NO_BK_VBAR Treat the vertical bar as the "or"-operator, and \| as a literal vertical bar. If this flag is left out, the syntax is reversed. RE_BK_PLUS_QM Treat the plus and the question mark characters as wildcards, and \+ and \? as the literal characters. RE_TIGHT_VBAR Bind the vertical bar tighter than the ^ and $ operators. This means that the vertical bar takes precedence over the ^ and $ in a single expression. RE_NEWLINE_OR Treat the newline character `\n' as a an "or"-operator. This might be useful in a program such as fgrep. RE_CONTEXT_INDEP_OPS Treat certain wildcards characters as wildcards only in certain contexts. Specifically, this applies to: ^ - only special at the beginning of a line, or after ( or | $ - only special at the end of a line, or before ) or | *, +, ? - only special when not after the beginning of a line, (, or | Some programs have a combination of the above flags as their default. The following flags give the syntax of some well-known Unix utilities in terms of the above flags: RE_SYNTAX_AWK - emulate awk regular expressions RE_SYNTAX_EGREP - emulate egrep regular expressions RE_SYNTAX_GREP - emulate grep regular expressions RE_SYNTAX_EMACS - emulate emacs-like regular expressions If re_compile_pattern() is successful in compiling the given regular expression, a NULL pointer is returned. If an error condition occurs, a pointer to one of the following fixed strings is returned. "Invalid regular expression" - eg: "$ab$*123\" has an invalid trailing '\' "Unmatched $" - eg: "\(ab*123" has no closing "$" "Unmatched \)" - eg: "ab\)*123" has no opening "$" "Premature end of regular expression" - eg: "foo[1-9" has no ']' "Nesting too deep" - you have too many levels of groupings: "\( $" "Regular expression too big" - the regular expression needed more than 64K to store -- Try using a shorter one! "Memory exhausted" - Close some windows! D0 D0 LONG re_compile_fastmap(bufp) struct re_pattern_buffer *bufp; If the initial part of a pattern does not match the string starting at a certain position, the whole expression will not match the string starting at that position. On this basis, it is possible to compute which characters can possibly be found at the start the pattern. If a string does not start with one of these characters, it cannot match the pattern. These collections of possible starting characters are called a fastmap. Fastmaps make pattern searching much faster by reducing the number of failed full matches. This function takes a compiled pattern in buffer `bufp' and computes a fastmap for it, which is stored in the `fastmap' field of the buffer. The fastmap is then used in the re_search() function while searching a string for a regular expression. If this function is not called before a re_search(), then re_search() will call it for you. D0 D0 D1 A0 A1 D2 D3 LONG re_search(pbufp, string, size, startpos, range, regs) struct re_pattern_buffer *pbufp; char *string; long size, startpos, range; struct re_registers *regs; This function searches the string `string' of size `size' for the regular expression previously compiled to the buffer `pbufp.' The `startpos' parameter is the index into the string to start searching. If the search is unsuccessful at startpos, it is tried at startpos+1 and so forth. The `range' parameter tells how far from the start position to go before failing. It is up to the caller to make sure that range is not so large as to take the starting position outside of the input strings. If the range parameter is negative, then the search will proceed from startpos to startpos-1 and so forth until -range positions have been checked. The `regs' parameter is a place to store information about exactly what was matched if the search is successful, including subexpressions. A subexpression is any part of a regular expression bounded by parentheses. The `start' field of a re_registers structure is an array of character pointers to the beginning of each subexpression matched. The `end' field is an array of character pointers to the character just past the end of each subexpression. For example, regs->start[0] to regs->end[0] is the entire expression matched regs->start[1] to regs->end[1] is the subexpression contained in the first  grouping if there is one regs->start[2] to regs->end[2] is the subexpression contained in the second  grouping if there is one and so on. If a NULL pointer is passed as the `regs' parameter, no information on matching is stored. There is a maximum of NREGS groupings available. If you really need more, you can change the definition of NREGS in regex.h and recompile the library. The return value is the position of the start of the of the string that matches the regular expression. If there is no match, a -1 is returned. If there was some internal error, a -2 is returned. The function re_search() depends on re_search_2() below to do its grunt work. D0 D0 D1 D0 A1 D2 D3 LONG re_search_2(pbufp, string1, size1, string2, size2, startpos, D4 D5 D6 range, regs, mstop) struct re_pattern_buffer *pbufp; char *string1, *string2; long size1, size2; long startpos; register long range; struct re_registers *regs; long mstop; This function works the same as re_search, with the exception that it takes different arguments. The regular expression in the buffer `pbufp' is searched for in the concatenation of `string1' and `string2.' The parameters `size1' and `size2' are the lengths of string1 and string2 respectively. The `startpos' is the starting position of the search and the the `range' is how many characters further to try the search, just as in re_search. The `regs' parameter is a pointer to a re_registers structure which is space for storing information about what exactly was matched. The return value is the position of the start of the of the string that matches the regular expression. If there is no match, a -1 is returned. If there was some internal error, a -2 is returned. See the description of the re_search() function for more details. D0 D0 D1 A0 A1 D2 LONG re_match(pbufp, string, size, pos, regs) struct re_pattern_buffer *pbufp; char *string; long size, pos; struct re_registers *regs; This function matches the compiled regular expression in `pbufp' against `string,' which is of length `size.' The `pos' parameter is the position in the string to start the matching. The `regs' parameter points to space to store information about the part of the string that matched the regular expression. See the description of the re_search() function for more details of the `regs' parameter. The return value is the length of the string that matches the regular expression. If there is no match, a -1 is returned. If there was some internal error, a -2 is returned. The difference between re_search() and re_match() is that re_search() finds the regular expression anywhere in a certain range of a string by looking at different starting positions, while re_match() only looks at the starting position specified. D0 D0 D1 A0 A1 D2 D3 D4 D5 LONG re_match_2(pbufp, string1, size1, string2, size2, pos, regs, mstop) struct re_pattern_buffer *pbufp; unsigned char *string1, *string2; long size1, size2; long pos; struct re_registers *regs; long mstop; This function is much like re_match(), except that two strings are specified as parameters. The function matches the compiled regular expression in `pbufp' against the concatenation of `string1' and `string2,' which are of length `size1' and `size2' respectively. The `pos' parameter is the position in the string to start the matching. The `regs' parameter points to space to store information about the part of the string that matched the regular expression. See the description of the re_search() function for more details of the `regs' parameter. The return value is the length of the string that matches the regular expression. If there is no match, a -1 is returned. If there was some internal error, a -2 is returned. Functions in regex.lib: The following entry points are for compatibility with the BSD Unix regular expression package. The BSD regular expression package does not fiddle with such piddly re-entrant ideas as user buffers, and thus a static buffer is used for you when compiling regular expressions. If you are writing your program in assembler, you will have to link with the aregex.lib as well as regex.lib to access these functions. This is because these routines are written in C, and parameters must be put on the stack. The glue code in aregex.lib does this for you. For assembler programs, the entry points for these functions are the function names without a leading underscore character. (ie. re_comp and re_exec, instead of _re_comp and _re_exec) D0 char *re_BSD_initialize() This function initializes the internal buffer. This function should be placed at the beginning of any program using the BSD entry points. void re_BSD_terminate() This function frees the system resources used by the initialize routine. This function should be placed at the end of any program using the BSD entry points. D0 D0 char *re_comp( s ) char *s; Compile the pattern in the string `s' for use in subsequent matchings. If the internal buffer has not been properly initialized, this function will detect the condition and call re_BSD_initialize() for you. This means it is not critical to call the initialize routine, but it is a good idea anyway. If the string s is a NULL pointer, the previous regular expression will be used. If the compilation is succesful, a NULL pointer is returned. Otherwise, a pointer to one of fixed strings returned by re_compile_buffer() is returned. (see the description of re_compile_buffer() above for details.) As well, re_comp() may return a pointer to the following string: "No previous regular expression" - re_BSD_initialize was never called D0 D0 LONG re_exec( s ) char *s; Use the last compiled pattern to match against the string `s.' Like re_search(), this function returns a -1 for no match, a -2 for internal error, and the position of the beginning of the matched string for a successful matching. Variables in regex.lib: The following variables are also provided in the linkable library for programming convenience: struct RegexBase *RegexBase Assign the results of an OpenLibrary() on regex.library to this variable. It is used to find the jump table in memory so that the shared library routines can be executed. char __Upcase[] This is a pre-defined translation table for use in a call to re_initialize_buffer(). It is a translation table that turns all lower case letters into upper case letters, effectively making the regular expression case insensitive while matching. Still To Do: - providing a Modula II, Lattice, PDC and/or Draco linkable support library Not having Modula II or Lattice, these are difficult for me to do right now... However, if you do do any of these, I would be eager to hear from you! I suspect the Lattice support would simply consist of a header file of #pragmas, but I have little idea how that would work. Files: alink.asm - assembler glue code source for aregex.lib aregex.lib - interface between assembler and regex.library interface.asm - interface between assembler and C within regex.library lib1.c - BSD style entry points to regex.library lib2.c - default uppercase translation table library.c - main shared library routines of regex.library library.h - header for library.c link.asm - C glue code source for regex.lib makefile - makefile for Manx malloc.c - support routines for regex.library ReadMe - this file regex.c - regular expression code in regex.library regex.h - C header file for anything to do with regex regex.i - assembler header file for anything to do with regex regex.lib - interface code between C and regex.library regex.library - Amiga shared library rtag.asm - ROM tag code for regex.library startup.asm - modified small model startup code for regex.library tester - test program tester.c - source to the above tinygrep - small, almost-useful test program tinygrep.c - source to the above Please redirect any comments, criticisms or vivacious vixens: Edwin Hoogerbeets Usenet: ehoogerbeets@rose.waterloo.edu (school account until Aug '89) or edwin@watcsc.waterloo.edu (permanent account) or w-edwinh@microsoft.uucp (Sept '89 to Dec '89) CIS: 72647,3675 (funds-dependent permanent 8-) Remember, pillows don't hit people. People do.