home *** CD-ROM | disk | FTP | other *** search
- From: johnk@wrq.com (John Kercheval)
- Newsgroups: comp.sources.misc
- Subject: v17i069: regex - REGEX Globber (Wild Card Matching), Part01/01
- Message-ID: <1991Mar25.220130.22238@sparky.IMD.Sterling.COM>
- Date: 25 Mar 91 22:01:30 GMT
- Approved: kent@sparky.imd.sterling.com
- X-Checksum-Snefru: eef2860b 4b30a023 e30080d9 612cca42
-
- Submitted-by: John Kercheval <johnk@wrq.com>
- Posting-number: Volume 17, Issue 69
- Archive-name: regex/part01
-
- Here is the shar archive of V1.10 of REGEX Globber.
- This is a *IX wildcard globber I butchered, hacked and cajoled together
- after seeing and hearing about and becoming disgusted with several similar
- routines which had one or more of the following attributes: slow, buggy,
- required large levels of recursion on matches, required grotesque levels
- of recursion on failing matches using '*', full of caveats about usability
- or copyrights.
-
- These routines are fairly well tested and reasonably fast. I have made
- an effort to fail on all bad patterns and to quickly determine failing '*'
- patterns. This parser will also do quite a bit of the '*' matching via
- quick linear loops versus the standard blind recursive descent.
-
- This parser has been submitted to profilers at various stages of development
- and has come through quite well. If the last millisecond is important to
- you then some time can be shaved by using stack allocated variables in
- place of many of the pointer follows (which may be done fairly often) found
- in regex_match and regex_match_after_star (ie *p, *t).
-
- No attempt is made to provide general [pat,pat] comparisons. The specific
- subcases supplied by these routines is [pat,text] which is sufficient
- for the large majority of cases (should you care).
-
- Since regex_match may return one of three different values depending upon
- the pattern and text I have made a simple shell for convenience (match()).
- Also included is an is_pattern routine to quickly check a potential pattern
- for regex special characters. I even placed this all in a header file for
- you lazy folks!
-
- Having said all that, here is my own reinvention of the wheel. Please
- enjoy it's use and I hope it is of some help to those with need ....
-
- jbk
- ----
- #! /bin/sh
- # This is a shell archive. Remove anything before this line, then feed it
- # into a shell via "sh file" or similar. To overwrite existing files,
- # type "sh file -c".
- # The tool that generated this appeared in the comp.sources.unix newsgroup;
- # send mail to comp-sources-unix@uunet.uu.net if you want that tool.
- # Contents: match.! match.c match.doc match.h matchmak matchtst.bat
- # readme.doc
- # Wrapped by kent@sparky on Mon Mar 25 14:33:28 1991
- PATH=/bin:/usr/bin:/usr/ucb ; export PATH
- echo If this archive is complete, you will see the following message:
- echo ' "shar: End of archive 1 (of 1)."'
- if test -f 'match.!' -a "${1}" != "-c" ; then
- echo shar: Will not clobber existing file \"'match.!'\"
- else
- echo shar: Extracting \"'match.!'\" \(950 characters\)
- sed "s/^X//" >'match.!' <<'END_OF_FILE'
- X
- X..............................................................................
- X.. ..
- X. REGEX Globber (Wild Card Matching) .
- X. .
- X. A *IX SH style pattern matcher written in C .
- X. V1.10 Dedicated to the Public Domain .
- X. .
- X. March 12, 1991 .
- X. J. Kercheval .
- X. [72450,3702] -- johnk@wrq.com .
- X.. ..
- X..............................................................................
- X
- END_OF_FILE
- if test 950 -ne `wc -c <'match.!'`; then
- echo shar: \"'match.!'\" unpacked with wrong size!
- fi
- # end of 'match.!'
- fi
- if test -f 'match.c' -a "${1}" != "-c" ; then
- echo shar: Will not clobber existing file \"'match.c'\"
- else
- echo shar: Extracting \"'match.c'\" \(16844 characters\)
- sed "s/^X//" >'match.c' <<'END_OF_FILE'
- X/*
- X EPSHeader
- X
- X File: match.c
- X Author: J. Kercheval
- X Created: Sat, 01/05/1991 22:21:49
- X*/
- X/*
- X EPSRevision History
- X
- X J. Kercheval Wed, 02/20/1991 22:29:01 Released to Public Domain
- X J. Kercheval Fri, 02/22/1991 15:29:01 fix '\' bugs (two :( of them)
- X J. Kercheval Sun, 03/10/1991 19:31:29 add error return to matche()
- X J. Kercheval Sun, 03/10/1991 20:11:11 add is_valid_pattern code
- X J. Kercheval Sun, 03/10/1991 20:37:11 beef up main()
- X J. Kercheval Tue, 03/12/1991 22:25:10 Released as V1.1 to Public Domain
- X*/
- X
- X/*
- X Wildcard Pattern Matching
- X*/
- X
- X
- X#include "match.h"
- X
- Xint matche_after_star (register char *pattern, register char *text);
- Xint fast_match_after_star (register char *pattern, register char *text);
- X
- X/*----------------------------------------------------------------------------
- X*
- X* Return TRUE if PATTERN has any special wildcard characters
- X*
- X----------------------------------------------------------------------------*/
- X
- XBOOLEAN is_pattern (char *p)
- X{
- X while ( *p ) {
- X switch ( *p++ ) {
- X case '?':
- X case '*':
- X case '[':
- X case '\\':
- X return TRUE;
- X }
- X }
- X return FALSE;
- X}
- X
- X
- X/*----------------------------------------------------------------------------
- X*
- X* Return TRUE if PATTERN has is a well formed regular expression according
- X* to the above syntax
- X*
- X* error_type is a return code based on the type of pattern error. Zero is
- X* returned in error_type if the pattern is a valid one. error_type return
- X* values are as follows:
- X*
- X* PATTERN_VALID - pattern is well formed
- X* PATTERN_ESC - pattern has invalid escape ('\' at end of pattern)
- X* PATTERN_RANGE - [..] construct has a no end range in a '-' pair (ie [a-])
- X* PATTERN_CLOSE - [..] construct has no end bracket (ie [abc-g )
- X* PATTERN_EMPTY - [..] construct is empty (ie [])
- X*
- X----------------------------------------------------------------------------*/
- X
- XBOOLEAN is_valid_pattern (char *p, int *error_type)
- X{
- X
- X /* init error_type */
- X *error_type = PATTERN_VALID;
- X
- X /* loop through pattern to EOS */
- X while( *p ) {
- X
- X /* determine pattern type */
- X switch( *p ) {
- X
- X /* check literal escape, it cannot be at end of pattern */
- X case '\\':
- X if( !*++p ) {
- X *error_type = PATTERN_ESC;
- X return FALSE;
- X }
- X p++;
- X break;
- X
- X /* the [..] construct must be well formed */
- X case '[':
- X p++;
- X
- X /* if the next character is ']' then bad pattern */
- X if ( *p == ']' ) {
- X *error_type = PATTERN_EMPTY;
- X return FALSE;
- X }
- X
- X /* if end of pattern here then bad pattern */
- X if ( !*p ) {
- X *error_type = PATTERN_CLOSE;
- X return FALSE;
- X }
- X
- X /* loop to end of [..] construct */
- X while( *p != ']' ) {
- X
- X /* check for literal escape */
- X if( *p == '\\' ) {
- X p++;
- X
- X /* if end of pattern here then bad pattern */
- X if ( !*p++ ) {
- X *error_type = PATTERN_ESC;
- X return FALSE;
- X }
- X }
- X else
- X p++;
- X
- X /* if end of pattern here then bad pattern */
- X if ( !*p ) {
- X *error_type = PATTERN_CLOSE;
- X return FALSE;
- X }
- X
- X /* if this a range */
- X if( *p == '-' ) {
- X
- X /* we must have an end of range */
- X if ( !*++p || *p == ']' ) {
- X *error_type = PATTERN_RANGE;
- X return FALSE;
- X }
- X else {
- X
- X /* check for literal escape */
- X if( *p == '\\' )
- X p++;
- X
- X /* if end of pattern here then bad pattern */
- X if ( !*p++ ) {
- X *error_type = PATTERN_ESC;
- X return FALSE;
- X }
- X }
- X }
- X }
- X break;
- X
- X /* all other characters are valid pattern elements */
- X case '*':
- X case '?':
- X default:
- X p++; /* "normal" character */
- X break;
- X }
- X }
- X
- X return TRUE;
- X}
- X
- X
- X/*----------------------------------------------------------------------------
- X*
- X* Match the pattern PATTERN against the string TEXT;
- X*
- X* returns MATCH_VALID if pattern matches, or an errorcode as follows
- X* otherwise:
- X*
- X* MATCH_PATTERN - bad pattern
- X* MATCH_LITERAL - match failure on literal mismatch
- X* MATCH_RANGE - match failure on [..] construct
- X* MATCH_ABORT - premature end of text string
- X* MATCH_END - premature end of pattern string
- X* MATCH_VALID - valid match
- X*
- X*
- X* A match means the entire string TEXT is used up in matching.
- X*
- X* In the pattern string:
- X* `*' matches any sequence of characters (zero or more)
- X* `?' matches any character
- X* [SET] matches any character in the specified set,
- X* [!SET] or [^SET] matches any character not in the specified set.
- X*
- X* A set is composed of characters or ranges; a range looks like
- X* character hyphen character (as in 0-9 or A-Z). [0-9a-zA-Z_] is the
- X* minimal set of characters allowed in the [..] pattern construct.
- X* Other characters are allowed (ie. 8 bit characters) if your system
- X* will support them.
- X*
- X* To suppress the special syntactic significance of any of `[]*?!^-\',
- X* and match the character exactly, precede it with a `\'.
- X*
- X----------------------------------------------------------------------------*/
- X
- Xint matche ( register char *p, register char *t )
- X{
- X register char range_start, range_end; /* start and end in range */
- X
- X BOOLEAN invert; /* is this [..] or [!..] */
- X BOOLEAN member_match; /* have I matched the [..] construct? */
- X BOOLEAN loop; /* should I terminate? */
- X
- X for ( ; *p; p++, t++ ) {
- X
- X /* if this is the end of the text then this is the end of the match */
- X if (!*t) {
- X return ( *p == '*' && *++p == '\0' ) ? MATCH_VALID : MATCH_ABORT;
- X }
- X
- X /* determine and react to pattern type */
- X switch ( *p ) {
- X
- X /* single any character match */
- X case '?':
- X break;
- X
- X /* multiple any character match */
- X case '*':
- X return matche_after_star (p, t);
- X
- X /* [..] construct, single member/exclusion character match */
- X case '[': {
- X
- X /* move to beginning of range */
- X p++;
- X
- X /* check if this is a member match or exclusion match */
- X invert = FALSE;
- X if ( *p == '!' || *p == '^') {
- X invert = TRUE;
- X p++;
- X }
- X
- X /* if closing bracket here or at range start then we have a
- X malformed pattern */
- X if ( *p == ']' ) {
- X return MATCH_PATTERN;
- X }
- X
- X member_match = FALSE;
- X loop = TRUE;
- X
- X while ( loop ) {
- X
- X /* if end of construct then loop is done */
- X if (*p == ']') {
- X loop = FALSE;
- X continue;
- X }
- X
- X /* matching a '!', '^', '-', '\' or a ']' */
- X if ( *p == '\\' ) {
- X range_start = range_end = *++p;
- X }
- X else {
- X range_start = range_end = *p;
- X }
- X
- X /* if end of pattern then bad pattern (Missing ']') */
- X if (!*p)
- X return MATCH_PATTERN;
- X
- X /* check for range bar */
- X if (*++p == '-') {
- X
- X /* get the range end */
- X range_end = *++p;
- X
- X /* if end of pattern or construct then bad pattern */
- X if (range_end == '\0' || range_end == ']')
- X return MATCH_PATTERN;
- X
- X /* special character range end */
- X if (range_end == '\\') {
- X range_end = *++p;
- X
- X /* if end of text then we have a bad pattern */
- X if (!range_end)
- X return MATCH_PATTERN;
- X }
- X
- X /* move just beyond this range */
- X p++;
- X }
- X
- X /* if the text character is in range then match found.
- X make sure the range letters have the proper
- X relationship to one another before comparison */
- X if ( range_start < range_end ) {
- X if (*t >= range_start && *t <= range_end) {
- X member_match = TRUE;
- X loop = FALSE;
- X }
- X }
- X else {
- X if (*t >= range_end && *t <= range_start) {
- X member_match = TRUE;
- X loop = FALSE;
- X }
- X }
- X }
- X
- X /* if there was a match in an exclusion set then no match */
- X /* if there was no match in a member set then no match */
- X if ((invert && member_match) ||
- X !(invert || member_match))
- X return MATCH_RANGE;
- X
- X /* if this is not an exclusion then skip the rest of the [...]
- X construct that already matched. */
- X if (member_match) {
- X while (*p != ']') {
- X
- X /* bad pattern (Missing ']') */
- X if (!*p)
- X return MATCH_PATTERN;
- X
- X /* skip exact match */
- X if (*p == '\\') {
- X p++;
- X
- X /* if end of text then we have a bad pattern */
- X if (!*p)
- X return MATCH_PATTERN;
- X }
- X
- X /* move to next pattern char */
- X p++;
- X }
- X }
- X
- X break;
- X }
- X
- X /* next character is quoted and must match exactly */
- X case '\\':
- X
- X /* move pattern pointer to quoted char and fall through */
- X p++;
- X
- X /* if end of text then we have a bad pattern */
- X if (!*p)
- X return MATCH_PATTERN;
- X
- X /* must match this character exactly */
- X default:
- X if (*p != *t)
- X return MATCH_LITERAL;
- X }
- X }
- X
- X /* if end of text not reached then the pattern fails */
- X if ( *t )
- X return MATCH_END;
- X else
- X return MATCH_VALID;
- X}
- X
- X
- X/*----------------------------------------------------------------------------
- X*
- X* recursively call matche() with final segment of PATTERN and of TEXT.
- X*
- X----------------------------------------------------------------------------*/
- X
- Xint matche_after_star (register char *p, register char *t)
- X{
- X register int match = 0;
- X register nextp;
- X
- X /* pass over existing ? and * in pattern */
- X while ( *p == '?' || *p == '*' ) {
- X
- X /* take one char for each ? and + */
- X if ( *p == '?' ) {
- X
- X /* if end of text then no match */
- X if ( !*t++ ) {
- X return MATCH_ABORT;
- X }
- X }
- X
- X /* move to next char in pattern */
- X p++;
- X }
- X
- X /* if end of pattern we have matched regardless of text left */
- X if ( !*p ) {
- X return MATCH_VALID;
- X }
- X
- X /* get the next character to match which must be a literal or '[' */
- X nextp = *p;
- X if ( nextp == '\\' ) {
- X nextp = p[1];
- X
- X /* if end of text then we have a bad pattern */
- X if (!nextp)
- X return MATCH_PATTERN;
- X }
- X
- X /* Continue until we run out of text or definite result seen */
- X do {
- X
- X /* a precondition for matching is that the next character
- X in the pattern match the next character in the text or that
- X the next pattern char is the beginning of a range. Increment
- X text pointer as we go here */
- X if ( nextp == *t || nextp == '[' ) {
- X match = matche(p, t);
- X }
- X
- X /* if the end of text is reached then no match */
- X if ( !*t++ ) match = MATCH_ABORT;
- X
- X } while ( match != MATCH_VALID &&
- X match != MATCH_ABORT &&
- X match != MATCH_PATTERN);
- X
- X /* return result */
- X return match;
- X}
- X
- X
- X/*----------------------------------------------------------------------------
- X*
- X* match() is a shell to matche() to return only BOOLEAN values.
- X*
- X----------------------------------------------------------------------------*/
- X
- XBOOLEAN match( char *p, char *t )
- X{
- X int error_type;
- X error_type = matche(p,t);
- X return (error_type == MATCH_VALID ) ? TRUE : FALSE;
- X}
- X
- X
- X#ifdef TEST
- X
- X /*
- X * This test main expects as first arg the pattern and as second arg
- X * the match string. Output is yaeh or nay on match. If nay on
- X * match then the error code is parsed and written.
- X */
- X
- X #include <stdio.h>
- X
- X int main(int argc, char *argv[])
- X {
- X int error;
- X int is_valid_error;
- X
- X if (argc != 3) {
- X printf("Usage: MATCH Pattern Text\n");
- X }
- X else {
- X printf("Pattern: %s\n", argv[1]);
- X printf("Text : %s\n", argv[2]);
- X
- X if (!is_pattern(argv[1])) {
- X printf(" First Argument Is Not A Pattern\n");
- X }
- X else {
- X error = matche(argv[1],argv[2]);
- X is_valid_pattern(argv[1],&is_valid_error);
- X
- X switch ( error ) {
- X case MATCH_VALID:
- X printf(" Match Successful");
- X if (is_valid_error != PATTERN_VALID)
- X printf(" -- is_valid_pattern() is complaining\n");
- X else
- X printf("\n");
- X break;
- X case MATCH_LITERAL:
- X printf(" Match Failed on Literal\n");
- X break;
- X case MATCH_RANGE:
- X printf(" Match Failed on [..]\n");
- X break;
- X case MATCH_ABORT:
- X printf(" Match Failed on Early Text Termination\n");
- X break;
- X case MATCH_END:
- X printf(" Match Failed on Early Pattern Termination\n");
- X break;
- X case MATCH_PATTERN:
- X switch ( is_valid_error ) {
- X case PATTERN_VALID:
- X printf(" Internal Disagreement On Pattern\n");
- X break;
- X case PATTERN_ESC:
- X printf(" Literal Escape at End of Pattern\n");
- X break;
- X case PATTERN_RANGE:
- X printf(" No End of Range in [..] Construct\n");
- X break;
- X case PATTERN_CLOSE:
- X printf(" [..] Construct is Open\n");
- X break;
- X case PATTERN_EMPTY:
- X printf(" [..] Construct is Empty\n");
- X break;
- X default:
- X printf(" Internal Error in is_valid_pattern()\n");
- X }
- X break;
- X default:
- X printf(" Internal Error in matche()\n");
- X break;
- X }
- X }
- X
- X }
- X return(0);
- X }
- X
- X#endif
- END_OF_FILE
- if test 16844 -ne `wc -c <'match.c'`; then
- echo shar: \"'match.c'\" unpacked with wrong size!
- fi
- # end of 'match.c'
- fi
- if test -f 'match.doc' -a "${1}" != "-c" ; then
- echo shar: Will not clobber existing file \"'match.doc'\"
- else
- echo shar: Extracting \"'match.doc'\" \(5288 characters\)
- sed "s/^X//" >'match.doc' <<'END_OF_FILE'
- X
- X REGEX Globber (Wild Card Matching)
- X
- X A *IX SH style pattern matcher written in C
- X V1.10 Dedicated to the Public Domain
- X
- X March 12, 1991
- X J. Kercheval
- X [72450,3702] -- johnk@wrq.com
- X
- X
- X
- X
- X*IX SH style Regular Expressions
- X================================
- X
- XThe *IX command SH is a working shell similar in feel to the MSDOS
- Xshell COMMAND.COM. In point of fact much of what we see in our
- Xfamiliar DOS PROMPT was gleaned from the early UNIX shells available
- Xfor many of machines the people involved in the computing arena had
- Xat the time of the development of DOS and it's much maligned
- Xprecursor CP/M (although the UNIX shells were and are much more
- Xflexible and powerful then those on the current flock of micro
- Xmachines). The designers of DOS and CP/M did some fairly strange
- Xthings with their command processor and OS. One of those things was
- Xto only selectively adopt the regular expressions allowed within the
- X*IX shells. Only '?' and '*' were allowed in filenames and even with
- Xthese the '*' was allowed only at the end of a pattern and in fact
- Xwhen used to specify the filename the '*' did not apply to extension.
- XThis gave rise to the all too common expression "*.*".
- X
- XREGEX Globber is a SH pattern matcher. This allows such
- Xspecifications as *75.zip or * (equivelant to *.* in DOS lingo).
- XExpressions such as [a-e]*t would fit the name "apple.crt" or
- X"catspaw.bat" or "elegant". This allows considerably wider
- Xflexibility in file specification, general parsing or any other
- Xcircumstance in which this type of pattern matching is wanted.
- X
- XA match would mean that the entire string TEXT is used up in matching
- Xthe PATTERN and conversely the matched TEXT uses up the entire
- XPATTERN.
- X
- XIn the specified pattern string:
- X `*' matches any sequence of characters (zero or more)
- X `?' matches any character
- X `\' suppresses syntactic significance of a special character
- X [SET] matches any character in the specified set,
- X [!SET] or [^SET] matches any character not in the specified set.
- X
- XA set is composed of characters or ranges; a range looks like
- X'character hyphen character' (as in 0-9 or A-Z). [0-9a-zA-Z_] is the
- Xminimal set of characters allowed in the [..] pattern construct.
- XOther characters are allowed (ie. 8 bit characters) if your system
- Xwill support them (it almost certainly will).
- X
- XTo suppress the special syntactic significance of any of `[]*?!^-\',
- Xand match the character exactly, precede it with a `\'.
- X
- XTo view several examples of good and bad patterns and text see the
- Xoutput of MATCHTST.BAT
- X
- X
- X
- XMATCH() and MATCHE()
- X====================
- X
- XThe match module as written has two parsing routines, one is matche()
- Xand the other is match(). Since match() is a call to matche() which
- Xsimply has its output mapped to a BOOLEAN value (ie TRUE if pattern
- Xmatches or FALSE otherwise), I will concentrate my explanations here
- Xon matche().
- X
- XThe purpose of matche() is to match a pattern against a string of
- Xtext (usually a file name or specification). The match routine has
- Xextensive pattern validity checking built into it as part of the
- Xparser and allows for a robust pattern match.
- X
- XThe parser gives an error code on return of type int. The error code
- Xwill be one of the the following defined values (defined in match.h):
- X
- X MATCH_PATTERN - bad pattern or misformed pattern
- X MATCH_LITERAL - match failed on character match (standard
- X character)
- X MATCH_RANGE - match failure on character range ([..] construct)
- X MATCH_ABORT - premature end of text string (pattern longer
- X than text string)
- X MATCH_END - premature end of pattern string (text longer
- X than pattern called for)
- X MATCH_VALID - valid match using pattern
- X
- XThe functions are declared as follows:
- X
- X BOOLEAN match (char *pattern, char *text);
- X
- X int matche(register char *pattern, register char *text);
- X
- X
- X
- XIS_VALID_PATTERN() and IS_PATTERN()
- X===================================
- X
- XThere are two routines for determining properties of a pattern
- Xstring. The first, is_pattern(), is designed simply to determine if
- Xsome character exists within the text which is consistent with a SH
- Xregular expression (this function returns TRUE if so and FALSE if
- Xnot). The second, is_valid_pattern() is designed to check the
- Xvalidity of a given pattern string (TRUE return if valid, FALSE if
- Xnot). By 'validity', I mean well formed or syntactically correct.
- X
- XIn addition, is_valid_pattern() has as one of it's parameters a
- Xreturn code for determining the type of error found in the pattern if
- Xone exists. The error codes are as follows and defined in match.h:
- X
- X PATTERN_VALID - pattern is well formed
- X PATTERN_ESC - pattern has invalid literal escape ('\' at end of
- X pattern)
- X PATTERN_RANGE - [..] construct has a no end range in a '-' pair
- X (ie [a-])
- X PATTERN_CLOSE - [..] construct has no end bracket (ie [abc-g )
- X PATTERN_EMPTY - [..] construct is empty (ie [])
- X
- XThe functions are declared as follows:
- X
- X BOOLEAN is_valid_pattern (char *pattern, int *error_type);
- X
- X BOOLEAN is_pattern (char *pattern);
- END_OF_FILE
- if test 5288 -ne `wc -c <'match.doc'`; then
- echo shar: \"'match.doc'\" unpacked with wrong size!
- fi
- # end of 'match.doc'
- fi
- if test -f 'match.h' -a "${1}" != "-c" ; then
- echo shar: Will not clobber existing file \"'match.h'\"
- else
- echo shar: Extracting \"'match.h'\" \(3963 characters\)
- sed "s/^X//" >'match.h' <<'END_OF_FILE'
- X/*
- X EPSHeader
- X
- X File: match.h
- X Author: J. Kercheval
- X Created: Sat, 01/05/1991 22:27:18
- X*/
- X/*
- X EPSRevision History
- X
- X J. Kercheval Wed, 02/20/1991 22:28:37 Released to Public Domain
- X J. Kercheval Sun, 03/10/1991 18:02:56 add is_valid_pattern
- X J. Kercheval Sun, 03/10/1991 18:25:48 add error_type in is_valid_pattern
- X J. Kercheval Sun, 03/10/1991 18:47:47 error return from matche()
- X J. Kercheval Tue, 03/12/1991 22:24:49 Released as V1.1 to Public Domain
- X*/
- X
- X/*
- X Wildcard Pattern Matching
- X*/
- X
- X#ifndef BOOLEAN
- X# define BOOLEAN int
- X# define TRUE 1
- X# define FALSE 0
- X#endif
- X
- X/* match defines */
- X#define MATCH_PATTERN 6 /* bad pattern */
- X#define MATCH_LITERAL 5 /* match failure on literal match */
- X#define MATCH_RANGE 4 /* match failure on [..] construct */
- X#define MATCH_ABORT 3 /* premature end of text string */
- X#define MATCH_END 2 /* premature end of pattern string */
- X#define MATCH_VALID 1 /* valid match */
- X
- X/* pattern defines */
- X#define PATTERN_VALID 0 /* valid pattern */
- X#define PATTERN_ESC -1 /* literal escape at end of pattern */
- X#define PATTERN_RANGE -2 /* malformed range in [..] construct */
- X#define PATTERN_CLOSE -3 /* no end bracket in [..] construct */
- X#define PATTERN_EMPTY -4 /* [..] contstruct is empty */
- X
- X/*----------------------------------------------------------------------------
- X*
- X* Match the pattern PATTERN against the string TEXT;
- X*
- X* match() returns TRUE if pattern matches, FALSE otherwise.
- X* matche() returns MATCH_VALID if pattern matches, or an errorcode
- X* as follows otherwise:
- X*
- X* MATCH_PATTERN - bad pattern
- X* MATCH_LITERAL - match failure on literal mismatch
- X* MATCH_RANGE - match failure on [..] construct
- X* MATCH_ABORT - premature end of text string
- X* MATCH_END - premature end of pattern string
- X* MATCH_VALID - valid match
- X*
- X*
- X* A match means the entire string TEXT is used up in matching.
- X*
- X* In the pattern string:
- X* `*' matches any sequence of characters (zero or more)
- X* `?' matches any character
- X* [SET] matches any character in the specified set,
- X* [!SET] or [^SET] matches any character not in the specified set.
- X*
- X* A set is composed of characters or ranges; a range looks like
- X* character hyphen character (as in 0-9 or A-Z). [0-9a-zA-Z_] is the
- X* minimal set of characters allowed in the [..] pattern construct.
- X* Other characters are allowed (ie. 8 bit characters) if your system
- X* will support them.
- X*
- X* To suppress the special syntactic significance of any of `[]*?!^-\',
- X* and match the character exactly, precede it with a `\'.
- X*
- X----------------------------------------------------------------------------*/
- X
- XBOOLEAN match (char *pattern, char *text);
- X
- Xint matche(register char *pattern, register char *text);
- X
- X/*----------------------------------------------------------------------------
- X*
- X* Return TRUE if PATTERN has any special wildcard characters
- X*
- X----------------------------------------------------------------------------*/
- X
- XBOOLEAN is_pattern (char *pattern);
- X
- X/*----------------------------------------------------------------------------
- X*
- X* Return TRUE if PATTERN has is a well formed regular expression according
- X* to the above syntax
- X*
- X* error_type is a return code based on the type of pattern error. Zero is
- X* returned in error_type if the pattern is a valid one. error_type return
- X* values are as follows:
- X*
- X* PATTERN_VALID - pattern is well formed
- X* PATTERN_ESC - pattern has invalid escape ('\' at end of pattern)
- X* PATTERN_RANGE - [..] construct has a no end range in a '-' pair (ie [a-])
- X* PATTERN_CLOSE - [..] construct has no end bracket (ie [abc-g )
- X* PATTERN_EMPTY - [..] construct is empty (ie [])
- X*
- X----------------------------------------------------------------------------*/
- X
- XBOOLEAN is_valid_pattern (char *pattern, int *error_type);
- END_OF_FILE
- if test 3963 -ne `wc -c <'match.h'`; then
- echo shar: \"'match.h'\" unpacked with wrong size!
- fi
- # end of 'match.h'
- fi
- if test -f 'matchmak' -a "${1}" != "-c" ; then
- echo shar: Will not clobber existing file \"'matchmak'\"
- else
- echo shar: Extracting \"'matchmak'\" \(396 characters\)
- sed "s/^X//" >'matchmak' <<'END_OF_FILE'
- X#
- X#
- X# Makefile for match.c
- X#
- X# Created 01-20-91 JBK
- X# Last Modified 02-13-91 JBK
- X#
- X#
- X
- XCC = cl
- X
- X#
- X# This is FLAGS for optimized version
- X#FLAGS = /c /AL /G2 /Ox /W4
- X
- X#
- X# This is FLAGS for optimized version with main
- XFLAGS = /D TEST /AL /G2 /Ox /W4
- X
- X#
- X# This is FLAGS for debugging versions with main
- X#FLAGS = /D TEST /AL /G2 /Od /W4 /Zi /qc
- X
- X
- Xmatch.exe: match.c match.h
- X $(CC) $(FLAGS) match.c
- END_OF_FILE
- if test 396 -ne `wc -c <'matchmak'`; then
- echo shar: \"'matchmak'\" unpacked with wrong size!
- fi
- # end of 'matchmak'
- fi
- if test -f 'matchtst.bat' -a "${1}" != "-c" ; then
- echo shar: Will not clobber existing file \"'matchtst.bat'\"
- else
- echo shar: Extracting \"'matchtst.bat'\" \(3776 characters\)
- sed "s/^X//" >'matchtst.bat' <<'END_OF_FILE'
- X@echo off
- X
- Xecho .
- Xecho Beginning MATCHTST
- Xecho .
- Xecho Creating file TEST.OUT
- Xecho .
- X
- XREM The following tests should match
- X
- Xmatch test? testy > test.out
- Xmatch test* test >> test.out
- Xmatch tes*t test >> test.out
- Xmatch *test test >> test.out
- Xmatch t*s*t test >> test.out
- Xmatch t*s*t tesseract >> test.out
- Xmatch t?s? test >> test.out
- Xmatch ?s*t psyot >> test.out
- Xmatch [a-z]s*t asset >> test.out
- Xmatch s[!gh]t set >> test.out
- Xmatch t[a-ce]st test >> test.out
- Xmatch tea[ea-c]up teacup >> test.out
- Xmatch [a-fh-z]* jack >> test.out
- Xmatch \i\** i*hello >> test.out
- Xmatch [\[-\]] [ >> test.out
- Xmatch [a-z\\] \ >> test.out
- Xmatch [a-z%_] b >> test.out
- Xmatch [\]] ] >> test.out
- Xmatch \i?* itch >> test.out
- Xmatch \i?* it >> test.out
- Xmatch ?*?*?t test >> test.out
- Xmatch ?*?*?*?* test >> test.out
- Xmatch *\]*\**\?*\[ ]this*is?atest[ >> test.out
- Xmatch [a-\\]* at >> test.out
- Xmatch [a-d\\-/] c >> test.out
- Xmatch *t?l*his bright-land-high-and-his >> test.out
- X
- X
- XREM The following tests should fail
- X
- Xmatch test test >> test.out
- Xmatch \ test >> test.out
- Xmatch tes\ test >> test.out
- Xmatch t*s*t texxeract >> test.out
- Xmatch t?st tst >> test.out
- Xmatch test? test >> test.out
- Xmatch s[!e]t set >> test.out
- Xmatch [] ] >> test.out
- Xmatch [ [ >> test.out
- Xmatch [\[-\] [ >> test.out
- Xmatch [a atest >> test.out
- Xmatch [a- atest >> test.out
- Xmatch [a-z atest >> test.out
- Xmatch [a-]* atest >> test.out
- Xmatch [a-fh-z jack >> test.out
- Xmatch [a-fh-z\] jack >> test.out
- Xmatch [a-fh-z] jack >> test.out
- Xmatch ?*?*?t*? test >> test.out
- Xmatch *????? test >> test.out
- Xmatch *\ test >> test.out
- Xmatch [a-e\ atest >> test.out
- Xmatch [a-\ atest >> test.out
- Xmatch [a-bd-\ atest >> test.out
- Xmatch *?*?t? test >> test.out
- Xmatch t? t >> test.out
- Xmatch ??*t step >> test.out
- Xmatch [a-e]*[!t eel >> test.out
- Xmatch [\ test >> test.out
- Xmatch \t[est test >> test.out
- Xmatch ?*[] hello >> test.out
- X
- Xecho MATCHTST Complete
- Xecho .
- END_OF_FILE
- if test 3776 -ne `wc -c <'matchtst.bat'`; then
- echo shar: \"'matchtst.bat'\" unpacked with wrong size!
- fi
- # end of 'matchtst.bat'
- fi
- if test -f 'readme.doc' -a "${1}" != "-c" ; then
- echo shar: Will not clobber existing file \"'readme.doc'\"
- else
- echo shar: Extracting \"'readme.doc'\" \(6018 characters\)
- sed "s/^X//" >'readme.doc' <<'END_OF_FILE'
- X03-12-91
- X
- XThis is V1.1 of REGEX Globber.
- X
- X
- X03-12-91
- X
- XI have made a few changes to the match module which do several
- Xthings. The first change is an increase in bad pattern detection
- Xduring a match. It was possible, in some very unlikely cases, to
- Xcook up a pattern which should result in an early bad match but which
- Xwould actually cause problems for the parser. In particular, the
- Xsubcase where the literal escape '\' within an open [..] construct at
- Xthe end of a pattern would end up with incorrect results. I
- Xproceeded to create some of these patterns, added them to my test
- Xbattery and dove straight in.
- X
- XIn the interim I came across a posting to CompuServe (SMATCH by Stan
- XAderman) which attempted to create a completely non-recursive
- Ximplementation of match (I am not sure this is possible without
- Xexplicitly creating your own stack or it's equivelant, like a binary
- Xtree :-{ ). While the code could not correctly handle multiple '*'
- Xcharacters in the pattern, there was a few interesting ideas in the
- Xposting. On some occasions, running match over and over would be
- Xcounter productive, especially and in particular when you have a bad
- Xpattern. I have added a fast routine, is_valid_pattern(), to
- Xdetermine if the current pattern is well formed which should address
- Xthis situation.
- X
- XOne other idea which I unceremoniously lifted from SMATCH was (in
- Xhindsight a pretty obvious feature) the return of a meaningful error
- Xcode from both the pattern validity routine and from match() (which I
- Xrenamed to matche()).
- X
- XI also took some time to experiment with some ways to cut some time
- Xoff the routine. Since this is a SH pattern matcher whose intent is
- Xprimarily for shell functions, the changes could not be algorithmic
- Xchanges which relied on speedup over large input. The differences in
- Xexecution time were not very significant, but I did manage to gain
- Xapproximately 5%-10% speedup when I removed the literal escape ('\')
- Xparsing and pattern error checking. For those of you who want to use
- Xthis for filename wildcard usage, I would recommend doing this since
- Xyou should use is_valid_pattern and is_pattern before going out and
- Xfinding filenames and the dos path delimiter defaults to the
- Xcharacter used for the literal escape ('\') anyway (Note: I will be
- Xsoon be releasing a *IX style file parser in the FINDFILE, FINDNEXT
- Xflavor soon to a Public Domain archive near you :-) ).
- X
- XI also briefly toyed with adding a non-SH regex character '+' to this
- Xmodule but removed it again. It was a performance hit of a few
- Xpercent and would be mostly unused in any event. For those
- Xinterested in such a feature, the changes are truly minimal. The
- Xrequired extra work is:
- X
- X 1) One case statement each in is_pattern() and is_valid_pattern()
- X 2) One case statement in matche()
- X 3) One addition to a while conditional in matche_after_star()
- X 4) One addition to an if conditional in matche_after_star()
- X
- XHint: The case statements are all "case '+'" and the conditionals
- X have "|| *p == '+' " added to them.
- X
- XI have also included a file (MATCH.DOC) which describes matches use and
- Xbackground as well as a little about regular expressions.
- X
- X jbk
- X
- X02-24-91
- X
- XThis is V1.01 of REGEX Globber.
- X
- X
- X02-22-91 Seattle, WA
- X
- XHmm. Choke. (Foot in mouth). After griping about buggy routines and
- Xliterally seconds after posting this code the first time, I received
- Xa wonderful new test evaluation tool which allows you to perform
- Xcoverage analysis during testing. Sure enough I found that about
- X25% of the paths in the program were never traversed in my current
- Xtest battery. After swallowing my (overly large) pride and coming
- Xup with a test battery which covered the entire path of the program
- XI found a couple of minor logic bugs involving literal escapes (\)
- Xwithin other patterns (ie [..] and * sequences). I have repackaged
- Xthese routines and included also the makefile I use and the test
- Xbattery I use to make things a bit easier.
- X
- X jbk
- X
- X02-20-91 Seattle, WA
- X
- XHere is a *IX wildcard globber I butchered, hacked and cajoled together
- Xafter seeing and hearing about and becoming disgusted with several similar
- Xroutines which had one or more of the following attributes: slow, buggy,
- Xrequired large levels of recursion on matches, required grotesque levels
- Xof recursion on failing matches using '*', full of caveats about usability
- Xor copyrights.
- X
- XI submit this without copyright and with the clear understanding that
- Xthis code may be used by anyone, for any reason, with any modifications
- Xand without any guarantees, warrantee or statements of usability of any
- Xsort.
- X
- XHaving gotten those cow chips out of the way, these routines are fairly
- Xwell tested and reasonably fast. I have made an effort to fail on all
- Xbad patterns and to quickly determine failing '*' patterns. This parser
- Xwill also do quite a bit of the '*' matching via quick linear loops versus
- Xthe standard blind recursive descent.
- X
- XThis parser has been submitted to profilers at various stages of development
- Xand has come through quite well. If the last millisecond is important to
- Xyou then some time can be shaved by using stack allocated variables in
- Xplace of many of the pointer follows (which may be done fairly often) found
- Xin regex_match and regex_match_after_star (ie *p, *t).
- X
- XNo attempt is made to provide general [pat,pat] comparisons. The specific
- Xsubcases supplied by these routines is [pat,text] which is sufficient
- Xfor the large majority of cases (should you care).
- X
- XSince regex_match may return one of three different values depending upon
- Xthe pattern and text I have made a simple shell for convenience (match()).
- XAlso included is an is_pattern routine to quickly check a potential pattern
- Xfor regex special characters. I even placed this all in a header file for
- Xyou lazy folks!
- X
- XHaving said all that, here is my own reinvention of the wheel. Please
- Xenjoy it's use and I hope it is of some help to those with need ....
- X
- X
- X jbk
- X
- END_OF_FILE
- if test 6018 -ne `wc -c <'readme.doc'`; then
- echo shar: \"'readme.doc'\" unpacked with wrong size!
- fi
- # end of 'readme.doc'
- fi
- echo shar: End of archive 1 \(of 1\).
- cp /dev/null ark1isdone
- MISSING=""
- for I in 1 ; do
- if test ! -f ark${I}isdone ; then
- MISSING="${MISSING} ${I}"
- fi
- done
- if test "${MISSING}" = "" ; then
- echo You have the archive.
- rm -f ark[1-9]isdone
- else
- echo You still must unpack the following archives:
- echo " " ${MISSING}
- fi
- exit 0
- exit 0 # Just in case...
- --
- Kent Landfield INTERNET: kent@sparky.IMD.Sterling.COM
- Sterling Software, IMD UUCP: uunet!sparky!kent
- Phone: (402) 291-8300 FAX: (402) 291-4362
- Please send comp.sources.misc-related mail to kent@uunet.uu.net.
-