═══ 1. Introduction ═══ Introduction The "RxRegExp.DLL" code allows any rexx program to make fast extensive use of regular expression searches and replaces. The original regular expession "engine" (which was not modified) was written by "Henry Spencer" (henry@zoo.toronto.edu). There is no version number on this source that I can find although the doco is dated "5 Sept 1996". I have provided a sample rexx program which should help you get started. The DLL was written by "Dennis Bareis" (db0@anz.com), the latest code can be obtained from my web page at "http://www.ozemail.com.au/~dbareis". ═══ 2. Regular Expressions ═══ The following description is based on the documentation written by "Henry Spencer" (henry@zoo.toronto.edu). Regular Expressions A regular expression is a sequence of characters which describes what we are searching for. A regular expression is zero or more branches separated by "|". It matches anything that matches one of the branches. A branch is zero or more pieces, concatenated. It matches a match for the first, followed by a match for the second, etc. A piece is an atom possibly followed by "*", "+", or "?" where:  An atom followed by "*" matches a sequence of 0 or more matches of the atom.  An atom followed by "+" matches a sequence of 1 or more matches of the atom.  An atom followed by "?" matches a match of the atom, or the null string. An atom is:  A regular expression in parentheses (matching a match for the regular expression), a range (see below).  "." which matches any single character.  "^" which matches the empty string at the start of a line.  "$" which matches the empty string at the end of a line.  "\e" (a slash followed by a character) which matches the character. This is needed so search for "$" etc!  A single character with no other significance matches that character. A range is a sequence of characters enclosed in "[]". It normally matches any single character from the sequence. If the sequence begins with "^", it matches any single character not from the rest of the sequence. If two characters in the sequence are separated by "-" then this is shorthand for the full list of ASCII characters between them (e.g. "[0-9]" matches any decimal digit). To include a literal "]" in the sequence, make it the first character (following a possible "^"). To include a literal "-", make it the first or last character. AMBIGUITY If a regular expression could match two different parts of the input string, it will match the one which begins earliest. If both begin in the same place but match different lengths, or match the same length in different ways, life gets messier, as follows:  In general, the possibilities in a list of branches are considered in left-to-right order, the possibilities for "*", "+", and "?" are considered longest-first, nested constructs are considered from the outermost in, and concatenated constructs are considered leftmost-first. The match that will be chosen is the one that uses the earliest possibility in the first choice that has to be made. If there is more than one choice, the next will be made in the same manner (earliest possibility) subject to the decision on the first choice. And so forth. For example, "(ab|a)b*c" could match "abc" in one of two ways. The first choice is between "ab" and "a"; since "ab" is earlier, and does lead to a successful overall match, it is chosen. Since the "b" is already spoken for, the "b*" must match its last possibility (the empty string) since it must respect the earlier choice.  In the particular case where the regular expression does not use "|" and does not apply "*", "+", or "?" to parenthesized subexpressions, the net effect is that the longest possible match will be chosen. So "ab*", presented with "xabbbby", will match "abbbb". Note that if "ab*" is tried against "xabyabbbz", it will match "ab" just after "x", due to the begins-earliest rule. In effect, the decision on where to start the match is the first choice to be made, hence subsequent choices must respect it even if this leads them to less-preferred alternatives. ═══ 3. Commands ═══ Commands The following functions are available:  RegExpVersion  RegExpCompile  RegExpMatch  RegExpReplace Note that you can't trust the return codes from rexx add/query/drop functions so if you wish to detect (instead of dying) that the DLL is unavailable you should check my sample rexx program which contains a subroutine which does this. ═══ 3.1. RegExpVersion ═══ RegExpVersion This routine will allow you to determine the version and author information of the DLL you are using. The routine takes a single parameter as follows: 1. Name of Variable to update with version Information. The information returned (seperated by a single space): 1. Version Number such as "98.104". 2. My Web page URL (where you can get a later version if available). 3. My email Address. 4. My Name. Please see my sample rexx program for an example of this call in use. Returns This routine returns "OK" if the call succeeds otherwise it returns text which describes the reason for the failure. ═══ 3.2. RegExpCompile ═══ RegExpCompile To make use of a regular expression it must first be compiled. This routine will either compile a regular expression or release any memory associated with the last compiled regular expression. The routine takes a single parameter as follows: 1. The regular expression to be compiled. There is one exception, if you pass exactly "ReClose" then this is taken to be a close off previous regular expression command. The compiled express (plus some other bits) are held in memory until released. Rexx programs running in different sessions will not interfere with each other. Not sure if there can be any problems within a session but I suspect not. Please see my sample rexx program for an example of this call in use. Returns This routine returns "OK" if the call succeeds otherwise it returns text which describes the reason for the failure. ═══ 3.3. RegExpMatch ═══ RegExpMatch This routine will attempt to find a regular expression which was previously compiled. The routine takes 2 parameters as follows: 1. The string to be searched. 2. Name of Variable to update with match details. If the variable is blank then there was no match otherwise a set of one or more "match pairs" is returned. Match Pairs In a set of match pairs the first describes the location of the overall regular expression while any others (if they exist) describe the location of any matches for expressions that occurred between round brackets in the regular expression. Each match pair describes the starting position (1st byte is 1) and the length of the match. If you used the regular expression "(A)(B)" on the string "ABCDEF" then "1 2 1 1 2 1" would be returned. Please see my sample rexx program for an example of this call in use. Returns This routine returns "OK" if the call succeeds otherwise it returns text which describes the reason for the failure. ═══ 3.4. RegExpReplace ═══ RegExpReplace This routine will modify a string you supply based on the match information from a previous regular expression. The routine takes 2 parameters as follows: 1. The replace specification. 2. Name of Variable which will contain the modified string. Replace Specification The first thing you should understand is that we are not performing and sort of normal search and replace here, we are taking a specification and building a new string. If you wanted to perform a search and replace type operation you would need to use round brackets and ensure that the before and after match strings were described. Note that doing this can vary the match location of the string you are trying to match! The replace specification is basically a string where any occurrance of "&" or "\0" is replaced with the overall matched characters. "\1" to "\9" are replaced with the appropriate match information for each round bracketed expression ('\1' is first one). For example you may wish to replace "SET" with "set" in which case you could have used the regular expression "SET" to search for the string with "RegExpMatch", however this would not be useful unless you manually wrote code to do the replacement (not hard), to use this "replace" routine you would search with something like "(^.*)SET(.*$)" and then your replace specification could be "\1set\2". Please see my sample rexx program for an example of this call in use. Returns This routine returns "OK" if the call succeeds otherwise it returns text which describes the reason for the failure. ═══ 4. Rexx Example ═══ Rexx Example /* Small Test program for my REXX Regular Expression Code */ /*--- Load up functions we will use -----------------------------------------*/ call RxFuncAdd 'RegExpVersion', 'RxRegExp', 'RegExpVersion'; call RxFuncAdd 'RegExpCompile', 'RxRegExp', 'RegExpCompile'; call RxFuncAdd 'RegExpMatch', 'RxRegExp', 'RegExpMatch'; call RxFuncAdd 'RegExpReplace', 'RxRegExp', 'RegExpReplace'; /*--- Make sure Regular Expression DLL is available for use -----------------*/ say 'RxRegExp.DLL'; say '~~~~~~~~~~~~'; if RegExpOk() = 'N' then do say "Regular Expressions can't be used (DLL probably unavailable)!"; exit(GetLineNumber()); end; say "Available"; /*--- Display Version Info --------------------------------------------------*/ say ''; say 'RegExpVersion'; say '~~~~~~~~~~~~~'; VerRc = RegExpVersion('ReVersion'); if VerRc <> 'OK' then do say ' COULD NOT GET RXREGEXP.DLL VERSION INFO' say ' ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~' say ' ' || VerRc; exit(GetLineNumber()); end; parse var ReVersion DllVersion WebPage EmailAddress AuthorName; say 'Version = ' || DllVersion; say 'Author = ' || AuthorName; say 'Email = ' || EmailAddress; say 'Web Page = ' || WebPage; /*--- Define/Compile regular expression -------------------------------------*/ say '' say 'RegExpCompile'; say '~~~~~~~~~~~~~'; Re = arg(1); if Re = '' then Re = "1[12]"; CompileRc = RegExpCompile(Re); say CompileRc; if CompileRc <> 'OK' then exit(GetLineNumber()); /*--- Look for a match ------------------------------------------------------*/ LookIn = "AAA112BBB"; say '' say 'RegExpMatch'; say '~~~~~~~~~~~'; say 'Looking for ' || Re || ' in "' || LookIn || '"'; Answer = '?'; MatchRc = RegExpMatch(LookIn, "Answer"); if MatchRc <> 'OK' then do /*--- An error occurred --------------------------------------------------*/ say ' ERROR' say ' ~~~~~' say ' ' || MatchRc; exit(GetLineNumber()); end else do /*--- Did we have any matches? -------------------------------------------*/ if Answer = '' then do say ' * No Matches'; exit(GetLineNumber()); end; /*--- List all matches and submatches ------------------------------------*/ say ' * MatchList = ' || Answer; /*--- Extract the overall match information ------------------------------*/ parse var Answer MatchStart MatchLength SubMatch; say ' * Match starts at posn ' || MatchStart || ' and is ' || MatchLength || ' bytes long.'; /*--- Extract the submatch information -----------------------------------*/ Index = 0; do forever /*--- Extract info ---------------------------------------------------*/ parse var SubMatch SubMatchStart SubMatchLength SubMatch; /*--- Make sure we have the information ------------------------------*/ if SubMatchLength = '' then leave; /*--- Report the information -----------------------------------------*/ Index = Index + 1; say ' * Match (' || Index || ') starts at posn ' || SubMatchStart || ' and is ' || SubMatchLength || ' bytes long.'; end; end; /*--- Make change -----------------------------------------------------------*/ say '' say 'RegExpReplace'; say '~~~~~~~~~~~~~'; Before = "xxx\0yyy" say 'Replacing: ' || Before Answer = '?'; MatchRc = RegExpReplace(Before, "Answer"); say 'ReplaceRc = ' || MatchRc; say 'Answer = ' || Answer; /*--- Close Regular expression ----------------------------------------------*/ say ''; say 'RegExpCompile(ReClose)'; say '~~~~~~~~~~~~~~~~~~~~~~'; say RegExpCompile("ReClose"); exit(0); /*===========================================================================*/ RegExpOk: /* */ /* None of the rexx 'Rx' functions (add/drop/query) work correctly. The */ /* return code can't be trusted. */ /* */ /* This routine will return 'Y' if the DLL can be accessed. It calls one */ /* of the known functions. We have already registered it so if we fail then */ /* the registration failed. */ /* */ /* Note that as this code is within a subroutine the trap handler we set up */ /* does not override and set up in the calling code once this routine */ /* returns. */ /*===========================================================================*/ /*--- Get up trap handler and execute the command ------------------------*/ signal on SYNTAX name RegExpNotOk; interpret "DummyReRc = RegExpVersion('ReVersion')"; /*--- We did not die so the function must be available -------------------*/ return('Y'); /*--- We must have died so the function is unavailable -------------------*/ RegExpNotOk: return('N'); /*===========================================================================*/ GetLineNumber: /*===========================================================================*/ return( SIGL );