home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
OS/2 Shareware BBS: 18 REXX
/
18-REXX.zip
/
red98104.zip
/
RxRegExp.INF
(
.txt
)
< prev
next >
Wrap
OS/2 Help File
|
1998-04-14
|
13KB
|
401 lines
ΓòÉΓòÉΓòÉ 1. Introduction ΓòÉΓòÉΓòÉ
Introduction
The "RxRegExp.DLL" code allows any rexx program to make fast extensive use of
regular expression searches and replaces.
The original regular expession "engine" (which was not modified) was written by
"Henry Spencer" (henry@zoo.toronto.edu). There is no version number on this
source that I can find although the doco is dated "5 Sept 1996".
I have provided a sample rexx program which should help you get started.
The DLL was written by "Dennis Bareis" (db0@anz.com), the latest code can be
obtained from my web page at "http://www.ozemail.com.au/~dbareis".
ΓòÉΓòÉΓòÉ 2. Regular Expressions ΓòÉΓòÉΓòÉ
The following description is based on the documentation written by
"Henry Spencer" (henry@zoo.toronto.edu).
Regular Expressions
A regular expression is a sequence of characters which describes what we are
searching for.
A regular expression is zero or more branches separated by "|". It matches
anything that matches one of the branches.
A branch is zero or more pieces, concatenated. It matches a match for the
first, followed by a match for the second, etc.
A piece is an atom possibly followed by "*", "+", or "?" where:
An atom followed by "*" matches a sequence of 0 or more matches of
the atom.
An atom followed by "+" matches a sequence of 1 or more matches of
the atom.
An atom followed by "?" matches a match of the atom, or the null
string.
An atom is:
A regular expression in parentheses (matching a match for the regular
expression), a range (see below).
"." which matches any single character.
"^" which matches the empty string at the start of a line.
"$" which matches the empty string at the end of a line.
"\e" (a slash followed by a character) which matches the character.
This is needed so search for "$" etc!
A single character with no other significance matches that character.
A range is a sequence of characters enclosed in "[]". It normally matches any
single character from the sequence. If the sequence begins with "^", it
matches any single character not from the rest of the sequence. If two
characters in the sequence are separated by "-" then this is shorthand for the
full list of ASCII characters between them (e.g. "[0-9]" matches any decimal
digit). To include a literal "]" in the sequence, make it the first character
(following a possible "^"). To include a literal "-", make it the first or
last character.
AMBIGUITY
If a regular expression could match two different parts of the input string,
it will match the one which begins earliest. If both begin in the same place
but match different lengths, or match the same length in different ways, life
gets messier, as follows:
In general, the possibilities in a list of branches are considered in
left-to-right order, the possibilities for "*", "+", and "?" are
considered longest-first, nested constructs are considered from the
outermost in, and concatenated constructs are considered
leftmost-first. The match that will be chosen is the one that uses
the earliest possibility in the first choice that has to be made. If
there is more than one choice, the next will be made in the same
manner (earliest possibility) subject to the decision on the first
choice. And so forth.
For example, "(ab|a)b*c" could match "abc" in one of two ways. The
first choice is between "ab" and "a"; since "ab" is earlier, and does
lead to a successful overall match, it is chosen. Since the "b" is
already spoken for, the "b*" must match its last possibility (the
empty string) since it must respect the earlier choice.
In the particular case where the regular expression does not use "|"
and does not apply "*", "+", or "?" to parenthesized subexpressions,
the net effect is that the longest possible match will be chosen. So
"ab*", presented with "xabbbby", will match "abbbb". Note that if
"ab*" is tried against "xabyabbbz", it will match "ab" just after
"x", due to the begins-earliest rule. In effect, the decision on
where to start the match is the first choice to be made, hence
subsequent choices must respect it even if this leads them to
less-preferred alternatives.
ΓòÉΓòÉΓòÉ 3. Commands ΓòÉΓòÉΓòÉ
Commands
The following functions are available:
RegExpVersion
RegExpCompile
RegExpMatch
RegExpReplace
Note that you can't trust the return codes from rexx add/query/drop functions
so if you wish to detect (instead of dying) that the DLL is unavailable you
should check my sample rexx program which contains a subroutine which does
this.
ΓòÉΓòÉΓòÉ 3.1. RegExpVersion ΓòÉΓòÉΓòÉ
RegExpVersion
This routine will allow you to determine the version and author information of
the DLL you are using.
The routine takes a single parameter as follows:
1. Name of Variable to update with version Information.
The information returned (seperated by a single space):
1. Version Number such as "98.104".
2. My Web page URL (where you can get a later version if available).
3. My email Address.
4. My Name.
Please see my sample rexx program for an example of this call in use.
Returns
This routine returns "OK" if the call succeeds otherwise it returns text which
describes the reason for the failure.
ΓòÉΓòÉΓòÉ 3.2. RegExpCompile ΓòÉΓòÉΓòÉ
RegExpCompile
To make use of a regular expression it must first be compiled. This routine
will either compile a regular expression or release any memory associated with
the last compiled regular expression.
The routine takes a single parameter as follows:
1. The regular expression to be compiled. There is one exception, if
you pass exactly "ReClose" then this is taken to be a close off
previous regular expression command.
The compiled express (plus some other bits) are held in memory until released.
Rexx programs running in different sessions will not interfere with each
other. Not sure if there can be any problems within a session but I suspect
not.
Please see my sample rexx program for an example of this call in use.
Returns
This routine returns "OK" if the call succeeds otherwise it returns text which
describes the reason for the failure.
ΓòÉΓòÉΓòÉ 3.3. RegExpMatch ΓòÉΓòÉΓòÉ
RegExpMatch
This routine will attempt to find a regular expression which was previously
compiled.
The routine takes 2 parameters as follows:
1. The string to be searched.
2. Name of Variable to update with match details. If the variable is
blank then there was no match otherwise a set of one or more
"match pairs" is returned.
Match Pairs
In a set of match pairs the first describes the location of the overall
regular expression while any others (if they exist) describe the location of
any matches for expressions that occurred between round brackets in the
regular expression.
Each match pair describes the starting position (1st byte is 1) and the length
of the match.
If you used the regular expression "(A)(B)" on the string "ABCDEF" then
"1 2 1 1 2 1" would be returned.
Please see my sample rexx program for an example of this call in use.
Returns
This routine returns "OK" if the call succeeds otherwise it returns text which
describes the reason for the failure.
ΓòÉΓòÉΓòÉ 3.4. RegExpReplace ΓòÉΓòÉΓòÉ
RegExpReplace
This routine will modify a string you supply based on the match information
from a previous regular expression.
The routine takes 2 parameters as follows:
1. The replace specification.
2. Name of Variable which will contain the modified string.
Replace Specification
The first thing you should understand is that we are not performing and sort
of normal search and replace here, we are taking a specification and building
a new string. If you wanted to perform a search and replace type operation
you would need to use round brackets and ensure that the before and after
match strings were described. Note that doing this can vary the match
location of the string you are trying to match!
The replace specification is basically a string where any occurrance of "&" or
"\0" is replaced with the overall matched characters. "\1" to "\9" are
replaced with the appropriate match information for each round bracketed
expression ('\1' is first one).
For example you may wish to replace "SET" with "set" in which case you could
have used the regular expression "SET" to search for the string with
"RegExpMatch", however this would not be useful unless you manually wrote code
to do the replacement (not hard), to use this "replace" routine you would
search with something like "(^.*)SET(.*$)" and then your replace specification
could be "\1set\2".
Please see my sample rexx program for an example of this call in use.
Returns
This routine returns "OK" if the call succeeds otherwise it returns text which
describes the reason for the failure.
ΓòÉΓòÉΓòÉ 4. Rexx Example ΓòÉΓòÉΓòÉ
Rexx Example
/* Small Test program for my REXX Regular Expression Code */
/*--- Load up functions we will use -----------------------------------------*/
call RxFuncAdd 'RegExpVersion', 'RxRegExp', 'RegExpVersion';
call RxFuncAdd 'RegExpCompile', 'RxRegExp', 'RegExpCompile';
call RxFuncAdd 'RegExpMatch', 'RxRegExp', 'RegExpMatch';
call RxFuncAdd 'RegExpReplace', 'RxRegExp', 'RegExpReplace';
/*--- Make sure Regular Expression DLL is available for use -----------------*/
say 'RxRegExp.DLL';
say '~~~~~~~~~~~~';
if RegExpOk() = 'N' then
do
say "Regular Expressions can't be used (DLL probably unavailable)!";
exit(GetLineNumber());
end;
say "Available";
/*--- Display Version Info --------------------------------------------------*/
say '';
say 'RegExpVersion';
say '~~~~~~~~~~~~~';
VerRc = RegExpVersion('ReVersion');
if VerRc <> 'OK' then
do
say ' COULD NOT GET RXREGEXP.DLL VERSION INFO'
say ' ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
say ' ' || VerRc;
exit(GetLineNumber());
end;
parse var ReVersion DllVersion WebPage EmailAddress AuthorName;
say 'Version = ' || DllVersion;
say 'Author = ' || AuthorName;
say 'Email = ' || EmailAddress;
say 'Web Page = ' || WebPage;
/*--- Define/Compile regular expression -------------------------------------*/
say ''
say 'RegExpCompile';
say '~~~~~~~~~~~~~';
Re = arg(1);
if Re = '' then
Re = "1[12]";
CompileRc = RegExpCompile(Re);
say CompileRc;
if CompileRc <> 'OK' then
exit(GetLineNumber());
/*--- Look for a match ------------------------------------------------------*/
LookIn = "AAA112BBB";
say ''
say 'RegExpMatch';
say '~~~~~~~~~~~';
say 'Looking for ' || Re || ' in "' || LookIn || '"';
Answer = '?';
MatchRc = RegExpMatch(LookIn, "Answer");
if MatchRc <> 'OK' then
do
/*--- An error occurred --------------------------------------------------*/
say ' ERROR'
say ' ~~~~~'
say ' ' || MatchRc;
exit(GetLineNumber());
end
else
do
/*--- Did we have any matches? -------------------------------------------*/
if Answer = '' then
do
say ' * No Matches';
exit(GetLineNumber());
end;
/*--- List all matches and submatches ------------------------------------*/
say ' * MatchList = ' || Answer;
/*--- Extract the overall match information ------------------------------*/
parse var Answer MatchStart MatchLength SubMatch;
say ' * Match starts at posn ' || MatchStart || ' and is ' || MatchLength || ' bytes long.';
/*--- Extract the submatch information -----------------------------------*/
Index = 0;
do forever
/*--- Extract info ---------------------------------------------------*/
parse var SubMatch SubMatchStart SubMatchLength SubMatch;
/*--- Make sure we have the information ------------------------------*/
if SubMatchLength = '' then
leave;
/*--- Report the information -----------------------------------------*/
Index = Index + 1;
say ' * Match (' || Index || ') starts at posn ' || SubMatchStart || ' and is ' || SubMatchLength || ' bytes long.';
end;
end;
/*--- Make change -----------------------------------------------------------*/
say ''
say 'RegExpReplace';
say '~~~~~~~~~~~~~';
Before = "xxx\0yyy"
say 'Replacing: ' || Before
Answer = '?';
MatchRc = RegExpReplace(Before, "Answer");
say 'ReplaceRc = ' || MatchRc;
say 'Answer = ' || Answer;
/*--- Close Regular expression ----------------------------------------------*/
say '';
say 'RegExpCompile(ReClose)';
say '~~~~~~~~~~~~~~~~~~~~~~';
say RegExpCompile("ReClose");
exit(0);
/*===========================================================================*/
RegExpOk:
/* */
/* None of the rexx 'Rx' functions (add/drop/query) work correctly. The */
/* return code can't be trusted. */
/* */
/* This routine will return 'Y' if the DLL can be accessed. It calls one */
/* of the known functions. We have already registered it so if we fail then */
/* the registration failed. */
/* */
/* Note that as this code is within a subroutine the trap handler we set up */
/* does not override and set up in the calling code once this routine */
/* returns. */
/*===========================================================================*/
/*--- Get up trap handler and execute the command ------------------------*/
signal on SYNTAX name RegExpNotOk;
interpret "DummyReRc = RegExpVersion('ReVersion')";
/*--- We did not die so the function must be available -------------------*/
return('Y');
/*--- We must have died so the function is unavailable -------------------*/
RegExpNotOk:
return('N');
/*===========================================================================*/
GetLineNumber:
/*===========================================================================*/
return( SIGL );