home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Shareware Overload
/
ShartewareOverload.cdr
/
games
/
patterns.zip
/
CHP.DOC
next >
Wrap
Text File
|
1988-04-03
|
28KB
|
630 lines
CHP.EXE
Changes Those Substrings from a Set of Files
That Are Matched by a Given Regular Expression Pattern
by
Robert A. Magnuson
DMB, DCRT, NIH
Bethesda, MD 20892
Mar 1988
Revised Apr 1988
CHP (for Change via Pattern) is a tool used to search DOS files
for those lines that contain a match for a specified pattern.
Matches are changed according to the replacement specification.
All input lines--changed or not--are written to stdout.
[This document is intended to be read from the screen.
It contains some characters which will probably not print
correctly on a printer.]
A DOS command line that invokes CHP contains a number of
arguments. The syntax is summarized by the following:
┌────────────────┐ ┌─────┐
│ options: │ ┌─ <pat> ─┴ <r> ─┤ ┌─────────────┐
CHP ─┴─ / ─┬ <ltr> ┬──┴──┤ ├──┴─┬ <file> ┬──┴─
└──────┘ └─── <patfile> ──┘ └───────┘
Option Definitions:
─────────────────── │ <r> replaces the match,
/c <r> is capitialized like the match│ is required unless /d.
/d delete match (there is no <r>) │
/e no more option args even if │In <r>:
next arg begins with / │\0 is the entire match,
/f pattern, replacement in <patfile> │\δ is the δth group where δ is 1-9,
/v verify change │\u upcases next \δ where δ is 0-9,
/x compare case exact │\l lowercases next \δ,
──────────────────────────────────────┤\i initialcaps next \δ,
Pattern, replacement can contain hexa-│\α is α, where α is any other byte,
decimal byte representations "\xhh". │anything else is itself.
First there may be optional arguments specifying various CHP
options. Then there is the required pattern which is used to
match the lines taken from the files. Next, there is the usually
required replacement specification. Then come the filenames
which can be wildcarded. If no filenames appear, CHP gets its
input from stdin.
The arguments can optionally be enclosed in double
quotes. The enclosing quotes are stripped off and not
seen by CHP. Should you need to have a double quote
within an argument, the argument must be double quoted
and the internal double quote must be escaped by
preceding it with a backslash. This treatment of the
double quotes is done by the argc/argv mechanism of the C
compiler. This mechanism does not allow for null
arguments. What looks like a quoted nullstring--two
consecutive doublequotes--is treated like whitespace
between arguments. [CHP is implemented in Borland
TurboC.]
CHP exits with ERRORLEVEL set to one if some matched substrings
were changed, to zero if no changes occurred, and to two for
syntax problems.
OPTION ARGUMENTS:
CHP options are specified by the presence or absence of various
option letters in option arguments. Any option argument must
begin with a slash and must appear in front of the other kinds of
arguments. Any number of legal option letters can appear in an
option argument. Thus, you can have multiple option arguments,
perhaps each with a single option letter (and each beginning with
a slash), or just one option argument containing all of the
option letters desired. CURRENTLY, ALL LEGAL CHP OPTION LETTERS
ARE lower case.
For the sake of readability, the above syntax diagram
shows only the case where all option letters appear in
one option argument.
A CHP syntax error occurs when illegal option letters appear, and
when required arguments are missing. The <p> argument is
required. The <r> argument is required unless the /d option is
present.
When a syntax error occurs CHP prints a boxed syntax diagram
containing terse instructions on how to use CHP. This mechanism
can be deliberately tripped in order to get on-screen help. The
suggested way is to invoke CHP with no arguments--thus causing
the no-<p> syntax error.
The /d option asks that the matches be deleted, which means
replacing the matches with nullstrings. If the compiler used in
writing CHP permitted null command-line arguments, deletion would
be done by making the replacement null. When the /d option is
taken, the first argument beyond the pattern (i.e., the one that
would be <r>)--if any--is taken to be the first filename.
The /v option permits interractive verification as to whether
each or any of the found matches should be changed. In that
case, CHP writes the line number and the line's contents on the
screen with the current match highlighted. Then, prompted by CHP
whether to make this change, the user has six options: change
it, don't change it, change it and all following matches, change
neither this one nor any following matches, quit the program
immediately without finishing, or get help on what choices there
are.
In matching the pattern, case (upper or lower) is ignored unless
the /x option is taken.
The /e option permits the pattern to start with a
slash--otherwise the pattern would look like another option
argument.
The /c option capitalizes <r> like the match. More fully, /c
makes <r>'s first constant letter the same case as the match's
first letter. A constant byte in <r> is one appearing directly,
i.e., not part of the yield of \0 or \1, etc. If the match has
no letters, no case change is made in <r>.
Usually the pattern and replacement appear directly in the DOS
command line. However, the DOS command line maximum length is
severely restricted. Should there not be enough room for the
pattern and replacement, they can be placed in a file (as the
first and second lines). The /f option is then used to signify
that the <p> parameter is not the pattern, but rather the name of
a file containing both the pattern and replacement.
The doublequoting convention that is carried out by the
argc-argv treatment of the command line arguments does
not apply to the pattern or replacement contained in a
file (when the /f option is in effect). Doublequotes,
spaces, etc., are taken literally.
REGULAR-EXPRESSION PATTERN-MATCHING:
CHP selects the substrings in the lines by means of pattern
matching. The <p> is a pattern (or names the file that contains
the pattern and replacement) that is matched against each file
line. Case (upper or lower) is ignored unless the /x option has
been taken.
A pattern is a string of characters. We distinguish two kinds of
characters: normal and meta. There are exactly nine
metacharacters:
. * + ? ^ $ [ ] \
The remaining characters are normal. Metacharacters sometimes
combine with normal characters as we will see. Otherwise, a
normal character simply matches itself. The metacharacters
behave as follows:
A matches
───── ────────────────────────────
. any single byte
* 0 or more of the preceding
+ 1 or more of the preceding
? 0 or 1 of the preceding
[...] any 1 of the enclosed bytes
[^...] any byte not enclosed
^ the beginning of the <cmp>
$ the end of the <cmp>
\α α, where α is a metabyte
α\!ß α or ß
\(α\) α (grouped for precedence)
\δ the δth group where δ is 1-9
\b beginning/end of a word
\< beginning of a word
\> end of a word
\w a word byte: [a-zA-Z0-9]
\W a nonword byte: [^a-zA-Z0-9]
Note that the '\' metacharacter is used as an escape, i.e., to
quote a metacharacter. Thus to match, e.g., a period (as a
normal character) you must use '\.' If you leave out the
backslash, the period alone will have its metacharacter meaning.
In the above explanation of the '*', '+' and '?' metacharacters,
'preceding' means 'the shortest possible preceding'. Thus, 'ab+'
matches 'ab', 'abb', etc., but not 'abab'
The square bracket metacharacters specify any one of the enclosed
characters--known as a character class. The minus sign has a
special meaning as a range in a character class. '[a-g]' can be
used in place of '[abcdefg]'. When appearing first in a
character class, a circumflex indicates that the match is with
any character not in the character class. Thus, '[^0-9]' matches
any non decimal-digit. Most metacharacters lose their special
status in a character class, and should not be escaped. If a
right square bracket is to be in a character class, it must
follow immediately the beginning left square bracket. If a minus
sign is to be in a character class, it must appear as '---',
i.e., a range containing only itself. Since the square brackets
do not nest, a left square bracket can easily be included in a
character class. E.g., '[][]' matches a right or a left square
bracket.
Some pattern match examples follow:
A matches
────────────────── ───────────────────────
zyx zyx
f.x fax, fix and fxx
f\.x f.x
f[aix]x only fax, fix and fxx
\[[a-z]+\] [hello] and [world]
\(suf\!pre\)fix suffix or prefix
ba\(na\)* banananana
[A-P]: A:, C:, H:, etc
\([cd]:\)?\w abc, c:zyx, d:cat, etc
\(abra\)\(cad\1\)* abracadabracadabra
Due to a bug in PC DOS I have changed the alternative
(i.e., the "or") from '\|' to '\!'. The vertical, '|',
is DOS's piping symbol. Although doublequoting is
supposed to protect any redirection symbols in the
interior from being acted upon, under certain
circumstances DOS performs the redirection even though
it is doublequoted.
Please note that CHP's pattern matching is done via REGEX.C from
Free Software Foundation, Inc.
THE REPLACEMENT STRING:
The replacement string, <r> in the syntax diagram, uses only one
metacharacter--namely the backslash. The entire match is
represented by \0. The first group (a submatch) is represented
by \1, the second by \2, through \9 for the ninth. To represent
a backslash as itself, you escape it by preceding it with a
backslash. A \l yields nothing but has the side effect of
lowercasing the yield of the next backslash-digit pair.
Similarly, a \u uppercases, and a \i initialcaps. What about a
backslash not followed by a digit, 'l', 'u', 'i', or a backslash?
It disappears. Other characters in the replacement string
represent themselves.
The \i upcases the first byte of each substring of
letters, and lowercases the remaining letters in the
substring. A nonletter causes the next substring of
letters to be initialcapped.
HEXADECIMAL REPRESENTATION IN PATTERN, REPLACEMENT
Sometimes it is convenient or necessary to represent bytes in a
coded fashion. You may need a smiley face, for example. You can
keyboard this character directly in two ways: (1) by entering a
control-A, or (2) by holding down the ALT key, typing a "1" on
the numeric keyboard, then releasing the ALT key. But when you
need to document this character on your printer, the smiley face
does not print at all! Worse yet, if you need to enter a tab on
the DOS command line, DOS may translate it into spaces (up to the
next tab stop). CHP allows characters to be entered in a
hexadecimal format, either as
/xhh
or as
\xh
where the h's are hexadecimal digits. Both the "x" and the A
through F can be upper/lower case (or mixed).
EXAMPLES:
To copy the file ALPHA.TXT to the new file JUNK while changing
all occurrences of 'cat' to 'tiger', do a
chp cat tiger alpha.txt>junk
The receiving file is well named because each occurrence of 'cat',
irrespective of case, and regardless of surrounding material,
will be changed. The verify option will help a little, i.e.,
chp/v cat tiger alpha.txt>junk
but I suspect that the Quit option will be taken after you start
to find so many occurrences of 'cat' in other contexts.
To copy the file ALPHA.TXT to the new file JUNK while changing
all occurrences of the word 'cat' (and not 'cats', 'cathode', or
'indicate') to 'tiger', do a
chp \bcat\b tiger alpha.txt>junk
Note the '\b''s (i.e., wordbreaks) surrounding 'cat'.
But wait. What about the word 'cats'? Shouldn't it be changed
to 'tigers'? Try
chp \bcat\(s?\)\b tiger\1 alpha.txt>junk
Here the pattern may be paraphrased as a word that begins with
'cat' and ends with zero or one 's'. Note that the zero or one
's' has been grouped in the funny backslashed parentheses. This
is to remember it in \1 which is used in the replacement string.
Thus the word 'cats' does become 'tigers'.
Some occurrences of the word 'cat' (or 'cats') may begin a
sentence, and others may be within a sentence. I.e., at the
beginning of a sentence, we have 'Cat', whereas 'cat' occurs
within the sentence. Correspondingly, we would like 'Cat' to
become 'Tiger', and 'cat' 'tiger'. That is what the /c CHP
option is for. E.g., if you invoke CHP with a
chp/c \bcat\b tiger alpha.txt
which has the /c option,
Cat, cat, burning bright,
is changed to
Tiger, tiger, burning bright,
To change doublequotes into singlequotes with verification, do a
chp/v "\"" ' alpha.txt >junk
The pattern seen by CHP in this case is simply one doublequote.
The C compiler's argc-argv handler sees a doublequoted escaped
doublequote. The outer doublequotes are stripped, and the inner
escaped doublequote becomes a doublequote. The resultant
doublequote is then given to CHP as the second command-line
argument. [The first argument is '/v'.]
Just for fun suppose you want to look at a file with all its
vowels removed. Try
chp/d [aeiou] chp.doc
There is no replacment string here. The /d option indicates
deletion of the matched substrings. The pattern consists of a
character class containing the five vowels. Because there is no
redirection of stdout, the output will come pouring out on the
screen just below the above CHP command line. You might want to
have BREAK ON (a DOS command) before you do this so that you can
terminate CHP with a ^C.
Along a similar vein, how about removing all words containing
'th'? To make it even more mystifying, we should also remove the
character following each of these words (usually a space) so that
there will be no obvious gaps. This can be done to this document
with a
chp/dx \w*th\w*\W chp.doc
That pattern can be paraphrased as zero or more word constituents
followed by 'th' (exact case because of the \x option) followed
by zero or more word constituents followed by one nonword
constituent. Then, if you'd like to see the words that were
removed, you can use the companion FP.EXE (Find Pattern) with a
fp/hx \w*th\w*\W chp.doc
How about removing the definite and indefinite articles, 'the',
'a' and 'an', from a file? Try, e.g.,
chp/d \b\(the\!an?\)\b\W chp.doc>junk
Perhaps you have a text file, JUNK1, wherein some lines have
leading spaces, and some words are separated by multiple spaces.
The leading spaces are to be deleted, and each internal
multiple-space sequence is to be changed into a single space.
These two changes cannot be done with a single invocation of CHP.
However, the task can be done with a single DOS command line in
which a CHP invocation is piped to another whose output is
redirected to the resultant file, JUNK2. The following DOS
command line
chp/d "^ +" junk1 | chp " +" " " > junk2
does the job. The first CHP specifies deletion of one or more
leading spaces, the input file is JUNK1, and the output is piped.
The second CHP's <p> specifies two or more spaces, the <r> is a
single space, and the ouput is redirected to JUNK2. Because it
has no input file, the second CHP gets its input from (the piped)
stdin.
Sometimes you need to exchange the order of pairs of words so
that the second precedes the first. This might occur in a data
file in which names are in lastname/firstname order and you want
them reversed. Assuming the words are separated by a single
space, the following pattern matches a pair of words putting the
first in \1 and the second in \2.
\(\w+\) \(\w+\)
Now, to copy the file NAMES.LST to the new file JUNK
interchanging the word pair at the beginning of each line, do a
chp/v "^\(\w+\) \(\w+\)" "\2 \1" alpha.txt>junk
The /v option is useful since it allows you to look at each match
and at what is done to it. If you don't like the first few, you
can quit the operation. If you like it, you can tell CHP to do
the rest of the file without further verification (by responding
with a ^R).
Interchanging last/first names reminds me of the all-caps NIH
Telephone Directory. It can be spruced up via CHP so
that the names are initial-capped and are in forward order.
The NIH Telephone Directory is available in machine
readable format under WYLBUR on the NIH IBM 370. The
file can be downloaded to a PC. Several LANs have copies
of it on the server's hard disk. If you have access to a
copy, it is a good source file on which to practice with
patterns. When first downloaded from the 370 all letters
are in uppercase. Each line starts with an individual's
name, followed by spaces, followed by telepone number,
organization, building, and room number.
Each name is ordered as: last, first(s), initial(s), and JR, SR,
II or III where applicable. After each part is a single space.
Some last names are in two words, the first of which might be MC,
MAC, O, D, DI, DEL, VON, etc. The name words need to be
initialcapped (with the other letters lowercased). This applies
to JR and SR but not to II or III. The single letter prefixes
need to have the space that follows changed to an apostrophe.
The MC's and the MAC's need to be closed up with the second part
of the last name. The last name (with any prefix) needs to be
moved over beyond the first names and initials but before the JR,
SR, II or III.
Here is a "listing" of TELEPHON.BAT that uses CHP to copy a file
in the format of the NIH Telephone Directory into the format
discussed above.
---------- TELEPHON.BAT
[1]:change O BRIEN to O'BRIEN, etc
[2]chp "^\([a-z]\) \([a-z]\)" \1'\2 %1 >%2_1
[3]:move JR/SR (initialcapped), II/III (no case change) over out of way
[4]chp \b\(\([js]r\W\)\!\(iii?\W\)\)\(\W+\) \4\i\2\3 %2_1 >%2_2
[5]:initialcap all name words
[6]chp ^\(\w+\W\)+\(\W+\) \i\1\2 %2_2 >%2_3
[7]:Close up MCs and MACs
[8]chp ^\(mac\!mc\)\(\W\)\(\(\w+\W\)+\) \1\3\2 %2_3 >%2_4
[9]:move last name to end, recognizing prefixes
[10]echo ^\(\(\(d[iu]\!de[ls]?\!l[aeo]\!van\(\Wder?\)?
\!von\)\W\)?\)\([-'a-z]+\W\)\(\(\w+\W\)*\) >_tmpat_
[11]echo \6\1\5 >>_tmpat_
[12]chp/f _tmpat_ %2_4 >%2_5
[13]:move JR/SR/II/III back (from being moved over earlier)
[14]chp \(\W\)\(\W+\)\(\([js]r\!iii?\)\W\) \1\3\2 %2_5 >%2
[15]:TELEPHON.BAT converts the nih telephone directory
There are six transformations. The initial file, SAMPLE.UC in
the example carried out below, is changed first to SAMPLE_1, then
to SAMPLE_2, etc., finally emerging as SAMPLE. The steps carried
out are:
1) Put the apostrophes in names like O BRIEN.
2) Move II's, III's, JR's and SR's over out of
the way. In the process JR's and SR's are
initialcapped, II's and III's are not.
3) Initial cap the name words.
4) Close up the MC's and the MAC's.
5) Move last names to end, recognizing prefixes.
6) Move Jr/Sr/II/III's back.
I have prepared a file, SAMPLE.UC, in the format of the NIH
Telephone Directory. I used my name plus modifications in it
(including a fake doctorate) to show the various steps. Here is
the SAMPLE.UC:
---------- SAMPLE.UC
D MAGNUSON ROBERT ANDRE II 496-6256 CR DMB 12A 4021
LE MAGNUSON BOB I 496-6256 CR DMB 12A 4021
MAC MAGNUSON ROBERT 496-6256 CR DMB 12A 4021
MAGNUSON ROBERT A 496-6256 CR DMB 12A 4021
MC MAGNUSON ROBERT SR 496-6256 CR DMB 12A 4021
O MAGNUSON ROB JR DR 496-6256 CR DMB 12A 4021
VAN DER MAGNUSON R A 496-6256 CR DMB 12A 4021
VON MAGNUSON R ANDRE III 496-6256 CR DMB 12A 4021
I invoked TELEPHON.BAT with a
TELEPHON SAMPLE.UC SAMPLE
which asks that SAMPLE.UC be transformed via the five
intermediate files (which, in this case, are of the form
SAMPLE_n), to the final SAMPLE file. Appearing below in
succession are the intermediate files, SAMPLE_1 through SAMPLE_5,
and the final file, SAMPLE.
---------- SAMPLE_1: Apostrophes inserted:
D'MAGNUSON ROBERT ANDRE II 496-6256 CR DMB 12A 4021
LE MAGNUSON BOB I 496-6256 CR DMB 12A 4021
MAC MAGNUSON ROBERT 496-6256 CR DMB 12A 4021
MAGNUSON ROBERT A 496-6256 CR DMB 12A 4021
MC MAGNUSON ROBERT SR 496-6256 CR DMB 12A 4021
O'MAGNUSON ROB JR DR 496-6256 CR DMB 12A 4021
VAN DER MAGNUSON R A 496-6256 CR DMB 12A 4021
VON MAGNUSON R ANDRE III 496-6256 CR DMB 12A 4021
---------- SAMPLE_2: Move over II/III's and
(initalcapped) JR/SR's.
D'MAGNUSON ROBERT ANDRE II 496-6256 CR DMB 12A 4021
LE MAGNUSON BOB I 496-6256 CR DMB 12A 4021
MAC MAGNUSON ROBERT 496-6256 CR DMB 12A 4021
MAGNUSON ROBERT A 496-6256 CR DMB 12A 4021
MC MAGNUSON ROBERT Sr 496-6256 CR DMB 12A 4021
O'MAGNUSON ROB Jr DR 496-6256 CR DMB 12A 4021
VAN DER MAGNUSON R A 496-6256 CR DMB 12A 4021
VON MAGNUSON R ANDRE III 496-6256 CR DMB 12A 4021
---------- SAMPLE_3: Initialcap name words:
D'Magnuson Robert Andre II 496-6256 CR DMB 12A 4021
Le Magnuson Bob I 496-6256 CR DMB 12A 4021
Mac Magnuson Robert 496-6256 CR DMB 12A 4021
Magnuson Robert A 496-6256 CR DMB 12A 4021
Mc Magnuson Robert Sr 496-6256 CR DMB 12A 4021
O'Magnuson Rob Jr DR 496-6256 CR DMB 12A 4021
Van Der Magnuson R A 496-6256 CR DMB 12A 4021
Von Magnuson R Andre III 496-6256 CR DMB 12A 4021
---------- SAMPLE_4: Close up the MC's and the MAC's.
D'Magnuson Robert Andre II 496-6256 CR DMB 12A 4021
Le Magnuson Bob I 496-6256 CR DMB 12A 4021
MacMagnuson Robert 496-6256 CR DMB 12A 4021
Magnuson Robert A 496-6256 CR DMB 12A 4021
McMagnuson Robert Sr 496-6256 CR DMB 12A 4021
O'Magnuson Rob Jr DR 496-6256 CR DMB 12A 4021
Van Der Magnuson R A 496-6256 CR DMB 12A 4021
Von Magnuson R Andre III 496-6256 CR DMB 12A 4021
---------- SAMPLE_5: Move last name, recognizing prefixes.
Robert Andre D'Magnuson II 496-6256 CR DMB 12A 4021
Bob I Le Magnuson 496-6256 CR DMB 12A 4021
Robert MacMagnuson 496-6256 CR DMB 12A 4021
Robert A Magnuson 496-6256 CR DMB 12A 4021
Robert McMagnuson Sr 496-6256 CR DMB 12A 4021
Rob O'Magnuson Jr DR 496-6256 CR DMB 12A 4021
R A Van Der Magnuson 496-6256 CR DMB 12A 4021
R Andre Von Magnuson III 496-6256 CR DMB 12A 4021
---------- SAMPLE: Move Jr/Sr/II/III's back.
Robert Andre D'Magnuson II 496-6256 CR DMB 12A 4021
Bob I Le Magnuson 496-6256 CR DMB 12A 4021
Robert MacMagnuson 496-6256 CR DMB 12A 4021
Robert A Magnuson 496-6256 CR DMB 12A 4021
Robert McMagnuson Sr 496-6256 CR DMB 12A 4021
Rob O'Magnuson Jr DR 496-6256 CR DMB 12A 4021
R A Van Der Magnuson 496-6256 CR DMB 12A 4021
R Andre Von Magnuson III 496-6256 CR DMB 12A 4021
Notes on TELEPHON.BAT:
Line [2] looks for a letter at the beginning of the line, a
space, then another letter. The result is the first letter, an
apostrophe, then the second letter. CHP's input file is %1, the
output is %2_1. In the example, these files are SAMPLE.UC, and
SAMPLE_1.
In line [4] CHP looks for a word boundary, JR or SR followed by a
nonword constituent, or II or III followed by a nonword
constituent, then one or more nonword constituents. The result
is the one or more nonword constituents, an initialcapped copy of
the JR or SR and its nonword constituent, then the II or III and
its nonword constituent. Although the result has both the JR/SR
and the II/III, one of them is always null. CHP reads SAMPLE_1,
and DOS writes to SAMPLE_2 (substituting SAMPLE for %2 in the
batch file).
In line [6] CHP looks for a sequence of one or more words
anchored to the beginning of the line, each word of which is
followed by exactly one nonword constituent. Following the words
the pattern wants a string of at least one nonword constituent.
The result is an initialcapped copy of the words followd by the
nonword constituent(s). Here we went from SAMPLE_2 to SAMPLE_3.
In line [8] the pattern looks for MAC or MC anchored to the
beginning of the line, exactly one nonword constituent, then a
sequence of words each of which ends with exactly one nonword
constituent. The result is the MAC/MC, the sequence of words,
then the space (i.e., the nonword constituent). Thus, the MAC/MC
is closed up with the rest of the last name, and the removed
space is placed after the rest of the name (to preserve the
original length of the whole name). The input file is SAMPLE_3,
the output, SAMPLE_4.
In line [12] CHP is invoked with the /f option whereby the
pattern and the result are read from a file--named _TMPAT_ in
this case. The ECHO in line [10] writes the pattern to _TMPAT_,
and the ECHO in line [11] appends the result parameter to
_TMPAT_. The task here is to put the first name(s) and initials
ahead of the last name. Usually the last name is the first word
on the line, but the last-name prefixes have to be taken into
account. An inspection of the current NIH telephone directory
revealed the following prefixes: Di, Du, De, Del, Des, La, Le,
Lo, Van, Van De, Van Der, and Von. (I hope I didn't miss any.)
The pattern specifies that zero or one of these (with its
trailing blank) occurs at the beginning of the line, a word of
one or more letters, apostrophes or hyphens (i.e., the last
name), then a sequence of words (the first names/initials) each
with its trailing blank. The result is to be the first
names/initials, the possibly null prefix, then the last name.
SAMPLE_4 is copied with the changes to SAMPLE_5.
Finally, in line [14], the JR/SR/II/III's are moved back creating
the final file, SAMPLE.