Shareware Overload

home *** CD-ROM | disk | FTP | other *** search

/ Shareware Overload / ShartewareOverload.cdr / games / patterns.zip / CHP.DOC next >

Wrap

Text File | 1988-04-03 | 28KB | 630 lines

CHP.EXE Changes Those Substrings from a Set of Files That Are Matched by a Given Regular Expression Pattern by Robert A. Magnuson DMB, DCRT, NIH Bethesda, MD 20892 Mar 1988 Revised Apr 1988 CHP (for Change via Pattern) is a tool used to search DOS files for those lines that contain a match for a specified pattern. Matches are changed according to the replacement specification. All input lines--changed or not--are written to stdout. [This document is intended to be read from the screen. It contains some characters which will probably not print correctly on a printer.] A DOS command line that invokes CHP contains a number of arguments. The syntax is summarized by the following: ┌────────────────┐ ┌─────┐ │ options: │ ┌─ <pat> ─┴ <r> ─┤ ┌─────────────┐ CHP ─┴─ / ─┬ <ltr> ┬──┴──┤ ├──┴─┬ <file> ┬──┴─ └──────┘ └─── <patfile> ──┘ └───────┘ Option Definitions: ─────────────────── │ <r> replaces the match, /c <r> is capitialized like the match│ is required unless /d. /d delete match (there is no <r>) │ /e no more option args even if │In <r>: next arg begins with / │\0 is the entire match, /f pattern, replacement in <patfile> │\δ is the δth group where δ is 1-9, /v verify change │\u upcases next \δ where δ is 0-9, /x compare case exact │\l lowercases next \δ, ──────────────────────────────────────┤\i initialcaps next \δ, Pattern, replacement can contain hexa-│\α is α, where α is any other byte, decimal byte representations "\xhh". │anything else is itself. First there may be optional arguments specifying various CHP options. Then there is the required pattern which is used to match the lines taken from the files. Next, there is the usually required replacement specification. Then come the filenames which can be wildcarded. If no filenames appear, CHP gets its input from stdin. The arguments can optionally be enclosed in double quotes. The enclosing quotes are stripped off and not seen by CHP. Should you need to have a double quote within an argument, the argument must be double quoted and the internal double quote must be escaped by preceding it with a backslash. This treatment of the double quotes is done by the argc/argv mechanism of the C compiler. This mechanism does not allow for null arguments. What looks like a quoted nullstring--two consecutive doublequotes--is treated like whitespace between arguments. [CHP is implemented in Borland TurboC.] CHP exits with ERRORLEVEL set to one if some matched substrings were changed, to zero if no changes occurred, and to two for syntax problems. OPTION ARGUMENTS: CHP options are specified by the presence or absence of various option letters in option arguments. Any option argument must begin with a slash and must appear in front of the other kinds of arguments. Any number of legal option letters can appear in an option argument. Thus, you can have multiple option arguments, perhaps each with a single option letter (and each beginning with a slash), or just one option argument containing all of the option letters desired. CURRENTLY, ALL LEGAL CHP OPTION LETTERS ARE lower case. For the sake of readability, the above syntax diagram shows only the case where all option letters appear in one option argument. A CHP syntax error occurs when illegal option letters appear, and when required arguments are missing. The argument is required. The <r> argument is required unless the /d option is present. When a syntax error occurs CHP prints a boxed syntax diagram containing terse instructions on how to use CHP. This mechanism can be deliberately tripped in order to get on-screen help. The suggested way is to invoke CHP with no arguments--thus causing the no- syntax error. The /d option asks that the matches be deleted, which means replacing the matches with nullstrings. If the compiler used in writing CHP permitted null command-line arguments, deletion would be done by making the replacement null. When the /d option is taken, the first argument beyond the pattern (i.e., the one that would be <r>)--if any--is taken to be the first filename. The /v option permits interractive verification as to whether each or any of the found matches should be changed. In that case, CHP writes the line number and the line's contents on the screen with the current match highlighted. Then, prompted by CHP whether to make this change, the user has six options: change it, don't change it, change it and all following matches, change neither this one nor any following matches, quit the program immediately without finishing, or get help on what choices there are. In matching the pattern, case (upper or lower) is ignored unless the /x option is taken. The /e option permits the pattern to start with a slash--otherwise the pattern would look like another option argument. The /c option capitalizes <r> like the match. More fully, /c makes <r>'s first constant letter the same case as the match's first letter. A constant byte in <r> is one appearing directly, i.e., not part of the yield of \0 or \1, etc. If the match has no letters, no case change is made in <r>. Usually the pattern and replacement appear directly in the DOS command line. However, the DOS command line maximum length is severely restricted. Should there not be enough room for the pattern and replacement, they can be placed in a file (as the first and second lines). The /f option is then used to signify that the parameter is not the pattern, but rather the name of a file containing both the pattern and replacement. The doublequoting convention that is carried out by the argc-argv treatment of the command line arguments does not apply to the pattern or replacement contained in a file (when the /f option is in effect). Doublequotes, spaces, etc., are taken literally. REGULAR-EXPRESSION PATTERN-MATCHING: CHP selects the substrings in the lines by means of pattern matching. The is a pattern (or names the file that contains the pattern and replacement) that is matched against each file line. Case (upper or lower) is ignored unless the /x option has been taken. A pattern is a string of characters. We distinguish two kinds of characters: normal and meta. There are exactly nine metacharacters: . * + ? ^ $ [ ] \ The remaining characters are normal. Metacharacters sometimes combine with normal characters as we will see. Otherwise, a normal character simply matches itself. The metacharacters behave as follows: A matches ───── ──────────────────────────── . any single byte * 0 or more of the preceding + 1 or more of the preceding ? 0 or 1 of the preceding [...] any 1 of the enclosed bytes [^...] any byte not enclosed ^ the beginning of the <cmp> $ the end of the <cmp> \α α, where α is a metabyte α\!ß α or ß $α$ α (grouped for precedence) \δ the δth group where δ is 1-9 \b beginning/end of a word \< beginning of a word \> end of a word \w a word byte: [a-zA-Z0-9] \W a nonword byte: [^a-zA-Z0-9] Note that the '\' metacharacter is used as an escape, i.e., to quote a metacharacter. Thus to match, e.g., a period (as a normal character) you must use '\.' If you leave out the backslash, the period alone will have its metacharacter meaning. In the above explanation of the '*', '+' and '?' metacharacters, 'preceding' means 'the shortest possible preceding'. Thus, 'ab+' matches 'ab', 'abb', etc., but not 'abab' The square bracket metacharacters specify any one of the enclosed characters--known as a character class. The minus sign has a special meaning as a range in a character class. '[a-g]' can be used in place of '[abcdefg]'. When appearing first in a character class, a circumflex indicates that the match is with any character not in the character class. Thus, '[^0-9]' matches any non decimal-digit. Most metacharacters lose their special status in a character class, and should not be escaped. If a right square bracket is to be in a character class, it must follow immediately the beginning left square bracket. If a minus sign is to be in a character class, it must appear as '---', i.e., a range containing only itself. Since the square brackets do not nest, a left square bracket can easily be included in a character class. E.g., '[][]' matches a right or a left square bracket. Some pattern match examples follow: A matches ────────────────── ─────────────────────── zyx zyx f.x fax, fix and fxx f\.x f.x f[aix]x only fax, fix and fxx \[[a-z]+\] [hello] and [world] $suf\!pre$fix suffix or prefix ba$na$* banananana [A-P]: A:, C:, H:, etc $[cd]:$?\w abc, c:zyx, d:cat, etc $abra$$cad\1$* abracadabracadabra Due to a bug in PC DOS I have changed the alternative (i.e., the "or") from '\|' to '\!'. The vertical, '|', is DOS's piping symbol. Although doublequoting is supposed to protect any redirection symbols in the interior from being acted upon, under certain circumstances DOS performs the redirection even though it is doublequoted. Please note that CHP's pattern matching is done via REGEX.C from Free Software Foundation, Inc. THE REPLACEMENT STRING: The replacement string, <r> in the syntax diagram, uses only one metacharacter--namely the backslash. The entire match is represented by \0. The first group (a submatch) is represented by \1, the second by \2, through \9 for the ninth. To represent a backslash as itself, you escape it by preceding it with a backslash. A \l yields nothing but has the side effect of lowercasing the yield of the next backslash-digit pair. Similarly, a \u uppercases, and a \i initialcaps. What about a backslash not followed by a digit, 'l', 'u', 'i', or a backslash? It disappears. Other characters in the replacement string represent themselves. The \i upcases the first byte of each substring of letters, and lowercases the remaining letters in the substring. A nonletter causes the next substring of letters to be initialcapped. HEXADECIMAL REPRESENTATION IN PATTERN, REPLACEMENT Sometimes it is convenient or necessary to represent bytes in a coded fashion. You may need a smiley face, for example. You can keyboard this character directly in two ways: (1) by entering a control-A, or (2) by holding down the ALT key, typing a "1" on the numeric keyboard, then releasing the ALT key. But when you need to document this character on your printer, the smiley face does not print at all! Worse yet, if you need to enter a tab on the DOS command line, DOS may translate it into spaces (up to the next tab stop). CHP allows characters to be entered in a hexadecimal format, either as /xhh or as \xh where the h's are hexadecimal digits. Both the "x" and the A through F can be upper/lower case (or mixed). EXAMPLES: To copy the file ALPHA.TXT to the new file JUNK while changing all occurrences of 'cat' to 'tiger', do a chp cat tiger alpha.txt>junk The receiving file is well named because each occurrence of 'cat', irrespective of case, and regardless of surrounding material, will be changed. The verify option will help a little, i.e., chp/v cat tiger alpha.txt>junk but I suspect that the Quit option will be taken after you start to find so many occurrences of 'cat' in other contexts. To copy the file ALPHA.TXT to the new file JUNK while changing all occurrences of the word 'cat' (and not 'cats', 'cathode', or 'indicate') to 'tiger', do a chp \bcat\b tiger alpha.txt>junk Note the '\b''s (i.e., wordbreaks) surrounding 'cat'. But wait. What about the word 'cats'? Shouldn't it be changed to 'tigers'? Try chp \bcat$s?$\b tiger\1 alpha.txt>junk Here the pattern may be paraphrased as a word that begins with 'cat' and ends with zero or one 's'. Note that the zero or one 's' has been grouped in the funny backslashed parentheses. This is to remember it in \1 which is used in the replacement string. Thus the word 'cats' does become 'tigers'. Some occurrences of the word 'cat' (or 'cats') may begin a sentence, and others may be within a sentence. I.e., at the beginning of a sentence, we have 'Cat', whereas 'cat' occurs within the sentence. Correspondingly, we would like 'Cat' to become 'Tiger', and 'cat' 'tiger'. That is what the /c CHP option is for. E.g., if you invoke CHP with a chp/c \bcat\b tiger alpha.txt which has the /c option, Cat, cat, burning bright, is changed to Tiger, tiger, burning bright, To change doublequotes into singlequotes with verification, do a chp/v "\"" ' alpha.txt >junk The pattern seen by CHP in this case is simply one doublequote. The C compiler's argc-argv handler sees a doublequoted escaped doublequote. The outer doublequotes are stripped, and the inner escaped doublequote becomes a doublequote. The resultant doublequote is then given to CHP as the second command-line argument. [The first argument is '/v'.] Just for fun suppose you want to look at a file with all its vowels removed. Try chp/d [aeiou] chp.doc There is no replacment string here. The /d option indicates deletion of the matched substrings. The pattern consists of a character class containing the five vowels. Because there is no redirection of stdout, the output will come pouring out on the screen just below the above CHP command line. You might want to have BREAK ON (a DOS command) before you do this so that you can terminate CHP with a ^C. Along a similar vein, how about removing all words containing 'th'? To make it even more mystifying, we should also remove the character following each of these words (usually a space) so that there will be no obvious gaps. This can be done to this document with a chp/dx \w*th\w*\W chp.doc That pattern can be paraphrased as zero or more word constituents followed by 'th' (exact case because of the \x option) followed by zero or more word constituents followed by one nonword constituent. Then, if you'd like to see the words that were removed, you can use the companion FP.EXE (Find Pattern) with a fp/hx \w*th\w*\W chp.doc How about removing the definite and indefinite articles, 'the', 'a' and 'an', from a file? Try, e.g., chp/d \b$the\!an?$\b\W chp.doc>junk Perhaps you have a text file, JUNK1, wherein some lines have leading spaces, and some words are separated by multiple spaces. The leading spaces are to be deleted, and each internal multiple-space sequence is to be changed into a single space. These two changes cannot be done with a single invocation of CHP. However, the task can be done with a single DOS command line in which a CHP invocation is piped to another whose output is redirected to the resultant file, JUNK2. The following DOS command line chp/d "^ +" junk1 | chp " +" " " > junk2 does the job. The first CHP specifies deletion of one or more leading spaces, the input file is JUNK1, and the output is piped. The second CHP's specifies two or more spaces, the <r> is a single space, and the ouput is redirected to JUNK2. Because it has no input file, the second CHP gets its input from (the piped) stdin. Sometimes you need to exchange the order of pairs of words so that the second precedes the first. This might occur in a data file in which names are in lastname/firstname order and you want them reversed. Assuming the words are separated by a single space, the following pattern matches a pair of words putting the first in \1 and the second in \2. $\w+$ $\w+$ Now, to copy the file NAMES.LST to the new file JUNK interchanging the word pair at the beginning of each line, do a chp/v "^$\w+$ $\w+$" "\2 \1" alpha.txt>junk The /v option is useful since it allows you to look at each match and at what is done to it. If you don't like the first few, you can quit the operation. If you like it, you can tell CHP to do the rest of the file without further verification (by responding with a ^R). Interchanging last/first names reminds me of the all-caps NIH Telephone Directory. It can be spruced up via CHP so that the names are initial-capped and are in forward order. The NIH Telephone Directory is available in machine readable format under WYLBUR on the NIH IBM 370. The file can be downloaded to a PC. Several LANs have copies of it on the server's hard disk. If you have access to a copy, it is a good source file on which to practice with patterns. When first downloaded from the 370 all letters are in uppercase. Each line starts with an individual's name, followed by spaces, followed by telepone number, organization, building, and room number. Each name is ordered as: last, first(s), initial(s), and JR, SR, II or III where applicable. After each part is a single space. Some last names are in two words, the first of which might be MC, MAC, O, D, DI, DEL, VON, etc. The name words need to be initialcapped (with the other letters lowercased). This applies to JR and SR but not to II or III. The single letter prefixes need to have the space that follows changed to an apostrophe. The MC's and the MAC's need to be closed up with the second part of the last name. The last name (with any prefix) needs to be moved over beyond the first names and initials but before the JR, SR, II or III. Here is a "listing" of TELEPHON.BAT that uses CHP to copy a file in the format of the NIH Telephone Directory into the format discussed above. ---------- TELEPHON.BAT [1]:change O BRIEN to O'BRIEN, etc [2]chp "^$[a-z]$ $[a-z]$" \1'\2 %1 >%2_1 [3]:move JR/SR (initialcapped), II/III (no case change) over out of way [4]chp \b$\([js]r\W$\!$iii?\W$\)$\W+$ \4\i\2\3 %2_1 >%2_2 [5]:initialcap all name words [6]chp ^$\w+\W$+$\W+$ \i\1\2 %2_2 >%2_3 [7]:Close up MCs and MACs [8]chp ^$mac\!mc$$\W$$\(\w+\W$+\) \1\3\2 %2_3 >%2_4 [9]:move last name to end, recognizing prefixes [10]echo ^$\(\(d[iu]\!de[ls]?\!l[aeo]\!van\(\Wder?$? \!von\)\W\)?\)$[-'a-z]+\W$$\(\w+\W$*\) >_tmpat_ [11]echo \6\1\5 >>_tmpat_ [12]chp/f _tmpat_ %2_4 >%2_5 [13]:move JR/SR/II/III back (from being moved over earlier) [14]chp $\W$$\W+$$\([js]r\!iii?$\W\) \1\3\2 %2_5 >%2 [15]:TELEPHON.BAT converts the nih telephone directory There are six transformations. The initial file, SAMPLE.UC in the example carried out below, is changed first to SAMPLE_1, then to SAMPLE_2, etc., finally emerging as SAMPLE. The steps carried out are: 1) Put the apostrophes in names like O BRIEN. 2) Move II's, III's, JR's and SR's over out of the way. In the process JR's and SR's are initialcapped, II's and III's are not. 3) Initial cap the name words. 4) Close up the MC's and the MAC's. 5) Move last names to end, recognizing prefixes. 6) Move Jr/Sr/II/III's back. I have prepared a file, SAMPLE.UC, in the format of the NIH Telephone Directory. I used my name plus modifications in it (including a fake doctorate) to show the various steps. Here is the SAMPLE.UC: ---------- SAMPLE.UC D MAGNUSON ROBERT ANDRE II 496-6256 CR DMB 12A 4021 LE MAGNUSON BOB I 496-6256 CR DMB 12A 4021 MAC MAGNUSON ROBERT 496-6256 CR DMB 12A 4021 MAGNUSON ROBERT A 496-6256 CR DMB 12A 4021 MC MAGNUSON ROBERT SR 496-6256 CR DMB 12A 4021 O MAGNUSON ROB JR DR 496-6256 CR DMB 12A 4021 VAN DER MAGNUSON R A 496-6256 CR DMB 12A 4021 VON MAGNUSON R ANDRE III 496-6256 CR DMB 12A 4021 I invoked TELEPHON.BAT with a TELEPHON SAMPLE.UC SAMPLE which asks that SAMPLE.UC be transformed via the five intermediate files (which, in this case, are of the form SAMPLE_n), to the final SAMPLE file. Appearing below in succession are the intermediate files, SAMPLE_1 through SAMPLE_5, and the final file, SAMPLE. ---------- SAMPLE_1: Apostrophes inserted: D'MAGNUSON ROBERT ANDRE II 496-6256 CR DMB 12A 4021 LE MAGNUSON BOB I 496-6256 CR DMB 12A 4021 MAC MAGNUSON ROBERT 496-6256 CR DMB 12A 4021 MAGNUSON ROBERT A 496-6256 CR DMB 12A 4021 MC MAGNUSON ROBERT SR 496-6256 CR DMB 12A 4021 O'MAGNUSON ROB JR DR 496-6256 CR DMB 12A 4021 VAN DER MAGNUSON R A 496-6256 CR DMB 12A 4021 VON MAGNUSON R ANDRE III 496-6256 CR DMB 12A 4021 ---------- SAMPLE_2: Move over II/III's and (initalcapped) JR/SR's. D'MAGNUSON ROBERT ANDRE II 496-6256 CR DMB 12A 4021 LE MAGNUSON BOB I 496-6256 CR DMB 12A 4021 MAC MAGNUSON ROBERT 496-6256 CR DMB 12A 4021 MAGNUSON ROBERT A 496-6256 CR DMB 12A 4021 MC MAGNUSON ROBERT Sr 496-6256 CR DMB 12A 4021 O'MAGNUSON ROB Jr DR 496-6256 CR DMB 12A 4021 VAN DER MAGNUSON R A 496-6256 CR DMB 12A 4021 VON MAGNUSON R ANDRE III 496-6256 CR DMB 12A 4021 ---------- SAMPLE_3: Initialcap name words: D'Magnuson Robert Andre II 496-6256 CR DMB 12A 4021 Le Magnuson Bob I 496-6256 CR DMB 12A 4021 Mac Magnuson Robert 496-6256 CR DMB 12A 4021 Magnuson Robert A 496-6256 CR DMB 12A 4021 Mc Magnuson Robert Sr 496-6256 CR DMB 12A 4021 O'Magnuson Rob Jr DR 496-6256 CR DMB 12A 4021 Van Der Magnuson R A 496-6256 CR DMB 12A 4021 Von Magnuson R Andre III 496-6256 CR DMB 12A 4021 ---------- SAMPLE_4: Close up the MC's and the MAC's. D'Magnuson Robert Andre II 496-6256 CR DMB 12A 4021 Le Magnuson Bob I 496-6256 CR DMB 12A 4021 MacMagnuson Robert 496-6256 CR DMB 12A 4021 Magnuson Robert A 496-6256 CR DMB 12A 4021 McMagnuson Robert Sr 496-6256 CR DMB 12A 4021 O'Magnuson Rob Jr DR 496-6256 CR DMB 12A 4021 Van Der Magnuson R A 496-6256 CR DMB 12A 4021 Von Magnuson R Andre III 496-6256 CR DMB 12A 4021 ---------- SAMPLE_5: Move last name, recognizing prefixes. Robert Andre D'Magnuson II 496-6256 CR DMB 12A 4021 Bob I Le Magnuson 496-6256 CR DMB 12A 4021 Robert MacMagnuson 496-6256 CR DMB 12A 4021 Robert A Magnuson 496-6256 CR DMB 12A 4021 Robert McMagnuson Sr 496-6256 CR DMB 12A 4021 Rob O'Magnuson Jr DR 496-6256 CR DMB 12A 4021 R A Van Der Magnuson 496-6256 CR DMB 12A 4021 R Andre Von Magnuson III 496-6256 CR DMB 12A 4021 ---------- SAMPLE: Move Jr/Sr/II/III's back. Robert Andre D'Magnuson II 496-6256 CR DMB 12A 4021 Bob I Le Magnuson 496-6256 CR DMB 12A 4021 Robert MacMagnuson 496-6256 CR DMB 12A 4021 Robert A Magnuson 496-6256 CR DMB 12A 4021 Robert McMagnuson Sr 496-6256 CR DMB 12A 4021 Rob O'Magnuson Jr DR 496-6256 CR DMB 12A 4021 R A Van Der Magnuson 496-6256 CR DMB 12A 4021 R Andre Von Magnuson III 496-6256 CR DMB 12A 4021 Notes on TELEPHON.BAT: Line [2] looks for a letter at the beginning of the line, a space, then another letter. The result is the first letter, an apostrophe, then the second letter. CHP's input file is %1, the output is %2_1. In the example, these files are SAMPLE.UC, and SAMPLE_1. In line [4] CHP looks for a word boundary, JR or SR followed by a nonword constituent, or II or III followed by a nonword constituent, then one or more nonword constituents. The result is the one or more nonword constituents, an initialcapped copy of the JR or SR and its nonword constituent, then the II or III and its nonword constituent. Although the result has both the JR/SR and the II/III, one of them is always null. CHP reads SAMPLE_1, and DOS writes to SAMPLE_2 (substituting SAMPLE for %2 in the batch file). In line [6] CHP looks for a sequence of one or more words anchored to the beginning of the line, each word of which is followed by exactly one nonword constituent. Following the words the pattern wants a string of at least one nonword constituent. The result is an initialcapped copy of the words followd by the nonword constituent(s). Here we went from SAMPLE_2 to SAMPLE_3. In line [8] the pattern looks for MAC or MC anchored to the beginning of the line, exactly one nonword constituent, then a sequence of words each of which ends with exactly one nonword constituent. The result is the MAC/MC, the sequence of words, then the space (i.e., the nonword constituent). Thus, the MAC/MC is closed up with the rest of the last name, and the removed space is placed after the rest of the name (to preserve the original length of the whole name). The input file is SAMPLE_3, the output, SAMPLE_4. In line [12] CHP is invoked with the /f option whereby the pattern and the result are read from a file--named _TMPAT_ in this case. The ECHO in line [10] writes the pattern to _TMPAT_, and the ECHO in line [11] appends the result parameter to _TMPAT_. The task here is to put the first name(s) and initials ahead of the last name. Usually the last name is the first word on the line, but the last-name prefixes have to be taken into account. An inspection of the current NIH telephone directory revealed the following prefixes: Di, Du, De, Del, Des, La, Le, Lo, Van, Van De, Van Der, and Von. (I hope I didn't miss any.) The pattern specifies that zero or one of these (with its trailing blank) occurs at the beginning of the line, a word of one or more letters, apostrophes or hyphens (i.e., the last name), then a sequence of words (the first names/initials) each with its trailing blank. The result is to be the first names/initials, the possibly null prefix, then the last name. SAMPLE_4 is copied with the changes to SAMPLE_5. Finally, in line [14], the JR/SR/II/III's are moved back creating the final file, SAMPLE.