home *** CD-ROM | disk | FTP | other *** search
- ═══════════════════════════════════════
-
- 5. PATTERNS IN BYTE SEQUENCES
-
- ═══════════════════════════════════════
-
-
- Topic 4 showed how byte distributions help us to
- analyze the content of a file. One simple fact may be obscured by
- the length of Topic 4... that a byte survey and analysis take very
- little time. The survey itself might require one to four seconds.
- Reviewing it might involve another thirty seconds.
-
- In Topic 5 we consider sequences of bytes. We want to
- identify patterns related to our objectives of:
-
- » extracting searchable content;
- » recognizing record separations;
- » recognizing field separations; and
- » recognizing formatting aids.
-
-
- ═══════════════════════════════════════
- 5.1 Heads and tails...
- first impressions of a file
- ═══════════════════════════════════════
-
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
- Usage head file_name [ line_count ] [/a][/t] > text
-
- Displays in printable format the first line_count lines
- within a file; the default is 10 lines. This clone of
- the Unix HEAD and TAIL utilities provides a quick check on
- the likely contents of a file. If the "/a" option is used,
- accented characters are treated as printable text. If
- "/t" is specified, the display is of the TAIL of the
- file, the LAST line_count lines.
-
- input: Normally an ASCII text file.
-
- output: The specified number of lines is either displayed on the
- screen or sent to a file. Each non-printable character is
- replaced by an ^ symbol. If any line length exceeds 120
- characters, a warning is issued. If any line length
- exceeds 1024 or the file includes null bytes, the program
- advises that the target file is not ASCII text.
-
- writeup: MIR TUTORIAL ONE, topic 5
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
-
- The HEAD program can be used to get a first impression
- of the beginning and of the end of most files, although it is best
- for ASCII text. Try for example:
-
- HEAD SVP_TXT
-
- The result shows directly on the screen. Alternately it may be
- redirected into a new file. With the command "HEAD SVP_TXT", ten
- lines are shown. Next try
-
- HEAD SVP_TXT 4
- and then
- HEAD SVP_TXT 20
-
- You are shown 4 lines and the next time 20 lines of text. For line
- counts greater than 23, be ready to use CTL-S to stop and restart
- movement across the screen.
-
- Adding the argument "/t" switches heads to tails.
-
- HEAD SVP_TXT /T
-
- causes the last 10 lines of SVP_TXT to be shown on the screen.
-
- HEAD SVP_TXT 25 /T > TEMP
-
- or
- HEAD SVP_TXT /T 25 > TEMP
-
- places the last 25 lines in a file called TEMP. Note the file name
- must come first; the order of arguments after that does not matter.
-
- Incidentally, there is no restriction on the number of
- lines. I tried HEAD SVP_TXT /T 4000 and found it worked!
-
-
- ≡≡≡≡->> QUESTION:
- Input the DOS command "COPY HEAD.C HEAD2.C". Then
- revise HEAD2.C so that no file is named, and standard
- input is the source of data. Compile the result and
- experiment with it. The arguments are simpler, and
- their order doesn't matter. What are the dangers of
- using HEAD2.C in a DOS environment?
- <<-≡≡≡≡
-
- ≡≡≡≡->> QUESTION:
- Make another copy of HEAD.C and call it TAIL.C. Edit
- it so that the resulting program needs no "/t" argument
- and always shows the end of a file. Experiment.
- <<-≡≡≡≡
-
-
- Occasionally you might have text containing legitimate
- accented characters. To demonstrate the "/a" (accents) argument:
-
- HEAD SVP_TXT 150 /A > TEMP
- then
- HEAD TEMP /T
- then
- HEAD TEMP /T /A
-
- What's really happening here? You are taking the top 150 lines of
- a file, storing it in a separate temporary file, then displaying
- the last 10 lines of the temporary file (that is, lines 141 to 150
- of the original file) on the screen. This is a way of showing an
- intermediate part of a file... not as fast as CPB (copy bytes), but
- convenient. When you try the last two commands above, do you
- notice the difference between the two displays? HEAD TEMP /T
- includes a word that looks like H^tel; when accents are requested
- in HEAD TEMP /T /A, the same word comes out as Hôtel.
-
-
- ≡≡≡≡->> QUESTION:
- The experiment fails if you build the temporary file
- without the /A argument (HEAD SVP_TXT 150 > TEMP). Why
- does it fail?
- <<-≡≡≡≡
-
- On a non-text file, HEAD may either show a lot of caret
- ("^") characters, or conclude that a HEAD display is meaningless.
- That information is worth the few seconds used to input the command
- and see the result.
-
-
- ═════════════════════════
- 5.2 Non-DOS files
- ═════════════════════════
-
- Suppose you display the head of a file and find it
- looks like this:
-
-
- Fourscore and seven years ago,
- our forefathers brought forth
- upon
- this continent
- a new nation,
- conceived in liberty,
- and dedicated
- to the proposition
- that all men are created equal.
-
-
- This sample is not 80 characters wide, but you get the idea. Each
- new line starts where the last one left off, and lines wrap around
- onto the next line when the right margin is reached. This effect
- is common when UNIX files are brought into a DOS environment. DOS
- needs a carriage return to match each linefeed chararacter.
-
- Here's a simple solution:
-
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
- usage: dosify file_name[s]
-
- Replaces a UNIX-style file with a copy in which each
- line feed is preceded by one carriage return, and the
- file ends with one CTL-Z byte. Use this program on a
- file in which the MORE command produces a skewed listing
- that fails to go back to the left margin for new lines.
-
- input: Any printable ASCII file[s].
-
- output: The same file, with the same name, with DOS conventional
- line ends and end of file.
-
- writeup: MIR TUTORIAL ONE, topic 5
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
-
- You can dosify a clutch of files in one command:
-
- DOSIFY GETTYSBU.RG LINCOLN DOUGLAS WHATEVER
-
- and the display begins to make sense:
-
- Fourscore and seven years ago,
- our forefathers brought forth
- upon this continent
- a new nation,
- conceived in liberty,
- and dedicated to the proposition
- that all men are created equal.
-
- DOSIFY appears to change files in place. In reality it makes a
- copy, and if successful, destroys the original and changes the name
- of the copy to match the original.
-
-
- ≡≡≡≡->> QUESTION:
- Using A_BYTES on a non-DOS file, how would you
- calculate in advance the number of bytes that it will
- contain after it is dosified?
- <<-≡≡≡≡
-
-
- ═════════════════════════════════════
- 5.3 Displaying printable data
- ═════════════════════════════════════
-
- Our immediate objective is to get first impressions of
- the content of a file. F_PRINT is a filter to show only printable
- characters within a file. Unlike HEAD, it can start instantly at
- any point within a file. An accent argument extends the range to
- include accented (high-bit-set) characters, but not graphics.
-
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
- usage f_print file_name [/a][/w] [ from_byte to_byte ] > subset
-
- Reduces a file to printable characters only. If the /w
- option is specified, strings of printable characters that
- are unlikely to be words are filtered out as well, and each
- new burst of accepted text is placed on a new line. /a
- causes accented characters to be accepted as printable.
-
- input: Any file whatsoever, or any part of a file.
-
- output: Printable subset.
-
- writeup: MIR TUTORIAL ONE, topic 5
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
-
- The command
-
- F_PRINT SVP_TXT /A 121500 121700
-
- causes the following display. Note the accented byte in the two
- repetitions of Luçon.
-
- CENT <P10>D<P8>EPAUL<R>i.s.C.M.
- @TEXT60 = <MI>Addressed:<D> Monsieur Le Soudier, Priest of the
- Mission of Luçon, in Luçon
- @HEAD4 = 464. - TO N.
- @TEXT4 = Saint-Lazare, Sunday, July 29, 1640
- @TEXT7
-
- For some fun, try F_PRINT on an executable EXE file,
- first without the /W argument, then with /W. For example,
-
- F_PRINT F_PRINT.EXE
- and
- F_PRINT /W F_PRINT.EXE
-
- The second listing is much shorter and far more intelligible.
-
-
- ≡≡≡≡->> QUESTION:
- In what ways might you amend source code in F_PRINT.C
- to get other useful effects? Hint: Try variations in
- the function check_store.
- <<-≡≡≡≡
-
-
- ═══════════════════════════════
- 5.4 Detailed data dumps
- ═══════════════════════════════
-
- Let's move beyond first impressions to methods of
- displaying exactly what byte sequences occur in a file or part of
- a file.
-
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
- Usage dump file_name [/a] [ from_byte [ to_byte ] ] > report
-
- Lists the contents of a specified portion of any file,
- reporting 16 bytes per line. "/a" causes accented high bit
- characters to be printed.
-
- input: Any file whatsoever.
-
- output: Printable ASCII report, listing offset, then 16 bytes in
- hexadecimal format, with printable ASCII on the right;
- periods substitute for non-printable bytes.
-
- writeup: MIR TUTORIAL ONE, topic 5
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
-
- DUMP permits commands such as:
-
- DUMP SVP_TXT 0 800 > TOP_END
-
- This preliminary test of the file can be tried on any portion of
- the file. Moreover, if your target has 367 or fewer bytes, you can
- send the output directly to the screen without worrying about CTL-S
- stop and go control, as in:
-
- DUMP SVP_TXT 100000 100366
-
- DUMP restricts the printable character set to the 95
- byte patterns ranging from hex 20 (space) up to hex 7E. This
- restriction makes it much easier to recognize ordinary text; it is
- not surrounded by a jumble of happy faces and graphic characters.
- (Try the DOS 5.0 MORE command on any EXE executable file and see
- what you get!) Other characters are in the strict sense
- printable... carriage returns, line feeds and tabs. For accented
- characters using PC compatible extended ASCII, add the accents
- argument "/a":
-
- DUMP SVP_TXT /A 11100 11200
-
- Note the accented French word "écus" in the result.
-
-
- ═══════════════════════════════════════════
- 5.5 Convenient display of fragments
- ═══════════════════════════════════════════
-
- Suppose we want to check out other high-bit-set bytes
- found in the file SVP_TXT. Here is the list created by A_BYTES:
-
- é [82] 43 0.0% 11144 12314 13915 14831 18658 23503
- 23800 26370
- â [83] 1 0.0% 207322
- à [85] 1 0.0% 116508
- ç [87] 7 0.0% 95180 109218 121610 121620 129909
- 175862 181966
- è [8A] 4 0.0% 130386 130571 161305 232659
- î [8C] 4 0.0% 65079 93876 95582 138200
- ô [93] 10 0.0% 8834 16736 28121 28656 97731 134953
- 163316 170678
-
- One way to display a byte at a known location with its
- context is to issue a DUMP command that straddles its location.
- For example, to view the ç with cedilla at offset 95180:
-
- DUMP SVP_TXT /A 95100 95300
-
- would do the job. But DUMP gives too much detail for this purpose.
- The key lines in the screen display are:
-
- 95164: 79 65 3c 5e 3e 39 3c 44 3e 20 69 6e 0d 0a 4c 75
- ye<^>9<D> in Lu
- 95180: 87 6f 6e 3f 3c 5e 3e 31 30 3c 44 3e 20 49 20 66
- çon?<^>10<D> I f
- 95196: 69 6e 64 20 69 74 20 64 69 66 66 69 63 75 6c 74
- ind it difficult
-
- A more convenient program is FRAGMENT:
-
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
- Usage fragment input_file offset > stdout
-
- Display a five line fragment of a file in printable form
- with two lines of context on either side of the selected
- offset. Useful to get a quick view of contents at a
- selected location in a file. Use CPB and/or DUMP for an
- alternate method, less convenient, but with more detail.
-
- input: Most useful for printable ASCII files.
-
- output: Five double spaced lines in which non-printing characters
- are shown as blank with a ^ in the blank line below. The
- character at the exact offset is marked by a | in the blank
- line below.
-
- writeup: MIR TUTORIAL ONE, topic 5
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
-
- To display context of the first ç at offset 95180,
- simply input:
- FRAGMENT SVP_TXT 95180
-
- Five lines are displayed; the third line starts this way:
-
- Luçon?<^>10<D> I find it difficult to
- |
-
- Notice how the ç is highlighted by the vertical bar | immediately
- underneath.
-
- Try showing the î at byte 138200 with this command:
-
- FRAGMENT SVP_TXT 138200
-
- You are shown the context around:
-
- M. Benoît<^>2<D> does not return
- |
-
- and again a vertical bar underneath draws attention to the î.
-
- FRAGMENT works for any ASCII file, particularly where
- line lengths are under 80 characters. (Unix users: Many terminals
- are unpredictable when they attempt to display bytes with the high
- bit set. The source code contains notes indicating where to make
- necessary changes.) In the SVP_TXT example, the locations shown
- above all check out as valid accented characters within French
- names. Further along, we will find that by using the program
- A_PATTRN we can verify that all bytes with high bit set in our
- sample SVP_TXT are valid.
-
-
- ══════════════════════════════════════════════
- 5.6 Viewing patterns throughout a file
- ══════════════════════════════════════════════
-
- The techniques thus far display context at specific
- points within a file... the beginning, the end, or near certain
- offsets. More is needed. We want to be able to:
-
- » ensure that patterns are consistent across all the
- data;
-
- » identify every set of codes and signals that may help
- us toward our objectives of interpreting record and
- field separators, searchable content, etc.
-
- At the end of the preceding topic, we concluded that
- our sample file, SVP_TXT, is extended ASCII text (normal text plus
- accented letters), and that the usage of certain characters needs
- to be checked out: @ = ^ | < >.
-
- The program A_PATTRN can be used to isolate every
- occurrence within a file of a single character or of a string of up
- to 16 characters.
-
-
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
- usage: a_pattrn file_name key [ /x ] [ bytes_before ] > report
- "/x" = include hex, show only 16 bytes instead of 40
- List every occurrence of a key character or string in a file.
- Show 3 (or "bytes_before", range 0 to 15) bytes prior to the
- key each time. Normally show a total of 40 bytes each time
- the key is found; if the "/x" argument is set, show only 16
- bytes, but in hex and ASCII both. The key may be from 1 to
- 16 characters. Within the key, any non-printing characters,
- characters which may confuse DOS (> or < or |), linefeeds,
- blanks, backslash, etc. must be shown in hex form... a
- backslash and 2 hex digits. Examples:
- a_pattrn herfile \8E > herfile.8e
- a_pattrn yourfile * 7 > yourfile.ast
- a_pattrn myfile Mother
- a_pattrn hisfile \94\05ke\ff 0 > 5char.pat
-
- input: Any file whatsoever.
-
- output: One line for each occurrence of the target byte(s) in the
- file. Sort the result to make patterns show up more clearly.
- writeup: MIR TUTORIAL ONE, topic 5
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
-
- DOS assigns meaning to certain characters (such as
- space, | < >, etc.), so if you have any problem using the A_PATTRN
- command, switch to the hex format for the search key (the letter or
- sequence of letters on which you wish to search). For example,
-
- A_PATTRN SVP_TXT > > SVP3E
- fails, but
- A_PATTRN SVP_TXT \3E > SVP.3E
- works.
-
- The key benefit of A_PATTRN is that it selects the same
- byte or string and places it in the same position on each line.
- Patterns begin to emerge at once. Here are the first few lines
- produced by:
-
- A_PATTRN SVP_TXT \3C\5E > RESULT
-
- 00000641: men<^>2<D> want to communicate..in writi
- 00001138: ot]<^>3<D> concern the hospital, you..ca
- 00001467: uf,<^>4<D> on Madame..Goussault's<^>5<D>
- 00001497: t's<^>5<D> estates, I believe, although
- 00001678: eu,<^>6<D> and to do so as soon as possi
- 00002197: ne,<^>2<D> which I see from your..letter
- 00003152: is;<^>3<D> who put together his council
- 00003412: us?<^>4<D> Indeed, there is no good or..
- 00003747: ey,<^>5<D> who on another..occasion spok
- 00005086: eva<^>6<D> and,..following his example,
-
- \3C\5E appears as <^ starting at byte 17 in each line. It is a
- simple matter to perform a sort:
-
- SORT /+17 < RESULT > RESULT.SRT
-
- The hexadecimal output from A_PATTRN (when argument /x
- is used) looks just like the output of DUMP. Here we have
- shortened the lines a bit.
-
- 00000641: 6d 65 6e 3c 5e 3e 32 3c 44 3e 20... men<^>2<D> want
- 00001138: 6f 74 5d 3c 5e 3e 33 3c 44 3e 20... ot]<^>3<D> conce
- 00001467: 75 66 2c 3c 5e 3e 34 3c 44 3e 20... uf,<^>4<D> on Ma
- 00001497: 74 27 73 3c 5e 3e 35 3c 44 3e 20... t's<^>5<D> estat
- 00001678: 65 75 2c 3c 5e 3e 36 3c 44 3e 20... eu,<^>6<D> and t
- 00002197: 6e 65 2c 3c 5e 3e 32 3c 44 3e 20... ne,<^>2<D> which
- 00003152: 69 73 3b 3c 5e 3e 33 3c 44 3e 20... is;<^>3<D> who p
- 00003412: 75 73 3f 3c 5e 3e 34 3c 44 3e 20... us?<^>4<D> Indee
- 00003747: 65 79 2c 3c 5e 3e 35 3c 44 3e 20... ey,<^>5<D> who o
- 00005086: 65 76 61 3c 5e 3e 36 3c 44 3e 20... eva<^>6<D> and,.
-
- The hex result can also be sorted (SORT /+20). When dealing with
- fully printable files, the hex rendition of each byte is not
- particularly useful. The one piece of information the hex version
- provides is that the ".." pattern within the ASCII is usually a
- line feed - carriage return combination.
-
- Whichever output is selected, we discover that the two
- bytes "<^" in the file SVP_TXT are in every case followed by ">",
- a one or two digit number, then "<D>". The lowest numbers, 1 and
- 2, are most frequent. The frequency falls off steadily so that the
- highest, "<^>27<D>" occurs only once.
-
- Looking at the patterns around the single character hex
- 3C ("^") alone reveals two other combinations: "<B^>#<D>" and
- "<I^>#<D>". The three basic patterns <^>, <B^> and <I^> account
- for all occurrences of the caret character "^".
-
- ≡≡≡≡->> QUESTION:
- DOS 5.0 has a "FIND" command which can also be used to
- list every line in which a character sequence appears.
- Compare the respective advantages of FIND and A_PATTRN.
- <<-≡≡≡≡
-
-
- ═════════════════════════════════════════
- 5.7 The power of sorting patterns
- ═════════════════════════════════════════
-
- Suppose we look for patterns around the single "at
- sign" (@) character:
-
- A_PATTRN SVP_TXT @ > AT_SIGN.SVP
-
- The result contains 935 lines which start out as follows:
-
- 00000000: ...@HEAD1 = SAINT VINCENT DE PAUL..@HEAD
- 00000032: L..@HEAD2 = CORRESPONDENCE..@HEAD4 = 417
- 00000057: E..@HEAD4 = 417. - TO SAINT LOUISE DE MA
- 00000121: S..@TEXT4 = Paris, January 11, 1640..@TE
- 00000155: 0..@TEXT7 = Mademoiselle,..@TEXT6 = I re
- 00000179: ,..@TEXT6 = I received three letters fro
- 00000605: ...@TEXT6 = Seeing that those Gentlemen<
- 00001613: ...@TEXT6 = You would do well to send fo
- 00001799: ...@TEXT6 = People are praying to God fo
- 00001941: ...@HEAD4 = 418. - TO LOUIS ABELLY,<B^>1
-
- We may have tripped upon the record separators and field separators
- that we are looking for. Notice particularly the pattern @HEAD4 =
- which is followed by a number. We have several options at this
- point. One is to lengthen the key and re-run the pattern analysis:
-
- A_PATTRN SVP_TXT @HEAD > AT_HEAD.SVP
- and
- A_PATTRN SVP_TXT @TEXT > AT_TEXT.SVP
-
- Alternately, since the earlier listing AT_SIGN.SVP has
- only 51,425 bytes, we can sort it beginning at column 17:
-
- SORT /+17 < AT_SIGN.SVP > AT_SIGN.SRT
-
- As we view the sorted result, patterns become very clear. Here is
- part of the analysis that I reported after a few more tests with
- the A_PATTRN program. At this point, analysis was still tentative,
- but it provided a good basis for discussion with the database
- provider.
-
-
- Analysis of SVP_TXT
- ASCII text with Printer's Codes (Headers)
-
- February 12, 1992
-
- The following are tentative interpretations to the
- Printer's Codes embedded in SVP_TXT. Corrections to errors would
- be welcome.
-
- @HEAD1 database heading, 1 occurrence at beginning
- @HEAD2 database subheading, 1 occurrence "
- @HEAD4 sequence number, 1 per letter
- @TEXT31 dateline
- @TEXT4 dateline, letter from s.v.p.
- @TEXT41 dateline, letter to s.v.p. from other person
- @TEXT5 signature line, s.v.p.
- @TEXT51 signature line, other person
- @TEXT6 paragraph start, letter from s.v.p.
- @TEXT60
- @TEXT600
- @TEXT61 paragraph start, letter from other person
- @TEXT611 address line to s.v.p.
- @TEXT7 salutation from s.v.p.
- @TEXT71 salutation to s.v.p.
- <169> beginning quote (7X)
- <170> end quote (7X)
- <197> dash (23X)
- <B^>1<D> superscript footnote ref in heading (11X)
- <D> terminator for other < > symbols (594X)
- <I^>#<D> superscript footnote ref in heading (59X)
- <M> emphasis -- bold, highlight, italics? (6X)
- <MI> emphasis -- bold, etc.? (134X)
- <P10>, <P7>, <P7M>, <P8>, <P8MI>, <P9>
- pica measures for font size (212X)
- <R> ?? 18 of 21X <R>i.s.C.M. in signature
- <^>#<D> superscript footnote ref in text (409X)
- <|> blank position holder, not to end line (28X)
-
-
- ═══════════════════════════════
- 5.8 Sorting large files
- ═══════════════════════════════
-
- As files get larger, the DOS SORT slows down. Sorting
- the 935 lines (51,425 bytes) in AT_SIGN.SVP took 30 seconds on a 12
- megahertz AT clone. As the SORT 64k byte limit is approached,
- things fall apart.
-
- A description follows for SORT2, a device to get around
- the 64k limit. It's not elegant, but it works!
-
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
- Usage - sort2 [/r] [/+n] from_file to_file key[s]
-
- Sorts large ASCII text files using the memory-bound DOS SORT
- routine in multiple passes. /r signifies reverse order.
- /+n specifies a starting column, 1-999. A key is 1 to 3
- characters, used as a dividing point. The program separates
- the input file into a series of temporary files, depending on
- the byte(s) at the starting column. For n dividing points,
- the program makes n+1 temporary files, and reports the size
- of each. If all are under 60k characters, they are sorted
- and placed together in the output file. If a run fails, add
- another dividing point mid-way in the range that fails (that
- is, the file that is too big), and try again. NOTE: The DOS
- SORT starts column count at 1, converts all lower to upper
- case!
-
- input: Line oriented printable ASCII.
-
- output: Same file, sorted.
-
- writeup: MIR TUTORIAL ONE, topic 5
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
-
-
- ≡≡≡≡->> QUESTION:
- Try out SORT2, get a feel for how it works. Can you
- come up with ways to make it easier to use or more
- powerful? Or do you have your own super sort that you
- are willing to publish under copyleft rules?
- <<-≡≡≡≡
-
- Another way to speed up sorts is to throw away portions
- of the target file that are not essential for the purpose you have
- in mind when sorting. For example, the program A_PATTRN produces
- 8 byte offsets followed by a colon and white space, and up to 40
- bytes of information. Are they all necessary?
-
- The program COLRM removes the same columns from every
- line of ASCII text in a file:
-
-
-
-
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
- Usage - colrm from_col to_col < printable_ascii > revised_ascii
-
- Removes the specified range of columns from each line of an
- ASCII file. This is a clone of the Unix "colrm" utility.
-
- input: A printable ASCII file with less than 512 characters per
- line. Columns number from 1 upward.
-
- output: The same number of lines, but with one segment of columns
- removed from each line.
-
- writeup: MIR TUTORIAL ONE, topic 5
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
-
- Since ASCII text is the only accepted input, we are
- safe in a DOS environment to use standard input and output. There
- is no confusion over line feeds and CTL-Z characters. An added
- benefit is that we can pipe the output of successive runs of COLRM.
-
- Recall the earlier example
-
- A_PATTRN SVP_TXT \3C\5E > RESULT
-
- which produced output that started off like this:
-
- 00000641: men<^>2<D> want to communicate..in writi
- 00001138: ot]<^>3<D> concern the hospital, you..ca
- 00001467: uf,<^>4<D> on Madame..Goussault's<^>5<D>
- 00001497: t's<^>5<D> estates, I believe, although
- 00001678: eu,<^>6<D> and to do so as soon as possi
- 00002197: ne,<^>2<D> which I see from your..letter
- 00003152: is;<^>3<D> who put together his council
- 00003412: us?<^>4<D> Indeed, there is no good or..
- 00003747: ey,<^>5<D> who on another..occasion spok
- 00005086: eva<^>6<D> and,..following his example,
-
- Our primary interest is in the patterns of the form
- <^>#<D> and <^>##<D>. We could remove the first 16 columns by:
-
- COLRM 1 16 < RESULT > TEMP
-
- and those that follow the 8 characters of interest by
-
- COLRM 9 99 < TEMP > RESULT2
-
- Notice that you can use any large number that will reach to the end
- of all lines. Alternately, you can do the two steps in one:
-
- COLRM 1 16 < RESULT | COLRM 9 99 > RESULT2
-
- RESULT2 has only 4,090 bytes, in contrast to the 22,086 in RESULT.
- The new file starts off like this:
-
- <^>2<D>
- <^>3<D>
- <^>4<D>
- <^>5<D>
- <^>6<D>
- <^>2<D>
- <^>3<D>
- <^>4<D>
- <^>5<D>
- <^>6<D>
-
- Eighty per cent reduction in a file size pays off when sorting.
-
- A_OCCUR is useful in analyzing sorted files that
- contain many repetitions.
-
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
- Usage - a_occur [ min_freq ] [ /n ] < ascii_text > report
- /n = non-sequenced data is okay
-
- Count the frequency of occurrence of identical lines
- If a minimum frequency is specified, lines occurring
- fewer times are dropped entirely from the result.
-
- Input: ASCII text, which must be in sorted order UNLESS the
- flag "/n" is included.
-
- Output: A reduced copy of the file with each line shown only
- once. Each line begins with a frequency count, padded
- out to six characters with blanks.
-
- Writeup: MIR TUTORIAL ONE, topic five.
- See also the related programs A_OCCUR2 and A_OCCUR3.
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
-
- Here is the top end of the output when we input the
- command
- A_OCCUR < RESULT2 > FREQ
-
-
- 10 <^>10<D>
- 8 <^>11<D>
- 6 <^>12<D>
- 5 <^>13<D>
- 5 <^>14<D>
- 3 <^>15<D>
- 3 <^>16<D>
- 3 <^>17<D>
- 2 <^>18<D>
- 2 <^>19<D>
-
- Frequency is the first element. For example, the pattern <^>11<D>
- occurs 8 times, <^>12<D> occurs 6 times. It was the regularly
- declining frequency of the numbers that first suggested to me that
- these tags indicate footnote numbers within the test file SVP_TXT.
-
- To finish this topic, we mention two simple utility
- programs that are related to A_OCCUR.
-
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
- usage - a_occur2 [ min_frequency [ filename_under_min ] ]
- < merged a_occur files > combined
-
- A utility to calculate cumulative frequency of
- merged A_OCCUR outputs. If a minimum frequency is
- specified, then all lower frequency items are either
- suppressed or sent to a file named in the next argument.
-
- Input: ASCII text, in which each line starts with a number
- (a frequency count) followed by blanks, then sorted text
- starting in the seventh column.
-
- Output: A copy of the same file in which multiple identical lines
- are shown only once, preceded by the combined frequency
- count.
-
- Writeup: MIR TUTORIAL ONE, topic five.
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
-
-
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
- Usage a_occur3 < occur_file > expanded_file
-
- Reverse an A_OCCUR file by removing the initial count,
- then outputting each line the number of times indicated
- by the count. Useful if editing an A_OCCUR file, then
- reconstituting it.
-
- input: ASCII file with each line containing a count, blank
- padded to the sixth character, then the line content.
-
- output: Same content, but with leading six characters removed
- and content repeated for "count" lines.
-
- Writeup: MIR TUTORIAL ONE, topic five.
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
-
-
- * * * * *
-
-
- We have identified several methods of viewing portions
- of a computer file. Each is an aid in analyzing file content. The
- most powerful aid is the A_PATTRN program. Its output may be
- sorted so that the context of any character or sequence of up to 16
- characters may be examined.
-
- Interpreting the results becomes easier as you acquire
- experience with various kinds of data. The next few topics offer
- additional tools and pointers for analysis.