Power-Programmierung

home *** CD-ROM | disk | FTP | other *** search

/ Power-Programmierung / CD2.mdf / doc / mir / 16ascii < prev next >

Wrap

Text File | 1992-06-29 | 21.9 KB | 492 lines

═══════════════════════════════════════ 6. WORKED EXAMPLES... VARIATIONS IN ASCII TEXT ═══════════════════════════════════════ The next three topics will be richer to the extent that you and other readers provide samples that can be used to explain the various forms in which data is held in real files. The variety is staggering... in working with between 200 and 300 large databases in the late 1980s, I found that only a few formats and sets of rules were replicated entirely across databases. Many more databases had unique patterns or combinations of rules. But a word of encouragement... analysis really does get easier along the way. ════════════════════════════════ 6.1 Other analysis tools ════════════════════════════════ Here is an assortment of programs useful with ASCII text files. Source code is included on the diskettes. ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ usage: lines file_name[s] Provides a quick count of the number of lines in each of one or more files. input: Any file[s], but most useful if ASCII text. output: A one line report on the screen of the number of lines in each input file. writeup: MIR TUTORIAL ONE, topic 6 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ The DOS DIR command is one measure of size of a file. The LINES command is another. It is quick. Try it on some of the source files. For example, LINES A_PATTRN.C A_BYTES.C LINES.C DOSIFY.C yielded this answer on the screen: 344 lines in file a_pattrn.c 335 lines in file a_bytes.c 132 lines in file lines.c 154 lines in file dosify.c 965 lines TOTAL The actual count may differ when you try it; that would be on account of later revisions in your copy of each of these files. LINES is particularly useful in preparation for a SORT of a file. ≡≡≡≡->> QUESTION: The programs that use lists of file names as inputs would be improved by a function to expand out wild cards in file names (each ? to be replaced by a single character, each * to be replaced by zero or one or several characters). Try your hand at it and share the result. <<-≡≡≡≡ Another program provides an analysis of line lengths: ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ usage: a_len [interval] file_name[s] Analyze the distribution of line lengths up to 1024 bytes within any file. The reporting interval (an integer from 1 to 100) is a count of the lengths that will be grouped together. For example, an interval of 10 means that frequencies of length 0, length 1-10, length 11-20, etc. are shown in the report. The default interval is 10. If the first file name starts with numeric digits, show the interval first! input: Any ASCII file[s]. output: file_name.len which reports the frequency of line lengths occurring in the file. Lengths exclude carriage returns and line feeds. writeup: MIR TUTORIAL ONE, topic 6 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Here is the result of A_LEN SVP_TXT 1 - 10: 63 1.5% 11 - 20: 154 3.7% 21 - 30: 160 3.9% 31 - 40: 186 4.5% 41 - 50: 185 4.5% 51 - 60: 832 20.2% 61 - 70: 2538 61.6% Over 80 per cent of lines are between 51 and 70 bytes long. None are longer. That's a very strong indication that the file is printable text. Non-displayable files are much more likely to have random distances between line feed characters. (For a more detailed list, try the command A_LEN 1 SVP_TXT.) A_LEN also tells us whether line-oriented utility programs are likely to work. Some line editors behave badly when data has long lines. It is common that versions of UNIX "vi" for example choke up with lines over 256 bytes in length. LINE_NUM has a variety of uses. ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Usage line_num [ starting_line_no ] < stdin > stdout Assign a sequence number to each line in a file, starting either at zero or at a user-specified sequence number. input: Any printable ASCII file. output: One line for each line of input. A sequence number is left justified, followed by a tab, then the input line exactly as received. Empty lines are counted, but left empty. writeup: MIR TUTORIAL ONE, topic 6 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Try it with the DOS command FIND. For example, LINE_NUM < A_BYTES.C | FIND "Usage" The result on the screen is: 62 void process(), Usage_(), report(), non_exist() ; 88 Usage_(); 114 Usage_() In other words, the term "Usage" occurs on lines 62, 88 and 114. ═════════════════════════════════ 6.2 ASCII markup patterns ═════════════════════════════════ The simplest form of text is called WYSIWYG (pronounced wizzy-wig). It stands for "What you see is what you get." A document or communication consists of content and form. Depending on how the author has arranged the content of a WYSIWYG file, you as a reader can get all of the content, but may miss much of the intended form. Does form matter? On even the simplest one page memo, yes! Lack of form imposes obstacles to understanding. We look to form to highlight answers to key questions: Who is this from? When was it prepared? What was uppermost in the author's mind? For whom is the message intended? What response is desired or intended? What is being stressed? What parts are subordinate explanation? How does one communicate a response to the author? Visualize a memo that simply runs everything together as a stream of words in a single paragraph. The answers may be there. But reading time goes way up. The message is therefore less likely to be read in full, and less likely to be understood correctly if it is read. The longer and the more complex the communication, the more form matters. Form guides the reader through the structure of the document. For example: » The level of a heading is indicated by location on a page, type size, case, font selection, proximity to other material, use of graphic enhancement, associated numbering, and even color. An unnumbered heading, centred and alone on a page, in bold upper case italics, carries a message quite distinct from exactly the same words preceded by "6.5.3", left justified, primarily lower case, and with following text continuing on the same line in the same font. » Structure is needed to sort out the cross referencing of complex material: Are footnotes at the bottom of the page, at the end of the chapter, or in their own section among the support sections at the end of the document? » A table of contents is intended to clarify structure. » A preface underlines purpose. » An index helps the user to find material of particular interest. You have just read a series of indented points. They are parallel and separated. The form is more readily understood than the same content run together into one paragraph. A computer file is essentially a stream of bytes. To indicate form, there has to be either: » a definition of the structure separate from the file; or » signals embedded within the file content. The simplest external definition is that a "line" contains a certain number of bytes (often 80). Alternately, the external definition may itself be a lengthy file. Internal signals or codes must be distinguishable from the content. These signals may involve characters unused in the text (nulls, non-printing characters); they may be sequences of printable characters that are guaranteed not to appear as part of the content. Whatever their format, internal structure and formatting codes should also be consistent across the entire file. (But don't count on it. I was given a small database in which the sequence "<I>" had two or more quite distinct meanings. It turned out the database provider knew of the flaw and, instead of correcting it, resorted to physically pasting up copy for photo reproduction. Inconsistency is costly.) ══════════════════════════════════ 6.3 Standard Generalized Markup Language (SGML) ══════════════════════════════════ The word "rigor" is used in mathematics to convey notions of completeness, logic and disciplined consistency. The Standard Generalized Markup Language is an attempt to ensure rigor in the transfer of documents. Standards are paradoxical... at one and the same time an imposition and a convenience. In the case of SGML, the imposition is in having to learn and methodically apply methods of declaring structure and of representing codes within content. In exchange for the pain, the gain is clarity in what is intended by the author (or subsequent editor) in documents that adhere to the standard. As the standard takes hold, we indexers profit from increasingly greater consistency across databases. In the long run, we will spend much less time spent deciphering "orphan" database structures and markup methods. In SGML, the elements of form and structure are declared in a distinct document which can be transmitted with the database(s). Tagging is a separate task. Wide variability is allowed in the symbols used for tagging content, but consistency is enforced in the methods of tagging. One gets the impression that the early writers on SGML have inadvertently set more of a standard than they intended; one sees their particular choices of symbols being picked up as if they were part of the standard. Example: "the figures shown in <tableref>Table 7.2</tableref>..." where "<" starts an opening tag and "</" a closing tag. For our purposes, the more consistency the better! The less time spent setting up for alternative tag sets, the quicker we can get data into a form that permits automated indexing. One of the beauties of SGML is its breadth. WYSIWYG can be viewed as untagged SGML; one needs simply the declaration of an unstructured byte stream to be the structure. We can write software to parse and manipulate this simplest form of text. The software may be expanded on an as-needed basis whenever we wish to add structure and formatting. In that sense, the form of text we use for automated indexing may be considered a form of SGML. If we want more speed in the inversion and indexing process, we can adopt a specific selection of tag to be recognized by our parser. The cost is that we must convert data received from others to our tag types. If the incoming data is true SGML, then our preprocessing software can be a simple table replacement algorithm (easy stuff to write). Tutorial THREE deals with automated indexing. The primary version of the software is kept simple for instructional purposes. But there is nothing stopping us from building more sophisticated parsers based on more elaborate SGML declarations. Let's do it together, basing the expansions on real world needs that you encounter. ═════════════════════════════════════════ 6.4 Free versus hierarchical text ═════════════════════════════════════════ Text data includes virtually anything that can be typed on a computer keyboard. The most familiar is free text. Natural language is our normal way of communicating with one another. Words and phrases are grouped according to grammatical rules to form sentences and paragraphs. Words and phrases give each other context, so that communication creates a picture or impression in the mind of the person receiving it. The text is free in the sense that from a computer standpoint no divisions or rules are implied. There is no computer-based definition or limit to the length of a word or line or paragraph. This paragraph, considered alone and by itself, is an example of free text. In lengthy communication, free text is sub-divided. The divisions may be as subtle as an extra line feed to mark the end of paragraphs. The more complex the document, the more likely it is divided into sections, chapters, or articles. Subdivisions are meant to be an aid to understanding. Our comprehension of a book is improved if we can associate what we are reading with a chapter name and book title. Hierarchical data is a term that covers this type. Each paragraph is connected with a hierarchy of headings (book, chapter, section, sub-section). Hierarchical data is also found in newspaper articles, business reports, dictionary entries, and encyclopedia entries. Each has one or more levels of headings. A variation on hierarchical data is text that is cross referenced. Cross references are an internal form of subject index. They are created manually and inserted within the text. Human judgment is involved, and their creation is both expensive and a matter of personal judgment. Where a record touches on more than one subject, multiple cross reference paths may branch out from the one record. Heavily studied databases such as religious scriptures are often cross referenced. At some point, a threshold is passed whereby the rules are no longer (only) implied in the natural language, but are expressed in computer terms as well. The file is no longer treated as a simple byte stream. A hierarchical model of some sort is imposed on the data to distinguish among the various levels and categories of information. In Tutorial TWO we will see how the power of an hierarchical model can be wedded to the simplicity of a byte stream. It comes down to a simple trick of doing away with the assumption that records have to be numbered consecutively. More to follow! ════════════════════════════════════════ 6.5 Fielded variable length text ════════════════════════════════════════ Consider the following: Part Number: CL4-097-B Description: VALVE SPRING ASSEMBLY Quantity on hand, location 1: 57 Quantity on hand, location 2: 0 Quantity on hand, location 3: 16,212 Quantity on hand, location 4: 8,004 Usage, this month: 4,211 Usage, same month last year: 3,654 Usage, current year to date: 31,032 Usage, last year to date: 23,996 Economic order quantity: 1,000 Cost per unit: $ 12.975 Permitted substitute part #: CL4-997-X Permitted substitute part #: DY6-000-P The above is an inventory record. Each line is a "field", an element of data which describes one attribute of the item under discussion. Fielded data is very common in business records. Here is a variation on the same data: <pn>CL4-097-B\0<ds>VALVE SPRING ASSEMBLY\0<q1>57\0 <q3>16212\0<q4>8004\0<um>4211\0<ul>3654\0<uy>31032 <up>23996\0<eo>1000\0<co>12.975\0<sp>CL4-997-X\0<s p>DY6-000-P\0 In the second version, the name or title is implied for each field by a mnemonic tag. The length of fields is variable in every sense... from one field to the next, and in the same field from one record to the next. The data size for one field may be as short as a single character. Since tags are used, fields may be dropped entirely when they are empty of data. Some fields may even be repeated (as in the substitute part field which occurs twice.) End of data within each field above is indicated by a marker, in this case "\0". Here is a variation that reduces the storage space for this particular record from 163 to 143 bytes. End of a field is signalled by the tag which begins the following field. An end of record tag has to be added so that the length of the last field is defined. <pn>CL4-097-B<ds>VALVE SPRING ASSEMBLY<q1>57<q3>16 212<q4>8004<um>4211<ul>3654<uy>31032<up>23996<eo>1 000<co>12.975<sp>CL4-997-X<sp>DY6-000-P<fn> ≡≡≡≡->> QUESTION: Under what conditions does removal of the end of field marker sacrifice information or introduce ambiguity? <<-≡≡≡≡ Why have fielded records been included in a topic on ASCII text? Precisely because each field contains printable ASCII characters. Analysis of ASCII fielded records is much simpler than analysis of their binary counterparts. Variable length ASCII fielded records are recognized by their frequent repetition of the tags. The A_PATTRN program can be used on the first character of the tag to pull out all occurrences. Here's another example of fielded variable length data: 000 001 Historic documents 002 United States 003 Civil War 004 Gettysburg Address 005 November 19, 1863 006 Lincoln, Abraham 006 President Abraham Lincoln 007 Fourscore and seven years ago our forefathers brought 007 forth upon this continent a new nation, conceived in liberty 007 and dedicated to the proposition that all men are created ... The example contains four levels of heading. 006 may be either an author or a historic person field; we don't know unless we view other records or are given access to the list of field definitions. Field 007 goes on at length. After field 007 there may be other data such as cross references. The next record is identified by "000" at the left margin. The first three columns are, by virtue of position, field identifiers or tags. The fourth column is blank simply to enhance readability. This simple format is a first step from WYSIWYG toward SGML. Precisely because it is simple, I often use it as an intermediate step in setting up for indexing. (The earlier FindIt product used a more complex variation in which the fourth column contained codes. Hindsight shows that the simpler version is more powerful.) In fixed length ASCII fielded records, a pattern noticeably recurs... possibly blocks of white space, or fields containing dates (MAY 22 92), that stand out readily and shift the same number of columns left or right at regular vertical intervals. ══════════════════════════════════════════════ 6.6 Independent versus continuous data ══════════════════════════════════════════════ Records occur in some physical order within a computer file. The white rabbit in "Alice and Wonderland" preferred a simple order for things: "Start at the beginning, keep going until you get to the end, and then stop." In a document such as a book or a business report, there is usually a distinguishable beginning and end. In a database of library books and periodicals, the physical order may be date of acquisition... the later the date of receipt, the further the record is toward the end of the sequence of records. In an inventory file, either part number or date of addition may govern the physical order. Try the command FIND "HEAD4" SVP_TXT It turns out that this data set is ordered according to sequential numbering of the various pieces of correspondence. The result of the above command has 106 lines, starting as follows: @HEAD4 = 417. - TO SAINT LOUISE DE MARILLAC,<B^>1<D> IN ANGERS @HEAD4 = 418. - TO LOUIS ABELLY,<B^>1<D> VICAR GENERAL OF BAYONNE @HEAD4 = 419. - TO SAINT LOUISE, IN ANGERS @HEAD4 = 420. - TO SAINT LOUISE, IN ANGERS @HEAD4 = 421. - TO SAINT LOUISE, IN ANGERS @HEAD4 = 422. - TO SAINT LOUISE, IN ANGERS @HEAD4 = 423. - TO LOUIS LEBRETON,<B^>1<D> IN ROME @HEAD4 = 424. - TO JACQUES THOLARD,<B^>1<D> IN ANNECY Being able to detect sequence within one field helps in the data analysis. This is because patterns show up more quickly when there is a strictly ordered field within the data. An ordered field helps to determine sequence when the total data set has to be pieced together from a number of files. And if the data has been garbled or truncated, order facilitates repair. Ah, yes, Virginia, there is damaged data out there... a lot of it. There is nothing like indexing a set of data to find all kinds of errors in it. Every garbled spelling and extraneous piece of garbage shows up in the list of indexed terms. If records are truly independent and in no sequence, there is no need for the retrieval software to display nearby records. But if there is continuity in the data, the person searching through the data will like the ability to see the records that occur before or after the records found by a search. * * * * * In this topic, we have looked at tools for analyzing ASCII text files. We found that formatting and markup of ASCII text is extremely varied. Indexing will become simpler and less time consuming as standards become widely accepted. Physical storage presupposes a flat data model; hierarchical models can be inferred by segregating data into fields. Continuous data is easier to work with in index preparation than truly independent non-sequential files.