home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Power-Programmierung
/
CD2.mdf
/
doc
/
mir
/
16ascii
< prev
next >
Wrap
Text File
|
1992-06-29
|
22KB
|
492 lines
═══════════════════════════════════════
6. WORKED EXAMPLES...
VARIATIONS IN ASCII TEXT
═══════════════════════════════════════
The next three topics will be richer to the extent that
you and other readers provide samples that can be used to explain
the various forms in which data is held in real files. The variety
is staggering... in working with between 200 and 300 large
databases in the late 1980s, I found that only a few formats and
sets of rules were replicated entirely across databases. Many more
databases had unique patterns or combinations of rules. But a word
of encouragement... analysis really does get easier along the way.
════════════════════════════════
6.1 Other analysis tools
════════════════════════════════
Here is an assortment of programs useful with ASCII
text files. Source code is included on the diskettes.
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
usage: lines file_name[s]
Provides a quick count of the number of lines in each of
one or more files.
input: Any file[s], but most useful if ASCII text.
output: A one line report on the screen of the number of lines in
each input file.
writeup: MIR TUTORIAL ONE, topic 6
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
The DOS DIR command is one measure of size of a file.
The LINES command is another. It is quick. Try it on some of the
source files. For example,
LINES A_PATTRN.C A_BYTES.C LINES.C DOSIFY.C
yielded this answer on the screen:
344 lines in file a_pattrn.c
335 lines in file a_bytes.c
132 lines in file lines.c
154 lines in file dosify.c
965 lines TOTAL
The actual count may differ when you try it; that would be on
account of later revisions in your copy of each of these files.
LINES is particularly useful in preparation for a SORT
of a file.
≡≡≡≡->> QUESTION:
The programs that use lists of file names as inputs
would be improved by a function to expand out wild
cards in file names (each ? to be replaced by a single
character, each * to be replaced by zero or one or
several characters). Try your hand at it and share the
result.
<<-≡≡≡≡
Another program provides an analysis of line lengths:
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
usage: a_len [interval] file_name[s]
Analyze the distribution of line lengths up to 1024 bytes
within any file. The reporting interval (an integer from
1 to 100) is a count of the lengths that will be grouped
together. For example, an interval of 10 means that
frequencies of length 0, length 1-10, length 11-20, etc.
are shown in the report. The default interval is 10. If
the first file name starts with numeric digits, show the
interval first!
input: Any ASCII file[s].
output: file_name.len which reports the frequency of line lengths
occurring in the file. Lengths exclude carriage returns
and
line feeds.
writeup: MIR TUTORIAL ONE, topic 6
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Here is the result of
A_LEN SVP_TXT
1 - 10: 63 1.5%
11 - 20: 154 3.7%
21 - 30: 160 3.9%
31 - 40: 186 4.5%
41 - 50: 185 4.5%
51 - 60: 832 20.2%
61 - 70: 2538 61.6%
Over 80 per cent of lines are between 51 and 70 bytes
long. None are longer. That's a very strong indication that the
file is printable text. Non-displayable files are much more likely
to have random distances between line feed characters. (For a more
detailed list, try the command A_LEN 1 SVP_TXT.)
A_LEN also tells us whether line-oriented utility
programs are likely to work. Some line editors behave badly when
data has long lines. It is common that versions of UNIX "vi" for
example choke up with lines over 256 bytes in length.
LINE_NUM has a variety of uses.
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Usage line_num [ starting_line_no ] < stdin > stdout
Assign a sequence number to each line in a file, starting
either at zero or at a user-specified sequence number.
input: Any printable ASCII file.
output: One line for each line of input. A sequence number is
left justified, followed by a tab, then the input line
exactly as received. Empty lines are counted, but left
empty.
writeup: MIR TUTORIAL ONE, topic 6
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Try it with the DOS command FIND. For example,
LINE_NUM < A_BYTES.C | FIND "Usage"
The result on the screen is:
62 void process(), Usage_(), report(), non_exist() ;
88 Usage_();
114 Usage_()
In other words, the term "Usage" occurs on lines 62, 88 and 114.
═════════════════════════════════
6.2 ASCII markup patterns
═════════════════════════════════
The simplest form of text is called WYSIWYG (pronounced
wizzy-wig). It stands for "What you see is what you get." A
document or communication consists of content and form. Depending
on how the author has arranged the content of a WYSIWYG file, you
as a reader can get all of the content, but may miss much of the
intended form.
Does form matter? On even the simplest one page memo,
yes! Lack of form imposes obstacles to understanding. We look to
form to highlight answers to key questions: Who is this from?
When was it prepared? What was uppermost in the author's mind?
For whom is the message intended? What response is desired or
intended? What is being stressed? What parts are subordinate
explanation? How does one communicate a response to the author?
Visualize a memo that simply runs everything together as a stream
of words in a single paragraph. The answers may be there. But
reading time goes way up. The message is therefore less likely to
be read in full, and less likely to be understood correctly if it
is read.
The longer and the more complex the communication, the
more form matters. Form guides the reader through the structure of
the document. For example:
» The level of a heading is indicated by location on a
page, type size, case, font selection, proximity to
other material, use of graphic enhancement, associated
numbering, and even color. An unnumbered heading,
centred and alone on a page, in bold upper case
italics, carries a message quite distinct from exactly
the same words preceded by "6.5.3", left justified,
primarily lower case, and with following text
continuing on the same line in the same font.
» Structure is needed to sort out the cross referencing
of complex material: Are footnotes at the bottom of
the page, at the end of the chapter, or in their own
section among the support sections at the end of the
document?
» A table of contents is intended to clarify structure.
» A preface underlines purpose.
» An index helps the user to find material of particular
interest.
You have just read a series of indented points. They
are parallel and separated. The form is more readily understood
than the same content run together into one paragraph.
A computer file is essentially a stream of bytes. To
indicate form, there has to be either:
» a definition of the structure separate from the file;
or
» signals embedded within the file content.
The simplest external definition is that a "line"
contains a certain number of bytes (often 80). Alternately, the
external definition may itself be a lengthy file.
Internal signals or codes must be distinguishable from
the content. These signals may involve characters unused in the
text (nulls, non-printing characters); they may be sequences of
printable characters that are guaranteed not to appear as part of
the content. Whatever their format, internal structure and
formatting codes should also be consistent across the entire file.
(But don't count on it. I was given a small database in which the
sequence "<I>" had two or more quite distinct meanings. It turned
out the database provider knew of the flaw and, instead of
correcting it, resorted to physically pasting up copy for photo
reproduction. Inconsistency is costly.)
══════════════════════════════════
6.3 Standard Generalized
Markup Language (SGML)
══════════════════════════════════
The word "rigor" is used in mathematics to convey
notions of completeness, logic and disciplined consistency. The
Standard Generalized Markup Language is an attempt to ensure rigor
in the transfer of documents. Standards are paradoxical... at one
and the same time an imposition and a convenience. In the case of
SGML, the imposition is in having to learn and methodically apply
methods of declaring structure and of representing codes within
content. In exchange for the pain, the gain is clarity in what is
intended by the author (or subsequent editor) in documents that
adhere to the standard. As the standard takes hold, we indexers
profit from increasingly greater consistency across databases. In
the long run, we will spend much less time spent deciphering
"orphan" database structures and markup methods.
In SGML, the elements of form and structure are
declared in a distinct document which can be transmitted with the
database(s). Tagging is a separate task. Wide variability is
allowed in the symbols used for tagging content, but consistency is
enforced in the methods of tagging. One gets the impression that
the early writers on SGML have inadvertently set more of a standard
than they intended; one sees their particular choices of symbols
being picked up as if they were part of the standard. Example:
"the figures shown in <tableref>Table 7.2</tableref>..." where "<"
starts an opening tag and "</" a closing tag. For our purposes,
the more consistency the better! The less time spent setting up
for alternative tag sets, the quicker we can get data into a form
that permits automated indexing.
One of the beauties of SGML is its breadth. WYSIWYG
can be viewed as untagged SGML; one needs simply the declaration of
an unstructured byte stream to be the structure. We can write
software to parse and manipulate this simplest form of text. The
software may be expanded on an as-needed basis whenever we wish to
add structure and formatting. In that sense, the form of text we
use for automated indexing may be considered a form of SGML. If we
want more speed in the inversion and indexing process, we can adopt
a specific selection of tag to be recognized by our parser. The
cost is that we must convert data received from others to our tag
types. If the incoming data is true SGML, then our preprocessing
software can be a simple table replacement algorithm (easy stuff to
write).
Tutorial THREE deals with automated indexing. The
primary version of the software is kept simple for instructional
purposes. But there is nothing stopping us from building more
sophisticated parsers based on more elaborate SGML declarations.
Let's do it together, basing the expansions on real world needs
that you encounter.
═════════════════════════════════════════
6.4 Free versus hierarchical text
═════════════════════════════════════════
Text data includes virtually anything that can be typed
on a computer keyboard. The most familiar is free text. Natural
language is our normal way of communicating with one another.
Words and phrases are grouped according to grammatical rules to
form sentences and paragraphs. Words and phrases give each other
context, so that communication creates a picture or impression in
the mind of the person receiving it. The text is free in the sense
that from a computer standpoint no divisions or rules are implied.
There is no computer-based definition or limit to the length of a
word or line or paragraph. This paragraph, considered alone and by
itself, is an example of free text.
In lengthy communication, free text is sub-divided.
The divisions may be as subtle as an extra line feed to mark the
end of paragraphs. The more complex the document, the more likely
it is divided into sections, chapters, or articles. Subdivisions
are meant to be an aid to understanding. Our comprehension of a
book is improved if we can associate what we are reading with a
chapter name and book title. Hierarchical data is a term that
covers this type. Each paragraph is connected with a hierarchy of
headings (book, chapter, section, sub-section). Hierarchical data
is also found in newspaper articles, business reports, dictionary
entries, and encyclopedia entries. Each has one or more levels of
headings.
A variation on hierarchical data is text that is cross
referenced. Cross references are an internal form of subject
index. They are created manually and inserted within the text.
Human judgment is involved, and their creation is both expensive
and a matter of personal judgment. Where a record touches on more
than one subject, multiple cross reference paths may branch out
from the one record. Heavily studied databases such as religious
scriptures are often cross referenced.
At some point, a threshold is passed whereby the rules
are no longer (only) implied in the natural language, but are
expressed in computer terms as well. The file is no longer treated
as a simple byte stream. A hierarchical model of some sort is
imposed on the data to distinguish among the various levels and
categories of information. In Tutorial TWO we will see how the
power of an hierarchical model can be wedded to the simplicity of
a byte stream. It comes down to a simple trick of doing away with
the assumption that records have to be numbered consecutively.
More to follow!
════════════════════════════════════════
6.5 Fielded variable length text
════════════════════════════════════════
Consider the following:
Part Number: CL4-097-B
Description: VALVE SPRING ASSEMBLY
Quantity on hand, location 1: 57
Quantity on hand, location 2: 0
Quantity on hand, location 3: 16,212
Quantity on hand, location 4: 8,004
Usage, this month: 4,211
Usage, same month last year: 3,654
Usage, current year to date: 31,032
Usage, last year to date: 23,996
Economic order quantity: 1,000
Cost per unit: $ 12.975
Permitted substitute part #: CL4-997-X
Permitted substitute part #: DY6-000-P
The above is an inventory record. Each line is a
"field", an element of data which describes one attribute of the
item under discussion. Fielded data is very common in business
records. Here is a variation on the same data:
<pn>CL4-097-B\0<ds>VALVE SPRING ASSEMBLY\0<q1>57\0
<q3>16212\0<q4>8004\0<um>4211\0<ul>3654\0<uy>31032
<up>23996\0<eo>1000\0<co>12.975\0<sp>CL4-997-X\0<s
p>DY6-000-P\0
In the second version, the name or title is implied for
each field by a mnemonic tag. The length of fields is variable in
every sense... from one field to the next, and in the same field
from one record to the next. The data size for one field may be as
short as a single character. Since tags are used, fields may be
dropped entirely when they are empty of data. Some fields may even
be repeated (as in the substitute part field which occurs twice.)
End of data within each field above is indicated by a
marker, in this case "\0". Here is a variation that reduces the
storage space for this particular record from 163 to 143 bytes.
End of a field is signalled by the tag which begins the following
field. An end of record tag has to be added so that the length of
the last field is defined.
<pn>CL4-097-B<ds>VALVE SPRING ASSEMBLY<q1>57<q3>16
212<q4>8004<um>4211<ul>3654<uy>31032<up>23996<eo>1
000<co>12.975<sp>CL4-997-X<sp>DY6-000-P<fn>
≡≡≡≡->> QUESTION:
Under what conditions does removal of the end of field
marker sacrifice information or introduce ambiguity?
<<-≡≡≡≡
Why have fielded records been included in a topic on
ASCII text? Precisely because each field contains printable ASCII
characters. Analysis of ASCII fielded records is much simpler than
analysis of their binary counterparts. Variable length ASCII
fielded records are recognized by their frequent repetition of the
tags. The A_PATTRN program can be used on the first character of
the tag to pull out all occurrences.
Here's another example of fielded variable length data:
000
001 Historic documents
002 United States
003 Civil War
004 Gettysburg Address
005 November 19, 1863
006 Lincoln, Abraham
006 President Abraham Lincoln
007 Fourscore and seven years ago our forefathers brought
007 forth upon this continent a new nation, conceived in liberty
007 and dedicated to the proposition that all men are created
...
The example contains four levels of heading. 006 may
be either an author or a historic person field; we don't know
unless we view other records or are given access to the list of
field definitions. Field 007 goes on at length. After field 007
there may be other data such as cross references. The next record
is identified by "000" at the left margin. The first three columns
are, by virtue of position, field identifiers or tags. The fourth
column is blank simply to enhance readability. This simple format
is a first step from WYSIWYG toward SGML. Precisely because it is
simple, I often use it as an intermediate step in setting up for
indexing. (The earlier FindIt product used a more complex
variation in which the fourth column contained codes. Hindsight
shows that the simpler version is more powerful.)
In fixed length ASCII fielded records, a pattern noticeably
recurs... possibly blocks of white space, or fields containing
dates (MAY 22 92), that stand out readily and shift the same number
of columns left or right at regular vertical intervals.
══════════════════════════════════════════════
6.6 Independent versus continuous data
══════════════════════════════════════════════
Records occur in some physical order within a computer
file. The white rabbit in "Alice and Wonderland" preferred a
simple order for things: "Start at the beginning, keep going until
you get to the end, and then stop." In a document such as a book
or a business report, there is usually a distinguishable beginning
and end. In a database of library books and periodicals, the
physical order may be date of acquisition... the later the date of
receipt, the further the record is toward the end of the sequence
of records. In an inventory file, either part number or date of
addition may govern the physical order.
Try the command
FIND "HEAD4" SVP_TXT
It turns out that this data set is ordered according to sequential
numbering of the various pieces of correspondence. The result of
the above command has 106 lines, starting as follows:
@HEAD4 = 417. - TO SAINT LOUISE DE MARILLAC,<B^>1<D> IN ANGERS
@HEAD4 = 418. - TO LOUIS ABELLY,<B^>1<D> VICAR GENERAL OF BAYONNE
@HEAD4 = 419. - TO SAINT LOUISE, IN ANGERS
@HEAD4 = 420. - TO SAINT LOUISE, IN ANGERS
@HEAD4 = 421. - TO SAINT LOUISE, IN ANGERS
@HEAD4 = 422. - TO SAINT LOUISE, IN ANGERS
@HEAD4 = 423. - TO LOUIS LEBRETON,<B^>1<D> IN ROME
@HEAD4 = 424. - TO JACQUES THOLARD,<B^>1<D> IN ANNECY
Being able to detect sequence within one field helps in
the data analysis. This is because patterns show up more quickly
when there is a strictly ordered field within the data. An ordered
field helps to determine sequence when the total data set has to be
pieced together from a number of files. And if the data has been
garbled or truncated, order facilitates repair. Ah, yes, Virginia,
there is damaged data out there... a lot of it. There is nothing
like indexing a set of data to find all kinds of errors in it.
Every garbled spelling and extraneous piece of garbage shows up in
the list of indexed terms.
If records are truly independent and in no sequence,
there is no need for the retrieval software to display nearby
records. But if there is continuity in the data, the person
searching through the data will like the ability to see the records
that occur before or after the records found by a search.
* * * * *
In this topic, we have looked at tools for analyzing
ASCII text files. We found that formatting and markup of ASCII
text is extremely varied. Indexing will become simpler and less
time consuming as standards become widely accepted. Physical
storage presupposes a flat data model; hierarchical models can be
inferred by segregating data into fields. Continuous data is
easier to work with in index preparation than truly independent
non-sequential files.