home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Power-Programmierung
/
CD2.mdf
/
doc
/
mir
/
18binary
< prev
next >
Wrap
Text File
|
1992-06-29
|
16KB
|
345 lines
═════════════════════════════════
8. WORKED EXAMPLES...
BINARY DATA
═════════════════════════════════
≡≡≡≡->> QUESTION:
If you have binary data that you can't decipher, and if
you are able to give permission for a sample to be
included in this topic as a worked example, then here's
an opportunity! The first few samples received that I
feel would be useful teaching tools for other readers
will be built in, and the sender will receive a copy of
the expanded topic.
<<-≡≡≡≡
════════════════════════════════════
8.1 The preprocessing option
════════════════════════════════════
Recall that the eight bits in a byte can occur in 256
combinations. Most of the low end control characters and the 128
values with the high bit set are not printable. In a DOS file,
allowance may be made for accented characters and, in some cases,
for mathematical and Greek symbols. In topic 4 we looked at
reasons why the non-printable characters might occur in a file...
packed numbers, binary markup, compression substitutions, etc. To
this list we may add data that is truly binary... numeric values
occupying one, two or four bytes, or so-called "real" numbers with
values before and after a decimal place. Alternatively, files may
contain data that are not intended for display on ASCII screens at
all... audio, graphics, animation, etc.
It would be technically possible to index some types of
binary data directly, provided the software were written to take
into account every form of data which might be presented for
indexing. The problem with specialized software is that disaster
lurks one thoughtless command away. Sooner or later somebody is
going to feed in data with peculiarities not foreseen by the
programmer. In computer parlance, the "results are unpredictable."
Specialized software requires specialized data.
For most purposes, it is preferable to use a single
general indexing program. Indexing at its core is a process of
inverting a huge matrix. That's fairly complex and has to be made
as efficient as possible. (We go into all the details in Tutorial
THREE.) We don't want to load that software down with complex
parsing. It's better to do any specialized work separately. Hence
we add a step called "preprocessing" to bring data to a standard
ASCII format. The following steps are applied, first to a sample,
later to the full set of data:
» Analyze data
» Select or create preprocessing tools
» Preprocess
» Verify preprocessed data is to standard
» Invert
» Validate that indexes fulfil specifications
» Set parameters, select display software
» Finalize data
» Test indexing and retrieval result
We raise the subject of preprocessing now to show that
all binary data has to be transformed prior to indexing. (This is
independent of how the data is made available to the searcher, in
its original form or preprocessed or otherwise modified.) Packed
numbers must be unpacked, numeric fields changed to their ASCII
equivalent, blocking data used to extract ASCII, markup used to
identify records and their parts, compressed data uncompressed,
etc.
═══════════════════════════
8.2 File signatures
═══════════════════════════
Okay, you have some data to index and a byte survey
shows that it contains non-printing characters. Where to from
here?
All data is created through the use of some form of
software... Optical Character Recognition, word processing, report
generators, transaction processing, and many other forms. Data
that is not printable text often is identified internally so that
its parent software will recognize the file if an attempt is made
to process it. The signature is often within the first four bytes
of a file. If not, the first 128 bytes usually contain important
clues. Recall that to display the first 128 bytes, the command
DUMP filename 0 128
puts the data on the screen in Hexadecimal and ASCII format.
Examples:
» WordPerfect files are identified in the first four
bytes by a hex value \FF followed by the letters "WPC".
» Microsoft Word contains many null bytes among the first
128. Two items stand out, a style sheet name and a
printer name, such as NORMAL.STY and NECP6.
» DOS executable files have the signature "MZ" in the
first two bytes. (But don't try to index their
content!)
≡≡≡≡->> QUESTION:
Another useful item you might contribute... the first
128 bytes of various file types, to extend the list of
file "signatures". Use CPB to make these small samples
and identify each sample by the software used to create
it.
<<-≡≡≡≡
════════════════════════════════════════════
8.3 Converting word processing files
════════════════════════════════════════════
Many word processors have an option to output ASCII
files. This option may be used to simplify preprocessing.
The normal sequence to store a Microsoft Word file is
Escape / Transfer / Save (three keystrokes \ESC T S). If a file
name has already been specified, its name is presented for
confirmation. To save the file in normal word processing mode,
with its underlining, bold characters, etc., simply press return.
To convert the file to ASCII, use the same \ESC T S sequence. You
might modify the file name (for example, MYFILE.DOC to MYFILE.ASC).
Then press TAB or the right arrow. The cursor moves to the right,
past the word "formatted" and highlights Yes in a Yes / No choice.
Press N for No. The program checks with you: "Enter Y to confirm
loss of formatting." Press Y and the ASCII version is saved. If
you do a byte survey on the result, you should find no binary data.
There may be some work left to put the file into a standard format
for indexing, but at least you are working with an ASCII file with
no undefined codes.
WordPerfect's sequence is not intuitively obvious. But
it produces a clean result. It's CTL-F5 1 1. You are asked to
supply a file name. Once that's in place, hit return and the ASCII
version is saved. One of the nice features of this conversion is
that all tabs are replaced by blanks, and you get a WYSIWYG file
that preserves all indenting and centering. Note: My copy of
WordPerfect 5.1 includes a CONVERT.EXE utility dated 02-08-91. The
result is NOT the same. CONVERT lapses occasionally on indents and
pads the end of the ASCII output with many unnecessary hex \1A
bytes. (Incidentally, I use WordPerfect to write source code and
to clean it up after debugging. I set margins to five characters
left and right, and tabs to four. It's an easy habit to use the
CTL-F5 1 1 sequence to save the file when ready to compile or tuck
away for future use.)
≡≡≡≡->> QUESTION:
Sequences for other word processors, please!
<<-≡≡≡≡
═════════════════════════════════════
8.4 Binary deblocking lengths
═════════════════════════════════════
Binary deblocking may be recognized by three features.
First, there is a low portion of binary characters in a file that
is mostly printable. Second, the binary characters occur in small
bursts at slightly varying distances. This is in contrast to the
pattern with packed numbers (considered in the preceding topic) in
which the non-printing characters recurred at specific points
within fixed length records. The third feature is that the binary
portions translate into lengths corresponding more or less to the
distance between the current and the next burst.
This third feature is not readily apparent. Display
the data in hexadecimal and ASCII on either side of a burst.
DUMP filename 19200 19400
Within the result, look at the binary bytes which in this case are
\03\9e (decimal equivalent = 3 X 256 + 9 X 16 + 14 X 1 = 926,
provided bytes are in sequence of high to low order).
19264: 65 6e 64 20 6f 66 20 70 72 65 76 69 6f 75 73 20
end of previous
19280: 62 6c 6f 63 6b 2e 03 9e 42 65 67 69 6e 6e 69 6e
block...Beginnin
19296: 67 20 6f 66 20 74 68 65 20 6e 65 78 74 20 62 6c
g of the next bl
To test whether these are binary blocking data, do a
dump 926 bytes further on and see if there are binary bytes either
there or immediately adjacent. Variations occur depending on
whether the blocking data include or exclude their own length, in
this case two bytes.
You may run into cases of extended block lengths,
possibly four bytes, followed by local record or sub-block lengths
which are usually two bytes each. The signal for this is four or
six bytes at or near the top of the file. It is found in some
library MARC records. In the next topic we introduce a program to
deblock data in this form.
Incidentally, the above sample is cooked data. It's
not real. Here's a little data cooker:
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Usage hex_bin ascii_input binary_output
Create a file with any combination of printable and binary
characters. Used to create test files.
input: An edited ASCII file in which printable characters are as
desired in output, and binary characters are represented by
a backslash and two hex values (example, \08 to represent
a backspace or \5C to represent a backslash).
output: The same file with binary characters replacing all \xx
values.
writeup: MIR TUTORIAL ONE, topic 8
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
═══════════════════════════════════════════════
8.5 Binary data in fixed length records
═══════════════════════════════════════════════
Binary data in fixed length records can be interpreted
within a reasonable time period only if you have the record layout
and/or examples of printouts that match data on hand and/or the
software normally used to display the data. The more ASCII data
sprinkled through the data, the easier it is to orient oneself
within the record.
Start by trying to match large integers... over 255 and
if possible over 65535, that is more than one and preferably more
than two bytes in length. Select a high integer value in the
display or printout. Work out its hex equivalent. Example: The
number 144,000 appears in the printout.
144,000 / 65536 = 2 (hex \02), remainder = 12,928
12,928 / 256 = 50 (hex \32), remainder = 128
128 / 1 = 128 (hex \80).
Then look in the matching record for four bytes. Depending on the
operating system in which created, the bytes \00\02\32\80 or the
reverse set \80\32\02\00 should be somewhere within the record.
For a fixed length layout, you now have identified exactly which
bytes match the corresponding data. This work is painstaking, but
it enables you to define the preprocessing task very precisely.
≡≡≡≡->> QUESTION:
Write a little routine to convert decimal values to
hexadecimal. (UNIX has a utility BC which does exactly
this task.)
<<-≡≡≡≡
Dates are often expressed as binary values as well.
There are a variety of techniques. Typically a base is selected.
Two bytes can hold elapsed days since the base date. More
frequently, seven bits are reserved for years since the base year,
four bits for the month and five for the day of the month. Once
date bytes have been identified and matched to their corresponding
display, the coding technique can be identified by inspection.
═══════════════════════════
8.6 Compressed data
═══════════════════════════
A variety of techniques are used for compression...
pattern substitution, variable length encoding, Huffman code,
suppression of repeated characters, differencing, etc. Unless
information is available on how the data has been compressed, the
indexer is faced with a daunting task. Until you have built up
experience in other areas of preparation and indexing, you might
choose to avoid working with compressed data.
Pattern substitution is one method that can sometimes
be deciphered within a reasonable amount of time. Repetitive
strings are replaced in the text with one or two byte binary
values. These binary bytes must be distinguishable from text. If
there are no accented characters (or if accented characters are
preceded by a reserved character to flag them), then the high bit
set characters can be used either for 128 single byte replacements
or as the lead byte for 32768 two byte replacements. In order to
decompress rapidly, the display program relies on a body of text
which it reads into RAM on startup. Some programs store the
decompressed equivalents in a tree format. It's easier to work
with those that employ a linear table, that is, the decompressed
values listed one after another, usually with a supporting list of
offsets indicating the beginning of each term.
Pattern substitution compressed data is recognized by:
» high counts of non-printable characters;
» words and word fragments interspersed frequently within
binary values;
» availability of display of the uncompressed form that
matches the data being analyzed;
» existence of a decompression table and possibly a
vector of offsets pointing to starting points within
the table.
Once the compression technique is understood and the decompression
table is available, it is possible to create software to decompress
the data as a first step in preprocessing. Because of the variety
of techniques in use, it is nearly impossible to write software to
cover all cases. At a minimum, expect to spend some time adapting
software; in worst case, be prepared to write from scratch.
≡≡≡≡->> QUESTION:
The source code I have on hand is too specific to have
any teaching value. Do you have any C code that could
serve the purpose? Alternatively, do you have a sample
and a decompression table for which I could write the
algorithm?
<<-≡≡≡≡
* * * * *
Of all data formats, binary data presents the greatest
challenge to the indexer. We have looked at reasons to preprocess
binary data to a standard ASCII format. Where possible, we use the
signature of a binary file to identify the parent software and use
that software to reduce the data to an ASCII alternative. Binary
blocking information and binary data within fixed length records
can be fairly readily transformed to ASCII. Compressed data is
more difficult, but some forms can be preprocessed within a
reasonable time frame.