home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Power-Programmierung
/
CD2.mdf
/
doc
/
mir
/
11intro
< prev
next >
Wrap
Text File
|
1992-06-29
|
4KB
|
96 lines
════════════════════════════════════════════════
1. INTRODUCTION TO MIR TUTORIAL ONE
════════════════════════════════════════════════
════════════════════════════
1.1 Project overview
════════════════════════════
The Mass Indexing and Retrieval (MIR) project deals
with the technical side of enabling people to find information
within large quantities of data. Output from the project takes the
form of five sets of printed tutorials, plus related software and
source code under these headings:
ONE Database Analysis
TWO Secrets of Data Preparation
THREE Keys to Automated Indexing
FOUR Search Engines and Information Retrieval
FIVE Related Topics and Applications
The tutorials are addressed to Directors of Information
Services, custom software providers, information publishers,
government information distributors, educators, trainers, and
programmers. The software is distributed under "copyleft" rules of
the Free Software Foundation. Improvements are invited and will be
shared in a final volume and in an accompanying CD-ROM.
You may wish to print the five introductory topics
together with Tutorial ONE and include them in a three ring binder.
For best formatting, use the WordPerfect 5.1 version of the files
provided on diskettes. Printed copies are also available from
Marpex Inc. for a nominal cost; see the files ORDRINFO and
ORDRFORM.
═════════════════════════════════
1.2 Tutorial ONE overview
═════════════════════════════════
The purpose of MIR Tutorial ONE is to enable you to
analyze computerized data from an indexing perspective.
The first topic, source code guidelines, explains the
perspectives that have been built into the software that is
provided with the tutorials. People who wish to improve on the
technology are shown how to share their insights and C language
source code.
Methods of data gathering affect the cost, the quality
and the complexity of the task of indexing. An index adds value to
data, so we pay attention to some marketing considerations.
Data analysis has to do with recognizing various forms
in which data is accumulated, and detecting the inconsistencies
(common in large sets of data) that make indexing more challenging.
Data format offers possibilities and imposes limitations that will
face searchers who wish to extract information. How might the data
be structured in a way that better suits the needs of searchers?
The reader is provided with a variety of software tools for this
critical data analysis function.
The ability to identify patterns in byte sequences
quickly is critical to keeping indexing costs low. We examine a
series of software tools for this purpose.
Worked examples are provided of the analysis stage.
These topics are at a "nuts and bolts" level... use such and such
a program, here is the input, here is the output, and here is what
the results mean. The sequence is from simplest to most complex...
simple ASCII text, ASCII with markup, fielded text, fixed length
records, the addition of packed numbers, then various forms of
binary data
Data deblocking is explained at this stage since it may
be required in order to finish analysis of the data.
At the end of TUTORIAL ONE, the participant has
detailed exposure to the techniques of data analysis, and is able
to use a selection of analysis tools (source code provided) to
recognize and interpret a wide range of data types.