home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Hall of Fame
/
HallofFameCDROM.cdr
/
cdrom
/
ata
/
sfql-ovr.txt
< prev
next >
Wrap
Text File
|
1991-05-11
|
42KB
|
924 lines
-------------------
ATA 89-9C MONOGRAPH
-------------------
------------------------------------------------------------------
CD-ROM Disc Interchangeability standards:
Beyond iso 9660
with the Structured Full-Text Query Language (SFQL)
Neil R. Shapiro,
Elias Diamantopoulos,
and Paul Cotton
(A typeset version of this document can be obtained from Scilab
Inc., 1921 Hexam Rd, Schenectady, NY 12309. (518) 393-1526; Fax
at same number via autoswitch)
------------------------------------------------------------------
ABSTRACT
Today, data on CD-ROM can only be accessed using the retrieval
software that comes with the disc. This often means that end-users
must work with several user interfaces. The Air Transport
Association (ATA) has addressed this issue by developing a software
interface standard which allows a single retrieval application to
work with discs from multiple sources. A key component of this
strategy is the Structured Full-Text Query Language (SFQL). This
paper provides an overview of the SFQL language, and the ATA efforts
to develop and test it.
------------------------------------------------------------------
LIST OF ACCOMPANYING FIGURES
Figure 1. SFQL Conceptual Model. (OVRFIG1.PCX)
Figure 2. Client/Server architecture for SFQL (OVRFIG2.PCX)
Figure 3. Client/Server components and the need for
standardization. (OVRFIG3.PCX)
Figure 4. SFQL server design based on existing engine. (OVRFIG4.PCX)
Figure 5. Terminology Comparison between SQL and SFQL (OVRFIG5.PCX)
Figure 6. Illustration of SFQLCA with child substructures (VLR and
FSR) (OVRFIG6.PCX)
------------------------------------------------------------------
INTRODUCTION
------------
The reliability of commercial aircraft is something we dare not
take for granted--it requires top quality maintenance. For this
reason, the commercial aerospace industry devotes significant
resources to producing high quality, up-to-date maintenance
documentation. The industry thus monitors new technology
closely in order to improve the traditional publishing methods.
It should come as no surprise then, that both the airlines and
the aerospace manufacturers have been actively testing CD-ROM
technology as a way to improve the efficiency of technical
publishing.
After three years of field trials, CD-ROM has proven to be a key
technology in the aerospace industry. Foremost among the list
of achievements is the improved accessibility of data provided
by electronic access to documents on CD-ROM. Full-text
retrieval, which provides complete ad hoc access, and hypertext
technology, have begun to replace tedious manual processes for
information inquiry on the maintenance hanger floor as well as
in shop management operations of airlines.
CD-ROM has also changed the economics of document production
--making management of information more cost effective for both
aerospace manufacturers and airlines. For example, periodic
revisions to an engine shop manual may include as many as 2500
pages and can cost as much as $160 to ship. After receiving the
revision, users may spend as many as 6 hours collating the
revised pages into the manuals. With CD-ROM, a completely
revised set of manuals can be sent to users at a lower cost than
that of the paper revisions alone.
But experience has also shown that the state-of-the-art of
CD-ROMs are not advanced enough to support the aerospace
industry's needs. Airlines have heterogeneous fleets of
airframes and engines, and thus support documentation comes from
many different suppliers. The suppliers, to meet their own
production needs, select different vendors to produce the
CD-ROMs. And the CD-ROMs that come from different vendors are
incompatible.
The incompatibilities are driven by software. Each disc comes
with its own retrieval software, and only that software can be
used with the disc. A CD-ROM produced with Retrieval System A
cannot be accessed using Retrieval System B, or vice versa. In
other words, CD-ROMs are Software Dependent.
Thus, to the airline end-user, state-of-the-art CD-ROMs mean
multiple user interfaces for working with multiple discs--even
when the discs contain standardized information, such as manuals
produced according to their industry documentation standard, ATA
Specification 100. And multiple user interfaces, i.e., multiple
software applications, opens the door for system
incompatibilities over platforms, operating systems, windowing
environments, and input/output devices such as monitors,
printers, or specialized hardware.
CD-ROM standards are the answer to this problem. Yet, despite
landmark efforts which have standardized CD-ROMs to the file
level (ISO 9660), there are no CD-ROM standards in place that
will allow a single system to serve as the retrieval software
for discs from different vendors.
The technical reason for this incompatibility is the lack of
file content standards for CD-ROM. But file content standards
would not be sufficient to produce software independent CD-ROMs.
There is more than data on the disc, there is metadata--data
about data, in the form of indexes for full-text retrieval or
columnar retrieval (relational) technology, and hypertext links,
electronic cross references.
Moreover, standards for metadata are not likely to appear--not
only because they could not cover the breadth of requirements
for CD-ROM retrieval systems, but because the proprietary nature
of metadata is the basis of competition for system performance
and features. (However, it should be noted that a program to
provide a standard format for full-text databases is underway at
the U.S. Air Force Electronic Systems Division, Hanscom Air
Force Base, MA.)
One solution to this is to build retrieval systems using a
client/server architecture to separate the end-user interface
from the underlying metadata access routines. The server
software would be distributed with each disc and would
understand the vendor-dependent format of the metadata. The
server would accept commands via a standard interprocess request
language and would return data in standardized formats to any
other software program acting as its client. Thus, any vendor's
end-user retrieval software (the client) could operate with any
other vendor's disc--disc interchangeability.
ATA STANDARDS FOR CD-ROM
------------------------
Given the importance of the CD-ROMs to publication of
maintenance documentation, the Air Transport Association (ATA),
the standards governor of the airlines, has taken the initiative
to define standards to provide software independence of CD-ROMs
(Shapiro and Bowers, 1989).
The Structured Full-Text Query Language (SFQL), is a key
component of this strategy. SFQL is designed to serve as the
standard interprocess request language between the client and
server. SFQL provides both a standard method of data request
(e.g., a query), and a standard method of data return.
As its name suggests, SFQL is based on the ANSI and ISO
Structured Query Language (SQL), the industry standard query
language for relational databases. The standard SQL would not
be sufficient to serve as the client/server request language for
the ATA CD-ROMs: SQL is designed primarily for relational
database access; the ATA requires the features of a full-text
(fully indexed) database on the CD-ROM. SFQL is thus an attempt
to extend SQL to full-text databases.
Starting with the existing standard, SQL, as the basis for a
full-text retrieval language provides leverage in the design and
use of the new standard. Foremost, it provides a potential
bridge between full-text and relational databases. Although
initially the ATA is looking to full-text databases only, it
will be possible in the future to provide a combined
full-text/relational server which could provide the client with
transparent use of the two database types. In addition, the
common language reduces the learning curve for those familiar
with SQL, and thus promotes acceptance of the standard.
In order to base the new language on SQL, however, an analogy is
needed between relational and full-text databases so that common
areas can be mapped to SQL. This is not as difficult as it
seems: full-text systems have evolved to handle structured data,
just as relational systems have evolved to deal with free text.
Consider a full-text database, which is comprised of a set of
documents. When we perform a search, the results are also a set
of documents. Likewise, a table in a relational database
contains a set of rows. When we perform a query against a
table, the result is a set of rows. The key difference between
a document in a full text database and a row in a relational
database, then, is that the relational database row is rigidly
structured: the row is divided up into fields (called columns).
Documents in full-text databases are not really unstructured, of
course. There are a number of logical elements in the data,
which may be "tagged" using a mark-up language. Most full-text
databases permit searches to be restricted in domain to one or
more of these tags--much in the same way a relational database
permits searches on specific columns.
When the tagged fields of the document represent unique,
nonoverlapping attributes, it is easy to see the analogy between
the full-text and the relational models. The analogy,
illustrated in Figure 1, is used to map SQL to full-text
databases.
This analogy is, of course, based on a class of full-text data
where the structure can be easily compared to the table-based
relational model, since tag-fields are nonhierarchical. While,
in fact, most full-text systems support only nonhierarchical
tags, many documents are better described by a hierarchical
model (e.g., ATA Maintenance/Engine Shop Manuals). We shall see
later how hierarchical relationships can be represented in an
SFQL database.
ARCHITECTURE
------------
SFQL is a key part of the ATA client/server architecture for
CD-ROM database interchangeability, as shown in Figure 2. SFQL
is the language used by the end-user software (the client) to
query and request data from one or more retrieval engines
(servers). SFQL also specifies a standard data return
structure, the SFQLCA, which is a wrapper to return variable
length data in multiple formats, so that a query may return both
text and graphics.
With this approach, the same end-user software can be used with
discs from different vendors. The user simply inserts a new
disc, and behind the scenes, the end-user software connects with
the retrieval engine supplied on the new disc. Whichever disc
is loaded, when the user enters a search request (for example,
by filling in a form), the request is translated into a series
of SFQL statements, which are sent to the retrieval engine. The
retrieval engine, in turn, packages all responses in the
standard SFQLCA return structure.
Note that SFQL itself is not sufficient to provide disc
interchangeability. A number of other standards must come into
play to orchestrate the interaction between client and server
(see Figure 3). For example, an end-to-end communications
protocol is required to ensure error free transfer of SFQL
requests and data return. Since a number of applicable
standards exist, depending on platform, the ATA plans to adopt
the prevalent standard for each specific platform. Data format
standards are also required: SFQL only defines the envelope in
which the server returns data--the contents must still be
interpretable by the client.
Thus, the ATA standards cover Layers 1 to 3, which will provide
disc interchangeability. But only Layer 2, SFQL Language and
Return Data Structures, represents a significant new development
that might provide benefits to the entire CD-ROM industry.
PROOF OF CONCEPT
----------------
To test whether the ATA standards could provide software
independence for CD-ROMs, a proof of concept experiment was
undertaken in December of 1989. GE (a manufacturer of jet
engines) and Aerospatiale (an airframe manufacturer) each set
out to develop a complete full-text retrieval system in
compliance with the draft standard. The plan was that each
company would independently develop prototype systems consisting
of a client and a server which communicate using SFQL. After
the prototypes were developed, an interchange test would be
performed. If software independence was achieved,
Aerospatiale's client should work with GE's server as well as
the Aerospatiale server; Likewise, GE's client should work with
both companies' servers.
Rather than building complete full-text systems from scratch, GE
and Aerospatiale chose to build servers using commercial
full-text systems. The idea was to encapsulate the existing
system into a server using an interprocess interface based on
SFQL. The server would translate incoming SFQL requests to the
proprietary system's interface, and conversely, would format
outgoing data into SFQL defined return structures. Figure 4
illustrates this design.
GE built its server using KnowledgeSet's (Monterey, CA) KRS
retrieval engine while Aerospatiale used Fulcrum Technologies'
(Ottawa, Canada) Ful/Text engine. Both servers were developed
for Microsoft Windows (MS-Windows) using Microsoft C and the
MS-Windows Software Development Kit, as well as libraries from
Fulcrum and KnowledgeSet. The servers proved to be good test
cases for the experiment, since they differed significantly in
their design assumptions.
GE and Aerospatiale each developed their own client applications
using C and the window-based graphical user interface provided
by MS-Windows. GE developed two different types of client: (1)
an Interactive SFQL (ISFQL) front-end which allowed dynamic
testing of SFQL queries simply by highlighting one or more SFQL
statements in the editor window; and (2) a general purpose
query-by-forms application. The latter application was designed
to demonstrate a form-based application that was not tied to a
particular database design: it did not have field names bound to
the form. Field names were supplied at run-time to the client
dialog by the server, and selected by the user via point and
click techniques.
Aerospatiale also developed a query-by-forms application,
although with a markedly different user interface. It was
designed under a more rigid set of assumptions in order to
provide the type of straightforward, easy-to-use interface that
would be placed in an airline maintenance bay.
In addition to the existing SFQL, the prototype development
required writing an end-to-end communication protocol, not
included as part of SFQL, to provide the underlying interprocess
communications. The specification that resulted, based on
Microsoft's DDE, eventually became the ATA proposed standard for
the PC's DOS/Windows environment (ATA 89-9C.WINCOMM-R2-1990).
The prototypes were successfully demonstrated at the February
1990 AIA/ATA meeting in Washington, D.C. Each system was first
shown performing queries and retrieval from its "native" server.
This demonstrated that the systems were similar to the CD-ROM
applications familiar to the ATA: there was nothing new
there--everyone had used proprietary retrieval systems.
The magic came with each application's derivative of a "New
Disc" menu item. What made this option different, of course,
was that either GE's or Aerospatiale's disc could be inserted
and used with the application, without exiting, rebooting,
reconfiguring, or restarting. The menu item, transparent to the
user, disconnected from the current server, allowed the user to
change discs, and then established a new connection with the
server on the new disc. To the user, this operation
demonstrated disc interchangeability.
PROTOTYPE RESULTS
-----------------
The prototype effort was the first known demonstration of
database interchangeability among full-text engines. The effort
validated the client/server model for disc interchangeability,
and the selection of SQL as the basis for the client/server
language.
The prototypes also uncovered some problems with the original
SFQL specification. The most major problem was the initial
assumption that clients using locator information returned by
SFQL should, be allowed to access the disc directly for data
access. The design was supposed to optimize performance of
applications by not requiring large data transfers to be made
through the client/server interface. However, the design was in
direct conflict with many existing systems which store the data
internally in proprietary formats according to their own
optimization goals (e.g., space or performance). More
importantly, the approach violated one of the initial design
premises: standardize only the interface between layers and give
the implementors as much flexibility as possible within either
layer (server or client).
SFQL2 DEVELOPMENT
-----------------
The ATA has addressed this and other language issues by forming
a joint ATA/Vendor SFQL working group. Vendors were invited to
participate in a review of the proposed standard in January,
1989, and in October, 1990, at ATA headquarters in Washington,
DC. At the October meeting, the SFQL working group of
volunteers was formed to draft a second version of the language
(89-9C.SFQL2-R1-1990).
Vendors represented on the committee included Context
Corporation, Electronic Data Systems (EDS), Fulcrum
Technologies, IBM, KnowledgeSet, Maxwell Data Management, and
TMS. The ATA was represented by American Airlines, British
Airways, and four aerospace industry representatives (airline
suppliers): Aerospatiale, Boeing, Douglas, and GE.
The group completed the draft specification in meetings spanning
over 12 days, using a JAD-based methodology Joint Application
Design is a method invented by IBM in which users and developers
work as a team to develop a set of system specifications.
Although not traditionally applied to this type of effort, JAD
turned out to be an effective means for integrating the
knowledge of this diverse group.
Foremost among the changes, the SFQL model is now entirely black
box. Both search and data access now work through the
SFQL-based interprocess communications. This makes it easier
for vendors to comply with the standard, leaves room for
innovations which improve performance or functionality, and can
provide greater upward compatibility as the underlying search
engine technology evolves.
The SFQL2 specification, along with several other related draft
ATA retrieval standards, are now undergoing further validation
testing. The validation testing will assert that SFQL, combined
with a number of other key standards (e.g., the logical
partitioning of documents using SGML based mark-up), provide an
effective, software independent solution for technical
documentation delivery on CD-ROM.
In the next sections, we will provide a look at the SFQL
language itself, including a discussion of SQL extensions to
support hierarchical tags.
THE SFQL LANGUAGE
-----------------
The draft SFQL standard (ATA 89-9C.SFQL2-R1-1990) was written by
a group of volunteers representing the Aerospace Industry
Association, the Air Transport Association, and interested
vendors. The specification was based on ANSI SQL (ANSI
X3.135-1986). Since the primary focus of the ATA standards is
CD-ROM based retrieval, not all of the features of the ANSI
specification were included. Specifically, features relating to
dynamic database updates, dynamic view definition, transaction
processing, and database definition were deferred as low
priority items in a data publishing paradigm. Other features
such as SQL subselects and subqueries were deferred because the
cost/benefit ratio of complexity to usefulness was too high.
Finally, some features such as joins were excluded because we
did not yet fully understand all of the implications of the
operation in the full-text environment.
Thus, the nucleus of SFQL is a subset of the 1986 ANSI SQL
specification. The SFQL Working Group then extended this base
language to include features for full-text query and data
manipulation which are not part of SQL. All new language
constructs were added under the restriction that they were used
in a manner consistent with SQL. Further, as extensions were
designed, the working group referred to newer drafts of the ANSI
and ISO SQL standards, to avoid reinventing features which had
been specified by ANSI or ISO but were not yet approved. When a
new feature was needed, the potential for conflict with new SQL
features was evaluated and factored into the design.
DEFINITIONS
-----------
The SFQL specification introduces some new nomenclature to help
users map document-based concepts into the table-based metaphor
used in SQL. The mapping between the two sets of terminology is
given in Figure 5.
An SFQL database consists of one or more collections, analogous
to SQL tables. Each collection is a set of document-records
(SQL rows), where a document-record may contain all or part of
the information we normally associate with a paper document.
Document-records are divided into fields; since these fields,
analogous to the columns of an SQL table, are typically derived
from a marked ("tagged") area of a document, they are formally
referred to as tag-fields. Tag-fields may contain the entire
source document, a specific portion of it such as the title or
abstract, or may contain a single item (such as a part number).
Tag-fields are created by marking up documents with a standard
markup language, such as SGML, or via vendor-specific methods
LANGUAGE SYNTAX
---------------
SFQL is very simple. It consists of three general statement
types: SELECT, CURSOR, and SET. The SELECT statement is used to
formulate all queries. CURSOR is a family of four statements
(DECLARE, OPEN, FETCH, and CLOSE) which request return of data
one record (or portion of a record) at a time. SET is used to
configure options in the SFQL server. These constructs are
familiar to anyone who has used an SQL-based relational
database, since they are used in the same way as in SQL. (SET
is found in most SQL databases; however, it is not part of ANSI
X3.135-1986.)
THE SELECT STATEMENT
--------------------
The heart of the SFQL data manipulation language is the SELECT
statement, used to express queries. It consists of three basic
clauses:
SELECT return_list
FROM collection [UNION collection]
WHERE predicate ;
The SELECT clause is used to specify a list of tag-fields to be
returned from matching document-records in the collection(s)
indicated in the FROM clause.
The WHERE clause is used to specify a predicate which indicates
the criteria for including document-records in the result set.
The result of the query will be a new, temporary collection,
called the search-results, containing all the document-records
that matched the criteria in predicate, but where each
document-record consists only of the data in the tag-fields
specified in the return_list.
The structure and purpose of the SFQL SELECT statement presented
above is basically the same as that of SQL. One important
difference, however, is in the FROM clause. The SQL FROM clause
allows a comma-separated list of tables (collections). This is
the basis of the SQL join operation, which as mentioned earlier,
is not currently included in SFQL. However, multiple
collections may be searched by means of the UNION operator in
the FROM clause, which was adopted from the draft specification
of SQL3. Unlike a join, the UNION operation performs a simple
set-based union of the collections prior to the search.
PREDICATE EXTENSIONS
--------------------
The most significant extensions to the SQL SELECT involve the
addition of predicates to permit full-text searches such as
domain-restricted search and proximity search.
Domain-restricted search is used to find a word or phrase in a
particular tag-field of each document-record. This improves the
precision of a search, compared to a word or phrase search
conducted across an entire document. For example, one could
search for all document-records containing the name "Smith" in
the authors tag-field. The SFQL query for this would look like:
SELECT authors
FROM Publications
WHERE authors CONTAINS 'Smith' ;
Proximity search is used to locate a phrase or sentence where
the exact wording is not known. It enables the user to search
for a word within a specified number of units of another word,
where units consist of characters, words, sentences, or
paragraphs. For example;
SELECT DOCUMENT
FROM Publications
WHERE DOCUMENT CONTAINS 'turbine' WITHIN 3 WORDS OF
'power generation' ;
This query would return the entire text of any document-record
in the collection Publications where the word "turbine" is three
words or fewer before or after the phrase "power generation".
The keyword DOCUMENT is a special reserved name which represents
the entire document-record as a single tag-field.
Of course, SFQL also includes the standard SQL relational
predicates such as "=", ">", and "<", which can be combined with
proximity or domain-restricted search using traditional boolean
operations (AND, OR, NOT). For example, to find the titles of
all articles published in Byte after 1989 concerning "SGML":
SELECT titles
FROM Articles
WHERE date > DATE(1989-12-31)
AND source = 'Byte'
AND DOCUMENT CONTAINS 'SGML' ;
Note that in this query, the source tag-field in the
document-record would have to contain the text "Byte", exactly
as specified, since relational operators are used in SFQL
exactly as they are in SQL. This means that the comparisons are
made without the benefit of the text processing used in
proximity and domain searching.
CURSOR OPERATIONS
-----------------
An SFQL SELECT statement is typically embedded in a cursor to
control the flow of return data. A SELECT which is not embedded
in a cursor can only return a single document-record. This is
sometimes useful, however, to perform return results of
functions such as HITS() or SYSTEMINFO().
A cursor is simply a pointer used to direct access to rows in
the collection that results from a search. The cursor is
defined via a DECLARE statement, which associates it with a
SELECT statement (a query). The basic syntax is:
DECLARE cursor_name
CURSOR FOR select_statement;
where cursor_name can be subsequently used to access the results
of the select_statement. Note that this operation only stores
the definition of the cursor; the actual search is not performed
until the cursor is opened as:
OPEN cursor_name;
Once the search is conducted, the data in the search-results
need to be retrieved. The FETCH statement is used for this.
The basic syntax is:
FETCH cursor_name ;
The fetch brings back the data in the tag-fields specified in
the SELECT statement for the current document-record. The
cursor starts at the top of the search-results, pointing to the
first document-record, and then moves down document-record by
document-record with each successive fetch. Thus, the first
time a fetch is issued the first document-record is retrieved;
the second fetch retrieves the second document-record, etc.
When all the document-records have been retrieved, a FETCH
statement returns an exception condition.
The SFQL cursor operations just described are identical to that
of SQL. A few language extensions were made, however, to
accommodate differences in the quantity of data expected in the
two types of databases.
In SQL-based databases, a row is typically short in length (a
maximum record length is usually enforced by the database
system). In contrast, for full-text systems, document-records
are typically very large. Thus, the user needs to control which
fields within a document-record to return, and must be able to
return portions of the data within the fields. In SQL, control
over this is provided entirely within the SELECT clause. In
other words, fetches based on the cursor:
DECLARE cursor1
CURSOR FOR
SELECT author, title
FROM abstracts;
return only the author and title of each row (which could
contain other fields such as journal, date, and abstract). But,
in SQL, after opening the cursor, the application cannot change
the list of return fields without redefining the cursor and
reexecuting the query. Thus, in SQL, a declared cursor would
typically list all fields of interest in the SELECT, and the
application would discard any fields that were not needed on a
particular fetch.
In SFQL databases, where document-records are expected to be
very long, this restriction would place an unnecessary burden on
client/server communications. Therefore, in SFQL, the cursor
FETCH operation, used to retrieve data document-record by
document-record, was extended to allow the return_list to be
changed on a document-record by document-record basis. The
mechanism for this enhancement was to add a FIELDS clause to the
FETCH statement. The FIELDS clause is specified in the FETCH
statement as:
FETCH cursor_name
FIELDS return_list ;
The FIELDS clause allows the FETCH to return a subset of the
items in the SELECT clause of the cursor declaration. Because
the FIELDS clause may be included with each FETCH statement, the
list of items returned may be changed for every document-record
in the search-results. For example,
FETCH cursor1
FIELDS author;
for the above query would only return the tag-field author for
the current document-record.
To illustrate how this feature fits into the development of an
application, consider the following scenario. When the user
issues a query, the system conducts the search and returns
summary data identifying which documents matched the search
criteria. For example, this summary information might consist
of the document title and author. The user then selects the
documents of interest directly from this summary information.
As the user makes further choices from the summary, the client
requests more information about the matching document using the
FIELDS clause of the FETCH, without the need to reissue the
query and without affecting the original search-results, which
comprise the summary list.
THE "HORIZONTAL" CURSOR
-----------------------
Using the FIELDS clause to return partial document records by
tag-fields helps in limiting the amount of information returned
as a result of a query. However, in many full-text databases a
tag-field itself might be very large--for example, in a
collection of famous novels, the tag-field DOCUMENT could return
the entire text of War And Peace. Further, unlike most
relational database (SQL) queries, the criteria which qualified
this record for the search-results is not the entire tag-field,
but a relatively small portion of it. To return data from
within a tag-field relative to the areas that matched the
predicate, the SFQL function HIT_TEXT() is used in the FIELDS
clause of the fetch. The general form of HIT_TEXT is:
HIT_TEXT (tag_field_spec,fetch_orientation, maxlength,
hit_justification, tag_list)
where fetch_orientation indicates the direction of movement of
the horizontal cursor, either PREVIOUS, NEXT, ABSOLUTE n, or
RELATIVE n--the same terms used in the standard FETCH "vertical"
cursor. The maxlength argument is used to specify the amount of
text to return. To control the position of the matched text in
the return buffer, hit_justification is used to specify the
return of text either before, after, or centered about the
matched text. In any case, the actual matched text will be
marked in the document using special, configurable character
codes called match-codes. The last argument, tag_list, is a
blank-separated list of tags to remove (only the tag itself is
removed, not the tagged text) before returning the text. The
keyword ALL can be used in the tag_list to filter all tags out
of the return text.
For example, to return 1000 characters of text centered around
the first hit in the current document-record, with all tags
removed, the following FETCH would be sent to the server:
FETCH cursor1
FIELDS HIT_TEXT(DOCUMENT, NEXT, 1000, CENTERED, 'ALL') ;
To retrieve the same amount of text around the next hit, this
FETCH statement would be reissued.
After examining the text surrounding a hit and deciding the
document is relevant, users often want to scroll through the
adjacent information. The DATA function provides this
capability in SFQL. The format of the DATA function is:
DATA (tag_field_spec, fetch_orientation, maxlength, tag_list)
which are interpreted as they are for HIT_TEXT above.
For example, to return the next block of 1000 characters of text
following our HIT_TEXT example above:
FETCH cursor1
FIELDS DATA(DOCUMENT, NEXT,1000,'ALL') ;
Likewise, to return the last 1000 characters in the DOCUMENT:
FETCH cursor1
FIELDS DATA(DOCUMENT, LAST,1000,'ALL') ;
The DATA function can also work independently of the HIT_TEXT
function. This function can be used to request blocks of data
from a tag-field on user demand. This is useful for tag-fields
containing lengthy text or large amounts of graphics data.
Thus:
FETCH cursor1
FIELDS DATA(illustration, FIRST, 1000, 'ALL') ;
returns the first 500 characters of the tag-field abstract, and:
FETCH cursor1
FIELDS DATA(illustration,
ABSOLUTE 3, 500, 'ALL') ;
returns characters the characters in positions 1001 - 1500 of
the tag-field.
THE SET STATEMENT
-----------------
The SET statement is included in SFQL to give the user control
over server operations. In full-text systems, a query can
result in a very large number of document-records, which can tie
up system resources needlessly both in terms of memory and
processing time. For example, to eliminate the possibility of
an excessive number of document-records returned by a query, the
following statement can be issued:
SET MAX_SEARCH_RECORDS(100);
which limits the number of document records to 100.
Similarly, the functions SUSPEND and RESUME are used by the
client to interrupt (or resume) an operation at the server that
is taking too much time. To suspend a particular search, the
client could issue the SFQL statement:
SET SUSPEND(cursorname) ;
where cursorname represents the name of the cursor-based search
that is ongoing.
There are a number of other SET functions in SFQL, which control
important aspects of the server other than resource allocation.
For example, the MATCH_METHOD function determines whether or not
case is significant during queries, or whether fuzzy matching
(e.g., verb stemming) is in effect.
SFQL RETURN DATA STRUCTURES
---------------------------
The SFQL grammar provides a means to express queries and request
data from the server. A standard data return structure is also
required, so that the client can interpret the data returned by
the server.
SFQL specifies return of data and status information in a
structure called the SFQL Communications Area (SFQLCA). The
SFQLCA has two parts, a header that contains status information,
and a message area (SFQLMSG) which is used to return data or
extended error messages.
The SFQLCA header is modeled after several popular SQL
implementations. (Although the concept of a communications
areas is part of the SQL standard, the precise definition of an
SQLCA is not.) Among the more important items of the SFQLCA is
the SFQLCODE, which provides an integer return code indicating
the status of the operation. (SQLCODE is the only part of SQL's
SQLCA which is standard.)
The server responds to every SFQL statement sent by the client
with an SFQLCA. For example, the return SFQLCA may simply
contain a "No Errors" status code following a "CLOSE cursor"
statement, or it may contain an error message ranging from
statement parse errors to resource limitation errors.
When data is returned in the SFQLMSG area, for example, in
response to a FETCH, it is formatted in a special multi-field
variable length return structure called a Variable Length Record
(VLR). The VLR permits the return of data from tag-fields of
various types and sizes, from integers to large text or binary
objects. Each data object is placed in a special container
structure called a Field Subrecord (FSR). The FSR header
identifies the source of the data object, and its type.
A variable number of FSRs may be placed in a VLR. Figure 6
illustrates this in the context of the entire data structure
hierarchy--from SFQLCA to FSR. As the figure shows, there is
one FSR returned for each of the items in the return_list of the
FIELDS clause (or in the return_list of the SELECT if there is
no FIELDS clause in the FETCH.)
For example the statement:
FETCH cursor1
FIELDS title, author, abstract ;
would return an SFQLCA message with status information and a VLR
containing three FSRs to hold the data associated with the
title, author, and abstract return items.
CONCLUSIONS
-----------
The architecture described here is an evolutionary step for
CD-ROM technology. It permits users to grow beyond a single
major CD-ROM database application towards a complete CD-ROM
information system.
The SFQL language also represents an evolutionary step for both
full-text systems and relational databases. SFQL provides
full-text systems the features of SQL, a powerful and
well-established query language standard for working with
structured data. SFQL provides relational database systems
with the necessary language extensions to allow databases to
exploit full-text capabilities.
But most importantly, SFQL is the first step in forming
permanent bonds between these important technologies. Although
SFQL is not a complete superset of SQL, the groundwork has been
laid to bridge the two technologies. The underlying
model--relational, full-text, or both--should be transparent to
the client application during retrieval. Then we will have
truly achieved disc interchangeability.
BIBLIOGRAPHY
------------
ANSI X3.135-1986. American National Standard for Information
Systems--Database Language--SQL. American National Standards
Institute, New York, NY.
ATA, 1990. ATA Specification 100, Appendix 1: Digital Data
Standards, Published by the Air Transport Association, 1709 New
York Avenue Northwest, Washington D.C. 20006.
ATA 89-9C.COMMWIN-R2-1990. CD-ROM Interchangeability
Standard--Client/Server Communication Protocol: Microsoft
Windows. Air Transport Association, Washington, DC.
ATA 89-9C.SFQL2-R1-1990. CD-ROM Interchangeability
Standard--SFQL. Air Transport Association, Washington, DC.
Shapiro, N.R. and Bowers, F.J. (1989). Standardizing the
delivery of technical documentation on CD-ROM's. CD Data
Report, pp. 21-28.
ABOUT THE AUTHORS
-----------------
Neil R. Shapiro, Ph.D., is an independent consultant
specializing in cooperative applications, user interface
technology, and optical publishing. As the original author of
SFQL, he has represented GE Aircraft Engines in the ATA
standards effort for the last two years. He is a member of the
IEEE Computer Society and the ACM. Dr. Shapiro may be reached
at Scilab Inc., 1921 Hexam Rd., Schenectady, NY, 12309. TEL
518/393-1526, FAX: 518/393-1526 (automatic routing on carrier
detect, or press Extension 11).
Elias Diamantopoulos, M.S., has been an engineer at GE's
Ordnance Division since 1986, where he develops expert system
based applications with hypermedia-based support. He is
currently working to define a paperless maintenance system for
the Navy. Mr. Diamantopoulos implemented GE's SFQL server
prototype as part of a Master's Thesis for the University of
Massachusetts. He is a member of IEEE.
Paul Cotton is Technical Product Manager at Fulcrum
Technologies, Inc, a supplier of full-text search and retrieval
software. Mr. Cotton was one of the key contributors in the
SFQL Working Group. He was a co-founder of Officesmiths, Inc.
and principal designer of The Officesmith, an integrated text
database and office automation package. Mr. Cotton has M.Math
and B.Math degrees in Computer Science from the University of
Waterloo. He can be reached at 560 Rochester Street, Ottawa
Ontario Canada K1S 5K2 TEL: 613/238-1761 or EMAIL:
uunet!fultech!pbc.
-----------------------------------------------------------------
Copyright 1991, Scilab Inc..
Permission to copy/download/upload, without fee, all or part of the
material is granted provided that (1) This copyright notice is included
in its entirety; (2) the copies are not made or distributed for direct
commercial advantage; and (3) Scilab Inc. (1921 Hexam Rd, Schenectady,
NY 12309-6510) is notified of the distribution details (e.g., the event
at which it was distributed or the recipient names).