home *** CD-ROM | disk | FTP | other *** search
Text File | 1994-12-31 | 427.6 KB | 7,784 lines |
-
- The Project Gutenberg Etext of LOC WORKSHOP ON ELECTRONIC TEXTS
- *****This file should be named locet10.txt or locet10.zip******
-
- Corrected EDITIONS of our etexts get a new NUMBER, locet11.txt
- VERSIONS based on separate sources get new LETTER, locet10a.txt
-
- This choice was made by popular demand for information on etexts
- as they are being moved away from Plain Vanilla ASCII and toward
- markup and graphical representation, as opposed to what you will
- see in this file, which is fairly good example of PVACSII.
-
- We are deeply indebted to the Library of Congress for preparing,
- posting, and freely distributing this as Plain Vanilla ASCII.
-
- Thanks to James Daly for his assistance in determing the work is
- not under copyright protection, on 2/27/93 as we were stymied as
- to how to release this work as soon as possible by 2/28/93.
-
- This edition has many paragraphs reformatted to eliminate widows
- and orphans, a few typos corrected, and one adjective changed to
- a noun. Otherwise every word appears as it was in the original,
- which can be obtained if you:
-
- ftp seq1.loc.gov
- login: anonymous
- passord: name@machine
- cd /pub/Library.of.Congress/research.guides/amer.memory
- ls -al (to see filenames)
- get filename.ext
- quit
-
- WARNING: this machine was not functioning on weekends.
- Information about Project Gutenberg (one page)
-
- We produce about two million dollars for each hour we work. The
- fifty hours is one conservative estimate for how long it we take
- to get any etext selected, entered, proofread, edited, copyright
- searched and analyzed, the copyright letters written, etc. This
- projected audience is one hundred million readers. If our value
- per text is nominally estimated at one dollar, then we produce 2
- million dollars per hour; this year we will have to do four text
- files per month: thus upping our productivity from one million.
- The Goal of Project Gutenberg is to Give Away One Trillion Etext
- Files by the December 31, 2001. [10,000 x 100,000,000=Trillion]
- This is ten thousand titles each to one hundred million readers,
- which is 10% of the expected number of computer users by the end
- of the year 2001.
-
- We need your donations more than ever!
-
- All donations should be made to "Project Gutenberg/IBC", and are
- tax deductible to the extent allowable by law ("IBC" is Illinois
- Benedictine College). (Subscriptions to our paper newsletter go
- to IBC, too)
-
- For these and other matters, please mail to:
-
- David Turner, Project Gutenberg
- Illinois Benedictine College
- 5700 College Road
- Lisle, IL 60532-0900
-
- Email requests to:
- Internet: chipmonk@eagle.ibc.edu (David Turner)
- Compuserve: >INTERNET: chipmonk@eagle.ibc.edu (David Turner)
- Attmail: internet!chipmonk@eagle.ibc.edu (David Turner)
- MCImail: (David Turner)
- ADDRESS TYPE: MCI / EMS: INTERNET / MBX:chipmonk@eagle.ibc.edu
-
- When all other email fails try our Michael S. Hart, Executive
- Director:
- hart@vmd.cso.uiuc.edu (internet) hart@uiucvmd (bitnet)
-
- We would prefer to send you this information by email
- (Internet, Bitnet, Compuserve, ATTMAIL or MCImail).
-
- ******
- If you have an FTP program (or emulator), please:
-
- FTP directly to the Project Gutenberg archives: ftp
- mrcnext.cso.uiuc.edu
- login: anonymous
- password: your@login
- cd etext/etext91
- or cd etext92
- or cd etext93 [for new books] [now also in cd etext/etext93]
- or cd etext/articles [get suggest gut for more information]
- dir [to see files]
- get or mget [to get files. . .set bin for zip files]
- GET 0INDEX.GUT
- for a list of books
- and
- GET NEW GUT for general information
- and
- MGET GUT* for newsletters.
-
- **Information prepared by the Project Gutenberg legal advisor**
- (Three Pages)
-
- ****START**THE SMALL PRINT!**FOR PUBLIC DOMAIN ETEXTS**START****
-
- Why is this "Small Print!" statement here? You know: lawyers.
- They tell us you might sue us if there is something wrong with
- your copy of this etext, even if you got it for free from
- someone other than us, and even if what's wrong is not our
- fault. So, among other things, this "Small Print!" statement
- disclaims most of our liability to you. It also tells you how
- you can distribute copies of this etext if you want to.
-
- *BEFORE!* YOU USE OR READ THIS ETEXT
-
- By using or reading any part of this PROJECT GUTENBERG-tm etext,
- you indicate that you understand, agree to and accept this
- "Small Print!" statement. If you do not, you can receive a
- refund of the money (if any) you paid for this etext by sending
- a request within 30 days of receiving it to the person you got
- it from. If you received this etext on a physical medium (such
- as a disk), you must return it with your request.
-
- ABOUT PROJECT GUTENBERG-TM ETEXTS
-
- This PROJECT GUTENBERG-tm etext, like most PROJECT GUTENBERG-tm
- etexts, is a "public domain" work distributed by Professor
- Michael S. Hart through the Project Gutenberg Association (the
- "Project"). Among other things, this means that no one owns a
- United States copyright on or for this work, so the Project (and
- you!) can copy and distribute it in the United States without
- permission and without paying copyright royalties. Special
- rules, set forth below, apply if you wish to copy and distribute
- this etext under the Project's "PROJECT GUTENBERG" trademark.
-
- To create these etexts, the Project expends considerable efforts
- to identify, transcribe and proofread public domain works.
- Despite these efforts, the Project's etexts and any medium they
- may be on may contain "Defects". Among other things, Defects
- may take the form of incomplete, inaccurate or corrupt data,
- transcription errors, a copyright or other intellectual property
- infringement, a defective or damaged disk or other etext medium,
- a computer virus, or computer codes that damage or cannot be
- read by your equipment.
-
- DISCLAIMER
-
- But for the "Right of Replacement or Refund" described below,
- [1] the Project (and any other party you may receive this etext
- from as a PROJECT GUTENBERG-tm etext) disclaims all liability to
- you for damages, costs and expenses, including legal fees, and
- [2] YOU HAVE NO REMEDIES FOR NEGLIGENCE OR UNDER STRICT LIABILI-
- TY, OR FOR BREACH OF WARRANTY OR CONTRACT, INCLUDING BUT NOT
- LIMITED TO INDIRECT, CONSEQUENTIAL, PUNITIVE OR INCIDENTAL
- DAMAGES, EVEN IF YOU GIVE NOTICE OF THE POSSIBILITY OF SUCH
- DAMAGES.
-
- If you discover a Defect in this etext within 90 days of
- receiving it, you can receive a refund of the money (if any) you
- paid for it by sending an explanatory note within that time to
- the person you received it from. If you received it on a
- physical medium, you must return it with your note, and such
- person may choose to alternatively give you a replacement copy.
- If you received it electronically, such person may choose to
- alternatively give you a second opportunity to receive it elec-
- tronically.
-
- THIS ETEXT IS OTHERWISE PROVIDED TO YOU "AS-IS". NO OTHER
- WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED, ARE MADE TO YOU AS
- TO THE ETEXT OR ANY MEDIUM IT MAY BE ON, INCLUDING BUT NOT
- LIMITED TO WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
- PARTICULAR PURPOSE.
-
- Some states do not allow disclaimers of implied warranties or
- the exclusion or limitation of consequential damages, so the
- above disclaimers and exclusions may not apply to you, and you
- may have other legal rights.
-
- INDEMNITY
-
- You will indemnify and hold the Project, its directors,
- officers, members and agents harmless from all liability, cost
- and expense, including legal fees, that arise from any
- distribution of this etext for which you are responsible, and
- from [1] any alteration, modification or addition to the etext
- for which you are responsible, or [2] any Defect.
-
- DISTRIBUTION UNDER "PROJECT GUTENBERG-tm"
-
- You may distribute copies of this etext electronically, or by
- disk, book or any other medium if you either delete this "Small
- Print!" and all other references to Project Gutenberg, or:
-
- [1] Only give exact copies of it. Among other things, this re-
- quires that you do not remove, alter or modify the etext or
- this "small print!" statement. You may however, if you
- wish, distribute this etext in machine readable binary,
- compressed, mark-up, or proprietary form, including any
- form resulting from conversion by word processing or hyper-
- text software, but only so long as *EITHER*:
-
- [*] The etext, when displayed, is clearly readable. We
- consider an etext *not* clearly readable if it
- contains characters other than those intended by the
- author of the work, although tilde (~), asterisk (*)
- and underline (_) characters may be used to convey
- punctuation intended by the author, and additional
- characters may be used to indicate hypertext links.
-
- [*] The etext may be readily converted by the reader at no
- expense into plain ASCII, EBCDIC or equivalent form by
- the program that displays the etext (as is the case,
- for instance, with most word processors).
-
- [*] You provide, or agree to also provide on request at no
- additional cost, fee or expense, a copy of the etext
- in its original plain ASCII form (or in EBCDIC or
- other equivalent proprietary form).
-
- [2] Honor the etext refund and replacement provisions of this
- "Small Print!" statement.
-
- [3] Pay a trademark license fee of 20% (twenty percent) of the
- net profits you derive from distributing this etext under
- the trademark, determined in accordance with generally
- accepted accounting practices. The license fee:
-
- [*] Is required only if you derive such profits. In
- distributing under our trademark, you incur no
- obligation to charge money or earn profits for your
- distribution.
-
- [*] Shall be paid to "Project Gutenberg Association /
- Illinois Benedictine College" (or to such other person
- as the Project Gutenberg Association may direct)
- within the 60 days following each date you prepare (or
- were legally required to prepare) your year-end tax
- return with respect to your income for that year.
-
- WHAT IF YOU *WANT* TO SEND MONEY EVEN IF YOU DON'T HAVE TO?
-
- The Project gratefully accepts contributions in money, time,
- scanning machines, OCR software, public domain etexts, royalty
- free copyright licenses, and every other sort of contribution
- you can think of. Money should be paid to "Project Gutenberg
- Association / Illinois Benedictine College".
-
- WRITE TO US! We can be reached at:
-
- Internet: hart@vmd.cso.uiuc.edu
- Bitnet: hart@uiucvmd
- CompuServe: >internet:hart@.vmd.cso.uiuc.edu
- Attmail: internet!vmd.cso.uiuc.edu!Hart
-
- or
-
- Project Gutenberg
- Illinois Benedictine College
- 5700 College Road
- Lisle, IL 60532
-
-
- Drafted by CHARLES B. KRAMER, Attorney
- CompuServe: 72600,2026
- Internet: 72600.2026@compuserve.com
- Tel: (212) 254-5093
- *END*THE SMALL PRINT! FOR PUBLIC DOMAIN ETEXTS*Ver.08.29.92*END*
-
-
- The Project Gutenberg Etext of LOC WORKSHOP ON ELECTRONIC TEXTS
-
-
-
-
- WORKSHOP ON ELECTRONIC TEXTS
-
- PROCEEDINGS
-
-
-
- Edited by James Daly
-
-
-
-
-
-
-
- 9-10 June 1992
-
-
- Library of Congress
- Washington, D.C.
-
-
-
- Supported by a Grant from the David and Lucile Packard Foundation
-
-
- *** *** *** ****** *** *** ***
-
-
- TABLE OF CONTENTS
-
-
- Acknowledgements
-
- Introduction
-
- Proceedings
- Welcome
- Prosser Gifford and Carl Fleischhauer
-
- Session I. Content in a New Form: Who Will Use It and What Will They Do?
- James Daly (Moderator)
- Avra Michelson, Overview
- Susan H. Veccia, User Evaluation
- Joanne Freeman, Beyond the Scholar
- Discussion
-
- Session II. Show and Tell
- Jacqueline Hess (Moderator)
- Elli Mylonas, Perseus Project
- Discussion
- Eric M. Calaluca, Patrologia Latina Database
- Carl Fleischhauer and Ricky Erway, American Memory
- Discussion
- Dorothy Twohig, The Papers of George Washington
- Discussion
- Maria L. Lebron, The Online Journal of Current Clinical Trials
- Discussion
- Lynne K. Personius, Cornell mathematics books
- Discussion
-
- Session III. Distribution, Networks, and Networking:
- Options for Dissemination
- Robert G. Zich (Moderator)
- Clifford A. Lynch
- Discussion
- Howard Besser
- Discussion
- Ronald L. Larsen
- Edwin B. Brownrigg
- Discussion
-
- Session IV. Image Capture, Text Capture, Overview of Text and
- Image Storage Formats
- William L. Hooton (Moderator)
- A) Principal Methods for Image Capture of Text:
- direct scanning, use of microform
- Anne R. Kenney
- Pamela Q.J. Andre
- Judith A. Zidar
- Donald J. Waters
- Discussion
- B) Special Problems: bound volumes, conservation,
- reproducing printed halftones
- George Thoma
- Carl Fleischhauer
- Discussion
- C) Image Standards and Implications for Preservation
- Jean Baronas
- Patricia Battin
- Discussion
- D) Text Conversion: OCR vs. rekeying, standards of accuracy
- and use of imperfect texts, service bureaus
- Michael Lesk
- Ricky Erway
- Judith A. Zidar
- Discussion
-
- Session V. Approaches to Preparing Electronic Texts
- Susan Hockey (Moderator)
- Stuart Weibel
- Discussion
- C.M. Sperberg-McQueen
- Discussion
- Eric M. Calaluca
- Discussion
-
- Session VI. Copyright Issues
- Marybeth Peters
-
- Session VII. Conclusion
- Prosser Gifford (Moderator)
- General discussion
-
- Appendix I: Program
-
- Appendix II: Abstracts
-
- Appendix III: Directory of Participants
-
-
- *** *** *** ****** *** *** ***
-
-
- Acknowledgements
-
- I would like to thank Carl Fleischhauer and Prosser Gifford for the
- opportunity to learn about areas of human activity unknown to me a scant
- ten months ago, and the David and Lucile Packard Foundation for
- supporting that opportunity. The help given by others is acknowledged on
- a separate page.
-
- 19 October 1992
-
-
- *** *** *** ****** *** *** ***
-
-
- INTRODUCTION
-
- The Workshop on Electronic Texts (1) drew together representatives of
- various projects and interest groups to compare ideas, beliefs,
- experiences, and, in particular, methods of placing and presenting
- historical textual materials in computerized form. Most attendees gained
- much in insight and outlook from the event. But the assembly did not
- form a new nation, or, to put it another way, the diversity of projects
- and interests was too great to draw the representatives into a cohesive,
- action-oriented body.(2)
-
- Everyone attending the Workshop shared an interest in preserving and
- providing access to historical texts. But within this broad field the
- attendees represented a variety of formal, informal, figurative, and
- literal groups, with many individuals belonging to more than one. These
- groups may be defined roughly according to the following topics or
- activities:
-
- * Imaging
- * Searchable coded texts
- * National and international computer networks
- * CD-ROM production and dissemination
- * Methods and technology for converting older paper materials into
- electronic form
- * Study of the use of digital materials by scholars and others
-
- This summary is arranged thematically and does not follow the actual
- sequence of presentations.
-
- NOTES:
- (1) In this document, the phrase electronic text is used to mean
- any computerized reproduction or version of a document, book,
- article, or manuscript (including images), and not merely a machine-
- readable or machine-searchable text.
-
- (2) The Workshop was held at the Library of Congress on 9-10 June
- 1992, with funding from the David and Lucile Packard Foundation.
- The document that follows represents a summary of the presentations
- made at the Workshop and was compiled by James DALY. This
- introduction was written by DALY and Carl FLEISCHHAUER.
-
-
- PRESERVATION AND IMAGING
-
- Preservation, as that term is used by archivists,(3) was most explicitly
- discussed in the context of imaging. Anne KENNEY and Lynne PERSONIUS
- explained how the concept of a faithful copy and the user-friendliness of
- the traditional book have guided their project at Cornell University.(4)
- Although interested in computerized dissemination, participants in the
- Cornell project are creating digital image sets of older books in the
- public domain as a source for a fresh paper facsimile or, in a future
- phase, microfilm. The books returned to the library shelves are
- high-quality and useful replacements on acid-free paper that should last
- a long time. To date, the Cornell project has placed little or no
- emphasis on creating searchable texts; one would not be surprised to find
- that the project participants view such texts as new editions, and thus
- not as faithful reproductions.
-
- In her talk on preservation, Patricia BATTIN struck an ecumenical and
- flexible note as she endorsed the creation and dissemination of a variety
- of types of digital copies. Do not be too narrow in defining what counts
- as a preservation element, BATTIN counseled; for the present, at least,
- digital copies made with preservation in mind cannot be as narrowly
- standardized as, say, microfilm copies with the same objective. Setting
- standards precipitously can inhibit creativity, but delay can result in
- chaos, she advised.
-
- In part, BATTIN's position reflected the unsettled nature of image-format
- standards, and attendees could hear echoes of this unsettledness in the
- comments of various speakers. For example, Jean BARONAS reviewed the
- status of several formal standards moving through committees of experts;
- and Clifford LYNCH encouraged the use of a new guideline for transmitting
- document images on Internet. Testimony from participants in the National
- Agricultural Library's (NAL) Text Digitization Program and LC's American
- Memory project highlighted some of the challenges to the actual creation
- or interchange of images, including difficulties in converting
- preservation microfilm to digital form. Donald WATERS reported on the
- progress of a master plan for a project at Yale University to convert
- books on microfilm to digital image sets, Project Open Book (POB).
-
- The Workshop offered rather less of an imaging practicum than planned,
- but "how-to" hints emerge at various points, for example, throughout
- KENNEY's presentation and in the discussion of arcana such as
- thresholding and dithering offered by George THOMA and FLEISCHHAUER.
-
- NOTES:
- (3) Although there is a sense in which any reproductions of
- historical materials preserve the human record, specialists in the
- field have developed particular guidelines for the creation of
- acceptable preservation copies.
-
- (4) Titles and affiliations of presenters are given at the
- beginning of their respective talks and in the Directory of
- Participants (Appendix III).
-
-
- THE MACHINE-READABLE TEXT: MARKUP AND USE
-
- The sections of the Workshop that dealt with machine-readable text tended
- to be more concerned with access and use than with preservation, at least
- in the narrow technical sense. Michael SPERBERG-McQUEEN made a forceful
- presentation on the Text Encoding Initiative's (TEI) implementation of
- the Standard Generalized Markup Language (SGML). His ideas were echoed
- by Susan HOCKEY, Elli MYLONAS, and Stuart WEIBEL. While the
- presentations made by the TEI advocates contained no practicum, their
- discussion focused on the value of the finished product, what the
- European Community calls reusability, but what may also be termed
- durability. They argued that marking up--that is, coding--a text in a
- well-conceived way will permit it to be moved from one computer
- environment to another, as well as to be used by various users. Two
- kinds of markup were distinguished: 1) procedural markup, which
- describes the features of a text (e.g., dots on a page), and 2)
- descriptive markup, which describes the structure or elements of a
- document (e.g., chapters, paragraphs, and front matter).
-
- The TEI proponents emphasized the importance of texts to scholarship.
- They explained how heavily coded (and thus analyzed and annotated) texts
- can underlie research, play a role in scholarly communication, and
- facilitate classroom teaching. SPERBERG-McQUEEN reminded listeners that
- a written or printed item (e.g., a particular edition of a book) is
- merely a representation of the abstraction we call a text. To concern
- ourselves with faithfully reproducing a printed instance of the text,
- SPERBERG-McQUEEN argued, is to concern ourselves with the representation
- of a representation ("images as simulacra for the text"). The TEI proponents'
- interest in images tends to focus on corollary materials for use in teaching,
- for example, photographs of the Acropolis to accompany a Greek text.
-
- By the end of the Workshop, SPERBERG-McQUEEN confessed to having been
- converted to a limited extent to the view that electronic images
- constitute a promising alternative to microfilming; indeed, an
- alternative probably superior to microfilming. But he was not convinced
- that electronic images constitute a serious attempt to represent text in
- electronic form. HOCKEY and MYLONAS also conceded that their experience
- at the Pierce Symposium the previous week at Georgetown University and
- the present conference at the Library of Congress had compelled them to
- reevaluate their perspective on the usefulness of text as images.
- Attendees could see that the text and image advocates were in
- constructive tension, so to say.
-
- Three nonTEI presentations described approaches to preparing
- machine-readable text that are less rigorous and thus less expensive. In
- the case of the Papers of George Washington, Dorothy TWOHIG explained
- that the digital version will provide a not-quite-perfect rendering of
- the transcribed text--some 135,000 documents, available for research
- during the decades while the perfect or print version is completed.
- Members of the American Memory team and the staff of NAL's Text
- Digitization Program (see below) also outlined a middle ground concerning
- searchable texts. In the case of American Memory, contractors produce
- texts with about 99-percent accuracy that serve as "browse" or
- "reference" versions of written or printed originals. End users who need
- faithful copies or perfect renditions must refer to accompanying sets of
- digital facsimile images or consult copies of the originals in a nearby
- library or archive. American Memory staff argued that the high cost of
- producing 100-percent accurate copies would prevent LC from offering
- access to large parts of its collections.
-
-
- THE MACHINE-READABLE TEXT: METHODS OF CONVERSION
-
- Although the Workshop did not include a systematic examination of the
- methods for converting texts from paper (or from facsimile images) into
- machine-readable form, nevertheless, various speakers touched upon this
- matter. For example, WEIBEL reported that OCLC has experimented with a
- merging of multiple optical character recognition systems that will
- reduce errors from an unacceptable rate of 5 characters out of every
- l,000 to an unacceptable rate of 2 characters out of every l,000.
-
- Pamela ANDRE presented an overview of NAL's Text Digitization Program and
- Judith ZIDAR discussed the technical details. ZIDAR explained how NAL
- purchased hardware and software capable of performing optical character
- recognition (OCR) and text conversion and used its own staff to convert
- texts. The process, ZIDAR said, required extensive editing and project
- staff found themselves considering alternatives, including rekeying
- and/or creating abstracts or summaries of texts. NAL reckoned costs at
- $7 per page. By way of contrast, Ricky ERWAY explained that American
- Memory had decided from the start to contract out conversion to external
- service bureaus. The criteria used to select these contractors were cost
- and quality of results, as opposed to methods of conversion. ERWAY noted
- that historical documents or books often do not lend themselves to OCR.
- Bound materials represent a special problem. In her experience, quality
- control--inspecting incoming materials, counting errors in samples--posed
- the most time-consuming aspect of contracting out conversion. ERWAY
- reckoned American Memory's costs at $4 per page, but cautioned that fewer
- cost-elements had been included than in NAL's figure.
-
-
- OPTIONS FOR DISSEMINATION
-
- The topic of dissemination proper emerged at various points during the
- Workshop. At the session devoted to national and international computer
- networks, LYNCH, Howard BESSER, Ronald LARSEN, and Edwin BROWNRIGG
- highlighted the virtues of Internet today and of the network that will
- evolve from Internet. Listeners could discern in these narratives a
- vision of an information democracy in which millions of citizens freely
- find and use what they need. LYNCH noted that a lack of standards
- inhibits disseminating multimedia on the network, a topic also discussed
- by BESSER. LARSEN addressed the issues of network scalability and
- modularity and commented upon the difficulty of anticipating the effects
- of growth in orders of magnitude. BROWNRIGG talked about the ability of
- packet radio to provide certain links in a network without the need for
- wiring. However, the presenters also called attention to the
- shortcomings and incongruities of present-day computer networks. For
- example: 1) Network use is growing dramatically, but much network
- traffic consists of personal communication (E-mail). 2) Large bodies of
- information are available, but a user's ability to search across their
- entirety is limited. 3) There are significant resources for science and
- technology, but few network sources provide content in the humanities.
- 4) Machine-readable texts are commonplace, but the capability of the
- system to deal with images (let alone other media formats) lags behind.
- A glimpse of a multimedia future for networks, however, was provided by
- Maria LEBRON in her overview of the Online Journal of Current Clinical
- Trials (OJCCT), and the process of scholarly publishing on-line.
-
- The contrasting form of the CD-ROM disk was never systematically
- analyzed, but attendees could glean an impression from several of the
- show-and-tell presentations. The Perseus and American Memory examples
- demonstrated recently published disks, while the descriptions of the
- IBYCUS version of the Papers of George Washington and Chadwyck-Healey's
- Patrologia Latina Database (PLD) told of disks to come. According to
- Eric CALALUCA, PLD's principal focus has been on converting Jacques-Paul
- Migne's definitive collection of Latin texts to machine-readable form.
- Although everyone could share the network advocates' enthusiasm for an
- on-line future, the possibility of rolling up one's sleeves for a session
- with a CD-ROM containing both textual materials and a powerful retrieval
- engine made the disk seem an appealing vessel indeed. The overall
- discussion suggested that the transition from CD-ROM to on-line networked
- access may prove far slower and more difficult than has been anticipated.
-
-
- WHO ARE THE USERS AND WHAT DO THEY DO?
-
- Although concerned with the technicalities of production, the Workshop
- never lost sight of the purposes and uses of electronic versions of
- textual materials. As noted above, those interested in imaging discussed
- the problematical matter of digital preservation, while the TEI proponents
- described how machine-readable texts can be used in research. This latter
- topic received thorough treatment in the paper read by Avra MICHELSON.
- She placed the phenomenon of electronic texts within the context of
- broader trends in information technology and scholarly communication.
-
- Among other things, MICHELSON described on-line conferences that
- represent a vigorous and important intellectual forum for certain
- disciplines. Internet now carries more than 700 conferences, with about
- 80 percent of these devoted to topics in the social sciences and the
- humanities. Other scholars use on-line networks for "distance learning."
- Meanwhile, there has been a tremendous growth in end-user computing;
- professors today are less likely than their predecessors to ask the
- campus computer center to process their data. Electronic texts are one
- key to these sophisticated applications, MICHELSON reported, and more and
- more scholars in the humanities now work in an on-line environment.
- Toward the end of the Workshop, Michael LESK presented a corollary to
- MICHELSON's talk, reporting the results of an experiment that compared
- the work of one group of chemistry students using traditional printed
- texts and two groups using electronic sources. The experiment
- demonstrated that in the event one does not know what to read, one needs
- the electronic systems; the electronic systems hold no advantage at the
- moment if one knows what to read, but neither do they impose a penalty.
-
- DALY provided an anecdotal account of the revolutionizing impact of the
- new technology on his previous methods of research in the field of classics.
- His account, by extrapolation, served to illustrate in part the arguments
- made by MICHELSON concerning the positive effects of the sudden and radical
- transformation being wrought in the ways scholars work.
-
- Susan VECCIA and Joanne FREEMAN delineated the use of electronic
- materials outside the university. The most interesting aspect of their
- use, FREEMAN said, could be seen as a paradox: teachers in elementary
- and secondary schools requested access to primary source materials but,
- at the same time, found that "primariness" itself made these materials
- difficult for their students to use.
-
-
- OTHER TOPICS
-
- Marybeth PETERS reviewed copyright law in the United States and offered
- advice during a lively discussion of this subject. But uncertainty
- remains concerning the price of copyright in a digital medium, because a
- solution remains to be worked out concerning management and synthesis of
- copyrighted and out-of-copyright pieces of a database.
-
- As moderator of the final session of the Workshop, Prosser GIFFORD directed
- discussion to future courses of action and the potential role of LC in
- advancing them. Among the recommendations that emerged were the following:
-
- * Workshop participants should 1) begin to think about working
- with image material, but structure and digitize it in such a
- way that at a later stage it can be interpreted into text, and
- 2) find a common way to build text and images together so that
- they can be used jointly at some stage in the future, with
- appropriate network support, because that is how users will want
- to access these materials. The Library might encourage attempts
- to bring together people who are working on texts and images.
-
- * A network version of American Memory should be developed or
- consideration should be given to making the data in it
- available to people interested in doing network multimedia.
- Given the current dearth of digital data that is appealing and
- unencumbered by extremely complex rights problems, developing a
- network version of American Memory could do much to help make
- network multimedia a reality.
-
- * Concerning the thorny issue of electronic deposit, LC should
- initiate a catalytic process in terms of distributed
- responsibility, that is, bring together the distributed
- organizations and set up a study group to look at all the
- issues related to electronic deposit and see where we as a
- nation should move. For example, LC might attempt to persuade
- one major library in each state to deal with its state
- equivalent publisher, which might produce a cooperative project
- that would be equitably distributed around the country, and one
- in which LC would be dealing with a minimal number of publishers
- and minimal copyright problems. LC must also deal with the
- concept of on-line publishing, determining, among other things,
- how serials such as OJCCT might be deposited for copyright.
-
- * Since a number of projects are planning to carry out
- preservation by creating digital images that will end up in
- on-line or near-line storage at some institution, LC might play
- a helpful role, at least in the near term, by accelerating how
- to catalog that information into the Research Library Information
- Network (RLIN) and then into OCLC, so that it would be accessible.
- This would reduce the possibility of multiple institutions digitizing
- the same work.
-
-
- CONCLUSION
-
- The Workshop was valuable because it brought together partisans from
- various groups and provided an occasion to compare goals and methods.
- The more committed partisans frequently communicate with others in their
- groups, but less often across group boundaries. The Workshop was also
- valuable to attendees--including those involved with American Memory--who
- came less committed to particular approaches or concepts. These
- attendees learned a great deal, and plan to select and employ elements of
- imaging, text-coding, and networked distribution that suit their
- respective projects and purposes.
-
- Still, reality rears its ugly head: no breakthrough has been achieved.
- On the imaging side, one confronts a proliferation of competing
- data-interchange standards and a lack of consensus on the role of digital
- facsimiles in preservation. In the realm of machine-readable texts, one
- encounters a reasonably mature standard but methodological difficulties
- and high costs. These latter problems, of course, represent a special
- impediment to the desire, as it is sometimes expressed in the popular
- press, "to put the [contents of the] Library of Congress on line." In
- the words of one participant, there was "no solution to the economic
- problems--the projects that are out there are surviving, but it is going
- to be a lot of work to transform the information industry, and so far the
- investment to do that is not forthcoming" (LESK, per litteras).
-
-
- *** *** *** ****** *** *** ***
-
-
- PROCEEDINGS
-
-
- WELCOME
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- GIFFORD * Origin of Workshop in current Librarian's desire to make LC's
- collections more widely available * Desiderata arising from the prospect
- of greater interconnectedness *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- After welcoming participants on behalf of the Library of Congress,
- American Memory (AM), and the National Demonstration Lab, Prosser
- GIFFORD, director for scholarly programs, Library of Congress, located
- the origin of the Workshop on Electronic Texts in a conversation he had
- had considerably more than a year ago with Carl FLEISCHHAUER concerning
- some of the issues faced by AM. On the assumption that numerous other
- people were asking the same questions, the decision was made to bring
- together as many of these people as possible to ask the same questions
- together. In a deeper sense, GIFFORD said, the origin of the Workshop
- lay in the desire of the current Librarian of Congress, James H.
- Billington, to make the collections of the Library, especially those
- offering unique or unusual testimony on aspects of the American
- experience, available to a much wider circle of users than those few
- people who can come to Washington to use them. This meant that the
- emphasis of AM, from the outset, has been on archival collections of the
- basic material, and on making these collections themselves available,
- rather than selected or heavily edited products.
-
- From AM's emphasis followed the questions with which the Workshop began:
- who will use these materials, and in what form will they wish to use
- them. But an even larger issue deserving mention, in GIFFORD's view, was
- the phenomenal growth in Internet connectivity. He expressed the hope
- that the prospect of greater interconnectedness than ever before would
- lead to: 1) much more cooperative and mutually supportive endeavors; 2)
- development of systems of shared and distributed responsibilities to
- avoid duplication and to ensure accuracy and preservation of unique
- materials; and 3) agreement on the necessary standards and development of
- the appropriate directories and indices to make navigation
- straightforward among the varied resources that are, and increasingly
- will be, available. In this connection, GIFFORD requested that
- participants reflect from the outset upon the sorts of outcomes they
- thought the Workshop might have. Did those present constitute a group
- with sufficient common interests to propose a next step or next steps,
- and if so, what might those be? They would return to these questions the
- following afternoon.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- FLEISCHHAUER * Core of Workshop concerns preparation and production of
- materials * Special challenge in conversion of textual materials *
- Quality versus quantity * Do the several groups represented share common
- interests? *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Carl FLEISCHHAUER, coordinator, American Memory, Library of Congress,
- emphasized that he would attempt to represent the people who perform some
- of the work of converting or preparing materials and that the core of
- the Workshop had to do with preparation and production. FLEISCHHAUER
- then drew a distinction between the long term, when many things would be
- available and connected in the ways that GIFFORD described, and the short
- term, in which AM not only has wrestled with the issue of what is the
- best course to pursue but also has faced a variety of technical
- challenges.
-
- FLEISCHHAUER remarked AM's endeavors to deal with a wide range of library
- formats, such as motion picture collections, sound-recording collections,
- and pictorial collections of various sorts, especially collections of
- photographs. In the course of these efforts, AM kept coming back to
- textual materials--manuscripts or rare printed matter, bound materials,
- etc. Text posed the greatest conversion challenge of all. Thus, the
- genesis of the Workshop, which reflects the problems faced by AM. These
- problems include physical problems. For example, those in the library
- and archive business deal with collections made up of fragile and rare
- manuscript items, bound materials, especially the notoriously brittle
- bound materials of the late nineteenth century. These are precious
- cultural artifacts, however, as well as interesting sources of
- information, and LC desires to retain and conserve them. AM needs to
- handle things without damaging them. Guillotining a book to run it
- through a sheet feeder must be avoided at all costs.
-
- Beyond physical problems, issues pertaining to quality arose. For
- example, the desire to provide users with a searchable text is affected
- by the question of acceptable level of accuracy. One hundred percent
- accuracy is tremendously expensive. On the other hand, the output of
- optical character recognition (OCR) can be tremendously inaccurate.
- Although AM has attempted to find a middle ground, uncertainty persists
- as to whether or not it has discovered the right solution.
-
- Questions of quality arose concerning images as well. FLEISCHHAUER
- contrasted the extremely high level of quality of the digital images in
- the Cornell Xerox Project with AM's efforts to provide a browse-quality
- or access-quality image, as opposed to an archival or preservation image.
- FLEISCHHAUER therefore welcomed the opportunity to compare notes.
-
- FLEISCHHAUER observed in passing that conversations he had had about
- networks have begun to signal that for various forms of media a
- determination may be made that there is a browse-quality item, or a
- distribution-and-access-quality item that may coexist in some systems
- with a higher quality archival item that would be inconvenient to send
- through the network because of its size. FLEISCHHAUER referred, of
- course, to images more than to searchable text.
-
- As AM considered those questions, several conceptual issues arose: ought
- AM occasionally to reproduce materials entirely through an image set, at
- other times, entirely through a text set, and in some cases, a mix?
- There probably would be times when the historical authenticity of an
- artifact would require that its image be used. An image might be
- desirable as a recourse for users if one could not provide 100-percent
- accurate text. Again, AM wondered, as a practical matter, if a
- distinction could be drawn between rare printed matter that might exist
- in multiple collections--that is, in ten or fifteen libraries. In such
- cases, the need for perfect reproduction would be less than for unique
- items. Implicit in his remarks, FLEISCHHAUER conceded, was the admission
- that AM has been tilting strongly towards quantity and drawing back a
- little from perfect quality. That is, it seemed to AM that society would
- be better served if more things were distributed by LC--even if they were
- not quite perfect--than if fewer things, perfectly represented, were
- distributed. This was stated as a proposition to be tested, with
- responses to be gathered from users.
-
- In thinking about issues related to reproduction of materials and seeing
- other people engaged in parallel activities, AM deemed it useful to
- convene a conference. Hence, the Workshop. FLEISCHHAUER thereupon
- surveyed the several groups represented: 1) the world of images (image
- users and image makers); 2) the world of text and scholarship and, within
- this group, those concerned with language--FLEISCHHAUER confessed to finding
- delightful irony in the fact that some of the most advanced thinkers on
- computerized texts are those dealing with ancient Greek and Roman materials;
- 3) the network world; and 4) the general world of library science, which
- includes people interested in preservation and cataloging.
-
- FLEISCHHAUER concluded his remarks with special thanks to the David and
- Lucile Packard Foundation for its support of the meeting, the American
- Memory group, the Office for Scholarly Programs, the National
- Demonstration Lab, and the Office of Special Events. He expressed the
- hope that David Woodley Packard might be able to attend, noting that
- Packard's work and the work of the foundation had sponsored a number of
- projects in the text area.
-
- ******
-
- SESSION I. CONTENT IN A NEW FORM: WHO WILL USE IT AND WHAT WILL THEY DO?
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- DALY * Acknowledgements * A new Latin authors disk * Effects of the new
- technology on previous methods of research *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Serving as moderator, James DALY acknowledged the generosity of all the
- presenters for giving of their time, counsel, and patience in planning
- the Workshop, as well as of members of the American Memory project and
- other Library of Congress staff, and the David and Lucile Packard
- Foundation and its executive director, Colburn S. Wilbur.
-
- DALY then recounted his visit in March to the Center for Electronic Texts
- in the Humanities (CETH) and the Department of Classics at Rutgers
- University, where an old friend, Lowell Edmunds, introduced him to the
- department's IBYCUS scholarly personal computer, and, in particular, the
- new Latin CD-ROM, containing, among other things, almost all classical
- Latin literary texts through A.D. 200. Packard Humanities Institute
- (PHI), Los Altos, California, released this disk late in 1991, with a
- nominal triennial licensing fee.
-
- Playing with the disk for an hour or so at Rutgers brought home to DALY
- at once the revolutionizing impact of the new technology on his previous
- methods of research. Had this disk been available two or three years
- earlier, DALY contended, when he was engaged in preparing a commentary on
- Book 10 of Virgil's Aeneid for Cambridge University Press, he would not
- have required a forty-eight-square-foot table on which to spread the
- numerous, most frequently consulted items, including some ten or twelve
- concordances to key Latin authors, an almost equal number of lexica to
- authors who lacked concordances, and where either lexica or concordances
- were lacking, numerous editions of authors antedating and postdating Virgil.
-
- Nor, when checking each of the average six to seven words contained in
- the Virgilian hexameter for its usage elsewhere in Virgil's works or
- other Latin authors, would DALY have had to maintain the laborious
- mechanical process of flipping through these concordances, lexica, and
- editions each time. Nor would he have had to frequent as often the
- Milton S. Eisenhower Library at the Johns Hopkins University to consult
- the Thesaurus Linguae Latinae. Instead of devoting countless hours, or
- the bulk of his research time, to gathering data concerning Virgil's use
- of words, DALY--now freed by PHI's Latin authors disk from the
- tyrannical, yet in some ways paradoxically happy scholarly drudgery--
- would have been able to devote that same bulk of time to analyzing and
- interpreting Virgilian verbal usage.
-
- Citing Theodore Brunner, Gregory Crane, Elli MYLONAS, and Avra MICHELSON,
- DALY argued that this reversal in his style of work, made possible by the
- new technology, would perhaps have resulted in better, more productive
- research. Indeed, even in the course of his browsing the Latin authors
- disk at Rutgers, its powerful search, retrieval, and highlighting
- capabilities suggested to him several new avenues of research into
- Virgil's use of sound effects. This anecdotal account, DALY maintained,
- may serve to illustrate in part the sudden and radical transformation
- being wrought in the ways scholars work.
-
- ******
-
- ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- MICHELSON * Elements related to scholarship and technology * Electronic
- texts within the context of broader trends within information technology
- and scholarly communication * Evaluation of the prospects for the use of
- electronic texts * Relationship of electronic texts to processes of
- scholarly communication in humanities research * New exchange formats
- created by scholars * Projects initiated to increase scholarly access to
- converted text * Trend toward making electronic resources available
- through research and education networks * Changes taking place in
- scholarly communication among humanities scholars * Network-mediated
- scholarship transforming traditional scholarly practices * Key
- information technology trends affecting the conduct of scholarly
- communication over the next decade * The trend toward end-user computing
- * The trend toward greater connectivity * Effects of these trends * Key
- transformations taking place * Summary of principal arguments *
- ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Avra MICHELSON, Archival Research and Evaluation Staff, National Archives
- and Records Administration (NARA), argued that establishing who will use
- electronic texts and what they will use them for involves a consideration
- of both information technology and scholarship trends. This
- consideration includes several elements related to scholarship and
- technology: 1) the key trends in information technology that are most
- relevant to scholarship; 2) the key trends in the use of currently
- available technology by scholars in the nonscientific community; and 3)
- the relationship between these two very distinct but interrelated trends.
- The investment in understanding this relationship being made by
- information providers, technologists, and public policy developers, as
- well as by scholars themselves, seems to be pervasive and growing,
- MICHELSON contended. She drew on collaborative work with Jeff Rothenberg
- on the scholarly use of technology.
-
- MICHELSON sought to place the phenomenon of electronic texts within the
- context of broader trends within information technology and scholarly
- communication. She argued that electronic texts are of most use to
- researchers to the extent that the researchers' working context (i.e.,
- their relevant bibliographic sources, collegial feedback, analytic tools,
- notes, drafts, etc.), along with their field's primary and secondary
- sources, also is accessible in electronic form and can be integrated in
- ways that are unique to the on-line environment.
-
- Evaluation of the prospects for the use of electronic texts includes two
- elements: 1) an examination of the ways in which researchers currently
- are using electronic texts along with other electronic resources, and 2)
- an analysis of key information technology trends that are affecting the
- long-term conduct of scholarly communication. MICHELSON limited her
- discussion of the use of electronic texts to the practices of humanists
- and noted that the scientific community was outside the panel's overview.
-
- MICHELSON examined the nature of the current relationship of electronic
- texts in particular, and electronic resources in general, to what she
- maintained were, essentially, five processes of scholarly communication
- in humanities research. Researchers 1) identify sources, 2) communicate
- with their colleagues, 3) interpret and analyze data, 4) disseminate
- their research findings, and 5) prepare curricula to instruct the next
- generation of scholars and students. This examination would produce a
- clearer understanding of the synergy among these five processes that
- fuels the tendency of the use of electronic resources for one process to
- stimulate its use for other processes of scholarly communication.
-
- For the first process of scholarly communication, the identification of
- sources, MICHELSON remarked the opportunity scholars now enjoy to
- supplement traditional word-of-mouth searches for sources among their
- colleagues with new forms of electronic searching. So, for example,
- instead of having to visit the library, researchers are able to explore
- descriptions of holdings in their offices. Furthermore, if their own
- institutions' holdings prove insufficient, scholars can access more than
- 200 major American library catalogues over Internet, including the
- universities of California, Michigan, Pennsylvania, and Wisconsin.
- Direct access to the bibliographic databases offers intellectual
- empowerment to scholars by presenting a comprehensive means of browsing
- through libraries from their homes and offices at their convenience.
-
- The second process of communication involves communication among
- scholars. Beyond the most common methods of communication, scholars are
- using E-mail and a variety of new electronic communications formats
- derived from it for further academic interchange. E-mail exchanges are
- growing at an astonishing rate, reportedly 15 percent a month. They
- currently constitute approximately half the traffic on research and
- education networks. Moreover, the global spread of E-mail has been so
- rapid that it is now possible for American scholars to use it to
- communicate with colleagues in close to 140 other countries.
-
- Other new exchange formats created by scholars and operating on Internet
- include more than 700 conferences, with about 80 percent of these devoted
- to topics in the social sciences and humanities. The rate of growth of
- these scholarly electronic conferences also is astonishing. From l990 to
- l991, 200 new conferences were identified on Internet. From October 1991
- to June 1992, an additional 150 conferences in the social sciences and
- humanities were added to this directory of listings. Scholars have
- established conferences in virtually every field, within every different
- discipline. For example, there are currently close to 600 active social
- science and humanities conferences on topics such as art and
- architecture, ethnomusicology, folklore, Japanese culture, medical
- education, and gifted and talented education. The appeal to scholars of
- communicating through these conferences is that, unlike any other medium,
- electronic conferences today provide a forum for global communication
- with peers at the front end of the research process.
-
- Interpretation and analysis of sources constitutes the third process of
- scholarly communication that MICHELSON discussed in terms of texts and
- textual resources. The methods used to analyze sources fall somewhere on
- a continuum from quantitative analysis to qualitative analysis.
- Typically, evidence is culled and evaluated using methods drawn from both
- ends of this continuum. At one end, quantitative analysis involves the
- use of mathematical processes such as a count of frequencies and
- distributions of occurrences or, on a higher level, regression analysis.
- At the other end of the continuum, qualitative analysis typically
- involves nonmathematical processes oriented toward language
- interpretation or the building of theory. Aspects of this work involve
- the processing--either manual or computational--of large and sometimes
- massive amounts of textual sources, although the use of nontextual
- sources as evidence, such as photographs, sound recordings, film footage,
- and artifacts, is significant as well.
-
- Scholars have discovered that many of the methods of interpretation and
- analysis that are related to both quantitative and qualitative methods
- are processes that can be performed by computers. For example, computers
- can count. They can count brush strokes used in a Rembrandt painting or
- perform regression analysis for understanding cause and effect. By means
- of advanced technologies, computers can recognize patterns, analyze text,
- and model concepts. Furthermore, computers can complete these processes
- faster with more sources and with greater precision than scholars who
- must rely on manual interpretation of data. But if scholars are to use
- computers for these processes, source materials must be in a form
- amenable to computer-assisted analysis. For this reason many scholars,
- once they have identified the sources that are key to their research, are
- converting them to machine-readable form. Thus, a representative example
- of the numerous textual conversion projects organized by scholars around
- the world in recent years to support computational text analysis is the
- TLG, the Thesaurus Linguae Graecae. This project is devoted to
- converting the extant ancient texts of classical Greece. (Editor's note:
- according to the TLG Newsletter of May l992, TLG was in use in thirty-two
- different countries. This figure updates MICHELSON's previous count by one.)
-
- The scholars performing these conversions have been asked to recognize
- that the electronic sources they are converting for one use possess value
- for other research purposes as well. As a result, during the past few
- years, humanities scholars have initiated a number of projects to
- increase scholarly access to converted text. So, for example, the Text
- Encoding Initiative (TEI), about which more is said later in the program,
- was established as an effort by scholars to determine standard elements
- and methods for encoding machine-readable text for electronic exchange.
- In a second effort to facilitate the sharing of converted text, scholars
- have created a new institution, the Center for Electronic Texts in the
- Humanities (CETH). The center estimates that there are 8,000 series of
- source texts in the humanities that have been converted to
- machine-readable form worldwide. CETH is undertaking an international
- search for converted text in the humanities, compiling it into an
- electronic library, and preparing bibliographic descriptions of the
- sources for the Research Libraries Information Network's (RLIN)
- machine-readable data file. The library profession has begun to initiate
- large conversion projects as well, such as American Memory.
-
- While scholars have been making converted text available to one another,
- typically on disk or on CD-ROM, the clear trend is toward making these
- resources available through research and education networks. Thus, the
- American and French Research on the Treasury of the French Language
- (ARTFL) and the Dante Project are already available on Internet.
- MICHELSON summarized this section on interpretation and analysis by
- noting that: 1) increasing numbers of humanities scholars in the library
- community are recognizing the importance to the advancement of
- scholarship of retrospective conversion of source materials in the arts
- and humanities; and 2) there is a growing realization that making the
- sources available on research and education networks maximizes their
- usefulness for the analysis performed by humanities scholars.
-
- The fourth process of scholarly communication is dissemination of
- research findings, that is, publication. Scholars are using existing
- research and education networks to engineer a new type of publication:
- scholarly-controlled journals that are electronically produced and
- disseminated. Although such journals are still emerging as a
- communication format, their number has grown, from approximately twelve
- to thirty-six during the past year (July 1991 to June 1992). Most of
- these electronic scholarly journals are devoted to topics in the
- humanities. As with network conferences, scholarly enthusiasm for these
- electronic journals stems from the medium's unique ability to advance
- scholarship in a way that no other medium can do by supporting global
- feedback and interchange, practically in real time, early in the research
- process. Beyond scholarly journals, MICHELSON remarked the delivery of
- commercial full-text products, such as articles in professional journals,
- newsletters, magazines, wire services, and reference sources. These are
- being delivered via on-line local library catalogues, especially through
- CD-ROMs. Furthermore, according to MICHELSON, there is general optimism
- that the copyright and fees issues impeding the delivery of full text on
- existing research and education networks soon will be resolved.
-
- The final process of scholarly communication is curriculum development
- and instruction, and this involves the use of computer information
- technologies in two areas. The first is the development of
- computer-oriented instructional tools, which includes simulations,
- multimedia applications, and computer tools that are used to assist in
- the analysis of sources in the classroom, etc. The Perseus Project, a
- database that provides a multimedia curriculum on classical Greek
- civilization, is a good example of the way in which entire curricula are
- being recast using information technologies. It is anticipated that the
- current difficulty in exchanging electronically computer-based
- instructional software, which in turn makes it difficult for one scholar
- to build upon the work of others, will be resolved before too long.
- Stand-alone curricular applications that involve electronic text will be
- sharable through networks, reinforcing their significance as intellectual
- products as well as instructional tools.
-
- The second aspect of electronic learning involves the use of research and
- education networks for distance education programs. Such programs
- interactively link teachers with students in geographically scattered
- locations and rely on the availability of electronic instructional
- resources. Distance education programs are gaining wide appeal among
- state departments of education because of their demonstrated capacity to
- bring advanced specialized course work and an array of experts to many
- classrooms. A recent report found that at least 32 states operated at
- least one statewide network for education in 1991, with networks under
- development in many of the remaining states.
-
- MICHELSON summarized this section by noting two striking changes taking
- place in scholarly communication among humanities scholars. First is the
- extent to which electronic text in particular, and electronic resources
- in general, are being infused into each of the five processes described
- above. As mentioned earlier, there is a certain synergy at work here.
- The use of electronic resources for one process tends to stimulate its
- use for other processes, because the chief course of movement is toward a
- comprehensive on-line working context for humanities scholars that
- includes on-line availability of key bibliographies, scholarly feedback,
- sources, analytical tools, and publications. MICHELSON noted further
- that the movement toward a comprehensive on-line working context for
- humanities scholars is not new. In fact, it has been underway for more
- than forty years in the humanities, since Father Roberto Busa began
- developing an electronic concordance of the works of Saint Thomas Aquinas
- in 1949. What we are witnessing today, MICHELSON contended, is not the
- beginning of this on-line transition but, for at least some humanities
- scholars, the turning point in the transition from a print to an
- electronic working context. Coinciding with the on-line transition, the
- second striking change is the extent to which research and education
- networks are becoming the new medium of scholarly communication. The
- existing Internet and the pending National Education and Research Network
- (NREN) represent the new meeting ground where scholars are going for
- bibliographic information, scholarly dialogue and feedback, the most
- current publications in their field, and high-level educational
- offerings. Traditional scholarly practices are undergoing tremendous
- transformations as a result of the emergence and growing prominence of
- what is called network-mediated scholarship.
-
- MICHELSON next turned to the second element of the framework she proposed
- at the outset of her talk for evaluating the prospects for electronic
- text, namely the key information technology trends affecting the conduct
- of scholarly communication over the next decade: 1) end-user computing
- and 2) connectivity.
-
- End-user computing means that the person touching the keyboard, or
- performing computations, is the same as the person who initiates or
- consumes the computation. The emergence of personal computers, along
- with a host of other forces, such as ubiquitous computing, advances in
- interface design, and the on-line transition, is prompting the consumers
- of computation to do their own computing, and is thus rendering obsolete
- the traditional distinction between end users and ultimate users.
-
- The trend toward end-user computing is significant to consideration of
- the prospects for electronic texts because it means that researchers are
- becoming more adept at doing their own computations and, thus, more
- competent in the use of electronic media. By avoiding programmer
- intermediaries, computation is becoming central to the researcher's
- thought process. This direct involvement in computing is changing the
- researcher's perspective on the nature of research itself, that is, the
- kinds of questions that can be posed, the analytical methodologies that
- can be used, the types and amount of sources that are appropriate for
- analyses, and the form in which findings are presented. The trend toward
- end-user computing means that, increasingly, electronic media and
- computation are being infused into all processes of humanities
- scholarship, inspiring remarkable transformations in scholarly
- communication.
-
- The trend toward greater connectivity suggests that researchers are using
- computation increasingly in network environments. Connectivity is
- important to scholarship because it erases the distance that separates
- students from teachers and scholars from their colleagues, while allowing
- users to access remote databases, share information in many different
- media, connect to their working context wherever they are, and
- collaborate in all phases of research.
-
- The combination of the trend toward end-user computing and the trend
- toward connectivity suggests that the scholarly use of electronic
- resources, already evident among some researchers, will soon become an
- established feature of scholarship. The effects of these trends, along
- with ongoing changes in scholarly practices, point to a future in which
- humanities researchers will use computation and electronic communication
- to help them formulate ideas, access sources, perform research,
- collaborate with colleagues, seek peer review, publish and disseminate
- results, and engage in many other professional and educational activities.
-
- In summary, MICHELSON emphasized four points: 1) A portion of humanities
- scholars already consider electronic texts the preferred format for
- analysis and dissemination. 2) Scholars are using these electronic
- texts, in conjunction with other electronic resources, in all the
- processes of scholarly communication. 3) The humanities scholars'
- working context is in the process of changing from print technology to
- electronic technology, in many ways mirroring transformations that have
- occurred or are occurring within the scientific community. 4) These
- changes are occurring in conjunction with the development of a new
- communication medium: research and education networks that are
- characterized by their capacity to advance scholarship in a wholly unique
- way.
-
- MICHELSON also reiterated her three principal arguments: l) Electronic
- texts are best understood in terms of the relationship to other
- electronic resources and the growing prominence of network-mediated
- scholarship. 2) The prospects for electronic texts lie in their capacity
- to be integrated into the on-line network of electronic resources that
- comprise the new working context for scholars. 3) Retrospective conversion
- of portions of the scholarly record should be a key strategy as information
- providers respond to changes in scholarly communication practices.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- VECCIA * AM's evaluation project and public users of electronic resources
- * AM and its design * Site selection and evaluating the Macintosh
- implementation of AM * Characteristics of the six public libraries
- selected * Characteristics of AM's users in these libraries * Principal
- ways AM is being used *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Susan VECCIA, team leader, and Joanne FREEMAN, associate coordinator,
- American Memory, Library of Congress, gave a joint presentation. First,
- by way of introduction, VECCIA explained her and FREEMAN's roles in
- American Memory (AM). Serving principally as an observer, VECCIA has
- assisted with the evaluation project of AM, placing AM collections in a
- variety of different sites around the country and helping to organize and
- implement that project. FREEMAN has been an associate coordinator of AM
- and has been involved principally with the interpretative materials,
- preparing some of the electronic exhibits and printed historical
- information that accompanies AM and that is requested by users. VECCIA
- and FREEMAN shared anecdotal observations concerning AM with public users
- of electronic resources. Notwithstanding a fairly structured evaluation
- in progress, both VECCIA and FREEMAN chose not to report on specifics in
- terms of numbers, etc., because they felt it was too early in the
- evaluation project to do so.
-
- AM is an electronic archive of primary source materials from the Library
- of Congress, selected collections representing a variety of formats--
- photographs, graphic arts, recorded sound, motion pictures, broadsides,
- and soon, pamphlets and books. In terms of the design of this system,
- the interpretative exhibits have been kept separate from the primary
- resources, with good reason. Accompanying this collection are printed
- documentation and user guides, as well as guides that FREEMAN prepared for
- teachers so that they may begin using the content of the system at once.
-
- VECCIA described the evaluation project before talking about the public
- users of AM, limiting her remarks to public libraries, because FREEMAN
- would talk more specifically about schools from kindergarten to twelfth
- grade (K-12). Having started in spring 1991, the evaluation currently
- involves testing of the Macintosh implementation of AM. Since the
- primary goal of this evaluation is to determine the most appropriate
- audience or audiences for AM, very different sites were selected. This
- makes evaluation difficult because of the varying degrees of technology
- literacy among the sites. AM is situated in forty-four locations, of
- which six are public libraries and sixteen are schools. Represented
- among the schools are elementary, junior high, and high schools.
- District offices also are involved in the evaluation, which will
- conclude in summer 1993.
-
- VECCIA focused the remainder of her talk on the six public libraries, one
- of which doubles as a state library. They represent a range of
- geographic areas and a range of demographic characteristics. For
- example, three are located in urban settings, two in rural settings, and
- one in a suburban setting. A range of technical expertise is to be found
- among these facilities as well. For example, one is an "Apple library of
- the future," while two others are rural one-room libraries--in one, AM
- sits at the front desk next to a tractor manual.
-
- All public libraries have been extremely enthusiastic, supportive, and
- appreciative of the work that AM has been doing. VECCIA characterized
- various users: Most users in public libraries describe themselves as
- general readers; of the students who use AM in the public libraries,
- those in fourth grade and above seem most interested. Public libraries
- in rural sites tend to attract retired people, who have been highly
- receptive to AM. Users tend to fall into two additional categories:
- people interested in the content and historical connotations of these
- primary resources, and those fascinated by the technology. The format
- receiving the most comments has been motion pictures. The adult users in
- public libraries are more comfortable with IBM computers, whereas young
- people seem comfortable with either IBM or Macintosh, although most of
- them seem to come from a Macintosh background. This same tendency is
- found in the schools.
-
- What kinds of things do users do with AM? In a public library there are
- two main goals or ways that AM is being used: as an individual learning
- tool, and as a leisure activity. Adult learning was one area that VECCIA
- would highlight as a possible application for a tool such as AM. She
- described a patron of a rural public library who comes in every day on
- his lunch hour and literally reads AM, methodically going through the
- collection image by image. At the end of his hour he makes an electronic
- bookmark, puts it in his pocket, and returns to work. The next day he
- comes in and resumes where he left off. Interestingly, this man had
- never been in the library before he used AM. In another small, rural
- library, the coordinator reports that AM is a popular activity for some
- of the older, retired people in the community, who ordinarily would not
- use "those things,"--computers. Another example of adult learning in
- public libraries is book groups, one of which, in particular, is using AM
- as part of its reading on industrialization, integration, and urbanization
- in the early 1900s.
-
- One library reports that a family is using AM to help educate their
- children. In another instance, individuals from a local museum came in
- to use AM to prepare an exhibit on toys of the past. These two examples
- emphasize the mission of the public library as a cultural institution,
- reaching out to people who do not have the same resources available to
- those who live in a metropolitan area or have access to a major library.
- One rural library reports that junior high school students in large
- numbers came in one afternoon to use AM for entertainment. A number of
- public libraries reported great interest among postcard collectors in the
- Detroit collection, which was essentially a collection of images used on
- postcards around the turn of the century. Train buffs are similarly
- interested because that was a time of great interest in railroading.
- People, it was found, relate to things that they know of firsthand. For
- example, in both rural public libraries where AM was made available,
- observers reported that the older people with personal remembrances of
- the turn of the century were gravitating to the Detroit collection.
- These examples served to underscore MICHELSON's observation re the
- integration of electronic tools and ideas--that people learn best when
- the material relates to something they know.
-
- VECCIA made the final point that in many cases AM serves as a
- public-relations tool for the public libraries that are testing it. In
- one case, AM is being used as a vehicle to secure additional funding for
- the library. In another case, AM has served as an inspiration to the
- staff of a major local public library in the South to think about ways to
- make its own collection of photographs more accessible to the public.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- FREEMAN * AM and archival electronic resources in a school environment *
- Questions concerning context * Questions concerning the electronic format
- itself * Computer anxiety * Access and availability of the system *
- Hardware * Strengths gained through the use of archival resources in
- schools *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Reiterating an observation made by VECCIA, that AM is an archival
- resource made up of primary materials with very little interpretation,
- FREEMAN stated that the project has attempted to bridge the gap between
- these bare primary materials and a school environment, and in that cause
- has created guided introductions to AM collections. Loud demand from the
- educational community, chiefly from teachers working with the upper
- grades of elementary school through high school, greeted the announcement
- that AM would be tested around the country.
-
- FREEMAN reported not only on what was learned about AM in a school
- environment, but also on several universal questions that were raised
- concerning archival electronic resources in schools. She discussed
- several strengths of this type of material in a school environment as
- opposed to a highly structured resource that offers a limited number of
- paths to follow.
-
- FREEMAN first raised several questions about using AM in a school
- environment. There is often some difficulty in developing a sense of
- what the system contains. Many students sit down at a computer resource
- and assume that, because AM comes from the Library of Congress, all of
- American history is now at their fingertips. As a result of that sort of
- mistaken judgment, some students are known to conclude that AM contains
- nothing of use to them when they look for one or two things and do not
- find them. It is difficult to discover that middle ground where one has
- a sense of what the system contains. Some students grope toward the idea
- of an archive, a new idea to them, since they have not previously
- experienced what it means to have access to a vast body of somewhat
- random information.
-
- Other questions raised by FREEMAN concerned the electronic format itself.
- For instance, in a school environment it is often difficult both for
- teachers and students to gain a sense of what it is they are viewing.
- They understand that it is a visual image, but they do not necessarily
- know that it is a postcard from the turn of the century, a panoramic
- photograph, or even machine-readable text of an eighteenth-century
- broadside, a twentieth-century printed book, or a nineteenth-century
- diary. That distinction is often difficult for people in a school
- environment to grasp. Because of that, it occasionally becomes difficult
- to draw conclusions from what one is viewing.
-
- FREEMAN also noted the obvious fear of the computer, which constitutes a
- difficulty in using an electronic resource. Though students in general
- did not suffer from this anxiety, several older students feared that they
- were computer-illiterate, an assumption that became self-fulfilling when
- they searched for something but failed to find it. FREEMAN said she
- believed that some teachers also fear computer resources, because they
- believe they lack complete control. FREEMAN related the example of
- teachers shooing away students because it was not their time to use the
- system. This was a case in which the situation had to be extremely
- structured so that the teachers would not feel that they had lost their
- grasp on what the system contained.
-
- A final question raised by FREEMAN concerned access and availability of
- the system. She noted the occasional existence of a gap in communication
- between school librarians and teachers. Often AM sits in a school
- library and the librarian is the person responsible for monitoring the
- system. Teachers do not always take into their world new library
- resources about which the librarian is excited. Indeed, at the sites
- where AM had been used most effectively within a library, the librarian
- was required to go to specific teachers and instruct them in its use. As
- a result, several AM sites will have in-service sessions over a summer,
- in the hope that perhaps, with a more individualized link, teachers will
- be more likely to use the resource.
-
- A related issue in the school context concerned the number of
- workstations available at any one location. Centralization of equipment
- at the district level, with teachers invited to download things and walk
- away with them, proved unsuccessful because the hours these offices were
- open were also school hours.
-
- Another issue was hardware. As VECCIA observed, a range of sites exists,
- some technologically advanced and others essentially acquiring their
- first computer for the primary purpose of using it in conjunction with
- AM's testing. Users at technologically sophisticated sites want even
- more sophisticated hardware, so that they can perform even more
- sophisticated tasks with the materials in AM. But once they acquire a
- newer piece of hardware, they must learn how to use that also; at an
- unsophisticated site it takes an extremely long time simply to become
- accustomed to the computer, not to mention the program offered with the
- computer. All of these small issues raise one large question, namely,
- are systems like AM truly rewarding in a school environment, or do they
- simply act as innovative toys that do little more than spark interest?
-
- FREEMAN contended that the evaluation project has revealed several strengths
- that were gained through the use of archival resources in schools, including:
-
- * Psychic rewards from using AM as a vast, rich database, with
- teachers assigning various projects to students--oral presentations,
- written reports, a documentary, a turn-of-the-century newspaper--
- projects that start with the materials in AM but are completed using
- other resources; AM thus is used as a research tool in conjunction
- with other electronic resources, as well as with books and items in
- the library where the system is set up.
-
- * Students are acquiring computer literacy in a humanities context.
-
- * This sort of system is overcoming the isolation between disciplines
- that often exists in schools. For example, many English teachers are
- requiring their students to write papers on historical topics
- represented in AM. Numerous teachers have reported that their
- students are learning critical thinking skills using the system.
-
- * On a broader level, AM is introducing primary materials, not only
- to students but also to teachers, in an environment where often
- simply none exist--an exciting thing for the students because it
- helps them learn to conduct research, to interpret, and to draw
- their own conclusions. In learning to conduct research and what it
- means, students are motivated to seek knowledge. That relates to
- another positive outcome--a high level of personal involvement of
- students with the materials in this system and greater motivation to
- conduct their own research and draw their own conclusions.
-
- * Perhaps the most ironic strength of these kinds of archival
- electronic resources is that many of the teachers AM interviewed
- were desperate, it is no exaggeration to say, not only for primary
- materials but for unstructured primary materials. These would, they
- thought, foster personally motivated research, exploration, and
- excitement in their students. Indeed, these materials have done
- just that. Ironically, however, this lack of structure produces
- some of the confusion to which the newness of these kinds of
- resources may also contribute. The key to effective use of archival
- products in a school environment is a clear, effective introduction
- to the system and to what it contains.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- DISCUSSION * Nothing known, quantitatively, about the number of
- humanities scholars who must see the original versus those who would
- settle for an edited transcript, or about the ways in which humanities
- scholars are using information technology * Firm conclusions concerning
- the manner and extent of the use of supporting materials in print
- provided by AM to await completion of evaluative study * A listener's
- reflections on additional applications of electronic texts * Role of
- electronic resources in teaching elementary research skills to students *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- During the discussion that followed the presentations by MICHELSON,
- VECCIA, and FREEMAN, additional points emerged.
-
- LESK asked if MICHELSON could give any quantitative estimate of the
- number of humanities scholars who must see or want to see the original,
- or the best possible version of the material, versus those who typically
- would settle for an edited transcript. While unable to provide a figure,
- she offered her impressions as an archivist who has done some reference
- work and has discussed this issue with other archivists who perform
- reference, that those who use archives and those who use primary sources
- for what would be considered very high-level scholarly research, as
- opposed to, say, undergraduate papers, were few in number, especially
- given the public interest in using primary sources to conduct
- genealogical or avocational research and the kind of professional
- research done by people in private industry or the federal government.
- More important in MICHELSON's view was that, quantitatively, nothing is
- known about the ways in which, for example, humanities scholars are using
- information technology. No studies exist to offer guidance in creating
- strategies. The most recent study was conducted in 1985 by the American
- Council of Learned Societies (ACLS), and what it showed was that 50
- percent of humanities scholars at that time were using computers. That
- constitutes the extent of our knowledge.
-
- Concerning AM's strategy for orienting people toward the scope of
- electronic resources, FREEMAN could offer no hard conclusions at this
- point, because she and her colleagues were still waiting to see,
- particularly in the schools, what has been made of their efforts. Within
- the system, however, AM has provided what are called electronic exhibits-
- -such as introductions to time periods and materials--and these are
- intended to offer a student user a sense of what a broadside is and what
- it might tell her or him. But FREEMAN conceded that the project staff
- would have to talk with students next year, after teachers have had a
- summer to use the materials, and attempt to discover what the students
- were learning from the materials. In addition, FREEMAN described
- supporting materials in print provided by AM at the request of local
- teachers during a meeting held at LC. These included time lines,
- bibliographies, and other materials that could be reproduced on a
- photocopier in a classroom. Teachers could walk away with and use these,
- and in this way gain a better understanding of the contents. But again,
- reaching firm conclusions concerning the manner and extent of their use
- would have to wait until next year.
-
- As to the changes she saw occurring at the National Archives and Records
- Administration (NARA) as a result of the increasing emphasis on
- technology in scholarly research, MICHELSON stated that NARA at this
- point was absorbing the report by her and Jeff Rothenberg addressing
- strategies for the archival profession in general, although not for the
- National Archives specifically. NARA is just beginning to establish its
- role and what it can do. In terms of changes and initiatives that NARA
- can take, no clear response could be given at this time.
-
- GREENFIELD remarked two trends mentioned in the session. Reflecting on
- DALY's opening comments on how he could have used a Latin collection of
- text in an electronic form, he said that at first he thought most scholars
- would be unwilling to do that. But as he thought of that in terms of the
- original meaning of research--that is, having already mastered these texts,
- researching them for critical and comparative purposes--for the first time,
- the electronic format made a lot of sense. GREENFIELD could envision
- growing numbers of scholars learning the new technologies for that very
- aspect of their scholarship and for convenience's sake.
-
- Listening to VECCIA and FREEMAN, GREENFIELD thought of an additional
- application of electronic texts. He realized that AM could be used as a
- guide to lead someone to original sources. Students cannot be expected
- to have mastered these sources, things they have never known about
- before. Thus, AM is leading them, in theory, to a vast body of
- information and giving them a superficial overview of it, enabling them
- to select parts of it. GREENFIELD asked if any evidence exists that this
- resource will indeed teach the new user, the K-12 students, how to do
- research. Scholars already know how to do research and are applying
- these new tools. But he wondered why students would go beyond picking
- out things that were most exciting to them.
-
- FREEMAN conceded the correctness of GREENFIELD's observation as applied
- to a school environment. The risk is that a student would sit down at a
- system, play with it, find some things of interest, and then walk away.
- But in the relatively controlled situation of a school library, much will
- depend on the instructions a teacher or a librarian gives a student. She
- viewed the situation not as one of fine-tuning research skills but of
- involving students at a personal level in understanding and researching
- things. Given the guidance one can receive at school, it then becomes
- possible to teach elementary research skills to students, which in fact
- one particular librarian said she was teaching her fifth graders.
- FREEMAN concluded that introducing the idea of following one's own path
- of inquiry, which is essentially what research entails, involves more
- than teaching specific skills. To these comments VECCIA added the
- observation that the individual teacher and the use of a creative
- resource, rather than AM itself, seemed to make the key difference.
- Some schools and some teachers are making excellent use of the nature
- of critical thinking and teaching skills, she said.
-
- Concurring with these remarks, DALY closed the session with the thought that
- the more that producers produced for teachers and for scholars to use with
- their students, the more successful their electronic products would prove.
-
- ******
-
- SESSION II. SHOW AND TELL
-
- Jacqueline HESS, director, National Demonstration Laboratory, served as
- moderator of the "show-and-tell" session. She noted that a
- question-and-answer period would follow each presentation.
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- MYLONAS * Overview and content of Perseus * Perseus' primary materials
- exist in a system-independent, archival form * A concession * Textual
- aspects of Perseus * Tools to use with the Greek text * Prepared indices
- and full-text searches in Perseus * English-Greek word search leads to
- close study of words and concepts * Navigating Perseus by tracing down
- indices * Using the iconography to perform research *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Elli MYLONAS, managing editor, Perseus Project, Harvard University, first
- gave an overview of Perseus, a large, collaborative effort based at
- Harvard University but with contributors and collaborators located at
- numerous universities and colleges in the United States (e.g., Bowdoin,
- Maryland, Pomona, Chicago, Virginia). Funded primarily by the
- Annenberg/CPB Project, with additional funding from Apple, Harvard, and
- the Packard Humanities Institute, among others, Perseus is a multimedia,
- hypertextual database for teaching and research on classical Greek
- civilization, which was released in February 1992 in version 1.0 and
- distributed by Yale University Press.
-
- Consisting entirely of primary materials, Perseus includes ancient Greek
- texts and translations of those texts; catalog entries--that is, museum
- catalog entries, not library catalog entries--on vases, sites, coins,
- sculpture, and archaeological objects; maps; and a dictionary, among
- other sources. The number of objects and the objects for which catalog
- entries exist are accompanied by thousands of color images, which
- constitute a major feature of the database. Perseus contains
- approximately 30 megabytes of text, an amount that will double in
- subsequent versions. In addition to these primary materials, the Perseus
- Project has been building tools for using them, making access and
- navigation easier, the goal being to build part of the electronic
- environment discussed earlier in the morning in which students or
- scholars can work with their sources.
-
- The demonstration of Perseus will show only a fraction of the real work
- that has gone into it, because the project had to face the dilemma of
- what to enter when putting something into machine-readable form: should
- one aim for very high quality or make concessions in order to get the
- material in? Since Perseus decided to opt for very high quality, all of
- its primary materials exist in a system-independent--insofar as it is
- possible to be system-independent--archival form. Deciding what that
- archival form would be and attaining it required much work and thought.
- For example, all the texts are marked up in SGML, which will be made
- compatible with the guidelines of the Text Encoding Initiative (TEI) when
- they are issued.
-
- Drawings are postscript files, not meeting international standards, but
- at least designed to go across platforms. Images, or rather the real
- archival forms, consist of the best available slides, which are being
- digitized. Much of the catalog material exists in database form--a form
- that the average user could use, manipulate, and display on a personal
- computer, but only at great cost. Thus, this is where the concession
- comes in: All of this rich, well-marked-up information is stripped of
- much of its content; the images are converted into bit-maps and the text
- into small formatted chunks. All this information can then be imported
- into HyperCard and run on a mid-range Macintosh, which is what Perseus
- users have. This fact has made it possible for Perseus to attain wide
- use fairly rapidly. Without those archival forms the HyperCard version
- being demonstrated could not be made easily, and the project could not
- have the potential to move to other forms and machines and software as
- they appear, none of which information is in Perseus on the CD.
-
- Of the numerous multimedia aspects of Perseus, MYLONAS focused on the
- textual. Part of what makes Perseus such a pleasure to use, MYLONAS
- said, is this effort at seamless integration and the ability to move
- around both visual and textual material. Perseus also made the decision
- not to attempt to interpret its material any more than one interprets by
- selecting. But, MYLONAS emphasized, Perseus is not courseware: No
- syllabus exists. There is no effort to define how one teaches a topic
- using Perseus, although the project may eventually collect papers by
- people who have used it to teach. Rather, Perseus aims to provide
- primary material in a kind of electronic library, an electronic sandbox,
- so to say, in which students and scholars who are working on this
- material can explore by themselves. With that, MYLONAS demonstrated
- Perseus, beginning with the Perseus gateway, the first thing one sees
- upon opening Perseus--an effort in part to solve the contextualizing
- problem--which tells the user what the system contains.
-
- MYLONAS demonstrated only a very small portion, beginning with primary
- texts and running off the CD-ROM. Having selected Aeschylus' Prometheus
- Bound, which was viewable in Greek and English pretty much in the same
- segments together, MYLONAS demonstrated tools to use with the Greek text,
- something not possible with a book: looking up the dictionary entry form
- of an unfamiliar word in Greek after subjecting it to Perseus'
- morphological analysis for all the texts. After finding out about a
- word, a user may then decide to see if it is used anywhere else in Greek.
- Because vast amounts of indexing support all of the primary material, one
- can find out where else all forms of a particular Greek word appear--
- often not a trivial matter because Greek is highly inflected. Further,
- since the story of Prometheus has to do with the origins of sacrifice, a
- user may wish to study and explore sacrifice in Greek literature; by
- typing sacrifice into a small window, a user goes to the English-Greek
- word list--something one cannot do without the computer (Perseus has
- indexed the definitions of its dictionary)--the string sacrifice appears
- in the definitions of these sixty-five words. One may then find out
- where any of those words is used in the work(s) of a particular author.
- The English definitions are not lemmatized.
-
- All of the indices driving this kind of usage were originally devised for
- speed, MYLONAS observed; in other words, all that kind of information--
- all forms of all words, where they exist, the dictionary form they belong
- to--were collected into databases, which will expedite searching. Then
- it was discovered that one can do things searching in these databases
- that could not be done searching in the full texts. Thus, although there
- are full-text searches in Perseus, much of the work is done behind the
- scenes, using prepared indices. Re the indexing that is done behind the
- scenes, MYLONAS pointed out that without the SGML forms of the text, it
- could not be done effectively. Much of this indexing is based on the
- structures that are made explicit by the SGML tagging.
-
- It was found that one of the things many of Perseus' non-Greek-reading
- users do is start from the dictionary and then move into the close study
- of words and concepts via this kind of English-Greek word search, by which
- means they might select a concept. This exercise has been assigned to
- students in core courses at Harvard--to study a concept by looking for the
- English word in the dictionary, finding the Greek words, and then finding
- the words in the Greek but, of course, reading across in the English.
- That tells them a great deal about what a translation means as well.
-
- Should one also wish to see images that have to do with sacrifice, that
- person would go to the object key word search, which allows one to
- perform a similar kind of index retrieval on the database of
- archaeological objects. Without words, pictures are useless; Perseus has
- not reached the point where it can do much with images that are not
- cataloged. Thus, although it is possible in Perseus with text and images
- to navigate by knowing where one wants to end up--for example, a
- red-figure vase from the Boston Museum of Fine Arts--one can perform this
- kind of navigation very easily by tracing down indices. MYLONAS
- illustrated several generic scenes of sacrifice on vases. The features
- demonstrated derived from Perseus 1.0; version 2.0 will implement even
- better means of retrieval.
-
- MYLONAS closed by looking at one of the pictures and noting again that
- one can do a great deal of research using the iconography as well as the
- texts. For instance, students in a core course at Harvard this year were
- highly interested in Greek concepts of foreigners and representations of
- non-Greeks. So they performed a great deal of research, both with texts
- (e.g., Herodotus) and with iconography on vases and coins, on how the
- Greeks portrayed non-Greeks. At the same time, art historians who study
- iconography were also interested, and were able to use this material.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- DISCUSSION * Indexing and searchability of all English words in Perseus *
- Several features of Perseus 1.0 * Several levels of customization
- possible * Perseus used for general education * Perseus' effects on
- education * Contextual information in Perseus * Main challenge and
- emphasis of Perseus *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Several points emerged in the discussion that followed MYLONAS's presentation.
-
- Although MYLONAS had not demonstrated Perseus' ability to cross-search
- documents, she confirmed that all English words in Perseus are indexed
- and can be searched. So, for example, sacrifice could have been searched
- in all texts, the historical essay, and all the catalogue entries with
- their descriptions--in short, in all of Perseus.
-
- Boolean logic is not in Perseus 1.0 but will be added to the next
- version, although an effort is being made not to restrict Perseus to a
- database in which one just performs searching, Boolean or otherwise. It
- is possible to move laterally through the documents by selecting a word
- one is interested in and selecting an area of information one is
- interested in and trying to look that word up in that area.
-
- Since Perseus was developed in HyperCard, several levels of customization
- are possible. Simple authoring tools exist that allow one to create
- annotated paths through the information, which are useful for note-taking
- and for guided tours for teaching purposes and for expository writing.
- With a little more ingenuity it is possible to begin to add or substitute
- material in Perseus.
-
- Perseus has not been used so much for classics education as for general
- education, where it seemed to have an impact on the students in the core
- course at Harvard (a general required course that students must take in
- certain areas). Students were able to use primary material much more.
-
- The Perseus Project has an evaluation team at the University of Maryland
- that has been documenting Perseus' effects on education. Perseus is very
- popular, and anecdotal evidence indicates that it is having an effect at
- places other than Harvard, for example, test sites at Ball State
- University, Drury College, and numerous small places where opportunities
- to use vast amounts of primary data may not exist. One documented effect
- is that archaeological, anthropological, and philological research is
- being done by the same person instead of by three different people.
-
- The contextual information in Perseus includes an overview essay, a
- fairly linear historical essay on the fifth century B.C. that provides
- links into the primary material (e.g., Herodotus, Thucydides, and
- Plutarch), via small gray underscoring (on the screen) of linked
- passages. These are handmade links into other material.
-
- To different extents, most of the production work was done at Harvard,
- where the people and the equipment are located. Much of the
- collaborative activity involved data collection and structuring, because
- the main challenge and the emphasis of Perseus is the gathering of
- primary material, that is, building a useful environment for studying
- classical Greece, collecting data, and making it useful.
- Systems-building is definitely not the main concern. Thus, much of the
- work has involved writing essays, collecting information, rewriting it,
- and tagging it. That can be done off site. The creative link for the
- overview essay as well as for both systems and data was collaborative,
- and was forged via E-mail and paper mail with professors at Pomona and
- Bowdoin.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- CALALUCA * PLD's principal focus and contribution to scholarship *
- Various questions preparatory to beginning the project * Basis for
- project * Basic rule in converting PLD * Concerning the images in PLD *
- Running PLD under a variety of retrieval softwares * Encoding the
- database a hard-fought issue * Various features demonstrated * Importance
- of user documentation * Limitations of the CD-ROM version *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Eric CALALUCA, vice president, Chadwyck-Healey, Inc., demonstrated a
- software interpretation of the Patrologia Latina Database (PLD). PLD's
- principal focus from the beginning of the project about three-and-a-half
- years ago was on converting Migne's Latin series, and in the end,
- CALALUCA suggested, conversion of the text will be the major contribution
- to scholarship. CALALUCA stressed that, as possibly the only private
- publishing organization at the Workshop, Chadwyck-Healey had sought no
- federal funds or national foundation support before embarking upon the
- project, but instead had relied upon a great deal of homework and
- marketing to accomplish the task of conversion.
-
- Ever since the possibilities of computer-searching have emerged, scholars
- in the field of late ancient and early medieval studies (philosophers,
- theologians, classicists, and those studying the history of natural law
- and the history of the legal development of Western civilization) have
- been longing for a fully searchable version of Western literature, for
- example, all the texts of Augustine and Bernard of Clairvaux and
- Boethius, not to mention all the secondary and tertiary authors.
-
- Various questions arose, CALALUCA said. Should one convert Migne?
- Should the database be encoded? Is it necessary to do that? How should
- it be delivered? What about CD-ROM? Since this is a transitional
- medium, why even bother to create software to run on a CD-ROM? Since
- everybody knows people will be networking information, why go to the
- trouble--which is far greater with CD-ROM than with the production of
- magnetic data? Finally, how does one make the data available? Can many
- of the hurdles to using electronic information that some publishers have
- imposed upon databases be eliminated?
-
- The PLD project was based on the principle that computer-searching of
- texts is most effective when it is done with a large database. Because
- PLD represented a collection that serves so many disciplines across so
- many periods, it was irresistible.
-
- The basic rule in converting PLD was to do no harm, to avoid the sins of
- intrusion in such a database: no introduction of newer editions, no
- on-the-spot changes, no eradicating of all possible falsehoods from an
- edition. Thus, PLD is not the final act in electronic publishing for
- this discipline, but simply the beginning. The conversion of PLD has
- evoked numerous unanticipated questions: How will information be used?
- What about networking? Can the rights of a database be protected?
- Should one protect the rights of a database? How can it be made
- available?
-
- Those converting PLD also tried to avoid the sins of omission, that is,
- excluding portions of the collections or whole sections. What about the
- images? PLD is full of images, some are extremely pious
- nineteenth-century representations of the Fathers, while others contain
- highly interesting elements. The goal was to cover all the text of Migne
- (including notes, in Greek and in Hebrew, the latter of which, in
- particular, causes problems in creating a search structure), all the
- indices, and even the images, which are being scanned in separately
- searchable files.
-
- Several North American institutions that have placed acquisition requests
- for the PLD database have requested it in magnetic form without software,
- which means they are already running it without software, without
- anything demonstrated at the Workshop.
-
- What cannot practically be done is go back and reconvert and re-encode
- data, a time-consuming and extremely costly enterprise. CALALUCA sees
- PLD as a database that can, and should, be run under a variety of
- retrieval softwares. This will permit the widest possible searches.
- Consequently, the need to produce a CD-ROM of PLD, as well as to develop
- software that could handle some 1.3 gigabyte of heavily encoded text,
- developed out of conversations with collection development and reference
- librarians who wanted software both compassionate enough for the
- pedestrian but also capable of incorporating the most detailed
- lexicographical studies that a user desires to conduct. In the end, the
- encoding and conversion of the data will prove the most enduring
- testament to the value of the project.
-
- The encoding of the database was also a hard-fought issue: Did the
- database need to be encoded? Were there normative structures for encoding
- humanist texts? Should it be SGML? What about the TEI--will it last,
- will it prove useful? CALALUCA expressed some minor doubts as to whether
- a data bank can be fully TEI-conformant. Every effort can be made, but
- in the end to be TEI-conformant means to accept the need to make some
- firm encoding decisions that can, indeed, be disputed. The TEI points
- the publisher in a proper direction but does not presume to make all the
- decisions for him or her. Essentially, the goal of encoding was to
- eliminate, as much as possible, the hindrances to information-networking,
- so that if an institution acquires a database, everybody associated with
- the institution can have access to it.
-
- CALALUCA demonstrated a portion of Volume 160, because it had the most
- anomalies in it. The software was created by Electronic Book
- Technologies of Providence, RI, and is called Dynatext. The software
- works only with SGML-coded data.
-
- Viewing a table of contents on the screen, the audience saw how Dynatext
- treats each element as a book and attempts to simplify movement through a
- volume. Familiarity with the Patrologia in print (i.e., the text, its
- source, and the editions) will make the machine-readable versions highly
- useful. (Software with a Windows application was sought for PLD,
- CALALUCA said, because this was the main trend for scholarly use.)
-
- CALALUCA also demonstrated how a user can perform a variety of searches
- and quickly move to any part of a volume; the look-up screen provides
- some basic, simple word-searching.
-
- CALALUCA argued that one of the major difficulties is not the software.
- Rather, in creating a product that will be used by scholars representing
- a broad spectrum of computer sophistication, user documentation proves
- to be the most important service one can provide.
-
- CALALUCA next illustrated a truncated search under mysterium within ten
- words of virtus and how one would be able to find its contents throughout
- the entire database. He said that the exciting thing about PLD is that
- many of the applications in the retrieval software being written for it
- will exceed the capabilities of the software employed now for the CD-ROM
- version. The CD-ROM faces genuine limitations, in terms of speed and
- comprehensiveness, in the creation of a retrieval software to run it.
- CALALUCA said he hoped that individual scholars will download the data,
- if they wish, to their personal computers, and have ready access to
- important texts on a constant basis, which they will be able to use in
- their research and from which they might even be able to publish.
-
- (CALALUCA explained that the blue numbers represented Migne's column numbers,
- which are the standard scholarly references. Pulling up a note, he stated
- that these texts were heavily edited and the image files would appear simply
- as a note as well, so that one could quickly access an image.)
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- FLEISCHHAUER/ERWAY * Several problems with which AM is still wrestling *
- Various search and retrieval capabilities * Illustration of automatic
- stemming and a truncated search * AM's attempt to find ways to connect
- cataloging to the texts * AM's gravitation towards SGML * Striking a
- balance between quantity and quality * How AM furnishes users recourse to
- images * Conducting a search in a full-text environment * Macintosh and
- IBM prototypes of AM * Multimedia aspects of AM *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- A demonstration of American Memory by its coordinator, Carl FLEISCHHAUER,
- and Ricky ERWAY, associate coordinator, Library of Congress, concluded
- the morning session. Beginning with a collection of broadsides from the
- Continental Congress and the Constitutional Convention, the only text
- collection in a presentable form at the time of the Workshop, FLEISCHHAUER
- highlighted several of the problems with which AM is still wrestling.
- (In its final form, the disk will contain two collections, not only the
- broadsides but also the full text with illustrations of a set of
- approximately 300 African-American pamphlets from the period 1870 to 1910.)
-
- As FREEMAN had explained earlier, AM has attempted to use a small amount
- of interpretation to introduce collections. In the present case, the
- contractor, a company named Quick Source, in Silver Spring, MD., used
- software called Toolbook and put together a modestly interactive
- introduction to the collection. Like the two preceding speakers,
- FLEISCHHAUER argued that the real asset was the underlying collection.
-
- FLEISCHHAUER proceeded to describe various search and retrieval
- capabilities while ERWAY worked the computer. In this particular package
- the "go to" pull-down allowed the user in effect to jump out of Toolbook,
- where the interactive program was located, and enter the third-party
- software used by AM for this text collection, which is called Personal
- Librarian. This was the Windows version of Personal Librarian, a
- software application put together by a company in Rockville, Md.
-
- Since the broadsides came from the Revolutionary War period, a search was
- conducted using the words British or war, with the default operator reset
- as or. FLEISCHHAUER demonstrated both automatic stemming (which finds
- other forms of the same root) and a truncated search. One of Personal
- Librarian's strongest features, the relevance ranking, was represented by
- a chart that indicated how often words being sought appeared in
- documents, with the one receiving the most "hits" obtaining the highest
- score. The "hit list" that is supplied takes the relevance ranking into
- account, making the first hit, in effect, the one the software has
- selected as the most relevant example.
-
- While in the text of one of the broadside documents, FLEISCHHAUER
- remarked AM's attempt to find ways to connect cataloging to the texts,
- which it does in different ways in different manifestations. In the case
- shown, the cataloging was pasted on: AM took MARC records that were
- written as on-line records right into one of the Library's mainframe
- retrieval programs, pulled them out, and handed them off to the contractor,
- who massaged them somewhat to display them in the manner shown. One of
- AM's questions is, Does the cataloguing normally performed in the mainframe
- work in this context, or had AM ought to think through adjustments?
-
- FLEISCHHAUER made the additional point that, as far as the text goes, AM
- has gravitated towards SGML (he pointed to the boldface in the upper part
- of the screen). Although extremely limited in its ability to translate
- or interpret SGML, Personal Librarian will furnish both bold and italics
- on screen; a fairly easy thing to do, but it is one of the ways in which
- SGML is useful.
-
- Striking a balance between quantity and quality has been a major concern
- of AM, with accuracy being one of the places where project staff have
- felt that less than 100-percent accuracy was not unacceptable.
- FLEISCHHAUER cited the example of the standard of the rekeying industry,
- namely 99.95 percent; as one service bureau informed him, to go from
- 99.95 to 100 percent would double the cost.
-
- FLEISCHHAUER next demonstrated how AM furnishes users recourse to images,
- and at the same time recalled LESK's pointed question concerning the
- number of people who would look at those images and the number who would
- work only with the text. If the implication of LESK's question was
- sound, FLEISCHHAUER said, it raised the stakes for text accuracy and
- reduced the value of the strategy for images.
-
- Contending that preservation is always a bugaboo, FLEISCHHAUER
- demonstrated several images derived from a scan of a preservation
- microfilm that AM had made. He awarded a grade of C at best, perhaps a
- C minus or a C plus, for how well it worked out. Indeed, the matter of
- learning if other people had better ideas about scanning in general, and,
- in particular, scanning from microfilm, was one of the factors that drove
- AM to attempt to think through the agenda for the Workshop. Skew, for
- example, was one of the issues that AM in its ignorance had not reckoned
- would prove so difficult.
-
- Further, the handling of images of the sort shown, in a desktop computer
- environment, involved a considerable amount of zooming and scrolling.
- Ultimately, AM staff feel that perhaps the paper copy that is printed out
- might be the most useful one, but they remain uncertain as to how much
- on-screen reading users will do.
-
- Returning to the text, FLEISCHHAUER asked viewers to imagine a person who
- might be conducting a search in a full-text environment. With this
- scenario, he proceeded to illustrate other features of Personal Librarian
- that he considered helpful; for example, it provides the ability to
- notice words as one reads. Clicking the "include" button on the bottom
- of the search window pops the words that have been highlighted into the
- search. Thus, a user can refine the search as he or she reads,
- re-executing the search and continuing to find things in the quest for
- materials. This software not only contains relevance ranking, Boolean
- operators, and truncation, it also permits one to perform word algebra,
- so to say, where one puts two or three words in parentheses and links
- them with one Boolean operator and then a couple of words in another set
- of parentheses and asks for things within so many words of others.
-
- Until they became acquainted recently with some of the work being done in
- classics, the AM staff had not realized that a large number of the
- projects that involve electronic texts were being done by people with a
- profound interest in language and linguistics. Their search strategies
- and thinking are oriented to those fields, as is shown in particular by
- the Perseus example. As amateur historians, the AM staff were thinking
- more of searching for concepts and ideas than for particular words.
- Obviously, FLEISCHHAUER conceded, searching for concepts and ideas and
- searching for words may be two rather closely related things.
-
- While displaying several images, FLEISCHHAUER observed that the Macintosh
- prototype built by AM contains a greater diversity of formats. Echoing a
- previous speaker, he said that it was easier to stitch things together in
- the Macintosh, though it tended to be a little more anemic in search and
- retrieval. AM, therefore, increasingly has been investigating
- sophisticated retrieval engines in the IBM format.
-
- FLEISCHHAUER demonstrated several additional examples of the prototype
- interfaces: One was AM's metaphor for the network future, in which a
- kind of reading-room graphic suggests how one would be able to go around
- to different materials. AM contains a large number of photographs in
- analog video form worked up from a videodisc, which enable users to make
- copies to print or incorporate in digital documents. A frame-grabber is
- built into the system, making it possible to bring an image into a window
- and digitize or print it out.
-
- FLEISCHHAUER next demonstrated sound recording, which included texts.
- Recycled from a previous project, the collection included sixty 78-rpm
- phonograph records of political speeches that were made during and
- immediately after World War I. These constituted approximately three
- hours of audio, as AM has digitized it, which occupy 150 megabytes on a
- CD. Thus, they are considerably compressed. From the catalogue card,
- FLEISCHHAUER proceeded to a transcript of a speech with the audio
- available and with highlighted text following it as it played.
- A photograph has been added and a transcription made.
-
- Considerable value has been added beyond what the Library of Congress
- normally would do in cataloguing a sound recording, which raises several
- questions for AM concerning where to draw lines about how much value it can
- afford to add and at what point, perhaps, this becomes more than AM could
- reasonably do or reasonably wish to do. FLEISCHHAUER also demonstrated
- a motion picture. As FREEMAN had reported earlier, the motion picture
- materials have proved the most popular, not surprisingly. This says more
- about the medium, he thought, than about AM's presentation of it.
-
- Because AM's goal was to bring together things that could be used by
- historians or by people who were curious about history,
- turn-of-the-century footage seemed to represent the most appropriate
- collections from the Library of Congress in motion pictures. These were
- the very first films made by Thomas Edison's company and some others at
- that time. The particular example illustrated was a Biograph film,
- brought in with a frame-grabber into a window. A single videodisc
- contains about fifty titles and pieces of film from that period, all of
- New York City. Taken together, AM believes, they provide an interesting
- documentary resource.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- DISCUSSION * Using the frame-grabber in AM * Volume of material processed
- and to be processed * Purpose of AM within LC * Cataloguing and the
- nature of AM's material * SGML coding and the question of quality versus
- quantity *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- During the question-and-answer period that followed FLEISCHHAUER's
- presentation, several clarifications were made.
-
- AM is bringing in motion pictures from a videodisc. The frame-grabber
- devices create a window on a computer screen, which permits users to
- digitize a single frame of the movie or one of the photographs. It
- produces a crude, rough-and-ready image that high school students can
- incorporate into papers, and that has worked very nicely in this way.
-
- Commenting on FLEISCHHAUER's assertion that AM was looking more at
- searching ideas than words, MYLONAS argued that without words an idea
- does not exist. FLEISCHHAUER conceded that he ought to have articulated
- his point more clearly. MYLONAS stated that they were in fact both
- talking about the same thing. By searching for words and by forcing
- people to focus on the word, the Perseus Project felt that they would get
- them to the idea. The way one reviews results is tailored more to one
- kind of user than another.
-
- Concerning the total volume of material that has been processed in this
- way, AM at this point has in retrievable form seven or eight collections,
- all of them photographic. In the Macintosh environment, for example,
- there probably are 35,000-40,000 photographs. The sound recordings
- number sixty items. The broadsides number about 300 items. There are
- 500 political cartoons in the form of drawings. The motion pictures, as
- individual items, number sixty to seventy.
-
- AM also has a manuscript collection, the life history portion of one of
- the federal project series, which will contain 2,900 individual
- documents, all first-person narratives. AM has in process about 350
- African-American pamphlets, or about 12,000 printed pages for the period
- 1870-1910. Also in the works are some 4,000 panoramic photographs. AM
- has recycled a fair amount of the work done by LC's Prints and
- Photographs Division during the Library's optical disk pilot project in
- the 1980s. For example, a special division of LC has tooled up and
- thought through all the ramifications of electronic presentation of
- photographs. Indeed, they are wheeling them out in great barrel loads.
- The purpose of AM within the Library, it is hoped, is to catalyze several
- of the other special collection divisions which have no particular
- experience with, in some cases, mixed feelings about, an activity such as
- AM. Moreover, in many cases the divisions may be characterized as not
- only lacking experience in "electronifying" things but also in automated
- cataloguing. MARC cataloguing as practiced in the United States is
- heavily weighted toward the description of monograph and serial
- materials, but is much thinner when one enters the world of manuscripts
- and things that are held in the Library's music collection and other
- units. In response to a comment by LESK, that AM's material is very
- heavily photographic, and is so primarily because individual records have
- been made for each photograph, FLEISCHHAUER observed that an item-level
- catalog record exists, for example, for each photograph in the Detroit
- Publishing collection of 25,000 pictures. In the case of the Federal
- Writers Project, for which nearly 3,000 documents exist, representing
- information from twenty-six different states, AM with the assistance of
- Karen STUART of the Manuscript Division will attempt to find some way not
- only to have a collection-level record but perhaps a MARC record for each
- state, which will then serve as an umbrella for the 100-200 documents
- that come under it. But that drama remains to be enacted. The AM staff
- is conservative and clings to cataloguing, though of course visitors tout
- artificial intelligence and neural networks in a manner that suggests that
- perhaps one need not have cataloguing or that much of it could be put aside.
-
- The matter of SGML coding, FLEISCHHAUER conceded, returned the discussion
- to the earlier treated question of quality versus quantity in the Library
- of Congress. Of course, text conversion can be done with 100-percent
- accuracy, but it means that when one's holdings are as vast as LC's only
- a tiny amount will be exposed, whereas permitting lower levels of
- accuracy can lead to exposing or sharing larger amounts, but with the
- quality correspondingly impaired.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- TWOHIG * A contrary experience concerning electronic options * Volume of
- material in the Washington papers and a suggestion of David Packard *
- Implications of Packard's suggestion * Transcribing the documents for the
- CD-ROM * Accuracy of transcriptions * The CD-ROM edition of the Founding
- Fathers documents *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Finding encouragement in a comment of MICHELSON's from the morning
- session--that numerous people in the humanities were choosing electronic
- options to do their work--Dorothy TWOHIG, editor, The Papers of George
- Washington, opened her illustrated talk by noting that her experience
- with literary scholars and numerous people in editing was contrary to
- MICHELSON's. TWOHIG emphasized literary scholars' complete ignorance of
- the technological options available to them or their reluctance or, in
- some cases, their downright hostility toward these options.
-
- After providing an overview of the five Founding Fathers projects
- (Jefferson at Princeton, Franklin at Yale, John Adams at the
- Massachusetts Historical Society, and Madison down the hall from her at
- the University of Virginia), TWOHIG observed that the Washington papers,
- like all of the projects, include both sides of the Washington
- correspondence and deal with some 135,000 documents to be published with
- extensive annotation in eighty to eighty-five volumes, a project that
- will not be completed until well into the next century. Thus, it was
- with considerable enthusiasm several years ago that the Washington Papers
- Project (WPP) greeted David Packard's suggestion that the papers of the
- Founding Fathers could be published easily and inexpensively, and to the
- great benefit of American scholarship, via CD-ROM.
-
- In pragmatic terms, funding from the Packard Foundation would expedite
- the transcription of thousands of documents waiting to be put on disk in
- the WPP offices. Further, since the costs of collecting, editing, and
- converting the Founding Fathers documents into letterpress editions were
- running into the millions of dollars, and the considerable staffs
- involved in all of these projects were devoting their careers to
- producing the work, the Packard Foundation's suggestion had a
- revolutionary aspect: Transcriptions of the entire corpus of the
- Founding Fathers papers would be available on CD-ROM to public and
- college libraries, even high schools, at a fraction of the cost--
- $100-$150 for the annual license fee--to produce a limited university
- press run of 1,000 of each volume of the published papers at $45-$150 per
- printed volume. Given the current budget crunch in educational systems
- and the corresponding constraints on librarians in smaller institutions
- who wish to add these volumes to their collections, producing the
- documents on CD-ROM would likely open a greatly expanded audience for the
- papers. TWOHIG stressed, however, that development of the Founding
- Fathers CD-ROM is still in its infancy. Serious software problems remain
- to be resolved before the material can be put into readable form.
-
- Funding from the Packard Foundation resulted in a major push to
- transcribe the 75,000 or so documents of the Washington papers remaining
- to be transcribed onto computer disks. Slides illustrated several of the
- problems encountered, for example, the present inability of CD-ROM to
- indicate the cross-outs (deleted material) in eighteenth century
- documents. TWOHIG next described documents from various periods in the
- eighteenth century that have been transcribed in chronological order and
- delivered to the Packard offices in California, where they are converted
- to the CD-ROM, a process that is expected to consume five years to
- complete (that is, reckoning from David Packard's suggestion made several
- years ago, until about July 1994). TWOHIG found an encouraging
- indication of the project's benefits in the ongoing use made by scholars
- of the search functions of the CD-ROM, particularly in reducing the time
- spent in manually turning the pages of the Washington papers.
-
- TWOHIG next furnished details concerning the accuracy of transcriptions.
- For instance, the insertion of thousands of documents on the CD-ROM
- currently does not permit each document to be verified against the
- original manuscript several times as in the case of documents that appear
- in the published edition. However, the transcriptions receive a cursory
- check for obvious typos, the misspellings of proper names, and other
- errors from the WPP CD-ROM editor. Eventually, all documents that appear
- in the electronic version will be checked by project editors. Although
- this process has met with opposition from some of the editors on the
- grounds that imperfect work may leave their offices, the advantages in
- making this material available as a research tool outweigh fears about the
- misspelling of proper names and other relatively minor editorial matters.
-
- Completion of all five Founding Fathers projects (i.e., retrievability
- and searchability of all of the documents by proper names, alternate
- spellings, or varieties of subjects) will provide one of the richest
- sources of this size for the history of the United States in the latter
- part of the eighteenth century. Further, publication on CD-ROM will
- allow editors to include even minutiae, such as laundry lists, not
- included in the printed volumes.
-
- It seems possible that the extensive annotation provided in the printed
- volumes eventually will be added to the CD-ROM edition, pending
- negotiations with the publishers of the papers. At the moment, the
- Founding Fathers CD-ROM is accessible only on the IBYCUS, a computer
- developed out of the Thesaurus Linguae Graecae project and designed for
- the use of classical scholars. There are perhaps 400 IBYCUS computers in
- the country, most of which are in university classics departments.
- Ultimately, it is anticipated that the CD-ROM edition of the Founding
- Fathers documents will run on any IBM-compatible or Macintosh computer
- with a CD-ROM drive. Numerous changes in the software will also occur
- before the project is completed. (Editor's note: an IBYCUS was
- unavailable to demonstrate the CD-ROM.)
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- DISCUSSION * Several additional features of WPP clarified *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Discussion following TWOHIG's presentation served to clarify several
- additional features, including (1) that the project's primary
- intellectual product consists in the electronic transcription of the
- material; (2) that the text transmitted to the CD-ROM people is not
- marked up; (3) that cataloging and subject-indexing of the material
- remain to be worked out (though at this point material can be retrieved
- by name); and (4) that because all the searching is done in the hardware,
- the IBYCUS is designed to read a CD-ROM which contains only sequential
- text files. Technically, it then becomes very easy to read the material
- off and put it on another device.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- LEBRON * Overview of the history of the joint project between AAAS and
- OCLC * Several practices the on-line environment shares with traditional
- publishing on hard copy * Several technical and behavioral barriers to
- electronic publishing * How AAAS and OCLC arrived at the subject of
- clinical trials * Advantages of the electronic format and other features
- of OJCCT * An illustrated tour of the journal *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Maria LEBRON, managing editor, The Online Journal of Current Clinical
- Trials (OJCCT), presented an illustrated overview of the history of the
- joint project between the American Association for the Advancement of
- Science (AAAS) and the Online Computer Library Center, Inc. (OCLC). The
- joint venture between AAAS and OCLC owes its beginning to a
- reorganization launched by the new chief executive officer at OCLC about
- three years ago and combines the strengths of these two disparate
- organizations. In short, OJCCT represents the process of scholarly
- publishing on line.
-
- LEBRON next discussed several practices the on-line environment shares
- with traditional publishing on hard copy--for example, peer review of
- manuscripts--that are highly important in the academic world. LEBRON
- noted in particular the implications of citation counts for tenure
- committees and grants committees. In the traditional hard-copy
- environment, citation counts are readily demonstrable, whereas the
- on-line environment represents an ethereal medium to most academics.
-
- LEBRON remarked several technical and behavioral barriers to electronic
- publishing, for instance, the problems in transmission created by special
- characters or by complex graphics and halftones. In addition, she noted
- economic limitations such as the storage costs of maintaining back issues
- and market or audience education.
-
- Manuscripts cannot be uploaded to OJCCT, LEBRON explained, because it is
- not a bulletin board or E-mail, forms of electronic transmission of
- information that have created an ambience clouding people's understanding
- of what the journal is attempting to do. OJCCT, which publishes
- peer-reviewed medical articles dealing with the subject of clinical
- trials, includes text, tabular material, and graphics, although at this
- time it can transmit only line illustrations.
-
- Next, LEBRON described how AAAS and OCLC arrived at the subject of
- clinical trials: It is 1) a highly statistical discipline that 2) does
- not require halftones but can satisfy the needs of its audience with line
- illustrations and graphic material, and 3) there is a need for the speedy
- dissemination of high-quality research results. Clinical trials are
- research activities that involve the administration of a test treatment
- to some experimental unit in order to test its usefulness before it is
- made available to the general population. LEBRON proceeded to give
- additional information on OJCCT concerning its editor-in-chief, editorial
- board, editorial content, and the types of articles it publishes
- (including peer-reviewed research reports and reviews), as well as
- features shared by other traditional hard-copy journals.
-
- Among the advantages of the electronic format are faster dissemination of
- information, including raw data, and the absence of space constraints
- because pages do not exist. (This latter fact creates an interesting
- situation when it comes to citations.) Nor are there any issues. AAAS's
- capacity to download materials directly from the journal to a
- subscriber's printer, hard drive, or floppy disk helps ensure highly
- accurate transcription. Other features of OJCCT include on-screen alerts
- that allow linkage of subsequently published documents to the original
- documents; on-line searching by subject, author, title, etc.; indexing of
- every single word that appears in an article; viewing access to an
- article by component (abstract, full text, or graphs); numbered
- paragraphs to replace page counts; publication in Science every thirty
- days of indexing of all articles published in the journal;
- typeset-quality screens; and Hypertext links that enable subscribers to
- bring up Medline abstracts directly without leaving the journal.
-
- After detailing the two primary ways to gain access to the journal,
- through the OCLC network and Compuserv if one desires graphics or through
- the Internet if just an ASCII file is desired, LEBRON illustrated the
- speedy editorial process and the coding of the document using SGML tags
- after it has been accepted for publication. She also gave an illustrated
- tour of the journal, its search-and-retrieval capabilities in particular,
- but also including problems associated with scanning in illustrations,
- and the importance of on-screen alerts to the medical profession re
- retractions or corrections, or more frequently, editorials, letters to
- the editors, or follow-up reports. She closed by inviting the audience
- to join AAAS on 1 July, when OJCCT was scheduled to go on-line.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- DISCUSSION * Additional features of OJCCT *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- In the lengthy discussion that followed LEBRON's presentation, these
- points emerged:
-
- * The SGML text can be tailored as users wish.
-
- * All these articles have a fairly simple document definition.
-
- * Document-type definitions (DTDs) were developed and given to OJCCT
- for coding.
-
- * No articles will be removed from the journal. (Because there are
- no back issues, there are no lost issues either. Once a subscriber
- logs onto the journal he or she has access not only to the currently
- published materials, but retrospectively to everything that has been
- published in it. Thus the table of contents grows bigger. The date
- of publication serves to distinguish between currently published
- materials and older materials.)
-
- * The pricing system for the journal resembles that for most medical
- journals: for 1992, $95 for a year, plus telecommunications charges
- (there are no connect time charges); for 1993, $110 for the
- entire year for single users, though the journal can be put on a
- local area network (LAN). However, only one person can access the
- journal at a time. Site licenses may come in the future.
-
- * AAAS is working closely with colleagues at OCLC to display
- mathematical equations on screen.
-
- * Without compromising any steps in the editorial process, the
- technology has reduced the time lag between when a manuscript is
- originally submitted and the time it is accepted; the review process
- does not differ greatly from the standard six-to-eight weeks
- employed by many of the hard-copy journals. The process still
- depends on people.
-
- * As far as a preservation copy is concerned, articles will be
- maintained on the computer permanently and subscribers, as part of
- their subscription, will receive a microfiche-quality archival copy
- of everything published during that year; in addition, reprints can
- be purchased in much the same way as in a hard-copy environment.
- Hard copies are prepared but are not the primary medium for the
- dissemination of the information.
-
- * Because OJCCT is not yet on line, it is difficult to know how many
- people would simply browse through the journal on the screen as
- opposed to downloading the whole thing and printing it out; a mix of
- both types of users likely will result.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- PERSONIUS * Developments in technology over the past decade * The CLASS
- Project * Advantages for technology and for the CLASS Project *
- Developing a network application an underlying assumption of the project
- * Details of the scanning process * Print-on-demand copies of books *
- Future plans include development of a browsing tool *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Lynne PERSONIUS, assistant director, Cornell Information Technologies for
- Scholarly Information Services, Cornell University, first commented on
- the tremendous impact that developments in technology over the past ten
- years--networking, in particular--have had on the way information is
- handled, and how, in her own case, these developments have counterbalanced
- Cornell's relative geographical isolation. Other significant technologies
- include scanners, which are much more sophisticated than they were ten years
- ago; mass storage and the dramatic savings that result from it in terms of
- both space and money relative to twenty or thirty years ago; new and
- improved printing technologies, which have greatly affected the distribution
- of information; and, of course, digital technologies, whose applicability to
- library preservation remains at issue.
-
- Given that context, PERSONIUS described the College Library Access and
- Storage System (CLASS) Project, a library preservation project,
- primarily, and what has been accomplished. Directly funded by the
- Commission on Preservation and Access and by the Xerox Corporation, which
- has provided a significant amount of hardware, the CLASS Project has been
- working with a development team at Xerox to develop a software
- application tailored to library preservation requirements. Within
- Cornell, participants in the project have been working jointly with both
- library and information technologies. The focus of the project has been
- on reformatting and saving books that are in brittle condition.
- PERSONIUS showed Workshop participants a brittle book, and described how
- such books were the result of developments in papermaking around the
- beginning of the Industrial Revolution. The papermaking process was
- changed so that a significant amount of acid was introduced into the
- actual paper itself, which deteriorates as it sits on library shelves.
-
- One of the advantages for technology and for the CLASS Project is that
- the information in brittle books is mostly out of copyright and thus
- offers an opportunity to work with material that requires library
- preservation, and to create and work on an infrastructure to save the
- material. Acknowledging the familiarity of those working in preservation
- with this information, PERSONIUS noted that several things are being
- done: the primary preservation technology used today is photocopying of
- brittle material. Saving the intellectual content of the material is the
- main goal. With microfilm copy, the intellectual content is preserved on
- the assumption that in the future the image can be reformatted in any
- other way that then exists.
-
- An underlying assumption of the CLASS Project from the beginning was
- that it would develop a network application. Project staff scan books
- at a workstation located in the library, near the brittle material.
- An image-server filing system is located at a distance from that
- workstation, and a printer is located in another building. All of the
- materials digitized and stored on the image-filing system are cataloged
- in the on-line catalogue. In fact, a record for each of these electronic
- books is stored in the RLIN database so that a record exists of what is
- in the digital library throughout standard catalogue procedures. In the
- future, researchers working from their own workstations in their offices,
- or their networks, will have access--wherever they might be--through a
- request server being built into the new digital library. A second
- assumption is that the preferred means of finding the material will be by
- looking through a catalogue. PERSONIUS described the scanning process,
- which uses a prototype scanner being developed by Xerox and which scans a
- very high resolution image at great speed. Another significant feature,
- because this is a preservation application, is the placing of the pages
- that fall apart one for one on the platen. Ordinarily, a scanner could
- be used with some sort of a document feeder, but because of this
- application that is not feasible. Further, because CLASS is a
- preservation application, after the paper replacement is made there, a
- very careful quality control check is performed. An original book is
- compared to the printed copy and verification is made, before proceeding,
- that all of the image, all of the information, has been captured. Then,
- a new library book is produced: The printed images are rebound by a
- commercial binder and a new book is returned to the shelf.
- Significantly, the books returned to the library shelves are beautiful
- and useful replacements on acid-free paper that should last a long time,
- in effect, the equivalent of preservation photocopies. Thus, the project
- has a library of digital books. In essence, CLASS is scanning and
- storing books as 600 dot-per-inch bit-mapped images, compressed using
- Group 4 CCITT (i.e., the French acronym for International Consultative
- Committee for Telegraph and Telephone) compression. They are stored as
- TIFF files on an optical filing system that is composed of a database
- used for searching and locating the books and an optical jukebox that
- stores 64 twelve-inch platters. A very-high-resolution printed copy of
- these books at 600 dots per inch is created, using a Xerox DocuTech
- printer to make the paper replacements on acid-free paper.
-
- PERSONIUS maintained that the CLASS Project presents an opportunity to
- introduce people to books as digital images by using a paper medium.
- Books are returned to the shelves while people are also given the ability
- to print on demand--to make their own copies of books. (PERSONIUS
- distributed copies of an engineering journal published by engineering
- students at Cornell around 1900 as an example of what a print-on-demand
- copy of material might be like. This very cheap copy would be available
- to people to use for their own research purposes and would bridge the gap
- between an electronic work and the paper that readers like to have.)
- PERSONIUS then attempted to illustrate a very early prototype of
- networked access to this digital library. Xerox Corporation has
- developed a prototype of a view station that can send images across the
- network to be viewed.
-
- The particular library brought down for demonstration contained two
- mathematics books. CLASS is developing and will spend the next year
- developing an application that allows people at workstations to browse
- the books. Thus, CLASS is developing a browsing tool, on the assumption
- that users do not want to read an entire book from a workstation, but
- would prefer to be able to look through and decide if they would like to
- have a printed copy of it.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- DISCUSSION * Re retrieval software * "Digital file copyright" * Scanning
- rate during production * Autosegmentation * Criteria employed in
- selecting books for scanning * Compression and decompression of images *
- OCR not precluded *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- During the question-and-answer period that followed her presentation,
- PERSONIUS made these additional points:
-
- * Re retrieval software, Cornell is developing a Unix-based server
- as well as clients for the server that support multiple platforms
- (Macintosh, IBM and Sun workstations), in the hope that people from
- any of those platforms will retrieve books; a further operating
- assumption is that standard interfaces will be used as much as
- possible, where standards can be put in place, because CLASS
- considers this retrieval software a library application and would
- like to be able to look at material not only at Cornell but at other
- institutions.
-
- * The phrase "digital file copyright by Cornell University" was
- added at the advice of Cornell's legal staff with the caveat that it
- probably would not hold up in court. Cornell does not want people
- to copy its books and sell them but would like to keep them
- available for use in a library environment for library purposes.
-
- * In production the scanner can scan about 300 pages per hour,
- capturing 600 dots per inch.
-
- * The Xerox software has filters to scan halftone material and avoid
- the moire patterns that occur when halftone material is scanned.
- Xerox has been working on hardware and software that would enable
- the scanner itself to recognize this situation and deal with it
- appropriately--a kind of autosegmentation that would enable the
- scanner to handle halftone material as well as text on a single page.
-
- * The books subjected to the elaborate process described above were
- selected because CLASS is a preservation project, with the first 500
- books selected coming from Cornell's mathematics collection, because
- they were still being heavily used and because, although they were
- in need of preservation, the mathematics library and the mathematics
- faculty were uncomfortable having them microfilmed. (They wanted a
- printed copy.) Thus, these books became a logical choice for this
- project. Other books were chosen by the project's selection committees
- for experiments with the technology, as well as to meet a demand or need.
-
- * Images will be decompressed before they are sent over the line; at
- this time they are compressed and sent to the image filing system
- and then sent to the printer as compressed images; they are returned
- to the workstation as compressed 600-dpi images and the workstation
- decompresses and scales them for display--an inefficient way to
- access the material though it works quite well for printing and
- other purposes.
-
- * CLASS is also decompressing on Macintosh and IBM, a slow process
- right now. Eventually, compression and decompression will take
- place on an image conversion server. Trade-offs will be made, based
- on future performance testing, concerning where the file is
- compressed and what resolution image is sent.
-
- * OCR has not been precluded; images are being stored that have been
- scanned at a high resolution, which presumably would suit them well
- to an OCR process. Because the material being scanned is about 100
- years old and was printed with less-than-ideal technologies, very
- early and preliminary tests have not produced good results. But the
- project is capturing an image that is of sufficient resolution to be
- subjected to OCR in the future. Moreover, the system architecture
- and the system plan have a logical place to store an OCR image if it
- has been captured. But that is not being done now.
-
- ******
-
- SESSION III. DISTRIBUTION, NETWORKS, AND NETWORKING: OPTIONS FOR
- DISSEMINATION
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- ZICH * Issues pertaining to CD-ROMs * Options for publishing in CD-ROM *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Robert ZICH, special assistant to the associate librarian for special
- projects, Library of Congress, and moderator of this session, first noted
- the blessed but somewhat awkward circumstance of having four very
- distinguished people representing networks and networking or at least
- leaning in that direction, while lacking anyone to speak from the
- strongest possible background in CD-ROMs. ZICH expressed the hope that
- members of the audience would join the discussion. He stressed the
- subtitle of this particular session, "Options for Dissemination," and,
- concerning CD-ROMs, the importance of determining when it would be wise
- to consider dissemination in CD-ROM versus networks. A shopping list of
- issues pertaining to CD-ROMs included: the grounds for selecting
- commercial publishers, and in-house publication where possible versus
- nonprofit or government publication. A similar list for networks
- included: determining when one should consider dissemination through a
- network, identifying the mechanisms or entities that exist to place items
- on networks, identifying the pool of existing networks, determining how a
- producer would choose between networks, and identifying the elements of
- a business arrangement in a network.
-
- Options for publishing in CD-ROM: an outside publisher versus
- self-publication. If an outside publisher is used, it can be nonprofit,
- such as the Government Printing Office (GPO) or the National Technical
- Information Service (NTIS), in the case of government. The pros and cons
- associated with employing an outside publisher are obvious. Among the
- pros, there is no trouble getting accepted. One pays the bill and, in
- effect, goes one's way. Among the cons, when one pays an outside
- publisher to perform the work, that publisher will perform the work it is
- obliged to do, but perhaps without the production expertise and skill in
- marketing and dissemination that some would seek. There is the body of
- commercial publishers that do possess that kind of expertise in
- distribution and marketing but that obviously are selective. In
- self-publication, one exercises full control, but then one must handle
- matters such as distribution and marketing. Such are some of the options
- for publishing in the case of CD-ROM.
-
- In the case of technical and design issues, which are also important,
- there are many matters which many at the Workshop already knew a good
- deal about: retrieval system requirements and costs, what to do about
- images, the various capabilities and platforms, the trade-offs between
- cost and performance, concerns about local-area networkability,
- interoperability, etc.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- LYNCH * Creating networked information is different from using networks
- as an access or dissemination vehicle * Networked multimedia on a large
- scale does not yet work * Typical CD-ROM publication model a two-edged
- sword * Publishing information on a CD-ROM in the present world of
- immature standards * Contrast between CD-ROM and network pricing *
- Examples demonstrated earlier in the day as a set of insular information
- gems * Paramount need to link databases * Layering to become increasingly
- necessary * Project NEEDS and the issues of information reuse and active
- versus passive use * X-Windows as a way of differentiating between
- network access and networked information * Barriers to the distribution
- of networked multimedia information * Need for good, real-time delivery
- protocols * The question of presentation integrity in client-server
- computing in the academic world * Recommendations for producing multimedia
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Clifford LYNCH, director, Library Automation, University of California,
- opened his talk with the general observation that networked information
- constituted a difficult and elusive topic because it is something just
- starting to develop and not yet fully understood. LYNCH contended that
- creating genuinely networked information was different from using
- networks as an access or dissemination vehicle and was more sophisticated
- and more subtle. He invited the members of the audience to extrapolate,
- from what they heard about the preceding demonstration projects, to what
- sort of a world of electronics information--scholarly, archival,
- cultural, etc.--they wished to end up with ten or fifteen years from now.
- LYNCH suggested that to extrapolate directly from these projects would
- produce unpleasant results.
-
- Putting the issue of CD-ROM in perspective before getting into
- generalities on networked information, LYNCH observed that those engaged
- in multimedia today who wish to ship a product, so to say, probably do
- not have much choice except to use CD-ROM: networked multimedia on a
- large scale basically does not yet work because the technology does not
- exist. For example, anybody who has tried moving images around over the
- Internet knows that this is an exciting touch-and-go process, a
- fascinating and fertile area for experimentation, research, and
- development, but not something that one can become deeply enthusiastic
- about committing to production systems at this time.
-
- This situation will change, LYNCH said. He differentiated CD-ROM from
- the practices that have been followed up to now in distributing data on
- CD-ROM. For LYNCH the problem with CD-ROM is not its portability or its
- slowness but the two-edged sword of having the retrieval application and
- the user interface inextricably bound up with the data, which is the
- typical CD-ROM publication model. It is not a case of publishing data
- but of distributing a typically stand-alone, typically closed system,
- all--software, user interface, and data--on a little disk. Hence, all
- the between-disk navigational issues as well as the impossibility in most
- cases of integrating data on one disk with that on another. Most CD-ROM
- retrieval software does not network very gracefully at present. However,
- in the present world of immature standards and lack of understanding of
- what network information is or what the ground rules are for creating or
- using it, publishing information on a CD-ROM does add value in a very
- real sense.
-
- LYNCH drew a contrast between CD-ROM and network pricing and in doing so
- highlighted something bizarre in information pricing. A large
- institution such as the University of California has vendors who will
- offer to sell information on CD-ROM for a price per year in four digits,
- but for the same data (e.g., an abstracting and indexing database) on
- magnetic tape, regardless of how many people may use it concurrently,
- will quote a price in six digits.
-
- What is packaged with the CD-ROM in one sense adds value--a complete
- access system, not just raw, unrefined information--although it is not
- generally perceived that way. This is because the access software,
- although it adds value, is viewed by some people, particularly in the
- university environment where there is a very heavy commitment to
- networking, as being developed in the wrong direction.
-
- Given that context, LYNCH described the examples demonstrated as a set of
- insular information gems--Perseus, for example, offers nicely linked
- information, but would be very difficult to integrate with other
- databases, that is, to link together seamlessly with other source files
- from other sources. It resembles an island, and in this respect is
- similar to numerous stand-alone projects that are based on videodiscs,
- that is, on the single-workstation concept.
-
- As scholarship evolves in a network environment, the paramount need will
- be to link databases. We must link personal databases to public
- databases, to group databases, in fairly seamless ways--which is
- extremely difficult in the environments under discussion with copies of
- databases proliferating all over the place.
-
- The notion of layering also struck LYNCH as lurking in several of the
- projects demonstrated. Several databases in a sense constitute
- information archives without a significant amount of navigation built in.
- Educators, critics, and others will want a layered structure--one that
- defines or links paths through the layers to allow users to reach
- specific points. In LYNCH's view, layering will become increasingly
- necessary, and not just within a single resource but across resources
- (e.g., tracing mythology and cultural themes across several classics
- databases as well as a database of Renaissance culture). This ability to
- organize resources, to build things out of multiple other things on the
- network or select pieces of it, represented for LYNCH one of the key
- aspects of network information.
-
- Contending that information reuse constituted another significant issue,
- LYNCH commended to the audience's attention Project NEEDS (i.e., National
- Engineering Education Delivery System). This project's objective is to
- produce a database of engineering courseware as well as the components
- that can be used to develop new courseware. In a number of the existing
- applications, LYNCH said, the issue of reuse (how much one can take apart
- and reuse in other applications) was not being well considered. He also
- raised the issue of active versus passive use, one aspect of which is
- how much information will be manipulated locally by users. Most people,
- he argued, may do a little browsing and then will wish to print. LYNCH
- was uncertain how these resources would be used by the vast majority of
- users in the network environment.
-
- LYNCH next said a few words about X-Windows as a way of differentiating
- between network access and networked information. A number of the
- applications demonstrated at the Workshop could be rewritten to use X
- across the network, so that one could run them from any X-capable device-
- -a workstation, an X terminal--and transact with a database across the
- network. Although this opens up access a little, assuming one has enough
- network to handle it, it does not provide an interface to develop a
- program that conveniently integrates information from multiple databases.
- X is a viewing technology that has limits. In a real sense, it is just a
- graphical version of remote log-in across the network. X-type applications
- represent only one step in the progression towards real access.
-
- LYNCH next discussed barriers to the distribution of networked multimedia
- information. The heart of the problem is a lack of standards to provide
- the ability for computers to talk to each other, retrieve information,
- and shuffle it around fairly casually. At the moment, little progress is
- being made on standards for networked information; for example, present
- standards do not cover images, digital voice, and digital video. A
- useful tool kit of exchange formats for basic texts is only now being
- assembled. The synchronization of content streams (i.e., synchronizing a
- voice track to a video track, establishing temporal relations between
- different components in a multimedia object) constitutes another issue
- for networked multimedia that is just beginning to receive attention.
-
- Underlying network protocols also need some work; good, real-time
- delivery protocols on the Internet do not yet exist. In LYNCH's view,
- highly important in this context is the notion of networked digital
- object IDs, the ability of one object on the network to point to another
- object (or component thereof) on the network. Serious bandwidth issues
- also exist. LYNCH was uncertain if billion-bit-per-second networks would
- prove sufficient if numerous people ran video in parallel.
-
- LYNCH concluded by offering an issue for database creators to consider,
- as well as several comments about what might constitute good trial
- multimedia experiments. In a networked information world the database
- builder or service builder (publisher) does not exercise the same
- extensive control over the integrity of the presentation; strange
- programs "munge" with one's data before the user sees it. Serious
- thought must be given to what guarantees integrity of presentation. Part
- of that is related to where one draws the boundaries around a networked
- information service. This question of presentation integrity in
- client-server computing has not been stressed enough in the academic
- world, LYNCH argued, though commercial service providers deal with it
- regularly.
-
- Concerning multimedia, LYNCH observed that good multimedia at the moment
- is hideously expensive to produce. He recommended producing multimedia
- with either very high sale value, or multimedia with a very long life
- span, or multimedia that will have a very broad usage base and whose
- costs therefore can be amortized among large numbers of users. In this
- connection, historical and humanistically oriented material may be a good
- place to start, because it tends to have a longer life span than much of
- the scientific material, as well as a wider user base. LYNCH noted, for
- example, that American Memory fits many of the criteria outlined. He
- remarked the extensive discussion about bringing the Internet or the
- National Research and Education Network (NREN) into the K-12 environment
- as a way of helping the American educational system.
-
- LYNCH closed by noting that the kinds of applications demonstrated struck
- him as excellent justifications of broad-scale networking for K-12, but
- that at this time no "killer" application exists to mobilize the K-12
- community to obtain connectivity.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- DISCUSSION * Dearth of genuinely interesting applications on the network
- a slow-changing situation * The issue of the integrity of presentation in
- a networked environment * Several reasons why CD-ROM software does not
- network *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- During the discussion period that followed LYNCH's presentation, several
- additional points were made.
-
- LYNCH reiterated even more strongly his contention that, historically,
- once one goes outside high-end science and the group of those who need
- access to supercomputers, there is a great dearth of genuinely
- interesting applications on the network. He saw this situation changing
- slowly, with some of the scientific databases and scholarly discussion
- groups and electronic journals coming on as well as with the availability
- of Wide Area Information Servers (WAIS) and some of the databases that
- are being mounted there. However, many of those things do not seem to
- have piqued great popular interest. For instance, most high school
- students of LYNCH's acquaintance would not qualify as devotees of serious
- molecular biology.
-
- Concerning the issue of the integrity of presentation, LYNCH believed
- that a couple of information providers have laid down the law at least on
- certain things. For example, his recollection was that the National
- Library of Medicine feels strongly that one needs to employ the
- identifier field if he or she is to mount a database commercially. The
- problem with a real networked environment is that one does not know who
- is reformatting and reprocessing one's data when one enters a client
- server mode. It becomes anybody's guess, for example, if the network
- uses a Z39.50 server, or what clients are doing with one's data. A data
- provider can say that his contract will only permit clients to have
- access to his data after he vets them and their presentation and makes
- certain it suits him. But LYNCH held out little expectation that the
- network marketplace would evolve in that way, because it required too
- much prior negotiation.
-
- CD-ROM software does not network for a variety of reasons, LYNCH said.
- He speculated that CD-ROM publishers are not eager to have their products
- really hook into wide area networks, because they fear it will make their
- data suppliers nervous. Moreover, until relatively recently, one had to
- be rather adroit to run a full TCP/IP stack plus applications on a
- PC-size machine, whereas nowadays it is becoming easier as PCs grow
- bigger and faster. LYNCH also speculated that software providers had not
- heard from their customers until the last year or so, or had not heard
- from enough of their customers.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- BESSER * Implications of disseminating images on the network; planning
- the distribution of multimedia documents poses two critical
- implementation problems * Layered approach represents the way to deal
- with users' capabilities * Problems in platform design; file size and its
- implications for networking * Transmission of megabyte size images
- impractical * Compression and decompression at the user's end * Promising
- trends for compression * A disadvantage of using X-Windows * A project at
- the Smithsonian that mounts images on several networks *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Howard BESSER, School of Library and Information Science, University of
- Pittsburgh, spoke primarily about multimedia, focusing on images and the
- broad implications of disseminating them on the network. He argued that
- planning the distribution of multimedia documents posed two critical
- implementation problems, which he framed in the form of two questions:
- 1) What platform will one use and what hardware and software will users
- have for viewing of the material? and 2) How can one deliver a
- sufficiently robust set of information in an accessible format in a
- reasonable amount of time? Depending on whether network or CD-ROM is the
- medium used, this question raises different issues of storage,
- compression, and transmission.
-
- Concerning the design of platforms (e.g., sound, gray scale, simple
- color, etc.) and the various capabilities users may have, BESSER
- maintained that a layered approach was the way to deal with users'
- capabilities. A result would be that users with less powerful
- workstations would simply have less functionality. He urged members of
- the audience to advocate standards and accompanying software that handle
- layered functionality across a wide variety of platforms.
-
- BESSER also addressed problems in platform design, namely, deciding how
- large a machine to design for situations when the largest number of users
- have the lowest level of the machine, and one desires higher
- functionality. BESSER then proceeded to the question of file size and
- its implications for networking. He discussed still images in the main.
- For example, a digital color image that fills the screen of a standard
- mega-pel workstation (Sun or Next) will require one megabyte of storage
- for an eight-bit image or three megabytes of storage for a true color or
- twenty-four-bit image. Lossless compression algorithms (that is,
- computational procedures in which no data is lost in the process of
- compressing [and decompressing] an image--the exact bit-representation is
- maintained) might bring storage down to a third of a megabyte per image,
- but not much further than that. The question of size makes it difficult
- to fit an appropriately sized set of these images on a single disk or to
- transmit them quickly enough on a network.
-
- With these full screen mega-pel images that constitute a third of a
- megabyte, one gets 1,000-3,000 full-screen images on a one-gigabyte disk;
- a standard CD-ROM represents approximately 60 percent of that. Storing
- images the size of a PC screen (just 8 bit color) increases storage
- capacity to 4,000-12,000 images per gigabyte; 60 percent of that gives
- one the size of a CD-ROM, which in turn creates a major problem. One
- cannot have full-screen, full-color images with lossless compression; one
- must compress them or use a lower resolution. For megabyte-size images,
- anything slower than a T-1 speed is impractical. For example, on a
- fifty-six-kilobaud line, it takes three minutes to transfer a
- one-megabyte file, if it is not compressed; and this speed assumes ideal
- circumstances (no other user contending for network bandwidth). Thus,
- questions of disk access, remote display, and current telephone
- connection speed make transmission of megabyte-size images impractical.
-
- BESSER then discussed ways to deal with these large images, for example,
- compression and decompression at the user's end. In this connection, the
- issues of how much one is willing to lose in the compression process and
- what image quality one needs in the first place are unknown. But what is
- known is that compression entails some loss of data. BESSER urged that
- more studies be conducted on image quality in different situations, for
- example, what kind of images are needed for what kind of disciplines, and
- what kind of image quality is needed for a browsing tool, an intermediate
- viewing tool, and archiving.
-
- BESSER remarked two promising trends for compression: from a technical
- perspective, algorithms that use what is called subjective redundancy
- employ principles from visual psycho-physics to identify and remove
- information from the image that the human eye cannot perceive; from an
- interchange and interoperability perspective, the JPEG (i.e., Joint
- Photographic Experts Group, an ISO standard) compression algorithms also
- offer promise. These issues of compression and decompression, BESSER
- argued, resembled those raised earlier concerning the design of different
- platforms. Gauging the capabilities of potential users constitutes a
- primary goal. BESSER advocated layering or separating the images from
- the applications that retrieve and display them, to avoid tying them to
- particular software.
-
- BESSER detailed several lessons learned from his work at Berkeley with
- Imagequery, especially the advantages and disadvantages of using
- X-Windows. In the latter category, for example, retrieval is tied
- directly to one's data, an intolerable situation in the long run on a
- networked system. Finally, BESSER described a project of Jim Wallace at
- the Smithsonian Institution, who is mounting images in a extremely
- rudimentary way on the Compuserv and Genie networks and is preparing to
- mount them on America On Line. Although the average user takes over
- thirty minutes to download these images (assuming a fairly fast modem),
- nevertheless, images have been downloaded 25,000 times.
-
- BESSER concluded his talk with several comments on the business
- arrangement between the Smithsonian and Compuserv. He contended that not
- enough is known concerning the value of images.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- DISCUSSION * Creating digitized photographic collections nearly
- impossible except with large organizations like museums * Need for study
- to determine quality of images users will tolerate *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- During the brief exchange between LESK and BESSER that followed, several
- clarifications emerged.
-
- LESK argued that the photographers were far ahead of BESSER: It is
- almost impossible to create such digitized photographic collections
- except with large organizations like museums, because all the
- photographic agencies have been going crazy about this and will not sign
- licensing agreements on any sort of reasonable terms. LESK had heard
- that National Geographic, for example, had tried to buy the right to use
- some image in some kind of educational production for $100 per image, but
- the photographers will not touch it. They want accounting and payment
- for each use, which cannot be accomplished within the system. BESSER
- responded that a consortium of photographers, headed by a former National
- Geographic photographer, had started assembling its own collection of
- electronic reproductions of images, with the money going back to the
- cooperative.
-
- LESK contended that BESSER was unnecessarily pessimistic about multimedia
- images, because people are accustomed to low-quality images, particularly
- from video. BESSER urged the launching of a study to determine what
- users would tolerate, what they would feel comfortable with, and what
- absolutely is the highest quality they would ever need. Conceding that
- he had adopted a dire tone in order to arouse people about the issue,
- BESSER closed on a sanguine note by saying that he would not be in this
- business if he did not think that things could be accomplished.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- LARSEN * Issues of scalability and modularity * Geometric growth of the
- Internet and the role played by layering * Basic functions sustaining
- this growth * A library's roles and functions in a network environment *
- Effects of implementation of the Z39.50 protocol for information
- retrieval on the library system * The trade-off between volumes of data
- and its potential usage * A snapshot of current trends *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Ronald LARSEN, associate director for information technology, University
- of Maryland at College Park, first addressed the issues of scalability
- and modularity. He noted the difficulty of anticipating the effects of
- orders-of-magnitude growth, reflecting on the twenty years of experience
- with the Arpanet and Internet. Recalling the day's demonstrations of
- CD-ROM and optical disk material, he went on to ask if the field has yet
- learned how to scale new systems to enable delivery and dissemination
- across large-scale networks.
-
- LARSEN focused on the geometric growth of the Internet from its inception
- circa 1969 to the present, and the adjustments required to respond to
- that rapid growth. To illustrate the issue of scalability, LARSEN
- considered computer networks as including three generic components:
- computers, network communication nodes, and communication media. Each
- component scales (e.g., computers range from PCs to supercomputers;
- network nodes scale from interface cards in a PC through sophisticated
- routers and gateways; and communication media range from 2,400-baud
- dial-up facilities through 4.5-Mbps backbone links, and eventually to
- multigigabit-per-second communication lines), and architecturally, the
- components are organized to scale hierarchically from local area networks
- to international-scale networks. Such growth is made possible by
- building layers of communication protocols, as BESSER pointed out.
- By layering both physically and logically, a sense of scalability is
- maintained from local area networks in offices, across campuses, through
- bridges, routers, campus backbones, fiber-optic links, etc., up into
- regional networks and ultimately into national and international
- networks.
-
- LARSEN then illustrated the geometric growth over a two-year period--
- through September 1991--of the number of networks that comprise the
- Internet. This growth has been sustained largely by the availability of
- three basic functions: electronic mail, file transfer (ftp), and remote
- log-on (telnet). LARSEN also reviewed the growth in the kind of traffic
- that occurs on the network. Network traffic reflects the joint contributions
- of a larger population of users and increasing use per user. Today one sees
- serious applications involving moving images across the network--a rarity
- ten years ago. LARSEN recalled and concurred with BESSER's main point
- that the interesting problems occur at the application level.
-
- LARSEN then illustrated a model of a library's roles and functions in a
- network environment. He noted, in particular, the placement of on-line
- catalogues onto the network and patrons obtaining access to the library
- increasingly through local networks, campus networks, and the Internet.
- LARSEN supported LYNCH's earlier suggestion that we need to address
- fundamental questions of networked information in order to build
- environments that scale in the information sense as well as in the
- physical sense.
-
- LARSEN supported the role of the library system as the access point into
- the nation's electronic collections. Implementation of the Z39.50
- protocol for information retrieval would make such access practical and
- feasible. For example, this would enable patrons in Maryland to search
- California libraries, or other libraries around the world that are
- conformant with Z39.50 in a manner that is familiar to University of
- Maryland patrons. This client-server model also supports moving beyond
- secondary content into primary content. (The notion of how one links
- from secondary content to primary content, LARSEN said, represents a
- fundamental problem that requires rigorous thought.) After noting
- numerous network experiments in accessing full-text materials, including
- projects supporting the ordering of materials across the network, LARSEN
- revisited the issue of transmitting high-density, high-resolution color
- images across the network and the large amounts of bandwidth they
- require. He went on to address the bandwidth and synchronization
- problems inherent in sending full-motion video across the network.
-
- LARSEN illustrated the trade-off between volumes of data in bytes or
- orders of magnitude and the potential usage of that data. He discussed
- transmission rates (particularly, the time it takes to move various forms
- of information), and what one could do with a network supporting
- multigigabit-per-second transmission. At the moment, the network
- environment includes a composite of data-transmission requirements,
- volumes and forms, going from steady to bursty (high-volume) and from
- very slow to very fast. This aggregate must be considered in the design,
- construction, and operation of multigigabyte networks.
-
- LARSEN's objective is to use the networks and library systems now being
- constructed to increase access to resources wherever they exist, and
- thus, to evolve toward an on-line electronic virtual library.
-
- LARSEN concluded by offering a snapshot of current trends: continuing
- geometric growth in network capacity and number of users; slower
- development of applications; and glacial development and adoption of
- standards. The challenge is to design and develop each new application
- system with network access and scalability in mind.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- BROWNRIGG * Access to the Internet cannot be taken for granted * Packet
- radio and the development of MELVYL in 1980-81 in the Division of Library
- Automation at the University of California * Design criteria for packet
- radio * A demonstration project in San Diego and future plans * Spread
- spectrum * Frequencies at which the radios will run and plans to
- reimplement the WAIS server software in the public domain * Need for an
- infrastructure of radios that do not move around *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Edwin BROWNRIGG, executive director, Memex Research Institute, first
- polled the audience in order to seek out regular users of the Internet as
- well as those planning to use it some time in the future. With nearly
- everybody in the room falling into one category or the other, BROWNRIGG
- made a point re access, namely that numerous individuals, especially those
- who use the Internet every day, take for granted their access to it, the
- speeds with which they are connected, and how well it all works.
- However, as BROWNRIGG discovered between 1987 and 1989 in Australia,
- if one wants access to the Internet but cannot afford it or has some
- physical boundary that prevents her or him from gaining access, it can
- be extremely frustrating. He suggested that because of economics and
- physical barriers we were beginning to create a world of haves and have-nots
- in the process of scholarly communication, even in the United States.
-
- BROWNRIGG detailed the development of MELVYL in academic year 1980-81 in
- the Division of Library Automation at the University of California, in
- order to underscore the issue of access to the system, which at the
- outset was extremely limited. In short, the project needed to build a
- network, which at that time entailed use of satellite technology, that is,
- putting earth stations on campus and also acquiring some terrestrial links
- from the State of California's microwave system. The installation of
- satellite links, however, did not solve the problem (which actually
- formed part of a larger problem involving politics and financial resources).
- For while the project team could get a signal onto a campus, it had no means
- of distributing the signal throughout the campus. The solution involved
- adopting a recent development in wireless communication called packet radio,
- which combined the basic notion of packet-switching with radio. The project
- used this technology to get the signal from a point on campus where it
- came down, an earth station for example, into the libraries, because it
- found that wiring the libraries, especially the older marble buildings,
- would cost $2,000-$5,000 per terminal.
-
- BROWNRIGG noted that, ten years ago, the project had neither the public
- policy nor the technology that would have allowed it to use packet radio
- in any meaningful way. Since then much had changed. He proceeded to
- detail research and development of the technology, how it is being
- deployed in California, and what direction he thought it would take.
- The design criteria are to produce a high-speed, one-time, low-cost,
- high-quality, secure, license-free device (packet radio) that one can
- plug in and play today, forget about it, and have access to the Internet.
- By high speed, BROWNRIGG meant 1 megabyte and 1.5 megabytes. Those units
- have been built, he continued, and are in the process of being
- type-certified by an independent underwriting laboratory so that they can
- be type-licensed by the Federal Communications Commission. As is the
- case with citizens band, one will be able to purchase a unit and not have
- to worry about applying for a license.
-
- The basic idea, BROWNRIGG elaborated, is to take high-speed radio data
- transmission and create a backbone network that at certain strategic
- points in the network will "gateway" into a medium-speed packet radio
- (i.e., one that runs at 38.4 kilobytes), so that perhaps by 1994-1995
- people, like those in the audience for the price of a VCR could purchase
- a medium-speed radio for the office or home, have full network connectivity
- to the Internet, and partake of all its services, with no need for an FCC
- license and no regular bill from the local common carrier. BROWNRIGG
- presented several details of a demonstration project currently taking
- place in San Diego and described plans, pending funding, to install a
- full-bore network in the San Francisco area. This network will have 600
- nodes running at backbone speeds, and 100 of these nodes will be libraries,
- which in turn will be the gateway ports to the 38.4 kilobyte radios that
- will give coverage for the neighborhoods surrounding the libraries.
-
- BROWNRIGG next explained Part 15.247, a new rule within Title 47 of the
- Code of Federal Regulations enacted by the FCC in 1985. This rule
- challenged the industry, which has only now risen to the occasion, to
- build a radio that would run at no more than one watt of output power and
- use a fairly exotic method of modulating the radio wave called spread
- spectrum. Spread spectrum in fact permits the building of networks so
- that numerous data communications can occur simultaneously, without
- interfering with each other, within the same wide radio channel.
-
- BROWNRIGG explained that the frequencies at which the radios would run
- are very short wave signals. They are well above standard microwave and
- radar. With a radio wave that small, one watt becomes a tremendous punch
- per bit and thus makes transmission at reasonable speed possible. In
- order to minimize the potential for congestion, the project is
- undertaking to reimplement software which has been available in the
- networking business and is taken for granted now, for example, TCP/IP,
- routing algorithms, bridges, and gateways. In addition, the project
- plans to take the WAIS server software in the public domain and
- reimplement it so that one can have a WAIS server on a Mac instead of a
- Unix machine. The Memex Research Institute believes that libraries, in
- particular, will want to use the WAIS servers with packet radio. This
- project, which has a team of about twelve people, will run through 1993
- and will include the 100 libraries already mentioned as well as other
- professionals such as those in the medical profession, engineering, and
- law. Thus, the need is to create an infrastructure of radios that do not
- move around, which, BROWNRIGG hopes, will solve a problem not only for
- libraries but for individuals who, by and large today, do not have access
- to the Internet from their homes and offices.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- DISCUSSION * Project operating frequencies *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- During a brief discussion period, which also concluded the day's
- proceedings, BROWNRIGG stated that the project was operating in four
- frequencies. The slow speed is operating at 435 megahertz, and it would
- later go up to 920 megahertz. With the high-speed frequency, the
- one-megabyte radios will run at 2.4 gigabits, and 1.5 will run at 5.7.
- At 5.7, rain can be a factor, but it would have to be tropical rain,
- unlike what falls in most parts of the United States.
-
- ******
-
- SESSION IV. IMAGE CAPTURE, TEXT CAPTURE, OVERVIEW OF TEXT AND
- IMAGE STORAGE FORMATS
-
- William HOOTON, vice president of operations, I-NET, moderated this session.
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- KENNEY * Factors influencing development of CXP * Advantages of using
- digital technology versus photocopy and microfilm * A primary goal of
- CXP; publishing challenges * Characteristics of copies printed * Quality
- of samples achieved in image capture * Several factors to be considered
- in choosing scanning * Emphasis of CXP on timely and cost-effective
- production of black-and-white printed facsimiles * Results of producing
- microfilm from digital files * Advantages of creating microfilm * Details
- concerning production * Costs * Role of digital technology in library
- preservation *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Anne KENNEY, associate director, Department of Preservation and
- Conservation, Cornell University, opened her talk by observing that the
- Cornell Xerox Project (CXP) has been guided by the assumption that the
- ability to produce printed facsimiles or to replace paper with paper
- would be important, at least for the present generation of users and
- equipment. She described three factors that influenced development of
- the project: 1) Because the project has emphasized the preservation of
- deteriorating brittle books, the quality of what was produced had to be
- sufficiently high to return a paper replacement to the shelf. CXP was
- only interested in using: 2) a system that was cost-effective, which
- meant that it had to be cost-competitive with the processes currently
- available, principally photocopy and microfilm, and 3) new or currently
- available product hardware and software.
-
- KENNEY described the advantages that using digital technology offers over
- both photocopy and microfilm: 1) The potential exists to create a higher
- quality reproduction of a deteriorating original than conventional
- light-lens technology. 2) Because a digital image is an encoded
- representation, it can be reproduced again and again with no resulting
- loss of quality, as opposed to the situation with light-lens processes,
- in which there is discernible difference between a second and a
- subsequent generation of an image. 3) A digital image can be manipulated
- in a number of ways to improve image capture; for example, Xerox has
- developed a windowing application that enables one to capture a page
- containing both text and illustrations in a manner that optimizes the
- reproduction of both. (With light-lens technology, one must choose which
- to optimize, text or the illustration; in preservation microfilming, the
- current practice is to shoot an illustrated page twice, once to highlight
- the text and the second time to provide the best capture for the
- illustration.) 4) A digital image can also be edited, density levels
- adjusted to remove underlining and stains, and to increase legibility for
- faint documents. 5) On-screen inspection can take place at the time of
- initial setup and adjustments made prior to scanning, factors that
- substantially reduce the number of retakes required in quality control.
-
- A primary goal of CXP has been to evaluate the paper output printed on
- the Xerox DocuTech, a high-speed printer that produces 600-dpi pages from
- scanned images at a rate of 135 pages a minute. KENNEY recounted several
- publishing challenges to represent faithful and legible reproductions of
- the originals that the 600-dpi copy for the most part successfully
- captured. For example, many of the deteriorating volumes in the project
- were heavily illustrated with fine line drawings or halftones or came in
- languages such as Japanese, in which the buildup of characters comprised
- of varying strokes is difficult to reproduce at lower resolutions; a
- surprising number of them came with annotations and mathematical
- formulas, which it was critical to be able to duplicate exactly.
-
- KENNEY noted that 1) the copies are being printed on paper that meets the
- ANSI standards for performance, 2) the DocuTech printer meets the machine
- and toner requirements for proper adhesion of print to page, as described
- by the National Archives, and thus 3) paper product is considered to be
- the archival equivalent of preservation photocopy.
-
- KENNEY then discussed several samples of the quality achieved in the
- project that had been distributed in a handout, for example, a copy of a
- print-on-demand version of the 1911 Reed lecture on the steam turbine,
- which contains halftones, line drawings, and illustrations embedded in
- text; the first four loose pages in the volume compared the capture
- capabilities of scanning to photocopy for a standard test target, the
- IEEE standard 167A 1987 test chart. In all instances scanning proved
- superior to photocopy, though only slightly more so in one.
-
- Conceding the simplistic nature of her review of the quality of scanning
- to photocopy, KENNEY described it as one representation of the kinds of
- settings that could be used with scanning capabilities on the equipment
- CXP uses. KENNEY also pointed out that CXP investigated the quality
- achieved with binary scanning only, and noted the great promise in gray
- scale and color scanning, whose advantages and disadvantages need to be
- examined. She argued further that scanning resolutions and file formats
- can represent a complex trade-off between the time it takes to capture
- material, file size, fidelity to the original, and on-screen display; and
- printing and equipment availability. All these factors must be taken
- into consideration.
-
- CXP placed primary emphasis on the production in a timely and
- cost-effective manner of printed facsimiles that consisted largely of
- black-and-white text. With binary scanning, large files may be
- compressed efficiently and in a lossless manner (i.e., no data is lost in
- the process of compressing [and decompressing] an image--the exact
- bit-representation is maintained) using Group 4 CCITT (i.e., the French
- acronym for International Consultative Committee for Telegraph and
- Telephone) compression. CXP was getting compression ratios of about
- forty to one. Gray-scale compression, which primarily uses JPEG, is much
- less economical and can represent a lossy compression (i.e., not
- lossless), so that as one compresses and decompresses, the illustration
- is subtly changed. While binary files produce a high-quality printed
- version, it appears 1) that other combinations of spatial resolution with
- gray and/or color hold great promise as well, and 2) that gray scale can
- represent a tremendous advantage for on-screen viewing. The quality
- associated with binary and gray scale also depends on the equipment used.
- For instance, binary scanning produces a much better copy on a binary
- printer.
-
- Among CXP's findings concerning the production of microfilm from digital
- files, KENNEY reported that the digital files for the same Reed lecture
- were used to produce sample film using an electron beam recorder. The
- resulting film was faithful to the image capture of the digital files,
- and while CXP felt that the text and image pages represented in the Reed
- lecture were superior to that of the light-lens film, the resolution
- readings for the 600 dpi were not as high as standard microfilming.
- KENNEY argued that the standards defined for light-lens technology are
- not totally transferable to a digital environment. Moreover, they are
- based on definition of quality for a preservation copy. Although making
- this case will prove to be a long, uphill struggle, CXP plans to continue
- to investigate the issue over the course of the next year.
-
- KENNEY concluded this portion of her talk with a discussion of the
- advantages of creating film: it can serve as a primary backup and as a
- preservation master to the digital file; it could then become the print
- or production master and service copies could be paper, film, optical
- disks, magnetic media, or on-screen display.
-
- Finally, KENNEY presented details re production:
-
- * Development and testing of a moderately-high resolution production
- scanning workstation represented a third goal of CXP; to date, 1,000
- volumes have been scanned, or about 300,000 images.
-
- * The resulting digital files are stored and used to produce
- hard-copy replacements for the originals and additional prints on
- demand; although the initial costs are high, scanning technology
- offers an affordable means for reformatting brittle material.
-
- * A technician in production mode can scan 300 pages per hour when
- performing single-sheet scanning, which is a necessity when working
- with truly brittle paper; this figure is expected to increase
- significantly with subsequent iterations of the software from Xerox;
- a three-month time-and-cost study of scanning found that the average
- 300-page book would take about an hour and forty minutes to scan
- (this figure included the time for setup, which involves keying in
- primary bibliographic data, going into quality control mode to
- define page size, establishing front-to-back registration, and
- scanning sample pages to identify a default range of settings for
- the entire book--functions not dissimilar to those performed by
- filmers or those preparing a book for photocopy).
-
- * The final step in the scanning process involved rescans, which
- happily were few and far between, representing well under 1 percent
- of the total pages scanned.
-
- In addition to technician time, CXP costed out equipment, amortized over
- four years, the cost of storing and refreshing the digital files every
- four years, and the cost of printing and binding, book-cloth binding, a
- paper reproduction. The total amounted to a little under $65 per single
- 300-page volume, with 30 percent overhead included--a figure competitive
- with the prices currently charged by photocopy vendors.
-
- Of course, with scanning, in addition to the paper facsimile, one is left
- with a digital file from which subsequent copies of the book can be
- produced for a fraction of the cost of photocopy, with readers afforded
- choices in the form of these copies.
-
- KENNEY concluded that digital technology offers an electronic means for a
- library preservation effort to pay for itself. If a brittle-book program
- included the means of disseminating reprints of books that are in demand
- by libraries and researchers alike, the initial investment in capture
- could be recovered and used to preserve additional but less popular
- books. She disclosed that an economic model for a self-sustaining
- program could be developed for CXP's report to the Commission on
- Preservation and Access (CPA).
-
- KENNEY stressed that the focus of CXP has been on obtaining high quality
- in a production environment. The use of digital technology is viewed as
- an affordable alternative to other reformatting options.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- ANDRE * Overview and history of NATDP * Various agricultural CD-ROM
- products created inhouse and by service bureaus * Pilot project on
- Internet transmission * Additional products in progress *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Pamela ANDRE, associate director for automation, National Agricultural
- Text Digitizing Program (NATDP), National Agricultural Library (NAL),
- presented an overview of NATDP, which has been underway at NAL the last
- four years, before Judith ZIDAR discussed the technical details. ANDRE
- defined agricultural information as a broad range of material going from
- basic and applied research in the hard sciences to the one-page pamphlets
- that are distributed by the cooperative state extension services on such
- things as how to grow blueberries.
-
- NATDP began in late 1986 with a meeting of representatives from the
- land-grant library community to deal with the issue of electronic
- information. NAL and forty-five of these libraries banded together to
- establish this project--to evaluate the technology for converting what
- were then source documents in paper form into electronic form, to provide
- access to that digital information, and then to distribute it.
- Distributing that material to the community--the university community as
- well as the extension service community, potentially down to the county
- level--constituted the group's chief concern.
-
- Since January 1988 (when the microcomputer-based scanning system was
- installed at NAL), NATDP has done a variety of things, concerning which
- ZIDAR would provide further details. For example, the first technology
- considered in the project's discussion phase was digital videodisc, which
- indicates how long ago it was conceived.
-
- Over the four years of this project, four separate CD-ROM products on
- four different agricultural topics were created, two at a
- scanning-and-OCR station installed at NAL, and two by service bureaus.
- Thus, NATDP has gained comparative information in terms of those relative
- costs. Each of these products contained the full ASCII text as well as
- page images of the material, or between 4,000 and 6,000 pages of material
- on these disks. Topics included aquaculture, food, agriculture and
- science (i.e., international agriculture and research), acid rain, and
- Agent Orange, which was the final product distributed (approximately
- eighteen months before the Workshop).
-
- The third phase of NATDP focused on delivery mechanisms other than
- CD-ROM. At the suggestion of Clifford LYNCH, who was a technical
- consultant to the project at this point, NATDP became involved with the
- Internet and initiated a project with the help of North Carolina State
- University, in which fourteen of the land-grant university libraries are
- transmitting digital images over the Internet in response to interlibrary
- loan requests--a topic for another meeting. At this point, the pilot
- project had been completed for about a year and the final report would be
- available shortly after the Workshop. In the meantime, the project's
- success had led to its extension. (ANDRE noted that one of the first
- things done under the program title was to select a retrieval package to
- use with subsequent products; Windows Personal Librarian was the package
- of choice after a lengthy evaluation.)
-
- Three additional products had been planned and were in progress:
-
- 1) An arrangement with the American Society of Agronomy--a
- professional society that has published the Agronomy Journal since
- about 1908--to scan and create bit-mapped images of its journal.
- ASA granted permission first to put and then to distribute this
- material in electronic form, to hold it at NAL, and to use these
- electronic images as a mechanism to deliver documents or print out
- material for patrons, among other uses. Effectively, NAL has the
- right to use this material in support of its program.
- (Significantly, this arrangement offers a potential cooperative
- model for working with other professional societies in agriculture
- to try to do the same thing--put the journals of particular interest
- to agriculture research into electronic form.)
-
- 2) An extension of the earlier product on aquaculture.
-
- 3) The George Washington Carver Papers--a joint project with
- Tuskegee University to scan and convert from microfilm some 3,500
- images of Carver's papers, letters, and drawings.
-
- It was anticipated that all of these products would appear no more than
- six months after the Workshop.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- ZIDAR * (A separate arena for scanning) * Steps in creating a database *
- Image capture, with and without performing OCR * Keying in tracking data
- * Scanning, with electronic and manual tracking * Adjustments during
- scanning process * Scanning resolutions * Compression * De-skewing and
- filtering * Image capture from microform: the papers and letters of
- George Washington Carver * Equipment used for a scanning system *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Judith ZIDAR, coordinator, National Agricultural Text Digitizing Program
- (NATDP), National Agricultural Library (NAL), illustrated the technical
- details of NATDP, including her primary responsibility, scanning and
- creating databases on a topic and putting them on CD-ROM.
-
- (ZIDAR remarked a separate arena from the CD-ROM projects, although the
- processing of the material is nearly identical, in which NATDP is also
- scanning material and loading it on a Next microcomputer, which in turn
- is linked to NAL's integrated library system. Thus, searches in NAL's
- bibliographic database will enable people to pull up actual page images
- and text for any documents that have been entered.)
-
- In accordance with the session's topic, ZIDAR focused her illustrated
- talk on image capture, offering a primer on the three main steps in the
- process: 1) assemble the printed publications; 2) design the database
- (database design occurs in the process of preparing the material for
- scanning; this step entails reviewing and organizing the material,
- defining the contents--what will constitute a record, what kinds of
- fields will be captured in terms of author, title, etc.); 3) perform a
- certain amount of markup on the paper publications. NAL performs this
- task record by record, preparing work sheets or some other sort of
- tracking material and designing descriptors and other enhancements to be
- added to the data that will not be captured from the printed publication.
- Part of this process also involves determining NATDP's file and directory
- structure: NATDP attempts to avoid putting more than approximately 100
- images in a directory, because placing more than that on a CD-ROM would
- reduce the access speed.
-
- This up-front process takes approximately two weeks for a
- 6,000-7,000-page database. The next step is to capture the page images.
- How long this process takes is determined by the decision whether or not
- to perform OCR. Not performing OCR speeds the process, whereas text
- capture requires greater care because of the quality of the image: it
- has to be straighter and allowance must be made for text on a page, not
- just for the capture of photographs.
-
- NATDP keys in tracking data, that is, a standard bibliographic record
- including the title of the book and the title of the chapter, which will
- later either become the access information or will be attached to the
- front of a full-text record so that it is searchable.
-
- Images are scanned from a bound or unbound publication, chiefly from
- bound publications in the case of NATDP, however, because often they are
- the only copies and the publications are returned to the shelves. NATDP
- usually scans one record at a time, because its database tracking system
- tracks the document in that way and does not require further logical
- separating of the images. After performing optical character
- recognition, NATDP moves the images off the hard disk and maintains a
- volume sheet. Though the system tracks electronically, all the
- processing steps are also tracked manually with a log sheet.
-
- ZIDAR next illustrated the kinds of adjustments that one can make when
- scanning from paper and microfilm, for example, redoing images that need
- special handling, setting for dithering or gray scale, and adjusting for
- brightness or for the whole book at one time.
-
- NATDP is scanning at 300 dots per inch, a standard scanning resolution.
- Though adequate for capturing text that is all of a standard size, 300
- dpi is unsuitable for any kind of photographic material or for very small
- text. Many scanners allow for different image formats, TIFF, of course,
- being a de facto standard. But if one intends to exchange images with
- other people, the ability to scan other image formats, even if they are
- less common, becomes highly desirable.
-
- CCITT Group 4 is the standard compression for normal black-and-white
- images, JPEG for gray scale or color. ZIDAR recommended 1) using the
- standard compressions, particularly if one attempts to make material
- available and to allow users to download images and reuse them from
- CD-ROMs; and 2) maintaining the ability to output an uncompressed image,
- because in image exchange uncompressed images are more likely to be able
- to cross platforms.
-
- ZIDAR emphasized the importance of de-skewing and filtering as
- requirements on NATDP's upgraded system. For instance, scanning bound
- books, particularly books published by the federal government whose pages
- are skewed, and trying to scan them straight if OCR is to be performed,
- is extremely time-consuming. The same holds for filtering of
- poor-quality or older materials.
-
- ZIDAR described image capture from microform, using as an example three
- reels from a sixty-seven-reel set of the papers and letters of George
- Washington Carver that had been produced by Tuskegee University. These
- resulted in approximately 3,500 images, which NATDP had had scanned by
- its service contractor, Science Applications International Corporation
- (SAIC). NATDP also created bibliographic records for access. (NATDP did
- not have such specialized equipment as a microfilm scanner.
-
- Unfortunately, the process of scanning from microfilm was not an
- unqualified success, ZIDAR reported: because microfilm frame sizes vary,
- occasionally some frames were missed, which without spending much time
- and money could not be recaptured.
-
- OCR could not be performed from the scanned images of the frames. The
- bleeding in the text simply output text, when OCR was run, that could not
- even be edited. NATDP tested for negative versus positive images,
- landscape versus portrait orientation, and single- versus dual-page
- microfilm, none of which seemed to affect the quality of the image; but
- also on none of them could OCR be performed.
-
- In selecting the microfilm they would use, therefore, NATDP had other
- factors in mind. ZIDAR noted two factors that influenced the quality of
- the images: 1) the inherent quality of the original and 2) the amount of
- size reduction on the pages.
-
- The Carver papers were selected because they are informative and visually
- interesting, treat a single subject, and are valuable in their own right.
- The images were scanned and divided into logical records by SAIC, then
- delivered, and loaded onto NATDP's system, where bibliographic
- information taken directly from the images was added. Scanning was
- completed in summer 1991 and by the end of summer 1992 the disk was
- scheduled to be published.
-
- Problems encountered during processing included the following: Because
- the microfilm scanning had to be done in a batch, adjustment for
- individual page variations was not possible. The frame size varied on
- account of the nature of the material, and therefore some of the frames
- were missed while others were just partial frames. The only way to go
- back and capture this material was to print out the page with the
- microfilm reader from the missing frame and then scan it in from the
- page, which was extremely time-consuming. The quality of the images
- scanned from the printout of the microfilm compared unfavorably with that
- of the original images captured directly from the microfilm. The
- inability to perform OCR also was a major disappointment. At the time,
- computer output microfilm was unavailable to test.
-
- The equipment used for a scanning system was the last topic addressed by
- ZIDAR. The type of equipment that one would purchase for a scanning
- system included: a microcomputer, at least a 386, but preferably a 486;
- a large hard disk, 380 megabyte at minimum; a multi-tasking operating
- system that allows one to run some things in batch in the background
- while scanning or doing text editing, for example, Unix or OS/2 and,
- theoretically, Windows; a high-speed scanner and scanning software that
- allows one to make the various adjustments mentioned earlier; a
- high-resolution monitor (150 dpi ); OCR software and hardware to perform
- text recognition; an optical disk subsystem on which to archive all the
- images as the processing is done; file management and tracking software.
-
- ZIDAR opined that the software one purchases was more important than the
- hardware and might also cost more than the hardware, but it was likely to
- prove critical to the success or failure of one's system. In addition to
- a stand-alone scanning workstation for image capture, then, text capture
- requires one or two editing stations networked to this scanning station
- to perform editing. Editing the text takes two or three times as long as
- capturing the images.
-
- Finally, ZIDAR stressed the importance of buying an open system that allows
- for more than one vendor, complies with standards, and can be upgraded.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- WATERS *Yale University Library's master plan to convert microfilm to
- digital imagery (POB) * The place of electronic tools in the library of
- the future * The uses of images and an image library * Primary input from
- preservation microfilm * Features distinguishing POB from CXP and key
- hypotheses guiding POB * Use of vendor selection process to facilitate
- organizational work * Criteria for selecting vendor * Finalists and
- results of process for Yale * Key factor distinguishing vendors *
- Components, design principles, and some estimated costs of POB * Role of
- preservation materials in developing imaging market * Factors affecting
- quality and cost * Factors affecting the usability of complex documents
- in image form *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Donald WATERS, head of the Systems Office, Yale University Library,
- reported on the progress of a master plan for a project at Yale to
- convert microfilm to digital imagery, Project Open Book (POB). Stating
- that POB was in an advanced stage of planning, WATERS detailed, in
- particular, the process of selecting a vendor partner and several key
- issues under discussion as Yale prepares to move into the project itself.
- He commented first on the vision that serves as the context of POB and
- then described its purpose and scope.
-
- WATERS sees the library of the future not necessarily as an electronic
- library but as a place that generates, preserves, and improves for its
- clients ready access to both intellectual and physical recorded
- knowledge. Electronic tools must find a place in the library in the
- context of this vision. Several roles for electronic tools include
- serving as: indirect sources of electronic knowledge or as "finding"
- aids (the on-line catalogues, the article-level indices, registers for
- documents and archives); direct sources of recorded knowledge; full-text
- images; and various kinds of compound sources of recorded knowledge (the
- so-called compound documents of Hypertext, mixed text and image,
- mixed-text image format, and multimedia).
-
- POB is looking particularly at images and an image library, the uses to
- which images will be put (e.g., storage, printing, browsing, and then use
- as input for other processes), OCR as a subsequent process to image
- capture, or creating an image library, and also possibly generating
- microfilm.
-
- While input will come from a variety of sources, POB is considering
- especially input from preservation microfilm. A possible outcome is that
- the film and paper which provide the input for the image library
- eventually may go off into remote storage, and that the image library may
- be the primary access tool.
-
- The purpose and scope of POB focus on imaging. Though related to CXP,
- POB has two features which distinguish it: 1) scale--conversion of
- 10,000 volumes into digital image form; and 2) source--conversion from
- microfilm. Given these features, several key working hypotheses guide
- POB, including: 1) Since POB is using microfilm, it is not concerned with
- the image library as a preservation medium. 2) Digital imagery can improve
- access to recorded knowledge through printing and network distribution at
- a modest incremental cost of microfilm. 3) Capturing and storing documents
- in a digital image form is necessary to further improvements in access.
- (POB distinguishes between the imaging, digitizing process and OCR,
- which at this stage it does not plan to perform.)
-
- Currently in its first or organizational phase, POB found that it could
- use a vendor selection process to facilitate a good deal of the
- organizational work (e.g., creating a project team and advisory board,
- confirming the validity of the plan, establishing the cost of the project
- and a budget, selecting the materials to convert, and then raising the
- necessary funds).
-
- POB developed numerous selection criteria, including: a firm committed
- to image-document management, the ability to serve as systems integrator
- in a large-scale project over several years, interest in developing the
- requisite software as a standard rather than a custom product, and a
- willingness to invest substantial resources in the project itself.
-
- Two vendors, DEC and Xerox, were selected as finalists in October 1991,
- and with the support of the Commission on Preservation and Access, each
- was commissioned to generate a detailed requirements analysis for the
- project and then to submit a formal proposal for the completion of the
- project, which included a budget and costs. The terms were that POB would
- pay the loser. The results for Yale of involving a vendor included:
- broad involvement of Yale staff across the board at a relatively low
- cost, which may have long-term significance in carrying out the project
- (twenty-five to thirty university people are engaged in POB); better
- understanding of the factors that affect corporate response to markets
- for imaging products; a competitive proposal; and a more sophisticated
- view of the imaging markets.
-
- The most important factor that distinguished the vendors under
- consideration was their identification with the customer. The size and
- internal complexity of the company also was an important factor. POB was
- looking at large companies that had substantial resources. In the end,
- the process generated for Yale two competitive proposals, with Xerox's
- the clear winner. WATERS then described the components of the proposal,
- the design principles, and some of the costs estimated for the process.
-
- Components are essentially four: a conversion subsystem, a
- network-accessible storage subsystem for 10,000 books (and POB expects
- 200 to 600 dpi storage), browsing stations distributed on the campus
- network, and network access to the image printers.
-
- Among the design principles, POB wanted conversion at the highest
- possible resolution. Assuming TIFF files, TIFF files with Group 4
- compression, TCP/IP, and ethernet network on campus, POB wanted a
- client-server approach with image documents distributed to the
- workstations and made accessible through native workstation interfaces
- such as Windows. POB also insisted on a phased approach to
- implementation: 1) a stand-alone, single-user, low-cost entry into the
- business with a workstation focused on conversion and allowing POB to
- explore user access; 2) movement into a higher-volume conversion with
- network-accessible storage and multiple access stations; and 3) a
- high-volume conversion, full-capacity storage, and multiple browsing
- stations distributed throughout the campus.
-
- The costs proposed for start-up assumed the existence of the Yale network
- and its two DocuTech image printers. Other start-up costs are estimated
- at $1 million over the three phases. At the end of the project, the annual
- operating costs estimated primarily for the software and hardware proposed
- come to about $60,000, but these exclude costs for labor needed in the
- conversion process, network and printer usage, and facilities management.
-
- Finally, the selection process produced for Yale a more sophisticated
- view of the imaging markets: the management of complex documents in
- image form is not a preservation problem, not a library problem, but a
- general problem in a broad, general industry. Preservation materials are
- useful for developing that market because of the qualities of the
- material. For example, much of it is out of copyright. The resolution
- of key issues such as the quality of scanning and image browsing also
- will affect development of that market.
-
- The technology is readily available but changing rapidly. In this
- context of rapid change, several factors affect quality and cost, to
- which POB intends to pay particular attention, for example, the various
- levels of resolution that can be achieved. POB believes it can bring
- resolution up to 600 dpi, but an interpolation process from 400 to 600 is
- more likely. The variation quality in microfilm will prove to be a
- highly important factor. POB may reexamine the standards used to film in
- the first place by looking at this process as a follow-on to microfilming.
-
- Other important factors include: the techniques available to the
- operator for handling material, the ways of integrating quality control
- into the digitizing work flow, and a work flow that includes indexing and
- storage. POB's requirement was to be able to deal with quality control
- at the point of scanning. Thus, thanks to Xerox, POB anticipates having
- a mechanism which will allow it not only to scan in batch form, but to
- review the material as it goes through the scanner and control quality
- from the outset.
-
- The standards for measuring quality and costs depend greatly on the uses
- of the material, including subsequent OCR, storage, printing, and
- browsing. But especially at issue for POB is the facility for browsing.
- This facility, WATERS said, is perhaps the weakest aspect of imaging
- technology and the most in need of development.
-
- A variety of factors affect the usability of complex documents in image
- form, among them: 1) the ability of the system to handle the full range
- of document types, not just monographs but serials, multi-part
- monographs, and manuscripts; 2) the location of the database of record
- for bibliographic information about the image document, which POB wants
- to enter once and in the most useful place, the on-line catalog; 3) a
- document identifier for referencing the bibliographic information in one
- place and the images in another; 4) the technique for making the basic
- internal structure of the document accessible to the reader; and finally,
- 5) the physical presentation on the CRT of those documents. POB is ready
- to complete this phase now. One last decision involves deciding which
- material to scan.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- DISCUSSION * TIFF files constitute de facto standard * NARA's experience
- with image conversion software and text conversion * RFC 1314 *
- Considerable flux concerning available hardware and software solutions *
- NAL through-put rate during scanning * Window management questions *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- In the question-and-answer period that followed WATERS's presentation,
- the following points emerged:
-
- * ZIDAR's statement about using TIFF files as a standard meant de
- facto standard. This is what most people use and typically exchange
- with other groups, across platforms, or even occasionally across
- display software.
-
- * HOLMES commented on the unsuccessful experience of NARA in
- attempting to run image-conversion software or to exchange between
- applications: What are supposedly TIFF files go into other software
- that is supposed to be able to accept TIFF but cannot recognize the
- format and cannot deal with it, and thus renders the exchange
- useless. Re text conversion, he noted the different recognition
- rates obtained by substituting the make and model of scanners in
- NARA's recent test of an "intelligent" character-recognition product
- for a new company. In the selection of hardware and software,
- HOLMES argued, software no longer constitutes the overriding factor
- it did until about a year ago; rather it is perhaps important to
- look at both now.
-
- * Danny Cohen and Alan Katz of the University of Southern California
- Information Sciences Institute began circulating as an Internet RFC
- (RFC 1314) about a month ago a standard for a TIFF interchange
- format for Internet distribution of monochrome bit-mapped images,
- which LYNCH said he believed would be used as a de facto standard.
-
- * FLEISCHHAUER's impression from hearing these reports and thinking
- about AM's experience was that there is considerable flux concerning
- available hardware and software solutions. HOOTON agreed and
- commented at the same time on ZIDAR's statement that the equipment
- employed affects the results produced. One cannot draw a complete
- conclusion by saying it is difficult or impossible to perform OCR
- from scanning microfilm, for example, with that device, that set of
- parameters, and system requirements, because numerous other people
- are accomplishing just that, using other components, perhaps.
- HOOTON opined that both the hardware and the software were highly
- important. Most of the problems discussed today have been solved in
- numerous different ways by other people. Though it is good to be
- cognizant of various experiences, this is not to say that it will
- always be thus.
-
- * At NAL, the through-put rate of the scanning process for paper,
- page by page, performing OCR, ranges from 300 to 600 pages per day;
- not performing OCR is considerably faster, although how much faster
- is not known. This is for scanning from bound books, which is much
- slower.
-
- * WATERS commented on window management questions: DEC proposed an
- X-Windows solution which was problematical for two reasons. One was
- POB's requirement to be able to manipulate images on the workstation
- and bring them down to the workstation itself and the other was
- network usage.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- THOMA * Illustration of deficiencies in scanning and storage process *
- Image quality in this process * Different costs entailed by better image
- quality * Techniques for overcoming various de-ficiencies: fixed
- thresholding, dynamic thresholding, dithering, image merge * Page edge
- effects *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- George THOMA, chief, Communications Engineering Branch, National Library
- of Medicine (NLM), illustrated several of the deficiencies discussed by
- the previous speakers. He introduced the topic of special problems by
- noting the advantages of electronic imaging. For example, it is regenerable
- because it is a coded file, and real-time quality control is possible with
- electronic capture, whereas in photographic capture it is not.
-
- One of the difficulties discussed in the scanning and storage process was
- image quality which, without belaboring the obvious, means different
- things for maps, medical X-rays, or broadcast television. In the case of
- documents, THOMA said, image quality boils down to legibility of the
- textual parts, and fidelity in the case of gray or color photo print-type
- material. Legibility boils down to scan density, the standard in most
- cases being 300 dpi. Increasing the resolution with scanners that
- perform 600 or 1200 dpi, however, comes at a cost.
-
- Better image quality entails at least four different kinds of costs: 1)
- equipment costs, because the CCD (i.e., charge-couple device) with
- greater number of elements costs more; 2) time costs that translate to
- the actual capture costs, because manual labor is involved (the time is
- also dependent on the fact that more data has to be moved around in the
- machine in the scanning or network devices that perform the scanning as
- well as the storage); 3) media costs, because at high resolutions larger
- files have to be stored; and 4) transmission costs, because there is just
- more data to be transmitted.
-
- But while resolution takes care of the issue of legibility in image
- quality, other deficiencies have to do with contrast and elements on the
- page scanned or the image that needed to be removed or clarified. Thus,
- THOMA proceeded to illustrate various deficiencies, how they are
- manifested, and several techniques to overcome them.
-
- Fixed thresholding was the first technique described, suitable for
- black-and-white text, when the contrast does not vary over the page. One
- can have many different threshold levels in scanning devices. Thus,
- THOMA offered an example of extremely poor contrast, which resulted from
- the fact that the stock was a heavy red. This is the sort of image that
- when microfilmed fails to provide any legibility whatsoever. Fixed
- thresholding is the way to change the black-to-red contrast to the
- desired black-to-white contrast.
-
- Other examples included material that had been browned or yellowed by
- age. This was also a case of contrast deficiency, and correction was
- done by fixed thresholding. A final example boils down to the same
- thing, slight variability, but it is not significant. Fixed thresholding
- solves this problem as well. The microfilm equivalent is certainly legible,
- but it comes with dark areas. Though THOMA did not have a slide of the
- microfilm in this case, he did show the reproduced electronic image.
-
- When one has variable contrast over a page or the lighting over the page
- area varies, especially in the case where a bound volume has light
- shining on it, the image must be processed by a dynamic thresholding
- scheme. One scheme, dynamic averaging, allows the threshold level not to
- be fixed but to be recomputed for every pixel from the neighboring
- characteristics. The neighbors of a pixel determine where the threshold
- should be set for that pixel.
-
- THOMA showed an example of a page that had been made deficient by a
- variety of techniques, including a burn mark, coffee stains, and a yellow
- marker. Application of a fixed-thresholding scheme, THOMA argued, might
- take care of several deficiencies on the page but not all of them.
- Performing the calculation for a dynamic threshold setting, however,
- removes most of the deficiencies so that at least the text is legible.
-
- Another problem is representing a gray level with black-and-white pixels
- by a process known as dithering or electronic screening. But dithering
- does not provide good image quality for pure black-and-white textual
- material. THOMA illustrated this point with examples. Although its
- suitability for photoprint is the reason for electronic screening or
- dithering, it cannot be used for every compound image. In the document
- that was distributed by CXP, THOMA noticed that the dithered image of the
- IEEE test chart evinced some deterioration in the text. He presented an
- extreme example of deterioration in the text in which compounded
- documents had to be set right by other techniques. The technique
- illustrated by the present example was an image merge in which the page
- is scanned twice and the settings go from fixed threshold to the
- dithering matrix; the resulting images are merged to give the best
- results with each technique.
-
- THOMA illustrated how dithering is also used in nonphotographic or
- nonprint materials with an example of a grayish page from a medical text,
- which was reproduced to show all of the gray that appeared in the
- original. Dithering provided a reproduction of all the gray in the
- original of another example from the same text.
-
- THOMA finally illustrated the problem of bordering, or page-edge,
- effects. Books and bound volumes that are placed on a photocopy machine
- or a scanner produce page-edge effects that are undesirable for two
- reasons: 1) the aesthetics of the image; after all, if the image is to
- be preserved, one does not necessarily want to keep all of its
- deficiencies; 2) compression (with the bordering problem THOMA
- illustrated, the compression ratio deteriorated tremendously). One way
- to eliminate this more serious problem is to have the operator at the
- point of scanning window the part of the image that is desirable and
- automatically turn all of the pixels out of that picture to white.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- FLEISCHHAUER * AM's experience with scanning bound materials * Dithering
- *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Carl FLEISCHHAUER, coordinator, American Memory, Library of Congress,
- reported AM's experience with scanning bound materials, which he likened
- to the problems involved in using photocopying machines. Very few
- devices in the industry offer book-edge scanning, let alone book cradles.
- The problem may be unsolvable, FLEISCHHAUER said, because a large enough
- market does not exist for a preservation-quality scanner. AM is using a
- Kurzweil scanner, which is a book-edge scanner now sold by Xerox.
-
- Devoting the remainder of his brief presentation to dithering,
- FLEISCHHAUER related AM's experience with a contractor who was using
- unsophisticated equipment and software to reduce moire patterns from
- printed halftones. AM took the same image and used the dithering
- algorithm that forms part of the same Kurzweil Xerox scanner; it
- disguised moire patterns much more effectively.
-
- FLEISCHHAUER also observed that dithering produces a binary file which is
- useful for numerous purposes, for example, printing it on a laser printer
- without having to "re-halftone" it. But it tends to defeat efficient
- compression, because the very thing that dithers to reduce moire patterns
- also tends to work against compression schemes. AM thought the
- difference in image quality was worth it.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- DISCUSSION * Relative use as a criterion for POB's selection of books to
- be converted into digital form *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- During the discussion period, WATERS noted that one of the criteria for
- selecting books among the 10,000 to be converted into digital image form
- would be how much relative use they would receive--a subject still
- requiring evaluation. The challenge will be to understand whether
- coherent bodies of material will increase usage or whether POB should
- seek material that is being used, scan that, and make it more accessible.
- POB might decide to digitize materials that are already heavily used, in
- order to make them more accessible and decrease wear on them. Another
- approach would be to provide a large body of intellectually coherent
- material that may be used more in digital form than it is currently used
- in microfilm. POB would seek material that was out of copyright.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- BARONAS * Origin and scope of AIIM * Types of documents produced in
- AIIM's standards program * Domain of AIIM's standardization work * AIIM's
- structure * TC 171 and MS23 * Electronic image management standards *
- Categories of EIM standardization where AIIM standards are being
- developed *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Jean BARONAS, senior manager, Department of Standards and Technology,
- Association for Information and Image Management (AIIM), described the
- not-for-profit association and the national and international programs
- for standardization in which AIIM is active.
-
- Accredited for twenty-five years as the nation's standards development
- organization for document image management, AIIM began life in a library
- community developing microfilm standards. Today the association
- maintains both its library and business-image management standardization
- activities--and has moved into electronic image-management
- standardization (EIM).
-
- BARONAS defined the program's scope. AIIM deals with: 1) the
- terminology of standards and of the technology it uses; 2) methods of
- measurement for the systems, as well as quality; 3) methodologies for
- users to evaluate and measure quality; 4) the features of apparatus used
- to manage and edit images; and 5) the procedures used to manage images.
-
- BARONAS noted that three types of documents are produced in the AIIM
- standards program: the first two, accredited by the American National
- Standards Institute (ANSI), are standards and standard recommended
- practices. Recommended practices differ from standards in that they
- contain more tutorial information. A technical report is not an ANSI
- standard. Because AIIM's policies and procedures for developing
- standards are approved by ANSI, its standards are labeled ANSI/AIIM,
- followed by the number and title of the standard.
-
- BARONAS then illustrated the domain of AIIM's standardization work. For
- example, AIIM is the administrator of the U.S. Technical Advisory Group
- (TAG) to the International Standards Organization's (ISO) technical
- committee, TC l7l Micrographics and Optical Memories for Document and
- Image Recording, Storage, and Use. AIIM officially works through ANSI in
- the international standardization process.
-
- BARONAS described AIIM's structure, including its board of directors, its
- standards board of twelve individuals active in the image-management
- industry, its strategic planning and legal admissibility task forces, and
- its National Standards Council, which is comprised of the members of a
- number of organizations who vote on every AIIM standard before it is
- published. BARONAS pointed out that AIIM's liaisons deal with numerous
- other standards developers, including the optical disk community, office
- and publishing systems, image-codes-and-character set committees, and the
- National Information Standards Organization (NISO).
-
- BARONAS illustrated the procedures of TC l7l, which covers all aspects of
- image management. When AIIM's national program has conceptualized a new
- project, it is usually submitted to the international level, so that the
- member countries of TC l7l can simultaneously work on the development of
- the standard or the technical report. BARONAS also illustrated a classic
- microfilm standard, MS23, which deals with numerous imaging concepts that
- apply to electronic imaging. Originally developed in the l970s, revised
- in the l980s, and revised again in l991, this standard is scheduled for
- another revision. MS23 is an active standard whereby users may propose
- new density ranges and new methods of evaluating film images in the
- standard's revision.
-
- BARONAS detailed several electronic image-management standards, for
- instance, ANSI/AIIM MS44, a quality-control guideline for scanning 8.5"
- by 11" black-and-white office documents. This standard is used with the
- IEEE fax image--a continuous tone photographic image with gray scales,
- text, and several continuous tone pictures--and AIIM test target number
- 2, a representative document used in office document management.
-
- BARONAS next outlined the four categories of EIM standardization in which
- AIIM standards are being developed: transfer and retrieval, evaluation,
- optical disc and document scanning applications, and design and
- conversion of documents. She detailed several of the main projects of
- each: 1) in the category of image transfer and retrieval, a bi-level
- image transfer format, ANSI/AIIM MS53, which is a proposed standard that
- describes a file header for image transfer between unlike systems when
- the images are compressed using G3 and G4 compression; 2) the category of
- image evaluation, which includes the AIIM-proposed TR26 tutorial on image
- resolution (this technical report will treat the differences and
- similarities between classical or photographic and electronic imaging);
- 3) design and conversion, which includes a proposed technical report
- called "Forms Design Optimization for EIM" (this report considers how
- general-purpose business forms can be best designed so that scanning is
- optimized; reprographic characteristics such as type, rules, background,
- tint, and color will likewise be treated in the technical report); 4)
- disk and document scanning applications includes a project a) on planning
- platters and disk management, b) on generating an application profile for
- EIM when images are stored and distributed on CD-ROM, and c) on
- evaluating SCSI2, and how a common command set can be generated for SCSI2
- so that document scanners are more easily integrated. (ANSI/AIIM MS53
- will also apply to compressed images.)
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- BATTIN * The implications of standards for preservation * A major
- obstacle to successful cooperation * A hindrance to access in the digital
- environment * Standards a double-edged sword for those concerned with the
- preservation of the human record * Near-term prognosis for reliable
- archival standards * Preservation concerns for electronic media * Need
- for reconceptualizing our preservation principles * Standards in the real
- world and the politics of reproduction * Need to redefine the concept of
- archival and to begin to think in terms of life cycles * Cooperation and
- the La Guardia Eight * Concerns generated by discussions on the problems
- of preserving text and image * General principles to be adopted in a
- world without standards *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Patricia BATTIN, president, the Commission on Preservation and Access
- (CPA), addressed the implications of standards for preservation. She
- listed several areas where the library profession and the analog world of
- the printed book had made enormous contributions over the past hundred
- years--for example, in bibliographic formats, binding standards, and, most
- important, in determining what constitutes longevity or archival quality.
-
- Although standards have lightened the preservation burden through the
- development of national and international collaborative programs,
- nevertheless, a pervasive mistrust of other people's standards remains a
- major obstacle to successful cooperation, BATTIN said.
-
- The zeal to achieve perfection, regardless of the cost, has hindered
- rather than facilitated access in some instances, and in the digital
- environment, where no real standards exist, has brought an ironically
- just reward.
-
- BATTIN argued that standards are a double-edged sword for those concerned
- with the preservation of the human record, that is, the provision of
- access to recorded knowledge in a multitude of media as far into the
- future as possible. Standards are essential to facilitate
- interconnectivity and access, but, BATTIN said, as LYNCH pointed out
- yesterday, if set too soon they can hinder creativity, expansion of
- capability, and the broadening of access. The characteristics of
- standards for digital imagery differ radically from those for analog
- imagery. And the nature of digital technology implies continuing
- volatility and change. To reiterate, precipitous standard-setting can
- inhibit creativity, but delayed standard-setting results in chaos.
-
- Since in BATTIN'S opinion the near-term prognosis for reliable archival
- standards, as defined by librarians in the analog world, is poor, two
- alternatives remain: standing pat with the old technology, or
- reconceptualizing.
-
- Preservation concerns for electronic media fall into two general domains.
- One is the continuing assurance of access to knowledge originally
- generated, stored, disseminated, and used in electronic form. This
- domain contains several subdivisions, including 1) the closed,
- proprietary systems discussed the previous day, bundled information such
- as electronic journals and government agency records, and electronically
- produced or captured raw data; and 2) the application of digital
- technologies to the reformatting of materials originally published on a
- deteriorating analog medium such as acid paper or videotape.
-
- The preservation of electronic media requires a reconceptualizing of our
- preservation principles during a volatile, standardless transition which
- may last far longer than any of us envision today. BATTIN urged the
- necessity of shifting focus from assessing, measuring, and setting
- standards for the permanence of the medium to the concept of managing
- continuing access to information stored on a variety of media and
- requiring a variety of ever-changing hardware and software for access--a
- fundamental shift for the library profession.
-
- BATTIN offered a primer on how to move forward with reasonable confidence
- in a world without standards. Her comments fell roughly into two sections:
- 1) standards in the real world and 2) the politics of reproduction.
-
- In regard to real-world standards, BATTIN argued the need to redefine the
- concept of archive and to begin to think in terms of life cycles. In
- the past, the naive assumption that paper would last forever produced a
- cavalier attitude toward life cycles. The transient nature of the
- electronic media has compelled people to recognize and accept upfront the
- concept of life cycles in place of permanency.
-
- Digital standards have to be developed and set in a cooperative context
- to ensure efficient exchange of information. Moreover, during this
- transition period, greater flexibility concerning how concepts such as
- backup copies and archival copies in the CXP are defined is necessary,
- or the opportunity to move forward will be lost.
-
- In terms of cooperation, particularly in the university setting, BATTIN
- also argued the need to avoid going off in a hundred different
- directions. The CPA has catalyzed a small group of universities called
- the La Guardia Eight--because La Guardia Airport is where meetings take
- place--Harvard, Yale, Cornell, Princeton, Penn State, Tennessee,
- Stanford, and USC, to develop a digital preservation consortium to look
- at all these issues and develop de facto standards as we move along,
- instead of waiting for something that is officially blessed. Continuing
- to apply analog values and definitions of standards to the digital
- environment, BATTIN said, will effectively lead to forfeiture of the
- benefits of digital technology to research and scholarship.
-
- Under the second rubric, the politics of reproduction, BATTIN reiterated
- an oft-made argument concerning the electronic library, namely, that it
- is more difficult to transform than to create, and nowhere is that belief
- expressed more dramatically than in the conversion of brittle books to
- new media. Preserving information published in electronic media involves
- making sure the information remains accessible and that digital
- information is not lost through reproduction. In the analog world of
- photocopies and microfilm, the issue of fidelity to the original becomes
- paramount, as do issues of "Whose fidelity?" and "Whose original?"
-
- BATTIN elaborated these arguments with a few examples from a recent study
- conducted by the CPA on the problems of preserving text and image.
- Discussions with scholars, librarians, and curators in a variety of
- disciplines dependent on text and image generated a variety of concerns,
- for example: 1) Copy what is, not what the technology is capable of.
- This is very important for the history of ideas. Scholars wish to know
- what the author saw and worked from. And make available at the
- workstation the opportunity to erase all the defects and enhance the
- presentation. 2) The fidelity of reproduction--what is good enough, what
- can we afford, and the difference it makes--issues of subjective versus
- objective resolution. 3) The differences between primary and secondary
- users. Restricting the definition of primary user to the one in whose
- discipline the material has been published runs one headlong into the
- reality that these printed books have had a host of other users from a
- host of other disciplines, who not only were looking for very different
- things, but who also shared values very different from those of the
- primary user. 4) The relationship of the standard of reproduction to new
- capabilities of scholarship--the browsing standard versus an archival
- standard. How good must the archival standard be? Can a distinction be
- drawn between potential users in setting standards for reproduction?
- Archival storage, use copies, browsing copies--ought an attempt to set
- standards even be made? 5) Finally, costs. How much are we prepared to
- pay to capture absolute fidelity? What are the trade-offs between vastly
- enhanced access, degrees of fidelity, and costs?
-
- These standards, BATTIN concluded, serve to complicate further the
- reproduction process, and add to the long list of technical standards
- that are necessary to ensure widespread access. Ways to articulate and
- analyze the costs that are attached to the different levels of standards
- must be found.
-
- Given the chaos concerning standards, which promises to linger for the
- foreseeable future, BATTIN urged adoption of the following general
- principles:
-
- * Strive to understand the changing information requirements of
- scholarly disciplines as more and more technology is integrated into
- the process of research and scholarly communication in order to meet
- future scholarly needs, not to build for the past. Capture
- deteriorating information at the highest affordable resolution, even
- though the dissemination and display technologies will lag.
-
- * Develop cooperative mechanisms to foster agreement on protocols
- for document structure and other interchange mechanisms necessary
- for widespread dissemination and use before official standards are
- set.
-
- * Accept that, in a transition period, de facto standards will have
- to be developed.
-
- * Capture information in a way that keeps all options open and
- provides for total convertibility: OCR, scanning of microfilm,
- producing microfilm from scanned documents, etc.
-
- * Work closely with the generators of information and the builders
- of networks and databases to ensure that continuing accessibility is
- a primary concern from the beginning.
-
- * Piggyback on standards under development for the broad market, and
- avoid library-specific standards; work with the vendors, in order to
- take advantage of that which is being standardized for the rest of
- the world.
-
- * Concentrate efforts on managing permanence in the digital world,
- rather than perfecting the longevity of a particular medium.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- DISCUSSION * Additional comments on TIFF *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- During the brief discussion period that followed BATTIN's presentation,
- BARONAS explained that TIFF was not developed in collaboration with or
- under the auspices of AIIM. TIFF is a company product, not a standard,
- is owned by two corporations, and is always changing. BARONAS also
- observed that ANSI/AIIM MS53, a bi-level image file transfer format that
- allows unlike systems to exchange images, is compatible with TIFF as well
- as with DEC's architecture and IBM's MODCA/IOCA.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- HOOTON * Several questions to be considered in discussing text conversion
- *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- HOOTON introduced the final topic, text conversion, by noting that it is
- becoming an increasingly important part of the imaging business. Many
- people now realize that it enhances their system to be able to have more
- and more character data as part of their imaging system. Re the issue of
- OCR versus rekeying, HOOTON posed several questions: How does one get
- text into computer-readable form? Does one use automated processes?
- Does one attempt to eliminate the use of operators where possible?
- Standards for accuracy, he said, are extremely important: it makes a
- major difference in cost and time whether one sets as a standard 98.5
- percent acceptance or 99.5 percent. He mentioned outsourcing as a
- possibility for converting text. Finally, what one does with the image
- to prepare it for the recognition process is also important, he said,
- because such preparation changes how recognition is viewed, as well as
- facilitates recognition itself.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- LESK * Roles of participants in CORE * Data flow * The scanning process *
- The image interface * Results of experiments involving the use of
- electronic resources and traditional paper copies * Testing the issue of
- serendipity * Conclusions *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Michael LESK, executive director, Computer Science Research, Bell
- Communications Research, Inc. (Bellcore), discussed the Chemical Online
- Retrieval Experiment (CORE), a cooperative project involving Cornell
- University, OCLC, Bellcore, and the American Chemical Society (ACS).
-
- LESK spoke on 1) how the scanning was performed, including the unusual
- feature of page segmentation, and 2) the use made of the text and the
- image in experiments.
-
- Working with the chemistry journals (because ACS has been saving its
- typesetting tapes since the mid-1970s and thus has a significant back-run
- of the most important chemistry journals in the United States), CORE is
- attempting to create an automated chemical library. Approximately a
- quarter of the pages by square inch are made up of images of
- quasi-pictorial material; dealing with the graphic components of the
- pages is extremely important. LESK described the roles of participants
- in CORE: 1) ACS provides copyright permission, journals on paper,
- journals on microfilm, and some of the definitions of the files; 2) at
- Bellcore, LESK chiefly performs the data preparation, while Dennis Egan
- performs experiments on the users of chemical abstracts, and supplies the
- indexing and numerous magnetic tapes; 3) Cornell provides the site of the
- experiment; 4) OCLC develops retrieval software and other user interfaces.
- Various manufacturers and publishers have furnished other help.
-
- Concerning data flow, Bellcore receives microfilm and paper from ACS; the
- microfilm is scanned by outside vendors, while the paper is scanned
- inhouse on an Improvision scanner, twenty pages per minute at 300 dpi,
- which provides sufficient quality for all practical uses. LESK would
- prefer to have more gray level, because one of the ACS journals prints on
- some colored pages, which creates a problem.
-
- Bellcore performs all this scanning, creates a page-image file, and also
- selects from the pages the graphics, to mix with the text file (which is
- discussed later in the Workshop). The user is always searching the ASCII
- file, but she or he may see a display based on the ASCII or a display
- based on the images.
-
- LESK illustrated how the program performs page analysis, and the image
- interface. (The user types several words, is presented with a list--
- usually of the titles of articles contained in an issue--that derives
- from the ASCII, clicks on an icon and receives an image that mirrors an
- ACS page.) LESK also illustrated an alternative interface, based on text
- on the ASCII, the so-called SuperBook interface from Bellcore.
-
- LESK next presented the results of an experiment conducted by Dennis Egan
- and involving thirty-six students at Cornell, one third of them
- undergraduate chemistry majors, one third senior undergraduate chemistry
- majors, and one third graduate chemistry students. A third of them
- received the paper journals, the traditional paper copies and chemical
- abstracts on paper. A third received image displays of the pictures of
- the pages, and a third received the text display with pop-up graphics.
-
- The students were given several questions made up by some chemistry
- professors. The questions fell into five classes, ranging from very easy
- to very difficult, and included questions designed to simulate browsing
- as well as a traditional information retrieval-type task.
-
- LESK furnished the following results. In the straightforward question
- search--the question being, what is the phosphorus oxygen bond distance
- and hydroxy phosphate?--the students were told that they could take
- fifteen minutes and, then, if they wished, give up. The students with
- paper took more than fifteen minutes on average, and yet most of them
- gave up. The students with either electronic format, text or image,
- received good scores in reasonable time, hardly ever had to give up, and
- usually found the right answer.
-
- In the browsing study, the students were given a list of eight topics,
- told to imagine that an issue of the Journal of the American Chemical
- Society had just appeared on their desks, and were also told to flip
- through it and to find topics mentioned in the issue. The average scores
- were about the same. (The students were told to answer yes or no about
- whether or not particular topics appeared.) The errors, however, were
- quite different. The students with paper rarely said that something
- appeared when it had not. But they often failed to find something
- actually mentioned in the issue. The computer people found numerous
- things, but they also frequently said that a topic was mentioned when it
- was not. (The reason, of course, was that they were performing word
- searches. They were finding that words were mentioned and they were
- concluding that they had accomplished their task.)
-
- This question also contained a trick to test the issue of serendipity.
- The students were given another list of eight topics and instructed,
- without taking a second look at the journal, to recall how many of this
- new list of eight topics were in this particular issue. This was an
- attempt to see if they performed better at remembering what they were not
- looking for. They all performed about the same, paper or electronics,
- about 62 percent accurate. In short, LESK said, people were not very
- good when it came to serendipity, but they were no worse at it with
- computers than they were with paper.
-
- (LESK gave a parenthetical illustration of the learning curve of students
- who used SuperBook.)
-
- The students using the electronic systems started off worse than the ones
- using print, but by the third of the three sessions in the series had
- caught up to print. As one might expect, electronics provide a much
- better means of finding what one wants to read; reading speeds, once the
- object of the search has been found, are about the same.
-
- Almost none of the students could perform the hard task--the analogous
- transformation. (It would require the expertise of organic chemists to
- complete.) But an interesting result was that the students using the text
- search performed terribly, while those using the image system did best.
- That the text search system is driven by text offers the explanation.
- Everything is focused on the text; to see the pictures, one must press
- on an icon. Many students found the right article containing the answer
- to the question, but they did not click on the icon to bring up the right
- figure and see it. They did not know that they had found the right place,
- and thus got it wrong.
-
- The short answer demonstrated by this experiment was that in the event
- one does not know what to read, one needs the electronic systems; the
- electronic systems hold no advantage at the moment if one knows what to
- read, but neither do they impose a penalty.
-
- LESK concluded by commenting that, on one hand, the image system was easy
- to use. On the other hand, the text display system, which represented
- twenty man-years of work in programming and polishing, was not winning,
- because the text was not being read, just searched. The much easier
- system is highly competitive as well as remarkably effective for the
- actual chemists.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- ERWAY * Most challenging aspect of working on AM * Assumptions guiding
- AM's approach * Testing different types of service bureaus * AM's
- requirement for 99.95 percent accuracy * Requirements for text-coding *
- Additional factors influencing AM's approach to coding * Results of AM's
- experience with rekeying * Other problems in dealing with service bureaus
- * Quality control the most time-consuming aspect of contracting out
- conversion * Long-term outlook uncertain *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- To Ricky ERWAY, associate coordinator, American Memory, Library of
- Congress, the constant variety of conversion projects taking place
- simultaneously represented perhaps the most challenging aspect of working
- on AM. Thus, the challenge was not to find a solution for text
- conversion but a tool kit of solutions to apply to LC's varied
- collections that need to be converted. ERWAY limited her remarks to the
- process of converting text to machine-readable form, and the variety of
- LC's text collections, for example, bound volumes, microfilm, and
- handwritten manuscripts.
-
- Two assumptions have guided AM's approach, ERWAY said: 1) A desire not
- to perform the conversion inhouse. Because of the variety of formats and
- types of texts, to capitalize the equipment and have the talents and
- skills to operate them at LC would be extremely expensive. Further, the
- natural inclination to upgrade to newer and better equipment each year
- made it reasonable for AM to focus on what it did best and seek external
- conversion services. Using service bureaus also allowed AM to have
- several types of operations take place at the same time. 2) AM was not a
- technology project, but an effort to improve access to library
- collections. Hence, whether text was converted using OCR or rekeying
- mattered little to AM. What mattered were cost and accuracy of results.
-
- AM considered different types of service bureaus and selected three to
- perform several small tests in order to acquire a sense of the field.
- The sample collections with which they worked included handwritten
- correspondence, typewritten manuscripts from the 1940s, and
- eighteenth-century printed broadsides on microfilm. On none of these
- samples was OCR performed; they were all rekeyed. AM had several special
- requirements for the three service bureaus it had engaged. For instance,
- any errors in the original text were to be retained. Working from bound
- volumes or anything that could not be sheet-fed also constituted a factor
- eliminating companies that would have performed OCR.
-
- AM requires 99.95 percent accuracy, which, though it sounds high, often
- means one or two errors per page. The initial batch of test samples
- contained several handwritten materials for which AM did not require
- text-coding. The results, ERWAY reported, were in all cases fairly
- comparable: for the most part, all three service bureaus achieved 99.95
- percent accuracy. AM was satisfied with the work but surprised at the cost.
-
- As AM began converting whole collections, it retained the requirement for
- 99.95 percent accuracy and added requirements for text-coding. AM needed
- to begin performing work more than three years ago before LC requirements
- for SGML applications had been established. Since AM's goal was simply
- to retain any of the intellectual content represented by the formatting
- of the document (which would be lost if one performed a straight ASCII
- conversion), AM used "SGML-like" codes. These codes resembled SGML tags
- but were used without the benefit of document-type definitions. AM found
- that many service bureaus were not yet SGML-proficient.
-
- Additional factors influencing the approach AM took with respect to
- coding included: 1) the inability of any known microcomputer-based
- user-retrieval software to take advantage of SGML coding; and 2) the
- multiple inconsistencies in format of the older documents, which
- confirmed AM in its desire not to attempt to force the different formats
- to conform to a single document-type definition (DTD) and thus create the
- need for a separate DTD for each document.
-
- The five text collections that AM has converted or is in the process of
- converting include a collection of eighteenth-century broadsides, a
- collection of pamphlets, two typescript document collections, and a
- collection of 150 books.
-
- ERWAY next reviewed the results of AM's experience with rekeying, noting
- again that because the bulk of AM's materials are historical, the quality
- of the text often does not lend itself to OCR. While non-English
- speakers are less likely to guess or elaborate or correct typos in the
- original text, they are also less able to infer what we would; they also
- are nearly incapable of converting handwritten text. Another
- disadvantage of working with overseas keyers is that they are much less
- likely to telephone with questions, especially on the coding, with the
- result that they develop their own rules as they encounter new
- situations.
-
- Government contracting procedures and time frames posed a major challenge
- to performing the conversion. Many service bureaus are not accustomed to
- retaining the image, even if they perform OCR. Thus, questions of image
- format and storage media were somewhat novel to many of them. ERWAY also
- remarked other problems in dealing with service bureaus, for example,
- their inability to perform text conversion from the kind of microfilm
- that LC uses for preservation purposes.
-
- But quality control, in ERWAY's experience, was the most time-consuming
- aspect of contracting out conversion. AM has been attempting to perform
- a 10-percent quality review, looking at either every tenth document or
- every tenth page to make certain that the service bureaus are maintaining
- 99.95 percent accuracy. But even if they are complying with the
- requirement for accuracy, finding errors produces a desire to correct
- them and, in turn, to clean up the whole collection, which defeats the
- purpose to some extent. Even a double entry requires a
- character-by-character comparison to the original to meet the accuracy
- requirement. LC is not accustomed to publish imperfect texts, which
- makes attempting to deal with the industry standard an emotionally
- fraught issue for AM. As was mentioned in the previous day's discussion,
- going from 99.95 to 99.99 percent accuracy usually doubles costs and
- means a third keying or another complete run-through of the text.
-
- Although AM has learned much from its experiences with various collections
- and various service bureaus, ERWAY concluded pessimistically that no
- breakthrough has been achieved. Incremental improvements have occurred
- in some of the OCR technology, some of the processes, and some of the
- standards acceptances, which, though they may lead to somewhat lower costs,
- do not offer much encouragement to many people who are anxiously awaiting
- the day that the entire contents of LC are available on-line.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- ZIDAR * Several answers to why one attempts to perform full-text
- conversion * Per page cost of performing OCR * Typical problems
- encountered during editing * Editing poor copy OCR vs. rekeying *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Judith ZIDAR, coordinator, National Agricultural Text Digitizing Program
- (NATDP), National Agricultural Library (NAL), offered several answers to
- the question of why one attempts to perform full-text conversion: 1)
- Text in an image can be read by a human but not by a computer, so of
- course it is not searchable and there is not much one can do with it. 2)
- Some material simply requires word-level access. For instance, the legal
- profession insists on full-text access to its material; with taxonomic or
- geographic material, which entails numerous names, one virtually requires
- word-level access. 3) Full text permits rapid browsing and searching,
- something that cannot be achieved in an image with today's technology.
- 4) Text stored as ASCII and delivered in ASCII is standardized and highly
- portable. 5) People just want full-text searching, even those who do not
- know how to do it. NAL, for the most part, is performing OCR at an
- actual cost per average-size page of approximately $7. NAL scans the
- page to create the electronic image and passes it through the OCR device.
-
- ZIDAR next rehearsed several typical problems encountered during editing.
- Praising the celerity of her student workers, ZIDAR observed that editing
- requires approximately five to ten minutes per page, assuming that there
- are no large tables to audit. Confusion among the three characters I, 1,
- and l, constitutes perhaps the most common problem encountered. Zeroes
- and O's also are frequently confused. Double M's create a particular
- problem, even on clean pages. They are so wide in most fonts that they
- touch, and the system simply cannot tell where one letter ends and the
- other begins. Complex page formats occasionally fail to columnate
- properly, which entails rescanning as though one were working with a
- single column, entering the ASCII, and decolumnating for better
- searching. With proportionally spaced text, OCR can have difficulty
- discerning what is a space and what are merely spaces between letters, as
- opposed to spaces between words, and therefore will merge text or break
- up words where it should not.
-
- ZIDAR said that it can often take longer to edit a poor-copy OCR than to
- key it from scratch. NAL has also experimented with partial editing of
- text, whereby project workers go into and clean up the format, removing
- stray characters but not running a spell-check. NAL corrects typos in
- the title and authors' names, which provides a foothold for searching and
- browsing. Even extremely poor-quality OCR (e.g., 60-percent accuracy)
- can still be searched, because numerous words are correct, while the
- important words are probably repeated often enough that they are likely
- to be found correct somewhere. Librarians, however, cannot tolerate this
- situation, though end users seem more willing to use this text for
- searching, provided that NAL indicates that it is unedited. ZIDAR
- concluded that rekeying of text may be the best route to take, in spite
- of numerous problems with quality control and cost.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- DISCUSSION * Modifying an image before performing OCR * NAL's costs per
- page *AM's costs per page and experience with Federal Prison Industries *
- Elements comprising NATDP's costs per page * OCR and structured markup *
- Distinction between the structure of a document and its representation
- when put on the screen or printed *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- HOOTON prefaced the lengthy discussion that followed with several
- comments about modifying an image before one reaches the point of
- performing OCR. For example, in regard to an application containing a
- significant amount of redundant data, such as form-type data, numerous
- companies today are working on various kinds of form renewal, prior to
- going through a recognition process, by using dropout colors. Thus,
- acquiring access to form design or using electronic means are worth
- considering. HOOTON also noted that conversion usually makes or breaks
- one's imaging system. It is extremely important, extremely costly in
- terms of either capital investment or service, and determines the quality
- of the remainder of one's system, because it determines the character of
- the raw material used by the system.
-
- Concerning the four projects undertaken by NAL, two inside and two
- performed by outside contractors, ZIDAR revealed that an in-house service
- bureau executed the first at a cost between $8 and $10 per page for
- everything, including building of the database. The project undertaken
- by the Consultative Group on International Agricultural Research (CGIAR)
- cost approximately $10 per page for the conversion, plus some expenses
- for the software and building of the database. The Acid Rain Project--a
- two-disk set produced by the University of Vermont, consisting of
- Canadian publications on acid rain--cost $6.70 per page for everything,
- including keying of the text, which was double keyed, scanning of the
- images, and building of the database. The in-house project offered
- considerable ease of convenience and greater control of the process. On
- the other hand, the service bureaus know their job and perform it
- expeditiously, because they have more people.
-
- As a useful comparison, ERWAY revealed AM's costs as follows: $0.75
- cents to $0.85 cents per thousand characters, with an average page
- containing 2,700 characters. Requirements for coding and imaging
- increase the costs. Thus, conversion of the text, including the coding,
- costs approximately $3 per page. (This figure does not include the
- imaging and database-building included in the NAL costs.) AM also
- enjoyed a happy experience with Federal Prison Industries, which
- precluded the necessity of going through the request-for-proposal process
- to award a contract, because it is another government agency. The
- prisoners performed AM's rekeying just as well as other service bureaus
- and proved handy as well. AM shipped them the books, which they would
- photocopy on a book-edge scanner. They would perform the markup on
- photocopies, return the books as soon as they were done with them,
- perform the keying, and return the material to AM on WORM disks.
-
- ZIDAR detailed the elements that constitute the previously noted cost of
- approximately $7 per page. Most significant is the editing, correction
- of errors, and spell-checkings, which though they may sound easy to
- perform require, in fact, a great deal of time. Reformatting text also
- takes a while, but a significant amount of NAL's expenses are for equipment,
- which was extremely expensive when purchased because it was one of the few
- systems on the market. The costs of equipment are being amortized over
- five years but are still quite high, nearly $2,000 per month.
-
- HOCKEY raised a general question concerning OCR and the amount of editing
- required (substantial in her experience) to generate the kind of
- structured markup necessary for manipulating the text on the computer or
- loading it into any retrieval system. She wondered if the speakers could
- extend the previous question about the cost-benefit of adding or exerting
- structured markup. ERWAY noted that several OCR systems retain italics,
- bolding, and other spatial formatting. While the material may not be in
- the format desired, these systems possess the ability to remove the
- original materials quickly from the hands of the people performing the
- conversion, as well as to retain that information so that users can work
- with it. HOCKEY rejoined that the current thinking on markup is that one
- should not say that something is italic or bold so much as why it is that
- way. To be sure, one needs to know that something was italicized, but
- how can one get from one to the other? One can map from the structure to
- the typographic representation.
-
- FLEISCHHAUER suggested that, given the 100 million items the Library
- holds, it may not be possible for LC to do more than report that a thing
- was in italics as opposed to why it was italics, although that may be
- desirable in some contexts. Promising to talk a bit during the afternoon
- session about several experiments OCLC performed on automatic recognition
- of document elements, and which they hoped to extend, WEIBEL said that in
- fact one can recognize the major elements of a document with a fairly
- high degree of reliability, at least as good as OCR. STEVENS drew a
- useful distinction between standard, generalized markup (i.e., defining
- for a document-type definition the structure of the document), and what
- he termed a style sheet, which had to do with italics, bolding, and other
- forms of emphasis. Thus, two different components are at work, one being
- the structure of the document itself (its logic), and the other being its
- representation when it is put on the screen or printed.
-
- ******
-
- SESSION V. APPROACHES TO PREPARING ELECTRONIC TEXTS
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- HOCKEY * Text in ASCII and the representation of electronic text versus
- an image * The need to look at ways of using markup to assist retrieval *
- The need for an encoding format that will be reusable and multifunctional
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Susan HOCKEY, director, Center for Electronic Texts in the Humanities
- (CETH), Rutgers and Princeton Universities, announced that one talk
- (WEIBEL's) was moved into this session from the morning and that David
- Packard was unable to attend. The session would attempt to focus more on
- what one can do with a text in ASCII and the representation of electronic
- text rather than just an image, what one can do with a computer that
- cannot be done with a book or an image. It would be argued that one can
- do much more than just read a text, and from that starting point one can
- use markup and methods of preparing the text to take full advantage of
- the capability of the computer. That would lead to a discussion of what
- the European Community calls REUSABILITY, what may better be termed
- DURABILITY, that is, how to prepare or make a text that will last a long
- time and that can be used for as many applications as possible, which
- would lead to issues of improving intellectual access.
-
- HOCKEY urged the need to look at ways of using markup to facilitate retrieval,
- not just for referencing or to help locate an item that is retrieved, but also to put markup tags in
- a text to help retrieve the thing sought either with linguistic tagging or
- interpretation. HOCKEY also argued that little advancement had occurred in
- the software tools currently available for retrieving and searching text.
- She pressed the desideratum of going beyond Boolean searches and performing
- more sophisticated searching, which the insertion of more markup in the text
- would facilitate. Thinking about electronic texts as opposed to images means
- considering material that will never appear in print form, or print will not
- be its primary form, that is, material which only appears in electronic form.
- HOCKEY alluded to the history and the need for markup and tagging and
- electronic text, which was developed through the use of computers in the
- humanities; as MICHELSON had observed, Father Busa had started in 1949
- to prepare the first-ever text on the computer.
-
- HOCKEY remarked several large projects, particularly in Europe, for the
- compilation of dictionaries, language studies, and language analysis, in
- which people have built up archives of text and have begun to recognize
- the need for an encoding format that will be reusable and multifunctional,
- that can be used not just to print the text, which may be assumed to be a
- byproduct of what one wants to do, but to structure it inside the computer
- so that it can be searched, built into a Hypertext system, etc.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- WEIBEL * OCLC's approach to preparing electronic text: retroconversion,
- keying of texts, more automated ways of developing data * Project ADAPT
- and the CORE Project * Intelligent character recognition does not exist *
- Advantages of SGML * Data should be free of procedural markup;
- descriptive markup strongly advocated * OCLC's interface illustrated *
- Storage requirements and costs for putting a lot of information on line *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Stuart WEIBEL, senior research scientist, Online Computer Library Center,
- Inc. (OCLC), described OCLC's approach to preparing electronic text. He
- argued that the electronic world into which we are moving must
- accommodate not only the future but the past as well, and to some degree
- even the present. Thus, starting out at one end with retroconversion and
- keying of texts, one would like to move toward much more automated ways
- of developing data.
-
- For example, Project ADAPT had to do with automatically converting
- document images into a structured document database with OCR text as
- indexing and also a little bit of automatic formatting and tagging of
- that text. The CORE project hosted by Cornell University, Bellcore,
- OCLC, the American Chemical Society, and Chemical Abstracts, constitutes
- WEIBEL's principal concern at the moment. This project is an example of
- converting text for which one already has a machine-readable version into
- a format more suitable for electronic delivery and database searching.
- (Since Michael LESK had previously described CORE, WEIBEL would say
- little concerning it.) Borrowing a chemical phrase, de novo synthesis,
- WEIBEL cited the Online Journal of Current Clinical Trials as an example
- of de novo electronic publishing, that is, a form in which the primary
- form of the information is electronic.
-
- Project ADAPT, then, which OCLC completed a couple of years ago and in
- fact is about to resume, is a model in which one takes page images either
- in paper or microfilm and converts them automatically to a searchable
- electronic database, either on-line or local. The operating assumption
- is that accepting some blemishes in the data, especially for
- retroconversion of materials, will make it possible to accomplish more.
- Not enough money is available to support perfect conversion.
-
- WEIBEL related several steps taken to perform image preprocessing
- (processing on the image before performing optical character
- recognition), as well as image postprocessing. He denied the existence
- of intelligent character recognition and asserted that what is wanted is
- page recognition, which is a long way off. OCLC has experimented with
- merging of multiple optical character recognition systems that will
- reduce errors from an unacceptable rate of 5 characters out of every
- l,000 to an unacceptable rate of 2 characters out of every l,000, but it
- is not good enough. It will never be perfect.
-
- Concerning the CORE Project, WEIBEL observed that Bellcore is taking the
- topography files, extracting the page images, and converting those
- topography files to SGML markup. LESK hands that data off to OCLC, which
- builds that data into a Newton database, the same system that underlies
- the on-line system in virtually all of the reference products at OCLC.
- The long-term goal is to make the systems interoperable so that not just
- Bellcore's system and OCLC's system can access this data, but other
- systems can as well, and the key to that is the Z39.50 common command
- language and the full-text extension. Z39.50 is fine for MARC records,
- but is not enough to do it for full text (that is, make full texts
- interoperable).
-
- WEIBEL next outlined the critical role of SGML for a variety of purposes,
- for example, as noted by HOCKEY, in the world of extremely large
- databases, using highly structured data to perform field searches.
- WEIBEL argued that by building the structure of the data in (i.e., the
- structure of the data originally on a printed page), it becomes easy to
- look at a journal article even if one cannot read the characters and know
- where the title or author is, or what the sections of that document would be.
- OCLC wants to make that structure explicit in the database, because it will
- be important for retrieval purposes.
-
- The second big advantage of SGML is that it gives one the ability to
- build structure into the database that can be used for display purposes
- without contaminating the data with instructions about how to format
- things. The distinction lies between procedural markup, which tells one
- where to put dots on the page, and descriptive markup, which describes
- the elements of a document.
-
- WEIBEL believes that there should be no procedural markup in the data at
- all, that the data should be completely unsullied by information about
- italics or boldness. That should be left up to the display device,
- whether that display device is a page printer or a screen display device.
- By keeping one's database free of that kind of contamination, one can
- make decisions down the road, for example, reorganize the data in ways
- that are not cramped by built-in notions of what should be italic and
- what should be bold. WEIBEL strongly advocated descriptive markup. As
- an example, he illustrated the index structure in the CORE data. With
- subsequent illustrated examples of markup, WEIBEL acknowledged the common
- complaint that SGML is hard to read in its native form, although markup
- decreases considerably once one gets into the body. Without the markup,
- however, one would not have the structure in the data. One can pass
- markup through a LaTeX processor and convert it relatively easily to a
- printed version of the document.
-
- WEIBEL next illustrated an extremely cluttered screen dump of OCLC's
- system, in order to show as much as possible the inherent capability on
- the screen. (He noted parenthetically that he had become a supporter of
- X-Windows as a result of the progress of the CORE Project.) WEIBEL also
- illustrated the two major parts of the interface: l) a control box that
- allows one to generate lists of items, which resembles a small table of
- contents based on key words one wishes to search, and 2) a document
- viewer, which is a separate process in and of itself. He demonstrated
- how to follow links through the electronic database simply by selecting
- the appropriate button and bringing them up. He also noted problems that
- remain to be accommodated in the interface (e.g., as pointed out by LESK,
- what happens when users do not click on the icon for the figure).
-
- Given the constraints of time, WEIBEL omitted a large number of ancillary
- items in order to say a few words concerning storage requirements and
- what will be required to put a lot of things on line. Since it is
- extremely expensive to reconvert all of this data, especially if it is
- just in paper form (and even if it is in electronic form in typesetting
- tapes), he advocated building journals electronically from the start. In
- that case, if one only has text graphics and indexing (which is all that
- one needs with de novo electronic publishing, because there is no need to
- go back and look at bit-maps of pages), one can get 10,000 journals of
- full text, or almost 6 million pages per year. These pages can be put in
- approximately 135 gigabytes of storage, which is not all that much,
- WEIBEL said. For twenty years, something less than three terabytes would
- be required. WEIBEL calculated the costs of storing this information as
- follows: If a gigabyte costs approximately $1,000, then a terabyte costs
- approximately $1 million to buy in terms of hardware. One also needs a
- building to put it in and a staff like OCLC to handle that information.
- So, to support a terabyte, multiply by five, which gives $5 million per
- year for a supported terabyte of data.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- DISCUSSION * Tapes saved by ACS are the typography files originally
- supporting publication of the journal * Cost of building tagged text into
- the database *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- During the question-and-answer period that followed WEIBEL's
- presentation, these clarifications emerged. The tapes saved by the
- American Chemical Society are the typography files that originally
- supported the publication of the journal. Although they are not tagged
- in SGML, they are tagged in very fine detail. Every single sentence is
- marked, all the registry numbers, all the publications issues, dates, and
- volumes. No cost figures on tagging material on a per-megabyte basis
- were available. Because ACS's typesetting system runs from tagged text,
- there is no extra cost per article. It was unknown what it costs ACS to
- keyboard the tagged text rather than just keyboard the text in the
- cheapest process. In other words, since one intends to publish things
- and will need to build tagged text into a typography system in any case,
- if one does that in such a way that it can drive not only typography but
- an electronic system (which is what ACS intends to do--move to SGML
- publishing), the marginal cost is zero. The marginal cost represents the
- cost of building tagged text into the database, which is small.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- SPERBERG-McQUEEN * Distinction between texts and computers * Implications
- of recognizing that all representation is encoding * Dealing with
- complicated representations of text entails the need for a grammar of
- documents * Variety of forms of formal grammars * Text as a bit-mapped
- image does not represent a serious attempt to represent text in
- electronic form * SGML, the TEI, document-type declarations, and the
- reusability and longevity of data * TEI conformance explicitly allows
- extension or modification of the TEI tag set * Administrative background
- of the TEI * Several design goals for the TEI tag set * An absolutely
- fixed requirement of the TEI Guidelines * Challenges the TEI has
- attempted to face * Good texts not beyond economic feasibility * The
- issue of reproducibility or processability * The issue of mages as
- simulacra for the text redux * One's model of text determines what one's
- software can do with a text and has economic consequences *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Prior to speaking about SGML and markup, Michael SPERBERG-McQUEEN, editor,
- Text Encoding Initiative (TEI), University of Illinois-Chicago, first drew
- a distinction between texts and computers: Texts are abstract cultural
- and linguistic objects while computers are complicated physical devices,
- he said. Abstract objects cannot be placed inside physical devices; with
- computers one can only represent text and act upon those representations.
-
- The recognition that all representation is encoding, SPERBERG-McQUEEN
- argued, leads to the recognition of two things: 1) The topic description
- for this session is slightly misleading, because there can be no discussion
- of pros and cons of text-coding unless what one means is pros and cons of
- working with text with computers. 2) No text can be represented in a
- computer without some sort of encoding; images are one way of encoding text,
- ASCII is another, SGML yet another. There is no encoding without some
- information loss, that is, there is no perfect reproduction of a text that
- allows one to do away with the original. Thus, the question becomes,
- What is the most useful representation of text for a serious work?
- This depends on what kind of serious work one is talking about.
-
- The projects demonstrated the previous day all involved highly complex
- information and fairly complex manipulation of the textual material.
- In order to use that complicated information, one has to calculate it
- slowly or manually and store the result. It needs to be stored, therefore,
- as part of one's representation of the text. Thus, one needs to store the
- structure in the text. To deal with complicated representations of text,
- one needs somehow to control the complexity of the representation of a text;
- that means one needs a way of finding out whether a document and an
- electronic representation of a document is legal or not; and that
- means one needs a grammar of documents.
-
- SPERBERG-McQUEEN discussed the variety of forms of formal grammars,
- implicit and explicit, as applied to text, and their capabilities. He
- argued that these grammars correspond to different models of text that
- different developers have. For example, one implicit model of the text
- is that there is no internal structure, but just one thing after another,
- a few characters and then perhaps a start-title command, and then a few
- more characters and an end-title command. SPERBERG-McQUEEN also
- distinguished several kinds of text that have a sort of hierarchical
- structure that is not very well defined, which, typically, corresponds
- to grammars that are not very well defined, as well as hierarchies that
- are very well defined (e.g., the Thesaurus Linguae Graecae) and extremely
- complicated things such as SGML, which handle strictly hierarchical data
- very nicely.
-
- SPERBERG-McQUEEN conceded that one other model not illustrated on his two
- displays was the model of text as a bit-mapped image, an image of a page,
- and confessed to having been converted to a limited extent by the
- Workshop to the view that electronic images constitute a promising,
- probably superior alternative to microfilming. But he was not convinced
- that electronic images represent a serious attempt to represent text in
- electronic form. Many of their problems stem from the fact that they are
- not direct attempts to represent the text but attempts to represent the
- page, thus making them representations of representations.
-
- In this situation of increasingly complicated textual information and the
- need to control that complexity in a useful way (which begs the question
- of the need for good textual grammars), one has the introduction of SGML.
- With SGML, one can develop specific document-type declarations
- for specific text types or, as with the TEI, attempts to generate
- general document-type declarations that can handle all sorts of text.
- The TEI is an attempt to develop formats for text representation that
- will ensure the kind of reusability and longevity of data discussed earlier.
- It offers a way to stay alive in the state of permanent technological
- revolution.
-
- It has been a continuing challenge in the TEI to create document grammars
- that do some work in controlling the complexity of the textual object but
- also allowing one to represent the real text that one will find.
- Fundamental to the notion of the TEI is that TEI conformance allows one
- the ability to extend or modify the TEI tag set so that it fits the text
- that one is attempting to represent.
-
- SPERBERG-McQUEEN next outlined the administrative background of the TEI.
- The TEI is an international project to develop and disseminate guidelines
- for the encoding and interchange of machine-readable text. It is
- sponsored by the Association for Computers in the Humanities, the
- Association for Computational Linguistics, and the Association for
- Literary and Linguistic Computing. Representatives of numerous other
- professional societies sit on its advisory board. The TEI has a number
- of affiliated projects that have provided assistance by testing drafts of
- the guidelines.
-
- Among the design goals for the TEI tag set, the scheme first of all must
- meet the needs of research, because the TEI came out of the research
- community, which did not feel adequately served by existing tag sets.
- The tag set must be extensive as well as compatible with existing and
- emerging standards. In 1990, version 1.0 of the Guidelines was released
- (SPERBERG-McQUEEN illustrated their contents).
-
- SPERBERG-McQUEEN noted that one problem besetting electronic text has
- been the lack of adequate internal or external documentation for many
- existing electronic texts. The TEI guidelines as currently formulated
- contain few fixed requirements, but one of them is this: There must
- always be a document header, an in-file SGML tag that provides
- 1) a bibliographic description of the electronic object one is talking
- about (that is, who included it, when, what for, and under which title);
- and 2) the copy text from which it was derived, if any. If there was
- no copy text or if the copy text is unknown, then one states as much.
- Version 2.0 of the Guidelines was scheduled to be completed in fall 1992
- and a revised third version is to be presented to the TEI advisory board
- for its endorsement this coming winter. The TEI itself exists to provide
- a markup language, not a marked-up text.
-
- Among the challenges the TEI has attempted to face is the need for a
- markup language that will work for existing projects, that is, handle the
- level of markup that people are using now to tag only chapter, section,
- and paragraph divisions and not much else. At the same time, such a
- language also will be able to scale up gracefully to handle the highly
- detailed markup which many people foresee as the future destination of
- much electronic text, and which is not the future destination but the
- present home of numerous electronic texts in specialized areas.
-
- SPERBERG-McQUEEN dismissed the lowest-common-denominator approach as
- unable to support the kind of applications that draw people who have
- never been in the public library regularly before, and make them come
- back. He advocated more interesting text and more intelligent text.
- Asserting that it is not beyond economic feasibility to have good texts,
- SPERBERG-McQUEEN noted that the TEI Guidelines listing 200-odd tags
- contains tags that one is expected to enter every time the relevant
- textual feature occurs. It contains all the tags that people need now,
- and it is not expected that everyone will tag things in the same way.
-
- The question of how people will tag the text is in large part a function
- of their reaction to what SPERBERG-McQUEEN termed the issue of
- reproducibility. What one needs to be able to reproduce are the things
- one wants to work with. Perhaps a more useful concept than that of
- reproducibility or recoverability is that of processability, that is,
- what can one get from an electronic text without reading it again
- in the original. He illustrated this contention with a page from
- Jan Comenius's bilingual Introduction to Latin.
-
- SPERBERG-McQUEEN returned at length to the issue of images as simulacra
- for the text, in order to reiterate his belief that in the long run more
- than images of pages of particular editions of the text are needed,
- because just as second-generation photocopies and second-generation
- microfilm degenerate, so second-generation representations tend to
- degenerate, and one tends to overstress some relatively trivial aspects
- of the text such as its layout on the page, which is not always
- significant, despite what the text critics might say, and slight other
- pieces of information such as the very important lexical ties between the
- English and Latin versions of Comenius's bilingual text, for example.
- Moreover, in many crucial respects it is easy to fool oneself concerning
- what a scanned image of the text will accomplish. For example, in order
- to study the transmission of texts, information concerning the text
- carrier is necessary, which scanned images simply do not always handle.
- Further, even the high-quality materials being produced at Cornell use
- much of the information that one would need if studying those books as
- physical objects. It is a choice that has been made. It is an arguably
- justifiable choice, but one does not know what color those pen strokes in
- the margin are or whether there was a stain on the page, because it has
- been filtered out. One does not know whether there were rips in the page
- because they do not show up, and on a couple of the marginal marks one
- loses half of the mark because the pen is very light and the scanner
- failed to pick it up, and so what is clearly a checkmark in the margin of
- the original becomes a little scoop in the margin of the facsimile.
- Standard problems for facsimile editions, not new to electronics, but
- also true of light-lens photography, and are remarked here because it is
- important that we not fool ourselves that even if we produce a very nice
- image of this page with good contrast, we are not replacing the
- manuscript any more than microfilm has replaced the manuscript.
-
- The TEI comes from the research community, where its first allegiance
- lies, but it is not just an academic exercise. It has relevance far
- beyond those who spend all of their time studying text, because one's
- model of text determines what one's software can do with a text. Good
- models lead to good software. Bad models lead to bad software. That has
- economic consequences, and it is these economic consequences that have
- led the European Community to help support the TEI, and that will lead,
- SPERBERG-McQUEEN hoped, some software vendors to realize that if they
- provide software with a better model of the text they can make a killing.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- DISCUSSION * Implications of different DTDs and tag sets * ODA versus SGML *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- During the discussion that followed, several additional points were made.
- Neither AAP (i.e., Association of American Publishers) nor CALS (i.e.,
- Computer-aided Acquisition and Logistics Support) has a document-type
- definition for ancient Greek drama, although the TEI will be able to
- handle that. Given this state of affairs and assuming that the
- technical-journal producers and the commercial vendors decide to use the
- other two types, then an institution like the Library of Congress, which
- might receive all of their publications, would have to be able to handle
- three different types of document definitions and tag sets and be able to
- distinguish among them.
-
- Office Document Architecture (ODA) has some advantages that flow from its
- tight focus on office documents and clear directions for implementation.
- Much of the ODA standard is easier to read and clearer at first reading
- than the SGML standard, which is extremely general. What that means is
- that if one wants to use graphics in TIFF and ODA, one is stuck, because
- ODA defines graphics formats while TIFF does not, whereas SGML says the
- world is not waiting for this work group to create another graphics format.
- What is needed is an ability to use whatever graphics format one wants.
-
- The TEI provides a socket that allows one to connect the SGML document to
- the graphics. The notation that the graphics are in is clearly a choice
- that one needs to make based on her or his environment, and that is one
- advantage. SGML is less megalomaniacal in attempting to define formats
- for all kinds of information, though more megalomaniacal in attempting to
- cover all sorts of documents. The other advantage is that the model of
- text represented by SGML is simply an order of magnitude richer and more
- flexible than the model of text offered by ODA. Both offer hierarchical
- structures, but SGML recognizes that the hierarchical model of the text
- that one is looking at may not have been in the minds of the designers,
- whereas ODA does not.
-
- ODA is not really aiming for the kind of document that the TEI wants to
- encompass. The TEI can handle the kind of material ODA has, as well as a
- significantly broader range of material. ODA seems to be very much
- focused on office documents, which is what it started out being called--
- office document architecture.
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- CALALUCA * Text-encoding from a publisher's perspective *
- Responsibilities of a publisher * Reproduction of Migne's Latin series
- whole and complete with SGML tags based on perceived need and expected
- use * Particular decisions arising from the general decision to produce
- and publish PLD *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- The final speaker in this session, Eric CALALUCA, vice president,
- Chadwyck-Healey, Inc., spoke from the perspective of a publisher re
- text-encoding, rather than as one qualified to discuss methods of
- encoding data, and observed that the presenters sitting in the room,
- whether they had chosen to or not, were acting as publishers: making
- choices, gathering data, gathering information, and making assessments.
- CALALUCA offered the hard-won conviction that in publishing very large
- text files (such as PLD), one cannot avoid making personal judgments of
- appropriateness and structure.
-
- In CALALUCA's view, encoding decisions stem from prior judgments. Two
- notions have become axioms for him in the consideration of future sources
- for electronic publication: 1) electronic text publishing is as personal
- as any other kind of publishing, and questions of if and how to encode
- the data are simply a consequence of that prior decision; 2) all
- personal decisions are open to criticism, which is unavoidable.
-
- CALALUCA rehearsed his role as a publisher or, better, as an intermediary
- between what is viewed as a sound idea and the people who would make use
- of it. Finding the specialist to advise in this process is the core of
- that function. The publisher must monitor and hug the fine line between
- giving users what they want and suggesting what they might need. One
- responsibility of a publisher is to represent the desires of scholars and
- research librarians as opposed to bullheadedly forcing them into areas
- they would not choose to enter.
-
- CALALUCA likened the questions being raised today about data structure
- and standards to the decisions faced by the Abbe Migne himself during
- production of the Patrologia series in the mid-nineteenth century.
- Chadwyck-Healey's decision to reproduce Migne's Latin series whole and
- complete with SGML tags was also based upon a perceived need and an
- expected use. In the same way that Migne's work came to be far more than
- a simple handbook for clerics, PLD is already far more than a database
- for theologians. It is a bedrock source for the study of Western
- civilization, CALALUCA asserted.
-
- In regard to the decision to produce and publish PLD, the editorial board
- offered direct judgments on the question of appropriateness of these
- texts for conversion, their encoding and their distribution, and
- concluded that the best possible project was one that avoided overt
- intrusions or exclusions in so important a resource. Thus, the general
- decision to transmit the original collection as clearly as possible with
- the widest possible avenues for use led to other decisions: 1) To encode
- the data or not, SGML or not, TEI or not. Again, the expected user
- community asserted the need for normative tagging structures of important
- humanities texts, and the TEI seemed the most appropriate structure for
- that purpose. Research librarians, who are trained to view the larger
- impact of electronic text sources on 80 or 90 or 100 doctoral
- disciplines, loudly approved the decision to include tagging. They see
- what is coming better than the specialist who is completely focused on
- one edition of Ambrose's De Anima, and they also understand that the
- potential uses exceed present expectations. 2) What will be tagged and
- what will not. Once again, the board realized that one must tag the
- obvious. But in no way should one attempt to identify through encoding
- schemes every single discrete area of a text that might someday be
- searched. That was another decision. Searching by a column number, an
- author, a word, a volume, permitting combination searches, and tagging
- notations seemed logical choices as core elements. 3) How does one make
- the data available? Tieing it to a CD-ROM edition creates limitations,
- but a magnetic tape file that is very large, is accompanied by the
- encoding specifications, and that allows one to make local modifications
- also allows one to incorporate any changes one may desire within the
- bounds of private research, though exporting tag files from a CD-ROM
- could serve just as well. Since no one on the board could possibly
- anticipate each and every way in which a scholar might choose to mine
- this data bank, it was decided to satisfy the basics and make some
- provisions for what might come. 4) Not to encode the database would rob
- it of the interchangeability and portability these important texts should
- accommodate. For CALALUCA, the extensive options presented by full-text
- searching require care in text selection and strongly support encoding of
- data to facilitate the widest possible search strategies. Better
- software can always be created, but summoning the resources, the people,
- and the energy to reconvert the text is another matter.
-
- PLD is being encoded, captured, and distributed, because to
- Chadwyck-Healey and the board it offers the widest possible array of
- future research applications that can be seen today. CALALUCA concluded
- by urging the encoding of all important text sources in whatever way
- seems most appropriate and durable at the time, without blanching at the
- thought that one's work may require emendation in the future. (Thus,
- Chadwyck-Healey produced a very large humanities text database before the
- final release of the TEI Guidelines.)
-
- ******
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- DISCUSSION * Creating texts with markup advocated * Trends in encoding *
- The TEI and the issue of interchangeability of standards * A
- misconception concerning the TEI * Implications for an institution like
- LC in the event that a multiplicity of DTDs develops * Producing images
- as a first step towards possible conversion to full text through
- character recognition * The AAP tag sets as a common starting point and
- the need for caution *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- HOCKEY prefaced the discussion that followed with several comments in
- favor of creating texts with markup and on trends in encoding. In the
- future, when many more texts are available for on-line searching, real
- problems in finding what is wanted will develop, if one is faced with
- millions of words of data. It therefore becomes important to consider
- putting markup in texts to help searchers home in on the actual things
- they wish to retrieve. Various approaches to refining retrieval methods
- toward this end include building on a computer version of a dictionary
- and letting the computer look up words in it to obtain more information
- about the semantic structure or semantic field of a word, its grammatical
- structure, and syntactic structure.
-
- HOCKEY commented on the present keen interest in the encoding world
- in creating: 1) machine-readable versions of dictionaries that can be
- initially tagged in SGML, which gives a structure to the dictionary entry;
- these entries can then be converted into a more rigid or otherwise
- different database structure inside the computer, which can be treated as
- a dynamic tool for searching mechanisms; 2) large bodies of text to study
- the language. In order to incorporate more sophisticated mechanisms,
- more about how words behave needs to be known, which can be learned in
- part from information in dictionaries. However, the last ten years have
- seen much interest in studying the structure of printed dictionaries
- converted into computer-readable form. The information one derives about
- many words from those is only partial, one or two definitions of the
- common or the usual meaning of a word, and then numerous definitions of
- unusual usages. If the computer is using a dictionary to help retrieve
- words in a text, it needs much more information about the common usages,
- because those are the ones that occur over and over again. Hence the
- current interest in developing large bodies of text in computer-readable
- form in order to study the language. Several projects are engaged in
- compiling, for example, 100 million words. HOCKEY described one with
- which she was associated briefly at Oxford University involving
- compilation of 100 million words of British English: about 10 percent of
- that will contain detailed linguistic tagging encoded in SGML; it will
- have word class taggings, with words identified as nouns, verbs,
- adjectives, or other parts of speech. This tagging can then be used by
- programs which will begin to learn a bit more about the structure of the
- language, and then, can go to tag more text.
-
- HOCKEY said that the more that is tagged accurately, the more one can
- refine the tagging process and thus the bigger body of text one can build
- up with linguistic tagging incorporated into it. Hence, the more tagging
- or annotation there is in the text, the more one may begin to learn about
- language and the more it will help accomplish more intelligent OCR. She
- recommended the development of software tools that will help one begin to
- understand more about a text, which can then be applied to scanning
- images of that text in that format and to using more intelligence to help
- one interpret or understand the text.
-
- HOCKEY posited the need to think about common methods of text-encoding
- for a long time to come, because building these large bodies of text is
- extremely expensive and will only be done once.
-
- In the more general discussion on approaches to encoding that followed,
- these points were made:
-
- BESSER identified the underlying problem with standards that all have to
- struggle with in adopting a standard, namely, the tension between a very
- highly defined standard that is very interchangeable but does not work
- for everyone because something is lacking, and a standard that is less
- defined, more open, more adaptable, but less interchangeable. Contending
- that the way in which people use SGML is not sufficiently defined, BESSER
- wondered 1) if people resist the TEI because they think it is too defined
- in certain things they do not fit into, and 2) how progress with
- interchangeability can be made without frightening people away.
-
- SPERBERG-McQUEEN replied that the published drafts of the TEI had met
- with surprisingly little objection on the grounds that they do not allow
- one to handle X or Y or Z. Particular concerns of the affiliated
- projects have led, in practice, to discussions of how extensions are to
- be made; the primary concern of any project has to be how it can be
- represented locally, thus making interchange secondary. The TEI has
- received much criticism based on the notion that everything in it is
- required or even recommended, which, as it happens, is a misconception
- from the beginning, because none of it is required and very little is
- actually actively recommended for all cases, except that one document
- one's source.
-
- SPERBERG-McQUEEN agreed with BESSER about this trade-off: all the
- projects in a set of twenty TEI-conformant projects will not necessarily
- tag the material in the same way. One result of the TEI will be that the
- easiest problems will be solved--those dealing with the external form of
- the information; but the problem that is hardest in interchange is that
- one is not encoding what another wants, and vice versa. Thus, after
- the adoption of a common notation, the differences in the underlying
- conceptions of what is interesting about texts become more visible.
- The success of a standard like the TEI will lie in the ability of
- the recipient of interchanged texts to use some of what it contains
- and to add the information that was not encoded that one wants, in a
- layered way, so that texts can be gradually enriched and one does not
- have to put in everything all at once. Hence, having a well-behaved
- markup scheme is important.
-
- STEVENS followed up on the paradoxical analogy that BESSER alluded to in
- the example of the MARC records, namely, the formats that are the same
- except that they are different. STEVENS drew a parallel between
- document-type definitions and MARC records for books and serials and maps,
- where one has a tagging structure and there is a text-interchange.
- STEVENS opined that the producers of the information will set the terms
- for the standard (i.e., develop document-type definitions for the users
- of their products), creating a situation that will be problematical for
- an institution like the Library of Congress, which will have to deal with
- the DTDs in the event that a multiplicity of them develops. Thus,
- numerous people are seeking a standard but cannot find the tag set that
- will be acceptable to them and their clients. SPERBERG-McQUEEN agreed
- with this view, and said that the situation was in a way worse: attempting
- to unify arbitrary DTDs resembled attempting to unify a MARC record with a
- bibliographic record done according to the Prussian instructions.
- According to STEVENS, this situation occurred very early in the process.
-
- WATERS recalled from early discussions on Project Open Book the concern
- of many people that merely by producing images, POB was not really
- enhancing intellectual access to the material. Nevertheless, not wishing
- to overemphasize the opposition between imaging and full text, WATERS
- stated that POB views getting the images as a first step toward possibly
- converting to full text through character recognition, if the technology
- is appropriate. WATERS also emphasized that encoding is involved even
- with a set of images.
-
- SPERBERG-McQUEEN agreed with WATERS that one can create an SGML document
- consisting wholly of images. At first sight, organizing graphic images
- with an SGML document may not seem to offer great advantages, but the
- advantages of the scheme WATERS described would be precisely that
- ability to move into something that is more of a multimedia document:
- a combination of transcribed text and page images. WEIBEL concurred in
- this judgment, offering evidence from Project ADAPT, where a page is
- divided into text elements and graphic elements, and in fact the text
- elements are organized by columns and lines. These lines may be used as
- the basis for distributing documents in a network environment. As one
- develops software intelligent enough to recognize what those elements
- are, it makes sense to apply SGML to an image initially, that may, in
- fact, ultimately become more and more text, either through OCR or edited
- OCR or even just through keying. For WATERS, the labor of composing the
- document and saying this set of documents or this set of images belongs
- to this document constitutes a significant investment.
-
- WEIBEL also made the point that the AAP tag sets, while not excessively
- prescriptive, offer a common starting point; they do not define the
- structure of the documents, though. They have some recommendations about
- DTDs one could use as examples, but they do just suggest tag sets. For
- example, the CORE project attempts to use the AAP markup as much as
- possible, but there are clearly areas where structure must be added.
- That in no way contradicts the use of AAP tag sets.
-
- SPERBERG-McQUEEN noted that the TEI prepared a long working paper early
- on about the AAP tag set and what it lacked that the TEI thought it
- needed, and a fairly long critique of the naming conventions, which has
- led to a very different style of naming in the TEI. He stressed the
- importance of the opposition between prescriptive markup, the kind that a
- publisher or anybody can do when producing documents de novo, and
- descriptive markup, in which one has to take what the text carrier
- provides. In these particular tag sets it is easy to overemphasize this
- opposition, because the AAP tag set is extremely flexible. Even if one
- just used the DTDs, they allow almost anything to appear almost anywhere.
-
- ******
-
- SESSION VI. COPYRIGHT ISSUES
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- PETERS * Several cautions concerning copyright in an electronic
- environment * Review of copyright law in the United States * The notion
- of the public good and the desirability of incentives to promote it *
- What copyright protects * Works not protected by copyright * The rights
- of copyright holders * Publishers' concerns in today's electronic
- environment * Compulsory licenses * The price of copyright in a digital
- medium and the need for cooperation * Additional clarifications * Rough
- justice oftentimes the outcome in numerous copyright matters * Copyright
- in an electronic society * Copyright law always only sets up the
- boundaries; anything can be changed by contract *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- Marybeth PETERS, policy planning adviser to the Register of Copyrights,
- Library of Congress, made several general comments and then opened the
- floor to discussion of subjects of interest to the audience.
-
- Having attended several sessions in an effort to gain a sense of what
- people did and where copyright would affect their lives, PETERS expressed
- the following cautions:
-
- * If one takes and converts materials and puts them in new forms,
- then, from a copyright point of view, one is creating something and
- will receive some rights.
-
- * However, if what one is converting already exists, a question
- immediately arises about the status of the materials in question.
-
- * Putting something in the public domain in the United States offers
- some freedom from anxiety, but distributing it throughout the world
- on a network is another matter, even if one has put it in the public
- domain in the United States. Re foreign laws, very frequently a
- work can be in the public domain in the United States but protected
- in other countries. Thus, one must consider all of the places a
- work may reach, lest one unwittingly become liable to being faced
- with a suit for copyright infringement, or at least a letter
- demanding discussion of what one is doing.
-
- PETERS reviewed copyright law in the United States. The U.S.
- Constitution effectively states that Congress has the power to enact
- copyright laws for two purposes: 1) to encourage the creation and
- dissemination of intellectual works for the good of society as a whole;
- and, significantly, 2) to give creators and those who package and
- disseminate materials the economic rewards that are due them.
-
- Congress strives to strike a balance, which at times can become an
- emotional issue. The United States has never accepted the notion of the
- natural right of an author so much as it has accepted the notion of the
- public good and the desirability of incentives to promote it. This state
- of affairs, however, has created strains on the international level and
- is the reason for several of the differences in the laws that we have.
- Today the United States protects almost every kind of work that can be
- called an expression of an author. The standard for gaining copyright
- protection is simply originality. This is a low standard and means that
- a work is not copied from something else, as well as shows a certain
- minimal amount of authorship. One can also acquire copyright protection
- for making a new version of preexisting material, provided it manifests
- some spark of creativity.
-
- However, copyright does not protect ideas, methods, systems--only the way
- that one expresses those things. Nor does copyright protect anything
- that is mechanical, anything that does not involve choice, or criteria
- concerning whether or not one should do a thing. For example, the
- results of a process called declicking, in which one mechanically removes
- impure sounds from old recordings, are not copyrightable. On the other
- hand, the choice to record a song digitally and to increase the sound of
- violins or to bring up the tympani constitutes the results of conversion
- that are copyrightable. Moreover, if a work is protected by copyright in
- the United States, one generally needs the permission of the copyright
- owner to convert it. Normally, who will own the new--that is, converted-
- -material is a matter of contract. In the absence of a contract, the
- person who creates the new material is the author and owner. But people
- do not generally think about the copyright implications until after the
- fact. PETERS stressed the need when dealing with copyrighted works to
- think about copyright in advance. One's bargaining power is much greater
- up front than it is down the road.
-
- PETERS next discussed works not protected by copyright, for example, any
- work done by a federal employee as part of his or her official duties is
- in the public domain in the United States. The issue is not wholly free
- of doubt concerning whether or not the work is in the public domain
- outside the United States. Other materials in the public domain include:
- any works published more than seventy-five years ago, and any work
- published in the United States more than twenty-eight years ago, whose
- copyright was not renewed. In talking about the new technology and
- putting material in a digital form to send all over the world, PETERS
- cautioned, one must keep in mind that while the rights may not be an
- issue in the United States, they may be in different parts of the world,
- where most countries previously employed a copyright term of the life of
- the author plus fifty years.
-
- PETERS next reviewed the economics of copyright holding. Simply,
- economic rights are the rights to control the reproduction of a work in
- any form. They belong to the author, or in the case of a work made for
- hire, the employer. The second right, which is critical to conversion,
- is the right to change a work. The right to make new versions is perhaps
- one of the most significant rights of authors, particularly in an
- electronic world. The third right is the right to publish the work and
- the right to disseminate it, something that everyone who deals in an
- electronic medium needs to know. The basic rule is if a copy is sold,
- all rights of distribution are extinguished with the sale of that copy.
- The key is that it must be sold. A number of companies overcome this
- obstacle by leasing or renting their product. These companies argue that
- if the material is rented or leased and not sold, they control the uses
- of a work. The fourth right, and one very important in a digital world,
- is a right of public performance, which means the right to show the work
- sequentially. For example, copyright owners control the showing of a
- CD-ROM product in a public place such as a public library. The reverse
- side of public performance is something called the right of public
- display. Moral rights also exist, which at the federal level apply only
- to very limited visual works of art, but in theory may apply under
- contract and other principles. Moral rights may include the right of an
- author to have his or her name on a work, the right of attribution, and
- the right to object to distortion or mutilation--the right of integrity.
-
- The way copyright law is worded gives much latitude to activities such as
- preservation; to use of material for scholarly and research purposes when
- the user does not make multiple copies; and to the generation of
- facsimile copies of unpublished works by libraries for themselves and
- other libraries. But the law does not allow anyone to become the
- distributor of the product for the entire world. In today's electronic
- environment, publishers are extremely concerned that the entire world is
- networked and can obtain the information desired from a single copy in a
- single library. Hence, if there is to be only one sale, which publishers
- may choose to live with, they will obtain their money in other ways, for
- example, from access and use. Hence, the development of site licenses
- and other kinds of agreements to cover what publishers believe they
- should be compensated for. Any solution that the United States takes
- today has to consider the international arena.
-
- Noting that the United States is a member of the Berne Convention and
- subscribes to its provisions, PETERS described the permissions process.
- She also defined compulsory licenses. A compulsory license, of which the
- United States has had a few, builds into the law the right to use a work
- subject to certain terms and conditions. In the international arena,
- however, the ability to use compulsory licenses is extremely limited.
- Thus, clearinghouses and other collectives comprise one option that has
- succeeded in providing for use of a work. Often overlooked when one
- begins to use copyrighted material and put products together is how
- expensive the permissions process and managing it is. According to
- PETERS, the price of copyright in a digital medium, whatever solution is
- worked out, will include managing and assembling the database. She
- strongly recommended that publishers and librarians or people with
- various backgrounds cooperate to work out administratively feasible
- systems, in order to produce better results.
-
- In the lengthy question-and-answer period that followed PETERS's
- presentation, the following points emerged:
-
- * The Copyright Office maintains that anything mechanical and
- totally exhaustive probably is not protected. In the event that
- what an individual did in developing potentially copyrightable
- material is not understood, the Copyright Office will ask about the
- creative choices the applicant chose to make or not to make. As a
- practical matter, if one believes she or he has made enough of those
- choices, that person has a right to assert a copyright and someone
- else must assert that the work is not copyrightable. The more
- mechanical, the more automatic, a thing is, the less likely it is to
- be copyrightable.
-
- * Nearly all photographs are deemed to be copyrightable, but no one
- worries about them much, because everyone is free to take the same
- image. Thus, a photographic copyright represents what is called a
- "thin" copyright. The photograph itself must be duplicated, in
- order for copyright to be violated.
-
- * The Copyright Office takes the position that X-rays are not
- copyrightable because they are mechanical. It can be argued
- whether or not image enhancement in scanning can be protected. One
- must exercise care with material created with public funds and
- generally in the public domain. An article written by a federal
- employee, if written as part of official duties, is not
- copyrightable. However, control over a scientific article written
- by a National Institutes of Health grantee (i.e., someone who
- receives money from the U.S. government), depends on NIH policy. If
- the government agency has no policy (and that policy can be
- contained in its regulations, the contract, or the grant), the
- author retains copyright. If a provision of the contract, grant, or
- regulation states that there will be no copyright, then it does not
- exist. When a work is created, copyright automatically comes into
- existence unless something exists that says it does not.
-
- * An enhanced electronic copy of a print copy of an older reference
- work in the public domain that does not contain copyrightable new
- material is a purely mechanical rendition of the original work, and
- is not copyrightable.
-
- * Usually, when a work enters the public domain, nothing can remove
- it. For example, Congress recently passed into law the concept of
- automatic renewal, which means that copyright on any work published
- between l964 and l978 does not have to be renewed in order to
- receive a seventy-five-year term. But any work not renewed before
- 1964 is in the public domain.
-
- * Concerning whether or not the United States keeps track of when
- authors die, nothing was ever done, nor is anything being done at
- the moment by the Copyright Office.
-
- * Software that drives a mechanical process is itself copyrightable.
- If one changes platforms, the software itself has a copyright. The
- World Intellectual Property Organization will hold a symposium 28
- March through 2 April l993, at Harvard University, on digital
- technology, and will study this entire issue. If one purchases a
- computer software package, such as MacPaint, and creates something
- new, one receives protection only for that which has been added.
-
- PETERS added that often in copyright matters, rough justice is the
- outcome, for example, in collective licensing, ASCAP (i.e., American
- Society of Composers, Authors, and Publishers), and BMI (i.e., Broadcast
- Music, Inc.), where it may seem that the big guys receive more than their
- due. Of course, people ought not to copy a creative product without
- paying for it; there should be some compensation. But the truth of the
- world, and it is not a great truth, is that the big guy gets played on
- the radio more frequently than the little guy, who has to do much more
- until he becomes a big guy. That is true of every author, every
- composer, everyone, and, unfortunately, is part of life.
-
- Copyright always originates with the author, except in cases of works
- made for hire. (Most software falls into this category.) When an author
- sends his article to a journal, he has not relinquished copyright, though
- he retains the right to relinquish it. The author receives absolutely
- everything. The less prominent the author, the more leverage the
- publisher will have in contract negotiations. In order to transfer the
- rights, the author must sign an agreement giving them away.
-
- In an electronic society, it is important to be able to license a writer
- and work out deals. With regard to use of a work, it usually is much
- easier when a publisher holds the rights. In an electronic era, a real
- problem arises when one is digitizing and making information available.
- PETERS referred again to electronic licensing clearinghouses. Copyright
- ought to remain with the author, but as one moves forward globally in the
- electronic arena, a middleman who can handle the various rights becomes
- increasingly necessary.
-
- The notion of copyright law is that it resides with the individual, but
- in an on-line environment, where a work can be adapted and tinkered with
- by many individuals, there is concern. If changes are authorized and
- there is no agreement to the contrary, the person who changes a work owns
- the changes. To put it another way, the person who acquires permission
- to change a work technically will become the author and the owner, unless
- some agreement to the contrary has been made. It is typical for the
- original publisher to try to control all of the versions and all of the
- uses. Copyright law always only sets up the boundaries. Anything can be
- changed by contract.
-
- ******
-
- SESSION VII. CONCLUSION
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- GENERAL DISCUSSION * Two questions for discussion * Different emphases in
- the Workshop * Bringing the text and image partisans together *
- Desiderata in planning the long-term development of something * Questions
- surrounding the issue of electronic deposit * Discussion of electronic
- deposit as an allusion to the issue of standards * Need for a directory
- of preservation projects in digital form and for access to their
- digitized files * CETH's catalogue of machine-readable texts in the
- humanities * What constitutes a publication in the electronic world? *
- Need for LC to deal with the concept of on-line publishing * LC's Network
- Development Office exploring the limits of MARC as a standard in terms
- of handling electronic information * Magnitude of the problem and the
- need for distributed responsibility in order to maintain and store
- electronic information * Workshop participants to be viewed as a starting
- point * Development of a network version of AM urged * A step toward AM's
- construction of some sort of apparatus for network access * A delicate
- and agonizing policy question for LC * Re the issue of electronic
- deposit, LC urged to initiate a catalytic process in terms of distributed
- responsibility * Suggestions for cooperative ventures * Commercial
- publishers' fears * Strategic questions for getting the image and text
- people to think through long-term cooperation * Clarification of the
- driving force behind both the Perseus and the Cornell Xerox projects *
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- In his role as moderator of the concluding session, GIFFORD raised two
- questions he believed would benefit from discussion: 1) Are there enough
- commonalities among those of us that have been here for two days so that
- we can see courses of action that should be taken in the future? And, if
- so, what are they and who might take them? 2) Partly derivative from
- that, but obviously very dangerous to LC as host, do you see a role for
- the Library of Congress in all this? Of course, the Library of Congress
- holds a rather special status in a number of these matters, because it is
- not perceived as a player with an economic stake in them, but are there
- roles that LC can play that can help advance us toward where we are heading?
-
- Describing himself as an uninformed observer of the technicalities of the
- last two days, GIFFORD detected three different emphases in the Workshop:
- 1) people who are very deeply committed to text; 2) people who are almost
- passionate about images; and 3) a few people who are very committed to
- what happens to the networks. In other words, the new networking
- dimension, the accessibility of the processability, the portability of
- all this across the networks. How do we pull those three together?
-
- Adding a question that reflected HOCKEY's comment that this was the
- fourth workshop she had attended in the previous thirty days, FLEISCHHAUER
- wondered to what extent this meeting had reinvented the wheel, or if it
- had contributed anything in the way of bringing together a different group
- of people from those who normally appear on the workshop circuit.
-
- HOCKEY confessed to being struck at this meeting and the one the
- Electronic Pierce Consortium organized the previous week that this was a
- coming together of people working on texts and not images. Attempting to
- bring the two together is something we ought to be thinking about for the
- future: How one can think about working with image material to begin
- with, but structuring it and digitizing it in such a way that at a later
- stage it can be interpreted into text, and find a common way of building
- text and images together so that they can be used jointly in the future,
- with the network support to begin there because that is how people will
- want to access it.
-
- In planning the long-term development of something, which is what is
- being done in electronic text, HOCKEY stressed the importance not only
- of discussing the technical aspects of how one does it but particularly
- of thinking about what the people who use the stuff will want to do.
- But conversely, there are numerous things that people start to do with
- electronic text or material that nobody ever thought of in the beginning.
-
- LESK, in response to the question concerning the role of the Library of
- Congress, remarked the often suggested desideratum of having electronic
- deposit: Since everything is now computer-typeset, an entire decade of
- material that was machine-readable exists, but the publishers frequently
- did not save it; has LC taken any action to have its copyright deposit
- operation start collecting these machine-readable versions? In the
- absence of PETERS, GIFFORD replied that the question was being
- actively considered but that that was only one dimension of the problem.
- Another dimension is the whole question of the integrity of the original
- electronic document. It becomes highly important in science to prove
- authorship. How will that be done?
-
- ERWAY explained that, under the old policy, to make a claim for a
- copyright for works that were published in electronic form, including
- software, one had to submit a paper copy of the first and last twenty
- pages of code--something that represented the work but did not include
- the entire work itself and had little value to anyone. As a temporary
- measure, LC has claimed the right to demand electronic versions of
- electronic publications. This measure entails a proactive role for the
- Library to say that it wants a particular electronic version. Publishers
- then have perhaps a year to submit it. But the real problem for LC is
- what to do with all this material in all these different formats. Will
- the Library mount it? How will it give people access to it? How does LC
- keep track of the appropriate computers, software, and media? The situation
- is so hard to control, ERWAY said, that it makes sense for each publishing
- house to maintain its own archive. But LC cannot enforce that either.
-
- GIFFORD acknowledged LESK's suggestion that establishing a priority
- offered the solution, albeit a fairly complicated one. But who maintains
- that register?, he asked. GRABER noted that LC does attempt to collect a
- Macintosh version and the IBM-compatible version of software. It does
- not collect other versions. But while true for software, BYRUM observed,
- this reply does not speak to materials, that is, all the materials that
- were published that were on somebody's microcomputer or driver tapes
- at a publishing office across the country. LC does well to acquire
- specific machine-readable products selectively that were intended to be
- machine-readable. Materials that were in machine-readable form at one time,
- BYRUM said, would be beyond LC's capability at the moment, insofar as
- attempting to acquire, organize, and preserve them are concerned--and
- preservation would be the most important consideration. In this
- connection, GIFFORD reiterated the need to work out some sense of
- distributive responsibility for a number of these issues, which
- inevitably will require significant cooperation and discussion.
- Nobody can do it all.
-
- LESK suggested that some publishers may look with favor on LC beginning
- to serve as a depository of tapes in an electronic manuscript standard.
- Publishers may view this as a service that they did not have to perform
- and they might send in tapes. However, SPERBERG-McQUEEN countered,
- although publishers have had equivalent services available to them for a
- long time, the electronic text archive has never turned away or been
- flooded with tapes and is forever sending feedback to the depositor.
- Some publishers do send in tapes.
-
- ANDRE viewed this discussion as an allusion to the issue of standards.
- She recommended that the AAP standard and the TEI, which has already been
- somewhat harmonized internationally and which also shares several
- compatibilities with the AAP, be harmonized to ensure sufficient
- compatibility in the software. She drew the line at saying LC ought to
- be the locus or forum for such harmonization.
-
- Taking the group in a slightly different direction, but one where at
- least in the near term LC might play a helpful role, LYNCH remarked the
- plans of a number of projects to carry out preservation by creating
- digital images that will end up in on-line or near-line storage at some
- institution. Presumably, LC will link this material somehow to its
- on-line catalog in most cases. Thus, it is in a digital form. LYNCH had
- the impression that many of these institutions would be willing to make
- those files accessible to other people outside the institution, provided
- that there is no copyright problem. This desideratum will require
- propagating the knowledge that those digitized files exist, so that they
- can end up in other on-line catalogs. Although uncertain about the
- mechanism for achieving this result, LYNCH said that it warranted
- scrutiny because it seemed to be connected to some of the basic issues of
- cataloging and distribution of records. It would be foolish, given the
- amount of work that all of us have to do and our meager resources, to
- discover multiple institutions digitizing the same work. Re microforms,
- LYNCH said, we are in pretty good shape.
-
- BATTIN called this a big problem and noted that the Cornell people (who
- had already departed) were working on it. At issue from the beginning
- was to learn how to catalog that information into RLIN and then into
- OCLC, so that it would be accessible. That issue remains to be resolved.
- LYNCH rejoined that putting it into OCLC or RLIN was helpful insofar as
- somebody who is thinking of performing preservation activity on that work
- could learn about it. It is not necessarily helpful for institutions to
- make that available. BATTIN opined that the idea was that it not only be
- for preservation purposes but for the convenience of people looking for
- this material. She endorsed LYNCH's dictum that duplication of this
- effort was to be avoided by every means.
-
- HOCKEY informed the Workshop about one major current activity of CETH,
- namely a catalogue of machine-readable texts in the humanities. Held on
- RLIN at present, the catalogue has been concentrated on ASCII as opposed
- to digitized images of text. She is exploring ways to improve the
- catalogue and make it more widely available, and welcomed suggestions
- about these concerns. CETH owns the records, which are not just
- restricted to RLIN, and can distribute them however it wishes.
-
- Taking up LESK's earlier question, BATTIN inquired whether LC, since it
- is accepting electronic files and designing a mechanism for dealing with
- that rather than putting books on shelves, would become responsible for
- the National Copyright Depository of Electronic Materials. Of course
- that could not be accomplished overnight, but it would be something LC
- could plan for. GIFFORD acknowledged that much thought was being devoted
- to that set of problems and returned the discussion to the issue raised
- by LYNCH--whether or not putting the kind of records that both BATTIN and
- HOCKEY have been talking about in RLIN is not a satisfactory solution.
- It seemed to him that RLIN answered LYNCH's original point concerning
- some kind of directory for these kinds of materials. In a situation
- where somebody is attempting to decide whether or not to scan this or
- film that or to learn whether or not someone has already done so, LYNCH
- suggested, RLIN is helpful, but it is not helpful in the case of a local,
- on-line catalogue. Further, one would like to have her or his system be
- aware that that exists in digital form, so that one can present it to a
- patron, even though one did not digitize it, if it is out of copyright.
- The only way to make those linkages would be to perform a tremendous
- amount of real-time look-up, which would be awkward at best, or
- periodically to yank the whole file from RLIN and match it against one's
- own stuff, which is a nuisance.
-
- But where, ERWAY inquired, does one stop including things that are
- available with Internet, for instance, in one's local catalogue?
- It almost seems that that is LC's means to acquire access to them.
- That represents LC's new form of library loan. Perhaps LC's new on-line
- catalogue is an amalgamation of all these catalogues on line. LYNCH
- conceded that perhaps that was true in the very long term, but was not
- applicable to scanning in the short term. In his view, the totals cited
- by Yale, 10,000 books over perhaps a four-year period, and 1,000-1,500
- books from Cornell, were not big numbers, while searching all over
- creation for relatively rare occurrences will prove to be less efficient.
- As GIFFORD wondered if this would not be a separable file on RLIN and
- could be requested from them, BATTIN interjected that it was easily
- accessible to an institution. SEVERTSON pointed out that that file, cum
- enhancements, was available with reference information on CD-ROM, which
- makes it a little more available.
-
- In HOCKEY's view, the real question facing the Workshop is what to put in
- this catalogue, because that raises the question of what constitutes a
- publication in the electronic world. (WEIBEL interjected that Eric Joule
- in OCLC's Office of Research is also wrestling with this particular
- problem, while GIFFORD thought it sounded fairly generic.) HOCKEY
- contended that a majority of texts in the humanities are in the hands
- of either a small number of large research institutions or individuals
- and are not generally available for anyone else to access at all.
- She wondered if these texts ought to be catalogued.
-
- After argument proceeded back and forth for several minutes over why
- cataloguing might be a necessary service, LEBRON suggested that this
- issue involved the responsibility of a publisher. The fact that someone
- has created something electronically and keeps it under his or her
- control does not constitute publication. Publication implies
- dissemination. While it would be important for a scholar to let other
- people know that this creation exists, in many respects this is no
- different from an unpublished manuscript. That is what is being accessed
- in there, except that now one is not looking at it in the hard-copy but
- in the electronic environment.
-
- LEBRON expressed puzzlement at the variety of ways electronic publishing
- has been viewed. Much of what has been discussed throughout these two
- days has concerned CD-ROM publishing, whereas in the on-line environment
- that she confronts, the constraints and challenges are very different.
- Sooner or later LC will have to deal with the concept of on-line
- publishing. Taking up the comment ERWAY made earlier about storing
- copies, LEBRON gave her own journal as an example. How would she deposit
- OJCCT for copyright?, she asked, because the journal will exist in the
- mainframe at OCLC and people will be able to access it. Here the
- situation is different, ownership versus access, and is something that
- arises with publication in the on-line environment, faster than is
- sometimes realized. Lacking clear answers to all of these questions
- herself, LEBRON did not anticipate that LC would be able to take a role
- in helping to define some of them for quite a while.
-
- GREENFIELD observed that LC's Network Development Office is attempting,
- among other things, to explore the limits of MARC as a standard in terms
- of handling electronic information. GREENFIELD also noted that Rebecca
- GUENTHER from that office gave a paper to the American Society for
- Information Science (ASIS) summarizing several of the discussion papers
- that were coming out of the Network Development Office. GREENFIELD said
- he understood that that office had a list-server soliciting just the kind
- of feedback received today concerning the difficulties of identifying and
- cataloguing electronic information. GREENFIELD hoped that everybody
- would be aware of that and somehow contribute to that conversation.
-
- Noting two of LC's roles, first, to act as a repository of record for
- material that is copyrighted in this country, and second, to make
- materials it holds available in some limited form to a clientele that
- goes beyond Congress, BESSER suggested that it was incumbent on LC to
- extend those responsibilities to all the things being published in
- electronic form. This would mean eventually accepting electronic
- formats. LC could require that at some point they be in a certain
- limited set of formats, and then develop mechanisms for allowing people
- to access those in the same way that other things are accessed. This
- does not imply that they are on the network and available to everyone.
- LC does that with most of its bibliographic records, BESSER said, which
- end up migrating to the utility (e.g., OCLC) or somewhere else. But just
- as most of LC's books are available in some form through interlibrary
- loan or some other mechanism, so in the same way electronic formats ought
- to be available to others in some format, though with some copyright
- considerations. BESSER was not suggesting that these mechanisms be
- established tomorrow, only that they seemed to fall within LC's purview,
- and that there should be long-range plans to establish them.
-
- Acknowledging that those from LC in the room agreed with BESSER
- concerning the need to confront difficult questions, GIFFORD underscored
- the magnitude of the problem of what to keep and what to select. GIFFORD
- noted that LC currently receives some 31,000 items per day, not counting
- electronic materials, and argued for much more distributed responsibility
- in order to maintain and store electronic information.
-
- BESSER responded that the assembled group could be viewed as a starting
- point, whose initial operating premise could be helping to move in this
- direction and defining how LC could do so, for example, in areas of
- standardization or distribution of responsibility.
-
- FLEISCHHAUER added that AM was fully engaged, wrestling with some of the
- questions that pertain to the conversion of older historical materials,
- which would be one thing that the Library of Congress might do. Several
- points mentioned by BESSER and several others on this question have a
- much greater impact on those who are concerned with cataloguing and the
- networking of bibliographic information, as well as preservation itself.
-
- Speaking directly to AM, which he considered was a largely uncopyrighted
- database, LYNCH urged development of a network version of AM, or
- consideration of making the data in it available to people interested in
- doing network multimedia. On account of the current great shortage of
- digital data that is both appealing and unencumbered by complex rights
- problems, this course of action could have a significant effect on making
- network multimedia a reality.
-
- In this connection, FLEISCHHAUER reported on a fragmentary prototype in
- LC's Office of Information Technology Services that attempts to associate
- digital images of photographs with cataloguing information in ways that
- work within a local area network--a step, so to say, toward AM's
- construction of some sort of apparatus for access. Further, AM has
- attempted to use standard data forms in order to help make that
- distinction between the access tools and the underlying data, and thus
- believes that the database is networkable.
-
- A delicate and agonizing policy question for LC, however, which comes
- back to resources and unfortunately has an impact on this, is to find
- some appropriate, honorable, and legal cost-recovery possibilities. A
- certain skittishness concerning cost-recovery has made people unsure
- exactly what to do. AM would be highly receptive to discussing further
- LYNCH's offer to test or demonstrate its database in a network
- environment, FLEISCHHAUER said.
-
- Returning the discussion to what she viewed as the vital issue of
- electronic deposit, BATTIN recommended that LC initiate a catalytic
- process in terms of distributed responsibility, that is, bring together
- the distributed organizations and set up a study group to look at all
- these issues and see where we as a nation should move. The broader
- issues of how we deal with the management of electronic information will
- not disappear, but only grow worse.
-
- LESK took up this theme and suggested that LC attempt to persuade one
- major library in each state to deal with its state equivalent publisher,
- which might produce a cooperative project that would be equitably
- distributed around the country, and one in which LC would be dealing with
- a minimal number of publishers and minimal copyright problems.
-
- GRABER remarked the recent development in the scientific community of a
- willingness to use SGML and either deposit or interchange on a fairly
- standardized format. He wondered if a similar movement was taking place
- in the humanities. Although the National Library of Medicine found only
- a few publishers to cooperate in a like venture two or three years ago, a
- new effort might generate a much larger number willing to cooperate.
-
- KIMBALL recounted his unit's (Machine-Readable Collections Reading Room)
- troubles with the commercial publishers of electronic media in acquiring
- materials for LC's collections, in particular the publishers' fear that
- they would not be able to cover their costs and would lose control of
- their products, that LC would give them away or sell them and make
- profits from them. He doubted that the publishing industry was prepared
- to move into this area at the moment, given its resistance to allowing LC
- to use its machine-readable materials as the Library would like.
-
- The copyright law now addresses compact disk as a medium, and LC can
- request one copy of that, or two copies if it is the only version, and
- can request copies of software, but that fails to address magazines or
- books or anything like that which is in machine-readable form.
-
- GIFFORD acknowledged the thorny nature of this issue, which he illustrated
- with the example of the cumbersome process involved in putting a copy of a
- scientific database on a LAN in LC's science reading room. He also
- acknowledged that LC needs help and could enlist the energies and talents
- of Workshop participants in thinking through a number of these problems.
-
- GIFFORD returned the discussion to getting the image and text people to
- think through together where they want to go in the long term. MYLONAS
- conceded that her experience at the Pierce Symposium the previous week at
- Georgetown University and this week at LC had forced her to reevaluate
- her perspective on the usefulness of text as images. MYLONAS framed the
- issues in a series of questions: How do we acquire machine-readable
- text? Do we take pictures of it and perform OCR on it later? Is it
- important to obtain very high-quality images and text, etc.?
- FLEISCHHAUER agreed with MYLONAS's framing of strategic questions, adding
- that a large institution such as LC probably has to do all of those
- things at different times. Thus, the trick is to exercise judgment. The
- Workshop had added to his and AM's considerations in making those
- judgments. Concerning future meetings or discussions, MYLONAS suggested
- that screening priorities would be helpful.
-
- WEIBEL opined that the diversity reflected in this group was a sign both
- of the health and of the immaturity of the field, and more time would
- have to pass before we convince one another concerning standards.
-
- An exchange between MYLONAS and BATTIN clarified the point that the
- driving force behind both the Perseus and the Cornell Xerox projects was
- the preservation of knowledge for the future, not simply for particular
- research use. In the case of Perseus, MYLONAS said, the assumption was
- that the texts would not be entered again into electronically readable
- form. SPERBERG-McQUEEN added that a scanned image would not serve as an
- archival copy for purposes of preservation in the case of, say, the Bill
- of Rights, in the sense that the scanned images are effectively the
- archival copies for the Cornell mathematics books.
-
-
- *** *** *** ****** *** *** ***
-
-
- Appendix I: PROGRAM
-
-
-
- WORKSHOP
- ON
- ELECTRONIC
- TEXTS
-
-
-
- 9-10 June 1992
-
- Library of Congress
- Washington, D.C.
-
-
-
- Supported by a Grant from the David and Lucile Packard Foundation
-
-
- Tuesday, 9 June 1992
-
- NATIONAL DEMONSTRATION LAB, ATRIUM, LIBRARY MADISON
-
- 8:30 AM Coffee and Danish, registration
-
- 9:00 AM Welcome
-
- Prosser Gifford, Director for Scholarly Programs, and Carl
- Fleischhauer, Coordinator, American Memory, Library of
- Congress
-
- 9:l5 AM Session I. Content in a New Form: Who Will Use It and What
- Will They Do?
-
- Broad description of the range of electronic information.
- Characterization of who uses it and how it is or may be used.
- In addition to a look at scholarly uses, this session will
- include a presentation on use by students (K-12 and college)
- and the general public.
-
- Moderator: James Daly
- Avra Michelson, Archival Research and Evaluation Staff,
- National Archives and Records Administration (Overview)
- Susan H. Veccia, Team Leader, American Memory, User Evaluation,
- and
- Joanne Freeman, Associate Coordinator, American Memory, Library
- of Congress (Beyond the scholar)
-
- 10:30-
- 11:00 AM Break
-
- 11:00 AM Session II. Show and Tell.
-
- Each presentation to consist of a fifteen-minute
- statement/show; group discussion will follow lunch.
-
- Moderator: Jacqueline Hess, Director, National Demonstration
- Lab
-
- 1. A classics project, stressing texts and text retrieval
- more than multimedia: Perseus Project, Harvard
- University
- Elli Mylonas, Managing Editor
-
- 2. Other humanities projects employing the emerging norms of
- the Text Encoding Initiative (TEI): Chadwyck-Healey's
- The English Poetry Full Text Database and/or Patrologia
- Latina Database
- Eric M. Calaluca, Vice President, Chadwyck-Healey, Inc.
-
- 3. American Memory
- Carl Fleischhauer, Coordinator, and
- Ricky Erway, Associate Coordinator, Library of Congress
-
- 4. Founding Fathers example from Packard Humanities
- Institute: The Papers of George Washington, University
- of Virginia
- Dorothy Twohig, Managing Editor, and/or
- David Woodley Packard
-
- 5. An electronic medical journal offering graphics and
- full-text searchability: The Online Journal of Current
- Clinical Trials, American Association for the Advancement
- of Science
- Maria L. Lebron, Managing Editor
-
- 6. A project that offers facsimile images of pages but omits
- searchable text: Cornell math books
- Lynne K. Personius, Assistant Director, Cornell
- Information Technologies for Scholarly Information
- Sources, Cornell University
-
- 12:30 PM Lunch (Dining Room A, Library Madison 620. Exhibits
- available.)
-
- 1:30 PM Session II. Show and Tell (Cont'd.).
-
- 3:00-
- 3:30 PM Break
-
- 3:30-
- 5:30 PM Session III. Distribution, Networks, and Networking: Options
- for Dissemination.
-
- Published disks: University presses and public-sector
- publishers, private-sector publishers
- Computer networks
-
- Moderator: Robert G. Zich, Special Assistant to the Associate
- Librarian for Special Projects, Library of Congress
- Clifford A. Lynch, Director, Library Automation, University of
- California
- Howard Besser, School of Library and Information Science,
- University of Pittsburgh
- Ronald L. Larsen, Associate Director of Libraries for
- Information Technology, University of Maryland at College
- Park
- Edwin B. Brownrigg, Executive Director, Memex Research
- Institute
-
- 6:30 PM Reception (Montpelier Room, Library Madison 619.)
-
- ******
-
- Wednesday, 10 June 1992
-
- DINING ROOM A, LIBRARY MADISON 620
-
- 8:30 AM Coffee and Danish
-
- 9:00 AM Session IV. Image Capture, Text Capture, Overview of Text and
- Image Storage Formats.
-
- Moderator: William L. Hooton, Vice President of Operations,
- I-NET
-
- A) Principal Methods for Image Capture of Text:
- Direct scanning
- Use of microform
-
- Anne R. Kenney, Assistant Director, Department of Preservation
- and Conservation, Cornell University
- Pamela Q.J. Andre, Associate Director, Automation, and
- Judith A. Zidar, Coordinator, National Agricultural Text
- Digitizing Program (NATDP), National Agricultural Library
- (NAL)
- Donald J. Waters, Head, Systems Office, Yale University Library
-
- B) Special Problems:
- Bound volumes
- Conservation
- Reproducing printed halftones
-
- Carl Fleischhauer, Coordinator, American Memory, Library of
- Congress
- George Thoma, Chief, Communications Engineering Branch,
- National Library of Medicine (NLM)
-
- 10:30-
- 11:00 AM Break
-
- 11:00 AM Session IV. Image Capture, Text Capture, Overview of Text and
- Image Storage Formats (Cont'd.).
-
- C) Image Standards and Implications for Preservation
-
- Jean Baronas, Senior Manager, Department of Standards and
- Technology, Association for Information and Image Management
- (AIIM)
- Patricia Battin, President, The Commission on Preservation and
- Access (CPA)
-
- D) Text Conversion:
- OCR vs. rekeying
- Standards of accuracy and use of imperfect texts
- Service bureaus
-
- Stuart Weibel, Senior Research Specialist, Online Computer
- Library Center, Inc. (OCLC)
- Michael Lesk, Executive Director, Computer Science Research,
- Bellcore
- Ricky Erway, Associate Coordinator, American Memory, Library of
- Congress
- Pamela Q.J. Andre, Associate Director, Automation, and
- Judith A. Zidar, Coordinator, National Agricultural Text
- Digitizing Program (NATDP), National Agricultural Library
- (NAL)
-
- 12:30-
- 1:30 PM Lunch
-
- 1:30 PM Session V. Approaches to Preparing Electronic Texts.
-
- Discussion of approaches to structuring text for the computer;
- pros and cons of text coding, description of methods in
- practice, and comparison of text-coding methods.
-
- Moderator: Susan Hockey, Director, Center for Electronic Texts
- in the Humanities (CETH), Rutgers and Princeton Universities
- David Woodley Packard
- C.M. Sperberg-McQueen, Editor, Text Encoding Initiative (TEI),
- University of Illinois-Chicago
- Eric M. Calaluca, Vice President, Chadwyck-Healey, Inc.
-
- 3:30-
- 4:00 PM Break
-
- 4:00 PM Session VI. Copyright Issues.
-
- Marybeth Peters, Policy Planning Adviser to the Register of
- Copyrights, Library of Congress
-
- 5:00 PM Session VII. Conclusion.
-
- General discussion.
- What topics were omitted or given short shrift that anyone
- would like to talk about now?
- Is there a "group" here? What should the group do next, if
- anything? What should the Library of Congress do next, if
- anything?
- Moderator: Prosser Gifford, Director for Scholarly Programs,
- Library of Congress
-
- 6:00 PM Adjourn
-
-
- *** *** *** ****** *** *** ***
-
-
- Appendix II: ABSTRACTS
-
-
- SESSION I
-
- Avra MICHELSON Forecasting the Use of Electronic Texts by
- Social Sciences and Humanities Scholars
-
- This presentation explores the ways in which electronic texts are likely
- to be used by the non-scientific scholarly community. Many of the
- remarks are drawn from a report the speaker coauthored with Jeff
- Rothenberg, a computer scientist at The RAND Corporation.
-
- The speaker assesses 1) current scholarly use of information technology
- and 2) the key trends in information technology most relevant to the
- research process, in order to predict how social sciences and humanities
- scholars are apt to use electronic texts. In introducing the topic,
- current use of electronic texts is explored broadly within the context of
- scholarly communication. From the perspective of scholarly
- communication, the work of humanities and social sciences scholars
- involves five processes: 1) identification of sources, 2) communication
- with colleagues, 3) interpretation and analysis of data, 4) dissemination
- of research findings, and 5) curriculum development and instruction. The
- extent to which computation currently permeates aspects of scholarly
- communication represents a viable indicator of the prospects for
- electronic texts.
-
- The discussion of current practice is balanced by an analysis of key
- trends in the scholarly use of information technology. These include the
- trends toward end-user computing and connectivity, which provide a
- framework for forecasting the use of electronic texts through this
- millennium. The presentation concludes with a summary of the ways in
- which the nonscientific scholarly community can be expected to use
- electronic texts, and the implications of that use for information
- providers.
-
- Susan VECCIA and Joanne FREEMAN Electronic Archives for the Public:
- Use of American Memory in Public and
- School Libraries
-
- This joint discussion focuses on nonscholarly applications of electronic
- library materials, specifically addressing use of the Library of Congress
- American Memory (AM) program in a small number of public and school
- libraries throughout the United States. AM consists of selected Library
- of Congress primary archival materials, stored on optical media
- (CD-ROM/videodisc), and presented with little or no editing. Many
- collections are accompanied by electronic introductions and user's guides
- offering background information and historical context. Collections
- represent a variety of formats including photographs, graphic arts,
- motion pictures, recorded sound, music, broadsides and manuscripts,
- books, and pamphlets.
-
- In 1991, the Library of Congress began a nationwide evaluation of AM in
- different types of institutions. Test sites include public libraries,
- elementary and secondary school libraries, college and university
- libraries, state libraries, and special libraries. Susan VECCIA and
- Joanne FREEMAN will discuss their observations on the use of AM by the
- nonscholarly community, using evidence gleaned from this ongoing
- evaluation effort.
-
- VECCIA will comment on the overall goals of the evaluation project, and
- the types of public and school libraries included in this study. Her
- comments on nonscholarly use of AM will focus on the public library as a
- cultural and community institution, often bridging the gap between formal
- and informal education. FREEMAN will discuss the use of AM in school
- libraries. Use by students and teachers has revealed some broad
- questions about the use of electronic resources, as well as definite
- benefits gained by the "nonscholar." Topics will include the problem of
- grasping content and context in an electronic environment, the stumbling
- blocks created by "new" technologies, and the unique skills and interests
- awakened through use of electronic resources.
-
- SESSION II
-
- Elli MYLONAS The Perseus Project: Interactive Sources and
- Studies in Classical Greece
-
- The Perseus Project (5) has just released Perseus 1.0, the first publicly
- available version of its hypertextual database of multimedia materials on
- classical Greece. Perseus is designed to be used by a wide audience,
- comprised of readers at the student and scholar levels. As such, it must
- be able to locate information using different strategies, and it must
- contain enough detail to serve the different needs of its users. In
- addition, it must be delivered so that it is affordable to its target
- audience. [These problems and the solutions we chose are described in
- Mylonas, "An Interface to Classical Greek Civilization," JASIS 43:2,
- March 1992.]
-
- In order to achieve its objective, the project staff decided to make a
- conscious separation between selecting and converting textual, database,
- and image data on the one hand, and putting it into a delivery system on
- the other. That way, it is possible to create the electronic data
- without thinking about the restrictions of the delivery system. We have
- made a great effort to choose system-independent formats for our data,
- and to put as much thought and work as possible into structuring it so
- that the translation from paper to electronic form will enhance the value
- of the data. [A discussion of these solutions as of two years ago is in
- Elli Mylonas, Gregory Crane, Kenneth Morrell, and D. Neel Smith, "The
- Perseus Project: Data in the Electronic Age," in Accessing Antiquity:
- The Computerization of Classical Databases, J. Solomon and T. Worthen
- (eds.), University of Arizona Press, in press.]
-
- Much of the work on Perseus is focused on collecting and converting the
- data on which the project is based. At the same time, it is necessary to
- provide means of access to the information, in order to make it usable,
- and them to investigate how it is used. As we learn more about what
- students and scholars from different backgrounds do with Perseus, we can
- adjust our data collection, and also modify the system to accommodate
- them. In creating a delivery system for general use, we have tried to
- avoid favoring any one type of use by allowing multiple forms of access
- to and navigation through the system.
-
- The way text is handled exemplifies some of these principles. All text
- in Perseus is tagged using SGML, following the guidelines of the Text
- Encoding Initiative (TEI). This markup is used to index the text, and
- process it so that it can be imported into HyperCard. No SGML markup
- remains in the text that reaches the user, because currently it would be
- too expensive to create a system that acts on SGML in real time.
- However, the regularity provided by SGML is essential for verifying the
- content of the texts, and greatly speeds all the processing performed on
- them. The fact that the texts exist in SGML ensures that they will be
- relatively easy to port to different hardware and software, and so will
- outlast the current delivery platform. Finally, the SGML markup
- incorporates existing canonical reference systems (chapter, verse, line,
- etc.); indexing and navigation are based on these features. This ensures
- that the same canonical reference will always resolve to the same point
- within a text, and that all versions of our texts, regardless of delivery
- platform (even paper printouts) will function the same way.
-
- In order to provide tools for users, the text is processed by a
- morphological analyzer, and the results are stored in a database.
- Together with the index, the Greek-English Lexicon, and the index of all
- the English words in the definitions of the lexicon, the morphological
- analyses comprise a set of linguistic tools that allow users of all
- levels to work with the textual information, and to accomplish different
- tasks. For example, students who read no Greek may explore a concept as
- it appears in Greek texts by using the English-Greek index, and then
- looking up works in the texts and translations, or scholars may do
- detailed morphological studies of word use by using the morphological
- analyses of the texts. Because these tools were not designed for any one
- use, the same tools and the same data can be used by both students and
- scholars.
-
- NOTES:
- (5) Perseus is based at Harvard University, with collaborators at
- several other universities. The project has been funded primarily
- by the Annenberg/CPB Project, as well as by Harvard University,
- Apple Computer, and others. It is published by Yale University
- Press. Perseus runs on Macintosh computers, under the HyperCard
- program.
-
- Eric CALALUCA
-
- Chadwyck-Healey embarked last year on two distinct yet related full-text
- humanities database projects.
-
- The English Poetry Full-Text Database and the Patrologia Latina Database
- represent new approaches to linguistic research resources. The size and
- complexity of the projects present problems for electronic publishers,
- but surmountable ones if they remain abreast of the latest possibilities
- in data capture and retrieval software techniques.
-
- The issues which required address prior to the commencement of the
- projects were legion:
-
- 1. Editorial selection (or exclusion) of materials in each
- database
-
- 2. Deciding whether or not to incorporate a normative encoding
- structure into the databases?
- A. If one is selected, should it be SGML?
- B. If SGML, then the TEI?
-
- 3. Deliver as CD-ROM, magnetic tape, or both?
-
- 4. Can one produce retrieval software advanced enough for the
- postdoctoral linguist, yet accessible enough for unattended
- general use? Should one try?
-
- 5. Re fair and liberal networking policies, what are the risks to
- an electronic publisher?
-
- 6. How does the emergence of national and international education
- networks affect the use and viability of research projects
- requiring high investment? Do the new European Community
- directives concerning database protection necessitate two
- distinct publishing projects, one for North America and one for
- overseas?
-
- From new notions of "scholarly fair use" to the future of optical media,
- virtually every issue related to electronic publishing was aired. The
- result is two projects which have been constructed to provide the quality
- research resources with the fewest encumbrances to use by teachers and
- private scholars.
-
- Dorothy TWOHIG
-
- In spring 1988 the editors of the papers of George Washington, John
- Adams, Thomas Jefferson, James Madison, and Benjamin Franklin were
- approached by classics scholar David Packard on behalf of the Packard
- Humanities Foundation with a proposal to produce a CD-ROM edition of the
- complete papers of each of the Founding Fathers. This electronic edition
- will supplement the published volumes, making the documents widely
- available to students and researchers at reasonable cost. We estimate
- that our CD-ROM edition of Washington's Papers will be substantially
- completed within the next two years and ready for publication. Within
- the next ten years or so, similar CD-ROM editions of the Franklin, Adams,
- Jefferson, and Madison papers also will be available. At the Library of
- Congress's session on technology, I would like to discuss not only the
- experience of the Washington Papers in producing the CD-ROM edition, but
- the impact technology has had on these major editorial projects.
- Already, we are editing our volumes with an eye to the material that will
- be readily available in the CD-ROM edition. The completed electronic
- edition will provide immense possibilities for the searching of documents
- for information in a way never possible before. The kind of technical
- innovations that are currently available and on the drawing board will
- soon revolutionize historical research and the production of historical
- documents. Unfortunately, much of this new technology is not being used
- in the planning stages of historical projects, simply because many
- historians are aware only in the vaguest way of its existence. At least
- two major new historical editing projects are considering microfilm
- editions, simply because they are not aware of the possibilities of
- electronic alternatives and the advantages of the new technology in terms
- of flexibility and research potential compared to microfilm. In fact,
- too many of us in history and literature are still at the stage of
- struggling with our PCs. There are many historical editorial projects in
- progress presently, and an equal number of literary projects. While the
- two fields have somewhat different approaches to textual editing, there
- are ways in which electronic technology can be of service to both.
-
- Since few of the editors involved in the Founding Fathers CD-ROM editions
- are technical experts in any sense, I hope to point out in my discussion
- of our experience how many of these electronic innovations can be used
- successfully by scholars who are novices in the world of new technology.
- One of the major concerns of the sponsors of the multitude of new
- scholarly editions is the limited audience reached by the published
- volumes. Most of these editions are being published in small quantities
- and the publishers' price for them puts them out of the reach not only of
- individual scholars but of most public libraries and all but the largest
- educational institutions. However, little attention is being given to
- ways in which technology can bypass conventional publication to make
- historical and literary documents more widely available.
-
- What attracted us most to the CD-ROM edition of The Papers of George
- Washington was the fact that David Packard's aim was to make a complete
- edition of all of the 135,000 documents we have collected available in an
- inexpensive format that would be placed in public libraries, small
- colleges, and even high schools. This would provide an audience far
- beyond our present 1,000-copy, $45 published edition. Since the CD-ROM
- edition will carry none of the explanatory annotation that appears in the
- published volumes, we also feel that the use of the CD-ROM will lead many
- researchers to seek out the published volumes.
-
- In addition to ignorance of new technical advances, I have found that too
- many editors--and historians and literary scholars--are resistant and
- even hostile to suggestions that electronic technology may enhance their
- work. I intend to discuss some of the arguments traditionalists are
- advancing to resist technology, ranging from distrust of the speed with
- which it changes (we are already wondering what is out there that is
- better than CD-ROM) to suspicion of the technical language used to
- describe electronic developments.
-
- Maria LEBRON
-
- The Online Journal of Current Clinical Trials, a joint venture of the
- American Association for the Advancement of Science (AAAS) and the Online
- Computer Library Center, Inc. (OCLC), is the first peer-reviewed journal
- to provide full text, tabular material, and line illustrations on line.
- This presentation will discuss the genesis and start-up period of the
- journal. Topics of discussion will include historical overview,
- day-to-day management of the editorial peer review, and manuscript
- tagging and publication. A demonstration of the journal and its features
- will accompany the presentation.
-
- Lynne PERSONIUS
-
- Cornell University Library, Cornell Information Technologies, and Xerox
- Corporation, with the support of the Commission on Preservation and
- Access, and Sun Microsystems, Inc., have been collaborating in a project
- to test a prototype system for recording brittle books as digital images
- and producing, on demand, high-quality archival paper replacements. The
- project goes beyond that, however, to investigate some of the issues
- surrounding scanning, storing, retrieving, and providing access to
- digital images in a network environment.
-
- The Joint Study in Digital Preservation began in January 1990. Xerox
- provided the College Library Access and Storage System (CLASS) software,
- a prototype 600-dots-per-inch (dpi) scanner, and the hardware necessary
- to support network printing on the DocuTech printer housed in Cornell's
- Computing and Communications Center (CCC).
-
- The Cornell staff using the hardware and software became an integral part
- of the development and testing process for enhancements to the CLASS
- software system. The collaborative nature of this relationship is
- resulting in a system that is specifically tailored to the preservation
- application.
-
- A digital library of 1,000 volumes (or approximately 300,000 images) has
- been created and is stored on an optical jukebox that resides in CCC.
- The library includes a collection of select mathematics monographs that
- provides mathematics faculty with an opportunity to use the electronic
- library. The remaining volumes were chosen for the library to test the
- various capabilities of the scanning system.
-
- One project objective is to provide users of the Cornell library and the
- library staff with the ability to request facsimiles of digitized images
- or to retrieve the actual electronic image for browsing. A prototype
- viewing workstation has been created by Xerox, with input into the design
- by a committee of Cornell librarians and computer professionals. This
- will allow us to experiment with patron access to the images that make up
- the digital library. The viewing station provides search, retrieval, and
- (ultimately) printing functions with enhancements to facilitate
- navigation through multiple documents.
-
- Cornell currently is working to extend access to the digital library to
- readers using workstations from their offices. This year is devoted to
- the development of a network resident image conversion and delivery
- server, and client software that will support readers who use Apple
- Macintosh computers, IBM windows platforms, and Sun workstations.
- Equipment for this development was provided by Sun Microsystems with
- support from the Commission on Preservation and Access.
-
- During the show-and-tell session of the Workshop on Electronic Texts, a
- prototype view station will be demonstrated. In addition, a display of
- original library books that have been digitized will be available for
- review with associated printed copies for comparison. The fifteen-minute
- overview of the project will include a slide presentation that
- constitutes a "tour" of the preservation digitizing process.
-
- The final network-connected version of the viewing station will provide
- library users with another mechanism for accessing the digital library,
- and will also provide the capability of viewing images directly. This
- will not require special software, although a powerful computer with good
- graphics will be needed.
-
- The Joint Study in Digital Preservation has generated a great deal of
- interest in the library community. Unfortunately, or perhaps
- fortunately, this project serves to raise a vast number of other issues
- surrounding the use of digital technology for the preservation and use of
- deteriorating library materials, which subsequent projects will need to
- examine. Much work remains.
-
- SESSION III
-
- Howard BESSER Networking Multimedia Databases
-
- What do we have to consider in building and distributing databases of
- visual materials in a multi-user environment? This presentation examines
- a variety of concerns that need to be addressed before a multimedia
- database can be set up in a networked environment.
-
- In the past it has not been feasible to implement databases of visual
- materials in shared-user environments because of technological barriers.
- Each of the two basic models for multi-user multimedia databases has
- posed its own problem. The analog multimedia storage model (represented
- by Project Athena's parallel analog and digital networks) has required an
- incredibly complex (and expensive) infrastructure. The economies of
- scale that make multi-user setups cheaper per user served do not operate
- in an environment that requires a computer workstation, videodisc player,
- and two display devices for each user.
-
- The digital multimedia storage model has required vast amounts of storage
- space (as much as one gigabyte per thirty still images). In the past the
- cost of such a large amount of storage space made this model a
- prohibitive choice as well. But plunging storage costs are finally
- making this second alternative viable.
-
- If storage no longer poses such an impediment, what do we need to
- consider in building digitally stored multi-user databases of visual
- materials? This presentation will examine the networking and
- telecommunication constraints that must be overcome before such databases
- can become commonplace and useful to a large number of people.
-
- The key problem is the vast size of multimedia documents, and how this
- affects not only storage but telecommunications transmission time.
- Anything slower than T-1 speed is impractical for files of 1 megabyte or
- larger (which is likely to be small for a multimedia document). For
- instance, even on a 56 Kb line it would take three minutes to transfer a
- 1-megabyte file. And these figures assume ideal circumstances, and do
- not take into consideration other users contending for network bandwidth,
- disk access time, or the time needed for remote display. Current common
- telephone transmission rates would be completely impractical; few users
- would be willing to wait the hour necessary to transmit a single image at
- 2400 baud.
-
- This necessitates compression, which itself raises a number of other
- issues. In order to decrease file sizes significantly, we must employ
- lossy compression algorithms. But how much quality can we afford to
- lose? To date there has been only one significant study done of
- image-quality needs for a particular user group, and this study did not
- look at loss resulting from compression. Only after identifying
- image-quality needs can we begin to address storage and network bandwidth
- needs.
-
- Experience with X-Windows-based applications (such as Imagequery, the
- University of California at Berkeley image database) demonstrates the
- utility of a client-server topology, but also points to the limitation of
- current software for a distributed environment. For example,
- applications like Imagequery can incorporate compression, but current X
- implementations do not permit decompression at the end user's
- workstation. Such decompression at the host computer alleviates storage
- capacity problems while doing nothing to address problems of
- telecommunications bandwidth.
-
- We need to examine the effects on network through-put of moving
- multimedia documents around on a network. We need to examine various
- topologies that will help us avoid bottlenecks around servers and
- gateways. Experience with applications such as these raise still broader
- questions. How closely is the multimedia document tied to the software
- for viewing it? Can it be accessed and viewed from other applications?
- Experience with the MARC format (and more recently with the Z39.50
- protocols) shows how useful it can be to store documents in a form in
- which they can be accessed by a variety of application software.
-
- Finally, from an intellectual-access standpoint, we need to address the
- issue of providing access to these multimedia documents in
- interdisciplinary environments. We need to examine terminology and
- indexing strategies that will allow us to provide access to this material
- in a cross-disciplinary way.
-
- Ronald LARSEN Directions in High-Performance Networking for
- Libraries
-
- The pace at which computing technology has advanced over the past forty
- years shows no sign of abating. Roughly speaking, each five-year period
- has yielded an order-of-magnitude improvement in price and performance of
- computing equipment. No fundamental hurdles are likely to prevent this
- pace from continuing for at least the next decade. It is only in the
- past five years, though, that computing has become ubiquitous in
- libraries, affecting all staff and patrons, directly or indirectly.
-
- During these same five years, communications rates on the Internet, the
- principal academic computing network, have grown from 56 kbps to 1.5
- Mbps, and the NSFNet backbone is now running 45 Mbps. Over the next five
- years, communication rates on the backbone are expected to exceed 1 Gbps.
- Growth in both the population of network users and the volume of network
- traffic has continued to grow geometrically, at rates approaching 15
- percent per month. This flood of capacity and use, likened by some to
- "drinking from a firehose," creates immense opportunities and challenges
- for libraries. Libraries must anticipate the future implications of this
- technology, participate in its development, and deploy it to ensure
- access to the world's information resources.
-
- The infrastructure for the information age is being put in place.
- Libraries face strategic decisions about their role in the development,
- deployment, and use of this infrastructure. The emerging infrastructure
- is much more than computers and communication lines. It is more than the
- ability to compute at a remote site, send electronic mail to a peer
- across the country, or move a file from one library to another. The next
- five years will witness substantial development of the information
- infrastructure of the network.
-
- In order to provide appropriate leadership, library professionals must
- have a fundamental understanding of and appreciation for computer
- networking, from local area networks to the National Research and
- Education Network (NREN). This presentation addresses these
- fundamentals, and how they relate to libraries today and in the near
- future.
-
- Edwin BROWNRIGG Electronic Library Visions and Realities
-
- The electronic library has been a vision desired by many--and rejected by
- some--since Vannevar Bush coined the term memex to describe an automated,
- intelligent, personal information system. Variations on this vision have
- included Ted Nelson's Xanadau, Alan Kay's Dynabook, and Lancaster's
- "paperless library," with the most recent incarnation being the
- "Knowledge Navigator" described by John Scully of Apple. But the reality
- of library service has been less visionary and the leap to the electronic
- library has eluded universities, publishers, and information technology
- files.
-
- The Memex Research Institute (MemRI), an independent, nonprofit research
- and development organization, has created an Electronic Library Program
- of shared research and development in order to make the collective vision
- more concrete. The program is working toward the creation of large,
- indexed publicly available electronic image collections of published
- documents in academic, special, and public libraries. This strategic
- plan is the result of the first stage of the program, which has been an
- investigation of the information technologies available to support such
- an effort, the economic parameters of electronic service compared to
- traditional library operations, and the business and political factors
- affecting the shift from print distribution to electronic networked
- access.
-
- The strategic plan envisions a combination of publicly searchable access
- databases, image (and text) document collections stored on network "file
- servers," local and remote network access, and an intellectual property
- management-control system. This combination of technology and
- information content is defined in this plan as an E-library or E-library
- collection. Some participating sponsors are already developing projects
- based on MemRI's recommended directions.
-
- The E-library strategy projected in this plan is a visionary one that can
- enable major changes and improvements in academic, public, and special
- library service. This vision is, though, one that can be realized with
- today's technology. At the same time, it will challenge the political
- and social structure within which libraries operate: in academic
- libraries, the traditional emphasis on local collections, extending to
- accreditation issues; in public libraries, the potential of electronic
- branch and central libraries fully available to the public; and for
- special libraries, new opportunities for shared collections and networks.
-
- The environment in which this strategic plan has been developed is, at
- the moment, dominated by a sense of library limits. The continued
- expansion and rapid growth of local academic library collections is now
- clearly at an end. Corporate libraries, and even law libraries, are
- faced with operating within a difficult economic climate, as well as with
- very active competition from commercial information sources. For
- example, public libraries may be seen as a desirable but not critical
- municipal service in a time when the budgets of safety and health
- agencies are being cut back.
-
- Further, libraries in general have a very high labor-to-cost ratio in
- their budgets, and labor costs are still increasing, notwithstanding
- automation investments. It is difficult for libraries to obtain capital,
- startup, or seed funding for innovative activities, and those
- technology-intensive initiatives that offer the potential of decreased
- labor costs can provoke the opposition of library staff.
-
- However, libraries have achieved some considerable successes in the past
- two decades by improving both their service and their credibility within
- their organizations--and these positive changes have been accomplished
- mostly with judicious use of information technologies. The advances in
- computing and information technology have been well-chronicled: the
- continuing precipitous drop in computing costs, the growth of the
- Internet and private networks, and the explosive increase in publicly
- available information databases.
-
- For example, OCLC has become one of the largest computer network
- organizations in the world by creating a cooperative cataloging network
- of more than 6,000 libraries worldwide. On-line public access catalogs
- now serve millions of users on more than 50,000 dedicated terminals in
- the United States alone. The University of California MELVYL on-line
- catalog system has now expanded into an index database reference service
- and supports more than six million searches a year. And, libraries have
- become the largest group of customers of CD-ROM publishing technology;
- more than 30,000 optical media publications such as those offered by
- InfoTrac and Silver Platter are subscribed to by U.S. libraries.
-
- This march of technology continues and in the next decade will result in
- further innovations that are extremely difficult to predict. What is
- clear is that libraries can now go beyond automation of their order files
- and catalogs to automation of their collections themselves--and it is
- possible to circumvent the fiscal limitations that appear to obtain
- today.
-
- This Electronic Library Strategic Plan recommends a paradigm shift in
- library service, and demonstrates the steps necessary to provide improved
- library services with limited capacities and operating investments.
-
- SESSION IV-A
-
- Anne KENNEY
-
- The Cornell/Xerox Joint Study in Digital Preservation resulted in the
- recording of 1,000 brittle books as 600-dpi digital images and the
- production, on demand, of high-quality and archivally sound paper
- replacements. The project, which was supported by the Commission on
- Preservation and Access, also investigated some of the issues surrounding
- scanning, storing, retrieving, and providing access to digital images in
- a network environment.
-
- Anne Kenney will focus on some of the issues surrounding direct scanning
- as identified in the Cornell Xerox Project. Among those to be discussed
- are: image versus text capture; indexing and access; image-capture
- capabilities; a comparison to photocopy and microfilm; production and
- cost analysis; storage formats, protocols, and standards; and the use of
- this scanning technology for preservation purposes.
-
- The 600-dpi digital images produced in the Cornell Xerox Project proved
- highly acceptable for creating paper replacements of deteriorating
- originals. The 1,000 scanned volumes provided an array of image-capture
- challenges that are common to nineteenth-century printing techniques and
- embrittled material, and that defy the use of text-conversion processes.
- These challenges include diminished contrast between text and background,
- fragile and deteriorated pages, uneven printing, elaborate type faces,
- faint and bold text adjacency, handwritten text and annotations, nonRoman
- languages, and a proliferation of illustrated material embedded in text.
- The latter category included high-frequency and low-frequency halftones,
- continuous tone photographs, intricate mathematical drawings, maps,
- etchings, reverse-polarity drawings, and engravings.
-
- The Xerox prototype scanning system provided a number of important
- features for capturing this diverse material. Technicians used multiple
- threshold settings, filters, line art and halftone definitions,
- autosegmentation, windowing, and software-editing programs to optimize
- image capture. At the same time, this project focused on production.
- The goal was to make scanning as affordable and acceptable as
- photocopying and microfilming for preservation reformatting. A
- time-and-cost study conducted during the last three months of this
- project confirmed the economic viability of digital scanning, and these
- findings will be discussed here.
-
- From the outset, the Cornell Xerox Project was predicated on the use of
- nonproprietary standards and the use of common protocols when standards
- did not exist. Digital files were created as TIFF images which were
- compressed prior to storage using Group 4 CCITT compression. The Xerox
- software is MS DOS based and utilizes off-the shelf programs such as
- Microsoft Windows and Wang Image Wizard. The digital library is designed
- to be hardware-independent and to provide interchangeability with other
- institutions through network connections. Access to the digital files
- themselves is two-tiered: Bibliographic records for the computer files
- are created in RLIN and Cornell's local system and access into the actual
- digital images comprising a book is provided through a document control
- structure and a networked image file-server, both of which will be
- described.
-
- The presentation will conclude with a discussion of some of the issues
- surrounding the use of this technology as a preservation tool (storage,
- refreshing, backup).
-
- Pamela ANDRE and Judith ZIDAR
-
- The National Agricultural Library (NAL) has had extensive experience with
- raster scanning of printed materials. Since 1987, the Library has
- participated in the National Agricultural Text Digitizing Project (NATDP)
- a cooperative effort between NAL and forty-five land grant university
- libraries. An overview of the project will be presented, giving its
- history and NAL's strategy for the future.
-
- An in-depth discussion of NATDP will follow, including a description of
- the scanning process, from the gathering of the printed materials to the
- archiving of the electronic pages. The type of equipment required for a
- stand-alone scanning workstation and the importance of file management
- software will be discussed. Issues concerning the images themselves will
- be addressed briefly, such as image format; black and white versus color;
- gray scale versus dithering; and resolution.
-
- Also described will be a study currently in progress by NAL to evaluate
- the usefulness of converting microfilm to electronic images in order to
- improve access. With the cooperation of Tuskegee University, NAL has
- selected three reels of microfilm from a collection of sixty-seven reels
- containing the papers, letters, and drawings of George Washington Carver.
- The three reels were converted into 3,500 electronic images using a
- specialized microfilm scanner. The selection, filming, and indexing of
- this material will be discussed.
-
- Donald WATERS
-
- Project Open Book, the Yale University Library's effort to convert 10,
- 000 books from microfilm to digital imagery, is currently in an advanced
- state of planning and organization. The Yale Library has selected a
- major vendor to serve as a partner in the project and as systems
- integrator. In its proposal, the successful vendor helped isolate areas
- of risk and uncertainty as well as key issues to be addressed during the
- life of the project. The Yale Library is now poised to decide what
- material it will convert to digital image form and to seek funding,
- initially for the first phase and then for the entire project.
-
- The proposal that Yale accepted for the implementation of Project Open
- Book will provide at the end of three phases a conversion subsystem,
- browsing stations distributed on the campus network within the Yale
- Library, a subsystem for storing 10,000 books at 200 and 600 dots per
- inch, and network access to the image printers. Pricing for the system
- implementation assumes the existence of Yale's campus ethernet network
- and its high-speed image printers, and includes other requisite hardware
- and software, as well as system integration services. Proposed operating
- costs include hardware and software maintenance, but do not include
- estimates for the facilities management of the storage devices and image
- servers.
-
- Yale selected its vendor partner in a formal process, partly funded by
- the Commission for Preservation and Access. Following a request for
- proposal, the Yale Library selected two vendors as finalists to work with
- Yale staff to generate a detailed analysis of requirements for Project
- Open Book. Each vendor used the results of the requirements analysis to
- generate and submit a formal proposal for the entire project. This
- competitive process not only enabled the Yale Library to select its
- primary vendor partner but also revealed much about the state of the
- imaging industry, about the varying, corporate commitments to the markets
- for imaging technology, and about the varying organizational dynamics
- through which major companies are responding to and seeking to develop
- these markets.
-
- Project Open Book is focused specifically on the conversion of images
- from microfilm to digital form. The technology for scanning microfilm is
- readily available but is changing rapidly. In its project requirements,
- the Yale Library emphasized features of the technology that affect the
- technical quality of digital image production and the costs of creating
- and storing the image library: What levels of digital resolution can be
- achieved by scanning microfilm? How does variation in the quality of
- microfilm, particularly in film produced to preservation standards,
- affect the quality of the digital images? What technologies can an
- operator effectively and economically apply when scanning film to
- separate two-up images and to control for and correct image
- imperfections? How can quality control best be integrated into
- digitizing work flow that includes document indexing and storage?
-
- The actual and expected uses of digital images--storage, browsing,
- printing, and OCR--help determine the standards for measuring their
- quality. Browsing is especially important, but the facilities available
- for readers to browse image documents is perhaps the weakest aspect of
- imaging technology and most in need of development. As it defined its
- requirements, the Yale Library concentrated on some fundamental aspects
- of usability for image documents: Does the system have sufficient
- flexibility to handle the full range of document types, including
- monographs, multi-part and multivolume sets, and serials, as well as
- manuscript collections? What conventions are necessary to identify a
- document uniquely for storage and retrieval? Where is the database of
- record for storing bibliographic information about the image document?
- How are basic internal structures of documents, such as pagination, made
- accessible to the reader? How are the image documents physically
- presented on the screen to the reader?
-
- The Yale Library designed Project Open Book on the assumption that
- microfilm is more than adequate as a medium for preserving the content of
- deteriorated library materials. As planning in the project has advanced,
- it is increasingly clear that the challenge of digital image technology
- and the key to the success of efforts like Project Open Book is to
- provide a means of both preserving and improving access to those
- deteriorated materials.
-
- SESSION IV-B
-
- George THOMA
-
- In the use of electronic imaging for document preservation, there are
- several issues to consider, such as: ensuring adequate image quality,
- maintaining substantial conversion rates (through-put), providing unique
- identification for automated access and retrieval, and accommodating
- bound volumes and fragile material.
-
- To maintain high image quality, image processing functions are required
- to correct the deficiencies in the scanned image. Some commercially
- available systems include these functions, while some do not. The
- scanned raw image must be processed to correct contrast deficiencies--
- both poor overall contrast resulting from light print and/or dark
- background, and variable contrast resulting from stains and
- bleed-through. Furthermore, the scan density must be adequate to allow
- legibility of print and sufficient fidelity in the pseudo-halftoned gray
- material. Borders or page-edge effects must be removed for both
- compactibility and aesthetics. Page skew must be corrected for aesthetic
- reasons and to enable accurate character recognition if desired.
- Compound images consisting of both two-toned text and gray-scale
- illustrations must be processed appropriately to retain the quality of
- each.
-
- SESSION IV-C
-
- Jean BARONAS
-
- Standards publications being developed by scientists, engineers, and
- business managers in Association for Information and Image Management
- (AIIM) standards committees can be applied to electronic image management
- (EIM) processes including: document (image) transfer, retrieval and
- evaluation; optical disk and document scanning; and document design and
- conversion. When combined with EIM system planning and operations,
- standards can assist in generating image databases that are
- interchangeable among a variety of systems. The applications of
- different approaches for image-tagging, indexing, compression, and
- transfer often cause uncertainty concerning EIM system compatibility,
- calibration, performance, and upward compatibility, until standard
- implementation parameters are established. The AIIM standards that are
- being developed for these applications can be used to decrease the
- uncertainty, successfully integrate imaging processes, and promote "open
- systems." AIIM is an accredited American National Standards Institute
- (ANSI) standards developer with more than twenty committees comprised of
- 300 volunteers representing users, vendors, and manufacturers. The
- standards publications that are developed in these committees have
- national acceptance and provide the basis for international harmonization
- in the development of new International Organization for Standardization
- (ISO) standards.
-
- This presentation describes the development of AIIM's EIM standards and a
- new effort at AIIM, a database on standards projects in a wide framework
- of imaging industries including capture, recording, processing,
- duplication, distribution, display, evaluation, and preservation. The
- AIIM Imagery Database will cover imaging standards being developed by
- many organizations in many different countries. It will contain
- standards publications' dates, origins, related national and
- international projects, status, key words, and abstracts. The ANSI Image
- Technology Standards Board requested that such a database be established,
- as did the ISO/International Electrotechnical Commission Joint Task Force
- on Imagery. AIIM will take on the leadership role for the database and
- coordinate its development with several standards developers.
-
- Patricia BATTIN
-
- Characteristics of standards for digital imagery:
-
- * Nature of digital technology implies continuing volatility.
-
- * Precipitous standard-setting not possible and probably not
- desirable.
-
- * Standards are a complex issue involving the medium, the
- hardware, the software, and the technical capacity for
- reproductive fidelity and clarity.
-
- * The prognosis for reliable archival standards (as defined by
- librarians) in the foreseeable future is poor.
-
- Significant potential and attractiveness of digital technology as a
- preservation medium and access mechanism.
-
- Productive use of digital imagery for preservation requires a
- reconceptualizing of preservation principles in a volatile,
- standardless world.
-
- Concept of managing continuing access in the digital environment
- rather than focusing on the permanence of the medium and long-term
- archival standards developed for the analog world.
-
- Transition period: How long and what to do?
-
- * Redefine "archival."
-
- * Remove the burden of "archival copy" from paper artifacts.
-
- * Use digital technology for storage, develop management
- strategies for refreshing medium, hardware and software.
-
- * Create acid-free paper copies for transition period backup
- until we develop reliable procedures for ensuring continuing
- access to digital files.
-
- SESSION IV-D
-
- Stuart WEIBEL The Role of SGML Markup in the CORE Project (6)
-
- The emergence of high-speed telecommunications networks as a basic
- feature of the scholarly workplace is driving the demand for electronic
- document delivery. Three distinct categories of electronic
- publishing/republishing are necessary to support access demands in this
- emerging environment:
-
- 1.) Conversion of paper or microfilm archives to electronic format
- 2.) Conversion of electronic files to formats tailored to
- electronic retrieval and display
- 3.) Primary electronic publishing (materials for which the
- electronic version is the primary format)
-
- OCLC has experimental or product development activities in each of these
- areas. Among the challenges that lie ahead is the integration of these
- three types of information stores in coherent distributed systems.
-
- The CORE (Chemistry Online Retrieval Experiment) Project is a model for
- the conversion of large text and graphics collections for which
- electronic typesetting files are available (category 2). The American
- Chemical Society has made available computer typography files dating from
- 1980 for its twenty journals. This collection of some 250 journal-years
- is being converted to an electronic format that will be accessible
- through several end-user applications.
-
- The use of Standard Generalized Markup Language (SGML) offers the means
- to capture the structural richness of the original articles in a way that
- will support a variety of retrieval, navigation, and display options
- necessary to navigate effectively in very large text databases.
-
- An SGML document consists of text that is marked up with descriptive tags
- that specify the function of a given element within the document. As a
- formal language construct, an SGML document can be parsed against a
- document-type definition (DTD) that unambiguously defines what elements
- are allowed and where in the document they can (or must) occur. This
- formalized map of article structure allows the user interface design to
- be uncoupled from the underlying database system, an important step
- toward interoperability. Demonstration of this separability is a part of
- the CORE project, wherein user interface designs born of very different
- philosophies will access the same database.
-
- NOTES:
- (6) The CORE project is a collaboration among Cornell University's
- Mann Library, Bell Communications Research (Bellcore), the American
- Chemical Society (ACS), the Chemical Abstracts Service (CAS), and
- OCLC.
-
- Michael LESK The CORE Electronic Chemistry Library
-
- A major on-line file of chemical journal literature complete with
- graphics is being developed to test the usability of fully electronic
- access to documents, as a joint project of Cornell University, the
- American Chemical Society, the Chemical Abstracts Service, OCLC, and
- Bellcore (with additional support from Sun Microsystems, Springer-Verlag,
- DigitaI Equipment Corporation, Sony Corporation of America, and Apple
- Computers). Our file contains the American Chemical Society's on-line
- journals, supplemented with the graphics from the paper publication. The
- indexing of the articles from Chemical Abstracts Documents is available
- in both image and text format, and several different interfaces can be
- used. Our goals are (1) to assess the effectiveness and acceptability of
- electronic access to primary journals as compared with paper, and (2) to
- identify the most desirable functions of the user interface to an
- electronic system of journals, including in particular a comparison of
- page-image display with ASCII display interfaces. Early experiments with
- chemistry students on a variety of tasks suggest that searching tasks are
- completed much faster with any electronic system than with paper, but
- that for reading all versions of the articles are roughly equivalent.
-
- Pamela ANDRE and Judith ZIDAR
-
- Text conversion is far more expensive and time-consuming than image
- capture alone. NAL's experience with optical character recognition (OCR)
- will be related and compared with the experience of having text rekeyed.
- What factors affect OCR accuracy? How accurate does full text have to be
- in order to be useful? How do different users react to imperfect text?
- These are questions that will be explored. For many, a service bureau
- may be a better solution than performing the work inhouse; this will also
- be discussed.
-
- SESSION VI
-
- Marybeth PETERS
-
- Copyright law protects creative works. Protection granted by the law to
- authors and disseminators of works includes the right to do or authorize
- the following: reproduce the work, prepare derivative works, distribute
- the work to the public, and publicly perform or display the work. In
- addition, copyright owners of sound recordings and computer programs have
- the right to control rental of their works. These rights are not
- unlimited; there are a number of exceptions and limitations.
-
- An electronic environment places strains on the copyright system.
- Copyright owners want to control uses of their work and be paid for any
- use; the public wants quick and easy access at little or no cost. The
- marketplace is working in this area. Contracts, guidelines on electronic
- use, and collective licensing are in use and being refined.
-
- Issues concerning the ability to change works without detection are more
- difficult to deal with. Questions concerning the integrity of the work
- and the status of the changed version under the copyright law are to be
- addressed. These are public policy issues which require informed
- dialogue.
-
-
- *** *** *** ****** *** *** ***
-
-
- Appendix III: DIRECTORY OF PARTICIPANTS
-
-
- PRESENTERS:
-
- Pamela Q.J. Andre
- Associate Director, Automation
- National Agricultural Library
- 10301 Baltimore Boulevard
- Beltsville, MD 20705-2351
- Phone: (301) 504-6813
- Fax: (301) 504-7473
- E-mail: INTERNET: PANDRE@ASRR.ARSUSDA.GOV
-
- Jean Baronas, Senior Manager
- Department of Standards and Technology
- Association for Information and Image Management (AIIM)
- 1100 Wayne Avenue, Suite 1100
- Silver Spring, MD 20910
- Phone: (301) 587-8202
- Fax: (301) 587-2711
-
- Patricia Battin, President
- The Commission on Preservation and Access
- 1400 16th Street, N.W.
- Suite 740
- Washington, DC 20036-2217
- Phone: (202) 939-3400
- Fax: (202) 939-3407
- E-mail: CPA@GWUVM.BITNET
-
- Howard Besser
- Centre Canadien d'Architecture
- (Canadian Center for Architecture)
- 1920, rue Baile
- Montreal, Quebec H3H 2S6
- CANADA
- Phone: (514) 939-7001
- Fax: (514) 939-7020
- E-mail: howard@lis.pitt.edu
-
- Edwin B. Brownrigg, Executive Director
- Memex Research Institute
- 422 Bonita Avenue
- Roseville, CA 95678
- Phone: (916) 784-2298
- Fax: (916) 786-7559
- E-mail: BITNET: MEMEX@CALSTATE.2
-
- Eric M. Calaluca, Vice President
- Chadwyck-Healey, Inc.
- 1101 King Street
- Alexandria, VA 223l4
- Phone: (800) 752-05l5
- Fax: (703) 683-7589
-
- James Daly
- 4015 Deepwood Road
- Baltimore, MD 21218-1404
- Phone: (410) 235-0763
-
- Ricky Erway, Associate Coordinator
- American Memory
- Library of Congress
- Phone: (202) 707-6233
- Fax: (202) 707-3764
-
- Carl Fleischhauer, Coordinator
- American Memory
- Library of Congress
- Phone: (202) 707-6233
- Fax: (202) 707-3764
-
- Joanne Freeman
- 2000 Jefferson Park Avenue, No. 7
- Charlottesville, VA 22903
-
- Prosser Gifford
- Director for Scholarly Programs
- Library of Congress
- Phone: (202) 707-1517
- Fax: (202) 707-9898
- E-mail: pgif@seq1.loc.gov
-
- Jacqueline Hess, Director
- National Demonstration Laboratory
- for Interactive Information Technologies
- Library of Congress
- Phone: (202) 707-4157
- Fax: (202) 707-2829
-
- Susan Hockey, Director
- Center for Electronic Texts in the Humanities (CETH)
- Alexander Library
- Rutgers University
- 169 College Avenue
- New Brunswick, NJ 08903
- Phone: (908) 932-1384
- Fax: (908) 932-1386
- E-mail: hockey@zodiac.rutgers.edu
-
- William L. Hooton, Vice President
- Business & Technical Development
- Imaging & Information Systems Group
- I-NET
- 6430 Rockledge Drive, Suite 400
- Bethesda, MD 208l7
- Phone: (301) 564-6750
- Fax: (513) 564-6867
-
- Anne R. Kenney, Associate Director
- Department of Preservation and Conservation
- 701 Olin Library
- Cornell University
- Ithaca, NY 14853
- Phone: (607) 255-6875
- Fax: (607) 255-9346
- E-mail: LYDY@CORNELLA.BITNET
-
- Ronald L. Larsen
- Associate Director for Information Technology
- University of Maryland at College Park
- Room B0224, McKeldin Library
- College Park, MD 20742-7011
- Phone: (301) 405-9194
- Fax: (301) 314-9865
- E-mail: rlarsen@libr.umd.edu
-
- Maria L. Lebron, Managing Editor
- The Online Journal of Current Clinical Trials
- l333 H Street, N.W.
- Washington, DC 20005
- Phone: (202) 326-6735
- Fax: (202) 842-2868
- E-mail: PUBSAAAS@GWUVM.BITNET
-
- Michael Lesk, Executive Director
- Computer Science Research
- Bell Communications Research, Inc.
- Rm 2A-385
- 445 South Street
- Morristown, NJ 07960-l9l0
- Phone: (201) 829-4070
- Fax: (201) 829-5981
- E-mail: lesk@bellcore.com (Internet) or bellcore!lesk (uucp)
-
- Clifford A. Lynch
- Director, Library Automation
- University of California,
- Office of the President
- 300 Lakeside Drive, 8th Floor
- Oakland, CA 94612-3350
- Phone: (510) 987-0522
- Fax: (510) 839-3573
- E-mail: calur@uccmvsa
-
- Avra Michelson
- National Archives and Records Administration
- NSZ Rm. 14N
- 7th & Pennsylvania, N.W.
- Washington, D.C. 20408
- Phone: (202) 501-5544
- Fax: (202) 501-5533
- E-mail: tmi@cu.nih.gov
-
- Elli Mylonas, Managing Editor
- Perseus Project
- Department of the Classics
- Harvard University
- 319 Boylston Hall
- Cambridge, MA 02138
- Phone: (617) 495-9025, (617) 495-0456 (direct)
- Fax: (617) 496-8886
- E-mail: Elli@IKAROS.Harvard.EDU or elli@wjh12.harvard.edu
-
- David Woodley Packard
- Packard Humanities Institute
- 300 Second Street, Suite 201
- Los Altos, CA 94002
- Phone: (415) 948-0150 (PHI)
- Fax: (415) 948-5793
-
- Lynne K. Personius, Assistant Director
- Cornell Information Technologies for
- Scholarly Information Sources
- 502 Olin Library
- Cornell University
- Ithaca, NY 14853
- Phone: (607) 255-3393
- Fax: (607) 255-9346
- E-mail: JRN@CORNELLC.BITNET
-
- Marybeth Peters
- Policy Planning Adviser to the
- Register of Copyrights
- Library of Congress
- Office LM 403
- Phone: (202) 707-8350
- Fax: (202) 707-8366
-
- C. Michael Sperberg-McQueen
- Editor, Text Encoding Initiative
- Computer Center (M/C 135)
- University of Illinois at Chicago
- Box 6998
- Chicago, IL 60680
- Phone: (312) 413-0317
- Fax: (312) 996-6834
- E-mail: u35395@uicvm..cc.uic.edu or u35395@uicvm.bitnet
-
- George R. Thoma, Chief
- Communications Engineering Branch
- National Library of Medicine
- 8600 Rockville Pike
- Bethesda, MD 20894
- Phone: (301) 496-4496
- Fax: (301) 402-0341
- E-mail: thoma@lhc.nlm.nih.gov
-
- Dorothy Twohig, Editor
- The Papers of George Washington
- 504 Alderman Library
- University of Virginia
- Charlottesville, VA 22903-2498
- Phone: (804) 924-0523
- Fax: (804) 924-4337
-
- Susan H. Veccia, Team leader
- American Memory, User Evaluation
- Library of Congress
- American Memory Evaluation Project
- Phone: (202) 707-9104
- Fax: (202) 707-3764
- E-mail: svec@seq1.loc.gov
-
- Donald J. Waters, Head
- Systems Office
- Yale University Library
- New Haven, CT 06520
- Phone: (203) 432-4889
- Fax: (203) 432-7231
- E-mail: DWATERS@YALEVM.BITNET or DWATERS@YALEVM.YCC.YALE.EDU
-
- Stuart Weibel, Senior Research Scientist
- OCLC
- 6565 Frantz Road
- Dublin, OH 43017
- Phone: (614) 764-608l
- Fax: (614) 764-2344
- E-mail: INTERNET: Stu@rsch.oclc.org
-
- Robert G. Zich
- Special Assistant to the Associate Librarian
- for Special Projects
- Library of Congress
- Phone: (202) 707-6233
- Fax: (202) 707-3764
- E-mail: rzic@seq1.loc.gov
-
- Judith A. Zidar, Coordinator
- National Agricultural Text Digitizing Program
- Information Systems Division
- National Agricultural Library
- 10301 Baltimore Boulevard
- Beltsville, MD 20705-2351
- Phone: (301) 504-6813 or 504-5853
- Fax: (301) 504-7473
- E-mail: INTERNET: JZIDAR@ASRR.ARSUSDA.GOV
-
-
- OBSERVERS:
-
- Helen Aguera, Program Officer
- Division of Research
- Room 318
- National Endowment for the Humanities
- 1100 Pennsylvania Avenue, N.W.
- Washington, D.C. 20506
- Phone: (202) 786-0358
- Fax: (202) 786-0243
-
- M. Ellyn Blanton, Deputy Director
- National Demonstration Laboratory
- for Interactive Information Technologies
- Library of Congress
- Phone: (202) 707-4157
- Fax: (202) 707-2829
-
- Charles M. Dollar
- National Archives and Records Administration
- NSZ Rm. 14N
- 7th & Pennsylvania, N.W.
- Washington, DC 20408
- Phone: (202) 501-5532
- Fax: (202) 501-5512
-
- Jeffrey Field, Deputy to the Director
- Division of Preservation and Access
- Room 802
- National Endowment for the Humanities
- 1100 Pennsylvania Avenue, N.W.
- Washington, DC 20506
- Phone: (202) 786-0570
- Fax: (202) 786-0243
-
- Lorrin Garson
- American Chemical Society
- Research and Development Department
- 1155 16th Street, N.W.
- Washington, D.C. 20036
- Phone: (202) 872-4541
- Fax: E-mail: INTERNET: LRG96@ACS.ORG
-
- William M. Holmes, Jr.
- National Archives and Records Administration
- NSZ Rm. 14N
- 7th & Pennsylvania, N.W.
- Washington, DC 20408
- Phone: (202) 501-5540
- Fax: (202) 501-5512
- E-mail: WHOLMES@AMERICAN.EDU
-
- Sperling Martin
- Information Resource Management
- 20030 Doolittle Street
- Gaithersburg, MD 20879
- Phone: (301) 924-1803
-
- Michael Neuman, Director
- The Center for Text and Technology
- Academic Computing Center
- 238 Reiss Science Building
- Georgetown University
- Washington, DC 20057
- Phone: (202) 687-6096
- Fax: (202) 687-6003
- E-mail: neuman@guvax.bitnet, neuman@guvax.georgetown.edu
-
- Barbara Paulson, Program Officer
- Division of Preservation and Access
- Room 802
- National Endowment for the Humanities
- 1100 Pennsylvania Avenue, N.W.
- Washington, DC 20506
- Phone: (202) 786-0577
- Fax: (202) 786-0243
-
- Allen H. Renear
- Senior Academic Planning Analyst
- Brown University Computing and Information Services
- 115 Waterman Street
- Campus Box 1885
- Providence, R.I. 02912
- Phone: (401) 863-7312
- Fax: (401) 863-7329
- E-mail: BITNET: Allen@BROWNVM or
- INTERNET: Allen@brownvm.brown.edu
-
- Susan M. Severtson, President
- Chadwyck-Healey, Inc.
- 1101 King Street
- Alexandria, VA 223l4
- Phone: (800) 752-05l5
- Fax: (703) 683-7589
-
- Frank Withrow
- U.S. Department of Education
- 555 New Jersey Avenue, N.W.
- Washington, DC 20208-5644
- Phone: (202) 219-2200
- Fax: (202) 219-2106
-
-
- (LC STAFF)
-
- Linda L. Arret
- Machine-Readable Collections Reading Room LJ 132
- (202) 707-1490
-
- John D. Byrum, Jr.
- Descriptive Cataloging Division LM 540
- (202) 707-5194
-
- Mary Jane Cavallo
- Science and Technology Division LA 5210
- (202) 707-1219
-
- Susan Thea David
- Congressional Research Service LM 226
- (202) 707-7169
-
- Robert Dierker
- Senior Adviser for Multimedia Activities LM 608
- (202) 707-6151
-
- William W. Ellis
- Associate Librarian for Science and Technology LM 611
- (202) 707-6928
-
- Ronald Gephart
- Manuscript Division LM 102
- (202) 707-5097
-
- James Graber
- Information Technology Services LM G51
- (202) 707-9628
-
- Rich Greenfield
- American Memory LM 603
- (202) 707-6233
-
- Rebecca Guenther
- Network Development LM 639
- (202) 707-5092
-
- Kenneth E. Harris
- Preservation LM G21
- (202) 707-5213
-
- Staley Hitchcock
- Manuscript Division LM 102
- (202) 707-5383
-
- Bohdan Kantor
- Office of Special Projects LM 612
- (202) 707-0180
-
- John W. Kimball, Jr
- Machine-Readable Collections Reading Room LJ 132
- (202) 707-6560
-
- Basil Manns
- Information Technology Services LM G51
- (202) 707-8345
-
- Sally Hart McCallum
- Network Development LM 639
- (202) 707-6237
-
- Dana J. Pratt
- Publishing Office LM 602
- (202) 707-6027
-
- Jane Riefenhauser
- American Memory LM 603
- (202) 707-6233
-
- William Z. Schenck
- Collections Development LM 650
- (202) 707-7706
-
- Chandru J. Shahani
- Preservation Research and Testing Office (R&T) LM G38
- (202) 707-5607
-
- William J. Sittig
- Collections Development LM 650
- (202) 707-7050
-
- Paul Smith
- Manuscript Division LM 102
- (202) 707-5097
-
- James L. Stevens
- Information Technology Services LM G51
- (202) 707-9688
-
- Karen Stuart
- Manuscript Division LM 130
- (202) 707-5389
-
- Tamara Swora
- Preservation Microfilming Office LM G05
- (202) 707-6293
-
- Sarah Thomas
- Collections Cataloging LM 642
- (202) 707-5333
-
-
- END
- *************************************************************
-
- Note: This file has been edited for use on computer networks. This
- editing required the removal of diacritics, underlining, and fonts such
- as italics and bold.
-
- kde 11/92
-
- [A few of the italics (when used for emphasis) were replaced by CAPS mh]
-
- *End of The Project Gutenberg Etext of LOC WORKSHOP ON ELECTRONIC ETEXTS
-
-