home *** CD-ROM | disk | FTP | other *** search
Wrap
<?xml version="1.0"?> <?xml-stylesheet href="http://202.120.182.10/xml/test20.xsl" type="text/xsl"?> <!-- DOCTYPE tei.2 SYSTEM "xteilite.dtd" [ <!ENTITY nil ''> <!ENTITY null ''> <!ENTITY perc "%"> ] --> <!-- XML Compatible Version --> <!-- 1997-11-27 : PB : first XML release --> <tei.2 TEIform="tei.2"> <teiHeader type="text" status="new" TEIform="teiHeader"> <fileDesc TEIform="fileDesc"> <titleStmt TEIform="titleStmt"><title TEIform="title">Text Encoding for Information Interchange: An Introduction to the Text Encoding Initiative.</title> </titleStmt> <publicationStmt TEIform="publicationStmt"><p TEIform="p">For presentation at the LEC, London, Nov 1995.</p> </publicationStmt> <sourceDesc default="NO" TEIform="sourceDesc"><p TEIform="p">Revised and expanded version of article published as <title TEIform="title">The Wider Relevance of the Text Encoding Initiative </title> in OII Spectrum, Nov 1994. </p> </sourceDesc></fileDesc> </teiHeader> <text TEIform="text"> <front TEIform="front"> <titlePage TEIform="titlePage"> <docTitle TEIform="docTitle"> <titlePart type="main" TEIform="titlePart">Text Encoding for Information Interchange : An Introduction to the Text Encoding Initiative </titlePart></docTitle> <docAuthor TEIform="docAuthor">Lou Burnard</docAuthor> <!-- DOCIMPRINT>Document No: TEI ED 99</DOCIMPRINT --> <docDate TEIform="docDate">July 1995</docDate> </titlePage> <!-- divgen type='toc' --> <!-- ==================================cut to save space=========== <DIV type="abstract"> <HEAD>Abstract</HEAD> <P>The Text Encoding Initiative (TEI) is an international cooperative research effort, the goal of which is to define a set of generic Guidelines for the representation of textual materials in electronic form, in such a way as to enable researchers in any discipline to interchange and re-use resources, independently of software, hardware, and application area. The first full version of the TEI Guidelines was published in May 1994, after six years of development in Europe and the US, with major funding from the American National Endowment for the Humanities and from the EU. </P> <P>This publication takes the form of a substantial reference manual, documenting a modular and extensible SGML document type definition (DTD), capable of describing all kinds of texts, of all times and in all languages. Modern computer-aided research crosses many political, linguistic, temporal, and disciplinary boundaries; the TEI scheme reflects this fact. As far as possible, the Guidelines eschew controversy; where consensus has not been established, only very general recommendations are made. The object is to help the researcher make his or her position explicit, not to dictate what that position should be. </P> <P>Consequently, the TEI <q>standard</q> offers neither a single all-embracing encoding scheme, nor an unstructured collection of tag sets. Rather it offers an extensible framework containing a common core of features, a choice of frameworks, and a wide variety of optional additions for specific application areas. </P> <P>The value of this approach has been demonstrated by the diversity of fields in which the Guidelines have been applied, ranging from traditional humanities disciplines such as textual editing and critical editions, to linguistic engineering applications such as spoken and written language corpora. The Guidelines include detailed provision for the encoding of meta-textual cataloguing information; they also include an extensive set of recommendations for the representation of arbitrary interpretive structures and for the encoding of complex hypertextual links and alignments, such as those needed for multilingual corpora, and multimedia applications. </P></DIV> ==================================end of cut to save space=========== --> </front> <body TEIform="body"> <div1 type="foo" org="uniform" sample="complete" part="N" TEIform="div1"><head n="1" TEIform="head">Standardization and the TEI </head> <p TEIform="p">Standards come into being for a variety of reasons, and in a variety of ways, not always entirely explicable. They may be entirely market-defined; for example by manufacturers' attempts singly or as a group to control market share, or by consumers' desires to simplify purchase decisions. Standards also result from pressure applied by well-intentioned groups of experts, or as a consequence of legislation in the public interest. And finally, standards come about as the expression of some emergent consensus within some large community. This last method is the most likely to last, but the most difficult to achieve. </p> <p TEIform="p">In creating such a consensus, there is an inevitable tension between the need to transform what is simply tried and tested into something normative and binding on the one hand, and the reluctance to straitjacket or constrain unanticipated development on the other. This is particularly true of the research and development arena, which depends for its survival on innovation, and thus the ability to provide answers to as yet unformulated questions, while at the same time being as concerned as any other community to codify existing practice. The research community is populated by experts, who need maximal flexibility and who distrust constraint, but also by novices, who need access to that accumulated expertise in a consistent and codified form, if only in order to rebel against it and thus become experts in their turn. </p> <p TEIform="p">Standardization of the way in which information is stored and represented (rather than processed) is the key to a number of closely related problems, all of central concern to users of modern Information Technology, be they academic or commercial. For creators of language resources in particular, it addresses the difficulty of ensuring that information is <emph TEIform="emph">reusable</emph>; the difficulty of ensuring that information represented in different ways can be seamlessly <emph TEIform="emph">integrated;</emph> and the difficulty of facilitating loss-free information <emph TEIform="emph">interchange</emph> between the widest choice of different platforms, different application systems and different languages. </p> <p TEIform="p">By standardizing at the level of text representation, we can hope to retain the flexibility needed to develop new applications, while ensuring that old ones continue to function. By attempting a theory-neutral standardization, at the level where consensus exists, we avoid the need to reinvent the wheel, without requiring that everyone drive a particular brand of bicycle. </p> <p TEIform="p">In this spirit, the TEI <title TEIform="title">Guidelines</title> which form the topic of this paper, aim to provide not a set of normative rules for particular applications, but rather a modular and extensible framework, within which particular application-specific norms can be defined. The development of such TEI-aware norms is already underway in a number of contexts, most significantly for the present audience, within EAGLES and related EU projects such as Multext, but also in a wide variety of corpus building, scholarly editing and digital library projects. Such projects have in common the need to customize and make less generic the framework defined by the TEI, retaining as they do so the capacity for interchange for which it was developed. The general principles, and many of the specific mechanisms, underlying this approach are of clear relevance to all large scale users of information technology. </p> <p TEIform="p">This paper <note place="unspecified" anchored="yes" TEIform="note">An earlier version of which was published as <title TEIform="title">The Wider Relevance of the Text Encoding Initiative</title> in OII <title TEIform="title">Spectrum</title>, Nov 1994.</note> describes the origins and organization of the TEI scheme, including some technical details of how it may be customized for multiple application areas, and an overview of its coverage. </p></div1> <div1 org="uniform" sample="complete" part="N" TEIform="div1"><head n="2" TEIform="head">What is the TEI?</head> <p TEIform="p">The Text Encoding Initiative (TEI) is an international cooperative research effort, the goal of which is to define a set of generic Guidelines for the representation of textual materials in electronic form. The project was sponsored and organized by three leading professional associations in the field: the Association for Computational Linguistics (ACL), the Association for Literary and Linguistic Computing (ALLC) and the Association for Computing and the Humanities (ACH). It has been funded throughout its five years of activities on both sides of the Atlantic: primarily by the US National Endowment for the Humanities and by the European Union 3rd framework Programme for Linguistic Research and Engineering, but also with grants from the Mellon Foundation and from the Canadian Social Sciences and Humanities Research Council. Of equal significance has been the donation of time and expertise by the many members of the wider research community who have served on the TEI's Working Committees and Working Groups. </p> <p TEIform="p">As its title suggests, the TEI is strongly interested in text. But this interest is by no means confined to the use of electronic text as a stage in the production of paper documents, and the word <mentioned TEIform="mentioned">text</mentioned> should not be read too literally. The TEI is equally concerned with both textual and non-textual resources in electronic form, whether as constituents of a research database or components of non-paper publications. </p> <p TEIform="p">Like the publishing industry, the research community has long realized that its stock in hand is not words on the page, but information, independent of any particular physical realization. As technology emerges which is genuinely adequate to the task of integrating text, graphics and audio into a seamless information-bearing vehicle, so the importance of that integrated vision becomes more apparent. By providing a description of information which is independent of realization or media, the TEI scheme, like other SGML-based approaches, enormously facilitates the construction and exploitation of multimedia technology. </p> <p TEIform="p">In the same way, the texts with which language researchers are concerned are likely to be very heterogenous. In the construction of language corpora such as the British National Corpus <note place="unspecified" anchored="yes" TEIform="note">See <code>http://info.ox.ac.uk/bnc</code> for details of this 100 million word TEI-conformant corpus of modern British English.</note>, material as divers as newspapers, books, office memoranda, playscripts, publicity brochures, letters and diaries, transcribed lectures and interviews, TV and radio broadcasts, and unscripted conversations are integrated into a single body of material. Research needs impose that this integration be carried out with minimal loss of information, and at the same time with minimal complexity: in any case, the resulting <soCalled TEIform="soCalled">text</soCalled> is far removed from the conventional notion of a printed work. </p> <p TEIform="p">Electronic texts are most obviously different from printed ones in that the former contain <term TEIform="term">markup</term> or encoding, which makes explicit various features of the text, so that they can be efficiently processed. Printed texts adopt a variety of similarly-motivated conventions (use of typeface, organization of the carrier medium etc), but these are not so readily processable as the tags of a formal markup scheme. </p> <p TEIform="p">The goals of the TEI project initially had a dual focus: being concerned with both <emph TEIform="emph">what</emph> textual features should be encoded (i.e. made explicit) in an electronic text, and <emph TEIform="emph">how</emph> that encoding should be represented for loss-free, platform-independent, interchange. <!-- The approach taken was a two stage one: <list type="gloss"> <label>firstly</label> <item> the identification of those distinctions concerning which there is common agreement; </item> <label>secondly</label> <item> the creation of a uniform encoding system within which those distinctions can be expressed for interchange.</item> </list> --> </p> <p TEIform="p">Early on in the project, the Standard Generalized Markup Language (SGML; ISO 8879) was chosen as the most appropriate vehicle for the Guidelines, initially on the purely pragmatic grounds that to create a comparably expressive and versatile formal language would be a major research project in itself. In the event, despite some frequently rehearsed inelegancies, SGML has proved entirely adequate to the needs of researchers, and after five years, is still increasing its domination of the software industry, with new product announcements coming every year. The TEI was thus able to focus its efforts on the expression, using SGML, of the set of textual features indicated as its first goal above. </p> <p TEIform="p">The prime deliverable of the TEI project is a very large number (over 400) of textual feature definitions, expressed as SGML elements and attributes, with associated documentation and examples. These elements are grouped into <term TEIform="term">tag sets</term> of various kinds, as further discussed below, and together constitute a modular scheme which can be configured to provide hardware-, software-, and application- independent support for the encoding of all kinds of text in all languages and of all times. </p> <p TEIform="p">The TEI tag sets are necessarily based on, but not limited by, existing encoding practices; they are designed to be both comprehensive and extensible. They are collectively documented in a substantial reference manual, the <title TEIform="title">Guidelines for Text Encoding for Interchange</title>, which appeared in May 1994 after five years of extensive development work. <!-- A first draft of this publication appeared in November 1990. Between 1992 and 1994, chapters of a revised draft were circulated electronically. A fully revised and completed version, known internally as P3, was first published in May 1994. A revised and corrected edition is tentatively planned for first quarter 1996. --> This 1400 page manual is published both in paper and electronic hypertext form and is also available over the Internet, in a variety of formats. <note place="unspecified" anchored="yes" TEIform="note"><title TEIform="title">Guidelines for the encoding and interchange of machine-readable texts</title> edited by C.M.Sperberg-McQueen and Lou Burnard (Chicago and Oxford, ALLC-ACH-ACL Text Encoding Initiative, 1994). For details of current availability and locations, see the official TEI Web page at <code>http://www-tei.uic.edu/orgs/tei</code>.</note> </p></div1> <div1 org="uniform" sample="complete" part="N" TEIform="div1"><head n="3" TEIform="head">Organization of the TEI scheme </head> <p TEIform="p">As an SGML application, the TEI scheme necessarily requires the existence of some kind of <term TEIform="term">document type definition</term> (DTD). Current approaches to dtd design may be caricatured as falling into one of three camps, depending on their answer to the question <q direct="unspecified" TEIform="q">How many DTDs does the world need?</q>. </p> <p TEIform="p">For many of the first users of SGML, the appropriate answer was <q direct="unspecified" TEIform="q">One</q>: the whole purpose of the exercise being to define a template against which all texts could be checked rigorously and consistently. This approach, which might be characterized by the phrase <q direct="unspecified" TEIform="q">we know what's best for you</q>, has an obvious place in applications such as technical documentation, but is equally obviously inappropriate where the object of the exercise is to describe texts produced before the blessings of structured document design were revealed to the world. </p> <p TEIform="p">At the opposite extreme are those whose answer would be <q direct="unspecified" TEIform="q">none</q>, for whom no DTD can ever be adequate to the full complexity of the texts to be described: this attitude might be caricatured as <q direct="unspecified" TEIform="q">No-one will ever understand my problem</q>. Again, it is not impossible to imagine applications for which a DTD consisting only of elements with the content model ANY would be entirely appropriate (the first electronic edition of the <title TEIform="title">Oxford English Dictionary</title> provides one obvious example), although its usefulness in the general case is less clear. </p> <p TEIform="p">Perhaps most numerous are those who shrug their shoulders and say <q direct="unspecified" TEIform="q">as many as it takes</q>: the world will always need new DTDs, in the boundary case, one per document. In the name of pragmatism, this attitude risks crowding the fledgeling possibility of information interchange out of the nest entirely; nevertheless, its popularity reminds us that sometimes the document must drive the DTD, rather than the reverse. </p> <p TEIform="p">The approach taken by the TEI attempts to combine virtues of all three of these approaches. It defines not one, but many possible DTDs, which may be tailored to the needs of a particular application in a way difficult or impossible with most other general purpose DTDs so far developed. The user of the TEI scheme is offered the opportunity of building a DTD which matches his or her requirements, but constrained to do so in a way that facilitates interchange. </p> <p TEIform="p">We refer to this somewhat jocularly as the Chicago Pizza model. All pizzas have some ingredients in common (cheese and tomato sauce); in Chicago, at least, they may have entirely different forms of pastry base, with which (universally) the consumer is expected to make his or her own selection of toppings. Using SGML syntax this might be summarized as follows: <eg TEIform="eg"><![CDATA[<!ENTITY % base "(deepDish | thinCrust | stuffed)" > <!ENTITY % topping "(sausage | mushroom | pepper | anchovy ...)"> <!ELEMENT pizza - - (%base, cheese & tomato, (%topping;)* )> ]]></eg> In the same way, the user of the TEI scheme constructs a view of the TEI DTD by combining the core tag sets (which are always present), exactly one <soCalled TEIform="soCalled">base</soCalled> tag set and his or her own selection of <soCalled TEIform="soCalled">additional</soCalled> tag sets or toppings. </p> <p TEIform="p">We use the term <term TEIform="term">tag set</term> to denote simply a collection of definitions for SGML elements and their attributes. These tag sets are the basic organizing principles of the TEI scheme, and are divided into four groups: <list type="gloss" TEIform="list"> <label TEIform="label">core </label><item TEIform="item">tag sets defining elements likely to be needed by all documents, and therefore available by default in all cases. </item> <label TEIform="label">base </label><item TEIform="item">tag sets defining particular classes of document whose gross structure may vary; in general, only one base tag set is appropriate for a given document. </item> <label TEIform="label">additional </label><item TEIform="item">tag sets defining sets of elements which may be found in any class of document but which are typically associated with some specialized application or detailed subject area. </item> <label TEIform="label">auxiliary </label><item TEIform="item">tag sets comprising elements with highly specialized roles, typically for description of some part of the encoding scheme, and which make up a DTD independent of the main one. </item> </list>In general, elements appear in only one tag set, though the current model allows for the redefinition of elements within different base tag sets. Elements may not be defined in more than one additional tag set. </p> <p TEIform="p">This modularization is achieved by the use of parameter entities in the TEI DTD, which is further discussed below. To illustrate the basic mechanism we present here the start of a minimal TEI-conformant document in which the base tag set for prose has been selected together with the additional tag set for linking: <eg TEIform="eg"><![CDATA[<!DOCTYPE tei.2 [ <!ENTITY % TEI.prose "INCLUDE"> <!ENTITY % TEI.linking "INCLUDE"> ]> <tei.2> <!-- content of document here --> </tei.2> ]]></eg> Because this selection of tag sets is effected explicitly by declarations within the DTD subset, as shown above, any recipient of the document can tell which TEI tag sets are required to process it. Any deviations or modifications of the TEI definitions (for example, the renaming of elements, or the addition of new ones) may be made in a similar declarative manner. Once a given view of the TEI dtd has been defined in this way, it can be fixed or <soCalled TEIform="soCalled">compiled</soCalled> to preclude further modification and also to remove the complexity necessarily introduced by the extensive use of indirection in the TEI dtd. </p></div1> <div1 org="uniform" sample="complete" part="N" TEIform="div1"><head n="4" TEIform="head">The TEI core </head> <p TEIform="p">Two core tag sets are available to all TEI documents without formality. The first defines a large number of elements which may appear in almost any kind of document, whatever kind of base tag set is in use. The second defines the <term TEIform="term">header</term>, providing something analogous to an electronic title page for the electronic text. </p> <div2 type="splot" org="uniform" sample="complete" part="N" TEIform="div2"><head n="5" TEIform="head">Elements available to all bases</head> <p TEIform="p">The core tag set common to all TEI documents provides means of encoding with a reasonable degree of sophistication such textual features as typographically highlighted or quoted phrases, (optionally distinguishing highlighting used for emphasis, technical terms, foreign words, titles etc); quoted phrases, (optionally distinguishing amongst direct speech, quotation, glosses, cited phrases etc.); <q direct="unspecified" TEIform="q">data-like</q> phrases such as names, numbers and measures, dates and times, etc.; lists of all kinds; basic editorial changes (e.g. correction of apparent errors; regularization and normalization; additions, deletions and omissions); simple links and cross references, providing basic hypertextual features; facilities for annotation, indexing, bibliographic citations and referencing systems. <!-- == deleted to save space ====================================== the following textual features: <list> <item>typographically highlighted phrases, (optionally distinguishing amongst highlighting for emphasis, technical terms, foreign words, titles etc.) </item> <item>quoted phrases, optionally distinguishing amongst direct speech, quotation, glosses, cited phrases etc. </item> <item>names, numbers and measures, dates and times, and similar <q>data-like</q> phrases. </item> <item>lists of all kinds </item> <item>basic editorial changes (e.g. correction of apparent errors; regularization and normalization; additions, deletions and omissions) </item> <item>simple links and cross references, providing basic hypertextual features. </item> <item>pre-existing or generated annotation and indexing </item> <item>bibliographic citations, adequate for most commonly used bibliographic packages, in either a free or a tightly structured format </item> <item>simple or complex referencing systems, not necessarily dependent on the existing SGML structure. </item> </list> == end of deleted to save space ====================================== --> There are few documents which do not exhibit some of these features; and none of these features is particularly restricted to any one kind of document. In some cases, an additional tagset is also available, providing more specialized elements for those wishing to encode aspects of these features in greater detail (for example, for verse and drama, and for names), but the elements defined in this core are believed to be adequate for most applications most of the time. </p></div2> <div2 id="hdr" org="uniform" sample="complete" part="N" TEIform="div2"><head n="6" TEIform="head">The header</head> <p TEIform="p">The TEI scheme attaches particular importance to the provision of documentary or bibliographic information about electronic texts. Such information is essential for any satisfactory interchange of texts coming from multiple sources, or for which long term uses are envisaged. <!-- As with software, leaving the documentation of an electronic text to the last moment is a recipe for disaster all too commonly followed. --> </p> <p TEIform="p">The TEI header is one of the few mandatory elements in a TEI document. It has four major divisions which together provide a detailed syntax for the documentation of: <list type="simple" TEIform="list"> <item TEIform="item">the electronic document itself and the sources from which it was derived; </item> <item TEIform="item">the encoding system which has been applied; </item> <item TEIform="item">descriptive information categorizing the document and its subject matter; </item> <item TEIform="item">its revision history.</item> </list></p> <p TEIform="p">The first of these, the <term TEIform="term">file description</term>, contains traditional bibliographic material, detailing title, intellectual responsibility and publication or distribution information relating to an electronic text, which can readily be translated into a conventional catalogue record for use by the growing number of forward-thinking academic and public libraries now coming to terms with their new role as curators of non-print electronic materials. </p> <p TEIform="p">Several commentators, noticing how the day to day information processing of all sectors of the economy now takes place in electronic form only, have expressed concern at the difficulties faced by librarians and archivists in handling these new forms of historical records. Others, trying to come to terms with the wealth of information in <q direct="unspecified" TEIform="q">cyberspace</q>, have lamented the absence of any effective cataloguing standards for networked resources and other forms of electronic publication. For creators of language corpora, the provision of such meta-descriptive information is essential, since without it analysis of the full complexity of language use is all but impossible. The TEI Header represents a major contribution to overcoming all these problems. </p> <p TEIform="p">Many electronic texts are essentially derivative works, created either by keying or scanning previously existing print materials, combining or modifying previously existing electronic materials, or both. The <term TEIform="term">source description</term> part of the TEI header allows an encoder to specify the source or sources from which a text has been derived, using traditional bibliographic concepts. The pedigree of a TEI-conformant text can thus be specified, in the same way as a conventional book will generally document its publishing history. A detailed formal description of changes made in producing a text can be recorded as a distinct <term TEIform="term">revision history</term>; this is particularly useful for highly dynamic texts. </p> <p TEIform="p">As noted above, the TEI is not a fixed encoding scheme, but offers a variety of options appropriate to different situations. Consequently, the <term TEIform="term">encoding description</term> within a TEI Header is of particular importance to users of an electronic document. It provides, in structured or unstructured form, vital information about editorial conventions or policies, design decisions and even the selection of tags actually used within the document. </p> <p TEIform="p">The <term TEIform="term">profile description</term> is used to group together a wide range of additional descriptive information ranging from specifications of the languages used within it, the situation or social context in which it was produced, its topics or classification, to demographic or social characteristics of its authors or participants. No-one is likely to need all of these categories of information, but <!-- the working groups involved in defining the header agreed that --> all of them are likely to be essential to some users. </p> <!-- <P>At one extreme, an encoder may provide only a bibliographic identification of the text. At the other, encoders wishing to ensure that their texts can be used for the widest range of applications, will want to provide a level of detailed documentation approximating to the kind most often supplied in the form of a manual. Most texts will lie somewhere between these extremes; textual corpora in particular will tend more to the latter extreme. </P> --> <p TEIform="p">A collection of TEI headers can also be regarded as a distinct document, and an auxiliary DTD is provided to support interchange of headers alone, for example between libraries or archives. </p></div2></div1> <div1 id="bas" type="frrpo" org="uniform" sample="complete" part="N" TEIform="div1"><head n="7" TEIform="head">The TEI base tag sets</head> <p TEIform="p">To construct a view of the TEI DTD, the user must always choose one base tag sets. Six of these are currently defined, for documents which are predominantly one of prose, verse, drama, transcribed speech, dictionaries, or terminological databases. Another two are provided for use with texts which combine these basic tag sets. <!-- <list> <item>prose </item> <item>verse </item> <item>drama </item> <item>transcribed speech </item> <item>letters and memoranda </item> <item>dictionary entries </item> <item>terminological entries </item> </list> --> </p> <p TEIform="p">The choice of a base tag set determines the basic structure of all the documents with which it is to be used, reflecting the fact that subelements likely to appear within a dictionary (for example) will be entirely different in kind from those likely to appear within a letter or a novel, and even more so from those likely to be found in a transcription of spoken language. To cater for this variety, the constituents of all divisions of a TEI <gi TEI="yes" TEIform="gi">text</gi> element are not defined explicitly, but in terms of <term TEIform="term">parameter entities</term>. The mechanism used is to provide definitions like the following within the DTD, one of which the user must over-ride by supplying an appropriate declaration in the DTD subset: <eg TEIform="eg"><![CDATA[ <!ENTITY % TEI.prose "IGNORE"> <!ENTITY % TEI.dictionary "IGNORE"> ]]></eg> The body of the main dtd contains a series of alternative definitions, each enclosed within an SGML <term TEIform="term">marked section</term> named after the base which it defines, as in this simplified example: <eg TEIform="eg"><![CDATA[ <![ %TEI.prose [ <!-- This definition is in force when the prose base is selected --> <!-- Its effect is to define component as either paragraph or list --> <!ENTITY % component "p|list" > ]&null;]> <![ %TEI.dictionary [ <!--This definition is in force when the dictionary base is selected --> <!-- Its effect is to define component as entry alone --> <!ENTITY % component "entry" > ]&null;]> <!-- This definition is always in force --> <!-- Its effect is to define component.seq as one or more of --> <!-- whatever definition of component is currently in force --> <!ENTITY % component.seq "(%component)+"> ]]></eg> Within the body of the DTD, elements are defined using these parameter entities only, for example: <eg TEIform="eg"><![CDATA[ <!ELEMENT div - - ((%component.seq)+)> ]]></eg> To select a base tag set a declaration such as the following should be supplied within the DTD subset for the document: <eg TEIform="eg"><![CDATA[ <!ENTITY % TEI.prose "INCLUDE"> ]]></eg> This will over-ride the declaration within the TEI DTD itself, because it is given first. If no base is declared, the DTD will not compile. </p> <p TEIform="p">The value of the parameter entity called <ident>component.seq</ident> will thus differ in different bases. In this way it is possible for the divisions of a text using the drama base (for example) to consist of speeches and stage directions, while those of a text using the dictionary base will consist of lexical entries. </p> <div2 org="uniform" sample="complete" part="N" TEIform="div2"><head n="8" TEIform="head">Textual Divisions</head> <p TEIform="p">Although the actual components may differ, groups of textual components are potentially grouped into higher level <soCalled TEIform="soCalled">division</soCalled>s in almost any kind of text. These higher level units may be called variously <soCalled TEIform="soCalled">chapters</soCalled>, <soCalled TEIform="soCalled">sections</soCalled>, <soCalled TEIform="soCalled">subdvisions</soCalled>, <soCalled TEIform="soCalled">acts</soCalled> or <soCalled TEIform="soCalled">parts</soCalled> but all seem to behave in more or less the same way: they are incomplete in themselves, and nested hierarchically. In the TEI scheme all such objects are therefore regarded as the same kind of element, called here a <term TEIform="term">division</term>. <!-- ; though a distinction is made between divisions whose hierarchic position is regarded as inseparable from their semantics (these are encoded as <gi>div1</gi>, <gi>div2</gi> etc. down to <gi>div7</gi> elements) and those for which their position in the document tree is regarded as of lesser importance (these are known as <SOCALLED>vanilla </SOCALLED> <gi>div</gi>s). Numbered and unnumbered division elements may not be mixed in the same <gi>front</gi>, <gi>body</gi>, or <gi>back</gi> element.--> </p> <p TEIform="p">A <ident>type</ident> attribute may be used to distinguish amongst divisions in some respect other than their hierarchic position: the values for this attribute (as for several others in the TEI scheme) are not standardized, precisely because no consensus exists, or is likely to exist, as to a generic typology. A set of legal values should however be defined for a given application, either in the TEI Header or by a user-defined modification. </p> <p TEIform="p">In the normal case, the components of all divisions in a particular base are homogeneous --- they all use the same value for <ident>component.seq</ident>. However, the scheme also allows for two kinds of heterogeneity. If the <term TEIform="term">general</term> base is selected, together with two or more other bases, then different divisions of a text may have different constituents, though each division must itself be homogeneous. A <term TEIform="term">mixed</term> base is also defined, in which components from any selection of bases may be combined promiscuously across division boundaries. </p> <p TEIform="p">This approach applies equally to the encoding of smaller units: rather than attempt to enumerate all the different analytic units which particular disciplines might find necessary, the TEI proposes two generic segmentation elements: one (<gi TEI="yes" TEIform="gi">s</gi>) for simple end-to-end segmentation, such as that commonly used in language corpora, roughly corresponding to the notion of orthographic sentence; the other (<gi TEI="yes" TEIform="gi">seg</gi>) for segments which can potentially self-nest. In either case, a <ident>type</ident> attribute may be used to distinguish different kinds of segment. </p></div2> <div2 org="uniform" sample="complete" part="N" TEIform="div2"><head n="9" TEIform="head">The TEI Class System and Modification Mechanisms</head> <p TEIform="p">Textual features, and hence the elements which encode them, may be categorized or classified in a number of ways. The TEI scheme identifies two kinds of classification scheme: <term TEIform="term">attribute classes</term> and <term TEIform="term">model classes</term>; both are used for broadly similar purposes. </p> <p TEIform="p">Members of an attribute class share the same set of attributes. For example, all elements which represent links or associations between one element and another do so using a common set of attributes, defined by the <ident>pointer</ident> attribute class. <!-- All elements are members of at least one attribute class, the class <q>global</q>, which is further discussed below (section <ptr TARGET="glob"/>). --> </p> <p TEIform="p">Members of a model class share the same structural properties: that is, they may appear at the same position within the SGML document structure. For example, the class <term TEIform="term">divtop</term> contains all elements (headings, epigraphs etc.) which can appear at the start of a textual division; all elements used to mark editorial corrections or omissions are members of the class <term TEIform="term">edit</term>; elements marking bibliographic citations etc. are all members of the class <term TEIform="term">bibl</term> and so on. </p> <p TEIform="p">Elements may of course be members of more than one class. Classes may have super- and sub-classes, and properties (notably associated attributes) may be inherited. Classes are defined in the TEI dtd by means of parameter entities, and used extensively for DTD maintenance, documentation, and extension.</p> <p TEIform="p">The TEI scheme supports three kinds of user modification: new elements may be added into existing classes, and existing elements renamed or undefined. These operations are carried out in a controlled manner, using the class system and without any need for extensive revision of the TEI DTD itself.</p> <p TEIform="p">The process of adding a new element to a class may be illustrated as follows. Consider the model class <term TEIform="term">divTop</term> mentioned above. Simplifying somewhat, this element class is defined as follows: <eg TEIform="eg"><![CDATA[ <!ENTITY % x.divtop ""> <!ENTITY % m.divtop "%x.divtop head | byline | epigraph"> ]]></eg> To add a new element (say, <gi TEI="yes" TEIform="gi">keywords</gi>) to this class, enabling it to appear anywhere in the content model that other members of the class do, all that is needed is to re-define the <soCalled TEIform="soCalled">x-entity</soCalled> within the document type subset: <eg TEIform="eg"><![CDATA[ <!ENTITY % x.divtop "keywords |"> ]]></eg> Note the trailing vertical bar, which is required. As it happens, the element <gi TEI="yes" TEIform="gi">keywords</gi> is already defined in the TEI scheme (within the header); if it were not, an element declaration would also be necessary.</p> <p TEIform="p">Parameter entities are also used to effect the two other kinds of modification mentioned above: the ability to undefine elements, and the ability to rename them. </p> <p TEIform="p">Within the main TEI dtd, each element definition and its associated attribute list specification is enclosed by a marked section with the same name as the element, the default value for which is "INCLUDE". Thus, to undefine the element <gi TEI="yes" TEIform="gi">mentioned</gi>, all that is needed is a declaration like the following in the DTD subset: <eg TEIform="eg"><![CDATA[ <!ENTITY % mentioned "IGNORE"> ]]></eg> </p> <p TEIform="p">A similar declaration may be used to rename any element; for example, to rename <gi TEI="yes" TEIform="gi">p</gi> as <gi TEI="yes" TEIform="gi">para</gi>: <eg TEIform="eg"><![CDATA[ <!ENTITY % n.p "para"> ]]></eg> This works because all references to the <gi TEI="yes" TEIform="gi">p</gi> element throughout the TEI dtd are made indirectly, using the <ident>n.p</ident> entity. Furthermore, the original name for an element is recoverable by an SGML application, because it forms the value of a global attribute <ident>teiform</ident> of declared type FIXED. </p> <p TEIform="p">All user-defined modifications of this kind are regarded as forming an additional tag set, which is embedded within the DTD in the same way as as any other tag set, i.e. by enabling the <term TEIform="term">TEI.extensions</term> parameter entities. In this way a TEI document can make explicit the extent and nature of any modification required in the base TEI scheme for its processing. An auxiliary tag set is also provided for the documentation of additional SGML elements in a way compatible with that used for the rest of the scheme. </p></div2> <div2 id="glob" org="uniform" sample="complete" part="N" TEIform="div2"><head n="10" TEIform="head">The global attributes</head> <p TEIform="p">One particularly important class is the <term TEIform="term">global</term> attribute class. By default the following attributes are members of this class and may therefore be supplied for all elements in the TEI scheme: <list type="gloss" TEIform="list"> <label TEIform="label">id</label> <item TEIform="item">provides an SGML identifier for an element</item> <label TEIform="label">n</label> <item TEIform="item">provides a possibly non-unique name or number for an element</item> <label TEIform="label">lang</label> <item TEIform="item">specifies the language and hence the writing system used for an element</item> <label TEIform="label">rend</label> <item TEIform="item">provides information about the rendering of an element where this is not otherwise specified</item> </list> </p> <p TEIform="p">This list may be extended: for example, selecting the additional tag set for analysis will add analytic attributes to the above list. The <ident>id</ident> and <ident>n</ident> attributes allow for the identification of any element occurrence within a TEI-conformant text. Elements carrying an <ident>id</ident> attribute value may be the object of a link or cross-reference, or any of the other re-structuring mechanisms proposed by the TEI for circumventing the rigidly hierarchic structure of a simple SGML DTD. The fact that the requirement for such links is usually unpredictable is one reason for making this attribute global. </p> <p TEIform="p">Values on <ident>id</ident> attributes must be unique (their declared value is ID). Values on the <ident>n</ident> attribute however need not be; they may be used to carry a TEI canonical reference. A method for defining the structure of such canonical reference schemes is also provided, so that documents using it can be processed automatically. </p> <p TEIform="p">The <ident>lang</ident> attribute indicates both the language and hence the writing system applicable to the element's content, thus providing explicit support for polyglot or multiscript texts. If no value is given, that of the element's direct parent is assumed. (A number of TEI attributes have this characteristic, which is catered for by a TEI-defined keyword). The value of this element identifies a special purpose <gi TEI="yes" TEIform="gi">language</gi> element which documents the language in use, optionally associating it with an external entity in which a formal <term TEIform="term">writing system declaration</term> (WSD) may be given. </p> <p TEIform="p">A WSD defines a language/writing system pair (for example, <q direct="unspecified" TEIform="q">Koine Greek, using TLG Beta Code</q>). and is formally defined by an auxiliary DTD which allows each character to be systematically defined and documented, in terms of existing international or other standards, public or private entity sets, ad hoc transliteration schemes or explicit definitions, as well as combinations of all four. </p> <p TEIform="p">Finally, the global <ident>rend</ident> element may be used to give information about the physical presentation of the text in the source, where this is not otherwise given. A default rendition may be specified for all elements of a given type. No specific set of values is defined for this attribute in the current draft, though it is probable that some suitable set of DSSSL primitives will be proposed in a later version. </p> <p TEIform="p">It should be stressed that the <ident>rend</ident> element is <emph TEIform="emph">not</emph> intended for use as a means of specifying the desired formatting of an element, except insofaras this may be determined by a desire to mimic the approximate appearance of the original text. Like other SGML applications, the TEI scheme attempts to provide elements for the encoding of those textual features deemed essential to a productive use of the encoded text; however, unlike most other SGML applications, the TEI scheme recognizes that for some, it is precisely the appearance of a text which is the object of research. </p></div2></div1> <div1 id="adds" org="uniform" sample="complete" part="N" TEIform="div1"><head n="11" TEIform="head">The TEI additional tag sets</head> <p TEIform="p">Ten additional tag sets are defined by the current TEI proposals. These include tag sets for special application areas such as the orthographic transcription of speech, the detailed physical description of manuscript or print material, and the recording of an <soCalled TEIform="soCalled">electronic variorum</soCalled> modelled on the traditional critical apparatus. A tag set is defined for the detailed documentation of contextual information needed by language corpora, as well as for the detailed encoding of names and dates; abstractions such as networks, graphs or trees; mathematical formulae and tables etc. </p> <p TEIform="p">In addition to these application-specific additional tag sets, some more general purpose additional tag sets are defined for <list type="simple" TEIform="list"> <item TEIform="item">linking and alignment </item> <item TEIform="item">analysis and interpretation </item> <item TEIform="item">feature structure analysis </item> </list></p> <p TEIform="p">The tag set for linking and alignment extends the set of linking and pointing elements already defined in the TEI core to provide facilities for linking to arbitrary locations or spans of texts, whether or not these are in the current document, and whether or not the target is an SGML document. Mechanisms are included for recording the alignment or correspondence of parts of a text, for example in multilingual corpora, or for marking the alignment of audio or video with a transcription of it. As such, this tag set provides a usefully large subset of the facilities offered by the HyTime standard, but with a considerably simpler and more efficient interface. <note place="unspecified" anchored="yes" TEIform="note">Witness the fact that, as of May 1995, support for the TEI <term TEIform="term">extended pointer mechanism</term> has already been implemented in Softquad's Panorama Pro, and Electronic Book Technology's DynaText --- the two market leaders amongst commercial SGML browsing software.</note> </p> <p TEIform="p">As noted above, a generic segmentation element is defined for the identification of textual spans appropriate to any analytic scheme. An out-of-line generic <gi TEI="yes" TEIform="gi">interp</gi> element may be used to link arbitrary text segments (which may be nested or discontinuous) with any user-defined set of attribute/value pair interpretations. Specific tags are also defined for the most common requirements of linguistic analysis such as identification and typing of morphemes, words, phrases, and sentences. </p> <p TEIform="p">A specialized tagset is also provided for the encoding of abstract interpretations of a text, either in parallel with it or embedded within it. This is based on the <term TEIform="term">feature structure</term> notation employed in theoretical linguistics, but has applications beyond linguistic theory. <note place="unspecified" anchored="yes" TEIform="note">An introduction to this tag set is provided by D. T. Langendoen and G.F. Simons ``A rationale for the TEI recommendations for feature-structure markup'' in <title TEIform="title">Computers and the Humanities</title> (forthcoming, 1995; for an extended discussion of an application of the feature structure scheme to the problems of encoding historical source materials, see D. I. Greenstein, and L. Burnard ``Speaking with one voice'' (ib).</note> </p> <p TEIform="p">Using this mechanism, encoders can define arbitrarily complex bundles or sets of features identified in a text, according to their own methodological bias. They may thus embed a whole range of interpretations of a text, linguistic, literary, or thematic, within a text in a controlled manner. The syntax defined by the Guidelines not only formalizes the way in which such features are encoded, but also provides for a detailed specification of legal feature value/pair combinations and rules determining, for example, the implication of under-specified or defaulted features. This is known as a <term TEIform="term">feature system declaration</term> and is defined by an auxiliary tag set. </p> <p TEIform="p">An additional tag set is also provided for the encoding of degrees of uncertainty or ambiguity in the encoding of a text. These particular tag sets exhibit in a particularly noticeable form one of the chief strengths of the TEI approach to encoding: it provides the encoder with a well-defined set of tools which can be used to make explicit his or her reading of a text. No claim to absolute authority is made by any encoder, nor ever should be; the TEI scheme merely allows encoders to <q direct="unspecified" TEIform="q">come clean</q> about what they have perceived in a text, to whatever degree of detail seems appropriate. </p> <p TEIform="p">A user of the TEI scheme may combine as many or as few additional tag sets as suit his or her needs. The existence of tag sets for particular application areas in the current draft reflects, to some extent, accidents of history: no claim to systematic or encyclopaedic coverage is implied. Indeed, it is confidently expected that new tag sets will be added, and that their definition will form an important part of the continued work of this and successor projects. </p></div1> <div1 org="uniform" sample="complete" part="N" TEIform="div1"><head n="12" TEIform="head">From General to Specific</head> <p TEIform="p">The TEI Guidelines have taken more than five years to reach their present state, the first at which they can be said to be reasonably complete. In retrospect, it is doubtless true that they could have been created much more quickly with less involvement from the research community, or a clearer statement from it of a set of particular goals. But that statement would have inevitably limited the scope of the resulting scheme, providing exactly the kind of strait-jacket which we wished to avoid. Moreover, by prioritizing any one research agenda however well-articulated, we would have effectively disenfranchised and alienated all others. A little like the early Church fathers then, the TEI chose to provide as broad and as catholic a means of salvation as possible. </p> <p TEIform="p">At the same time, the TEI scheme applies rigorously the principle <q direct="unspecified" TEIform="q">essentia non sunt multiplicanda praeter necessitatem</q><note place="unspecified" anchored="yes" TEIform="note">Generally attributed to William of Occam (1300-1349), this recommendation is known as <term TEIform="term">Occam's Razor</term>; it may be translated as <q direct="unspecified" TEIform="q">Essences should not be unnecessarily multiplied</q> and refers properly to the distinction made by the Scholiasts between <q direct="unspecified" TEIform="q">essence</q> --- those properties of an entity which define its type and <q direct="unspecified" TEIform="q">accidents</q> --- those properties specific only to one instance of an entity</note>. Rather than defining discrete elements for different kinds of list (bulleted, glossary, enumerated etc.). the TEI scheme defines a single <gi TEI="yes" TEIform="gi">list</gi> element which bears a <ident>type</ident> attribute to distinguish amongst these various kinds. In the same way, all kinds of links between document elements, whatever their semantics, are encoded using the same tags. To handle the indefinite number of elements potentially needed to handle all kinds of analysis and interpretation, a small number of generic tags are proposed which (in the case of the feature structure tag set referred to above) are sufficiently abstract and general to cater for almost any kind of interpretative judgment. </p> <p TEIform="p">At the same time, there remain many situations in which the TEI's desire to exclude no-one has lead to a multiplication of distinctions at first sight rather bewildering. It seems to say the least unlikely that anyone will ever encode a document using every possible element defined by the union of every TEI tag set, though such a monster DTD is indeed possible. </p> <!-- deleted to save space ======================================== <P>Even in a relatively small area such as the definition of text classification schemes, the TEI proposes three parallel (and mutually incompatible) methods. In the matter of hypertextual addressing the TEI syntax permits of 14 different <q>location methods</q>. Names of persons, places, and organizations may be left unmarked, tagged simply as referring strings, or analyzed into subcomponents specific to them. Bibliographic citations may be presented as simple prose, or as assemblages of specific elements, either highly structured or loosely assembled. The Guidelines are even, seemingly, unable to make up their mind whether to organize text into numbered or unnumbered divisions! </P> <P>It is probable that many people confronted by the 1400 pages of the current printed version are likely to derive less comfort from knowing that somewhere in it exists precisely the general-purpose solution they need than they would from a demonstration of the application of that general mechanism to the specific problem currently facing them.</P> ===end deleted to save space ======================================== --> <p TEIform="p"> As published, the Guidelines constitute a substantial document unsuitable for casual browsing, even in electronic form. The TEI therefore plans to make available a number of smaller introductory tutorials focused on particular application areas. Two such have already appeared: one dealing with terminological systems, <note place="unspecified" anchored="yes" TEIform="note">Melby, Alan et al <title TEIform="title">Terminology Interchange Format (TIF): a tutorial</title> (Vienna, Infoterm, 1993) </note> and the other on encoding of manuscript transcriptions <note place="unspecified" anchored="yes" TEIform="note">Robinson, Peter <title TEIform="title">Encoding of Primary Sources Using SGML</title>, Oxford, Office for Humanities Communication, 1994</note>. </p> <p TEIform="p">A third tutorial has also recently been completed, documenting a special pedagogically-motivated subset of some 200 elements, selected from the whole TEI scheme (not just the core). Known as TEI Lite, this DTD has already been used in two electronic publishing projects and is in use at electronic text repositories at the Universitirs of Oxford, Virginia and Michigan, and elsewhere. <note place="unspecified" anchored="yes" TEIform="note">At the time of writing, the document defining this scheme is only available in electronic form, as (Sperberg-McQueen, C.M. and Lou Burnard <title TEIform="title">TEI Lite: An Introduction to the TEI encoding scheme</title> (Chicago and Oxford, May 1995)) from the URLs <code>http://www-tei.uic.edu/orgs/tei</code> or <code>http://info.ox.ac.uk/~archive/teilite</code></note> </p> <p TEIform="p">The real proof of the effectiveness of the TEI design will come only with its wide-spread adoption, tailored to the particular needs of individual projects. As far as can be judged from the long list of early implementors, such evidence will soon be forthcoming. </p></div1> <div1 org="uniform" sample="complete" part="N" TEIform="div1"><head n="13" TEIform="head">Conclusions</head> <p TEIform="p">This article has focussed chiefly on the complexity and generality of the TEI scheme, with a view to demonstrating its intellectual adequacy and its potential as a model for many SGML applications. </p> <p TEIform="p">It has also attempted to demonstrate how a simple modular scheme can be implemented in such a way as to maximize the <q direct="unspecified" TEIform="q">interchange space</q> within which information interchange takes place. </p> <p TEIform="p">The origins of the TEI scheme in the academic world mean that it has been designed with the widest possible set of applications in mind. Optimizing it for particular sets of users will be a new challenge. </p> </div1> </body> </text> </tei.2>