NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / text / sgml / 1281 < prev next >

Wrap

Text File | 1993-01-11 | 8.1 KB | 189 lines

Path: sparky!uunet!ralvm13.VNET.IBM.COM From: drmacro@ralvm13.VNET.IBM.COM Message-ID: <19930110.211516.255@almaden.ibm.com> Date: Sat, 9 Jan 93 10:25:48 EST Newsgroups: comp.text.sgml Subject: Re: Separation of text and markup Disclaimer: This posting represents the poster's views, not those of IBM News-Software: UReply 3.1 References: <C0BsDK.5G2@undergrad.math.waterloo.edu> <19930105.004@erik.naggum.no> <1993Jan8.194217.9341@linus.mitre.org> <19930108.007@erik.naggum.no> Lines: 176 In <19930108.007@erik.naggum.no> Erik Naggum <SGML@ifi.uio.no> writes: >I've thought a little more about this, and maybe someone out there can use >an idea I had. It goes like this: > >Assume that we have text files that do not contain any markup, i.e. so that >we won't need to parse their contents in any way if we refer to and extract >text from them, and, as a corollary, element contents are not split across >files (entities). > >Assume that we have a document type which does not have both elements and >data in element contents. > >Assume that we have an entity manager that can extract portions of an >entity given an entity, a starting point and a length, however specified. > >For each element that contains PCDATA in the DTD, define three attributes > > contents ENTITY #CONREF > start NUMBERS #IMPLIED > length NUMBERS #IMPLIED > >(with the understanding that _start_ and _length_ are REQUIRED if >_contents_ is specified). > >The start and length are both NUMBERS because we might have a record-based >storage system that must address a given character by record and position >within a record, as well as one that can address characters individually. > >Then, the element structure can be kept in a "regular" SGML document, which >in the document type declaration subset identifies the entities to which >references are made in the start-tags in this document, and the application >can request the element contents from the entity manager while receiving >the element structure from the parser. > >I know of no entity manager that allows this flexibility, but that doesn't >mean none exist, or can't exist. (I also imagine that this kind of data >extraction service will be required by an entity manager to be used with >HyTime engines.) You can get the same functionality by using one level of indirection and standard HyTime addressing elements. Simply define a location ladder that ends with a data location element that specifies the location and extent of the data to grab (in these examples I've simplified as allowed by HyTime for brevity):  <!ENTITY %conref.att "ContentSource IDREF #CONREF -- source of element content -- HyNames CDATA #FIXED 'ContentSource linkend' HyTime NAME #FIXED clink" -- conceptually a clink -- >  <!ELEMENT ContentRefSpec - - (nameloc, dataloc+) > <!ATTLIST ContentRefSpec HyTime NAME #FIXED hybrid>  <!ELEMENT (nameloc | DataLocSource) O O (nmlist) -- nmquery omitted for simplicity -- > <!ATTLIST (nameloc | DataLocSource) HyTime NAME #FIXED nameloc ID ID #REQUIRED > <!ELEMENT nmlist O O (#PCDATA) -- lextype(NAMES) -->   <!ELEMENT dataloc O O (dimspec*) -- Will define data locations -->  <!ATTLIST dataloc HyTime NAME #FIXED dataloc id ID #REQUIRED locsrc IDREFS #REQUIRED -- no logical default -- -- other attributes omitted for simplicity -- > <!ELEMENT dimspec O O (marklist, marklist) > <!ELEMENT marklist - O (#PCDATA) -- Lextype(snzi*) -->   <!ELEMENT MyData - - (%mydata.el;)* -- typical element --> <!ATTLIST MyData %conref.att; > I've defined the omission indicators such that for manual typing, you only have to specify a few of the tags to create a reference:  <!ENTITY thedata SYSTEM "data.file" CDATA >  <DataLocSource id=mydata nametype=entity>thedata</>   <nameloc id=dataref1>dimspec1 <dataloc id=dimspec1 locsrc=mydata> <marklist>10 100 <MyData contentref=dataref1>  <nameloc id=dataref2>dimspec2 <dataloc id=dimspec2 locsrc=mydata> <marklist>1 1 <marklist>10 -1 <MyData contentref=dataref2> Interestingly enough, Erik's notation could be transformed into what I've defined above on the fly by an application. It would not itself be HyTime conforming or directly processible by a HyTime system, but it could be transformed into "virtual SGML" dynamically and then passed to a HyTime system. Note also that the above method allows the use of multiple data locations that can be aggregated into a single value when the location chain is resolved. You can also define some application-specific simplifications. For example, since I have namelocs and datalocs contained within a common element, I could define the processing semantic for my application such that the nameloc element is implied by the existence of the dataloc elements within the ContentRefSpec element. The application defines the nameloc ID, perhaps by taking the nameloc ID value from an attribute of the ContentRefSpec: <!ELEMENT ContentRefSpec - - (nameloc, dataloc+) > <!ATTLIST ContentRefSpec HyTime NAME #FIXED hybrid id NAME #REQUIRED -- defines nameloc ID -- DataContainer ENTITY #REQUIRED -- defines locsrc value -- > <!ELEMENT nameloc O O (nmlist) -- nmquery ommitted for simplicity -- <!ATTLIST nameloc HyTime NAME #FIXED nameloc ID ID #IMPLIED -- From ContentRefSpec -- > <!ELEMENT dataloc O O (marklist) > <!ATTLIST dataloc HyTime NAME #FIXED dataloc ID ID #IMPLIED -- Defined by processor -- locsrc ID #IMPLIED -- From ContentRefSpec -- > The ID= attribute on ContentRefSpec is not of type ID, so it's value will not be part of the SGML ID name space, but we can define an application semantic that uses the ContentRefSpec ID= value to provide the now implied nameloc ID. Because ID= is required on ContentRefSpec, it is effectively required for nameloc, thus meeting the requirement of ISO 10744. If we further define the Dataloc ID to be implied by the containment structure, we reduce the actual required markup to: <ContentRefSpec id=dataref1 datacontainer=thedata> <marklist>1 1 </ContentRefSpec> Similarly, the location source specification is simplified by the ENTITY attribute DataContainer, which is used to create a virtual nameloc for the ID-to-entity mapping needed by dataloc. The nameloc, nmlist, and dataloc elements are all omissible and the required attributes are now implied unambiguously. The ContentRefSpec clearly identifies the application-specific semantic of this particular hyperlink and the details of the HyTime stuff is hidden (or at least hideable). This does require that the HyTime elements not be part of the general content of your application, because it would be a violation to allow nameloc with implied ID outside a context that guarantees specification of an ID by some mechanism. Eliot Kimber Internet: drmacro@ralvm13.vnet.ibm.com Dept E14/B500 IBMMAIL: USIB2DK9@IBMMAIL Network Programs Information Development Phone: 1-919-543-7091 IBM Corporation Research Triangle Park, NC 27709