home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!ralvm13.VNET.IBM.COM
- From: drmacro@ralvm13.VNET.IBM.COM
- Message-ID: <19930110.211516.255@almaden.ibm.com>
- Date: Sat, 9 Jan 93 10:25:48 EST
- Newsgroups: comp.text.sgml
- Subject: Re: Separation of text and markup
- Disclaimer: This posting represents the poster's views, not those of IBM
- News-Software: UReply 3.1
- References: <C0BsDK.5G2@undergrad.math.waterloo.edu> <19930105.004@erik.naggum.no> <1993Jan8.194217.9341@linus.mitre.org>
- <19930108.007@erik.naggum.no>
- Lines: 176
-
- In <19930108.007@erik.naggum.no> Erik Naggum <SGML@ifi.uio.no> writes:
- >I've thought a little more about this, and maybe someone out there can use
- >an idea I had. It goes like this:
- >
- >Assume that we have text files that do not contain any markup, i.e. so that
- >we won't need to parse their contents in any way if we refer to and extract
- >text from them, and, as a corollary, element contents are not split across
- >files (entities).
- >
- >Assume that we have a document type which does not have both elements and
- >data in element contents.
- >
- >Assume that we have an entity manager that can extract portions of an
- >entity given an entity, a starting point and a length, however specified.
- >
- >For each element that contains PCDATA in the DTD, define three attributes
- >
- > contents ENTITY #CONREF
- > start NUMBERS #IMPLIED
- > length NUMBERS #IMPLIED
- >
- >(with the understanding that _start_ and _length_ are REQUIRED if
- >_contents_ is specified).
- >
- >The start and length are both NUMBERS because we might have a record-based
- >storage system that must address a given character by record and position
- >within a record, as well as one that can address characters individually.
- >
- >Then, the element structure can be kept in a "regular" SGML document, which
- >in the document type declaration subset identifies the entities to which
- >references are made in the start-tags in this document, and the application
- >can request the element contents from the entity manager while receiving
- >the element structure from the parser.
- >
- >I know of no entity manager that allows this flexibility, but that doesn't
- >mean none exist, or can't exist. (I also imagine that this kind of data
- >extraction service will be required by an entity manager to be used with
- >HyTime engines.)
-
- You can get the same functionality by using one level of indirection
- and standard HyTime addressing elements. Simply define a location
- ladder that ends with a data location element that specifies the
- location and extent of the data to grab (in these examples I've
- simplified as allowed by HyTime for brevity):
-
- <!-- First define common attributes to do content-reference: -->
- <!ENTITY %conref.att
- "ContentSource IDREF #CONREF -- source of element content --
- HyNames CDATA #FIXED 'ContentSource linkend'
- HyTime NAME #FIXED clink" -- conceptually a clink --
- >
-
- <!-- Now define element to contain HyTime elements, for neatness: -->
- <!ELEMENT ContentRefSpec - - (nameloc, dataloc+) >
- <!ATTLIST ContentRefSpec HyTime NAME #FIXED hybrid>
-
- <!-- Now define nameloc and dataloc elements. These are standard
- HyTime. DatalocSource defines source of data for dataloc. -->
- <!ELEMENT (nameloc | DataLocSource)
- O O (nmlist) -- nmquery omitted for simplicity --
- >
- <!ATTLIST (nameloc | DataLocSource)
- HyTime NAME #FIXED nameloc
- ID ID #REQUIRED
- >
- <!ELEMENT nmlist O O (#PCDATA) -- lextype(NAMES) -->
- <!-- Attlist omitted for simplicity -->
- <!-- Now define Dataloc element, again, HyTime: -->
-
- <!ELEMENT dataloc O O (dimspec*) -- Will define data locations -->
- <!-- Content model reduced to 'dimspec' for simplicity -->
- <!ATTLIST dataloc HyTime NAME #FIXED dataloc
- id ID #REQUIRED
- locsrc IDREFS #REQUIRED -- no logical default --
- -- other attributes omitted for simplicity --
- >
-
- <!ELEMENT dimspec O O (marklist, marklist) >
- <!ELEMENT marklist - O (#PCDATA) -- Lextype(snzi*) -->
- <!-- Content model reduced to '#PCDATA' for simplicity -->
-
- <!-- Now define our application-specific elements. All will
- use the common content reference attributes: -->
- <!ELEMENT MyData - - (%mydata.el;)* -- typical element -->
- <!ATTLIST MyData
- %conref.att;
- >
-
- I've defined the omission indicators such that for manual
- typing, you only have to specify a few of the tags to
- create a reference:
-
- <!-- In the DTD, define the entity that contains the data: -->
-
- <!ENTITY thedata SYSTEM "data.file" CDATA >
-
- <!-- Now use DataLocSource to create the entity reference. This
- will be refered to from DataLoc with the locsrc= attribute: -->
- <DataLocSource id=mydata nametype=entity>thedata</>
-
- <!-- stream-oriented (one dimension) reference: -->
- <!-- Note that ContentRefSpec is implied by nameloc. -->
- <nameloc id=dataref1>dimspec1
- <dataloc id=dimspec1 locsrc=mydata>
- <marklist>10 100
- <MyData contentref=dataref1>
-
- <!-- line-oriented (two dimensions) reference: -->
- <nameloc id=dataref2>dimspec2
- <dataloc id=dimspec2 locsrc=mydata>
- <marklist>1 1<!-- First record -->
- <marklist>10 -1<!-- 10th through last character -->
- <MyData contentref=dataref2>
-
- Interestingly enough, Erik's notation could be transformed
- into what I've defined above on the fly by an application.
- It would not itself be HyTime conforming or directly
- processible by a HyTime system, but it could be transformed
- into "virtual SGML" dynamically and then passed to a HyTime
- system. Note also that the above method allows the
- use of multiple data locations that can be aggregated
- into a single value when the location chain is resolved.
-
- You can also define some application-specific simplifications.
- For example, since I have namelocs and datalocs contained within
- a common element, I could define the processing semantic for
- my application such that the nameloc element is implied by
- the existence of the dataloc elements within the ContentRefSpec
- element. The application defines the nameloc ID, perhaps by taking
- the nameloc ID value from an attribute of the ContentRefSpec:
-
- <!ELEMENT ContentRefSpec - - (nameloc, dataloc+) >
- <!ATTLIST ContentRefSpec HyTime NAME #FIXED hybrid
- id NAME #REQUIRED -- defines nameloc ID --
- DataContainer ENTITY #REQUIRED -- defines locsrc value --
- >
- <!ELEMENT nameloc O O (nmlist) -- nmquery ommitted for simplicity --
- <!ATTLIST nameloc HyTime NAME #FIXED nameloc
- ID ID #IMPLIED -- From ContentRefSpec --
- >
- <!ELEMENT dataloc O O (marklist) >
- <!ATTLIST dataloc HyTime NAME #FIXED dataloc
- ID ID #IMPLIED -- Defined by processor --
- locsrc ID #IMPLIED -- From ContentRefSpec --
- >
-
- The ID= attribute on ContentRefSpec is not of type ID, so
- it's value will not be part of the SGML ID name space, but we
- can define an application semantic that uses the ContentRefSpec
- ID= value to provide the now implied nameloc ID. Because
- ID= is required on ContentRefSpec, it is effectively required
- for nameloc, thus meeting the requirement of ISO 10744. If
- we further define the Dataloc ID to be implied by the containment
- structure, we reduce the actual required markup to:
-
- <ContentRefSpec id=dataref1 datacontainer=thedata>
- <marklist>1 1
- </ContentRefSpec>
-
- Similarly, the location source specification is simplified by the
- ENTITY attribute DataContainer, which is used to create a virtual
- nameloc for the ID-to-entity mapping needed by dataloc. The nameloc,
- nmlist, and dataloc elements are all omissible and the required
- attributes are now implied unambiguously. The ContentRefSpec clearly
- identifies the application-specific semantic of this particular
- hyperlink and the details of the HyTime stuff is hidden (or at least
- hideable). This does require that the HyTime elements not be part of the
- general content of your application, because it would be a violation
- to allow nameloc with implied ID outside a context that guarantees
- specification of an ID by some mechanism.
-
- Eliot Kimber Internet: drmacro@ralvm13.vnet.ibm.com
- Dept E14/B500 IBMMAIL: USIB2DK9@IBMMAIL
- Network Programs Information Development Phone: 1-919-543-7091
- IBM Corporation
- Research Triangle Park, NC 27709
-