NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / text / sgml / 1273 < prev next >

Wrap

Text File | 1993-01-08 | 1.7 KB | 39 lines

Newsgroups: comp.text.sgml Path: sparky!uunet!gatech!usenet.ins.cwru.edu!agate!linus!linus.mitre.org!thelonius!john From: john@thelonius.mitre.org (John D. Burger) Subject: Separation of text and markup Message-ID: <1993Jan8.194217.9341@linus.mitre.org> Sender: john@thelonius (John D. Burger) Nntp-Posting-Host: thelonius.mitre.org Organization: Artificial Intelligence Center, MITRE Corporation References: <C0BsDK.5G2@undergrad.math.waterloo.edu> <19930105.004@erik.naggum.no> Date: Fri, 8 Jan 1993 19:42:17 GMT Lines: 26 Erik Naggum (erik@naggum.no) writes: If you wish to separate the markup from the text completely, you will need to establish a means to identify the start and end of all elements in the separated text. This is certainly _possible_, but you wouldn't want to do it, because you would end up with more things to take care of than writing an SGML-aware editor from scratch. We're developing a text understanding system that marks up the source with semantic information in SGML (e.g. <VIOLENCE victim="...">). What we will eventually have is a SUITE of such processors, from simple (but fast) to sophisticated (but slow). Different users may run different processors on the same source document, and may eventually wish to merge or compare the results. What this seems to require is exactly what you describe, namely, keeping track of where every element (or at least every tag) is, in terms of the original document. Since processors might treat white space differently (e.g. two blank lines vs. indentation for paragraphs), it even seems that we have to index the SGML down to the individual characters, not just words or other tokens. Are there any tools we can take advantage of that already do this? John Burger john@mitre.org