home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.text.sgml
- Path: sparky!uunet!gatech!usenet.ins.cwru.edu!agate!linus!linus.mitre.org!thelonius!john
- From: john@thelonius.mitre.org (John D. Burger)
- Subject: Separation of text and markup
- Message-ID: <1993Jan8.194217.9341@linus.mitre.org>
- Sender: john@thelonius (John D. Burger)
- Nntp-Posting-Host: thelonius.mitre.org
- Organization: Artificial Intelligence Center, MITRE Corporation
- References: <C0BsDK.5G2@undergrad.math.waterloo.edu> <19930105.004@erik.naggum.no>
- Date: Fri, 8 Jan 1993 19:42:17 GMT
- Lines: 26
-
- Erik Naggum (erik@naggum.no) writes:
-
- If you wish to separate the markup from the text completely, you will need to
- establish a means to identify the start and end of all elements in the
- separated text. This is certainly _possible_, but you wouldn't want to do it,
- because you would end up with more things to take care of than writing an
- SGML-aware editor from scratch.
-
- We're developing a text understanding system that marks up the source with
- semantic information in SGML (e.g. <VIOLENCE victim="...">). What we will
- eventually have is a SUITE of such processors, from simple (but fast) to
- sophisticated (but slow). Different users may run different processors on the
- same source document, and may eventually wish to merge or compare the results.
-
- What this seems to require is exactly what you describe, namely, keeping track of
- where every element (or at least every tag) is, in terms of the original
- document.
-
- Since processors might treat white space differently (e.g. two blank lines vs.
- indentation for paragraphs), it even seems that we have to index the SGML down to
- the individual characters, not just words or other tokens.
-
- Are there any tools we can take advantage of that already do this?
-
- John Burger
- john@mitre.org
-