home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.text.sgml
- Path: sparky!uunet!mcsun!sunic!aun.uninett.no!nuug!ifi.uio.no!enag
- From: Erik Naggum <SGML@ifi.uio.no>
- Message-ID: <19930111.002@erik.naggum.no>
- Date: 11 Jan 1993 02:49:54 +0100
- Subject: Normalization -- a software problem?
- Lines: 91
-
- During the discussion about data bases and data querying, it looks like
- it's been taken for granted that an SGML document must be "normalized" to
- be used by an application, such as one feeding a data base.
-
- I don't understand this. Maybe I need to have my view of SGML processing
- shaken, but I've always thought that you want to read and process an SGML
- document only with an SGML parser because whatever you use to process it
- will have to be an SGML parser if it is to do things right.
-
- To illustrate: in the very beginning of my contact with SGML, I tried
- writing my own "office document application" which was supposed to use
- SGML, but of course it only understood a very limited input language,
- syntactically SGML-like, but semantically far from SGML, and I've seen and
- heard about lots of software that makes basically the same mistake I did.
- I assume that the intention with a "normalized" SGML document is to make it
- easier for this kind of half-hearted efforts to work with SGML documents.
-
- If I'm right, then we have a problem. SGML has very powerful abstraction
- mechanisms that programmers want to access. SGML is also a notation to
- make those abstractions representable in character strings (or files).
- When you want to access the abstract level, the element structure, you can
- only get at it if you parse the character string, and most programmers are
- not likely to want to do this, as it implies a lot of tedium, and some very
- intricate details that are hard to get right without imposing application-
- specific conventions that only serve to make lifer harder for both users
- and programmers, not to mention the documents that have to live with them.
-
- The solution to the problem is to put a powerful SGML parser between the
- input file and the application. This sounds obvious, and it is. The next
- problem is that people don't do this, and that they have good reasons for
- not doing it: it's not lack of availability of SGML software, but lack of
- availability of libraries with a well-defined, clean interface that allows
- the program to open any number of parsing instances on any number of
- entities (files or character strings) in parallel, access and navigate
- through the element structure, and dealing with both the character string
- representation and the element structure at the same time. I'm talking
- about the utilities that are all over the place to help us work with text.
-
- Where's the "grep" for SGML documents? Where's the regular expression
- matching routines that are aware of the position of a match in the element
- structure? Where's the concrete syntax converters? Where's the software
- to provide us with abstraction from the character set mess out there?
-
- My question boils down to this: If we'd had all or some of these utilities,
- a good library-type parser that would allow us to read and write SGML
- documents, while the programs saw the element structure, etc, why would
- anyone need normalization to make life "easier" for application programs?
- And would they gripe about short references, data tags, minimization, etc?
-
- The way I see it, a conceptually clean layering between the character
- string representation and the element structure needs to be established. I
- see some very good arguments about SGML documents on the basis of the
- element structure, but they seem to have problems with separating it from
- the character string representation, and the attendant features of SGML.
- I'm not competent to discuss many of the suggestions I've seen, but I'm
- alarmed at the preoccupation with the character string representation, with
- comments and tag minimization, etc.
-
- Admittedly, I haven't seen a lot of SGML software, but those that I've seen
- and heard of have one thing in common: they want to run the show and do
- everything themselves. I, on the other hand, want components from which I
- can build applications where the SGML parser is a transparent part of the
- I/O subsystem, and the entity manager takes care of all my file system
- access needs. Maybe this exists, but I'm unhappy with the focus on the
- user's view, with SGML sensitive editors and more or less support for SGML
- at various levels.
-
- It seems to me from my point of view that we have a long way to go, and
- it's not a matter of technology, but of approach and design philosophy.
- Not that this is unique to the SGML community, of course; it's the rampant
- user-only orientation and the trend towards a primacy of visual results
- that I'd like to see reversed so people could get complicated information
- processing tasks done.
-
- Maybe I've seen too little of the market and of what people do with SGML,
- and I have reason to believe that what I've seen may not be completely
- representative, but for most of the "information technology" industry, it's
- a sorry sight. Don't get me wrong: I appreciate better visual results, but
- not at the expense of software technology making a 10-year leap backwards
- with respect to the things that we want to _do_ with the information.
-
- To sum it up: do people need normalization because they only have SGML
- parsers tied up in programs that won't let them use components of it?
-
- Best regards,
- </Erik>
- --
- Erik Naggum ISO 8879 SGML +47 295 0313
- Oslo, Norway ISO 10744 HyTime
- <erik@naggum.no> ISO 9899 C Memento, terrigena
- <SGML@ifi.uio.no> ISO 10646 UCS Memento, vita brevis
-