NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / text / sgml / 1279 < prev next >

Wrap

Text File | 1993-01-10 | 5.5 KB | 100 lines

Newsgroups: comp.text.sgml Path: sparky!uunet!mcsun!sunic!aun.uninett.no!nuug!ifi.uio.no!enag From: Erik Naggum <SGML@ifi.uio.no> Message-ID: <19930111.002@erik.naggum.no> Date: 11 Jan 1993 02:49:54 +0100 Subject: Normalization -- a software problem? Lines: 91 During the discussion about data bases and data querying, it looks like it's been taken for granted that an SGML document must be "normalized" to be used by an application, such as one feeding a data base. I don't understand this. Maybe I need to have my view of SGML processing shaken, but I've always thought that you want to read and process an SGML document only with an SGML parser because whatever you use to process it will have to be an SGML parser if it is to do things right. To illustrate: in the very beginning of my contact with SGML, I tried writing my own "office document application" which was supposed to use SGML, but of course it only understood a very limited input language, syntactically SGML-like, but semantically far from SGML, and I've seen and heard about lots of software that makes basically the same mistake I did. I assume that the intention with a "normalized" SGML document is to make it easier for this kind of half-hearted efforts to work with SGML documents. If I'm right, then we have a problem. SGML has very powerful abstraction mechanisms that programmers want to access. SGML is also a notation to make those abstractions representable in character strings (or files). When you want to access the abstract level, the element structure, you can only get at it if you parse the character string, and most programmers are not likely to want to do this, as it implies a lot of tedium, and some very intricate details that are hard to get right without imposing application- specific conventions that only serve to make lifer harder for both users and programmers, not to mention the documents that have to live with them. The solution to the problem is to put a powerful SGML parser between the input file and the application. This sounds obvious, and it is. The next problem is that people don't do this, and that they have good reasons for not doing it: it's not lack of availability of SGML software, but lack of availability of libraries with a well-defined, clean interface that allows the program to open any number of parsing instances on any number of entities (files or character strings) in parallel, access and navigate through the element structure, and dealing with both the character string representation and the element structure at the same time. I'm talking about the utilities that are all over the place to help us work with text. Where's the "grep" for SGML documents? Where's the regular expression matching routines that are aware of the position of a match in the element structure? Where's the concrete syntax converters? Where's the software to provide us with abstraction from the character set mess out there? My question boils down to this: If we'd had all or some of these utilities, a good library-type parser that would allow us to read and write SGML documents, while the programs saw the element structure, etc, why would anyone need normalization to make life "easier" for application programs? And would they gripe about short references, data tags, minimization, etc? The way I see it, a conceptually clean layering between the character string representation and the element structure needs to be established. I see some very good arguments about SGML documents on the basis of the element structure, but they seem to have problems with separating it from the character string representation, and the attendant features of SGML. I'm not competent to discuss many of the suggestions I've seen, but I'm alarmed at the preoccupation with the character string representation, with comments and tag minimization, etc. Admittedly, I haven't seen a lot of SGML software, but those that I've seen and heard of have one thing in common: they want to run the show and do everything themselves. I, on the other hand, want components from which I can build applications where the SGML parser is a transparent part of the I/O subsystem, and the entity manager takes care of all my file system access needs. Maybe this exists, but I'm unhappy with the focus on the user's view, with SGML sensitive editors and more or less support for SGML at various levels. It seems to me from my point of view that we have a long way to go, and it's not a matter of technology, but of approach and design philosophy. Not that this is unique to the SGML community, of course; it's the rampant user-only orientation and the trend towards a primacy of visual results that I'd like to see reversed so people could get complicated information processing tasks done. Maybe I've seen too little of the market and of what people do with SGML, and I have reason to believe that what I've seen may not be completely representative, but for most of the "information technology" industry, it's a sorry sight. Don't get me wrong: I appreciate better visual results, but not at the expense of software technology making a 10-year leap backwards with respect to the things that we want to _do_ with the information. To sum it up: do people need normalization because they only have SGML parsers tied up in programs that won't let them use components of it? Best regards, </Erik> -- Erik Naggum ISO 8879 SGML +47 295 0313 Oslo, Norway ISO 10744 HyTime <erik@naggum.no> ISO 9899 C Memento, terrigena <SGML@ifi.uio.no> ISO 10646 UCS Memento, vita brevis