WWW specification: understanding the Web (*)

Haim Kilov
Bellcore, MRE-2F049
445 South Street
Morristown, NJ 07960
haim@cc.bellcore.com

WWW has been described in terms of navigation, i.e., nodes, links, 
and keywords. This description is based on currently existing 
technology. However, users still complain about getting lost -- and 
properly so! There exists a need to understand the Web in service-
oriented terms, i.e., in terms of intellectual contents of WWW 
information. Only in this manner will it be possible to provide a 
better roadmap -- in fact, a concept map -- for WWW.

This paper will show how information modeling concepts (rooted in 
programming methodology concepts) may be used to understand 
the Web documents.


Abstraction and precision

As the Web is a very good example of an open system, it is possible 
to use the reference model for Open Distributed Processing [ODP 2] 
as a framework for understanding. This framework is based on 
information semantics rather than on syntactic details and describes 
important concepts essential for formulating, understanding, and 
implementing a specification.

Information -- intellectual contents of documents -- has to be 
specified in an abstract manner. Abstraction (suppression of 
irrelevant details to enhance understanding [ODP 2]) is essential 
because humans cannot deal with large amounts of "unstructured" 
information. Therefore, to understand documents, higher-level 
precisely-defined constructs, such as composition, dependency, 
reference, and so on, have to be used instead of links [Kilov 94].

Existing approaches to understanding (hypertext) documents have 
too often been based on existing products. Document users were 
less than happy with these approaches. There have already been 
requests and papers on the need for a better framework to deal with 
documents ("unstructured data"), some using SQL as an example 
[Eddison 94]. Indeed, SQL provides modeling facilities on a higher 
level than links and keyword search, but these facilities are still 
quite rudimentary with respect to information semantics and are 
based more on technology (relational DBMSs) than on the information 
viewpoint. The model for and the documentation of a system need 
not be the same as the model for and the documentation of the 
enterprise (or an application -- a task --) for which the system has 
been created [Raven 94]. Although certain object concepts are being 
incorporated in SQL, there still are problems in reconciling the 
existing, value-based, SQL framework, with the identity-based object-
oriented one. These issues are out of scope of document modeling, 
but the general need to better understand and model documents is 
pretty well exemplified by efforts of this kind.

A document should be understood -- and specified! -- using concepts 
that describe its information contents rather than using concepts 
that describe existing software tools. More generally, understanding 
an enterprise and an application is possible only in terms of concepts 
that describe appropriate business rules [Kilov, Ross 94; ODP 2] 
rather than in terms of existing systems technology. [As an aside, it 
has been acknowledged for a while that programming concepts are 
independent of existing programming languages and software tools 
[Hehner 93, Dijkstra 76].]

Fortunately, programming methodology concepts -- the ones used 
to create and understand specifications -- are perfectly applicable 
to describe an information model as well! This approach may and 
should be used to better understand and specify both traditional 
and non-traditional (e.g., Web) documents. There is no need to 
invent (from scratch) new models for representing and 
communicating information in documents because generic 
concepts from existing information models are perfectly reusable. 
This permits, in particular, reuse of the same constructs (generic 
relationships) to understand and specify both an enterprise and a 
document that describes the enterprise. Obviously, such an approach 
makes creating and documenting a specification substantially 
easier.

Observe that the specifications are usually declarative, i.e., they 
define the "what" rather than the "how". In other words, they define 
the invariants, i.e., the predicates that have to be true for the entire 
lifetime of the set of objects, as well as the pre- and postconditions 
for each operation, i.e., the predicates that have to be true for the 
operation to occur and, correspondingly, the predicates that have 
to be true immediately after the occurrence of an operation [ODP 
2]. These specifications are simple because they do not have to 
describe any algorithmic details. Business rules may perfectly be 
defined in this manner [Kilov, Ross 94; ODP 2; Swatman 93; 
Ainsworth 94; Hayes 93; Trader 94]. This approach reuses concepts 
well-known in programming methodology [Dijkstra 76, Hehner 93].

A specification has, first and foremost, to be understandable. To 
create such a specification, i.e., to understand information better, 
cooperation is needed between subject matter experts and modelers. 
This is well-known in information modeling. Only a formal 
specification provides complete understanding because an 
"ordinary" natural language specification may be redundant or 
inconsistent due to the properties of a natural language. However, 
the subject matter expert (SME) does not have to read and understand 
the formal notation used: formal specifications very helpful for the 
modelers [Ainsworth 94, Kilov 93] may well be translated into stylized 
English for the SMEs. Indeed, essential specifications in [Kilov, 
Ross 94] are formulated both in Object Z and in stylized English, 
i.e., translated from Object Z. The English specifications are presented 
to and perfectly understandable by a careful SME. At least some 
English specifications in [ODP 2, Trader 94] have also been translated 
from more formal ones.


"Finding an object": concept maps created by document users

A document user browses and understands the document using 
its implicit or explicit model provided by the document author. 
Ideally, a document model (concept map) should be shown explicitly, 
by presenting appropriate information specifications to the users. 
Good prefaces to good textbooks widely use constructs like 
composition, reference, and dependency for this purpose (for 
example, [Hehner 93] uses reference associations and compositions 
in a quick tour of his book). The Web provides an excellent opportunity 
for document authors to specify document models explicitly. 
However, there exist bad examples as well: when a Web document 
is presented only as an ordered composition of its pieces to be 
retrieved one after the other, this specification does not add anything 
to the intellectual contents of the document, and probably just 
relates to the document's logical layout.

To understand a document better, its user may want to make notes 
- -- i.e., to create appropriate documents -- himself: "Roland Barthes 
suggested... that when you read you should be writing yourself, 
transposing the information from print to your notebook" [Ulmer 
88]. To do that, the document user has to find concepts and 
relationships between them in the original text (a student often 
implements that by highlighting appropriate fragments of the textbook 
in different ways). In the same manner, an enterprise modeler 
together with the SME have to find enterprise objects and 
relationships between them -- i.e., business rules that are already 
there! -- to create an information model. Obviously, different users 
of a document will find different interrelated concepts of interest, 
in the same manner as different customers of an enterprise will 
find different interrelated objects of interest. In many cases, the 
document model provided by its author is a very good framework 
for understanding its concepts. This model (concept map) should 
be accessible to the document user together with the document 
itself. However, the framework provided by the author may have to 
be appropriately extended (or restricted) by the document user. 
Understanding a document by internalizing the concepts of that 
document is made possible by finding fragments of interest and 
relationships between them and by writing about (i.e., specifying, 
in a formal or informal manner) these fragments and relationships 
(see, e.g., [Ulmer 88, Wegner 94]). Obviously, information modeling 
experience suggests good ways to formulate these specifications. 
Future users of a document may understand it better by reusing 
not only the specification of the document provided by its author, 
but also specifications of different document viewpoints provided 
by different earlier users of the document.

Thus, user-creatable roadmaps -- in fact, concept maps -- become 
essential in order to better understand a document. Obviously, 
these considerations are perfectly applicable to collections of 
documents. The Web interfaces provide convenient tools to represent 
such concept maps, but these tools are quite restricted: they usually 
are applied by document authors rather than users and hardly 
permit users to mark up document fragments not anticipated by 
the authors (a fragment may be defined as the -- existing -- unit by 
which information is picked up (Tim Berners-Lee)); they usually do 
not distinguish between different types of relationships between 
these fragments; they usually represent binary relationships; and 
they do not provide a way to find the relationships in which a 
particular document fragment participates. The latter shortcoming 
is especially important: the knowledge of relationships that refer to 
a given document fragment substantially improves understanding 
both of the fragment and of these relationships.

Again, information modeling provides important reusable concepts 
[Kilov, Ross 94] -- including a library of generic relationship types 
- -- applicable for creating document concept maps. Application-
specific relationships used in these concept maps are created by 
instantiating these generic relationships for particular documents 
and their fragments. It is possible to use a graphical or a linear 
representation of these relationships, so that existing tools used to 
navigate the Web may also be used to represent document concept 
maps, i.e., to understand the documents. In this manner, it will be 
possible to distinguish between different types of relationships 
between documents; to represent both binary and non-binary 
relationships; and to provide a way to find the relationships in 
which a particular document participates.

It is important to notice that a linear representation of a relationship 
does not have explicit links: a link is a property of a particular 
(graphical) representation of a relationship rather than of the 
relationship itself. Let us look at two examples of such a linear 
representation, for Bellcore legal disclaimers [Kilov 94-1] and for 
the organization of the family of ODP standards:

DisclaimerTypes: Exhaustive Subtyping (Disclaimer, {Technical 
Analysis Disclaimer, Technical Audit Disclaimer, General 
Disclaimer}), or

ODP Refinement: Ordinary Reference (Descriptive Model [ODP 2], 
Prescriptive Model [ODP 3]).

This consideration alone shows that links are not necessary for 
understanding a relationship and therefore a hypertext or a Web 
document. The analogy with goto's or pointers in traditional 
programming is obvious [Kilov 94].

Document users may want to acquaint themselves with examples 
of document concept maps before creating their own maps. Examples 
of this kind may be provided in the Web. Obviously, reading is 
easier than writing, and for quite a few document users it will be 
sufficient to choose among several existing concept maps for the 
same collection of documents, rather than to create their own maps. 
In the same manner, but to a larger extent, an implementor of a 
specification construct in most cases picks and chooses a particular 
element of an implementation library as a refinement of this 
construct rather than invents his own refinement [Kilov, Ross 94; 
Welsh 94].

Several kinds of documents have a well-established information 
model. Examples of such documents are well-known: databases, 
spreadsheets, program code, error messages, etc. Fragments of these 
"structured" documents and relationships between them are quite 
rigid; a document user often cannot change them or specify new 
fragments (although in relational databases specification of new 
relationships between existing fragments is possible). In fact, software 
development may well be considered as a document-based process 
[Welsh 93, Welsh 94] which refers to both traditional documents 
(e.g., requirements) and structured ones (e.g., databases or program 
code), with the need to specify, clearly and explicitly, relevant 
fragments of these documents and relationships (e.g., reference, 
composition, dependency, and so on) between these fragments. The 
same information modeling concepts may be used to better 
understand a collection of documents of any kind.


Collective behavior

A document component cannot be understood and used in isolation: 
it is referred to in the specifications of structural and behavioral 
properties of one or, more often, several, documents. Moreover, a 
document is also not isolated: a traditional document has explicit 
and implicit references to other documents; and, as we see in the 
Web, there are quite a few explicit links between documents (but 
not between their components?). These references and links 
implement different types of relationships between documents. The 
concepts used in understanding and modeling a single document, 
such as precise and abstract specifications of relationships between 
document fragments, are perfectly (re)usable for understanding and 
modeling a collection of documents. In particular, a new composite 
document may be created by reusing fragments of different existing 
documents and organizing them in an appropriate way. The Web 
provides an excellent framework for doing so: before creating 
information, a Web user is encouraged to find and reorganize it.

The concept of specifications referring to collections of objects and, 
in particular, to collective behavior of these objects, is reasonably 
well-known. It is encountered both in information modeling [Kilov, 
Ross 94] and in several standards related to object reference models 
and open systems [ODP 2, GRM 94, OODBTG 91, FM 94]. Obviously, 
collective behavior is essential to understand any enterprise, 
including a collection of interrelated documents.


Viewpoints

The Web, like any other large information system, presents a 
challenge of specifying and reconciling several different (and 
possibly time-varying) viewpoints referring to the "same" information. 
In particular, the information and technology viewpoints should be 
clearly distinguished. For a document, be it a traditional or a Web 
document, it means distinguishing between the document's 
intellectual contents and its (logical and physical) layout. On the 
other hand, several different information viewpoints may emphasize 
different information characteristics of the "same" enterprise. As 
mentioned earlier, for a document it means distinguishing among 
different relationships between different document fragments. These 
fragments and relationships are different because different users 
are interested in different aspects of a document or a document 
collection. In particular, a document reader can create -- and 
precisely specify -- new fragments and new relationships between 
fragments of an existing document (without changing its text!). A 
good example is quoted by [Welsh 94]: in the WEB system for literate 
programming, it is possible to be interested either in the narrative 
describing the development steps, or in the consolidated copy of 
the Pascal program -- the result of this development.

Let us consider another example -- the international standard 
describing the Open Distributed Processing Trading Function 
[Trader 94]. This standard describes the means to advertise services 
and to match service offers with service requests. The reader of this 
document may be interested either in a general overview, or only 
in aspects related to the service-oriented viewpoints, or only in 
aspects related to the implementation-related viewpoints, or else in 
aspects related to formal specifications, and so on. For each of 
these viewpoints, only certain fragments of [Trader 94], together 
with certain, viewpoint-specific, relationships between these 
fragments, will be of interest. There is no need for all of these 
fragments and relationships to have been perceived as such and 
therefore to have been marked up by the authors of [Trader 94]. 
Notice that a good traditional index for a traditional document may 
provide a reasonable approximation of several concept maps 
representing several different viewpoints for a document.

Obviously, creating a particular viewpoint merges the activities of a 
traditional document reader, writer, and editor, together with the 
activities of an information modeler. Providing an explicit 
specification of a document viewpoint does not differ from providing 
an explicit specification of any enterprise or application (as P.Wegner 
noted, there exists a strong analogy between document engineering 
and software engineering [Wegner 94]).

As noted earlier, a collection of documents need not be fixed: it 
may evolve reflecting the evolution of the enterprise described by 
these documents. Documents describing software development are 
a well-known example [Welsh 93, Welsh 94]. In these documents, 
the reference associations between specification fragments and their 
refinements should be preserved during "maintenance", i.e., updates. 
The invariant for a reference association states that the properties 
of the maintained entity should correspond to the properties of its 
reference entity. In this example, the refinement (code document) 
is a maintained entity with respect to its reference entity -- its 
specification, so that properties of the refinement should correspond 
to the properties of the specification. This invariant should be satisfied 
all the time, and not just when the refinement is created. In more 
technical terms, the reference association between the specification 
and its refinement is an ordinary reference rather than a reference 
for create. A more detailed description of reference associations 
with many examples can be found in [Kilov, Ross 94].


Less restrictions on documents

The approach of using information concepts rather than computer 
implementations permits creating semantically rich information 
models of documents, without artificial restrictions imposed by 
currently existing products. The same problem (and solution) exists 
in information modeling for traditional enterprises. Some of these 
unnecessary restrictions deal with the following:

- -- a document is not always a tree. Although the Web does not 
impose a tree-like document structure, too many documents (both 
traditional and hypertext ones!) are still presented as trees;
- -- composition is encountered more often than subtyping. Dogmatic 
adherence to the "traditional" object-oriented approach used in 
programming languages emphasizes subtyping hierarchies and 
inheritance and de-emphasizes compositions, thus leading to 
inappropriate information models;
- -- relationships are often non-binary. Composition is a typical 
example [Kilov, Ross 94]: "A composite type corresponds to one or 
more component types, and a composite instance corresponds to 
zero or more instances of each component type. There exists at 
least one resultant property of a composite instance dependent 
upon the properties of its component instances. There exists also at 
least one emergent property of a composite instance independent 
of the properties of its component instances. The sets of application-
specific types for the composite and its components should not be 
equal.";
- -- the same component may belong to several different composites 
(i.e., a composition is not hierarchical). More generally, the same 
document fragment may belong to several different relationships, 
in the same manner as in any other enterprise the same (business) 
object may belong to several different relationships. This becomes 
visible in a traditional document when it is marked up by its readers: 
the same fragment may be marked up using highlighters of different 
colors, where a highlighter of a particular color is used to specify 
(the semantics of) a collection of fragments related in a particular 
way.

Relationships in a document or collection of documents explicitly 
specify different ways of browsing this document or this collection 
- -- it is well-known that even traditional documents are not read 
cover to cover. Obviously, the document author cannot and should 
not anticipate all potential relationships (and even all potential 
document fragments representing concepts) of interest to the future 
readers of his document, and therefore the reader of a document 
should be able to specify ("highlight") these concept maps. In 
traditional documents, these implicit specifications have been 
implicitly tolerated (almost any used textbook from a university 
library may be used as an example). In Web documents, these 
(semantic) relationship specifications should become precise and 
explicit, be defined using concepts rather than links (compare 
[Murray 93, Kilov 94]), and be explicitly promoted.


Names

Naming within different contexts (including different names for the 
same thing) should be considered very carefully in dealing with 
the huge amount of Web documents. In particular, a name should 
denote a document (or its component) rather than a place where (a 
pointer to) the document happens to be stored. Using names to 
uniquely identify particular documents within WWW is a quite 
non-trivial task because the Web is a huge open system. [ODP 2] 
clearly states that naming in an open system is possible only relative 
to a given naming context. A naming context is defined as a relation 
between a set of names and a set of entities, whereby the set of 
names belongs to a single name space. An identifier is defined as 
an unambiguous name of an entity in such a context. [For an 
almost trivial example, consider such "unreal" names as section 
numbers in a document: in the context of the subsequent version 
of the document, the same sections may well be denoted by different 
names.] Establishing object equality (and therefore document 
equality) by comparing their properties (e.g., behaviors) is possible 
only at some abstraction level whereby details irrelevant for this 
level are suppressed.

There have been various proposals of establishing document identity 
within the Web, based, e.g., on reusing the idea of establishing a 
book identity -- something like an ISBN for every Web document. 
This may be quite difficult to organize, and may lead to the existence 
of many names for the same document with respect at least to 
documents that do not have a well-defined "owner". The number of 
such documents may be quite large due to the current proliferation 
of essentially the same document from various Web sources. 
Nevertheless, this idea is a good first step as it abstracts away 
document storage considerations.

The concept of "keyword search" is rather close to the concept of 
naming. Indeed, a keyword is supposed to denote a concept that is 
searched for by the Web user. However, it is rather well-known that 
keyword search is often inadequate, as the following quote from 
comp.infosystems.www perfectly shows:

">Is there something wrong with just using WAIS?

No, not at all, unless you don't know exactly what you're looking 
for, exactly what words to search on, and there aren't too many 
documents that match those words, and you can figure out which 
sources to search, and there are sources that cover the subject, and 
there aren't too many sources that cover the subject... and a few 
other reasons.

...most search either return nothing or too much, but rarely the 
desired answer.

Nick Arnett
Multimedia Computing Corp. (strategic consulting)
Campbell, California"


Standards

Several international standards for information management and 
open systems (such as [GRM 94, ODP 2]) are applicable to the 
specification of documents and document collections. In fact, these 
standards have influenced and have been influenced by information 
modeling considerations, and have been successfully used for 
modeling complex enterprises. Specifying enterprises and 
documents that describe these enterprises by means of the same 
concepts and using the same standards leads to improved 
understanding and consistent and readable documents. Obviously, 
reuse of these standards to specify Web documents is very helpful 
to both specifiers and users of these documents. In particular, it 
should be noticed that these standards promote the use of abstract 
and precise (formal) specifications to better understand and specify 
information systems in general. Web documents in particular will 
be understood substantially better in this manner.


Acknowledgment

Thanks go to Mark Buckley for helpful comments.


References

[Ainsworth 94] M.Ainsworth, A.H.Cruickshank, P.J.L.Wallis, 
L.J.Groves. Viewpoint specification and Z. Information and Software 
technology, Vol. 36 (1994), No. 1, pp. 43-51.
[Dijkstra 76] E. W. Dijkstra, A Discipline of Programming. Englewood 
Cliffs, NJ: Prentice Hall, 1976.
[Eddison 94] P.Eddison. Adopting the SQL paradigm: text-retrieval 
solves data access problems. IMC Journal, Vol. 30 (1994), No. 2, pp. 
11-13.
[FM 94] ANSI ASC X3H7. Object model features matrix. Document 
number X3H7-93-007v7. April 1994.
[GRM 94] ISO/IEC JTC1/SC21/WG4, Information Technology - Open 
Systems Interconnection - Management Information Services - 
Structure of Management Information - Part 7: General Relationship 
Model. CD ISO/IEC 10165-7 N 8454. March 30, 1994.
[Hayes 93] I.Hayes. Specification case studies. Second Edition. 
Prentice-Hall, 1993.
[Hehner 93] E.C.R.Hehner. A practical theory of programming. Springer 
Verlag, 1993.
[Kilov 93] H. Kilov, Information Modeling and Object Z: Specifying 
Generic Reusable Associations, in Proceedings of NGITS-93 (Next 
Generation Information Technology and Systems, Haifa, Israel, June 
28-30, 1993), ed. O. Etzion and A. Segev, pp. 182-91.
[Kilov, Ross 94] H.Kilov, J.Ross. Information modeling: an object-
oriented approach. Prentice-Hall, 1994.
[Kilov 94-1] H.Kilov. Information modeling: a path to document 
analysis, in Proceedings of Electronic Document Delivery Conference 
(EDD'94), pp. 267-280.
[Kilov 94] H.Kilov. On understanding hypertext: are links essential? 
ACM Software Engineering Notes, Vol. 19, No. 1 (January 1994), 
p.30.
[Murray 93] P.Murray. Tyrannical links, loose associations, and 
other difficulties of Hypertext. ACM SIGLINK Newsletter, Vol. 2, No. 1 
(March 1993), pp. 10-12.
[ODP 2] ISO/IEC JTC1/SC21/WG7, Basic Reference Model for Open 
Distributed Processing - Part 2: Descriptive Model. (CD 10746-2, 
February 1994).
[ODP 3] ISO/IEC JTC1/SC21/WG7. Basic Reference Model for Open 
Distributed Processing - Part 3: Prescriptive Model. (ISO/IEC 
JTC1/SC21/WG7 N 7525, December 1992).
[OODBTG 91] Object Data Management Reference Model. (ANSI 
Accredited Standards Committee. X3, Information Processing 
Systems.) Document Number OODB 89-01R8. 17 September 1991. 
(Also in: Computer Standards & Interfaces, Vol. 15 (1993), pp. 124-142.)
[Raven 94] M.E.Raven and R.Thompson. Can principles of object-
oriented system documentation be applied to user documentation? 
* (The journal of computer documentation), Vol. 18 (1994), No. 1, pp. 
15-19.
[Swatman 93] P.Swatman. Increasing formality in the specification of 
high-quality information systems in a commercial context. Department 
of Computer Science, Curtin University of Technology, Australia, 
1993.
[Trader 94] ISO/IEC JTC1/SC21/WG7, Information Technology - 
Open Distributed Processing - ODP Trading Function. ISO/IEC 
JTC1/SC21/WG7 N 897, 1994-02-04.
[Ulmer 88] Gregory L.Ulmer. Handbook for a theory hobby. Visible 
Language, Vol. 22, No. 4 (1988), pp. 399-423.
[Wegner 94] P.Wegner. Course on computer literacy for non-majors. 
Brown University (Providence, RI), Department of Computer Science, 
CS-94-21, Draft, May 1, 1994.
[Welsh 93] Jim Welsh and Jun Han. Software documents: concepts 
and tools. Technical Report No. 93-23, Software Verification Research 
Centre, The University of Queensland, Australia, 1993.
[Welsh 94] J.Welsh. Software is history! In: A Classical Mind (Essays 
in Honour of C.A.R.Hoare), ed. by A.W.Roscoe. Prentice-Hall, 1994, 
pp. 419-429.

(*) Copyright 1994, Bell Communications Research, Inc. (Bellcore). 
Permission to use, copy, modify and distribute this material for any 
lawful purpose and without fee is hereby granted, provided that the 
above copyright notice and this permission notice appear in all 
copies, and that the name of Bellcore not be used in advertising or 
publicity pertaining to this material without the specific, prior written 
permission of an authorized representative of Bellcore. BELLCORE 
MAKES NO REPRESENTATIONS OR WARRANTY, EXPRESS OR 
IMPLIED, ABOUT THE ACCURACY, SUFFICIENCY, OR SUITABILITY 
OF THIS MATERIAL FOR ANY PURPOSE. IT IS PROVIDED "AS IS", 
WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES. Bellcore 
expressly disclaims any liability for any damage or injury incurred 
by any person arising out of the sufficiency, accuracy, or utility of 
any information contained herein. Any use of this material is at 
the sole risk of the user.