Extending WWW to support
Platform Independent Virtual Reality

David Raggett, Hewlett Packard Laboratories
(email: dsr@hplb.hpl.hp.com)

Abstract

This is a proposal to allow VR environments to be incorporated into the World Wide Web, thereby allowing users to "walk" around and push through doors to follow hyperlinks to other parts of the Web. VRML is proposed as a logical markup format for non-proprietary platform independent VR. The format describes VR environments as compositions of logical elements. Additional details are specified using a universal resource naming scheme support ing retrieval of shared resources over the network. The paper closes with ideas for how to extend this to support virtual presence teleconferencing.

Introduction

This paper describes preliminary ideas for extending the World Wide Web to incorporate virtual reality (VR). By the end of this decade, the continuing advances in price/performance will allow affordable desktop systems to run highly realistic virtual reality models. VR will become an increasingly important medium, and the time is now ripe to develop the mechanisms for people to share VR models on a global basis. The author invites help in building a proof of concept demo and can be contacted at the email address given above.

VR systems at the low end of the price range show a 3D view into the VR environment together with a means of moving around and interacting with that environment. At the minimum you could use the cursor keys for moving forward and backwards, and turning left and right. Other keys would allow you to pick things up and put them down. A mouse improves the ease of control, but the "realism" is primarily determined by the latency of the feedback loop from control to changes in the display. Joysticks and SpaceBalls improve control, but cannot compete with the total immersion offered by head mounted displays (HMDs). High end systems use magnetic tracking of the user's head and limbs, together with devices like 3D mice and datagloves to yet further improve the illusion.

Sound can be just as important to the illusion as the visual simulation: The sound of a clock gets stronger as you approach it. An aeroplane roars overhead crossing from one horizon to the next. High end systems allow for tracking of multiple moving sources of sound. Distancing is the technique where you get to see and hear more detail as you approach an object. The VR environment can include objects with complex behavior, just like their physical analogues in the real world, e.g. drawers in an office desk, telephones, calculators, and cars. The simulation of behavior is frequently more demanding computationally than updating the visual and aural displays.

The Virtual environment may impose the same restrictions as in the real world, e.g. gravity and restricting motion to walking, climbing up/down stairs, and picking up or putting down objects. Alter natively, users can adopt superpowers and fly through the air with ease, or even through walls! When using a simple interface, e.g. a mouse, it may be easier to learn if the range of actions at any time is limited to a small set of possibilities, e.g. moving forwards towards a staircase causes you to climb the stairs. A separate action is unnecessary, as the VR environment builds in assumptions about how people move around. Avatars are used to represent the user in the VR environment. Typically these are simple disembodied hands, which allow you to grab objects. This avoids the problems in working out the positions of the user's limbs and cuts down on the computational load.

Platform Independent VR

Is it possible to define an interchange format for VR environments which can be visualized on a broad range of platforms from PCs to high-end workstations?

At first sight there is little relationship between the capabilities of systems at either extreme. In practice, many VR elements are composed from common elements, e.g. rooms have floors, walls, ceilings, doors, windows, tables and chairs. Outdoors, there are buildings, roads, cars, lawns, and trees etc. Perhaps we can draw upon experience with document conversion and the Standard Generalized Markup Language (SGML) [ref. 4] and specify VR environments at a logical level, leaving browsers to fill in the details according to the capabilities of each platform.

The basic idea is to compose VR environments from a limited set of logical elements, e.g. chair, door, and floor. The dimensions of some of these elements can be taken by default. Others, like the dimensions of a room, require lists of points, e.g. to specify the polygon defining the floor plan. Additional parameters give the color and texture of surfaces. A picture frame hanging on a wall can be specified in terms of a bitmapped image.

These elements can be described at a richer level of detail by reference to external models. The basic chair element would have a subclassification, e.g. office chair, which references a detailed 3D model, perhaps in the DXF format. Keeping such details in separate files has several advantages:

Simplifies the high level VR markup format

This makes it easier to create and revise VR environments than with a flat representation.

Models can be cached for reuse in other VR environments

Keeping the definition separate from the environment makes it easy to create models in terms of existing elements, and saves resources.

Allows for sharing models over the net

Directory services can be used to locate where to retrieve the model from. In this way, a vast collection of models can be shared across the net.

Alternative models can be provided according to each browser's capabilities.

Authors can model objects at different levels of detail according to the capabilities of low, mid and high end machines. The appropriate choice can be made when querying the directory service, e.g. by including machine capabilities in the request. This kind of negotiation is already in place as part of the World Wide Web's HTTP protocol [ref. 3].

Limiting VR environments to compositions of known elements would be overly restrictive. To avoid this, it is necessary to provide a means of specifying novel objects, including their appearance and behavior. The high level VR markup format should therefore be dynamically extendable. The built-in definitions are merely a short cut to avoid the need to repeat definitions for common objects.

Universal Resource Locators (URLs

The World Wide Web uses a common naming scheme to represent hypermedia links and links to shared resources. It is possible to represent nearly any file or service with a URL [ref. 2].

The first part always identifies the method of access (or protocol). The next part generally names an Internet host and is followed by path information for the resource in question. The syntax varies according to the access method given at the start. Here are some examples:

http://info.cern.ch/hypertext/WWW/TheProject.html
This is the CERN home page for the World Wide Web project. The prefix "http" implies that this resource should be obtained using the hypertext transfer protocol (HTTP).
http://cui_www.unige.ch/w3catalog
The searchable catalog of WWW resources at CUI, in Geneva. Updated daily.
news:comp.infosystems.www
The Usenet newsgroup "comp.infosystems.www". This is accessed via the NNTP protocol.
ftp://ftp.ifi.uio.no/pub/SGML
This names an anonymous FTP server: ftp.ifi.uio.no which includes a collection of information relating to the Standard Generalized Markup Language - SGML.

Application to VR

The URL notation can be used in a VR markup language for:

Referencing wire frame models, image tiles and other resources

For example, a 3D model of a vehicle or an office chair. Resources may be defined intensionally, and generated by the server in response to the user's request.

Hypermedia links to other parts of the Web.

Major museums could provide educational VR models on particular topics. Hypermedia links would allow students to easily move from one museum to another by "walking" through links between the different sites.

One drawback of URLs is that they generally depend on particular servers. Work is in progress to provide widespread support for lifetime identifiers that are location independent. This will make it pos sible to provide automated directory services akin to X.500 for locating the nearest copy of a resource.

MIME: Multipurpose Internet Mail Extensions

MIME describes a set of mechanisms for specifying and describing the format of Internet message bodies. It is designed to allow multiple objects to be sent in a single message, to support the use of multiple fonts plus non-textual material such as images and audio fragments. Although it was conceived for use with email messages, MIME has a much wider applicability. The hypertext transfer protocol HTTP uses MIME for request and response message formats. This allows servers to use a standard notation for describing document contents, e.g. image/gif for GIF images and text/html for hypertext documents in the HTML format. When a client receives a MIME message the content type is used to invoke the appropriate viewer. The bindings are specified in the mailcaps configuration file. This makes it easy to add local support for a new format without changes to your mailer or web browser. You simply install the viewer for the new format and then add the bind ing into your mailcaps file.

The author anticipates the development of a public domain viewer for a new MIME content type: video/vrml. A platform independent VR markup language would allow people to freely exchange VR models either as email messages or as linked nodes in the World Wide Web.

A sketch of the proposed VR markup language (VRML)

A major distinction appears to be indoor and out door scenes. Indoors, the scene is constructed from a set of interconnected rooms. Outdoors, you have a landscape of plains, hills and valleys upon which you can place buildings, roads, fields, lakes and forests etc. The following sketch is in no way comprehensive, but should give a flavour of how VRML would model VR environments. Much work remains to turn this vision into a practical reality.

Indoor scenes

The starting point is to specify the outlines of the rooms. Architects drawings describe each building as a set of floors, each of which is described as a set of interconnected rooms. The plan shows the position of windows, doors and staircases. Annotations define whether a door opens inwards or outwards, and whether a staircase goes up or down. VRML directly reflects this hierarchical decomposition with separate markup elements for buildings, floors, rooms, doors and staircases etc. Each element can be given a unique identifier. The markup for adjoining rooms use this identifier to name interconnecting doors. Rooms are made up from floors, walls and ceilings. Additional attributes define the appearance, e.g. the color of the walls and ceiling, the kind of plaster coving used to join walls to the ceiling, and the style of windows. The range of elements and their permitted attributes are defined by a formal specification analogous to the SGML document type definition.

Rooms have fittings: carpets, paintings, book cases, kitchen units, tables and chairs etc. A painting is described by reference to an image stored separately (like inlined images in HTML). The browser retrieves this image and then applies a parallax transformation to position the painting at the designated location on the wall. Wall paper can be modelled as a tiling, where each point on the wall maps to a point in an image tile for the wall paper. This kind of texture mapping is computationally expensive, and low power systems may choose to employ a uniform shading instead. Views through windows to the outside can be approximated by mapping the line of sight to a point on an image acting as a back cloth, and effectively at infinity. Kitchen units, tables and chairs etc. are described by reference to external models. A simple hierarchical naming scheme can be used to substitute a simpler model when the more detailed one would overload a low power browser.

Hypermedia links can be represented in a variety of ways. The simple approach used in HTML documents for depicting links is almost certainly inadequate. A door metaphor makes good sense when transferring to another VR model or to a different location in the current model. If the link is to an HTML document, then an obvious metaphor is opening a book (by tapping on it with your virtual hand?). Similarly a radio or audio system makes sense for listening to a audio link, and a television for viewing an MPEG movie.

Outdoor scenes

A simple way of modelling the ground into plains, hills and valleys is to attach a rubber sheet to a set of vertical pins of varying lengths and placed at irregular locations: zi = fi(x, y). The sheet is single valued for any x and y, where x and y are orthogonal axes in the horizontal plane. Smooth terrain can be described by interpolating gradients specified at selected points. The process is only applied within polygons for which all vertices have explicit gradients. This makes it possible to restrict smoothing to selected regions as needed.

The next step is to add scenery onto the underlying ground surface:

Texture wrapping - mapping an aerial photo graph onto the ground surface.

This works well if the end-user is flying across a landscape at a sufficient height that parallax effects can be neglected for surface detail like trees and buildings. Realism can be further enhanced by including an atmospheric haze that obscures distant details.

Plants - these come in two categories: point-like objects such as individual trees and area-like objects such as forests, fields, weed patches, lawns and flower beds.

A tree can be placed at a given (x, y) coordinate and scaled to a given height. A range of tree types can be used, e.g. deciduous (summer/fall), and coniferous. The actual appearance of each type of tree is specified in a separate model, so VRML only needs the class name and a means of specifying the model's parameters (in many cases defaults will suffice). Extended objects like forests can be rendered by repeating an image tile or generated as a fractal texture, using attributes to reference external definitions for the image tile or texture.

Water - streams, rivers and water falls; ponds, lakes and the sea. The latter involves attributes for describing the nature of the beach: muddy estuary, sandy, rocky and cliffs.
Borders - fences, hedges, walls etc. which are fundamentally line-like objects
Roads - number of lanes, types of junctions, details for signs, traffic lights etc.

Each road can be described in terms of a sequence of points along its center and its width. Features like road lights and crash barriers can be generated by default according the attributes describing the kind of road. Road junctions could be specified in detail, but it seems possible to generate much of this locally on the basis of the nature of the junction and the end points of the roads it connects: freeway-exit, clover- leaf junction, 4-way stop, round-about etc. In general VRML should avoid specifying detail where this can be inferred by the browsing tool. This reduces the load on the network and allows browsers to show the scene in the detail appropriate to the power of each platform. Successive generations of kit can add more and more detail leading to progres sively more realistic scenes without changes to the original VRML documents.

Buildings - houses, skyscrapers, factories, filling stations, barns, silos, etc.

Most buildings can be specified using constructive geometry, i.e. as a set of intersecting parts each of which is defined by a rectangular base and some kind of roof. This approach describes buildings in a compact style and makes it feasible for VRML to deal with a rich variety of building types. The texture of walls and roofs, as well as the style of windows and doors can be defined by reference to external models.

Vehicles, and other moving objects

A scene could consist of a number of parked vehicles plus a number of vehicles moving along the road. Predetermined trajectories are rather unexciting. A more interesting approach is to let the behavior of the set of vehicles emerge from simple rules governing the motion of each vehicle. This could also apply to pedestrians moving on a side-walk. The rules would be defined in scripts associated with the model and not part of VRML itself. The opportunities for several users to meet up in a shared VR scene are discussed in the next section.

Distant scenery, e.g. a mountain range on the horizon

This is effectively at infinity and can be represented as a back cloth hung in a cylinder around the viewer. It could be implemented using bitmap images (e.g. in GIF or JPEG formats). One issue is how to make the appearance change according to the weather/ time of day.

Weather and Sky

Outdoor scenes wouldn't be complete without a range of different weather types! Objects should gradually lose their color and contrast as their distance increases. Haze is useful for washing out details as the browser can then ignore objects beyond a certain distance. The opacity of the haze will vary according to the weather and time of day. Fractal techniques can be used to synthesize cloud formations. The color of the sky should vary as a function of the angle from the sun and the angle above the horizon. For VRML, the weather would be characterized as a set of predetermined weather types.

Distancing

The illusion will be more complete if you can see progressively more detail the closer you get. Unfortunately, it is impractical to explicitly specify VR models in arbitrary detail. Another approach is to let individual models to reference more detailed models in a chain of progressively finer detail, e.g. a model that defines a lawn as a green texture can reference a model that specifies how to draw individual blades of grass. The latter is only needed when the user zooms in on the lawn. The browser then runs the more detailed model to generate a forest of grass blade.

Actions and Scripts

Simple primitive actions are part of the VRML model, for example to ability of the user to change position/orientation and to pick up/put down or "press" objects. Other behaviour is the responsibility of the various objects and lies outside the scope of VRML. Thus a virtual calculator would allow users to press keys and carry out calculations just like the real thing. This rich behaviour is specified as part of the model for the calculator object class along with details of its appearence. A scripting language is needed for this, but it will be independent of VRML, and indeed there could be a variety of different languages. The format negotiation mechanism in HTTP seems appropriate to this, as it would allow browsers to indicate which representations are supported when sending requests to servers.

Achieving Realism

Another issue, is how to provide realism without excessive computional demands. To date the computer graphics community has focussed on mathematical models for realism, e.g. ray tracing with detailed models for how objects scatter or transmit light. An alternative approach could draw upon artistic metaphors for rendering scenes. Paintings are not like photographs, and artists don't try to capture all details, rather they aim to distill the essentials with a much smaller number of brush strokes. This is akin to symbolic representations of scenes. We may be able to apply this to VR. As an example consider the difficulty in modelling the folds of cloth on your shirt as you move your arm around. Modelling this computationally is going to be very expensive, perhaps a few rules can be used to draw in folds when you fold your arms.

Virtual presence Teleconferencing

The price performance of computer systems currently doubles about every 15 months. This has happened for the last five years and industry pundits see no end in sight. It therefore makes sense to consider approaches which today are impractical, but will soon come within reach.

A world without people would be a dull place indeed! The markup language described above allows us to define shared models of VR environments, so the next step is to work out how to allow people to meet in these environments. This comes down to two parts:

The protocols needed to ensure that each user sees an up to date view of all the other people in the same virtual location, whether this is a room or somewhere outdoors.
A way of visualising people in the virtual envi ronment, this in turn begs the question of how to sense each user - their expressions, speech and movements.

For people to communicate effectively, the latency for synchronizing models must of order 100 milliseconds or less. You can get by with longer delays, but it gets increasingly difficult. Adopting a formal system for turn taking helps, but you lose the ability for non-verbal communication. In meetings, it is common to exchange glances with a colleague to see how he or she is reacting to what is being said. The rapid feedback involved in such exchanges calls for high resolution views of people's faces together with very low latency.

A powerful technique will be to use video cameras to build real-time 3D models of people's faces. As the skull shape is fixed, the changes are limited to the orientation of the skull and the relative position of the jaw. The fine details in facial expressions can be captured by wrapping video images onto the 3D model. This approach greatly reduces the bandwidth needed to project lifelike figures into the VR environment. The view of the back of the head and the ears etc. are essentially unchanging and can be filled in from earlier shots, or if necessary synthesized from scratch to match visible cues.

In theory, the approach needs a smaller band width than conventional video images, as head movements can be compressed into a simple change of coordinates. Further gains in bandwidth could be achieved at a cost in accuracy by characterizing facial gestures in terms of a composition of "identikit" stereotypes, e.g. shots of mouths which are open or closed, smiling or frowning. The face is then built up by blending the static model of the user's face and jaw with the stereotypes for the mouth, cheeks, eyes, and forehead.

Although head mounted displays offer total immersion, they also make it difficult to sense the user's facial expressions. They are also uncomfort able to wear. Virtual presence teleconferencing is therefore more likely to use conventional displays together with video cameras mounted around the user's workspace. Lightweight headsets are likely to be used in preference to stereo or quadraphonic loudspeaker systems, as they offer greater auditory realism as well as avoiding trouble when sound spills over into neighboring work areas.

The cameras also offer the opportunity for hands free control of the user's position in the VR environment. Tracking of hands and fingers could be used for gesture control without the need for 3D mice or spaceballs etc. Another idea is to take cues from head movements, e.g. moving your head from side to side could be exaggerated in the VR environment to allow users to look from side to side without needing to look away from the display being used to visualize that environment.

Where Next?

For workstations running the X11 windowing system, the PEX library for 3D graphics is now available on most platforms. This makes it practical to start developing proof of concept platform independent VR. The proposed VRML interchange format could be used within the World Wide Web or for email messages. All users would need to do is to download a public domain VRML browser and add it to their mailcaps file. The author is interested in getting in touch with people willing to collaborate in turning this vision into a reality.

References

"Hypertext Markup Language (HTML)",
Tim Berners-Lee, January 1993.
URL=ftp://info.cern.ch/pub/www/doc/html-spec.ps
or http://info.cern.ch/hypertext/WWW/MarkUp/MarkUp.html
"Uniform Resource Locators", Tim Berners-Lee, January 1992.
URL=ftp://info.cern.ch/pub/www/doc/url7a.ps
or http://info.cern.ch/hypertext/WWW/Addressing/Addressing.html
"Protocol for the Retrieval and Manipulation of Texual and Hypermedia Information",
Tim Berners-Lee, 1993.
URL=ftp://info.cern.ch/pub/www/doc/http-spec.ps
or http://info.cern.ch/hypertext/WWW/Protocols/HTTP/HTTP2.html
"The SGML Handbook", Charles F. GoldFarb, pub. 1990 by the Clarendon Press, Oxford.

Extending WWW to support Platform Independent Virtual Reality