Jumping off the page

In the start of a series on using your Acorn to create Web pages, David Matthewman explains what HTML is and how to write it.

In the beginning, so they say, was the word. Now, a word is more useful than the individual letters that spell it, but leaves something to be desired for meaningful communication. Take the 'Polish'; it can represent an action, a block of wax or even a nationality. To pin down its meaning further - to put it into context - it must be surrounded by other words. In short, soon after the beginning, was the sentence.

If a word is seem as a point, and a sentence a line, then clearly a paragraph extends this into two dimensions as an area. Then what? Paragraphs follow one another in sequence and there is no way of writing a sentence that literally 'jumps off' the page. Even a book is just a collection of pages; adding another 'dimension' to text requires rather more work.

At Acorn User we sometimes include a jargon box to avoid the messy solution of having to define a term each time it appears which can break up the flow of a sentence, and we will also refer back to articles in previous issues where they are relevant. These ways of extending plain text are not new practice, and publishers have been doing it for centuries without needing any new and frightening-sounding words to refer to it.

However, in 1965 Ted Nelson gave a name to the practice: hypertext. Nelson's definition of hypertext went beyond traditional practice; it related documents and parts of documents to a web of others. The aim was a project called Xanadu, a system linking all the world's literature, which was well ahead of what technology could practically achieve then.

Rumblings under Geneva

There the matter might have rested, except that in the late 1980s a group of high-energy physicists at CERN in Geneva were looking for a viable way to spread information rapidly among the high-energy physics community. One of their researchers, Tim Berners-Lee, wrote a proposal called HyperText and CERN which contained the basic principles which were to develop into the World Wide Web. These were:

A multi-platform approach, allowing the information to be accessed from as many different platforms and in as many different ways as possible.
A flexible approach to document type and information protocols: the Web shouldn't itself restrict the type of information that can be transmitted using it.
Universal access: any user should be able to access any information.

Over the next few years, development was largely confined to CERN, but in early 1993 Marc Andreessen at NCSA led a team that developed Mosaic which ran on X-Windows. This was one of the first graphical Web browser - to follow a link, you pointed at a word and clicked.

In 1994 two major players in the WWW arena came into being. Because the Web was developing so quickly, it was obvious that some advisory body was needed, to draw up a set of standards for the evolution of the Web. This authority - eventually called the World Wide Web Consortium or W3C for short - is a consortium of universities and private industries, which oversees the development of the Web.

The other player (which actually appeared before W3C) was the company now known as Netscape Communications. Netscape, co-founded by Marc Andreesen, developed a commercial Web browser - variously called Mozilla, Navigator and Atlas - that came to dominate the market. At one point it was estimated that over 90 per cent of Web browsing was performed by a version of Netscape's browser. This figure has dropped recently, but Netscape still has around 75 per cent of the market.

The shifting HTML standard

As Netscape has not written a browser for the Acorn platform, you may be wondering whether this is relevant. It is, because it had a profound influence on the development of HTML, the language of the Web. HyperText Markup Language, to give it its full title, should adhere to a standard defined by the Internet Working Group of the Internet Engineering Task Force (IETF), a standard principally developed by W3C.

This ensures that everyone knows exactly what they mean when they talk about HTML - in theory. In practice, although the work of W3C is very worthy, it is also very slow and cautious. This means that despite the fact that the HTML 2.0 specification is currently implemented by just about every browser, its current status at IETF is 'proposed standard'.

Netscape Comunications, noting that its customers wanted to do things like turn the background turquoise, and bored of waiting for the official channels to agree on a way to do this, invented its own. As Netscape owned so much of the browser market, most people writing HTML took advantage of the extra features and, sure as chicken follows egg, so did most other browsers.

The net result of this is that there are several HTML 'standards'. Most Web authors write HTML 2.0 with varying amounts of Netscape extensions to the standard, and it's HTML like this that I'm going to be covering in this series.

What is HTML?

The way that Web pages are sent over the Internet involves two main computer programs: a server and a browser. The server sits on the machine that holds the Web pages and sends them to anyone who asks, and we won't worry about this side of things for the moment. The browser runs on the user's machine, asks for the pages, reads them and displays them. You don't actually need to be connected to the Internet at all, most browsers will look at Web pages stored locally on a hard drive, so if you're reading this without and Internet account then don't despair; you can play too. The browser is the key to the whole process. When you write HTML, what you are writing are instructions to the browser about how the page should be displayed. However, these instructions are not the sort of instructions that you'll be used to if you do a lot of DTP work - point size, font, frame information - because this sort of information is not readily portable between different types of browsers.

To see this more clearly, consider two extreme types of browser. The first takes the HTML and speaks it. Such a browser would be very useful to blind people and, indeed, browsers like this exist. The second takes the HTML and, with a flawless knowledge of the rules of typography and of the particular 'house style' of a magazine, lays out pages from it. Browsers of the second type don't exist, but it's useful to pretend that they do.

What is the speaking browser to make of the setting of a particular point size? Very little: I know no one who speaks in Times Roman (although I know plenty who might as well be talking Zapf Dingbats). Such information is of no more use to the page layout browser, which will override it with its own settings.

The instructions contained within HTML are designed to put the text into a particular context, either by marking it as having a certain emphasis or by defining its relation to some other text. Tell a speaking browser that a phrase is italicised, and it will ignore the instruction; tell it that the phrase is emphasised and it will know to stress it as it speaks it.

OK, I'll come clean. There's nothing actually to stop you using all sorts of tricks in your HTML code to try to define the layout and appearence of your Web pages. But - and it's a big but - you must remember that not every browser will treat the information in the same way or even use it at all. When starting out, it's safer to mark your pages up strictly for context rather than for appearence.

Marking up text

If you take a block of plain text and send it to a browser, you will notice a number of things at once. The first is that it will display it. Formally, there are a number of items that should be present in a document to make it valid HTML, but very few browsers will actually sulk if they aren't there.

The second point is that any extraneous spaces, line feeds and carriage returns in the text are ignored. The browser will simply skip over them and display the text in one block with uniform spacing between the words. There are circumstances when you can make the browser care about extra spaces and so on, but we'll cover those later.

Marking up the text to turn it into HTML involves inserting tags. Tags are instructions contained within angle brackets '<>' and are case-insensitive. They usually come in pairs - a 'turn on' and 'turn off' tag - with the second of the two being preceded by a slash '/'. For example, to emphasise a word or phrase, enclose it within the  tag pair, as in:

This point is very important.

I'll talk about the other kind of tag - the self-contained one - in the next issue.

An HTML skeleton

Now that you've seen how tags work - it's simple really - you can create your own bona fide HTML file from a block of text. First, you have to tell the browser that the whole document is HTML by enclosing it within the <html></html> tag pair. Next, you have to separate the body of the document (which contains the text) from the header which contains other information including the title.

This is all done with appropriate tag pairs, to produce a document that looks like this:

<html> <head> <title>Title here</title> </head> <body> Text of document here </body> </html>

Looking at the above, you can see that tags can be nested. For instance, the title is within the header of the document, and both the body and header are within the outer <html> tags.

What I have written above is a perfectly valid HTML file (well, almost, but don't worry about it), but there's a little more work that needs doing before it will display in a sensible manner. As I have already observed, Web browsers ignore all line feeds and carriage returns and will run paragraphs together. To stop them doing this, you need to mark paragraphs using the  tag. Strictly speaking,  marks the start of a paragraph and  the end, but you can safely ignore the end tag and just use  to mark paragraph starts.

Anchor tags and attributes

So far, we can mark a document as valid HTML, and flag off all the paragraphs. However, one of the key features of HTML is the ability to link words and phrases to other documents. This is accomplished with the Anchor tag <a>.

Placing <a> and </a> round some text marks it as an anchor - a link to another document. Astute readers will notice that we haven't actually specified where the file links to. We do this by using tag attributes.

Most tags in HTML can have attributes. They are specified by putting sets of:

variable="value"

pairs within the opening tag. In this case the relevant variable is 'href' and the value is the name of the link. To link to a file mail.html you would therefore write something like:

You can also <a href="mail.html">mail me</a>.

When this is displayed in a browser, the words 'mail me' will be highlighted in some way - often by colouring them blue - and (in a WIMP environment) the browser can be told to load the file mail.html by clicking on the words.

Usually anchor tags are a little more complex than this, but that's a topic for the next in the series. I've given you quite enough to think about for one month, and if you're feeling adventurous you can always look at the files on the cover disc.

Where next from here?

Starting from the word, we went one-dimensional with sentences, two-dimensional with paragraphs and finally three-dimensional with hypertext. That's about as far as I'm going to stretch the metaphor; I'm not covering time travel next issue. Instead, I'll show you how to set out across the Internet, and also how to incorporate pictures into your text.

Until then you should spend your time being sure that you understand opening and closing tag pairs, the difference between the head and the body of an HTML document and what a tag attribute is. And repeat ten times before you go to sleep: 'I will write my HTML to be meaningful to speaking browsers.'

A summary of the software to help you write HTML on the Acorn platform is available.