Previous Page TOC Next Page See Page



— 2 —
HTTP: How To Speak On the Web


The global Internet, a "network of networks," is possible for one reason: the adherence to standard communications protocols. These protocols define how the computers and applications running on the network communicate with one another. They allow computers running on different hardware platforms and different operating systems to share information. These protocols include the Simple Mail Transfer Protocol (SMTP) and Post Office Protocol (POP) for e-mail, the File Transfer Protocol (FTP) for file transfer, and the Network News Transfer Protocol (NNTP) for reading and posting to Internet newsgroups.

The World Wide Web (Web for short) uses a protocol named the Hypertext Transfer Protocol (HTTP) to transfer hypermedia documents and other resources between server and client computers. These hypermedia documents are often referred to as Web pages and can contain links to other Web pages (hence the term hypertext). Figure 2.1 shows an example of a hypermedia document. The screen shot illustrates many of the features of the Web: hyperlinks (the underlined text) which can transport the user to another page; embedded data objects (such as the Under Construction picture); and embedded applications (the LED banner is actually an application transferred from the server and executed on the client machine).

Figure 2.1. An example of a Web document.

This chapter discusses the HTTP protocol in enough depth that Visual Basic programmers should be able to proficiently write applications that run on both HTTP server and client machines. Although some of the material may seem to be beyond the scope of a typical Visual Basic programming book, the concepts are necessary to correctly communicate on the Web.

It is important to understand the structure and operation of HTTP messages. For applications that will serve as user agents (retrieving information from HTTP servers), the proper request messages must be generated. This is necessary not only to assure proper information is presented to the user, but also to limit the load on the network. Likewise, when developing server-side applications, you need to make sure the proper status, header, and entity information is returned to the client. This chapter, though not exhaustive on the subject, will give you an understanding of the proper use of HTTP messaging so you can effectively write both types of applications.

Introduction to HTTP


The HTTP protocol defines how client and server applications communicate in order to transfer hypertext documents and other resources located on the network. The protocol does not attempt to define what types of resources are transferred. The data may be text, sound, full-motion video, even applications to be executed on the client machine. Although the protocol is commonly used for communicating on a TCP/IP network (such as the Internet), it can be used with any network topology.

http://www.ietf.cnri.reston.va.us/ids.by.wg/http.html

Even though the HTTP protocol has been in use since 1990, it is still evolving. As of this writing, the protocol is in Internet-Draft form. Internet-Drafts are working documents of a group known as the Internet Engineering Task Force (IETF). The Internet-Draft for the HTTP protocol is authored by the HTTP Working Group, which includes Tim Bernes-Lee (who originally proposed the hypertext protocol in 1989), Roy Fielding, and Henrik Frystyk Nielsen. The protocol is currently being referred to as HTTP/1.0. The Internet-Draft was last published on February 19, 1996 and will expire on August 19, 1996. You should refer to this document if you desire an in-depth coverage of the HTTP specification. We will rely heavily on it throughout this chapter. A link to this Internet-Draft and a list of other HTTP-related Internet-Drafts can be found at http://www.ietf.cnri.reston.va.us/ids.by.wg/http.html.

According to the minutes of a recent meeting of the HTTP Working Group, the next release of the protocol, HTTP/1.1, is slated to be available in August, 1996. A protocol called HTTP/NG ("next generation") is slated to be released as a Proposed Standard in December, 1996.

The HTTP/1.0 Internet-Draft defines the HTTP protocol as "an application-level protocol with the lightness and speed necessary for distributed, collaborative, hypermedia information systems. It is a generic, stateless, object-oriented protocol which can be used for many tasks...."

Many of these terms and concepts will be defined and discussed in the remainder of this chapter. Table 2.1 gives some brief definitions of the common terms used when discussing the HTTP protocol and Web communications.

Table 2.1. Web and protocol terminology.

Term

Definition

Connection

A "virtual circuit" connecting two programs and allowing these programs to communicate with one another.

Message

A structured sequence of characters transmitted using an open connection.

Request

A message from a client application to a server application, typically for retrieving a resource from an HTTP server.

Response

A message from a server application to a client application, typically containing a hypermedia file that was requested by the client application.

Resource

An object of data residing on the network or a service available on the network, which can be identified by a unique address.

Entity

Defined in the HTTP/1.0 Internet-Draft as "A particular representation or rendition of a data resource, or reply from a service resource, that may be enclosed within a request or response message." An entity is essentially the data from a file or service that is "wrapped" within an HTTP message.

User Agent

The application that initiates a request message. Typically Web browsers, spiders, agents, or other client-side tools.

Server

The application that listens on a network and responds to HTTP request messages.



HTTP as a Client/Server Protocol


As you can probably guess from Table 2.1, the HTTP protocol defines a client/server model. The difference between HTTP and typical client/server protocols, however, is that either role can be played by either computer involved in the conversation. The role a given computer plays during the conversation depends on the resource being accessed and possibly the HTML contained by the resource.

Another very important difference is the fact that the HTTP is a stateless protocol. There is no requirement for a user to go through a logon process, which is typical of most client/server systems. In fact, the majority of HTTP data transfers are completely anonymous—beyond the machine address of the client, the server has no knowledge of who is retrieving data from it. In addition to the lack of user information, the HTTP protocol provides no mechanism for tracking how long a client may actually utilize information it has retrieved or for knowing what a client may have done before requesting a resource from the server.

The typical HTTP conversation contains the following steps, which are illustrated in Figure 2.2.

  1. A client opens a connection with a server. Recall that any computer can act as a client, even if the computer is running a server application.

  2. The client sends a request to the server. This request consists of a request method, a resource or service address, and possibly other header fields, and body content. These concepts are discussed in the following sections.

  3. The server returns to the client a status line, possible header information, and (usually) an entity section.

  4. The server closes the connection.

Figure 2.2. Illustrating the request-response nature of HTTP.

There is always the possibility on the global Internet that a connection can fail at any time. The HTTP protocol provides that both the client and server applications must be prepared for such a situation. The loss of connection can occur due to user interaction, a communication time-out, or an application failure. A loss of connection is considered to terminate the client's current request. This means that, regardless of the state of the request when the termination occurs, the client must restart the entire process to properly attempt to access the resource again.

In addition to this drawback, the HTTP protocol allows for only a single resource to be transferred during a connection. This means that if a hypertext page has embedded references to other resources (such as images), the client must retrieve each resource individually through separate connections. For example, to construct the Web page of Figure 2.1, the Web browser had to make three connections. One to retrieve the HTML file, another to retrieve the embedded Java applet, and a third to retrieve the Under Construction picture. This shortcoming of HTTP has often been blamed for the slow response time of the Web.

Addressing on the Web


As mentioned in the preceding section, HTTP is used to transfer data objects from a server machine to a client. In order to make this transfer happen, the two applications involved in the conversation must recognize a common addressing mechanism. This addressing mechanism must uniquely identify every data object available not only on the server application machine, but also on the entire network the applications are using for communication. The addressing scheme must also be familiar to programmers and publishers on the Web because addresses must be used both to gather and to publish information or services on the Web.

The Web uses a form of address known as a Universal Resource Identifier (URI) to identify data objects on servers. The URI for an object is independent of which protocol is used to access the data. An object's URI also provides no real clue as to what type of data is being identified. However, most URIs include a filename extension (similar to the DOS filename extension), which can be used by the client application as a clue to how the object should be presented to the user. For example, a Web site dedicated to gardening may have a file named roses.htm, which is most likely an HTML document, and perhaps a file named roses.gif, which is probably a picture of roses.

The more common form of a URI is known as a Universal Resource Locator (URL). This is typically what is specified when you see a list of Web pages. A URL is a URI that contains protocol information specifying how the data object should be retrieved from the server. The difference is subtle but important. Here are a few examples that should clear up any confusion you may have:

URL: ftp://ftp.myserver.com/demos/demo.zip

The first two examples illustrate the crucial difference between a URI and a URL. The URL informs the machine at address myserver.com to retrieve a file named default.htm for its /user1 directory and return it to the client using the HTTP protocol. The URI merely defines the location of the file; whereas, the URL specifies how it should be retrieved.

Similar to DOS, the URI can contain either absolute or relative addressing. For instance, if you are viewing (or creating) a Web-based document with a URI of //myserver.com/user1/default.htm, you can use the following addressing mechanisms within the document:

Another similarity to DOS in specifying a URI is that the URI cannot contain any spaces and must encode certain reserved characters. If whitespace is required within a URI, you must encode the space as the string "%20". So, a directory named "My Documents" if used in a URI would appear as "My%20Documents". Similarly, there are several reserved characters that have special meanings within a URI. These are outlined in Table 2.2. These reserved characters, if actually meant to appear within a URI, must be encoded using the ISO Latin-1 character set.

Table 2.2. Reserved characters in URIs.

Character

Usage in URIs

% (percent)

Identifies encoded characters.

/ (forward slash)

Used to separate path and filenames.

# (pound)

Used to separate the URI of an object from a placeholder within the object. Often used to mark off sections of a document into different portions.

? (question mark)

Identifies a query to a data object. Text after the question mark is the search term(s) to be applied to the data object.

*, !, ^, |, ~

Reserved for special circumstances.



The HTTP URL


The HTTP URL has a specific format which identifies data resources available on a network. The format is:

http://<host>[:<port>]/[<path>][?<search_text>]

The http portion is used to indicate that the resource is to be retrieved using the HTTP protocol. The <host> is the Internet host name or IP address for the machine where the resource resides. The port, which is a numeric value, is an optional parameter necessary if the server is not listening on TCP port 80 (which is the value assumed if <port> is not specified). The <path> portion is either an absolute path or relative path locating the resource within the server's file structure. If the <path> is not specified, the server should respond with a default HTML file. This method is typically used to access the site's home page. You can specify the <path>, however, if you know the exact URL for the resource you're interested in. The default file's location is typically set up in the server software's setup or configuration program. If the resource can be searched, the ?<search_text> portion can be provided to instruct the server on how the resource should be searched. This item is both server and resource specific. Later chapters will address these searchable resources in-depth.

HTTP Messages


As most programmers familiar with Windows programming are aware, applications can communicate with one another only when there exists an agreed-upon language. When computer applications converse, they use messages that must conform to predefined rules and formats.

Similarly, all conversations that take place using the HTTP protocol use a message-based system. The message format is nearly identical for both client and server sides of the conversation. This section discusses the Backus-Naur Form (a method of documenting rules and syntax) and the general format of HTTP messages. The following sections discusses the portions of HTTP messages that are of particular importance to programmers designing Web-based applications.

Using Backus-Naur Form


The HTTP /1.0 Internet-Draft document makes heavy use of a notation known as the Backus-Naur Form (BNF) to specify how the HTTP protocol operates. This section illustrates the Backus-Naur Form in enough detail to help you understand the notation. The BNF is compact and easy to read and understand. Most programmers have no problem understanding the format, but some introduction to it may still be necessary.

Basic Notation

The basic notation for defining a rule using the BNF is

name = definition

The name of a rule is simply the name itself (no enclosing < or > characters). The = sign is used to separate the rule from its definition. Whitespace (spaces, tabs, new lines, and so on) has no meaning within a rule except that indentation indicates that a definition spans multiple lines. The < and > signs can be used within a definition to help separate element names. There are certain basic rules that appear in uppercase characters (SP to indicate a space, DIGIT to indicate a numeric character, and so forth).

To represent a literal character within a rule, place quotation marks around it:

"literal"

Use of quotation marks is reserved for marking literal characters. The text is generally not case-sensitive (unless otherwise noted).

To denote an either/or rule, use

rule1 | rule2

The pipe character denotes that either element can be used. For example, 0 | 1 states that either 0 or 1 is acceptable.

Parentheses are used to group elements that are considered as a single element:

(rule1 rule2)

For example, (x (and | or) y) allows for (x and y) or (x or y) to be accepted.

To indicate a repeating rule or element, use

*rule

The asterisk character (*) indicates repetition of an element. The notation is also expressed as <n>*<m>element. This indicates that at least <n> and at most <m> repetitions of the element are permitted. The default for <n> is 0; the default for <m> is infinity. For example, 1*element indicates that at least one of element must appear. The rule 1*5element indicates that at least one but at most five of element can appear.

The square brackets ([]) are used to indicate elements that are optional:

[rule]

This is identical to the notation used in the Visual Basic documentation to indicate optional parameters in function calls.

To indicate a specific number of repetitions of an element or rule, use

N rule

Which indicates that the rule must appear N times. This notation is identical to <n>*<n>rule defined above. For example, 2DIGIT indicates a two-digit number.

To indicate a list of elements, the notation

#rule

is used. This notation is similar to *rule defined earlier. A more complete form is <n>#<m>element where <n> indicates the minimum number of list items acceptable and <m> indicates the maximum number of list items. The defaults are 0 and infinity.

To separate rules from comments, use a semicolon character, as in

; comment

The comment starts with the semicolon and continues to the end of the current line.

Basic Rules

Some of the basic rules you'll encounter in this chapter are listed in the following lines:

OCTET = <any 8-bit sequence of data>
CHAR = <any ASCII character>
ALPHA = <any character in the range "A"…"Z" or "a"…"z">
DIGIT = <any digit "0"…"9">
CR = <the ASCII carriage return (Chr$(13))>
LF = <the ASCII line feed (Chr$(10))>
CRLF = CR LF
SP = <the ASCII space (Chr$(32))>
TEXT = <used for describing values that won't be parsed by the applications>

The Format of an HTTP Message


As mentioned previously, the client request message and the server response message use a similar format. In fact, the formats are practically identical. This section introduces the Backus-Naur Form for the two messages.



Remember, the role of client (requester) and server (responder) can belong to either machine involved in a conversation at any given time during the conversation. However, during a given connection the roles will not change.

HTTP messages can take either a full request/response or a simple request/response message format. The simple request/response format is used by HTTP/1.0 clients and servers to communicate with clients and servers using previous versions (specifically HTTP/0.9) of the HTTP protocol. When a client makes a simple request, the server must use the simple response format. Simple requests can also be used by the client application in instances where the available HTTP headers and content negotiation would be merely unnecessary overhead.

The syntax of the full request is illustrated in the following lines:

<Method> SP <URI> SP <HTTP-Version> CRLF
    *( <General-Header>
    | <Request-Header>
    | <Entity-Header> )
CRLF
[<Entity-Body>]

The format specifies that the first line of the full request contains three required elements: a method, a URI, and the HTTP version. The elements are separated by a space, and the line is terminated with a carriage return and line feed. The methods include GET, HEAD, POST, and others, discussed in the next few sections. The <HTTP-Version> element is defined by the following BNF rule: "HTTP/" 1*DIGIT "." 1*DIGIT. The current version would have an <HTTP-Version> element of "HTTP/1.0".

The header fields are optional fields, which will be discussed shortly. Header fields are defined by the following rule:

HTTP-Header = <Header-Field-Name> ":" [<Value>] CRLF

There can be any number of headers, and they can appear in any order. The headers are typically sent General-Header fields first, then Request/Response-Headers, then Entity-Header fields. The headers can span multiple lines as long as each line is preceded with whitespace. A request can also include an entity body if separated from the headers and initial request line by a CRLF on a line by itself. For example, the request message includes an entity portion if the request is a POST from an HTML form.

The full response message looks like this:

<HTTP-Version> SP <Status-Code> SP <Reason-Phrase> CRLF
    *( <General-Header>
    | <Response-Header>
    | <Entity-Header> )
CRLF
[<Entity-Body>]

As I have stated, the response is very similar to the request. The first line of the format and the use of <Response-Header> are the only differences. The first line of the response consists of the <HTTP-Version> element, a status code, and a reason phrase. The status codes and reason phrases defined by the protocol will be discussed later in this chapter.

A simple request looks like the following line:

"GET" SP <URI> CRLF

The simple response (which must be returned if the server receives a simple request) contains only the entity body (the HTML document itself, for example). The server is not permitted to return any header fields and cannot identify the media type of the data being returned.

General Message Header Fields

The <General-Header> fields are common to both request and response messages. They are optional headers that apply only to the individual messages themselves and not to the applications, machines, or users involved or the data being transferred.

Date

The Date header specifies the date and time the message was transferred. The value must be a valid HTTP date. There are three generally accepted formats. See the HTTP/1.0 Internet-Draft document for the full definition of the HTTP date formats. The following is an example of the header field in the preferred format:

Date: Mon, 18 Mar 1996 10:05:00 GMT

The Date header should always be sent in full response messages in order to allow clients to properly cache data. Clients should send the Date field in request messages when sending requests that include an entity body (such as PUT and POST requests).

Pragma

The Pragma header is used for implementation-specific directives that may be of value to either the requester or the responder.

Forwarded

Used by proxy machines when the message travels between origin and destination through other machines, the Forwarded header is used mainly to trace the route of an HTTP message through proxy servers. It is of limited use to Visual Basic programmers except in debugging problems involving a firewall (which is a machine that, for security reasons, sits between the global Internet and an organization's local area network).

Message-ID

The Message-ID field is used to attempt to uniquely identify a message, but not the contents of the message. The Message-ID is intended to be valid for a longer time period than the message itself. The value typically consists of a string which is unique at the originating machine followed by the @ character and the fully-qualified domain name of the originating machine. Many methods are available for generating the ID, but the following is an example:

Message-ID: <9603181005123@myserver.com>


The Forwarded and Message-ID header fields have appeared in some texts covering HTTP but do not appear in the current version of the Internet-Draft for HTTP/1.0


HTTP Request Messages


The HTTP Request message is the mechanism used to retrieve a data resource from a server. In order to maintain backward compatibility with the previous version of the HTTP protocol, the HTTP/1.0 protocol provides for both a full request (for HTTP/1.0) and a simple request (for HTTP/0.9) style of message. If an HTTP/1.0 server receives a simple request message, it must respond with an HTTP/0.9-compatible simple response message. Likewise, an HTTP/1.0 client should always generate a full request message.

Request Methods


As mentioned in the previous section, the syntax of the full-request message includes an element named <Method>, which has the following rule:

Method = "GET" | "HEAD" | "POST" | <extension-method>

The <Method> element indicates what operation should be performed on the data resource specified by the <URI> element. The acceptable methods for a given resource can change at any time. If a method is not allowed for a resource, the client receives notification of this in the <Status-Code> and <Reason-Phrase> elements of the response message.

The following sections describe the named methods. The <extension-method> element allows for extensions to the HTTP/1.0 protocol. Both client and server must recognize these extended methods or the server will likely return a <Status-Code> of 501 (not implemented).

The GET Method

As perhaps the easiest method to understand, the GET method merely instructs the server to return to the client the resource indicated by the <URI> element of the request message. If the <URI> points to a server application, the server returns the data output by the application, not the application itself.

Also, a Request-Header field named If-Modified-Since creates a conditional GET request. If the resource has been modified since the time value specified in the header, the resource is returned. If it has not been modified since that time, the server responds with a status code of 304 (not modified) and with no entity in the response message. This header field is used to perform client-side caching and reduce network load.

The HEAD Method

The HEAD method is nearly identical to the GET method. The very important difference, however, is that the server must return only HTTP header information related to the resource. The resource (entity) itself must never be returned in response to a HEAD request.

The HEAD method allows spiders and agents operating on the Web to retrieve only necessary header information about a particular resource. This can be useful when checking the validity of hypertext links or checking a resource to see whether it has been modified since a particular date.

The POST Method

The POST method is used when sending entity information to a server. For instance, POST is used when filling out an HTML form on a Web page. The Submit button on the form typically performs a POST request and appends the form's field values to the request message as the <Entity-Body> element.

The POST method is usually performed on some type of application resource as opposed to a document resource. A successful POST request does not require the server to return an <Entity-Body> element in the response message. In some cases, the action may not produce a resource that can be identified by a URI. If no <Entity-Body> is returned, the server should indicate a <Status-Code> of 200 (okay) or 204 (no content). If the action does produce an <Entity-Body>, the <Status-Code> should return as 201 (created) and, of course, the <Entity-Body> should be transmitted to the client.

An entity header field called Content-Length is required on all POST messages. If it is invalid or missing, the server returns a <Status-Code> or 400 (bad request).

Request Message Header Fields


The full-request message can contain any number of header fields that can be used to qualify the request or to provide information about the client making the request. The syntax for the request header is

Request-Header = Authorization | From | If-Modified-Since | Referer | User-Agent

Additional field names can be added only if all applications involved in a conversation recognize them as request header fields. Otherwise, unrecognized fields are considered Entity-Headers.

Authorization

The Authorization request-header field is used by user agents that wish to present some sort of credentials to the server. The format of the field is

Authorization = "Authorization:" <credentials>

More on authentication appears in the last section of this chapter.

From

The From request-header field is sent by a user agent that wishes to provide the e-mail address of the person who is at the helm. The address should be a valid mailbox and should be sent only with the user's express knowledge and permission. This field should always be used by Web robots and crawlers to provide the e-mail address of the person who started the robot. The format of the field is as follows:

From = "From:" <mailbox>

If-Modified-Since

As mentioned in the section titled "The GET Method", the If-Modified-Since header field is used to produce a conditional GET request. The field uses this format:

If-Modified-Since = "If-Modified-Since:" <HTTP-date>

The resource is returned to the client only if the resource has been modified since the date specified in the <HTTP-date> element. If the <HTTP-date> element specifies an invalid date or if the date is later than the server's current date, the server essentially ignores the header field and returns the resource as though it is responding to a normal GET request.

Referer

The Referer header field specifies the URI of the resource from which the request message's <URI> element was obtained. This field must be sent only if the <URI> field has actually been obtained from a source that has an address. If a user has generated the <URI> element value (by typing in the address or selecting from a bookmark list, for example), this field must not be sent. It uses this format:

Referer = "Referer:" <referer-URI>

User-Agent

User-Agent contains information about the user agent that generated the request message. This request-header field is useful to the server in logging server activity and also for creating responses that are specific for the given user agent. The field is not required but should be sent as a courtesy to the server, using this format:

User-Agent = "User-Agent:" 1*( <product> | <comment>)

The convention for the <product> element is to list the information in order of significance. Typically, this field's values include the product name of the user agent, the product version, and sometimes the operating system the user agent is running under.

HTTP Response Messages


The HTTP response message is really where the bulk of the information transmitted on the Web is contained. After the server receives a request message, the server processes it and determines what should be returned to the client.

The Status Line of HTTP Responses


The first line of the response message, as previously shown, includes the <HTTP-version>, <Status-Code>, and <Reason-Phrase> elements. The <HTTP-version> element has been discussed in previous sections of this chapter. It is identical in this usage to the previous usages.

The <Status-Code> element consists of a three-digit integer code. This code is meant to be used by the client application to determine the status of the response message. The first digit of the code indicates the category into which the response message falls:

The <Reason-Phrase> element is a textual message intended for the user. It attempts to explain the <Status-Code> in language meaningful to a human reader. The client application is not required to display the <Reason-Phrase> element, but typical user agents do anyway.

Table 2.3 lists the possible <Status-Code> values and typical corresponding <Reason-Phrase> values.

Table 2.3. Response message status codes and reason phrases.

Status Code

Reason Phrase

200

OK

201

Created

202

Accepted

204

No content

301

Moved Permanently

302

Moved temporarily

304

Not modified

400

Bad request

401

Unauthorized

403

Forbidden

404

Not found

500

Internal server error

501

Not implemented

502

Bad gateway

503

Service unavailable



Response Message Header Fields


Just as the request message can send header fields that qualify or provide additional information about the request being submitted, the HTTP protocol provides the opportunity for the response message to send the client additional information about the response. If new header fields are added, they must either accompany a change in the HTTP protocol version, or all parties in a conversation must recognize them as response header fields.

The syntax of the <Response-Header> element is

Response-Header = Location | Server | WWW-Authenticate

Location

If the <Response-Header> contains a Location element, defined as

Location = "Location:" <absoluteURI>

This defines an absolute address that the client should be redirected to. It is used in cases where the <Status-Code> is of the 3xx variety (indicating redirection), such as when a resource has moved to a new location. The <Entity-Body> element of the response message typically includes a short note explaining the redirection and offering a hyperlink to the URI specified in the Location field's value.

Server

The Server header field is the response message's equivalent to the request message's User-Agent header field. It provides information to the client about the server application and version, using the format:

Server = "Server:" 1*(<product> | <comment>)

As with the User-Agent header field, the Server field's values are listed in order of their significance in identifying the server software.

WWW-Authenticate

If the <Status-Code> in the response message is 401 (unauthorized), the WWW-Authenticate header field must be included in the message. This response message basically issues an authentication challenge to the user agent. The user agent must then provide authentication information to the server. The WWW-Authenticate header uses the format:

WWW-Authenticate = "WWW-Authenticate:" 1#<challenge>

The issue of authentication is taken up in the final section of this chapter, "Authentication through HTTP".

The Entity Portion of HTTP Messages


The bulk of response messages and a few types of request messages include an <Entity-Body> element and can also include an <Entity-Header> element. In a response message, the <Entity-Body> portion is the actual data resource that is being transferred. For instance, if the request message GET //myserver.com/home.html HTTP/1.0 is received by a server, the entity portion of the response message will be the file home.html located in the Web server's root directory.

In a request message, the <Entity-Body> element is used for POSTing information to a Web server. This is used typically for submitting information entered into an HTML form by a human user or for queries being performed by some automated user agent application.

The remainder of this section discusses the <Entity-Header> element.

Entity Header Fields


The <Entity-Header> element's fields provide additional information about the <Entity-Body> or, if the <Entity-Body> is not present (as in the case of HEAD requests), about the resource requested. These headers are optional. If sent <Entity-Header> element fields are returned, they generally follow the <Response-Header> fields.

The following line shows the format for the <Entity-Header> element:

Entity-Header = Allow | Content-Encoding | Content-Length | Content-Type | 
                Expires | Last-Modified | <extension-header>

The <extension-header> element provides for the addition of new <Entity-Header> element fields without changing the entire protocol. However, if a user agent does not recognize an <Entity-Header> element field it is (and should be) ignored.

Allow

The Allow field is used to indicate which methods are supported by the resource requested in the request message. The format of the header field is

Allow = "Allow:" 1#method

An example is Allow: GET, HEAD. This header field is used to inform the user agent of which methods are valid for the resource being requested. However, it does not specify which methods are implemented by the server. Also, the use of this field does not prevent the user agent from attempting to perform other methods upon the resource. Nevertheless, it is good practice to follow the advice of this header field.

Content-Type, Content-Encoding, and Content-Length

These header fields indicate the type of resource being returned, how it is encoded, and the size of the <Entity-Body> being returned. The following lines show the respective formats for these headers:

Content-Type = "Content-Type:" <media-type>
Content-Encoding = "Content-Encoding:" <content-coding>
Content-Length = "Content-Length:" 1*DIGIT

The Content-Type header indicates the media type used for the <Entity-Body> element. It essentially documents the format of the entity being transferred. The typical HTML document is sent with a Content-Type of text/html. If the request method is HEAD, the Content-Type represents the media type that would be returned if the request is a GET.

The Content-Type field can be used by the user agent to determine how to present the resource to the user. The content types are used by Web browsers when setting up helper applications that are external applications used to display specific file types (also known as media types). For instance, a Content-Type of audio/basic represents a sound file that the typical Web browser has to present to the user through an external application.

The Content-Encoding field is a modifier to the Content-Type header and indicates whether and how a resource has been encoded before transmission. This is used when a resource has been compressed, for example. The user agent must use the encoding information in order to decode the data received before presenting it to the human user.

The Content-Length indicates the size of the <Entity-Body> element. It is used in both request and response messages and is, in fact, required in request messages. The size is the number of octets sent to the recipient. If the message is a response to a HEAD request, the Content-Length field indicates the size that would have been returned by a response to a GET request on the same resource.

Expires

As the name of this header field indicates, the Expires entity header indicates the date after which the entity should be considered expired or invalid. This can be returned by applications that are generating real-time data, for example, to indicate the date (and time) after which the entity should be considered as old news. Caches should not retain the entity after the date specified.

The presence of an Expires header does not mean that the resource will change or no longer exist after the value specified, but simply that it will be "stale" (to use the wording from the Internet-Draft). If the value specified is 0 or is an invalid HTTP date, the resource should be considered as immediately expired and should not be cached in any way.

The format for the Expires header is

Expires = "Expires:" <HTTP-date>

Last-Modified

The Last-Modified header field indicates the date that the server believes the resource was last changed. The value is interpreted differently depending upon the nature of the resource being transferred. This header can be used to determine whether a new copy of a resource should be retrieved or if the user should be notified that the resource has been changed. For a file resource, it is most likely to be the date that the file was last saved. For a database resource, this field can be used to indicate the date a record was last updated. The possibilities are endless.

The format for the Last-Modified header is

Last-Modified = "Last-Modified:" <HTTP-date>

When you write server-side applications, it is important to think about how this field should be used if the resource is time-sensitive. If you're writing a user agent application, take care when attempting to interpret this field's value.

Authentication Through HTTP


The HTTP/1.0 protocol provides a simple method for user authentication. This can be used to provide subscription-based services or to limit access to data and resources to specific individuals. The mechanism is a challenge-response cycle in which the server issues a challenge to the user agent and the user agent responds with the proper authentication information.

When a user agent attempts to access a resource that is in a protected space, the server responds with a message with a <Status-Code> of 401 (Unauthorized) and a WWW-Authenticate response header field. The WWW-Authenticate header will contain a <realm> element, which can be displayed to the user in order to obtain a user-ID and password. After the user agent determines the user-ID and password to be used, it issues another request message to the server. This time the request includes an Authorization request header field. The value used for the field is a base64 encoded string. The string contains the user-ID and password separated by a colon. Borrowing the example from the HTTP/1.0 Internet-Draft, if the user-ID is "Aladdin" and the password is "open sesame", the Authorization header would be the following string:

Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==



Information on base64 encoding can be found in the Internet Request-For-Comments (RFC) 1521 (this is included on the CD-ROM).

If the server determines that either the user-ID or password sent with the request is invalid, it responds with a <Status-Code> of 403 (forbidden). The user agent can then prompt the user to re-enter the credentials. However, this cycle should not be allowed to repeat indefinitely.

This HTTP authentication method is a clear text method. No encryption is performed on the user-ID and password that is transmitted. Care must be taken by the user not to use sensitive passwords for this authentication method. If any programs you write will allow access to sites requiring authentication, you may wish to inform the user through some sort of warning message that the passwords they enter will be sent unencrypted.

Summary


This chapter probably made for dry reading. However, it will serve as a valuable reference throughout the remainder of this book. Most of the applications we'll delve into will use HTTP messaging in one form or another. In some of those chapters, you'll expand on some of the topics covered here. In others, you'll simply refer back to this chapter to refresh your memory of this fascinating topic.

If you'd like to delve deeper into the HTTP protocol, I encourage you to follow the work of the HTTP Working Group. Relevant links can be found in Appendix D, "Bibliography and Cool Web Resources."

Previous Page Page Top TOC Next Page See Page