World Wide Web Network Traffic Patterns

Jeff Sedayao

Intel Corporation, 2250 Mission College Blvd

Santa Clara, CA 95052

Abstract

The World Wide Web (WWW) generates a significant and 
growing portion of traffic on the Internet. With the click of 
a mouse button, a person browsing on the WWW can gen-
erate megabytes of multimedia network traffic. WWW's 
growth and possible network impact merit a study of its 
traffic patterns, problems, and possible changes. This 
paper attempts to characterize World Wide Web traffic pat-
terns. First, the Web's HyperText Transfer Protocol 
(HTTP) is reviewed, with particular attention to latency 
factors. User access patterns and file size distribution are 
then described. Next, the HTTP design issues are dis-
cussed, followed by a section on proposed revisions. Bene-
fits and drawbacks to each of the proposals are covered. 
The paper ends with pointers toward more information on 
this area.

1.0	Introduction

The World Wide Web [1] has been called the "killer app" 
of the Internet. Whole new businesses are being created to 
make advertising and information available on the Web. 
World Wide Web traffic is growing at the rate of over 20% 
PER MONTH [2]. With such an incredible growth rate, it 
is becomes critical for network engineers and technology 
managers to understand the impact of World Wide Web 
traffic on networks. In fact, some network administrators 
fear the World Wide Web because they fear its traffic 
implications. 

 This paper attempts to characterize the network impact of 
the Web. The first section reviews the HyperText Transfer 
Protocol (HTTP) [3], the protocol used by the World Wide 
Web. User access patterns are also covered. The next sec-
tion examines the issues caused by HTTP and the use of 
the Web. The last major section of the paper describes cur-
rent attempts to deal with those Web traffic issues.

2.0	World Wide Web Traffic 
Characteristics

To understand the World Wide Web's traffic characteris-
tics, a brief understanding of the HyperText Transfer Pro-
tocol (HTTP), the protocol used by WWW, is necessary. 
HTTP is designed to be a simple request/response proto-
col. A client opens up a connection to the server, sends a 
request, gets a response, and then closes down the connec-
tion. Two key elements are the Uniform Resource Locator 
(URL) [4] and the method. The URL is a construct that 
provides information on the network location of some 
document on the Internet, including what server it lives on 
and how to access it. The method describes a specific 
action for a server to take. 

 The most common HTTP request (and the one we will 
focus on) uses the "GET" method. A client first sets up a 
connection (usually through TCP [5]) to the target Web 
server. Next, the client sends GET (the method) followed 
by a series of definitions of what data formats it will 
accept. The server processes the request and sends a 
"meta-description" of the document to the client, followed 
by the document itself. The document (also known as a 
page) can be a variety of things. It can be a form, an 
image, text, or a video or audio clip. After receiving the 
information, the connection is closed. 

 What happens on the packet level with a Get Request? 
Pamanabhan and Mogul [6] describe this in their study of 
HTTP. The client opens the TCP connection, resulting in 
an exchange of packets between client and server. That is 
one round trip. The HTTP request is then sent and docu-
ments received. This is the second round trip. For each 
URL retrieved, there are two round trips between the cli-
ent and server. 

WWW documents may contain "inline images". These are 
images within a document. Web browsers process these 
images in the following way. First, they send a GET 
request for the document. Then they must send another 
HTTP GET request for each unique image. Thus a docu-
ment (also known as a page) will see 2n + 2 packet round 
trips, where n is the number of unique images in that docu-
ment. 

 This is only HTTP generated traffic. Another study of 
Web traffic [7] notes that Domain Name Service [8] 
(DNS) queries are also made on each HTTP request. A cli-
ent needs to do a DNS name to address lookup on each 
URL. It does this to obtain the network address of the 
server it must access. The server, in turn, may look up the 
name associated with the client's network address and 
then look up the name to verify that the name is associated 
with that network address. All three DNS requests can 
result in packet round trips between the client network and 
the server network. The following is a summary of the 
steps in a single HTTP request:

1. DNS name to address lookup.

2. Connection set up.

3. DNS address to name lookup.

4. DNS name to address lookup.

5. Send HTTP request and received page.

 Each of these can result in packet round trips. Some, like 
steps 1, 4, and 5, can result in more than one packet round 
trips. 

 What are typical WWW access patterns? We looked at 
records of WWW activity at Intel over 32 weeks. The 
most common type of document or object is the Graphic 
Interchange Format [9] (GIF). This makes sense because 
GIFs are used as inline images in Web documents. The 
mean size of a GIF file is about 17005 bytes, and with a 
median size of 4513 bytes. GIF files also generated most 
of the WWW traffic into Intel, followed by MPEG files (a 
video format). MPEG files averaged 609146 bytes in size. 
This indicates a traffic distribution that is greatly skewed 
between many small objects (GIFs) and a number of very 
large objects (MPEG videos). 

3.0	Traffic Issues

To discuss traffic issues, we need to define two terms - 
bandwidth and latency. Bandwidth is the amount of data 
that can be moved through a network link during a given 
time. It is usually described in units of bits/second. 
Latency is the time it takes data to travel from a given 
point in a network to another given point in the network. It 
is usually described in units of seconds or milliseconds. 
The lower limit to any network latency is governed by the 
speed of light. The network latency between San Fran-
cisco and New York will never be less than 13.7 millisec-
onds, now matter how big the network bandwidth on a 
connection between the two cities. 

 One of the things that a first time Web user notices is that 
when he clicks on an anchor (indicating a hypertext link), 
he never knows how long it will take to load that page. 
The page may be a short page, or it may be a long page 
filled with in-lined images. This is because Web Page 
authors rarely indicate the size of the page pointed to by 
the anchor. Many WWW users find it disturbing when 
they have to wait unpredictable amounts of time for pages 
to load. From a network perspective, the result is also a 
bursty unpredictable traffic mix. As described above, traf-
fic varies between small objects (mostly) and very large 
objects. 

 The HTTP protocol has a number of problems. It was 
designed to be as stateless as possible. GET requests are 
treated independently by an HTTP server. This makes 
HTTP servers easier to write. Unfortunately, this also 
brings in a number of inefficiences and leaves HTTP per-
formance dependent on network latency. For each HTTP 
request, a minimum of two round trips is necessary. With 
worse case DNS lookups, up to five round trip times could 
be needed for the HTTP request. Spero's analysis of HTTP 
[10] concludes that the two Round Trip Times (RTT) form 
the lower bound for the total transaction time of an HTTP 
request. No matter how big the network bandwidth is 
between client and server, the network latency between 
client and server determines minimum time to process a 
request. 

 Inline images make the problem even worse. For each 
page with inline images, a separate HTTP request must be 
made for the page and then for each unique image. Thus a 
single document could have from 1 to any number of 
HTTP requests associated with it. The minimum time to 
load the document would be (2n + 2) * latency, where n is 
the number images in the page. This can also be expressed 
as (n + 1) * RTT. 

 A TCP feature called "slow-start" [11] causes inefficient 
use of available network bandwidth. Slow start is a feature 
of TCP that causes the amount of data sent to be small and 
then increase. A TCP connection will first send a small 
amount of data. As acknowledgements of packets come in, 
the window gets bigger and bigger. Unfortunately, most 
HTTP connections are very short lived. The most common 
object is a GIF file (median size of 4513 bytes). Many 
TCP connections used in HTTP will not be at full throttle 
before they are terminated. 

 While GIFs generate the most traffic, MPEG video clips 
generate the next most traffic. This skewing of traffic 
between lots of small images and fewer but very large 
video clips presents a difficult challenge for network 
administrators. HTTP wants both low latency yet high 
bandwidth networks. Network bandwidth can be increased 
(often at very high cost), but there are absolute lower 
bounds on latency. 

 A final inefficiency involves TCP protocol states on the 
server. TCP specifications require that a system that has 
closed a connection maintain connection information for 
four minutes [5]. The large number of connections could 
cause a server to have its connections table filled with 
"TIME-WAIT" state connections.

4.0	Proposed Solutions

The problems described above have been widely dis-
cussed. A number of solutions have been proposed. This 
section discusses the pro's and con's of a number of solu-
tions.

4.1	Be Careful and Smart with Cache

 This solution [7] proposes that Web page authors, Net-
work administrators, and WWW browser authors do a 
number of things to minimize the number of TCP connec-
tions necessary and thus minimize transaction time. Web 
authors can keep in-lined images to a minimum and make 
pages that are cachable. Network administrators can set up 
Proxy Caching Servers [12]. Proxy Caching Servers get 
Web information on behalf of clients. They cache pages, 
so that if a page or URL has been requested before, a client 
can immediately get the cached page rather than waiting to 
get it from the Internet. It is also a good idea to deploy 
caching DNS servers on or near the caching servers. This 
will reduce the latencies involved with DNS lookups. Web 
Browsers can cache pages and images. Many browsers do 
just that, reducing the amount of network traffic and time 
needed to load a page. 

 These ideas have a number of benefits. They can be done 
immediately and effectively with existing software and 
protocols. There are a few drawbacks to consider. Caching 
doesn't work for dynamically changing data like stock 
quotes (stock quotes are a very popular use of WWW) 
Also, these strategies don't address the fundamental prob-
lems with HTTP. Items that are not cached will still be 
affected by network latencies and will experience extra 
delay caused by looking through a cache and going 
through the proxy server. 

4.2	Parallel GETs

 Another performance-enhancing tactic is for browsers to 
send out GETs for inline images without waiting for the 
initial GET to complete. The GETs are basically processed 
in parallel as data comes in. This is a big advantage over 
browsers that wait for each in-lined image to complete 
before getting another one. But like the previous idea, it 
doesn't deal with the fundamental problems of HTTP. 
Total transaction time will still have a lower bound of (n + 
1)*RTT , where n is the number of images in a document. 
In fact, it can make matters worse for Web servers and 
proxy caching servers. Many server implementations 
spawn a process for each URL. Instead of processes being 
smoothly created one after another, large batches of pro-
cesses are created almost simultaneously. This process 
burstiness can really impact a server. 

4.3	URNs

 Uniform Resource Names (URNs) [13] are constructs 
under development by the Internet Engineering Task Force 
(IETF). They are unique and permanently assigned names 
for resources. URNs map to a number of URLs. A browser 
or proxy server could permanently cache a number of 
URNs. It potentially could look for and access the URL 
that had the smallest network latency. 

 URNs are coming, and they will offer better caching per-
formance. Still, for uncached URNs, the HTTP protocol 
problems have not been solved. Doing the URN to URL 
mappings will add latency to HTTP requests. Also, while 
URNs are coming, the infrastructure for resolving URNs 
is not yet available. 

4.4	GETLIST and GETALL

 Pamanabhan and Mogul [6] propose two new methods. 
GETLIST would get a list of URLs. GETALL would get 
all the URLs (images) in-lined in a page. These two meth-
ods solve the connection problem by transferring all the 
needed URLs in one connection. If images were already 
cached, the GETLIST method would be used to get only 
the images needed. These two methods would create 
longer lived connections and reduce both the number and 
the relative cost of setting up a connection. Some draw-
backs to this approach are that changes in existing WWW 
servers and clients would have to be made. Also, adding 
these commands would make clients and servers more 
complex, as much more state would have to be managed. 

4.5	HTTP Session Layer

 Spero [14] proposes adding a session layer to HTTP in a 
new protocol called HTTP-NG (HTTP Next Generation). 
This session layer would divide a connection into different 
channels. Control messages (HTTP requests and meta-
information) would flow over a control channel. As in the 
previous suggestion, this proposal solves connection prob-
lems by adding complexity and state to clients and servers. 
There would also be issues with transition, although Spero 
proposes a solution using intermediary proxy servers. 
Other people are working on session layer based solutions 
[15].

4.6	MIME Multipart Documents

 HTTP wraps data objects in the MIME [16] multimedia 
mail format. One suggestion for eliminating the multiple 
connection problem is to transfer a document and all its 
images in a multipart MIME type. This could be done in 
one connection. This is an elegant solution, but like all the 
solutions, it has a few drawbacks. While WWW servers 
and browsers are already supposed to be capable of doing 
this, few of the popular browsers and servers are. Two 
other problems with this scheme need to be considered 
[17]: (1) MIME encoding will roughly double the number 
of bytes used sent and (2) by loading in a complete docu-
ment, there is no way to stop already cached images or 
documents from being reloaded (unlike the GETLIST/
GETALL proposal).

5.0	Conclusion

HTTP and patterns of World Wide Web use create a num-
ber of challenges to network infrastructure. This paper has 
covered WWW traffic patterns, the issues raised, and pos-
sible solutions to those problems. Work on this issue is 
ongoing. One of the best places to monitor WWW traffic 
developments is on the Web itself. In particular, discus-
sions on solving HTTP problems can be examined at the 
HTTP-WG mail archives (URL http://www.ics.uic.edu/
pub/ietf/http/hypermail).

6.0	References

[1] Tim Berners-Lee, R. Cailiau, A. Luotonen, H. Nielsen, 
and A. Secret. The World Wide Web. Communications 
of the ACM. 37(8):76-82, August 1994. 

[2] Tony Rutkowski. Internet Traffic. URL ftp://
ftp.isoc.gov/isoc/charts/traffic4.ppt, December 10, 
1994. 

[3] Tim Berners-Lee. Hypertext Transfer Protocol 
(HTTP). Internet Draft draft-ietf-iir-http-00.txt, IETF. 
Novermber, 1993. This is working draft. 

[4] T. Berners-Lee, L. Masinter & M. McCahill. Uniform 
Resource Locators (URL). RFC 1738. December, 
1994. 

[5] Jon. B. Postel. Transmission Control Protocol. RFC 
793. September, 1981. 

[6] Venkata N. Padmanabhhan and Jeffrey C. Mogul. 
Improving HTTP Latency. Proceedings of the Second 
International World-Wide Web Conference, pages 995-
1005, Chicago, October 1994. 

[7] Jeff Sedayao. Mosaic will kill my Network! Proceed-
ings of the Second International World-Wide Web Con-
ference, pages 1029-1038. Chicago, October 1994. 

[8] P. Mockapetris. Domain names - concepts and facili-
ties. RFC 1034. November 1987. 

[9] CompuServe, Incorporated. Graphic Interchange For-
mat Standard. 1987. 

[10] Simon E. Spero. Analysis of HTTP Performance 
Problems. URL http://elanor.oit.unc.edu/http-
prob.html, July 1994. 

[11] Van Jacobsen. Congestion Avoidance and control. 
Proceedings of SIGCOMM `88 Symposium on Com-
munications Architectures and Protocols. pages 314-
329. Stanford, CA, August 1988. 

 [12] Kevin Altis and Ari Luotonen. World Wide Web 
Proxies. Proceedings of the First International 
World-Wide Web Conference. Geneva, April 1994. 

 [13] K. Sollins, L. Masinter. Functional Requirements for 
Uniform Resource Names. RFC 1737. December 
1994. 

 [14] Simon E. Spero. Progress on HTTP-NG. URL http://
www11.w3.org/hypertext/WWW/Protocols/HTTP-
NG/http-ng-status.html. 

 [15] Dave Raggett. Minutes from the December 1994 San 
Jose IETF (HTTP BOF). URL http://
www.ics.uci.edu/pub/ietf/http/minutes-SJ.txt 

 [16] N. Borenstein and N. Feed. MIME (Multipurpose 
Internet Mail Extensions) Part One: Mechanisms for 
Specifying and Describing the Format of Internet 
Message Bodies. RFC 1521. September 1993. 

 [17] Mitra. "Re: HTTP: T-T-T-Talking about MIME Gen-
eration". URL http://www.ics.uci.edu/pub/ietf/http/
hypermail/