Previous Page TOC Next Page See Page



— 11 —
A Brief Introduction to Web Spiders and Agents


A spider, or robot as they are commonly called, is a program that automatically traverses Web space. By this, we mean that the program automatically requests a series of URLs, picks new URLs to traverse, and continues traversing until some programmer-defined condition is met.

Consider these tools as incredibly fast, efficient Netscape users. Instead of manually navigating the Web, these robots can load and parse the received data and return to the user its results. For example, if you wanted to keep tabs on an online bookstore and be notified by e-mail whenever a new book written by your favorite author is published, a robot can do that for you.

The benefits of spiders can be seen by their applications. A few more popular implementations of spiders include these uses:

As more people find more uses for the Web, more robots and spiders will come up. Right now their use has been pretty limited to academic research. The main language for programming these tools has been a mix of C/C++ and Perl.

In this chapter, we will go over the ideas behind spider programming, the guidelines to follow, and in the following chapters, we'll go through some Visual Basic programs that act as spiders.

First, let's define the different terms that fall loosely under the "agent" or "spider" category.

Terminology


As with everything related to the Internet, a lexicon has developed to describe client-side applications and server-side resources. Most of the early work on the Internet was performed at universities and military organizations. Both are notorious for clever naming schemes. This section defines some of the terms used in this and later chapters.

Web Ants—These are multiple robots usually running on different servers, that work cooperatively.

New terms for these robots are sure to pop up as they become more popular and easier to program, but for this chapter, we'll stick to robot. If you hear of, or invent a new term, please e-mail it to me (haasch@execpc.com).

How Robots Work


The key to what makes robots work is the 'Web' in the term World Wide Web. The word Web in this case makes reference to the connectedness of the URLs that exist.

Figure 12.1 is a somewhat unrealistic, but possible spider's web. Each point where two threads of spider silk touch represents a Web page, and a single strand of silk to the next intersection is a URL. The threads become anchors between the pages.

Figure 11.1. The World Wide Web viewed as a "spider's web."

The one shortcoming of this analogy is that unlike two intersections in a spider's web, where one leads to another and vice versa (you can travel in either direction), URLs usually don't have anchors back to all URLs that anchor to them.

In Figure 12.2, arrows show the direction of the anchor. The arrow originates on the URL that has the anchor and points to the URL that it is anchored to.

Figure 11.2. Web diagram with arrows showing direction of anchors.

A robot is a program that traverses these links automatically. Taking an arbitrary number of root URLs, the program follows the anchors and repeats this process until it is told by an exit condition to stop.

These exit conditions are usually the time elapsed during the traversal, number of levels of recursion that have occurred, number of pages loaded, or the discovery of a search parameter.

Choosing the Root URL(s)


Let's consider each step of the process—first, choosing the root URL. Currently, there are no less than a dozen search engines that exist on the Internet. Some function as robots, such as WebCrawler, that search for pages that match the text you submit to them. Others, such as Yahoo, are large databases of URLs with descriptions that are categorized. These sites are visited millions of times a day and are general purpose, broad-based tools to find what you're looking for on the Internet.

A possible application using these mega search engines would be a client-side application that takes a user-entered keyword, submits that keyword to a group of search engines, takes their responses, and formats them uniformly. This would save the user from going to Yahoo, then to InfoSeek, then to WebCrawler, and so forth—all for the same keyword search.

Another application would be a group of URLs pertaining to one topic, for instance, law. This group of URLs would need some external links to traverse. From there, the program would load the first user-defined set of URLs, search for the keyword, and traverse any links until it met an exit condition.

These are just two search type tools that robots can perform. A few other robots that have been implemented and their descriptions are described later in the chapter.

Traversing the Links


A variety of algorithms describe this traversal process. Two of the most common are breadth-first and depth-first traversals.

With a URL or group of URLs to use as a staring point, let's look at traversal issues.

The fact that the World Wide Web can be viewed as nodes and links lends itself very well to graph theory and graphing algorithms. The two methods of traversal we want to consider are depth-first and breath-first searching.

Depth-First Searching

Given the following group of nodes and directed anchors, a depth-first search would start with the root node A, then move to H, I, K, J, B, C, E, F, D, G. As you trace the path followed by this progression, you notice that it goes as deep as it can find nodes, then backs up a step, checks for any other children nodes, moves to them, and repeats for all paths. There is no one correct way on this and many graphs to do a depth-first search, another valid search would start again at A, then move through the following path, backing up as necessary: B, D, G, C, E, F, H, I, K, J.

Figure 11.3. Navigating a depth-first search.

Breadth-First Searching

Consider the graph in Figure 12.3: A is the root node, with B and H as its children; C, D, I and J are its grandchildren; E, F, G, and K as its great-grandchildren. The breadth-first search is done by first searching the root, then all of its children, then all of its grandchildren, and finally its great-grandchildren. A possible breadth-first search path would be A, B, H, C, D, I, J, E, F, G, K. As with the depth-first search, there usually is no one correct way of navigating the links in a breath-first search. Another possible path could be A, H, B, C, D, I, J, K, G, F, E.

In Chapter 14, "WebSearcher: A Simple Search Tool," you'll implement a Web search tool that uses a breadth-first searching method of traversing Web hyperlinks.

Traversal Design Considerations

It's easy to get caught up in the wonderful way a robot works—pulling down pages, searching, getting anchors, searching some more, pulling more pages, and on and on. One important thing is consideration for the Webmaster whose server you're hitting.

As you know, the ability to serve Web pages depends on three factors: amount of bandwidth, speed of the Web server, and the amount of data being transferred. So, when your robot rifles off a few hundred requests per minute at one Web server, the Web server will definitely feel the effects.

Consider the diagram in Figure 12.4. A root page is indicated by the singular uppermost block in the diagram. Having the robot recur the anchors in that root page would generate requests for its ten children. If the robot is programmed to recursively traverse its children's links, then you're up to its 100 grandchildren now. This cycle could continue as long as there are more anchors to traverse, but the performance impact if all of these pages reside on one Web server would be extreme, to say the least.

Figure 11.4. Illustration of link recursion.

An easy solution to this problem is to slow down with the requests. Although there are no hard and fast numbers about how slow, try the robot against a site where you are the administrator, or know the administrator, and find the results. Then slow it down by increasing the time between requests until it's not a very noticeable depreciation in server performance.

Another feature that makes breadth-first searching attractive is that you hit alternate servers during the operation of your robot, rather than concentrating on the depth within one server.

Potential for Problems


Spiders are a controversial topic among Web site administrators, mostly because unwary programmers have made spider programs that make too many requests in too short a time span. This is called rapid fire. At best it will slow the performance of the server, and the worst case scenario would be a Web server crash.

This, combined with the fact that more and more sites are adding forms for user feedback (which a robot wouldn't need to look at, much less submit blank forms), CGI scripts with side effects (voting, submitting forms), and deep nesting and interlinking of pages have brought about a movement for the ability for a site administrator to limit or exclude robots from their sites.

The fact that robots can and have caused server slowdowns and crashes makes them a controversial tool. Some people love them for their information searching capabilities; some people hate them because of the performance hits' effects on their Web servers.

Due to this potential for problems, a standard for robot exclusion is under development. This document is summarized here.

Robot Exclusion


http://web.nexor.co.uk/mak/doc/robots/norobots.html

As an evolving standard for robot exclusion, this document can be found at http://web.nexor.co.uk/mak/doc/robots/norobots.html. Currently, the standard is to place a file called robots.txt in the server's root directory so it is accessible by the URL http://www.yourwebsite.com/robots.txt if the server's host name is www.yourwebsite.com. The robot (if it conforms to this evolving standard) will read the file and check to see if it is excluded from or limited at the site.

The robots.txt file is structured as a series of records that indicate the HTTP user agent of the robot and any number of disallow fields that limit or exclude that user agent.

Entries are case sensitive, and comments can be added using the Borne shell style # <comment> format.

Listing 12.1 is an example of a robots.txt file.

Listing 12.1. An example of a robot exclusion file.

User-agent: *
Disallow: /temp
Disallow: /bin
Disallow: /usr
User-agent: InfoSeek Robot
Disallow: /

In this example you see two records: In the first, the * indicates that the following conditions apply to all robots, and disallows them from the /temp, /bin, and /usr directories. The second record disallows the InfoSeek robot from the entire Web site.

The number of record entries and disallow statements per record is up to the site administrator.

Compliance with this standard for exclusion is not mandatory, but unless the robot author wants his e-mail box filled with flames from angry Webmasters and site administrators, compliance is recommended.

Basic Algorithms for Search Robots


A popular implementation for spiders is to search for information. WebCrawler and InfoSeek are two search engines that provide the user feedback from a spider-type program.

Algorithm for a Search Engine


A search engine starts off with a user query, usually a keyword or two. The server side program then takes the query and sends it to the robot. The robot starts with a set of starter pages and parses the starting pages for anchors and the keywords. It then traverses the anchors, looks for keywords there, and recurs this process until either a maximum depth of anchors is reached, a time-out period is reached, or a specified number of matches is met.

The titles are returned to the user as hypertext links, and are often ranked by the occurrence of the query words.

Search engines, such as WebCrawler, actually assign a rank to each page it finds for you. Most of these ranking systems assign a higher rank for a higher incidence of keywords found, so if your keyword is "HTML" and a document has the string "HTML" in it 100 times, that document would be ranked higher than a different document containing the string "HTML" only 40 times.

A searching spider can be implemented with these steps:

  1. With a database of root pages, search for the keyword. If the keyword is found, rank the page on the number of occurrences of that keyword, and add it to a temporary list. Check for time-out and number of matches requirements; keep going if necessary; otherwise, wrap the results in HTML and return the results to the user.

  2. Go to the anchors of the root pages; load their children; again search for the keyword. Again, if the keyword is found, add it and its rank to the temporary list. Check for time-out, search depth, and number of matches, repeat this step as many times as necessary, continually building the list of matches.

  3. After one of the conditions is met, the list of URLs, their titles, and their rankings according to number of keywords per page can be translated to HTML. After this page is built, it can be returned to the user as the search results.



A few tips: First, don't forget to look for the robots.txt file, and search for the user-agent field that matches your browser. With this, you know where and where not to aim your robot.

Another good idea if using a robot that takes in a number of root nodes that access different servers is to use a round-robin type schedule to increase the amount of time the robot successively hits one server. A hit every few seconds, or better yet, every few minutes, is much more polite to the owner of the server than making a few thousand requests in a minute or two.


With this and a good parsing program, you can build a Visual Basic front end that enables the user to specify the root URLs, the number of times to recur, and the information the user seeks.

Behind this the program, you need to set up a system of requesting, parsing, and analyzing pages for content, as well as a system to determine if the search, time-out, and depth conditions and restrictions have been met. Tack on the results into a textbox or a browser custom control, add a status bar indicating the progression of the search, and you've built a full-featured Web robot.

See Chapter 14 for a Visual Basic implementation of a simple Web searching tool.

Generating a Report on Invalid Links


A continual headache for Webmasters and site maintainers is the constant moving and deleting of Web resources. The innerconnectiveness of the World Wide Web encourages sites to link to other sites via hypertext references.

One problem with these external links is that when the page pointed to by the external link is moved, the person who linked to it is not notified, thus creating a 404 URL Not Found error message or a message similar to "Thanks for visiting, but this page has moved to http://www.newurl.com".

This is very similar to an old address book of people you occasionally call. If the people move and change their phone numbers, you get that familiar phone company voice telling you the number has changed or has been disconnected.

Currently, there is no such method for a page, when it changes location or disappears, to notify all linking pages of the change. The lack of this facility causes these all-to-often error and "We have moved here[el]" messages.

The idea to find out what percent of links result in this behavior is somewhat difficult to implement. Here is a basic plan of attack for designing this application:

  1. Choose a root URL that contains at lease one anchor.

  2. Select the number of recursions you want the robot to perform.

  3. Retrieve the root URL.

  4. Check the information within the URL. If it returns an Error Code 404 (Text Description Here), increment a variable that contains the number of bad URLs. Otherwise, increment a variable that contains the number of valid URLs. Also increment the number of recursions that have occurred.

  5. If the number of recursions is 0, then the percentage is either 0 or 100, depending on the validity of the root URL. In either case, report back to the user and exit.

  6. If the number of recursions in greater than zero, parse the URLs of the anchors within the previously retrieved page. Retrieve these URLs, collecting their anchors also. At this step, you also need to increment the counters that store the number of invalid and valid links, and the number of recursions that have taken place.

  7. Check for the number of recursions. If it has been met, report to the user, and then exit. Otherwise, retrieve the child URLs of the URLs that were retrieved in step 6.

  8. Continue the process until the number of recursions has been met.

Again, with any spider, it is suggested that the standard for robot exclusion be implemented and that the servers are hit in a round-robin fashion, so as not to overload any single Web server.

Examples of Existing Web Robots


Although there are no known robots written in Visual Basic currently running regularly on the Web, they will pop up. Currently, most Web robots are being written in Perl, C, and C++.

WebCrawler


http://www.webcrawler.com

WebCrawler at http://www.webcrawler.com (see Figure 12.5) is one of the major search engines on the World Wide Web. It started in 1994 as a project at the University of Washington and is currently hosted by America Online. You can tell if your server has been hit by this search engine by checking the log files of the Web server for http user-agent of WebCrawler/2.0 libwww/3.0 that was run from a machine named spidey.webcrawler.com.

Figure 11.5. Opening screen for WebCrawler.

MOMSpider


http://www.ics.uci.edu/WebSoft/MOMspider/

Multi-Owner Maintenance spider (MOMSpider) is a project from Roy Fielding at the University of California-Irvine. The MOMSpider was written to deal with the maintenance of Web sites. MOMSpider and its accompanying documentation address the important issue of continually changing page locations and URLs as it relates to managing a Web site. In other words, by utilizing a program such as MOMSpider, you can help eliminate users seeing the File Not Found message when clicking a link on your Web page. Documentation and the MOMSpider program itself are available at http://www.ics.uci.edu/WebSoft/MOMspider/.

Summary


This chapter demonstrated that there are many uses for the Web other than just browsing. The ability to use a World Wide network full of information is a massive opportunity. Be it searching, validating links, or reporting on the status of a set of pages, robots can make these tasks easier for the user of the World Wide Web.

What's Next?


From here, you can go on to Chapter 16, "Link
Checker: A Spider That Checks for Broken Links." There, we'll go through the design and construction of a link-verifying robot. Then, in Chapter 17 you'll build a client-side Web search tool.

Previous Page Page Top TOC Next Page See Page