A spider, or robot as they are commonly called, is a program that automatically traverses Web space. By this, we mean that the program automatically requests a series of URLs, picks new URLs to traverse, and continues traversing until some programmer-defined condition is met.
Consider these tools as incredibly fast, efficient Netscape users. Instead of manually navigating the Web, these robots can load and parse the received data and return to the user its results. For example, if you wanted to keep tabs on an online bookstore and be notified by e-mail whenever a new book written by your favorite author is published, a robot can do that for you.
The benefits of spiders can be seen by their applications. A few more popular implementations of spiders include these uses:
As more people find more uses for the Web, more robots and spiders will come up. Right now their use has been pretty limited to academic research. The main language for programming these tools has been a mix of C/C++ and Perl.
In this chapter, we will go over the ideas behind spider programming, the guidelines to follow, and in the following chapters, we'll go through some Visual Basic programs that act as spiders.
First, let's define the different terms that fall loosely under the "agent" or "spider" category.
As with everything related to the Internet, a lexicon has developed to describe client-side applications and server-side resources. Most of the early work on the Internet was performed at universities and military organizations. Both are notorious for clever naming schemes. This section defines some of the terms used in this and later chapters.
RobotThis is the first term used to define the program whose activity was recurring WWW pages, looking for information. More or less, all spiders, agents, and crawlers can be called robots.
SpidersThe media-friendly term for robot.
AgentAnother term for the same concept, although some derivatives of agent have come up: intelligent agents, autonomous agents, and so forth.
WormsA worm formally meant a program that would replicate and distribute itself over a server. To my knowledge, WWW Worms are nothing like that; the original developers just chose a bad term to describe their robot.
Web CrawlersThere is a search engine called WebCrawler (http://www.webcrawler.com), but some have used the term Web Crawler, or just crawler to describe the actions of robots.
Web AntsThese are multiple robots usually running on different servers, that work cooperatively.
New terms for these robots are sure to pop up as they become more popular and easier to program, but for this chapter, we'll stick to robot. If you hear of, or invent a new term, please e-mail it to me (haasch@execpc.com).
The key to what makes robots work is the 'Web' in the term World Wide Web. The word Web in this case makes reference to the connectedness of the URLs that exist.
Figure 12.1 is a somewhat unrealistic, but possible spider's web. Each point where two threads of spider silk touch represents a Web page, and a single strand of silk to the next intersection is a URL. The threads become anchors between the pages.
Figure 11.1. The World Wide Web viewed as a "spider's web."
The one shortcoming of this analogy is that unlike two intersections in a spider's web, where one leads to another and vice versa (you can travel in either direction), URLs usually don't have anchors back to all URLs that anchor to them.
In Figure 12.2, arrows show the direction of the anchor. The arrow originates on the URL that has the anchor and points to the URL that it is anchored to.
Figure 11.2. Web diagram with arrows showing direction of anchors.
A robot is a program that traverses these links automatically. Taking an arbitrary number of root URLs, the program follows the anchors and repeats this process until it is told by an exit condition to stop.
These exit conditions are usually the time elapsed during the traversal, number of levels of recursion that have occurred, number of pages loaded, or the discovery of a search parameter.
Let's consider each step of the processfirst, choosing the root URL. Currently, there are no less than a dozen search engines that exist on the Internet. Some function as robots, such as WebCrawler, that search for pages that match the text you submit to them. Others, such as Yahoo, are large databases of URLs with descriptions that are categorized. These sites are visited millions of times a day and are general purpose, broad-based tools to find what you're looking for on the Internet.
A possible application using these mega search engines would be a client-side application that takes a user-entered keyword, submits that keyword to a group of search engines, takes their responses, and formats them uniformly. This would save the user from going to Yahoo, then to InfoSeek, then to WebCrawler, and so forthall for the same keyword search.
Another application would be a group of URLs pertaining to one topic, for instance, law. This group of URLs would need some external links to traverse. From there, the program would load the first user-defined set of URLs, search for the keyword, and traverse any links until it met an exit condition.
These are just two search type tools that robots can perform. A few other robots that have been implemented and their descriptions are described later in the chapter.
A variety of algorithms describe this traversal process. Two of the most common are breadth-first and depth-first traversals.
With a URL or group of URLs to use as a staring point, let's look at traversal issues.
The fact that the World Wide Web can be viewed as nodes and links lends itself very well to graph theory and graphing algorithms. The two methods of traversal we want to consider are depth-first and breath-first searching.
Given the following group of nodes and directed anchors, a depth-first search would start with the root node A, then move to H, I, K, J, B, C, E, F, D, G. As you trace the path followed by this progression, you notice that it goes as deep as it can find nodes, then backs up a step, checks for any other children nodes, moves to them, and repeats for all paths. There is no one correct way on this and many graphs to do a depth-first search, another valid search would start again at A, then move through the following path, backing up as necessary: B, D, G, C, E, F, H, I, K, J.
Figure 11.3. Navigating a depth-first search.
Consider the graph in Figure 12.3: A is the root node, with B and H as its children; C, D, I and J are its grandchildren; E, F, G, and K as its great-grandchildren. The breadth-first search is done by first searching the root, then all of its children, then all of its grandchildren, and finally its great-grandchildren. A possible breadth-first search path would be A, B, H, C, D, I, J, E, F, G, K. As with the depth-first search, there usually is no one correct way of navigating the links in a breath-first search. Another possible path could be A, H, B, C, D, I, J, K, G, F, E.
In Chapter 14, "WebSearcher: A Simple Search Tool," you'll implement a Web search tool that uses a breadth-first searching method of traversing Web hyperlinks.
It's easy to get caught up in the wonderful way a robot workspulling down pages, searching, getting anchors, searching some more, pulling more pages, and on and on. One important thing is consideration for the Webmaster whose server you're hitting.
As you know, the ability to serve Web pages depends on three factors: amount of bandwidth, speed of the Web server, and the amount of data being transferred. So, when your robot rifles off a few hundred requests per minute at one Web server, the Web server will definitely feel the effects.
Consider the diagram in Figure 12.4. A root page is indicated by the singular uppermost block in the diagram. Having the robot recur the anchors in that root page would generate requests for its ten children. If the robot is programmed to recursively traverse its children's links, then you're up to its 100 grandchildren now. This cycle could continue as long as there are more anchors to traverse, but the performance impact if all of these pages reside on one Web server would be extreme, to say the least.
Figure 11.4. Illustration of link recursion.
An easy solution to this problem is to slow down with the requests. Although there are no hard and fast numbers about how slow, try the robot against a site where you are the administrator, or know the administrator, and find the results. Then slow it down by increasing the time between requests until it's not a very noticeable depreciation in server performance.
Another feature that makes breadth-first searching attractive is that you hit alternate servers during the operation of your robot, rather than concentrating on the depth within one server.
Spiders are a controversial topic among Web site administrators, mostly because unwary programmers have made spider programs that make too many requests in too short a time span. This is called rapid fire. At best it will slow the performance of the server, and the worst case scenario would be a Web server crash.
This, combined with the fact that more and more sites are adding forms for user feedback (which a robot wouldn't need to look at, much less submit blank forms), CGI scripts with side effects (voting, submitting forms), and deep nesting and interlinking of pages have brought about a movement for the ability for a site administrator to limit or exclude robots from their sites.
The fact that robots can and have caused server slowdowns and crashes makes them a controversial tool. Some people love them for their information searching capabilities; some people hate them because of the performance hits' effects on their Web servers.
Due to this potential for problems, a standard for robot exclusion is under development. This document is summarized here.
http://web.nexor.co.uk/mak/doc/robots/norobots.html
As an evolving standard for robot exclusion, this document can be found at http://web.nexor.co.uk/mak/doc/robots/norobots.html. Currently, the standard is to place a file called robots.txt in the server's root directory so it is accessible by the URL http://www.yourwebsite.com/robots.txt if the server's host name is www.yourwebsite.com. The robot (if it conforms to this evolving standard) will read the file and check to see if it is excluded from or limited at the site.
The robots.txt file is structured as a series of records that indicate the HTTP user agent of the robot and any number of disallow fields that limit or exclude that user agent.
Entries are case sensitive, and comments can be added using the Borne shell style # <comment> format.
Listing 12.1 is an example of a robots.txt file.
Listing 12.1. An example of a robot exclusion file.
User-agent: * Disallow: /temp Disallow: /bin Disallow: /usr User-agent: InfoSeek Robot Disallow: /
In this example you see two records: In the first, the * indicates that the following conditions apply to all robots, and disallows them from the /temp, /bin, and /usr directories. The second record disallows the InfoSeek robot from the entire Web site.
The number of record entries and disallow statements per record is up to the site administrator.
Compliance with this standard for exclusion is not mandatory, but unless the robot author wants his e-mail box filled with flames from angry Webmasters and site administrators, compliance is recommended.
A popular implementation for spiders is to search for information. WebCrawler and InfoSeek are two search engines that provide the user feedback from a spider-type program.
A search engine starts off with a user query, usually a keyword or two. The server side program then takes the query and sends it to the robot. The robot starts with a set of starter pages and parses the starting pages for anchors and the keywords. It then traverses the anchors, looks for keywords there, and recurs this process until either a maximum depth of anchors is reached, a time-out period is reached, or a specified number of matches is met.
The titles are returned to the user as hypertext links, and are often ranked by the occurrence of the query words.
Search engines, such as WebCrawler, actually assign a rank to each page it finds for you. Most of these ranking systems assign a higher rank for a higher incidence of keywords found, so if your keyword is "HTML" and a document has the string "HTML" in it 100 times, that document would be ranked higher than a different document containing the string "HTML" only 40 times.
A searching spider can be implemented with these steps:
A few tips: First, don't forget to look for the robots.txt file, and search for the user-agent field that matches your browser. With this, you know where and where not to aim your robot.
Another good idea if using a robot that takes in a number of root nodes that access different servers is to use a round-robin type schedule to increase the amount of time the robot successively hits one server. A hit every few seconds, or better yet, every few minutes, is much more polite to the owner of the server than making a few thousand requests in a minute or two.
With this and a good parsing program, you can build a Visual Basic front end that enables the user to specify the root URLs, the number of times to recur, and the information the user seeks.
Behind this the program, you need to set up a system of requesting, parsing, and analyzing pages for content, as well as a system to determine if the search, time-out, and depth conditions and restrictions have been met. Tack on the results into a textbox or a browser custom control, add a status bar indicating the progression of the search, and you've built a full-featured Web robot.
See Chapter 14 for a Visual Basic implementation of a simple Web searching tool.
A continual headache for Webmasters and site maintainers is the constant moving and deleting of Web resources. The innerconnectiveness of the World Wide Web encourages sites to link to other sites via hypertext references.
One problem with these external links is that when the page pointed to by the external link is moved, the person who linked to it is not notified, thus creating a 404 URL Not Found error message or a message similar to "Thanks for visiting, but this page has moved to http://www.newurl.com".
This is very similar to an old address book of people you occasionally call. If the people move and change their phone numbers, you get that familiar phone company voice telling you the number has changed or has been disconnected.
Currently, there is no such method for a page, when it changes location or disappears, to notify all linking pages of the change. The lack of this facility causes these all-to-often error and "We have moved here[el]" messages.
The idea to find out what percent of links result in this behavior is somewhat difficult to implement. Here is a basic plan of attack for designing this application:
Again, with any spider, it is suggested that the standard for robot exclusion be implemented and that the servers are hit in a round-robin fashion, so as not to overload any single Web server.
Although there are no known robots written in Visual Basic currently running regularly on the Web, they will pop up. Currently, most Web robots are being written in Perl, C, and C++.
WebCrawler at http://www.webcrawler.com (see Figure 12.5) is one of the major search engines on the World Wide Web. It started in 1994 as a project at the University of Washington and is currently hosted by America Online. You can tell if your server has been hit by this search engine by checking the log files of the Web server for http user-agent of WebCrawler/2.0 libwww/3.0 that was run from a machine named spidey.webcrawler.com.
Figure 11.5. Opening screen for WebCrawler.
http://www.ics.uci.edu/WebSoft/MOMspider/
Multi-Owner Maintenance spider (MOMSpider) is a project from Roy Fielding at the University of California-Irvine. The MOMSpider was written to deal with the maintenance of Web sites. MOMSpider and its accompanying documentation address the important issue of continually changing page locations and URLs as it relates to managing a Web site. In other words, by utilizing a program such as MOMSpider, you can help eliminate users seeing the File Not Found message when clicking a link on your Web page. Documentation and the MOMSpider program itself are available at http://www.ics.uci.edu/WebSoft/MOMspider/.
This chapter demonstrated that there are many uses for the Web other than just browsing. The ability to use a World Wide network full of information is a massive opportunity. Be it searching, validating links, or reporting on the status of a set of pages, robots can make these tasks easier for the user of the World Wide Web.
From here, you can go on to Chapter 16, "Link
Checker: A Spider That Checks for Broken Links." There, we'll go through the design and construction of a link-verifying robot. Then, in Chapter 17 you'll build a client-side Web search tool.