OS/2 Shareware BBS: 35 Internet

home *** CD-ROM | disk | FTP | other *** search

/ OS/2 Shareware BBS: 35 Internet / 35-Internet.zip / spman090.zip / spider.HLP (.txt) < prev

Wrap

OS/2 Help File | 1996-09-20 | 7KB | 200 lines

ΓòÉΓòÉΓòÉ 1. SpiderMan ΓòÉΓòÉΓòÉ SpiderMan 0.90 SpiderMan can retrieve Web pages from a HTTP (WWW) server. It can be configured to follow all hyperlinks on the page that lead to other pages on the same server. Images on the pages can be retrieved as well. All pages are stored on disk and can be viewed later using your web browser. SpiderMan can make use of a proxy HTTP server, speeding up the whole procedure. SpiderMan requires at least one HPFS partition! Topics: The main window Common tasks Command line options For the techies Contacting the author ΓòÉΓòÉΓòÉ 1.1. The main window ΓòÉΓòÉΓòÉ On the main window you find the following elements: A drop down list where you enter the URL. The last 15 URLs are saved. "Start" and "Stop" buttons. A log window. Its contents are also stored in the log file. A status line. Its contents are: - the current URL - total number of data bytes retrieved - total number of data bytes of the current URL - number of bytes retrieved of the current URL - number of URLs retrieved - number of URLs tried - number of URLs queued for inspection. ΓòÉΓòÉΓòÉ 1.2. Common tasks ΓòÉΓòÉΓòÉ Here's how to perform some common task with SpiderMan: I wan't to suck a complete web site. In the setup, enable "Follow links", "Inline images". Disable "Don't climb up". Then enter the root URL of the site (e.g. "http://www.thesite.com/"), then press "Start". I wan't to suck a subrange of a web site. In the setup, enable "Follow links", "Inline images" and "Don't climb up". Then enter the URL of the site (e.g. "http://www.thesite.com/some/path/start.html"), then press "Start". I wan't to suck a single web page with images, but only if it's changed. In the setup, disable "Follow links". Enable "Inline images" and "Modified pages only". Then enter the URL of the page (e.g. "http://www.thesite.com/pageofinterest.html"), then press "Start". ΓòÉΓòÉΓòÉ 1.3. Command line options ΓòÉΓòÉΓòÉ SpiderMan can be run in automated mode, i.e. it takes one or more URLs as program parameters, downloads these pages according to the program options, and exits when finished. The command line syntax is: SPIDER.EXE [<url> | @<listfile>]* In other words, you can specify one or more URLs, and one or more list files. Each line in a list file is interpreted as URL. Empty lines and lines starting with ';' are ignored. When finished, SpiderMan returns one of the following ERRORLEVEL values: 0 Everything OK 1 Invalid command line option 2 Problem(s) with one of the list files 10 Other error ΓòÉΓòÉΓòÉ 1.4. For the techies ΓòÉΓòÉΓòÉ Here's some technical information if you're interested: SpiderMan uses HTTP 1.0. HTTP 0.9 is not supported. If some web site is still using a HTTP 0.9 server, its contents may be just as outdated, so you might not miss anything. SpiderMan only follows HTTP links, not FTP or others. SpiderMan counts <IMG SRC=...> and <BODY BACKGROUND=...> as inline images. If the file name of a retrieved page isn't specified, it's stored as INDEX.HTML. The "Last-Modified" timestamp is stored in the file's EAs. The EA name is HTTP.LMODIFIED and is of type EAT_ASCII. Some characters in the URL are converted when building the path name of the file. However, no conversion to FAT (8.3) names is performed! If a page is redirected, the redirection is automatically followed, but only if the new location is on the same server! SpiderMan has been developed on and tested with OS/2 Warp 3.0 and TCP/IP 2.0 installed. It should also work with the following configurations: - Warp 3.0 with IAK - Warp 3.0 Connect with TCP/IP 3.0 - Warp Server - Warp 4.0 (Merlin) ΓòÉΓòÉΓòÉ 1.5. Contacting the author ΓòÉΓòÉΓòÉ SpiderMan was developed by Michael Hohner. He can be reached electronically at: EMail: miho@osn.de Fidonet: 2:2490/2520.17 CompuServe: 100425,1754 ΓòÉΓòÉΓòÉ 2. Setup ΓòÉΓòÉΓòÉ Options Specify all program options. ΓòÉΓòÉΓòÉ 2.1. Servers ΓòÉΓòÉΓòÉ Proxy Enter the host name of a proxy HTTP server. You may also specify a port number for the proxy server. Check Enable to finally use the server. Contact you service provider to get this data. Note: Only enter the host name, not the URL (e.g. "proxy.isp.com", not "http://proxy.isp.com:123/")! Email address Enter your EMail address. It is included in every request. Don't enter anything here if you don't want your EMail address to be revealed. ΓòÉΓòÉΓòÉ 2.2. Paths ΓòÉΓòÉΓòÉ Path for retrieved data Path where retrieved pages and images are stored. This path and subpaths are created automatically. Log file Path and name of the log file. ΓòÉΓòÉΓòÉ 2.3. Options ΓòÉΓòÉΓòÉ Follow links If checked, hyperlinks in retrieved documents are followed. Otherwise, SpiderMan just retrieves one page. You can enter a set of extensions (separated by spaces, commas or semicolons) to retrieve. Links with other extensions are ignored. If you don't enter anything, all links are followed. Inline images If checked, inline images are also retrieved. Don't climb up If checked, hyperlinks that are hierarchically higher than the initial URL are not followed. Otherwise, all links to the same server are followed. Retrieve modified pages only A document is only retrieved if it's newer than the local copy. ΓòÉΓòÉΓòÉ 3. About ΓòÉΓòÉΓòÉ ΓòÉΓòÉΓòÉ 4. File menu ΓòÉΓòÉΓòÉ Exit Ends the program. ΓòÉΓòÉΓòÉ 5. Help menu ΓòÉΓòÉΓòÉ General help Provides general help Product information Displays name, version number, copyright information etc.