Dream 52

home *** CD-ROM | disk | FTP | other *** search

/ Dream 52 / Amiga_Dream_52.iso / Linux / Divers / lynx2.8.1dev.10.tar.gz / lynx2.8.1dev.10.tar / lynx2-8 / docs / CRAWL.announce < prev next >

Wrap

Text File | 1997-02-24 | 6KB | 132 lines

The TRAVERSAL code from old versions of Lynx has been upgraded by David Mathog (mathog@seqaxp.bio.caltech.edu) so that it works again, can be implemented via a command line switch (-traversal) instead of via a compilation symbol for creating a separate Lynx executable as in those previous versions, and can be used in conjunction with a -crawl switch to make Lynx a front end for a Web Crawler. Usage: lynx [-traversal] [-realm] [-crawl] ["startpage"] Added switches are: -traversal Follow all http links derived from startpage that are on the same server as startpage. If startpage isn't specified then the traversal begins with the default startfile or WWW_HOME. -realm Further restrict http links to ones in the same realm (having a matching base URI) as the startpage (e.g., http://host/~user/ will restrict the traversal to that user's public html tree). -crawl With [-traversal] outputs each unique hypertext page as an lnk###########.dat file in the format specified below. With [-dump] outputs only the startpage, in the same format, to stdout. Note on startpage: If a startpage is specified and contains any uppercase characters, on VMS it should be enclosed in double-quotes. The code that extracts the access and host fields from startpage for comparsions with links to ensure they are not on another server, and the comparisons with already traversed links, are case sensitive, and the startpage will go to all lowercase on VMS if no double-quotes are supplied, such that it might be treated as a new link if encounted with uppercase letters. Files created and/or used with the -traversal switch, based on definitions in userdefs.h: TRAVERSE_FILE (traverse.dat): Contains a list of all URLs that were traversed. Note that if a URL appears in this file it will not be traversed again (important if runs are started and stopped). Placing an entry in this file BEFORE the run will block traversal of that URL. Unlike reject.dat a final * has no effect (see below). Note that Lynx internal client-side image MAP URLs will be included in this file (e.g., LYNXIMGMAP:http://server/foo.html#map1), in addition to the "real" (external) http URLs. TRAVERSE_FOUND_FILE (traverse2.dat): Contains a list of all URLs that were traversed, in the order encountered or re-encountered (but not re-travered) during a traversal run, and the TITLEs of the documents (separated from the URLs by TABs) A URL and TITLE may be present in this list many times. To simplify the list, on VMS use: sort/nodups traverse2.dat;1 ;2 Note that the URLs and TITLEs of the Lynx internal client-side image MAP pseudo-documents will not be included in this file, though "traversed", but only the http URLs and TITLEs derived from the MAP's AREA tag HREFs that were traversed. TRAVERSE_REJECT_FILE (reject.dat): Contains a list of URLs that have been rejected from the traversal. Once a URL has been entered in this list, it will not be traversed. URLs that end in a * will cause rejection of all URLs that match up to the character before the *. So for instance, to reject all htbin references on a site put this line in the reject.dat file BEFORE starting the run: http://www.site.wherever:8000/htbin* TRAVERSE_ERRORS (traverse.errors): A list of links that could not be accessed or had an unknown status returned by the http server. If the owner of the document containing the link is know via a LINK REV="made" HREF="mailto:foo" in it and the MAIL_SYSTEM_ERROR_LOGGING was set true in userdefs.h or lynx.cfg (not recommended!!!), a message about the problem will be mailed to the owner as well. Files created during traversals if the -crawl switch is included with the -traversal switch: lnk########.dat Numbered output files containing the contents of traversed hypertext documents in text format. All hypertext links within the document have been stripped, and the URL and TITLE of the document are recorded as the first two lines, e.g., for the seqaxp.bio.caltech.edu home page the first two lines will be: THE_URL:http://seqaxp.bio.caltech.edu:8000/ THE_TITLE:SAF Web server home page The VMSIndex software is being adapted to use this information to extract the corresponding URL and TITLE for use in indexing the lnk########.dat files, e.g.: $ build_index - /url=(text="THE_URL:") - /topic=(text="THE_TITLE:",EXCLUDE) - /output=INDEX_NAME - lnk*.dat A clever person should be able to figure out a way to index the lnk########.dat files on Unix as well. If you want the hypertext links in the document to be numbered, include the -number_links switch. By default, this will cause the list of References (URLs for the numbered links) to be appended as well. If you want numbered links but not the References list, include the -nolist switch as well. Note that any client-side image MAP pseudo documents that were "traversed" will not have lnk########.dat output files created for them, but output files will be created for "real" documents that were traversed based on the HREFs of the MAP's AREA tags. This functionality is still under development. Feedback and suggestions are welcome.