OS/2 Shareware BBS: 35 Internet

home *** CD-ROM | disk | FTP | other *** search

/ OS/2 Shareware BBS: 35 Internet / 35-Internet.zip / cheklink.zip / cheklink.doc < prev next >

Wrap

Text File | 1998-05-15 | 42KB | 1,037 lines

15 May 1998: The CheckLink addon for SRE-http, Version 1.02 Contact: Daniel Hellerstein (danielh@econ.ag.gov) CheckLink: Create, display,traverse,and index a web-tree Abstract: CheckLink is a multi-threaded, socket aware utility used to create, verify, traverse, and index a web-tree; where "web-tree" is defined as all URL's (in-line images, anchors, etc.) that are referenced in a root HTML document, and in all documents reachable from this root. CheckLink can be run as an SRE-http addon, or from an OS/2 command prompt. ------------------- Contents: 1. Introduction 1.a. Web Tree? Does that make sense? II. Installation II.a. Using CheckLink without SRE-http. III. CheckLink parameters. III.a. A Note on How CHEKLINK displays results III.b. CHEKLINK, CHEKLNK2, and CHEKINDX parameters IV. CheckLink Request Options IV.a. CHEKLINK options. IV.b. CHEKLNK2 options IV.c. CHEKINDX IV.c.ii. CHEKINDX options. IV.c.iii. CHEKINDX edit mode V. Notes VI. Disclaimer ------------------- I. Introduction CheckLink is a robot that is used to create, verify, traverse and index a web-tree. In other words, CheckLink will find and variously display all the URLS (such as anchors and in-line images) that appear in a set of HTML documents. In particular, CheckLink will: ... given a "starter-URL" provided by a client: a) use TCP/IP socket calls to obtain the contents of the html document (that this "starter-URL" points to) b) find URLs referred to by this document (i.e. <A Href=.. elements contained) within the document) c) verify the "existence" of all of these URLs d) recursively check each URL that maps to an html document The recursive part simply means "go back to step a" for each and every "on-site" text/html document pointed to by a URL in this starter-URL (etc.). The net effect is that a "web-tree" is mapped, with the root of the web-tree being the starter-URL selected by the client, and with each element of the web-tree being a unique URL. Typically, the bulk of these URLs lie on a single site; though off-site URLs can be checked to see if the resources they point to are still available. CheckLink will maintain information on all links "contained in", or that "point to", the resources represented by the URLs that comprise the web-tree. With this information in hand, CheckLink makes it easy to traverse a web-tree, such traversal being a handy way to ascertain the devious ways the web-site (or the portion of the web-site spanned by the web-tree) is interconnected. CheckLink is best run as an "addon" for the SRE-http web server (http://rpbcam.econ.ag.gov/srehttp). In particular, version 1.2M (or above) of SRE-http is required. However, you can select any "starter-URL" desired -- it need NOT be on the site hosting CheckLink. For those lacking SRE-http, you can use the components of CheckLink as a standalone program running from an OS/2 prompt, and as a CGI-BIN script. Although it's a cleaner product when run under SRE-http, the functionality is basically the same. Lastly, CheckLink is mult-threaded. In addition to adding speed to web-traversals, the multi-threaded nature protects CheckLink against recalcitrant servers; servers that might stop or otherwise hang-up a single threaded link checker. ------------------- 1.a. Web Tree? Does that make sense? Perhaps the use of the term "web-tree" is misleading -- it's more of a web-network, web-graph, or (dare we say it?) a web-web. The point is that a tree implies a bottom-to-top branching structure, with a clearly defined set of precedences. In contrast, a web site is defined by a network of links, with each node connecting to a wide variety of other nodes. Although most web-sites do have some sort of hierarchy (i.e.; there is usually one or several "home pages"), this is usually loosely defined, with lots of cross-cutting links. Nevertheless, for reasons of brevity we will use the term "web-tree" in this documentation to refer to "the network of resources, as refered to by URLs, that may be reached from a single starting point". Although this single-starting point (the "starter-URL") is really just a point of entry, one usually chooses a "starter-URL" that is somehow more fundamental -- say, a home page. Hence, this "starter-URL" is often refered to as the "root of the web-tree". ------------------- II. Installation CheckLink consist of 3 seperate program files, and one sample HTML FORM (and this documentation). The 3 files are: CHEKLINK.CMD --- creates the web-tree, and displays basic information on the web tree CHEKLNK2.CMD --- examine and traverse a web-tree CHEKINDX.CMD --- create a hierarchical index of a web-tree. CHEKLINK.HTM --- an HTML document with several forms for invoking the above programs. Assuming that you have SRE-http installed as your web-server, installation of these components of CheckLink is straightforward: i) UNZIP CHEKLINK.ZIP to an empty temporary directory. ii) Copy CHEKLINK.CMD, CHEKLNK2.CMD, and CHEKINDX.CMD to your SRE-http "ADDON" directory (i.e.; D:\GOSERVE\ADDON) iii) Copy CHEKLINK.HTM to your GoServe data directory, or someother WWW accessible location (i.e.; D:\WWW) iv) Optional: Change parameters in CHEKLINK.CMD and CHEKLNK2.CMD. The easiest way to use CheckLink is by pointing your browser at /CHEKLINK.HTM. CheckLink works with all browsers that understand tables, but the results look best with browsers that understand either multi-part documents or client pull (such as NetScape 2.01. and above). ------------------- II.a. Using CheckLink without SRE-http. If you are not an SRE-http user, you can run CheckLink as a standalone program -- just copy CHEKLINK.CMD to an appropriate directory (say, D:\INTERNET\CHEKLINK). When you are ready to run CheckLink, just CD to this directory, run CHEKLINK from an OS/2 command prompt, and follow the directions. For example: D:>cd \internet\cheklink D:\INTERNET\CHEKLINK>cheklink When run in standalone mode, the i/o interface is primitive, and the final output is HTML code -- it is meant to be viewed with a browser. Otherwise, the results are the same as when run as an SRE-http addon (it might even be a touch faster). IMPORTANT NOTE: To use CheckLink as a standalone program, you MUST have REXXLIB.DLL. REXXLIB is $25 shareware (obtainable from http://www.quercus-sys.com/rexxlib.htm). It's a good bargain, but if this expenditure is problemmatic, please contact danielh@econ.ag.gov for alternatives. If you want to use CheckLink to "examine and traverse the web-tree", or to "create an index of the web-tree", you should copy the CHEKLNK2.CMD and CHEKINDX.CMD files to your CGI-BIN scripts directory. The output from CHEKLINK will contain CGI-BIN calls to CHEKLNK2. Thus ... To use CheckLink in a non-SRE-http environment, you will a) Run CHEKLINK.CMD, from an OS/2 command prompt, to generate the index of a web-tree, and to produce several tables of results. b) Run CHEKLNK2.CMD and CHEKINDX.CMD as CGI-BIN scripts c) Optionally, make a few small CGI-BIN modifications to CHEKLINK.HTM (see CHEKLINK.HTM for the details) ------------------- III. CheckLink parameters. Regardless of how you run CheckLink, you may wish to first adjust several performance-tuning and display-customization parameters. Most of these appear at the top of the CHEKLINK.CMD, and there are a few in CHEKLNK2.CMD and CHEKINDX.CMD -- you should modify these files with your favorite text editor. Note that to use any of the 3 CheckLink programs you do NOT need to set these parameters -- the default values work reasonably well. However, if you intend to make more then occasional use of CheckLink, we recommend setting the LINKFILE_DIR parameter in CHEKLINK.CMD, CHEKLNK2.CMD, and CHEKINDX.CMD. ------------------- III.a. A Note on How CHEKLINK displays results Before further discussion, a note on how CHEKLINK displays results (when run as an SRE-http addon) is germane: CHEKLINK can return results either in one long document, as a "two part" document, or in two seperate documents. In a "two part" document: The first part contains status information, and is sent to the client in pieces. The second part contains the results tables. In a "long document" these parts are concatenated -- the final output contains both "status" and "results" information (and will be a bit more cluttered as a result) Since CHEKLINK can take several minutes to process a thousand or so links, the production of "status" information is crucial. In fact, this status information is "sent in pieces" -- with some sort of output being sent to the client every few seconds. Not only does this help keep the client from giving up, it also prevents "server inactive" timeouts. In fact, it's this "may take several minutes to finish" aspect of CHEKLINK that makes it very difficult to distribute a pure CGI-BIN version of CHEKLINK -- most CGI-BIN implementations do NOT allow for "sending results as they become avaialble", and one can not count on lengthy (i.e.; more then a few minutes) inactive-timeouts. Although two-part documents are the more elegant solution, with certain browsers some very annoying "over refresh" behavior occurs (i.e; every time you "back up" to the results, CHEKLINK is reinvoked). As a work around, the "two document" strategy can be used, which will result in almost the same display as a two-part document (client pull is used to automatically replace the "status" document with the "results" document). The drawback is the requirement for semi-permanent storage of the results file on your server's disk -- you may need to monitor disk space if you allow CHEKLINK to be extensively used in two-document mode. ------------------- III.b. CHEKLINK, CHEKLNK2, and CHEKINDX parameters: BACK_1 : <BODY> modifiers. BACK_2 BACK_1 and BACK_2 are used to set a BGCOLOR (or BACKGROUND) for the "two parts" of CheckLink's output. Note that if you are using CheckLink in single-part mode (i.e.; if you are using an older web browser, or if you set the multi_use option to 0) BACK_2 is ignored. Examples: back_1='bgcolor="#668a78"' back_2='bgcolor="#8888dd" background="CL.GIF' Note: BACK_1 (BACK_2) is ignored if INTRO_1A (INTRO_1B) is set to a non-null value. CHEKLINK_HTM : URL pointing to CHEKLINK.HTM CHEKLINK_HTM should contain a URL (usually, a relative URL) that points to the CHEKLINK.HTM file shipped with CheckLink. This variable is used to add a "generate another web-tree" option to the output file. Thus, neglecting to properly set CHEKLINK_HTM will have few deleterious effects. Example: CHEKLINK_HTM = '/CHEKLINK.HTM' CHECK_ROBOT : Suppress checking ROBOTS.TXT. If check_robot=1, then check the starter-URL site for a /robots.txt file, and use it to control extent of search. Proper net'iquette dicates that when checking a stranger's site, make sure you have set check_robot=1. Note: the contents of a ROBOTS.TXT file are added to the a special "site-specific" EXCLUSION_LIST -- it only effects URLs on the starter-URL site. Example: check_robot=1 DOUBLE_CHECK: Since servers can be momentarily busy, it's often wise to "double check" busy servers. To do this, set DOUBLE_CHECK=1 To NOT double check, set DOUBLE_CHECK=0. This double checking will only look at servers that were "not available". It will be done after all links have been examined (thus giving the "not available" server a chance to become available. Lastly, GET queries are used (instead of HEAD queries). GET_QUERY: As part of mapping a web-tree, CheckLink will query servers for basic information on URLs. These queries are best done with HEAD requests. Unfortunately, there are a number of older servers that do not properly respond to HEAD requests.If you find that CheckLink is identifiying many URLs as unavailable (even though your browser can get to them readily), it may be due to their host server's failure to recognize these HEAD requests. As a work around, you can use short GET requests instead of HEAD requests. This method is engaged by setting: get_query=1. Example: get_query=0 Note: This get_query=1 method is not highly recommended -- it's slower, and somewhat "ruder" (connections are purposely broken, which tends to add garbage to the visited server's log file). Instead, we recommend setting DOUBLE_CHECK=1 LINKFILE_DIR: directory to store "linkage" files in. Linkage files contain "link" information on all the URLs discovered during CheckLink's recursive mapping of a "web tree". In particular, the LINKFILE option (see section IV) specifies a filename, which will then be stored in the LINKFILE_DIR. By default, LINKFILE_DIR will be your OS/2 TEMP drive. Example: LINKFILE_DIR='D:\GOSERVE\CHKLNKS' Note: in addition to storing LINKFILEs, the LINKFILE_DIR is also used to store "RESULTS" files. MAXATONCE: maximum number of "query" threads Specifies the maximum number of threads to use when checking for the existence (and mimetype) of a link (using HEAD requests). Increasing this number may speed up throughput, but it may subject the target server(s) to excessive loads. Example: maxatonce=6 MAXATONCE_GET: maximum number of "read" threads. Specifies the maximum number of threads to use when retrieving the contents of a URL (using GET requests). Increasing this number may speed up throughput, but it may subject the target server(s) to excessive loads. Example: maxatonce_get=2 MAXAGE: Kill a query if it's old Specifies number of seconds to wait on a query (a HEAD request). You may need to increase this time span if sites are far away or otherwise slow. However, increasing MAXAGE will increase the time that CheckLink waits on "hung" sites. Example:maxage=30 MAXAGE2: Kill a read if it's old Specifies number of seconds to wait on a read (a GET request). You may need to increase this time span if sites are far away or otherwise slow. However, increasing MAXAGE will increase the time that CheckLink waits on "hung" sites. Example:maxage2=60 ROW_COLOR1 : Used to set the <TR> in the results tables ROW_COLOR2 ROW_COLOR1A ROW_COLOR2A ROW_COLOR1 and ROW_COLOR2 set the odd and even rows (respectively) of tables used to display the results of checking IMG links. ROW_COLOR1A and ROW_COLOR2A set the odd and even rows (respectively) of tables used to display the results of checking Anchor links. Examples: row_color1='bgcolor="#bbcc66"' row_color2='bgcolor="#aaccdd"' row_color1a='bgcolor="#bbaa44"' row_color2a='bgcolor="#aaccdd"' TD_INDENT: Used to Indent a Type=2 (table) index This string is used to indent each row of an "index table". You can try using characters (i.e.; ___ ), none-breaking spaces (i.e.; ) or empty columns (i.e.; <TD> </TD> ) Example: td_indent='<td bgcolor="#789966"> <font color="#789966">__</font></td>' Special Feature: If you have a 1_PIXEL.GIF in the /IMGS/ directory of your web tree, you can set td_indent equal to an integer value, which will cause the following to be used: <td> <IMG SRC="/IMGS/1_PIXEL.GIF" width=45> </td> where 45 is any number equal to td_indent * indent_levels. The above example would be used if > td_indent=15, > a given line is being displayed at a third level indentation Since td_indent*3 == 15*3 == 45; a 45 pixel "blank" spacer-image will be drawn NOte: 1_PIXEL.GIF should be a GIF file consisting of 1 transparent pixel. TD_TITLE: : Modifies "title" field of a table index. TD_TITLE is used in the <TD field of a "table": index. Example: td_title='valign="TOP" bgcolor="#a2a9a9" ' TD_DESCRIP TD_DESCRIP is used when writing descriptions. Example: td_descrip='valign="TOP"' TR_MOD1 and TR_MOD2: Modifies rows of a table index TR_MOD1 and TR_MOD2 modify <TR elements of a table index. TR_MOD1 is used on odd rows, TR_MOD2 is used on even rows. Note that if TD_TITLE and TD_DESCRIP are used, TR_MODn may not have much impact. tr_mod1=' Bgcolor="#449922"' tr_mod2='' USER_INTRO1A : Files containing "header" information. USER_INTRO1B Fully qualified file names containing "header" information, for each part. If ='', then a generic header is used If specified, the file MUST contain at least: <HTML><HEAD>.... </HEAD> <BODY ...> <h1>... </h1> Note: use of user_intro1a (user_intro1b) means that back_1 (back_2) are NOT used. Examples: user_intro1a='' user_intro1b='D:\GOSERVE\CHEK1.HDR' ------------------- IV. CheckLink Request Options Request options are specified when one of the CheckLink programs is requested; say, when you use CHEKLINK as the ACTION in an HTML FORM. The following briefly describe these options. For further details, we recommend perusing CHEKLINK.HTM. ------------------- IV.a. CHEKLINK options. The only required option is URL (defaults will be used for the other options when they are not specified). Options: BASEONLY : BASEONLY=0 : Read url's relative to the root of the request BASEONLY=1 : Read url's relative to the base of the request Example: if URL=/dogs/foo.htm; then baseonly=0 : /cats/bar.htm would be "recursively" read baseonly=1 : /cats/bar.htm would NOT "recursively" read DESCRIP: Create & save descriptions for "on-site" (and "in directory", if BASEONLY=1) documnents. DESCRIP=0 -- do not create descriptions DESCRIP=1 -- create descriptions for text/html documents DESCRIP=2 -- create descriptions for text/html and text/plain documents DESCRIP=1 is fairly costless (it uses information that's already been read). DESCRIP=2 requires reading additional files. A maximum of 300 characters is retained (this can be modified by changing the DSCMAX parameter in CHEKLINK.CMD). EXCLUSION_LIST: Space delimited list of selector to NOT query or read. *'s can be used as wildcards. Example:!* *?* *MAPIMAGE/* CGI-*' (this is also the default) LINKFILE : Name of a file to store "linkage" information. Linkage information pertains to each and every URL in the web-tree. Each of these URLs will be associated with a list of web-tree residing, text/html, URLs that contain links pointing to this URL. In addition, each text/html URL (in the web tree) is associated with a list of all it's links (that point both on and off site) The LINKFILE is used to store these lists. More importantly, CHEKLNK2.CMD uses the LINKFILE to "examine and traverse" the web tree. Notes: * The LINKFILE should be a file name, without path or extension information. A default extension of .STM is used, and the file is written to the LINKFILE_DIR directory. * If you do not want to retain this information, set LINKFILE=0 * If you set LINKFILE (to a non-0 value), the output from CHEKLINK will contain links (one for each URL) to CHEKLNK2. NAME: A descriptive name You can enter a descriptive name for this "web-tree" -- it will be displayed at various points. If you do not specify a name, a default name will be constructed from the URL option (see below). Example: name=A+Sample+web_tree (note the URL encoding of spaces as + characters) OUTTYPE: A space delimited list of tables to produce. The following values can be used in any combinaton: OK ) Display succesfully found links NOSITE ) Display links to unreachable sites NOURL ) Display links missing resources> OFFSITE ) Display links to off-site URLs EXCLUDED ) Display links to excluded URLs (as specified in the EXCLUSION_LIST) ALL ) Display all links Examples: OUTTYPE='ALL' OUTTYPE='OK NOURL ' RESULTS : A file containing the results of a prior call to CheckLink (primarily for internal use by CheckLink). Due to inappropriate refreshing by certain browsers, CheckLink can be instructed to save it's results tables to a file (see description of use_multi). RESULTS points to one of these files -- when included, CheckLink will just return the RESULTS file. Example: results="CHKS0001.HTM" Note that these "results" files are stored in the LINKFILE_DIR directory. SITEONLY: SITEONLY=0 : Query all url's SITEONLY=1 : Query url's on starter-URL's "own site" URL: URL=fully qualified, or relative, URL This is the "starter-URL" Example: url="/samples/guide.htm" USE_MULTI: USE_MULTI=0 : Return results in one long documemt USE_MULTI=1 : Return results in two-part document; with the second part replacing (overwriting) the first. USE_MULTI=2 : Return results in two seperate documents, the second one being stored on the server's disk. Note that if an older browser (that does not support connection:maintain) is used, then USE_MULTI is set to 2. The primary reason for USE_MULTI=2 is to work around the "over- refreshing" bugs of certain browsers. Note that when USE_MULTI=2 is used, the RESULTS option is used internally by CHEKLINK to provide a link to the second document. This document, which will be assigned random name, will be stored on the LINKFILE_DIR directory. ------------------- IV.b. CHEKLNK2 options CHEKLNK2 is used to examine and traverse a web tree. Typically, you would not code a requeset to CHEKLNK2 -- you'ld use links to CHEKLNK2 in the table produced by CHEKLINK. In addition, CHEKLNK2 includes numerous links back into CHEKLNK2, links that utilize the options listed below. That is, CHEKLNK2 is somewhat of a self-contained program. It is NOT expected that expected that CHEKLNK2 will be explicitily used by most authors. Therefore -- the following description will be rudimentary. Note that CHEKLNK2 can be called as an SRE-http addon, or as a CGI-BIN script (but not as a standalone program). Options: LINKFILE -- Same definition as above -- the linkage file (relative to the LINKFILE_DIR directory) that was created by a request to CHEKLINK. ENTRYNUM -- pointer to an entry in the LINKFILE -- his entry corresponds to a unique URL; CHEKLNK2 will display links to and from this unique URL. Example:entrynum=12 If entrynum=0, an alphabetized index of all text/html documents (in the web-tree) will be displayed. ISIMG -- Select between image & anchors linkss. Setting isimg=1 means to use "image" links; otherwise, use "anchor" links. Note that the the combination of ENTRYNUM and ISIMG dictate which URL will be examined. Example: entrynum=15&isimg=1 VIA -- Information on what location in the web-tree (which URL) was being examined prior to jumping here. LIST -- Enable "traverse web tree mode". LIST can take the following values: LIST=0 (the default (used if LIST is not specified). Display a "synopsis" of the URL. This synopsis includes basic information (such as the size and mime type), and a list of URLs (in the web tree) that refer to text/html documents that contain links to this URL (the entrynum URL). In addition, if this (the entrynum) URL is a text/html document, a table of all links (images and anchors) will be displayed LIST=1 Display an (alphabetized) list of links to all text/html documents pointed to by links in the "entrynum URL" (more precisely, by the text/html document pointed to by the entrynum URL). LIST=2 Similar to LIST=1, but display text/html documents that point TO the "entrynum URL" (LIST=2 is the reverse of LIST=1) LIST=3 Display an alphabetized table of ALL urls contained in web-tree. In contrast, using LIST=0 and ENTRYNUM=0 will generate a list of "on-site, text/html documents". Example: LIST=1&entrynum=5 MIME -- A space delimited list of mimetypes, possibly containing wildcards. MIME is only used when LIST=3. When you specify MIME, then only URLs with a mimetype matching (one of) the elements of the MIME value will be used. Examples: LIST=3&MIME=text/plain LIST=3&MIME=image/* LIST=3&MIME=application/pdf+application/x-pdf (note use of + as a url encoded space) Special Note: If you include an * in the LINKFILE value, CHEKLNK2 will produce a short list of currently available linkage files, and let you choose one to examine. The choice uses normal file matching rules. For example /CHEKLNK2?linkfile='CHK*' may yield CHK01, CHKNOW, and CHK_C. ------------------- IV.c. CHEKINDX CHEKINDX is used to create a hierarchical index of your web-tree. By hierarchical index, we mean the sort of index we are all familiar with -- a highly indented list, with more "subsidiary" resources on more indented lines. Basically, the notion is to use CHEKINDX to create a "web index" that you can post on your site (usually with suitable prettifications). Note that CHEKINDX uses nested "unordered lists" (<UL> constructs) to display the hierarchical index; hence the output should be viewable by all browsers. As noted in section 1a, the web-tree is something of a misnomer; and construction of such a "hierarchical index" is not a cut and dried affair. That is, given the mutiplicity of cross-cutting links, there is no single hierarchical representation of these "web-trees". Therefore, CHEKINDX uses a simple heuristic: given a specified "root-url" (which may, or may not, be the starter-URL), CHEKINDX will determine the position in the hierarchy as a function of distance to the root-url. Basically, the following rules are used: Level 1 (starting closest to the left margin): The root-url. There is only one "level 1" row (it's the top row). Level 2: (second closest to the left margin): All URLs contained in the "root-URL" (that is, contained in the text/html document pointed to by the "root-URL". Level 3: All URLs contained in a level 2 URL. Level 4, 5, etc are defined similarly. Note that level 3 lists appear directly after the appropriate level 2 URL, and so forth. Root-URL 2A 2B 3B.i For example: 3B.ii 3B.iii 2C 3C.i 3C.ii 4.C.ii.x 4.C.ii.xx 3.C.iii The above heuristic contains a key rule: * Once listed, a URL can never appear in a "higher level". That is, 3C.i can NOT list 2A. This rule can be applied at various levels of stringency. For example, you could allow "ties" to displayed multiple times, or you could only allow "one listing" per URL. Controlling this stringency, as well as otherwise influencing the scope of the listing, in controlled by the CHEKINDX options. ------------------- IV.c.ii. CHEKINDX options. Options: CLEANUP : Used to remove "earlier, higher level references" CLEANUP=1 signals CHEKINDX to remove "higher level" references that preceded lower level references. In the examples used above, setting MULTI=1 would cause the earlier "level 5" reference to be removed from the index. CLEANUP has no effect when used with MULTI=0. When used with MULTI=1, then only the first (of several possible ties) is displayed. That is, MULTI=1 and CLEANUP=1 invokes a "use first occurence of lowest level" rule. When used with MULTI=2, all ties are displayed -- "use all occurences of the best level". Note that CLEANUP requires an extra iteration, hence requires more processing time. Example: CLEANUP=1 By default, CLEANUP=0 (cleanup is not attempted). DESCRIP: Write descriptions (if available) DESCRIP=1 : Write descriptions (under the title-link), if available DESCRIP=0 : Do notwrite descriptions) DROP: Space delimited list of (possibly wildcarded) selectors to drop URL's with a selector portion that matches one of the items in DROP will not be displayed in the index. However, links within "dropped" selectors may be displayed! Thus, you should coordinate DROP with EXCLUDE. Examples: DROP=*SAMPLES/*FILELIST.HTM DROP=*/IND*.HTM+*/MAP*.HTM (note use of + as a URL-encoded space) By default, DROP='' (nothing is dropped) EXCLUDE: Space delimited list of (possibly wildcarded) selectors to "not expand". URL's with a selector portion that matches one of the items in EXCLUDE will be included in the index, but will not be "expanded". That is, the "links" associated with an EXCLUDEd selector are not used. Contrast this with DROP, which drops display of the selector, but (possibly) retains URL's with links that appear within the document the selector refers to. Examples: EXCLUDE=FILELIST.HTM EXCLUDE=*/SITEMAP.HTM+*/INDICE.HTM (note use of + as a URL-encoded space) The primary use of EXCLUDE is to prevent some other "file listing" from being placed at a low level and "capturing" the bulk of the URLs. Such an occurence may distort the true relationship between URLs. By default, EXCLUDE='' (nothing is excluded) HEADER: Optional header to display at top of index. If not specified, the servername will be displayed. Example: Header='This+is+OUR+Site' LINKFILE: As defined above (filename only, no path). LINKFILE is the only required parameter (note that the LINKFILE=* shortcut is NOT supported by CHEKINDX). MIME: Space delimited list of mimetypes (possibly wildcarded) URLs to include in the index. More precisely: The mimetype of the resource (that is pointed to by URLs in the web-tree) is compared to the list of mimetypes in the MIME option. If no match occurs, the URL is NOT included in the index. Examples: MIME=text/* MIME=image/jpeg+image/gif (note use of + for URL-encoding) MIME=application/pdf By default, MIME=text/html. MULTI: Used to control the "stringency" of display. As mentioned above, it is likely that URLs will be referred to by several other "URLs" (that is, by html documents pointed to by several other URLs). To prevent infinite recursion, the basic rule is to: "never include a URL if it's already been included at a lower level" MULTI controls the other cases: MULTI=0 -- only one reference to a URL per index. Thus, if the first reference is at "level 5", and a "level 3" reference is found later, the "level 3" reference will NOT be displayed. Note that "level 2" references are ALWAYS displayed -- they are checked first. MULTI=1 -- if latter references are strictly lower, then also display them. Thus, the level 3 reference mentioned above would be displayed (along with the level 5 reference). MULTI=2 -- Similar to MULTI=1, but ties are also displayed (thus, a second level 5 reference would be displayed if MULTI=2, but not if MULTI=1) Example: MULTI=2 By default, MULTI=0 PIX. : Stem variable pointing to mime-type specific imags. PIX. is a stem variable that points to small .GIF icons that will be displayed next to the title (or selector) of each entry in the index. The syntax is: PIX.0=number of entries PIX.n="mime/type selector where n: 1.. pix.0 mime/type can include * as a wildcard selector is a selector PIX.!INCLUDE=text to include in IMG element For example: pix.0=3 pix.1='text/plain /imgs/text.gif ' pix.2='image/* /imgs/image.gif ' pix.3='text/html ' pix.!include=' height=18 width=18 ALT="*" align="center" ' Note that when there is no "selector", no icon is drawn. Also, the first (of several possible) matches is used. SITEONLY: Only include URLs on the "starter-URL's" site. If SITEONLY=1, then URLs that point off-site will not be included in the index. If SITEONLY=0, then all URLs may be included in the index. Note that "off-site" URLs will NEVER reference other links -- for purposes of the web-tree, they are all "leafs". By default SITEONLY=1 (off-site URLs are excluded) TYPE: Display type There are three types of display: TYPE=1 : Use an Unordered List (<UL>) TYPE=2 : Use a table (<TABLE>) TYPE=3 : Return an "editable" document. You can use this to delete, change or move various records and fields (see section IV.c.iii for details). Note: if you select TYPE=2, you might want to play with the varions TABLE_, TR_ and TD_ parameters in CHEKINDX.CMD. URL: The "root-URL". Actually, it's the "root selector" -- you don't need to specify the http://a.b.c/ portion. CHEKINDX will use this "root-URL" as the "level 1" of the hierarchical index. Example: URL=/samples/index.htm By default, the "starter-URL" of the LINKFILE is used. ------------------- IV.c.iii. CHEKINDX edit mode. In many cases, the hierarchical index created the CheckLink can use editing. You may want to remove uninteresting links, change the indentation levels, modify uniformative descriptions, or even move index entries around. To facilitate such actions, you can invoke the "edit" mode of CHEKINDX (see the description of the TYPE option above). In "edit" mode, an HTML form listing each entry, along with several options per entry, will be listed. With this form you can: * Remove entries * Move entries * Change an entries indentation level * Modify the "title" of the entry * Modify the "description" of the entry. After making these changes, you can then create a <UL> or <TABLE> index; or, you can re-edit the index (and make additional changes). Notes: * These edits do NOT effect the "link file" (from which the index is first generated). * Edit mode is NOT available when CHEKINDX is run as a cgi-bin script. It is only available if you are running this CheckLink package as an SRE-http addon. * You can re-edit several times, until you like what you see; and then you can finalize the index as a <UL> or a TABLE. * After finalizing, you should save the index to an HTML document (you might then further modify it with your favorite text editor). * A <UL> version of the index (reflecting current changes) is written on the bottom portion of the customization page. ------------------- V. Notes: * CHEKLINK looks for a few kinds of "image" links, and several kinds of "anchor" links: Image Links: <IMG src="xxx"> <BODY background="XXX"> Anchor Links <A Href="XXX"> <AREA Href="xxx"> <FRAME src="XXX"> <EMBED src="XXX"> <LINK href="xxx"> <APPLET code="xxx" codebase="http://x.x.x/yy" > <OBJECT codebase="xxx"> Note that tags in comments (between  are NOT processed. Note that if there is some tag I've left out, please contact me (danielh@econ.ag.gov) if inclusion of such a capability would greatly enhance CheckLink! ? The major difference between IMG and ANCHOR links is that IMG links are never "read" (they are only queried). Should APPLET or OBJECT be treated as images? * A possibility (given enough interest): A graphical web-mapper component for CheckLink. * To display some of the run-time status information, you'll need PMPRINTF.EXE (http://www2.hursley.ibm.com/goserve). * Sample speeds of CHEKLINK (on a Pentium 100 over a 16/4M Token Ring LAN based Intranet, with a T1 line to the outside world): 1 GETs per second (of html/text URLs, average size of 20k). 8 HEADs per second (requests for basic information) ------------------- VI. Disclaimer Copyright 1997,1998 by Daniel Hellerstein. Permission to use this program for any purpose is hereby granted without fee, provided that the author's name not be used in advertising or publicity pertaining to distribution of the software without specific written prior permision. This includes the right to subset and reuse the code, with proper attribution; and with the following understanding:. We, the authors of CheckLink and any potentially affiliated institutions, disclaim any and all liability for damages due to the use, misuse, or failure of the product or subsets of the product. Furthermore you may also charge a reasonable re-distribution fee for CheckLink; with the understanding that this does not remove the work from the public domain and that the above proviso remains in effect. THIS SOFTWARE PACKAGE IS PROVIDED "AS IS" WITHOUT EXPRESS OR IMPLIED WARRANTY. THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE PACKAGE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR (Daniel Hellerstein) OR ANY PERSON OR INSTITUTION ASSOCIATED WITH THIS PRODUCT BE LIABLE FOR ANY SPECIAL,INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT,NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE PACKAGE. SRE-http was developed on the personal time of Daniel Hellerstein, and is not supported, approved, or in any way an official product of my employer (USDA/ERS).