home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
OS/2 Shareware BBS: 35 Internet
/
35-Internet.zip
/
cheklink.zip
/
cheklink.doc
< prev
next >
Wrap
Text File
|
1998-05-15
|
42KB
|
1,037 lines
15 May 1998: The CheckLink addon for SRE-http, Version 1.02
Contact: Daniel Hellerstein (danielh@econ.ag.gov)
CheckLink: Create, display,traverse,and index a web-tree
Abstract: CheckLink is a multi-threaded, socket aware utility used to
create, verify, traverse, and index a web-tree; where
"web-tree" is defined as all URL's (in-line images, anchors,
etc.) that are referenced in a root HTML document, and in all
documents reachable from this root. CheckLink can be run as an
SRE-http addon, or from an OS/2 command prompt.
-------------------
Contents:
1. Introduction
1.a. Web Tree? Does that make sense?
II. Installation
II.a. Using CheckLink without SRE-http.
III. CheckLink parameters.
III.a. A Note on How CHEKLINK displays results
III.b. CHEKLINK, CHEKLNK2, and CHEKINDX parameters
IV. CheckLink Request Options
IV.a. CHEKLINK options.
IV.b. CHEKLNK2 options
IV.c. CHEKINDX
IV.c.ii. CHEKINDX options.
IV.c.iii. CHEKINDX edit mode
V. Notes
VI. Disclaimer
-------------------
I. Introduction
CheckLink is a robot that is used to create, verify, traverse and index
a web-tree. In other words, CheckLink will find and variously display all
the URLS (such as anchors and in-line images) that appear in a set of HTML
documents. In particular, CheckLink will:
... given a "starter-URL" provided by a client:
a) use TCP/IP socket calls to obtain the contents of the html document
(that this "starter-URL" points to)
b) find URLs referred to by this document (i.e. <A Href=.. elements contained)
within the document)
c) verify the "existence" of all of these URLs
d) recursively check each URL that maps to an html document
The recursive part simply means "go back to step a" for each
and every "on-site" text/html document pointed to by a URL in this
starter-URL (etc.).
The net effect is that a "web-tree" is mapped, with the root of the
web-tree being the starter-URL selected by the client, and with
each element of the web-tree being a unique URL. Typically, the bulk
of these URLs lie on a single site; though off-site URLs can be checked
to see if the resources they point to are still available.
CheckLink will maintain information on all links "contained
in", or that "point to", the resources represented by the URLs that comprise
the web-tree. With this information in hand, CheckLink makes it easy to
traverse a web-tree, such traversal being a handy way to ascertain the
devious ways the web-site (or the portion of the web-site spanned by
the web-tree) is interconnected.
CheckLink is best run as an "addon" for the SRE-http web server
(http://rpbcam.econ.ag.gov/srehttp). In particular, version 1.2M (or
above) of SRE-http is required. However, you can select any
"starter-URL" desired -- it need NOT be on the site hosting CheckLink.
For those lacking SRE-http, you can use the components of CheckLink as a
standalone program running from an OS/2 prompt, and as a CGI-BIN script.
Although it's a cleaner product when run under SRE-http, the functionality
is basically the same.
Lastly, CheckLink is mult-threaded. In addition to adding
speed to web-traversals, the multi-threaded nature protects CheckLink against
recalcitrant servers; servers that might stop or otherwise hang-up a
single threaded link checker.
-------------------
1.a. Web Tree? Does that make sense?
Perhaps the use of the term "web-tree" is misleading -- it's more of a
web-network, web-graph, or (dare we say it?) a web-web. The point
is that a tree implies a bottom-to-top branching structure, with a
clearly defined set of precedences. In contrast, a web site is defined
by a network of links, with each node connecting to a wide variety
of other nodes. Although most web-sites do have some sort of hierarchy
(i.e.; there is usually one or several "home pages"), this is usually
loosely defined, with lots of cross-cutting links.
Nevertheless, for reasons of brevity we will use the term "web-tree"
in this documentation to refer to "the network of resources, as refered
to by URLs, that may be reached from a single starting point". Although
this single-starting point (the "starter-URL") is really just a point of
entry, one usually chooses a "starter-URL" that is somehow more
fundamental -- say, a home page. Hence, this "starter-URL" is often
refered to as the "root of the web-tree".
-------------------
II. Installation
CheckLink consist of 3 seperate program files, and one sample HTML FORM
(and this documentation). The 3 files are:
CHEKLINK.CMD --- creates the web-tree, and displays basic information on
the web tree
CHEKLNK2.CMD --- examine and traverse a web-tree
CHEKINDX.CMD --- create a hierarchical index of a web-tree.
CHEKLINK.HTM --- an HTML document with several forms for invoking the
above programs.
Assuming that you have SRE-http installed as your web-server,
installation of these components of CheckLink is straightforward:
i) UNZIP CHEKLINK.ZIP to an empty temporary directory.
ii) Copy CHEKLINK.CMD, CHEKLNK2.CMD, and CHEKINDX.CMD to your SRE-http
"ADDON" directory (i.e.; D:\GOSERVE\ADDON)
iii) Copy CHEKLINK.HTM to your GoServe data directory, or someother
WWW accessible location (i.e.; D:\WWW)
iv) Optional: Change parameters in CHEKLINK.CMD and CHEKLNK2.CMD.
The easiest way to use CheckLink is by pointing your browser at /CHEKLINK.HTM.
CheckLink works with all browsers that understand tables, but the results
look best with browsers that understand either multi-part documents or
client pull (such as NetScape 2.01. and above).
-------------------
II.a. Using CheckLink without SRE-http.
If you are not an SRE-http user, you can run CheckLink as a standalone
program -- just copy CHEKLINK.CMD to an appropriate directory (say,
D:\INTERNET\CHEKLINK). When you are ready to run CheckLink, just CD to
this directory, run CHEKLINK from an OS/2 command prompt, and follow
the directions.
For example:
D:>cd \internet\cheklink
D:\INTERNET\CHEKLINK>cheklink
When run in standalone mode, the i/o interface is primitive, and the
final output is HTML code -- it is meant to be viewed with a browser.
Otherwise, the results are the same as when run as an SRE-http
addon (it might even be a touch faster).
IMPORTANT NOTE: To use CheckLink as a standalone program, you MUST have
REXXLIB.DLL. REXXLIB is $25 shareware (obtainable from
http://www.quercus-sys.com/rexxlib.htm). It's a good
bargain, but if this expenditure is problemmatic, please
contact danielh@econ.ag.gov for alternatives.
If you want to use CheckLink to "examine and traverse the web-tree",
or to "create an index of the web-tree", you should copy the
CHEKLNK2.CMD and CHEKINDX.CMD files to your CGI-BIN scripts
directory. The output from CHEKLINK will contain CGI-BIN
calls to CHEKLNK2. Thus ...
To use CheckLink in a non-SRE-http environment, you will
a) Run CHEKLINK.CMD, from an OS/2 command prompt, to generate the
index of a web-tree, and to produce several tables of results.
b) Run CHEKLNK2.CMD and CHEKINDX.CMD as CGI-BIN scripts
c) Optionally, make a few small CGI-BIN modifications to
CHEKLINK.HTM (see CHEKLINK.HTM for the details)
-------------------
III. CheckLink parameters.
Regardless of how you run CheckLink, you may wish to first adjust
several performance-tuning and display-customization parameters.
Most of these appear at the top of the CHEKLINK.CMD, and there are a
few in CHEKLNK2.CMD and CHEKINDX.CMD -- you should modify these files with
your favorite text editor.
Note that to use any of the 3 CheckLink programs you do NOT need to set
these parameters -- the default values work reasonably well.
However, if you intend to make more then occasional use of CheckLink,
we recommend setting the LINKFILE_DIR parameter in CHEKLINK.CMD,
CHEKLNK2.CMD, and CHEKINDX.CMD.
-------------------
III.a. A Note on How CHEKLINK displays results
Before further discussion, a note on how CHEKLINK displays results
(when run as an SRE-http addon) is germane:
CHEKLINK can return results either in one long document, as a
"two part" document, or in two seperate documents.
In a "two part" document:
The first part contains status information, and is sent to the
client in pieces.
The second part contains the results tables.
In a "long document" these parts are concatenated -- the final
output contains both "status" and "results" information (and will
be a bit more cluttered as a result)
Since CHEKLINK can take several minutes to process a thousand or so
links, the production of "status" information is crucial. In fact, this
status information is "sent in pieces" -- with some sort of output
being sent to the client every few seconds. Not only does this help
keep the client from giving up, it also prevents "server inactive"
timeouts.
In fact, it's this "may take several minutes to finish" aspect of
CHEKLINK that makes it very difficult to distribute a pure CGI-BIN
version of CHEKLINK -- most CGI-BIN implementations do NOT allow
for "sending results as they become avaialble", and one can not
count on lengthy (i.e.; more then a few minutes) inactive-timeouts.
Although two-part documents are the more elegant solution, with
certain browsers some very annoying "over refresh" behavior occurs
(i.e; every time you "back up" to the results, CHEKLINK is reinvoked).
As a work around, the "two document" strategy can be used, which will
result in almost the same display as a two-part document (client pull
is used to automatically replace the "status" document with the
"results" document). The drawback is the requirement for semi-permanent
storage of the results file on your server's disk -- you may need to
monitor disk space if you allow CHEKLINK to be extensively used in
two-document mode.
-------------------
III.b. CHEKLINK, CHEKLNK2, and CHEKINDX parameters:
BACK_1 : <BODY> modifiers.
BACK_2
BACK_1 and BACK_2 are used to set a BGCOLOR (or BACKGROUND) for the
"two parts" of CheckLink's output. Note that if you are using CheckLink
in single-part mode (i.e.; if you are using an older web browser, or
if you set the multi_use option to 0) BACK_2 is ignored.
Examples:
back_1='bgcolor="#668a78"'
back_2='bgcolor="#8888dd" background="CL.GIF'
Note: BACK_1 (BACK_2) is ignored if INTRO_1A (INTRO_1B) is set to a non-null
value.
CHEKLINK_HTM : URL pointing to CHEKLINK.HTM
CHEKLINK_HTM should contain a URL (usually, a relative URL) that
points to the CHEKLINK.HTM file shipped with CheckLink. This variable
is used to add a "generate another web-tree" option to the output file.
Thus, neglecting to properly set CHEKLINK_HTM will have few deleterious
effects.
Example: CHEKLINK_HTM = '/CHEKLINK.HTM'
CHECK_ROBOT : Suppress checking ROBOTS.TXT.
If check_robot=1, then check the starter-URL site for a /robots.txt file,
and use it to control extent of search.
Proper net'iquette dicates that when checking a stranger's site,
make sure you have set check_robot=1.
Note: the contents of a ROBOTS.TXT file are added to the a special
"site-specific" EXCLUSION_LIST -- it only effects URLs on the
starter-URL site.
Example: check_robot=1
DOUBLE_CHECK:
Since servers can be momentarily busy, it's often wise to "double check"
busy servers. To do this, set DOUBLE_CHECK=1
To NOT double check, set DOUBLE_CHECK=0.
This double checking will only look at servers that were "not available".
It will be done after all links have been examined (thus giving the "not
available" server a chance to become available. Lastly, GET queries
are used (instead of HEAD queries).
GET_QUERY:
As part of mapping a web-tree, CheckLink will query servers for basic
information on URLs. These queries are best done with HEAD
requests.
Unfortunately, there are a number of older servers that
do not properly respond to HEAD requests.If you find that CheckLink
is identifiying many URLs as unavailable (even though your browser
can get to them readily), it may be due to their host server's failure
to recognize these HEAD requests.
As a work around, you can use short GET requests instead of
HEAD requests. This method is engaged by setting:
get_query=1.
Example: get_query=0
Note: This get_query=1 method is not highly recommended -- it's slower,
and somewhat "ruder" (connections are purposely broken, which
tends to add garbage to the visited server's log file).
Instead, we recommend setting DOUBLE_CHECK=1
LINKFILE_DIR: directory to store "linkage" files in.
Linkage files contain "link" information on all the URLs discovered during
CheckLink's recursive mapping of a "web tree". In particular, the LINKFILE
option (see section IV) specifies a filename, which will then
be stored in the LINKFILE_DIR.
By default, LINKFILE_DIR will be your OS/2 TEMP drive.
Example: LINKFILE_DIR='D:\GOSERVE\CHKLNKS'
Note: in addition to storing LINKFILEs, the LINKFILE_DIR is also used
to store "RESULTS" files.
MAXATONCE: maximum number of "query" threads
Specifies the maximum number of threads to use when checking for the
existence (and mimetype) of a link (using HEAD requests).
Increasing this number may speed up throughput, but it may subject the
target server(s) to excessive loads.
Example: maxatonce=6
MAXATONCE_GET: maximum number of "read" threads.
Specifies the maximum number of threads to use when retrieving the
contents of a URL (using GET requests). Increasing this number may
speed up throughput, but it may subject the target server(s) to excessive
loads.
Example: maxatonce_get=2
MAXAGE: Kill a query if it's old
Specifies number of seconds to wait on a query (a HEAD request).
You may need to increase this time span if sites are far away or otherwise
slow. However, increasing MAXAGE will increase the time that
CheckLink waits on "hung" sites.
Example:maxage=30
MAXAGE2: Kill a read if it's old
Specifies number of seconds to wait on a read (a GET request).
You may need to increase this time span if sites are far away or otherwise
slow. However, increasing MAXAGE will increase the time that
CheckLink waits on "hung" sites.
Example:maxage2=60
ROW_COLOR1 : Used to set the <TR> in the results tables
ROW_COLOR2
ROW_COLOR1A
ROW_COLOR2A
ROW_COLOR1 and ROW_COLOR2 set the odd and even rows (respectively)
of tables used to display the results of checking IMG links.
ROW_COLOR1A and ROW_COLOR2A set the odd and even rows (respectively)
of tables used to display the results of checking Anchor links.
Examples:
row_color1='bgcolor="#bbcc66"'
row_color2='bgcolor="#aaccdd"'
row_color1a='bgcolor="#bbaa44"'
row_color2a='bgcolor="#aaccdd"'
TD_INDENT: Used to Indent a Type=2 (table) index
This string is used to indent each row of an "index table".
You can try using characters (i.e.; ___ ), none-breaking spaces (i.e.; )
or empty columns (i.e.; <TD> </TD> )
Example:
td_indent='<td bgcolor="#789966"> <font color="#789966">__</font></td>'
Special Feature:
If you have a 1_PIXEL.GIF in the /IMGS/ directory of your web tree, you
can set td_indent equal to an integer value, which will cause the following
to be used:
<td> <IMG SRC="/IMGS/1_PIXEL.GIF" width=45> </td>
where 45 is any number equal to td_indent * indent_levels.
The above example would be used if
> td_indent=15,
> a given line is being displayed at a third level indentation
Since td_indent*3 == 15*3 == 45; a 45 pixel "blank" spacer-image will
be drawn
NOte: 1_PIXEL.GIF should be a GIF file consisting of 1 transparent pixel.
TD_TITLE: : Modifies "title" field of a table index.
TD_TITLE is used in the <TD field of a "table": index.
Example:
td_title='valign="TOP" bgcolor="#a2a9a9" '
TD_DESCRIP
TD_DESCRIP is used when writing descriptions.
Example:
td_descrip='valign="TOP"'
TR_MOD1 and TR_MOD2: Modifies rows of a table index
TR_MOD1 and TR_MOD2 modify <TR elements of a table index. TR_MOD1 is used
on odd rows, TR_MOD2 is used on even rows. Note that if TD_TITLE and TD_DESCRIP
are used, TR_MODn may not have much impact.
tr_mod1=' Bgcolor="#449922"'
tr_mod2=''
USER_INTRO1A : Files containing "header" information.
USER_INTRO1B
Fully qualified file names containing "header" information, for each part.
If ='', then a generic header is used
If specified, the file MUST contain at least:
<HTML><HEAD>.... </HEAD> <BODY ...> <h1>... </h1>
Note: use of user_intro1a (user_intro1b) means that back_1 (back_2) are
NOT used.
Examples:
user_intro1a=''
user_intro1b='D:\GOSERVE\CHEK1.HDR'
-------------------
IV. CheckLink Request Options
Request options are specified when one of the CheckLink programs
is requested; say, when you use CHEKLINK as the ACTION in an HTML FORM.
The following briefly describe these options.
For further details, we recommend perusing CHEKLINK.HTM.
-------------------
IV.a. CHEKLINK options.
The only required option is URL (defaults will be used for the other options
when they are not specified).
Options:
BASEONLY :
BASEONLY=0 : Read url's relative to the root of the request
BASEONLY=1 : Read url's relative to the base of the request
Example: if URL=/dogs/foo.htm; then
baseonly=0 : /cats/bar.htm would be "recursively" read
baseonly=1 : /cats/bar.htm would NOT "recursively" read
DESCRIP:
Create & save descriptions for "on-site" (and "in directory",
if BASEONLY=1) documnents.
DESCRIP=0 -- do not create descriptions
DESCRIP=1 -- create descriptions for text/html documents
DESCRIP=2 -- create descriptions for text/html and text/plain
documents
DESCRIP=1 is fairly costless (it uses information that's already
been read). DESCRIP=2 requires reading additional files.
A maximum of 300 characters is retained (this can be modified
by changing the DSCMAX parameter in CHEKLINK.CMD).
EXCLUSION_LIST:
Space delimited list of selector to NOT query or read.
*'s can be used as wildcards.
Example:!* *?* *MAPIMAGE/* CGI-*'
(this is also the default)
LINKFILE : Name of a file to store "linkage" information.
Linkage information pertains to each and every URL in the web-tree.
Each of these URLs will be associated with a list of web-tree
residing, text/html, URLs that contain links pointing to this URL.
In addition, each text/html URL (in the web tree) is associated with
a list of all it's links (that point both on and off site)
The LINKFILE is used to store these lists. More importantly,
CHEKLNK2.CMD uses the LINKFILE to "examine and traverse" the
web tree.
Notes:
* The LINKFILE should be a file name, without path or extension
information. A default extension of .STM is used, and
the file is written to the LINKFILE_DIR directory.
* If you do not want to retain this information, set LINKFILE=0
* If you set LINKFILE (to a non-0 value), the output from
CHEKLINK will contain links (one for each URL) to CHEKLNK2.
NAME: A descriptive name
You can enter a descriptive name for this "web-tree" -- it will be
displayed at various points. If you do not specify a name, a default
name will be constructed from the URL option (see below).
Example: name=A+Sample+web_tree
(note the URL encoding of spaces as + characters)
OUTTYPE:
A space delimited list of tables to produce.
The following values can be used in any combinaton:
OK ) Display succesfully found links
NOSITE ) Display links to unreachable sites
NOURL ) Display links missing resources>
OFFSITE ) Display links to off-site URLs
EXCLUDED ) Display links to excluded URLs (as specified in the EXCLUSION_LIST)
ALL ) Display all links
Examples: OUTTYPE='ALL'
OUTTYPE='OK NOURL '
RESULTS : A file containing the results of a prior call to CheckLink
(primarily for internal use by CheckLink).
Due to inappropriate refreshing by certain browsers, CheckLink
can be instructed to save it's results tables to a file (see
description of use_multi). RESULTS points to one of these
files -- when included, CheckLink will just return the RESULTS
file.
Example: results="CHKS0001.HTM"
Note that these "results" files are stored in the LINKFILE_DIR
directory.
SITEONLY:
SITEONLY=0 : Query all url's
SITEONLY=1 : Query url's on starter-URL's "own site"
URL: URL=fully qualified, or relative, URL
This is the "starter-URL"
Example: url="/samples/guide.htm"
USE_MULTI:
USE_MULTI=0 : Return results in one long documemt
USE_MULTI=1 : Return results in two-part document; with the second
part replacing (overwriting) the first.
USE_MULTI=2 : Return results in two seperate documents, the second
one being stored on the server's disk.
Note that if an older browser (that does not support
connection:maintain) is used, then USE_MULTI is set to 2.
The primary reason for USE_MULTI=2 is to work around the "over-
refreshing" bugs of certain browsers.
Note that when USE_MULTI=2 is used, the RESULTS option is
used internally by CHEKLINK to provide a link to the second
document. This document, which will be assigned random name,
will be stored on the LINKFILE_DIR directory.
-------------------
IV.b. CHEKLNK2 options
CHEKLNK2 is used to examine and traverse a web tree. Typically, you would not
code a requeset to CHEKLNK2 -- you'ld use links to CHEKLNK2 in the table
produced by CHEKLINK. In addition, CHEKLNK2 includes numerous links
back into CHEKLNK2, links that utilize the options listed below.
That is, CHEKLNK2 is somewhat of a self-contained program.
It is NOT expected that expected that CHEKLNK2 will be explicitily used
by most authors.
Therefore -- the following description will be rudimentary.
Note that CHEKLNK2 can be called as an SRE-http addon, or as a CGI-BIN
script (but not as a standalone program).
Options:
LINKFILE -- Same definition as above -- the linkage file (relative to the
LINKFILE_DIR directory) that was created by a request to CHEKLINK.
ENTRYNUM -- pointer to an entry in the LINKFILE -- his entry corresponds to a
unique URL; CHEKLNK2 will display links to and from this
unique URL.
Example:entrynum=12
If entrynum=0, an alphabetized index of all text/html documents
(in the web-tree) will be displayed.
ISIMG -- Select between image & anchors linkss. Setting isimg=1 means to
use "image" links; otherwise, use "anchor" links. Note that
the the combination of ENTRYNUM and ISIMG dictate which URL will
be examined.
Example: entrynum=15&isimg=1
VIA -- Information on what location in the web-tree (which URL) was
being examined prior to jumping here.
LIST -- Enable "traverse web tree mode". LIST can take the following
values:
LIST=0 (the default (used if LIST is not specified).
Display a "synopsis" of the URL. This synopsis includes
basic information (such as the size and mime type),
and a list of URLs (in the web tree) that refer to
text/html documents that contain links to this URL (the
entrynum URL). In addition, if this (the entrynum) URL
is a text/html document, a table of all links (images and
anchors) will be displayed
LIST=1
Display an (alphabetized) list of links to all text/html
documents pointed to by links in the "entrynum URL" (more
precisely, by the text/html document pointed to by the
entrynum URL).
LIST=2
Similar to LIST=1, but display text/html documents that
point TO the "entrynum URL" (LIST=2 is the reverse
of LIST=1)
LIST=3
Display an alphabetized table of ALL urls contained in
web-tree. In contrast, using LIST=0 and ENTRYNUM=0 will
generate a list of "on-site, text/html documents".
Example: LIST=1&entrynum=5
MIME -- A space delimited list of mimetypes, possibly containing
wildcards.
MIME is only used when LIST=3. When you specify MIME, then
only URLs with a mimetype matching (one of) the elements of
the MIME value will be used.
Examples: LIST=3&MIME=text/plain
LIST=3&MIME=image/*
LIST=3&MIME=application/pdf+application/x-pdf
(note use of + as a url encoded space)
Special Note:
If you include an * in the LINKFILE value, CHEKLNK2 will
produce a short list of currently available linkage files, and let you
choose one to examine. The choice uses normal file matching rules.
For example /CHEKLNK2?linkfile='CHK*' may yield CHK01, CHKNOW, and CHK_C.
-------------------
IV.c. CHEKINDX
CHEKINDX is used to create a hierarchical index of your web-tree. By
hierarchical index, we mean the sort of index we are all familiar with
-- a highly indented list, with more "subsidiary" resources on more indented
lines. Basically, the notion is to use CHEKINDX to create a "web index" that
you can post on your site (usually with suitable prettifications).
Note that CHEKINDX uses nested "unordered lists" (<UL> constructs) to display
the hierarchical index; hence the output should be viewable by all browsers.
As noted in section 1a, the web-tree is something of a misnomer; and
construction of such a "hierarchical index" is not a cut and dried
affair. That is, given the mutiplicity of cross-cutting links, there is
no single hierarchical representation of these "web-trees".
Therefore, CHEKINDX uses a simple heuristic: given a specified "root-url"
(which may, or may not, be the starter-URL), CHEKINDX will determine the
position in the hierarchy as a function of distance to the root-url.
Basically, the following rules are used:
Level 1 (starting closest to the left margin):
The root-url. There is only one "level 1" row (it's the top row).
Level 2: (second closest to the left margin):
All URLs contained in the "root-URL" (that is, contained in the text/html
document pointed to by the "root-URL".
Level 3:
All URLs contained in a level 2 URL.
Level 4, 5, etc are defined similarly. Note that level 3 lists appear directly
after the appropriate level 2 URL, and so forth.
Root-URL
2A
2B
3B.i
For example: 3B.ii
3B.iii
2C
3C.i
3C.ii
4.C.ii.x
4.C.ii.xx
3.C.iii
The above heuristic contains a key rule:
* Once listed, a URL can never appear in a "higher level". That is,
3C.i can NOT list 2A.
This rule can be applied at various levels of stringency. For example, you
could allow "ties" to displayed multiple times, or you could only allow
"one listing" per URL.
Controlling this stringency, as well as otherwise influencing the scope of the
listing, in controlled by the CHEKINDX options.
-------------------
IV.c.ii. CHEKINDX options.
Options:
CLEANUP : Used to remove "earlier, higher level references"
CLEANUP=1 signals CHEKINDX to remove "higher level" references
that preceded lower level references. In the examples used
above, setting MULTI=1 would cause the earlier "level 5"
reference to be removed from the index.
CLEANUP has no effect when used with MULTI=0.
When used with MULTI=1, then only the first (of several
possible ties) is displayed. That is, MULTI=1 and CLEANUP=1
invokes a "use first occurence of lowest level" rule.
When used with MULTI=2, all ties are displayed -- "use all
occurences of the best level".
Note that CLEANUP requires an extra iteration, hence
requires more processing time.
Example: CLEANUP=1
By default, CLEANUP=0 (cleanup is not attempted).
DESCRIP: Write descriptions (if available)
DESCRIP=1 : Write descriptions (under the title-link), if available
DESCRIP=0 : Do notwrite descriptions)
DROP: Space delimited list of (possibly wildcarded) selectors to drop
URL's with a selector portion that matches one of the
items in DROP will not be displayed in the index.
However, links within "dropped" selectors may be displayed!
Thus, you should coordinate DROP with EXCLUDE.
Examples: DROP=*SAMPLES/*FILELIST.HTM
DROP=*/IND*.HTM+*/MAP*.HTM
(note use of + as a URL-encoded space)
By default, DROP='' (nothing is dropped)
EXCLUDE: Space delimited list of (possibly wildcarded) selectors to
"not expand".
URL's with a selector portion that matches one of the
items in EXCLUDE will be included in the index, but will
not be "expanded". That is, the "links" associated with
an EXCLUDEd selector are not used. Contrast this with DROP,
which drops display of the selector, but (possibly) retains
URL's with links that appear within the document the selector refers to.
Examples: EXCLUDE=FILELIST.HTM
EXCLUDE=*/SITEMAP.HTM+*/INDICE.HTM
(note use of + as a URL-encoded space)
The primary use of EXCLUDE is to prevent some other "file listing"
from being placed at a low level and "capturing" the bulk of
the URLs. Such an occurence may distort the true relationship
between URLs.
By default, EXCLUDE='' (nothing is excluded)
HEADER: Optional header to display at top of index.
If not specified, the servername will be displayed.
Example: Header='This+is+OUR+Site'
LINKFILE: As defined above (filename only, no path).
LINKFILE is the only required parameter (note that the
LINKFILE=* shortcut is NOT supported by CHEKINDX).
MIME: Space delimited list of mimetypes (possibly wildcarded)
URLs to include in the index.
More precisely: The mimetype of the resource (that is pointed
to by URLs in the web-tree) is compared to the list of
mimetypes in the MIME option. If no match occurs, the
URL is NOT included in the index.
Examples: MIME=text/*
MIME=image/jpeg+image/gif
(note use of + for URL-encoding)
MIME=application/pdf
By default, MIME=text/html.
MULTI: Used to control the "stringency" of display.
As mentioned above, it is likely that URLs will be referred
to by several other "URLs" (that is, by html documents pointed
to by several other URLs). To prevent infinite recursion,
the basic rule is to:
"never include a URL if it's already been included at a lower level"
MULTI controls the other cases:
MULTI=0 -- only one reference to a URL per index. Thus,
if the first reference is at "level 5", and
a "level 3" reference is found later, the
"level 3" reference will NOT be displayed.
Note that "level 2" references are ALWAYS
displayed -- they are checked first.
MULTI=1 -- if latter references are strictly lower, then
also display them. Thus, the level 3 reference
mentioned above would be displayed (along with
the level 5 reference).
MULTI=2 -- Similar to MULTI=1, but ties are also displayed
(thus, a second level 5 reference would be
displayed if MULTI=2, but not if MULTI=1)
Example: MULTI=2
By default, MULTI=0
PIX. : Stem variable pointing to mime-type specific imags.
PIX. is a stem variable that points to small
.GIF icons that will be displayed next to the title
(or selector) of each entry in the index.
The syntax is:
PIX.0=number of entries
PIX.n="mime/type selector
where
n: 1.. pix.0
mime/type can include * as a wildcard
selector is a selector
PIX.!INCLUDE=text to include in IMG element
For example:
pix.0=3
pix.1='text/plain /imgs/text.gif '
pix.2='image/* /imgs/image.gif '
pix.3='text/html '
pix.!include=' height=18 width=18 ALT="*" align="center" '
Note that when there is no "selector", no icon is
drawn. Also, the first (of several possible) matches is used.
SITEONLY: Only include URLs on the "starter-URL's" site.
If SITEONLY=1, then URLs that point off-site will
not be included in the index.
If SITEONLY=0, then all URLs may be included in the index.
Note that "off-site" URLs will NEVER reference other links --
for purposes of the web-tree, they are all "leafs".
By default SITEONLY=1 (off-site URLs are excluded)
TYPE: Display type
There are three types of display:
TYPE=1 : Use an Unordered List (<UL>)
TYPE=2 : Use a table (<TABLE>)
TYPE=3 : Return an "editable" document. You can use this to delete,
change or move various records and fields (see section
IV.c.iii for details).
Note: if you select TYPE=2, you might want to play with the varions TABLE_, TR_
and TD_ parameters in CHEKINDX.CMD.
URL: The "root-URL".
Actually, it's the "root selector" -- you don't need to specify the
http://a.b.c/ portion. CHEKINDX will use this "root-URL" as
the "level 1" of the hierarchical index.
Example: URL=/samples/index.htm
By default, the "starter-URL" of the LINKFILE is used.
-------------------
IV.c.iii. CHEKINDX edit mode.
In many cases, the hierarchical index created the CheckLink can use
editing. You may want to remove uninteresting links, change the
indentation levels, modify uniformative descriptions, or even move
index entries around. To facilitate such actions, you can invoke
the "edit" mode of CHEKINDX (see the description of the TYPE option above).
In "edit" mode, an HTML form listing each entry, along with several options
per entry, will be listed. With this form you can:
* Remove entries
* Move entries
* Change an entries indentation level
* Modify the "title" of the entry
* Modify the "description" of the entry.
After making these changes, you can then create a <UL> or <TABLE> index;
or, you can re-edit the index (and make additional changes).
Notes:
* These edits do NOT effect the "link file" (from which the index is first
generated).
* Edit mode is NOT available when CHEKINDX is run as a cgi-bin script. It
is only available if you are running this CheckLink package as an SRE-http
addon.
* You can re-edit several times, until you like what you see; and then you
can finalize the index as a <UL> or a TABLE.
* After finalizing, you should save the index to an HTML document (you might
then further modify it with your favorite text editor).
* A <UL> version of the index (reflecting current changes) is written on the
bottom portion of the customization page.
-------------------
V. Notes:
* CHEKLINK looks for a few kinds of "image" links, and several kinds
of "anchor" links:
Image Links:
<IMG src="xxx">
<BODY background="XXX">
Anchor Links
<A Href="XXX">
<AREA Href="xxx">
<FRAME src="XXX">
<EMBED src="XXX">
<LINK href="xxx">
<APPLET code="xxx" codebase="http://x.x.x/yy" >
<OBJECT codebase="xxx">
Note that tags in comments (between <!-- --> are NOT processed.
Note that if there is some tag I've left out, please contact me
(danielh@econ.ag.gov) if inclusion of such a capability would
greatly enhance CheckLink!
? The major difference between IMG and ANCHOR links is that IMG links are
never "read" (they are only queried). Should APPLET or OBJECT be
treated as images?
* A possibility (given enough interest): A graphical web-mapper component
for CheckLink.
* To display some of the run-time status information, you'll need
PMPRINTF.EXE (http://www2.hursley.ibm.com/goserve).
* Sample speeds of CHEKLINK (on a Pentium 100 over a 16/4M Token Ring
LAN based Intranet, with a T1 line to the outside world):
1 GETs per second (of html/text URLs, average size of 20k).
8 HEADs per second (requests for basic information)
-------------------
VI. Disclaimer
Copyright 1997,1998 by Daniel Hellerstein.
Permission to use this program for any purpose is hereby granted
without fee, provided that the author's name not be used in
advertising or publicity pertaining to distribution of the software
without specific written prior permision.
This includes the right to subset and reuse the code, with proper attribution;
and with the following understanding:.
We, the authors of CheckLink and any potentially affiliated institutions,
disclaim any and all liability for damages due to the use, misuse, or
failure of the product or subsets of the product.
Furthermore you may also charge a reasonable re-distribution fee for
CheckLink; with the understanding that this does not remove the
work from the public domain and that the above proviso remains in effect.
THIS SOFTWARE PACKAGE IS PROVIDED "AS IS" WITHOUT EXPRESS
OR IMPLIED WARRANTY.
THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE PACKAGE,
INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS.
IN NO EVENT SHALL THE AUTHOR (Daniel Hellerstein) OR ANY PERSON OR
INSTITUTION ASSOCIATED WITH THIS PRODUCT BE LIABLE FOR ANY
SPECIAL,INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER
RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION
OF CONTRACT,NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR
IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE PACKAGE.
SRE-http was developed on the personal time of Daniel Hellerstein,
and is not supported, approved, or in any way an official product
of my employer (USDA/ERS).