home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
OS/2 Shareware BBS: 35 Internet
/
35-Internet.zip
/
sslurp20.zip
/
sslurp.HLP
(
.txt
)
< prev
Wrap
OS/2 Help File
|
1999-03-14
|
13KB
|
476 lines
ΓòÉΓòÉΓòÉ 1. Sslurp! ΓòÉΓòÉΓòÉ
Sslurp! 2.0
Sslurp! can retrieve Web pages from a HTTP (WWW) server. It can be configured
to follow all hyperlinks on the page that lead to other pages on the same
server. Images on the pages can be retrieved as well. All pages are stored on
disk and can be viewed later using your web browser.
Sslurp contains a small non-caching filtering proxy server that can be used to
view downloaded pages as well as for filtered WWW access.
Sslurp! can make use of a proxy HTTP server, speeding up the whole procedure.
Sslurp! requires at least one HPFS partition!
Topics:
The main window
Common tasks
The proxy server
Command line options
For the techies
Contacting the author
ΓòÉΓòÉΓòÉ 1.1. The main window ΓòÉΓòÉΓòÉ
On the main window you find the following elements:
A drop down list where you enter the URL. The last 15 URLs are saved.
You can quickly enter an URL here by dragging an URL object from a WPS
folder to this entry field.
"Start", "Stop" and "Skip" buttons.
A list of processed and pending URLs. For processed URLs, a status
message is displayed. The list is cleared when starting with a new URL.
A status line. Its contents are:
- the current URL
- total number of data bytes retrieved
- total number of data bytes of the current URL
- number of bytes retrieved of the current URL
- number of URLs retrieved
- number of URLs tried
- number of pending URLs.
ΓòÉΓòÉΓòÉ 1.2. Common tasks ΓòÉΓòÉΓòÉ
Here's how to perform some common task with Sslurp!:
I want to download a complete web site.
In the setup, set "Links" to "same server, all types". Enable "Inline images"
on the "Options" page. Then enter the root URL of the site (e.g.
"http://www.thesite.com/"), then press "Start".
I want to download a subrange of a web site.
In the setup, set "Links" to "don't climb up, all types" and enable "Inline
images" on the "Options" page. Then enter the URL of the site (e.g.
"http://www.thesite.com/some/path/start.html"), then press "Start".
I want to download a single web page with images, but only if it's changed.
In the setup, set "Links" to "none". Enable "Inline images" and "Retrieve
modified items only". Then enter the URL of the page (e.g.
"http://www.thesite.com/pageofinterest.html"), then press "Start".
ΓòÉΓòÉΓòÉ 1.3. Sslurp's proxy server ΓòÉΓòÉΓòÉ
Sslurp contains a simple non-caching filtering HTTP proxy server. It's
"non-caching" because it does not store items that are retrieved through it.
It's "filtering" because filter patterns can be defined which avoid unwanted
items to be downloaded. It's a "HTTP" proxy because it only handles the HTTP
protocol. It's "simple" because it's - um - simple.
The proxy works like this:
1. It accepts connections from HTTP clients, e.g. your web browser.
2. A new thread is created that handles the new connection. Sslurp does not
limit the number of threads. It's expected that the client program does
this (and web browsers usually do).
3. Sslurp checks if the requested URL matches one of the filter patterns.
When it does, an error reply is returned.
4. When the URL is not filtered Sslurp checks if the item is present in the
download area. When it is found on disk, its content is returned.
5. When the item is neither filtered nor present, Sslurp connects to the
destination server or to another proxy (depending on Sslurp's
configuration) and forwards the request and corresponding reply.
Why use the proxy?
When viewing downloaded items directly with your web browser, it may have
problems finding some of the items. The reason is that links in HTML files
often are absolute instead of relative, e.g. "/dir/item" instead of
"../../dir/item". The browser resolves these links relative to the root
directory of your download drive instead of the server's sub-directory. Also,
some characters in URLs have to be converted because they are not legal in
file names. The browser can't perform these conversions, only Sslurp can.
The proxy solves these problems. When the browser requests an item through the
proxy, Sslurp knows where the item is stored and which conversions were
performed when the item was downloaded. Files are always found if present.
How to configure the proxy
The proxy listens on a certain port number for incoming connections. The
default port number is 3128. You can select a different port number with the
-N command line option.
Only one instance of a running Sslurp can run the proxy server. It usually is
the first instance that is started.
To use the proxy server you have to configure your browser to make HTTP
requests through a proxy. Enter the host name of the computer running Sslurp
as the HTTP proxy server (this usually is the same computer, but may also be a
different one).
How to configure filtering
Sslurp reads the file "filter.lst" at startup. This file contains filter
patterns, one pattern per line. The file format is:
<Type><space><pattern>
<Type> specifies the pattern type and can be one of the following:
P <pattern> is a prefix pattern, i.e. any URL starting with the
specified string is filtered. Sslurp matches patterns after the
"http://" part, i.e. starting with the host name. For example,
"http://www.host.com/item" matches the prefix pattern
"www.host.com/".
S <pattern> is a suffix pattern, i.e. any URL ending with the
specified string is filtered. Sslurp matches patterns before
possible URL parameters. For example, "http://www.host.com/item"
matches the suffix pattern "/item" and
"http://www.host.com/item.cgi?param=x" matches the suffix pattern
"/item.cgi".
I <pattern> is a substring of an URL, i.e. any URL containing
<pattern> is filtered.
<space> is exactly one space character.
<pattern> is a string. Patterns are case-sensitive.
ΓòÉΓòÉΓòÉ 1.4. Command line options ΓòÉΓòÉΓòÉ
Sslurp! can be run in automated mode, i.e. it takes one or more URLs as program
parameters, downloads these pages according to the program options, and exits
when finished.
The command line syntax is:
SSLURP.EXE [Options] [<url> | @<listfile>]*
In other words, you can specify
options,
one or more URLs, and
one or more list files. Each line in a list file is interpreted as URL.
Empty lines and lines starting with ';' are ignored.
The following command line options are supported:
-T<dir> Retrieved items are stored in the given directory.
-L- No links are followed
-Ls Only links to the same server are followed
-Ld Only links that are not pointing upward are followed
-La All links are followed
-E["extensions"] Only links with one of the given file extensions are followed
-X["extensions"] Only links excluding the ones of the given file extensions
are followed
-I+ Inline images are downloaded
-I- Inline images are not downloaded
-Ia Inline images are downloaded, even those on different servers
-A+ Applets are downloaded
-A- Applets are not downloaded
-Aa Applets are downloaded, even those on different servers
-U+ Only items newer than local copies are downloaded
-U- All items are downloaded
-S<size> Restricts downloaded items to <size> bytes
-S- Downloads are not restricted by size
-D<number> Restricts followed links to <number> steps
-D- Downloads are not restricted by link depth
-P+ Uses the proxy server
-P- Does not use the proxy server
-O<file> Uses the specified file for logging
-N<number> Specifies port number for internal proxy
Note: Command line options override options given in the setup. For options
not given in the command line, the setup options are used. So if an option is
turned on in the setup, you must explicitly switch it off to deactivate it.
It's not sufficient to just omit the command line option! Stored options are
not modified by command line options.
When finished, Sslurp! returns one of the following ERRORLEVEL values:
0 Everything OK
1 Invalid command line option
2 Problem(s) with one of the list files
10 Other error
ΓòÉΓòÉΓòÉ 1.5. For the techies ΓòÉΓòÉΓòÉ
Here's some technical information if you're interested:
Sslurp! uses HTTP 1.0. HTTP 0.9 is not supported. If some web site is
still using a HTTP 0.9 server, its contents may be just as outdated, so
you might not miss anything. HTTP 1.1 server replies are recognized.
Sslurp! only follows HTTP links, not FTP or others.
Sslurp! regards <IMG SRC=...> and <BODY BACKGROUND=...> as inline images.
If the file name of a retrieved page isn't specified, it's stored as
INDEX.HTML.
The "Last-Modified" timestamp is stored in the file's EAs. The EA name is
HTTP.LMODIFIED and is of type EAT_ASCII.
The "Date" timestamp is stored in the file's EAs. The EA name is
HTTP.DATE and is of type EAT_ASCII.
The "Content-Type" is stored in the file's EAs. The EA name is HTTP.CTYPE
and is of type EAT_ASCII.
The "Expires" timestamp is stored in the file's EAs. The EA name is
HTTP.EXPIRES and is of type EAT_ASCII.
The URL of the retrieved item is stored in the file's .SUBJECT EA.
Some characters in the URL are converted when building the path name of
the file. However, no conversion to FAT (8.3) names is performed!
If a page is redirected, the redirection is automatically followed, but
only if the new location is on the same server!
Sslurp! has been developed on and tested with OS/2 Warp 4.0. It should
also work with the following configurations:
- Warp 3.0 with IAK
- Warp 3.0 with TCP/IP 2.0
- Warp 3.0 Connect (TCP/IP 3.0)
- Warp Server
ΓòÉΓòÉΓòÉ 1.6. Contacting the author ΓòÉΓòÉΓòÉ
Sslurp! was developed by Michael Hohner. He can be reached electronically at:
EMail: miho@n-online.de
Fidonet: 2:2490/1050.17
ΓòÉΓòÉΓòÉ 2. File menu ΓòÉΓòÉΓòÉ
Exit
Ends the program.
ΓòÉΓòÉΓòÉ 3. Setup ΓòÉΓòÉΓòÉ
Options
Specify all program options.
Servers
Setup server specific options, e.g. authentication.
ΓòÉΓòÉΓòÉ 3.1. Links ΓòÉΓòÉΓòÉ
none
No links are followed
all
All links (even those to other servers) are followed. Be very
careful with this option!
same server
Only links to items on the same server are followed.
don't climb up
Hyperlinks to items that are hierarchically higher than the initial
URL are not followed. Otherwise, all links to items on the same
server are followed.
Example:
If you started with http://some.site/dir1/index.html, and the
current page is http://some.site/dir1/more/levels/abc.html, a link
that points to http://some.site/otherdir/index.html wouldn't be
followed, but a link to http://some.site/dir1/x/index.html would.
all types
All types of links are followed, restricted only by the above
settings.
including
You can enter a set of extensions (separated by spaces, commas or
semicolons) of items to retrieve. Links to items with other
extensions are ignored.
Example: With "htm html", Sslurp! only follows links to other HTML
pages, but does not download other hyperlinked files.
excluding
Reverse of the above option. Only links to items not having one of
the given extensions are followed.
Max link depth
Limits the depth of links to follow to the specified number. A level
of "1" specifies the initial page.
Example:
If page A contains a link to B, and B contains a link to C, A would
be level 1, B would be level 2 and C would be level 3. A maximum
link depth of "2" would retrieve pages A and B, but not C.
Max size
Limits the size of items to download. If the server announces the
size and it's larger than the number specified, the item is skipped.
If the server doesn't announce the size, the item is truncated when
the maximum size is reached.
Retries
If set to >0, retries failed downloads up to the specified number of
times. If set to 0, every URL is only downloaded once.
ΓòÉΓòÉΓòÉ 3.2. Options ΓòÉΓòÉΓòÉ
These settings influence which items will be downloaded and how it'll be done.
Inline images
If checked, inline images are also retrieved.
from other servers
If checked, inline images located on other servers are also
retrieved. Otherwise only images from the same server are
downloaded.
Java applets
If checked, java applets are also retrieved.
from other servers
If checked, applets located on other servers are also retrieved.
Otherwise only applets from the same server are downloaded.
Retrieve modified items only
An item is only retrieved if it's newer than the local copy.
Strongly recommended!
ΓòÉΓòÉΓòÉ 3.3. General ΓòÉΓòÉΓòÉ
Proxy
Enter the host name of a proxy HTTP server. You may also specify a
port number for the proxy server. Check Enable to finally use the
server. Contact your service provider to get this data.
Note: Only enter the host name, not the URL (e.g. "proxy.isp.com",
not "http://proxy.isp.com:1234/")!
User name
Enter your user ID here if your proxy server requires
authentication.
Password
Password for proxy authentication.
Email address
Enter your EMail address. It is included in every request. Don't
enter anything here if you don't want your EMail address to be
revealed.
ΓòÉΓòÉΓòÉ 3.4. Paths ΓòÉΓòÉΓòÉ
Path for retrieved data
Path where retrieved pages and images are stored. This path and
subpaths are created automatically.
ΓòÉΓòÉΓòÉ 3.5. Logging ΓòÉΓòÉΓòÉ
These options control logging.
Log file
Path and name of the log file
Additional information
Log additional (but somewhat optional) messages
Server replies
Log all lines in the server's reply
Debug messages
Log messages used for debugging purposes (turn on if requested).
ΓòÉΓòÉΓòÉ 3.6. Server list ΓòÉΓòÉΓòÉ
A list of base URLs is displayed.
Press New to add a new URL with settings.
Press Change to change the settings of the selected URL.
Press Delete to delete the selected URL.
ΓòÉΓòÉΓòÉ 3.7. Server ΓòÉΓòÉΓòÉ
Base URL
Set of URLs (this item and all items hierarchically below) for which
these settings apply. This usually specifies a directory on a
server.
Example:
If you enter "http://some.server/basedir/", these settings apply to
"http://some.server/basedir/page1.html", but not to
"http://some.server/otherdir/b.html".
User name
User name or user ID used for basic authorization.
Password
Password used for basic authorization.
ΓòÉΓòÉΓòÉ 4. Help menu ΓòÉΓòÉΓòÉ
General help
Provides general help
Product information
Displays name, version number, copyright information etc.
ΓòÉΓòÉΓòÉ 5. About ΓòÉΓòÉΓòÉ
This page intentionally left blank.