home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Dream 52
/
Amiga_Dream_52.iso
/
Linux
/
Magazine
/
wwwoffle-2.1.tar.gz
/
wwwoffle-2.1
/
README
< prev
next >
Wrap
Text File
|
1998-03-02
|
24KB
|
525 lines
WWWOFFLE - World Wide Web Offline Explorer - Version 2.1
========================================================
The WWWOFFLE programs simplify World Wide Web browsing from computers that use
intermittent (dial-up) connections to the internet.
Description
-----------
The wwwoffled program is a simple proxy server with special features for use
with dial-up internet links. This means that it is possible to browse web pages
and read them without having to remain connected.
While Online
- Caching of pages that are viewed for review later.
- Conditional fetching to only get pages that have changed.
While Offline
- The ability to follow links and mark other pages for download.
- Browser or command line interface to select pages for downloading.
- Optional info on bottom of pages showing cached date and allowing refresh.
- Works with pages containing forms.
- Can be configured to use dial-on-demand for pages that are not cached.
Automated Download
- Downloading of specified pages non-interactively.
- Can automatically fetch inlined images in pages fetched this way.
- Can automatically fetch contents of all frames on pages fetched this way.
- Automatically follows links for pages that have been moved.
- Can monitor pages at regular intervals to fetch those that have changed.
- Makes backup copies of cached pages so server errors don't overwrite them.
Provides
- Caching of web pages (http), ftp sites and finger command.
- An introductory page with information and links to the built-in pages.
- Multiple indices of pages stored in cache for easy selection.
- Interactive or command line control of online/offline status.
- User selectable purging of pages from cache based on hostname.
- Interactive or command line option to fetch pages and links recursively.
- Interactive web page to allow editing of the configuration file.
General
- Can be used with one or more external proxies based on hostname.
- Automates proxy authentication for external proxies that require it.
- Configurable to still allow use on intranets while offline.
- Can be configured to block or not cache URLs based on file type or host.
- Can censor outgoing HTTP headers to maintain user privacy.
- All options controlled using a simple configuration file.
- Optional password control for management functions.
Configuring A Web Browser
-------------------------
To use the wwwoffle programs, requires that your web browser is set up to use it
as a proxy. The proxy hostname will be 'localhost' (or the name of the host
that wwwoffled is running on), and the port number will be the one that is used
by wwwoffled (default 8080).
Netscape V1:
In the Options->Preferences dialog window, enter localhost as the http
and ftp proxies and 8080 as the port numbers.
Netscape V3:
In the Options->Preferences dialog window under the Proxies tab select
the Manual Proxy Configuration option and enter localhost as the http
and ftp proxies and 8080 as the port numbers.
Mosaic V2.6, Lynx, Arena, Emacs-W3:
Set the environment variables http_proxy and ftp_proxy to
http://localhost:8080/
You will also need to disable the caching that the web browser performs itself
between sessions to get the best out of the program.
Depending on which browser you use and which version, it is possible to request
pages to be refreshed while offline. This is done using the 'reload' or
'refresh' button or key on the browser. On many browsers, there are two ways of
doing this, one forces the proxy to reload the page, and this is the one that
will cause the page to be refreshed.
The latest browser compatibilty information is available at:
http://www.gedanken.demon.co.uk/wwwoffle/version-2.0/browser.html
Welcome Page
------------
There is a welcome page at URL 'http://localhost:8080/' that gives a very brief
description of the program and has links to the index pages, interactive control
page and the wwwoffle internet home pages.
The most important places to get information about wwwoffle are the wwwoffle
homepage 'http://www.gedanken.demon.co.uk/wwwoffle/index.html' which has
information about wwwoffle in general. Or even better the wwwoffle version-2.0
user page 'http://www.gedanken.demon.co.uk/wwwoffle/Version-2.0/user.html' which
has more information about this version of wwwoffle.
Index Of Cached Files
---------------------
To get the index of cached files, use the URL 'http://localhost:8080/index/'.
There are sufficient links on each of the index pages to allow easy navigation.
The indexes provides several levels of information:
A list of the requests in the outgoing directory.
A list of the files fetched the last time that the program was online.
A list of the files that are being monitored.
A list of the most recently fetched files.
A list of all hosts for each of the protocols (http,ftp etc.).
A list of all of the files on a particular host.
These indexes can be sorted in a number of ways:
No sorting
By time of last modification (update).
By time of last access.
By date of last update with markers for each day.
Alphabetically.
By file extension.
For each of the pages that are cached there are options to delete the page,
refresh it, select the interactive refresh page with the URL already filled in
or add the page to the list that is monitored regularly.
Interactive Refresh Page
------------------------
Pages can be specified by using whatever method is provided by the browser that
is used or as an alternative there is an interactive refresh page. This allows
the user to enter a URL and then fetch it if it is not currently cached or
refresh it if it is in the cache. There is also the option here to recursively
fetch the pages that are linked to by the page that is specified. This
recursive fetching can be limited to pages from the same host, narrowed down to
links in the same directory (or subdirectory) or widened to fetch pages from any
web server. This functionality is also provided in the 'wwwoffle' command line
program.
Monitoring Web-Pages
--------------------
Pages can be specified that are to be checked at regular intervals. This can
either be every time that wwwoffle is online or every few days (user
specifiable). This parameter applies to all such requests and works as follows
(with <n> as the interval); If the page has not been monitored in the last <n>
days then fetch it next time that you go online, or if the date it was added to
the list is a multiple of <n> then fetch it next time. This means that if you
don't go online for a week then the first time you will get all of the pages,
but it won't fetch all pages on the same day from then on, they are spread out
according to the date they were added to the list.
Interactive Control Page
------------------------
The behaviour and mode of operation of the wwwoffle demon can be controlled from
an interactive control page at 'http://localhost:8080/control/'. This has a
number of buttons that change the mode of the proxy server. These provide the
same functionality as the 'wwwoffle' command line program. To provide security,
this page can be password protected. There is also the facility to delete pages
from the cache or from the spooled outgoing requests directory.
Interactive Configuration File Editing Page
-------------------------------------------
The interactive configuration file editing page allows the configuration file
wwwoffle.conf to be edited. This facility can be reached via the control page
'http://localhost:8080/control/'. Each section in the configuration file has a
separate dialog box that allows the contents of the section to be changed. The
comments from the configuration file are displayed in the page so that the
description of the possible values in the different sections can be consulted.
When the contents of the sections have been updated, the configuration file can
be re-read by selecting the link at the bottom of the page.
Deleting Requests
-----------------
If no password is used for the control pages then it is possible for anybody to
delete requests that are recorded. If a password is assigned then users that
know this password can delete any request (or cached file or other thing).
Individual users that do not know the password can delete pages that they have
requested provided that they do it immediately that the "Will Get" page appears,
the "Cancel" button on here has a once-only password that will delete that
request.
Backup Copies of Pages
----------------------
When a page is fetched while online a remote server error will overwrite any
existing web page. In this case a backup copy of the page is made so that when
the error message has been read while offline the backup copy is placed back
into the cache. This is automatic for all cases of files that have remote
server errors (and that do not use external proxies), no user intervention is
required.
Spool Directory Layout
----------------------
In the spool directory there is a directory for each of the network protocols
that are handled. In this directory there is a directory for each hostname that
has been contacted and has pages cached. These directories have the name of the
host. In each of these directories, there is an entry for each of the pages
that are cached, generated using a hashing function to give a constant length.
The entry consists of two files, one prefixed with 'D' that contains the data
and one prefixed with 'U' that contains the URL.
The outgoing directory is a single directory that all of the pending requests
are contained in, the format is the same with two files for each, but using 'O'
for the file containing the request instead of 'D' and one prefixed with 'U'
that contains the URL.
The lasttime directory is a single directory that contains an entry for each of
the files that were fetched the last time that the program was online. Each
entry consists of two files, one prefixed with 'D' that is an empty placeholder
and one prefixed with 'U' that contains the URL.
The monitor directory is a single directory that all of the regularly monitored
requests are contained in, the format is the same as the outgoing directory with
two files for each, using 'O' and 'U' prefixes.
If there is a symbolic link created pointing to one of the directories then all
references to the link will be replaced by references to the directory.
(e.g. If foo.com is a symbolic link to foo-mirror.co.uk then web page links to
http://foo.com/path will be replaced by links to http://foo-mirror.co.uk/path).
This means that local mirrors can be used where possible and sites with multiple
names can share a single directory.
The Programs and Configuration File
-----------------------------------
There are two programs that make up this utility, with three distinct functions.
wwwoffle - A program to interact with and control the HTTP proxy demon.
wwwoffled - A demon process that acts as an HTTP proxy.
wwwoffles - A server that actually does the fetching of the web pages.
The wwwoffles function is combined with the wwwoffled function into the
wwwoffled program from version 1.1 onwards. This is to simplify the procedure
of starting servers, and allow for future improvements.
The configuration file, called wwwoffle.conf by default contains all of the
parameters that are used to control the way the wwwoffled and wwwoffles
functions work.
WWWOFFLE - User control program
-------------------------------
The control program (wwwoffle) is used to control the action of the demon
program (wwwoffled), or to request pages that are not in the cache.
The demon program needs to know if the system is online or offline, when to
fetch the pages that have been previously requested and when to purge the cache
of old pages.
The first mode of operation is for controlling the demon process. These are the
functions that are also available on the interactive control page (except kill).
wwwoffle -online Indicates to the demon that the system is online.
wwwoffle -autodial Indicates to the demon that the system is in autodial
mode, this will use cached pages if they exist and use
the network as last resort, for dial-on-demand systems.
wwwoffle -offline Indicates to the demon that the system is offline.
wwwoffle -fetch Commands the demon to fetch the pages that were
requested by browsers while the system was offline.
wwwoffle exits when the fetching is complete.
(This requires the demon to be told it is online).
wwwoffle -config Cause the configuration file for the demon process to be
re-read. The config file can also be re-read by sending
a HUP signal to the wwwoffled process.
wwwoffle -purge Commands the demon to purge from the cache the pages
that are older than the number of days specified in the
configuration file, using modification or access
time. Or if a maximum size is specified then delete the
oldest pages until the maximum size is not exceeded.
wwwoffle -kill Causes the demon to exit cleanly at a convenient point.
The second mode of operation is to specify URLs to get.
wwwoffle <URL> .. <URL> Specifies to the demon URLs that must be fetched.
If online then it is got immediately, else the request
is stored for a later fetch.
wwwoffle <filename> ... The specified HTML file is be read and all of the links
in it used as if they had been specified on the command
line.
wwwoffle -F Force the wwwoffle server to refresh the URL.
(Or fetch it if not cached.)
wwwoffle -i Specifies that the URLs when fetched are to be parsed
for images and these are also to be fetched.
wwwoffle -f Specifies that the URLs when fetched are to be parsed
for frames and these are also to be fetched.
wwwoffle -r[<depth>] Specifies that the URL when fetched is to have the links
followed and these pages also fetched (to a depth
specified by the optional depth parameter, default 1).
Only links on the same server are to be fetched.
wwwoffle -R[<depth>] This is the same as the '-r' option except that all of
the links are to be followed, even those to other
servers.
wwwoffle -d[<depth>] This is the same as the '-r' option except that links
are only followed if they are in the same directory or a
sub-directory.
The third mode of operation is to get a URL from the cache.
wwwoffle <URL> Specifies the URL to get.
wwwoffle -o Get the URL and output it on the standard output.
(Or request it if not already cached.)
The last mode of operation is to provide help in using the other modes.
wwwoffle -h Gives help about the command line options.
With any of the first three modes of operation the wwwoffle server can be
specified in one of three different ways.
wwwoffle -c <config-file>
Can be used to specify the configuration file that
contains the port numbers, server hostname (the first
entry in the LocalHost section) and the password (if
required for the first mode of operation). If there is
a password then this is the only way to specify it.
wwwoffle -p <host>[:<port>]
Can be used to specify the hostname and port number that
the demon program listens to for control messages (first
mode) or proxy connections (second and third modes).
WWWOFFLE_PROXY An environment variable that can be used to specify
either the argument to the -c option (must be the full
pathname) or the argument to the -p option. (In this
case two ports can be specified, the first for the proxy
connection, the second for the control connection
e.g. 'localhost:8080:8081' or 'localhost:8080'.)
WWWOFFLED - Demon program
-------------------------
The demon program (wwwoffled) runs as an HTTP proxy and also accepts connections
from the control program (wwwoffle).
The demon program needs to maintain the current state of the system, online or
offline, as well as the other parameters in the configuration file.
As HTTP proxy requests come in, the program forks a copy of itself (the
wwwoffles function) to handle the requests. The server program can also be
forked in response to the wwwoffle program requesting pages to be fetched.
wwwoffled -c <config-file> Starts the demon with the named configuration
file.
wwwoffled -d [level] Starts the demon in debugging mode, i.e it does
not detach from the terminal and uses standard
error for the log messages. The optional
numeric level (0 for none to 5 for all)
specifies the level of error messages for
standard error, if not specified then use
log-level from the config file.
wwwoffled -h Gives help about the command line options.
There are a number of error and informational messages that are generated by the
program as it runs. By default (in the config file) these go to syslog, by
using the -d flag the demon does not detach from the terminal and the errors are
also on standard error.
By using the run-uid and run-gid options in the config file, it is possible to
change the user id and group id that the program runs as. This will require
that the program is started by root and that the specified user has read/write
access to the spool directory.
WWWOFFLES - Server program
--------------------------
The server (wwwoffles) starts by being forked from the demon (wwwoffled) in one
of three different modes.
Real - When the system is online and acting as a proxy for a browser.
All requests for web pages are handled by forking a new server which
will connect to the remote host and fetch the page. This page is then
stored in the cache as well as being returned to the browser. If the
page is already in the cache then the remote server is asked for a newer
page if one exists, else the cache one is used.
SpoolOrReal - When the system is in autodial mode and we have not decided if we
will go for Spool or Real mode. Select Spool mode if already cached and
Real mode otherwise as a last resort.
Fetch - When the system is online and fetching pages that have been requested.
All web page requests in the outgoing directory are fetched by the
server connecting to the remote host to get the page. This page is then
stored in the cache, there is no browser active. If the page has been
moved then the link is followed and that one fetched.
Spool - When the system is offline and acting as a proxy for a browser.
All requests for web pages are handled by forking a server that will
either return a cached page or store the request. If the page is
cached, it is returned to the browser, else a dummy page is returned
(and stored in the cache), and the outgoing request is stored.
If the cached page refers to a page that failed to be downloaded then it
will be deleted from the cache.
Depending on the existence of files in the spool and other conditions, the mode
can be changed to one of several other modes.
RealNoCache - For requests for pages on the server machine or those specified
not to be cached in the configuration file.
RealRefresh - Used by the refresh button on the index or the wwwoffle program
to refetch a page while the system is online.
SpoolGet - Used when the page does not exist in the cache so a request needs to
be stored for it in the outgoing directory.
SpoolWillGet - Used when the page is not in the cache but a request for it is in
the outgoing directory already.
SpoolRefresh - Used when the refresh button on the index or the wwwoffle program
are used, the existing spooled page (if there is one) is not
overwritten, but a request is stored.
SpoolPragma - Used when the browser requests the cache to refresh the page
using the 'Pragma: no-cache' header, the existing spooled page (if there
is one) is not overwritten, but a request is stored.
WWWOFFLE-TOOLS - Cache maintenance program
------------------------------------------
This is a quick hack program that I wrote to allow you to list the contents of
the cache or move files around in it.
All of the programs should be invoked from the spool directory.
wwwoffle-rm - Delete the URL that is specified on the command line.
To delete all URLs from a host it is easier to use
'rm -r http/foo' than use this.
wwwoffle-mv - To rename a host directory in the spool to another name.
Because the URL is encoded in the filename just renaming the
directory will not work. Instead of 'mv http/foo http/bar'
use 'wwwoffle-mv http/foo http/bar'.
wwwoffle-ls - To list the files in the directory in the style of 'ls -l'.
For example use 'wwwoffle-ls http/foo' to list the URLs cached
in the directory http/foo.
These are basically hacks that I needed and should not be considered as fully
featured and fully debugged programs.
Author and Copyright
--------------------
The two programs wwwoffle and wwwoffled were written by Andrew M. Bishop in
1996,97,98 and are copyright Andrew M. Bishop 1996,97,98.
The program update-cache and the programs known as wwwoffle-tools were written
by Andrew M. Bishop in 1997,98 and are copyright Andrew M. Bishop 1997,98.
They can be freely distributed according to the terms of the GNU General Public
License (see the file `COPYING').
If you wish to submit bug reports or other comments about the programs then
email the author amb@gedanken.demon.co.uk and put wwwoffle in the subject line.
With Source Code Contributions From
- - - - - - - - - - - - - - - - - -
Yannick Versley <sa6z225@public.uni-hamburg.de>
Initial syslog code (much rewritten before inclusion).
Axel Rasmus Wienberg <2wienbe@informatik.uni-hamburg.de>
Code to run wwwoffled as a specified uid/gid.
Andreas Dietrich <quasi@baccus.franken.de>
Code to detach the program from the terminal like a *real* demon.
Ullrich von Bassewitz <uz@wuschel.ibb.schwaben.com>
Better handling of signals.
Optimisation of the file handling in the outgoing directory.
The log-level, max-servers and max-fetch-servers config options.
Tilman Bohn <tb@bohn.isdn.uni-heidelberg.de>
Autodial mode.
And Other Useful Contributions From
- - - - - - - - - - - - - - - - - -
Too many people to mention - (everybody that e-mailed me).
Suggestions and bug reports.