home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Dream 52
/
Amiga_Dream_52.iso
/
Linux
/
Magazine
/
wwwoffle-2.1.tar.gz
/
wwwoffle-2.1
/
UPGRADE
< prev
next >
Wrap
Text File
|
1997-12-19
|
5KB
|
122 lines
WWWOFFLE - World Wide Web Offline Explorer - Version 2.0
========================================================
WHAT?
-----
The format of the cache that wwwoffle uses to store the web pages has changed in
version 2.x compared to the previous versions. If you have used wwwoffle
version 1.x then you *MUST* upgrade the existing cache before you can use the
new version of the program.
HOW?
----
*** READ ALL THIS SECTION BEFORE DOING ANYTHING ELSE ***
When you compile wwwoffle there is another program called 'upgrade-cache' that
is also compiled. You need to run this program to convert the cache from the
old format to the new one.
There are a number of options that you can take for this upgrade route, the
following applies to all of them.
In each of the options the basics are that you must run upgrade-cache and it
takes an argument of the name of the cache directory that is used (usually
/var/spool/wwwoffle). When the program runs it prints out informational and
warning messages, these may be useful.
Option 1 - Be reckless
Run 'upgrade-cache /var/spool/wwwoffle', watch the messages go flashing by and
hope that it works.
Option 2 - Be brave
With sh/bash run 'upgrade-cache /var/spool/wwwoffle > upgrade.log 2>&1'
or with csh/tcsh run 'upgrade-cache /var/spool/wwwoffle >& upgrade.log'
read the messages and check the warnings.
Option 3 - Be safe
Backup the cache first then follow option 2.
With GNU tar I suggest that you use the --atime-preserve option so that the
access times of the files in the cache are not modified by performing the
backup. The index and purge options in wwwoffle use these so it is important.
When it finishes, the multiple host named directories in /var/spool/wwwoffle are
gone, moved into a new sub-directory called http. The outgoing directory and
this http directory are the only directories that should be left.
If there is a warning message then you should decide what needs doing. It could
be any of the following reasons:
That upgrade-cache was run by a user without write permissions.
That one or more files were changed while the program was running.
That there is a spare file in one of the host directories that needs deleting.
That there is a symbolic link that does not point anywhere.
If the upgrade-cache program crashes then that is a bug - tell me.
If you are left with many files or directories and the warnings are unclear then
this may be a bug - tell me.
If there are only a small number of spare files or directories, then just delete
them, you probably won't notice that they are missing.
WHY?
----
The existing scheme for naming of the files in the cache had some problemsm, the
new one is better.
0) It was designed for my personal use which did not involve many web-pages
stored and did not visit any pages with unusual names,
You could say that the hacks that I implemented to get it working as I wrote
it were not well enough thought out. But at the time I wrote it I wanted to
get it working as soon as possible and did not write it with the future
growth in mind. The scheme as implemented has not caused any problems for me
personally.
1) It was possible for a web-page that has several possible arguments to be
stored incorrectly.
This is because for each page that has arguments a hash value is computed
from the arguments to provide a unique filename. The reason for this failing
is that I used a hash function that I made up on the spot, giving a 32-bit
hashed value. This seemed to be sufficient for 4 billion sub-pages with the
same path name for each host and path combination. As it turned out the hash
function was not strong enough and the number of possibilities was much
smaller.
2) There was no provision for any protocol other than http.
Very quickly the idea of doing ftp as well came to my mind, but could not be
implemented easily or cleanly with the current system.
3) The outgoing directory was inefficient for large numbers of files.
An increasing sequence of numbers was used resulting in slow access, this was
fixed in version 1.2x but there could still be many requests for the same URL
in the directory. Now a unique name based on a hash is used so that only one
request for each page is stored.
4) Bad characters and url-encoded URLs caused problems.
Some URLs that had funny characters including URL-encoded sequences caused
problems. The URL http://www.foo.com/~bar and http://www.foo.com/%7Ebar are
the same URL but could be stored in different files.
5) It is now a neater design with no special cases.
Previously only files with arguments needed hashing, now all of them use a
hash, this simplifies the logic. The format of the outgoing directory is the
same as the other directories.
6) There are more possibilities for future expansion.
It is now possible to consider adding more files to the cache to store extra
information about a URL, for example a password. It is obvious now that this
would be another file with the same hash value but a different prefix.