home *** CD-ROM | disk | FTP | other *** search
-
- Welcome to WWWGrab/2 v1.3
-
- ---------------------------------------------------------------------------
- <Czech>
- ¼esk∞ návod je v souboru WWWGrab.CZE.
- </Czech>
-
- WWWGrab/2 is a utility for mirroring websites. It is a commandline
- based program which uses configuration files for input.
-
- WWWGrab/2 Requirements:
-
- * OS/2 Version 2.11 or greater. Merlin or OS/2 Warp Connect suggested
- for best performance.
- * One of the following TCP/IP packages for OS/2 (listed in order of
- preference):
- * IBM TCP/IP included in OS/2 Warp Merlin.
- * IBM TCP/IP 3.0 included in OS/2 Warp Connect.
- * IBM TCP/IP 2.0 Base Kit with CSD64092 or greater applied.
- * The Internet Access Kit from OS/2 Warp's Bonus pack.
-
- * A disk with long filename support (HPFS, ext2fs, etc.) is not
- required but is strongly recommended!
-
- To start WWWGrab/2 simply type the following at an OS/2 command
- prompt:
-
- WWWGRAB <config_file> [/i]
-
- WWWGrab/2 may also be called from command and REXX files, and from
- Program objects on the OS/2 desktop.
-
- WWWGrab/2 uses configuration files. A configuration file is a plain
- ASCII text file with commands and options that tell WWWGrab/2 what to do.
- Its format is described below. The easiest way to create your first
- configuration file is to copy an existing demonstration file and change it
- to suit your needs.
-
- The default configuration file, named "default.w3d", is automatically
- processed when WWWGrab/2 is executed. It should contain frequently
- used commands and options. If you do not want WWWGrab/2 to load the
- default configuration file use the '/i' commandline option.
-
- These options and commands CANNOT be used in the default
- configuration file:
- URL ALL
- CHANGESITE SITELIST
- REALM DENY
- ALLOW EXCL
- ADD TOP
- INCLUDE META
-
- --------------------------------------------------------------------------
- Credits
-
- I want to express my thanks to all who have tested WWWGrab/2 on a
- voluntary basis and reported errors and gave constructive suggestions for
- improvement. Without their help WWWGrab/2 would not have been this
- successful.
-
- Special thanks go out to:
-
- * Tom Wheeler
- * Andreas Krattenmacher
- * Mike Nice
- * Stanislav Koci (St/\n)
- * Jochen Riemer
-
- Thanks also to HELLOWEEN, GAMMA RAY, Michael Kiske, MANOWAR, Alice
- Cooper, GREEN DAY, and all the other great musicians who provide me with
- music to listen to while I am programming.
-
- ---------------------------------------------------------------------------
- WWWGrab/2 Configuration File Format
-
- NOTES:
- Commands with a leading '$' are available only in the
- registered version of WWWGrab/2.
-
- !! All URLs must be in the full format, i.e. http://www.foo.com/ !!
- !! or http://127.0.0.1/this/is/localhost/ !!
- !! You may also specify a port value, i.e. http://www.foo.com:8080/ !!
- !! or http://127.0.0.1:8080/this/is/localhost/at_port_8080.html !!
-
- At the end of this list of commands and options is a quick-reference
- table which summarizes some information about the commands and options.
-
- URL <url-in-the-http-form>
- This command tells WWWGrab/2 a site you wish to mirror. The complete
- URL of the site is required! The URL command can be used more than
- once to mirror multiple sites or multiple directories on the same
- site. This is a basic command :-)
-
- Example:
- URL http://www.geocities.com/SiliconValley/Heights/7262/index.html
-
-
- LOCALPATH <path>
- WWWGrab/2 must have a place to store the files it downloads. This
- command tells WWWGrab/2 the path on your local machine under which
- the URL will be mirrored.
-
- Example:
- LOCALPATH F:\GRAB\IBM\
- Stores files mirrored under the F:\GRAB\IBM\ directory.
-
-
- $ EXTENSIONS <list of extensions>
- The EXTENSIONS command defines a list of file extension search
- strings which are to be downloaded. Extensions are seperated by a
- space. HTM, HTML, SHTM, SHTML, JPG, GIF, WAV, AU, CLASS, and JAVA are
- automatically defined, so there is no need to specify them when
- using the EXTENSIONS command. You may alternatively use the ':'
- char as a 'NOT' operator to list extensions which you wish to
- ignore. You may not use "NOT extensions" together with "normal
- extensions". Only "NOT extensions" will be applied if both are
- used together. Be careful what you put here! Including EXE or
- ZIP extensions could use vast quantities of disk space if you start
- mirroring a large site such as hobbes or sunsite!
-
-
- Example:
- EXTENSIONS ZIP C
- Use ZIP and C extensions
-
- EXTENSIONS :ZIP :C
- All extensions except ZIP and C
-
- EXTENSIONS HTML JAVA :CLASS
- This mirrors ALL extensions except CLASS!
-
-
- MAXDEEP <levels>
- MaxDeep defines how many subdirectory levels deep WWWGrab/2 will
- mirror. Pages which are lower than <levels> subdirectories in the
- tree are ignored.
-
- Example:
- MAXDEEP 5
- Will get http://www.foo.com/1/2/3/4/5/file.html but not
- http://www.foo.com/1/2/3/4/5/6/file.html
-
- NOTE: The shareware version of WWWGrab/2 is limited to 5 levels.
-
-
- MAXTRIES <num>
- MaxTries tells WWWGrab/2 how many times it should try to get a file.
- WWWGrab/2 tries to grab all the files sequentially. If a file isn't
- successfully retrieved on the first attempt, it is ignored until the
- complete list has been processed. Then WWWGrab/2 retries files missed
- on the first attempt. This process is repeated until all the files
- are retrieved or MAXTRIES attempts have been made.
-
- Example:
- MAXTRIES 3
-
-
- $ DEFAULTNAME <name>
- Sometimes links point to a directory instead of a file. In this case,
- if the filename is not known the DefaultName is used for that
- directory. The default value for DefaultName is "index.html".
-
- Example:
- DEFAULTNAME Welcome.html
-
-
- ALL
- Normally, if WWWGrab/2 sees that a file already exists, it will send
- a conditional GET to the remote server. The file is only downloaded
- again if the version on the server is newer than the local file. If
- you want to update all the files regardless of their date and local
- existence, you should use the ALL option.
-
-
- $ CHANGESITE <num sites>
- Normally, if WWWGrab/2 finds a link to another WWW server in an html
- file, the link is ignored. If you want to allow WWWGrab/2 to follow
- links to another server, use the CHANGESITE command. The default is
- 0, which means don't change sites. BE CAREFUL what you enter here!
- You may start mirroring the entire WWW!
-
- Example:
- CHANGESITE 2
-
-
- $ SITELIST <hostname>
- Normally, if WWWGrab/2 finds a link to another web site in an html
- file, the link is ignored. You can use the SITELIST command to
- specify allowed hosts. You may use the ':' character as a NOT
- operator. This command can be used more than once.
-
- Example:
- SITELIST www.xxx.yyy
- Allow connections to site www.xxx.yyy.
-
- SITELIST :www.xxx.yyy
- All websites except www.xxx.yyy.
-
- NOTE: This command overrides the CHANGESITE command!
-
-
- NOIMG
- Use this option if you don't want to grab image files.
-
-
- NOSND
- Use this option if you don't want to grab audio files.
-
-
- NOAPPLET
- Use this option if you don't want to grab applets.
-
-
- OHTML
- This option combines NOIMG, NOSND and NOAPPLET.
-
-
- PROXY <hostname>
- Use this command if you access the Internet via a Proxy server/cache.
- The <hostname> may be the full hostname (i.e. proxy.foo.com) or an IP
- address. If you're uncertain about this, counsult your system
- administrator or internet service provider.
-
- Examples:
- PROXY www.proxy.server
-
- PROXY 123.456.789.10
-
- PPORT <proxy port>
- This command specifies the proxy port. The default value is 80. This
- value isn't used if no proxy is specified with the PROXY command.
-
- Example:
- PPORT 8080
-
-
- NICE [delay]
- This command defines the adjustable delay in seconds between links so
- you don't hog all the resources of the system you're mirroring from.
- If you use this command without a value, WWWGrab/2 will delay 10
- seconds before requesting the next file. Warning: WWWGrab/2 can
- generate requests too fast for some servers. Setting the NICE
- parameter too low may generate too many requests for the server and
- crash the server. This is not nice :-). A low NICE setting is known
- to kill the following types of servers:
-
- All WWW servers that run under Microsoft Windows(TM)
- Old generation (HTML/1.0) CERN servers on all platforms
-
- Low NICE values may also generate large amounts of network traffic
- and hog network resources. For safety, you should set the NICE
- value to at least five seconds. The longer, the better. Remember,
- this program is automated and can easily run for hours with no
- user interaction.
-
- Example:
- NICE 5
-
- NOTE: If you try to set a NICE value of 0 (zero), the value
- will be automatically changed to five seconds.
-
- MAXDL <limit>
- This defines the maximum number of kilobytes WWWGrab/2 will
- download. When WWWGrab/2 is about to download a file, it checks the
- filesize. If downloading the file would exceed the limit specified
- in MAXDL, WWWGrab/2 will ignore the file.
-
- Example:
- MAXDL 3
- Download up to 3KB.
-
-
- $ REALM <host> <"Realm Name"> <encoded username and password>
- Defines a secured host, a realmname and a base64 encoded
- username+password. REALM can be used more than once. The realmname is
- CaSe SeNsItIvE! If you don't know the realmname you may insert an
- empty string (i.e. ""), or examine WWWGRAB.LOG. The host is
- basic-auth secured host. It may be in IP format (1.22.33.44) or in
- the standard "domain" format (www.foo.com). Realms are generated by
- the makeauth program. You may use the INCLUDE command to include its
- output into the configuration file.
-
- Example:
- REALM www.secured.host "This is ReaLmName" LTot
-
-
-
- CHAM <number>
- Some servers (esp. Netscape) try to recognize the client name. If
- they don't know the client name, they don't send any data. You may
- use this option to "mask" the client name (like CHAMeleon). Numbers
- are:
- 0 - WWWGrab (default)
- 1 - Mozilla Netscape Browser
- 2 - WebExplorer IBM WebExplorer/2
- 3 - WebCrawler WebCrawler robot
- 4 - InfoSeek InfoSeek robot
- 5 - Harvest a web robot
- 6 - Mosaic NCSA Mosaic
- 7 - Lynx Lynx, text browser
- 8 - PRODIGY-WB Prodigy browser
- 9 - Internet Microsoft's web browser
-
-
- Example:
- CHAM 2
- Sends the server the WebExplorer client name.
-
-
- $ MASK <file mask>
- Use this command if you want to mirror only specified files. This
- command overrides EXTENSIONS. You MUST explicitly define every file
- mask if using this command, including the defaults in EXTENSIONS such
- as HTML, etc.! This command can be used more than once. The file mask
- can have wildcard characters (special characters for character
- substitution). The '?' char means ANY one char is legal. The '*' char
- means ZERO OR MORE cases of any character are legal. They may be
- located at any position in the string, and may be used more than
- once.
-
- Example:
- MASK *.jpg
- Will mirror all files with the .jpg extension
-
- MASK ?a*.html
- Will mirror all files beginning with any character,
- followed by 'a', having any number of characters following,
- and ending with .html, such as zaphod.html, 0a.html, etc.
-
- MASK *.jpg s?n.htm* do*s.large.i*x *.*.html.c*
- Will mirror one.jpg, two.jpg, sin.htm, son.htm, sun.html,
- dogs.large.idx, doorways.large.index, index.short.html.cz852,
- index.of.html.cz.html, try.decode.html.c, etc...
-
-
- I401
- If WWWGrab/2 sends a conditional GET to a protected page, and the
- page isn't modified, some servers return a 401 status code. You may
- use I401 to override this response and download the file.
-
-
- ADD <path>
- Add the specified path to the list of requested URL's. This command
- can be used more than once, and always applies to the first URL
- command.
-
- Example:
- URL http://www.xxx.yyy/path1/index.html
- URL http://foobar.com/
- ADD /path2/pic/index.html
- Mirrors: http://www.xxx.yyy/path1/index.html AND
- http://www.xxx.yyy/path2/pic/index.html AND
- http://foobar.com/
-
-
- $ ALLOW <URL-in-http-form>
- Explicitly specifies that a subtree is retrievable. This command
- can be used more than once.
-
- Example:
- ALLOW http://www.xxx.yyy/allow/this/path/
-
-
- $ DENY <URL-in-http-form>
- The URL provided, as well as all subtrees of the URL, are not
- processed. Many times specific directory subtrees are not desirable.
- You can deny retrieval of these URL's using this setting. It can be
- used more than once.
-
- Example:
- DENY http://www.xxx.yyy/deny/this/path/
- Do not download any files from the /deny/this/path/ tree.
-
- If you do not include the trailing slash
- (http://www.xxx.yyy/deny/this/path) then all subdirectories beginning
- with "path" are not processed. This includes "paths.html",
- "path1/news", etc.
-
-
- EXCL <www-server>
- This command defines a WWW server to exclude from mirroring. This
- command is usable together with the CHANGESITE command. It can be
- used more than once.
-
- Example:
- EXCL www.yyy.zzz
- EXCL microsoft.is.lame.org BTW: try this URL :-)
-
-
- TOP <path>
- Defines the TOP of the path. WWWGrab/2 will ignore files in
- directories higher than this path. In other words, the path of the
- file must start with this string. This restriction is applied only
- to the FIRST defined host.
-
- Example:
- TOP /path/xxxx/
- Ignore files above /path/xxxx/, i.e. DON'T mirror /path/some.file
-
- INCLUDE <file>
- This commmand allows you to include another configuration file into
- the configuration file currently being processed. Nesting is allowed,
- to a maximum depth of 4 levels. This command is useful for including
- commands which are used in multiple configuration files. See also '@'
- files below.
-
- Example:
- INCLUDE realms.inc
- INCL urls.inc
-
-
- FAT
- This option turns on FAT compatibility. In this mode WWWGrab/2 stores
- all mirrored files in a single directory using the FAT 8.3 filename
- format. It automaticaly fixes links. This option is automaticaly
- turned on if the local path (LOCALPATH) is located on a FAT partition
- or on a partition without long filename support.
-
-
- $ REMOVE
- This option informs WWWGrab/2 to remove unused links from a HTML
- file. Links are not deleted, but only commented out.
-
-
- $ REPL <path>
- Specifies a path which replaces the LOCALPATH in a link. For example,
- if you specify "REPL /mirrors" and the LOCALPATH is
- F:\OS2Httpd\HTML\GRAB\, for a link in the grabbed HTML document to
- "<A HREF="/some/pages/index.html"> link </a>", the replaced filename
- is "F:\OS2Httpd\HTML\GRAB\www.foo.com\some\pages\index.html". The
- link in the document will be changed to:
- "/mirrors/www.foo.com/some/pages/index.html"
-
- Example:
- REPL /mirrors
-
-
-
- METAFILE <filename>
- This command specifies the file WWWGrab/2 uses for saving information
- about mirrored files. The default filename is META.DAT, which is
- stored in the LOCALPATH\%host% directory.
-
- Example:
- META data.met
-
- --------------------------------------------------------------------------
- Using '@' Files
-
- You may find yourself using the same parameters over and over again
- for some options. Rather than having to copy/paste text from one
- configuration file to another, and then update each file when you make
- changes to a parameter, you can store the parameters in an '@' file (say
- "at file") and reference the '@' file in each configuration file. For
- example, if you frequently use the MASK command, you may store it in
- the DEFAULT.W3G file, and it will be applied to ALL configuration files
- by default. But if you want to use two different MASKs for different
- configuration files, you must use '@' files. How? First create one
- file called (for example) MASKS1 with this text:
-
- *.HTML
- *.HTM
- *.?.JPEG
- *.0?.GIF
-
- Then create a second file named (for example) MASK2 with this
- content:
-
- *.SHTML
- *.SHTM
- *.JPEG
- *.GIF
- *.WAV
-
- Now, you may write in one configuration file named "example1.conf":
- ;
- ;; Mirrors only *.HTML, *.HTM, *.?.JPEG, *.0?.GIF files
- ;
- URL http://some.http.address.com/
- MASK @MASKS1 ; use contents of the MASKS1 file
-
- In the second configuration file "example2.conf" you can write:
- ;
- ;; Mirrors only *.HTML, *.HTM, *.SHTML, *.SHTM, *.?.JPEG, *.0?.GIF,
- ; *.JPEG, *.GIF, *.WAV
- ;
- URL http://some.http.address.com/
- MASK @MASKS1 ; use contents of the MASKS1 file
- MASK @MASK2 ; and add contents of the MASK2 file
-
- If you had used just MASK @MASK2, then *.SHTML, *.SHTM, *JPEG,
- *.GIF, and *.WAV files would be mirrored.
-
- You may use '@' files with these commands:
- URL, EXTENSIONS, ALLOW, EXCL, SITELIST, MASK, DENY, ADD
-
- The '@' file must contain only one parameter per line.
-
-
- --------------------------------------------------------------------------
- EXAMPLES
-
- Basic authorization example:
- URL http://www.sec1.host/secured/pages/index.html
- LOCALPATH \MyGrab\Secured
- MAXDEEP 5
- MAXTRIES 3
- REALM www.sec1.host "Realm 1" WAEFfgSDRGwer==
- REALM www.sec1.host "Realm 2" WQREGFbsdgiwheg
-
-
- The default configuration file example:
-
- ;; Definition of common extensions
- ;
- EXTENSIONS HTML HTM SHTML SHTM
- EXTENSIONS JPG JPEG GIF XBM
- EXTENSIONS WAV VOC AU
- EXTENSIONS JAVA CLASS
-
- ;
- ;; The default value for the MAXDEEP command
- ;
- MAXDEEP 5
-
- ;
- ;; The default value for the NICE command
- ;
- NICE 3
-
-
- Quick Reference Chart of Commands and Options
-
- COMMAND SHORTCUT '@' DEFCFG OVERRIDES DEFVAL REG MULTIPLE
- --------------------------------------------------------------------------
- URL YES NO NO YES
- LOCALPATH LOP NO YES [0] NO NO
- EXTENSIONS EXT YES YES [1] YES YES
- MAXDEEP MDP NO YES 1 [2] NO
- MAXTRIES MTR NO YES NO NO
- DEFAULTNAME DEF NO YES [3] YES NO
- ALL NO NO NO NO
- CHANGESITE CHSIT NO NO 0 YES NO
- SITELIST SLIST YES NO CHANGESITE YES YES
- NOIMG NO YES NO NO
- NOSND NO YES NO NO
- NOAPPLET NOAP NO YES NO NO
- OHTML NO YES [4] NO NO
- PROXY NO YES NO NO
- PPORT NO YES 80 NO NO
- NICE NO YES 10 NO NO
- MAXDL NO YES NO NO
- REALM NO NO YES YES
- CHAM NO YES 0 NO NO
- MASK YES YES EXTENSIONS YES YES
- I401 NO YES NO NO
- ADD YES NO NO YES
- ALLOW YES NO YES YES
- DENY YES NO YES YES
- EXCL YES NO NO YES
- TOP NO NO NO NO
- INCLUDE INCL NO NO NO YES
- FAT NO YES NO NO
- REMOVE NO YES YES NO
- REPL NO YES YES NO
- METAFILE META NO NO NO NO
-
-
- [0] - \WWWGrab\Grab
- [1] - HTM, HTML, SHTM, SHTML, JPG, GIF, WAV, AU, CLASS, and JAVA.
- [2] - The shareware version of WWWGrab/2 is limited to five levels.
- [3] - The default value for the shareware version is "index.html".
- [4] - Combines NOIMG, NOSND, and NOAPPLET.
-
- ---------------------------------------------------------------------------
-
- Disclaimer etc.
-
- This program is COPYRIGHTED by J. Rubes.
-
- WWWGrab/2 is a shareware product. It is distributed through public
- access channels so that prospective buyers have the opportunity
- to evaluate the product before making a decision to buy.
-
- WWWGrab/2 may be used only for legal purposes. CHECK if you are
- allowed to mirror a site before doing so.
-
- USE AT YOUR OWN RISK
-
- This program is provided AS IS without any warranty, expressed or
- implied, including but not limited to fitness for a particular use. The
- user is responsible for the results of correct or incorrect usage of this
- software. WWWGrab/2 may not be used to provide commercial services without
- written permission of the author.
-
- ---------------------------------------------------------------------------
-
- If you like this program, please:
- Send me $10.00, the normal user fee for WWWGrab/2. You may send
- more :-)
-
- This registration fee is for INDIVIDUALS. A negotiated site licence
- is required for businesses, governments and other institutions if
- WWWGrab/2 is to be used on more than one computer at that site. Contact
- the author for details on site license discounts.
-
-
- Upon registration you will receive (via email) a registered
- personalized copy of the most recent version of WWWGrab/2. This
- registration makes all subsequent versions available free of charge.
-
- See the REGISTER.ENG file for registration information.
-
- If you don't like this program:
- Delete it.
-
- ---------------------------------------------------------------------------
-
- Remember that software of this kind lives or dies by the response it gets.
-
- You may get the most recent version of the WWWGrab/2 at:
- http://www.geocities.com/SiliconValley/Heights/7262/
-
-
- You may send comments, suggestions, bugs, etc. to:
- email:
- jirkar@geocities.com
- jirkar@hotmail.com
- Jiri_Rubes@slad.fido.cz
-
- FidoNet:
- Jiri Rubes 2:421/37
-
-
- My english is poor, I know ;-)
- If you see a big bug in this text, please email me and I'll change it.
-
- A special, BIG thanks goes to Tom Wheeler for checking the documentation.
-
-
-