Beginners should read the licence followed by the section on Starting to use analog.
This Readme describes analog 4.1. For the latest version of analog, see the analog home page. For examples of the output see
Analog is free software, but its usage, distribution and modification are covered by a licence. You must agree to the terms of the licence before using the program. In particular, it comes with no warranty.
This is a version of the Readme in one page. If you're reading it on line, you might prefer the version on several smaller pages. There is an index at the end of this document.
Now you can go to
If you log in to your ISP's machine from your home machine, you have two options. If you have the right permissions, you can run analog on your ISP's machine. Otherwise, you can download (e.g., ftp) the logfiles from their machine to yours, and then run analog on your machine.
Once you've downloaded the right version of analog for your computer from the analog home page (or a mirror site), you need to know how to set it up and run it. This is very easy, but the instructions are slightly different depending which platform you're using.
If you can't manage to set up analog after reading the instructions, send a message to the analog-help mailing list.
LOGFILE logfilename # to set where your logfile livesThe logfile must be stored locally -- analog won't use FTP or HTTP to fetch it from the internet. There's a sample logfile supplied with the program.
There's a list of basic commands later in the Readme. Also there are a few to get you started in the configuration file already, but there are lots of others available. You can read about all the commands in the section on customising analog.
One note: on other platforms, there is another way to give options, via command line arguments. You'll see these mentioned in this Readme from time to time, but the Mac doesn't have a command line, so ignore these.
If you want to compile your own version of analog (it's written in C), or just to read the source code, it's available from the analog home page. (It's the same source code for all versions).
When you've downloaded analog, and either you or your browser has unzipped it, you will find in the analog folder a configuration file called analog.cfg and the analog executable itself, as well as the Readme, the Licence (which you must read and agree to before using analog) and a couple of other files. There is no setup.exe: analog is already ready to run without one.
(Some unzip programs are broken, and do not create folders when they should. If you don't have a folder called lang inside the analog folder, create one and put all the files called *.lng and *.tab into it.)
There are two ways of running analog. You can either run it from Windows (by single-clicking or double-clicking on its icon, depending on your setup), or you can run it from the DOS command prompt (under Start-Programs). If you run it from Windows, it will create a DOS window to run in. When it's finished, it will produce an output file called Report.html. The first time you run it, this may all happen almost instantly. For help in interpreting the output, see What the results mean.
LOGFILE logfilename # to set where your logfile livesThe logfile must be stored locally -- analog won't use FTP or HTTP to fetch it from the internet. There's a sample logfile supplied with the program.
There's a list of basic commands later in the Readme. Also there are a few to get you started in the configuration file already, but there are lots of others available. You can read about all the commands in the section on customising analog.
In some ways, it's easier to run analog from the DOS command prompt, because you get to see any error or warning messages more easily. Also, if you run analog from the command prompt, there is another way to give options, via command line arguments, given on the command line after the program name. These are just shortcuts for configuration file commands. You can use the command line arguments if you run analog from a batch file too.
If you want to compile your own version of analog (it's written in C), or just to read the source code, it's available from the analog home page. (It's the same source code for all versions).
If you're not using one of the platforms for which a precompiled version is available, you'll have to compile your own version from the source. But don't worry -- it's written in standard C throughout, so it will compile out of the box on most platforms. (The source code is the same for all platforms.)
First, you should look at the file anlghead.h, and see if there's anything you want to edit. In particular, you need to set the ANALOGDIR.
When you have done that, you need to compile the program. How to do that depends on which operating system you're using.
maketo compile the program. On most systems, that will be sufficient. If it fails to compile, have a look in the Makefile to see if there's anything that you need to change to suit your configuration, and try again. It says in that file what to do. In particular, Solaris 2 (SunOS 5) users need to change the LIBS= line.
(Experts can pass some arguments in on the make command line instead of by editing anlghead.h: e.g.
make DEFS='-DANALOGDIR=\"/usr/etc/apache/analog/\"'This is useful if you have a script to compile analog.)
If you haven't got gcc, you will need to change the compiler - try acc or cc instead. If it still doesn't compile, try DEFS=-DNODNS to ignore the DNS lookup code.
There is a known problem with HP-UX 10 and some versions of gcc. If it complains about an error in the <sys/stat.h> library, you need to upgrade to gcc version 2.7.2.3 or later, or use HP's cc compiler. HP's compiler is not an ANSI C compiler by default, so you need to specify -Ae in the CFLAGS to tell the compiler to use ANSI C.
SunOS 4's cc and gcc don't have the necessary header files for ANSI C. If you have the ANSI C compiler acc, use that. Otherwise use the DEFS given in the Makefile.
SunOS 5 users need to change the LIBS= line in the Makefile. Also, this OS sometimes seems to have a broken strcmp() function. If you get an "illegal instruction" error when running analog, compile it with the -DNEED_STRCMP in the DEFS= line.
Compiling under OpenVMS. First edit anlghead.h as described above. Then type
MMSto compile analog.
Compiling under Acorn RiscOS. The Makefile is called Make.Risc, and you will have to rename it to Makefile before running make. Also you have to make directories called C, H and O, and move the sources files into the appropriate directories: e.g., alias.c must be renamed C.alias. And you will find that there are some filenames in the header file anlghead.h that you want to change to fit into the RiscOS directory structure.
Compiling under OS/2. To compile analog for OS/2, you will need the EMX package. You should edit the Makefile to have OS=OS2 and LIBS=-lsocket. Then after editing anlghead.h and running Make, you need to run the command
EMXBIND -b ANALOGto generate the analog.exe executable.
analogto run the program. (Or ./analog if for some reason . isn't in your $PATH.)
You can configure analog by putting commands in the configuration file, which is called analog.cfg by default. Two commands you will need straight away are
LOGFILE logfilename # to set where your logfile lives OUTFILE outputfile.html # to send the output to a file instead of the screenThe logfile must be stored locally -- analog won't use FTP or HTTP to fetch it from the internet. There's a sample logfile supplied with the program.
There's a list of basic commands later in the Readme. Also there are a few to get you started in the configuration file already, but there are lots of others available. You can read about all the commands in the section on customising analog. For help in interpreting the output, see What the results mean.
There is one other way to give options to analog, via command line arguments, given on the command line after the program name. These are just shortcuts for configuration file commands.
The following section is a technical (i.e., dull but important) section on the
Then there's documentation on all the configuration commands in the following categories. Analog has over 200 configuration commands and over 40 command line options, so sometimes these sections turn into lists of commands. But here's where you find out everything you can do with analog.Later there's an index of all the commands and topics, and also a quick reference containing the syntax of all the commands and examples.
LOGFILE my_logfile OUTFILE output.htmlwhere, of course, you should substitute the names of the files you want to use. The logfile must be stored locally -- analog won't use FTP or HTTP to fetch it from the internet, so you may have to fetch it yourself first. You can read several logfiles by giving several logfile commands, or by giving a comma-separated list, or by using wild cards in the logfile name. So, for example, if you use the commands
LOGFILE new1.log,old*.log LOGFILE new2.loganalog will analyse the logfiles new1.log, new2.log, and all the old logfiles. Analog will recognise logfiles in several different formats. You can read more about this in the section on Choosing a logfile.
HOSTNAME "Spam Widgets Inc." HOSTURL http://www.spam-widgets.com/
If you have broken images in the output instead of graphs, you need to say in which directory on your server the images are stored. You do this by a command like
IMAGEDIR /analog/images/(The images are distributed with the program - you will have to move them to whichever directory you choose.)
MONTHLY ON # one line for each month WEEKLY ON # one line for each week FULLDAILY ON # one line for each day DAILY ON # one line for each day of the week HOURLY ON # one line for each hour of the day GENERAL ON # the General Summary at the top REQUEST ON # which files were requested FAILURE ON # which files were not found DIRECTORY ON # Directory Report HOST ON # which computers requested files ORGANISATION ON # which organisations they were from DOMAIN ON # which countries they were in REFERRER ON # where people followed links from FAILREF ON # where people followed broken links from SEARCHQUERY ON # the phrases and words they used... SEARCHWORD ON # ...to find you from search engines BROWSER ON # which browsers people were using OSREP ON # and which operating systems FILETYPE ON # types of file requested SIZE ON # sizes of files requested STATUS ON # number of each type of success and failureThe referrer and browser reports will only appear if your server records the necessary information. You can configure lots of other things about each report, such as how many rows are listed, which columns are included, and how the reports are sorted. For example, the command
REQINCLUDE pagestells analog only to list pages, rather than all files, in the request report. You can read a summary of all the reports and the commands which control them in the section on Analog's reports.
LANGUAGE FRENCHwill give you the output in French. The available languages at the moment are ARMENIAN, BOSNIAN, CATALAN, SIMP-CHINESE (GB2312 encoding), TRAD-CHINESE (Big5 encoding), CZECH, DANISH, DUTCH, ENGLISH, US-ENGLISH, FINNISH, FRENCH, GERMAN, GREEK, ICELANDIC, ITALIAN, JAPANESE, KOREAN, NORWEGIAN (Bokmål), NYNORSK, POLISH, PORTUGUESE, BR-PORTUGUESE, RUSSIAN, SERBIAN, SLOVAK, SLOVENE, SPANISH, SWEDISH, TURKISH and UKRAINIAN. See the section on Configuring the output for how to download, or even translate, new languages. As new languages are translated, they will be added to the analog home page.
As I said, these are only a few of the commands available. To find out about all the commands, you'll have to read the remaining sections of the Readme, starting with a short section on the syntax of configuration commands.
CONFIGFILE other.cfgThe commands in the other configuration file are read immediately, in order. The program then continues reading the command line or calling configuration file where it left off. Note that reading an alternative configuration file does not stop the default configuration file (usually analog.cfg) being read as well. To do that you have to specify -G as well as the +g command. Also, note that reading in several configuration files does not produce several reports, but a single report based on all the options.
In the Mac version, you can start up a program with a particular configuration file instead of the default one by dragging the configuration file onto the analog icon. The file must start with a #.
You can also specify any configuration command on the command line even if it doesn't have a command line abbreviation, by use of the +C command. (NB The C must be upper case.) For example, +C"UNCOMPRESS *.gz gzcat" will include that command.
DAILY OFF # We don't want a Daily Summary FULLDAILY "ON" # We want a full Daily Report instead HOSTNAME (Spam Widgets Inc.) # Spaces, so quotes or brackets needed LOGFILE logfile1.log,\ logfile2.log # This line and the previous one are one commandGenerally later commands override earlier ones if you can have only one of that thing (e.g., for the OUTFILE), or supplement them if you can have several (e.g., for the LOGFILE, because you can read several logfiles). Apart from that, the order of commands doesn't matter, except that LOGFORMAT and LOGTIMEOFFSET commands must come earlier in the same configuration file than the LOGFILE to which they refer.
analog -settings [other options]or include SETTINGS ON in the configuration commands. That will tell you what the values of all the variables will be, based on the defaults in anlghead.h and anlghea2.h, the configuration commands, and the command line options. If you're on Unix or Windows, remember that you can send the output to a file with
analog -settings > file
LOGFILE logfilenameor just to put the logfile name on the command line without any arguments, e.g., analog logfilename. A - sign or the word stdin is interpreted as standard input: this is useful on Unix systems for constructing pipes. All logfiles must be within your computer's file system (on disk, or at least mounted under Unix, or on a mapped drive under NT) -- analog won't use FTP or HTTP to fetch them from the internet. In the Mac version, you can also analyse a particular single logfile by dragging it onto the analog icon.
You can have several LOGFILE commands. You can include wildcards in the logfile name (but not necessarily in the directory name: this is system-dependent), and you can use a list of logfiles separated by commas (without spaces). So the following commands would tell analog to read logfile1, c:\logs\logfile2, and all files ending in .log:
LOGFILE logfile1,*.log LOGFILE c:\logs\logfile2Or if you were on a Mac, you might use something like
LOGFILE "Hard Drive:Internet Applications:Analog:Logs:*"The LOGFILE commands are cumulative, except that any logfiles on the command line or in user-specified configuration files override any in the default configuration file, and are themselves overridden by any in the mandatory configuration file. There is also the special command
LOGFILE nonewhich erases the list of logfiles specified so far.
If your logfile is not in one of the standard formats, you will probably still be OK, because it is possible to tell analog about other formats using a LOGFORMAT command. This is explained in the next section. But most users don't ever need to know about this because they have logfiles in a standard format. So the best thing to do is just to try analysing your logfile and see if analog will understand it. If it does, you don't need to worry about LOGFORMATs.
If analog can't understand your logfile, it will warn you that it can't detect the format, or possibly that it found a lot of corrupt lines. There are basically four reasons why this might happen:
LOGFILE log1,log2 http://www.%v.mydomain.comwould translate a filename /file.html with virtual host host1 in log1 or log2 to http://www.host1.mydomain.com/file.html. If you are using the second argument to the LOGFILE command, you will probably want to use the SUBDIR command as well.
If %v is included in the argument and the logfile line doesn't have a virtual host, that line will be marked as corrupt. If VHOSTLOWMEM 3 is specified, the %v's will not be translated and will just appear as %v in the output.
UNCOMPRESS *.gz,*.Z /usr/bin/gzcatwhereas on Windows NT, you might use
UNCOMPRESS *.gz ("c:\Program Files\gzip\gzip" -cd)This would be a suitable command to include in the default configuration file.
If analog determines when it starts to uncompress a logfile that that file isn't wanted for the analysis, two undesirable things can happen. Either the program might pause until the logfile is fully uncompressed, or there might be a "broken pipe" error reported. This is system dependent, and out of analog's control.
The common logfile format is written by most servers. Its lines look like
jay.bird.com - fred [25/Dec/1998:17:45:35 +0000] "GET /~sret1/ HTTP/1.0" 200 1243(except all on one line). Some versions of Microsoft software have a buggy version of this with an extra quote mark before the HTTP like this:
jay.bird.com - fred [25/Dec/1998:17:45:35 +0000] "GET /~sret1/ "HTTP/1.0" 200 1243Analog will understand these, but (as with any two formats) it will reject lines if the format changes half way through.
[25/Dec/1998:17:45:35] http://www.site.com/ -> /~sret1/and the browser (or agent) log looks like
[25/Dec/1998:17:45:35] Mozilla/2.0 (X11; I; HP-UX A.09.05)In the referrer log, the date can be omitted.
jay.bird.com - fred [25/Dec/1998:17:45:35 +0000] "GET /~sret1/ HTTP/1.0" 200 1243 "http://www.site.com/" "Mozilla/2.0 (X11; I; HP-UX A.09.05)"(except all one line). If you are using the Apache server, you can generate this with the mod_log_config module, using the command
LogFormat "%h %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-Agent}i\""It is usually better to use the combined log than separate logs, because it stores more information in less space.
192.64.25.41, -, 25/12/98, 17:45:35, W3SVC1, HOST1, 192.16.225.10, 2178, 303, 1243, 200, 0, GET, /~sret1/, -,(except all on one line). However, the format is extremely badly designed, in that the date follows local conventions: in other words, in North America the above example would have the date 12/25/98 instead. Analog will diagnose which form the logfile is in if possible: but if both the date and the month are at most 12, there is no way to tell which format it is. In this case, it will advise you to use the command LOGFORMAT MICROSOFT-NA for North American date format, or LOGFORMAT MICROSOFT-INT for international date format. In some countries, the date will not be in either of these formats, in which case you need to write your own LOGFORMAT command.
There are also various third-party extensions to the Microsoft format to include, for example, the browser and referrer. But they all do it in different ways, so analog can't automatically diagnose them, and again, you need to write a LOGFORMAT command for them.
12/25/98 17:45:35 jay.bird.com host1 Server fred GET /~sret1/ http://www.site.com/ Mozilla/2.0 (X11; I; HP-UX A.09.05) 200 1243 2178(except all on one line, and with the fields separated by tabs). It suffers from the same problem with ambiguous dates as the IIS logfile (above), so again you might have to use LOGFORMAT WEBSITE-NA or LOGFORMAT WEBSITE-INT, or even have to write your own LOGFORMAT command.
If analog finds that the header line is corrupt, it will usually tell you what was wrong with it. The most common problem is that you're not allowed the time without the date or vice versa -- in particular, having the date just at the top of the logfile is not sufficient; you must have it on each line. Microsoft servers produce extended logs with the date only at the top. But if the date changes during the logfile, the server doesn't then write a new date line. For this reason analog can't analyse such logfiles safely. There are some programs on the helper applications page to put the date on each line. If you already have such a logfile you might want to use one of these programs, but they have to assume that the date doesn't change during the logfile, so it would be safer to tell your server to log in a better format in future.
The extended log is described at http://www.w3.org/TR/WD-logfile.html. Its header line looks like
#Fields: date time cs-uriIn the rest of the logfile, the fields can be separated by spaces or tabs. There is also Microsoft's attempt at the extended format -- unfortunately they didn't read the spec., so they didn't enclose the browser and referrer in quotes, they replaced spaces in the browser name with +'s, and they put the time taken to serve the request in milliseconds instead of seconds. And there is WebSTAR's attempt which is very nearly right except that they erroneously used the CS-HOST field as the client hostname instead of the server hostname. Analog will understand all of these versions.
Extended logs always record the time in GMT, so you will probably need to use a LOGTIMEOFFSET command to convert to your local timezone.
The WebSTAR format is described at http://www.starnine.com/webstar/docs/ws4manual.3f.html. It has a header line like
!!LOG_FORMAT DATE TIME RESULT URL BYTES_SENT HOSTNAMEIn the rest of the logfile, the fields are separated by tabs. The WebSTAR server also records the time in GMT, so again you will probably need to use a LOGTIMEOFFSET command to convert to your local timezone. Some other Mac servers also use the WebSTAR format, or something looking like it. Analog will understand these too.
Finally, the Netscape header line looks like
format=%Ses->client.ip% [%SYSDATE%] "%Req->reqpb.clf-request%" %Req->srvhdrs.clf-status% %Req->srvhdrs.content-length%
The basic command to specify a log format looks like
LOGFORMAT format-- we'll discuss what the formats can be in a minute. Or if you are using the Apache server, you will probably find it more convenient to use
APACHELOGFORMAT formatinstead.
The LOGFORMAT and APACHELOGFORMAT commands only apply to logfiles specified with a LOGFILE command later in the same configuration file. So you must put the LOGFORMAT above the LOGFILE to which it refers. This way, different logfiles can have different formats, like this:
LOGFILE log0 LOGFORMAT format1 LOGFILE log1 LOGFORMAT format2 LOGFILE log2 LOGFILE log3In this example, log1 is in format1, log2 and log3 are in format2, and log0 isn't in either format -- analog will try and detect which format it's in.
APACHELOGFORMAT (%h %l %u %t \"%r\" %s %b)(The parentheses are needed because the argument contains spaces.) Analog understands all Apache log formats, with the exception that it won't parse Apache's "%...{format}t" construction for customised times: if you have this construction, you will have to use ordinary LOGFORMAT instead.
There are format words for all the built-in formats analog knows about. You might need one of these words if your logfile is in a standard format, but analog can't detect which format it's in for some reason; for example, maybe the first line is corrupt; or maybe analog can't tell whether you're using North American or international dates. So for example
LOGFORMAT COMMONwill select common format; you can also have COMBINED, REFERRER, BROWSER, EXTENDED, MICROSOFT-NA (North American date format), MICROSOFT-INT (international date format), WEBSITE-NA, WEBSITE-INT, MS-EXTENDED (Microsoft's attempt at extended format), WEBSTAR-EXTENDED (WebSTAR's version of extended format), MS-COMMON (a buggy version of common format in some versions of Microsoft software), NETSCAPE or WEBSTAR. All these formats were defined at the end of the previous section. You can also use the special word AUTO to return to automatic detection.
If your logfile is not in one of the recognised formats, you can tell analog about your format using a log format string. You only ever need this if your logfile has lines which are not in one of the standard formats. (And even if it isn't in a standard format, if you're using the Apache web server, you will find APACHELOGFORMAT easier.)
The format string consists of a template for the logfile line, with the various fields and special characters replaced by codes as follows. Please note that these codes are case sensitive -- for example, %b is completely different from %B!
jay.bird.com - fred [25/Dec/1998:17:45:35 +0000] "GET /~sret1/ HTTP/1.0" 200 1243(except all on one line) could be represented by the LOGFORMAT command
LOGFORMAT (%S - %u [%d/%M/%Y:%h:%n:%j %j] "%j %r %j" %c %b)In other words, it's just the sample line but with the hostname replaced by %S, the username by %u etc. (The parentheses are needed because the argument contains spaces.) Or take another example: if you had lines which looked like
Fri 25/12/98 5:45pm, /~sret1/, jay.bird.com, 200, 1243, http://www.site.com, Mozilla/2.0 (X11; I; HP-UX A.09.05)(all on one line again), you could use the format
LOGFORMAT (%j %d/%m/%y %h:%n%am, %r, %S, %c, %b, %f, %B)
LOGFORMAT COMMON LOGFORMAT COMBINED LOGFILE log1 LOGFORMAT (%j %d/%m/%y %h:%n%am, %r, %S, %c, %b, %f, %B) LOGFILE log2 LOGFILE log3log1 has lines in both common and combined format, whereas log2 and log3 have lines just in the format in the previous example.
If you specify several formats, analog tries to match each line to the first format first, then if that fails the next, and so on, so the order of the formats is important. Usually you want to specify the most common one first, to minimise the time spent trying to match lines to inappropriate formats.
So let's go back to the first example:
LOGFILE log0 LOGFORMAT format1 LOGFILE log1 LOGFORMAT format2 LOGFILE log2 LOGFILE log3Here log0 actually gets the default log format. If there are no DEFAULTLOGFORMAT commands, the default will be auto-detection. But if there are DEFAULTLOGFORMAT commands, even in another configuration file, that will be the format of log0.
The times you need to use the DEFAULTLOGFORMAT instead of the LOGFORMAT are if you want to change the format of logfiles which aren't given in a LOGFILE command -- for example, ones specified on the command line, or dragged onto the program icon on a Mac, or compiled in. It is also useful to use the DEFAULTLOGFORMAT if your logfiles are always in the same format, so that you don't have to worry about putting in enough LOGFORMATs in the right places.
The "Unix time", %U, is always recorded in GMT. So you will probably need to use a LOGTIMEOFFSET command to convert to your local timezone. Also, it's just the integer part of the time, so if you have decimals you will have to use %U.%j .
The log formats which analog can handle are those which are known as instantaneously decipherable: in practice, this means that the character which terminates a string can never occur in the string. So for example, in common format, which looks like
LOGFORMAT (%S - %u [%d/%M/%Y:%h:%n:%j %j] "%j %r %j" %c %b)if the hostname ever contained a space, the line would be marked as corrupt, because analog terminates the host at the first space, not at the first occurrence of space-dash-space, and then the rest of the line wouldn't match. Of course, hostnames should never contain spaces, so this shouldn't be a problem. There are a couple of other restrictions: if there is any date or time information, then the year, month, date, hour and minute must all be present: and the same information may not occur twice in the format (so you can't have both %m and %M, for example, because these both represent the month; make one of them a %j to have it ignored).
Sometimes you need to read one of the fields in a logfile, but not analyse it. For example, if you have a separate common log and referrer log, the referrer log might look like
http://guide-p.infoseek.com/Titles -> /~sret1/analog/But the requests for /~sret1/analog/ would already have been counted when reading the main logfile, so you don't want to count them again now. You get round this by specifying a * in that item in the format string, like this:
LOGFORMAT (%f -> %*r)
A tip: sometimes it is more efficient to specify two or more adjacent fields to ignore with a single %j, as long as the whole group ends with a recognisable character. So common format is more efficiently specified as
LOGFORMAT (%S - %u [%d/%M/%Y:%h:%n:%j] "%j %r %j" %c %b)-- in the date and time [25/Dec/1998:17:45:35 +0000], the seconds and the timezone can be ignored with a single %j, extending until the close-bracket.
Another tip: %j can also be used to ignore whole lines, rather than just fields analog doesn't use. For example, the extended log format ignores lines beginning with # by using
LOGFORMAT #%jand the Microsoft format ignore lines corresponding to FTP requests with
LOGFORMAT (%*S, %*u, %m/%d/%y, %h:%n:%j, %j)If those formats had not been used, the lines would have been incorrectly marked as corrupt.
jay.bird.com - fred [25/Dec/1998:17:45:35 +0000] "GET /~sret1/ HTTP/1.0" 200 1243 LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j%w%r%wHTTP%j" %c %b) LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j%w%r" %c %b)
jay.bird.com - fred [25/Dec/1998:17:45:35 +0000] "GET /~sret1/ "HTTP/1.0" 200 1243 LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j%w%r%w"HTTP%j" %c %b) LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j%w%r" %c %b)
jay.bird.com - fred [25/Dec/1998:17:45:35 +0000] "GET /~sret1/ HTTP/1.0" 200 1243 "http://www.site.com/" "Mozilla/2.0 (X11; I; HP-UX A.09.05)" LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j%w%r%wHTTP%j" %c %b "%f" "%B") LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j%w%r" %c %b "%f" "%B")
[25/Dec/1998:17:45:35] http://www.site.com/ -> /~sret1/ or http://www.site.com/ -> /~sret1/ LOGFORMAT ([%d/%M/%Y:%h:%n:%j] %f -> %*r) LOGFORMAT (%f -> %*r)
[25/Dec/1998:17:45:35] Mozilla/2.0 (X11; I; HP-UX A.09.05) LOGFORMAT ([%d/%M/%Y:%h:%n:%j] %B)
192.64.25.41, -, 12/25/98, 17:45:35, W3SVC1, HOST1, 192.16.225.10, 2178, 303, 1243, 200, 0, GET, /~sret1/, -, LOGFORMAT (%S, %u, %m/%d/%y, %h:%n:%j, W3SVC%j, %j, %v, %T, %j, %b, %c, %j, %j, %r, %q,) LOGFORMAT (%*S, %*u, %m/%d/%y, %h:%n:%j, %j)
192.64.25.41, -, 25/12/98, 17:45:35, W3SVC1, HOST1, 192.16.225.10, 2178, 303, 1243, 200, 0, GET, /~sret1/, -, LOGFORMAT (%S, %u, %d/%m/%y, %h:%n:%j, W3SVC%j, %j, %v, %T, %j, %b, %c, %j, %j, %r, %q,) LOGFORMAT (%*S, %*u, %d/%m/%y, %h:%n:%j, %j)
12/25/98 17:45:35 jay.bird.com host1 Server fred GET /~sret1/ http://www.site.com/ Mozilla/2.0 (X11; I; HP-UX A.09.05) 200 1243 2178 LOGFORMAT (%m/%d/%y %h:%n:%j\t%S\t%v\t%j\t%u\t%j\t%r\t%f\t%j\t%B\t%c\t%b\t%T)
25/12/98 17:45:35 jay.bird.com host1 Server fred GET /~sret1/ http://www.site.com/ Mozilla/2.0 (X11; I; HP-UX A.09.05) 200 1243 2178 LOGFORMAT (%d/%m/%y %h:%n:%j\t%S\t%v\t%j\t%u\t%j\t%r\t%f\t%j\t%B\t%c\t%b\t%T)
CASE INSENSITIVE CASE SENSITIVEThere are similar commands for usernames, if your logfile records these. By default, usernames are always case insensitive, but you can specify
USERCASE SENSITIVEto override this.
DIRSUFFIX default.htm(You can only have one DIRSUFFIX.) There are other built-in aliases for other items: for example, hostnames are converted to lower case at this point.
FILEALIAS /football.html /soccer.html HOSTALIAS lion lion.statslab.cam.ac.ukThere is also the special command FILEALIAS none, which cancels any other file aliases which might have been specified.
The alias commands for the other items are called BROWALIAS, REFALIAS, USERALIAS and VHOSTALIAS. Only one alias is ever applied to any item. So after
FILEALIAS /football.html /soccer.html FILEALIAS /soccer.html /brazil.htmlthe file /soccer.html would get translated to /brazil.html, but /football.html would only get translated to /soccer.html and would not see the second alias.
You can also use wildcards (? and *) in alias commands. And on the right-hand side, you can use $1, $2 etc. to represent the parts of the original name matched by the *'s. (You can use $$ to get an actual $ on the right-hand side.) As a special abbreviation, if there is exactly one * on the left-hand side, then a * on the right-hand side can be used to represent $1. So, for example,
FILEALIAS /*/football/* /soccer/would translate /sport/football/rules.html to just /soccer/, but either of
FILEALIAS /*/football/* /$1/soccer/$2 # or FILEALIAS /sport/football/* /sport/soccer/*would translate /sport/football/rules.html to /sport/soccer/rules.html.
Analog's *'s are un-greedy: if there are two possible ways of matching, the part of the expression on the left matches as little as possible. This is more often what you want. But it contrasts with Perl's regular expressions, for example. (Oh, two consecutive *'s are completely useless, but if you try it they are collapsed into one before counting the $1, $2, etc.)
The behaviour of FILEALIAS and REFALIAS can be slightly unintuitive if the file has search arguments.
A warning to Unix users: if you put an ALIAS command on the command line with +C, the shell may try and expand $1 etc., which is not what you want. To stop the shell doing this, put the command in single quotes instead of double quotes.
TYPEOUTPUTALIAS .txt ".txt (Plain text files)"would provide an explanation of that line in the file type report.
There can be some confusion between some ALIAS and OUTPUTALIAS commands. For example, what is the difference between HOSTALIAS and HOSTOUTPUTALIAS? In fact, there are several differences, resulting from the different times at which the aliases are processed. The HOSTALIAS applies to the host items, but the HOSTOUTPUTALIAS only applies to the lines in the host report. This means that the HOSTALIAS also affects the other reports which use the hosts, such as the domain report, whereas the HOSTOUTPUTALIAS only affects the host report. Also the HOSTOUTPUTALIAS applies separately to each line of the host report. This means that if two separate hosts translate to the same thing in a HOSTALIAS command, they will become one host ever after. But if one were to use the same HOSTOUTPUTALIAS commands, there would be two hosts, which would just happen to have the same name in one report.
In summary, HOSTALIAS would normally be used if a single host had two different names, so might otherwise appear to be two hosts, whereas HOSTOUTPUTALIAS would normally be used to annotate or clarify the host report.
The full list of output aliases is REQOUTPUTALIAS, REDIROUTPUTALIAS, FAILOUTPUTALIAS, TYPEOUTPUTALIAS, DIROUTPUTALIAS, HOSTOUTPUTALIAS, DOMOUTPUTALIAS, ORGOUTPUTALIAS, REFOUTPUTALIAS, REFSITEOUTPUTALIAS, REDIRREFOUTPUTALIAS, FAILREFOUTPUTALIAS, BROWOUTPUTALIAS, FULLBROWOUTPUTALIAS, OSOUTPUTALIAS, VHOSTOUTPUTALIAS, USEROUTPUTALIAS and FAILUSEROUTPUTALIAS.
There is one known bug with OUTPUTALIAS. The report is sorted before the OUTPUTALIAS is applied. This means that if the SORTBY for the report is set to ALPHABETICAL, then the report will not be sorted correctly.
You include regular expressions in an ALIAS command by prefixing the left-hand side of the alias with "REGEXP:". Or you can specify a case-insensitive match, like Unix egrep -i, by using "REGEXPI:". (It's automatically case-insensitive for many items, such as hostnames, or filenames if you have specified CASE INSENSITIVE.)
On the right-hand side of the alias you can use $1, $2 etc. to represent the first, second etc. bracketed expression on the left-hand side, counting in order of the left brackets. (Again, you can't put $1, $2 etc. on the command line unless you put them in single quotes.)
Regular expressions match if they match just part of the string. If you want them to have to match the whole of the string, you have to anchor them to the ends of the string with ^ and $.
For example,
REQOUTPUTALIAS REGEXP:^(/~(.+?)/.*) "[$2] $1"would translate
/~sret1/backgammon/rules.htmlto
[sret1] /~sret1/backgammon/rules.htmlin the Request Report. Or
HOSTALIAS REGEXP:^([^.]*)$ $1.mycompany.comwould add .mycompany.com to all hostnames not containing a dot. (See the FAQ for a discussion about whether this is a good idea.)
Regular expressions are greedy: if there are two possible ways of matching, the part of the expression on the left matches as much as possible.
HOSTEXCLUDE mycomputer.myisp.comwould exclude all requests by that computer from the statistics.
The rule for determining whether an item is included or excluded is as follows. All the INCLUDE and EXCLUDE commands for that item are considered one by one in order, and the item is included or excluded according to the last command it matched. Items which don't match any of the INCLUDE or EXCLUDE commands are included if the first command was an exclusion, and excluded if the first command was an inclusion. For example, the configuration
FILEINCLUDE /~sret1/* FILEEXCLUDE /~sret1/backgammon/*,/~sret1/analog/* FILEINCLUDE /~sret1/backgammon/*.gifwould instruct the program to examine only my files, excluding my backgammon and analog files, but including gifs in my backgammon directory. On the other hand,
FILEEXCLUDE /~sret1/*/img/*would analyse all files, except for images in my various directories. (If you get confused with all the inclusions and exclusions, remember that you can always use SETTINGS ON to see what the options you have specified represent.) Note that inclusions and exclusions can contain any number of wildcards.
The full list of these commands is HOSTINCLUDE and HOSTEXCLUDE; FILEINCLUDE and FILEXCLUDE; BROWINCLUDE and BROWEXCLUDE; REFINCLUDE and REFEXCLUDE; USERINCLUDE and USEREXCLUDE; VHOSTINCLUDE and VHOSTEXCLUDE; and STATUSINCLUDE and STATUSEXCLUDE.
Because the inclusions and exclusions take place after the aliasing, the name you must use is the aliased name. (In the absence of OUTPUTALIAS commands, this is the name of the item in the output.)
Sometimes a line doesn't contain a particular sort of item, either because there is no field reserved for it on the line, or because the browser didn't send it for that request. You can include or exclude these lines by making a special blank entry in the INCLUDE or EXCLUDE command. For example,
USERINCLUDE jim USERINCLUDE ""would include lines from user jim and lines without any user specified.
The behaviour of REQINCLUDE and REFINCLUDE can be slightly unintuitive if the file has search arguments.
You can also use regular expressions for the inclusions and exclusions by prefixing the expression with "REGEXP:" or "REGEXPI:". I've already described this at length in the context of aliases, so you can look there for all the details.
STATUSINCLUDE 200-206,304,500-would mean only look at lines with status codes 200-206, 304 or 500-599.
(On the subject of status codes, analog by default counts code 304 (Not Modified) as a successful request, because it assumes that the cached version of the document is then presented to the user. Some people think it should be a redirected request though, and this is configurable with the command
304ISSUCCESS OFFAgain, if you don't understand this, stick with the default.)
FROM 990701 TO 000630Alternatively, each of the components can be preceded by + or - to represent time relative to the time at which the program was invoked. In this case, the date can have more than 2 digits. This allows constructions like
FROM -01-00+01 # from tomorrow last year TO -00-0131 # to the end of last month (OK even if last month # didn't have 31 days) FROM -00-00-112 TO -00-00-01 # statistics for the last 16 weeks FROM -00-00-00:-06+01 # statistics for the last 6 hoursThere are command line abbreviations +F and +T for the FROM and TO commands; for example, +T-00-00-01:1800 looks at statistics until 6pm yesterday. -F and -T turn off the from and to, as do FROM OFF and TO OFF.
REFREPEXCLUDE http://your.site.com/*would exclude your internal referrers from the Referrer Report. However, it would not exclude them from the Failed Referrer Report, the Referring Site Report, etc. (you need to use FAILREFEXCLUDE, REFSITEEXCLUDE etc. for that); nor would it prevent other analysis of logfile lines with those referrers, as REFEXCLUDE would. Also REFREPEXCLUDE would include the referrers in the "not listed" line at the bottom of the report.
The full list of these commands is REQINCLUDE and REQEXCLUDE; REDIRINCLUDE and REDIREXCLUDE; FAILINCLUDE and FAILEXCLUDE; TYPEINCLUDE and TYPEEXCLUDE; DIRINCLUDE and DIREXCLUDE; HOSTREPINCLUDE and HOSTREPEXCLUDE; DOMINCLUDE and DOMEXCLUDE; ORGINCLUDE and ORGEXCLUDE; REFREPINCLUDE and REFREPEXCLUDE; REFSITEINCLUDE and REFSITEEXCLUDE; SEARCHQUERYINCLUDE and SEARCHQUERYEXCLUDE; SEARCHWORDINCLUDE and SEARCHWORDEXCLUDE; REDIRREFINCLUDE and REDIRREFEXCLUDE; FAILREFINCLUDE and FAILREFEXCLUDE; BROWSUMINCLUDE and BROWSUMEXCLUDE; FULLBROWINCLUDE and FULLBROWEXCLUDE; OSINCLUDE and OSEXCLUDE; VHOSTREPINCLUDE and VHOSTREPEXCLUDE; USERREPINCLUDE and USERREPEXCLUDE; and FAILUSERINCLUDE and FAILUSEREXCLUDE. The inclusion or exclusion applies to the unaliased name, if you are doing any output aliases.
You can also use the symbolic word pages in suitable INCLUDE and EXCLUDE commands; one very common command is
REQINCLUDE pagesto include only pages in the request report.
PAGEINCLUDE *.ps,*.ps.gz PAGEEXCLUDE /sret1.htmlI.e., Postscript and gzipped Postscript are pages, but /sret1.html isn't. (If the file has search arguments, the PAGEINCLUDE and PAGEEXCLUDE are reckoned just on the part of the filename before the question mark.)
LINKINCLUDE pageswould link to pages in the Request Report.
/cgi-bin/script.pl?x=1&y=2runs the /cgi-bin/script.pl program with arguments x=1 and y=2. (Sometimes the server records these arguments in a separate field in the logfile, but if so you can use the %q field in the LOGFORMAT command, and analog will translate the filename to the above format).
You can tell analog either to read or to ignore the arguments using the commands ARGSINCLUDE and ARGSEXCLUDE which we'll discuss in a minute. But by default, all arguments are read, and as this is usually what you want, you don't usually need those commands.
You don't always see the arguments in the reports, even if they're being read, because analog doesn't show them if there aren't enough of them. In order to see them, you have to set the corresponding ARGSFLOOR parameter low enough.
Also note that within a report, the search arguments are listed immediately under the file to which they refer. This temporarily interrupts the normal order of the files. It may be clearer if you turn the N column on.
The reason is that, for example, the command
FILEINCLUDE /cgi-bin/script.pldoesn't match the file /cgi-bin/script.pl?x=1&y=2. To match that, you would have to use something like
FILEINCLUDE /cgi-bin/script.pl*instead. Similarly
FILEALIAS /cgi-bin/script.pl /script.plwill change /cgi-bin/script.pl itself, but not /cgi-bin/script.pl?x=1&y=2. You might want to use something like
FILEALIAS /cgi-bin/script.pl?* /script.pl?$1as well. (However, PAGEINCLUDE and PAGEEXCLUDE always refer to the part of the filename before the question mark.)
ARGSEXCLUDE /cgi-bin/script.plwere given, analog would ignore the arguments to that file, and so read /cgi-bin/script.pl?x=1&y=2 as just /cgi-bin/script.pl. On the other hand, if
ARGSINCLUDE /cgi-bin/script.plwere specified, analog would read the arguments, and so treat /cgi-bin/script.pl?x=1&y=2 as a different file from /cgi-bin/script.pl. REFARGSINCLUDE and REFARGSEXCLUDE are the same for referrers.
Technical note: the check for whether the arguments should be included happens before the filename has been subject to either built-in or user-specified aliases. So you have to use the unaliased name, exactly as it occurs in the logfile. For example, ARGSINCLUDE /~sret1/script.pl won't match /%7Esret1/script.pl even though they are really the same file. It also means that you can't use "pages" in the ARGSINCLUDE or ARGSEXCLUDE command, because we don't know whether a file is a page until after it's been aliased.
http://www.altavista.com/cgi-bin/query?pg=q&kl=XX&q=carrot+cakeThe search term is in the field q= so the appropriate SEARCHENGINE command is
SEARCHENGINE http://www.altavista.com/cgi-bin/query qor even better
SEARCHENGINE http://*altavista.*/* qto allow for all their mirror sites in different countries.
Sometimes a search engine has two or more possible fields for the search term. In that case you can list all of them separated by commas, like this:
SEARCHENGINE http://*webcrawler.*/* search,searchText
I said previously that %7E in a URL is automatically converted to ~, etc. In fact this is only done to the ASCII-printable characters %20-%7E (because these are the only characters that are the same in every character set).
But in the Search Query Report and Search Word Report it is useful to be able to convert non-ASCII characters too, so that you can see the actual words people typed, rather than get the %nm codes in place of all accented letters. So in these reports analog also converts characters %A0-%FF (if you are using an ISO-8859-* character set) or %80-%FF (for other character sets, apart from ASCII).
However, there are reasons why you might not want this feature, and you can turn it off with the command
SEARCHCHARCONVERT OFFThese reasons include:
There are 32 different reports which analog can produce, if your logfiles contain the necessary information. Each one has a short name, and a code letter or number, as follows:
x GENERAL General Summary m MONTHLY Monthly Report W WEEKLY Weekly Report D FULLDAILY Daily Report d DAILY Daily Summary H FULLHOURLY Hourly Report h HOURLY Hourly Summary 4 QUARTER Quarter-Hour Report 5 FIVE Five-Minute Report S HOST Host Report Z ORGANISATION Organisation Report o DOMAIN Domain Report r REQUEST Request Report i DIRECTORY Directory Report t FILETYPE File Type Report z SIZE File Size Report P PROCTIME Processing Time Report E REDIR Redirection Report I FAILURE Failure Report f REFERRER Referrer Report s REFSITE Referring Site Report N SEARCHQUERY Search Query Report n SEARCHWORD Search Word Report k REDIRREF Redirected Referrer Report K FAILREF Failed Referrer Report B FULLBROWSER Browser Report b BROWSER Browser Summary p OSREP Operating System Report v VHOST Virtual Host Report u USER User Report J FAILUSER Failed User Report c STATUS Status Code ReportFor details on what the various reports mean, and a summary of the commands which control them, see the section on Analog's reports.
FIVE OFF REFSITE ONor by using command line arguments like -5 and +s. You can also turn all reports except the General Summary on or off with the commands ALL ON and ALL OFF, or with the command line arguments +A and -A.
You can turn the "Go To" lines in the report off with the command
GOTOS OFFGOTOS ON turns them on again, and GOTOS FEW puts the "Go To" lines just at the top and bottom. GOTOS OFF can be abbreviated with the -X command line argument, and GOTOS ON with +X.
You can turn off the "Program started at" line at the top of the report, and the "Running Time" line at the bottom, with the command
RUNTIME OFFand turn them on again with RUNTIME ON.
The figures in parentheses in the General Summary are for the last seven days: either the seven days before the TO time, or if no TO time is given, the seven days before the time of the program start. The figures for the last seven days are normally included if some, but not all, of the requests fall in those seven days; but you can turn them off by means of the command
LASTSEVEN OFFOf course LASTSEVEN ON turns them on again.
You can change the order of the reports by means of the REPORTORDER command. You should list the code letters for all possible reports in the order you want them. Non-alphanumeric characters are ignored and so can be used as separators. For example,
REPORTORDER x-mdDhH45W-cPz-ritEI-SZo-sNnfKk-uJ-v-bBp
OUTFILE stats.htmor with a command line argument like +Ostats.htm. If you use the filename - or stdout, the output will go to standard output, which is normally the screen, but Unix users might like to redirect it to another file or even into a pipe. You can also use an absolute path name, like
OUTFILE /usr/bin/httpd/htdocs/stats.html # Unix OUTFILE "Hard Disk:Server Apps:WebSTAR:Analog:Report.html" # Mac
Sometimes it's convenient to include the date in the name of the OUTFILE. You can do this by including the following codes in the filename.
%D date of month %m month name %M month number %y two-digit year %Y four-digit year %H hour %n minute %w day of weekSo for example,
OUTFILE stats%y%M.htmlwill produce filenames like stats9905.html. The date used is the TO date if one was specified, and otherwise the time of the start of the program.
As well as a command like
OUTPUT PLAINyou can also select PLAIN style with the command line argument +a, and HTML with the command line argument -a. You can also specify OUTPUT NONE for no output, if you are producing a cache file.
LANGUAGE FRENCHwill give you the output in French. The available languages at the moment are ARMENIAN, BOSNIAN, CATALAN, SIMP-CHINESE (GB2312 encoding), TRAD-CHINESE (Big5 encoding), CZECH, DANISH, DUTCH, ENGLISH, US-ENGLISH, FINNISH, FRENCH, GERMAN, GREEK, ICELANDIC, ITALIAN, JAPANESE, KOREAN, NORWEGIAN (Bokmål), NYNORSK, POLISH, PORTUGUESE, BR-PORTUGUESE, RUSSIAN, SERBIAN, SLOVAK, SLOVENE, SPANISH, SWEDISH, TURKISH and UKRAINIAN. As new languages are translated, they will be added to the analog home page.
The other way is to use the LANGFILE command. This is useful if you want to download a new language from the analog home page, or if you want to translate one yourself, or even if you want to change some words or phrases or the way the dates and times are formatted in the output. The LANGFILE command tells analog in which file to find the various words and phrases for a new language. For example, the command
LANGFILE lang/guarani.lng # or LANGFILE /usr/etc/httpd/analog/lang/guarani.lngwould read from that file. (Note that you have to include the directory name if the file isn't in the directory or folder which you're running analog from. In particular, it's not assumed to be in the same directory as the other language files.)
Some languages also have domains files available. These are normally selected automatically by the LANGUAGE command. But you can tell analog to use a different domains file with the DOMAINSFILE command. Also, some languages have translations of the form interface.
If you want to translate another language, I would be delighted! You'd be wise to contact me first to make sure that no-one else is already translating the same language. The English language file contains some brief instructions for translating new languages.
You have to be careful using this command. Because of daylight savings time in operation in different parts of the world at different times, analog cannot attempt to convert between different timezones. So it's your responsibility to set the right offset for different times of year. For example, if you were in Chicago, but your server was recording time in GMT, you would need to specify two different time offsets, one of minus five hours for summer and one of minus six hours for winter. You would need to split your logfiles in the right places and then run commands like
LOGTIMEOFFSET -300 LOGFILE summer*.log LOGTIMEOFFSET -360 LOGFILE winter*.log
There is also a related command called TIMEOFFSET. This tells analog how much to offset the time of the computer on which it is running (rather than the computer running the server), to get your local time.
IMAGEDIR img/ # within the same directory as the output IMAGEDIR /img/ # off the root directory of your server
There are three commands which affect the top line of the output. First, the LOGO command allows you to replace the analog logo with another image (for example, your organisation's logo). You can say
LOGO picture.gif # for this file LOGO /images/picture2.gif # a different file LOGO none # for no logoThe logo is assumed to be inside the IMAGEDIR unless it starts with a slash, or contains ://
Then there are commands HOSTNAME and HOSTURL which affect the name and link at the end of the title line. For example, I might specify
HOSTNAME "Stephen Turner" HOSTURL http://www.statslab.cam.ac.uk/~sret1/to generate the title "Web Server Statistics for Stephen Turner". Again, you can use none as the HOSTURL to specify no link. Analog will normally translate characters in the hostname to HTML if necessary. So to include literal HTML, such as accented characters, in the output you need to precede them by a backslash, like this:
HOSTNAME "M\üller & S\öhne"
HEADERFILE noneto cancel a previously-specified header file.
STYLESHEET /housestyle.css STYLESHEET none # to cancel itHint: a common mistake in writing style sheets is to declare a font-family for the body, but then not put <pre> sections back into a monospaced font. This stops the columns lining up properly. Your style sheet should contain a line like the following:
PRE, TT, CODE, KBD, SAMP { font-family: monospace }
SEPCHAR " " REPSEPCHAR none DECPOINT ,to make "three thousand and a quarter" look like "3 000,25" in text and "3000,25" in the reports.
Each time report can contain columns listing the requests, requests for pages, and bytes transferred at that time, using the following code letters.
HOURCOLS Pbtells analog to include the number of page requests and percentage of the bytes, in that order, as the columns for the Hourly Summary. The other COLS commands are MONTHCOLS, WEEKCOLS, DAYCOLS (Daily Summary), FULLDAYCOLS (Daily Report), FULLHOURCOLS (Hourly Report), QUARTERCOLS and FIVECOLS. There is also a TIMECOLS command, which specifies that all the time reports are to have the specified columns.
FULLDAYGRAPH Ptells analog to plot the bar charts in the Daily Report by the number of page requests. This also controls how analog decides which is the busiest time period in the bottom line of the report. Using a lower case letter tells analog to plot the bar charts with ASCII characters instead of the normal red bars. (This produces shorter output, and it is how they appear anyway in PLAIN and ASCII output styles, or when viewed with a non-graphical browser.) So, for example,
FULLDAYGRAPH bwould plot the Daily Report by bytes, without using the graphics. The other GRAPH commands are MONTHGRAPH, WEEKGRAPH, DAYGRAPH, HOURGRAPH, FULLHOURGRAPH, QUARTERGRAPH and FIVEGRAPH. There's also an ALLGRAPH command to set all of them simultaneously.
BARSTYLE aThe default style is b.BARSTYLE b
BARSTYLE c
BARSTYLE d
BARSTYLE e
BARSTYLE f
BARSTYLE g
BARSTYLE h
![]()
MONTHBACK ON # Monthly Report backwards WEEKBACK OFF # Weekly Report forwardsThe other BACK commands are FULLDAYBACK, FULLHOURBACK, QUARTERBACK and FIVEBACK. It tends to be confusing to mix directions (and analog will warn you if you attempt it) so usually you want to use the ALLBACK command which will set all of them at once.
QUARTERROWS 96 # only the last day's worth MONTHROWS 0 # 0 means no restriction: show all timeThe other ROWS commands are WEEKROWS, FULLDAYROWS, FULLHOURROWS and FIVEROWS. Even if a ROWS command is given, the line at the bottom of the report will still show the busiest time period ever, not just the busiest one in that many rows.
MARKCHAR =tells analog to use the equals sign.
There is a parameter called MINGRAPHWIDTH which sets the minimum nominal size of the graphs. For example, if you set
MINGRAPHWIDTH 10then the graph will be allowed to be up to 10 characters wide, even if that would exceed the PAGEWIDTH.
There is one more command which affects the time reports. You can specify which day should be counted as the first day of the week. This affects the layout of the Daily Report, Daily Summary and Weekly Report. For example, our local student newspaper publishes a new edition on the web every Friday, so they like to specify WEEKBEGINSON FRIDAY for their reports.
In the next section, we'll look at commands relating to the non-time reports.
First, these reports have COLS commands, just like the time reports. (See the section on Time reports for how to use these commands.) In the non-time reports, three additional columns are available, namely d for date of last access, D for date and time of last access, and N for the number of the item in the list. So, for example,
REQCOLS NRDcounts the files in the Request Report, listing the number of requests for each and the time when each was last requested. The full list of COLS commands for non-time reports is HOSTCOLS, ORGCOLS, DOMCOLS, REQCOLS, DIRCOLS, TYPECOLS, SIZECOLS, PROCTIMECOLS, REDIRCOLS, FAILCOLS, REFCOLS, REFSITECOLS, SEARCHQUERYCOLS, SEARCHWORDCOLS, REDIRREFCOLS, FAILREFCOLS, FULLBROWCOLS (Browser Report), BROWCOLS (Browser Summary), OSCOLS, VHOSTCOLS, USERCOLS, FAILUSERCOLS and STATUSCOLS. Not every column is allowed in every report, but if you specify an illegal one, analog will warn you about it.
HOSTSORTBY ALPHABETICALwill sort the Host Report alphabetically. The other SORTBY commands are ORGSORTBY, DOMSORTBY, REQSORTBY, DIRSORTBY, TYPESORTBY, REDIRSORTBY, FAILSORTBY, REFSORTBY, REFSITESORTBY, SEARCHQUERYSORTBY, SEARCHWORDSORTBY, REDIRREFSORTBY, FAILREFSORTBY, FULLBROWSORTBY, BROWSORTBY, OSSORTBY, VHOSTSORTBY, USERSORTBY, FAILUSERSORTBY and STATUSSORTBY. Again, not every sort method is possible in every report, but you'll be warned if you choose an illegal one.
There is one known bug concerned with SORTBY ALPHABETICAL. The report is sorted before any OUTPUTALIAS is applied. This means that if an OUTPUTALIAS has been specified for the report, then the report will not be sorted correctly.
DOMFLOOR 1000r # all domains with at least 1000 requests DOMFLOOR 1000p # at least 1000 requests for pages DOMFLOOR 1000000b # at least 1,000,000 bytes transferred DOMFLOOR 1Mb # at least 1 megabyte DOMFLOOR 0.5%r # 0.5% of the requests (ditto %p and %b) DOMFLOOR 0.5:r # 0.5% of the maximum number of requests # for any domain (ditto :p and :b) DOMFLOOR 970701d # last access since 1st July 1997 DOMFLOOR -00-01-00d # last access in last month (see # documentation on FROM and TO commands) DOMFLOOR -100r # domains with top 100 number of requests # (ditto -100p, -100b, -100d)The other FLOOR commands are HOSTFLOOR, ORGFLOOR, REQFLOOR, DIRFLOOR, TYPEFLOOR, REDIRFLOOR, FAILFLOOR, REFFLOOR, REFSITEFLOOR, SEARCHQUERYFLOOR, SEARCHWORDFLOOR, REDIRREFFLOOR, FAILREFFLOOR, FULLBROWFLOOR, BROWFLOOR, OSFLOOR, VHOSTFLOOR, USERFLOOR, FAILUSERFLOOR and STATUSFLOOR. Once again, not every floor method is legal for every report, but you'll be warned if you try and choose an illegal one.
There's one other command which affects the links in the Request Report. The command BASEURL prepends an additional string to the URLs in the target of the link. For example, after the command
BASEURL http://www.statslab.cam.ac.uk/~sret1/ will be linked to http://www.statslab.cam.ac.uk/~sret1/, not just to /~sret1/. This is very useful if you want to display the statistics on a different server from the server they refer to. If you want the file to be listed as http://www.statslab.cam.ac.uk/~sret1/, rather than just to be linked to that address, you need to use the second argument to the LOGFILE command instead.
In the next section, we'll look at commands for generating hierarchical reports, which are closely related to the commands in this section.
First, you need to be able to control what gets listed in the reports. For this you need to use the SUB family of commands. So, for example, the command SUBDIR /~sret1/* would ensure that the Directory Report would not only contain an entry for the sum of my files, but also one for each of my subdirectories, something like this:
29,111: /~sret1/ 10,234: /~sret1/analog/ 5,179: /~sret1/backgammon/ 11,908: /~steve/You can have more than one * in the command. For example
SUBDOMAIN *.*would list the whole Domain Report two levels deep.
If you specify a SUB command, all the intermediate levels are included automatically. So, for example, after
SUBDOMAIN statslab.cam.ac.ukcam.ac.uk and ac.uk will be included in the Domain Report too, and after *.*.ac.uk, *.ac.uk will be included.
Here are examples of the other four SUB commands:
SUBTYPE *.gz # in the File Type Report SUBBROW */* # e.g. Mozilla/4 in the Browser Summary SUBBROW Mozilla/*.* # add minor version numbers for Mozilla REFDIR http://search.yahoo.com/* # Referring Site Report SUBORG *.aol.com # Organisation Report SUBORG *.*.com # Break down all .com's
The SUBDOMAIN report (but none of the others) can included a second argument describing the subdomain. For example
SUBDOMAIN cam.ac.uk 'University of Cambridge'Then that subdomain will be listed with its translation in the Domain Report. You can also have numerical subdomains: e.g.,
SUBDOMAIN 131.111 'University of Cambridge'If you sort the subdomains alphabetically, the numerical ones will also be sorted alphabetically, not numerically. I don't think this will cause any problems.
One other use for the SUBDIR command is if you have used the second argument to the LOGFILE command. Suppose you have translated files like /index.html into http://www.mycompany.com/index.html. Then the command
SUBDIR http://*/*would be appropriate to make the directory report look right.
A sub-item is listed in a hierarchical report only if it is above the sub-FLOOR, and it is included with a SUB command, and it is not excluded because of an INCLUDE or EXCLUDE command, and its immediate parent is listed. For example, specifying
SUBDIR /*/*/ SUBDIRFLOOR -3r SUBDIRSORTBY REQUESTSwould list the three subdirectories with most requests under each directory. SUBDIRFLOOR 1:r would have listed any subdirectory with at least 1% of the maximum number of requests of any top level directory.
The three file reports (Request Report, Redirection Report and Failure Report) and the three referrer reports (Referrer Report, Redirected Referrer Report and Failed Referrer Report) are not fully hierarchical, but they do list search arguments together under the file to which they refer (provided that the arguments have been read in: see the ARGSINCLUDE command). So they have similar sub-FLOOR and sub-SORTBY commands, namely REQARGSFLOOR, REDIRARGSFLOOR, FAILARGSFLOOR, REFARGSFLOOR, REDIRREFARGSFLOOR and FAILREFARGSFLOOR; and REQARGSSORTBY, REDIRARGSSORTBY, FAILARGSSORTBY, REFARGSSORTBY, REDIRREFARGSSORTBY and FAILREFARGSSORTBY. The same applies to the Operating System Report with its subdivisions of operating systems: it has SUBOSFLOOR and SUBOSSORTBY.
DOMAINSFILE lang/mydomains.tabNormally you don't need this command, because if there is a domains file in your language, it should be selected automatically. But the DOMAINSFILE command can be useful if you want to use a domains file in a new language, for example.
You should have got a domains file with the program, but if you've lost it, you can download one from http://www.analog.cx/ukdom.tab. It should contain on each line a domain code, followed by a number, followed by its location, like this:
ad 2 Andorra ae 3 United Arab Emirates [...]It does not need to be in alphabetical order, though humans may prefer it that way. Subdomains do not go in the domains file: you can list them in the Domain Report using the SUBDOMAIN command.
There are some problems with this. A few countries have organisations at both levels 2 and 3 (for example asaspace.at and univie.ac.at). In those cases I've favoured false negatives over false positives by using the bigger number. (Also there is a correction which will make most of them right again: the first component is always removed from a hostname of three or more components.) For other countries, I don't have enough information to tell what the level should be. I've just given those a 1. Do let me know if you have any more information, or corrections, for the numbers.
am Arm\énie
Only domains which occur in the domains file will get their own line in the Domain Report: the rest are probably spurious, and will be accumulated together as "unknown domains". If you have debugging turned on, you can see which domains were unknown.
Lines starting with a hash (#) in the domains file are considered to be comments.
OUTPUT COMPUTERThis style is designed to be easy to read into spreadsheets, or post-process with graphics creation tools, for example.
Each line in the output is separated into fields by means of a special string. You can specify this string by means of the COMPSEP command; for example
COMPSEP ,for CSV (comma separated value) format. Make sure not to use anything that might occur in the output: for example, a single or double space would not be suitable.
After that, there follows a field indicating the remaining columns in the report (using the letters RrPpBbDd as usual). In hierarchical reports (including the reports which can show search arguments) there is an additional column l at the beginning, indicating the level in the hierarchy.
Finally there are the numerical data for each column and then the name of the item. Times actually take up several fields: year, month, date, hour & minute, or as many of those as are necessary to identify the time.
So here is an example line from the Domain Report, showing the third-level domain cam.ac.uk with 43 requests and 3.516% of the bytes.
o lRb 3 43 3.516 cam.ac.uk
For most people, the cache file will not be needed: compressing the logfile using a standard compression utility such as gzip will be sufficient. Compressing a logfile is very efficient owing to the large number of repeated strings: I find about 12 times compression in practice. That in itself may solve your filespace problems, without needing to throw away any information.
The cache file is not the best format for post-processing the data or feeding it into a spreadsheet. For that you should use the computer readable output style.
If you are going to use the cache file feature, it is very important that you understand what is and what is not recorded. It is not possible to reconstruct everything of interest in the logfile from the cache file. The cache file does contain information about the total number of requests for each host and each file, but not about, for example, which files were read by which hosts. (To do so would take up as much disk space as the compressed logfile.) So you cannot later look at only one file and see which hosts read that file. Similarly, you cannot later restrict the files or hosts by date, using FROM and TO commands.
In summary, you should do all the inclusions and exclusions you want when you create the cache file. If you want different sets of inclusions and exclusions, you should create several cache files from the same logfile. You cannot later apply extra inclusions and exclusions accurately.
A couple of other minor points: the pattern of failed requests and redirected requests over time is not recorded in the cache file. So although the total number will still be correct, the number in the last 7 days can be under-reported subsequently. And times are only recorded to five-minute resolution.
CACHEOUTFILE noneto turn it off again. You will still get the regular output as well as the cache output, unless you request OUTPUT NONE. To avoid overwriting, you cannot set the CACHEOUTFILE to be a file which already exists. (Disclaimer: on some systems, race conditions may very occasionally thwart this check. Also on a few systems, making the file writeable but not readable will allow it to be overwritten). You can include the date in the name of the CACHEOUTFILE in the same way as described earlier for the OUTFILE.
You can read in a previously-made cache file with the CACHEFILE command, or with the +U command line option. As with the LOGFILE command, you can use commas and wild cards to read in several cache files, and read compressed cache files using the UNCOMPRESS mechanism. Note that if you don't want to read a logfile as well as the cache file, you will have to explicitly set the LOGFILE to none.
When analog reads in a cache file, it will respect inclusions and exclusions as far as it can, but it does not apply any more aliases to the items. (This is to avoid double-aliasing.) So you must do any aliases you want at the time you create the cache file. Similarly, it does not obey the LOGTIMEOFFSET variable, to avoid double-offsetting, so any offset you want must be applied at cache-creation time too.
Sometimes you don't want to record all the types of item in the cache file. You might want to forget about which hosts had accessed your web site, for example, and only remember how many times each file was requested. You can choose not to include one type of item in the cache file by setting its LOWMEM to 3; for example, specify
HOSTLOWMEM 3to exclude hosts from the cache file. Because this is a serious step, analog will produce a warning if you do this. You can even set all six LOWMEMs to 3 if you just want to remember the pattern of requests over time, not even which files were requested.
I prefer to make a separate cache file from each logfile, in case something goes wrong with one of them, rather than a single cache file combining several logfiles, or a single cache file combining an old cache file and a logfile.
Unfortunately DNS lookups are typically very slow, because your computer has to ask across the network to find out the names of the hosts. For this reason, analog saves the addresses it has looked up in a file, so that you don't have to look them up again next time. (Even so, you may find the DNS lookups too slow to be usable.) The file is specified by a command like
DNSFILE dnsfile.txtYou will still need to use one of the commands in the next paragraph in order to actually use the file.
There are four possible levels of DNS activity. If you specify DNS NONE, no numerical addresses will be resolved. If you specify DNS READ, then analog will read the DNS file for old lookups, but no new lookups will take place. This mode is suitable if you are running analog while not connected to the internet. The third level is DNS WRITE. This reads the old file, looks up new addresses, and adds them to the file. (The first time you use DNS WRITE, you will get a missing-file warning as it tries to read the old file, but it will exist the next time.) The final level is DNS LOOKUP. This reads the old file and looks up new addresses, but doesn't add the new addresses to the file, so that they will not be remembered for next time. This is not normally a level that the user wants to specify, but analog will switch to this the behaviour if DNS WRITE fails for some reason.
If you are using a HOSTEXCLUDE command, you need to exclude the numerical IP address if it can't be resolved, or the name if it can. In other words, exclude whatever the host is known as in the report.
DNSLOCKFILE filenameOf course you should make sure that all copies of analog use the same lock file, at least if they have the same DNS file! If analog crashes, it may not clear up the lock file, so in that case you may have to delete it yourself. (Disclaimer: on some systems, race conditions may occasionally thwart this mechanism, but this is very unlikely.)
Analog never deletes anything from the DNS file: this means that the DNS file will grow, and can become quite large. You should delete the top of it every so often.
There are two parameters which say how long to trust old lookups for. If you set
DNSGOODHOURS 672for example, then successful lookups will be checked again after 672 hours (4 weeks). You can also set the DNSBADHOURS similarly, to check failed lookups again after a certain time.
Finally, there is a debugging command, DEBUG +D to show all the DNS lookups that analog is making.
timestamp IP_address namewhere the timestamp is the number of minutes since the beginning of 1970, GMT (i.e., "Unix time" divided by 60), and the name is just * if the address couldn't be resolved.
Recall what happens to an item when it has been read in. First it is aliased. Secondly, it is checked to see whether it is included or excluded. Then finally, if all the items are wanted, one request is added to its score.
Normally the name of the item is saved before the aliasing takes place. This avoids analog having to do the aliasing again next time the same item is encountered. But this can take up more memory than necessary. So there is a family of LOWMEM commands provided, which tell analog to record the name at a later stage, or even not at all. If you use these commands, analog will have to do a bit more work than normal, but it will use less memory. On most sites, the hosts take up most of the memory, so I'll use the HOSTLOWMEM command as an example.
The command
HOSTLOWMEM 0represents the normal case, when the hostname is recorded before being aliased. If you specify
HOSTLOWMEM 1instead, then the hostname is not recorded until after the aliasing. If you specify
HOSTLOWMEM 2then the name is not recorded until after the inclusion and exclusion lookup has been done as well. And finally, if you give the command
HOSTLOWMEM 3then the hostname is not saved at all, and the Host Report will not be constructed, even if you've asked for it. (The Domain Report can still be constructed though.) The analogous commands for the other items are FILELOWMEM, BROWLOWMEM, REFLOWMEM, USERLOWMEM and VHOSTLOWMEM.
First, remember the option we mentioned before, to list the current settings of all of analog's variables. To get this, just put -settings on the command line, or SETTINGS ON in one of your configuration files, along with your other commands. Then analog will produce the list of settings instead of running in the normal way.
DEBUG ONyou get all the debugging. (And DEBUG OFF turns it all off.) You can also get just certain categories of debugging. The categories are
DEBUG FSwould give you information about file opening and closing, and what was in each logfile, but none of the other sorts of debugging. Each line of debugging information is prepended with its code letter. You can also specify
DEBUG +CDto add C and D category debugging to whatever you've already got, and
DEBUG -CDto remove those two categories.
There is also a command line abbreviation for this command. Use +V (for ON), -V (for OFF), +VFS (to select exactly options FS), +V+FS (to add those options), and +V-FS (to remove them).
The C messages actually come on two lines. The first line gives the logfile line which was corrupt. The second line indicates where analog first noticed a problem. (This is usually, but not always, close to where the problem actually was!) In fact, each "line" of the message may spread over more than one line on your screen, and you have to be careful to take that into account when trying to find out where the logfile line was corrupt.
There is also a command line version of the WARNINGS command, looking like +q, -q, +q<options>, +q+<options> or +q-<options>.
PROGRESSFREQ 20000 # saythen analog will produce a little message after every 20,000 lines it reads from the logfile. This is useful to determine whether the program has really stopped or (as is more likely) is just being slow for some reason (such as using DNS lookups).
ERRFILE newfileIf you do this, analog will warn you that it's redirecting the messages, just so that you don't miss any. To change back to standard error, use
ERRFILE stderrThe ERRFILE command will erase any previous contents of that file. (So don't use the same ERRFILE command twice, or you may lose messages!)
ERRLINELENGTH 0specifies an unlimited screen width.
Important: For security reasons, you must not attempt to run analog itself as a CGI program, or even leave it in the directory or folder with your web files or CGI programs. When the form interface runs analog for you, it checks that analog isn't given any dangerous options. Without this check, your system could be vulnerable to attack.
Please don't try and set up the form until analog has been set up and is running properly on its own. It just adds another level of complexity to troubleshoot. And unlike analog itself, the form interface will not run "out of the box". You have to read this section to find out how to set it up.
The form interface is suitable for ordinary users to use, but it needs to be set up by a system administrator or other expert. In order to set it up, you have to be running a web server. You need to know what CGI programs are, where they live on your server, and how to set up their permissions properly. You also need to know how to write HTML forms. I shall assume this level of background knowledge for the rest of this section. And you have to be running Perl 5.001 or later: see Technical details below for other system requirements. (Actually, if you're on Windows and don't have Perl, you can download an executable version of the form interface from the helper applications page.)
Warning: CGI programs can contain security loopholes which allow an unscrupulous user to harm your system. (If you don't know about this, you shouldn't be running CGI programs at all. Read and understand the World Wide Web Security FAQ and the CGI Security FAQ first.) I have tried to make this form interface safe, but I cannot guarantee it. Even the most carefully-designed CGI programs can accidentally have serious security bugs. And I take no responsibility if anything goes wrong: you use it at your own risk. (See the licence.) Furthermore, you should be aware that unless you take special measures like password protection or limiting anlgform.pl to specific hostnames, setting up the form interface implies making analog executable, and your logfiles analysable, by anyone on the internet. There are more notes on security design in this program towards the end of this section.
The form interface consists of two parts: a form (called anlgform.html) to choose the options, and a cgi program (called anlgform.pl) to pass them to the analog program. Both anlgform.html and anlgform.pl must be configured to your system before they will work at all. There are instructions at the top of both files explaining how to do this.
The form which is distributed with the program should only be regarded as an example form. You can find forms in languages other than English in the lang directory. Or you can write your own if you prefer. In fact you don't actually need the form at all: if you want just to create a link to the cgi program, with the arguments passed after a question mark in the URL in the usual way, then that's fine.
Logfile name: <input type=text name="LOGFILE">or maybe something like
<select name=LOGFILE size=1> <option value="/var/log/apache/fred"> Fred's logfile <option value="/var/log/apache/jane"> Jane's logfile </select>
There are a few commands which you can't specify on the form for security or performance reasons. The full list is *LOGFORMAT, LANGFILE, HEADERFILE, FOOTERFILE, UNCOMPRESS, OUTFILE, CACHEOUTFILE, ERRFILE, DNS and SETTINGS; and the person setting up the form can add more. There are also certain arguments you can't give to commands: the most important is that you can't include the wildcard * in the LOGFILE. See the security notes below for the reasons for these exclusions, and for some more commands you might want to add to the forbidden list.
Alias this file: <input type=text name="FILEALIAS1"> To this one: <input type=text name="FILEALIAS2">You can only specify one such pair this way; so there's no way to specify several of the same ALIAS, for example.
Then there are FLOOR commands. To avoid users of the form having to know the syntax of these commands, you can if you want specify them in two halves, FLOORA and FLOORB, and they will be stuck together. For example, the form distributed with the program specifies
<br>Include all domains with at least <input type=TEXT name="DOMFLOORA" maxlength=6 size=6> <select name="DOMFLOORB"> <option value=r>requests <option value=p>requests for pages <option value=b selected>bytes </select>If DOMFLOORA contains 5% and DOMFLOORB contains r, then DOMFLOOR 5%r will be sent to the program. (Or DOMFLOORA=5 and DOMFLOORB=%r would work too, if you chose to present the form that way.)
Secondly, you can specify other configuration files to be included at specific times. When analog is called by the CGI program, it first processes the default configuration file as usual. Then it processes any configuration file specified by an option with name cg. Then it processes all the other commands which the CGI program specifies. After that, it processes any configuration file specified by an option with name cm. Finally, it processes the mandatory configuration file as usual. (You may therefore want two copies of analog, one for form use and one for non-form use, with different configuration files compiled in.) Note that the commands in the default and mandatory configuration files will contribute to the configuration: some of them may even override options specified on the form. For example, if the default configuration file contains an INCLUDE command, this may cause INCLUDE and EXCLUDE commands specified on the form to behave unexpectedly.
There are a couple of commands which the form always sets. These may override what you have set elsewhere. First, it sets either DNS READ (if a DNSFILE is set on the form) or DNS NONE (otherwise). You can override this behaviour in the mandatory configuration file, but you are likely to run into timeout problems if you do. Secondly, it always sets WARNINGS FL, so that the less important warnings don't fill up your server's error log. You can override this by sending an explicit WARNINGS command from the form.
There is one small point about compressed logfiles. For security reasons, when using the form interface you need to specify the full pathname to the uncompression command in the UNCOMPRESS command in your configuration file.
First, you can run anlgform.pl from the (DOS or Unix) command line. This is good enough to debug most problems. You can specify options in pairs like this:
anlgform.pl qv=1 LOGFILE=/some/log REQINCLUDE=pagesIf you include qv=1 in the argument list as above, you will see what anlgform.pl is trying to send to analog. If you don't include qv=1, anlgform.pl will try and run analog.
If it still doesn't work, check the following points:
First, you should think about who can run the form interface. Unless you take special measures like password protection or limiting anlgform.pl to specific hostnames, adding the form interface to your site implies making analog executable, and your logfiles analysable, by anyone on the internet. There are obvious concerns both about privacy and about the load on your system.
Certain commands are ignored by anlgform.pl and not passed to analog. The list of them can be found at the top of anlgform.pl. Here are the reasons for them. HEADERFILE and FOOTERFILE would place any file on your system within the output. The *LOGFORMAT commands would also allow any file to be read, because someone could designate each line to be a single filename and then just list the filenames. OUTFILE, CACHEOUTFILE and ERRFILE would allow people to write to your filespace; ERRFILE would also divert errors away from your error log. UNCOMPRESS would allow a user to execute any command. DNS is forbidden because setting it higher than READ would normally cause the process to time out.
None of the above should be deleted (unless you are really, really sure that it's completely impossible for anyone other than yourself to run anlgform.pl). There are two other commands which are forbidden by default but which you could consider removing from the forbidden list. SETTINGS is included because it will give away the locations of some files on your system. But it is useful for diagnostic purposes, and you could consider removing it temporarily if you have trouble setting up the form. The other command which is included is LANGFILE, although I consider it to be a lower risk. It is included because it is theoretically possible that another file could be exactly the right number of lines long to be accepted as a language file, and then parts of it would get into the output. But it would have to be exactly the right length first. If that's a risk you're prepared to take, you can remove LANGFILE from the list.
There are other commands which you might consider adding to the list. For example, it is theoretically possible (though rather unlikely), that another file on your system could conform sufficiently closely to one of the predefined log formats that analog could be persuaded to analyse it and so reveal some of its contents. If you're worried about this, or even if you want to force only one particular logfile to be analysed from the form, you can add the LOGFILE command to the list of forbidden commands. And you could add DOMAINSFILE for similar reasons.
You can of course add any command you like to the list. For example, a user can use any configuration file on your system unless you add all of CONFIGFILE, CM and CG. Or if you wanted to stop a user having control of which warnings were written to the error log, you could add WARNINGS.
The arguments to LOGFILE and CACHEFILE commands are checked for containing only certain allowed characters (specifically, letters, digits, /\.:_ space, and - between two {letter, digit, underscore}'s). This is because they could match an UNCOMPRESS command and thus be passed to the shell when the uncompress command is popen()'ed.
Apart from that, command names are checked for containing only letters and the digits 1 and 2; and the arguments to commands are checked for not containing control characters (actually characters 0-32 and 127-159; in particular newline characters are prohibited). The length of the commands isn't checked by anlgform.pl, but buffer overflow shouldn't be an issue as configuration commands are checked for length by analog.
By the way, the reason that I advise that analog itself shouldn't be used as a CGI program is that some servers, notably Microsoft IIS, allow users to pass command line arguments into a CGI program. And even if the program doesn't return the proper CGI headers, the output can be sent back to the user. This means that all the above checking of arguments is then thwarted. Of course, on servers on which you can't pass command line arguments to a CGI program, there are not the same security concerns, but then analog isn't very useful as a CGI program because if you can't pass any arguments, you can only get the default output.
On Windows, you have to associate the .pl extension with the Perl executable so that Perl scripts are executed by Perl.
anlgform.pl will understand the GET or POST methods of form submission. The HTML spec says that GET should be used when, as in this case, running the program has no side effects. However, section 15.1.3 of the HTTP spec says that POST should be used if some of the options being passed might be confidential. Also, very long URLs, formed by specifying lots of options, can cause trouble to some older servers. So anlgform.html uses the POST method by default. However, the GET method will also work. For example, you could make a normal link to anlgform.pl with options specified after a question mark in the usual GET way.
This section is fairly long, but it's worth reading carefully. If you understand the basics of how the web works, you will understand what your web statistics are really telling you.
So, what do you know about it? First, I make one request for your front page. You know the date and time of the request and which page I asked for (of course), and the internet address of my computer (my host). I also usually tell you which page referred me to your site, and the make and model of my browser. I do not tell you my username or my email address.
Next, I look at the page (or rather my browser does) to see if it's got any graphics on it. If so, and if I've got image loading turned on in my browser, I make a separate connection to retrieve each of these graphics. I never log into your site: I just make a sequence of requests, one for each new file I want to download. The referring page for each of these graphics is your front page. Maybe there are 10 graphics on your front page. Then so far I've made 11 requests to your server.
After that, I go and visit some of your other pages, making a new request for each page and graphic that I want. Finally, I follow a link out of your site. You never know about that at all. I just connect to the next site without telling you.
The other sort of cache is on a larger scale. I'm in the UK. Because the link across the Atlantic is sometimes very congested, we've set up a national cache. (Many individual ISP's also do the same thing.) I can set my browser to get your pages from the national cache instead of directly from you. If anyone else in the country has used the cache to look at your pages recently, the cache will have saved them, and will give them out to me without ever telling you about it. So hundreds of people could read your pages, even though you'd only sent it out once. Also, if the page I wanted wasn't already stored in the cache, the cache would ask for it from you on my behalf. This would mean that the request appeared to come from the cache, rather than from me. If several people did this, you would think that only one host was accessing the cache, rather than lots of different ones.
You can also know what people told you their browsers were, and what the referring pages were. You should be aware, though, that many browsers lie deliberately about what sort of browser they are, or even let users configure the browser name. Also, a few browsers send incorrect referrers, telling you the last page that the user was on even if they weren't referred by that page. And some people use "anonymizers" which deliberately send false browsers and referrers.
I've presented a somewhat negative view here, emphasising what you can't find out. Web statistics are still informative: it's just important not to slip from "this page has received 30,000 requests" to "30,000 people have read this page." In some sense these problems are not really new to the web -- they are present just as much in print media too. For example, you only know how many magazines you've sold, not how many people have read them. In print media we have learnt to live with these issues, using the data which are available, and it would be better if we did on the web too, rather than making up spurious numbers.
5. Acknowledgements and further reading. Many other people have made these points too. While originally writing this section, I benefited from three earlier expositions: Interpreting WWW Statistics by Doug Linder; Making Sense of Web Usage Statistics by Dana Noonan; and Getting Real about Usage Statistics by Tim Stehle. Unfortunately none of these articles seems to be available on the web any more.
Another, extremely well-written document on these ideas is Measuring Web Site Usage: Log File Analysis by Susan Haigh and Janette Megarity. Being on a Canadian government site, it's available in both English and French. Or for an even more negative point of view, you could read Why Web Usage Statistics are (Worse Than) Meaningless by Jeff Goldberg.