The Columbia Crown The Kermit Project | Columbia University
612 West 115th Street, New York NY 10025 USA • kermit@columbia.edu
…since 1981

C-Kermit 9.0 Sitemap Script

Frank da Cruz
The Kermit Project
Columbia University
Last update: Tue Dec 7 16:54:21 2010
Download:   http://kermit.columbia.edu/ftp/scripts/ckermit/ksitemap
Requires:   C-Kermit 9.0 Alpha.03 or later.

The ksitemap script builds a sitemap.xml file for a website based on a data file that you provide listing the files and (using Google Sitemap Image Extensions) images you wish to include in your sitemap, along with their properties, so that search engines like Google, Yahoo, Bing, and Ask can index them better. Read about sitemaps here.

Totally data driven, ksitemap reads a file-list file (or “filelist” for short) containing the names and attributes of the pages and images to be included in the sitemap. The filelist file is kept in the web directory itself, but it need not be world readable.

Invocation

ksitemap is invoked with the pathname of filelist as its first and only command-line argument, for example:

$ ksitemap /www/filelist (absolute)
$ ksitemap ~/web/filelist (symbolic)
$ ksitemap web/filelist (relative)
$ ksitemap ../web/filelist (relative)
If you give a directory name without a filename, 'filelist' will be used as the filename.

If you invoke ksitemap without a command-line argument then:

  • If the environment variable KSITEMAPDIR is defined, it will be used as the pathname of the website directory;
  • Otherwise, your current directory will be assumed as the website directory.
In both cases the filename will default to 'filelist'. Thus, if you have the KSITEMAPDIR environment variable defined in your Unix profile (e.g. .bash_profile for the Bash shell); for example:

export KSITEMAPDIR=/net/w/0/htdocs/username/web/

and the name of the file-list file is 'filelist', then you can run ksitemap from any directory any time without any command-line argument.

To invoke for debugging and testing, do:

$ DEBUG=1 ksitemap args

This gives progress messages and it writes the sitemap.xml file in a "tmp" directory.

The filelist file

The filelist file contains names of HTML and image files relative to the web directory. It can contain comment lines that begin with '#':

# This is a comment line

And it can contain blank lines, which are ignored. Nonblank, non-comment are in this format:

tag=value

An equal sign (=) separates the tag from the value. If you need to include an equal sign in the value itself, surround the value with ASCII doublequotes. Examples:

cap=View from the Empire State Building looking East
cap="A+B=C"

The first few lines define parameters for the whole website:

Tag Status Value
home Required The URL of the website's home directory (with no filename part)
geo Optional The default geographical location for images, if any
lic Optional The default filename, if any, for a page containing copyright or license information for the site's original images

These items should come before any of the page-specific items that are described below. If you include a geo or lic tag before any url tag (see below), these will be used for any image for which you do not specify a geo or lic tag. In other words the ones in the top section are global and the ones in an img section are local to that image.

The "home" line's value is the URL of the website root directory, ending with slash, for example:

home:http://kermit.columbia.edu/

This is used to form the full URLs of the files and images in the website. Example:

home:http://kermit.columbia.edu/
geo:New York City USA
lic:copyright.html

The remainder of the file contains lines for each file and image you want to include in your sitemap. For each page, the lines should appear in the following order:

Tag Status Value
url Required Name file an html file in the root directory or in a subdirectory.
pri Optional Priority of the page, 0.0 to 1.0

For each URL, the page date is supplied automatically based on the modification date of the file and the change frequency (daily, weekly, monthly, yearly) is supplied based on when the file was last modifed.

If there are images on the page that you want to include in the sitemap:

Tag Status Value
img Required Name file an image file in the root directory or in a subdirectory.
cap Optional A text caption for the image
title Optional A text title for the image
geo Optional The geographical localation of this image only
lic Optional The filename of URL of this image only

Here's a brief example that has three files. For the first file (index.html), a priority is specified; for the others, the default priority is accepted. The second file is in a subdirectory. The third file has images. Comments, blank lines, and indentation are used for clarity, but they do not do not affect the result. There should be no spaces before or after the equal signs.

# ksitemap filelist for building sitemap.xml

home=http://kermit.columbia.edu/
geo=New York City USA
lic=copyright.html

url=index.html
pri=1.0

url=cudocs/ilosetup.html

url=cable.html
img=connectors-340.jpg
  cap=Male and Female RS-232 Connectors
  title=Serial Data Connectors
img=modemcable.jpg
  cap=Modem Cable Schematic
  geo=Bedford MA
img=nullmodem-480.jpg
  cap=Null Modem Cable Schematic
  lic=special.html

The resulting sitemap.xml looks like this:

<?xml version="1.0" encoding="ISO-8859-1"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
  <loc>http://kermit.columbia.edu/</loc>
  <lastmod>2010-12-07</lastmod>
  <changefreq>daily</changefreq>
  <priority>1.0</priority>
</url>
<url>
  <loc>http://kermit.columbia.edu/cudocs/ilosetup.html</loc>
  <lastmod>2010-12-07</lastmod>
  <changefreq>daily</changefreq>
  <priority>0.5</priority>
</url>
<url>
  <loc>http://kermit.columbia.edu/cable.html</loc>
  <lastmod>2010-12-07</lastmod>
  <changefreq>daily</changefreq>
  <image:image>
    <image:loc>http://kermit.columbia.edu/connectors-340.jpg</image:loc>
    <image:caption>Male and Female RS-232 Connectors</image:caption>
    <image:title>Serial Data Connectors</image:title>
    <image:geo_location>New York City USA</image:geo_location>
    <image:license>http://kermit.columbia.edu/copyright.html</image:license>
  </image:image>
  <image:image>
    <image:loc>http://kermit.columbia.edu/modemcable.jpg</image:loc>
    <image:caption>Modem Cable Schematic</image:caption>
    <image:geo_location>Bedford MA</image:geo_location>
    <image:license>http://kermit.columbia.edu/copyright.html</image:license>
  </image:image>
  <image:image>
    <image:loc>http://kermit.columbia.edu/nullmodem-480.jpg</image:loc>
    <image:caption>Null Modem Cable Schematic</image:caption>
    <image:geo_location>New York City USA</image:geo_location>
    <image:license>http://kermit.columbia.edu/special.html</image:license>
  </image:image>
  <priority>0.5</priority>
</url>
</urlset>