home *** CD-ROM | disk | FTP | other *** search
- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
- <html lang="en">
- <head>
- <title>
- Kermit sitemap script
- </title>
- <META http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
- <META http-equiv="Content-Style-Type" content="text/css">
- <LINK REL=STYLESHEET TYPE="text/css" HREF="kermit.css">
- <LINK REL="shortcut icon" href="favicon.ico" >
- <style type="text/css">
- ul li { padding-bottom:9;padding-right:64;line-height:12.5pt; }
- h2, h3 { font-family: sans-serif }
- h3 { margin-left:-8; border-top: 2px solid #999999 }
- ul.contents li { line-height:10pt; font-family:sans-serif; }
- .nu { text-decoration:none }
- tt,pre { font-size:10.5pt }
- dl.loose dd { padding:0 0 8 0 }
- blockquote.example { margin-top:6; margin-bottom:8 }
- blockquote.boxed { border:1px solid grey; padding:0 12 8 12 }
- body { color:black; background:white; margin:0; font-size:12pt">
-
-
- </style>
- </head>
-
- <body>
-
- <table cellpadding=0 cellspacing=0 width="100%"
- style="border:1px solid darkmagenta; background-color:white;">
-
- <tr style="background-image:url('lb3.jpg');">
- <td style="padding:8 8 8 8">
- <a style="text-decoration:none"
- href="http://www.columbia.edu"><img
- border=0
- alt="The Columbia Crown"
- title="The Columbia Crown (crown of King George II)"
- height=105
- src="crownico.gif"></a>
- <td align="left" style="padding-top:23">
- <tt style="font-size:24pt"><b>The Kermit Project</b></tt> |
- <span style="font-family:Ariel,times; font-size:18pt"><i>Columbia
- University</i></span>
- <br><span style="font-family:Ariel,times; font-size:14pt">
- 612 West 115th Street, New York NY 10025 USA •
- <a href="mailto:kermit@columbia.edu">kermit@columbia.edu</a>
- </span>
- <table width="100%">
- <tr>
- <td style="font-size:12pt; font-style:italic">…since
- <small>1981</small></div>
- </table>
-
- <tr>
- <td colspan=2 style="padding:0">
- <table class=menu cellpadding=0 cellspacing=0
- width="100%" style="border-top:1px solid darkmagenta">
- <tr>
- <td onClick="document.location.href='index.html';"
- title="Kermit Project Home Page"
- style="cursor:pointer"><a href="index.html">Home</a>
- <td onClick="document.location.href='k95.html';"
- title="Kermit 95 for Windows"
- style="cursor:pointer"><a href="k95.html">Kermit 95</a>
- <td class=this onClick="document.location.href='ckermit.html';"
- title="C-Kermit for Unix and VMS"
- style="cursor:pointer"><a href="ckermit.html">C-Kermit</a>
- <td onClick="document.location.href='ckscripts.html';"
- title="Kermit Script Language and Tutorial"
- style="cursor:pointer"><a href="ckscripts.html">Scripts</a>
- <td onClick="document.location.href='current.html';"
- title="Current Versions of Kermit Software"
- style="cursor:pointer"><a href="current.html">Current</a>
- <td onClick="document.location.href='whatsnew.html';"
- title="What's New"
- style="cursor:pointer"><a href="whatsnew.html">New</a>
- <td onClick="document.location.href='faq.html';"
- title="Frequently Asked Questions"
- style="cursor:pointer"><a href="faq.html">FAQ</a>
- <td onClick="document.location.href='support.html';"
- style="border-right:0"
- title="Kermit Software Support"
- style="cursor:pointer"><a href="support.html">Support</a>
- </table>
- </table>
-
- <div style="font-family:calibri,sans-serif,times">
- <div class=normalmargins style="padding-top:4">
-
- <h2>C-Kermit 9.0 Sitemap Script</h2>
-
- <blockquote style="margin-top:8">
- Frank da Cruz<br>
- The Kermit Project<br>
- Columbia University<br>
- <i>Last update:</i>
- Tue Jun 28 12:22:01 2011
- </blockquote>
-
- <!--
- <form class=contents>
- <fieldset>
- <legend>Contents</legend>
- <ul>
- <li><a href="#record"><b>Reading a CSV or TSV Record and Converting it
- to an Array</b></a>
- <li><a href="#join"><b>Using \fjoin() to create a
- Comma- or Tab-Separated Value List from an Array</b></a>
- <li><a href="#file"><b>Using CSV or TSV Files</b></a>
- </ul>
- </fieldset>
- </form>
- <p>
- -->
-
-
-
- <blockquote class=boxed style="background:#eeeeee">
- <table>
- <tr>
- <td><b>Download</b>:
- <td><a href="http://kermit.columbia.edu/ftp/scripts/ckermit/ksitemap">http://kermit.columbia.edu/ftp/scripts/ckermit/ksitemap</a>
- <tr>
- <td><b>Requires</b>:
- <td><a href="ck90.html"><b>C-Kermit 9.0</b></a>.
- </table>
- </blockquote>
-
- <p>
- <div class=normalmargins>
-
- The <b>ksitemap</b> script builds a sitemap.xml file for a website based on
- a data file that you provide listing the files and (using <a
- href="http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=178636">Google
- Sitemap Image Extensions</a>) images you wish to include in your sitemap,
- along with their properties, so that search engines like Google, Yahoo,
- Bing, and Ask can index them better. Read about sitemaps <a
- href="http://sitemaps.org/protocol.php">here</a>.
-
- <p>
-
- Totally data driven, ksitemap reads a file-list file (or
- “filelist” for short) containing the names and attributes of the
- pages and images to be included in the sitemap. The filelist file is kept
- in the web directory itself, but it need not be world readable.
-
- <p>
-
- The ksitemap script should work under any Unix operating system (Linux, Mac
- OS X, NetBSD, Solaris, etc) that has C-Kermit 9.0 installed (but the <a
- href="http://kermit.columbia.edu/ftp/scripts/ckermit/ksitemap">top line</a>,
- which indicates the pathname of the C-Kermit executable, might need to be
- changed). In Unix the ksitemap script must, of course, also be given
- execute permission (chmod +x). Ksitemap has not yet been tested in
- VMS.
-
- <h3>Invocation</h3>
-
- Ksitemap is invoked with the pathname of the filelist as its first and only
- command-line argument, for example:
-
- <p>
- <blockquote>
- <table>
- <tr>
- <td style="padding-right:24">$ <u>ksitemap /www/filelist</u>
- <td><i>(absolute)</i>
- <tr>
- <td>$ <u>ksitemap ~/web/filelist</u>
- <td><i>(symbolic)</i>
- <tr>
- <td>$ <u>ksitemap web/filelist</u>
- <td><i>(relative)</i>
- <tr>
- <td>$ <u>ksitemap ../web/filelist</u>
- <td><i>(relative)</i>
- <tr>
- <td>$ <u>ksitemap /www/</u>
- <td><i>(absolute directory, no filename)</i>
- <tr>
- <td>$ <u>ksitemap</u>
- <td><i>(no argument, see just below)</i>
- </table>
- </blockquote>
-
- If you give a directory name without a filename, 'filelist' is used
- as the filename.
-
- <p>
-
- If you invoke ksitemap without a command-line argument then:
-
- <ul style="padding-bottom:0;margin-bottom:8">
-
- <li>If the environment variable KSITEMAPDIR is defined, it will be used
- as the pathname of the website directory;
-
- <li>Otherwise, your current directory will be assumed as the website
- directory.
-
- </ul>
-
- In both cases the filename will default to 'filelist'. Thus, if you have
- the KSITEMAPDIR environment variable defined in your Unix profile (e.g.
- <tt>.bash_profile</tt> for the Bash shell); for example:
-
- <p>
- <blockquote>
- <tt>export KSITEMAPDIR=/net/w/0/htdocs/<i>username</i>/web/</tt>
- </blockquote>
- <p>
-
- and the name of the file-list file is 'filelist', then you can run ksitemap
- from any directory any time without any command-line argument.
-
- <p>
-
- To invoke for debugging and testing, do:
-
- <p>
- <blockquote>
- <tt>$ <u>DEBUG=1 ksitemap <i>args</i></tt></u>
- </blockquote>
- <p>
-
- This gives progress messages and it writes the sitemap.xml file in a "tmp"
- directory.
-
- <h3>The filelist file</h3>
-
- The filelist file contains names of HTML and image files relative to the web
- directory. It can contain comment lines that begin with '<tt>#</tt>':
-
- <p>
- <blockquote>
- <tt># This is a comment line</tt>
- </blockquote>
- </p>
-
- And it can contain blank lines, which are ignored.
- Nonblank, non-comment lines are in this format:
-
- <p>
- <blockquote>
- <tt><i>tag</i>=<i>value</i></tt>
- </blockquote>
- <p>
-
- An <b>equal sign</b> (=) separates the tag from the value. If you include
- <b>whitespace</b> (blanks or tabs) before and after the equal sign and they
- are ignored. The following three lines have identical effect:
-
- <p>
- <blockquote>
- <pre>
- home=http://www.xyzcorp.com/
- home = http://www.xyzcorp.com/
- home= http://www.xyzcorp.com/
- </pre>
- </blockquote>
- <p>
-
-
-
- If you need to include an equal sign in the value itself,
- surround the value with ASCII doublequotes. If you want the value itself
- to be enclosed in doublequotes, put three of them on each end (see the
- <a href="#programming">section on programming considerations</a> for an
- explanation). Examples:
-
- <p>
- <blockquote>
- <pre>
- cap=View from the Empire State Building looking East
- cap="A+B=C"
- cap="""Caption within doublequotes"""
- </pre>
- </blockquote>
- <p>
-
- The first few lines define parameters for the whole website:
-
- <p>
- <blockquote>
- <table class=compact>
- <tr>
- <th>Tag
- <th>Status
- <th>Value
- <tr>
- <td>encoding
- <td><i>Depends</i>
- <td style="padding-bottom:4">
-
- sitemap.xml files are encoded in <a
- href="http://en.wikipedia.org/wiki/UTF-8">UTF-8</a>. If your filelist file
- is encoded in some other character set (such as ISO-8859-1) for the purpose
- of including non-ASCII characters (such as accented letters or non-Roman
- letters), you must declare its encoding so ksitemap can convert the text to
- UTF-8. If your file-list file is ASCII, or it is already UTF-8, this item
- is
- <b>optional</b>. Otherwise this item is <b>required</b>, and it should come
- first, so ksitemap can convert all the lines in the file appropriately. The
- <i>value</i> is the <a
- href="http://www.iana.org/assignments/character-sets">MIME
- name of the character set</a> used in the file-list file. For a list of
- supported encodings, see <a href="csetnames.html">this page</a>).
-
- <tr>
- <td>home
- <td>Required
- <td>The URL of the website's home directory (with no filename part)
- <tr>
- <td>geo
- <td>Optional
- <td>The default geographical location for images, if any
- <tr>
- <td>lic
- <td>Optional
- <td style="padding-bottom:4">The default filename, if any, for a page
- containing copyright or license information for the site's original images
- <tr>
- <td>.<i>macroname</i>
- <td>Optional
- <td style="padding-bottom:4">Definition for macro with given name
- </table>
- </blockquote>
- <p>
-
- These items should come before any of the page-specific items that are
- described below. If you include a <b>geo</b> or <b>lic</b> tag before any
- <b>url</b> tag (see below), these will be used for any image for which you
- do not specify a <b>geo</b> or <b>lic</b> tag. In other words the ones in
- the top section are <i>global</i> and the ones in an img section are
- <i>local</i> to that image.
-
- <p>
-
- The "home" line's value is the URL of the website root
- directory, ending with slash, for example:
-
- <p>
- <blockquote>
- <tt>home:http://kermit.columbia.edu/</tt>
- </blockquote>
- <p>
-
- This is used to form the full URLs of the files and images in the website.
- Example:
-
- <p>
- <blockquote>
- <pre>
- home:http://kermit.columbia.edu/
- lic:copyright.html
- </pre>
- </blockquote>
- <p>
-
- This results in the URL of the license file being:
-
- <p>
- <blockquote>
- <pre>
- http://kermit.columbia.edu/copyright.html
- </pre>
- </blockquote>
- <p>
-
- <b>Macros</b> allow you to use variables in value strings. For example,
- given:
-
- <p>
- <blockquote>
- <pre>
- .year=2010
- </pre>
- </blockquote>
- <p>
-
- Then any ocurrence of <q><tt>\m(year)</tt></q> in a value string is replaced by
- <q><tt>2010</tt></q>.
-
- <p>
-
- The <b>remainder of the file list</b> contains lines for each file and image
- you want to include in your sitemap. For each page, the lines should appear
- in the following order:
-
- <p>
- <blockquote>
- <table class=compact>
- <tr>
- <th>Tag
- <th>Status
- <th>Value
- <tr>
- <td>url
- <td>Required
- <td>Name of an html file relative to the website's root directory.
- <tr>
- <td>pri
- <td>Optional
- <td>Priority of the page, 0.0 to 1.0
- </table>
- </blockquote>
- <p>
-
- For each URL, the <b>page date</b> is supplied automatically based on the
- modification date of the file and the <b>change frequency</b> (daily, weekly,
- monthly, yearly) is supplied based on when the file was last modifed.
-
- <p>
-
- For <b>redirects</b>, a URL entry can have two values; for example:
-
- <p>
- <blockquote>
- <pre>
- url=index.html=index-en.html
- </pre>
- </blockquote>
- <p>
-
- This means that the first filename is an HTTP Redirect to the second
- filename; that is, the first name is a pointer to a file having the second
- name. For example, suppose you have a website with calendars for different
- years: <tt>cal-2009.html</tt>, <tt>cal-2010.html</tt>,
- <tt>cal-2011.html</tt>, etc, and the calendar for the current year should
- always be available as simply <tt>cal.html</tt>. In that case your
- <tt>.htaccess</tt> file redirects the name <tt>cal.html</tt> to (say)
- <tt>cal-2011.html</tt> because you want the <tt>cal.html</tt> name to be
- indexed by Web crawlers even though no file exists with that name in your
- site. This way, each year you only have to change your <tt>.htaccess</tt>
- and you don't have to wait for the web crawlers to index a file that didn't
- exist before:
-
- <p>
- <blockquote>
- <pre>
- url=cal.html=cal-2011.html
- </pre>
- </blockquote>
- <p>
-
- If you have a lot of files using this naming convention, you can use a macro
- so the variable string can be defined (and changed) in just one place
- instead of lots of places:
-
- <p>
- <blockquote>
- <pre>
- .year=2011
- url=cal.html=cal-\m(year).html
- url=jan.html=jan-\m(year).html
- url=feb.html=feb-\m(year).html
- <i>etc...</i>
- </pre>
- </blockquote>
- <p>
-
-
-
-
-
- If there are <b>images</b> on the page that you want to include in the sitemap:
-
- <p>
- <blockquote>
- <table class=compact>
- <tr>
- <th>Tag
- <th>Status
- <th>Value
- <tr>
- <td>img
- <td>Required
- <td>Name file an image file in the root directory or in a subdirectory.
- <tr>
- <td>cap
- <td>Optional
- <td>A text caption for the image
- <tr>
- <td>title
- <td>Optional
- <td>A text title for the image
- <tr>
- <td>geo
- <td>Optional
- <td>The geographical localization of this image only
- <tr>
- <td>lic
- <td>Optional
- <td style="padding-bottom:4">The URL of a license page for this image only
-
- </table>
- </blockquote>
- <p>
-
- Here's a brief example that has three files. For the first file
- (index.html), a priority is specified; for the others, the default priority
- is accepted. The second file is in a subdirectory. The third file has
- images. Comments, blank lines, and indentation are used for clarity, but
- they do not do not affect the result. Note that there may be, but need not
- be, whitespace around the equal sign.
-
-
- <p>
- <blockquote class=boxed>
- <pre>
- # ksitemap filelist for building sitemap.xml
-
- encoding = ISO-8859-1
- home=http://kermit.columbia.edu/
- geo=New York City USA
- lic=copyright.html
-
- url=index.html
- pri=1.0
-
- url=cudocs/ilosetup.html
-
- url=cable.html
- img=connectors-340.jpg
- cap=Male and Female RS-232 Connectors
- title=Serial Data Connectors
- img=modemcable.jpg
- cap=Modem Cable Schematic
- geo=Bedford MA
- img=nullmodem-480.jpg
- cap=Null Modem Cable Schematic
- lic=special.html
- geo=Batey Ca±o - Yamasß
- </pre>
- </blockquote>
- <p>
-
- The resulting sitemap.xml looks like this:
-
- <p>
- <blockquote class=boxed>
- <pre>
-
- <?xml version="1.0" encoding="UTF-8"?>
- <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
- xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
- <url>
- <loc>http://kermit.columbia.edu/</loc>
- <lastmod>2010-12-07</lastmod>
- <changefreq>daily</changefreq>
- <priority>1.0</priority>
- </url>
- <url>
- <loc>http://kermit.columbia.edu/cudocs/ilosetup.html</loc>
- <lastmod>2010-12-07</lastmod>
- <changefreq>daily</changefreq>
- <priority>0.5</priority>
- </url>
- <url>
- <loc>http://kermit.columbia.edu/cable.html</loc>
- <lastmod>2010-12-07</lastmod>
- <changefreq>daily</changefreq>
- <priority>0.5</priority>
- <image:image>
- <image:loc>http://kermit.columbia.edu/connectors-340.jpg</image:loc>
- <image:caption>Male and Female RS-232 Connectors</image:caption>
- <image:title>Serial Data Connectors</image:title>
- <image:geo_location>New York City USA</image:geo_location>
- <image:license>http://kermit.columbia.edu/copyright.html</image:license>
- </image:image>
- <image:image>
- <image:loc>http://kermit.columbia.edu/modemcable.jpg</image:loc>
- <image:caption>Modem Cable Schematic</image:caption>
- <image:geo_location>Bedford MA</image:geo_location>
- <image:license>http://kermit.columbia.edu/copyright.html</image:license>
- </image:image>
- <image:image>
- <image:loc>http://kermit.columbia.edu/nullmodem-480.jpg</image:loc>
- <image:caption>Null Modem Cable Schematic</image:caption>
- <image:geo_location>Batey Ca±o - Yamasß</image:geo_location>
- <image:license>http://kermit.columbia.edu/special.html</image:license>
- </image:image>
- </url>
- </urlset>
- </pre>
- </blockquote>
-
- <a name="programming"> </a>
- <h3>Programming considerations</h3>
-
- The key to parsing the filelist is Kermit's <tt>\fsplit()</tt> function, and
- in particular some new features added to it in <a href="ck90.html">C-Kermit
- 9.0</a>: a straightforward way of handling strings containing non-ASCII
- characters, and the "comma-separated values" list (CSV) feature described in
- <a href="csv.html">this page</a>. The statement:
-
- <blockquote>
- <pre>
- .\%9 := \fsplit(\m(line),&x,=,CSV) # Split line on '='
- </pre>
- </blockquote>
-
- splits a filelist line into two pieces, the tag and the value:
-
- <ul style="margin-bottom:0">
-
- <li><tt>\%9</tt> is a kind of all-purpose temporary local variable, a usually
- unused command-line or macro argument number 9, which in this case receives
- the number of items that were obtained by splitting (the <tt>\%1-9</tt>
- variables are local by definition, meaning if you use them in a macro,
- changing their values won't affect variables of the same name anywhere else).
-
- <li><tt>\fsplit()</tt> is a built-in function for splitting a string into
- pieces based on all sorts of breaking, including, and grouping criteria.
-
- <li>The first argument, <tt>\m(line)</tt>, is the variable holding the
- current line from the filelist.
-
- <li><tt>&x</tt> is the name of the array to put the result in.
-
- <li><tt>= </tt> is the break set, composed of one character in this case,
- the equal sign.
-
- <li><tt>CSV</tt> means it is a "comma-separated values" list, but since the
- break character is equal sign and not comma, it is really an "equal-sign
- separated list", but with the same rules as a CSV, such as:
-
- <ol style="padding-top:8">
-
- <li>All characters other than the break character itself are in the include
- set.
-
- <li>Except that the separator can, but need not be, surrounded by
- whitespace, in which case the whitespace characters are discarded (not
- included).
-
- <li>A field containing the separator character as data must be surrounded
- by doublequotes, which will be removed in the final result.
-
- <li>A field that contains doublequotes must be enclosed in doublequotes,
- and then all interior doublequotes must be doubled.
-
- </ol>
- </ul>
-
- The complete set of CSV rules is <a href="csv.html#rules">here</a>.
-
- <p>
-
- Another observation about <tt>\fsplit()</tt> is worth making. Its result
- goes into an array, and array elements in the Kermit language, just like
- <tt>\%a</tt> variables, are evaluated <i>recursively</i>. The array
- elements contain the literal pieces of the original string, but when you
- refer to an array element whose value contains any backslashes, the string
- is evaluated recursively, "all the way down". This is why the array element
- values are referenced through <tt>\fcontents()</tt>, which forces a simple
- "one-level-deep" evaluation.
-
- <p>
-
- A more serious problem was noted when adding the macro capability to
- ksitemap, namely that <tt>\fsplit()</tt> itself was stripping out backslash
- characters. This is appropriate behavior for some of its other uses
- (e.g. parsing <a href="ckermit80.html#x9">S-Expressions</a>), but is not
- appropriate for parsing external data, such as data lines read from files.
- This explains the "Quoting Hell" trick just before the <tt>\fsplit()</tt>
- invocation. This will be unnecessary (and, in fact, harmful) in the next
- build of C-Kermit after 9.0.299 Alpha.09, where in CSV and TSV invocations
- of <tt>\fsplit()</tt>, backslashes will be treated just as any other
- character.
-
- <p>
-
- Finally it should be noted that ksitemap takes pains to expand macros only
- after verifying that a line contains “<tt>\m(<i>xxx</i>)</tt>”
- (where <tt><i>xxx</i></tt> would be the name of the macro). It could very
- easily have simply evaluated each line without all the testing and checking,
- but then files that contained backslashes for other reasons would be
- wrecked.
-
-
- <h3>References</h3>
-
- <table callpadding=0 cellspacing=0 width="100%">
- <tr>
- <td>
- <ul>
- <li><a href="http://sitemaps.org/protocol.php">Sitemap definition</a>
- (Sitemaps.org)
- <li><a
- href="http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=178636">Google Sitemap Image Extensions</a> (Google)
- <li><a href="http://unicode.org/faq/utf_bom.html#UTF8">UTF-8 FAQ</a>
- (Unicode Consortium)
- <li><a href="http://en.wikipedia.org/wiki/UTF-8">UTF-8</a> (Wikipedia)
- <li> <a href="http://www.iana.org/assignments/character-sets">MIME Character
- Set Names</a> (IETF)
- </ul>
- <td>
- <ul style="padding-left:0">
- <li><a href="csetnames.html">Character Sets Supported in Kermit</a> (Kermit Project)
- <li><a href="utf8.html">UTF-8 Sampler</a> (Kermit Project)
- <li><a href="ckermit.html">C-Kermit</a> (Kermit Project)
- <li><a href="ck90.html">C-Kermit 9.0</a> (Kermit Project)
- <li><a href="csv.html">CSV Files</a> (Kermit Project)
- </ul>
- </table>
- <p>
- </div>
- <hr>
- <address style="padding:0 0 12 0">
- ksitemap / Kermit sitemap script / <a href="index.html">The Kermit Project</a>
- / Columbia University / December 2010
- </address>
- </div>
-
- </body>
- </html>
-