An automated URL checker

Another sample program is useful when you have to observe several documents in the World Wide Web and need to know when these documents have been modified. Often it is very time consuming to load these documents with your web browser just to see that nothing has changed (e.g. if the documents are available via slow links only). In this case it would help you to have an automated checking process running during the night on your server preparing a list of changed documents. This task will be accomplished by the following 'URL' (Uniform Resource Locator) checker program. It reads a list of URLs for WWW documents, retrieves the modification dates for these documents and based on the previously retrieved information classifies the documents into changed, unchanged and unchecked documents. The output of the program will be formatted in HTML with links to the documents so you can use this output document as your initial home page at browser startup.

Based on the previous sample program the main program of the URL checker is pretty short:



    /* URLCHECK.CMD - IBM REXX Sample Program               */

    Parse Arg URLList HTMLFile



    /* Load REXX Socket library if not already loaded       */

    If RxFuncQuery("SockLoadFuncs") Then

     Do

       Call RxFuncAdd "SockLoadFuncs","RXSOCK","SockLoadFuncs"
       Call SockLoadFuncs

     End



    /* check all URLs in the specified file for expiration  */

    URLS.0 = 0

    Changed = ""
    Unchanged = ""
    Commented = ""
    Call CheckURLs URLList

    Call WriteHTML HTMLFile, Changed, Unchanged, Commented

    Exit

The following variables are used in the main program:

'URLS' - this stem is used to hold all URLs from the input file including the last modification date. As usual URLS.0 indicates the number of elements in this stem variable.
'Changed' - this string variable will contain the indices of all changed documents in the URLS stem variable separated by blanks, such as "1 4 5 8".
'Unchanged' - this string variable will contain the indices of all documents that have not been changed since the last check.
'Commented' - this string variable will contain the indices of all documents that have not been checked during this execution of the check program but have been commented in the input list of URLs to check.

The URLs to be checked are listed in the input file one URL per line. If you want to exclude an URL temporarily from the check you can comment it out by preceding it with a number sign "#". The function 'CheckURLs' reads this list, determines if the document has changed and adds the index to one of the variables 'Changed', 'Unchanged' or 'Commented'. After all documents have been checked, an updated URL list will be written containing not only the URL but also the string with the last modification date to be used for comparison at the next run of the check program.

Based on the information in the the three index variables and the 'URLS' stem the function 'WriteHTML' then writes out a HTML file with links to all URLs from the input file grouped by their status.

The following functions are reused from the previous sample and are not listed again:

Connect
SendCommand
Close
GetHeader
GetModificationDate

The check of all URLs will be done in the function 'CheckURLs'. At the beginning it exposes the global index variables for the different URL states and the stem variable for the list of URLs since they will be used to pass on this information to the next function. It then reads the list of URLs from the specified file, retrieves the header and modification date for all active documents and compares the result with the input from the URL list file (where available). According to the result of this comparison the index of the current URL will be appended to the appropriate index list for later use, as will the modified date be updated in the URLS stem variable.



    /********************************************************/

    /*                                                      */

    /* Procedure: CheckURLs                                 */

    /* Purpose:   Check the modification dates of all URLs  */

    /*            listed in the specified file. If the date */

    /*            has changed, update the list file with    */

    /*            the new date.                             */

    /* Arguments: URLFile - file containing URL list        */

    /* Returns:   nothing                                   */

    /*                                                      */

    /********************************************************/

    CheckURLs: Procedure Expose URLS. Changed Unchanged,

                                Commented

      Parse Arg URLFile



      Index = 0

      Do While Lines(URLFile)

        /* read line with URL and last modification date    */

        URLLine = LineIn(URLFile)



        /* remember line for later update of file           */

        Index = Index + 1

        URLS.0 = Index

        URLS.Index = URLLine



        /* if first character is not a "#" then process URL */

        If SubStr(URLLine, 1, 1) \= "#" Then

         Do

           /* retrieve header for specified URL             */

           Parse Var URLLine URL ModDate

           Header = GetHeader(URL)



           If Length(Header) \= 0 Then

            Do

              /* header could be read, find date            */

              DocDate = GetModificationDate(Header)



              If Length(ModDate) = 0 | ModDate \= DocDate Then

               Do

                 /* this URL has been changed, add to list  */

                 /* of changed URLs and update the date     */

                 Changed = Changed Index

                 URLS.Index = URL DocDate

               End

              Else

                /* add index to list of unchanged URLs     */

                Unchanged = Unchanged Index

            End

           Else

             /* add index to list of unchanged URLs        */

             Unchanged = Unchanged Index

         End

        Else

          /* add index to list of all commented out URLs   */

          Commented = Commented Index

      End



      /* close input stream, erase it and then rewrite it   */

      Call Stream URLFile, "C", "CLOSE"
      "@DEL" URLFile



      Do Index = 1 To URLS.0

        Call LineOut URLFile, URLS.Index

      End



      Call Stream URLFile, "C", "CLOSE"
      Return

After the all documents have been checked the result will be formatted into a 'HTML' file with links to the original documents. For details of HTML (HyperText Markup Language) see RFC 1866. The output file is created in the function 'WriteHTML'. It deletes an already existing version of the output file, creates a simple header, formats the lists of changed, unchanged and commented documents, and finally closes the file with a simple trailer containing the current time:



    /********************************************************/

    /*                                                      */

    /* Procedure: WriteHTML                                 */

    /* Purpose:   Create a new HTML document with links to  */

    /*            the input URLs grouped by modification.   */

    /* Arguments: HTML - output filename                    */

    /*            Changed - list of changed URL indices     */

    /*            Unchanged - list of unchanged URL indices */

    /*            Commented - list of commented URL indices */

    /* Returns:   nothing                                   */

    /*                                                      */

    /********************************************************/

    WriteHTML: Procedure Expose URLS.

      Parse Arg HTML, Changed, Unchanged, Commented



      /* write new HTML document with links to URLs         */

      "@DEL" HTML "1>NUL 2>NUL"


      Call LineOut HTML, "<html><head>"
      Call LineOut HTML, "<title>My link list</title>"
      Call LineOut HTML, "</head><body>"




      Call LineOut HTML, "<h1>Changed documents</h1>"
      Call FormatURLList HTML, Changed

      Call LineOut HTML, "<h1>Unchanged documents</h1>"
      Call FormatURLList HTML, Unchanged

      Call LineOut HTML, "<h1>Commented documents</h1>"
      Call FormatURLList HTML, Commented



      Call LineOut HTML, "<p><i>Documents checked at",

                   Date() "on" Time() "</i>"
      Call LineOut HTML, "</body></html>"
      Return

The function 'FormatURLList' is used to format a single index list into the HTML output format with one URL per line. This version of the formatter simply creates a hyper link to the document and lists the URL of the document followed by a line break. Another solution would be to format the URLS in an unordered list, etc., see the HTML reference for more formatting options.



    /********************************************************/

    /*                                                      */

    /* Procedure: FormatURLList                             */

    /* Purpose:   Format a list of URL indices into a HTML  */

    /*            formatted list with links to the URLs.    */

    /* Arguments: HTML - output filename                    */

    /*            List - list of indices                    */

    /* Returns:   nothing                                   */

    /*                                                      */

    /********************************************************/

    FormatURLList: Procedure Expose URLS.

      Parse Arg HTML, List



      /* are there any indices in the list?                 */

      If Words(List) > 0 Then

       Do

        Do Index = 1 To Words(List)

          Idx = Word(List, Index)

          Parse Var URLS.Idx URL ModDate

          URL = Strip(URL, "L", "#")



          Call LineOut HTML, "<br><a href=""" || URL || """>"
          Call LineOut HTML, URL || "</a>"
          If Length(ModDate) > 0 Then

            Call LineOut HTML, ", last modified at" ModDate

        End

       End

      Else

        Call LineOut HTML, "<p><i>no documents in list</i><p>"
      Return

This is is a sample input file for the URL checker:



    http://www.ibm.com

    http://www.myhost.mydomain/users/chris.html

    #http://www2.hursley.ibm.com/rexx/

After running the checker the resulting HTML file could look like that:



    <html><head>
    <title>My link list</title>
    </head><body>
    <h1>Changed documents</h1>
    <p><i>no documents in list</i><p>
    <h1>Unchanged documents</h1>
    <br><a href="http://www.ibm.com">
    http://www.ibm.com</a>
    , last modified at THU, 18 JUL 1996 17:41:10 GMT

    <br><a href="http://www.myhost.mydomain/users/chris.html">
    http://www.myhost.mydomain/users/chris.html</a>
    , last modified at MONDAY, 22-JUL-96 19:51:25 GMT

    <h1>Commented documents</h1>
    <br><a href="http://www2.hursley.ibm.com/rexx/">
    http://www2.hursley.ibm.com/rexx/</a>
    <p><i>Documents checked at 24 Jul 1996 on 18:43:56 </i>
    </body></html>

Running the program every night or early morning on a server gives you a daily updated list of the documents you want to follow. By using the generated HTML file as the startup page in your web browser you can access the changed documents directly via their links.

The shown URL checker can be improved in many ways, e.g.:

sort documents descending by date
extract title of each document to show in the link list
mirror documents on a local web server, ideally with all embedded graphics

You have seen connecting to a WWW server is not difficult. If you use the command 'GET' instead of 'HEAD' the server will send you the whole document preceded by the same header information which we have already used. Based on the information from this tutorial you could for example write a program that maintains a local shadow of a distant web server. You would have to retrieve a document, extract the links in it and follow them recursively. Sure you have other ideas what can be done with REXX in the Internet.

[ IBM REXX homepage | Previos page | Next page | Tutorial Index | Object REXX homepage ]