Advertisement: Support LinuxWorld, click here!

LinuxWorld Magazine
   LinuxWorld home
   Subscribe to LinuxWorld
   Search

    Tapping the Source:
LinuxWorld Expo info
and coverage
   Virtual message board
   Software round ups:
   Desktop applications
   Server software
   Development tools
   Submit your software

   Notification Service:
   Inform me of new
   Tapping the Source
   features

Advertisement
<A HREF="http://ad.doubleclick.net/jump/idg.lw.com/sponsor-button/tapping%3Bsz%3D125x125%3Btile%3D1"><IMG SRC="121666-suse_301_125-1.gif" tppabs="http://ad.doubleclick.net/ad/idg.lw.com/sponsor-button/tapping%3Bsz%3D125x125%3Btile%3D1" height=125 width=125></A>

Perl and XML

Perl gets a parser

The new XML::Parser module is the glue that provides a Perl interface to a C-based parser

Summary
Perl is a natural match for HTML. With the new parser module, XML::Parser, it's doubly handy for tackling XML. (1,500 words)

By Robert Richardson

n amusing moment in the XML standard -- quite possibly the only snippet of the standard anyone's likely to call amusing -- is the bit early in the introduction that proclaims:

It shall be easy to write programs which process XML documents.

Not that the standard doesn't live up to this aim, nor that it's not a worthwhile goal, but the sweet naivete of saying such a thing gives me an ear-to-ear grin every time I think about it. In computing, nothing simple stays simple.

Still, things that start out simple do sometimes manage to remain simpler than they might otherwise have been. HTML remains pleasantly straightforward, though it's no longer something you can fully explain over a single cup of coffee.

Perl, the scripting language so many of us have gotten so fond of, certainly started from Larry Wall's impulse to make it simple to read, mangle, and remangle text files.

HTML and Perl proved well matched in their desire to get most things done without a fuss. It shouldn't, therefore, surprise us that XML, which we might describe as "HTML freed from the chains of fixed-element names," is a format that we can tweak and mangle rather effectively with Perl.

As with so many potential uses of Perl, someone has already done a lot of the work, so our job is simply to

Understand what we need to do
Relate our needs to our shiny, new Perl module
Paste things together with a few lines of code as needed

That shiny new module is XML::Parser, a bit of glue originally written by Wall and later adopted by Clark Cooper that provides a Perl interface to James Clark's C-based expat parser. You can find XML::Parser at literally hundreds of mirror sites, but one to start with is ftp://ftp.epix.net/pub/languages/perl/CPAN.html.

What to do
So what XML tasks can you do with Perl? Well, on first glance, you can't "legally" do anything with Perl within the strict definition of XML. That's because Perl pre-dates the general cattle stampede to Unicode character sets, while XML is already well inside the corral. XML wants wide (2-byte) characters -- Perl ain't got 'em.

But in the real world, an awful lot of XML is done in complete ignorance of Unicode concerns, largely because Unicode's designers thoughtfully allowed for a compressed, single-byte version of Unicode called UTF-8. This single-byte Unicode is a superset of our dear friend ASCII, so pure ASCII files can legitimately be XML files. Not all XML files will have UTF-8 encoding that we can deal with, but for those that do, we're in business.

So, assuming XML source with 8-bit characters that Perl can handle, there are several sets of tasks we might tackle with Perl. Since XML allows us to have elements that relate to the logic of the information we're storing, we may have pages filled with tags never before seen on the Web. For our home page dedicated to classic MG coupes, we might have <Roof Style> and <Steering Wheel Style>.

Assuming we're getting this information from a database of MG shop manuals, we might want to generate XML pages based on information culled from the database. Since this is really just a specialized kind of report formatting, there's nothing new for us here -- spiffy reports is what Perl is all about.

But there's also the strong possibility that we'll start entering free-form database content into our document-storage system using XML formatting. We'll want to make sure that our data is, at the very least, well formed (that is, we end-tag all the elements we begin and we meet a list of other nagging requirements). We may also want to ensure that certain aspects of business logic are enforced, so that we don't do lazy things like write summaries of books without including information about the author and the publisher. To enforce such rules, or merely to enforce that documents be well formed, we'll need to parse and interpret our XML files.

Additionally, since XML is a great self-documenting format for exchanging data between systems, we may want to be able to package and send records to, or receive and unpack records from, other database systems or transaction applications. Perl's a great way to knock off this sort of task in a hurry.

There's also one bit of Web-site production that we might want to knock off using Perl. At present, there's only one mainstream browser that natively displays XML. To step lightly over a potentially fraught issue, let's just say it's not necessarily the browser of choice for the open source community. So, for other browsers that haven't caught up to XML, we want to be able to quickly convert our XML source code into HTML.

In all of the tasks where the input is an XML file, we'll have to begin by parsing the file into its constituent elements. If we're converting an XML file to an HTML file, we'll want to convert each element into a corresponding HTML element as we go.

As a preliminary move, let's assume that the lookup table matching styles to elements has been placed into memory. Now we can approach the parsing of our original XML file simply by sucking up the source file and reading through it, looking for tags by employing Perl's powerful and rather cryptic regular expressions:

$file = <>;
while ($file =~ /[^<]*<(\/)?([^>]+)>/)
#  identify and process the element

If you're unfamiliar with regular expressions in Perl, that second line is going to look pretty frightful. All it's doing though, is finding and isolating for us any part of the file that appears within angle brackets, as in <sometag>, because whatever else that bracketed text might be, it's going to be a tag of some kind.

Of course, it doesn't have to be a conventional element tag, so you'll have to go through a process of elimination to weed out the <?XML?> declaration that heads up the page, <DOCTYPE> declarations, comments, processing instructions, and CDATA blocks.

You don't have to set all this up on your own, though, because Michael Leventhal has a nice script that handles this triage of tags as part of a utility that checks to see if an XML file is well formed. (You can find it, along with a nice article explaining it in detail, at http://www.xml.com.) If you write your own function to handle each case as it comes up, you can focus on the just-plain-old elements, match the tag names to those in your table of styles, and insert appropriate HTML code into an output file.

Enter the parser
Actually, you can save yourself all of the setup of finding tags and determining their purpose, because the recently introduced XML parser handles these activities in a more robust fashion.

All you have to do is write callback functions to handle the elements of particular interest to you. There are three or four functions you're likely to want to write:

Start and end functions run as you'd expect (when XML start and end tags are found)
A character handler function runs for almost all of the tags you're likely to worry about, calling you with each tag's nonmarkup content
A default function handles anything you haven't assigned a handler for (and since you can't register a handler for comment tags, this is handy to use on occasion)

The following is a typical, basic setup for using XML::Parser:

Use XML::Parser;
My $file = shift;  # you'd probably want to error check the name
# Create the instance of the parser. Errorcontext setting determines 
# number of lines on either side of the error that will be returned. 
# In this case, we get two lines above and two below.
My $parser = new XML::Parser( ErrorContext => 2);
#  give Parser the name of each callback function
$parser->setHandlers(Start => \&startHandler,
                     Char => \&charHandler);
# Let Parser do the heavy lifting
$parser->parsefile($file);
sub startHandler
{
  # does startup stuff of our choosing
}
sub charHandler
{
  # receives callback for each "normal" element
}

So what are you going to do when your character handler gets called? Who knows -- it's likely to be something specific to your application of the moment. Perhaps you're looking for a specific string of numbers, but only within <Serial Code> tags. Maybe you need to make sure that the image of the brain CAT scan has a <Patient Name> tag that matches the patient record (it being not a good idea to get your patients mixed up when you're planning brain surgery).

There's little doubt that most of us who work with the Web will be dealing with XML-formatted information in a big way in roughly the next year or so. Since we'll be feeling our way for a while -- the whole Web is a history of figuring out how to do things as we're doing them -- it makes sense to use tools that enable us to nail things down in a hurry.

Perl is an excellent tool for this, and with XML::Parser, you've got a module that does all the nit-picking work for you, so you can focus on managing application data rather than sorting out a menacing tangle of angle brackets.

Discuss this article in the LinuxWorld forums (0 postings)
(Read our forums FAQ to learn more.)

About the author
Robert Richardson has used Perl for a number of Web-related projects. His articles have appeared in magazines such as Network Magazine, BYTE, Win32 Scripting Journal, and Home Office Computing. Contact him at www.smallofficetech.com.

Advertisement: Support LinuxWorld, click here!