Advertisement: Support LinuxWorld, click here! |
![]()
Advertisement
|
![]() |
![]() ![]()
It shall be easy to write programs which process XML documents. Not that the standard doesn't live up to this aim, nor that it's not a worthwhile goal, but the sweet naivete of saying such a thing gives me an ear-to-ear grin every time I think about it. In computing, nothing simple stays simple. Still, things that start out simple do sometimes manage to remain simpler than they might otherwise have been. HTML remains pleasantly straightforward, though it's no longer something you can fully explain over a single cup of coffee. Perl, the scripting language so many of us have gotten so fond of, certainly started from Larry Wall's impulse to make it simple to read, mangle, and remangle text files. HTML and Perl proved well matched in their desire to get most things done without a fuss. It shouldn't, therefore, surprise us that XML, which we might describe as "HTML freed from the chains of fixed-element names," is a format that we can tweak and mangle rather effectively with Perl. As with so many potential uses of Perl, someone has already done a lot of the work, so our job is simply to
That shiny new module is XML::Parser, a bit of glue originally written by Wall and later adopted by Clark Cooper that provides a Perl interface to James Clark's C-based expat parser. You can find XML::Parser at literally hundreds of mirror sites, but one to start with is ftp://ftp.epix.net/pub/languages/perl/CPAN.html.
What to do But in the real world, an awful lot of XML is done in complete ignorance of Unicode concerns, largely because Unicode's designers thoughtfully allowed for a compressed, single-byte version of Unicode called UTF-8. This single-byte Unicode is a superset of our dear friend ASCII, so pure ASCII files can legitimately be XML files. Not all XML files will have UTF-8 encoding that we can deal with, but for those that do, we're in business.
So, assuming XML source with 8-bit characters that Perl can handle,
there are several sets of tasks we might tackle with Perl. Since XML
allows us to have elements that relate to the logic of the information
we're storing, we may have pages filled with tags never before seen on
the Web. For our home page dedicated to classic MG coupes, we might
have Assuming we're getting this information from a database of MG shop manuals, we might want to generate XML pages based on information culled from the database. Since this is really just a specialized kind of report formatting, there's nothing new for us here -- spiffy reports is what Perl is all about. But there's also the strong possibility that we'll start entering free-form database content into our document-storage system using XML formatting. We'll want to make sure that our data is, at the very least, well formed (that is, we end-tag all the elements we begin and we meet a list of other nagging requirements). We may also want to ensure that certain aspects of business logic are enforced, so that we don't do lazy things like write summaries of books without including information about the author and the publisher. To enforce such rules, or merely to enforce that documents be well formed, we'll need to parse and interpret our XML files. Additionally, since XML is a great self-documenting format for exchanging data between systems, we may want to be able to package and send records to, or receive and unpack records from, other database systems or transaction applications. Perl's a great way to knock off this sort of task in a hurry. There's also one bit of Web-site production that we might want to knock off using Perl. At present, there's only one mainstream browser that natively displays XML. To step lightly over a potentially fraught issue, let's just say it's not necessarily the browser of choice for the open source community. So, for other browsers that haven't caught up to XML, we want to be able to quickly convert our XML source code into HTML. In all of the tasks where the input is an XML file, we'll have to begin by parsing the file into its constituent elements. If we're converting an XML file to an HTML file, we'll want to convert each element into a corresponding HTML element as we go. As a preliminary move, let's assume that the lookup table matching styles to elements has been placed into memory. Now we can approach the parsing of our original XML file simply by sucking up the source file and reading through it, looking for tags by employing Perl's powerful and rather cryptic regular expressions:
$file = <>; while ($file =~ /[^<]*<(\/)?([^>]+)>/) # identify and process the element
If you're unfamiliar with regular expressions in Perl, that second
line is going to look pretty frightful. All it's doing though, is
finding and isolating for us any part of the file that appears within
angle brackets, as in
Of course, it doesn't have to be a conventional element tag, so you'll
have to go through a process of elimination to weed out the
You don't have to set all this up on your own, though, because Michael Leventhal has a nice script that handles this triage of tags as part of a utility that checks to see if an XML file is well formed. (You can find it, along with a nice article explaining it in detail, at http://www.xml.com.) If you write your own function to handle each case as it comes up, you can focus on the just-plain-old elements, match the tag names to those in your table of styles, and insert appropriate HTML code into an output file.
Enter the parser All you have to do is write callback functions to handle the elements of particular interest to you. There are three or four functions you're likely to want to write:
The following is a typical, basic setup for using XML::Parser:
Use XML::Parser; My $file = shift; # you'd probably want to error check the name # Create the instance of the parser. Errorcontext setting determines # number of lines on either side of the error that will be returned. # In this case, we get two lines above and two below. My $parser = new XML::Parser( ErrorContext => 2); # give Parser the name of each callback function $parser->setHandlers(Start => \&startHandler, Char => \&charHandler); # Let Parser do the heavy lifting $parser->parsefile($file); sub startHandler { # does startup stuff of our choosing } sub charHandler { # receives callback for each "normal" element }
So what are you going to do when your character handler gets called?
Who knows -- it's likely to be something specific to your application
of the moment. Perhaps you're looking for a specific string of numbers,
but only within There's little doubt that most of us who work with the Web will be dealing with XML-formatted information in a big way in roughly the next year or so. Since we'll be feeling our way for a while -- the whole Web is a history of figuring out how to do things as we're doing them -- it makes sense to use tools that enable us to nail things down in a hurry.
Perl is an excellent tool for this, and with XML::Parser, you've got a
module that does all the nit-picking work for you, so you can focus on
managing application data rather than sorting out a menacing tangle of
angle brackets.
Discuss this article in the LinuxWorld forums
(0
postings)
About the author |
||||||||||||
|
Advertisement: Support LinuxWorld, click here! |
(c) 1999-expo LinuxWorld, published by Web Publishing Inc.