home *** CD-ROM | disk | FTP | other *** search
- =head1 NAME
-
- XML::LibXML::Parser - Parsing XML Data with XML::LibXML
-
- =head1 SYNOPSIS
-
- $parser = XML::LibXML->new();
- $doc = $parser->parse_file( $xmlfilename );
- $doc = $parser->parse_fh( $io_fh );
- $doc = $parser->parse_string( $xmlstring);
- $doc = $parser->parse_html_file( $htmlfile );
- $doc = $parser->parse_html_fh( $io_fh );
- $doc = $parser->parse_html_string( $htmlstring );
- $doc = $parser->parse_sgml_file( $sgmlfile );
- $doc = $parser->parse_sgml_fh( $io_fh );
- $doc = $parser->parse_sgml_string( $sgmlstring );
- $fragment = $parser->parse_balanced_chunk( $wbxmlstring );
- $fragment = $parser->parse_xml_chunk( $wbxmlstring );
- $parser->process_xincludes( $doc );
- $parser->processXIncludes( $doc );
- $parser->parse_chunk($string, $terminate);
- $parser->start_push();
- $parser->push(@data);
- $doc = $parser->finish_push( $recover );
- $parser->validation(1);
- $parser->recover(1);
- $parser->expand_entities(0);
- $parser->keep_blanks(0);
- $parser->pedantic_parser(1)
- $parser->line_numbers(1)
- $parser->load_ext_dtd(1);
- $parser->complete_attributes(1);
- $parser->expand_xinclude(1);
- $parser->load_catalog( $catalog_file );
- $parser->base_uri( $your_base_uri );
- $parser->gdome_dom(1);
- $parser->match_callback($subref);
- $parser->open_callback($subref);
- $parser->read_callback($subref);
- $parser->close_callback($subref);
-
-
- =head1 DESCRIPTION
-
-
- =head1 SYNOPSIS
-
- use XML::LibXML;
- my $parser = XML::LibXML->new();
-
- my $doc = $parser->parse_string(<<'EOT');
- <some-xml/>
- EOT
- my $fdoc = $parser->parse_file( $xmlfile );
-
- my $fhdoc = $parser->parse_fh( $xmlstream );
-
- my $fragment = $parser->parse_xml_chunk( $xml_wb_chunk );
-
-
- =head1 PARSING
-
- A XML document is read into a datastructure such as a DOM tree by a piece of
- software, called parser. XML::LibXML currently provides four diffrent parser
- interfaces:
-
-
- =over 4
-
- =item * A DOM Pull-Parser
-
- =item * A DOM Push-Parser
-
- =item * A SAX Parser
-
- =item * A DOM based SAX Parser.
-
- =back
-
-
- =head2 Creating a Parser Instance
-
- XML::LibXML provides an OO interface to the libxml2 parser functions. Thus you
- have to create a parser instance before you can parse any XML data.
-
- =over 4
-
- =item B<new>
-
- $parser = XML::LibXML->new();
-
- There is nothing much to say about the constructor. It simply creates a new
- parser instance.
-
- Although libxml2 uses mainly global flags to alter the behaviour of the parser,
- each XML::LibXML parser instance has its own flags or callbacks and does not
- interfere with other instances.
-
-
-
- =back
-
-
- =head2 DOM Parser
-
- One of the common parser interfaces of XML::LibXML is the DOM parser. This
- parser reads XML data into a DOM like datastructure, so each tag can get
- accessed and transformed.
-
- XML::LibXML's DOM parser is not only capable to parse XML data, but also
- (strict) HTML and SGML files. There are three ways to parse documents - as a
- string, as a Perl filehandle, or as a filename. The return value from each is a
- XML::LibXML::Document object, which is a DOM object.
-
- All of the functions listed below will throw an exception if the document is
- invalid. To prevent this causing your program exiting, wrap the call in an
- eval{} block
-
- =over 4
-
- =item B<parse_file>
-
- $doc = $parser->parse_file( $xmlfilename );
-
- This function reads an absolute filename into the memory. It causes XML::LibXML
- to use libxml2's file parser instead of letting perl reading the file such as
- with parse_fh(). If you need to parse files directly, this function would be
- the faster choice, since this function is about 6-8 times faster then
- parse_fh().
-
-
- =item B<parse_fh>
-
- $doc = $parser->parse_fh( $io_fh );
-
- parse_fh() parses a IOREF or a subclass of IO::Handle.
-
- Because the data comes from an open handle, libxml2's parser does not know
- about the base URI of the document. To set the base URI one should use
- parse_fh() as follows:
-
- my $doc = $parser->parse_fh( $io_fh, $baseuri );
-
-
- =item B<parse_string>
-
- $doc = $parser->parse_string( $xmlstring);
-
- This function is similar to parse_fh(), but it parses a XML document that is
- available as a single string in memory. Again, you can pass an optional base
- URI to the function.
-
- my $doc = $parser->parse_stirng( $xmlstring, $baseuri );
-
-
- =item B<parse_html_file>
-
- $doc = $parser->parse_html_file( $htmlfile );
-
- Similar to parse_file() but parses HTML (strict) documents.
-
-
- =item B<parse_html_fh>
-
- $doc = $parser->parse_html_fh( $io_fh );
-
- Similar to parse_fh() but parses HTML (strict) streams.
-
-
- =item B<parse_html_string>
-
- $doc = $parser->parse_html_string( $htmlstring );
-
- Similar to parse_file() but parses HTML (strict) strings.
-
-
- =item B<parse_sgml_file>
-
- $doc = $parser->parse_sgml_file( $sgmlfile );
-
- Similar to parse_file() but parses SGML documents.
-
-
- =item B<parse_sgml_fh>
-
- $doc = $parser->parse_sgml_fh( $io_fh );
-
- Similar to parse_file() but parses SGML streams.
-
-
- =item B<parse_sgml_string>
-
- $doc = $parser->parse_sgml_string( $sgmlstring );
-
- Similar to parse_file() but parses SGML strings.
-
-
-
- =back
-
- Parsing HTML may causes problems, especially if the ampersand ('&') is used.
- This is a common problem if HTML code is parsed that contains links to
- CGI-scripts. Such links cause the parser to throw errors. In such cases libxml2
- still parses the entire document as there was no error, but the error causes
- XML::LibXML to stop the parsing process. However, the document is not lost.
- Such HTML documents should be parsed using the recover flag. By default
- recovering is deactivated.
-
- The functions described above are implemented to parse well formed documents.
- In some cases a program gets well balanced XML instead of well formed documents
- (e.g. a XML fragment from a Database). With XML::LibXML it is not required to
- wrap such fragments in the code, because XML::LibXML is capable even to parse
- well balanced XML fragments.
-
- =over 4
-
- =item B<parse_balanced_chunk>
-
- $fragment = $parser->parse_balanced_chunk( $wbxmlstring );
-
- This function parses a well balanced XML string into a
- XML::LibXML::DocumentFragment.
-
-
- =item B<parse_xml_chunk>
-
- $fragment = $parser->parse_xml_chunk( $wbxmlstring );
-
- This is the old name of parse_balanced_chunk(). Because it may causes confusion
- with the push parser interface, this function should be used anymore.
-
-
-
- =back
-
- By default XML::LibXML does not process XInclude tags within a XML Document
- (see options section below). XML::LibXML allows to post process a document to
- expand XInclude tags.
-
- =over 4
-
- =item B<process_xincludes>
-
- $parser->process_xincludes( $doc );
-
- After a document is parsed into a DOM structure, you may want to expand the
- documents XInclude tags. This function processes the given document structure
- and expands all XInclude tags (or throws an error) by using the flags and
- callbacks of the given parser instance.
-
- Note that the resulting Tree contains some extra nodes (of type
- XML_XINCLUDE_START and XML_XINCLUDE_END) after successfully processing the
- document. These nodes indicate where data was included into the original tree.
- if the document is serialized, these extra nodes will not show up.
-
- Remember: A Document with processed XIncludes differs from the original
- document after serialization, because the original XInclude tags will not get
- restored!
-
- If the parser flag "expand_xincludes" is set to 1, you need not to post process
- the parsed document.
-
-
- =item B<processXIncludes>
-
- $parser->processXIncludes( $doc );
-
- This is an alias to process_xincludes, but through a JAVA like function name.
-
-
-
- =back
-
-
- =head2 Push Parser
-
- XML::LibXML provides a push parser interface. Rather than pulling the data from
- a given source the push parser waits for the data to be pushed into it.
-
- This allows one to parse large documents without waiting for the parser to
- finish. The interface is especially usefull if a program needs to preprocess
- the incoming pieces of XML (e.g. to detect document boundaries).
-
- While XML::LibXML parse_*() functions force the data to be a wellformed XML,
- the push parser will take any arbitrary string that contains some XML data. The
- only requirement is that all the pushed strings are together a wellformed
- document. With the push parser interface a programm can interrupt the parsing
- process as required, where the parse_*() functions give not enough flexibility.
-
- Different to the pull parser implemented in parse_fh() or parse_file(), the
- push parser is not able to find out about the documents end itself. Thus the
- calling program needs to indicate explicitly when the parsing is done.
-
- In XML::LibXML this is done by a single function:
-
- =over 4
-
- =item B<parse_chunk>
-
- $parser->parse_chunk($string, $terminate);
-
- parse_chunk() tries to parse a given chunk of data, which isn't nessecarily
- well balanced data. The function takes two parameters: The chunk of data as a
- string and optional a termination flag. If the termination flag is set to a
- true value (e.g. 1), the parsing will be stopped and the resulting document
- will be returned as the following exable describes:
-
- my $parser = XML::LibXML->new;
- for my $string ( "<", "foo", ' bar="hello worls"', "/>") {
- $parser->parse_chunk( $string );
- }
- my $doc = $parser->parse_chunk("", 1); # terminate the parsing
-
-
-
- =back
-
- Internally XML::LibXML provides three functions that controll the push parser
- process:
-
- =over 4
-
- =item B<start_push>
-
- $parser->start_push();
-
- Initializes the push parser.
-
-
- =item B<push>
-
- $parser->push(@data);
-
- This function pushs the data stored inside the array to libxml2's parse. Each
- entry in @data must be a normal scalar!
-
-
- =item B<finish_push>
-
- $doc = $parser->finish_push( $recover );
-
- This function returns the result of the parsing process. If this function is
- called without a parameter it will complain about non wellformed documents. If
- $restore is 1, the push parser can be used to restore broken or non well formed
- (XML) documents as the following example shows:
-
- eval {
- $parser->push( "<foo>", "bar" );
- $doc = $parser->finish_push(); # will report broken XML
- };
- if ( $@ ) {
- # ...
- }
-
- This can be anoing if the closing tag misses by accident. The following code
- will restore the document:
-
- eval {
- $parser->push( "<foo>", "bar" );
- $doc = $parser->finish_push(1); # will return the data parsed
- # unless an error happend
- };
-
- print $doc->toString(); # returns "<foo>bar</foo>"
-
- Of course finish_push() will return nothing if there was no data pushed to the
- parser before.
-
-
-
- =back
-
-
- =head2 DOM based SAX Parser
-
- XML::LibXML provides a DOM based SAX parser. The SAX parser is defined in
- XML::LibXML::SAX::Parser. As it is not a stream based parser, it parses
- documents into a DOM and traverses the DOM tree instead.
-
- The API of this parser is exactly the same as any other Perl SAX2 parser. See
- XML::SAX::Intro for details.
-
- Aside from the regular parsing methods, you can access the DOM tree traverser
- directly, using the generate() method:
-
- my $doc = build_yourself_a_document();
- my $saxparser = $XML::LibXML::SAX::Parser->new( ... );
- $parser->generate( $doc );
-
- This is useful for serializing DOM trees, for example that you might have done
- prior processing on, or that you have as a result of XSLT processing.
-
- WARNING
-
- This is NOT a streaming SAX parser. As I said above, this parser reads the
- entire document into a DOM and serialises it. Some people couldn't read that in
- the paragraph above so I've added this warning.
-
- If you want a streaming SAX parser look at the XML::LibXML::SAX man page
-
-
- =head1 SERIALIZATION
-
- XML::LibXML provides some functions to serialize nodes and documents. The
- serialization functions themself are described at the XML::LibXML::Node manpage
- or the XML::LibXML::Document manpage. XML::LibXML checks three global flags
- that alter the serialization process:
-
-
- =over 4
-
- =item * skipXMLDeclaration
-
- =item * skipDTD
-
- =item * setTagCompression
-
- =back
-
- of that three functions only setTagCompression is available for all
- serialization functions.
-
- Because XML::LibXML does these flags not itself, one has to define them locally
- as the following example shows:
-
- local $XML::LibXML::skipXMLDeclaration = 1;
- local $XML::LibXML::skipDTD = 1;
- local $XML::LibXML::setTagCompression = 1;
-
- If skipXMLDeclaration is defined and not '0', the XML declaration is omitted
- during serialization.
-
- If skipDTD is defined and not '0', an existing DTD would not be serialized with
- the document.
-
- If setTagCompression is defined and not '0' empty tags are displayed as open
- and closing tags ranther than the shortcut. For example the empty tag foo will
- be rendered as <foo></foo> rather than <foo/>.
-
-
- =head1 PARSER OPTIONS
-
- LibXML options are global (unfortunately this is a limitation of the underlying
- implementation, not this interface). They can either be set using
- $parser->option(...), or XML::LibXML->option(...), both are treated in the same
- manner. Note that even two parser processes will share some of the same
- options, so be careful out there!
-
- Every option returns the previous value, and can be called without parameters
- to get the current value.
-
- =over 4
-
- =item B<validation>
-
- $parser->validation(1);
-
- Turn validation on (or off). Defaults to off.
-
-
- =item B<recover>
-
- $parser->recover(1);
-
- Turn the parsers recover mode on (or off). Defaults to off.
-
- This allows to parse broken XML data into memory. This switch will only work
- with XML data rather than HTML data. Also the validation will be switched off
- automaticly.
-
- The recover mode helps to recover documents that are almost wellformed very
- efficiently. That is for example a document that forgets to close the document
- tag (or any other tag inside the document). The recover mode of XML::LibXML has
- problems though to restore documents that are more like well ballanced chunks.
-
- XML::LibXML will only parse until the first fatal error occours.
-
-
- =item B<expand_entities>
-
- $parser->expand_entities(0);
-
- Turn entity expansion on or off, enabled by default. If entity expansion is
- off, any external parsed entities in the document are left as entities.
- Probably not very useful for most purposes.
-
-
- =item B<keep_blanks>
-
- $parser->keep_blanks(0);
-
- Allows you to turn off XML::LibXML's default behaviour of maintaining
- whitespace in the document.
-
-
- =item B<pedantic_parser>
-
- $parser->pedantic_parser(1)
-
- You can make XML::LibXML more pedantic if you want to.
-
-
- =item B<line_numbers>
-
- $parser->line_numbers(1)
-
- If this option is activated XML::LibXML will store the line number of a node.
- This gives more information where a validation error occoured. It could be also
- used to find out about the position of a node after parsing (see also
- XML::LibXML::Node::line_number())
-
- By default line numbering is switched off (0).
-
-
- =item B<load_ext_dtd>
-
- $parser->load_ext_dtd(1);
-
- Load external DTD subsets while parsing.
-
-
- =item B<complete_attributes>
-
- $parser->complete_attributes(1);
-
- Complete the elements attributes lists with the ones defaulted from the DTDs.
- By default, this option is enabled.
-
-
- =item B<expand_xinclude>
-
- $parser->expand_xinclude(1);
-
- Expands XIinclude tags imidiatly while parsing the document. This flag assures
- that the parser callbacks are used while parsing the included document.
-
-
- =item B<load_catalog>
-
- $parser->load_catalog( $catalog_file );
-
- Will use $catalog_file as a catalog during all parsing processes. Using a
- catalog will significantly speed up parsing processes if many external
- ressources are loaded into the parsed documents (such as DTDs or XIncludes).
-
- Note that catalogs will not be available if an external entity handler was
- specified. At the current state it is not possible to make use of both types of
- resolving systems at the same time.
-
-
- =item B<base_uri>
-
- $parser->base_uri( $your_base_uri );
-
- In case of parsing strings or file handles, XML::LibXML doesn't know about the
- base uri of the document. To make relative references such as XIncludes work,
- one has to set a separate base URI, that is then used for the parsed documents.
-
-
- =item B<gdome_dom>
-
- $parser->gdome_dom(1);
-
- THIS FLAG IS EXPERIMENTAL!
-
- Although quite powerful XML:LibXML's DOM implementation is limited if one needs
- or wants full DOM level 2 or level 3 support. XML::GDOME is based on libxml2 as
- well but provides a rather complete DOM implementation by wrapping libgdome.
- This allows you to make use of XML::LibXML's full parser options and
- XML::GDOME's DOM implementation at the same time.
-
- To make use of this function, one has to install libgdome and configure
- XML::LibXML to use this library. For this you need to rebuild XML::LibXML!
-
-
-
- =back
-
-
- =head2 Input Callbacks
-
- If libxml2 has to load external documents during parsing, this may cause
- strange results, if the location is not a HTTP, FTP or relative location. To
- get around this limitation, one may add its own input handler, to open, read
- and close particular locations or URI classes.
-
- The input callbacks are used whenever LibXML has to get something other than
- external parsed entities from somewhere. The input callbacks in LibXML are
- stacked on top of the original input callbacks within the libxml library. This
- means that if you decide not to use your own callbacks (see match()), then you
- can revert to the default way of handling input. This allows, for example, to
- only handle certain URI schemes.
-
- Callbacks are only used on files, but not on strings or filehandles. This is
- because LibXML requires the match event to find out about which callback set is
- shall be used for the current input stream. LibXML can decide this only before
- the stream is open. For LibXML strings and filehandles are already opened
- streams.
-
- The following callbacks are defined:
-
- =over 4
-
- =item B<match_callback>
-
- $parser->match_callback($subref);
-
- If you want to handle the URI, simply return a true value from this callback.
-
-
- =item B<open_callback>
-
- $parser->open_callback($subref);
-
- Open something and return it to handle that resource.
-
-
- =item B<read_callback>
-
- $parser->read_callback($subref);
-
- Read a certain number of bytes from the resource. This callback is called even
- if the entire Document has already read.
-
-
- =item B<close_callback>
-
- $parser->close_callback($subref);
-
- Close the handle associated with the resource.
-
-
-
- =back
-
- The following example explains the concept a bit. It is a purely fictitious
- example that uses a MyScheme::Handler object that responds to methods similar
- to an IO::Handle.
-
- $parser->match_callback(\&match_uri);
-
- $parser->open_callback(\&open_uri);
-
- $parser->read_callback(\&read_uri);
-
- $parser->close_callback(\&close_uri);
-
- sub match_uri {
- my $uri = shift;
- return $uri =~ /^myscheme:/;
- }
-
- sub open_uri {
- my $uri = shift;
- return MyScheme::Handler->new($uri);
- }
-
- sub read_uri {
- my $handler = shift;
- my $length = shift;
- my $buffer;
- read($handler, $buffer, $length);
- return $buffer;
- }
-
- sub close_uri {
- my $handler = shift;
- close($handler);
- }
-
- A more realistic example can be found in the "example" directory.
-
- Since the parser requires all callbacks defined it is also possible to set all
- callbacks with a single call of callbacks(). This would implify the example
- code to:
-
- $parser->callbacks( \&match_uri, \&open_uri, \&read_uri, \&close_uri);
-
- All functions that are used to set the callbacks, can also be used to retrieve
- the callbacks from the parser.
-
- Optionaly it is possible to apply global callback on the XML::LibXML class
- level. This allows multiple parses to share the same callbacks. To set these
- global callbacks one can use the callback access functions directly on the
- class.
-
- XML::LibXML->callbacks( \&match_uri, \&open_uri, \&read_uri, \&close_uri);
-
- The previous code snippet will set the callbacks from the first example as
- global callbacks.
-
-
- =head1 ERROR REPORTING
-
- XML::LibXML throws exceptions during parseing, validation or XPath processing
- (and some other occations). These errors can be catched by using eval blocks.
- The error then will be stored in $@. Alternatively one can use the
- get_last_error() function of XML::LibXML. It will return the same string that
- is stored in $@. Using get_last_error() makes it still nessecary to eval the
- statement, since these function groups will die() on errors.
-
- Note, that the use of get_last_error() still requires eval blocks. XML::LibXML
- throws errors as they occour and does not wait if a user test for them. This is
- a very common misunderstanding in the use of XML::LibXML. If the eval is
- ommited, XML::LibXML will allways halt your script by "croaking" (see Carp man
- page for details).
-
- Also note that an increasing number throws errors if bad data is passed. If you
- cannot asure valid data passed to XML::LibXML you should eval these functions.
-
- get_last_error() can be called either by the class itself or by a parser
- instance:
-
- $errstring = XML::LibXML->get_last_error();
- $errstring = $parser->get_last_error();
-
- However, XML::LibXML exceptions are global. That means if get_last_error() is
- called on an parser instance, the last global error will be returned. This is
- not nessecarily the error caused by the parser instance itself.
-
- =head1 AUTHORS
-
- Matt Sergeant,
- Christian Glahn,
- =head1 VERSION
-
- 1.56
-
- =head1 COPYRIGHT
-
- 2001-2002, AxKit.com Ltd; 2001-2003 Christian Glahn, All rights reserved.
-
- =cut
-