home *** CD-ROM | disk | FTP | other *** search
- $Id: Readme,v 1.4 2003/10/08 10:15:40 harryf Exp $
- ++Introduction
- XML_HTMLSax is a SAX based XML parser for badly formed XML documents,
- such as HTML.
-
- The original code base was developed by Alexander Zhukov and published at
- http://sourceforge.net/projects/phpshelve/. Alexander kindly gave permission
- to modify the code and license for inclusion in PEAR.
-
- PEAR::XML_HTMLSax provides an API very similar to the native PHP Expat
- extension, allowing handlers using one to be easily adapted to the other.
- The key difference is HTMLSax will not break on badly formed XML, allowing
- it to be used for parsing HTML documents. Otherwise HTMLSax supports all
- the handlers available from Expat except namespace and external entity
- handlers. Provides methods for handling XML escapes as well as JSP/ASP
- opening and close tags.
-
- Version 2 has had it's internals completely overhauled to use a Lexer,
- delivering performance *approaching* that of the native XML extension,
- as well as a radically improved, modular design that makes adding
- further functionality easy.
-
- The public API has remained the same as older versions, except for the
- set_option() method, the available options having been renamed with additional
- options which allow HTMLSax to behave almost exactly like the native Expat
- extension, allowing contents of XML elements which contain linefeeds, tabs
- and XML entities to trigger additional calls to the registered data handler,
- as defined by XML_HTMLSax::set_data_handler().
-
- A big thanks to Jeff Moore (lead developer of WACT:
- http://wact.sourceforge.net) who's largely responsible for new design, as well
- input from members at Sitepoint's Advanced PHP forums:
- http://www.sitepointforums.com/showthread.php?threadid=121246.
-
- Thanks also to Marcus Baker (lead developer of SimpleTest:
- http://www.lastcraft.com/simple_test.php) for sorting out the unit tests.
-
- ++Uses
- Some particular situations where XML_HTMLSax can be useful include;
- - Parsing XML documents (such as those online) where the source is
- out of your control and Expat is choking because it's badly formed.
- - Converting HTML to XHTML
- - Filtering form posts to allow a limited HTML subset (probably with
- help from PEAR::XML_SaxFilters)
- - Reading HTML based content from a database and converting to PDF (with
- help from a PDF generation library and probably PEAR::XML_SaxFilters as
- well)
- - Parsing ASP(.NET) and JSP pages.
- - Creating a PHP-GTK based web browser? A PHP CSS Parser exists:
- http://www.phpclasses.org/browse.html/package/1081.html
-
- ++Features
- - Won't "break" on badly formed XML. May in some instances get it "wrong"
- (see Limitations) but will continue parsing.
-
- - Provides an API similar to the native PHP XML Expat extension
-
- - Can be instructed to behave in more or less the same manner as Expat,
- when dealing with linefeeds, tabs and XML entities
-
- - In addition to handling basic XML elements attributes and data also
- capable of dealing with;
- - Processing instructions e.g. <?php ?> / <?xml ?> etc. Within PI's
- XML entities are not parsed (i.e. ignore < and > )
- - XML Escape markup such as <! >, <!-- --> and <![CDATA[ ]]>. Within
- this XML entities are not parsed (useful for JavaScript, for example)
- - JSP / ASP (JASP) marked up with <% %>. Note: You will need to
- deal with <%@ %> and <%= %> yourself. With JASP markup XML entities
- are not parsed
-
- ++Usage Notes
-
- - Performance-wise, it runs faster on PHP 4.3.0 thanks to strspn() and
- strcspn() supporting position arguments. For older PHP versions while
- loops are used to achieve the same effect, meaning a slightly higher
- overhead. Note also that setting XML options with XML_HTMLSax::set_option()
- also slows down the parser, the options being handled by "decorators"
- which perform some further formatting on XML events which have already
- been parsed.
-
- - By default, no parser options are set
-
- - Regarding the XML_OPTION_ENTITIES_PARSED, this uses the html_entity_decode()
- function which is only available in PHP 4.3.0+. To get round this, HTMLSax
- checks your PHP version and for the function name html_entity_decode. If not
- found, it defines a function which mirrors the behavior of the native PHP
- html_entity_decode().
- Both XML_OPTION_ENTITIES_PARSED and XML_OPTION_ENTITIES_UNPARSED can be used
- down to PHP version 4.0.5, due to the regular expression used to find entities.
-
- - For attributes which have just a name but no value e.g.
- <option value="bar" selected>
-
- HTMLSax will return a boolean true value for that attribute name, when
- calling the opening tag handler;
-
- function myOpenHandler($parser,$name,$attrs) {
- print_r ( $attrs );
- }
-
- This would produce;
-
- Array
- (
- [value] => bar
- [selected] => true
- )
-
- - JASP directives like <%@ and <%= will be not be regarded as special.
- I.e. you will get back the @ or % from the contents of the JASP block
- and have to deal with these yourself.
-
- ++ Limitations
- - XML_HTMLSax only supports use of PHP classes as callback handlers;
- there is no support for using PHP functions as handlers.
-
- - The only weird behaviour is for attributes which only have a left quote
- or apostrophe e.g. <tag foo="bar>Some Text</tag> will give you an
- attribute like $attrs['foo']="bar>Some Text"; This is a trade off against
- allowing XML entities like < and > to appear inside attributes.
-
- - Although the package name might suggest otherwise, XML_HTMLSax has no
- special knowledge of HTML (i.e. there is no understanding of whether
- a given HTML document is well formed or not, according to HTML's rules).
- XML_HTMLSax is simply intended as a SAX based parser that will not
- complain about structure of a document it is asked to parse. It even does
- a fair job of parsing http://static.php.net/www.php.net/images/php.gif ...
-
- - <script /> elements containing < or > characters; these will be treated as new
- elements triggering the listeners. Make sure that any JavaScript inside is marked
- either with an XML comment <!-- --> or a CDATA block <[CDATA[ ]]>
- </script>
- Alternatively define open / close handlers which watch for <script /> elements
- as a special case, so that any further events triggered within them are handled
- as part of the <script /> element.
-
- - If you change the handlers once parsing has started, you will need to re-set
- and parser options you have defined
-
- ++ Example Use
- Further examples are available in the examples directory of this package.
-
- <?php
- // Include HTMLSax
- require_once('XML/XML_HTMLSax.php');
-
- // Define a customer handler class
- class MyHandler {
- function MyHandler(){}
-
- // Opening tags
- function openHandler(& $parser,$name,$attrs) {
- echo ( 'Open Tag Handler: '.$name );
- echo ( 'Attrs:' );
- print_r($attrs);
- }
-
- // Closing tags
- function closeHandler(& $parser,$name) {
- echo ( 'Close Tag Handler: '.$name );
- }
-
- // Text node handler
- function dataHandler(& $parser,$data) {
- echo ( 'Data Handler: '.$data );
- }
-
- // XML escape handler (e.g. HTML comments)
- function escapeHandler(& $parser,$data) {
- echo ( 'Escape Handler: '.$data );
- }
-
- // Processing instruction handler
- function piHandler(& $parser,$target,$data) {
- echo ( 'PI Handler: '.$target.' - '.$data );
- }
-
- // JSP / ASP markup handler
- function jaspHandler(& $parser,$data) {
- echo ( 'Jasp Handler: '.$data );
- }
- }
-
- // Get some HTML document
- $doc = file_get_contents('http://www.php.net');
-
- // Instantiate the handler
- $handler=new MyHandler();
-
- // Instantiate the parser
- $parser=& new XML_HTMLSax();
-
- // Register the handler with the parser
- $parser->set_object($handler);
-
- // Set a parser option
- $parser->set_option('XML_OPTION_TRIM_DATA_NODES');
-
- // Set the callback handlers (MyHandler methods)
- $parser->set_element_handler('openHandler','closeHandler');
- $parser->set_data_handler('dataHandler');
- $parser->set_escape_handler('escapeHandler');
- $parser->set_pi_handler('piHandler');
- $parser->set_jasp_handler('jaspHandler');
-
- // Parse the document
- $parser->parse($doc);
- ?>