Article

PHP5: Coming Soon to a Webserver Near You

Page: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Next

XML Support

Probably around 95% of all PHP applications deal with the rendering of (X)HTML in some form. Now, HTML is a subset of XML (well SGML, in fact), right? Clearly, being able to deal with XML in general is important to PHP.

One thing that became clear with PHP4 is that XML support was not up to scratch, as I mentioned at the start of this article. Only the SAX parser is distributed with PHP by default, which means applications like Krysalis can only be used on enlightened hosts, while raising the subject of the PHP DOM parser elicits many embarrassed coughs in PHP circles.

With PHP5 there's very good news on this front. As I mentioned before, part of PHP's XML "problem" results from the involvement of three third party XML libraries, which significantly increased the workload for the PHP group. The decision for PHP5 was to throw out the Expat and Ginger Alliance XSLT libraries, and replace them (seamlessly) with the Gnome XML Parser, libxml, which is currently used for the DOM extension.

While this move may seem insane at first glance, given that PHP's DOM extension has historically been the weakest link, the problem does not lie with libxml itself. Instead, the issue relates more to PHP's DOM implementation of libxml, which has struggled through a change of maintainers and numerous other problems.

The Gnome libxml library is (according to its own benchmarks) one of fastest XML parsers out there. As one of the younger implementations of XML related standards, it has had the advantage of being able to learn from the mistakes of older XML parsers. What's more, it comes with support for almost everything you're likely to be interested in doing with XML today, providing:

  • SAX and DOM APIs,
  • validation against DTDs and XML Schema (the latter being important to web services),
  • XSLT, XPath, Xpointer and XInclude support, and
  • parsers for HTML and docbook XML

plus the possibility of a whole bunch of other stuff, like XMLSec, which you can read about on the libxml Website.

Overall, this is an important step forward for PHP, and may well turn it into one of the leading technologies for dealing with XML on the Web.

Already the SAX and XSLT extensions have had their underlying libraries swapped out for the libxml equivalent. From a script perspective, this should not affect any code using the XML-related functions, as the APIs having remained the same.

Other news is the DOMXML extension is packing some new features, and for those that have "issues" with XML, there's a new extension which could make your life much easier (we’ll be looking at both of these in a moment).

Given that the PHP group have assigned four of their brightest minds to work on PHP's XML support, it's likely that many of the other XML technologies supported by libxml will find their way into PHP5 in the near future.

All in all, very good news.

My personal wish is it will become a policy to bundle all the XML-related extensions into the core distribution so they become widely supported by Web hosts everywhere. Of course, others may disagree.

New DOM Features

The DOM extension is still a work in progress, but the good news is the core DOM-related functionality seems to be stabilizing. There's also more been done to the XPath and XSLT support it provides and, perhaps most interesting of all, it's now capable of parsing HTML documents without choking (more or less).

Here's a simple HTML document that contains some classic badly-formed XML but still valid HTML:

             
<!doctype HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">              
<html>              
<head>              
<title> Example Page </title>              
</head>              
<body>              
<h2>Parsing HTML with DOM</h2>              
<p>This page has some examples of classic badly formed HTML.              
<br>              
<form>              
<select>              
<option>Red              
<option selected>Blue              
<option>Green              
</select>              
</form>              
</body>              
</html>              

Using the DOM extension, I can now happily parse it like so:

             
<?php              
// Open the HTML document              
$doc = html_doc_file('/www/sitepoint/php5/example.html');              
             
// Get the <head /> element              
$head = $doc->get_elements_by_tagname('head');              
$head = $head[0];              
             
// Get the <body /> element              
$body = $doc->get_elements_by_tagname('body');              
$body = $body[0];              
             
// Extracts the contents of <title />              
function getTitle ($head) {              
   $headers = $head->child_nodes();              
   foreach ( $headers as $header ) {              
       if ( $header->tagname() == 'title' )              
           echo ("<b>Page Title:</b> ".$header->get_content()."<br />\n");              
   }              
}              
             
// Parses the <body /> element              
function parseBody($body) {              
   $contents = $body->child_nodes();              
   foreach ( $contents as $content ) {              
       switch ( $content->tagname() ) {              
           case 'h2':              
              echo ( "<b>Header 2:</b> ".$content->get_content()."<br />\n");              
           break;              
           case 'p':              
              echo ( "<b>Paragraph:</b> ".$content->get_content()."<br />\n");              
           break;              
           case 'form':              
               $inputs = $content->child_nodes();              
               foreach ( $inputs as $input ) {              
                   if ( isset ( $input->tagname )              
                           && $input->tagname == 'select' )                
                       parseSelect($input);              
               }              
             
           break;              
       }              
   }              
}              
             
// Extract the contents of a <select />              
function parseSelect($select) {              
   echo ( "<b>Select:</b>\n<ul>\n" );              
   $options = $select->child_nodes();              
   foreach ( $options as $option ) {              
       echo ( "<li>".$option->get_content() );              
           if ( $option->has_attribute('selected') )              
               echo ( " <<< SELECTED" );              
       echo ( "</li>\n" );              
   }              
   echo ( "</ul>\n" );              
}              
             
getTitle($head);              
parseBody($body);              
?>              

Script: dom_html.php

The badly-formed tags and attributes, such as the unclosed &lt;p&gt; tags, and the "selected" attribute, cause no problems whatsoever.

Aside from helping PHP developers to "mine" other Websites (whether you should is a question I leave to your own personal ethics), this will no doubt prove important for "transforming" HTML into other formats, be it XHTML, or something else—like PDF.

If you liked this article, share the love:
Print-Friendly Version Suggest an Article

Sponsored Links