Article

PHP and XML: Parsing RSS 1.0

Page: 1 2 3 4 5 Next

PHP's Take on XML

There are two widely-used methods for programming languages to read XML documents: event-based APIs and Document Object Model (DOM) APIs. In the latter class of APIs, XML documents are read into memory in their entirety and can then be manipulated through a set of functions that provide access to an object oriented model of the document (the DOM) in memory. DOM APIs are generally considered to be more powerful; however, they suffer from one serious drawback: they are ill-suited to processing large XML documents, which would take too much memory to build the model of the document.

PHP uses an event-based API to process XML. In such models, the XML document is read in from beginning to end, setting off an event whenever a start tag, end tag, or block of character data is encountered. Each of these events causes a function of the programmer's choice to be called. Thus, reading an XML document with an event-based API like that of PHP is simply a matter of writing the functions to react appropriately to the events that occur as PHP moves through the document.

Here's the basic code for setting up event-handling functions and parsing (reading in) an XML document in PHP:

// Create an XML parser  
$xml_parser = xml_parser_create();  
 
// Set the functions to handle opening and closing tags  
xml_set_element_handler($xml_parser, "startElement", "endElement");  
 
// Set the function to handle blocks of character data  
xml_set_character_data_handler($xml_parser, "characterData");  
 
// Open the XML file for reading  
$fp = fopen("http://www.sitepoint.com/rss.php","r")  
       or die("Error reading RSS data.");  
 
// Read the XML file 4KB at a time  
while ($data = fread($fp, 4096))  
   // Parse each 4KB chunk with the XML parser created above  
   xml_parse($xml_parser, $data, feof($fp))  
       // Handle errors in parsing  
       or die(sprintf("XML error: %s at line %d",  
           xml_error_string(xml_get_error_code($xml_parser)),  
           xml_get_current_line_number($xml_parser)));  
 
// Close the XML file  
fclose($fp);  
 
// Free up memory used by the XML parser  
xml_parser_free($xml_parser);

Each of the lines of the above code are commented to explain what they do, but let's look at the XML-related PHP functions that are used in the above code in a little more detail:

  • xml_parser_create() Creates an XML parser. Just as you must create a database connection in PHP if you want to interact with a database, you must create an XML parser to use when you want to read in an XML file. In the above example, a reference to the parser is stored in $xml_parser.
  • xml_set_element_handler(parser, startElementFunction, endElementFunction) This function specifies the functions that an XML parser should use to process the events generated opening and closing tags. In this case, the parser is the one stored in our $xml_parser variable, while the functions are called startElement and endElement. These functions will be defined elsewhere in the PHP script (I'll give an example below).
  • xml_set_character_data_handler(parser, characterDataFunction) This function specifies the function that the XML parser should use to process character data appearing between tags in an XML document. Once again we use our $xml_parser variable in the example above. The function we choose to process character data is called characterData.
  • xml_parse(parser, data, endOfDocument) This function sends all or part of an XML document to the parser for it to process. The endOfDocument parameter should be set to true if the data marks the end of of XML document, or false if more of the document will follow in a subsequent call to xml_parse. This allows the parser to correctly catch unclosed tags at the end of the document and so forth. In our example, the parser is once again $xml_parser. The $data variable (up to 4KB in size) retrieved from the file with fread is passed as the data to be processed, while the feof function is used to determine whether PHP has reached the end of the XML file or not, thus providing the required endOfDocument parameter. If an error occurs in the parsing of the document, we print out the error message and the line of the file at which it occurs with xml_error_string, xml_get_error_code and xml_get_current_line_number, all of which are described in detail in the PHP manual if you're curious.
  • xml_parser_free(parser) Although all memory resources are freed at the end of a PHP script, you may wish to free up the memory used by the XML parser if your script will perform other potentially memory-intensive tasks after it parses the XML data. This function destroys the specified XML parser, thus freeing up resources and memory it may have allocated for parsing.

There are a few additional functions that let you handle some of the more esoteric events that occur during XML parsing, but these are well documented in the PHP manual so I'll leave them for you to read up on if your particular application requires them. For our purposes (reading an RSS file), we now have everything we need. All that's left is to write the three event handling functions: startElement, endElement, and characterData.

If you liked this article, share the love:
Print-Friendly Version Suggest an Article