Article
PHP and XML: Parsing RSS 1.0
Event Handlers for RSS Parsing
We have three functions to write. Each of these functions must take certain parameters. These parameters are dictated by PHP, since it is PHP's XML parser that will call them. Here are the attributes that you must define for each of these functions:
startElement($parser, $tagName, $attrs)
$parserwill be passed a reference to the XML parser that is being used to parse the document.$tagNameis the ALL-UPPERCASE (the PHP manual calls this 'case-folded') version of the name of the opening tag that triggered the event.$attrsis an associative array of the attributes that are present in the tag that triggered the event. For example, if the tag<body bgcolor="#FFFFFF">triggered the event, then the value of$attrs['BGCOLOR']would be"#FFFFFF". Note that, like the tag name, attribute names are case-folded (all uppercase).
endElement($parser, $tagName)
$parserwill be passed a reference to the XML parser that is being used to parse the document.$tagNameis the case-folded name of the closing tag that triggered the event.
characterData($parser, $data)
$parserwill be passed a reference to the XML parser that is being used to parse the document.$datais a string of text appearing between XML tags in the document. The text between two tags will not necessarily trigger a single event. Blocks of text spread over multiple lines will cause one event per line, with each event being passed the$datafor that line.
With this in mind, the process of converting the XML data for SitePoint's RSS file into a viewable HTML document may seem fairly straightforward at first glance. If you stop and try to work out what the three event handling functions should do, however, you'll quickly realise that it's not quite as simple as it seems. For those of you who may be feeling lost at this stage, don't worry. Looking at definitions for these functions that will process SitePoint's RSS file (or indeed any site's RSS file) should help it all make sense.
The first complexity that may strike you is that the characterData function must react to text appearing between tags, but nothing is passed to the function to tell it which tags contain the text being processed. For this reason, most XML parsing scripts will need to define a set of global variables to track information received by one of the event-handling functions for use by the others.
In the case of our RSS file, all the information we need about the articles on SitePoint's cover page is contained in the <item> tags in the document. So the first global variable we'll define will be $insideitem, which we'll set to true when entering an <item> tag and false when exiting one. We'll also define four other variables, the purposes for which will become clear as we move along:
$insideitem = false;
$tag = "";
$title = "";
$description = "";
$link = "";
Let's begin with startElement. This function will be called by the XML parser whenever an opening tag is encountered. Since we're only really interested in what goes on between <item> tags, we'll first check if we are indeed inside an <item> tag:
function startElement($parser, $tagName, $attrs) {
global $insideitem, $tag;
if ($insideitem) {
Note the global statement at the start of the function, which indicates that this function will need access to the $insideitem and $tag global variables. Now, if $insideitem is true, it means we're going to want to take note of the tag that is starting so we know what to do with the character data it contains, which will trigger a call to characterData next. So we record the name of the tag ($tagName) in our global $tag variable:
$tag = $tagName;
If, on the other hand, we're not inside an <item> tag, then the only opening tag that we could possibly be interested in would be an actual <item> tag, in which case we would set $insideitem to true to indicate that we were entering one of these tags:
} elseif ($tagName == "ITEM") {
$insideitem = true;
}
}
Note that we are checking if $tagName is "ITEM", since tag names are case-folded to all uppercase.
That does it for opening tags. The next step in parsing our RSS document is handling the character data that appears between tags, and that's the job of our characterData function:
function characterData($parser, $data) {
global $insideitem, $tag, $title, $description, $link;
This function requires access to all five of our global variables, as we'll see shortly. Now, as before, the only time we are interested in the character data in the XML file is when we are inside an <item> tag, so the first step again is to check if that is the case:
if ($insideitem) {
Now, there are three different tags that can appear inside <item> tags that we are interested in: <title>, <description> and <link>. Now, since we want to display the title of each article above its description and with a link to the URL specified in the <link> tag, we can't simply output the character data as it is encountered in the XML file. Instead, we need to collect all the data for each <item> tag and then print it all out at once. Our global $title, $description and $link variables will be used for this exact purpose. We will use a switch statement to determine which tag we are dealing with and store the $data in the corresponding variable. Recall that the name of the current tag is stored in the global $tag variable.
switch ($tag) {
case "TITLE":
$title .= $data;
break;
case "DESCRIPTION":
$description .= $data;
break;
case "LINK":
$link .= $data;
break;
}
}
}
Note that we append (.=) the $data to the variable in question, rather than simply assigning it (=) because the contents of a single tag can be received as several consecutive characterData events.
Once the character data for a tag has been processed, the next event to occur will call our endElement function to indicate the closing tag. In this application, the only tag that will require action on our part following its closing is the <item> tag. When the </item> tag is encountered, we will have retrieved all the $title, $description and $link data for the item, and so we can then output it as HTML:
function endElement($parser, $tagName) {
global $insideitem, $tag, $title, $description, $link;
if ($tagName == "ITEM") {
printf("<p><b><a href='%s'>%s</a></b></p>",
trim($link),htmlspecialchars(trim($title)));
printf("<p>%s</p>",htmlspecialchars(trim($description)));
Feel free to use echo statements if you're not used to the more convenient printf function I used above. In either case, once you've output the URL, title and description of the <item>, you can clear the global variables so that they're ready to receive the character data for the next <item> in the document:
$title = "";
$description = "";
$link = "";
And then finally set $insideitem to false to indicate to our other functions that we are no longer inside an <item> tag.
$insideitem = false;
}
}
That's it! To see this script in action, click here. You can also see the complete source code (use the view source command in your browser if the source code isn't displayed as a text file).