Article

Introduction to XML

Page: 1 2 3 4 5 6 Next

Your First XML Document

So, now you know the boring stuff, you're equipped to write your own XML documents. And the good news is, as with HTML, all you need is a text editor to create it! For viewing the XML in a nice format (including checking that it's well formed), Internet Explorer is a good choice.

Save your file with the extension .xml and you should be able to open it with Internet Explorer to view the document as a collapsible tree. Here's an XML document that demonstrates most of what we've seen:

<?xml version="1.0" ?>    
<!-- My first XML document -->    
<articles>    
 <article author="harryf" date="13 Oct 2002">    
   <title>XML is so easy!</title>    
   <body>XML really is nothing complicated</body>    
 </article>    
 <article author="harryf" date="13 Oct 2002">    
   <title>A Program Instruction</title>    
   <body>Here's a PI for PHP: <?php phpinfo(); ?></body>    
 </article>    
 <article author="harryf" date="13 Oct 2002">    
   <title>An Entity</title>    
   <body>Mathematics: x &lt; y &gt; z</body>    
 </article>    
</articles>

Told you it was easy!

If you really get into the job of editing XML, there's a few "professional tools" you may want to consider (most of which are either Open Source or have evaluation versions). These become particularly valuable when you start to work with complicated XML documents or some of the "advanced" XML technologies like XSLT and XML Schema.

Two well worth a look are:

  • Cooktop (http://www.xmlcooktop.com/) - An excellent Open Source editor providing support for plenty of additional XML technologies like XSLT and Web services.
  • XML Spy (http://www.xmlspy.com) - A commercial editor with support for most of the important XML technologies (XML Schema, XSLT, web services et al.), with a nice display to help visualise XML documents.

More can be found discussed here at the SitePointForums.

In generating XML from your applications (be they PHP, Java, C++, Python, C# etc.), be aware that sometimes (especially for simple documents) it's best simply to "hard code" XML into your code which you can echo() (or print(), system.out.printLn() etc.) directly to output. For more complicated tasks, you may want to consider a DOM parser, which we'll look at next...

Parsing in the Night

HTML is only pleasant to look when it comes into contact with a Web browser - otherwise it's just a boring ASCII text file. The same principle applies to XML but the "target" application for XML doesn't have to be a Web browser. XML only "comes to life" when some application "reads" it.

When an application reads an XML document, it's described as having <i>parsed</i> the document. That means it searched through the document, found all the character data placed within the XML tags, and has them available in some form that's ready for us to use.

The subject parsing is one that causes a lot of confusion to those getting started with XML. You may come across people who talk about things like SAX and DOM and wonder how musical instruments and cleaning fluids relate to XML. Again, the thing to remember is XML is at heart very simple.

If you've had any experience with programming, ask yourself "How do I extract the data from this piece of XML?":

<tag>My element</tag>

In PHP you might use a regular expression like:

<?php      
$xml="<tag>My element</tag>";    
preg_match ( "/<tag>(.*)<\/tag>/",$xml,$output );    
echo ($output[1]);    
?>

This is fine for a single tag. But what if we throw in some more elements, plus some attributes, PI's, comments and character data? Do you really want to write a program to be able to parse the XML document we wrote earlier?

<?xml version="1.0" ?>    
<!-- My first XML document -->    
<articles>    
 <article author="harryf" date="13 Oct 2002">    
   <title>XML is so easy!</title>    
   <body>XML really is nothing complicated</body>    
 </article>    
 <article author="harryf" date="13 Oct 2002">    
   <title>A Program Instruction</title>    
   <body>Here's a PI for PHP: <?php phpinfo(); ?></body>    
 </article>    
 <article author="harryf" date="13 Oct 2002">    
   <title>An Entity</title>    
   <body>Mathematics: x &lt; y &gt; z</body>    
 </article>    
</articles>

It's perfectly possible to do so if you're blessed with infinite time, but thankfully most programming languages and XML tools come with their own parsers to do this for you.

SAX and DOM

SAX and DOM are effectively two strategies for parsing XML (usually referred to as APIs - Application Program Interfaces). So you know, SAX stands for "Simple API for XML", while DOM is for "Document Object Model".

The SAX approach says "Give me a list of XML tags with information on what I should do with them. I'll read through the XML document from start to finish and every time I find a tag that was in your list, I'll do what you told me to." In other words, the SAX approach is event driven. A SAX parser will read an XML sequentially from the beginning. Each XML tag it encounters is regarded as an event, and with every event it encounters, it will consult a "list" it has been provided (by you the programmer) for this particular job -- and take whatever action is necessary. A common example of SAX in action is in parsing an RSS news feed (see Kevin Yank's PHP and XML: Parsing RSS 1.0 article - there will be plenty of other examples in most programming languages used for work online).

The DOM approach says "I'm going to load the entire XML document in one go, then make it available to you in a "hierarchical" form, building a "tree" from the document where individual elements and be accessed directly, without having to "re-read" from the the beginning. The DOM approach is object oriented, hence the name. If you have experience with object oriented programming, the DOM API will be instantly appealing. For those unaquainted with object orientation, it may take some getting used to. A loose analogy might be an online directory like dmoz. Dmoz is organised around a tree structure. It's possible to access parts of the tree directly, using URLs, for example http://dmoz.org/News/ gives the the News branch and http://dmoz.org/Science/ gives you Science. If DMOZ was only a single page -- a giant list organised under headings, you'd need to read through the the section you're looking for.

In general, the SAX approach is usually faster, arguably easier to use, and better suited to large documents, while the DOM approach provides you with a more powerful way to manipulate XML, and can be very useful in creating XML from within an application. But DOM loads the entire document into memory and is therefore slower and suited only to small to medium sized documents.

Neither SAX nor DOM is perfect so it's worth mentioning that a third approach for XML parsing called XOM is in progress in an attempt to make the perfect API. It's only a few months old, so there's obviously a way to go yet.

To add to the confusion, a fair few parsers that implement the SAX or DOM APIs are available, such as James Clark's now famous Expat Parser (a SAX parser) and the Gnome libxml parser (a DOM parser). Microsoft implement SAX and DOM, along with a number of other important XML technologies, within a toolkit known simply as MSXML, while Sun have XML encoding and decoding classes in the Java library, which you'll find out more about here.

Whichever XML parsing API you use, your programming language of choice will need to implement the API in some form (otherwise, prepare to write a lot of code or pray someone has done it for you such as Luis Argerich and PHP XML Classes). PHP provides you with SAX functions and DOM functions as extensions, which need to be added to your base PHP install.

If you liked this article, share the love:
Print-Friendly Version Suggest an Article

Sponsored Links