Article

Home » Client-side Coding » XML, XSLT & Web Services » A Really, Really, Really Good Introduction to XML

About the Author

Tom Myer

author_TomMyer Tom is the founder of Triple Dog Dare Media, an Austin, TX-based professional services consultancy that specializes in designing, building, and deploying ecommerce, database, and XML systems. He's spent the last 7 years working in various areas of XML development, including XML document analysis, DTD creation and validation, XML-based taxonomies, and XML-powered content and knowledge management systems.

View all articles by Tom Myer...

A Really, Really, Really Good Introduction to XML

By Tom Myer

August 24th, 2005

Reader Rating: 9

Page: 1 2 3 4 Next

In this chapter, we'll cover the basics of XML — essentially, most of the information you'll need to know to get a handle on this exciting technology. After we're done exploring some terminology and examples, we'll jump right in and start working with XML documents. Then, we'll spend some time starting the project we'll develop through the course of this book: building an XML-powered content management system.

This excerpt is taken from No Nonsense XML Web Development with PHP, SitePoint's new release, by Thomas Myer, which was designed to help you start using XML to build intelligent 'Future-Proof' PHP applications today.

The title contains over 350 pages of XML and PHP goodies. It walks you through the process of building a fully-functional XML-based content management system with PHP. And all the code used in the book is available to customers in a downloadalbe archive.

To find out more about "No Nonsense XML Web Development with PHP", visit the book's information page, or review the contents of the entire publication. As always, you can download this excerpt as a PDF if you prefer.

Chapter 1. Introduction to XML

Who here has heard of XML? Okay, just about everybody. If ever there were a candidate for "Most Hyped Technology" during the late 90s and the current decade, it's XML (though Java would be a close contender for the title).

Whenever I talk about XML with developers, designers, technical writers, or other Web professionals, the most common question I'm asked is, "What's the big deal?" In this book, I'll explain exactly what the big deal is—how XML can be used to make your Web applications smarter, more versatile, and more powerful. I'll try to stay away from the grandstanding hoopla that has characterized much of the discussion of XML; instead, I'll give you the background and know-how you'll need to make XML a part of your professional skillset.

What is XML?

So, what is XML? Whenever a group of people asks this question, I always look at the individuals' body language. A significant portion of the group leans forward eagerly, wanting to learn more. The others either roll their eyes in anticipation of hype and half-formed theories, or cringe in fear of a long, dry history of markup languages. As a result, I've learned to keep my explanation brief.

The essence of XML is in its name: Extensible Markup Language.

Extensible

XML is extensible. It lets you define your own tags, the order in which they occur, and how they should be processed or displayed. Another way to think about extensibility is to consider that XML allows all of us to extend our notion of what a document is: it can be a file that lives on a file server, or it can be a transient piece of data that flows between two computer systems (as in the case of Web Services).

Markup

The most recognizable feature of XML is its tags, or elements (to be more accurate). In fact, the elements you'll create in XML will be very similar to the elements you've already been creating in your HTML documents. However, XML allows you to define your own set of tags.

Language

XML is a language that's very similar to HTML. It's much more flexible than HTML because it allows you to create your own custom tags. However, it's important to realize that XML is not just a language. XML is a meta-language: a language that allows us to create or define other languages. For example, with XML we can create other languages, such as RSS, MathML (a mathematical markup language), and even tools like XSLT. More on this later.

Why Do We Need XML?

Okay, we know what it is, but why do we need XML? We need it because HTML is specifically designed to describe documents for display in a Web browser, and not much else. It becomes cumbersome if you want to display documents in a mobile device or do anything that's even slightly complicated, such as translating the content from German to English. HTML's sole purpose is to allow anyone to quickly create Web documents that can be shared with other people. XML, on the other hand, isn't just suited to the Web—it can be used in a variety of different contexts, some of which may not have anything to do with humans interacting with content (for example, Web Services use XML to send requests and responses back and forth).

HTML rarely (if ever) provides information about how the document is structured or what it means. In layman's terms, HTML is a presentation language, whereas XML is a data-description language.

For example, if you were to go to any ecommerce Website and download a product listing, you'd probably get something like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>ABC Products</title>
<meta http-equiv="Content-Type"
   content="text/html; charset=iso-8859-1" />
</head>
<body>
<h1>ABC Products</h1>
<h2>Product One</h2>
<p>Product One is an exciting new widget that will simplify your
 life.</p>
<p><b>Cost: $19.95</b></p>
<p><b>Shipping: $2.95</b></p>
<h2>Product Two</h2>

<h3>Product Three</h3>
<p><i>Cost: $24.95</i></p>
<p>This is such a terrific widget that you will most certainly
 want to buy one for your home and another one for your
 office!</p>

</body>
</html>

Take a good look at this—admittedly simple—code sample from a computer's perspective. A human can certainly read this document and make the necessary semantic leaps to understand it, but a computer couldn't.

Semantics and Other Jargon

You're going to be hearing a lot of talk about "semantics" and other linguistics terms in this chapter. It's unavoidable, so bear with me. Semantics is the study of meaning in language.

Humans are much better at semantics than computers, because humans are really good at deriving meaning. For example, if I asked you to list as many names for "female animals" as you could, you'd probably start with "lioness", "tigress", "ewe", "doe" and so on. If you were presented with a list of these names and asked to provide a category that contained them all, it's likely you'd say something like "female animals." Furthermore, if I asked you what a lioness was, you'd say, "female lion."

If I further asked you to list associated words, you might say "pride," "hunt," "savannah," "Africa," and the like. From there, you could make the leap to other wild cats, then to house cats and maybe even dogs (cats and dogs are both pets, after all). With very little effort, you'd be able to build a stunning semantic landscape, as it were.

Needless to say, computers are really bad at this game, which is a shame, as many computing tasks require semantic skill. That's why we need to give computers as much help as we can.

For example, a human can probably deduce that the <h2> tag in the above document has been used to tag a product name within a product listing. Furthermore, a human might be able to guess that the first paragraph after an <h2> holds the description, and that the next two paragraphs contain price and shipping information, in bold.

However, even a cursory glance at the rest of the document reveals some very human errors. For example, the last product name is encapsulated in <h3> tags, not <h2> tags. This last product listing also displays a price before the description, and the price is italicized instead of appearing in bold.

A computer program (and even some humans) that tried to decipher this document wouldn't be able to make the kinds of semantic leaps required to make sense of it. The computer would be able only to render the document to a browser with the styles associated with each tag. HTML is chiefly a set of instructions for rendering documents inside a Web browser; it's not a method of structuring documents to bring out their meaning.

If the above document were created in XML, it might look a little like this:

<?xml version="1.0"?>
<productListing title="ABC Products">
 <product>
   <name>Product One</name>
   <description>Product One is an exciting new widget that will
     simplify your life.</description>
   <cost>$19.95</cost>
   <shipping>$2.95</shipping>
 </product>
 <product>
   <name>Product Two</name>
   …
 </product>
 <product>
   <name>Product Three</name>
   <description>This is such a terrific widget that you will
     most certainly want to buy one for your home and another one
     for your office!</p>
   <cost>$24.95</cost>
   <shipping>$0.00</shipping>
 </product>
 …
</productListing>

Notice that this new document contains absolutely no information about display. What does a <product> tag look like in a browser? Beats me—we haven't defined that yet. Later on, we'll see how you can use technologies like CSS and XSLT to transform your XML into any format you like. Essentially, XML allows you to separate information from presentation—just one of its many powerful abilities.

When we concentrate on a document's structure, as we've done here, we are better able to ensure that our information is correct. In theory, we should be able to look at any XML document and understand instantly what's going on. In the example above, we know that a product listing contains products, and that each product has a name, a description, a price, and a shipping cost. You could say, rightly, that each XML document is self-describing, and is readable by both humans and software.

Now, everyone makes mistakes, and XML programmers are no exception. Imagine that you start to share your XML documents with another developer or company, and, somewhere along the line, someone places a product's description after its price. Normally, this wouldn't be a big deal, but perhaps your Web application requires that the description appears after the product name every time.

To ensure that everyone plays by the rules, you need a DTD (a document type definition), or schema. Basically, a DTD provides instructions about the structure of your particular XML document. It's a lot like a rule book that states which tags are legal, and where. Once you have a DTD in place, anyone who creates product listings for your application will have to follow the rules. We'll get into DTDs a little later. For now, though, let's continue with the basics.

A Closer Look at the XML Example

From the casual observer's viewpoint, a given XML document, such as the one we saw in the previous section, appears to be no more than a bunch of tags and letters. But there's more to it than that!

A Structural Viewpoint

Let's consider our XML example from a structural standpoint. No, not the kind of structure we bring to a document by marking it up with XML tags; let's look at this example on a more granular level. I want to examine the contents of a typical XML file, character by character.

The simplest XML elements contain an opening tag, a closing tag, and some content. The opening tag begins with a left angle bracket (<), followed by an element name that contains letters and numbers (but no spaces), and finishes with a right angle bracket (>). In XML, content is usually parsed character data. It could consist of plain text, other XML elements, and more exotic things like XML entities, comments, and processing instructions (all of which we'll see later). Following the content is the closing tag, which exhibits the same spelling and capitalization as your opening tag, but with one tiny change: a / appears right before the element name.

Here are a few examples of valid XML elements:

<myElement>some content here</myElement>
<elements>
 <myelement>one</myelement>
 <myelement>two</myelement>
</elements>

Elements, Tags, or Nodes?

I'll refer to XML elements, XML tags, and XML nodes at different points in this book. What's the deal? Well, for the layman, these terms are interchangeable, but if you want to get technical (and who'd want to do that in a technical book?) each has a very precise meaning:

  • An element consists of an opening tag, its attributes, any content, and a closing tag.
  • A tag—either opening or closing—is used to mark the start or end of an element.
  • A node is a part of the hierarchical structure that makes up an XML document. "Node" is a generic term that applies to any type of XML document object, including elements, attributes, comments, processing instructions, and plain text.

If you're used to working with HTML, you've probably created many documents that are missing end tags, use different capitalization in opening and closing tags, and contain improperly nested tags.

You won't be able to get away with any of that in XML! In this language, the <myElement> tag is different from the <MYELEMENT> tag, and both are different from the <myELEMENT> tag. If your opening tag is <myELEMENT> and your closing tag is </Myelement>, your document won't be valid.

If you use attributes on any elements, then attribute values must be single- or double-quoted. No longer can you get by with bare attribute values like you did in HTML! Let's see an example. The following is okay in HTML:

<h1 class=topHeader>

In XML, you'd have to put quotes (either single or double) around the attribute value, like this:

<h1 class="topHeader">

Also, if you nest your elements improperly (i.e. close an element before closing another element that is inside it), your document won't be valid. (I know I keep mentioning validity—we'll talk about it in detail soon!) For example, Web browsers don't generally complain about the following:

<b>Some text that is bolded, some that is <i>italicized</b></i>.

In XML, this improper nesting of elements would cause the program reading the document to raise an error.

As XML allows you to create any language you want, the inventors of XML had to institute a special rule, which happens to be closely related to the proper nesting rule. The rule states that each XML document must contain a single root element in which all the document's other elements are contained. As we'll see later, almost every single piece of XML development you'll do is facilitated by this one simple rule.

Attributes

Did you notice the <productListing> opening tag in our example? Inside the tag, following the element name, was the data title="ABC Products". This is called an attribute.

You can think of attributes as adjectives—they provide additional information about the element that may not make any sense as content. If you've worked with HTML, you're familiar with such attributes as the src (file source) on the <img> tag.

What information should be contained in an attribute? What should appear between the tags of an element? This is a subject of much debate, but don't worry, there really are no wrong answers here. Remember: you're the one defining your own language. Some developers (including me!) apply this rule of thumb: use attributes to store data that doesn't necessarily need to be displayed to a user of the information. Another common rule of thumb is to consider the length of the data. Potentially large data should be placed inside a tag; shorter data can be placed in an attribute. Typically, attributes are used to "embellish" the data contained within the tag.

Let's examine this issue a little more closely. Let's say that you wanted to create an XML document to keep track of your DVD collection. Here's a short snippet of the code you might use:

<dvdCollection>
 <dvd>
   <id>1</id>
   <title>Raiders of the Lost Ark</title>
   <release-year>1981</release-year>
   <director>Steven Spielberg</director>
   <actors>
     <actor>Harrison Ford</actor>
     <actor>Karen Allen</actor>
     <actor>John Rhys-Davies</actor>
   </actors>
 </dvd>
 ….
</dvdCollection>

It's unlikely that anyone who reads this document would need to know the ID of any of the DVDs in your collection. So, we could safely store the ID as an attribute of the <dvd> element instead, like this:

<dvd id="1">

In other parts of our DVD listing, the information seems a little bare. For instance, we're only displaying an actor's name between the <actor> tags—we could include much more information here. One way to do so is with the addition of attributes:

<actor type="superstar" gender="male" age="50">Harrison Ford
     </actor>

In this case, though, I'd probably revert to our rule of thumb—most users would probably want to know at least some of this information. So, let's convert some of these attributes to elements:

<actor type="superstar">
       <name>Harrison Ford</name>
       <gender>male</gender>
       <age>50</age>
     </actor>

Beware of Redundant Data

From a completely different perspective, one could argue that you shouldn't have all this repetitive information in your XML file. For example, your collection's bound to include at least one other movie that stars Harrison Ford. It would be smarter, from an architectural point of view, to have a separate listing of actors with unique IDs to which you could link. We'll discuss these questions at length throughout this book.

Empty-Element Tags

Some XML elements are said to be empty—they contain no content whatsoever. Familiar examples are the img and br elements in HTML. In the case of img, for example, all the element's information is contained in its tag's attributes. The <br> tag, on the other hand, does not normally contain any attributes—it just signifies a line break.

Remember that in XML all opening tags must be matched by a closing tag. For empty elements, you can use a single empty-element tag to replace this:

<myEmptyElement></myEmptyElement>

with this:

<myEmptyElement/>

The / at the end of this tag basically tells the parser that the element starts and ends right here. It's an efficient shorthand method that you can use to mark up empty elements quickly.

The XML Declaration

The line right at the top of our example is called the XML declaration:

<?xml version="1.0"?>

It's not strictly necessary to include this line, but it's the best way to make sure that any device that reads the document will know that it's an XML document, and to which version of XML it conforms.

Entities

I mentioned entities earlier. An entity is a handy construct that, at its simplest, allows you to define special characters for insertion into your documents. If you've worked with HTML, you know that the &lt; entity inserts a literal < character into a document. You can't use the actual character because it would be treated as the start of a tag, so you replace it with the appropriate entity instead.

XML, true to its extensible nature, allows you to create your own entities. Let's say that your company's copyright notice has to go on every single document. Instead of typing this notice over and over again, you could create an entity reference called copyright_notice with the proper text, then use it in your XML documents as &copyright_notice;. What a time-saver!

We'll cover entities in more detail later on.

More than Structure…

XML documents are more then just a sequence of elements. If you take another, closer look at our product or DVD listing examples, you'll notice two things:

  • The documents are self-describing, as we've already discussed.
  • The documents are really a hierarchy of nested objects.

Let's elaborate on the first point very quickly. We've already said that most (if not all) XML documents are self-describing. This feature, combined with all that content encapsulated in opening and closing tags, takes all XML documents far past the realm of mere data and into the revered halls of information.

Data can comprise a string of characters or numbers, such as 5551238888. This string can represent anything from a laptop's serial number, to a pharmacy's prescription ID, to a phone number in the United States. But the only way to turn this data into information (and therefore make it useful) is to add context to it—once you have context, you can be sure about what the data represents. In short, <phone country="us">5551238888</phone> leaves no doubt that this seemingly arbitrary string of numbers is in fact a U.S. phone number.

When you take into account the second point—that an XML document is really a hierarchy of objects—all sorts of possibilities open up. Remember what we discussed before—that, in an XML document, one element contains all the others? Well, that root element becomes the root of our hierarchical tree. You can think of that tree as a family tree, with the root element having various children (in this case, product elements), and each of those having various children (name, description, and so on). In turn, each product element has various siblings (other product elements) and a parent (the root), as shown in Figure 1.1, "The logical structure of an XML document.".

Figure 1.1. The logical structure of an XML document.
1488_logicalstructure

Because what we have is a tree, we should be able to travel up and down it, and from side to side, with relative ease. From a programmatic stance, most of your work with XML will focus on properly creating and navigating XML structures.

There's one final point about hierarchical trees that you should note. Before, we talked about transforming data into information by adding context. Well, when we start building hierarchies of information that indicate natural relationships (known as taxonomies), we've just taken the first giant leap toward turning information into knowledge. That statement itself could spawn a whole other book, so we'll just have to leave it at that and move on!

Formatting Issues

Earlier in this chapter, I made a point about XML allowing you to separate information from presentation. I also mentioned that you could use other technologies, like CSS (Cascading Style Sheets) and XSLT (Extensible Stylesheet Language Transformations), to make the information display in different contexts.

Note

Notice that in XSLT, it's "stylesheet," but in CSS it's "style sheet"! For the sake of consistency, we'll call them all "style sheets" in this book.

In later chapters, I'll go into plenty of detail on both CSS and XSLT, but I wanted to make a brief point here. Because we've taken the time to create XML documents, our information is no longer locked up inside proprietary formats such as word processors or spreadsheets. Furthermore, it no longer has to be "re-created" every time you want to create alternate displays of that information: all you have to do is create a style sheet or transformation to make your XML presentable in a given medium.

For example, if you stored your information in a word processing program, it would contain all kinds of information about the way it should appear on the printed page—lots of bolding, font sizes, and tables. Unfortunately, if that document also had to be posted to the Web as an HTML document, someone would have to convert it (either manually or via software), clean it up, and test it. Then, if someone else made changes to the original document, those changes wouldn't cascade to the HTML version. If yet another person wanted to take the same information and use it in a slide presentation, they might run the risk of using outdated information from the HTML version. Even if they did get the right information into their presentation, you'd still need to track three locations in which your information lived. As you can see, it can get pretty messy!

Now, if the same information were stored in XML, you could create three different XSLT files to transform the XML into HTML, a slide presentation, and a printer-friendly file format such as PostScript. If you made changes to the XML file, the other files would also change automatically once you passed the XML file through the process. (This notion, by the way, is an essential component of single-sourcing—i.e. having a "single source" for any given information that's reused in another application.)

As you can see, separating information from presentation makes your XML documents reusable, and can save hassles and headaches in environments in which a lot of information needs to be stored, processed, handled, and exchanged.

Here's another example. This book will actually be stored as XML (in the DocBook schema). That means the publisher can generate sample PDFs for its Website, make print-ready files for the printer, and potentially create ebooks in the future. All formats will be generated from the same source, and all will be created using different style sheets to process the base XML files.

Well-Formedness and Validity

We've talked a little bit about XML, what it's used for, how it looks, how to conceptualize it, and how to transform it. One of the most powerful advantages of XML, of course, is that it allows you to define your own language.

However, this most powerful feature also exposes a great weakness of XML. If all of us start defining our own languages, we run the risk of being unable to understand anything anyone else says. Thus, the creators of XML had to set down some rules that would describe a "legal" XML document.

There are two levels of "legality" in XML:

  • Well-formedness
  • Validity

A well-formed XML document follows these rules (most of which we've already discussed):

  • An XML document must contain a single root element that contains all other elements.
  • All elements must be properly nested.
  • All elements must be closed either with a closing tag or with a "self-closing" empty-element tag (i.e. <tag/>).
  • All attribute values must be quoted.

A valid XML document is both well-formed and follows all the rules set down in that document's DTD (document type definition). A valid document, then, is nothing more then a well-formed document that adheres to its DTD.

The question then becomes, why have two levels of legality? A good question, indeed!

For the most part, you will only care that your documents are well formed. In fact, most XML parsers (software that reads your XML documents) are non-validating (i.e. they don't care if your documents are valid)—and that includes those found in Web browsers like Firefox and Internet Explorer. Well-formedness alone allows you to create ad hoc XML documents that can be generated, added to an application, and tested quickly.

For other applications that are more mission-critical, you'll want to use a DTD within your XML documents, then run those documents through a validating parser.

The bottom line? Well-formedness is mandatory, but validity is an extra, optional step.

In the next section, we'll practice using both validating and non-validating parsers to get the hang of these tools.

Getting Your Hands Dirty

Okay, we've spent some time talking about XML and its potential, and examining some of the neater aspects of it. Now, it's time to do what I like best, and get our hands dirty as we actually work on some documents.

The first thing we want to do is to create an XML document. For our purposes, any XML document will do, but for the sake of continuity, let's use the product listing document we saw earlier in the chapter.

Here it is again, with a few more nodes added to it:

Example 1.1. myFirstXML.xml

<productListing title="ABC Products">
 <product>
   <name>Product One</name>
   <description>Product One is an exciting new widget that will
     simplify your life.</description>
   <cost>$19.95</cost>
   <shipping>$2.95</shipping>
 </product>
 <product>
   <name>Product Two</name>
   <description>Product Two is an exciting new widget that will
     make you jump up and down.</description>
   <cost>$29.95</cost>
   <shipping>$5.95</shipping>
 </product>
 <product>
   <name>Product Three</name>
   <description>Product Three is better than Product One and
     Product Two combined! It really is as good as we say it
is--or your money back. </description>
   <cost>$39.95</cost>
   <shipping>$5.95</shipping>
 </product>
</productListing>

Save this XML markup into a file and name it myFirstXML.xml. In the next few sections, we'll be viewing the file in different browsers and experimenting with parsers.

Viewing Raw XML in Internet Explorer

If you have Internet Explorer 5 or higher installed on your machine, you can view your newly-created XML file. As Figure 1.2, "Viewing an XML file in Internet Explorer." illustrates, Internet Explorer simply displays XML files as a series of indented nodes.

Figure 1.2. Viewing an XML file in Internet Explorer.
1488_rawxmlie

Notice the little minus signs next to some of the XML nodes? A minus sign in front of a node indicates that the node contains other nodes. If you click the minus sign, Internet Explorer will collapse all the child nodes belonging to that node, as shown in Figure 1.3, "Collapsing nodes displaying in Internet Explorer.".

Figure 1.3. Collapsing nodes displaying in Internet Explorer.
1488_collapsiblenodes
View larger image.

The little plus sign next to the first product node indicates that the node has children. Clicking on the plus sign will expand any nodes under that particular node. In this way, you can easily display the parts of the document on which you want to focus.

Now, open your XML document in any text editing tool and scroll down to the cost node of the second product. The line we're interested in should read:

Example 1.2. myFirstXML.xml (excerpt)

<cost>$29.95</cost>

Capitalize the "c" on the opening tag, so that the line reads like this:

<Cost>$29.95</cost>

Save your work and reload Internet Explorer. You should see an error message that looks like the one pictured in Figure 1.4, "Error message displaying in Internet Explorer.".

Figure 1.4. Error message displaying in Internet Explorer.
1488_errorie
View larger image.

As you can see, Internet Explorer provides a rather verbose explanation of the error it ran into: the end tag, </cost>, does not match the start tag, <Cost>.

Furthermore, it provides a nice visual of the offending line, a little arrow pointing to the spot at which the parser thinks the problem arose.

<Cost>$29.95</cost>
--------------^

Even though the problem is really with the start tag, the arrow points to the end tag. Because Internet Explorer uses a non-validating parser by default (remember, this means it only cares about well-formedness rules), it runs into problems at the end tag. You now have to backtrack to find out why that particular end tag caused such a problem. Once you get the hang of this debugging method, you'll find it a great help in tracking down problems.

Let's introduce a slightly more complex problem. Open your XML document in an editor once more, and fix the problem we introduced above. Then, go to the second-last line of the document (it should read </product>) and add a <product> tag in front of it. Save your work and reload your browser.

You should see an error message similar to the one shown in Figure 1.5, "Debugging a more complex error.".

Figure 1.5. Debugging a more complex error.
1488_error2ie
View larger image.

At first glance, this error message seems a bit more obscure than the previous one. For starters, this message seems to indicate a problem with the </productListing> end tag. However, look closely and what do you see? It says that the </productListing> end tag does not match the <product> start tag. That's exactly what's wrong! Someone introduced a <product> start tag and didn't close it properly.

I'm including this example because bad nesting is one of the most common errors introduced to XML documents. This kind of error can be subtle and hard to find, especially if you're doing a lot of editing, or if your document is complex or long.

Viewing Raw XML in Firefox

You can also use Firefox (and other Mozilla browsers like Netscape 8) to view your XML files. Firefox is a popular open-source browser, and at the time this book went to print the latest version was 1.0.4. You can download a free copy from the Mozilla website.

Viewing raw XML in Firefox is basically the same as viewing it in Internet Explorer, as you can see from Figure 1.6, "Viewing raw XML in Firefox.".

Firefox's built-in parser is non-validating, so you won't be able to use it to check for document validity. However, it's comforting to know that the good folks at the Mozilla Foundation are planning to add a validating parser in a future release of the browser.

Options for Using a Validating Parser

Okay, so both Internet Explorer and Firefox will check your XML for well-formedness, but you need to know for future reference how to check that an XML file is valid (i.e. conforms to a DTD). How do you do that?

Well, there are a couple of options, listed below.

Using an Online Validating Parser

There are various well-known online validating XML parsers. All you have to do is visit the appropriate page, upload your document, and the parser will validate it. Brown University's Scholarly Technology Group sponsors one of the most famous parsers:

http://www.stg.brown.edu/service/xmlvalid/

Figure 1.6. Viewing raw XML in Firefox.
1488_rawxmlmozilla
View larger image.

Using a Local Validating Parser

Sometimes, it may be impractical to use a Website to validate your XML because of issues relating to connectivity, privacy, or security. In any of these cases, it's a good idea to download one of the freely available solutions.

  • If you're familiar with Perl, you can use any of the outstanding parser modules written for that language, all of which are available at CPAN.org.
  • If you're comfortable with C++ or Visual Basic, then give MSXML by Microsoft a try.
  • IBM offers a very good standalone validating parser called XML4J. Just download the package and install it by following the instructions provided. Be warned, however, that you will have to know something about working with Java tools and files before you can get this one installed successfully.

Using Dreamweaver

Dreamweaver isn't just a tool for creating Web pages; it's also an integrated development environment (IDE) that offers a suite of development tools to the interested Web developer.

One of Dreamweaver's more interesting capabilities is its built-in XML validator. This checks for well-formedness if the document has no DTD, and for well-formedness and validity if a DTD is specified. If you don't have a copy of Dreamweaver, you can get a trial version to play with.

To validate an XML document, choose File > Check Page in Dreamweaver, then select Validate as XML. Results of the validation will appear under the Results area, as illustrated in Figure 1.7, "Dreamweaver MX's validating XML parser."

Figure 1.7. Dreamweaver MX's validating XML parser.
1488_validatedw
View larger image.

What if I Can't Get a Validating Parser?

If you can't get your hands on a validating parser, don't panic. For most purposes, an online resource will do the job nicely. If you work in a company that has an established software development group, chances are that one of the XML-savvy developers has already set up a good validating parser.

What about the content management system we'll work on through the course of this book? Well, we won't need to validate our XML documents until we get close to the project's end, when we start to deal with Web Services, and need to figure out how to accept XML content from (and send content to) organizations in the world at large.

Starting Our CMS Project

Now that we've introduced XML and played around with some documents and parsers, it's time to start our project. Throughout this book, we'll spend time building an XML-powered Website. Specifically, we're going to build an XML-powered content management system. This project will help ground your skills as you obtain firsthand experience with practical XML development techniques, issues, and processes.

So... What's a Content Management System?

A content management system (henceforth referred to as a CMS) is a piece of server-side software that's used to create, publish, and maintain content easily and efficiently on a Website. It usually consists of the following components:

  • A data back-end (comprising XML or database tables) that contains all your articles, news stories, images, and other content.
  • A data display component—usually templates or other pages—onto which your articles, images, etc., are "painted" by the CMS for display to site visitors.
  • A data administration component. This usually comprises easy-to-use HTML forms that allow site administrators to create, edit, publish, and delete articles in some kind of secure workflow. The data administration portion of a CMS is usually the most complicated, and this is the section on which you'll likely spend most of your development time.

Over the past decade, CMSs have been created using a range of different scripting languages including Perl/CGI, ASP, TCL, JSP, Python, and PHP. Each of these languages has its own pros and cons, but we'll use PHP with XML to build our CMS.

Requirements Gathering

Before you build any kind of CMS, first you must gather information that defines the basic requirements for the project.

The goal of the CMS is to make things easier for those who need to develop and run the site. And making things easier means having to do more homework beforehand! Although you may groan at the thought of this kind of exercise, a set of well-defined requirements can make the project run a lot more smoothly.

What kind of requirements do we need to gather? Essentially, requirements fall into three major categories:

  • What kind of content will the CMS handle? How is each type of content broken down? (The more complete your understanding of this issue, the easier it'll be to create and manage your XML files.)
  • Who will be visiting the site, and what behaviors do these users expect to find? (For example, will they want to browse a hierarchical list of articles, search for articles by keyword, see links to related articles, or all three?)
  • What do the site administrators need to do? (For example, they may need to log in securely, create content, edit content, publish content, and delete content. If your CMS will provide different roles for administrative users—such as site administrators, editors, and writers—your system will become more complex.)

As you can see, we've barely scratched the surface, and already we've uncovered a number of issues that need addressing. Let's tackle them one at a time.

CMS Content and Metadata

If you're going to build a content management system, it's logical to expect that you're going to want to put content into it. However, it's not always that easy!

The most common failing I've seen on dozens of CMS engagements on which I've worked is that most of the companies that actually take the time to think about content only think about one thing: "articles!" I'm not exactly sure why that is, but I'd venture to guess that articles are what most folks are exposed to when they read newspapers, magazines, or Websites, so it's the first—and only—content type that comes to mind.

But if you're going to build a workable CMS, you'll have to think beyond "articles" and define your content types more clearly. There's a whole range of content types that need management: PDFs, images, news stories, multimedia presentations, user reviews of whitepapers/PDFs, and much, much more. In the world of XML, each of these different types of content is, naturally enough, called a document type.

The second most common failing I see is an inability to successfully convince site owners that content means more than just "articles." What's even harder is to convince them that you have to know as much as you possibly can about each content type if you're going to successfully build their CMS.

It's not good enough to know that you'll be serving PDF files, news stories, images, and so on. You also have to know how each of these content types will break out into its separate components, or metadata. Metadata means "data about data" and it is immensely useful to the CMS developer. Each article, for instance, will have various pieces of metadata, such as a headline, author name, and keywords, each of which the CMS needs to track.

The only way to understand a content type's metadata is to research it—in other words, ask yourself and others a whole lot of questions about that piece of content.

The final challenge—to define various types of metadata—can be a blessing in disguise. In my experience, once people grasp the importance of metadata, they race off in every direction and collect every single piece of metadata they can find about a given content type. Usually, we developers end up with random bits of information that aren't very useful and will never be used. For example, the client might start to track the date on which an article is first drafted. In most cases, this is unimportant information—the reader certainly doesn't care!

Obviously, it's important to look for the right kinds of metadata, like these:

Provenance Metadata

  • Who created the content? When? When was it first published? When should it automatically be removed from the site, or archived? How is this document uniquely identified in the system? Who holds the copyright to it?

Organizational/Administrative Metadata

  • If you're using category listings for your content, where will any individual piece of content live within that category system? What other content is it related to? Which keywords describe the content for indexing or search purposes (in other words, how do we find the content)? Who should have access to the content (the entire public, only site subscribers, or company staff)?

Physical/Structural Metadata

  • Is the content ASCII text, an XML snippet, or a binary file, like a PDF or image? If it's a file, where does it reside on the server? What is the file's MIME type?

Descriptive Metadata

  • If it's an article, what's the headline? Does the CMS view an article body as being separate from headings and paragraphs, or are all these items seen as one big lump of XML?

Gathering metadata can be very tricky. Let's take a look at a seemingly trivial issue: handling metadata about authors of articles. At first glance, we could say that all of our articles should contain elements for author name and email address, and leave it at that. However, we may later decide that we want site visitors to search or browse articles by author. In this case, it would make more sense to have a centralized list of authors, each with his or her own unique ID. This would eliminate the possibility of our having Tom Myer and Thomas Myer as "separate" authors just because the name was entered differently in individual articles.

Having a separate author listing would also allow us to easily set bylines for each author, in case someone decided they wanted to publish pieces under a pen name. It would also allow us to track author information across content types. We'd know, for instance, if a particular author has penned articles, written reviews, or uploaded files. Of course, agreeing on this approach means that we need to do other work later on, such as building administrative interfaces for author listings.

Once you've figured out the metadata required for a given content type, you can move on to the next content type. Eventually, you'll have a clear picture of all the content types you want your site to support.

What's the point of all this activity? Well, just think of metadata as one of the three pillars of your XML-powered CMS. (The other two are site functionality and site design. In many ways, metadata affect both and, thus, the user's experience of your site.) Every piece of metadata could potentially drive some kind of site behavior, but each piece of metadata also must be managed by the administration tools you set up.

Site Behavior

Site behavior should always be based on (and driven by) metadata. For example, if you're collecting keywords for all of your articles, you should be able to build a keyword-driven search engine for your site. If you're not collecting keyword information and want a keyword-driven search engine, you'd better back up and figure out how to add that to your content types.

Typical site behavior for a CMS-powered Website includes browsing by content categories, browsing by author, searching on titles and keywords, dynamic news sidebars, and more. Additionally, many XML- and database-powered sites feature homepages that boast dynamically updated content, such as Top Ten Downloads, latest news headlines, and so on.

CMS Administration

Our CMS will need to have an administrative component for each content type. It will also have to administer pieces of information that have nothing to do with content types, such as which users are authorized to log in to the CMS, and the privileges each of them has.

It goes without saying that your administrative interface has to be secure, otherwise, anyone could click to your CMS and start deleting content, making unauthorized changes to existing content, or adding new content that you may not want to have on your site.

In cases in which more than one person or department is involved with publishing content via the CMS, you'll need to consider workflow. A workflow is simply a set of rules that allow you to define who does what, when, and how. For example, your workflow might stipulate that a user with writer privileges may create an article, but that only a production editor can approve that content for publication on the site.

In many cases, CMS workflows emulate actual workflows that exist in publication and marketing departments. Because we're dealing with XML, we have a great opportunity to build a workflow system that's modular and flexible enough to take into account different requirements.

Defining your Content Types

We want to publish articles and news stories on our site. We definitely want to keep track of authors and site administrators, and we also want to build a search engine. We will also need to keep a record of all the copy on each of our site's pages, as well as binary files such as images and PDFs. That's a lot of work! For now, let's just step through the process of defining an article.

You may be asking, "Why are we messing around with content types at all?" It does seem like a silly thing for a developer to be doing, but it's actually the most vital task in building an XML-powered site. Whenever I build an XML-powered application, I try to define the content types first, because I find that all the other elements cascade from there. Because we've already spent some time discussing the structure of XML documents, and gathering requirements for the documents that will reside in our system, let's jump right in and start to define our article content type.

Articles

The articles in our CMS will be the mainstay of our site. In addition to the article text, each of our articles will be endowed with the following pieces of metadata:

  • A unique identifier
  • A headline
  • A short description
  • An author
  • A keyword listing
  • A publication date, which records when an article went live
  • Its status

Our article content type requires a root element that contains all the others; we can use <article> as that element. This not only makes sense from a "keep it simple" standpoint, but it is semantically appropriate, too.

Furthermore, because we need to identify each article in our system uniquely with an ID of some sort, it makes sense to add an id attribute to the root element that will contain this value. A unique identifier will ensure that no mistakes occur when we try to edit, delete, or view an existing article.

Now, each of our articles will have an author, so we need to reserve a spot for that information. There are literally dozens of ways to do this, but we'll take the simplest approach for now:

<article id="123">
 <author>Tom Myer</author>
</article>

Looking for the DTD?

In Chapter 3, DTDs for Consistency, we'll discuss document type definitions (DTDs)—the traditional means to structure the rules for an XML file—in detail. For now, I think it makes more sense to continue our discussion in the direction we've already chosen.

Our article will need a headline, a short description, a publication date, and some keywords. The <headline> is very simple—it can have its own element nested under the <article> element. Likewise, the <description> and <pubdate> elements will be nested under <article>.

The keyword listing can be handled in one of two ways. You could create under <article> a /c#/<keywords> element that itself was able to contain numerous <keyword> items:

<article id="123">
 <author>Tom Myer</author>
 <headline>Creating an XML-powered CMS</headline>
 <description>This article will show you how to create an
   XML-powered content management system</description>
 <pubdate>2004-01-20</pubdate>
 <keywords>
   <keyword>XML</keyword>
   <keyword>CMS</keyword>
 </keywords>
</article>

This approach will satisfy the structure nuts out there, but it turns out to be too complicated for the way we will eventually use these keywords. It turns out that all you really need is to list your keywords in a single <keywords> element, separated by spaces:

<article id="123">
 <author>Tom Myer</author>
 <headline>Creating an XML-powered CMS</headline>
 <description>This article will show you how to create an
   XML-powered content management system</description>
 <pubdate>2004-01-20</pubdate>
 <keywords>XML CMS</keywords>
</article>

Since individual keywords won't really have any importance in our system, this way of storing them works just fine.

Let's take a look at our growing XML document:

<article id="123">
 <author>Tom Myer</author>
 <headline>Creating an XML-powered CMS</headline>
 <description>This article will show you how to create an
   XML-powered content management system</description>
 <pubdate>2004-01-20</pubdate>
 <keywords>XML CMS</keywords>
</article>

We also need to track status information on the article. Because we don't need very robust workflows in this application, we can keep our status list very short, to "in progress" and "live."

Any article that is "in progress" will not be displayed on the live Website. It's a piece of content that's being worked on internally. Any article that is "live" will be displayed.

The easiest way to keep track of this information is to add a <status> element to our document:

<status>in progress</status>

However, you probably already see that status is very similar to keyword listings in that it has the potential to belong to many different content types. As such, it makes sense to centralize this information. We'll address this issue later, but for now, we'll continue to store status information in each article.

Now, we have to do something about the article's body. As most of our content will be displayed in a Web browser, it makes sense to use as many tags as possible that a browser like IE or Firefox can already understand. So HTML will form the basis of our article body's code. But for the purposes of our article storage system, we want to treat all of the HTML tags and text that make up the document body as a simple text string, rather than having to handle every single HTML tag that could appear in the article body. The best way to do this is to use a CDATA section within our XML document. XML parsers ignore tags, comments, and other XML syntax within a CDATA section—it simply passes the code through as a text string, without trying to interpret it. Here's what this looks like:

<body>[CDATA[
   <h1>Creating an XML-powered CMS</h1>
   <p>Here is all of our paragraph information. . .</p>
 ]]</body>

Well, we're done with articles! They now look like this:

<article id="123">
 <author>Tom Myer</author>
 <headline>Creating an XML-powered CMS</headline>
 <description>This article will show you how to create an
   XML-powered content management system</description>
 <pubdate>2004-01-20</pubdate>
 <status>live</status>
 <keywords>XML CMS</keywords>
 <body>[CDATA[
   <h1>Creating an XML-powered CMS</h1>
   <p>In this article…</p>
 ]]</body>
</article>

Gathering Requirements for Content Display

We now understand our article content type, which defines most of the content we'll display on the site. Now, let's talk about our requirements for displaying content.

  • The display side of our site will only display articles and other content that have a status of "live."
  • The search engine will retrieve content by keywords, titles, and descriptions, and only display those pieces that have a status of live.
  • The Website will display a list of author names by which site visitors can browse content, but it will only display those authors who have live articles posted on the site.

Gathering Requirements for the Administrative Tool

Let's talk briefly about the administrative tool. Here are some of the project's administration requirements:

  • All CMS users must log into the administrative tool. All passwords set for administrators will be encrypted before they're stored.
  • Each content type will have its own page through which users may list, add, edit, and delete individual pieces of content.
  • The same is true for authors and administrators. If you view an author listing, the CMS will display all pieces of content authored by that person.
  • The CMS will provide an easy method to update status, keyword, and other details for each piece of content on the site. Administrators will be able to group this information by status or content type.

Great—this is enough detail to get us started!

Summary

In this first chapter, we've discussed basic XML concepts, talked about the importance of the requirements gathering process, and performed an analysis to come up with content types and application requirements for our XML-powered CMS.

In the next chapter, we're going to delve deeper into XML, covering such topics as basic XSLT and XPath. We'll get our hands dirty with a little XSLT and start thinking about how we should display articles on our CMS-powered Website.

If you liked this article, share the love:
Print-Friendly Version Suggest an Article

Sponsored Links