Article
A Really, Really, Really Good Introduction to XML
Chapter 3. DTDs for Consistency
So far, we've created some very simple XML documents and learned what they're made of. We've also walked through some very simple examples in which we've transformed XML into something else, be it text, HTML, or different XML. Now, it's time to learn how to make your XML documents consistent.
Consistency in XML
Ralph Waldo Emerson, the great American thinker and essayist, once said, "A foolish consistency is the hobgoblin of little minds." Well, foolish or not, in the world of XML, we like consistency. In fact, in many contexts, consistency can be a very beautiful thing.
Remember that XML allows you to create any kind of language you want. We've already seen some varying examples in this book: from a letter to mom, to articles and news stories. In many cases, as long as you follow the rules of well-formedness, just about anything goes in XML.
However, there will come a time when you need your XML document to follow some rules—to pass a validity test—and those times will require that your XML data be consistently formatted. For example, our CMS should not allow a piece of data that's supposed to be in the admin information file to show up in a content file. What we need is a way to enforce that kind of rule.
In XML, there are two ways to set up consistency rules: DTDs and XML Schema. A DTD (document type definition) is a tried and true (if not old-fashioned) way of achieving consistency. It has a peculiar, non-XML syntax that many XML newcomers find rather limiting, but which evokes a comfortable, hometown charm among the old-school XML programmers. XML Schema is newer, faster, better, and so on; it does a lot more, and is written like any other XML document, but many find it just as esoteric as DTDs.
Information on DTDs and XML Schema could fill thick volumes if we gave it a chance. Each of these technologies contains lots of hidden nooks and crannies crammed with rules, exceptions, notations, and side stories. But, remember why we're here: we must learn as much as we need to know, then apply that knowledge as we build an XML-powered Website.
Fun with Terminology
Speaking of side stories, did you know that DTD actually stands for two things? It stands not just for document type definition, but also document type declaration. The declaration consists of the lines of code that make up the definition. Since the distinction is a tenuous one, we'll just call them both "DTD" and move on!
This chapter will focus on DTDs, as you're still a beginner, and providing information on XML Schema would be overkill. However, I will take a few minutes to explain XML Schema at a high level, and provide some comparisons with DTDs.
Just a warning before we start this chapter: consistency in XML is probably the hardest aspect we've covered so far, because DTDs can be pretty esoteric things. However, I think you'll find it worth your while, since using a DTD will prevent many problems down the road.
What's the Big Deal About Consistency?
Okay, before we get started, let's ask a very obvious question: "Why, oh why, are we sitting here on a lovely Saturday afternoon talking about the importance of consistency in XML documents? Why aren't we out in the park with our loyal dog Rover, a picnic basket, and our wonderful significant other?"
Well, you've actually asked two questions there. I can't answer the second one, because I really don't want to get into your personal life right now. As for the first question, many possible answers spring to mind:
- There will be a pop quiz later, so you'd better know your stuff.
- Your boss told you to learn it.
- You need to share your XML document with another company/department/organization, and they expect your information in a certain format.
- Your application requires that the XML documents given to it pass certain tests.
Although answers 1 and 2 can loom large in one's life, answers 3 and 4 are more solid reasons to understand the importance of consistency in XML documents. Using a system to ensure consistency allows your XML documents to interact with all kinds of applications, contexts, and business systems—not just your own. In layman's terms, using a DTD with your XML documents makes them easier to share with the outside world.
DTDs
The way DTDs work is relatively simple. If you supply a DTD along with your XML file, then the XML parser will compare the content of the document with the rules that are set out in the DTD. If the document doesn't conform to the rules specified by the DTD, the parser raises an error and indicates where the processing failed.
DTDs are such strange creatures that the best way to describe them is to just jump right in and start writing them, so that's exactly what we're going to do. A DTD might look something like this:
<!DOCTYPE letter [
ELEMENT letter (to,from,message)
ELEMENT to (#PCDATA)
ELEMENT from (#PCDATA)
ELEMENT message (#PCDATA)
]>
Those of you who are paying attention should have noticed some remarkable similarities between this DTD and the Letter to Mother example that we worked on in Chapter 2, XML in Practice. In fact, if you look closely, each line of the DTD provides a clue as to how our letter should be structured.
The first line of the DTD, which begins with <!DOCTYPE, indicates that our document type is letter. Any document we create on the basis of this DTD must therefore have a letter as its root element, or the document won't be valid.
The rest of the DTD is devoted to explaining two things:
- The proper order of elements in the XML document.
- The proper content of elements in the XML document.
In the next few sections, I'll walk you through the most important parts of element declarations. Then, we'll work on attribute and entity declarations. Once we have all that under our belts, we'll get our hands dirty building some sample XML files with DTDs.
Element Declarations
Let's have a look at the next line of the DTD above: the one that comes after the DOCTYPE.
ELEMENT letter (to,from,message)
This is called an element declaration. You can declare elements in any order you want, but they must all be declared in the DTD. To keep things simple, though, and to mirror the order in which elements appear in the actual XML file, I'd suggest that you do what we've done here: declare your root element first.
A DTD element declaration consists of a tag name and a definition in parentheses. These parentheses can contain rules for any of:
- Plain text
- A single child element
- A sequence of elements
In this case, we want the letter element to contain, in order, the elements to, from, and message. As you can see, the sequence of child elements is comma-delimited.
In fact, to be more precise, the sequence not only specifies the order in which the elements should appear, but also, how many of each element should appear. In this case, the element declaration specifies that one of each element must appear in the sequence. If our file contained two from elements, for example, it would be as invalid as if it listed the message element before to.
Naturally, there will come a time when you'll need to specify more than just one of each element. How will you do that? With a neat little system of notation, defined in Table 3.1, "XML Element Declaration Notation", which may remind you of UNIX regular expressions.
Table 3.1. XML Element Declaration Notation

With this notation as a backdrop, you can get pretty creative:
- Require at least two instances of an element.
ELEMENT chapter (title,para,para+)(at least two paras) - Apply element count modifiers to element groups.
ELEMENT chapter ((title,para+)+)(one or moretitles, each followed by one or moreparas) - Allow an element to contain an element or plain text.
ELEMENT title (subtitle|#PCDATA)(titlecontains asubtitleor plain text) - Require exactly three instances of an element.
ELEMENT instruction (step,step,step)(exactly threesteps)
Elements that Contain only Text
Let's keep looking at our original DTD. After the letter declaration, we see these three declarations:
ELEMENT to (#PCDATA)
ELEMENT from (#PCDATA)
ELEMENT message (#PCDATA)
Here, we see #PCDATA used to define the contents of our elements. #PCDATA stands for parsed character data, and refers to anything other than XML elements. So whenever you see this notation in a DTD, you know that the element must contain only text.
Mixed Content
What if you want to have something like this in your XML document?
<paragraph>This is a paragraph in which items are <b>bolded</b>,
<i>italicized</i>, and even <u>underlined</u>. Some items are
even deemed <highpriority>high priority</highpriority>.
</paragraph>
You'd probably think that you needed to declare the paragraph element as containing a sequence of #PCDATA and other elements, like this:
ELEMENT paragraph (#PCDATA,b,i,u,highpriority) <!-- wrong! -->
You might think that, but you'd be wrong! The proper way to declare that an element can contain mixed content is to separate its elements using the | symbol and add a * at the end of the element declaration:
ELEMENT paragraph (#PCDATA|b|i|u|highpriority)* <!-- right! -->
This notation allows the paragraph element to contain any combination of plain text and b, i, u, and highpriority elements. Note that with mixed content like this, you have no control over the number or order of the elements that are used.
Empty Elements
What about elements such as the hr and br, which in HTML contain no content at all? These are called empty elements, and are declared in a DTD as follows:
ELEMENT hr EMPTY
ELEMENT br EMPTY
So far, most of this makes good sense. Let's talk about attribute declarations next.
Attribute Declarations
Remember attributes? They're the extra bits of information that hang around inside the opening tags of XML elements. Fortunately, attributes can be controlled by DTDs, using what's called an attribute declaration.
An attribute declaration is structured differently than an element declaration. For one thing, we define it with !ATTLIST instead of |!ELEMENT. Also, we must include in the declaration the name of the element that contains the attribute(s), followed by a list of the attributes and their possible values.
For example, let's say we had an XML element that contained a number of attributes:
<actor actorid="HF1234" gender="male" type="superstar">
Harrison Ford</actor>
The element and attribute declarations for that element might look like this:
ELEMENT actor (#PCDATA)
ATTLIST actor
actorid ID #REQUIRED
gender (male|female) #REQUIRED
type CDATA #IMPLIED
The easiest attribute to understand is type—it contains CDATA, or character data. Basically, this attribute can contain any string of characters or numbers. Acceptable values for this attribute might be "superstar", "leading man", or even "dinosaur." As developers, we can't exert much control over what is placed in an attribute of type CDATA.
Do you see #IMPLIED right after CDATA? In DTD-speak, this means that the attribute is optional. Don't ask why they didn't use #OPTIONAL—this legacy has been passed down from the days of SGML, XML's more complex predecessor.
Let's take a look at the gender attribute's definition. This attribute is #REQUIRED, so a value for it has to be supplied with every actor element. Instead of allowing any arbitrary text, however, the DTD limits the values to either male or female.
If, in our document, an actor element fails to contain a gender attribute, or contains a gender attribute with values other than male or female, then our document would be deemed invalid.
Let's look at the most complex attribute value in our example, then we'll stop talking about attribute and element declarations. The actorid attribute has been designated an ID. In DTD-speak, an ID attribute must contain a unique value, which is handy for product codes, database keys, and other identifying factors.
In our example, we want the actorid attribute to uniquely identify each actor in the list. The ID type set for the actorid attribute ensures that our XML document is valid if and only if a unique actorid is assigned to each actor.
Some other rules that you need to follow for IDs include:
- ID values must start with a letter or underscore.
- There can only be one ID attribute assigned to an element.
Incidentally, if you want to declare an attribute that must contain a reference to a unique ID that is assigned to an element somewhere in the document, you can declare it with the IDREF attribute type. We won't have any use for this attribute type in this book, however.
Entity Declarations
Back in Chapter 1, Introduction to XML, we talked a little bit about entities. An entity is a piece of XML code that can be used (and reused) in a document with an entity reference. For example, the entity reference < is used to represent the < character, an XML built-in entity.
XML supports a number of built-in entities (among them <, >, "e; and &) that don't ever need to be declared inside a DTD. With entity declarations, you can define your own entities—something that I think you'll find very useful in your XML career.
There are different types of entities, including general, parameter, and external. Let's go over each very quickly.
General entities are basically used as substitutes for commonly-used segments of XML code. For example, here is an entity declaration that holds the copyright information for a company:
ENTITY copyright "© 2004 by Triple Dog Dare Media"
Now that we've declared this entity, we could use it in our documents like so:
<footer>©right;</footer>
When the parser sees ©right;, an entity reference, it looks for its entity declaration and substitutes the text we've declared as the entity.
There are a couple of restrictions on entity declarations:
- Circular references are not allowed. The following is a no-no:
ENTITY entity1 "&entity2; is a real pain to deal with!" ENTITY entity2 "Or so &entity1; would like you to believe!" - We can't reference a general entity anywhere but in the XML document proper. For entities that you can use in a DTD, you need parameter entities.
Parameter entities are both defined and referenced within DTDs. They're generally used to keep DTDs organized and to reduce the typing required to write them. Parameter entity names start with the % sign. Here's an example of a parameter entity, and its use in a DTD:
ENTITY % acceptable "(#PCDATA|b|i|u|citation|dialog)*"
ELEMENT paragraph %acceptable;
ELEMENT intro %acceptable;
ELEMENT sidebar %acceptable;
ELEMENT note %acceptable;
What this says is that each of the elements paragraph, intro, sidebar, and note can contain regular text as well as b, i,u, citation, and dialog elements. Not only does the use of a parameter entity reduce typing, it also simplifies maintenance of the DTD. If, in the future, you wanted to add another element (sidebar) as an acceptable child of those elements, you'd only have to update the %acceptable; entity:
ENTITY % acceptable "(#PCDATA|b|i|u|citation|dialog|sidebar)"
External entities point to external information that can be copied into your XML document at runtime. For example, you could include a stock ticker, inventory list, or other file, using an external entity.
ENTITY favquotes SYSTEM "http://www.example.com/favstocks.xml"
In this case, we're using the SYSTEM keyword to indicate that the entity is really a file that resides on a server. You'd use the entity in your XML documents as follows:
<section>
<heading>Current Favorite Stock Picks</heading>
&favquotes;
</section>
External DTDs
The DTD example we saw at the start of this chapter appeared within the DOCTYPE declaration at the top of the XML document. This is okay for experimentation purposes, but with many projects, you'll likely have dozens—or even hundreds—of files that must conform to the same DTD. In these cases, it's much smarter to put the DTD in a separate file, then reference it from your XML documents.
An external DTD is usually a file with a file extension of .dtd—for example, letter.dtd. This external DTD contains the same notational rules set forth for an internal DTD.
To reference this external DTD, you need to add two things to your XML document. First, you must edit the XML declaration to include the attribute
standalone="no":
<?xml version="1.0" standalone="no"?>
This tells a validating parser to validate the XML document against a separate DTD file. You must then add a DOCTYPE declaration that points to the external DTD, like this:
<!DOCTYPE letter SYSTEM "letter.dtd">
This will search for the letter.dtd file in the same directory as the XML file. If the DTD lives on a Web server, you might point to that instead:
<!DOCTYPE letter SYSTEM
"http://www.example.com/xml/dtd/letter.dtd">
A 10,000-Foot View of XML Schema
The XML Schema standard fulfills the same requirements as DTDs: it allows you to control the structure and content of an XML document. But, if it serves the same purpose as DTDs, why would we use XML Schema?
Well, DTDs have a few disadvantages:
- DTD notation has little to do with XML syntax, and therefore cannot be parsed or validated the way an XML document can.
- All DTD declarations are global, so you can't define two different elements with the same name, even if they appear in different contexts.
- DTDs cannot strictly control the type of information a given element or attribute can contain.
XML Schema is written in XML, so it can be parsed by an XML parser. XML Schema allows you, through the use of XML namespaces, to define different elements with the same name. Finally, XML Schema provides very fine control over the kinds of data contained in an element or attribute.
Now, for some major drawbacks: if you thought that DTDs were esoteric, then you won't be pleased by the complexity introduced by XML Schema. Most of the criticism aimed at XML Schema is focused on its complexity and length. In fact, at first glance, a schema's verbosity will remind you of your motor-mouth friend who hogs the airspace at any gathering.
We won't get much of a chance to work with XML Schema in this book, but there are many fine books available on the subject.
Getting Our Hands Dirty
Okay, now you know a lot more about DTDs than you did before. If you're thinking that all this talk of consistency in XML seems fairly esoteric, you're not alone. But stick with me—we're about to embark on the practical examples that will illustrate exactly how these concepts fit into the overall XML picture.
Let's start out by creating a sample document and using a DTD to validate it. For this exercise, we'll be working with Macromedia Dreamweaver MX, as it includes a built-in XML validator.
Our First Case: A Corporate Memo
You work for Amalgamated International, LLC. The big boss comes into your office because he heard a rumor that you're an XML wizard. This is really great news, because he's just come back from a conference where he learned that XML is a terrific way to get your internal corporate memos under control.
He instructs you to figure out how to get all the corporate memos into XML, and yes, they do need to be validated, because they will be used later by an application that's capable of searching through the memos.
The first thing you do is you take a look at the dozens of corporate memos you and your colleagues have received in the past few months. After a day or two of close examination, a pattern emerges.
Just by looking at them, you can see that all memos have the following elements:
- Date
- Sender
- Recipient list
- Priority
- Subject line
- One or more paragraphs
- Signature block
- Preparer's initials
You're sure that there's more to it than that, so you decide to gather more information. When you talk to your department's administrative assistant, he fills in the rest of the picture:
- There is almost always some kind of departmental code assigned to the file. This code is not always printed on the physical memos, but is always used as part of the filename. These codes help designate the memo's department of origin (accounting, finance, marketing, etc.).
- There is almost always a blind copy list on each memo—in other words, a list of recipients who, though they received it, are not listed anywhere on the memo as having received it.
- Many memos also have an expiration date. At Amalgamated, if a given memo has no expiration date, the information on the memo is deemed good for 180 days. Most memos contain information with lifetimes of less then six months, so most employees never see this kind of information. Other memos—those concerning HR policies, for instance—may have expiration dates that are years away.
With this information in hand, you begin to create a DTD for XML-based memos.
Although your first impulse might be to run out and create a sample XML memo document, please resist that urge for now. There's nothing wrong with this approach—indeed, it does provide useful modeling techniques. However, right now, we want to work with DTDs, then apply what we know to the building of the XML document.
So, the first thing you need to do is declare a DOCTYPE. Because these memos are internal to the company, and there may be a need for a separate external memo DOCTYPE, you decide to use internalmemo as your root element name:
Example 3.1. internalmemo-standalone.xml (excerpt)
<?xml version="1.0"?>
<!DOCTYPE internalmemo [
Now, it's time to define your elements. The first element—the root element—is internalmemo. This element will contain all the other elements, which hold date, sender, recipient, subject line, and all other information. Because these represent a lot of elements, it would be useful to split your document into two logical partitions: header and body. The header will contain recipient, subject line, date, and other information. The body will contain the actual text of the memo.
Here is the element declaration for our root element:
Example 3.2. internalmemo-standalone.xml (excerpt)
ELEMENT internalmemo (header,body)
In DTD syntax, the above declaration states that our internalmemo element must contain one header element and one body element. Next, we will indicate which elements these will contain.
Here's what the header will contain:
Example 3.3. internalmemo-standalone.xml (excerpt)
ELEMENT header (date,sender,recipients,blind-recipients?,
subject)
In DTD syntax, the above declaration states that the header element must contain single date, sender, and recipients elements, an optional blind-recipients element, and then a subject element.
Here is the body:
Example 3.4. internalmemo-standalone.xml (excerpt)
ELEMENT body (para+,sig)
In DTD syntax, the above declaration states that the body element must contain one or more para elements, followed by a single sig element.
Most of the other elements will contain plain text, except the para elements, in which we will allow bold and italic text formatting.
Example 3.5. internalmemo-standalone.xml (excerpt)
ELEMENT date (#PCDATA)
ELEMENT sender (#PCDATA)
ELEMENT recipients (#PCDATA)
ELEMENT blind-recipients (#PCDATA)
ELEMENT subject (#PCDATA)
ELEMENT sig (#PCDATA)
ELEMENT para (#PCDATA|b|i)*
ELEMENT b (#PCDATA)
ELEMENT i (#PCDATA)
That was simple enough. However, when we glance at the requirements, we can see that we haven't even begun to handle priority levels, preparer's initials, expiration dates, and department of origin.
What's the best way to handle these pieces of information? We could certainly add them as elements in the head section of our memos, but that wouldn't make much sense. Those pieces of information are hardly ever displayed on a document—they are used only for administrative purposes.
In any case, we want to be able to control the data that document creators put in for values such as priority. It wouldn't make much sense for them to enter "alligator" or "Disney World" when our application is going to be looking for "low", "medium" and "high."
The best way to store these pieces of information is to add them as attributes to the root element. To do that, we need to add an attribute declaration to our DTD:
Example 3.6. internalmemo-standalone.xml (excerpt)
ATTLIST internalmemo
priority (low|medium|high) #REQUIRED
initials CDATA #REQUIRED
expiredate CDATA #REQUIRED
origin (marketing|accounting|finance|hq|sales|ops) #REQUIRED
]>
So, what does a valid internal memo document look like? I'm glad you asked:
Example 3.7. internalmemo-standalone.xml
<?xml version="1.0"?>
<!DOCTYPE internalmemo [
ELEMENT internalmemo (header,body)
ELEMENT header (date,sender,recipients,blind-recipients?,
subject)
ELEMENT body (para+,sig)
ELEMENT date (#PCDATA)
ELEMENT sender (#PCDATA)
ELEMENT recipients (#PCDATA)
ELEMENT blind-recipients (#PCDATA)
ELEMENT subject (#PCDATA)
ELEMENT sig (#PCDATA)
ELEMENT para (#PCDATA|b|i)*
ELEMENT b (#PCDATA)
ELEMENT i (#PCDATA)
ATTLIST internalmemo
priority (low|medium|high) #REQUIRED
initials CDATA #REQUIRED
expiredate CDATA #REQUIRED
origin (marketing|accounting|finance|hq|sales|ops) #REQUIRED
]>
<internalmemo priority="high" initials="hjd"
expiredate="01/01/2008" origin="marketing">
<header>
<date>01/05/2004</date>
<sender>Thomas Myer</sender>
<recipients>Marketing Department</recipients>
<subject>Sell more stuff</subject>
</header>
<body>
<para>This is a <i>simple</i> memo from the marketing
department: sell <b>more</b> stuff!</para>
<sig>Thomas Myer</sig>
</body>
</internalmemo>
Validating Our First Case
Now that we have a DTD and XML document, it's time to validate. Fortunately, Macromedia Dreamweaver MX has a built-in validation tool that we can use during development (in "real life" we would use a built-in validator that's part of our application). If you don't already own Dreamweaver, you can get a trial copy.
All we have to do is open our XML document (which contains a DTD) in Dreamweaver, then choose File > Check Page > Validate as XML. The result should look a lot like Figure 3.1, "Validating our first case with Dreamweaver MX.".
Figure 3.1. Validating our first case with Dreamweaver MX.

View larger image.
Do you see how, under Results, it reads No errors or warnings found.? That's what you want to see. In Dreamweaver MX 2004, the results list for a valid document is simply empty, and the status bar beneath the list reads Complete.
What happens if some things are out of place? For instance, what if, as a priority, you wrote "Extremely Urgent"? What would happen then? In that case, you'd see an error message like the one in Figure 3.2, "Error resulting from a bad attribute value." below.
Figure 3.2. Error resulting from a bad attribute value.

View larger image.
Notice that Dreamweaver MX tells you where the problem lies (with a specific line number) and provides a description of the problem. In this case, the validator is saying that the value of the priority attribute in your XML document doesn't match any of the possibilities defined in the DTD.
What if you decided to put the <sender> tag before the <date> tag? The validator catches that too, as you can see in Figure 3.3, "Error resulting from a misplaced element.".
Figure 3.3. Error resulting from a misplaced element.

View larger image.
Again, the validator gives you a line number and a description that can lead you to resolve the problem. All you need to do is put the sender element back in the prescribed order, and the document will validate once more.
Second Case: Using an External DTD for Memos
Our first case was simple enough—an internal memo DTD and XML file. In that case, we embedded the DTD right into the file. This is a practical thing to do when you're only dealing with a small number of files for each DTD, but in Amalgamated's case, they'll be dealing with tens (if not hundreds) of thousands of memos.
There's no way that you want to have to maintain all those copies of the DTD separately. Instead, you want to have a single DTD that is included in all of your XML files. What you do is copy your DTD code out of your XML document and save it in a separate file called internalmemo.dtd. Don't copy the DOCTYPE line, or the last line that closes off the brackets!
When you're finished, your DTD file should look like this:
Example 3.8. internalmemo.dtd
LEMENT internalmemo (header,body)
ELEMENT header (date,sender,recipients,blind-recipients?,
subject)
ELEMENT body (para+,sig)
ELEMENT date (#PCDATA)
ELEMENT sender (#PCDATA)
ELEMENT recipients (#PCDATA)
ELEMENT blind-recipients (#PCDATA)
ELEMENT subject (#PCDATA)
ELEMENT sig (#PCDATA)
ELEMENT para (#PCDATA|b|i)*
ELEMENT b (#PCDATA)
ELEMENT i (#PCDATA)
ATTLIST internalmemo
priority (low|medium|high) #REQUIRED
initials CDATA #REQUIRED
expiredate CDATA #REQUIRED
origin (marketing|accounting|finance|hq|sales|ops) #REQUIRED
Next, place a link to that external DTD in your XML document, like this:
Example 3.9. internalmemo.xml (excerpt)
<!DOCTYPE internalmemo SYSTEM "internalmemo.dtd">
You also need to change your XML document declaration (the first line of our XML document) to look like this:
Example 3.10. internalmemo.xml (excerpt)
<?xml version="1.0" standalone="no"?>
If you've done everything right, your file should validate when you use Dreamweaver's built-in validator. You now have a reusable DTD that you can apply to other internal memos.
Our CMS Project
In Chapter 2, XML in Practice, we added a few more content types to our CMS project. We now understand articles, news stories, binary files, and Web copy, and are well on our way to completing the requirements-gathering phase of the project—we can start coding soon!
However, and this is a big "however," we've also run into something of a problem. If you recall, we are tracking author, status, keyword, and other vital information in separate files. That is, each individual article, news story, binary file, and Web copy file keeps track of its own keywords, status, author, and dates.
For most of this information, which will rarely be used except in connection with the particular document, this isn't a problem, but author information is something of a special case. If we wanted to display all documents for a certain author, we would have to dig through all of our files to find all the matches. This isn't a big deal when our site is small, but the task grows more unmanageable with each passing day.
Never fear—I have a proposal that will solve this problem. In fact, the rest of this chapter will be devoted to tackling this issue. With any luck, it will also give you some insights into the ways in which you can analyze requirements and come up with more architecturally sound XML designs.
Reworking the Way we Track Author Information
Let's take a quick look at our article. I've reprinted what we came up with at the end of Chapter 1, Introduction to XML below for easy reference:
<article id="123">
<author>Tom Myer</author>
<headline>Creating an XML-powered CMS</headline>
<description>This article will show you how to create an
XML-powered content management system</description>
<pubdate>2004-01-20</pubdate>
<status>live</status>
<keywords>XML CMS</keywords>
<body>[CDATA[
<h1>Creating an XML-powered CMS</h1>
<p>In this article…</p>
]]</body>
</article>
So far, it's been very convenient to track our author information using the author element. However, doing it this way presents two problems, one of which we've already mentioned: eventually, we will have hundreds of articles on the site, and it would put a lot of strain on our application to dig through each one in order to display a list of articles by author.
The other problem is a little less obvious. What happens if, in one article, my name is listed as "Tom Myer," and in another, it's "Thomas Myer"? Or if, in one article, someone misspells my name as "Tom Meyer" (this happens a lot). To our application, these three names are different, and articles will thus be listed under three different authors.
To solve this problem, we should create a separate author listing (authors.xml), then use an authorid to reference that information in our articles. Once we have this figured out, we can get rid of the author element in all the other content types, and replace them with an authorid elements.
Handling our authors this way also allows us to track other information about authors, such as their email addresses, their bylines (in case they want to publish under pseudonyms), and other such information.
Here's a sample of what that code would look like:
Example 3.11. authors.xml
<authors>
<author id="1">
<name>Thomas Myer</name>
<byline>myerman</byline>
<email>tom@tripledogdaremedia.com</email>
</author>
</authors>
Instead of a separate author element, we would add an authorid element to our articles, like this:
<article id="123">
<authorid>1</authorid>
…
Now we've solved the problem of redundancy—in other words, we've centralized our author information instead of having it spread across many different files. All we need to do is use this author ID in our articles, news stories, and all other content we add to our CMS; this ID is used to look up the author and retrieve the information we need.
Assign DTDs to our Project Documents?
The big question remains: do we take the time and effort to create DTDs or schemas for each of our content types? The answer is, as with most things technical, "it depends."
To be completely honest, most articles, news stories, and such will be submitted to the site through our administrative tool. This tool will have the necessary forms that will restrict data entry to certain fields. In other words, our administrative tool will do most of the work of validating our content. You could, therefore, suggest that a DTD would be completely superfluous, and you'd be right.
However, I think it would be good practice to develop a DTD for our article content type—after all, this is one of the most important document types we have in our system, and it has to be done right.
Here's a first shot at our article DTD:
ELEMENT article (authorid,headline,description,pubdate,status,
keywords,body)
ATTLIST article
id CDATA #REQUIRED
ELEMENT authorid (#PCDATA)
ELEMENT headline (#PCDATA)
ELEMENT description (#PCDATA)
ELEMENT pubdate (#PCDATA)
ELEMENT status (#PCDATA)
ELEMENT keywords (#PCDATA)
ELEMENT body (#PCDATA)
Although we have declared our body element to contain character data, our article bodies will indeed be formatted using HTML tags. Because this HTML content will be wrapped in a CDATA block, those tags will be ignored by any XML processor reading an article file. We can use a CDATA block to hold any kind of text, as the XML parser will ignore any XML syntax that might appear in it. We therefore don't need to worry about the intricacies of HTML markup in this DTD.
If you asked ten XML folks whether they agreed with this approach, you'd get ten different opinions and alternative approaches. For now, we've created something that will work—and work quickly.
If you'd like more practice with DTDs, you can go back to Chapter 2, XML in Practice and look at the XML formats we created for our other content types, like Web copy and news items. Try writing DTDs for these as well. If you ever need to check the documents stored in your CMS for validity, you can use these DTDs to do it.
Summary
Wow! In three chapters we've covered basic XML, some XSLT and CSS, and, now, the basics of DTDs. Plus, we've nailed down most of the requirements for our CMS project. I think we're in pretty good shape to start looking more deeply at the rest of our project. Along the way, we'll pick up a few more XSLT and XML tricks.