Article
Introduction to XML
The Rules of XML
We need to look next at the rules that govern XML documents. The rules can get a little tedious so if you're in a hurry, just have a quick glance through and refer back later. You'll find that, once you get into writing your own XML documents, most of these rules will be pretty obvious.
The XML standard itself is available at http://www.w3.org/TR/2000/REC-xml-20001006. To save you a long read, the key rules are explained below. Note that if an XML document obeys these rules, it is said to be well formed (the word "valid" has another meaning in XML, which we'll look at later):
These are the most important rules any XML must obey.
1. XML Version Required
All XML documents must begin with a statement that describes the version of the XML standard being used:
<?xml version="1.0"?>
The above is in fact a processing instruction.
2. Close your Tags!
Every XML tag must be properly closed. HTML is more relaxed here, allowing you to use tags like <img> and <br> without closing them. In XML these should be <br></br> or just <br /> if the tag contains no data.
3. XML Tags Must be Nested in the Correct Order
In HTML, a browser will allow you to have <i> <b> Hello World! </i> </b>. In XML this would have to be either <i> <b> Hello World! </b> </i> or <b> <i> Hello World! </i> </b>.
4. XML is Sensitive to UPPERCASE/lowercase
In XML <mytag /> is not the same as <MYTAG />! In HTML you can get away with this -- a browser will (generally) treat <BODY></body> as being the same thing.
5. And I Quote...
XML attributes must have quotes around them. In HTML you can get away with <a href=mypage.html>It's a Link!</a>. In XML that has to be <a href="mypage.html">It's a Link!</a>.
6. An XML Document Must have at Least One Element
At least one element, known as the the root element must exist for an XML document to be well formed. This tag doesn't have to contain anything, though, so the example below is acceptable:
<?xml version="1.0"?>
<root />
7. Naming your Tags
The way you name your XML tags is governed by the following rules;
- tag names can contain letters, numbers, and other characters (e.g.
<mytag3></mytag3>is fine) - tag names cannot contain spaces ( e.g.
<my tag></my tag>is wrong) - tag names cannot start with the letters xml (including UPPER or mIXeDcase)
- tag names cannot start with a number or punctuation character (e.g.
<3mytag></3mytagand<.mytag></.mytag>are both wrong).
8. Special Characters
Within the data you place in a tag or attribute, certain characters must be replaced with entities to prevent them from being mixed up with XML tags and syntax. These characters are:
Character : Entity : Example
" : " : <tag entity="Here is a quote "" />
' : ' : <tag entity="Here is an apostrophy '" />
< : < : <tag>1 < 2</tag>
> : > : <tag>2 > 1</xml_tag>
& : & : <tag>Kramer & Kramer</tag>
In PHP, the function htmlspecialchars() will achieve this.
9. New Lines and White Space
For new lines in XML, the XML standard supports carriage returns and linefeeds ( i.e. \r\n, \r and \n , as in most programming languages, are acceptable). Having said that, XML processors expected to 'normalize' these to \n during processing.
Whitespace in XML is regarded as space characters, new lines (above), and tab characters. If a document has no DTD (see below), all whitespace is must be preserved. If a DTD is provided with the XML document, if any element contains nothing but white space or other elements, the whitespace can be removed in processing the document - it's down to the DTD (or XML Schema) to specify which elements should have their whitespace preserved.
In most cases you shouldn't need to worry about this, but in particular where XSLT is concerned, to generate output for humans to read, you may need to be careful. You can find out more online in What's the diff? and Controlling Whitespace.