Article
Bulletproof HTML: 37 Steps to Perfect Markup
This article highlights and answers some of the most frequently asked questions about HTML. HTML is the foundation of the Web, and both developers and designers need to understand it.
1. What is HTML?
HTML, or Hypertext Markup Language, is a markup language that's primarily used for Web documents. Any document that's written in a markup language is interspersed with tags that indicate the meanings of certain passages. Since version 2.0, HTML has been an application of a more generic markup language: SGML (Standard Generalized Markup Language).
HTML defines a number of element types. An element type assigns some semantic meaning to its content. For example, the em element type gives its content emphasis over the surrounding text. An element is a concrete instance of an element type. An element usually consists of a start tag (<em>), some content, and an end tag (</em>).
This HTML stuff is really, <em>really</em> nifty!
HTML allows some end tags (and even a few start tags) to be omitted. Don't confuse tags with elements; a body element will be present even if the <body> and </body> tags are omitted. Certain element types must not have an end tag. One example is br, which signifies a line break.
Baa baa black sheep, have you any wool?<br>
Yes sir, yes sir, three bags full
A start tag can contain attributes, comprising an attribute name, an equal sign (=), and an attribute value. For example, we can use the lang attribute to specify the language of an element's content.
Jean-Claude often exclaimed <em lang="fr"> bon sang</em> despite the fact that no-one understood him.
Attribute values must be quoted in some instances, so it's good practice always to quote all attribute values. Some boolean attributes are allowed to be minimised in HTML, which means the name and the equal sign are omitted (e.g. selected instead of selected="selected"). Some attributes are required for some element types, e.g., the alt attribute in an img element.
<img src="/images/sitepoint.gif" alt="SitePoint">
Beginners often use phrases like "alt tag", but this is incorrect nomenclature; alt is an attribute, not a tag. Tags are surrounded by <...>.
2. What are the different versions of HTML?
The first version of HTML (1989) didn't have a version number; it was just "HTML". The first "standardised" version of HTML, released by the Internet Engineering Task Force (IETF) in 1995, was called HTML 2.0.
Then the World Wide Web Consortium (W3C) was formed. It presented its first "standard" version in 1997: HTML 3.2. Its successor, HTML 4.0, came out in 1998, and was quickly replaced by HTML 4.01 in 1999. That is the latest and current version of HTML. The W3C has announced that it will not create further versions of HTML. HTML 4.01 is recommended for creating HTML documents.
However, the Web Hypertext Application Technology Working Group (WHATWG) are working on what is referred to as HTML5, hoping that it will eventually be accepted as a W3C recommendation.
3. What about XHTML?
A few months after HTML 4.01 became a final recommendation, W3C released XHTML 1.0. This was seen as the "next version of HTML," but that perception's not entirely correct. XHTML 1.0 is a "reformulation of HTML 4 as an application of XML 1.0", as the specification puts it. In other words, it's XML with a predefined set of element types and attributes (and semantics) that correspond to the elements types and attributes of HTML 4.01. It even comes in the same three flavours as HTML.
Many designers and developers embraced XHTML, as it was seen as the way forward. Few understood the profound differences between XHTML and HTML, as they looked so similar. The reality is that the most commonly used browser, Internet Explorer, does not support XHTML in any way, shape or form. More modern browsers like Opera, Firefox and Safari do support XHTML, but their market shares are too low for their support to have any significant impact when it comes to publicly accessible web sites.
By adhering to a number of guidelines put forth in the famous Appendix C of the XHTML 1.0 specification, it is possible to serve an XHTML document as HTML. This approach allows HTML-only browsers to be able to "handle" the document, but it is, for all intents and purposes, nothing more than HTML. We cannot use any of the features of XHTML when serving it this way, because we are not really using XHTML at all -- we're only pretending to.
4. Is HTML case-sensitive?
No, but XHTML is. In XHTML, all tags and attributes must be in lowercase. Traditionally, HTML element names and tags were written in uppercase, but with the advent of XHTML, this convention has slowly given way to the XHTML standard of lowercase element names.
5. What does the DOCTYPE declaration do?
The DOCTYPE declaration, which must precede any other markup in a document, usually looks something like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
It specifies the element type of the document's root element (HTML), a public identifier and a system identifier.
The public identifier (-//W3C//DTD HTML 4.01//EN) shows who has issued the document type definition, or DTD, (W3C); the name of the DTD (DTD HTML 4.01); and the language in which the DTD is written (EN, for English). Note that it doesn't say anything about the language of the web page itself; it is the language of the DTD that is specified here.
The system identifier (http://www.w3.org/TR/html4/strict.dtd) is the URI (uniform resource identifier, or "web address") for the actual DTD.
The DOCTYPE declaration tells a validator (a program that checks the syntactic validity of a web page) against which DTD it should test the page for compliance. Browsers didn't used to care about the DOCTYPE declaration, but modern browsers use it to decide whether the page is "modern" (and presumably expect the behaviour detailed in the W3C HTML documentation) or old-school (and expect the browser to render the page with all the bugs and quirks exhibited by older browsers). A document's DOCTYPE affects the rendering mode used by Internet Explorer, Opera, Firefox (and other Mozilla-based browsers), Safari, and most other modern web browsers. A complete DOCTYPE declaration -- including the system identifier -- tells the browser that this is a modern document. If the system identifier is missing, or if there is no DOCTYPE declaration at all, browsers assume that this is an old document and render it in "quirks mode".
6. What is a DTD?
A DTD, or document type definition, specifies the element types and attributes that we can use in our web page. It also defines the rules of how we can use these elements together -- it's the specification for our markup language. The DTD can also declare the character entities we can use; more about those later.
A validator will test a web page for compliance with the DTD specified in the DOCTYPE declaration either explicitly, via the system identifier, or implicitly, using the public identifier. Browsers use non-validating parsers and do not actually read the DTD. They have built-in knowledge about the various element types, and usually a hard-coded list of character entities as well.
For HTML 4.01, which is the latest and greatest version, there are three different DTDs: Strict, Transitional and Frameset.
7. What is the difference between Strict, Transitional and Frameset DTDs?
The differences between these DTDs include the element types and attributes they declare, and how they allow or require element types to nest.
- The HTML 4.01 Strict DTD emphasises the separation of content from presentation and behaviour. This is the DTD that the W3C recommends for all new documents.
- The HTML 4.01 Transitional DTD is meant to be used transitionally when converting an old-school (pre-HTML4) document into modern markup. It isn't intended to be used to create new documents. It contains 11 presentational element types and a plethora of presentational attributes that are deprecated in the Strict DTD. The Transitional DTD is also often necessary for pages that reside within a frameset, because it declares the TARGET attribute required for opening links in another frame.
- The HTML 4.01 Frameset DTD is used for frameset pages. Frames are deprecated by the W3C. For modern web sites, using server-side scripting technologies is usually regarded as a far better solution.
8. Which DOCTYPE should I use?
If we are creating a new web page, the W3C recommends that we use HTML 4.01 Strict.
If we are trying to convert older HTML 2.0 or HTML 3.2 documents to the modern world, we can use HTML 4.01 Transitional until we have managed to transfer all presentational issues to CSS, and all behavioural issues to JavaScript.
9. Why should I validate my markup?
Why should we spell-check our text before publishing it on the web? Because mistakes and errors can confuse readers and detract from the important information. The same can be said for markup. Invalid markup can confuse browsers, search engines and other user agents. The result can be improper rendering, dysfunctional pages, pages that remain unindexed by the search engines, program crashes or the end of the universe as we know it!
If our page doesn't display the way we intended, we should always validate our markup before we start looking for other problems (or asking for help on SitePoint). With invalid markup, there are no guarantees.
Use the HTML validator at W3C to check your pages' compliance. Don't forget to include a DOCTYPE declaration, so the validator knows against which standards it should check your document.
HTML Tidy is a free tool that can help us tidy sloppy markup, format it nicely and make it easier to read.
10. Why does HTML allow sloppy coding?
It doesn't, but it recommends that user agents handle and try to recover markup errors.
It is sometimes alleged that HTML allows improperly nested elements like <b><i>foo</b></i>. That isn't true; the validator will complain about improperly nested tags because they don't constitute valid HTML. However, browsers will usually guess what the author meant, so the error may go undetected.
Some dislike that HTML allows the omission of certain (but not all!) end tags. That's not a problem for browsers, because valid markup can always be parsed unambiguously. In the early years it was very common to omit certain end tags, (e.g. </p> and </li>). Nowadays it's usually considered good practice to use explicit end tags for all elements except those, like br and img, for which it is forbidden.
11. Why does the validator complain about my <embed> tag?
embed has never been part of any HTML recommendation. It's a non-standard extension which, although supported by most browsers, is not part of HTML.
During the "browser wars" of the late 1990s, browser vendors like Microsoft and Netscape competed by adding lots of "cool" features to HTML to make it possible to style web pages. The problem with those additions was that they weren't standardised, and were mostly incompatible between browsers.
There are other elements that used to be quite common (marquee, anyone?), but which have never been included in an HTML recommendation. Don't use them if you can possibly avoid doing so.
A number of other attributes were very common in the 1990s, but have never been included in an official HTML recommendation. marginwidth is one example.
12. What does character encoding (charset) mean?
Computers can only deal with numbers. What we see on the screen as letters or images are transmitted over the Internet and around the various parts of your computer as numeric codes, which the computer sees as groups of binary digits (ones and zeros).
In order to make sense of these numbers, we need to define a minimum unit that's capable of conveying some sort of information. When we're dealing with text, this unit is called a character. This is a rather abstract concept. The character known as "uppercase A" has no defined visual appearance; it's more like "the idea of an A".
Next, we need to establish a set of such abstract characters that we will use. That's called a character set. A character set is the total set of abstract characters that we have at our disposal. For HTML, the standard character set is ISO 10646, which is virtually the same thing as Unicode. It is a set of tens of thousands of characters representing most of the written languages on the planet.
The visual appearance of a character is called a glyph. A certain set of glyphs is known as a font. The glyph for "uppercase A" will differ between fonts, but that doesn't change the underlying meaning of the abstract character.
Now, since computers only deal with numbers, we must have a way to represent each character with a numeric code. Each character in a character set has a code position, or code point. The code point is the numeric representation (index) of the character within the character set. Code points in Unicode are usually expressed in hexadecimal (e.g., 0x0041 for "uppercase A").
Finally, the encoding -- sometimes, unfortunately, also called a "character set" or "charset", though we'll stick with the correct term "character encoding" here -- is a mechanism for expressing those code points, usually with octets, which are groups of 8 binary digits (and thus are capable of representing numbers between 0 and 255, inclusive).
In the early days of computer communication, people used small character sets containing only the bare necessities for a specific language. The most well-known set is probably ASCII (ISO 646), which only contains 128 characters, 33 of which are unprintable "control codes". The ASCII character set has 128 code points numbered sequentially from 0 to 127. The encoding is a simple one-to-one: the codepoint for "uppercase A" is 65 (0x41), which is encoded as 65 (1000001 in binary).
ASCII isn't very useful outside the English-speaking world, because it contains only the letters A-Z, digits 0-9, and some basic punctuation. The International Organization for Standardization (ISO) issued a set of standards called ISO 8859, which augmented the ASCII character set with characters that needed for other languages. In the Western world, the most common set is ISO 8859-1, known as Latin-1. It contains characters needed to write most Western European languages. The ISO 8859 series are both character sets and character encodings. Each character set contains 256 characters, which can be encoded using 8 binary digits. They each used the ASCII character set as a subset, i.e., the first 128 code points are the same.
But even 256 characters was not enough to write some languages. Chinese, for instance, needs thousands of characters. Several mutually incompatible encodings for Chinese were devised, but there were still big problems for those who wanted to exchange information across linguistic and cultural barriers.
From this point, it would be easy to create a character encodings that used 16 or even 32 binary digits for each character. However, using a 32-bit encoding would result in most documents being four times larger than they needed to be.
The solution was a variable-length encoding called UTF-8. It uses between 8 and 48 bits to encode each code point, and it can address the entire Unicode (or ISO 10646) character set. The first 128 code points are encoded in 8 bits, and are identical to the corresponding code points in ASCII. Most Western European languages can be encoded with single octets, sprinkled with the occasional 16-bit character for letters with diacritical marks (e.g., Ä).
How does this affect us as authors of web documents? If we use characters whose code points are outside the ASCII range, the encoding becomes really crucial. Specify the wrong encoding, and the page will be difficult -- or even impossible -- to read.
So how do we go about specifying the encoding? The proper way to do it is to send this information in the Content-Type HTTP header:
Content-Type: text/html; charset=utf-8
The HTTP headers are sent by our web server, so we must tweak the server to change the encoding information. How we achieve that will depend on which web server we use. For Apache, it can be specified in the global configuration file (httpd.conf) or in local .htaccess files. But if we're using a shared host, we may not have sufficient privileges to tweak the configuration. In that case, we need a server-side scripting language to send our own HTTP header; here's an example for PHP:
header('Content-Type: text/html; charset=utf-8');
We can also specify the encoding using an HTTP equivalent in a META element:
<meta http-equiv="Content-Type"
content="text/html; charset=utf-8">
This meta element will be ignored if the real HTTP header contains encoding information. It can be useful anyway, though, because it will be used if a visitor saves our page to the hard drive and looks at it locally. In that situation, there's no web server to send HTTP headers, so the meta element will be used instead.
There is no default encoding for HTML, so we should always make sure to specify it.
A common encoding under Microsoft Windows is Windows-1252. It's very similar to ISO 8859-1, but there are differences. In ISO 8859-1, the range of code points between 0x80 and 0x9F is reserved for control characters. In Windows-1252, that range is instead used for a number of useful characters that are missing from the ISO encoding (e.g., typographically correct quotation marks). This is not an encoding that I would recommend for use on the Web, since it's Windows-specific. However, it is the default encoding in many text editors under Windows.
13. What is a BOM?
The BOM, or byte order mark, is used for some encodings that use more than 8 bits to encode code points (e.g. UTF-8 and UTF-16). Computer processors (CPUs) can employ two different schemes for storing large integer numbers: "big-endian" and "little-endian". The BOM comprises 16 bits, written at the very beginning of the file, which tell the browser which scheme is being used.
Unfortunately, many older browsers cannot handle this information, so they display these bits as character data. If you see a couple of strange characters at the top of a page, the reason is probably that the BOM isn't handled by the browser (or an incorrectly specified encoding).
The only resolution to this problem is to avoid using the BOM. Editors that can save a document as UTF-8 will usually allow us to choose whether or not to include the BOM.
14. What encoding should I declare?
It's very, very simple: we must specify the encoding that we used when saving our source file! If we save the file as ISO 8859-1, we must specify the encoding as iso-8859-1; if we save as UTF-8, we specify it as utf-8. The only problem here is that we may not always know what encoding our editor uses to save the file. Any editor worth its salt should give us the option to specify the encoding, though.
If we are writing in English, it doesn't matter all that much which encoding we choose, because we are mostly going to use characters that are encoded the same in most encodings. US-ASCII, ISO 8859-1, UTF-8 ... take your pick. For those of us who write code in other languages, the choice becomes more important. My native language -- Swedish -- uses three letters more than the English alphabet has to offer. Those are present in ISO 8859-1, so I can choose between that and UTF-8 encoding. Browser support for UTF-16 is poor, so it should be avoided on public web sites.
My recommendation is to use UTF-8 encoding wherever possible, without a BOM. It can natively represent any character in the Unicode character set.
Avoid Windows-1252 on public web pages, since it's a Windows-specific encoding. Use ISO 8859-1 instead (or ISO 8859-15, if you need the Euro sign).
15. How do I insert characters outside the encoding range?
What if we're using ISO 8859-1 encoding and wish to include a Euro sign in our content? There is no Euro sign in that character set, and hence no way to encode it, although it is present in ISO 10646 and can be used on a web page.
We have two choices: a named entity or a numeric reference.
The named entity for the Euro sign is €. Entities start with an ampersand (&) and end with a semicolon (;). In some circumstances we can get away with omitting the semicolon, but it is definitely good practice to always put it in. Entity names are case-sensitive.
A numeric reference can be either decimal (—) or hexadecimal (—), but it's generally safer to stick with decimal notation, because some old browsers can't handle the hexadecimal version. Note that the numeric value references the code point in ISO 10646; it has nothing to do with the encoding we've specified for our document.
References (in decimal) always work. Named entities may cause problems in older browsers, because some of them only support a subset of HTML entities.
16. Why do I need to write & instead of just &?
Certain characters have special meanings in HTML: < (less than), > (greater than), & (ampersand), " (quotation mark) and ' (apostrophe). In some circumstances, when we want to use these characters in normal text, we need to replace them with HTML entities.
The entities for the first four characters are as follows:
<(less than)>(greater than)&(ampersand)"(quotation mark)
XML defines an entity for the apostrophe ('), but HTML does not include this entity. An apostrophe can only be escaped using a numeric reference (').
Since the ampersand is used for these entities, it must nearly always be escaped, including occasions when it's used inside attribute values, such as the href attribute of links. Unfortunately, the ampersand is a very common argument separator in URIs, which means that it's quite common to encounter ampersands in URIs.
Most of the time in HTML, unescaped ampersands don't break anything (though XHTML is a different story). The error handling routines in browsers recover from the error and it all works. But if we should happen to have a query parameter whose name matches one of the predefined named entities in HTML ...
17. How should heading elements be used?
HTML heading element types are h1, h2, h3, h4, h5 and h6. The number denotes the structural level of the heading, which means we should treat headings as we did in those outlines we had to learn in school (and promptly forgot about right after graduation).
The top-level heading on a page must be an h1. It should describe what the page is about. Most pages will have one h1 heading, but very complex documents that deal with several disparate topics may need more than one.
h2 headings will mark up the next structural level. Any sub-levels under that will be h3, and so on. We can never skip a heading level as we move downward through the hierarchy. An h4 should not follow an h2; there should be an h3 in between. (The validator will not complain about this, but it is good practice.)
It's important to mark up headings with the Hn element types. Assistive technologies such as screen readers can make use of a proper heading hierarchy to present an outline of the document. If we use <font size="7">...</font>, they cannot.
Tommy Olsson is a pragmatic evangelist for web standards and accessibility, who lives in the outback of central Sweden. Visit his blog at