Article
The Definitive Guide to Web Character Encoding
Character encoding. You may have heard of it, but what is it, and why should you care? What can happen if you get it wrong? How do you know which one to use?
We'll look into the details in a minute, but for now let's just say that a character encoding is the way that letters, digits and other symbols are expressed as numeric values that a computer can understand.
A file -- an HTML document, for instance -- is saved with a particular character encoding. Information about the form of encoding that the file uses is sent to browsers and other user agents, so that they can interpret the bits and bytes properly. If the declared encoding doesn't match the encoding that has actually been used, browsers may render your precious web page as gobbledygook. And of course search engines can't make head nor tail of it, either.
What's the Difference?
Why does it matter which form of encoding we choose? What happens if we choose the "wrong" one?
The choice of character encoding affects the range of literal characters we can use in a web page. Regular Latin letters are rarely a problem, but some languages need more letters than others, and some languages need various diacritical marks above or below the letters. Then, of course, some languages don't use Latin letters at all. If we want proper -- as in typographically correct -- punctuation and special symbols, the choice of encoding also becomes more critical.
What if we need a character that cannot be represented with the encoding we've chosen? We have to resort to entities or numeric character references (NCR). An entity reference is a symbolic name for a particular character, such as © for the © symbol. It starts with an ampersand (&) and should end with a semicolon (;). An NCR references a character by its code position (see below). The NCR for the copyright symbol is © (decimal) or © (hexadecimal).
Entities or NCRs work just as well as literal characters, but they use more bytes and make the markup more difficult to read. They are also prone to typing errors.
What Affects the Choice?
A number of parameters should be taken into consideration before we choose a form of encoding, including:
- Which characters am I going to use?
- In which encodings can my editor save files?
- Which encodings are supported by the various components in my publishing chain?
- Which encodings are supported by visitors' browsers?
Let's consider each of these issues in turn.
Character Range
The first parameter we need to consider is the range of characters we're going to need. Obviously, a site that's written in a single language uses a more limited range of characters than a multilingual site -- especially one that mixes Latin letters with Cyrillic, Greek, Hebrew, Arabic, Chinese, and so on.
If we want to use typographically correct quotation marks, dashes and other special punctuation, the "normal" encodings fall short. This is also true if we need mathematical or other special symbols.
Text Editor Capabilities
Some authors prefer to use regular text editors like Notepad or Vim; others like a point-and-click WYSIWYG tool like Dreamweaver; some use a sophisticated content management system (CMS). Regardless of personal preference, our choice of editors affects our choice of encoding. Some editors can only save in one encoding, and they won't even tell you which one. Others can save in dozens of different encodings, but require you to know which one will suit your needs.
Other Components
A publishing chain consists of more than an editor. There's always a web server (HTTP server) at the far end of the chain, but there can be other components in between: databases, programming or scripting languages, frameworks, application servers, servlet engines and more.
Each of these components may affect your choice of encoding. Maybe the database can only store data in one particular encoding, or perhaps the scripting language you're using cannot handle certain encodings.
It's not possible to enumerate the capabilities of all the different editors, databases, and so on in this article, because there are simply too many of them. You need to look at the documentation for your components before choosing the encoding to use.
Browser Support
Some encodings -- like US-ASCII, the ISO 8859 series and UTF-8 -- are widely supported. Others are not. It is probably best to avoid the more esoteric encodings, especially on a site that's intended for an international audience.
Tommy Olsson is a pragmatic evangelist for web standards and accessibility, who lives in the outback of central Sweden. Visit his blog at