Article
Rich Punctuation: How To Do It And Why You Should Bother
Page: 1 2
Accessing More Characters
Whichever method you adopt to access non-keyboard characters, remember that there are many more characters and punctuation marks available than the most-used ones we have discussed. There are dedicated symbols such as smiley faces, degree Celsius, fractions, Roman numerals, mathematical operators, primes, and the minus sign. For a full overview of available symbols, explore the General Punctuation block and other Latin language blocks from the Unicode Standard.
That said, the availability of some typographic punctuation marks and characters is limited to font faces with broad support for the Unicode Standard. Whenever using non-typical characters, the designer must always check whether the font faces used supports the desired characters. Cascading Style Sheets (CSS) can specify stacks of font faces, thus ensuring broad support across platforms and devices. Font stacks should contain a generic fallback type as well as fonts from different operating systems.
But wait—that’s not all!
Delivering Web Documents in Unicode
Accessing non-keyboard characters and using key combinations at lightning-fast speed isn’t quite the end of the story, unfortunately. Your web document not only has to be encoded, but also identified as a Unicode document.
By default, most web browsers still read the document encoding as ASCII, the old standard of document encoding. This default must be overridden by specifying the encoding used through a HTTP header in the response given by the server to the browser’s request for the document. Refer to your particular server’s documentation for full details on how to achieve this. There’s also an excellent article by Tommy Olsson that deals with web character encoding—in fact, it’s helpful to read that article as a companion to this one.
The UTF-8 encoding will suffice for all Latin-based languages, as it will create small files and support the whole Unicode Standard. HTML files and all kind of XML files, including Atom web feeds, sitemap indexes, and XHTML, can be encoded using UTF-8.
Apache servers need just a little more persuasion to serve web documents as UTF-8. To achieve a server-wide change, the following can be added to a configuration file called .htaccess at the web-root directory:
DefaultEncoding: UTF-8
AddEncoding .atom .htm .html .xht .xhtml .xml UTF-8
This space-separated list of standard format extensions commonly found on the web server will be served using UTF-8 encoding. It can also be achieved on a per-document basis when using server-side coding, using this PHP code snippet:
<?php header("Content-Type: text/html;charset=UTF-8"); ?>
Take extra-special care to get these codes right, as HTTP headers are cAsE sEnSiTive!
Other server-side languages have very similar approaches to modifying HTTP headers. XML files require encoding information in the XML declarations (<?xml […] encoding="UTF-8"?>). Though it may not affect the actual document parsing, it’s good practice to always include encoding information in HTML files as well.
The following HTML code snippet must be the very first child of the <head/> element to have any effect. Again, HTTP headers are cAsE sEnsiTive!
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8"/>
XHTML files use both the XML declaration specified encoding and the HTML method for backwards compatibility reasons. This is only a fallback mechanism if the server failed to deliver the appropriate HTTP header, as described above.
Escaping Parsing Problems
As mentioned above, escaping characters is a method of writing out characters as a code sequence in the markup, and performs the same function in HTML and XML alike. When presented, the escaped characters will appear as the actual characters.
Including the actual character glyph itself in the web document, instead of escaping it, is a wiser approach for non-typical characters—just to be on the safe side. Moreover, with proper document encoding and delivery, the practice of escaping everything only causes the document to become significantly larger in size than it has to be.
That said, there are four particular Unicode Standard characters which must always be escaped. These characters, shown in the table below, hold special meanings in the HTML and XML markup languages, and escaping them avoids potential parsing and rendering problems.
| Character | Symbol | Escape Sequence |
|---|---|---|
| less than | < | < |
| greater than | > | > |
| straight quote | " | " |
| ampersand | & | & |
Negotiating Search Engines and Incompatible Devices
When working with any web document, authors have one crucial decision to make at almost every stage of development: favor the reader or the search engines?
As we’ve seen, a richer repertoire of punctuation marks will undoubtedly give visitors to your site a much better reading experience. But there’s one unfortunate downside—to some degree, it can give search engines a harder time understanding the non-typical punctuation in the document. We can live in certain hope that, in the not-too-distant future, the widespread uptake of rich punctuation will increase search engines’ understanding of a document instead of entailing a risk of decreasing it.
Unicode-incompatible devices and search engines may require an alternate version of your page, where everything is automatically mapped against ASCII/ANSI. This practice, however, is becoming more and more redundant as handheld devices and search engines smarten up. And web sites that offer syndication through Atom web feeds can easily work around the problem.
For example, when constructing the feed, the publishing tool should replace Unicode characters in the haystack with their almost-equivalent keyboard character pair. The same goes for mobile or hand-held versions of the document, as these devices tend not to fully support the Unicode Standard. So the hyphen (U+2010) would be replaced with the hyphen-minus (U+2D), the en dash (U+2013) with two hyphen-minuses (U+2D 2D), the curly apostrophe (U+2019) with apostrophe (U+27), and so on. Once these characters are taken care of, the feed will contain only ASCII/ANSI letters and symbols.
Offer the web feed simultaneously with the web version, but with only basic punctuation. Then search engines will find the <link/> between the two published formats and treat them as the one document. As a side effect, the click-through rate from web feeds may also increase as readers click through to the web version of the document for a better reading experience.
Conclusion
Web designers the world over have run up against the issue of characters in online content—generally letting straight quotes hold sway, detracting from fine web design. As we’ve seen though, in this day and age there actually isn’t very much to prevent us from ensuring our typography enhances our sleek content presentation in the best way it can. It’s time for a turning point in the way designers present our text, now that we have a choice in the matter; in fact, you’ll notice that we at SitePoint have finally put this preaching into practice. Three weeks ago, we took the plunge and embraced the offerings of today’s technology for our articles—and we’d never go back!
The first problem designers run into when using rich punctuation is the limitations of the input method: the keyboard. But, as we’ve seen, there are several decent and quite trivial solutions to circumventing this limitation. The second problem that trips us up is the document delivery. It is necessary to carefully declare the encoding used in the document through minor server-side changes.
With these obstacles out of the way, there is no reason to hold back on the punctuation—so embrace Unicode and give your readers a richer typographic experience!