Article
The Search Engine Marketing Kit - Chapter 1
Indexing: How Content is Analyzed
After the content of a Web page (or HTML representation of a non-HTML document) has been stored in the database, the indexer takes over, breaking down the page piece by piece, and creating a mathematical representation of it in the search engine's index.
The complexity of this process, the extreme variations between different search engines, and the fact that this part of the process is a closely guarded secret (a search engine's algorithm must be kept secret, in order to prevent optimizers from unfairly manipulating search results and, of course, to prevent competitors from "borrowing" useful ideas), makes a comprehensive explanation impossible. However, we can speak about the process in general terms that will apply to all crawling search engines.
What Indexing Means in Practice
When a search engine's indexer analyzes a document, it stores each word that occurs in the document as a hit in one of the indexes. The indexes may be sorted alphabetically, or they may be designed in a way that allows more commonly used words to be accessed more quickly.
The format of the index is very much like a table. Each row in the table records the word, the ID of the URL at which it appeared, its position within the document, and other information which will vary from one search engine to the next. This additional information may include such things as the structural element in which the word appeared (page title, heading, hyperlink etc.) and the formatting applied (bold, italic etc.).
Table 1.1 shows a hypothetical (and simplified) search engine index entry for an imaginary (and very boring) document. The page's title is "Hello, World!" The document itself contains the same words in a large heading, followed by the words "Greetings, everyone!" as the first paragraph of text.
Each index contains hits for different groups of words. The hypothetical index entry for the document will therefore be spread across multiple indexes. The only place in which the entire document might remain intact is in the repository of a search engine that retains a full cached copy of each page, as Google does.
Table 1.1. A Search Engine Index Entry

The first thing that you will notice is that punctuation is not stored in the index. That's because search engines ignore punctuation—if any of them considered it at all, it would be news to me. When we talk about SEO copywriting later, you'll see the importance of this fact.
For now, you need to understand that search engines don't look at Web pages; they look at indexes that match words to documents. When we talk about the vector space model and the ranking/retrieval process later in this chapter, you may need to refer back to this table to refresh your memory.
I should note one thing before we move on. Although search engines do apply a different "weight" to words that appear in more prominent positions (such as headings), they do not necessarily attempt to store those values in the index records. When we talk about algorithm changes and term weights in just a moment, you'll understand why.
Link (URL) Discovery and Indexing
URLs that are found within documents are fed back into the scheduler for crawling. Information about the source document may have an impact on the URL's priority within the crawling queue. For example, the URLs contained in links found on an important page (such as the Yahoo! homepage) may take precedence over links that have been found on lesser pages.
When a hyperlink is found within the document that's being indexed, the words in the hyperlink are recorded in the index as a hit, just like any other word, along with the fact that the words appeared in a hyperlink.
There are two ends to every link, though. The source document (the one being indexed) links to a target document, and some search engines also take this into account when indexing a page. The words contained within a hyperlink may also be indexed as hits for the target document.
In this manner, a URL that has never been crawled can still appear in search results, because the index still contains information about that URL. This definitely applies to Google, and may apply to other search engines as well. Only Google provides enough public information about its processes for us to be sure.
Even with Google, it's unclear whether such hits are stored in the main indexes, or in a separate index of link-related hits. There are indications, for example, that the ordering and proximity of words in anchor text is a factor in determining how much the link text affects a page's overall ranking for a given search query.
Results of the Indexing Phase
At the end of the indexing phase, the search engine is capable of returning the indexed URL in search results. If the search engine makes heavy use of link analysis in its ranking algorithm, the URL may not rank well for competitive search terms, but it is at least visible in the search results.
Note: We conclude our discussion of this phase with another example from Google. As mentioned in our examination of the crawling phase, Google's site:domain search shows all known URLs for a domain, including those which are not yet crawled and indexed. In order to find all of the indexed pages from a domain, a different approach is required. Most sites will have some signature text which appears on all pages, such as the copyright notice etc.
By combining the site:domain search with this signature text, it's possible to get a true measure of a site's index saturation, or the number of pages from the site which have been indexed. For example, the query site:example.com copyright will return all of the indexed pages from example.com, assuming that the work 'copyright' appears on all pages.
Link Analysis
The content of documents is susceptible to manipulation or optimization, so there may be a large number of Web pages that appear, at first glance, to be relevant to given keywords. Indeed, there may be millions of pages that match the searcher's query to some degree. Conversely, some highly relevant pages may not be optimized. As a result, search engines can't simply rely on the content of documents as a means by which to assess them—to do so would prevent the engines from showing the best possible search results.
There are many ways in which search engines can derive information from the linking relationships between pages, and each search engine takes a different approach to doing so. In this brief summary, we'll see how links can imply topical relationships and help search engines find the Web pages that are most important to their index. To start off, let's look at a couple very small Web graphs to understand how a search engine might interpret the linking relationships between Web pages. A Web graph is simply a diagram of the linking relationships between a group of pages.
Figure 1.2. Simple Web graph: one page "votes" for two others.

In Figure 1.2, we see three pages (A, B, and C). A links to both B and C. This implies that the content of B and C may relate to the topic discussed on page A. It also implies that the author of page A considers B and C useful—in effect, the author of A is "voting" for B and C.
Figure 1.3. Simple Web graph: two pages "vote" for one page.

In Figure 1.3, both A and B link to page C. This implies that pages A and B cover the same topic as page C, and that the authors of A and B are voting for page C. The implied topical relationship between A and B is as strong, or stronger, than in the previous example, because two pages that are found to link to the same resource are likely related to the same topic.
Figure 1.4. Simple Web graph: two pages interlinked; a third links to both.

In Figure 1.4, we see a more complex relationship. Pages A and B link to each other, implying that they may address the same topic. The existence of links from C to both A and B validates this interlinking and provides a strong indication that pages A and B mention the same topic.
Hubs and Authorities
The first really interesting attempt to harness the linking relationships between pages was the HITS (Hypertext-Induced Topic Selection) algorithm developed by Jon Kleinberg at Cornell University. Kleinberg's great revelation was that communities on the Web tend to cluster around specific hubs and authorities.
A hub is a page that contains links to many other pages on the same topic. A good example of a hub would be a page from Yahoo! or the Open Directory which contained a list of pages about a single topic. An authority is a page to which many other pages link. A page that's listed within the appropriate category on many directories, or is otherwise well-linked within the community of related pages, is considered an authority.
Though the concept is simple, its practical implementation involves many nuances. Many attempts have been made to improve on the basic idea of HITS and, no doubt, some of these ideas are in use within search engines today. One such idea advocates that more focused hub pages are those that link to specific pages, rather than the homepage of every site, thereby implying that more editorial effort went into creating the hub page.
Google's PageRank
Because of the dominant role Google plays in today's search engine landscape, and because of the incredible amount of insight it allows into its inner workings, the PageRank algorithm that this search engine uses has taken on an almost mythical status among SEO practitioners.
Many very detailed explanations of PageRank are available on the Web. The original paper written by Google founders Larry Page (the "Page" in PageRank) and Sergey Brin is called The PageRank Citation: Bringing Order to the Web, and is available in multiple formats online.
In addition to the many papers published by Stanford and Google researchers, numerous competing (and occasionally conflicting) accounts have been prepared by SEO consultants. I list many authoritative papers and provide an explanation of PageRank in Appendix A, Resources; for now, let's briefly discuss how PageRank works without getting bogged in mathematical detail.
The concept of PageRank is very similar to the "wandering drunk" algorithm employed in many areas of computer science, and to the proverbial thousands of monkeys that eventually type long enough to reproduce Shakespeare's Hamlet.
To understand this concept, let's consider a random Web surfer. We'll make him male, and call him Bob. To get Bob started, we'll sit him down with the browser open at a Web page that's selected at random from our index. If there are a million pages in our index, there's a one-in-a-million chance that Bob starts at any particular one of them.
Bob's job is to pick a random link on every page he visits, and continue on to wherever that link sends him. On each page Bob visits, including the first, there's a chance that he'll get bored and ask for a different random Web page. So, when he gets bored, we select another page completely at random, and the process starts again. In Page and Brin's paper, there is a 15% chance that Bob will get bored on any page.
If we let Bob surf the pages in our index in this way for a decade or so, he will eventually visit every page. Once Bob has viewed every page at least once, we can count the number of times he's visited each page in the index. The pages that Bob has visited the most times will be allotted the highest PageRank. To put this another way, the PageRank score of a given Web page is an estimate of the probability that a random surfer would find that page if they followed the process that Bob followed.
Now, if Bob starts surfing from a page with a very high PageRank, we can assume that any page it links to will have a high probability of being found by Bob. As such, you might think that links from a page with a high PageRank would be the most valuable.
However, this is not the case. The more links there are on a page, the less likely it is that any particular link will be chosen by a random surfer. This leads us to PageRank Truth #1:
The value of any link from a Web page is decreased proportionally for every additional link on that page.
This is why links from a pure directory like the Open Directory may actually be more valuable than links from a Web portal that happens to include a directory. In a pure directory, nearly all the PageRank attributed to the homepage flows through to the category listings.
By comparison, of the vast number of links that appear on each page of the Yahoo! site, only a certain percentage link to directory pages. Over 200 links appear on the Yahoo! homepage, most of which lead away from the directory. Even the directory pages themselves display many listings and other links.
Likewise, links from a highly selective directory are likely to be worth more than a less selective directory of equal size, because there will be fewer links (or listings) on each of the category pages.
The same logic applies to all links. If you are interested in maximizing the PageRank of the pages on your site, simply looking for high PageRank pages may not be the best approach. Link placement (i.e. which page carries the link) matters much more than the average site owner realizes. We'll have much more to say on PageRank and linking strategies later in this kit. But for now, let's turn our attention to the final piece of the link analysis puzzle.
Topics and Communities
Search engines know that many hyperlinks exist solely for the purpose of boosting the perceived popularity of a site in order to improve its ranking in the search results. PageRank is susceptible to this sort of manipulation, as are older link analysis schemes based on link popularity (a simple count of the number of links to a particular page or site).
The cutting edge of link analysis, therefore, goes into a deeper exploration of the topical relationships between Web pages. The Teoma search engine, for example, sees the Web as a set of topical communities, and looks for the most relevant and authoritative pages within a topic. This was actually the basis of the HITS research, but HITS involved the manual selection of pages. Any sort of practical implementation of a topically-driven link analysis scheme requires some sort of automated method of performing "topic distillation" on a given Web page.
Appendix A, Resources contains many references that deal with topic distillation and link analysis, including a topic-sensitive variation on PageRank, and a scheme referred to as LocalRank. LocalRank involves the calculation of an internal PageRank score within the set of pages returned by a Web search, so that the results can be rearranged on the basis of topical authority.
Link analysis and topic distillation are fascinating topics. What they mean to SEO consultants will be explained in greater detail in the next chapter.