Article

The Search Engine Marketing Kit - Chapter 1

Page: 1 2 3 4 5 6 7 Next

How Queries Are Processed

Today's search engines handle many different types of search queries. Searches run the gamut from FedEx package tracking, phone numbers, and dictionary definition lookups, to the plain old text-based searches that started it all.

Before retrieving search results, the search engine needs to interpret the user's query. Such an interpretation necessitates the extraction of any special syntax or search options that the user has invoked, such as a site-specific search, reverse phone directory lookup, or other options. Assuming that a text-based search is required (and it's not something easier, like a FedEx number), the processes that take place will differ depending on the search engine being used.

All the major search engines make some attempt to interpret the searcher's intent as indicated by the search terms he or she entered. Certain words, or types of words, may give clues that can help the search engine deliver the most satisfactory results. To give you a simple example, a person who searches for brake repair is probably looking for information. In this case, a mix of informational and how-to sites, along with a couple of nationwide chains, would be a good set of search results for most users. By comparison, someone who searches for brake repair Chicago obviously has one thing in mind: finding someone to fix their brakes. Once the searcher has made his or her intent clear like this, it's much easier for the search engine to help them find what they're looking for.

What should be clear to you, though, is that different search terms are interpreted in different ways by different search engines. Though semantic analysis (the art of trying to determine the meaning of the words used in a search) is just coming into its own, already it plays a significant role in the search process. It's likely that at least some elements of the algorithms used by many search engines to determine search results and rankings are already modified to address the type of search query that's performed.

Search engines also have a set of so-called stop words—extremely common words like a, and, and the. Although these words are supposed to be ignored by the search engine, they do have an influence on search results. A search for watching and waiting on almost any search engine will return different results than a search for watching waiting, even though the search engines claim to ignore the word and. It's likely that search engines replace the stop word with a wildcard, so that any word positioned between watching and waiting could pass for a matching phrase.

We will discuss how people search, how different types of searches indicate different buying modes, and what it all means to your strategy, in the next chapter.

Information Retrieval (IR) Theory

The godfather of text-based information retrieval was Gerald Salton (1927–1995), a professor at Cornell University. Salton's group at Cornell developed the SMART information retrieval system. This system pioneered the vector space model that's now used in some form or other by all crawling search engines.

The vector space model is conceptually simple: take the contents of a set of documents, create an index of every word occurrence in every document, and combine this information to create a mathematical representation of each document within a multi-dimensional vector space. Once you've done that, all you need to do is create a vector that represents your search query, and present the documents that are closest to it in the vector space.

Okay... maybe it's not so simple after all!

If you think the vector space model sounds complicated, you're not alone. And I didn't even explain how it takes the dot product of the magnitudes of the query and document vectors, calculates the cosine of the angle between them, and compares the cosines of the dot products of the query vector and the different document vectors, in order to find the most relevant documents for the query!

The good news is that an elaborate mathematical explanation of this concept is not necessarily important to search engine marketing. That said, there are a few things you need to understand about the field of information retrieval theory and the vector space model:

  • Words that appear more frequently within the collection of documents being searched are seen as less important in the retrieval of documents. Conversely, words that appear less frequently within the collection of documents are deemed more important in the retrieval of documents. So, if you search for defenestration policy guidelines, the less common word defenestration will have a greater influence on the retrieval process.
  • The proximity and ordering of words within a document is significant. If you searched for red monkey shoes, a document containing that exact phrase would be considered more relevant than a document that contained the words in a different sequence or in close proximity, either of which would, in turn, be seen as more relevant than a document that merely contained all three words.
  • Search engines make use of the structural and presentational elements of hypertext. Words that appear within key structural elements (page title, headings, hyperlinks etc.), to which significant formatting has been applied (bold, italic, large type), or which appear near the top of the document, are given more relevance than words that appear elsewhere within that document. In other words, occurrences of a given word can be weighted differently depending on where and how they appear in a document.

The Vector Space Model in Action

Just in case you're interested in digging deeper into the vector space, I thought I'd take a moment to explore it in more detail here.

As we've already seen, the search engine is not looking at documents; it's using an inverted index that maps words to their specific appearances within indexed documents. This is important, so do whatever you have to do to lock this concept into your brain.

When you perform a search, the ranking/retrieval component of the search engine constructs a set of vectors for matching documents within the index, and a separate vector for the search query itself. Don't get hung up on the term vector—a vector is just a collection of variables that relate to a specific item.

Each occurrence of a word within a document is weighted differently depending on a number of factors, including where the word occurs (e.g. is it a heading or bold text?). If these considerations are factored into the indexing process, the search engine may save some time during the search, but it will also forfeit the flexibility to change elements' weights depending on the specific search query.

The weighted occurrences of terms within a document are combined with the overall frequency with which the word occurs in the total collection of documents, and within the document itself, to produce a set of "term weights" for the document. The collection of term weights for all the words in the query represents that document's vector.

The query vector is an idealized set of term weights for the words in the query. To find the most relevant documents in its index, the search engine applies a little math that identifies the closest document matches in vector space. I won't even attempt to explain the math this involves—see Dr Garcia's Website for a detailed explanation.

Ranking and Retrieval Strategies

At some point, the search engine has to return results to the searcher. All the work that's lead up to this point has given the search engine a certain understanding of the user's query, and a great deal of information about the contents of each page.

There's one misconception about search result delivery that I should clear up now. A typical SERP will include some message like, "Results 1-10 of 843,000." What that means is that there were a total of approximately 843,000 pages that might have been relevant to the query. Search engines don't really examine all 843,000 in order to deliver search results.

In reality, none of the major search engines delivers more than 1,000 total matches, presumably because users would get tired before they actually clicked through to the hundredth page of search results. And, because search engines don't have to deliver more than 1,000 matches, they can use a pre-selection process to winnow the 843,000 candidates down to a smaller number of returned results. A page that isn't among the top 1,000 results for any specific factor in the selection process has no chance whatsoever of appearing in the search results.

The factors involved in that selection process are very dependent on the search query itself, and this is one of the instances in which less common words (such as defenestration) are likely to have a greater influence than commonly used words (like free or cheap).

Once the pre-selection is made, the search engine applies its ranking algorithm to the pages that made the cut, and presents a selection to the user as search results.

Query-Dependent Ranking Strategies

The type of search query that's entered can affect the way in which the search engine approaches the problem of delivering results. Every search engine has its own unique algorithm, but the results will always come from some combination of on-page factors, including content, formatting and structure, and off-page factors, such as link analysis and topic distillation.

For shorter, more generic queries, the initial result set will be very large, and there's a very good chance that the vector space model based on page content will fail to deliver satisfactory results. With such queries, ranking factors based on link analysis must play a significant role in the pre-selection process, and in determining the final results.

For longer, more specific queries, the initial result set will be smaller, so content-based strategies such as the vector space model may be more appropriate. This doesn't mean that all search engines will treat these queries differently, but it's a definite possibility.

Does the Topic Come into Play?

Although topical factors are definitely put to use by some search engines (notably Teoma), the extent of their impact is unknown. One of the difficulties search engines face in applying topic distillation and topical link analysis is that these algorithms need an idea of the query's topic in order to work.

For very long search queries, this is probably fairly easy to determine, but the vector space model already performs very well with long queries. In this case, a topical algorithm may be overkill, and may even lead to results that are less relevant to the specific query than they are to a related topic.

For very short queries, with which the vector space model needs help, it can be very difficult to determine an appropriate topic. My favorite example of this type of query is barber shop. This might be a place to get a haircut, but it's also the title of a popular comedy film series, and a form of a capella music performed by four men wearing striped shirts.

Because the importance of topical algorithms is uncertain, much of our discussion of topical factors may be less than completely relevant today. However, search engine optimization is very much a long-term game, and the trend towards topic distillation and topical link analysis is too strong to ignore.

Call it future-proofing if you like, but throughout this kit I will encourage you to adopt topical strategies in your content, copywriting, and link strategies.

Other Considerations

In addition to the basic processes we've just described, which are responsible for creating search results and rankings, search engines have a number of parallel activities that are worth a quick look.

Nearly all search engines undertake some form of automated monitoring of search quality. Google and Yahoo! both use a tracking link that allows them to know which of the search listings has been clicked. If a highly popular search phrase does not generate an acceptable number of clicks for the top-ranked pages, the search engine may consider adjusting its algorithm to deliver more satisfactory search results. Note, though, that this is not the same as Direct Hit technology, in which user clicks directly influence search results.

The search quality team, in addition to seeking out areas for improvement in search results, is responsible for the review and removal of sites that attempt to "spam" or deceive the search engine. For the most part, search engines can't act on individual spam reports, but must instead identify the techniques being used by spammers. When a new technique is identified, the software engineers attempt to find an automated means of detecting and filtering that form of spam from the search results.

Judging from the countless panic-stricken site owners who post every day to online discussion forums, many folks assume that a "spam penalty" is the most common reason for a site or page to drop out of the search results. In fact, such penalties are extremely uncommon and, in most cases, the explanation is far more mundane.

Any Website operator whose pages suddenly disappear from a search engine will be upset, but the problem almost always lies on their side of the equation. As mentioned above, not all Websites are ready to answer when the spider calls, and this can cause the search engines to stop crawling for a time. Server and DNS errors are the most common reasons for pages to be removed from a search engine's index.

In addition to the automated processes that may remove pages, search engines must comply with copyright laws such as the Digital Millennium Copyright Act (DMCA), which applies in the US. DMCA notifications represent a significant burden for search engines and other online service providers.

Unlike Web hosting companies, search engines usually do not have the means to contact the site owner quickly in order to allow for an appeal, and pages may be removed on the grounds of copyright infringement without the site owner ever becoming aware of the action. When Google deletes a page from its index for this reason, a link may be displayed on some search results to indicate that the page has been removed.

All of the major search engines must deal with a high volume of search requests from users, as well as peak hour demands that can easily be five times higher than their quiet times. In order to deliver search results in a hurry, no matter what the traffic levels, today's search engines use load balancing techniques across multiple data centers located strategically around the world.

It's not easy to run a search engine, and that fact goes a long way toward explaining why there are so few of them in the market today.

What Search Engines Want

One of the keys to developing a long term search engine strategy for any Website is an understanding of what search engines want. Repeat visitors drive revenue, so the major goal of any search engine is to keep its users happy. A few lessons from the past can help shed a little light on the future.

Freshness is critical. The once mighty AltaVista search engine (now a Yahoo! property) lost users to Google and the Inktomi-driven search portals for one major reason: it stopped crawling the Web aggressively, and failed to update the index on a regular basis. The lack of freshness in the AltaVista search results meant that many SERPs contained mostly broken links. Don't expect any search engine operating today to make the same mistake.

If you liked this article, share the love:
Print-Friendly Version Suggest an Article

Sponsored Links