Article

The Search Engine Marketing Kit - Chapter 1

Page: 1 2 3 4 5 6 7 Next

Search Engine Marketing Defined

Throughout this kit, I'll use search engine marketing (SEM) to describe many different tasks. We'll talk about this concept a lot, so it will be helpful to have a working definition. For the purposes of these discussions, we'll define search engine marketing as follows:

Search engine marketing is any legal activity intended to bring traffic from a search portal to another Website.

The term search engine marketing, therefore, covers a lot of ground. Wherever people search the Web, whatever they search for, and wherever the search results come from—if you're trying to reach out to target visitors, you're undertaking search engine marketing. The goal of SEM is to increase the levels of high-quality, targeted traffic to a Website. In this kit, we'll focus on the two primary disciplines of SEM, which are:

  • Search Engine Optimization (SEO) The function of SEO is to improve a Website's position within the organic search results for specific search terms, and to increase the overall traffic the site garners from crawler-based search engines. This is accomplished through a combination of on-page content and off-page promotion (such as directory submissions).
  • Pay-Per-Click Advertising (PPC) PPC involves the management of keyword-targeted advertising campaigns through one or more PPC service providers, such as Google's AdWords, or Overture from Yahoo!. The advertiser's goal is to profitably increase the amount of targeted traffic that his or her Website receives from search portals.

In addition to these two major disciplines, there are other aspects of search engine marketing that we'll discuss to a lesser degree, including:

  • Contextual advertising, which is offered by many PPC service providers. Contextual advertising delivers targeted advertising based on the content of each individual Web page that carries an ad. Advertisers who have used PPC to target people searching on the term fishing can also have their ads distributed across a great many Websites on which fishing is discussed. This is a fast-growing market, and one that's sure to become a very significant part of SEM over time.
  • Directory submission, which involves the submission of Websites to general-purpose and vertical (topic-specific) directories, or vortals. We will discuss this mainly in the context of SEO, but many directories (both general-purpose and vertical) provide search-driven traffic to the Websites they list. Many operate on a paid advertising or PPC basis. As searchable business directories like Verizon's SuperPages and the already established Business.com grow, so too will this area of search engine marketing.

Search engine marketing is a fast-growing and rapidly changing field. Before we get too far ahead of ourselves, though, let's take a close look at where organic search results come from: the crawling search engines.

The Crawling Search Engines

In this discussion, we'll explore the major components of a crawler search engine, and understand how they work. The typical Web user assumes that when they search, the search engine actually goes out onto the Web to look around. In fact, the job of searching the Web is vastly more complex than that, requiring massive amounts of hardware, software, and bandwidth.

To give you an idea of just how much hardware it takes to run a large-scale, modern search engine, here's a staggering figure: Google runs what is believed to be the world's largest Linux server cluster, with over 10,000 servers at present, and more being added all the time (it was "only" 4,000 in June, 2000).

Searching a small collection of well-structured documents, such as scientific research papers, is difficult enough, but that task is relatively easy compared to searching the Web. The Web is massive and mobile, consisting of billions of documents in over 100 languages, many of which change or disappear on a daily basis. To make matters worse, there is very little consistency in terms of how information is organized and presented on the Web.

Major Tasks Handled by Search Engines

There are five major tasks that each crawling search engine must handle, and significant computing resources are dedicated to each. These tasks are:

  1. Finding Web pages and downloading their contents.
    The bulk of this task is handled by two components: the crawler and the scheduler. The crawler's job is to interact with Web servers to download Web pages and/or other content. The scheduler determines which URLs will be crawled, in what order, and by which crawler. Large crawling search engines are likely to have multiple types of crawlers and schedulers, each assigned to different tasks.
  2. Storing the contents of Web documents and extracting the textual content.
    The primary components at this stage are the database/repository and parser modules. The database/repository receives the content of each URL from the crawlers, then stores it. The parser modules analyze the stored documents to extract information about the text content and hyperlinks within. Depending on the search engine, there may be multiple parser modules to handle different types of files, including HTML, PDF, Flash, Microsoft Word, and so on.
  3. Analyzing and indexing the content of documents.
    This is handled by the document indexer. The text content is analyzed by the indexer and stored in a set of databases called indexes. For simplicity's sake, I'll refer to these indexes as simply "the index." Included in the indexing process is the preliminary analysis of hyperlinks within the documents, feeding URLs back into the scheduler and building a separate index of links. The main focus of this phase is the on-page content of Web documents.
  4. Link analysis, to uncover the relationships between Web pages.
    This is the work of the link analyzer component. All of the major crawling search engines analyze the linking relationships between documents to help them determine the most relevant results for a given search query. Each search engine handles this differently, but they all have the same basic goals in mind. There may be more than one type of link analyzer in use, depending on the search engine.
  5. Query processing and the ranking of Web pages to deliver search results.
    The query processor and ranking/retrieval module are responsible for this important task. The query processor must determine what type of search the user is conducting, including any specialized operations that the user has invoked. The ranking/retrieval module determines the ranking order of the matching documents, retrieves information about those documents, and returns the results for presentation to the user.

The Crawling Phase: How Spiders Work

As mentioned above, one of the largest jobs of a crawling search engine is to find Web documents, download them, and store them for further analysis. To simplify matters, we've combined the work of tasks 1 and 2 above into a single activity that we'll refer to as the crawling phase.

Every crawling search engine is assigned different priorities for this phase of the process, depending on their resources and business relationships, and what they're trying to deliver to their users. All search engines, however, must tackle the same set of problems.

How Search Engines Find Documents

Every document on the Web is associated with a URL (Uniform Resource Locator). In this context, we will use the terms "document" and "URL" interchangeably. This is an oversimplification, as some URLs return different documents to the user depending on such factors as their location, browser type, form input etc., but this terminology suits our purposes for now.

To find every document on the Web would mean more than finding every URL on the Web. For this reason, search engines do not currently attempt to locate every possible unique document, although research is always underway in this area. Instead, crawling search engines focus their attention on unique URLs; although some dynamic sites may display different content at the same URL (via form inputs or other dynamic variables), search engines will see that URL as a single page.

The typical crawling search engine uses three main resources to build a list of URLs to crawl. Not all search engines use all of these:

  • Hyperlinks on existing Web pages The bulk of the URLs found in the databases of most crawling search engines consists of links found on Web pages that the spider has already crawled. Finding a link to a document on one page implies that someone found that link important enough to add it to their page.
  • Submitted URLs All the crawling search engines have some sort of process that allows users or Website owners to submit URLs to be crawled. In the past, all search engines offered a free manual submission process, but now, many accept only paid submissions. Google is a notable exception, with no apparent plans to stop accepting free submissions, although there is great doubt as to whether submitting actually does anything.
  • XML data feeds Paid inclusion programs, such as the Yahoo! Site Match system, include trusted feed programs that allow sites to submit XML-based content summaries for crawling and inclusion. As the Semantic Web begins to emerge, and more sites begin to offer RSS (RDF Site Summary) news feed files, some search engines have begun to read these files in order to find fresh content.

Search engines run multiple crawler programs, and each crawler program (or spider) receives instructions from the scheduler about which URL (or set of URLs) to fetch next. We will see how search engines manage the scheduling process shortly, but first, let's take a look at how the search engine's crawler program works.

The Robot Exclusion Protocol

The first search spiders developed a very bad reputation in a hurry. Web servers in 1993 and 1994 were not as powerful as they are today, and an aggressive spider could bring an underpowered Web server to a crashing halt, or use up the server's limited bandwidth, by fetching pages one after another.

Clearly, rules were needed to control this new type of automated user, and they have developed over time. Spiders are supposed to fetch no more than one document per minute (a rate that's probably much slower than necessary) from a given Web host, and they're expected to obey the Robot Exclusion Protocol.

In a nutshell, the Robot Exclusion Protocol allows Website operators to place into the root directory of their Web server a text file named robots.txt that identifies any URLs to which search spiders are denied access. We'll address the format of this file later; the important point here is that spiders will first attempt to read the robots.txt file from a Website before they access any other resources.

When a spider is assigned to fetch a URL from a Website, it reads the robots.txt file to determine whether it's permitted to fetch that URL. Assuming that it's permitted access by robots.txt, the crawler will make a request to the Web server for that URL. If no robots.txt file is present, the spider will behave as if it has been granted permission to fetch any URL on the site.

There are no specific rules about this, and each search engine will implement this differently, but it is considered poor behavior for a search engine to rely on a cached copy of the robots.txt file without confirming that it's still valid. In order to save resources, schedulers can assign the crawler program a set of URLs from the same site, to be fetched in sequence, before it moves on to another site. This allows the crawler to check robots.txt once and fetch multiple pages in a single session.

What Happens in a Crawling Session?

For the sake of clarity, let's walk through a typical crawling session between a spider and a Website. In this particular scenario, we'll assume that everything works perfectly, so the spider doesn't have to deal with any unusual problems.

Let's say that the spider has a URL it would like to fetch from our Website, and that this URL has been fetched before. The scheduler will supply the spider with the URL, along with the date and time of the most recent version that has been fetched. It will also supply the date and time from the most recent version of robots.txt that has been fetched from this site.

The communication between a user agent (such as your Web browser or our hypothetical spider) and a Web server is conducted via the HTTP protocol. The user agent sends requests, the server sends responses, and this communication goes back and forth.

Once the document has been downloaded from the Web server, the crawler's job is nearly done. It hands the document off to the database/repository module, and informs the scheduler that it has finished its task. The scheduler will respond with another task, and it's back to work for the spider.

If you liked this article, share the love:
Print-Friendly Version Suggest an Article

Sponsored Links