Article
The Search Engine Marketing Kit - Chapter 1
Practical Aspects of Crawling
If only things could always be as simple as our hypothetical session above! In reality, there are a tremendous number of practical problems that must be overcome in the day-to-day operations of a crawling search engine.
Dealing with DNS
The first problem that crawlers have to overcome lies in the domain name system that maps domain names to numeric addresses on the Internet. The root name servers for each top level domain, or TLD (e.g. .com, .net etc.), keep records of the domain name server (DNS server) that handles the addressing for each second level domain name (e.g. example.com).
Thousands of secondary and tertiary name servers across the Internet synchronize their DNS records with these root name servers periodically. When the DNS server for a domain name changes, this change is recorded by the domain name registrar, and given to the root name server for the TLD.
Unfortunately, this change is not reflected immediately in all name servers all over the world. In fact, it can easily take 48–72 hours for the change to propagate from one name server to the next, until the entire Internet is able to recognize the change.
A search engine spider, like any other user, must rely on the DNS in order to find the resources that it's been sent to fetch. Although the major search engines all have reasonably fast updates to their DNS records, when DNS servers are changed, it's possible that a spider will be sent out to fetch a page using the wrong DNS server address. When this happens, there are three possibilities:
- The DNS server from which the spider requests the site's Web server address no longer has a record of the domain name supplied. In this case, the spider will probably hand the URL back to the scheduler, to be tried again later.
- The DNS server does have a record for the domain name, and dutifully gives the spider an address for the wrong Web server. In this case, the spider may end up fetching the wrong page, or no page at all. It may also receive an error status code.
- Even though it's no longer the authoritative name server for the supplied domain name, the DNS server still provides the spider the correct address for the Web server. In this case, the spider will probably fetch the right page.
It's also possible that a search engine could use a cached DNS record for the domain name, and go looking for the Web server without checking to ensure that the record is current. This used to be an occasional problem for Google, but probably will never be seen again. It certainly hasn't appeared to be a problem for any of the major search engines in some time.
We will discuss exactly how to move a Website from one server to another, from one hosting provider to another, and from one DNS server to another, in Chapter 3, Advanced SEO And Search Engine-Friendly Design. For now, the key point is that the mishandling of DNS can lead to problems for search engines, and this can, in turn, create major headaches for you.
Dealing with Servers
The next challenge that spiders have to handle is HTTP error messages, servers that simply cannot be found, and servers that fail to respond to HTTP requests. There are also many other server responses that must be handled with particular care in order to avoid problems.
Rather than provide a comprehensive listing of every problem that could ever eventuate, I'll simply list a few broad categories and note how search engines are likely to deal with them. We'll dig more deeply into server issues in Chapter 3, Advanced SEO And Search Engine-Friendly Design.
Where's That Server?
If a server can't be found, or fails to respond, it's likely a temporary condition. The crawler will inform the scheduler of the error, and move on. If the condition persists, the search engine might remove the URL in question from the index, and may even stop trying to crawl it. It usually takes a long term problem, or a very unreliable server, to elicit such a drastic response, however. If a URL (or an entire domain) is removed because of server problems, a manual submission may be required in order to have the search engine crawl it again.
Where's That Page?
If a page does not exist at the requested URL, the server will return a 404 Not Found error. Sometimes, this means that a page has been permanently removed; sometimes, the page never existed in the first place; occasionally, pages that go missing reappear later. Search engines are usually quick to remove URLs that return 404 errors, although most of them will try to fetch the URL a couple more times before giving it up for dead. As with server issues, it may be necessary to resubmit pages that have been dropped for returning 404 errors. In Chapter 3, Advanced SEO And Search Engine-Friendly Design, we will discuss the right (and wrong) way to use custom 404 error pages.
Whoops, There Goes The Database!
Database errors are the bane of dynamic sites everywhere. Unless the code driving the site has robust error handling capabilities, most database errors will cause the Web server to return a 200 OK status code while delivering a page that contains nothing but an error message from the database. When this occurs, the error message will be stored by the spider as if it were the page's content. Resubmission of the page is not necessary, assuming the database issues have been corrected the next time the spider visits. Chapter 3, Advanced SEO And Search Engine-Friendly Design will include some recommendations on how best to manage database errors.
Sorry, We Moved It ... Or Did We?
Redirection by the Web server can be a challenge for search engines. A server response of 301 Moved Permanently should cause the search engine to visit the new URL and adjust its database to reflect the change. Trickier for spiders is the 302 Found response code, which is used by many applications and scripts to redirect Web browsers. Search engines currently have varying responses to server-based redirects. In some cases, very bad things can happen if spiders are allowed to follow 302 redirects, as we'll see in Chapter 3, Advanced SEO And Search Engine-Friendly Design.
Handling Dynamic Sites
One of the most difficult challenges faced by today's crawlers is the proliferation of dynamic or database-driven Websites. Depending on the way the site is configured, it's possible for a spider to get caught in an endless loop of pages that generate more pages, with a never-ending sequence of unique URLs that deliver the same (or slightly varied) content.
In order to avoid becoming caught in such spider traps, today's crawlers carefully examine URLs and avoid crawling any link that includes a session ID, the referring URL, or other variables that have nothing to do with the delivery of content. They also look for hints of duplicate content, including identical page titles, empty pages, and substantially similar content. Any of these gotchas can stop a spider from fully crawling a dynamic site. We will review crawler-friendly SEO strategies for dynamic sites in Chapter 3, Advanced SEO And Search Engine-Friendly Design.
Scheduling: How Search Engines Set Priorities
In addition to the challenges that must be overcome in crawling the Web, there are a great number of issues with which search engines must grapple in order to properly manage their crawling resources. As mentioned previously, each search engine's priorities are different.
Five years ago, the major competition between the search engines was to build the largest index of documents. News networks like CNN played up each succeeding announcement of what was described as the new "biggest search engine," which, no doubt, pleased many dot-com investors, even if some of the search engines played it a little fast and loose when it came to the numbers.
Today, the total index size is no longer seen as a key indicator of a search engine's quality. Nonetheless, any search engine must index a substantial portion of the Web in order to deliver relevant search results. Google currently has by far the largest index, which is especially evident to those searching for detailed technical information, as relevant pages may be buried deep within a site.
The scheduling of crawler activity must be guided by the search engine's individual priorities in four specific areas:
- Freshness: In order to deliver the best possible results, every search engine must index a great deal of new content. Without this, it would be impossible to return search results on current events. Most scheduling algorithms involve a list of important sites that should be checked regularly for new content. Indexing XML data feeds helps some search engines keep up with the growth of the Web.
- Depth vs Breadth: A key strategic decision for any search engine involves how many sites to crawl (breadth) and how deeply to crawl into each site (depth). For most search engines, making the depth vs. breadth decision for a given site will depend upon the site's linking relationships with the rest of the Web: more popular sites are more likely to be crawled in depth, especially if some inbound links point to internal pages. A single link to a site is usually enough to get that site's homepage crawled.
- Submitted Pages: Search engines such as Google, which allow the manual submission of pages, must decide how to deal with those manually submitted pages, and how to handle repeat submissions of the same URL. Such pages might be extremely fresh or important, or they may be nothing more than spam.
- Paid Inclusion: Search engines that offer paid inclusion programs generally guarantee that they will revisit paid URLs every 24–72 hours.
In terms of priority, a search engine that offers a paid inclusion program must visit those paid URLs first. After listings for paid inclusion, most search engines will probably focus resources on any important URLs that help them maintain a fresh index. Only after these two critical groups of URLs are crawled will they pursue additional URLs. URLs submitted via a free submission page are probably the last on the list, especially if they have been submitted repeatedly.
Parsing and Caching
Once the contents of a URL have been fetched, they are handed off to the database/repository and stored. Each URL is associated with a unique ID, which will be used throughout all the search engine's operations. Depending on the type of content, one of two things will happen next.
If the document is already in HTML format, it can be stored immediately, exactly as is. Additional metadata, such as the Last-Modified date and page title, may be stored along with the document. (Note that metadata should not be confused with <meta> tags. Metadata is "data about data." For search engines, the primary unit of data is the Web page, so anything that describes that Web page (other than its content) is metadata. This might include the page's title, URL, and other information such as the Website's directory description, which Yahoo! uses within its search results.)
This stored copy of the HTML code is used by some search engines to offer users a snapshot view of the page, or access to the cached version.
For documents that are presented in formats other than HTML, such as Adobe's popular Acrobat (PDF) or Microsoft Word, further processing is needed. Typically, search engines that attempt to index these types of documents first translate the document into HTML format with a specialized parser.
Converting non-HTML documents to an HTML representation allows search engines to offer users access to the document's contents in HTML format (as Google does), and to conduct all further processing on the HTML version. When the document contains structural information, such as a Microsoft Word file that makes use of heading styles, search engines can make use of these elements within the HTML translation. Adobe's PDF is notably lacking in structural elements, so search engines must rely on type styles and size to determine the most significant text elements.
At this point, all that has been accomplished is to store an HTML version of the document. Most search engines will perform further parsing at this stage, to extract the text content of the page, and catalog the various elements (headings, links etc.) for analysis by the indexing and link analysis components. Some of them may leave all of this processing to the indexer.
Results of the Crawling Phase
By the end of the crawling phase, the search engine knows that there was valid content at the URL, and it has added that content (possibly translated to HTML) to its database.
Even before a search engine crawls a page, it must "know" something about that page. It knows that the URL exists and, if the URL was found via links, the search engine may also have found within those links some text that tells it something about the URL.
Once a search engine knows that a URL exists, it's possible that this URL could appear in search results. In Google, a page that has not yet been crawled can appear as a supplemental search result, based on the keywords contained in hyperlinks pointing to that page. At this point, the page's title is not known, so the listing will display the page's URL in place of the title.
After the crawling phase is complete, the search engine knows the document's title, last-modified date, and its size. Such pages can appear in Google's results as supplemental search results, based on keywords that appear in the page's title and incoming links. After the crawling phase, the page title can also appear in the search results.
Tip: The Google search engine provides an unusual amount of transparency around its process and results. It's possible, for example, to have Google return a list of all the URLs it has found within a particular site. The syntax for this search is site:example.com.
If some of the URLs listed for a site:domain search do not include page titles or page size information, this means that those URLs have not been yet been crawled. If this condition persists, as happens often with dynamic sites, there may be issues with duplicate content, session IDs, empty pages, or other problems that have caused the spider to stop crawling the site. We will cover these issues in Chapter 3, Advanced SEO And Search Engine-Friendly Design.