How Search Engines Work

Web Crawling

The most popular search engine by far is named Google after the googol, or a one followed by 100 zeros, a reference to the reams of material the search engine can find with the click of a mouse. The search engine began its career as an academic tool and employed up to three or four spiders at any given time. A spider is a software robot which compiles and tabulates lists of words as it crawls the Web. In those early days, each spider was capable of maintaining 300 connections to web pages at any one time. When four spiders were in play, Google could crawl more than 100 pages per second or 600 kilobytes of information every second.

In order to keep up this fast pace, Google found it had to create a system that could feed data to the busy spiders. In the beginning, Google used a server designed to provide URLs to the spiders. Google made a decision not to depend on an Internet service provider to furnish the domain name server that would turn a server's name into a Web address. Google created its own DNS so as to speed up the Web crawling process.

Location, Location, Location

As the Google spiders crawled through an HTML Webpage, they looked at individual words contained within the page and noted their location. The spiders were designed to take note of words appearing in prominent locations such as titles and subtitles since these are good bets for building a successful user search. A Google spider is designed to ignore unimportant words like the articles "the," "an," and, "a," while other search engine spiders may employ different methods to sift through information.

The search engine's chosen approach is meant to cause the spiders to work faster and/or to allow for a more user-friendly search. The search engine Lycos, for instance, gathers all the words in the titles, subheadings, and links, plus the one hundred words used with the greatest frequency within the page, as well as all the words found within the first twenty lines of the text.

Meta Tags

Alta Vista goes to the opposite side of the spectrum, leaving no word unindexed, including all the articles and other words deemed negligible by the other popular search engines. Other systems pay great attention to the part of the Web page that is unseen by most users, the meta tags, which are used by the web page author to mark certain words and concepts as appropriate for indexing. These tags are used by the search engine to distinguish between words which may have more than one meaning. Some authors prefer their pages not be indexed by the search engine spiders and this, too, can be specified within the meta tags situated at the top of a webpage.