An Internet search engines is a computer system that (1) locates and indexes web pages, and (2) processes queries from users who are searching for information on the web. The most common way people find information on the Internet is through a search engine (PEW Internet & American Life Project, 2004).
A search engine comprises three components: a web spider, a database, and one or more information retrieval algorithms. The web spider (also known as a "web crawler'') searches the Internet for new web pages (Gordon and Pathak, 1999). It systematically follows hyperlinks found on known pages. If the spider comes upon a web page it has not previously encountered, it sends this page to the information retrieval algorithms for indexing and storage in the database (Kirsanov, 1997). The indexing enables the search engine to retrieve the URL of the web page from the database based on query terms entered into the search engine by users of the search engine.
The information retrieval algorithms create the indexing from the content of the web page (words, phrases, whether there are images), as well as whatever other clues the algorithm developer can exploit (e.g., the popularity of the page, the nature of pages that hyperlink to it).
Information retrieval algorithms are the subject of study of a subfield of information science called "document retrieval,'' and there are many books on the topic (e.g., Pao, 1989, van Rijsbergen, 1986, Salton and McGill, 1983) Briefly, there are three basic approaches to document indexing and retrieval: Boolean, vector space, and probabilistic. These approaches are distinguished partly by their use of different retrieval strategies. Boolean systems retrieve the subset of documents whose indexing terms match exactly the query generated by the user (Salton, 1989). Many electronic library catalogs use Boolean retrieval algorithms. Vector space and probabilistic systems impose a partial ordering on the document collection, according to a document score. The vector space algorithms compute a heuristic score, typically the cosine of the query and document's index term vectors (Salton, 1989). Probabilistic systems order documents by the probability of relevance, which they learn from a training set of documents that developers of the system or end users have judged relevant or not relevant. All Internet search engines use one of the latter two approaches, as does the GPHIN system that we discuss in the next section (its information retrieval algorithm is proprietary). GPHIN computes document scores (called "relevance scores''), and uses them to select documents to disseminate to subscribers.
Table 26.1 displays the current market share of the 10 most frequently used search engines, as measured by number of searches (as of March 2005) (Nielsen//NetRatings MegaView Search, 2005). These systems differ according to the methods by which their web spiders search the Internet and in their information retrieval algorithms, which accounts for differences in the information they retrieve in response to the same user query. At present, Google®, Yahoo®, and MSN® are the most popular search engines (Sullivan, 2005).
Was this article helpful?