Running Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez.

Size: px

Start display at page:

Download "Running Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez."

Martha Atkins
5 years ago
Views:

1 Running Head: 1 How a Search Engine Works Sara Davis INFO Spring 2016 Erika Gutierrez May 1, 2016

2 2 Search engines come in many forms and types, but they all follow three basic steps: crawling, indexing, and presenting ranked results. During the crawling step, the engine looks through as many files as it can access looking for any pages it either has not seen or that has changed since the last time the engine crawled through. It will routinely check each page it has seen usually every month or so. (Sullivan 2002) The engine then stores these pages on its index. Just because a page has been crawled, does not mean it is indexed yet. This usually happens when there has been an update to a page that the crawler has seen and indexed, but has not deleted the old file. This index is what the search engine pulls from to return ranked results to a query. This list may be ranked based on relevance, referencing, alphabetization, how recent the page is, etc., or any combination thereof. Crawling is done many different ways by various developers to fit certain needs or capabilities of a system. Despite the numerous applications for Web crawlers, at the core they are all fundamentally the same. Following is the process by which Web crawlers work: 1. Download the Web page. 2. Parse through the downloaded page and retrieve all the links. 3. For each link retrieved, repeat the process (Peshave 2005) As a crawler program gets through one page, it will then have many more links spawned from the first page. It will then open these links to search these pages for more links. An easy way to imagine this is a family tree with each page s links being children of that page and so on. Most crawlers can read as many as 300 pages simultaneously (Peshave 2005). Once the crawler reaches its limit, it will begin putting all new links in a queue. Crawlers do not use multiple

3 3 computers like a virus or intelligent agent, but simply reside on one computer, requesting access to pages much like a web browser. The largest difference is the automation of the crawler and the speed at which it can analyze pages. Since crawlers reside on one computer, they have obvious restraints. These restraints include network bandwidth to download pages, memory to maintain private data structures in support of their algorithms, CPU to evaluate and select URLs, and disk storage to store the text and links of fetched pages as well as other persistent data such as the queue of new pages to analyze. These constraints are the reason we have meta search engines that use multiple other prebuilt indexes from other search engines. These often use small or free indexes and are not usually very in depth compared to crawler based engines. After pages have been crawled and exhausted of links to spawn new pages, they are indexed and sorted by the engine based on different relationships to other pages, topics covered in the page, or common links to other pages. This also happens in three steps. 1. Parsing Any parser which is designed to run on the entire Web must handle a huge array of possible errors. These range from typos in HTML tags to kilobytes of zeros in the middle of a tag, non ASCII characters, HTML tags nested hundreds deep, and a great variety of other errors that challenge anyone s imagination to come up with equally creative ones. For maximum speed, instead of using YACC to generate a CFG parser, we use flex to generate a lexical analyzer which we outfit with its own stack. Developing this parser which runs at a reasonable speed and is very robust involved a fair amount of work.

4 4 2. Indexing Documents into Barrels After each document is parsed, it is encoded into a number of barrels. Every word is converted into a wordid by using an in memory hash table the lexicon. New additions to the lexicon hash table are logged to a file. Once the words are converted into wordid s, their occurrences in the current document are translated into hit lists and are written into the forward barrels. The main difficulty with parallelization of the indexing phase is that the lexicon needs to be shared. Instead of sharing the lexicon, we took the approach of writing a log of all the extra words that were not in a base lexicon, which we fixed at 14 million words. That way multiple indexers can run in parallel and then the small log file of extra words can be processed by one final indexer. 3. Sorting In order to generate the inverted index, the sorter takes each of the forward barrels and sorts it by wordid to produce an inverted barrel for title and anchor hits and a full text inverted barrel. This process happens one barrel at a time, thus requiring little temporary storage. Also, we parallelize the sorting phase to use as many machines as we have simply by running multiple sorters, which can process different buckets at the same time. Since the barrels don t fit into main memory, the sorter further subdivides them into baskets which do fit into memory based on wordid and docid. Then the sorter, loads each basket into memory, sorts it and writes its contents into the short inverted barrel and the full inverted barrel. (Brin, Page 2012) The first step does overlap with the first step of crawling. However, this parsing is clearly different from the parsing of the crawler. This parser is looking for semantic errors or typos or

5 5 other strange anomalies that may cause the page to be improperly sorted. The crawler is only parsing to look for links. This will all generally happen at roughly the same time, but it helps to keep these steps separate to understand the processes behind a search engine. This brings us to returning ranked results. This is seemingly the most important step to the user. This is the part of the engine that the user gets to actually interact with. While the other two steps make this process possible, a wonderful crawler and a beautifully sorted index will mean little if the results are not sorted in a satisfying manner for the user. There are seemingly too many factors that go into a well built raking system such as google. Some of these include relevance to query, number of hits for words in query, location of hits, number of links to other pages, number of other pages linking to this page, similarity to previously browsed pages, the list is endless. It becomes much more noticeable however when one uses a poorly ranked search engine that these endless and seemingly insignificant factors make a huge difference in returning good results. All these steps come together to create a search engine. The engine is constantly crawling and indexing and waiting for your query to return results from its database. Although the only part you interact with is the query box and the results themselves, one should remember and appreciate the other more mundane steps behind good results.

6 6 References Brin, S., Page, L., (2012). The anatomy of a large scale hypertextual web search engine. Computer Networks, 56 (18), Baboescu, F., Tullsen D. M., Rosu G., Singh S., (2005) A Tree Based Search Engine Architecture with Single Port Memories. ACM SIGARCH Computer Architecture News, 33 (2), Grehan, M., (2002) How Search Engines Work Search engine Marketing: The Essential Best Practice Guide. New York, NY Incisive Interactive Marketing LLC. search engines work mike gr ehan.pdf Sullivan, D., (2002) How Search Engines Work. Peshave, M., (2005) How Search Engines Work and a Web Crawler Application

7 7

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures Anatomy of a search engine Design criteria of a search engine Architecture Data structures Step-1: Crawling the web Google has a fast distributed crawling system Each crawler keeps roughly 300 connection