Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Similar documents
The Anatomy of a Large-Scale Hypertextual Web Search Engine

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

Running Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez.

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Searching the Web for Information

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Design of Alta Vista. Course Overview. Google System Anatomy

THE WEB SEARCH ENGINE

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

Introduction to Information Retrieval and Anatomy of Google. Information Retrieval Introduction

A Survey on Web Information Retrieval Technologies

Ambiguity Resolution in Search Engine Using Natural Language Web Application.

PAGE RANK ON MAP- REDUCE PARADIGM

The anatomy of a large-scale hypertextual Web search engine

COMP5331: Knowledge Discovery and Data Mining

The PageRank Citation Ranking: Bringing Order to the Web

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Web Structure Mining using Link Analysis Algorithms

An Application of Personalized PageRank Vectors: Personalized Search Engine

Information Retrieval Issues on the World Wide Web

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Search Engines. Information Retrieval in Practice

WEBTracker: A Web Crawler for Maximizing Bandwidth Utilization

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler

Lecture 17 November 7

Title: Artificial Intelligence: an illustration of one approach.

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Review: Searching the Web [Arasu 2001]

Full-Text Indexing For Heritrix

Information Retrieval. Lecture 10 - Web crawling

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

An Adaptive Approach in Web Search Algorithm

Optimizing Search Engines using Click-through Data

PROJECT REPORT (Final Year Project ) Project Supervisor Mrs. Shikha Mehta

Information Retrieval and Web Search

Reading Time: A Method for Improving the Ranking Scores of Web Pages

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Search Engine Architecture. Hongning Wang

Information Retrieval Spring Web retrieval

A Hierarchical Web Page Crawler for Crawling the Internet Faster

Link Analysis in Web Information Retrieval

Effective Page Refresh Policies for Web Crawlers

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

A brief history of Google

Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler

Searching the Web What is this Page Known for? Luis De Alba

Information Retrieval May 15. Web retrieval

Part I: Data Mining Foundations

Google Scale Data Management

Analytical survey of Web Page Rank Algorithm

E-Business s Page Ranking with Ant Colony Algorithm

A GEOGRAPHICAL LOCATION INFLUENCED PAGE RANKING TECHNIQUE FOR INFORMATION RETRIEVAL IN SEARCH ENGINE

Beyond PageRank: Machine Learning for Static Ranking

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

Search Engine Overview

CS514: Intermediate Course in Computer Systems

Airo International Research Journal September, 2017 Volume XII, ISSN:

Recent Researches on Web Page Ranking

Weighted Page Content Rank for Ordering Web Search Result

Survey on Web Structure Mining

International Journal of Advance Engineering and Research Development. A Review Paper On Various Web Page Ranking Algorithms In Web Mining

[Banjare*, 4.(6): June, 2015] ISSN: (I2OR), Publication Impact Factor: (ISRA), Journal Impact Factor: 2.114

Estimating Page Importance based on Page Accessing Frequency

Search Engine Techniques: A Review

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS

Searching the Web [Arasu 01]

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Personalizing PageRank Based on Domain Profiles

A STUDY ON THE EVOLUTION OF THE WEB

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Distributed Information Systems. Copyright 2010 Srdjan Komazec

Web Crawling As Nonlinear Dynamics

An Improved Computation of the PageRank Algorithm 1

Plan for today. CS276B Text Retrieval and Mining Winter Evolution of search engines. Connectivity analysis

Ranking Techniques in Search Engines

Breadth-First Search Crawling Yields High-Quality Pages

Site Content Analyzer for Analysis of Web Contents and Keyword Density

Web Applications: Internet Search and Digital Preservation

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

The application of Randomized HITS algorithm in the fund trading network

Inverted Indexing Mechanism for Search Engine

CRAWLING THE CLIENT-SIDE HIDDEN WEB

Computer Engineering, University of Pune, Pune, Maharashtra, India 5. Sinhgad Academy of Engineering, University of Pune, Pune, Maharashtra, India

CS Search Engine Technology

Compact Encoding of the Web Graph Exploiting Various Power Laws

.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Dynamic Visualization of Hubs and Authorities during Web Search

Deep Web Crawling and Mining for Building Advanced Search Application

Local Methods for Estimating PageRank Values

Corso di Biblioteche Digitali

A Novel Architecture of Ontology-based Semantic Web Crawler

Corso di Biblioteche Digitali

COMP Page Rank

Integrating Content Search with Structure Analysis for Hypermedia Retrieval and Management

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Transcription:

Anatomy of a search engine Design criteria of a search engine Architecture Data structures

Step-1: Crawling the web Google has a fast distributed crawling system Each crawler keeps roughly 300 connection open at once Google can crawl over 100 web pages per second using four crawlers at peak speeds (roughly 600K per second of data) Each crawler maintains a its own DNS cache The crawler uses asynchronous IO and a number of queues Ref: http://www.elibrary.icrisat.org/google%20search/how%20google%20works.htm

Step2: Indexing the Web (1) Parsing Any parser which is designed to run on the entire Web must handle a huge array of possible errors. These range from typos in HTML tags to kilobytes of zeros in the middle of a tag, non-ascii characters, HTML tags nested hundreds deep, and a great variety of other errors that challenge anyone's imagination to come up with equally creative ones. Use flex to generate a lexical analyzer for maximum speed URL Server Crawler Crawler Crawler Store Server Indexer Repository Barrel s Sorter Indexer Indexer

Indexing Documents into Barrels Step2: Indexing the Web (2) After each document is parsed, it is encoded into a number of barrels. Every word is converted into a wordid by using an in-memory hash table -- the lexicon. New additions to the lexicon hash table are logged to a file. The words in the current document are translated into hit lists. The words are written into the forward barrels. For parallelization, indexer writes a log to a file, instead of sharing the lexicon Sorting Takes each of the forward barrels. Sorts it by wordid to produce an inverted barrel. Parallelize the sorting phase. Subdivides the barrels into baskets to load into main memory because the barrels don t fit into memory. Sorts baskets and writes its contents into the inverted barrel

Searching (pseudo code) 1. Parse the query 2. Convert words into wordids 3. Seek to the start of the doclist in the short barrel for every word 4. Scan through the doclists until there is a document that matches all the search terms 5. Compute the rank of that document for the query 6. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4 7. If we are not at the end of any doclist go to step 4 8. Sort the documents that have matched by rank and return the top k Figure 4. Google Query Evaluation

Searching The Ranking System Every hit-list includes position, font and capitalization information. Factor in hits from anchor text and the PageRank of the document. Ranking function so that no particular factor can have too much influence For a single word search o In order to rank a document, Google looks at that document s hit list for a single word query and computes an IR score combined with PageRank For a multi-word search o Hits occurring close together in a document are weighted higher than hits occurring far apart Use of Feedback Google has a user feedback mechanism because figuring out the right values for many parameters is very difficult. When the ranking function is modified, this mechanism gives developers some idea of how a change in the ranking function affects the search results

Page Rank Page Rank: Page Rank of any node shows: how many important or popular nodes vote for the given node. This is a collective intelligence of nodes in graph. For given vertex/node V, let INV i ' (predecessors), and let ' The score of vertex Where: S V i ' i ' V i be the set of vertices/nodes that point to it OUT be the set of vertices that vertex ' Vi ' can be defined as Page et al. (1998) : S Rank of vertexv i. S ' =rank of vertex ' V j V i ' 1 SV j ' N jin ' OUTV V, from which incoming link comes to word / vertex V. j N Count of number of words/vertex in word graph of sentences. damping factor ( 0.85 is used in Page et al. (1998) ). V i j ' V points to (successors). i ' i (1)

Google architecture URL server sends list of URLs to be fetched to crawlers StoreServer compresses and stores pages Indexer extracts words, their pos., size, capital. Anchors cont.links and their text Sorter generates inverted index Searcher uses Lexicon, II, and PR

Major data Structure Major Data Structures: Google's data structures are optimized so that a large document collection can be crawled, indexed, and searched with little cost. Google is designed to avoid disk seeks whenever possible, and this has had a considerable influence on the design of the data structures. BigFiles: BigFiles are virtual files spanning multiple file systems and are addressable by 64 bit integers. Repository: The repository contains the full HTML of every web page. Each page is compressed using zlib. The repository requires no other data structures to be used in order to access it. This helps with data consistency and makes development much easier; Document Index: The document index keeps information about each document. It is a fixed width ISAM (Index sequential access mode) index, ordered by docid. The information stored in each entry includes the current document status, a pointer into the repository, a document checksum, and various statistics. Additionally, there is a file which is used to convert URLs into docids. It is a list of URL checksums with their corresponding docids and is sorted by checksum. In order to find the docid of a particular URL, the URL's checksum is computed and a binary search is performed on the checksums file to find its docid.

Major data Structure Lexicon: The current lexicon contains 14 million words (though some rare words were not added to the lexicon). It is implemented in two parts -- a list of the words (concatenated together but separated by nulls) and a hash table of pointers. Hit Lists: A hit list corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information. Hit lists account for most of the space used in both the forward and the inverted indices. Forward Index: The forward index is actually already partially sorted. It is stored in a number of barrels (we used 64). Each barrel holds a range of wordid's. If a document contains words that fall into a particular barrel, the docid is recorded into the barrel, followed by a list of wordid's with hitlists which correspond to those words. Inverted Index: The inverted index consists of the same barrels as the forward index, except that they have been processed by the sorter. For every valid wordid, the lexicon contains a pointer into the barrel that wordid falls into. It points to a doclist of docid's together with their corresponding hit lists. This doclist represents all the occurrences of that word in all documents.

Reference [Cho 98] Junghoo Cho, Hector Garcia-Molina, Lawrence Page. Efficient Crawling Through URL Ordering. Seventh International Web Conference (WWW 98). Brisbane, Australia, April 14-18, 1998. [Page 98] Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd. The PageRank Citation Ranking: Bringing Order to the Web. [Chakrabarti 98] S.Chakrabarti, B.Dom, D.Gibson, J.Kleinberg, P. Raghavan and S. Rajagopalan. Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. Seventh International Web Conference (WWW 98). Brisbane, Australia, April 14-18, 1998. [Gravano 94] Luis Gravano, Hector Garcia-Molina, and A. Tomasic. The Effectiveness of GlOSS for the Text-Database Discovery Problem. Proc. of the 1994 ACM SIGMOD International Conference On Management Of Data, 1994. Sergey Brin and Lawrence Page; The Anatomy of a Large-Scale Hypertextual Web Search Engine; Computer Science Department, Stanford University, Stanford, CA 94305