The Anatomy of a Large-Scale Hypertextual Web Search Engine

Similar documents
Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Searching the Web for Information

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Design of Alta Vista. Course Overview. Google System Anatomy

Running Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez.

THE WEB SEARCH ENGINE

Introduction to Information Retrieval and Anatomy of Google. Information Retrieval Introduction

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

A Survey on Web Information Retrieval Technologies

PAGE RANK ON MAP- REDUCE PARADIGM

A Survey of Google's PageRank

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

Information Retrieval. Lecture 10 - Web crawling

Ambiguity Resolution in Search Engine Using Natural Language Web Application.

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Distributed computing: index building and use

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

A brief history of Google

The anatomy of a large-scale hypertextual Web search engine

Search Engine Architecture. Hongning Wang

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

CS514: Intermediate Course in Computer Systems

An Adaptive Approach in Web Search Algorithm

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Reading Time: A Method for Improving the Ranking Scores of Web Pages

COMP5331: Knowledge Discovery and Data Mining

Full-Text Indexing For Heritrix

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Google Scale Data Management

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

Optimizing Search Engines using Click-through Data

The PageRank Citation Ranking: Bringing Order to the Web

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Building an Inverted Index

SEARCH ENGINE INSIDE OUT

Lab 3 - Development Phase 2

PROJECT REPORT (Final Year Project ) Project Supervisor Mrs. Shikha Mehta

Information Retrieval II

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

International Journal of Advance Engineering and Research Development. A Review Paper On Various Web Page Ranking Algorithms In Web Mining

Searching the Web [Arasu 01]

MG4J: Managing Gigabytes for Java. MG4J - intro 1

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

1. Introduction. 2. Salient features of the design. * The manuscript is still under progress 1

COMP Page Rank

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Web Structure Mining using Link Analysis Algorithms

Introduction to Google

Search Engines. Information Retrieval in Practice

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson

Multimedia Retrieval. Chapter 3: Web Retrieval. Dr. Roger Weber, Computer Science / CS342 / Fall 2014

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Announcements. Assignment 6 due right now. Assignment 7 (FacePamphlet) out, due next Friday, March 16 at 3:15PM

Mining Web Data. Lijun Zhang

Chapter 2. Architecture of a Search Engine

Corso di Biblioteche Digitali

Corso di Biblioteche Digitali

Information Retrieval. Lecture 4: Search engines and linkage algorithms

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval

Link Analysis and Web Search

Crawling the Web. Web Crawling. Main Issues I. Type of crawl

Ranking Techniques in Search Engines

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

Distributed computing: index building and use

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Performance Analysis for Crawling

Web Applications: Internet Search and Digital Preservation

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

Searching and Ranking

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

Searching the Web What is this Page Known for? Luis De Alba

COMP Parallel Computing. Lecture 22 November 29, Datacenters and Large Scale Data Processing

Information Retrieval

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

Analytical survey of Web Page Rank Algorithm

Index Construction Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Collection Building on the Web. Basic Algorithm

CS 137 Part 4. Structures and Page Rank Algorithm

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Information Retrieval

A New Technique for Ranking Web Pages and Adwords

Corso di Biblioteche Digitali

CLOUD COMPUTING PROJECT. By: - Manish Motwani - Devendra Singh Parmar - Ashish Sharma

Python & Web Mining. Lecture Old Dominion University. Department of Computer Science CS 495 Fall 2012

An Application of Personalized PageRank Vectors: Personalized Search Engine

Information Retrieval Spring Web retrieval

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

Transcription:

The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started small at Stanford Univ. during their graduate studies Google index grew to over 1 billion pages in June 2000 At the time, Google served an average of 18 million queries/day Google index grew to over 8 billion pages in 2003 Traditional keyword searches not accurate enough Google makes use of html document structure 2 1

2. System Features PageRank 1 : Bringing Order to the Web (Maps contains 518 million of hyperlinks) Description of PageRank calculation Intuitive Justification Anchor Text Other Features 1 Demo available at google.stanford.edu 3 Description of PageRank A = given page T 1 T n = pages that point to page A (i.e. citations) d = damping factor which can be between 0 and 1 (usually we set d = 0.85) C(A) = number of links going out of page A PR(A) = the PageRank of a page A PR(A) = (1-d) + d(pr(t 1 )/C(T 1 ) + + PR(T n )/C(T n )) NOTE: the sum of all web pages PageRank = 1 4 2

Intuitive Justification Assume there is a random surfer Probability that random surfer visits a page is its PR 1-d is the probability at each page that the random surfer will get bored Variations of the formula A page has high rank: If there are many pages that point to it If there are some pages that point to it and have a high PageRank 5 Anchor Text Associate the text of a link with the page that the link is on and the page the link points to Advantages: Anchors often provide more accurate description Anchors may exist for documents which cannot be indexed (i.e. image, programs, and databases) Propagating anchor text helps provide better quality results 24 million pages over 259 million anchors 6 3

Other Features Location information for all hits Keep track of some visual presentation details (i.e. font size of words) Full raw HTML of pages is available in a repository 7 3. Related Work Information Retrieval In the past, focused largely on scientific stories, articles, etc. Ex. Assignment 1 (Vector Space Model) Differences between web and controlled collections The web contains varying content (html, doc, PDF, etc.) No control of web content 8 4

4. System Anatomy Provide a high level discussion of the architecture Some in-depth description of important data structure The major components: crawling indexing Searching Implemented in C/C++ for Linux/Solaris 9 Google Architecture overview URL Server crawlers Store Server Anchors URL Resolver Indexer Repository Links Barrels Lexicon Doc index Sorters PageRank Searcher 10 5

Major Data Structure BigFiles Virtual files are addressable by 64 bit integers Support rudimentary compression options Repository Contains full HTML of every web page Compression rate of zlib is 3 to 1 compare to bzib is 4 to 1 docid, length, URL Document Index Keep info of each document (docid, fixed width ISAM index) Current doc status, a pointer into the repository, a doc checksum, and various statistics 11 Repository data structure 12 6

Major Data Structure (cont d) Lexicon Repository of words Implemented with a hash table of pointers (word ids) to barrels (that are sorted lists) 14 million words, plus an extra file for rare words Hit Lists Lexicon Stores occurrences of a particular word in a particular document Types of hits: Plain, Fancy and anchor (2 bytes per hit) 13 Major Data Structure (cont d) Forward Index Barrel holds a range of word ids Doc id and a list of word ids Barrels Inverted Index Similar to Assignment 1 inverted index Can be sorted by doc id or by ranking of word occurrence 14 7

Forward and Inverted Index and the Lexicon 15 Crawling the Web Crawlers need to be reliable, fast and robust Some authors don t want their pages to be crawled Google (1998) had about 3-4 crawlers running At peek speeds, the 4 crawlers processed 100 web pages/sec. Crawlers have different states: DNS lookup, connecting to host, send request, receiving response Crawlers and URL server were implemented in Python 16 8

Indexing the Web Parsing Parsers need to handle errors very well (typos, formatting, etc.) Indexing After parsing, placed into forward barrels Words converted into word id and occurrences into hit lists Sorting Forward barrels are sorted by word id to produce an inverted index 17 Searching Goal of searching: quality search results efficiently Ranking system: Every hit list includes position, font and capitalization info Consider each hit to be one of several different type and each of which has its own type-weight Many parameters: type-weight, type-prox-weight (for phrasal queries) User feedback mechanism 18 9

Google Query Evaluation 1. Parse the query. 2. Convert words into wordids. 3. Seek to the start of the doclist in the short barrel for every word. 4. Scan through the doclists until there is a document that matches all the search terms. 5. Compute the rank of that document for the query. 6. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. 7. If we are not at the end of any doclist go to step 4. Sort the documents that have matched by rank and return the top k. 19 5. Results and Performance Storage Requirements Google (1998) compressed its repository to about 53GB (24 million pages) Additional storage used by Google for indexes, temporary storage, lexicon, etc. required about 55GB. 20 10

System Performance Crawling & Indexing efficiently Crawling took a week or more To improve efficiency, the Indexer and Crawlers need to be capable of simultaneous execution Sorting needs to be done in parallel on several machines 21 Search Performance Disk I/O Requires most time (seek time, transfer time, network delay, etc.) Caching Reduces the number of disk access Performance times can improve from 2.13 sec. to 0.06 sec. 22 11

6. Conclusion Designed to be a scalable search engine Provide high quality search Complete architecture for gathering web pages, indexing them, and performing search queries over them 23 High Quality Search Heavy use of hypertextual info Link structure Link (anchor) text Proximity and font info increase relevance PageRank quality Link text relevant 24 12

Scalable Architecture Efficient in both space and time Bottlenecks in: CPU Memory capacity Disk seeks Disk throughput Disk capacity Network IO Major data structure make efficient use of available storage space (24 million pages in < 1 week) Build an index of 100 million pages < 1 month 25 13