Logistics. CSE Case Studies. Indexing & Retrieval in Google. Design of Alta Vista. Course Overview. Google System Anatomy

Similar documents
Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

CSE 454. Index Compression Alta Vista PageRank

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Searching the Web for Information

Running Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez.

A Survey on Web Information Retrieval Technologies

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

THE WEB SEARCH ENGINE

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

Ambiguity Resolution in Search Engine Using Natural Language Web Application.

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Introduction to Information Retrieval and Anatomy of Google. Information Retrieval Introduction

Full-Text Indexing For Heritrix

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Building an Inverted Index

Storage hierarchy. Textbook: chapters 11, 12, and 13

Information Retrieval II

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

Some Practice Problems on Hardware, File Organization and Indexing

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Search Engines. Information Retrieval in Practice

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Data-Intensive Distributed Computing

Information Retrieval

Part I: Data Mining Foundations

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Chapter 2. Architecture of a Search Engine

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Distributed computing: index building and use

Indexing and Searching

Sorting. Overview. External sorting. Warm up: in memory sorting. Purpose. Overview. Sort benchmarks

Using Graphics Processors for High Performance IR Query Processing

Indexing and Searching

SEARCH ENGINE INSIDE OUT

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Information Retrieval Spring Web retrieval

Chapter 13: Query Processing

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Chapter 12: Query Processing

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Indexing and Searching

Information Retrieval

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Chapter 12: Query Processing. Chapter 12: Query Processing

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

PROJECT REPORT (Final Year Project ) Project Supervisor Mrs. Shikha Mehta

Information Retrieval

Structural Text Features. Structural Features

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

CSE 4/521 Introduction to Operating Systems. Lecture 23 File System Implementation II (Allocation Methods, Free-Space Management) Summer 2018

Information Retrieval and Organisation

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Information Retrieval May 15. Web retrieval

Representing Data Elements

Information Retrieval

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Chapter 3 - Memory Management

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Search Quality. Jan Pedersen 10 September 2007

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

1 o Semestre 2007/2008

V.2 Index Compression

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Advanced Database Systems

MG4J: Managing Gigabytes for Java. MG4J - intro 1

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval. Danushka Bollegala

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Building Search Applications

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

modern database systems lecture 4 : information retrieval

CSE 530A. B+ Trees. Washington University Fall 2013

Advance Indexing. Limock July 3, 2014

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

Information Retrieval. Lecture 10 - Web crawling

Query Evaluation Strategies

Chapter 9 Memory Management

Outlook. File-System Interface Allocation-Methods Free Space Management

Crawling the Web. Web Crawling. Main Issues I. Type of crawl

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Transcription:

CSE 454 - Case Studies Indexing & Retrieval in Google Slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Design of Alta Vista Based on a talk by Mike Burrows Group Meetings Starting Tomorrow 10:30 11:00 2:00 2:30 3:00 3:30 Logistics 1 2 Info Extraction Datamining Course Overview P2P Case Studies: Nutch, Google, Altavista Information Retrieval Precision vs Recall Inverted Indicies Ecommerce Security Web Services Semantic Web Crawler Architecture Synchronization & Monitors Systems Foundation: Networking & Clusters 3 Google System Anatomy Create partially sorted forward Anchor text index added as well Each barrel holds part of the forward index ( i.e. for a range of words) Sorts by wordid to creates inverted index 4 Compresses Documents Major Data Structures Google File System Big Files virtual files spanning multiple file systems addressable by 64 bit integers handles allocation & deallocation of File Descriptions since the OS s is not enough supports rudimentary compression Major Data Structures (2) Repository Full HTML of every page Docs stored one after another Prefix: docid, length, URL Compressed: Tradeoff between Speed Compression ratio Choose zlib (3 to 1) Rather than bzip (4 to 1) Requires no other data structure to access it Robustness Ease of dev Can rebuild other structures 5 6 1

Major Data Structures (3) Document Index Info about each document Status Pointer to repository Document checksum + statistics If crawled, pointer to var width file with URL, title Else Pointer to URL-list Fixed width ISAM (index sequential access mode) index Ordered by docid Compact data structure Can fetch a record in 1 disk seek during search Major Data Structures (4) URL s - docid file List of URL checksums with their docids Sorted by checksums Used to convert URLs to docids Given a URL a binary search is performed Additions done with batch merge skip 7 8 Major Data Structures (5) Lexicon Can fit in memory currently 256 MB contains 14 million words 2 parts a list of words Separated by nulls a hash table of pointers Into hit list (occurrences) Major Data Structures (6) Hit Lists (Occurences) Includes position, font & capitalization Bulk of index size Tried 3 alternatives Simple: triple of ints too big Huffman: too slow Hand-optimized 2 bytes / hit plain fancy anchor cap 1 font 3 position 12 cap 1 = 7 type 4 position 8 Limited phrase capability cap 1 = 7 type 4 hash 4 pos 4 9 10 Major Data Structures (7) Hit Lists Continued forward barrels: total 43 GB docid wordid: 24 nhits: 8 wordid: 24 nhits: 8 null wordid docid wordid: 24 nhits: 8 wordid: 24 nhits: 8 wordid: 24 nhits: 8 null wordid hit hit hi hit hit hit hit hi hit hit hit hit hit hi hit Inverted Barrels: 41 GB Lexicon: 293 MB docid: 27 Nhits: 8hit hit hit hit wordid ndocs docid: 27 Nhits: 8hit hit hit wordid ndocs docid: 27 Nhits: 8hit hit hit hit wordid ndocs docid: 27 Nhits: 8hit hit 64 Barrels Holds range 11 Major Data Structures (8) Inverted Index Generated from FI by sorter Also 64 Barrels Lexicon has pointers: word -> barrel Points to a doclist: (docid, hitlist)+ Order of the docids is important By docid? -- easy to update By word-rank in doc? -- fast results, but slow to merge Soln: divide in two groups, then sort by docid short barrel (just title, anchor hits) full barrel (all hit lists) Sorting: in main memory; barrels done in parallel 12 2

Crawling the Web Fast distributed crawling system URLserver & Crawlers are implemented in python Each Crawler keeps about 300 connections open Peak rate = 100 pages, 600K per second Cache DNS lookup internally synchronized IO to handle events number of queues Robust & Carefully tested Parsing Must handle errors HTML typos KB of zeros in a middle of a TAG Non-ASCII characters HTML Tags nested hundreds deep Developed their own Parser involved a fair amount of work did not cause a bottleneck 13 14 Searching Algorithm 1. Parse the query 2. Convert word into wordids 3. Seek to the start of the doclist in the short barrel for every word 4. Scan through the doclists until there is a document that matches all of the search terms 5. Compute the rank of that document 6. If we re at the end of the short barrels start at the doclists of the full barrel, unless we have enough 7. If were not at the end of any doclist goto step 4 8. Sort the documents by rank return the top K (May jump here after 40k pages) The Ranking System The information Position, Font Size, Capitalization Anchor Text PageRank Hits Types title,anchor, URL etc.. small font, large font etc.. 15 16 The Ranking System (2) Each Hit type has it s own weight Count weights increase linearly with counts at first but quickly taper off - this is the IR score of the doc (IDF weighting??) IR combined w/ PageRank to give the final Rank For multi-word query A proximity score for every set of hits with a proximity type weight 10 grades of proximity Storage Requirements Using Compression on the repository About 55 GB for all the data used by the SE Most of the queries can be answered by just the short inverted index With better compression, a high quality SE can fit onto a 7GB drive of a new PC 17 18 3

Storage Statistics Total size of 147.8 GB Fetched Pages Compressed 53.5 GB Repository Short Inverted 4.1 GB Index Temporary 6.6 GB Anchor Data Document 9.7 GB Index Incl. Variable Width Data Links Database 3.9 GB Total Without 55.2 GB Repository Web Page Statistics Number of Web Pages Fetched Number of URLs Seen Number of Email Addresses Number of 404 s 24 million 76.5 million 1.7 million 1.6 million 8 B pages in 2005 System Performance It took 9 days to download 26 million pages 48.5 pages per second The Indexer & Crawler ran simultaneously The Indexer runs at 54 pages per second The sorters run in parallel using 4 machines, the whole process took 24 hours 19 20 Spam Keyword stuffing Meta tag stuffing Multiple titles Tiny fonts Invisible text <body bgcolor="ffffff"> <font color="#ffffff" size ="1">Your text here</font> Problem: takes up space. Size=1? Bottom? Doorway / jump pages Fast meta refresh Cloaking ~ Code swapping Domain spamming Pagerank spoofing 21 AltaVista: Inverted Files Map each word to list of locations where it occurs Words = null-terminated byte strings Locations = 64 bit unsigned ints Layer above gives interpretation for location URL Index into text specifying word number Slides adapted from talk by Mike Burrows 22 Documents A document is a region of location space Contiguous No overlap Densely allocated (first doc is location 1) All document structure encoded with words enddoc at last location of document begintitle, endtitle mark document title Document 1 Document 2 0 1 2 3 4 5 6 7 8...... Format of Inverted Files Words ordered lexicographically Each word followed by list of locations Common word prefixes are compressed Locations encoded as deltas Stored in as few bytes as possible 2 bytes is common Sneaky assembly code for operations on inverted files Pack deltas into aligned 64 bit word First byte contains continuation bits Table lookup on byte => no branch instructs, no mispredicts 35 parallelized instructions/ 64 bit word = 10 cycles/word Index ~ 10% of text size 23 24 4

Index Stream Readers (ISRs) Interface for Reading result of query Return ascending sequence of locations Implemented using lazy evaluation Methods loc(isr) return current location next(isr) advance to next location seek(isr, X) advance to next loc after X prev(isr) return previous location! Processing Simple Queries User searches for mp3 Open ISR on mp3 Uses hash table to avoid scanning entire file Next(), next(), next() returns locations containing the word 25 26 Combining ISRs And Compare locs on two streams Or Merges two or more ISRs Not Returns locations not in ISR (lazily) file ISR file ISR file ISR genesis fixx or ISR mp3 Genesis OR fixx Problem!!! and ISR (genesis OR fixx) AND mp3 ISR Constraint Solver Inputs: Set of ISRs: A, B,... Set of Constraints Constraint Types loc(a) loc(b) + K prev(a) loc(b) + K loc(a) prev(b) + K prev(a) prev(b) + K For example: phrase a b loc(a) loc(b), loc(b) loc(a) + 1 a a b a a b a b 27 28 Two words on one page Let E be ISR for word enddoc Constraints for conjunction a AND b prev(e) loc(a) loc(a) loc(e) prev(e) loc(b) loc(b) loc(e) What if prev(e) Undefined? Advanced Search Field query: a in Title of page Let BT, ET be ISRP of words begintitle, endtitle Constraints: loc(bt) loc(a) loc(a) loc(et) prev(et) loc(bt) b a e b a b e b prev(e) loc(b) loc(a) loc(e) 29 et a bt a et a bt et prev(et) loc(et) loc(a) loc(bt) 30 5

Solver Algorithm while (unsatisfied_constraints) satisfy_constraint(choose_unsat_constraint()) To satisfy: loc(a) loc(b) + K Execute: seek(b, loc(a) - K) To satisfy: prev(a) loc(b) + K Execute: seek(b, prev(a) - K) To satisfy: loc(a) prev(b) + K Execute: seek(b, loc(a) - K), next(b) To satisfy: prev(a) prev(b) + K Execute: seek(b, prev(a) - K) next(b) 31 Heuristic: Which choice advances a stream the furthest? Update Can t insert in the middle of an inverted file Must rewrite the entire file Naïve approach: need space for two copies Slow since file is huge Split data along two dimensions Buckets solve disk space problem Tiers alleviate small update problem 32 Buckets & Tiers Each word is hashed to a bucket Add new documents by adding a new tier Periodically merge tiers, bucket by bucket Delete documents by adding deleted word Expunge deletions when merging tier 0 older bigger Tier 1 Tier 2... newer smaller... Hash bucket(s) for word a Hash bucket(s) for word b Hash bucket(s) for word zebra Scaling How handle huge traffic? AltaVista Search ranked #16 10,674,000 unique visitors (Dec 99) Scale across N hosts 1. Ubiquitous index. Query one host 2. Split N ways. Query all, merge results 3. Ubiquitous index. Host handles subrange of locations. Query all, merge results 4. Hybrids 33 34 AltaVista Structure Front ends Alpha workstations Back ends 4-10 CPU Alpha servers 8GB RAM, 150GB disk Organized in groups of 4-10 machines Each with 1/N th of index Border routers FDDI switch FDDI switch Back end index servers Front end HTTP servers To alternate site 35 6