CS 347 Parallel and Distributed Data Processing

Similar documents
CS 347 Parallel and Distributed Data Processing

Distributed computing: index building and use

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed computing: index building and use

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Collection Building on the Web. Basic Algorithm

Information Retrieval. Lecture 10 - Web crawling

Crawling CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval Spring Web retrieval

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Mining Web Data. Lijun Zhang

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Today s lecture. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Data Center Performance

Midterm Examination CSE 455 / CIS 555 Internet and Web Systems Spring 2009 Zachary Ives

CSE 5306 Distributed Systems

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

The MapReduce Abstraction

CS6200 Information Retreival. Crawling. June 10, 2015

Internet Services and Search Engines. Amin Vahdat CSE 123b May 2, 2006

Searching the Web What is this Page Known for? Luis De Alba

CSE 5306 Distributed Systems. Naming

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

Distributed Computations MapReduce. adapted from Jeff Dean s slides

CS47300: Web Information Search and Management

Search Engines. Information Retrieval in Practice

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

Information Retrieval May 15. Web retrieval

Part I: Data Mining Foundations

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

CS47300: Web Information Search and Management

Towards a Distributed Web Search Engine

Information Retrieval

CompSci 356: Computer Network Architectures Lecture 21: Overlay Networks Chap 9.4. Xiaowei Yang

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Mining Web Data. Lijun Zhang

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Consistency and Replication. Some slides are from Prof. Jalal Y. Kawash at Univ. of Calgary

Addressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P?

Parallel Computing: MapReduce Jin, Hai

Plan for today. CS276B Text Retrieval and Mining Winter Evolution of search engines. Connectivity analysis

Crawling the Web. Web Crawling. Main Issues I. Type of crawl

PNUTS: Yahoo! s Hosted Data Serving Platform. Reading Review by: Alex Degtiar (adegtiar) /30/2013

A Survey on Web Information Retrieval Technologies

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

Lecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto

US Patent 6,658,423. William Pugh

March 10, Distributed Hash-based Lookup. for Peer-to-Peer Systems. Sandeep Shelke Shrirang Shirodkar MTech I CSE

How to Crawl the Web. Hector Garcia-Molina Stanford University. Joint work with Junghoo Cho

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

416 Distributed Systems. March 23, 2018 CDNs

Introduction to Information Retrieval

CS 61C: Great Ideas in Computer Architecture. MapReduce

Information Networks. Hacettepe University Department of Information Management DOK 422: Information Networks

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

CS 572: Information Retrieval

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Crawling. CS6200: Information Retrieval. Slides by: Jesse Anderton

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

CS November 2017

CSE 123b Communications Software

Today s class. CSE 123b Communications Software. Telnet. Network File System (NFS) Quick descriptions of some other sample applications

Assignment 5. Georgia Koloniari

Data-Intensive Computing with MapReduce

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Distributed Web Crawling over DHTs. Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4

Exam IST 441 Spring 2014

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Chapter 17: Parallel Databases

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

L22: SC Report, Map Reduce

Overview Computer Networking Lecture 16: Delivering Content: Peer to Peer and CDNs Peter Steenkiste

CSE 494 Project C. Garrett Wolf

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Performance Analysis for Crawling

Data-Intensive Distributed Computing

Numerical Computing with Spark. Hossein Falaki

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

No Trade Secrets. Microsoft does not claim any trade secret rights in this documentation.

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

EECS 395/495 Lecture 5: Web Crawlers. Doug Downey

CS November 2018

CompSci 516: Database Systems

How Does a Search Engine Work? Part 1

Transcription:

CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval

CS 347 Notes 12 2

CS 347 Notes 12 3

CS 347 Notes 12 4

CS 347 Notes 12 5

Web Search Engine Crawling Indexing Computing ranking features Serving queries CS 347 Notes 12 6

Crawling Fetch the content of web pages Initalize Seed URLs Web Get the next URL Fetch page URLs to visit Visited URLs Extract URLs Web pages CS 347 Notes 12 7

Crawling Issues Scope and freshness Not enough space/time to crawl all pages Page importance, quality, and update frequency Site mirrors and (near) duplicate pages Dynamic content and crawler traps Load at visited web sites Rules in robots.txt Limit number of visits per day Limit depth of crawl CS 347 Notes 12 8

Crawling Issues Load at crawler Variance of fetch latency/bandwidth Parallelization and scalability Multiple agents Partitioning URL lists Communication between agents Recovering from agent failure CS 347 Notes 12 9

Crawl Partitioning Requirements Each URL assigned to a single agent Locally computable URL-to-agent mapping Balanced distribution of URLs across agents Contravariance CS 347 Notes 12 10

Contravariance Agent A Agent B Agent A Agent B Agent C url 1 url 3 url 5 url 2 url 4 url 6 url 1 url 2 url 3 url 4 url 5 url 6 CS 347 Notes 12 11

Contravariance Agent A Agent B Agent A Agent B Agent C url 1 url 3 url 5 url 2 url 4 url 6 url 1 url 2 url 3 url 4 url 5 url 6 Agent A Agent B Agent C url 1 url 2 url 5 url 3 url 4 url 6 CS 347 Notes 12 12

URL Assignment Consistent hashing Hash function: URL agent Each agent replicated k times Each replica mapped randomly on unit circle Mapping persistent across agent restarts Lookup: map URL on unit circle; find closest live replica CS 347 Notes 12 13

URL Assignment A url 6 B B A CS 347 Notes 12 14

URL Assignment A url 6 A C url 6 B C B B A B A Balancing Contravariance CS 347 Notes 12 15

Crawl Partitioning Ideas URL normalization E.g., relative to absolute URL Host-based partitioning Reduces communication between agents Small vs. large hosts Geographic distribution CS 347 Notes 12 16

Fault Tolerance Repartitioning Permanent failures Recovering list of URLs to visit Checkpoints Communication logs Transient failures Avoiding re-visiting URLs E.g., before fetch, check with near neighbor agents CS 347 Notes 12 17

Web Search Engine Crawling Indexing Computing ranking features Serving queries CS 347 Notes 12 18

Indexing Build the term-document index Collection d 1 d 2 d 3 d 4 d 5 d 6 d n Lexicon t 1 t 2 t 3 t 4 t 5 t 6 Posting for t 1 t m CS 347 Notes 12 19

Architecture Reduce Web pages Inverted index files Intermediate runs Map Distributors Indexers Query servers CS 347 Notes 12 20

Indexing Issues Index partitioning Efficient query processing Query routing Result retrieval CS 347 Notes 12 21

Document Partitioning d 1 d 2 d 3 d 4 d 5 d 6 d n t 1 t 2 t 3 t 4 t 5 t 6 t m d 1 d 2 d 3 d 4 d 5 d 6 d n-2 d n-1 d n t 1 t 1 t 1 t 2 t 2 t 2 t 3 t 3 t 3 t 4 t 4 t 4 t 5 t 5 t 5 t 6 t 6 t 6 t m t m t m CS 347 Notes 12 22

Document Partitioning Split the collection of documents Advantages Easy to add new documents Load balanced High processing throughput Disadvantages Communication with all query servers CS 347 Notes 12 23

Term Partitioning d 1 d 2 d 3 d 4 d 5 d 6 d n t 1 t 2 d 1 d 2 d 3 d 4 d 5 d 6 d n t 3 t 1 t 2 d 1 d 2 d 3 d 4 d 5 d 6 d n t 3 t 4 t 5 t 4 t 5 t 6 t 6 t m d 1 d 2 d 3 d 4 d 5 d 6 d n t m-2 t m-1 t m CS 347 Notes 12 24

Term Partitioning Split the lexicon Advantages Reduced communication with query servers Disadvantages More processing before partitioning Adding new documents is hard Load balancing is hard Processing parallelism is limited by query length CS 347 Notes 12 25

Advanced Partitioning Topical partitioning using clustering Documents clustered by term-similarity Partitions made up of one or more clusters Usage-induced partitioning Queries extracted from logs Documents clustered by query-similarity Partitions made up of one or more clusters CS 347 Notes 12 26

Web Search Engine Crawling Indexing Computing ranking features Serving queries CS 347 Notes 12 27

Ranking Feature Computation Parallel/distributed computation tasks Text/language processing Document classification/clustering Web graph analysis CS 347 Notes 12 28

Example: PageRank Link-based global (query-independent) importance metric Random surfer model Start at a random page With probability d, visit a new page following a random link on the current page With probability (1 d), restart at a random page PageRank score ~ expected fraction of time spent at a page CS 347 Notes 12 29

PageRank Formula p ( x ) = d Σ p ( y ) / out ( y ) + ( 1 d ) / n y x CS 347 Notes 12 30

PageRank Formula Out-degree of page y Probability of random restart at x p ( x ) = d Σ p ( y ) / out ( y ) + ( 1 d ) / n y x PageRank of x PageRank of y, where y links to x CS 347 Notes 12 31

PageRank Algorithm i = 0 p [i] ( x ) = ( 1 d ) / n repeat i += 1 p [i] ( x ) = ( 1 d ) / n for all y x p [i] ( x ) += d p [i 1] ( y ) / out ( y ) until p [i] p [i 1] < ε CS 347 Notes 12 32

PageRank Implementation Two vectors, current and next Initialize vectors Iterate over all pages y, distribute PageRank from current( y ) to next ( x ) for all links y x current = next, re-initialize next Go back to iteration over pages or stop CS 347 Notes 12 33

PageRank Distribution MapReduce for each iteration i Map Take < y, ( current ( y ), edges ( y ) ) > For each y x in edges ( y ) emit < x, current ( y ) / edges ( y ) > Also emit < y, edges ( y ) > Reduce Take < x, val > and < x, edges ( x ) > Sum ( d val ) into next ( x ), add ( 1 d ) / n Emit < x, ( next ( x ), edges ( x ) ) > CS 347 Notes 12 34

PageRank Distribution < y, ( current ( y ), edges ( y ) ) > Map < x, val > < x, val > Reduce < x, ( next ( x ), edges ( x ) ) > CS 347 Notes 12 35

Web Search Engine Crawling Indexing Computing ranking features Serving queries CS 347 Notes 12 36

Query Processing Locate, retrieve, process, and serve query results Cache Inverted index files Query Results Query coordinator Query servers CS 347 Notes 12 37

Serving Architecture Multiple sites connected by WAN Site = coordinator + servers + cache Partitioning Parallel processing Distributed storage of data Based on index partitioning Replication Availability Throughput Response time CS 347 Notes 12 38

Serving Issues Routing the query To sites E.g., identical sites + routing by dynamic DNS lookup Within sites Merging the results Caching CS 347 Notes 12 39

Serving Issues Routing Merging Document partition All servers Results selected by servers; ranking by coordinator Term partition Servers containing query terms Selection and ranking by coordinator CS 347 Notes 12 40

Caching What to cache? Query answers Term postings CS 347 Notes 12 41

Caching Query terms repeated more frequently than whole queries What to cache? Query answers Faster response Term postings More hits CS 347 Notes 12 42

Caching Query terms repeated more frequently than whole queries What to cache? Query answers Faster response Term postings More hits Done as well in practice CS 347 Notes 12 43

Caching Policy Terms most frequent in queries High hit ratio Terms most frequent in documents Requires more cache space (longer postings) CS 347 Notes 12 44

Summary Crawling Partitioning: balancing and contravariance Consistent hashing Indexing Document, term, topical, and usage-induced partitioning Computing ranking features PageRank with MapReduce Serving queries Routing queries, merging results, and caching postings CS 347 Notes 12 45