CS 347 Parallel and Distributed Data Processing

Size: px

Start display at page:

Download "CS 347 Parallel and Distributed Data Processing"

Andrew Williams
6 years ago
Views:

1 CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval

2 CS 347 Notes 12 2

3 CS 347 Notes 12 3

4 CS 347 Notes 12 4

5 CS 347 Notes 12 5

6 Web Search Engine Crawling Indexing Computing ranking features Serving queries CS 347 Notes 12 6

7 Crawling Fetch the content of web pages Initalize Seed URLs Web Get the next URL Fetch page URLs to visit Visited URLs Extract URLs Web pages CS 347 Notes 12 7

8 Crawling Issues Scope and freshness Not enough space/time to crawl all pages Page importance, quality, and update frequency Site mirrors and (near) duplicate pages Dynamic content and crawler traps Load at visited web sites Rules in robots.txt Limit number of visits per day Limit depth of crawl CS 347 Notes 12 8

9 Crawling Issues Load at crawler Variance of fetch latency/bandwidth Parallelization and scalability Multiple agents Partitioning URL lists Communication between agents Recovering from agent failure CS 347 Notes 12 9

10 Crawl Partitioning Requirements Each URL assigned to a single agent Locally computable URL-to-agent mapping Balanced distribution of URLs across agents Contravariance CS 347 Notes 12 10

11 Contravariance Agent A Agent B Agent A Agent B Agent C url 1 url 3 url 5 url 2 url 4 url 6 url 1 url 2 url 3 url 4 url 5 url 6 CS 347 Notes 12 11

12 Contravariance Agent A Agent B Agent A Agent B Agent C url 1 url 3 url 5 url 2 url 4 url 6 url 1 url 2 url 3 url 4 url 5 url 6 Agent A Agent B Agent C url 1 url 2 url 5 url 3 url 4 url 6 CS 347 Notes 12 12

13 URL Assignment Consistent hashing Hash function: URL agent Each agent replicated k times Each replica mapped randomly on unit circle Mapping persistent across agent restarts Lookup: map URL on unit circle; find closest live replica CS 347 Notes 12 13

14 URL Assignment A url 6 B B A CS 347 Notes 12 14

15 URL Assignment A url 6 A C url 6 B C B B A B A Balancing Contravariance CS 347 Notes 12 15

16 Crawl Partitioning Ideas URL normalization E.g., relative to absolute URL Host-based partitioning Reduces communication between agents Small vs. large hosts Geographic distribution CS 347 Notes 12 16

17 Fault Tolerance Repartitioning Permanent failures Recovering list of URLs to visit Checkpoints Communication logs Transient failures Avoiding re-visiting URLs E.g., before fetch, check with near neighbor agents CS 347 Notes 12 17

18 Web Search Engine Crawling Indexing Computing ranking features Serving queries CS 347 Notes 12 18

19 Indexing Build the term-document index Collection d 1 d 2 d 3 d 4 d 5 d 6 d n Lexicon t 1 t 2 t 3 t 4 t 5 t 6 Posting for t 1 t m CS 347 Notes 12 19

20 Architecture Reduce Web pages Inverted index files Intermediate runs Map Distributors Indexers Query servers CS 347 Notes 12 20

21 Indexing Issues Index partitioning Efficient query processing Query routing Result retrieval CS 347 Notes 12 21

22 Document Partitioning d 1 d 2 d 3 d 4 d 5 d 6 d n t 1 t 2 t 3 t 4 t 5 t 6 t m d 1 d 2 d 3 d 4 d 5 d 6 d n-2 d n-1 d n t 1 t 1 t 1 t 2 t 2 t 2 t 3 t 3 t 3 t 4 t 4 t 4 t 5 t 5 t 5 t 6 t 6 t 6 t m t m t m CS 347 Notes 12 22

23 Document Partitioning Split the collection of documents Advantages Easy to add new documents Load balanced High processing throughput Disadvantages Communication with all query servers CS 347 Notes 12 23

24 Term Partitioning d 1 d 2 d 3 d 4 d 5 d 6 d n t 1 t 2 d 1 d 2 d 3 d 4 d 5 d 6 d n t 3 t 1 t 2 d 1 d 2 d 3 d 4 d 5 d 6 d n t 3 t 4 t 5 t 4 t 5 t 6 t 6 t m d 1 d 2 d 3 d 4 d 5 d 6 d n t m-2 t m-1 t m CS 347 Notes 12 24

25 Term Partitioning Split the lexicon Advantages Reduced communication with query servers Disadvantages More processing before partitioning Adding new documents is hard Load balancing is hard Processing parallelism is limited by query length CS 347 Notes 12 25

26 Advanced Partitioning Topical partitioning using clustering Documents clustered by term-similarity Partitions made up of one or more clusters Usage-induced partitioning Queries extracted from logs Documents clustered by query-similarity Partitions made up of one or more clusters CS 347 Notes 12 26

27 Web Search Engine Crawling Indexing Computing ranking features Serving queries CS 347 Notes 12 27

28 Ranking Feature Computation Parallel/distributed computation tasks Text/language processing Document classification/clustering Web graph analysis CS 347 Notes 12 28

29 Example: PageRank Link-based global (query-independent) importance metric Random surfer model Start at a random page With probability d, visit a new page following a random link on the current page With probability (1 d), restart at a random page PageRank score ~ expected fraction of time spent at a page CS 347 Notes 12 29

30 PageRank Formula p ( x ) = d Σ p ( y ) / out ( y ) + ( 1 d ) / n y x CS 347 Notes 12 30

31 PageRank Formula Out-degree of page y Probability of random restart at x p ( x ) = d Σ p ( y ) / out ( y ) + ( 1 d ) / n y x PageRank of x PageRank of y, where y links to x CS 347 Notes 12 31

32 PageRank Algorithm i = 0 p [i] ( x ) = ( 1 d ) / n repeat i += 1 p [i] ( x ) = ( 1 d ) / n for all y x p [i] ( x ) += d p [i 1] ( y ) / out ( y ) until p [i] p [i 1] < ε CS 347 Notes 12 32

33 PageRank Implementation Two vectors, current and next Initialize vectors Iterate over all pages y, distribute PageRank from current( y ) to next ( x ) for all links y x current = next, re-initialize next Go back to iteration over pages or stop CS 347 Notes 12 33

34 PageRank Distribution MapReduce for each iteration i Map Take < y, ( current ( y ), edges ( y ) ) > For each y x in edges ( y ) emit < x, current ( y ) / edges ( y ) > Also emit < y, edges ( y ) > Reduce Take < x, val > and < x, edges ( x ) > Sum ( d val ) into next ( x ), add ( 1 d ) / n Emit < x, ( next ( x ), edges ( x ) ) > CS 347 Notes 12 34

35 PageRank Distribution < y, ( current ( y ), edges ( y ) ) > Map < x, val > < x, val > Reduce < x, ( next ( x ), edges ( x ) ) > CS 347 Notes 12 35

36 Web Search Engine Crawling Indexing Computing ranking features Serving queries CS 347 Notes 12 36

37 Query Processing Locate, retrieve, process, and serve query results Cache Inverted index files Query Results Query coordinator Query servers CS 347 Notes 12 37

38 Serving Architecture Multiple sites connected by WAN Site = coordinator + servers + cache Partitioning Parallel processing Distributed storage of data Based on index partitioning Replication Availability Throughput Response time CS 347 Notes 12 38

39 Serving Issues Routing the query To sites E.g., identical sites + routing by dynamic DNS lookup Within sites Merging the results Caching CS 347 Notes 12 39

40 Serving Issues Routing Merging Document partition All servers Results selected by servers; ranking by coordinator Term partition Servers containing query terms Selection and ranking by coordinator CS 347 Notes 12 40

41 Caching What to cache? Query answers Term postings CS 347 Notes 12 41

42 Caching Query terms repeated more frequently than whole queries What to cache? Query answers Faster response Term postings More hits CS 347 Notes 12 42

43 Caching Query terms repeated more frequently than whole queries What to cache? Query answers Faster response Term postings More hits Done as well in practice CS 347 Notes 12 43

44 Caching Policy Terms most frequent in queries High hit ratio Terms most frequent in documents Requires more cache space (longer postings) CS 347 Notes 12 44

45 Summary Crawling Partitioning: balancing and contravariance Consistent hashing Indexing Document, term, topical, and usage-induced partitioning Computing ranking features PageRank with MapReduce Serving queries Routing queries, merging results, and caching postings CS 347 Notes 12 45

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 Web Search Engine Crawling Indexing Computing