Distributed computing: index building and use

Similar documents
Distributed computing: index building and use

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

Map Reduce. Yerevan.

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

The MapReduce Abstraction

CS November 2017

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Databases 2 (VU) ( / )

Introduction to MapReduce (cont.)

CS November 2018

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

MapReduce. Stony Brook University CSE545, Fall 2016

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

CompSci 516: Database Systems

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Introduction to Data Management CSE 344

Big Data Management and NoSQL Databases

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Lecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto

CS 61C: Great Ideas in Computer Architecture. MapReduce

Map Reduce.

Database Systems CSE 414

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Distributed File Systems II

B490 Mining the Big Data. 5. Models for Big Data

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Data Partitioning and MapReduce

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

Percolator. Large-Scale Incremental Processing using Distributed Transactions and Notifications. D. Peng & F. Dabek

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Index Construction 1

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

Information Retrieval

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

Introduction to Data Management CSE 344

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Mitigating Data Skew Using Map Reduce Application

Large-Scale GPU programming

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Introduction to MapReduce

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Developing MapReduce Programs

MI-PDB, MIE-PDB: Advanced Database Systems

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Map Reduce Group Meeting

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

Clustering Lecture 8: MapReduce

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

MapReduce: A Programming Model for Large-Scale Distributed Computation

CS 345A Data Mining. MapReduce

Information Retrieval

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Data Informatics. Seon Ho Kim, Ph.D.

CSCI 5417 Information Retrieval Systems Jim Martin!

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

G(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Distributed Transactions in Hadoop's HBase and Google's Bigtable

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Graphs (Part II) Shannon Quinn

CISC 7610 Lecture 2b The beginnings of NoSQL

CS 655 Advanced Topics in Distributed Systems

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

230 Million Tweets per day

Introduction to Map Reduce

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Lessons Learned While Building Infrastructure Software at Google

Map-Reduce. Marco Mura 2010 March, 31th

1. Introduction to MapReduce

MapReduce and Hadoop. Debapriyo Majumdar Indian Statistical Institute Kolkata

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

2.3 Algorithms Using Map-Reduce

HADOOP FRAMEWORK FOR BIG DATA

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Data Analytics with HPC. Data Streaming

FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

MapReduce & BigTable

Information Retrieval and Organisation

Information Retrieval

Transcription:

Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput Tolerate failure of 1+ machines 1 2 Distributing computations Ideas? Finding results for a query? Building index? Goals Keep all machines busy Be able to replace badly-behaved machines seamlessly! 3 Distributed Query Evaluation: Strategies Assign different queries to different machines Break up multi-term query: assign different query terms to different machines - good/bad consequences? Break up lexicon: assign different index terms to different machines? good/bad consequences? Break up postings lists: Assign different documents to different machines? good/bad consequences? Keep all machines busy? Seamlessly replace badly-behaved machines? 4 1

Example: Google query evaluation circa 2002 Parallelize computation distribute documents randomly to pieces of index Pool of machines for each piece- choose one Why random? Google Query Evaluation: Details circa 2002 Enter query -> DNS-based directed to one of geographically distributed clusters w/in cluster, query directed to 1 Google Web Server (GWS) Load balancing and reliability Scheduler machines assign tasks to pools of machines monitor performance 5 GWS distributes query to pools of machines Query directed to 1 machine w/in each pool 6 Google Query Evaluation: Details circa 2002 Enter query -> DNS-based directed to one of geographically distributed clusters Load balance & fault tolerance Round-trip time w/in cluster, query directed to 1 Google Web Server (GWS) Load balance & fault tolerance GWS distributes query to pools of machines Load sharing Query directed to 1 machine w/in each pool Load balance & fault tolerance 7 Issues for distributed documents How many take from each pool to get m results? Throughput limits? each machine does full query evaluation disk access limiting constraint? distributing index by term instead may help 8 2

Distributing computations ü Finding results for a query? Building index? Distributed Index Building Can easily assign different documents to different machines Efficient? Goals Keep all machines busy Be able to replace badly-behaved machines seamlessly! 9 10 Google Index Building circa 2003: MapReduce framework programming model implementation for large clusters Google introduced for index building and PageRank for processing and generating large data sets The Apache Hadoop project developed open-source software Other applications: database queries join like multi-term query eval. statistics on queries in given time period 11 MapReduce Programming Model input set: {(input key i, value i ) 0 i input size} user chooses type value e.g. whole document output set: {(output key i, value i ) 0 i output size} Map (written by user): (input key, value) {(intermed. key j, value j ) 0 j Map result size} system groups all Map output pairs by intermediate key (shuffle phase) gathers by intermediate key value supply to Reduce by iterator Reduce (written by user) process intermediate values: (intermed. key, list of values) (output key, value) 12 3

MapReduce for building inverted index Input pair: (docid, contents of doc) Map: produce {(term, docid)} for each term appearing in docid Input to Reduce: (term, docids) pairs for each term Output of Reduce: (term, sorted list of docids containing that term) postings list! keys 13 Matrix Vector multiplication i, j range over elements of matrix A and vector v q ranges over chunks of v and strips of A p ranges over chunks of strips of A Input: tuples (q, (p, chunk A pq, chunk v q )) Map input tuple to tuples for i in range of p: (i, Σ (A i,j v j ) = x iq ) with sum over all j in chunk q: v q and A p,q Shuffle gives (i, list of x i,q all q) Reduce to: (i, Σ q x iq = Σ q Σ j in q A i,j v j = (Av) i 14 p Matrix-vector multiplication diagram chunk 1 of strip 1 chunk 2 of strip 1 chunk l of strip 1 A chunk 1 of strip k chunk 2 of strip k chunk l of strip k strip 1 strip 2 strip k-1 strip k q v Chunk 1 Chunk 2 Chunk k-1 Chunk k 15 q Diagram of computation distribution See Figure 2.3 (pg 27) in Mining of Massive Data Sets by Rajaraman, Leskovec and Ullman Originally appeared as Figure 1 in MapReduce: Simplified Data Processing on Large Clusters by J. Dean and S. Ghemawat, Comm. of the ACM,vol. 51, no. 1 (2008), pp. 107-113. 16 4

MapReduce parallelism Map phase and shuffle phase may overlap Shuffle phase and reduce phase may overlap Map phase must finish before reduce phase starts reduce depends on all values associated with a given key 17 MapReduce Fault Tolerance Master fails => restart whole computation Worker node fails Master detects failure must redo all Map tasks assigned to worker output of completed Map tasks on failed worker s disk for failed Map worker, Master reschedules each Map task notifies reducer workers of change in input location for failed Reduce worker, Master reschedules each Reduce task rescheduling occurs as live workers become available 18 Hadoop Remarks The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. Includes MapReduce http://hadoop.apache.org/index.html 19 Google built on large collections of inexpensive commodity PCs always some not functioning Solve fault-tolerance problem in software redundancy & flexibility NOT special-purpose hardware Keep machines relative generalists machine becomes free assign to any one of set of tasks 20 5

June 2010 New Google index building: Caffeine daily crawl several billion documents Before: Rebuild index: new + existing series of 100 MapReduces to build index each doc. spent 2-3 days being indexed After: Each document fed through Percolator: incremental update of index Document indexed 100 times faster (median) Avg. age doc. in search result decr. nearly 50% 21 Percolator Built on top of Bigtable distributed storage tens of petabytes in indexing system Provides random access Requires extra resources over MapReduce Provides transaction semantics Repository transformation highly concurrent Requires some consistency guarantees for data Observers do tasks; write to table Writing to table creates work for other observers around 50 Bigtable op.s to process 1 doc. 22 Bigtable Overview Distributed database system One big, sparse table Sorted by row key Rows partitioned into tablets contiguous key space Tablet servers execute operations large number tablet servers: Performance! Fault tolerance replication of data transaction log server take over for failed server 23 Percolator observers users write observer code run distributed across collection of machines observer registers function and set of columns with Percolator Percolator invokes function after data written in one of columns (any row) Percolator must find dirty cells search distributed across machines avoid >1 observer for a single column 24 6

Caffeine versus MapReduce Caffeine uses roughly twice as many resources to process same crawl rate New document collection currently 3x larger than previous systems Only limit available disk space Document indexed 100 times faster (median) If number newly-crawled docs near size index, MapReduce better random lookup v.s. streaming 25 Earlybird: Real-Time Search at Twitter by many Twitter researchers (2012) Designed for properties of tweets Handle high rate of queries Handle large number updates in real time Flash crowds Update info, eg number of retweets Large number concurrent reads and writes Time stamp dominant ranking signal 26 Elements Distributed server architecture Tweets hash partitioned across servers New concurrency management Customized query processing Customized inverted index 27 Query processing Ø Full Boolean query language Ø Results returned most recent first Personalized signals in relevance algorithm (not described) User s local social graph actual query algorithm isn t particularly interesting reuse existing Lucene query eval code 28 7

Inverted Index Dictionary Hash table on term ID Term ID points to tail of postings list Postings lists organized in segments Each server has small number segments (12) Each segment has small number tweets, 2 23 Only one segment active In-active segments read-only 29 Active segment index Posting is 32-bit integer 24 bits doc ID; 8 bits term position each occurrence in tweet is new posting Postings list: pre-allocated integer array Dynamic allocation Traversing newest first = iterate bkwds Can traverse bkwds from any point while concurrently adding new postings Can binary search for doc ID Eliminate need skip pointers 30 In-active segments Replaces an active segment when filled One fixed-size integer array Dictionary points to different postings lists Arranged reverse chronologically Compressed Short postings list: as before Long postings list: uses gaps block-based compression 31 Earlybird performance Compare prior MySQL-based 1000 tweets per second indexing 12,000 queries per second Earlybird memory Full active index segment (16M tweets) 6.7 GB Full in-active index segment ~ 55% above Queries per second 5000 for fully-loaded server (114M tweets) Tweets per second 7000 in stress test - heavy query load 32 8