Agenda. Request- Level Parallelism. Agenda. Anatomy of a Web Search. Google Query- Serving Architecture 9/20/10

Similar documents
CS 61C: Great Ideas in Computer Architecture. MapReduce

In the news. Request- Level Parallelism (RLP) Agenda 10/7/11

Today s Lecture. CS 61C: Great Ideas in Computer Architecture (Machine Structures) Map Reduce

Distributed Computations MapReduce. adapted from Jeff Dean s slides

MapReduce: Simplified Data Processing on Large Clusters

Instructors Randy H. Katz hdp://inst.eecs.berkeley.edu/~cs61c/fa13. Power UVlizaVon Efficiency. Datacenter. Infrastructure.

The MapReduce Abstraction

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Review. Agenda. Request-Level Parallelism (RLP) Google Query-Serving Architecture 8/8/2011

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

Parallel Computing: MapReduce Jin, Hai

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing, MapReduce, and Spark

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing, MapReduce, and Spark

MapReduce. Simplified Data Processing on Large Clusters (Without the Agonizing Pain) Presented by Aaron Nathan

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

MI-PDB, MIE-PDB: Advanced Database Systems

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Parallel Programming Concepts

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Database Systems CSE 414

Smart Phone. Computer. Core. Logic Gates

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Introduction to Data Management CSE 344

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce

Advanced Database Technologies NoSQL: Not only SQL

L22: SC Report, Map Reduce

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344

MapReduce-style data processing

Introduction to Map Reduce

Big Data Management and NoSQL Databases

Warehouse-Scale Computing

The MapReduce Framework

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

The NoSQL Movement. FlockDB. CSCI 470: Web Science Keith Vertanen

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

CSE 344 MAY 2 ND MAP/REDUCE

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Introduction to MapReduce

Concurrency for data-intensive applications

CS November 2017

Clustering Lecture 8: MapReduce

CS427 Multicore Architecture and Parallel Computing

MapReduce: A Programming Model for Large-Scale Distributed Computation

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

MapReduce, Apache Hadoop

MapReduce, Apache Hadoop

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Lecture #30 Parallel Computing

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Distributed computing: index building and use

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Map Reduce. Yerevan.

Distributed computing: index building and use

Map Reduce Group Meeting

Introduction to MapReduce (cont.)

CS61C : Machine Structures

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

CompSci 516: Database Systems

Large-Scale GPU programming

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Parallel Nested Loops

MapReduce, Hadoop and Spark. Bompotas Agorakis

CS November 2018

MapReduce & BigTable

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Bioinforma)cs Resources - NoSQL -

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Part A: MapReduce. Introduction Model Implementation issues

Data Intensive Computing

6.830 Lecture Spark 11/15/2017

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CS MapReduce. Vitaly Shmatikov

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Batch Processing Basic architecture

Part II: Software Infrastructure in Data Centers: Distributed Execution Engines

Cloud Computing CS

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Programming Models MapReduce

Transcription:

Agenda CS 61C: Great Ideas in Computer Architecture (Machine Structures) Instructors: Randy H. Katz David A. PaHerson hhp://inst.eecs.berkeley.edu/~cs61c/fa10 Request and Data Level Parallelism Administrivia Technology Break Map- Reduce Examples 9/20/10 Fall 2010 - - Lecture #10 1 9/20/10 Fall 2010 - - Lecture #10 2 Agenda Request- Level Parallelism Request and Data Level Parallelism Administrivia Technology Break Map- Reduce Examples Hundreds or thousands of requests per second Not your laptop or cell- phone, but popular Internet services like Google search Such requests are largely independent LiHle read- write (aka producer- consumer ) sharing Mostly involve read- only databases Rarely involve read write data sharing or synchroniza_on across requests Computa_on easily par oned within a request and across different requests 9/20/10 Fall 2010 - - Lecture #10 3 9/20/10 Fall 2010 - - Lecture #10 4 Google Query- Serving Architecture Anatomy of a Web Search Google David A. PaHerson Direct request to closest Google datacenter Front- end load balancer directs request to one of many clusters within the datacenter Within cluster, select one of many Google Web Servers (GWS) to handle the request and compose the response pages GWS communicates with Index Servers to find documents that contain the search words, David, PaHerson Return document list with associated relevance score 9/20/10 Fall 2010 - - Lecture #10 5 9/20/10 Fall 2010 - - Lecture #10 6 1

Anatomy of a Web Search In parallel, Spell checker: Did you mean David Paterson? Ad system: books by PaHerson at Amazon.com Images of David PaHerson Use docids to access indexed documents Compose the page Result document extracts (with keyword in context) ordered by relevance score Sponsored links (along the top) and adver_sements (along the sides) 9/20/10 Fall 2010 - - Lecture #10 7 9/20/10 Fall 2010 - - Lecture #10 8 Anatomy of a Web Search Implementa_on strategy Randomly distribute the entries Make many copies (aka replicas) Load balance requests across replicas Redundant copies of indices and documents Breaks up hot spots, e.g., Jus_n Bieber Increases opportuni_es for request- level parallelism Makes the system more tolerant of failures Data- Parallel Divide and Conquer (Map- Reduce Processing) Map: Slice data into shards, distribute these to workers, compute sub- problem solu_ons map(in_key,in_value)->list(out_key,intermediate value)! Processes input key/value pair Produces set of intermediate pairs Reduce: Collect and combine sub- problem solu_ons reduce(out_key,list(intermediate_value))->list(out_value)! Combines all intermediate values for a par_cular key Produces a set of merged output values (usually just one)! 9/20/10 Fall 2010 - - Lecture #10 9 9/20/10 Fall 2010 - - Lecture #10 10 Google Uses MR For E.g.: Extrac_ng the set of outgoing links from a collec_on of HTML documents and aggrega_ng by target document S_tching together overlapping satellite images to remove seams and to select high- quality imagery for Google Earth Genera_ng a collec_on of inverted index files using a compression scheme tuned for efficient support of Google search queries Processing all road segments in the world and rendering map _le images that display these segments for Google Maps Fault- tolerant parallel execu_on of programs wrihen in higher- level languages across a collec_on of input data Map- Reduce Processing Time Line Master assigns map + reduce tasks to worker machines As soon as a map task finishes, it can be assigned a new map or reduce task Data shuffle begins as soon as a given Map finishes Reduce task begins as soon as all data shuffles finish 9/20/10 Fall 2010 - - Lecture #10 11 9/20/10 Fall 2010 - - Lecture #10 12 2

9/20/10 Fall 2010 - - Lecture #10 13 9/20/10 Fall 2010 - - Lecture #10 14 9/20/10 Fall 2010 - - Lecture #10 15 9/20/10 Fall 2010 - - Lecture #10 16 9/20/10 Fall 2010 - - Lecture #10 17 9/20/10 Fall 2010 - - Lecture #10 18 3

9/20/10 Fall 2010 - - Lecture #10 19 9/20/10 Fall 2010 - - Lecture #10 20 9/20/10 Fall 2010 - - Lecture #10 21 9/20/10 Fall 2010 - - Lecture #10 22 Agenda Request and Data- Level Parallelism Administrivia Technology Break Map- Reduce Examples 9/20/10 Fall 2010 - - Lecture #10 23 9/20/10 Fall 2010 - - Lecture #10 24 4

hhp://www.ny_mes.com/2010/09/20/technology/20arm.html 25 Agenda 27 Data Level Parallelism Administrivia Technology Break Map- Reduce Examples Map- Reduce Processing Example: Count Word Occurrences 26 Agenda Data Level Parallelism Administrivia Technology Break Map- Reduce Examples 28 Map- Reduce Processing: Execu_on Pseudo Code: map(string input_key, String input_value):! // input_key: document name! // input_value: document contents! for each word w in input_value:! EmitIntermediate(w, "1");! Map reduce(string output_key, Iterator intermediate_values):! // output_key: a word! // output_values: a list of counts! int result = 0;! for each v in intermediate_values:! result += ParseInt(v);! Emit(AsString(result));! 29 Reduce Map key k1 hashes to Reduce Task 2 Map key k2 hashes to Reduce Task 1 Map key k3 hashes to Reduce Task 2 Map key k4 hashes to Reduce Task 1 Map key k5 hashes to Reduce Task 1 Shuffle Reduce/Combine Collect Results 30 5

Another Example: Word Index (How Oqen Does a Word Appear?) Distribute Shuffle Collect that that is is that that is not is not is that it it is Map 1 Map 2 Map 3 Map 4 is 1, that 2 is 1, that 2 is 2, not 2 is 2, it 2, that 1 is 1,1,2,2 that 2,2,1 it 2 not 2 Reduce 1 Reduce 2 is 6; it 2 not 2; that 5 is 6; it 2; not 2; that 5 9/20/10 Fall 2010 - - Lecture #10 31 Map- Reduce Failure Handling On worker failure: Detect failure via periodic heartbeats Re- execute completed and in- progress map tasks Re- execute in progress reduce tasks Task comple_on commihed through master Master failure: Could handle, but don't yet (master failure unlikely) Robust: lost 1600 of 1800 machines once, but finished fine 9/20/10 Fall 2010 - - Lecture #10 32 Map- Reduce Redundant Execu_on Slow workers significantly lengthen comple_on _me Other jobs consuming resources on machine Bad disks with soq errors transfer data very slowly Weird things: processor caches disabled (!!) Solu_on: Near end of phase, spawn backup copies of tasks Whichever one finishes first "wins" Effect: Drama_cally shortens job comple_on _me Map- Reduce Locality Op_miza_on Master scheduling policy: Asks GFS for loca_ons of replicas of input file blocks Map tasks typically split into 64MB (== GFS block size) Map tasks scheduled so GFS input block replica are on same machine or same rack Effect: Thousands of machines read input at local disk speed Without this, rack switches limit read rate 9/19/10 Fall 2010 - - Lecture #10 33 9/19/10 Fall 2010 - - Lecture #10 34 Summary Request- Level Parallelism High request volume, each largely independent of the other Use replica_on for beher request throughput and availability Map- Reduce Data Parallelism Divide large data set into pieces for independent parallel processing Combine and process intermediate results to obtain final result 9/20/10 Fall 2010 - - Lecture #10 39 6