CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Similar documents
Map Reduce Group Meeting

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Parallel Programming Concepts

L22: SC Report, Map Reduce

Big Data Management and NoSQL Databases

Parallel Computing: MapReduce Jin, Hai

The MapReduce Abstraction

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

CS 345A Data Mining. MapReduce

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Distributed Computations MapReduce. adapted from Jeff Dean s slides

MapReduce: Simplified Data Processing on Large Clusters

CS 61C: Great Ideas in Computer Architecture. MapReduce

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

MI-PDB, MIE-PDB: Advanced Database Systems

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

Introduction to Map Reduce

Introduction to MapReduce

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

Clustering Lecture 8: MapReduce

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

MapReduce: A Programming Model for Large-Scale Distributed Computation

The MapReduce Framework

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

Introduction to MapReduce

Dept. Of Computer Science, Colorado State University

Advanced Data Management Technologies

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

1. Introduction to MapReduce

Introduction to Data Management CSE 344

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Introduction to MapReduce

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

CS MapReduce. Vitaly Shmatikov

Concurrency for data-intensive applications

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

In the news. Request- Level Parallelism (RLP) Agenda 10/7/11

MapReduce in Erlang. Tom Van Cutsem

Dept. Of Computer Science, Colorado State University

Introduction to Data Management CSE 344

Batch Processing Basic architecture

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

CS427 Multicore Architecture and Parallel Computing

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

CompSci 516: Database Systems

Map Reduce. Yerevan.

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Information Retrieval

Introduction to Data Management CSE 344

MapReduce Simplified Data Processing on Large Clusters

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

Introduction to MapReduce

An introduction to Big Data. Presentation by Devesh Sharma, Zubair Asghar & Andreas Aalsaunet

Survey on MapReduce Scheduling Algorithms

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Chapter 5. The MapReduce Programming Model and Implementation

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Database Systems CSE 414

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

MapReduce-style data processing

Distributed computing: index building and use

Introduction to MapReduce (cont.)

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Programming Systems for Big Data

Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements

Distributed computing: index building and use

Index Construction 1

MapReduce. Cloud Computing COMP / ECPE 293A

Distributed Systems. COS 418: Distributed Systems Lecture 1. Mike Freedman. Backrub (Google) Google The Cloud is not amorphous

Hadoop/MapReduce Computing Paradigm

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

CSE 344 MAY 2 ND MAP/REDUCE

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Large-Scale GPU programming

Agenda. Request- Level Parallelism. Agenda. Anatomy of a Web Search. Google Query- Serving Architecture 9/20/10

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce. U of Toronto, 2014

CS 345A Data Mining. MapReduce

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

Transcription:

CS 555: DISTRIBUTED SYSTEMS [MAPREDUCE] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Bit Torrent What is the right chunk/piece size? How many trackers for a torrent? Is each chunk downloaded from the same peer at a given time? How does a tracker know about the location of its chunks if the file is never downloaded by any leechers? Is the torrent file updated after every download? How can you prevent unauthorized users/subscribers from downloading content? n What if someone maliciously modifies file before.torrent file is created? L10.1 L10.2 Topics covered in this lecture MapReduce MAPREDUCE L10.3 L10.4 MapReduce: Topics that we will cover Why? What it is and what it is not? The core framework and original Google paper Development of simple programs using Hadoop The dominant MapReduce implementation MapReduce It s a framework for processing data residing on a large number of computers Very powerful framework Excellent for some problems Challenging or not applicable in other classes of problems L10.5 L10.6 SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.1

What is MapReduce? More a framework than a tool You are required to fit (some folks shoehorn it) your solution into the MapReduce framework MapReduce is not a feature, but rather a constraint What does this constraint mean? It makes problem solving easier and harder Clear boundaries for what you can and cannot do You actually need to consider fewer options than what you are used to But solving problems with constraints requires planning and a change in your thinking L10.7 L10.8 But what does this get us? Tradeoff of being confined to the MapReduce framework? Ability to process data on a large number of computers But, more importantly, without having to worry about concurrency, scale, fault tolerance, and robustness A challenge in writing MapReduce programs Design! Good programmers can produce bad software due to poor design Good programmers can produce bad MapReduce algorithms Only in this case your mistakes will be amplified Your job may be distributed on 100s or 1000s of machines and operating on a Petabyte of data L10.9 L10.10 MapReduce: Origins of the design Process crawled data and logs of web requests Several computations work on this raw data to compute derived data Inverted indices Representation of graph structure of web documents Pages crawled per host Most frequent queries in a day Most computations are conceptually straightforward But data is large Computations must be scalable Distributed across thousands of machines To complete in a reasonable amount of time L10.11 L10.12 SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.2

Complexity of managing distributed computations can Obscure simplicity of original computation Contributing factors: 1 How to parallelize computation 2 Distribute the data 3 Handle failures MapReduce was developed to cope with this complexity Express simple computations Hide messy details of Parallelization Data distribution Fault tolerance Load balancing L10.13 L10.14 MapReduce Programming model Associated implementation for Processing & Generating large data sets Programming model Computation takes a set of input key/value pairs Produces a set of output key/value pairs Express the computation as two functions: Map Reduce L10.15 L10.16 Map Takes an input pair Produces a set of intermediate key/value pairs MapReduce library Groups all intermediate values with the same intermediate key Passes them to the Reduce function L10.17 L10.18 SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.3

Reduce function Accepts intermediate key I and Set of values for that key Merge these values together to get Smaller set of values Counting number occurrences of each word in a large collection of documents map (String key, String value) //key: document name //value: document contents for each word w in value EmitIntermediate(w, 1 ) L10.19 L10.20 Counting number occurrences of each word in a large collection of documents reduce (String key, Iterator values) //key: a word //value: a list of counts int result = 0; for each v in values result += ParseInt(v); Emit(AsString(result)); Sums together all counts emitted for a particular word MapReduce specification object contains Names of Input Output Tuning parameters L10.21 L10.22 Map and reduce functions have associated types drawn from different domains What s passed to-and-from user-defined functions Strings map(k1, v1) à list(k2, v2) reduce(k2, list(v2)) à list(v2) User code converts between String Appropriate types L10.23 L10.24 SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.4

Programs expressed as MapReduce computations: Distributed Grep Map Emit line if it matches specified pattern Reduce Just copy intermediate data to the output Term-Vector per Host Summarizes important terms that occur in a set of documents <word, frequency> Map Emit <hostname, term vector> For each input document Reduce function Has all per-document vectors for a given host Add term vectors; discard away infrequent terms n <hostname, term vector> L10.25 L10.26 Implementation Machines are commodity machines GFS is used to manage the data stored on the disks IMPLEMENTATION OF THE RUNTIME L10.27 L10.28 Execution Overview Part I Maps distributed across multiple machines Automatic partitioning of data into M splits Splits processed concurrently on different machines Execution Overview Part II Partition intermediate key space into R pieces E.g. hash(key) mod R User specified parameters Partitioning function Number of partitions (R) L10.29 L10.30 SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.5

Execution Overview Execution Overview: Step I The MapReduce library User Program Splits input files into M pieces 16-64 MB per piece Split 0 Master Output file 0 Starts up copies of the program on a cluster of machines Split 1 Split 2 Split 3 Output file 1 Split 4 L10.31 L10.32 Execution Overview: Step II Program copies One of the copies is a Master There are M map tasks and R reduce tasks to assign Master Picks idle workers Assigns each worker a map or reduce task Execution Overview: Step III s that are assigned a map task Read contents of their input split Parses <key, value> pairs out of input data Pass each pair to user-defined Map function Intermediate <key, value> pairs from Maps Buffered in Memory L10.33 L10.34 Execution Overview: Step IV Writing to disk Periodically, buffered pairs are written to disk These writes are partitioned By the partitioning function Locations of buffered pairs on local disk Reported to back to Master Master forwards these locations to reduce workers Execution Overview: Step V Reading Intermediate data Master notifies Reduce worker about locations Reduce worker reads buffered data from the local disks of Maps Read all intermediate data; sort by intermediate key All occurrences of same key grouped together Many different keys map to the same Reduce task L10.35 L10.36 SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.6

Execution Overview: Step VI Processing data at the Reduce worker Iterate over sorted intermediate data For each unique key pass Key + set of intermediate values to Reduce function Execution Overview: Step VII Waking up the user After all Map & Reduce tasks have been completed Control returns to the user code Output of Reduce function is appended To output file of reduce partition L10.37 L10.38 Task Granularity Subdivide map phase into M pieces Subdivide reduce phase into R pieces M, R >> number of worker machines TASK GRANULARITY Each worker performing many different tasks Improves dynamic load balancing Speeds up recovery during failures L10.39 L10.40 Master Data Structures For each Map and Reduce task State: {idle, in-progress, completed} machine identity Practical bounds on how large M and R can be Master must make O(M + R) scheduling decisions Keep O(MR) state in memory For each completed Map task store Location and sizes of R intermediate file regions Information pushed incrementally to in-progress Reduce tasks L10.41 L10.42 SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.7

The contents of this slide-set are based on the following references JEFFREY DEAN and SANJAY GHEMAWAT: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004: 137-150 MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. 1 st Edition. Donald Miner and Adam Shook. O'Reilly Media ISBN: 978-1449327170. [Chapter 1] L10.43 SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.8