PARALLEL DATA PROCESSING IN BIG DATA SYSTEMS

Similar documents
CS 345A Data Mining. MapReduce

Clustering Lecture 8: MapReduce

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

CS 345A Data Mining. MapReduce

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Introduction to MapReduce

Data-Intensive Computing with MapReduce

Clustering Documents. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Introduction to MapReduce

Programming Systems for Big Data

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

MapReduce. Stony Brook University CSE545, Fall 2016

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

MapReduce Simplified Data Processing on Large Clusters

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Introduction to MapReduce

Introduction to MapReduce

Introduction to Map Reduce

Introduction to MapReduce (cont.)

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

Map Reduce. Yerevan.

Big Data Management and NoSQL Databases

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Introduction to MapReduce

MI-PDB, MIE-PDB: Advanced Database Systems

CS-2510 COMPUTER OPERATING SYSTEMS

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Laarge-Scale Data Engineering

Map Reduce Group Meeting

CS 61C: Great Ideas in Computer Architecture. MapReduce

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

MapReduce-style data processing

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

The MapReduce Abstraction

Introduction to MapReduce

Hadoop/MapReduce Computing Paradigm

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Introduction to MapReduce

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Parallel Programming Concepts

Introduction to MapReduce

MapReduce programming model

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Database Systems CSE 414

CMSC 723: Computational Linguistics I Session #12 MapReduce and Data Intensive NLP. University of Maryland. Wednesday, November 18, 2009

Parallel Computing: MapReduce Jin, Hai

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

Map-Reduce. Marco Mura 2010 March, 31th

CS427 Multicore Architecture and Parallel Computing

Advanced Data Management Technologies

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

1. Introduction to MapReduce

Data-Intensive Distributed Computing

The MapReduce Framework

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Introduction to Data Management CSE 344

MapReduce: A Programming Model for Large-Scale Distributed Computation

BigData and Map Reduce VITMAC03

CmpE 138 Spring 2011 Special Topics L2

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Developing MapReduce Programs

MapReduce and Hadoop

Distributed File Systems II

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Introduction to MapReduce Algorithms and Analysis

Map-Reduce. John Hughes

L22: SC Report, Map Reduce

MapReduce. U of Toronto, 2014

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

What is this course about?

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop

Distributed Filesystem

Databases 2 (VU) ( )

Map-Reduce (PFP Lecture 12) John Hughes

Map-Reduce and Related Systems

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor

MapReduce, Hadoop and Spark. Bompotas Agorakis

Information Retrieval Processing with MapReduce

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Transcription:

PARALLEL DATA PROCESSING IN BIG DATA SYSTEMS Great Ideas in ICT - June 16, 2016 Irene Finocchi (finocchi@di.uniroma1.it)

Title keywords

How big? The scale of things

Data deluge Every 2 days we create as much information as we did from the dawn of man through 2003 (August 2010, Eric Schmidt - executive chairman @ Google) Most data born digital 5 Exabytes

Where does data come from?

Giga, Tera, Peta, Exa 1 Brontobyte = 10 3 Yottabytes

How much information is that? GIGA 1 Gigabyte: broadcast quality movie 20 Gigabytes: audio collection of Beethoven s works 100 Gigabytes: library floor of books or academic journals on shelves

How much information is that? GIGA 1 Gigabyte: broadcast quality movie 20 Gigabytes: audio collection of Beethoven s works 100 Gigabytes: library floor of books or academic journals on shelves TERA 1 Terabyte: all the X- ray films in a large technological hospital; 50,000 trees made into paper and printed 10 Terabytes: printed collection of the U.S. Library of Congress

How much information is that? PETA 1 Petabyte: 3 years of Earth Observing System data (2001) 2 Petabytes: all U.S. academic research libraries 200 Petabytes: all printed material

How much information is that? PETA 1 Petabyte: 3 years of Earth Observing System data (2001) 2 Petabytes: all U.S. academic research libraries 200 Petabytes: all printed material EXA 5 Exabytes: all words ever spoken by human beings? ZETTA

Storage media Megabytes Terabytes Gigabytes Petabytes (and beyond)

Packing data into DVDs 800 Exabytes = stack of DVDs reaching from Earth to Moon and back 35 ZB = stack of DVDs would now reach halfway to Mars

Infrastructures for big data

Single-node architecture CPU Memory Not suitable for big data Disk

Storage trends 1990 Typical disk could store about 1370 MB ~ 1GB Transfer speed: 4.4 MB/s Could read a disk in about 5 minutes Today 1 TB disks are the norm Transfer speed 100 MB/s More than 2.5 hours to read a disk! Solution: using many disks in parallel! #1! Issues: Disk failures are very common How to combine data spread over many different disks

Clusters of commodity machines Since cannot mine tens to hundreds of Terabytes of data on a single server standard architecture emerging: Cluster of commodity Linux nodes Gigabit ethernet interconnections Scale out, not up! #2!

Real cluster architecture

Real cluster architecture

Building blocks Source: Barroso and Urs Hölzle (2009)

Cluster architecture q Each rack contains 10/64 nodes q Sample node configuration: 8 x 2GHz cores, 8 GB RAM, 4 disks (4 TB)

Parallelism

Parallelize the computation Divide and conquer Work to do w 1 w 2 w 3 Partition worker worker worker r 1 r 2 r 3 Result Combine

Parallelization challenges How do we assign work units to workers? What if we have more work units than workers? What if workers need to share partial results? How do we aggregate partial results? How do we know all the workers have finished? What if workers die? What s the common theme of all of these problems?

Common theme Parallelization problems arise from: Communication between workers (to exchange state) Access to shared resources (data) Thus, we need synchronization mechanisms

Source: Ricardo Guimarães Herrmann

Managing multiple workers Difficult because we don s know: the order in which workers run when workers interrupt each other when workers need to communicate partial results the order in which workers access shared data Thus, we need semaphores (lock, unlock), conditional variables (wait, notify, broadcast), barriers Still, lots of problems: Deadlock, livelock, race conditions... Dining philosophers, sleeping barbers, cigarette smokers... Moral of the story: be careful!

Current tools Programming models Shared memory (pthreads) Message passing (MPI) Design patterns Master-slaves Producer-consumer flows Shared work queues Shared Memory P 1 P 2 P 3 P 4 P 5 Memory Message Passing P 1 P 2 P 3 P 4 P 5 master producer consumer work queue slaves producer consumer

Source: Wikipedia (Flat Tire)

Where the rubber meets the road Concurrency/parallelism difficult to reason about Even more difficult At the scale of datacenters (and across datacenters) In the presence of failures In terms of multiple interacting services Not to mention debugging The reality: Lots of one-off solutions, custom code Write you own dedicated library, then program with it Burden on the programmer to explicitly manage everything

Big data systems #3! Have a runtime system that manages (most) low-level aspects Handles scheduling Assigns workers to map and reduce tasks Handles data distribution Moves processes to data Handles synchronization Gathers, sorts, and shuffles intermediate data Handles errors and faults Detects worker failures and restarts

MAPREDUCE A big data system for batch processing

Stable storage First-order problem: if nodes can fail, how can we store data persistently? Answer: Distributed File System Provides global file namespace Google GFS; Hadoop HDFS; Kosmix KFS Typical usage pattern Huge files (100s of GB to TB) Data is rarely updated in place Reads and appends are common

Distributed file system Chunk servers File is split into contiguous chunks Typically each chunk is 16-64MB Each chunk replicated (usually 3x) Try to keep 1 replica on the same rack and other 2 replicas in a different rack Master node Stores metadata Might be replicated (a.k.a. Name Node in HDFS) Client library for file access Talk to master to find chunk servers Connect directly to chunk servers to access data

Warm up: word count We have a large file of words, one word per line Count the number of times each distinct word appears in the file Sample application: analyze web server logs to find popular URLs

Different scenarios Case 1: Entire file fits in memory Load file into main memory Keep also a hash table with <word, count> pairs Case 2: File too large for mem, but all <word, count> pairs fit in mem Scan file on disk Keep <word, count> hash table in main memory Case 3: Too many distinct words to fit in memory External sort, then scan file (all occurrences of the same word are consecutive: one running counter suffices) sort datafile uniq c

Making things a little bit harder Now suppose we have a large corpus of documents Count the number of times each distinct word occurs in the corpus words(docs/*) sort uniq -c where words takes a file and outputs the words in it, one to a line The above captures the essence of MapReduce Great thing is it is naturally parallelizable

The origins (2004) Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Jeffrey Dean & Sanjay Ghemawat [OSDI 2004]

Map in Lisp

Reduce in Lisp

MapReduce in Lisp

MapReduce can refer to The programming model The execution framework (aka runtime system ) The specific implementation Usage usually clear from context

Programming model

Data as <key,value> pairs Everything built on top of <key,value> pairs Keys and values are user-defined: can be anything Only two user-defined functions: Map n map(k 1,v 1 ) list(k 2,v 2 ) #4! n given input data <k 1,v 1 >, produce intermediate data v 2 labeled with key k 2 Reduce n reduce(k 2, list(v 2 )) list(v 3 ) preserves key n given a list of values list(v 2 ) associated with a key k 2, return a list of values list(v 3 ) associated with the same key

Where is parallelism? All mappers run in parallel All reducers run in parallel Different pairs transparently distributed across available machines (parallel disks!) map(k 1,v 1 ) list(k 2,v 2 ) Shuffle: group values with the same key to be passed to a single reducer reduce(k 2, list(v 2 )) list(v 3 )

What you can avoid to care about The underlying runtime system: Automatically parallelizes the computation across large-scale clusters of machines Handles machine failures (incuding disk failures) Schedules inter-machine communication to make efficient use of the network and disks

THE MapReduce example: WordCount map(string key, String value): // key: document name; value: text of document for each word w in value: emit(w, 1) reduce(string key, Iterator<int> values): // key: a word; values: an iterator over counts result = 0 for each v in values: result += v emit(key, result)

WordCount data flow

A programmer s perspective The beauty of MapReduce is that any programmer can understand it, and its power comes from being able to harness thousands of computers behind that simple interface. David Patterson

Success story #1: sorting

Success story #1: sorting Nov 2008: 1TB, 1000 computers, 68 secs Previous record: 910 computers, 209 secs Nov 2008: 1PB, 4000 computers, 6 h, 48k harddisks Sept 2011: 1PB, 8000 computers, 33 m Sept 2011: 10PB, 8000 computers, 6 ½ h

Success story #2: MapReduce + AWS Ability to rent computing by the hour Additional services: e.g., persistent storage For instance, Amazon Web Services Rent your compute instances on Amazon s cloud Amazon Elastic MapReduce: set up and run applications on your cluster in minutes, at very low prices Useful AWS services: Elastic Compute Cloud: EC2 Persistent storage: S3 Elastic MapReduce: run Hadoop on EC2

Success story #2: MapReduce + AWS The New York Times needed to generate PDF files for 11,000,000 articles (every article from 1851-1980) in the form of images scanned from the original paper Each article composed of numerous TIFF images scaled and glued together Code for generating PDF quite straightforward

Success story #2: MapReduce + AWS 4TB of scanned articles sent to Amazon S3 A cluster of EC2 machines configured to distribute the PDF generation via Hadoop (open source MapReduce) Using 100 EC2 instances, in 24 hours the New York Times was able to convert the 4TB of scanned articles into 1.5TB of PDF documents Embarrassingly parallel problem

MapReduce vs other systems 1. Relational Database Management Systems (RDBMS)

MapReduce vs other systems 2. Grid computing and HPC (high-performance computing) Expensive (parallel) architectures Very good for compute-intensive jobs Problematic handling large data volumes: network bandwidth is the bottleneck Explicitly distribute the work across clusters of machines and manage data flow 3. Volunteer computing SETI@home (Search for Extra Terrestrial Intelligence) Donate your CPU time (processing about 0.35MB of radio telescope data) Data moved to your computer

Runtime system

Data flow principles Input and final output stored on distributed file system Data locality: scheduler tries to schedule map tasks close to physical storage location of input data Bring code to the data, not data to the code! #5! Intermediate results stored on local FS of map and reduce workers

Distributed execution overview # M of map tasks depends on splits User Program fork fork fork assign map Master assign reduce R = # of reduce tasks (can be chosen by the programmer) Input Data Split 0 Split 1 Split 2 read Worker Worker Worker local write remote read, sort Worker Worker write Output File 0 Output File 1

Splits Inputs to map tasks are created by contiguous splits of input file Default split size: 64MB/128MB (can be changed by setting parameter mapred.max.split.size) Depending on cluster size, more than one split may end up to be processed by the same worker

Distributed execution overview # M of map tasks depends on splits User Program fork fork fork assign map Master assign reduce R = # of reduce tasks (can be chosen by the programmer) Input Data Split 0 Split 1 Split 2 read Worker Worker Worker local write remote read, sort Worker Worker write Output File 0 Output File 1

Partition function For reduce functions, we need to ensure that records with the same intermediate key end up at the same worker System uses a default partition function: e.g., hash(key) mod R Sometimes useful to override E.g., hash(hostname(url)) mod R ensures URLs from a host end up in the same output file

Distributed execution overview # M of map tasks depends on splits User Program fork fork fork assign map Master assign reduce R = # of reduce tasks (can be chosen by the programmer) Input Data Split 0 Split 1 Split 2 read Worker Worker Worker local write remote read, sort Worker Worker write Output File 0 Output File 1

Coordination Master data structures Task status: (idle, in-progress, completed) Idle tasks get scheduled as workers become available When a map task completes, it sends the master the location and sizes of its R intermediate files, one for each reducer Master pushes this info to reducers Master pings workers periodically to detect failures

Failures Map worker failure Map tasks completed or in-progress at worker are reset to idle Reduce workers are notified when task is rescheduled on another worker Reduce worker failure Only in-progress tasks are reset to idle Master failure MapReduce task is aborted and client is notified

Distributed execution overview # M of map tasks depends on splits User Program fork fork fork assign map Master assign reduce R = # of reduce tasks (can be chosen by the programmer) Input Data Split 0 Split 1 Split 2 read Worker Worker Worker local write remote read, sort Worker Worker write Output File 0 Output File 1

Pipelines of MapReduce jobs Output is often input to another map reduce task Computation as a sequence of rounds: Shuffle is expensive: minimize rounds!

Implementations

Implementations of MapReduce Google MapReduce Not available outside Google Hadoop: an Apache project Open-source implementation in Java n Just need to copy java libs to each machine of your cluster Uses HDFS for stable storage Download: http://lucene.apache.org/hadoop/ Aster Data Cluster-optimized SQL Database that also implements MapReduce

The APIs Official JavaDoc of all Hadoop 2.7.2 classes is available at the following URL: http://hadoop.apache.org/docs/r2.7.2/api/ overview-summary.html We will review quickly the interfaces, classes, and methods we need to run our first Hadoop program.

Jobs in Hadoop A job represents a packaged Hadoop job for submission to cluster Need to specify input and output paths Need to specify input and output formats Need to specify mapper and reducer classes (+ other optional stuff: combiner, partitioner) Need to specify intermediate/final key/value classes Need to specify number of reducers (but not mappers, why?) *Note that there are two versions of the AP

Job creation and configuration example

Mapper and Reducer classes

Methods of the Mapper class

Inside a Mapper task public void run(context context) throws IOException, InterruptedException { setup(context); while (context.nextkeyvalue()) { map(context.getcurrentkey(), context.getcurrentvalue(), context); } cleanup(context); }

Methods of the Reducer class

The Context(s)

Hello World : Word Count private static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable ONE = new IntWritable(1); private final static Text WORD = new Text(); } @Override public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = ((Text) value).tostring(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasmoretokens()) { WORD.set(itr.nextToken()); context.write(word, ONE); } }

Hello World : Word Count private static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private final static IntWritable SUM = new IntWritable(); } @Override public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { Iterator<IntWritable> iter = values.iterator(); int sum = 0; while (iter.hasnext()) { sum += iter.next().get(); } SUM.set(sum); context.write(key, SUM); }

Good coding practices in MapReduce

A bad MapReduce program Problem: sum of n values a 1 a n Map: given <-;a i > emit <$;a i > Reduce: Input <$; a 3, a 1, a 4, > Sum the n values received as input Reducers use linear space! No parallelization! Same key associated to each input value: sequential computation!

A better solution Map round 1: Group values into n equally-sized groups emit <$ i/ n ;a i > Reduce round 1: Input <$ j ; a k, a h, > Sum the n input values and emit <$ j ; sum j > Round 2: sum the n partial sums (use previous algorithm) Reducers use sublinear ( n) space! n sum computations in parallel!

A (theoretical) computational model For an input of size n: sublinear machine memory: O(n 1-ε ) for some ε>0 sublinear # of machines: O(n 1-ε ) for some ε>0 this implies total memory O(n 2-2ε ) for some ε>0 n cannot replicate data too much shuffle is expensive, so O(polylog n) rounds n strive for O(1) mappers/reducers run in polynomial time Karloff, Suri & Vassilvitskii [ACM SIAM SODA 2010]

MapReduce class MRC i class = MapReduce algorithms that satisfy the constraints in the previous slide work in O(log i n) rounds MRC 0 = O(1) rounds

Example: analysis of MapReduce sum Algorithm 1 does not fit in the MRC class Linear space Algorithm 2 fits in the MRC 0 class ε= ½ 2 rounds

(Some) practical hints Minimize number of rounds Shuffle is expensive Reduce number of intermediate <key,value> pairs Network communication is expensive Use combiners and local aggregation wherever possible

What is a combiner? Often a map task will produce many pairs of the form (k,v 1 ), (k, v 2 ), for the same key k E.g., popular words in WordCount Can save space and communication time by preaggregating at mapper combine(k 1, list(v 1 )) à v 2 usually same as reduce function Works only if reduce function is commutative and associative

(Some) practical hints Minimize number of rounds Shuffle is expensive Reduce number of intermediate <key,value> pairs Network communication is expensive Use combiners and local aggregation wherever possible Take care of load balancing (by careful algorithm design) The curse of the last reduces: wall clock time is proportional to the slowest task Store your input in a few large files rather than in many small files Best usage of the underlying DFS

How many (Map and) Reduce tasks? M map tasks, R reduce tasks M is the number of splits (one DFS chunk per map task ) Rule of thumb: Make R larger than the number of nodes in cluster Improves dynamic load balancing and speeds recovery from worker failure Usually R is smaller than M, because output is spread across R files

Beyond MapReduce

References J. Dean and S. Ghemawat MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004 and CACM 2008 http://labs.google.com/papers/mapreduce.html T. White Hadoop: The Definitive Guide (4th ed)

Acknowledgments Part of the slides have been adapted from course material associated with the following textbooks: Mining of Massive Data Sets, by Leskovec, Rajaraman & Ullman Data-Intensive Text Processing with MapReduce, By Jimmy Lin

Source: Wikipedia (Japanese rock garden) Questions?