MapReduce Simplified Data Processing on Large Clusters

Similar documents
Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

An Introduction to Apache Spark

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

Large-scale Information Processing

Clustering Documents. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

L22: SC Report, Map Reduce

Parallel Computing: MapReduce Jin, Hai

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Map Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms

A BigData Tour HDFS, Ceph and MapReduce

Java in MapReduce. Scope

Chapter 3. Distributed Algorithms based on MapReduce

Big Data Analytics. 4. Map Reduce I. Lars Schmidt-Thieme

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2

CS 345A Data Mining. MapReduce

Map-Reduce in Various Programming Languages

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Map Reduce Group Meeting

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Introduction to MapReduce

Clustering Lecture 8: MapReduce

Guidelines For Hadoop and Spark Cluster Usage

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

Parallel Programming Concepts

Data-Intensive Computing with MapReduce

September 2013 Alberto Abelló & Oscar Romero 1

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Map-Reduce Applications: Counting, Graph Shortest Paths

Big Data Management and NoSQL Databases

The MapReduce Abstraction

MapReduce. Arend Hintze

Introduction to MapReduce

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

1/30/2019 Week 2- B Sangmi Lee Pallickara

CS MapReduce. Vitaly Shmatikov

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Cloud Computing. Up until now

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming)

EE657 Spring 2012 HW#4 Zhou Zhao

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Map-Reduce. Marco Mura 2010 March, 31th

Batch Processing Basic architecture

CS 61C: Great Ideas in Computer Architecture. MapReduce

Big Data landscape Lecture #2

CS427 Multicore Architecture and Parallel Computing

FAQs. Topics. This Material is Built Based on, Analytics Process Model. 8/22/2018 Week 1-B Sangmi Lee Pallickara

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Programming Systems for Big Data

Steps: First install hadoop (if not installed yet) by,

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Dept. Of Computer Science, Colorado State University

W1.A.0 W2.A.0 1/22/2018 1/22/2018. CS435 Introduction to Big Data. FAQs. Readings

MapReduce-style data processing

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

CSE6331: Cloud Computing

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)

Map-Reduce Applications: Counting, Graph Shortest Paths

Batch Inherence of Map Reduce Framework

Session 1 Big Data and Hadoop - Overview. - Dr. M. R. Sanghavi

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Computer Science 572 Exam Prof. Horowitz Tuesday, April 24, 2017, 8:00am 9:00am

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Big Data: Architectures and Data Analytics

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Introduction to Data Management CSE 344

TI2736-B Big Data Processing. Claudia Hauff

The MapReduce Framework

Improving the MapReduce Big Data Processing Framework

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

Hadoop Map Reduce 10/17/2018 1

MI-PDB, MIE-PDB: Advanced Database Systems

BigData and MapReduce with Hadoop

2. MapReduce Programming Model

Experiences with a new Hadoop cluster: deployment, teaching and research. Andre Barczak February 2018

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

MAPREDUCE - PARTITIONER

Data Analysis Using MapReduce in Hadoop Environment

A Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Map- reduce programming paradigm

CompSci 516: Database Systems

Distributed Filesystem

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Topics covered in this lecture

HADOOP FRAMEWORK FOR BIG DATA

MapReduce. U of Toronto, 2014

Transcription:

MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 / 50

What do we do when there is too much data to process? Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 2 / 50

Scale Up vs. Scale Out (1/2) Scale up or scale vertically: adding resources to a single node in a system. Scale out or scale horizontally: adding more nodes to a system. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 3 / 50

Scale Up vs. Scale Out (2/2) Scale up: more expensive than scaling out. Scale out: more challenging for fault tolerance and software development. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 4 / 50

Taxonomy of Parallel Architectures DeWitt, D. and Gray, J. Parallel database systems: the future of high performance database systems. ACM Communications, 35(6), 85-98, 1992. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 5 / 50

Taxonomy of Parallel Architectures DeWitt, D. and Gray, J. Parallel database systems: the future of high performance database systems. ACM Communications, 35(6), 85-98, 1992. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 5 / 50

MapReduce A shared nothing architecture for processing large data sets with a parallel/distributed algorithm on clusters of commodity hardware. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 6 / 50

Challenges I How to distribute computation? I How can we make it easy to write distributed programs? I Machines failure. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 7 / 50

Idea Issue: Copying data over a network takes time. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 8 / 50

Idea Issue: Copying data over a network takes time. Idea: Bring computation close to the data. Store files multiple times for reliability. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 8 / 50

Simplicity Don t worry about parallelization, fault tolerance, data distribution, and load balancing (MapReduce takes care of these). Hide system-level details from programmers. Simplicity! Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 9 / 50

MapReduce Definition A programming model: to batch process large data sets (inspired by functional programming). Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 10 / 50

MapReduce Definition A programming model: to batch process large data sets (inspired by functional programming). An execution framework: to run parallel algorithms on clusters of commodity hardware. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 10 / 50

Programming Model Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 11 / 50

Warm-up Task (1/2) We have a huge text document. Count the number of times each distinct word appears in the file Application: analyze web server logs to find popular URLs. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 12 / 50

Warm-up Task (2/2) File too large for memory, but all word, count pairs fit in memory. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 13 / 50

Warm-up Task (2/2) File too large for memory, but all word, count pairs fit in memory. words(doc.txt) sort uniq -c where words takes a file and outputs the words in it, one per a line Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 13 / 50

Warm-up Task (2/2) File too large for memory, but all word, count pairs fit in memory. words(doc.txt) sort uniq -c where words takes a file and outputs the words in it, one per a line It captures the essence of MapReduce: naturally parallelizable. great thing is that it is Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 13 / 50

MapReduce Overview words(doc.txt) sort uniq -c Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 14 / 50

MapReduce Overview words(doc.txt) sort uniq -c Sequentially read a lot of data. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 14 / 50

MapReduce Overview words(doc.txt) sort uniq -c Sequentially read a lot of data. Map: extract something you care about. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 14 / 50

MapReduce Overview words(doc.txt) sort uniq -c Sequentially read a lot of data. Map: extract something you care about. Group by key: sort and shuffle. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 14 / 50

MapReduce Overview words(doc.txt) sort uniq -c Sequentially read a lot of data. Map: extract something you care about. Group by key: sort and shuffle. Reduce: aggregate, summarize, filter or transform. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 14 / 50

MapReduce Overview words(doc.txt) sort uniq -c Sequentially read a lot of data. Map: extract something you care about. Group by key: sort and shuffle. Reduce: aggregate, summarize, filter or transform. Write the result. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 14 / 50

MapReduce Dataflow map function: processes data and generates a set of intermediate key/value pairs. reduce function: merges all intermediate values associated with the same intermediate key. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 15 / 50

Example: Word Count Consider doing a word count of the following file using MapReduce: Hello World Bye World Hello Hadoop Goodbye Hadoop Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 16 / 50

Example: Word Count - map The map function reads in words one a time and outputs (word, 1) for each parsed input word. The map function output is: (Hello, 1) (World, 1) (Bye, 1) (World, 1) (Hello, 1) (Hadoop, 1) (Goodbye, 1) (Hadoop, 1) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 17 / 50

Example: Word Count - shuffle The shuffle phase between map and reduce phase creates a list of values associated with each key. The reduce function input is: (Bye, (1)) (Goodbye, (1)) (Hadoop, (1, 1) (Hello, (1, 1)) (World, (1, 1)) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 18 / 50

Example: Word Count - reduce The reduce function sums the numbers in the list for each key and outputs (word, count) pairs. The output of the reduce function is the output of the MapReduce job: (Bye, 1) (Goodbye, 1) (Hadoop, 2) (Hello, 2) (World, 2) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 19 / 50

Combiner Function (1/2) In some cases, there is significant repetition in the intermediate keys produced by each map task, and the reduce function is commutative and associative. Machine 1: (Hello, 1) (World, 1) (Bye, 1) (World, 1) Machine 2: (Hello, 1) (Hadoop, 1) (Goodbye, 1) (Hadoop, 1) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 20 / 50

Combiner Function (2/2) Users can specify an optional combiner function to merge partially data before it is sent over the network to the reduce function. Typically the same code is used to implement both the combiner and the reduce function. Machine 1: (Hello, 1) (World, 2) (Bye, 1) Machine 2: (Hello, 1) (Hadoop, 2) (Goodbye, 1) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 21 / 50

Example: Word Count - map public static class MyMap extends Mapper<...> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); } } } Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 22 / 50

Example: Word Count - reduce public static class MyReduce extends Reducer<...> { public void reduce(text key, Iterator<...> values, Context context) throws IOException, InterruptedException { int sum = 0; while (values.hasnext()) sum += values.next().get(); } } context.write(key, new IntWritable(sum)); Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 23 / 50

Example: Word Count - driver public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(mymap.class); job.setcombinerclass(myreduce.class); job.setreducerclass(myreduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); } job.waitforcompletion(true); Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 24 / 50

Example: Word Count - Compile and Run (1/2) # start hdfs > hadoop-daemon.sh start namenode > hadoop-daemon.sh start datanode # make the input folder in hdfs > hdfs dfs -mkdir -p input # copy input files from local filesystem into hdfs > hdfs dfs -put file0 input/file0 > hdfs dfs -put file1 input/file1 > hdfs dfs -ls input/ input/file0 input/file1 > hdfs dfs -cat input/file0 Hello World Bye World > hdfs dfs -cat input/file1 Hello Hadoop Goodbye Hadoop Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 25 / 50

Example: Word Count - Compile and Run (2/2) > mkdir wordcount_classes > javac -classpath $HADOOP_HOME/share/hadoop/common/hadoop-common-2.2.0.jar: $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar: $HADOOP_HOME/share/hadoop/common/lib/commons-cli-1.2.jar -d wordcount_classes sics/wordcount.java > jar -cvf wordcount.jar -C wordcount_classes/. > hadoop jar wordcount.jar sics.wordcount input output > hdfs dfs -ls output output/part-00000 > hdfs dfs -cat output/part-00000 Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2 Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 26 / 50

Execution Engine Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 27 / 50

MapReduce Execution (1/7) The user program divides the input files into M splits. A typical size of a split is the size of a HDFS block (64 MB). Converts them to key/value pairs. It starts up many copies of the program on a cluster of machines. J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, ACM Communications 51(1), 2008. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 28 / 50

MapReduce Execution (2/7) One of the copies of the program is master, and the rest are workers. The master assigns works to the workers. It picks idle workers and assigns each one a map task or a reduce task. J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, ACM Communications 51(1), 2008. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 29 / 50

MapReduce Execution (3/7) A map worker reads the contents of the corresponding input splits. It parses key/value pairs out of the input data and passes each pair to the user defined map function. The intermediate key/value pairs produced by the map function are buffered in memory. J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, ACM Communications 51(1), 2008. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 30 / 50

MapReduce Execution (4/7) The buffered pairs are periodically written to local disk. They are partitioned into R regions (hash(key) mod R). The locations of the buffered pairs on the local disk are passed back to the master. The master forwards these locations to the reduce workers. J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, ACM Communications 51(1), 2008. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 31 / 50

MapReduce Execution (5/7) A reduce worker reads the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys. J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, ACM Communications 51(1), 2008. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 32 / 50

MapReduce Execution (6/7) The reduce worker iterates over the intermediate data. For each unique intermediate key, it passes the key and the corresponding set of intermediate values to the user defined reduce function. The output of the reduce function is appended to a final output file for this reduce partition. J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, ACM Communications 51(1), 2008. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 33 / 50

MapReduce Execution (7/7) When all map tasks and reduce tasks have been completed, the master wakes up the user program. J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, ACM Communications 51(1), 2008. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 34 / 50

Hadoop MapReduce and HDFS Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 35 / 50

Fault Tolerance - Worker Detect failure via periodic heartbeats. Re-execute in-progress map and reduce tasks. Re-execute completed map tasks: their output is stored on the local disk of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global filesystem. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 36 / 50

Fault Tolerance - Master State is periodically checkpointed: a new copy of master starts from the last checkpoint state. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 37 / 50

MapReduce Weaknesses and Solving Techniques Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 38 / 50

W1: Access to Input Data Weakness 1 Scanning the entire input to perform the map-side processing. Initiating map tasks on all input partitions. Accessing only a subset of input data would be enough in certain cases. Lack of selective access to data. High communication cost. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 39 / 50

S1: Access to Input Data Solution 1 Efficient access to data. Indexing data: Hadoop++, HAIL Intentional data placement: CoHadoop Data layout: Llama, Cheetah, RCFile, CIF Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 40 / 50

Weakness 2 W2: Redundant Processing and Recomputation Performing similar processing by different jobs over the same data. Jobs are processed independently: redundant processing No way to reuse the results produced by previous jobs. Future jobs may require those results: recompute everything Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 41 / 50

Solution 2 S2: Redundant Processing and Recomputation Batch processing of jobs: MRShare Result sharing and materialization: ReStore Incremental processing: Incoop Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 42 / 50

Weakness 3 W3: Lack of Early Termination Map tasks must process the entire input data before any reduce task can start processing. Some jobs may need only sampling of data. Quick retrieval of approximate results. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 43 / 50

S3: Lack of Early Termination Solution 3 Sampling: EARL Sorting: RanKloud Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 44 / 50

W4: Lack of Iteration Weakness 4 MapReduce programmers need to write a sequence of MapReduce jobs and coordinate their execution, in order to implement an iterative processing. Data should be reloaded and reprocessed in each iteration. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 45 / 50

S4: Lack of Iteration Solution 4 Looping, caching, pipelining: Stratosphere, Haloop, MapReduce online, NOVA, Twister, CBP, Pregel, PrIter Incremental processing: Stratosphere, REX, Differential dataflow Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 46 / 50

Weakness 5 W5: Lack of Interactive and Real-Time Processing Various overheads to guarantee fault-tolerance that negatively impact the performance. Many applications require fast response times, interactive analysis, and online analytics. Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 47 / 50

Solution 5 S5: Lack of Interactive and Real-Time Processing Streaming, pipelining: Dremel, Impala, Hyracks, Tenzing In-memory processing: PowerDrill, Spark/Shark, M3R Pre-computation: BlikDB Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 48 / 50

Summary Programming model: Map and Reduce Execution framework Batch processing Shared nothing architecture Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 49 / 50

Questions? Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 50 / 50