Introduction to MapReduce (cont.)

Similar documents
MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

CS 345A Data Mining. MapReduce

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

MapReduce. Stony Brook University CSE545, Fall 2016

Batch Processing Basic architecture

The MapReduce Abstraction

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Programming Systems for Big Data

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

CS 61C: Great Ideas in Computer Architecture. MapReduce

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

CS427 Multicore Architecture and Parallel Computing

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

CS 345A Data Mining. MapReduce

Clustering Lecture 8: MapReduce

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Lecture 11 Hadoop & Spark

Introduction to MapReduce

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

Database Systems CSE 414

Big Data Management and NoSQL Databases

MI-PDB, MIE-PDB: Advanced Database Systems

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Distributed computing: index building and use

Introduction to Map Reduce

Introduction to MapReduce

Programming Models MapReduce

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Parallel Programming Concepts

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Map Reduce Group Meeting

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Map Reduce. Yerevan.

Parallel Computing: MapReduce Jin, Hai

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Improving the MapReduce Big Data Processing Framework

Distributed computing: index building and use

CS MapReduce. Vitaly Shmatikov

Map-Reduce. Marco Mura 2010 March, 31th

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

L22: SC Report, Map Reduce

Developing MapReduce Programs

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Hadoop/MapReduce Computing Paradigm

HADOOP FRAMEWORK FOR BIG DATA

732A54/TDDE31 Big Data Analytics

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

Chapter 5. The MapReduce Programming Model and Implementation

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

Hadoop Map Reduce 10/17/2018 1

MapReduce, Hadoop and Spark. Bompotas Agorakis

Large-Scale GPU programming

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

Introduction to MapReduce

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Databases 2 (VU) ( / )

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

Part A: MapReduce. Introduction Model Implementation issues

MapReduce. U of Toronto, 2014

BigData and Map Reduce VITMAC03

Introduction to Hadoop and MapReduce

Map Reduce & Hadoop Recommended Text:

Cloud Computing CS

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

The MapReduce Framework

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

Hadoop An Overview. - Socrates CCDH

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

A BigData Tour HDFS, Ceph and MapReduce

MapReduce Simplified Data Processing on Large Clusters

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Database Applications (15-415)

Lecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto

Resilient Distributed Datasets

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

1. Introduction to MapReduce

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Data Platforms and Pattern Mining

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

2/4/2019 Week 3- A Sangmi Lee Pallickara

Transcription:

Introduction to MapReduce (cont.) Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com

USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 2

MapReduce: Summary USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 3

MapReduce: Examples Examples $ docker run -it --rm -p 8888:8888 jupyter/pyspark-notebook Use file: MapReduce-PySpark-Examples-2.ipynb Download tip.json from the yelp dataset challenge https://www.yelp.com/dataset/challenge USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 4

Combiners Sometimes, a Reduce function is associative and commutative Can be combined in any order, with the same result Can push some of what Reducers do to Map task Reduces network traffic for data sent to Reduce task Example: word count Instead of producing many pairs (w, 1), (w, 1), can sum the n occurrences and emit (w, n) Still required to do aggregation at the Reduce task for key, value pairs coming from multiple Map tasks USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 5

Example: Build an Inverted Index Input: tweet1, ( I love pancakes for breakfast ) tweet2, ( I dislike pancakes ) tweet3, ( What should I eat for breakfast? ) tweet4, ( I love to eat ) Desired output: pancakes, (tweet1, tweet2) breakfast, (tweet1, tweet3) eat, (tweet3, tweet4) love, (tweet1, tweet4) Map task: For each word, emit (word, tweetid) as intermediate key-value pair Reduce task: Reduce function then just emits key and list of tweetids associated with that key USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 6

MapReduce Environment MapReduce environment takes care of: Partitioning the input data Scheduling the program s execution across a set of machines Performing the group by key step In practice this is is the bottleneck Handling machine failures Managing required inter-machine communication USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 7

Data Flow Input and final output are stored on a distributed file system (HDFS) Scheduler tries to schedule map tasks close to physical storage location of input data Intermediate results are stored on local FS of Map and Reduce workers Output is often input to another MapReduce task USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 8

Coordination: Master Master node takes care of coordination Task status: (idle, in-progress, completed) Idle tasks get scheduled as workers become available When a map task completes, it sends the master the location and sizes of its intermediate files one for each reducer Master pushes this info to reducers Master pings workers periodically to detect failures USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 9

Dealing with failures Map worker failure Map tasks completed or in-progress at worker are reset to idle and rescheduled Reduce workers are notified when task is rescheduled on another worker Reduce worker failure Only in-progress tasks are reset to idle, why? Reduce task is restarted Master failure MapReduce task is aborted and client is notified USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 10

How many Map and Reduce jobs? M map tasks, R reduce tasks Rule of a thumb Make M much larger than the number of nodes in the cluster One DFS chunk per map task is common Improves dynamic load balancing and speeds up recovery from worker failures Usually R is smaller than M Because output is spread across R files Google example Often use 200,000 map tasks, 5000 reduce tasks on 2000 machines USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 11

Task granularity and pipelining Fine granularity tasks: map tasks >> machines Minimizes time for fault recovery Can do pipeline shuffling with map execution Multiple mapper on same machine. One working, the other output/shuffling Better dynamic load balancing USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 12

USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 13

Refinements: Backup Tasks Problem Slow workers significantly lengthen the job completion time Other jobs on the machine Bad disks Weird things Solution Near end of phase, spawn backup copies of tasks Whichever one finishes first wins Effect Dramatically shortens job completion time USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 14

Refinements: Combiners Back to our word counting example Combiner combines the values of all keys of a single mapper Much less data needs to be copied and shuffled! Works if reduce function is commutative and associative USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 15

Refinement: Partition function Want to control how keys get partitioned Inputs to map tasks are created by contiguous splits of input file Reduce needs to ensure that records with the same intermediate key end up at the same worker System uses a default partition function hash(key) mod R Sometimes useful to override the hash function E.g., want to have alphabetical or numeric ranges going to different Reduce tasks (sorting) E.g., hash(hostname(url)) mod R ensures URLs from a host end up in the same output file USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 16

Problems Suited for MapReduce USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 17

General Characteristics of Good Problems for MapReduce Data set is truly big Terabytes, not tens of gigabytes Hadoop/MapReduce designed for terabyte/petabyte scale computation Most real-world problems process less than 100 Gbytes of input Microsoft, Yahoo: median job under 14 GB Facebook: 90% of jobs under 100 GB Source: To Hadoop or Not to Hadoop by Anand Krishnaswamy 8/13/2013 USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 18

General Characteristics of Good Problems for MapReduce Don t need fast response time When submitting jobs, Hadoop latency can be a minute Not well-suited for problems that require faster response time online purchases, transaction processing A good pre-computation engine E.g., pre-compute related items for every item in inventory Source: To Hadoop or Not to Hadoop by Anand Krishnaswamy 8/13/2013 USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 19

General Characteristics of Good Problems for MapReduce Good for applications that work in batch mode Jobs run without human interaction, intervention Runs over entire data set Takes time to initiate, run; shuffle step can be time-consuming Does not support real time applications well: sensor data, real-time advertisements, etc. Does not provide good support for random access to datasets Extensions: Hive, Dremel, Shark, Amplab Source: To Hadoop or Not to Hadoop by Anand Krishnaswamy 8/13/2013 USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 20

General Characteristics of Good Problems for MapReduce Best suited for data that can be expressed as key-value pairs without losing context, dependencies Graph data harder to process using MapReduce Implicit relationships: edges, sub-trees, child/parent relationships, weights, etc. Graph algorithms may need information about the entire graph for each iteration Hard to break into independent chunks for Map tasks Alternatives: Google s Pregel, Apache Giraph Source: To Hadoop or Not to Hadoop by Anand Krishnaswamy 8/13/2013 USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 21

General Characteristics of Good Problems for MapReduce Other problems/data *Not* suited for MapReduce Tasks that need results of intermediate steps to compute results of current step Interdependencies among tasks Map tasks must be independent Some machine learning algorithms Gradient-based learning, expectation maximization Source: To Hadoop or Not to Hadoop by Anand Krishnaswamy 8/13/2013 USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 22

General Characteristics of Good Problems for MapReduce Summary: Good candidates for MapReduce Jobs that have to process huge quantities of data and either summarize or transform the contents Collected data has elements that can easily be captured with an identifier (key) and corresponding value Source: To Hadoop or Not to Hadoop by Anand Krishnaswamy 8/13/2013 USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 23

Useful tools for Data Science USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 24

Python libraries for data science https://activewizards.com/blog/top-20-python-libraries-for-data-science-in-2018/ USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 25

Pandas Python library that provides high-level data structures and a vast variety of tools for analysis https://pandas.pydata.org Ability to translate rather complex operations with data into one or two commands Pandas contains many built-in methods Grouping Filtering Combining data Missing data Time-series Hierarchical indexing USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 26

Scikit-learn http://scikit-learn.org/stable/ Python module based on NumPy and SciPy Provides algorithms for many standard machine learning and data mining tasks Classification Regression Clustering Dimensionality reduction Model selection Preprocessing USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 27

Elasticsearch https://www.elastic.co/products/elasticsearch Open-source, RESTful, distributed search and analytics engine Real-time data and real-time analytics Scalable, high-availability, multi-tenant Full text search https://www.elastic.co/products/kibana USC INF 553 Foundations and Applications of Data Mining (Fall 2018) https://grafana.com 28

D3.js Data Driven Documents JavaScript library for manipulating documents based on data https://d3js.org USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 29