Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Similar documents
Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Parallel Computing: MapReduce Jin, Hai

Clustering Lecture 8: MapReduce

MapReduce: Simplified Data Processing on Large Clusters

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

The MapReduce Framework

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Introduction to MapReduce

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

The MapReduce Abstraction

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Introduction to Hadoop and MapReduce

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Lecture 11 Hadoop & Spark

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CS 345A Data Mining. MapReduce

Introduction to Data Management CSE 344

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce

Parallel Programming Concepts

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

MI-PDB, MIE-PDB: Advanced Database Systems

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

MapReduce, Hadoop and Spark. Bompotas Agorakis

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Hadoop An Overview. - Socrates CCDH

MapReduce and Hadoop

Chapter 5. The MapReduce Programming Model and Implementation

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Database Systems CSE 414

L22: SC Report, Map Reduce

Introduction to MapReduce

Introduction to Data Management CSE 344

CS 61C: Great Ideas in Computer Architecture. MapReduce

MapReduce-style data processing

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

MapReduce. Simplified Data Processing on Large Clusters (Without the Agonizing Pain) Presented by Aaron Nathan

CS427 Multicore Architecture and Parallel Computing

Introduction to Data Management CSE 344

Improving the MapReduce Big Data Processing Framework

Introduction to MapReduce

Advanced Database Technologies NoSQL: Not only SQL

Hadoop. copyright 2011 Trainologic LTD

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Database Applications (15-415)

Big Data Management and NoSQL Databases

Introduction to Map Reduce

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

Microsoft Big Data and Hadoop

CSE 344 MAY 2 ND MAP/REDUCE

A BigData Tour HDFS, Ceph and MapReduce

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Introduction to BigData, Hadoop:-

HADOOP FRAMEWORK FOR BIG DATA

MapReduce & BigTable

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Welcome to the New Era of Cloud Computing

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Batch Processing Systems

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

MapReduce: A Programming Model for Large-Scale Distributed Computation

Large-Scale GPU programming

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Hadoop. Introduction / Overview

CISC 7610 Lecture 2b The beginnings of NoSQL

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Parallel Nested Loops

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Big Data Hadoop Stack

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Agenda. Request- Level Parallelism. Agenda. Anatomy of a Web Search. Google Query- Serving Architecture 9/20/10

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

PaaS and Hadoop. Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University

CS-2510 COMPUTER OPERATING SYSTEMS

Distributed Systems 16. Distributed File Systems II

Programming Systems for Big Data

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

Advanced Data Management Technologies

MapReduce. U of Toronto, 2014

Databases 2 (VU) ( / )

50 Must Read Hadoop Interview Questions & Answers

Map Reduce & Hadoop Recommended Text:

Map-Reduce. Marco Mura 2010 March, 31th

In the news. Request- Level Parallelism (RLP) Agenda 10/7/11

MapReduce. Stony Brook University CSE545, Fall 2016

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Transcription:

Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce

Outline MapReduce Programming Model MapReduce Examples Hadoop 2

Incredible Things That Happen Every Minute On The Internet 3

The Four V s of Big Data 4

5

Motivation: Large Scale Data Processing Want to process lots of data ( >1TB) Want to parallelize across hundreds/thousands of CPUs Want to make this easy 6

MapReduce A simple and powerful interface that enables automatic parallelization and distribution of largescale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs. More simply, MapReduce is A parallel programming model and associated implementation Dean and Ghermawat, MapReduce: Simplified Data Processing on Large Clusters, Google Inc. 7

A Brief History MapReduce is a new use of an old idea in Computer Science Map: Apply a function to every object in a list Each object is independent Order is unimportant Maps can be done in parallel The function produces a result Reduce: Combine the results to produce a final result You may have seen this in a Lisp or functional programming course 8

1o Years of Big Data Processing Technologies 9

Some MapReduce Terminology Job A full program - an execution of a Mapper and Reducer across a data set Task An execution of a Mapper or a Reducer on a slice of data a.k.a. Task-In-Progress (TIP) Task Attempt A particular instance of an attempt to execute a task on a machine 10

Terminology Example Running Word Count across 20 files is one job 20 files to be mapped imply 20 map tasks + some number of reduce tasks At least 20 map task attempts will be performed more if a machine crashes, etc. 11

Task Attempts A particular task will be attempted at least once, possibly more times if it crashes If the same input causes crashes over and over, that input will eventually be abandoned Multiple attempts at one task may occur in parallel with speculative execution turned on Task ID from TaskInProgress is not a unique identifier 12

MapReduce Programming Model Process data using special map() and reduce() functions The map() function is called on every item in the input and emits a series of intermediate key/value pairs All values associated with a given key are grouped together The reduce() function is called on every unique key, and its value list, and emits a value that is added to the output 13

map Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line) map() produces one or more intermediate values along with an output key from the input map (in_key, in_value) -> (out_key, intermediate_value) list 14

reduce After the map phase is over, all the intermediate values for a given output key are combined together into a list reduce() combines those intermediate values into one or more final values for that same output key reduce (out_key, intermediate_value list) -> out_value list 15

reduce reduce (out_key, intermediate_value list) -> out_value list initial returned 16

MapReduce Programming Model How now Brown cow How does It work now Input M M M M Map <How,1> <now,1> <brown,1> <cow,1> <How,1> <does,1> <it,1> <work,1> <now,1> MapReduce Framework <How,1 1> <now,1 1> <brown,1> <cow,1> <does,1> <it,1> <work,1> R R Reduce brown 1 cow 1 does 1 How 2 it 1 now 2 work 1 Output More formally, Map(k1,v1) --> list(k2,v2) Reduce(k2, list(v2)) --> list(v2) 17

MapReduce Architecture Input key*value pairs Input key*value pairs... Data store 1 map Data store n map (key 1, values...) (key 2, values...) (key 3, values...) (key 1, values...) (key 2, values...) (key 3, values...) == Barrier == : Aggregates intermediate values by output key key 1, intermediate values key 2, intermediate values key 3, intermediate values reduce reduce reduce final key 1 values final key 2 values final key 3 values 18

MapReduce: High Level Master node MapReduce job submitted by client computer JobTracker Slave node Slave node Slave node TaskTracker TaskTracker TaskTracker Task instance Task instance Task instance 19

Nodes, Trackers, Tasks Master node runs JobTracker instance, which accepts Job requests from clients TaskTracker instances run on slave nodes TaskTracker forks separate Java process for task instances 20

MapReduce Workflow 21

MapReduce in One Picture Tom White, Hadoop: The Definitive Guide 22

MapReduce Runtime System 1. Partitions input data 2. Schedules execution across a set of machines 3. Handles machine failure 4. Manages interprocess communication 23

GFS: Underlying Storage System Goal global view make huge files available in the face of node failures Master Node (meta server) Centralized, index all chunks on data servers Chunk server (data server) File is split into contiguous chunks, typically 16-64MB Each chunk replicated (usually 2x or 3x) Try to keep replicas in different racks 24

GFS Architecture GFS Master Client C 0 C 1 C 1 C 0 C 5 C 5 C 2 C 5 C 3 C 2 Chunkserver 1 Chunkserver 2 Chunkserver N 25

Parallelism map() functions run in parallel, creating different intermediate values from different input data sets reduce() functions also run in parallel, each working on a different output key All values are processed independently Bottleneck: reduce phase can t start until map phase is completely finished 26

Locality Master program divides up tasks based on location of data: tries to have map() tasks on same machine as physical file data, or at least same rack map() task inputs are divided into 64 MB blocks: same size as Google File System chunks 27

Fault Tolerance Master detects worker failures Re-executes completed & in-progress map() tasks Re-execute All output was stored locally Re-executes in-progress reduce() tasks Only re-execute partially completed tasks All output stored in the global file system Master notices particular input key/values cause crashes in map(), and skips those values on reexecution Effect: Can work around bugs in third-party libraries! 28

Optimizations No reduce can start until map is complete A single slow disk controller can rate-limit the whole process Master redundantly executes slow-moving map tasks; uses results of first copy to finish Combiner functions can run on same machine as a mapper Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth 29

Optimizations 30

MapReduce Benefits Greatly reduces parallel programming complexity Reduces synchronization complexity Automatically partitions data Provides failure transparency Handles load balancing 31

Outline MapReduce Programming Model MapReduce Examples Hadoop 32

Typical Problems Solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize, filter, or transform Write the results Outline stays the same, but map and reduce change to fit the problem 33

MapReduce Examples Word frequency Runtime System Map <word,1> <word,1> <word,1> <word,1,1,1> Reduce doc <word,3> 34

Example: Count Word Occurrences map(string input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(string output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); 35

Example: Count Word Occurrences 36

MapReduce Examples Distributed grep Map function emits <word, line_number> if word matches search criteria Reduce function is the identity function URL access frequency Map function processes web logs, emits <url, 1> Reduce function sums values and emits <url, total> 37

Outline MapReduce Programming Model MapReduce Examples Hadoop 38

Hadoop Open source MapReduce implementation http://hadoop.apache.org/core/index.html Google calls it MapReduce GFS Bigtable Hadoop equivalent Hadoop HDFS HBase 39

Basic Features: HDFS Highly fault-tolerant High throughput Suitable for applications with large data sets Streaming access to file system data Can be built out of commodity hardware 40

HDFS Architecture 41

Hadoop Related Projects Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heat maps and ability to view MapReduce, Pig and Hive applications visually along with features to diagnose their performance characteristics in a user-friendly manner Avro: A data serialization system Cassandra: A scalable multi-master database with no single points of failure Chukwa: A data collection system for managing large distributed systems HBase: A scalable, distributed database that supports structured data storage for large tables (NoSQL) Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying Mahout: A Scalable machine learning and data mining library Pig: A high-level data-flow language and execution framework for parallel computation ZooKeeper: A high-performance coordination service for distributed applications 42

Two Ecosystems BDEC Committee, The BDEC Pathways to Convergence Report, 2017 43

44

Data Science 45

Data Science in Wikipedia 46