MapReduce. Simplified Data Processing on Large Clusters (Without the Agonizing Pain) Presented by Aaron Nathan

Similar documents
Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

MapReduce: Simplified Data Processing on Large Clusters

The MapReduce Abstraction

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Agenda. Request- Level Parallelism. Agenda. Anatomy of a Web Search. Google Query- Serving Architecture 9/20/10

Parallel Computing: MapReduce Jin, Hai

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Database System Architectures Parallel DBs, MapReduce, ColumnStores

CS 61C: Great Ideas in Computer Architecture. MapReduce

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Parallel Programming Concepts

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Database Systems CSE 414

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

MapReduce-style data processing

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce

The NoSQL Movement. FlockDB. CSCI 470: Web Science Keith Vertanen

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Introduction to Data Management CSE 344

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344

CSE 344 MAY 2 ND MAP/REDUCE

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

In the news. Request- Level Parallelism (RLP) Agenda 10/7/11

Big Data Management and NoSQL Databases

CS 345A Data Mining. MapReduce

Lecture #30 Parallel Computing

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Parallel Nested Loops

MI-PDB, MIE-PDB: Advanced Database Systems

CS61C : Machine Structures

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Introduction to Map Reduce

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

1. Introduction to MapReduce

Advanced Database Technologies NoSQL: Not only SQL

L22: SC Report, Map Reduce

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

Clustering Lecture 8: MapReduce

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

Introduction to MapReduce

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

MapReduce: A Programming Model for Large-Scale Distributed Computation

CS MapReduce. Vitaly Shmatikov

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Introduction to Hadoop and MapReduce

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

MapReduce and Hadoop

Today s Lecture. CS 61C: Great Ideas in Computer Architecture (Machine Structures) Map Reduce

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Map Reduce Group Meeting

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Map Reduce. Yerevan.

Introduction to MapReduce

Part A: MapReduce. Introduction Model Implementation issues

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

Welcome to the New Era of Cloud Computing

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Large-Scale GPU programming

Introduction to MapReduce Algorithms and Analysis

Introduction to MapReduce

6.830 Lecture Spark 11/15/2017

MapReduce & BigTable

Programming Systems for Big Data

Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

Concurrency for data-intensive applications

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Hadoop Map Reduce 10/17/2018 1

CS427 Multicore Architecture and Parallel Computing

The MapReduce Framework

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing, MapReduce, and Spark

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Map-Reduce. Marco Mura 2010 March, 31th

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing, MapReduce, and Spark

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

CompSci 516: Database Systems

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Batch Processing Basic architecture

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Distributed Computation Models

Data Intensive Computing

UC Berkeley CS61C : Machine Structures

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

Distributed Filesystem

Transcription:

MapReduce Simplified Data Processing on Large Clusters (Without the Agonizing Pain) Presented by Aaron Nathan

The Problem Massive amounts of data >100TB (the internet) Needs simple processing Computers aren t perfect Slow Unreliable Misconfigured Requires complex (i.e. bug prone) code

MapReduce to the Rescue! Common FuncKonal Programming Model Map Step map (in_key, in_value) -> list(out_key, intermediate_value) Split a problem into a lot of smaller subproblems Reduce Step reduce (out_key, list(intermediate_value)) -> list(out_value) Combine the outputs of the subproblems to give the original problem s answer Each funckon is independent Highly Parallelizable

Algorithm Picture DATA MAP MAP MAP MAP MAP K1:v K1:v K2:v K2:v K1:v K2:v K3:v K3:v aggregator K1:v,v,v K2:v,v,v K3:v,v,v REDUCE Answer! REDUCE Answer! REDUCE Answer!

Some Example Code map(string input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(string output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));

Some Example ApplicaKons Distributed Grep URL Access Frequency Counter Reverse Web Link Graph Term Vector per Host Distributed Sort Inverted Index

The ImplementaKon Google Clusters 100s 1000s Dual Core x86 Commodity Machines Commodity Networking (100mbps/1Gbps) GFS Google Job Scheduler Library linked in c++

ExecuKon

The Master Maintains the state and idenkfy of all workers Manages intermediate values Receives signals from Map workers upon complekon Broadcasts signals to Reduce workers as they work Can retask completed Map workers to Reduce workers.

In Case of Failure Periodic Pings from Master >Workers On failure resets state of assigned task of dead worker Simple system proves resilient Works in case of a 80 simultaneous machine failures! Master failure is unhandled. Worker Failure doesn t effect output (output idenkcal whether failure occurs or not) Each map writes to local disk only If a mapper is lost, the data is just reprocessed Non determiniskc map funckons aren t guaranteed

Preserving Bandwidth Machines are in racks with small interconnects Use locakon informakon from GFS Ahempts to put tasks for workers and input slices on the same rack Usually results in LOCAL reads!

Backup ExecuKon Tasks What if one machine is slow? Can delay the complekon of the enkre MR OperaKon! Answer: Backup (Redundant) ExecuKons Whoever finishes first completes the task! Enabled towards the end of processing

ParKKoning M = number of Map Tasks (the number of input splits) R = number of Reduce Tasks (the number of intermediate key splits) W = number of worker computers In General: M = sizeof(input)/64 MB R = W*n (where n is a small number) Typical Scenario: InputSize = 12 TB, M = 200,000, R = 5000 W = 2000

Custom ParKKoning Default ParKKoned on intermediate key Hash(intermediate_key) mod R What if user has apriori knowledge about the key? Allow for user defined hashing funckon Ex. Hash(Hostname(url_key))

The Combiner If reducer is associakve and communivikve (2 + 5) + 4 = 11 or 2 + (5 + 4) = 11 (15+x)+2=2+(15+x) Repeated intermediate keys can be merged Saves network bandwidth EssenKally like a local reduce task

I/O AbstracKons How to get inikal key value pairs to map? Define an input format Make sure splits occur in reasonable places Ex: Text Each line is a key/pair Can come from GFS, bigtable, or anywhere really! Output works analogously

Skipping Bad Records What if a user makes a mistake in map/reduce And only apparent on few jobs.. Worker sends message to Master Skip record on >1 worker failure and tell others to ignore this record

Removing Unnecessary Development Pain Local MapReduce ImplementaKon that runs on development machine Master has HTTP page with status of enkre operakon Shows bad records Provide a Counter Facility Master aggregates counts and displayed on Master HTTP page

A look at the UI (in 1994) h6p://labs.google.com/papers/mapreduce osdi04 slides/index auto 0013.html

Performance Benchmarks SorBng AND Searching

Search (Grep) Scan through 10 10 100 byte records (1TB) M = 15000, R = 1 Startup Kme GFS LocalizaKon Program PropagaKon Peak > 30 GB/sec!

Sort 50 lines of code Map >key + textline Reduce > IdenKty M = 15000, R = 4000 ParKKon on init bytes of intermediate key Sorts in 891 sec!

What about Backup Tasks?

And wait it s useful! NB: August 2004

Open Source ImplementaKon Hadoop hhp://hadoop.apache.org/core/ Relies on HDFS All interfaces look almost exactly like MapReduce paper There is even a talk about it today! 4:15 B17 CS Colloquium: Mike Cafarella (Uwash)

AcBve Disks for Large Scale Data Processing

The Concept Use aggregate processing power Networked disks allow for higher throughput Why not move part of the applicakon onto the disk device? Reduce data traffic Increase parallelism further

Shrinking Support Hardware

Example ApplicaKons Media Database Find similar media data by fingerprint Real Time ApplicaKons Collect mulkple sensor data quickly Data Mining POS Analysis required adhoc database queries

Approach Leverage the parallelism available in systems with many disks Operate with a small amount of state, processing data as it streams off the disk Execute relakvely few instruckons per byte of data

Results Nearest Neighbor Search Problem: Determine k items closest to a parkcular item in a database Perform comparisons on the drive Returns the disks closest matches Server does final merge

Media Mining Example Perform low level image tasks on the disk! Edge DetecKon performed on disk Sent to server as edge image Server does higher level processing

Why not just use a bunch of PC s? The performance incease is similar In fact, the paper essenkally used this setup to actually benchmark their results! Supposedly this could be cheaper The paper doesn t really give a good argument for this Possibly reduced bandwidth on disk IO channel But who cares?

Some QuesKons What could a disk possibly do beher than the host processor? What added cost is associated with this mediocre processor on the HDD? Are new dependencies are introduced on hardware and sosware? Perhaps other (beher) places to do this type of local parallel processing? Maybe in 2001 this made more sense?