MapReduce-style data processing

Similar documents
Parallel Computing: MapReduce Jin, Hai

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

CS 61C: Great Ideas in Computer Architecture. MapReduce

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Hadoop/MapReduce Computing Paradigm

L22: SC Report, Map Reduce

MapReduce: Simplified Data Processing on Large Clusters

The MapReduce Abstraction

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

The MapReduce Framework

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

MI-PDB, MIE-PDB: Advanced Database Systems

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Parallel Programming Concepts

MapReduce. U of Toronto, 2014

A brief history on Hadoop

Lecture 11 Hadoop & Spark

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Distributed Filesystem

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

BigData and Map Reduce VITMAC03

MapReduce. Simplified Data Processing on Large Clusters (Without the Agonizing Pain) Presented by Aaron Nathan

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Introduction to Hadoop and MapReduce

CS 345A Data Mining. MapReduce

Introduction to Data Management CSE 344

Introduction to MapReduce

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Clustering Documents. Case Study 2: Document Retrieval

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Database System Architectures Parallel DBs, MapReduce, ColumnStores

In the news. Request- Level Parallelism (RLP) Agenda 10/7/11

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Map-Reduce. Marco Mura 2010 March, 31th

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

MapReduce, Hadoop and Spark. Bompotas Agorakis

Database Applications (15-415)

Database Systems CSE 414

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

CS427 Multicore Architecture and Parallel Computing

Large-Scale GPU programming

Clustering Lecture 8: MapReduce

Distributed File Systems II

Big Data Management and NoSQL Databases

Programming Systems for Big Data

CLOUD-SCALE FILE SYSTEMS

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Advanced Database Technologies NoSQL: Not only SQL

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

MapReduce and Hadoop

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

MapReduce: A Programming Model for Large-Scale Distributed Computation

Introduction to Map Reduce

MapReduce Simplified Data Processing on Large Clusters

Introduction to MapReduce

Map Reduce Group Meeting

Programming Models MapReduce

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Agenda. Request- Level Parallelism. Agenda. Anatomy of a Web Search. Google Query- Serving Architecture 9/20/10

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CA485 Ray Walshe Google File System

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Introduction to Data Management CSE 344

Map Reduce. Yerevan.

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Chapter 5. The MapReduce Programming Model and Implementation

Dept. Of Computer Science, Colorado State University

Cloud Computing CS

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

Introduction to MapReduce

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Batch Processing Systems

Transcription:

MapReduce-style data processing Software Languages Team University of Koblenz-Landau Ralf Lämmel and Andrei Varanovich

Related meanings of MapReduce Functional programming with map & reduce An algorithmic skeleton for data parallelism Google s related programming model Related programming techniques for data technologies

Functional programming with map & reduce We use Haskell here for illustration; reduce is called foldr in Haskell.

The higher-order function map map f l applies the function f to each element of the list l and produces the list of results. > :t map map :: (a -> b) -> [a] -> [b] > map ((+) 1) [1,2,3] [2,3,4] > map ((*) 2) [1,2,3] [2,4,6] > map even [1,2,3] [False,True,False] The type of map Increment the numbers. Double the numbers. Test numbers to be even.

The higher-order function foldr foldr f x l combines all elements of the list l with the binary operation f starting from x. The type of foldr > :t foldr foldr :: (a -> b -> b) -> b -> [a] -> b > foldr (+) 0 [1,2,3] Sum up the numbers. 6 > foldr (&&) True [True,True,False] False And the Booleans.

Typical forms of reduction sum = foldr (+) 0 product = foldr (*) 1 and = foldr (&&) True or = foldr (II) False

Another way to think of foldr > let l1 = [1,2,3,4] > sum l1 10 1 2 : 3 : 4 : : [] 1 2 + 3 + 4 + + 0 > product l1 24 [1,2,3,4] = 1:(2:(3:(4:[]))) sum, product :: Num a => [a] -> a sum = foldr (+) 0 product = foldr (*) 1 How to replace (:) How to replace [ ] foldr :: (a -> b -> b) -> b -> [a] -> b foldr f k [] = k foldr f k (x:xs) = f x (foldr f k xs)

Datatypes for companies data Company = Company Name [Department] data Department = Department Name Manager [Department] [Employee] data Employee = Employee Name Address Salary type Manager = Employee type Name = String type Address = String type Salary = Float [101implementation:haskell]

Company structure in Haskell company = Company "meganalysis" [ Department "Research" (Employee "Craig" "Redmond" 123456) [] [ Employee "Erik" "Utrecht" 12345, Employee "Ralf" "Koblenz" 1234 ], Department "Development" (Employee "Ray" "Redmond" 234567) [ Department "Dev1" (Employee "Klaus" "Boston" 23456) [ Department "Dev1.1" (Employee "Karl" "Riga" 2345) [] [ Employee "Joe" "Wifi City" 2344 ] ] [] ] [] ]

Cutting salaries in Haskell cut :: Company -> Company cut (Company n ds) = Company n (map dep ds) where dep :: Department -> Department dep (Department n m ds es) = Department n (emp m) (map dep ds) (map emp es) where emp :: Employee -> Employee emp (Employee n a s) = Employee n a (s/2) [101implementation:haskell]

Totaling salaries in Haskell total :: Company -> Float total (Company n ds) = sum (map dep ds) where dep :: Department -> Float dep (Department _ m ds es) = sum (emp m : map dep ds ++ map emp es) where emp :: Employee -> Float emp (Employee s) = s [101implementation:haskell]

The basic idea of data parallelism

Think of a huge company (Millions of employees) Too big to store on one disk! We assume a flat representation. Compute total in parallel on many machines.

Datatypes for flat companies type Company = Name -- Name of company type Department = ( " Name, " " " -- Name of department " Maybe Name," -- Name of ancestor department " Name"" " " -- Name of associated company " ) type Employee = ( " Name," " -- Name of employee " Name," " -- Name of associated department " Name," " -- Name of associated company " Address," -- Address of employee " Salary,"" -- Salary of employee " Bool"" " -- Manager? " ) type Name = String type Address = String type Salary = Float [101implementation:haskellFlat]

Flat companies companies :: [Company] companies = ["meganalysis"] departments :: [Department] departments = [ " ("Research", Nothing, "meganalysis"), " ("Development", Nothing, "meganalysis"), " ("Dev1", Just "Development", "meganalysis"), " ("Dev1.1", Just "Dev1", "meganalysis") " ] employees :: [Employee] employees = [ " ("Craig", "Research", "meganalysis", "Redmond", 123456, True), " ("Erik", "Research", "meganalysis", "Utrecht", 12345, False), " ("Ralf", "Research", "meganalysis", "Koblenz", 1234, False), " ("Ray", "Development", "meganalysis", "Redmond", 234567, True), " ("Klaus", "Dev1", "meganalysis", "Boston", 23456, True), " ("Karl", "Dev1.1", "meganalysis", "Riga", 2345, True), " ("Joe", "Dev1.1", "meganalysis", "Wifi City", 2344, False) " ]

Total flat companies -- Total all salaries (perhaps even of several companies) total :: [Employee] -> Float total = sum. map (\(_e, _d, _c, _a, s, _m) -> s) [101implementation:haskellFlat]

Clusters of machines for parallel map-reduce! http://labs.google.com/papers/sawzall.html

Total flat companies on many machines total :: [[Employee]] -> Float total = sum. map (sum. map (\(_e, _d, _c, _a, s, _m) -> s)) [101implementation:haskellFlat]

Total flat companies -- Total all salaries grouped by company times department totalperdepartment :: [Employee] -> Map (Name, Name) Float totalperdepartment = foldr insert empty where insert (_e, d, c, _a, s, _m) = insertwith (+) (c, d) s How to think parallel here? Output fromlist [ " (("meganalysis","dev1"),23456.0), " (("meganalysis","dev1.1"),4689.0), " (("meganalysis","development"),234567.0), " (("meganalysis","research"),137035.0) ] [101implementation:haskellFlat]

Google's MapReduce Programming Model User Program (1) fork (1) fork (1) fork (2) assign map Master (2) assign reduce worker split 0 split 1 split 2 split 3 split 4 (3) read worker (4) local write (5) remote read worker worker (6) write output file 0 output file 1 worker Input files Map phase Intermediate files (on local disks) Reduce phase Output files [http://labs.google.com/papers/mapreduce.html]

Large Scale Data Processing Process lots of data to produce other data. Use hundreds or thousands of CPUs. Automatic parallelization and distribution.

The MapReduce programming model Input & Output: sets of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair. Produces set of intermediate pairs. reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values for a particular key. Produces a set of merged output values (usually just one). The domains for intermediate and output values coincide.

An example: counting the number of occurrences of each word in a collection of documents map(string key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

Distribution: many map and reduce tasks [http://labs.google.com/papers/mapreduce.html]

Control of job execution Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures & stragglers User focuses on application, not on complexities of distributed computation.

Fault tolerance Cheap nodes fail, especially if you have many Mean time between failures for 1 node = 3 years Mean time between failures for 1000 nodes = 1 day Solution: Build fault-tolerance into system If a node crashes: Re-launch its current tasks on other nodes and re-run any maps the node previously ran. If a task crashes: Retry on another node. (OK for a map because it has no dependencies. OK for reduce because map outputs are on disk.)

Network as a bottleneck Limited bandwidth (especially for commodity network) Solution: Push computation to the data

Stragglers. If a task is going slowly (straggler): Launch second copy of task on another node ( speculative execution ). Take the output of whichever copy finishes first, and kill the other. Surprisingly important in large clusters: Stragglers occur frequently due to failing hardware, software bugs, misconfiguration, etc. Single straggler may noticeably slow down a job.

Apache Hadoop Apache Hadoop is an open-source software framework that supports data-intensive distributed applications [...]. It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived fromgoogle's MapReduce and Google File System (GFS) papers. The entire Apache Hadoop platform is now commonly considered to consist of the Hadoop kernel, MapReduce and HDFS, as well as a number of related projects [...]. Hadoop is a top-level Apache project being built and used by a global community of contributors, [...] written in the Java programming language. [http://en.wikipedia.org/wiki/apache_hadoop] 13 Sep 2012

Hadoop components Distributed file system (HDFS) MapReduce framework Single namespace for entire cluster Replicates data 3x for fault-tolerance Executes user jobs specified as map and reduce functions Manages work distribution & fault-tolerance

HDFS - Files split into 128MB blocks - Blocks replicated across several datanodes (usually 3) - Single namenode stores metadata (file names, block locations, etc) - Optimized for large files, sequential reads - Files are append-only Namenode File1 1 2 3 4 1 2 4 2 1 3 1 4 3 3 2 4 Datanodes

A Hadoop Cluster Aggregation switch Rack switch 40 nodes/rack, 1000-4000 nodes in cluster 1 Gbps bandwidth within rack, 8 Gbps out of rack Node specs (Yahoo terasort): 8 x 2GHz cores, 8 GB RAM, 4 disks (= 4 TB

Demo 101implementation:hadoop A 101companies implementation using Hadoop

" public static class TotalMapper extends " " " Mapper<Text, Employee, Text, DoubleWritable> { " " private static String name; " " protected void setup(context context) throws IOException, " " " " InterruptedException { " " " name = context.getconfiguration().get(total.queried_name); " " } " " protected void map(text key, Employee value, Context context) " " " " throws IOException, InterruptedException { " " " if (value.getcompany().tostring().equals(name)) " " " " context.write(value.getcompany(), value.getsalary()); " " } " }

" public static class TotalReducer extends " " " Reducer<Text, DoubleWritable, Text, DoubleWritable> { " " protected void reduce(text key, Iterable<DoubleWritable> values, " " " " Context context) throws IOException, InterruptedException { " " " double total = 0; " " " for (DoubleWritable value : values) { " " " " total += value.get(); " " " } " " " context.write(key, new DoubleWritable(total)); " " } " }

Summary You learned about... functional programming with map & reduce, related opportunities of parallelization, Google s MapReduce programming model, and the Hadoop implementation.

Resources Google's MapReduce Programming Model -- Revisited http://userpages.uni-koblenz.de/~laemmel/mapreduce/