PARALLEL DATA PROCESSING IN BIG DATA SYSTEMS

Size: px
Start display at page:

Download "PARALLEL DATA PROCESSING IN BIG DATA SYSTEMS"

Transcription

1 PARALLEL DATA PROCESSING IN BIG DATA SYSTEMS Great Ideas in ICT - June 16, 2016 Irene Finocchi (finocchi@di.uniroma1.it)

2 Title keywords

3 How big? The scale of things

4 Data deluge Every 2 days we create as much information as we did from the dawn of man through 2003 (August 2010, Eric Schmidt - executive Google) Most data born digital 5 Exabytes

5 Where does data come from?

6 Giga, Tera, Peta, Exa 1 Brontobyte = 10 3 Yottabytes

7 How much information is that? GIGA 1 Gigabyte: broadcast quality movie 20 Gigabytes: audio collection of Beethoven s works 100 Gigabytes: library floor of books or academic journals on shelves

8 How much information is that? GIGA 1 Gigabyte: broadcast quality movie 20 Gigabytes: audio collection of Beethoven s works 100 Gigabytes: library floor of books or academic journals on shelves TERA 1 Terabyte: all the X- ray films in a large technological hospital; 50,000 trees made into paper and printed 10 Terabytes: printed collection of the U.S. Library of Congress

9 How much information is that? PETA 1 Petabyte: 3 years of Earth Observing System data (2001) 2 Petabytes: all U.S. academic research libraries 200 Petabytes: all printed material

10

11 How much information is that? PETA 1 Petabyte: 3 years of Earth Observing System data (2001) 2 Petabytes: all U.S. academic research libraries 200 Petabytes: all printed material EXA 5 Exabytes: all words ever spoken by human beings? ZETTA

12 Storage media Megabytes Terabytes Gigabytes Petabytes (and beyond)

13 Packing data into DVDs 800 Exabytes = stack of DVDs reaching from Earth to Moon and back 35 ZB = stack of DVDs would now reach halfway to Mars

14 Infrastructures for big data

15 Single-node architecture CPU Memory Not suitable for big data Disk

16 Storage trends 1990 Typical disk could store about 1370 MB ~ 1GB Transfer speed: 4.4 MB/s Could read a disk in about 5 minutes Today 1 TB disks are the norm Transfer speed 100 MB/s More than 2.5 hours to read a disk! Solution: using many disks in parallel! #1! Issues: Disk failures are very common How to combine data spread over many different disks

17 Clusters of commodity machines Since cannot mine tens to hundreds of Terabytes of data on a single server standard architecture emerging: Cluster of commodity Linux nodes Gigabit ethernet interconnections Scale out, not up! #2!

18 Real cluster architecture

19 Real cluster architecture

20 Building blocks Source: Barroso and Urs Hölzle (2009)

21 Cluster architecture q Each rack contains 10/64 nodes q Sample node configuration: 8 x 2GHz cores, 8 GB RAM, 4 disks (4 TB)

22 Parallelism

23 Parallelize the computation Divide and conquer Work to do w 1 w 2 w 3 Partition worker worker worker r 1 r 2 r 3 Result Combine

24 Parallelization challenges How do we assign work units to workers? What if we have more work units than workers? What if workers need to share partial results? How do we aggregate partial results? How do we know all the workers have finished? What if workers die? What s the common theme of all of these problems?

25 Common theme Parallelization problems arise from: Communication between workers (to exchange state) Access to shared resources (data) Thus, we need synchronization mechanisms

26 Source: Ricardo Guimarães Herrmann

27 Managing multiple workers Difficult because we don s know: the order in which workers run when workers interrupt each other when workers need to communicate partial results the order in which workers access shared data Thus, we need semaphores (lock, unlock), conditional variables (wait, notify, broadcast), barriers Still, lots of problems: Deadlock, livelock, race conditions... Dining philosophers, sleeping barbers, cigarette smokers... Moral of the story: be careful!

28 Current tools Programming models Shared memory (pthreads) Message passing (MPI) Design patterns Master-slaves Producer-consumer flows Shared work queues Shared Memory P 1 P 2 P 3 P 4 P 5 Memory Message Passing P 1 P 2 P 3 P 4 P 5 master producer consumer work queue slaves producer consumer

29 Source: Wikipedia (Flat Tire)

30 Where the rubber meets the road Concurrency/parallelism difficult to reason about Even more difficult At the scale of datacenters (and across datacenters) In the presence of failures In terms of multiple interacting services Not to mention debugging The reality: Lots of one-off solutions, custom code Write you own dedicated library, then program with it Burden on the programmer to explicitly manage everything

31 Big data systems #3! Have a runtime system that manages (most) low-level aspects Handles scheduling Assigns workers to map and reduce tasks Handles data distribution Moves processes to data Handles synchronization Gathers, sorts, and shuffles intermediate data Handles errors and faults Detects worker failures and restarts

32 MAPREDUCE A big data system for batch processing

33 Stable storage First-order problem: if nodes can fail, how can we store data persistently? Answer: Distributed File System Provides global file namespace Google GFS; Hadoop HDFS; Kosmix KFS Typical usage pattern Huge files (100s of GB to TB) Data is rarely updated in place Reads and appends are common

34 Distributed file system Chunk servers File is split into contiguous chunks Typically each chunk is 16-64MB Each chunk replicated (usually 3x) Try to keep 1 replica on the same rack and other 2 replicas in a different rack Master node Stores metadata Might be replicated (a.k.a. Name Node in HDFS) Client library for file access Talk to master to find chunk servers Connect directly to chunk servers to access data

35 Warm up: word count We have a large file of words, one word per line Count the number of times each distinct word appears in the file Sample application: analyze web server logs to find popular URLs

36 Different scenarios Case 1: Entire file fits in memory Load file into main memory Keep also a hash table with <word, count> pairs Case 2: File too large for mem, but all <word, count> pairs fit in mem Scan file on disk Keep <word, count> hash table in main memory Case 3: Too many distinct words to fit in memory External sort, then scan file (all occurrences of the same word are consecutive: one running counter suffices) sort datafile uniq c

37 Making things a little bit harder Now suppose we have a large corpus of documents Count the number of times each distinct word occurs in the corpus words(docs/*) sort uniq -c where words takes a file and outputs the words in it, one to a line The above captures the essence of MapReduce Great thing is it is naturally parallelizable

38 The origins (2004) Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Jeffrey Dean & Sanjay Ghemawat [OSDI 2004]

39 Map in Lisp

40 Reduce in Lisp

41 MapReduce in Lisp

42 MapReduce can refer to The programming model The execution framework (aka runtime system ) The specific implementation Usage usually clear from context

43 Programming model

44 Data as <key,value> pairs Everything built on top of <key,value> pairs Keys and values are user-defined: can be anything Only two user-defined functions: Map n map(k 1,v 1 ) list(k 2,v 2 ) #4! n given input data <k 1,v 1 >, produce intermediate data v 2 labeled with key k 2 Reduce n reduce(k 2, list(v 2 )) list(v 3 ) preserves key n given a list of values list(v 2 ) associated with a key k 2, return a list of values list(v 3 ) associated with the same key

45 Where is parallelism? All mappers run in parallel All reducers run in parallel Different pairs transparently distributed across available machines (parallel disks!) map(k 1,v 1 ) list(k 2,v 2 ) Shuffle: group values with the same key to be passed to a single reducer reduce(k 2, list(v 2 )) list(v 3 )

46 What you can avoid to care about The underlying runtime system: Automatically parallelizes the computation across large-scale clusters of machines Handles machine failures (incuding disk failures) Schedules inter-machine communication to make efficient use of the network and disks

47 THE MapReduce example: WordCount map(string key, String value): // key: document name; value: text of document for each word w in value: emit(w, 1) reduce(string key, Iterator<int> values): // key: a word; values: an iterator over counts result = 0 for each v in values: result += v emit(key, result)

48 WordCount data flow

49 A programmer s perspective The beauty of MapReduce is that any programmer can understand it, and its power comes from being able to harness thousands of computers behind that simple interface. David Patterson

50 Success story #1: sorting

51 Success story #1: sorting Nov 2008: 1TB, 1000 computers, 68 secs Previous record: 910 computers, 209 secs Nov 2008: 1PB, 4000 computers, 6 h, 48k harddisks Sept 2011: 1PB, 8000 computers, 33 m Sept 2011: 10PB, 8000 computers, 6 ½ h

52 Success story #2: MapReduce + AWS Ability to rent computing by the hour Additional services: e.g., persistent storage For instance, Amazon Web Services Rent your compute instances on Amazon s cloud Amazon Elastic MapReduce: set up and run applications on your cluster in minutes, at very low prices Useful AWS services: Elastic Compute Cloud: EC2 Persistent storage: S3 Elastic MapReduce: run Hadoop on EC2

53 Success story #2: MapReduce + AWS The New York Times needed to generate PDF files for 11,000,000 articles (every article from ) in the form of images scanned from the original paper Each article composed of numerous TIFF images scaled and glued together Code for generating PDF quite straightforward

54 Success story #2: MapReduce + AWS 4TB of scanned articles sent to Amazon S3 A cluster of EC2 machines configured to distribute the PDF generation via Hadoop (open source MapReduce) Using 100 EC2 instances, in 24 hours the New York Times was able to convert the 4TB of scanned articles into 1.5TB of PDF documents Embarrassingly parallel problem

55 MapReduce vs other systems 1. Relational Database Management Systems (RDBMS)

56 MapReduce vs other systems 2. Grid computing and HPC (high-performance computing) Expensive (parallel) architectures Very good for compute-intensive jobs Problematic handling large data volumes: network bandwidth is the bottleneck Explicitly distribute the work across clusters of machines and manage data flow 3. Volunteer computing (Search for Extra Terrestrial Intelligence) Donate your CPU time (processing about 0.35MB of radio telescope data) Data moved to your computer

57 Runtime system

58 Data flow principles Input and final output stored on distributed file system Data locality: scheduler tries to schedule map tasks close to physical storage location of input data Bring code to the data, not data to the code! #5! Intermediate results stored on local FS of map and reduce workers

59 Distributed execution overview # M of map tasks depends on splits User Program fork fork fork assign map Master assign reduce R = # of reduce tasks (can be chosen by the programmer) Input Data Split 0 Split 1 Split 2 read Worker Worker Worker local write remote read, sort Worker Worker write Output File 0 Output File 1

60 Splits Inputs to map tasks are created by contiguous splits of input file Default split size: 64MB/128MB (can be changed by setting parameter mapred.max.split.size) Depending on cluster size, more than one split may end up to be processed by the same worker

61 Distributed execution overview # M of map tasks depends on splits User Program fork fork fork assign map Master assign reduce R = # of reduce tasks (can be chosen by the programmer) Input Data Split 0 Split 1 Split 2 read Worker Worker Worker local write remote read, sort Worker Worker write Output File 0 Output File 1

62 Partition function For reduce functions, we need to ensure that records with the same intermediate key end up at the same worker System uses a default partition function: e.g., hash(key) mod R Sometimes useful to override E.g., hash(hostname(url)) mod R ensures URLs from a host end up in the same output file

63 Distributed execution overview # M of map tasks depends on splits User Program fork fork fork assign map Master assign reduce R = # of reduce tasks (can be chosen by the programmer) Input Data Split 0 Split 1 Split 2 read Worker Worker Worker local write remote read, sort Worker Worker write Output File 0 Output File 1

64 Coordination Master data structures Task status: (idle, in-progress, completed) Idle tasks get scheduled as workers become available When a map task completes, it sends the master the location and sizes of its R intermediate files, one for each reducer Master pushes this info to reducers Master pings workers periodically to detect failures

65 Failures Map worker failure Map tasks completed or in-progress at worker are reset to idle Reduce workers are notified when task is rescheduled on another worker Reduce worker failure Only in-progress tasks are reset to idle Master failure MapReduce task is aborted and client is notified

66 Distributed execution overview # M of map tasks depends on splits User Program fork fork fork assign map Master assign reduce R = # of reduce tasks (can be chosen by the programmer) Input Data Split 0 Split 1 Split 2 read Worker Worker Worker local write remote read, sort Worker Worker write Output File 0 Output File 1

67 Pipelines of MapReduce jobs Output is often input to another map reduce task Computation as a sequence of rounds: Shuffle is expensive: minimize rounds!

68 Implementations

69 Implementations of MapReduce Google MapReduce Not available outside Google Hadoop: an Apache project Open-source implementation in Java n Just need to copy java libs to each machine of your cluster Uses HDFS for stable storage Download: Aster Data Cluster-optimized SQL Database that also implements MapReduce

70 The APIs Official JavaDoc of all Hadoop classes is available at the following URL: overview-summary.html We will review quickly the interfaces, classes, and methods we need to run our first Hadoop program.

71 Jobs in Hadoop A job represents a packaged Hadoop job for submission to cluster Need to specify input and output paths Need to specify input and output formats Need to specify mapper and reducer classes (+ other optional stuff: combiner, partitioner) Need to specify intermediate/final key/value classes Need to specify number of reducers (but not mappers, why?) *Note that there are two versions of the AP

72 Job creation and configuration example

73 Mapper and Reducer classes

74 Methods of the Mapper class

75 Inside a Mapper task public void run(context context) throws IOException, InterruptedException { setup(context); while (context.nextkeyvalue()) { map(context.getcurrentkey(), context.getcurrentvalue(), context); } cleanup(context); }

76 Methods of the Reducer class

77 The Context(s)

78 Hello World : Word Count private static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable ONE = new IntWritable(1); private final static Text WORD = new Text(); public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = ((Text) value).tostring(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasmoretokens()) { WORD.set(itr.nextToken()); context.write(word, ONE); } }

79 Hello World : Word Count private static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private final static IntWritable SUM = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { Iterator<IntWritable> iter = values.iterator(); int sum = 0; while (iter.hasnext()) { sum += iter.next().get(); } SUM.set(sum); context.write(key, SUM); }

80 Good coding practices in MapReduce

81 A bad MapReduce program Problem: sum of n values a 1 a n Map: given <-;a i > emit <$;a i > Reduce: Input <$; a 3, a 1, a 4, > Sum the n values received as input Reducers use linear space! No parallelization! Same key associated to each input value: sequential computation!

82 A better solution Map round 1: Group values into n equally-sized groups emit <$ i/ n ;a i > Reduce round 1: Input <$ j ; a k, a h, > Sum the n input values and emit <$ j ; sum j > Round 2: sum the n partial sums (use previous algorithm) Reducers use sublinear ( n) space! n sum computations in parallel!

83 A (theoretical) computational model For an input of size n: sublinear machine memory: O(n 1-ε ) for some ε>0 sublinear # of machines: O(n 1-ε ) for some ε>0 this implies total memory O(n 2-2ε ) for some ε>0 n cannot replicate data too much shuffle is expensive, so O(polylog n) rounds n strive for O(1) mappers/reducers run in polynomial time Karloff, Suri & Vassilvitskii [ACM SIAM SODA 2010]

84 MapReduce class MRC i class = MapReduce algorithms that satisfy the constraints in the previous slide work in O(log i n) rounds MRC 0 = O(1) rounds

85 Example: analysis of MapReduce sum Algorithm 1 does not fit in the MRC class Linear space Algorithm 2 fits in the MRC 0 class ε= ½ 2 rounds

86 (Some) practical hints Minimize number of rounds Shuffle is expensive Reduce number of intermediate <key,value> pairs Network communication is expensive Use combiners and local aggregation wherever possible

87 What is a combiner? Often a map task will produce many pairs of the form (k,v 1 ), (k, v 2 ), for the same key k E.g., popular words in WordCount Can save space and communication time by preaggregating at mapper combine(k 1, list(v 1 )) à v 2 usually same as reduce function Works only if reduce function is commutative and associative

88 (Some) practical hints Minimize number of rounds Shuffle is expensive Reduce number of intermediate <key,value> pairs Network communication is expensive Use combiners and local aggregation wherever possible Take care of load balancing (by careful algorithm design) The curse of the last reduces: wall clock time is proportional to the slowest task Store your input in a few large files rather than in many small files Best usage of the underlying DFS

89 How many (Map and) Reduce tasks? M map tasks, R reduce tasks M is the number of splits (one DFS chunk per map task ) Rule of thumb: Make R larger than the number of nodes in cluster Improves dynamic load balancing and speeds recovery from worker failure Usually R is smaller than M, because output is spread across R files

90 Beyond MapReduce

91

92 References J. Dean and S. Ghemawat MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004 and CACM T. White Hadoop: The Definitive Guide (4th ed)

93 Acknowledgments Part of the slides have been adapted from course material associated with the following textbooks: Mining of Massive Data Sets, by Leskovec, Rajaraman & Ullman Data-Intensive Text Processing with MapReduce, By Jimmy Lin

94 Source: Wikipedia (Japanese rock garden) Questions?

CS 345A Data Mining. MapReduce

CS 345A Data Mining. MapReduce CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA) Introduction to MapReduce Adapted from Jimmy Lin (U. Maryland, USA) Motivation Overview Need for handling big data New programming paradigm Review of functional programming mapreduce uses this abstraction

More information

CS 345A Data Mining. MapReduce

CS 345A Data Mining. MapReduce CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Dis Commodity Clusters Web data sets can be ery large Tens to hundreds of terabytes

More information

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins MapReduce 1 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins 2 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 2: Hadoop Nuts and Bolts Jimmy Lin University of Maryland Thursday, January 31, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Clustering Documents. Case Study 2: Document Retrieval

Clustering Documents. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve

More information

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec MapReduce: Recap Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec MapReduce: Recap Sequentially read a lot of data Why? Map: extract something we care about map (k, v)

More information

Introduction to MapReduce

Introduction to MapReduce Introduction to MapReduce Based on slides by Juliana Freire Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec Required Reading! Data-Intensive Text Processing with MapReduce,

More information

Programming Systems for Big Data

Programming Systems for Big Data Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA) Introduction to MapReduce Adapted from Jimmy Lin (U. Maryland, USA) Motivation Overview Need for handling big data New programming paradigm Review of functional programming mapreduce uses this abstraction

More information

MapReduce. Stony Brook University CSE545, Fall 2016

MapReduce. Stony Brook University CSE545, Fall 2016 MapReduce Stony Brook University CSE545, Fall 2016 Classical Data Mining CPU Memory Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining

More information

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing ECE5610/CSC6220 Introduction to Parallel and Distribution Computing Lecture 6: MapReduce in Parallel Computing 1 MapReduce: Simplified Data Processing Motivation Large-Scale Data Processing on Large Clusters

More information

MapReduce Simplified Data Processing on Large Clusters

MapReduce Simplified Data Processing on Large Clusters MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (1/2) January 10, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab CS6030 Cloud Computing Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 1 Acknowledgements I have liberally borrowed these slides and material

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Introduction to MapReduce

Introduction to MapReduce Introduction to MapReduce Based on slides by Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec How much data? Google processes 20 PB a day (2008) Internet Archive Wayback Machine has 3 PB + 100

More information

Introduction to Map Reduce

Introduction to Map Reduce Introduction to Map Reduce 1 Map Reduce: Motivation We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate

More information

Introduction to MapReduce (cont.)

Introduction to MapReduce (cont.) Introduction to MapReduce (cont.) Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 2 MapReduce: Summary USC INF 553 Foundations

More information

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus Getting to know MapReduce MapReduce Execution Pipeline Runtime Coordination and Task Management MapReduce Application Hadoop Word Count Implementation.

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model

More information

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018 Parallel Processing - MapReduce and FlumeJava Amir H. Payberah payberah@kth.se 14/09/2018 The Course Web Page https://id2221kth.github.io 1 / 83 Where Are We? 2 / 83 What do we do when there is too much

More information

Introduction to MapReduce

Introduction to MapReduce Introduction to MapReduce Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec Single-node architecture CPU Memory Data Analysis, Machine Learning, Statistics Disk How much

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

CS-2510 COMPUTER OPERATING SYSTEMS

CS-2510 COMPUTER OPERATING SYSTEMS CS-2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application Functional

More information

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data) Principles of Data Management Lecture #16 (MapReduce & DFS for Big Data) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s News Bulletin

More information

Laarge-Scale Data Engineering

Laarge-Scale Data Engineering Laarge-Scale Data Engineering The MapReduce Framework & Hadoop Key premise: divide and conquer work partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 result combine Parallelisation challenges How

More information

Map Reduce Group Meeting

Map Reduce Group Meeting Map Reduce Group Meeting Yasmine Badr 10/07/2014 A lot of material in this presenta0on has been adopted from the original MapReduce paper in OSDI 2004 What is Map Reduce? Programming paradigm/model for

More information

CS 61C: Great Ideas in Computer Architecture. MapReduce

CS 61C: Great Ideas in Computer Architecture. MapReduce CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing

More information

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [MAPREDUCE] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Bit Torrent What is the right chunk/piece

More information

MapReduce-style data processing

MapReduce-style data processing MapReduce-style data processing Software Languages Team University of Koblenz-Landau Ralf Lämmel and Andrei Varanovich Related meanings of MapReduce Functional programming with map & reduce An algorithmic

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (2/2) January 12, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

The MapReduce Abstraction

The MapReduce Abstraction The MapReduce Abstraction Parallel Computing at Google Leverages multiple technologies to simplify large-scale parallel computations Proprietary computing clusters Map/Reduce software library Lots of other

More information

Introduction to MapReduce

Introduction to MapReduce Introduction to MapReduce April 19, 2012 Jinoh Kim, Ph.D. Computer Science Department Lock Haven University of Pennsylvania Research Areas Datacenter Energy Management Exa-scale Computing Network Performance

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2013/14 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Cluster File

More information

Introduction to MapReduce

Introduction to MapReduce 732A54 Big Data Analytics Introduction to MapReduce Christoph Kessler IDA, Linköping University Towards Parallel Processing of Big-Data Big Data too large to be read+processed in reasonable time by 1 server

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 2: MapReduce Algorithm Design (2/2) January 14, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Distributed Computations MapReduce. adapted from Jeff Dean s slides Distributed Computations MapReduce adapted from Jeff Dean s slides What we ve learnt so far Basic distributed systems concepts Consistency (sequential, eventual) Fault tolerance (recoverability, availability)

More information

Parallel Programming Concepts

Parallel Programming Concepts Parallel Programming Concepts MapReduce Frank Feinbube Source: MapReduce: Simplied Data Processing on Large Clusters; Dean et. Al. Examples for Parallel Programming Support 2 MapReduce 3 Programming model

More information

Introduction to MapReduce

Introduction to MapReduce Introduction to MapReduce Jacqueline Chame CS503 Spring 2014 Slides based on: MapReduce: Simplified Data Processing on Large Clusters. Jeffrey Dean and Sanjay Ghemawat. OSDI 2004. MapReduce: The Programming

More information

MapReduce programming model

MapReduce programming model MapReduce programming model technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate

More information

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm Announcements HW5 is due tomorrow 11pm Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create your AWS account before

More information

Database Systems CSE 414

Database Systems CSE 414 Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) CSE 414 - Fall 2017 1 Announcements HW5 is due tomorrow 11pm HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create

More information

CMSC 723: Computational Linguistics I Session #12 MapReduce and Data Intensive NLP. University of Maryland. Wednesday, November 18, 2009

CMSC 723: Computational Linguistics I Session #12 MapReduce and Data Intensive NLP. University of Maryland. Wednesday, November 18, 2009 CMSC 723: Computational Linguistics I Session #12 MapReduce and Data Intensive NLP Jimmy Lin and Nitin Madnani University of Maryland Wednesday, November 18, 2009 Three Pillars of Statistical NLP Algorithms

More information

Parallel Computing: MapReduce Jin, Hai

Parallel Computing: MapReduce Jin, Hai Parallel Computing: MapReduce Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! MapReduce is a distributed/parallel computing framework introduced by Google

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox April 16 th, 2015 Emily Fox 2015 1 Document Retrieval n Goal: Retrieve

More information

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background. Credit Much of this information is from Google: Google Code University [no longer supported] http://code.google.com/edu/parallel/mapreduce-tutorial.html Distributed Systems 18. : The programming model

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 9 MapReduce Prof. Li Jiang 2014/11/19 1 What is MapReduce Origin from Google, [OSDI 04] A simple programming model Functional model For large-scale

More information

Advanced Data Management Technologies

Advanced Data Management Technologies ADMT 2017/18 Unit 16 J. Gamper 1/53 Advanced Data Management Technologies Unit 16 MapReduce J. Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Acknowledgements: Much of the information

More information

The amount of data increases every day Some numbers ( 2012):

The amount of data increases every day Some numbers ( 2012): 1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect

More information

2/26/2017. The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012): The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to

More information

1. Introduction to MapReduce

1. Introduction to MapReduce Processing of massive data: MapReduce 1. Introduction to MapReduce 1 Origins: the Problem Google faced the problem of analyzing huge sets of data (order of petabytes) E.g. pagerank, web access logs, etc.

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 1: MapReduce Algorithm Design (4/4) January 16, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since

More information

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)

More information

MapReduce: A Programming Model for Large-Scale Distributed Computation

MapReduce: A Programming Model for Large-Scale Distributed Computation CSC 258/458 MapReduce: A Programming Model for Large-Scale Distributed Computation University of Rochester Department of Computer Science Shantonu Hossain April 18, 2011 Outline Motivation MapReduce Overview

More information

BigData and Map Reduce VITMAC03

BigData and Map Reduce VITMAC03 BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to

More information

CmpE 138 Spring 2011 Special Topics L2

CmpE 138 Spring 2011 Special Topics L2 CmpE 138 Spring 2011 Special Topics L2 Shivanshu Singh shivanshu.sjsu@gmail.com Map Reduce ElecBon process Map Reduce Typical single node architecture Applica'on CPU Memory Storage Map Reduce Applica'on

More information

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015 Distributed Systems 18. MapReduce Paul Krzyzanowski Rutgers University Fall 2015 November 21, 2016 2014-2016 Paul Krzyzanowski 1 Credit Much of this information is from Google: Google Code University [no

More information

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing ΕΠΛ 602:Foundations of Internet Technologies Cloud Computing 1 Outline Bigtable(data component of cloud) Web search basedonch13of thewebdatabook 2 What is Cloud Computing? ACloudis an infrastructure, transparent

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

MapReduce and Hadoop

MapReduce and Hadoop Università degli Studi di Roma Tor Vergata MapReduce and Hadoop Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference Big Data stack High-level Interfaces Data Processing

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA CSE 3302 Lecture 11: Map/Reduce 7 October 2010 Nate Nystrom UTA 378,000 results in 0.17 seconds including images and video communicates with 1000s of machines web server index servers document servers

More information

Introduction to MapReduce Algorithms and Analysis

Introduction to MapReduce Algorithms and Analysis Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013 Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything)

More information

Map-Reduce. John Hughes

Map-Reduce. John Hughes Map-Reduce John Hughes The Problem 850TB in 2006 The Solution? Thousands of commodity computers networked together 1,000 computers 850GB each How to make them work together? Early Days Hundreds of ad-hoc

More information

L22: SC Report, Map Reduce

L22: SC Report, Map Reduce L22: SC Report, Map Reduce November 23, 2010 Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance Google version = Map Reduce; Hadoop = Open source

More information

MapReduce. U of Toronto, 2014

MapReduce. U of Toronto, 2014 MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce

More information

Parallel Nested Loops

Parallel Nested Loops Parallel Nested Loops For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on (S 1,T 1 ), (S 1,T 2 ),

More information

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011 Parallel Nested Loops Parallel Partition-Based For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on

More information

What is this course about?

What is this course about? Data-Intensive Information Processing Applications! Session #1 Introduction to MapReduce Jordan Boyd-Graber University of Maryland Thursday, February 3, 2011 This work is licensed under a Creative Commons

More information

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568 FLAT DATACENTER STORAGE Paper-3 Presenter-Pratik Bhatt fx6568 FDS Main discussion points A cluster storage system Stores giant "blobs" - 128-bit ID, multi-megabyte content Clients and servers connected

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014 Parallel Data Processing with Hadoop/MapReduce CS140 Tao Yang, 2014 Overview What is MapReduce? Example with word counting Parallel data processing with MapReduce Hadoop file system More application example

More information

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters Motivation MapReduce: Simplified Data Processing on Large Clusters These are slides from Dan Weld s class at U. Washington (who in turn made his slides based on those by Jeff Dean, Sanjay Ghemawat, Google,

More information

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School September 2012 This work is licensed

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

Databases 2 (VU) ( )

Databases 2 (VU) ( ) Databases 2 (VU) (707.030) Map-Reduce Denis Helic KMI, TU Graz Nov 4, 2013 Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 1 / 90 Outline 1 Motivation 2 Large Scale Computation 3 Map-Reduce 4 Environment

More information

Map-Reduce (PFP Lecture 12) John Hughes

Map-Reduce (PFP Lecture 12) John Hughes Map-Reduce (PFP Lecture 12) John Hughes The Problem 850TB in 2006 The Solution? Thousands of commodity computers networked together 1,000 computers 850GB each How to make them work together? Early Days

More information

Map-Reduce and Related Systems

Map-Reduce and Related Systems Map-Reduce and Related Systems Acknowledgement The slides used in this chapter are adapted from the following sources: CS246 Mining Massive Data-sets, by Jure Leskovec, Stanford University, http://www.mmds.org

More information

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor CS 470 Spring 2018 Mike Lam, Professor Parallel Algorithm Development (Foster's Methodology) Graphics and content taken from IPP section 2.7 and the following: http://www.mcs.anl.gov/~itf/dbpp/text/book.html

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

Information Retrieval Processing with MapReduce

Information Retrieval Processing with MapReduce Information Retrieval Processing with MapReduce Based on Jimmy Lin s Tutorial at the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2009) This

More information

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,

More information