Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements

Size: px
Start display at page:

Download "Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements"

Transcription

1 Tutorial Outline Map/Reduce Map/reduce Hadoop Sharma Chakravarthy Information Technology Laboratory Computer Science and Engineering Department The University of Texas at Arlington, Arlington, TX URL: 4/26/2015 Sharma Chakravarthy 1 4/26/2015 Sharma Chakravarthy 2 Acknowledgements These slides are put together from a variety of sources (both papers and slides/tutorials available on the web) Mostly I have tried to: provide my perspective, emphasize aspects that are of interest to this course, and have tried to put forth a consolidated view of Map/Reduce Data Center as a computer [Patterson, cacm 2008] Claim: There are dramatic differences between developing software for millions to use as a service versus distributing software for millions to run on their PCs Availability, dependability Bandwidth (with low latency) to service large number of users Innovation is fast as the software in their control! This has led to distributed data centers 4/26/2015 Sharma Chakravarthy 3 4/26/2015 Sharma Chakravarthy 4 1

2 Data Center as a computer [Patterson, cacm 2008] What are the useful programming abstractions for such a large system? How do thousands of computers behave differently from a small system? What must you do differently to run an abstraction on thousands of computers? Google proposed a two-phase primitive: Phase 1: maps a user supplied function onto thousands of computers Phase 2: Reduces the returned values from all those instances into a set of results Data Center as a computer [Patterson, cacm 2008] Map/Reduce was born Runs on heterogeneous computers (even different generations) Runs on heterogeneous OSs Scheduler is dynamic and accommodates above (as compared to batch schedulers of Grid computing Failures are handled transparently! Hadoop sorted 1 TB in 209 seconds (2009)!! Google regenerated its index using Map/Reduce Fairly easy to program, easy to understand! 4/26/2015 Sharma Chakravarthy 5 4/26/2015 Sharma Chakravarthy 6 Map/Reduce Is a programming paradigm Is a parallel programming paradigm Is derived from functional programming (specification vs. procedural programming) (remember SQL is non-procedural) Many a times the question asked is: Is there a difference between Map/Reduce and traditional parallel programming? What is Map/Reduce? Data parallel programming model for clusters of commodity machines Pioneered by Google Processes 20 PB of data per day Popularized by open source Hadoop project Used by Yahoo!, Facebook, Amazon, 4/26/2015 Sharma Chakravarthy 7 2

3 What is Map/Reduce used for? At Google: Index building for Google Search Article clustering for Google News Statistical machine translation At Yahoo!: Index building for Yahoo! Search Spam detection for Yahoo! Mail At Facebook: Data mining Ad optimization Spam detection Note that most/all of these applications are very different from traditional DBMS applications: very large data sizes one time computation versus data storage and management SQL is not a good vehicle/tool for doing this task Defining schema and loading this data into a DBMS is difficult What is Map/Reduce used for (2)? Also used for: Graph mining PageRank calculation Machine learning Shortest path Problems are being ported to map/reduce on a daily basis Is a popular research area! Porting algorithms into map/reduce is not always straightforward! Note that most/all of these applications are very different from traditional DBMS applications: very large data sizes one time computation versus data storage and management SQL is not a good vehicle/tool for doing this task Defining schema and loading this data into a DBMS is difficult Map/Reduce Automatic parallelization & distribution Fault-tolerant Provides status and monitoring tools Clean abstraction for programmers Borrows from functional programming Users implement interface of two functions: map (in_key, in_value) (int_key, intermediate_value list) reduce (int_key, intermediate_value list) (out_key, value list) In_key, int_key, and out_key need not be same! The Basics (5) Note that we cannot assume that every problem can be parallelized Some problems are easier to parallelize and some are inherently sequential So it is important to understand whether a problem can be parallelized and to what extent! Cryptography hash computations cannot be parallelized Parallelizing I/O is difficult (needs additional technology) 4/26/2015 Sharma Chakravarthy 11 4/26/2015 Sharma Chakravarthy 12 3

4 Map/Reduce Map/Reduce was developed within Google as a mechanism for processing large amounts of raw data; for example, crawled documents or web request logs. This data is so large, it must be distributed across thousands of machines in order to be processed in a reasonable time. This distribution implies parallel computing since the same computations are performed on each CPU, but with a different dataset. Map/Reduce is an abstraction that allows Google engineers to perform simple computations while hiding the details of parallelization, data distribution, load balancing and fault tolerance. MapReduce Input: a set of key/value pairs (generated from input data) User supplies two functions: map(k,v) list of (k1,list(v1)) reduce(k1, list(v1)) list of (k2, v2) Input split Map(k, v (k, v ) Shard Sort Group (k, v )s by k Partition Input Reducer Reduce(k, v[]) v output 4/26/2015 Sharma Chakravarthy 13 High level view of Map and reduce k 1 v 1 k 2 v 2 k 3 v 3 k 4 v 4 k 5 v 5 k 6 v 6 map map map map b a 1 2 c c 3 6 a c 5 2 b c 7 8 combine combine combine combine Partitioning is Done in the mapper b a 1 2 c 9 a c 5 2 b c 7 8 partition partition partition partition Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c reduce reduce reduce Shuffle is done by the network Sort is done in the reducer 4/26/2015 Sharma Chakravarthy 15 r 1 s 1 r 2 s 2 r 3 s 3 4

5 Input Data Split 0 Split 1 Split 2 Split 3 Input files Distributed Execution Overview Worker (3) read Worker Worker Map phase User Program (1) fork (1) fork (1) fork (7) wakeup (2) assign Master (2) assign map reduce (4) local write Worker (6) write Worker (5) remote read, Reduce phase sort Output File 0 Output File 1 Map/Reduce: Approach The MASTER: initializes the data and splits it up according to the number of available WORKERS (# of mappers can be chosen) Sends each WORKER its partition receives the results from each WORKER (its location and some meta information) The WORKER: receives the partition (actually its location) from the MASTER performs processing on the partition returns results to MASTER 4/26/2015 Sharma Chakravarthy 18 Map/Reduce Static load balancer: allocates processes to processors at run time while taking no account of current network load. Map, written by a user of the Map/Reduce library, takes an input pair and produces a set of intermediate key/value pairs. The Map/Reduce library groups together all intermediate values associated with the same intermediate key I and passes them to the reduce function. Shuffle is an important, transparent step The reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. See OSDI 2004 paper Data flow Input, final output are stored on a distributed file system (e.g., HDFS) Scheduler tries to schedule map tasks close to physical storage location of input data Intermediate results are stored on local FS of map and reduce workers Output is often input to another map reduce task Not all problems solved with one round of map/reduce computation! 4/26/2015 Sharma Chakravarthy 19 5

6 Coordination Master data structures Task status: (idle, in progress, completed) Idle tasks get scheduled as workers become available When a map task completes, it sends the master the location and sizes of its R intermediate files, one for each reducer Master pushes this info to reducers Master pings workers periodically to detect failures Failures Map worker failure Map tasks completed or in progress at worker are reset to idle Reduce workers are notified when task is rescheduled on another worker Reduce worker failure Only in progress tasks are reset to idle Master failure MapReduce task is aborted and client is notified Map/Reduce execution overview The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits or shards. The input shards can be processed in parallel on different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function can be specified by the user The illustration below shows the overall flow of a Map/Reduce operation. When the user program calls the Map/Reduce function, the following sequence of actions occurs (the numbered labels in the illustration correspond to the numbers in the list below). Input Data Split 0 Split 1 Split 2 Distributed Execution Overview Worker (3) read Worker Worker mappers User Program (1) fork (1) fork (1) fork (7) wakeup (2) assign Master (2) assign map reduce (4) local write Worker Worker (5) remote read, reducers sort (6) write Output File 0 Output File 1 4/26/2015 Sharma Chakravarthy 23 6

7 Data distribution across nodes Input File Input File InputFormat InputSplit InputSplit InputSplit InputSplit InputSplit RecordReader RecordReader RecordReader RecordReader RecordReader Intermediates Intermediates Intermediates Intermediates Intermediates 4/26/2015 Sharma Chakravarthy 25 Source: redrawn from a slide by Cloduera, cc-licensed Mapping Lists A list of data elements are provided, one at a time, to a function called mapper, which transforms each element individually to an output element Intermediates Intermediates Intermediates Intermediates Intermediates Partitioner Partitioner Partitioner Partitioner Partitioner (combiners omitted here) Same hash function is used by all! Intermediates Intermediates Intermediates Reducer Reducer Reduce 4/26/2015 Sharma Chakravarthy 27 Source: redrawn from a slide by Cloduera, cc-licensed 7

8 Reducing Lists Reducing lets you aggregate values together. A reducer function receives an iterator of input values from an input list. It them combines these values together, returning an output value Shuffle and Sort in Hadoop Probably the most complex aspect of MapReduce! Map side Map outputs are buffered in memory in a circular buffer When buffer reaches threshold, contents are spilled to disk Spills merged in a single, partitioned file (sorted within each partition): combiner runs here Reduce side First, map outputs are copied over to reducer machine Sort is a multi pass merge of map outputs (happens in memory and on disk) Final merge pass goes directly into reducer 4/26/2015 Sharma Chakravarthy 29 Partition and shuffle After the map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling. A different subset of the intermediate key space is assigned to each reduce node; these subsets (known as "partitions") are the inputs to the reduce tasks. Each map task may emit (key, value) pairs to any partition; all values for the same key are always reduced together regardless of which mapper is its origin. Therefore, the map nodes must all agree on where to send the different pieces of the intermediate data. Partition and shuffle The Partitioner class determines which partition a given (key, value) pair will go to. The default partitioner computes a hash value for the key and assigns the partition based on this result. Sort: Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer. 4/26/2015 Sharma Chakravarthy 31 4/26/2015 Sharma Chakravarthy 32 8

9 Partition, shuffle, sort Partition, shuffle, sort 4/26/2015 Sharma Chakravarthy 33 4/26/2015 Sharma Chakravarthy 34 Map tasks and mapper nodes An example 4/26/2015 Sharma Chakravarthy 35 4/26/2015 Sharma Chakravarthy 36 9

10 Map and reduce on nodes with data A closer look 4/26/2015 Sharma Chakravarthy 37 4/26/2015 Sharma Chakravarthy Develop code locally You Hadoop Workflow 3. Submit MapReduce job 3a. Go back to Step 2 1. Load data into HDFS Hadoop Cluster Input Files This is where the data for a Map/Reduce task is initially stored. While this does not need to be the case, the input files typically reside in HDFS. The format of these files is arbitrary; while line based log files can be used, we could also use a binary format, multiline input records, or something else entirely. It is typical for these input files to be very large tens of gigabytes or more. 4. Retrieve data from HDFS 4/26/2015 Sharma Chakravarthy 40 10

11 Input Format How these input files are split up and read is defined by the InputFormat. An InputFormat is a class that provides the following functionality: Selects the files or other objects that should be used for input Defines the InputSplits that break a file into tasks Provides a factory for RecordReader objects that read the file Several InputFormats are provided with Hadoop. An abstract type is called FileInputFormat; all InputFormats that operate on files inherit functionality and properties from this class. When starting a Hadoop job, FileInputFormat is provided with a path containing files to read. The FileInputFormat will read all files in this directory. It then divides these files into one or more InputSplits each. You can choose which InputFormat to apply to your input files for a job by calling the setinputformat() method of the JobConfobject that defines the job. InputFormats provided by Map/Reduce InputFormat: Description: Key: Value: TextInputFormat Default format; reads lines of text files The byte offset of the line The line contents KeyValueInputForm at SequenceFileInputF ormat Parses lines into key, val pairs A Hadoop specific high performance binary format Everything up to the first tab character user defined The remainder of the line user defined 4/26/2015 Sharma Chakravarthy 41 4/26/2015 Sharma Chakravarthy 42 InputSplits An InputSplit describes a unit of work that comprises a single map task in a Map/Reduce program. A Map/Reduce program applied to a data set, collectively referred to as a Job, is made up of several (possibly several hundred) tasks. Map tasks may involve reading a whole file; they often involve reading only part of a file. By default, the FileInputFormat and its descendants break a file up into 64 MB chunks (the same size as blocks in HDFS). You can control this value. By processing a file in chunks, we allow several map tasks to operate on a single file in parallel. If the file is very large, this can improve performance significantly through parallelism. InputSplits Even more importantly, since the various blocks that make up the file may be spread across several different nodes in the cluster, it allows tasks to be scheduled on each of these different nodes; the individual blocks are thus all processed locally, instead of needing to be transferred from one node to another. Of course, while log files can be processed in this piece wise fashion, some file formats are not amenable to chunked processing. By writing a custom InputFormat, you can control how the file is broken up (or is not broken up) into splits. The InputFormat defines the list of tasks that make up the mapping phase; each task corresponds to a single input split. The tasks are then assigned to the nodes in the system based on where the input file chunks are physically resident. An individual node may have several dozen tasks assigned to it. The node will begin working on the tasks, attempting to perform as many in parallel as it can. 4/26/2015 Sharma Chakravarthy 43 4/26/2015 Sharma Chakravarthy 44 11

12 Record Reader The InputSplit has defined a slice of work, but does not describe how to access it. TheRecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the. The RecordReader instance is defined by the InputFormat. The default InputFormat, TextInputFormat, provides a LineRecordReader, which treats each line of the input file as a new value. The key associated with each line is its byte offset in the file. The RecordReader is invoked repeatedly on the input until the entire InputSplit has been consumed. Each invocation of the RecordReader leads to another call to the map() method of the. The performs the interesting user defined work of the first phase of the Map/Reduce program. Given a key and a value, the map() method emits (key, value) pair(s) which are forwarded to the Reducers. A new instance of is instantiated in a separate Java process for each map task (InputSplit) that makes up part of the total job input. The individual mappers are intentionally not provided with a mechanism to communicate with one another in any way. This allows the reliability of each map task to be governed solely by the reliability of the local machine. The map() method receives two parameters in addition to the key and the value: 4/26/2015 Sharma Chakravarthy 45 4/26/2015 Sharma Chakravarthy 46 Reduce A Reducer instance is created for each reduce task. This is an instance of user provided code that performs the second important phase of job specific work. For each key in the partition assigned to a Reducer, the Reducer's reduce() method is called once. This receives a key as well as an iterator over all the values associated with the key. The values associated with a key are returned by the iterator in an undefined order. The Reducer also receives as parameters OutputCollector and Reporter objects; they are used in the same manner as in the map() method. Additional Map/Reduce functionality 4/26/2015 Sharma Chakravarthy 47 4/26/2015 Sharma Chakravarthy 48 12

13 Combiner The pipeline showed earlier omits a processing step which can be used for optimizing bandwidth usage by your Map/Reduce job. Called the Combiner, this pass runs after the and before the Reducer. Usage of the Combiner is optional. If this pass is suitable for your job, instances of the Combiner class are run on every node that has run map tasks. The Combiner will receive as input all data emitted by the instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the s. The Combiner is a "mini reduce" process which operates only on data generated by one machine. An Application: Word Count A simple Map/Reduce program can be written to determine count of words in a set of files. For example, if we had: foo.txt: Sweet, this is the foo file bar.txt: This is the bar file We would expect the output to be: sweet 1 this 2 is 2 the 2 foo 1 bar 1 file 2 4/26/2015 Sharma Chakravarthy 49 4/26/2015 Sharma Chakravarthy 50 Word Count example Input: Large number of text documents Task: Compute word count across all the document Word Count Solution //Pseudo-code for "word counting" map(string key, String value): // key: document name, // value: document contents for each word w in value: EmitIntermediate(w, "1"); Solution : For every word in a document output (word (key), "1 (value)) Reducer: Sum all occurrences of words and output (word, total_count) reduce(string key, Iterator values): // key: a word // values: a list of counts int word_count = 0; for each v in values: word_count += ParseInt(v); Emit(key, AsString(word_count)); No types, just strings 4/26/2015 Sharma Chakravarthy 51 4/26/2015 Sharma Chakravarthy 52 13

14 Word count optimization: Combiner Combiner Word count is a prime example where a Combiner is useful. The Word Count program emits a(word, 1) pair for every instance of every word it sees. So if the same document contains the word "cat" 3 times, the pair ("cat", 1) is emitted three times; all of these are then sent to the Reducer. By using a Combiner, these can be condensed into a single ("cat", 3) pair to be sent to the Reducer. Now each node only sends a single value to the reducer for each word drastically reducing the total bandwidth required for the shuffle process, and speeding up the job. 4/26/2015 Sharma Chakravarthy 53 4/26/2015 Sharma Chakravarthy 54 Combiner The best part is that we may not need to write any additional code to take advantage of this! If a reduce function is both commutative and associative, then it can be used as a Combiner as well. The Combiner should be an instance of the Reducer interface. If your Reducer itself cannot be used directly as a Combiner because of commutativity or associativity, you might still be able to write a third class to use as a Combiner for your job. Commutative means A o B is the same as B o A Associative means (A o B) o C is the same as A o (B o C) Count is both commutative and associative (as are +, max) is not associative, concat is not commutative! Word frequency example Used for decrypting, comparison of documents etc. Input: Large number of text documents Task: Compute word frequency (ratio) across all the document Need to compute total count as well as individual count A naive solution with basic Map/Reduce model requires two Map/Reduces MR1: count number of all words in these documents Use combiners MR2: count number of each word and divide it by the total count from MR1 4/26/2015 Sharma Chakravarthy 55 4/26/2015 Sharma Chakravarthy 56 14

15 Word frequency example Can we do better? Two nice features of Google's MapReduce implementation Ordering guarantee of reduce key Auxiliary functionality: EmitToAllReducers(k, v) A nice trick: To compute the total number of words in all documents Every map task sends its total world count with key " to ALL reducer splits Key "" will be the first key processed by reducer Sum of its values total number of words! Requires only 1 map and 1 reduce instead of 2 each in a chain Word frequency solution: with combiner map(string key, String value): // key: document name, value: document contents int word_count = 0; for each word w in value: EmitIntermediate(w, "1"); word_count++; EmitIntermediateToAllReducers("", AsString(word_count)); combine(string key, Iterator values): // Combiner for map output // key: a word, values: a list of counts int partial_word_count = 0; for each v in values: partial_word_count += ParseInt(v); Emit(key, AsString(partial_word_count)); 4/26/2015 Sharma Chakravarthy 57 Sharma Chakravarthy 58 Word frequency solution: reducer reduce(string key, Iterator values): // Actual reducer, key: a word // values: a list of counts if (is_first_key): assert("" == key); // sanity check total_word_count_ = 0; for each v in values: total_word_count_ += ParseInt(v) else assert(""!= key); // sanity check int word_count = 0; for each v in values: word_count += ParseInt(v); Emit(key, AsString(word_count / total_word_count_)); Computing averages Average income in a city From census or other data Joins two inputs SSTable 1: (SSN, {Personal Information}) :(John Smith;Sunnyvale, CA) :(Jane Brown;Mountain View, CA) :(Tom Little;Mountain View, CA) SSTable 2: (SSN, {year, income}) :(2007,$70000),(2006,$65000),(2005,$6000), :(2007,$72000),(2006,$70000),(2005,$6000), :(2007,$80000),(2006,$85000),(2005,$7500),... Task: Compute average income in each city in 2007 Note: Both inputs sorted by SSN (it does not have to be) Sharma Chakravarthy 59 4/26/2015 Sharma Chakravarthy 60 15

16 Computing averages 1a 1b: Input: SSN personal info input: SSN annual incomes Output: (ssn, city) output: (SSN, 2007 income) reducer 1: input: SSN, {city, 2007 income} Output: (SSN, [City, 2007 income]) 2: Input: SSN [city, 2007 income] Output: (City, 2007 income) reducer 2: Input: City 2007 incomes output: (City, AVG (2007 Incomes)) Average income in a joined solution The previous example showed a sorted input. But it does not have to be sorted. The input can be a single one as in our project case! Input: SSN personal info and incomes Output: (City, 2007 income) Reducer Input: City {2007 incomes} output: (City, AVG (2007 Incomes)) 4/26/2015 Sharma Chakravarthy 61 4/26/2015 Sharma Chakravarthy 62 Fault tolerance This fault tolerance underscores the need for program execution to be side effect free. If s and Reducers had individual identities and communicated with one another or the outside world, then restarting a task would require the other nodes to communicate with the new instances of the map and reduce tasks, and the reexecuted tasks would need to reestablish their intermediate state. This process is notoriously complicated and error prone in the general case. Map/Reduce simplifies this problem drastically by eliminating task identities or the ability for task partitions to communicate with one another. An individual task sees only its own direct inputs and knows only its own outputs, to make this failure and restart process clean and dependable. MR tolerance and DBMS recovery This fault tolerance underscores the need for program execution to be side effect free. This requirement is also needed/used in the recovery of a DBMS using logs. If a transaction were to communicate with outside (i.e., outside of reading and writing from disks, and with others), recovery becomes very complicated and may not even be feasible. DBMS recovery aims at restoring the state of the DBMS to a consistent state so that transactions aborted can be re executed from a consistent state It also requires that each transaction leaves the DBMS in a consistent state if it completes! ACID property (which is much stronger than what is used in MR) is guaranteed in a DBMS 4/26/2015 Sharma Chakravarthy 63 4/26/2015 Sharma Chakravarthy 64 16

17 Chaining jobs Not every problem can be solved with a Map/Reduce program, but fewer still are those which can be solved with a single Map/Reduce job. Many problems can be solved with Map/Reduce, by writing several Map/Reduce steps which run in series to accomplish a goal: Map1 > Reduce1 > Map2 > Reduce2 > Map3... You can easily chain jobs together in this fashion by writing multiple driver methods, one for each job. Call the first driver method, which uses JobClient.runJob() to run the job and wait for it to complete. When that job has completed, then call the next driver method, which creates a new JobConf object referring to different instances of and Reducer, etc. Chaining examples Suppose you want to compute Single Source Shortest Path Page Rank Graph substructures The above problems cannot be done in one iteration. This means several map/reduce pairs have to be chained to solve the problem! 4/26/2015 Sharma Chakravarthy 65 4/26/2015 Soumyava Das 66 Chaining jobs The first job in the chain should write its output to a path which is then used as the input path for the second job. This process can be repeated for as many jobs are necessary to arrive at a complete solution to the problem. Many problems which at first seem impossible in Map/Reduce can be accomplished by dividing one job into two or more. Hadoop provides another mechanism for managing batches of jobs with dependencies between jobs. Rather than submit ajobconfto the JobClient's runjob() or submitjob() methods, org.apache.hadoop.mapred.jobcontrol.job objects can be created to represent each job; Dependencies can be accommodated Graph Input 7 Node Id Adj List Distance 0 Graph [1,2,3] Input 0 1 [2,7] Null 2 [5,6] Null 3 [4] Null 4 [5] Null 5 [6] Null 6 [1,7] Null 7 [] Null Distance indicates distance of the node from the source Initially only the self distance is known so all other distance is null 4/26/2015 Sharma Chakravarthy 67 4/26/2015 Soumyava Das 68 17

18 Inputs 1 [2,7] Iteration Outputs Iteration 1 Combiner in action 1 [2,7], 1 Key Value Vertex 0 does not emit its distance or adj list as it is the start node Vertex 7 also does not emit anything as its adjacency list is empty 2 [5,6] 3 [4] 4 [5] 5 [6] 6 [1,7] Node Id Adj List Distance 0 [1,2,3] 0 1 [2,7] Null 2 [5,6] Null 3 [4] Null 4 [5] Null 5 [6] Null 6 [1,7] Null 7 [] Null P1 To P2 To Combiner combines same keys in a mapper It is an inmapper reducer 2 [5,6], 1 3 [4], 1 4 [5] 5 [6] 6 [1,7] Reducer Inputs /26/2015 Soumyava Das 69 4/26/2015 Soumyava Das 70 Iteration 1 Reducer in action Iteration i (i = 2) in action Reducer has key as vertex id followed by values. If number of values < 2 if value has [ Not reached Else: reached with distance If number of values >= 2 values contain the distances from the source to this node and the adjacency list. Emit the minimum of the distance values and the adjacency list Reducer Output Reducer Input 1 [2,7] 1 [2,7], [5,6] 2 [5,6], [4], 1 4 [5] 5 [6] 6 [1,7] 3 [4] [5] 5 [6] 6 [1,7] Input 1 [2,7] 1 1 M1 2 [5,6] M [4] M [5] M2 5 [6] M2 6 [1,7] M2 Output 1 [2,7] [5,6] [4] [5] 6 [1,7] 5 [6] /26/2015 Soumyava Das 71 4/26/2015 Soumyava Das 72 18

19 Iteration i (i = 2) Combiner in action Iteration i (i = 2) Reducer in action Individual map task output [6] 1 [2,7] 4 [5] Combiner Output [2,7], [4] 2 2, [5,6], [4], 1 4 [5] 6 [1,7] [6] [1,7] 2 [5,6] 2 1 Reducer Input Reducer Input 1 [2,7], 1 2 2, [5,6], 1 3 [4], 1 4 2, [5] 5 2, [6] 6 2, [1,7] 7 2 Reducer Output 1 [2,7] [5] 0 2 [5,6] [6] 3 [4] [1,7] Continue iterating now for i > /26/2015 Soumyava Das 73 4/26/2015 Soumyava Das 74 Takeaways Convergence The algorithm will converge when the distance of nodes from source do not change across iterations Need to check convergence criteria periodically to stop iterations (additional processing) The algorithm keeps the distance and not the path (extra bookkeeping for storing the path) Summary Additional features such as pipes and streaming are available in Hadoop. If you are familiar with C++ or Java, it is not very difficult to understand the basic concept and use it Of course, if you want to use advanced features, you need to learn them Much easier than using a DBMS for some jobs where the data is in free format; will discuss more of this later! 4/26/2015 Soumyava Das 75 4/26/2015 Sharma Chakravarthy 76 19

20 Questions! Sharma Chakravarthy 77 20

Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements

Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements Tutorial Outline Map/Reduce Sharma Chakravarthy Information Technology Laboratory Computer Science and Engineering Department The University of Texas at Arlington, Arlington, TX 76009 Email: sharma@cse.uta.edu

More information

Introduction to MapReduce

Introduction to MapReduce Introduction to MapReduce Jacqueline Chame CS503 Spring 2014 Slides based on: MapReduce: Simplified Data Processing on Large Clusters. Jeffrey Dean and Sanjay Ghemawat. OSDI 2004. MapReduce: The Programming

More information

Introduction to Map Reduce

Introduction to Map Reduce Introduction to Map Reduce 1 Map Reduce: Motivation We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate

More information

Parallel Computing: MapReduce Jin, Hai

Parallel Computing: MapReduce Jin, Hai Parallel Computing: MapReduce Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! MapReduce is a distributed/parallel computing framework introduced by Google

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

L22: SC Report, Map Reduce

L22: SC Report, Map Reduce L22: SC Report, Map Reduce November 23, 2010 Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance Google version = Map Reduce; Hadoop = Open source

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

CS 345A Data Mining. MapReduce

CS 345A Data Mining. MapReduce CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes

More information

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Distributed Computations MapReduce. adapted from Jeff Dean s slides Distributed Computations MapReduce adapted from Jeff Dean s slides What we ve learnt so far Basic distributed systems concepts Consistency (sequential, eventual) Fault tolerance (recoverability, availability)

More information

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [MAPREDUCE] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Bit Torrent What is the right chunk/piece

More information

CS 61C: Great Ideas in Computer Architecture. MapReduce

CS 61C: Great Ideas in Computer Architecture. MapReduce CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing

More information

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins MapReduce 1 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins 2 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

Parallel Programming Concepts

Parallel Programming Concepts Parallel Programming Concepts MapReduce Frank Feinbube Source: MapReduce: Simplied Data Processing on Large Clusters; Dean et. Al. Examples for Parallel Programming Support 2 MapReduce 3 Programming model

More information

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec MapReduce: Recap Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec MapReduce: Recap Sequentially read a lot of data Why? Map: extract something we care about map (k, v)

More information

Programming Models MapReduce

Programming Models MapReduce Programming Models MapReduce Majd Sakr, Garth Gibson, Greg Ganger, Raja Sambasivan 15-719/18-847b Advanced Cloud Computing Fall 2013 Sep 23, 2013 1 MapReduce In a Nutshell MapReduce incorporates two phases

More information

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA) Introduction to MapReduce Adapted from Jimmy Lin (U. Maryland, USA) Motivation Overview Need for handling big data New programming paradigm Review of functional programming mapreduce uses this abstraction

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce

More information

Advanced Data Management Technologies

Advanced Data Management Technologies ADMT 2017/18 Unit 16 J. Gamper 1/53 Advanced Data Management Technologies Unit 16 MapReduce J. Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Acknowledgements: Much of the information

More information

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University MapReduce & HyperDex Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University 1 Distributing Processing Mantra Scale out, not up. Assume failures are common. Move processing to the data. Process

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc. MapReduce and HDFS This presentation includes course content University of Washington Redistributed under the Creative Commons Attribution 3.0 license. All other contents: Overview Why MapReduce? What

More information

Introduction to MapReduce

Introduction to MapReduce 732A54 Big Data Analytics Introduction to MapReduce Christoph Kessler IDA, Linköping University Towards Parallel Processing of Big-Data Big Data too large to be read+processed in reasonable time by 1 server

More information

Introduction to MapReduce (cont.)

Introduction to MapReduce (cont.) Introduction to MapReduce (cont.) Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 2 MapReduce: Summary USC INF 553 Foundations

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

Programming Systems for Big Data

Programming Systems for Big Data Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There

More information

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab CS6030 Cloud Computing Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 1 Acknowledgements I have liberally borrowed these slides and material

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Part A: MapReduce. Introduction Model Implementation issues

Part A: MapReduce. Introduction Model Implementation issues Part A: Massive Parallelism li with MapReduce Introduction Model Implementation issues Acknowledgements Map-Reduce The material is largely based on material from the Stanford cources CS246, CS345A and

More information

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm Announcements HW5 is due tomorrow 11pm Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create your AWS account before

More information

Database Systems CSE 414

Database Systems CSE 414 Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) CSE 414 - Fall 2017 1 Announcements HW5 is due tomorrow 11pm HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create

More information

Parallel Nested Loops

Parallel Nested Loops Parallel Nested Loops For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on (S 1,T 1 ), (S 1,T 2 ),

More information

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011 Parallel Nested Loops Parallel Partition-Based For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Winter 215 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Due next Thursday evening Will send out reimbursement codes later

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

The MapReduce Abstraction

The MapReduce Abstraction The MapReduce Abstraction Parallel Computing at Google Leverages multiple technologies to simplify large-scale parallel computations Proprietary computing clusters Map/Reduce software library Lots of other

More information

1. Introduction to MapReduce

1. Introduction to MapReduce Processing of massive data: MapReduce 1. Introduction to MapReduce 1 Origins: the Problem Google faced the problem of analyzing huge sets of data (order of petabytes) E.g. pagerank, web access logs, etc.

More information

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA CSE 3302 Lecture 11: Map/Reduce 7 October 2010 Nate Nystrom UTA 378,000 results in 0.17 seconds including images and video communicates with 1000s of machines web server index servers document servers

More information

MapReduce: Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI 2004 Presented by Zachary Bischof Winter '10 EECS 345 Distributed Systems 1 Motivation Summary Example Implementation

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus Getting to know MapReduce MapReduce Execution Pipeline Runtime Coordination and Task Management MapReduce Application Hadoop Word Count Implementation.

More information

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access Map/Reduce vs. DBMS Sharma Chakravarthy Information Technology Laboratory Computer Science and Engineering Department The University of Texas at Arlington, Arlington, TX 76009 Email: sharma@cse.uta.edu

More information

CS MapReduce. Vitaly Shmatikov

CS MapReduce. Vitaly Shmatikov CS 5450 MapReduce Vitaly Shmatikov BackRub (Google), 1997 slide 2 NEC Earth Simulator, 2002 slide 3 Conventional HPC Machine ucompute nodes High-end processors Lots of RAM unetwork Specialized Very high

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information

Hadoop Map Reduce 10/17/2018 1

Hadoop Map Reduce 10/17/2018 1 Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018

More information

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing ECE5610/CSC6220 Introduction to Parallel and Distribution Computing Lecture 6: MapReduce in Parallel Computing 1 MapReduce: Simplified Data Processing Motivation Large-Scale Data Processing on Large Clusters

More information

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015 Distributed Systems 18. MapReduce Paul Krzyzanowski Rutgers University Fall 2015 November 21, 2016 2014-2016 Paul Krzyzanowski 1 Credit Much of this information is from Google: Google Code University [no

More information

MapReduce Algorithm Design

MapReduce Algorithm Design MapReduce Algorithm Design Bu eğitim sunumları İstanbul Kalkınma Ajansı nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no lu İstanbul Big

More information

Large-Scale GPU programming

Large-Scale GPU programming Large-Scale GPU programming Tim Kaldewey Research Staff Member Database Technologies IBM Almaden Research Center tkaldew@us.ibm.com Assistant Adjunct Professor Computer and Information Science Dept. University

More information

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies Informa)on Retrieval and Map- Reduce Implementa)ons Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies mas4108@louisiana.edu Map-Reduce: Why? Need to process 100TB datasets On 1 node:

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2013/14 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Cluster File

More information

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background. Credit Much of this information is from Google: Google Code University [no longer supported] http://code.google.com/edu/parallel/mapreduce-tutorial.html Distributed Systems 18. : The programming model

More information

Data Partitioning and MapReduce

Data Partitioning and MapReduce Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Map Reduce Group Meeting

Map Reduce Group Meeting Map Reduce Group Meeting Yasmine Badr 10/07/2014 A lot of material in this presenta0on has been adopted from the original MapReduce paper in OSDI 2004 What is Map Reduce? Programming paradigm/model for

More information

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35 MapReduce Kiril Valev LMU valevk@cip.ifi.lmu.de 23.11.2013 Kiril Valev (LMU) MapReduce 23.11.2013 1 / 35 Agenda 1 MapReduce Motivation Definition Example Why MapReduce? Distributed Environment Fault Tolerance

More information

Batch Processing Basic architecture

Batch Processing Basic architecture Batch Processing Basic architecture in big data systems COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman 2 1 2 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 3

More information

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018 Parallel Processing - MapReduce and FlumeJava Amir H. Payberah payberah@kth.se 14/09/2018 The Course Web Page https://id2221kth.github.io 1 / 83 Where Are We? 2 / 83 What do we do when there is too much

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Fall 2016 1 HW8 is out Last assignment! Get Amazon credits now (see instructions) Spark with Hadoop Due next wed CSE 344 - Fall 2016

More information

The amount of data increases every day Some numbers ( 2012):

The amount of data increases every day Some numbers ( 2012): 1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect

More information

2/26/2017. The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012): The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to

More information

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414 Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s

More information

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved. CS 138: Google CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved. Google Environment Lots (tens of thousands) of computers all more-or-less equal - processor, disk, memory, network interface

More information

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Database System Architectures Parallel DBs, MapReduce, ColumnStores Database System Architectures Parallel DBs, MapReduce, ColumnStores CMPSCI 445 Fall 2010 Some slides courtesy of Yanlei Diao, Christophe Bisciglia, Aaron Kimball, & Sierra Michels- Slettvet Motivation:

More information

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. CS 138: Google CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Google Environment Lots (tens of thousands) of computers all more-or-less equal - processor, disk, memory, network interface

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

MapReduce-style data processing

MapReduce-style data processing MapReduce-style data processing Software Languages Team University of Koblenz-Landau Ralf Lämmel and Andrei Varanovich Related meanings of MapReduce Functional programming with map & reduce An algorithmic

More information

CISC 7610 Lecture 2b The beginnings of NoSQL

CISC 7610 Lecture 2b The beginnings of NoSQL CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone

More information

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Importing and Exporting Data Between Hadoop and MySQL

Importing and Exporting Data Between Hadoop and MySQL Importing and Exporting Data Between Hadoop and MySQL + 1 About me Sarah Sproehnle Former MySQL instructor Joined Cloudera in March 2010 sarah@cloudera.com 2 What is Hadoop? An open-source framework for

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)! Big Data Processing, 2014/15 Lecture 7: MapReduce design patterns!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

MapReduce and Hadoop

MapReduce and Hadoop Università degli Studi di Roma Tor Vergata MapReduce and Hadoop Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference Big Data stack High-level Interfaces Data Processing

More information

Dept. Of Computer Science, Colorado State University

Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,

More information

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data) Principles of Data Management Lecture #16 (MapReduce & DFS for Big Data) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s News Bulletin

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 2: MapReduce Algorithm Design (2/2) January 14, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

MapReduce: A Programming Model for Large-Scale Distributed Computation

MapReduce: A Programming Model for Large-Scale Distributed Computation CSC 258/458 MapReduce: A Programming Model for Large-Scale Distributed Computation University of Rochester Department of Computer Science Shantonu Hossain April 18, 2011 Outline Motivation MapReduce Overview

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 9 MapReduce Prof. Li Jiang 2014/11/19 1 What is MapReduce Origin from Google, [OSDI 04] A simple programming model Functional model For large-scale

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

Map-Reduce. John Hughes

Map-Reduce. John Hughes Map-Reduce John Hughes The Problem 850TB in 2006 The Solution? Thousands of commodity computers networked together 1,000 computers 850GB each How to make them work together? Early Days Hundreds of ad-hoc

More information

Improving the MapReduce Big Data Processing Framework

Improving the MapReduce Big Data Processing Framework Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce

More information

MapReduce Design Patterns

MapReduce Design Patterns MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April

More information

MapReduce. U of Toronto, 2014

MapReduce. U of Toronto, 2014 MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in

More information

Hadoop MapReduce Framework

Hadoop MapReduce Framework Hadoop MapReduce Framework Contents Hadoop MapReduce Framework Architecture Interaction Diagram of MapReduce Framework (Hadoop 1.0) Interaction Diagram of MapReduce Framework (Hadoop 2.0) Hadoop MapReduce

More information

MapReduce Simplified Data Processing on Large Clusters

MapReduce Simplified Data Processing on Large Clusters MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /

More information

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns STATS 700-002 Data Analysis using Python Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns Unit 3: parallel processing and big data The next few lectures will focus on big

More information