MapReduce and Hadoop. The reference Big Data stack

Size: px
Start display at page:

Download "MapReduce and Hadoop. The reference Big Data stack"

Transcription

1 Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference Big Data stack High-level Interfaces Data Processing Data Storage Resource Management Support / Integration Valeria Cardellini - SABD 2017/18 1

2 The Big Data landscape Valeria Cardellini - SABD 2017/18 2 The open-source landscape for Big Data Valeria Cardellini - SABD 2017/18 3

3 Processing Big Data Some frameworks and platforms to process Big Data Relational DBMS: old fashion, no longer applicable Batch processing: store and process data sets at massive scale (especially Volume+Variety) In this lesson: MapReduce and Hadoop Data stream processing: process fast data (in realtime) as data is being generated, without storing Valeria Cardellini - SABD 2017/18 4 MapReduce Valeria Cardellini - SABD 2017/18 5

4 Parallel programming: background Parallel programming Break processing into parts that can be executed concurrently on multiple processors Challenge Identify tasks that can run concurrently and/or groups of data that can be processed concurrently Not all problems can be parallelized! Valeria Cardellini - SABD 2017/18 6 Parallel programming: background Simplest environment for parallel programming No dependency among data Data can be split into equal-size chunks Each process can work on a chunk Master/worker approach Master Initializes array and splits it according to the number of workers Sends each worker the sub-array Receives the results from each worker Worker: Receives a sub-array from master Performs processing Sends results to master Single Program, Multiple Data (SPMD): technique to achieve parallelism The most common style of parallel programming Valeria Cardellini - SABD 2017/18 7

5 Key idea behind MapReduce: Divide and conquer A feasible approach to tackling large-data problems Partition a large problem into smaller sub-problems Independent sub-problems executed in parallel Combine intermediate results from each individual worker The workers can be: Threads in a processor core Cores in a multi-core processor Multiple processors in a machine Many machines in a cluster Implementation details of divide and conquer are complex Valeria Cardellini - SABD 2017/18 8 Divide and conquer: how? Decompose the original problem in smaller, parallel tasks Schedule tasks on workers distributed in a cluster, keeping into account: Data locality Resource availability Ensure workers get the data they need Coordinate synchronization among workers Share partial results Handle failures Valeria Cardellini - SABD 2017/18 9

6 Key idea behind MapReduce: Scale out, not up! For data-intensive workloads, a large number of commodity servers is preferred over a small number of high-end servers Cost of super-computers is not linear Datacenter efficiency is a difficult problem to solve, but recent improvements Processing data is quick, I/O is very slow Sharing vs. shared nothing: Sharing: manage a common/global state Shared nothing: independent entities, no common state Sharing is difficult: Synchronization, deadlocks Finite bandwidth to access data from SAN Temporal dependencies are complicated (restarts) Valeria Cardellini - SABD 2017/18 10 MapReduce Programming model for processing huge amounts of data sets over thousands of servers Originally proposed by Google in 2004: MapReduce: simplified data processing on large clusters Based on a shared nothing approach Also an associated implementation (framework) of the distributed system that runs the corresponding programs Some examples of applications for Google: Web indexing Reverse Web-link graph Distributed sort Web access statistics Valeria Cardellini - SABD 2017/18 11

7 MapReduce: programmer view MapReduce hides system-level details Key idea: separate the what from the how MapReduce abstracts away the distributed part of the system Such details are handled by the framework Programmers get simple API Don t have to worry about handling Parallelization Data distribution Load balancing Fault tolerance Valeria Cardellini - SABD 2017/18 12 Typical Big Data problem Iterate over a large number of records Extract something of interest from each record Shuffle and sort intermediate results Aggregate intermediate results Generate final output Key idea: provide a functional abstraction of the two Map and Reduce operations Valeria Cardellini - SABD 2017/18 13

8 MapReduce: model Processing occurs in two phases: Map and Reduce Functional programming roots (e.g., Lisp) Map and Reduce are defined by the programmer Input and output: sets of key-value pairs Programmers specify two functions: map and reduce map(k 1, v 1 ) [(k 2, v 2 )] reduce(k 2, [v 2 ]) [(k 3, v 3 )] (k, v) denotes a (key, value) pair [ ] denotes a list Keys do not have to be unique: different pairs can have the same key Normally the keys k 1 of input elements are not relevant Valeria Cardellini - SABD 2017/18 14 Map Execute a function on a set of key-value pairs (input shard) to create a new list of key-value pairs map (in_key, in_value) list(out_key, intermediate_value) Map calls are distributed across machines by automatically partitioning the input data into shards Parallelism is achieved as keys can be processed by different machines MapReduce library groups together all intermediate values associated with the same intermediate key and passes them to the Reduce function Valeria Cardellini - SABD 2017/18 15

9 Reduce Combine values in sets to create a new value reduce (out_key, list(intermediate_value)) list(out_key, out_value) - The key in output is often identical to the key in the input - Parallelism is achieved as reducers operating on different keys can be executed simultaneously Valeria Cardellini - SABD 2017/18 16 Your first MapReduce example (in Lisp) Example: sum-of-squares (sum the square of numbers from 1 to n) in MapReduce fashion Map function: map square [1,2,3,4] returns [1,4,9,16] Reduce function: reduce [1,4,9,16] returns 30 (the sum of the square elements) Valeria Cardellini - SABD 2017/18 17

10 MapReduce program A MapReduce program, referred to as a job, consists of: Code for Map and Reduce packaged together Configuration parameters (where the input lies, where the output should be stored) Input data set, stored on the underlying distributed file system The input will not fit on a single computer s disk Each MapReduce job is divided by the system into smaller units called tasks Map tasks or mappers Reduce tasks or reducers All mappers need to finish before reducers can begin The output of MapReduce job is also stored on the underlying distributed file system A MapReduce program may consist of many rounds of different map and reduce functions Valeria Cardellini - SABD 2017/18 18 MapReduce computation 1. Some number of Map tasks each are given one or more chunks of data from a distributed file system 2. These Map tasks turn the chunk into a sequence of key-value pairs The way key-value pairs are produced from the input data is determined by the code written by the user for the Map function 3. The key-value pairs from each Map task are collected by a master controller and sorted by key 4. The keys are divided among all the Reduce tasks, so all keyvalue pairs with the same key wind up at the same Reduce task 5. The Reduce tasks work on one key at a time, and combine all the values associated with that key in some way The manner of combination of values is determined by the code written by the user for the Reduce function 6. Output key-value pairs from each reducer are written persistently back onto the distributed file system 7. The output ends up in r files, where r is the number of reducers Such output may be the input to a subsequent MapReduce phase Valeria Cardellini - SABD 2017/18 19

11 Where the magic happens Implicit between the map and reduce phases is a distributed group by operation on intermediate keys, called shuffle and sort Transfer mappers output to reducers, merging and sorting the output Intermediate data arrive at every reducer sorted by key Intermediate keys are transient They are not stored on the distributed file system They are spilled to the local disk of each machine in the cluster Input Splits Intermediate Outputs Final Outputs (k, v) Pairs Map Function (k, v ) Pairs Reduce Function (k, v ) Pairs Valeria Cardellini - SABD 2017/18 20 MapReduce computation: the complete picture Valeria Cardellini - SABD 2017/18 21

12 A simplified view of MapReduce: example Mappers are applied to all input key-value pairs, to generate an arbitrary number of intermediate pairs Reducers are applied to all intermediate values associated with the same intermediate key Between the map and reduce phase lies a barrier that involves a large distributed sort and group by Valeria Cardellini - SABD 2017/18 22 Hello World in MapReduce: WordCount Problem: count the number of occurrences for each word in a large collection of documents Input: repository of documents, each document is an element Map: read a document and emit a sequence of key-value pairs where: Keys are words of the documents and values are equal to 1: (w 1, 1), (w 2, 1),, (w n, 1) Shuffle and sort: group by key and generates pairs of the form (w 1, [1, 1,, 1]),..., (w n, [1, 1,, 1]) Reduce: add up all the values and emits (w 1, k),, (w n, l) Output: (w,m) pairs where: w is a word that appears at least once among all the input documents and m is the total number of occurrences of w among all those documents Valeria Cardellini - SABD 2017/18 23

13 WordCount in practice Map Shuffle Reduce Valeria Cardellini - SABD 2017/18 24 WordCount: Map Map emits each word in the document with an associated value equal to 1 Map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, 1 ); Valeria Cardellini - SABD 2017/18 25

14 WordCount: Reduce Reduce adds up all the 1 emitted for a given word Reduce(String key, Iterator values): // key: a word // values: a list of counts int result=0 for each v in values: result += ParseInt(v) Emit(AsString(result)) This is pseudo-code; for the complete code of the example see the MapReduce paper Valeria Cardellini - SABD 2017/18 26 Example: WordLengthCount Problem: count how many words of certain lengths exist in a collection of documents Input: a repository of documents, each document is an element Map: read a document and emit a sequence of key-value pairs where the key is the length of a word and the value is the word itself: (i, w1),, (j, wn) Shuffle and sort: group by key and generate pairs of the form (1, [w1,, wk]),, (n, [wr,, ws]) Reduce: count the number of words in each list and emit: (1, l),, (p, m) Output: (l,n) pairs, where l is a length and n is the total number of words of length l in the input documents Valeria Cardellini - SABD 2017/18 27

15 Example: matrix-vector multiplication Sparse matrix A = [a ij ] size n x n Vector x = [x j ] size n x 1 Problem: matrix-vector multiplication y = Ax where y i = Σ j=1 n a ij x j Used in the PageRank algorithm Let us assume that x can fit into the mapper memory Map: apply to ((i, j), a ij ) and produce key-value pair (i, a ij x j ) Reduce: receive (i, [a i1 x 1,..., a in x n ]) and sum all values of the list associated with a given key i, i.e., y i = Σ j=1 n a ij x j. The result will be a pair (i, y i ) Valeria Cardellini - SABD 2017/18 28 Example: matrix-vector multiplication What if vector x does not fit in mapper s memory? Solution: - Split x in horizontal stripes that fit in memory - Split matrix A accordingly in vertical stripes - Stripes of A do not need to fit in memory y A x Each mapper is assigned a matrix stripe; it also gets the corresponding vector stripe Map and Reduce functions are as before The best strategy is to partition A into square blocks, rather than stripes Valeria Cardellini - SABD 2017/18 29

16 Combine When the Reduce function is associative and commutative, we can push some of what the reducers do to the preceding Map tasks Associative and commutative: values to be combined can be combined in any order, with the same result E.g., addition in Reduce of WordCount In this case we also apply a combiner to Map combine (k2, [v2]) [(k3, v3)] In many cases the same function can be used for combining as the final reduction Shuffle and sort is still necessary! Advantages: Reduce amount of intermediate data Reduce network traffic Valeria Cardellini - SABD 2017/18 30 Combine Valeria Cardellini - SABD 2017/18 31

17 MapReduce: execution overview Master-worker architecture The master controller process coordinates the map and reduce tasks running the Map Reduce job Can rerun tasks on a failed compute node The worker processes execute the map and reduce tasks Valeria Cardellini - SABD 2017/18 32 MapReduce: execution overview Valeria Cardellini - SABD 2017/18 33

18 Coping with failures Worst scenario: compute node hosting the master fails The entire MapReduce job must be restarted Compute node hosting a map worker fails All the Map tasks that were assigned to this Worker will have to be redone, even if they had completed Computer node hosting a reduce worker fails Reschedule on another reduce worker Valeria Cardellini - SABD 2017/18 34 Apache Hadoop Valeria Cardellini - SABD 2017/18 35

19 What is Apache Hadoop? Open-source software framework for reliable, scalable, distributed data-intensive computing Originally developed by Yahoo! Goal: storage and processing of data-sets at massive scale Infrastructure: clusters of commodity hardware Core components: HDFS: Hadoop Distributed File System Hadoop YARN Hadoop MapReduce Includes a number of related projects Among which Apache Pig, Apache Hive, Apache HBase Used in production by Facebook, IBM, Linkedin, Twitter, Yahoo! and many others Valeria Cardellini - SABD 2017/18 36 Hadoop runs on clusters Compute nodes are stored on racks 8-64 compute nodes on a rack There can be many racks of compute nodes The nodes on a single rack are connected by a network, typically gigabit Ethernet Racks are connected by another level of network or a switch The bandwidth of intra-rack communication is usually much greater than that of inter-rack communication Compute nodes can fail! Solution: Files are stored redundantly Computations are divided into tasks Valeria Cardellini - SABD 2017/18 37

20 Hadoop runs on commodity hardware Pros You are not tied to expensive, proprietary offerings from a single vendor You can choose standardized, commonly available hardware from any of a large range of vendors to build your cluster Scale out, not up Cons Significant failure rates Solution: replication! Valeria Cardellini - SABD 2017/18 38 Commodity hardware A typical specification of commodity hardware Processor: Intel Xeon E cores GHz Memory: GB ECC RAM Storage: 4 1TB HDD Network: 1-10 Gb Ethernet Yahoo! has a huge installation of Hadoop: In 2014: more than 100,000 CPUs in > 40,000 computers running Hadoop Used to support research for Ad Systems and Web Search Also used to do scaling tests to support development of Apache Hadoop on larger clusters In 2017: 36 different clusters spread across YARN, Apache HBase, and Apache Storm deployments For some example, see Valeria Cardellini - SABD 2017/18 39

21 Hadoop core HDFS ü Already examined Hadoop YARN Cluster resource management Hadoop MapReduce System to write applications which process large data sets in parallel on large clusters (> 1000 nodes) of commodity hardware in a reliable, fault-tolerant manner Valeria Cardellini - SABD 2017/18 40 Valeria Cardellini - SABD 2017/18 41

22 Hadoop core: MapReduce Programming paradigm for expressing distributed computations over multiple servers The powerhouse behind most of today s big data processing Also used in other MPP environments and NoSQL databases (e.g., Vertica and MongoDB) Improving Hadoop usage Improving programmability: Pig and Hive Improving data access: HBase, Sqoop and Flume Valeria Cardellini - SABD 2017/18 42 We will study YARN later Global ResourceManager (RM) A set of per-application ApplicationMasters (AMs) A set of NodeManagers (NMs) Valeria Cardellini - SABD 2017/18 43

23 YARN job execution Valeria Cardellini - SABD 2017/18 44 Data locality optimization Hadoop tries to run each map task on a datalocal node (data locality optimization) So to not use cluster bandwidth Otherwise rack-local Off-rack as last choice Valeria Cardellini - SABD 2017/18 45

24 MapReduce data flow: single Reduce task Valeria Cardellini - SABD 2017/18 46 MapReduce data flow: multiple Reduce tasks When there are multiple reducers, the map tasks partition their output Valeria Cardellini - SABD 2017/18 47

25 Partitioner When there are multiple reducers, the map tasks partition their output One partition for each reduce task Goal: to determine which reducer will receive which intermediate keys and values Records for any given key are all in a single partition Default partitioner uses a hash function on the key to determine which bucket (i.e., reducer) Partitioning can be also controlled by a user-defined function Requires to implement a custom partitioner Valeria Cardellini - SABD 2017/18 48 MapReduce data flow: no Reduce task Valeria Cardellini - SABD 2017/18 49

26 Combiner Many MapReduce jobs are limited by the cluster bandwidth How to minimize the data transferred between map and reduce tasks? Use Combine task A combiner is an optional localized reducer that applies a userprovided method to combine mapper output Performs data aggregation on intermediate data of the same key for the Map task s output before transmitting the result to the Reduce task Takes each key-value pair from the Map task, processes it, and produces the output as key-value collection pairs The Reduce task is still needed to process records with the same key from different maps Valeria Cardellini - SABD 2017/18 50 Programming languages for Hadoop Java is the default programming language In Java you write a program with at least three parts: 1. A Main method which configures the job, and lauches it Set number of reducers Set mapper and reducer classes Set partitioner Set other Hadoop configurations 2. A Mapper class Takes (k,v) inputs, writes (k,v) outputs 3. A Reducer class Takes k, Iterator[v] inputs, and writes (k,v) outputs Valeria Cardellini - SABD 2017/18 51

27 Programming languages for Hadoop Hadoop Streaming API to write Map and Reduce functions in programming languages other than Java Uses Unix standard streams as interface between Hadoop and MapReduce program Allows to use any language (e.g., Python) that can read standard input (stdin) and write to standard output (stdout) to write the MapReduce program Standard input is stream data going into a program; it is from the keyboard Standard output is the stream where a program writes its output data; it is the text terminal Reducer interface for streaming is different than in Java: instead of receiving reduce(k, Iterator[v]), your script is sent one line per value, including the key Valeria Cardellini - SABD 2017/18 52 WordCount on Hadoop Let s analyze the WordCount code The code in Java The same code in Python Use Hadoop Streaming API for passing data between the Map and Reduce code via standard input and standard output See Michael Noll s tutorial Valeria Cardellini - SABD 2017/18 53

28 WordCount in Java import java.io.ioexception; import java.util.stringtokenizer; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; Valeria Cardellini - SABD 2017/18 54 public class WordCount { WordCount in Java (2) map method processes one line at a time, splits the line into tokens separated by whitespaces, via StringTokenizer, and emits a key-value pair <word, 1> public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } } } Valeria Cardellini - SABD 2017/18 55

29 WordCount in Java (3) reduce method sums up the values, which are the occurrence counts for each key public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); } public void reduce(text key, Iterable<IntWritable> values, } int sum = 0; Context context ) throws IOException, InterruptedException { for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); Valeria Cardellini - SABD 2017/18 56 } WordCount in Java (4) public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); } main method specifies various facets of the job Job job = Job.getInstance(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); the output of each map is passed through the local combiner (same as the Reducer) for local aggregation, after being sorted on the keys Valeria Cardellini - SABD 2017/18 57

30 WordCount in Java (5) Input files Hello World Bye World Hello Hadoop Goodbye Hadoop Output file Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2 Valeria Cardellini - SABD 2017/18 58 WordCount in Java (6) 1st mapper output < Hello, 1> < World, 1> < Bye, 1> < World, 1> After local combiner < Bye, 1> < Hello, 1> < World, 2> Valeria Cardellini - SABD 2017/18 59

31 WordCount in Java (7) 2nd mapper output After local combiner < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 1> Reducer output < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2> Valeria Cardellini - SABD 2017/18 60 Hadoop installation Some straightforward alternatives Single-node installation Cloudera CDH, Hortonworks HDP, MapR distributions MapR Integrated Hadoop-based stack containing all the components needed for production, tested and packaged to work together Include at least Hadoop, YARN, HDFS (not in MapR), HBase, Hive, Pig, Sqoop, Flume, Zookeeper, Kafka, Spark Own distributed Posix file system (MapR-FS) rather than HDFS Valeria Cardellini - SABD 2017/18 61

32 Hadoop installation Cloudera CDH See QuickStarts Single-node cluster: requires to have previously installed an hypervisor (VMware, KVM, or VirtualBox) or Docker QuickStarts not for use in production Valeria Cardellini - SABD 2017/18 62 Hadoop installation Hortonworks HDP Hortonworks Sandbox on a VM Requires either VirtualBox, VMWare or Docker Image file sizes: GB Includes Ambari: open source management platform for Hadoop Distributions include security frameworks (Sentri, Ranger, Knobs, ) to handle access control, security policies, Valeria Cardellini - SABD 2017/18 63

33 Hadoop configuration Tuning Hadoop clusters for good performance is somehow magic HW and SW systems (including OS), e.g.: Optimal number of disks that maximizes I/O bandwidth BIOS parameters File system type (on Linux, EXT4 is better) Open file handles Optimal HDFS block size JVM settings for Java heap usage and garbage collections See for some detailed hints Valeria Cardellini - SABD 2017/18 64 Hadoop configuration There are also Hadoop-specific parameters that can be tuned for performance Number of mappers Number of reducers Many other map-side and reduce-side tuning parameters Valeria Cardellini - SABD 2017/18 65

34 How many mappers Number of mappers Driven by the number of blocks in the input files You can adjust the HDFS block size to adjust the number of maps Good level of parallelism: mappers per-node, but up to 300 mappers for very CPU-light map tasks Task setup takes a while, so it is best if the maps take at least a minute to execute Valeria Cardellini - SABD 2017/18 66 How many reducers Number of reducers Can be user-defined (default is 1) Use Job.setNumReduceTasks(int) The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * <no. of maximum containers per node>) 0.95: all of the reduces can launch immediately and start transferring map outputs as the maps finish 1.75: the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing Can be set to zero if no reduction is desired No sorting of map-outputs before writing them to output file See Valeria Cardellini - SABD 2017/18 67

35 Some tips for performance To handle massive I/O and save bandwidth - Compress input data, map output data, reduce output data To address massive I/O in partition and sort phases Each map task has a circular buffer memory to which it writes the output; when the buffer is full, contents are written ( spilled ) to disk. Avoid that records are spilled more than once How? Adjust spill records and sorting buffer To address massive network traffic caused by large map output - Compress map output - Implement a combiner to reduce the amount of data passing through the shuffle See Valeria Cardellini - SABD 2017/18 68 Hadoop in the Cloud Pros: Gain Cloud scalability and elasticity Do not need to manage and provision the infrastructure and the platform Main challenges: Move data to the Cloud Latency is not zero (because of speed of light)! Minor issue: network bandwidth Data security and privacy Valeria Cardellini - SABD 2017/18 69

36 Amazon Elastic MapReduce (EMR) Distribute the computational work across a cluster of virtual servers running in AWS cloud (EC2 instances) Not only Hadoop: also Hive, Pig, Spark and Presto Usually not the latest released version Input and output: Amazon S3, DynamoDB, Elastic File System (EFS) Access through AWS Management Console, command line Valeria Cardellini - SABD 2017/18 70 Create EMR cluster Steps: 1. Provide cluster name 2. Select EMR release and applications to install 3. Select instance type (including spot instances) 4. Select number of instances 5. Select EC2 key-pair to connect securely to the cluster See Valeria Cardellini - SABD 2017/18 71

37 EMR cluster details You can still tune some parameters for performance - Some EC2 parameters (heap size used by Hadoop and Yarn ) - Hadoop parameters (e.g., memory for map and reduce JVMs) Valeria Cardellini - SABD 2017/18 72 Google Cloud Dataproc Distribute the computational work across a cluster of virtual servers running in the Google Cloud Platform Not only Hadoop (plus Hive and Pig), also Spark Input and output from other Google services: Cloud Storage, Bigtable Access through REST API, Cloud SDK, and Cloud Dataproc UI Fine-grain pay-per-use (seconds) Valeria Cardellini - SABD 2017/18 73

38 Create Cloud Datproc cluster Valeria Cardellini - SABD 2017/18 74 References Dean and Ghemawat, "MapReduce: simplified data processing on large clusters", OSDI it//archive/mapreduce-osdi04.pdf White, Hadoop: The Definitive Guide, 4 th edition, O Reilly, Miner and Shook, MapReduce Design Patterns, O Reilly, Leskovec, Rajaraman, and Ullman, Mining of Massive Datasets, chapter 2, Valeria Cardellini - SABD 2017/18 75

MapReduce and Hadoop

MapReduce and Hadoop Università degli Studi di Roma Tor Vergata MapReduce and Hadoop Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference Big Data stack High-level Interfaces Data Processing

More information

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838. COMP4442 Service and Cloud Computing Lab 12: MapReduce www.comp.polyu.edu.hk/~csgeorge/comp4442 Prof. George Baciu csgeorge@comp.polyu.edu.hk PQ838 1 Contents Introduction to MapReduce A WordCount example

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming)

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming) Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 16 Big Data Management VI (MapReduce Programming) Credits: Pietro Michiardi (Eurecom): Scalable Algorithm

More information

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing ECE5610/CSC6220 Introduction to Parallel and Distribution Computing Lecture 6: MapReduce in Parallel Computing 1 MapReduce: Simplified Data Processing Motivation Large-Scale Data Processing on Large Clusters

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014 Parallel Data Processing with Hadoop/MapReduce CS140 Tao Yang, 2014 Overview What is MapReduce? Example with word counting Parallel data processing with MapReduce Hadoop file system More application example

More information

Introduction to Data Intensive Computing

Introduction to Data Intensive Computing Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Introduction to Data Intensive Computing Corso di Sistemi Distribuiti e Cloud Computing A.A. 2017/18

More information

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018 Parallel Processing - MapReduce and FlumeJava Amir H. Payberah payberah@kth.se 14/09/2018 The Course Web Page https://id2221kth.github.io 1 / 83 Where Are We? 2 / 83 What do we do when there is too much

More information

Experiences with a new Hadoop cluster: deployment, teaching and research. Andre Barczak February 2018

Experiences with a new Hadoop cluster: deployment, teaching and research. Andre Barczak February 2018 Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018 abstract In 2017 the Machine Learning research group got funding for a new Hadoop cluster. However,

More information

Java in MapReduce. Scope

Java in MapReduce. Scope Java in MapReduce Kevin Swingler Scope A specific look at the Java code you might use for performing MapReduce in Hadoop Java program recap The map method The reduce method The whole program Running on

More information

Chapter 3. Distributed Algorithms based on MapReduce

Chapter 3. Distributed Algorithms based on MapReduce Chapter 3 Distributed Algorithms based on MapReduce 1 Acknowledgements Hadoop: The Definitive Guide. Tome White. O Reilly. Hadoop in Action. Chuck Lam, Manning Publications. MapReduce: Simplified Data

More information

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA) Introduction to MapReduce Adapted from Jimmy Lin (U. Maryland, USA) Motivation Overview Need for handling big data New programming paradigm Review of functional programming mapreduce uses this abstraction

More information

Parallel Computing: MapReduce Jin, Hai

Parallel Computing: MapReduce Jin, Hai Parallel Computing: MapReduce Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! MapReduce is a distributed/parallel computing framework introduced by Google

More information

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab CS6030 Cloud Computing Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 1 Acknowledgements I have liberally borrowed these slides and material

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2 Big Data Analysis using Hadoop Map-Reduce An Introduction Lecture 2 Last Week - Recap 1 In this class Examine the Map-Reduce Framework What work each of the MR stages does Mapper Shuffle and Sort Reducer

More information

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus Getting to know MapReduce MapReduce Execution Pipeline Runtime Coordination and Task Management MapReduce Application Hadoop Word Count Implementation.

More information

MapReduce Simplified Data Processing on Large Clusters

MapReduce Simplified Data Processing on Large Clusters MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /

More information

MapReduce-style data processing

MapReduce-style data processing MapReduce-style data processing Software Languages Team University of Koblenz-Landau Ralf Lämmel and Andrei Varanovich Related meanings of MapReduce Functional programming with map & reduce An algorithmic

More information

Data Acquisition. The reference Big Data stack

Data Acquisition. The reference Big Data stack Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference

More information

Big Data Hadoop Course Content

Big Data Hadoop Course Content Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux

More information

Large-scale Information Processing

Large-scale Information Processing Sommer 2013 Large-scale Information Processing Ulf Brefeld Knowledge Mining & Assessment brefeld@kma.informatik.tu-darmstadt.de Anecdotal evidence... I think there is a world market for about five computers,

More information

The MapReduce Abstraction

The MapReduce Abstraction The MapReduce Abstraction Parallel Computing at Google Leverages multiple technologies to simplify large-scale parallel computations Proprietary computing clusters Map/Reduce software library Lots of other

More information

2. MapReduce Programming Model

2. MapReduce Programming Model Introduction MapReduce was proposed by Google in a research paper: Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

Outline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc.

Outline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc. D. Praveen Kumar Junior Research Fellow Department of Computer Science & Engineering Indian Institute of Technology (Indian School of Mines) Dhanbad, Jharkhand, India Head of IT & ITES, Skill Subsist Impels

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

NewSQL Databases. The reference Big Data stack

NewSQL Databases. The reference Big Data stack Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica NewSQL Databases Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce

More information

Hadoop. Introduction / Overview

Hadoop. Introduction / Overview Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures

More information

Clustering Documents. Case Study 2: Document Retrieval

Clustering Documents. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve

More information

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

Guidelines For Hadoop and Spark Cluster Usage

Guidelines For Hadoop and Spark Cluster Usage Guidelines For Hadoop and Spark Cluster Usage Procedure to create an account in CSX. If you are taking a CS prefix course, you already have an account; to get an initial password created: 1. Login to https://cs.okstate.edu/pwreset

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc. MapReduce and HDFS This presentation includes course content University of Washington Redistributed under the Creative Commons Attribution 3.0 license. All other contents: Overview Why MapReduce? What

More information

CS 61C: Great Ideas in Computer Architecture. MapReduce

CS 61C: Great Ideas in Computer Architecture. MapReduce CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing

More information

Cloud Programming on Java EE Platforms. mgr inż. Piotr Nowak

Cloud Programming on Java EE Platforms. mgr inż. Piotr Nowak Cloud Programming on Java EE Platforms mgr inż. Piotr Nowak dsh distributed shell commands execution -c concurrent --show-machine-names -M --group cluster -g cluster /etc/dsh/groups/cluster needs passwordless

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

MapR Enterprise Hadoop

MapR Enterprise Hadoop 2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS

More information

The amount of data increases every day Some numbers ( 2012):

The amount of data increases every day Some numbers ( 2012): 1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect

More information

2/26/2017. The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012): The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to

More information

Data Acquisition. The reference Big Data stack

Data Acquisition. The reference Big Data stack Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference

More information

1. Introduction to MapReduce

1. Introduction to MapReduce Processing of massive data: MapReduce 1. Introduction to MapReduce 1 Origins: the Problem Google faced the problem of analyzing huge sets of data (order of petabytes) E.g. pagerank, web access logs, etc.

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414 Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

Large-Scale GPU programming

Large-Scale GPU programming Large-Scale GPU programming Tim Kaldewey Research Staff Member Database Technologies IBM Almaden Research Center tkaldew@us.ibm.com Assistant Adjunct Professor Computer and Information Science Dept. University

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2013/14 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Cluster File

More information

Data Partitioning and MapReduce

Data Partitioning and MapReduce Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,

More information

Batch Processing Basic architecture

Batch Processing Basic architecture Batch Processing Basic architecture in big data systems COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman 2 1 2 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 3

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Winter 215 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Due next Thursday evening Will send out reimbursement codes later

More information

September 2013 Alberto Abelló & Oscar Romero 1

September 2013 Alberto Abelló & Oscar Romero 1 duce-i duce-i September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate several use cases of duce 2. Describe what the duce environment is 3. Explain 6 benefits of using duce 4.

More information

CS-2510 COMPUTER OPERATING SYSTEMS

CS-2510 COMPUTER OPERATING SYSTEMS CS-2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application Functional

More information

Big Data landscape Lecture #2

Big Data landscape Lecture #2 Big Data landscape Lecture #2 Contents 1 1 CORE Technologies 2 3 MapReduce YARN 4 SparK 5 Cassandra Contents 2 16 HBase 72 83 Accumulo memcached 94 Blur 10 5 Sqoop/Flume Contents 3 111 MongoDB 12 2 13

More information

Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)

Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce) Ghislain Fourny Big Data 6. Massive Parallel Processing (MapReduce) So far, we have... Storage as file system (HDFS) 13 So far, we have... Storage as tables (HBase) Storage as file system (HDFS) 14 Data

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Fall 2016 1 HW8 is out Last assignment! Get Amazon credits now (see instructions) Spark with Hadoop Due next wed CSE 344 - Fall 2016

More information

Attacking & Protecting Big Data Environments

Attacking & Protecting Big Data Environments Attacking & Protecting Big Data Environments Birk Kauer & Matthias Luft {bkauer, mluft}@ernw.de #WhoAreWe Birk Kauer - Security Researcher @ERNW - Mainly Exploit Developer Matthias Luft - Security Researcher

More information

Introduction to Map Reduce

Introduction to Map Reduce Introduction to Map Reduce 1 Map Reduce: Motivation We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate

More information

Introduction to MapReduce (cont.)

Introduction to MapReduce (cont.) Introduction to MapReduce (cont.) Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 2 MapReduce: Summary USC INF 553 Foundations

More information

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece Introduction to Map/Reduce Kostas Solomos Computer Science Department University of Crete, Greece What we will cover What is MapReduce? How does it work? A simple word count example (the Hello World! of

More information

Advanced Data Management Technologies

Advanced Data Management Technologies ADMT 2017/18 Unit 16 J. Gamper 1/53 Advanced Data Management Technologies Unit 16 MapReduce J. Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Acknowledgements: Much of the information

More information

Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)

Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce) Ghislain Fourny Big Data Fall 2018 6. Massive Parallel Processing (MapReduce) Let's begin with a field experiment 2 400+ Pokemons, 10 different 3 How many of each??????????? 4 400 distributed to many volunteers

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Introduction to MapReduce

Introduction to MapReduce 732A54 Big Data Analytics Introduction to MapReduce Christoph Kessler IDA, Linköping University Towards Parallel Processing of Big-Data Big Data too large to be read+processed in reasonable time by 1 server

More information

Dept. Of Computer Science, Colorado State University

Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,

More information

L22: SC Report, Map Reduce

L22: SC Report, Map Reduce L22: SC Report, Map Reduce November 23, 2010 Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance Google version = Map Reduce; Hadoop = Open source

More information

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and

More information

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

Big Data Analytics: Insights and Innovations

Big Data Analytics: Insights and Innovations International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 6, Issue 10 (April 2013), PP. 60-65 Big Data Analytics: Insights and Innovations

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

MapReduce programming model

MapReduce programming model MapReduce programming model technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate

More information

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm Announcements HW5 is due tomorrow 11pm Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create your AWS account before

More information

Database Systems CSE 414

Database Systems CSE 414 Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) CSE 414 - Fall 2017 1 Announcements HW5 is due tomorrow 11pm HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create

More information

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [MAPREDUCE] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Bit Torrent What is the right chunk/piece

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange

More information

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Distributed Computations MapReduce. adapted from Jeff Dean s slides Distributed Computations MapReduce adapted from Jeff Dean s slides What we ve learnt so far Basic distributed systems concepts Consistency (sequential, eventual) Fault tolerance (recoverability, availability)

More information

CSE6331: Cloud Computing

CSE6331: Cloud Computing CSE6331: Cloud Computing Leonidas Fegaras University of Texas at Arlington c 2017 by Leonidas Fegaras Map-Reduce Fundamentals Based on: J. Simeon: Introduction to MapReduce P. Michiardi: Tutorial on MapReduce

More information

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing ΕΠΛ 602:Foundations of Internet Technologies Cloud Computing 1 Outline Bigtable(data component of cloud) Web search basedonch13of thewebdatabook 2 What is Cloud Computing? ACloudis an infrastructure, transparent

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (2/2) January 12, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA) Introduction to MapReduce Adapted from Jimmy Lin (U. Maryland, USA) Motivation Overview Need for handling big data New programming paradigm Review of functional programming mapreduce uses this abstraction

More information

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015 Distributed Systems 18. MapReduce Paul Krzyzanowski Rutgers University Fall 2015 November 21, 2016 2014-2016 Paul Krzyzanowski 1 Credit Much of this information is from Google: Google Code University [no

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 9 MapReduce Prof. Li Jiang 2014/11/19 1 What is MapReduce Origin from Google, [OSDI 04] A simple programming model Functional model For large-scale

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information

A Glimpse of the Hadoop Echosystem

A Glimpse of the Hadoop Echosystem A Glimpse of the Hadoop Echosystem 1 Hadoop Echosystem A cluster is shared among several users in an organization Different services HDFS and MapReduce provide the lower layers of the infrastructures Other

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Search Engines and Time Series Databases

Search Engines and Time Series Databases Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Search Engines and Time Series Databases Corso di Sistemi e Architetture per Big Data A.A. 2017/18

More information

FAQs. Topics. This Material is Built Based on, Analytics Process Model. 8/22/2018 Week 1-B Sangmi Lee Pallickara

FAQs. Topics. This Material is Built Based on, Analytics Process Model. 8/22/2018 Week 1-B Sangmi Lee Pallickara CS435 Introduction to Big Data Week 1-B W1.B.0 CS435 Introduction to Big Data No Cell-phones in the class. W1.B.1 FAQs PA0 has been posted If you need to use a laptop, please sit in the back row. August

More information