Big Data Analytics. 4. Map Reduce I. Lars Schmidt-Thieme

Size: px
Start display at page:

Download "Big Data Analytics. 4. Map Reduce I. Lars Schmidt-Thieme"

Transcription

1 Big Data Analytics 4. Map Reduce I Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany original slides by Lucas Rego Drumond, ISMLL 1 / 32

2 Outline 1. Introduction 2. Parallel programming paradigms 1 / 32

3 1. Introduction Outline 1. Introduction 2. Parallel programming paradigms 1 / 32

4 1. Introduction Overview Part III Machine Learning Algorithms Part II Large Scale Computational Models Part I Distributed Database Distributed File System 1 / 32

5 1. Introduction Why do we need a Computational Model? Our data is nicely stored in a distributed infrastructure 2 / 32

6 1. Introduction Why do we need a Computational Model? Our data is nicely stored in a distributed infrastructure We have a number of computers at our disposal 2 / 32

7 1. Introduction Why do we need a Computational Model? Our data is nicely stored in a distributed infrastructure We have a number of computers at our disposal We want our analytics software to take advantage of all this computing power 2 / 32

8 1. Introduction Why do we need a Computational Model? Our data is nicely stored in a distributed infrastructure We have a number of computers at our disposal We want our analytics software to take advantage of all this computing power When programming we want to focus on understanding our data and not our infrastructure 2 / 32

9 1. Introduction Shared Memory Infrastructure Processor Processor Processor Memory 3 / 32

10 1. Introduction Distributed Infrastructure Network Processor Processor Processor Processor Memory Memory Memory Memory 4 / 32

11 2. Parallel programming paradigms Outline 1. Introduction 2. Parallel programming paradigms 5 / 32

12 2. Parallel programming paradigms Parallel Computing principles We have p processors available to execute a task T 5 / 32

13 2. Parallel programming paradigms Parallel Computing principles We have p processors available to execute a task T Ideally: the more processors the faster a task is executed 5 / 32

14 2. Parallel programming paradigms Parallel Computing principles We have p processors available to execute a task T Ideally: the more processors the faster a task is executed Reality: synchronisation and communication costs 5 / 32

15 2. Parallel programming paradigms Parallel Computing principles We have p processors available to execute a task T Ideally: the more processors the faster a task is executed Reality: synchronisation and communication costs Speedup s(t, p) of a task T by using p processors: 5 / 32

16 2. Parallel programming paradigms Parallel Computing principles We have p processors available to execute a task T Ideally: the more processors the faster a task is executed Reality: synchronisation and communication costs Speedup s(t, p) of a task T by using p processors: Be t(t, p) the time needed to execute T using p processors 5 / 32

17 2. Parallel programming paradigms Parallel Computing principles We have p processors available to execute a task T Ideally: the more processors the faster a task is executed Reality: synchronisation and communication costs Speedup s(t, p) of a task T by using p processors: Be t(t, p) the time needed to execute T using p processors Speedup is given by: s(t, p) = t(t, 1) t(t, p) 5 / 32

18 2. Parallel programming paradigms Parallel Computing principles We have p processors available to execute a task T 6 / 32

19 2. Parallel programming paradigms Parallel Computing principles We have p processors available to execute a task T Efficiency e(t, p) of a task T by using p processors: 6 / 32

20 2. Parallel programming paradigms Parallel Computing principles We have p processors available to execute a task T Efficiency e(t, p) of a task T by using p processors: e(t, p) = t(t, 1) p t(t, p) 6 / 32

21 2. Parallel programming paradigms Considerations It is not worth using a lot of processors for solving small problems 7 / 32

22 2. Parallel programming paradigms Considerations It is not worth using a lot of processors for solving small problems Algorithms should increase efficiency with problem size 7 / 32

23 2. Parallel programming paradigms Paradigms - Shared Memory All the processors have access to all the data D := {d 1,..., d n } 8 / 32

24 2. Parallel programming paradigms Paradigms - Shared Memory All the processors have access to all the data D := {d 1,..., d n } Pieces of data can be overwritten 8 / 32

25 2. Parallel programming paradigms Paradigms - Shared Memory All the processors have access to all the data D := {d 1,..., d n } Pieces of data can be overwritten Processors need to lock datapoints before using them 8 / 32

26 2. Parallel programming paradigms Paradigms - Shared Memory All the processors have access to all the data D := {d 1,..., d n } Pieces of data can be overwritten Processors need to lock datapoints before using them For each processor p: 1. lock(d i ) 2. process(d i ) 3. unlock(d i ) 8 / 32

27 2. Parallel programming paradigms Word Count Example Given a corpus of text documents D := {d 1,..., d n } 9 / 32

28 2. Parallel programming paradigms Word Count Example Given a corpus of text documents D := {d 1,..., d n } each containing a sequence of words: w 1,..., w m pooled from a set W of possible words. 9 / 32

29 2. Parallel programming paradigms Word Count Example Given a corpus of text documents D := {d 1,..., d n } each containing a sequence of words: w 1,..., w m pooled from a set W of possible words. the task is to generate word counts for each word in the corpus 9 / 32

30 2. Parallel programming paradigms Word Count - Shared Memory Shared vector for word counts: c R W c {0} W Each processor: 1. access a document d D 2. for each word w i in document d: 2.1 lock(c i ) 2.2 c i c i unlock(c i ) 10 / 32

31 2. Parallel programming paradigms Paradigms - Shared Memory Inefficient in a distributed scenario 11 / 32

32 2. Parallel programming paradigms Paradigms - Shared Memory Inefficient in a distributed scenario Results of a process can easily be overwritten 11 / 32

33 2. Parallel programming paradigms Paradigms - Shared Memory Inefficient in a distributed scenario Results of a process can easily be overwritten Possible long waiting times for a piece of data because of the lock mechanism 11 / 32

34 2. Parallel programming paradigms Paradigms - Message passing Each processor sees only one part of the data π(d, p) := {d p,..., d p+ n p 1 } 12 / 32

35 2. Parallel programming paradigms Paradigms - Message passing Each processor sees only one part of the data π(d, p) := {d p,..., d p+ n p 1 } Each processor works on its partition 12 / 32

36 2. Parallel programming paradigms Paradigms - Message passing Each processor sees only one part of the data π(d, p) := {d p,..., d p+ n p 1 } Each processor works on its partition Results are exchanged between processors (message passing) 12 / 32

37 2. Parallel programming paradigms Paradigms - Message passing Each processor sees only one part of the data π(d, p) := {d p,..., d p+ n p 1 } Each processor works on its partition Results are exchanged between processors (message passing) For each processor p: 1. For each d π(d, p) 1.1 process(d) 2. Communicate results 12 / 32

38 2. Parallel programming paradigms Word Count - Message passing We need to define two types of processes: 1. Slave - counts the words on a subset of documents and informs the master 2. Master - gathers counts from the slaves and sums them up 13 / 32

39 2. Parallel programming paradigms Word Count - Message passing Slave: Local memory: subset of documents: π(d, p) := {d p,..., d p+ n p 1 } address of the master: addr_master local word counts: c R W 1. c {0} W 2. for each document d π(d, p) for each word w i in document d: c i c i Send message send(addr_master, c) 14 / 32

40 2. Parallel programming paradigms Word Count - Message passing Master: Local memory: 1. Global word counts: c global R W 2. List of slaves: S c global {0} W s {0} S For each received message (p, c p ) 1. c global c global + c p 2. s p 1 3. if s 1 = S return c global 15 / 32

41 2. Parallel programming paradigms Paradigms - Message passing We need to manually assign master and slave roles for each processor 16 / 32

42 2. Parallel programming paradigms Paradigms - Message passing We need to manually assign master and slave roles for each processor Partition of the data needs to be done manually 16 / 32

43 2. Parallel programming paradigms Paradigms - Message passing We need to manually assign master and slave roles for each processor Partition of the data needs to be done manually Implementations like OpenMPI only provide services to exchange messages 16 / 32

44 Outline 1. Introduction 2. Parallel programming paradigms 17 / 32

45 Map-Reduce Builds on the distributed message passing paradigm 17 / 32

46 Map-Reduce Builds on the distributed message passing paradigm Considers the data is partitioned over the nodes 17 / 32

47 Map-Reduce Builds on the distributed message passing paradigm Considers the data is partitioned over the nodes Pipelined procedure: 17 / 32

48 Map-Reduce Builds on the distributed message passing paradigm Considers the data is partitioned over the nodes Pipelined procedure: 1. Map phase 17 / 32

49 Map-Reduce Builds on the distributed message passing paradigm Considers the data is partitioned over the nodes Pipelined procedure: 1. Map phase 2. Reduce phase 17 / 32

50 Map-Reduce Builds on the distributed message passing paradigm Considers the data is partitioned over the nodes Pipelined procedure: 1. Map phase 2. Reduce phase High level abstraction: programmer only specifies a map and a reduce routine 17 / 32

51 Map-Reduce No need to worry about how many processors are available No need to specify which ones will be mappers and which ones will be reducers 18 / 32

52 Key-Value input data Map-Reduce requires the data to be stored in a key-value format 19 / 32

53 Key-Value input data Map-Reduce requires the data to be stored in a key-value format Natural if one works with column databases 19 / 32

54 Key-Value input data Map-Reduce requires the data to be stored in a key-value format Natural if one works with column databases Examples: 19 / 32

55 Key-Value input data Map-Reduce requires the data to be stored in a key-value format Natural if one works with column databases Examples: Key document document user user user Value array of words word movies friends tweet 19 / 32

56 The Paradigm - Formally Given A set of input keys I A set of output keys O A set of input values X A set of intermediate values V A set of output values Y We can define: map : I X P(O V ) and reduce : O P(V ) O Y where P denotes the powerset 20 / 32

57 The Paradigm - Informally 1. Each mapper transforms some key-value pairs into a set of pairs of an output key and an intermediate value 2. all intermediate values are grouped according to their output keys 3. each reducer receives some pairs of a key and all its intermediate values 4. each reducer for each key aggregates all its intermediate values to one final value 21 / 32

58 Word Count Example Map: Input: document-word list pairs Output: word-count pairs Reduce: Input: word-(count list) pairs Output: word-count pairs (d k, w 1,..., w m) [(w i, c i )] (w i, [c i ]) (w i, c [c i ] c) 22 / 32

59 Word Count Example (d1, love ain't no stranger ) (d2, crying in the rain ) (love, 1) (stranger,1) (crying, 1) (rain, 1) (ain't, 1) (stranger,1) (love, 1) (love, 2) (love, 2) (crying, 1) (stranger,1) (love, 5) (crying, 2) (crying, 1) (d3, looking for love ) (love, 2) (d4, I'm crying ) (d5, the deeper the love ) (crying, 1) (looking, 1) (deeper, 1) (ain't, 1) (ain't, 2) (ain't, 1) (rain, 1) (rain, 1) (looking, 1) (d6, is this love ) (love, 2) (ain't, 1) (looking, 1) (deeper, 1) (deeper, 1) (this, 1) (d7, Ain't no love ) (this, 1) (this, 1) Mappers Reducers 23 / 32

60 Map 1 public static class Map 2 extends MapReduceBase 3 implements Mapper<LongWritable, Text, Text, IntWritable> { 4 private final static IntWritable one = new IntWritable(1); 5 private Text word = new Text(); 6 7 public void map(longwritable key, Text value, 8 OutputCollector<Text, IntWritable> output, 9 Reporter reporter) 0 throws IOException { 1 2 String line = value. tostring (); 3 StringTokenizer tokenizer = new StringTokenizer(line ); 4 5 while ( tokenizer.hasmoretokens()) { 6 word.set( tokenizer.nexttoken()); 7 output. collect (word, one); 8 } 9 } 0 } 24 / 32

61 Reduce 1 public static class Reduce 2 extends MapReduceBase 3 implements Reducer<Text, IntWritable, Text, IntWritable> { 4 5 public void reduce(text key, Iterator <IntWritable> values, 6 OutputCollector<Text, IntWritable> output, 7 Reporter reporter) 8 throws IOException { 9 0 int sum = 0; 1 while ( values.hasnext()) 2 sum += values.next().get(); 3 4 output. collect (key, new IntWritable(sum)); 5 } 6 } 25 / 32

62 Execution snippet 1 public static void main(string [] args) throws Exception { 2 JobConf conf = new JobConf(WordCount.class); 3 conf.setjobname("wordcount"); 4 5 conf.setoutputkeyclass(text.class); 6 conf.setoutputvalueclass(intwritable.class); 7 8 conf.setmapperclass(map.class); 9 conf.setcombinerclass(reduce.class); 0 conf.setreducerclass(reduce.class); 1 2 conf.setinputformat(textinputformat.class); 3 conf.setoutputformat(textoutputformat.class); 4 5 FileInputFormat.setInputPaths(conf, new Path(args[0])); 6 FileOutputFormat.setOutputPath(conf, new Path(args[1])); 7 8 JobClient.runJob(conf); 9 } 0 } 26 / 32

63 Considerations Maps are executed in parallel Reduces are executed in parallel Bottleneck: Reducers can only execute after all the mappers are finished 27 / 32

64 Fault tolerance When the master node detects node failures: Re-executes completed and in-progress map() Re-executes in-progress reduce tasks 28 / 32

65 Fault tolerance When the master node detects node failures: Re-executes completed and in-progress map() Re-executes in-progress reduce tasks When the master node detects particular key-value pairs that cause mappers to crash: Problematic pairs are skipped in the execution 28 / 32

66 Parallel Efficiency of Map-Reduce We have p processors for performing map and reduce operations 29 / 32

67 Parallel Efficiency of Map-Reduce We have p processors for performing map and reduce operations Time to perform a task T on data D: t(t, 1) = wd 29 / 32

68 Parallel Efficiency of Map-Reduce We have p processors for performing map and reduce operations Time to perform a task T on data D: t(t, 1) = wd Time for producing intermediate data σd after the map phase: t(t inter, 1) = σd 29 / 32

69 Parallel Efficiency of Map-Reduce We have p processors for performing map and reduce operations Time to perform a task T on data D: t(t, 1) = wd Time for producing intermediate data σd after the map phase: t(t inter, 1) = σd Overheads: 29 / 32

70 Parallel Efficiency of Map-Reduce We have p processors for performing map and reduce operations Time to perform a task T on data D: t(t, 1) = wd Time for producing intermediate data σd after the map phase: t(t inter, 1) = σd Overheads: intermediate data per mapper: σd p 29 / 32

71 Parallel Efficiency of Map-Reduce We have p processors for performing map and reduce operations Time to perform a task T on data D: t(t, 1) = wd Time for producing intermediate data σd after the map phase: t(t inter, 1) = σd Overheads: intermediate data per mapper: σd p each of the p reducers needs to read one p-th of the data written by each of the p mappers: σd p 1 p p = σd p 29 / 32

72 Parallel Efficiency of Map-Reduce We have p processors for performing map and reduce operations Time to perform a task T on data D: t(t, 1) = wd Time for producing intermediate data σd after the map phase: t(t inter, 1) = σd Overheads: intermediate data per mapper: σd p each of the p reducers needs to read one p-th of the data written by each of the p mappers: σd p 1 p p = σd p Time for performing the task with Map-reduce: t MR (T, p) = wd p + 2K σd p K - constant for representing the overhead of IO operations (reading and writing data to disk) 29 / 32

73 Parallel Efficiency of Map-Reduce Time for performing the task in one processor: wd 30 / 32

74 Parallel Efficiency of Map-Reduce Time for performing the task in one processor: wd Time for performing the task with p processors on Map-reduce: t MR (T, p) = wd p + 2K σd p 30 / 32

75 Parallel Efficiency of Map-Reduce Time for performing the task in one processor: wd Time for performing the task with p processors on Map-reduce: t MR (T, p) = wd p + 2K σd p Efficiency equation: e(t, p) = t(t, 1) p t(t, p) 30 / 32

76 Parallel Efficiency of Map-Reduce Time for performing the task in one processor: wd Time for performing the task with p processors on Map-reduce: t MR (T, p) = wd p + 2K σd p Efficiency equation: Efficiency of Map-Reduce: e(t, p) = t(t, 1) p t(t, p) e MR (T, p) = wd p( wd p + 2K σd p ) 30 / 32

77 Parallel Efficiency of Map-Reduce e MR (T, p) = wd p( wd p + 2K σd p ) 31 / 32

78 Parallel Efficiency of Map-Reduce e MR (T, p) = wd p( wd p + 2K σd p ) = wd wd + 2KσD 31 / 32

79 Parallel Efficiency of Map-Reduce e MR (T, p) = wd p( wd p + 2K σd p ) = wd wd + 2KσD = wd 1 wd wd 1 wd + 2KσD 1 wd 31 / 32

80 Parallel Efficiency of Map-Reduce e MR (T, p) = wd p( wd p + 2K σd p ) = wd wd + 2KσD = wd 1 wd wd 1 wd + 2KσD 1 wd 1 = 1 + 2K σ w 31 / 32

81 Parallel Efficiency of Map-Reduce e MR (T, p) = K σ w 32 / 32

82 Parallel Efficiency of Map-Reduce 1 e MR (T, p) = 1 + 2K σ w Apparently the efficiency is independent of p 32 / 32

83 Parallel Efficiency of Map-Reduce 1 e MR (T, p) = 1 + 2K σ w Apparently the efficiency is independent of p High speedups can be achieved with large number of processors 32 / 32

84 Parallel Efficiency of Map-Reduce 1 e MR (T, p) = 1 + 2K σ w Apparently the efficiency is independent of p High speedups can be achieved with large number of processors If σ is high (too much intermediate data) the efficiency deteriorates 32 / 32

85 Parallel Efficiency of Map-Reduce 1 e MR (T, p) = 1 + 2K σ w Apparently the efficiency is independent of p High speedups can be achieved with large number of processors If σ is high (too much intermediate data) the efficiency deteriorates In many cases σ depends on p 32 / 32

Map-Reduce in Various Programming Languages

Map-Reduce in Various Programming Languages Map-Reduce in Various Programming Languages 1 Context of Map-Reduce Computing The use of LISP's map and reduce functions to solve computational problems probably dates from the 1960s -- very early in the

More information

EE657 Spring 2012 HW#4 Zhou Zhao

EE657 Spring 2012 HW#4 Zhou Zhao EE657 Spring 2012 HW#4 Zhou Zhao Problem 6.3 Solution Referencing the sample application of SimpleDB in Amazon Java SDK, a simple domain which includes 5 items is prepared in the code. For instance, the

More information

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School September 2012 This work is licensed

More information

Big Data landscape Lecture #2

Big Data landscape Lecture #2 Big Data landscape Lecture #2 Contents 1 1 CORE Technologies 2 3 MapReduce YARN 4 SparK 5 Cassandra Contents 2 16 HBase 72 83 Accumulo memcached 94 Blur 10 5 Sqoop/Flume Contents 3 111 MongoDB 12 2 13

More information

MapReduce Simplified Data Processing on Large Clusters

MapReduce Simplified Data Processing on Large Clusters MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /

More information

Cloud Computing. Up until now

Cloud Computing. Up until now Cloud Computing Lecture 9 Map Reduce 2010-2011 Introduction Up until now Definition of Cloud Computing Grid Computing Content Distribution Networks Cycle-Sharing Distributed Scheduling 1 Outline Map Reduce:

More information

Spark and Cassandra. Solving classical data analytic task by using modern distributed databases. Artem Aliev DataStax

Spark and Cassandra. Solving classical data analytic task by using modern distributed databases. Artem Aliev DataStax Spark and Cassandra Solving classical data analytic task by using modern distributed databases Artem Aliev DataStax Spark and Cassandra Solving classical data analytic task by using modern distributed

More information

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor CS 470 Spring 2018 Mike Lam, Professor Parallel Algorithm Development (Foster's Methodology) Graphics and content taken from IPP section 2.7 and the following: http://www.mcs.anl.gov/~itf/dbpp/text/book.html

More information

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018 Parallel Processing - MapReduce and FlumeJava Amir H. Payberah payberah@kth.se 14/09/2018 The Course Web Page https://id2221kth.github.io 1 / 83 Where Are We? 2 / 83 What do we do when there is too much

More information

PARLab Parallel Boot Camp

PARLab Parallel Boot Camp PARLab Parallel Boot Camp Cloud Computing with MapReduce and Hadoop Matei Zaharia Electrical Engineering and Computer Sciences University of California, Berkeley What is Cloud Computing? Cloud refers to

More information

Lab 11 Hadoop MapReduce (2)

Lab 11 Hadoop MapReduce (2) Lab 11 Hadoop MapReduce (2) 1 Giới thiệu Để thiết lập một Hadoop cluster, SV chọn ít nhất là 4 máy tính. Mỗi máy có vai trò như sau: - 1 máy làm NameNode: dùng để quản lý không gian tên (namespace) và

More information

Teaching Map-reduce Parallel Computing in CS1

Teaching Map-reduce Parallel Computing in CS1 Teaching Map-reduce Parallel Computing in CS1 Richard Brown, Patrick Garrity, Timothy Yates Mathematics, Statistics, and Computer Science St. Olaf College Northfield, MN rab@stolaf.edu Elizabeth Shoop

More information

Map- reduce programming paradigm

Map- reduce programming paradigm Map- reduce programming paradigm Some slides are from lecture of Matei Zaharia, and distributed computing seminar by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet. In pioneer days they

More information

Computer Science 572 Exam Prof. Horowitz Tuesday, April 24, 2017, 8:00am 9:00am

Computer Science 572 Exam Prof. Horowitz Tuesday, April 24, 2017, 8:00am 9:00am Computer Science 572 Exam Prof. Horowitz Tuesday, April 24, 2017, 8:00am 9:00am Name: Student Id Number: 1. This is a closed book exam. 2. Please answer all questions. 3. There are a total of 40 questions.

More information

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus Getting to know MapReduce MapReduce Execution Pipeline Runtime Coordination and Task Management MapReduce Application Hadoop Word Count Implementation.

More information

Internet Measurement and Data Analysis (13)

Internet Measurement and Data Analysis (13) Internet Measurement and Data Analysis (13) Kenjiro Cho 2016-07-11 review of previous class Class 12 Search and Ranking (7/4) Search systems PageRank exercise: PageRank algorithm 2 / 64 today s topics

More information

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L12: Cloud Computing 1

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L12: Cloud Computing 1 Cloud Computing Leonidas Fegaras University of Texas at Arlington Web Data Management and XML L12: Cloud Computing 1 Computing as a Utility Cloud computing is a model for enabling convenient, on-demand

More information

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece Introduction to Map/Reduce Kostas Solomos Computer Science Department University of Crete, Greece What we will cover What is MapReduce? How does it work? A simple word count example (the Hello World! of

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve

More information

Clustering Documents. Case Study 2: Document Retrieval

Clustering Documents. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve

More information

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L3b: Cloud Computing 1

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L3b: Cloud Computing 1 Cloud Computing Leonidas Fegaras University of Texas at Arlington Web Data Management and XML L3b: Cloud Computing 1 Computing as a Utility Cloud computing is a model for enabling convenient, on-demand

More information

Introduction to Map/Reduce & Hadoop

Introduction to Map/Reduce & Hadoop Introduction to Map/Reduce & Hadoop Vassilis Christophides christop@csd.uoc.gr http://www.csd.uoc.gr/~hy562 University of Crete 1 Peta-Bytes Data Processing 2 1 1 What is MapReduce? MapReduce: programming

More information

Map Reduce & Hadoop. Lecture BigData Analytics. Julian M. Kunkel.

Map Reduce & Hadoop. Lecture BigData Analytics. Julian M. Kunkel. Map Reduce & Hadoop Lecture BigData Analytics Julian M. Kunkel julian.kunkel@googlemail.com University of Hamburg / German Climate Computing Center (DKRZ) 2017-11-10 Disclaimer: Big Data software is constantly

More information

Introduction to Hadoop

Introduction to Hadoop Hadoop 1 Why Hadoop Drivers: 500M+ unique users per month Billions of interesting events per day Data analysis is key Need massive scalability PB s of storage, millions of files, 1000 s of nodes Need cost

More information

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2 Big Data Analysis using Hadoop Map-Reduce An Introduction Lecture 2 Last Week - Recap 1 In this class Examine the Map-Reduce Framework What work each of the MR stages does Mapper Shuffle and Sort Reducer

More information

CS 525 Advanced Distributed Systems Spring 2018

CS 525 Advanced Distributed Systems Spring 2018 CS 525 Advanced Distributed Systems Spring 2018 Indranil Gupta (Indy) Lecture 3 Cloud Computing (Contd.) January 24, 2018 All slides IG 1 What is MapReduce? Terms are borrowed from Functional Language

More information

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing ECE5610/CSC6220 Introduction to Parallel and Distribution Computing Lecture 6: MapReduce in Parallel Computing 1 MapReduce: Simplified Data Processing Motivation Large-Scale Data Processing on Large Clusters

More information

Computing as a Utility. Cloud Computing. Why? Good for...

Computing as a Utility. Cloud Computing. Why? Good for... Computing as a Utility Cloud Computing Leonidas Fegaras University of Texas at Arlington Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing

More information

3. Big Data Processing

3. Big Data Processing 3. Big Data Processing Cloud Computing & Big Data MASTER ENGINYERIA INFORMÀTICA FIB/UPC Fall - 2013 Jordi Torres, UPC - BSC www.jorditorres.eu Slides are only for presentation guide We will discuss+debate

More information

Introduction to Map/Reduce & Hadoop

Introduction to Map/Reduce & Hadoop Introduction to Map/Reduce & Hadoop V. CHRISTOPHIDES University of Crete & INRIA Paris 1 Peta-Bytes Data Processing 2 1 1 What is MapReduce? MapReduce: programming model and associated implementation for

More information

Data Deluge. Billions of users connected through the net. Storage getting cheaper Store more data!

Data Deluge. Billions of users connected through the net. Storage getting cheaper Store more data! Hadoop 1 Data Deluge Billions of users connected through the net WWW, FB, twitter, cell phones, 80% of the data on FB was produced last year Storage getting cheaper Store more data! Why Hadoop Drivers:

More information

Large-scale Information Processing

Large-scale Information Processing Sommer 2013 Large-scale Information Processing Ulf Brefeld Knowledge Mining & Assessment brefeld@kma.informatik.tu-darmstadt.de Anecdotal evidence... I think there is a world market for about five computers,

More information

Interfaces 3. Reynold Xin Aug 22, Databricks Retreat. Repurposed Jan 27, 2015 for Spark community

Interfaces 3. Reynold Xin Aug 22, Databricks Retreat. Repurposed Jan 27, 2015 for Spark community Interfaces 3 Reynold Xin Aug 22, 2014 @ Databricks Retreat Repurposed Jan 27, 2015 for Spark community Spark s two improvements over Hadoop MR Performance: 100X faster than Hadoop MR Programming model:

More information

Hadoop Map-Reduce Tutorial

Hadoop Map-Reduce Tutorial Table of contents 1 Purpose...2 2 Pre-requisites...2 3 Overview...2 4 Inputs and Outputs... 3 5 Example: WordCount v1.0... 3 5.1 Source Code...3 5.2 Usage... 6 5.3 Walk-through...7 6 Map-Reduce - User

More information

September 2013 Alberto Abelló & Oscar Romero 1

September 2013 Alberto Abelló & Oscar Romero 1 duce-i duce-i September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate several use cases of duce 2. Describe what the duce environment is 3. Explain 6 benefits of using duce 4.

More information

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838. COMP4442 Service and Cloud Computing Lab 12: MapReduce www.comp.polyu.edu.hk/~csgeorge/comp4442 Prof. George Baciu csgeorge@comp.polyu.edu.hk PQ838 1 Contents Introduction to MapReduce A WordCount example

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 1 / 67 Big Data small data big data Amir H. Payberah (SICS) Apache Spark

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox April 16 th, 2015 Emily Fox 2015 1 Document Retrieval n Goal: Retrieve

More information

BigData and MapReduce with Hadoop

BigData and MapReduce with Hadoop BigData and MapReduce with Hadoop Ivan Tomašić 1, Roman Trobec 1, Aleksandra Rashkovska 1, Matjaž Depolli 1, Peter Mežnar 2, Andrej Lipej 2 1 Jožef Stefan Institute, Jamova 39, 1000 Ljubljana 2 TURBOINŠTITUT

More information

Map-Reduce for Parallel Computing

Map-Reduce for Parallel Computing Map-Reduce for Parallel Computing Amit Jain Department of Computer Science College of Engineering Boise State University Big Data, Big Disks, Cheap Computers In pioneer days they used oxen for heavy pulling,

More information

Introduction to HDFS and MapReduce

Introduction to HDFS and MapReduce Introduction to HDFS and MapReduce Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2 Who Am I -

More information

ECE 587 Hardware/Software Co-Design Lecture 09 Concurrency in Practice Message Passing

ECE 587 Hardware/Software Co-Design Lecture 09 Concurrency in Practice Message Passing ECE 587 Hardware/Software Co-Design Spring 2018 1/14 ECE 587 Hardware/Software Co-Design Lecture 09 Concurrency in Practice Message Passing Professor Jia Wang Department of Electrical and Computer Engineering

More information

Using the Cloud to Crunch Your Data. Adrian Cockcro,

Using the Cloud to Crunch Your Data. Adrian Cockcro, Using the Cloud to Crunch Your Data Adrian Cockcro, acockcro,@ne0lix.com What is Cloud Compu;ng? What is Capacity Planning We care about CPU, Memory, Network and Disk resources, and Applica;on response

More information

Hadoop Map/Reduce Tutorial

Hadoop Map/Reduce Tutorial Table of contents 1 Purpose...2 2 Pre-requisites...2 3 Overview...2 4 Inputs and Outputs... 3 5 Example: WordCount v1.0... 3 5.1 Source Code...3 5.2 Usage... 6 5.3 Walk-through...7 6 Map/Reduce - User

More information

Semantics with Failures

Semantics with Failures Semantics with Failures If map and reduce are deterministic, then output identical to non-faulting sequential execution For non-deterministic operators, different reduce tasks might see output of different

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 2: Hadoop Nuts and Bolts Jimmy Lin University of Maryland Thursday, January 31, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

MRUnit testing framework is based on JUnit and it can test Map Reduce programs written on 0.20, 0.23.x, 1.0.x, 2.x version of Hadoop.

MRUnit testing framework is based on JUnit and it can test Map Reduce programs written on 0.20, 0.23.x, 1.0.x, 2.x version of Hadoop. MRUnit Tutorial Setup development environment 1. Download the latest version of MRUnit jar from Apache website: https://repository.apache.org/content/repositories/releases/org/apache/ mrunit/mrunit/. For

More information

Hadoop 3.X more examples

Hadoop 3.X more examples Hadoop 3.X more examples Big Data - 09/04/2018 Let s start with some examples! http://www.dia.uniroma3.it/~dvr/es2_material.zip Example: LastFM Listeners per Track Consider the following log file UserId

More information

Introduction to Hadoop. Scott Seighman Systems Engineer Sun Microsystems

Introduction to Hadoop. Scott Seighman Systems Engineer Sun Microsystems Introduction to Hadoop Scott Seighman Systems Engineer Sun Microsystems 1 Agenda Identify the Problem Hadoop Overview Target Workloads Hadoop Architecture Major Components > HDFS > Map/Reduce Demo Resources

More information

Hadoop 2.X on a cluster environment

Hadoop 2.X on a cluster environment Hadoop 2.X on a cluster environment Big Data - 05/04/2017 Hadoop 2 on AMAZON Hadoop 2 on AMAZON Hadoop 2 on AMAZON Regions Hadoop 2 on AMAZON S3 and buckets Hadoop 2 on AMAZON S3 and buckets Hadoop 2 on

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming)

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming) Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 16 Big Data Management VI (MapReduce Programming) Credits: Pietro Michiardi (Eurecom): Scalable Algorithm

More information

Hadoop 2.8 Configuration and First Examples

Hadoop 2.8 Configuration and First Examples Hadoop 2.8 Configuration and First Examples Big Data - 29/03/2017 Apache Hadoop & YARN Apache Hadoop (1.X) De facto Big Data open source platform Running for about 5 years in production at hundreds of

More information

Big Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme

Big Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme Big Data Analytics C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer

More information

FAQs. Topics. This Material is Built Based on, Analytics Process Model. 8/22/2018 Week 1-B Sangmi Lee Pallickara

FAQs. Topics. This Material is Built Based on, Analytics Process Model. 8/22/2018 Week 1-B Sangmi Lee Pallickara CS435 Introduction to Big Data Week 1-B W1.B.0 CS435 Introduction to Big Data No Cell-phones in the class. W1.B.1 FAQs PA0 has been posted If you need to use a laptop, please sit in the back row. August

More information

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014 Parallel Data Processing with Hadoop/MapReduce CS140 Tao Yang, 2014 Overview What is MapReduce? Example with word counting Parallel data processing with MapReduce Hadoop file system More application example

More information

Java in MapReduce. Scope

Java in MapReduce. Scope Java in MapReduce Kevin Swingler Scope A specific look at the Java code you might use for performing MapReduce in Hadoop Java program recap The map method The reduce method The whole program Running on

More information

Map Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms

Map Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms Map Reduce 1 MapReduce inside Google Googlers' hammer for 80% of our data crunching Large-scale web search indexing Clustering problems for Google News Produce reports for popular queries, e.g. Google

More information

ILO3:Algorithms and Programming Patterns for Cloud Applications (Hadoop)

ILO3:Algorithms and Programming Patterns for Cloud Applications (Hadoop) DISTRIBUTED RESEARCH ON EMERGING APPLICATIONS & MACHINES dream-lab.in Indian Institute of Science, Bangalore DREAM:Lab SE252:Lecture 13-14, Feb 24/25 ILO3:Algorithms and Programming Patterns for Cloud

More information

Hadoop 3 Configuration and First Examples

Hadoop 3 Configuration and First Examples Hadoop 3 Configuration and First Examples Big Data - 26/03/2018 Apache Hadoop & YARN Apache Hadoop (1.X) De facto Big Data open source platform Running for about 5 years in production at hundreds of companies

More information

W1.A.0 W2.A.0 1/22/2018 1/22/2018. CS435 Introduction to Big Data. FAQs. Readings

W1.A.0 W2.A.0 1/22/2018 1/22/2018. CS435 Introduction to Big Data. FAQs. Readings CS435 Introduction to Big Data 1/17/2018 W2.A.0 W1.A.0 CS435 Introduction to Big Data W2.A.1.A.1 FAQs PA0 has been posted Feb. 6, 5:00PM via Canvas Individual submission (No team submission) Accommodation

More information

Big Data Overview. Nenad Jukic Loyola University Chicago. Abhishek Sharma Awishkar, Inc. & Loyola University Chicago

Big Data Overview. Nenad Jukic Loyola University Chicago. Abhishek Sharma Awishkar, Inc. & Loyola University Chicago Big Data Overview Nenad Jukic Loyola University Chicago Abhishek Sharma Awishkar, Inc. & Loyola University Chicago Introduction Three Types of Data stored in Corporations and Organizations Transactional

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (1/2) January 10, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414 Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s

More information

Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case

Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case 1 / 39 Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case PAN Jie 1 Yann LE BIANNIC 2 Frédéric MAGOULES 1 1 Ecole Centrale Paris-Applied Mathematics and Systems

More information

Topics covered in this lecture

Topics covered in this lecture 9/5/2018 CS435 Introduction to Big Data - FALL 2018 W3.B.0 CS435 Introduction to Big Data 9/5/2018 CS435 Introduction to Big Data - FALL 2018 W3.B.1 FAQs How does Hadoop mapreduce run the map instance?

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany GraphLab GraphLab 1 / 34 Outline 1. Introduction

More information

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA) Introduction to MapReduce Adapted from Jimmy Lin (U. Maryland, USA) Motivation Overview Need for handling big data New programming paradigm Review of functional programming mapreduce uses this abstraction

More information

Big Data: Architectures and Data Analytics

Big Data: Architectures and Data Analytics Big Data: Architectures and Data Analytics July 14, 2017 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Database System Architectures Parallel DBs, MapReduce, ColumnStores Database System Architectures Parallel DBs, MapReduce, ColumnStores CMPSCI 445 Fall 2010 Some slides courtesy of Yanlei Diao, Christophe Bisciglia, Aaron Kimball, & Sierra Michels- Slettvet Motivation:

More information

Computer Science 572 Exam Prof. Horowitz Monday, November 27, 2017, 8:00am 9:00am

Computer Science 572 Exam Prof. Horowitz Monday, November 27, 2017, 8:00am 9:00am Computer Science 572 Exam Prof. Horowitz Monday, November 27, 2017, 8:00am 9:00am Name: Student Id Number: 1. This is a closed book exam. 2. Please answer all questions. 3. There are a total of 40 questions.

More information

Guidelines For Hadoop and Spark Cluster Usage

Guidelines For Hadoop and Spark Cluster Usage Guidelines For Hadoop and Spark Cluster Usage Procedure to create an account in CSX. If you are taking a CS prefix course, you already have an account; to get an initial password created: 1. Login to https://cs.okstate.edu/pwreset

More information

Experiences with a new Hadoop cluster: deployment, teaching and research. Andre Barczak February 2018

Experiences with a new Hadoop cluster: deployment, teaching and research. Andre Barczak February 2018 Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018 abstract In 2017 the Machine Learning research group got funding for a new Hadoop cluster. However,

More information

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,

More information

Enter the Elephant. Massively Parallel Computing With Hadoop. Toby DiPasquale Chief Architect Invite Media, Inc.

Enter the Elephant. Massively Parallel Computing With Hadoop. Toby DiPasquale Chief Architect Invite Media, Inc. Enter the Elephant Massively Parallel Computing With Hadoop Toby DiPasquale Chief Architect Invite Media, Inc. Philadelphia Emerging Technologies for the Enterprise March 26, 2008 Image credit, http,//www.depaulca.org/images/blog_1125071.jpg

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model

More information

Apache Spark. Easy and Fast Big Data Analytics Pat McDonough

Apache Spark. Easy and Fast Big Data Analytics Pat McDonough Apache Spark Easy and Fast Big Data Analytics Pat McDonough Founded by the creators of Apache Spark out of UC Berkeley s AMPLab Fully committed to 100% open source Apache Spark Support and Grow the Spark

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

Hadoop Map Reduce 10/17/2018 1

Hadoop Map Reduce 10/17/2018 1 Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018

More information

I N F O R M A T I O N R E T R I E V A L A SSIGNMENT 6 P R O T O T Y P E O F A B I L I N G U A L S EARCH E NGINE. Architecture

I N F O R M A T I O N R E T R I E V A L A SSIGNMENT 6 P R O T O T Y P E O F A B I L I N G U A L S EARCH E NGINE. Architecture Even Cheng 41123717 Karthik Raman - 95687560 CS221-IR March 18, 2010 I N F O R M A T I O N R E T R I E V A L A SSIGNMENT 6 P R O T O T Y P E O F A B I L I N G U A L S EARCH E NGINE Suppose you are bilingual.

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 7: Parallel Computing Cho-Jui Hsieh UC Davis May 3, 2018 Outline Multi-core computing, distributed computing Multi-core computing tools

More information

CSE 344 MAY 2 ND MAP/REDUCE

CSE 344 MAY 2 ND MAP/REDUCE CSE 344 MAY 2 ND MAP/REDUCE ADMINISTRIVIA HW5 Due Tonight Practice midterm Section tomorrow Exam review PERFORMANCE METRICS FOR PARALLEL DBMSS Nodes = processors, computers Speedup: More nodes, same data

More information

Chapter 3. Distributed Algorithms based on MapReduce

Chapter 3. Distributed Algorithms based on MapReduce Chapter 3 Distributed Algorithms based on MapReduce 1 Acknowledgements Hadoop: The Definitive Guide. Tome White. O Reilly. Hadoop in Action. Chuck Lam, Manning Publications. MapReduce: Simplified Data

More information

Big Data Analytics: Insights and Innovations

Big Data Analytics: Insights and Innovations International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 6, Issue 10 (April 2013), PP. 60-65 Big Data Analytics: Insights and Innovations

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce

More information

1/30/2019 Week 2- B Sangmi Lee Pallickara

1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-0 1/30/2019 Colorado State University, Spring 2019 Week 2-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING Term project deliverable

More information

Parallel Computing. Prof. Marco Bertini

Parallel Computing. Prof. Marco Bertini Parallel Computing Prof. Marco Bertini Apache Hadoop Chaining jobs Chaining MapReduce jobs Many complex tasks need to be broken down into simpler subtasks, each accomplished by an individual MapReduce

More information

Dept. Of Computer Science, Colorado State University

Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,

More information

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA CSE 3302 Lecture 11: Map/Reduce 7 October 2010 Nate Nystrom UTA 378,000 results in 0.17 seconds including images and video communicates with 1000s of machines web server index servers document servers

More information

MapReduce-style data processing

MapReduce-style data processing MapReduce-style data processing Software Languages Team University of Koblenz-Landau Ralf Lämmel and Andrei Varanovich Related meanings of MapReduce Functional programming with map & reduce An algorithmic

More information

Local MapReduce debugging

Local MapReduce debugging Local MapReduce debugging Tools, tips, and tricks Aaron Kimball Cloudera Inc. July 21, 2009 urce: Wikipedia Japanese rock garden Common sense debugging tips Build incrementally Build compositionally Use

More information

Using Numerical Libraries on Spark

Using Numerical Libraries on Spark Using Numerical Libraries on Spark Brian Spector London Spark Users Meetup August 18 th, 2015 Experts in numerical algorithms and HPC services How to use existing libraries on Spark Call algorithm with

More information

Map-Reduce Applications: Counting, Graph Shortest Paths

Map-Reduce Applications: Counting, Graph Shortest Paths Map-Reduce Applications: Counting, Graph Shortest Paths Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

MapReduce. Arend Hintze

MapReduce. Arend Hintze MapReduce Arend Hintze Distributed Word Count Example Input data files cat * key-value pairs (0, This is a cat!) (14, cat is ok) (24, walk the dog) Mapper map() function key-value pairs (this, 1) (is,

More information

MapReduce Algorithms

MapReduce Algorithms Large-scale data processing on the Cloud Lecture 3 MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation Commons Attribution 3.0 License) Outline

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

CS435 Introduction to Big Data Spring 2018 Colorado State University. 2/5/2018 Week 4-A Sangmi Lee Pallickara. FAQs. Total Order Sorting Pattern

CS435 Introduction to Big Data Spring 2018 Colorado State University. 2/5/2018 Week 4-A Sangmi Lee Pallickara. FAQs. Total Order Sorting Pattern W4.A.0.0 CS435 Introduction to Big Data W4.A.1 FAQs PA0 submission is open Feb. 6, 5:00PM via Canvas Individual submission (No team submission) If you have not been assigned the port range, please contact

More information

Big Data: Tremendous challenges, great solutions

Big Data: Tremendous challenges, great solutions Big Data: Tremendous challenges, great solutions Luc Bougé ENS Rennes Alexandru Costan INSA Rennes Gabriel Antoniu INRIA Rennes Survive the data deluge! Équipe KerData 1 Big Data? 2 Big Picture The digital

More information