Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)

Size: px
Start display at page:

Download "Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)"

Transcription

1 Ghislain Fourny Big Data 6. Massive Parallel Processing (MapReduce)

2 So far, we have... Storage as file system (HDFS) 13

3 So far, we have... Storage as tables (HBase) Storage as file system (HDFS) 14

4 Data is only useful if we can query it Querying Storage as tables (HBase) Storage as file system (HDFS) 15

5 ... in parallel Querying Storage as tables (HBase) Storage as file system (HDFS) 16

6 Data Processing Input data 17

7 Data Processing Input data Query 18

8 Data Processing Input data Query Output data 19

9 MapReduce 20

10 Data Processing: data comes in chunks Query 21

11 Data Processing: the ideal case Query Query Query Query Query Query Query Query 22

12 Data Processing: the worst case 23

13 Data Processing: the typical case 24

14 Data Processing: Map here... 25

15 Data Processing:... and shuffle there 26

16 A common and useful sub-case: MapReduce Input data 27

17 A common and useful sub-case: MapReduce Input data Map Map Map Map Map Map Map Map 28

18 A common and useful sub-case: MapReduce Input data Map Map Map Map Map Map Map Map Shuffle 29

19 A common and useful sub-case: MapReduce Input data Map Map Map Map Map Map Map Map Shuffle Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce 30

20 A common and useful sub-case: MapReduce Input data Map Map Map Map Map Map Map Map Shuffle Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Output data 31

21 Data Processing: Data Model Input data Map Map Map Map Map Map Map Map Intermediate data (shuffled) Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Output data 32

22 Data Processing: Data Shape Key- pairs Map Map Map Map Map Map Map Map Key- pairs Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Key- pairs 33

23 Data Processing: Data Types key type 1 -> type 1 Map Map Map Map Map Map Map Map key type I -> type I Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 34

24 Data Processing: Most often key type 1 -> type 1 Map Map Map Map Map Map Map Map key type A -> type A Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 35

25 Splitting 36

26 Splitting Split 37

27 Splitting key 1 Split key 2 key 3 key 4 38

28 Mapping function key 1 39

29 Mapping function key 1 Map 40

30 Mapping function key 1 Map key I key II 41

31 Mapping function... in parallel key 1 Map key I key II 42

32 Mapping function... in parallel key 1 Map key I key II key 2 Map key I key III 43

33 Mapping function... in parallel key 1 Map key I key II key 2 Map key I key III key 3 Map key II key III 44

34 Put it all together key I key II key I key III key II key III 45

35 Put it all together key I key II key I key III key II key III 46

36 Put it all together key I key II key I key II key I key I key III key III key II key II key III key III 47

37 Sort by key key I key II key I key III key III key II key I 48

38 Sort by key key I key I key II key I key I key I key III key II key III key II key II key III key I key III 49

39 Partition key I key I key I key II key II key III key III 50

40 Partition key I key I key I key II key II key III key III 51

41 Partition key I key I key I key I key I key I key II key II key II key II key III key III key III key III 52

42 Reduce function key I key I key I 53

43 Reduce function key I key I key I Reduce 54

44 Reduce function key I key I key I Reduce key A 55

45 Reduce function (with identical key sets) key A key A key A A B C Reduce key A 56

46 Reduce function (most generic) key I key I key I Reduce key A ( key B ) More is fine, but uncommon 57

47 Reduce function... in parallel key I key I key I Reduce key A 58

48 Reduce function... in parallel key I key I key I Reduce key A key II key II Reduce key B 59

49 Reduce function... in parallel key I key I key I Reduce key A key II key II Reduce key B key III key III Reduce key C 60

50 Overall 61

51 Overall Map 62

52 Overall Map 63

53 Overall Map Sort 64

54 Overall Map Sort 65

55 Overall Map Sort Partition 66

56 Overall Map Sort Partition 67

57 Overall Map Sort Partition Reduce 68

58 Overall Map Sort Partition Reduce 69

59 Input/Output formats 70

60 Input and output formats 71

61 Input and output formats From/to tables 72

62 Input and output formats From/to tables From/to files 73

63 Formats: tabular 74

64 Formats: tabular RDBMS 75

65 Formats: tabular RDBMS Row ID A1 1E0 22A 4A2 HBase 76

66 Formats: tabular 77

67 Formats: tabular 78

68 Formats: files (e.g., from HDFS) 79

69 Formats: files (e.g., from HDFS) Text 80

70 Formats: files (e.g., from HDFS) Text KeyValue 81

71 Formats: files (e.g., from HDFS) Text KeyValue SequenceFile 82

72 Text files Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... 83

73 Text files Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed 84

74 Text files: NLine Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... 85

75 Text files: NLine Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed 86

76 Key-Value Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... 87

77 Key-Value Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... Lorem sit consectetur adipiscing... ipsum dolor amet, elit, sed 88

78 Sequence files Hadoop binary format Stores generic key-s 89

79 Sequence files Hadoop binary format Stores generic key-s KeyLength Key ValueLength Value 90

80 Optimization 91

81 Optimization key type 1 -> type 1 Mapper Map Map Map Map Map Map key type A -> type A Reducer Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 92

82 Optimization How to reduce* the amount of data key type shuffled 1 -> around? type 1 Mapper *pun intended (Eselsbrücke) Map Map Map Map Map Map key type A -> type A Reducer Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 93

83 Optimization: Combine key type 1 -> type 1 Mapper Map Map Map Map Map Map key type A -> type A Combine Combine Combine Combine Combine Combine key type A -> type A Reducer Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 94

84 Combine: the 90% case 95

85 Combine: the 90% case Often, the combine function is identical to the reduce function. Combine Reduce Disclaimer: there are assumptions 96

86 Combine=Reduce: Assumption 1 Key/Value types must be identical for reduce input and output. key type A -> type A Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 97

87 Combine=Reduce : Assumption 2 98

88 Combine=Reduce : Assumption 2 Reduce function must be Commutative key A key A A B 99

89 Combine=Reduce : Assumption 2 Reduce function must be Commutative key A key A A B and Associative key A key A key A A B C 100

90 Optimization: Bring the Query to the Data Query Data 101

91 MapReduce: the APIs 102

92 Supported frameworks Hadoop MapReduce 103

93 Supported frameworks Hadoop MapReduce Java Streaming 104

94 Supported frameworks Hadoop MapReduce Java Streaming 105

95 Java API: Mapper import org.apache.hadoop.mapreduce.mapper; public class MyOwnMapper extends Mapper<K1, V1, K2, V2>{ } public void map(k1 key, V1, Context context) throws IOException, InterruptedException {... K2 new-key =... V2 new- =... context.write(new-key, new-);... } 106

96 Java API: Mapper import org.apache.hadoop.mapreduce.mapper; public class MyOwnMapper extends Mapper<K1, V1, K2, V2>{ } public void map(k1 key, V1, Context context) throws IOException, InterruptedException {... K2 new-key =... V2 new- =... context.write(new-key, new-);... } 107

97 Java API: Mapper import org.apache.hadoop.mapreduce.mapper; public class MyOwnMapper extends Mapper<K1, V1, K2, V2>{ } public void map(k1 key, V1, Context context) throws IOException, InterruptedException {... K2 new-key =... V2 new- =... context.write(new-key, new-);... } 108

98 Java API: Reducer import org.apache.hadoop.mapreduce.reducer; public class MyOwnReducer extends Reducer<K2, V2, K3, V3>{ } public void reduce (K2 key, Iterable<V2> s, Context context) throws IOException, InterruptedException {... K3 new-key =... V3 new- =... context.write(new-key, new-);... } 109

99 Java API: Reducer import org.apache.hadoop.mapreduce.reducer; public class MyOwnReducer extends Reducer<K2, V2, K3, V3>{ } public void reduce (K2 key, Iterable<V2> s, Context context) throws IOException, InterruptedException {... K3 new-key =... V3 new- =... context.write(new-key, new-);... } 110

100 Java API: Reducer import org.apache.hadoop.mapreduce.reducer; public class MyOwnReducer extends Reducer<K2, V2, K3, V3>{ } public void reduce (K2 key, Iterable<V2> s, Context context) throws IOException, InterruptedException {... K3 new-key =... V3 new- =... context.write(new-key, new-);... } 111

101 Java API: Job import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 112

102 Java API: Job import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 113

103 Java API: Job import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 114

104 Java API: Job import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 115

105 Java API: Job import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 116

106 Java API: Combiner (=Reducer) import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setcombinerclass(myownreducer.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 117

107 Java API: InputFormat classes InputFormat 118

108 Java API: InputFormat classes InputFormat DBInputFormat RDBMS 119

109 Java API: InputFormat classes InputFormat DBInputFormat RDBMS TableInputFormat HBase 120

110 Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase 121

111 Java API: InputFormat classes InputFormat DBInputFormat RDBMS TableInputFormat HBase FileInputFormat KeyValueTextInputFormat Key file 122

112 Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase KeyValueTextInputFormat SequenceFileInputFormat Key file Sequence file 123

113 Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase KeyValueTextInputFormat SequenceFileInputFormat TextInputFormat Key file Sequence file 124

114 Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase KeyValueTextInputFormat SequenceFileInputFormat TextInputFormat FixedLengthInputFormat Key file Sequence file Text 125

115 Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase KeyValueTextInputFormat SequenceFileInputFormat TextInputFormat FixedLengthInputFormat NLineInputFormat Key file Sequence file Text 126

116 Java API: OutputFormat classes OutputFormat DBOutputFormat RDBMS TableOutputFormat HBase FileoutputFormat SequenceFileOutputFormat TextOutputFormat Text MapFileOutputFormat Sequence file 127

117 MapReduce: the physical layer 128

118 Possible storage layers Hadoop MapReduce 129

119 Possible storage layers Hadoop MapReduce Local Filesystem HDFS S3 Azure Blob Storage 130

120 Possible storage layers Hadoop MapReduce Local Filesystem HDFS S3 Azure Blob Storage 131

121 Hadoop MapReduce: Numbers Several TBs of data Data 132

122 Hadoop MapReduce: Numbers Several TBs of data Data MapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMap Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map M 1000s of nodes 133

123 Hadoop infrastructure (version 1) Namenode Datanode Datanode Datanode Datanode Datanode Datanode 134

124 Master-slave architecture Master Slave Slave Slave Slave Slave Slave 135

125 Hadoop infrastructure (version 1) Namenode + JobTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 136

126 Hadoop infrastructure (version 1) Namenode + JobTracker Bring the Query to the Data Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 137

127 Tasks Task = or 138

128 Splits Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce 139

129 Splits Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce 140

130 Splits vs. map tasks Split 141

131 Splits vs. map tasks 1 split = 1 map task Split M M M M M 142

132 In practice M M M M 1 split Split 143

133 In practice M M M M 1 split = 1 block (subject to min and max size) Split Block 144

134 Splits vs. blocks: possible confusion Logical Level (MapReduce) Split Physical Level (HDFS) Block 145

135 Splits vs. blocks: possible confusion Logical Level (MapReduce) Split Record (key/ pair) Bit Physical Level (HDFS) Block 146

136 Records across blocks Logical Level (MapReduce) Split Physical Level (HDFS) Block 147

137 Records across blocks Logical Level (MapReduce) Split Remote read Physical Level (HDFS) Block 148

138 Fine-tuning to adjust splits to blocks Logical Level (MapReduce) Split Physical Level (HDFS) Block 149

139 Hadoop infrastructure (version 1) Namenode + JobTracker /dir/file Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 150

140 Hadoop infrastructure: map tasks Namenode + JobTracker As many map tasks as splits /dir/file M Datanode + TaskTracker M Datanode + TaskTracker M Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker M Datanode + TaskTracker 151

141 Hadoop infrastructure: map tasks As many map tasks as splits Namenode + JobTracker /dir/file Occasionally not possible to co-locate task and block M Datanode + TaskTracker M Datanode + TaskTracker M Datanode + TaskTracker Datanode + TaskTracker M Datanode + TaskTracker Datanode + TaskTracker 152

142 Hadoop infrastructure: reduce tasks A few reduce tasks Namenode + JobTracker /dir/file R R Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 153

143 Hadoop infrastructure: shuffling (inbetween) M R Namenode + JobTracker /dir/file M Datanode + TaskTracker R M Datanode + TaskTracker R M Datanode + TaskTracker Datanode + TaskTracker M Datanode + TaskTracker Datanode + TaskTracker 154

144 Shuffling phase Reducer Mappers 155

145 Shuffling phase Reducer Mappers Each mapper sorts its output key- pairs 156

146 Spilling to disk Key- pairs are spilled to disk if necessary 157

147 Shuffling phase Reducer Gets its key pairs over HTTP Mappers 158

148 Issue 1: Tight coupling Namenode + JobTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 159

149 Issue 2: Scalability Namenode + JobTracker Only one! Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 160

Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)

Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce) Ghislain Fourny Big Data Fall 2018 6. Massive Parallel Processing (MapReduce) Let's begin with a field experiment 2 400+ Pokemons, 10 different 3 How many of each??????????? 4 400 distributed to many volunteers

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (1/2) January 10, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 2: Hadoop Nuts and Bolts Jimmy Lin University of Maryland Thursday, January 31, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version : Hortonworks HDPCD Hortonworks Data Platform Certified Developer Download Full Version : https://killexams.com/pass4sure/exam-detail/hdpcd QUESTION: 97 You write MapReduce job to process 100 files in HDFS.

More information

CSE6331: Cloud Computing

CSE6331: Cloud Computing CSE6331: Cloud Computing Leonidas Fegaras University of Texas at Arlington c 2017 by Leonidas Fegaras Map-Reduce Fundamentals Based on: J. Simeon: Introduction to MapReduce P. Michiardi: Tutorial on MapReduce

More information

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus Getting to know MapReduce MapReduce Execution Pipeline Runtime Coordination and Task Management MapReduce Application Hadoop Word Count Implementation.

More information

Big Data for Engineers Spring Resource Management

Big Data for Engineers Spring Resource Management Ghislain Fourny Big Data for Engineers Spring 2018 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models

More information

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece Introduction to Map/Reduce Kostas Solomos Computer Science Department University of Crete, Greece What we will cover What is MapReduce? How does it work? A simple word count example (the Hello World! of

More information

Big Data 7. Resource Management

Big Data 7. Resource Management Ghislain Fourny Big Data 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage

More information

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2 Big Data Analysis using Hadoop Map-Reduce An Introduction Lecture 2 Last Week - Recap 1 In this class Examine the Map-Reduce Framework What work each of the MR stages does Mapper Shuffle and Sort Reducer

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

Hadoop. Introduction to BIGDATA and HADOOP

Hadoop. Introduction to BIGDATA and HADOOP Hadoop Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big Data and Hadoop What is the need of going ahead with Hadoop? Scenarios to apt Hadoop Technology in REAL

More information

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. 1. MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. MapReduce is a framework for processing big data which processes data in two phases, a Map

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming)

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming) Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 16 Big Data Management VI (MapReduce Programming) Credits: Pietro Michiardi (Eurecom): Scalable Algorithm

More information

Big Data Analysis using Hadoop Lecture 3

Big Data Analysis using Hadoop Lecture 3 Big Data Analysis using Hadoop Lecture 3 Last Week - Recap Driver Class Mapper Class Reducer Class Create our first MR process Ran on Hadoop Monitored on webpages Checked outputs using HDFS command line

More information

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838. COMP4442 Service and Cloud Computing Lab 12: MapReduce www.comp.polyu.edu.hk/~csgeorge/comp4442 Prof. George Baciu csgeorge@comp.polyu.edu.hk PQ838 1 Contents Introduction to MapReduce A WordCount example

More information

itpass4sure Helps you pass the actual test with valid and latest training material.

itpass4sure   Helps you pass the actual test with valid and latest training material. itpass4sure http://www.itpass4sure.com/ Helps you pass the actual test with valid and latest training material. Exam : CCD-410 Title : Cloudera Certified Developer for Apache Hadoop (CCDH) Vendor : Cloudera

More information

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School September 2012 This work is licensed

More information

Laarge-Scale Data Engineering

Laarge-Scale Data Engineering Laarge-Scale Data Engineering The MapReduce Framework & Hadoop Key premise: divide and conquer work partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 result combine Parallelisation challenges How

More information

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014 Parallel Data Processing with Hadoop/MapReduce CS140 Tao Yang, 2014 Overview What is MapReduce? Example with word counting Parallel data processing with MapReduce Hadoop file system More application example

More information

Ghislain Fourny. Big Data 5. Column stores

Ghislain Fourny. Big Data 5. Column stores Ghislain Fourny Big Data 5. Column stores 1 Introduction 2 Relational model 3 Relational model Schema 4 Issues with relational databases (RDBMS) Small scale Single machine 5 Can we fix a RDBMS? Scale up

More information

Hadoop Map Reduce 10/17/2018 1

Hadoop Map Reduce 10/17/2018 1 Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

Dept. Of Computer Science, Colorado State University

Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,

More information

COSC 6397 Big Data Analytics. Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading. Edgar Gabriel Spring 2014.

COSC 6397 Big Data Analytics. Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading. Edgar Gabriel Spring 2014. COSC 6397 Big Data Analytics Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading Edgar Gabriel Spring 2014 Recap on HBase Column-Oriented data store NoSQL DB Data is stored in

More information

MapReduce. Arend Hintze

MapReduce. Arend Hintze MapReduce Arend Hintze Distributed Word Count Example Input data files cat * key-value pairs (0, This is a cat!) (14, cat is ok) (24, walk the dog) Mapper map() function key-value pairs (this, 1) (is,

More information

Session 1 Big Data and Hadoop - Overview. - Dr. M. R. Sanghavi

Session 1 Big Data and Hadoop - Overview. - Dr. M. R. Sanghavi Session 1 Big Data and Hadoop - Overview - Dr. M. R. Sanghavi Acknowledgement Prof. Kainjan M. Sanghavi For preparing this prsentation This presentation is available on my blog https://maheshsanghavi.wordpress.com/expert-talk-fdp-workshop/

More information

MapReduce Simplified Data Processing on Large Clusters

MapReduce Simplified Data Processing on Large Clusters MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /

More information

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

ExamTorrent.   Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you ExamTorrent http://www.examtorrent.com Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you Exam : Apache-Hadoop-Developer Title : Hadoop 2.0 Certification exam for Pig

More information

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

2. MapReduce Programming Model

2. MapReduce Programming Model Introduction MapReduce was proposed by Google in a research paper: Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Chapter 3. Distributed Algorithms based on MapReduce

Chapter 3. Distributed Algorithms based on MapReduce Chapter 3 Distributed Algorithms based on MapReduce 1 Acknowledgements Hadoop: The Definitive Guide. Tome White. O Reilly. Hadoop in Action. Chuck Lam, Manning Publications. MapReduce: Simplified Data

More information

Exam Name: Cloudera Certified Developer for Apache Hadoop CDH4 Upgrade Exam (CCDH)

Exam Name: Cloudera Certified Developer for Apache Hadoop CDH4 Upgrade Exam (CCDH) Vendor: Cloudera Exam Code: CCD-470 Exam Name: Cloudera Certified Developer for Apache Hadoop CDH4 Upgrade Exam (CCDH) Version: Demo QUESTION 1 When is the earliest point at which the reduce method of

More information

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018 Parallel Processing - MapReduce and FlumeJava Amir H. Payberah payberah@kth.se 14/09/2018 The Course Web Page https://id2221kth.github.io 1 / 83 Where Are We? 2 / 83 What do we do when there is too much

More information

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo Vendor: Cloudera Exam Code: CCD-410 Exam Name: Cloudera Certified Developer for Apache Hadoop Version: Demo QUESTION 1 When is the earliest point at which the reduce method of a given Reducer can be called?

More information

Introduction to Map Reduce

Introduction to Map Reduce Introduction to Map Reduce 1 Map Reduce: Motivation We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate

More information

Recommended Literature

Recommended Literature COSC 6397 Big Data Analytics Introduction to Map Reduce (I) Edgar Gabriel Spring 2017 Recommended Literature Original MapReduce paper by google http://research.google.com/archive/mapreduce-osdi04.pdf Fantastic

More information

MapReduce and Hadoop. The reference Big Data stack

MapReduce and Hadoop. The reference Big Data stack Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The

More information

Distributed Systems. CS422/522 Lecture17 17 November 2014

Distributed Systems. CS422/522 Lecture17 17 November 2014 Distributed Systems CS422/522 Lecture17 17 November 2014 Lecture Outline Introduction Hadoop Chord What s a distributed system? What s a distributed system? A distributed system is a collection of loosely

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Java in MapReduce. Scope

Java in MapReduce. Scope Java in MapReduce Kevin Swingler Scope A specific look at the Java code you might use for performing MapReduce in Hadoop Java program recap The map method The reduce method The whole program Running on

More information

Enter the Elephant. Massively Parallel Computing With Hadoop. Toby DiPasquale Chief Architect Invite Media, Inc.

Enter the Elephant. Massively Parallel Computing With Hadoop. Toby DiPasquale Chief Architect Invite Media, Inc. Enter the Elephant Massively Parallel Computing With Hadoop Toby DiPasquale Chief Architect Invite Media, Inc. Philadelphia Emerging Technologies for the Enterprise March 26, 2008 Image credit, http,//www.depaulca.org/images/blog_1125071.jpg

More information

Facilitating Consistency Check between Specification & Implementation with MapReduce Framework

Facilitating Consistency Check between Specification & Implementation with MapReduce Framework Facilitating Consistency Check between Specification & Implementation with MapReduce Framework Shigeru KUSAKABE, Yoichi OMORI, Keijiro ARAKI Kyushu University, Japan 2 Our expectation Light-weight formal

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April

More information

Steps: First install hadoop (if not installed yet) by, https://sl6it.wordpress.com/2015/12/04/1-study-and-configure-hadoop-for-big-data/

Steps: First install hadoop (if not installed yet) by, https://sl6it.wordpress.com/2015/12/04/1-study-and-configure-hadoop-for-big-data/ SL-V BE IT EXP 7 Aim: Design and develop a distributed application to find the coolest/hottest year from the available weather data. Use weather data from the Internet and process it using MapReduce. Steps:

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

The core source code of the edge detection of the Otsu-Canny operator in the Hadoop

The core source code of the edge detection of the Otsu-Canny operator in the Hadoop Attachment: The core source code of the edge detection of the Otsu-Canny operator in the Hadoop platform (ImageCanny.java) //Map task is as follows. package bishe; import java.io.ioexception; import org.apache.hadoop.fs.path;

More information

KillTest *KIJGT 3WCNKV[ $GVVGT 5GTXKEG Q&A NZZV ]]] QORRZKYZ IUS =K ULLKX LXKK [VJGZK YKX\OIK LUX UTK _KGX

KillTest *KIJGT 3WCNKV[ $GVVGT 5GTXKEG Q&A NZZV ]]] QORRZKYZ IUS =K ULLKX LXKK [VJGZK YKX\OIK LUX UTK _KGX KillTest Q&A Exam : CCD-410 Title : Cloudera Certified Developer for Apache Hadoop (CCDH) Version : DEMO 1 / 4 1.When is the earliest point at which the reduce method of a given Reducer can be called?

More information

Introduction to BigData, Hadoop:-

Introduction to BigData, Hadoop:- Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 2: MapReduce Algorithm Design (2/2) January 14, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Hadoop MapReduce Framework

Hadoop MapReduce Framework Hadoop MapReduce Framework Contents Hadoop MapReduce Framework Architecture Interaction Diagram of MapReduce Framework (Hadoop 1.0) Interaction Diagram of MapReduce Framework (Hadoop 2.0) Hadoop MapReduce

More information

Introduction to HDFS and MapReduce

Introduction to HDFS and MapReduce Introduction to HDFS and MapReduce Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2 Who Am I -

More information

Big Data: Architectures and Data Analytics

Big Data: Architectures and Data Analytics Big Data: Architectures and Data Analytics June 26, 2018 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer

More information

Your First Hadoop App, Step by Step

Your First Hadoop App, Step by Step Learn Hadoop in one evening Your First Hadoop App, Step by Step Martynas 1 Miliauskas @mmiliauskas Your First Hadoop App, Step by Step By Martynas Miliauskas Published in 2013 by Martynas Miliauskas On

More information

Actual4Dumps. Provide you with the latest actual exam dumps, and help you succeed

Actual4Dumps.   Provide you with the latest actual exam dumps, and help you succeed Actual4Dumps http://www.actual4dumps.com Provide you with the latest actual exam dumps, and help you succeed Exam : HDPCD Title : Hortonworks Data Platform Certified Developer Vendor : Hortonworks Version

More information

Data Analytics Job Guarantee Program

Data Analytics Job Guarantee Program Data Analytics Job Guarantee Program 1. INSTALLATION OF VMWARE 2. MYSQL DATABASE 3. CORE JAVA 1.1 Types of Variable 1.2 Types of Datatype 1.3 Types of Modifiers 1.4 Types of constructors 1.5 Introduction

More information

INTRODUCTION TO HADOOP

INTRODUCTION TO HADOOP Hadoop INTRODUCTION TO HADOOP Distributed Systems + Middleware: Hadoop 2 Data We live in a digital world that produces data at an impressive speed As of 2012, 2.7 ZB of data exist (1 ZB = 10 21 Bytes)

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2013/14 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Cluster File

More information

Map Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms

Map Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms Map Reduce 1 MapReduce inside Google Googlers' hammer for 80% of our data crunching Large-scale web search indexing Clustering problems for Google News Produce reports for popular queries, e.g. Google

More information

Projected by: LUKA CECXLADZE BEQA CHELIDZE Superviser : Nodar Momtsemlidze

Projected by: LUKA CECXLADZE BEQA CHELIDZE Superviser : Nodar Momtsemlidze Projected by: LUKA CECXLADZE BEQA CHELIDZE Superviser : Nodar Momtsemlidze About HBase HBase is a column-oriented database management system that runs on top of HDFS. It is well suited for sparse data

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

50 Must Read Hadoop Interview Questions & Answers

50 Must Read Hadoop Interview Questions & Answers 50 Must Read Hadoop Interview Questions & Answers Whizlabs Dec 29th, 2017 Big Data Are you planning to land a job with big data and data analytics? Are you worried about cracking the Hadoop job interview?

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce

More information

Big Data: Architectures and Data Analytics

Big Data: Architectures and Data Analytics Big Data: Architectures and Data Analytics June 26, 2018 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH) Cloudera CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Download Full Version : http://killexams.com/pass4sure/exam-detail/cca-410 Reference: CONFIGURATION PARAMETERS DFS.BLOCK.SIZE

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies Informa)on Retrieval and Map- Reduce Implementa)ons Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies mas4108@louisiana.edu Map-Reduce: Why? Need to process 100TB datasets On 1 node:

More information

Big Data and Scripting map reduce in Hadoop

Big Data and Scripting map reduce in Hadoop Big Data and Scripting map reduce in Hadoop 1, 2, connecting to last session set up a local map reduce distribution enable execution of map reduce implementations using local file system only all tasks

More information

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours) Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:

More information

Big Data: Architectures and Data Analytics

Big Data: Architectures and Data Analytics Big Data: Architectures and Data Analytics January 22, 2018 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer

More information

Guidelines For Hadoop and Spark Cluster Usage

Guidelines For Hadoop and Spark Cluster Usage Guidelines For Hadoop and Spark Cluster Usage Procedure to create an account in CSX. If you are taking a CS prefix course, you already have an account; to get an initial password created: 1. Login to https://cs.okstate.edu/pwreset

More information

Ghislain Fourny. Big Data 5. Wide column stores

Ghislain Fourny. Big Data 5. Wide column stores Ghislain Fourny Big Data 5. Wide column stores Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage 2 Where we are User interfaces

More information

A MapReduce Relational-Database Index-Selection Tool

A MapReduce Relational-Database Index-Selection Tool A MapReduce Relational-Database Index-Selection Tool by Fatimah Alsayoud Bachelor of Computer and Information Sciences in the field of Information technology, King Saud University, 2008 A thesis presented

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

Introduction to Map/Reduce & Hadoop

Introduction to Map/Reduce & Hadoop Introduction to Map/Reduce & Hadoop Vassilis Christophides christop@csd.uoc.gr http://www.csd.uoc.gr/~hy562 University of Crete 1 Peta-Bytes Data Processing 2 1 1 What is MapReduce? MapReduce: programming

More information

A brief history on Hadoop

A brief history on Hadoop Hadoop Basics A brief history on Hadoop 2003 - Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct 2003 - Google releases papers with GFS (Google File System)

More information

Big Data: Architectures and Data Analytics

Big Data: Architectures and Data Analytics Big Data: Architectures and Data Analytics July 14, 2017 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer

More information

Topics covered in this lecture

Topics covered in this lecture 9/5/2018 CS435 Introduction to Big Data - FALL 2018 W3.B.0 CS435 Introduction to Big Data 9/5/2018 CS435 Introduction to Big Data - FALL 2018 W3.B.1 FAQs How does Hadoop mapreduce run the map instance?

More information

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391 Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391 Outline Big Data Big Data Examples Challenges with traditional storage NoSQL Hadoop HDFS MapReduce Architecture 2 Big Data In information

More information

Expert Lecture plan proposal Hadoop& itsapplication

Expert Lecture plan proposal Hadoop& itsapplication Expert Lecture plan proposal Hadoop& itsapplication STARTING UP WITH BIG Introduction to BIG Data Use cases of Big Data The Big data core components Knowing the requirements, knowledge on Analyst job profile

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (2/2) January 12, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Top 25 Hadoop Admin Interview Questions and Answers

Top 25 Hadoop Admin Interview Questions and Answers Top 25 Hadoop Admin Interview Questions and Answers 1) What daemons are needed to run a Hadoop cluster? DataNode, NameNode, TaskTracker, and JobTracker are required to run Hadoop cluster. 2) Which OS are

More information

Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop

Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop Jiaqi Tan Xinghao Pan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan PARALLEL DATA LABORATORY Carnegie Mellon University Motivation Debugging

More information

2/26/2017. For instance, consider running Word Count across 20 splits

2/26/2017. For instance, consider running Word Count across 20 splits Based on the slides of prof. Pietro Michiardi Hadoop Internals https://github.com/michiard/disc-cloud-course/raw/master/hadoop/hadoop.pdf Job: execution of a MapReduce application across a data set Task:

More information

Outline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc.

Outline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc. D. Praveen Kumar Junior Research Fellow Department of Computer Science & Engineering Indian Institute of Technology (Indian School of Mines) Dhanbad, Jharkhand, India Head of IT & ITES, Skill Subsist Impels

More information

Certified Big Data and Hadoop Course Curriculum

Certified Big Data and Hadoop Course Curriculum Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation

More information

Certified Big Data Hadoop and Spark Scala Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills

More information

Programming Models MapReduce

Programming Models MapReduce Programming Models MapReduce Majd Sakr, Garth Gibson, Greg Ganger, Raja Sambasivan 15-719/18-847b Advanced Cloud Computing Fall 2013 Sep 23, 2013 1 MapReduce In a Nutshell MapReduce incorporates two phases

More information

MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java

MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java Contents Page 1 Copyright IBM Corporation, 2015 US Government Users Restricted Rights - Use, duplication or disclosure restricted

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information