Introduction to HDFS and MapReduce

Size: px
Start display at page:

Download "Introduction to HDFS and MapReduce"

Transcription

1 Introduction to HDFS and MapReduce

2 Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2

3 Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2

4 Think Big is the leading professional services firm that s purpose built for Big Data. One of Silicon Valley s Fastest Growing Big Data start ups 100% Focus on Big Data consulting & Data Science solution services Management Background: Cambridge Technology, C-bridge, Oracle, Sun Microsystems, Quantcast, Accenture C-bridge Internet Solutions (CBIS) founder 1996 & executives, IPO 1999 Clients: 40+ North America Locations US East: Boston, New York, Washington D.C. US Central: Chicago, Austin US West: HQ Mountain View, San Diego, Salt Lake City EMEA & APAC Confidential Think Big Analytics 3

5 Think Big Recognized as a Top Pure-Play Big Data Vendor Source: Forbes February 2012 Confidential Think Big Analytics 01/04/13 4

6 Agenda - Big Data - Hadoop Ecosystem - HDFS - MapReduce in Hadoop - The Hadoop Java API - Conclusions 5

7 Big Data 6

8 A Data Shift... Source: EMC Digital Universe Study* 7

9 Motivation Simple algorithms and lots of data trump complex models. Halevy, Norvig, and Pereira (Google), IEEE Intelligent Systems 8

10 Pioneers Google and Yahoo: - Index 850+ million websites, over one trillion URLs. Facebook ad targeting: million users, > 50% of whom are active daily. 9

11 Hadoop Ecosystem 10

12 Common Tool? Hadoop - Cluster: distributed computing platform. - Commodity*, server-class hardware. - Extensible Platform. 11

13 Hadoop Origins MapReduce and Google File System (GFS) pioneered at Google. Hadoop is the commercially-supported open-source equivalent. 12

14 What Is Hadoop? Hadoop is a platform. Distributes and replicates data. Manages parallel tasks created by users. Runs as several processes on a cluster. The term Hadoop generally refers to a toolset, not a single tool. 13

15 Why Hadoop? Handles unstructured to semi-structured to structured data. Handles enormous data volumes. Flexible data analysis and machine learning tools. Cost-effective scalability. 14

16 The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 15

17 The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 15

18 HDFS 16

19 What Is HDFS? Hadoop Distributed File System. Stores files in blocks across many nodes in a cluster. Replicates the blocks across nodes for durability. Master/Slave architecture. 17

20 HDFS Traits Not fully POSIX compliant. No file updates. Write once, read many times. Large blocks, sequential read patterns. Designed for batch processing. 18

21 HDFS Master NameNode - Runs on a single node as a master process Holds file metadata (which blocks are where) Directs client access to files in HDFS SecondaryNameNode - Not a hot failover - Maintains a copy of the NameNode metadata 19

22 HDFS Slaves DataNode - Generally runs on all nodes in the cluster Block creation/replication/deletion/reads Takes orders from the NameNode 20

23 HDFS Illustrated NameNode Put File File DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

24 HDFS Illustrated NameNode Put File File DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

25 HDFS Illustrated Put File NameNode DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

26 HDFS Illustrated Put File NameNode 1,4,6 2 3 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

27 HDFS Illustrated Put File NameNode 1,4,6 2,5,3 3 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

28 HDFS Illustrated Put File NameNode 1,4,6 2,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

29 HDFS Illustrated Put File NameNode 1,4,6 2,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

30 Power of Hadoop Read File NameNode 1,4,6 2,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

31 Power of Hadoop Read File NameNode 1,4,6 2,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

32 Power of Hadoop Read File NameNode 1,4,6 2,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

33 Power of Hadoop Read File NameNode,4,6 2,5,3 3,2,6 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

34 Power of Hadoop Read File NameNode 5,4,6 2,5,3 3,2,6 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

35 Power of Hadoop Read File NameNode 5,4,6 2,5,3 3,2,6 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

36 Power of Hadoop Read File NameNode 5,4,6 2,5,3 3,2,6 Read time = Transfer Rate x Number of Machines* DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

37 Power of Hadoop Read File NameNode 5,4,6 2,5,3 3,2,6 Read time = Transfer Rate x Number of Machines* DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode MB/s x 3 = 300MB/s 22

38 HDFS Shell Easy to use command line interface. Create, copy, move, and delete files. Administrative duties - chmod, chown, chgrp. Set replication factor for a file. Head, tail, cat to view files. 23

39 The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 24

40 The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 24

41 MapReduce in Hadoop 25

42 MapReduce Basics Logical functions: Mappers and Reducers. Developers write map and reduce functions, then submit a jar to the Hadoop cluster. Hadoop handles distributing the Map and Reduce tasks across the cluster. Typically batch oriented. 26

43 JobTracker (Master) MapReduce Daemons - Manages MapReduce jobs, giving tasks to different nodes, managing task failure TaskTracker (Slave) - Creates individual map and reduce tasks - Reports task status to JobTracker 27

44 MapReduce in Hadoop 28

45 MapReduce in Hadoop Let s look at how MapReduce actually works in Hadoop, using WordCount. 28

46 Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce There is a Map phase (hadoop, 1) (mapreduce, 1) (uses, 1) (is, 1), (a, 1) (map, 1),(phase,1) (there, 1) a 2 hadoop 1 is 2 map 1 mapreduce 1 phase 2 There is a Reduce phase (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) reduce 1 there 2 uses 1 29

47 Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce There is a Map phase There is a Reduce phase (hadoop, 1) (mapreduce, 1) (uses, 1) (is, 1), (a, 1) We need to convert (map, 1),(phase,1) (there, 1) the Input into the Output. (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) a 2 hadoop 1 is 2 map 1 mapreduce 1 phase 2 reduce 1 there 2 uses 1 29

48 Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce a 2 hadoop 1 is 2 There is a Map phase map 1 mapreduce 1 phase 2 There is a Reduce phase reduce 1 there 2 uses 1 30

49 Input Mappers Hadoop uses MapReduce (doc1, " ") There is a Map phase (doc2, " ") (doc3, "") There is a Reduce phase (doc4, " ") 31

50 Input Mappers Hadoop uses MapReduce There is a Map phase (doc1, " ") (doc2, " ") (hadoop, 1) (uses, 1) (mapreduce, 1) (there, 1) (is, 1) (a, 1) (map, 1) (phase, 1) (doc3, "") There is a Reduce phase (doc4, " ") (there, 1) (is, 1) (a, 1) (reduce, 1) (phase, 1) 32

51 Input Mappers Sort, Shuffle Reducers Hadoop uses MapReduce (doc1, " ") (hadoop, 1) (mapreduce, 1) 0-9, a-l (uses, 1) There is a Map phase (doc2, " ") (is, 1), (a, 1) (map, 1),(phase,1) (there, 1) m-q (doc3, "") There is a Reduce phase (doc4, " ") (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) r-z 33

52 Input Mappers Sort, Shuffle Reducers Hadoop uses MapReduce There is a Map phase (doc1, " ") (doc2, " ") (doc3, "") (hadoop, 1) (mapreduce, 1) (uses, 1) (is, 1), (a, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) There is a Reduce phase (doc4, " ") (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) r-z (reduce, [1]), (there, [1,1]), (uses, 1) 34

53 Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce (doc1, " ") (hadoop, 1) (mapreduce, 1) (uses, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) a 2 hadoop 1 is 2 There is a Map phase (doc2, " ") (is, 1), (a, 1) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) map 1 mapreduce 1 phase 2 (doc3, "") There is a Reduce phase (doc4, " ") (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) r-z (reduce, [1]), (there, [1,1]), (uses, 1) reduce 1 there 2 uses 1 35

54 Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce (doc1, " ") (hadoop, 1) (mapreduce, 1) (uses, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) a 2 hadoop 1 is 2 There is a Map phase (doc2, " ") (is, 1), (a, 1) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) map 1 mapreduce 1 phase 2 (doc3, "") (doc4, " ") (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) r-z (reduce, [1]), (there, [1,1]), (uses, 1) 36

55 Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce There is a Map phase Map: (doc1, " ") (doc2, " ") (doc3, "") (doc4, " ") (hadoop, 1) (mapreduce, 1) (uses, 1) (is, 1), (a, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) (phase,1) Transform one input to 0-N outputs. (is, 1), (a, 1) (there, 1), (reduce 1) r-z (reduce, [1]), (there, [1,1]), (uses, 1) a 2 hadoop 1 is 2 map 1 mapreduce 1 phase 2 36

56 Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce There is a Map phase Map: (doc1, " ") (doc2, " ") (doc3, "") (doc4, " ") (hadoop, 1) (mapreduce, 1) (uses, 1) (is, 1), (a, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) (is, 1), (a, 1) (phase,1) Transform one input to 0-N outputs. (there, 1), (reduce 1) Reduce: r-z (reduce, [1]), one (there, output. [1,1]), (uses, 1) a 2 hadoop 1 is 2 map 1 mapreduce 1 phase 2 Collect multiple inputs into 36

57 Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode 37

58 Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode 37

59 Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker M M M DataNode DataNode DataNode 37

60 Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker Map Phase M M M DataNode DataNode DataNode 37

61 Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker Map Phase k,v M k,v k,v M k,v M k,v * Intermediate Data Is Stored Locally DataNode DataNode DataNode 37

62 Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker Map Phase k,v k,v k,v k,v k,v DataNode DataNode DataNode 37

63 Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker k,v k,v k,v k,v k,v Shuffle/Sort DataNode DataNode DataNode 37

64 Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker k,v k,v k,v k,v k,v Shuffle/Sort DataNode DataNode DataNode 37

65 Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker k,v R k,v k,v R k,v R k,v Reduce Phase DataNode DataNode DataNode 37

66 Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker R R R Reduce Phase DataNode DataNode DataNode 37

67 Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker Job Complete! DataNode DataNode DataNode 37

68 The Hadoop Java API 38

69 MapReduce in Java 39

70 MapReduce in Java Let s look at WordCount written in the MapReduce Java API. 39

71 Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { word.set(wordstring.tolowercase()); collector.collect(word, one); } } } 40

72 Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { word.set(wordstring.tolowercase()); collector.collect(word, one); } Let s drill into this code... } } 40

73 Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { word.set(wordstring.tolowercase()); collector.collect(word, one); } } } 41

74 Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); Mapper class with 4 type parameters for the input key-value types and output types. public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { word.set(wordstring.tolowercase()); collector.collect(word, one); } } } 41

75 Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); Output key-value objects we ll reuse. public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { word.set(wordstring.tolowercase()); collector.collect(word, one); } } } 42

76 Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { } static final Text word = new Text(); static final IntWritable one = new IntWritable(1); Map method with input, output collector, and reporting public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { word.set(wordstring.tolowercase()); collector.collect(word, one); } } } 43

77 Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { } } } word.set(wordstring.tolowercase()); collector.collect(word, one); Tokenize the line, collect each (word, 1) 44

78 Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasnext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); } 45

79 Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasnext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); } Let s drill into this code... 45

80 Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasnext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); } 46

81 Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { } Reducer class with 4 type parameters for the input key-value types and output public void reduce(text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasnext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); } 46

82 Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { } Reduce method with input, output collector, and reporting public void reduce(text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasnext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); } 47

83 Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; } while (counts.hasnext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); Count the counts per word and emit (word, N) 48

84 Other Options HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 49

85 Other Options HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 49

86 Other Options HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 49

87 Conclusions 50

88 Hadoop Benefits A cost-effective, scalable way to: - Store massive data sets. - Perform arbitrary analyses on those data sets. 51

89 Hadoop Tools Offers a variety of tools for: - Application development. - Integration with other platforms (e.g., databases). 52

90 Hadoop Distributions A rich, open-source ecosystem. - Free to use. - Commercially-supported distributions. 53

91 Thank You! - Feel free to contact me at ryan.tabora@thinkbiganalytics.com - Or our solutions consultant matt.mcdevitt@thinkbiganalytics.com - As always, THINK BIG! 54

92 Bonus Content 55

93 The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 56

94 The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 56

95 Hive: SQL for Hadoop 57

96 Hive 58

97 Hive Let s look at WordCount written in Hive, the SQL for Hadoop. 58

98 CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word; 59

99 CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word; Let s drill into this code... 59

100 CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word; 60

101 CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; Create a table to hold the raw text we re counting. Each line is a column. CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word; 60

102 CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; Load the text in the docs directory into the table. CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word; 61

103 CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word; Create the final table and fill it with the results from a nested query of the docs table that performs WordCount on the fly. 62

104 Hive 63

105 Hive Because so many Hadoop users come from SQL backgrounds, Hive is one of the most essential tools in the ecosystem!! 63

106 The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 64

107 The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 64

108 Pig: Data Flow for Hadoop 65

109 Pig 66

110 Pig Let s look at WordCount written in Pig, the Data Flow language for Hadoop. 66

111 inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; 67

112 inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; Let s drill into this code... 67

113 inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; 68

114 inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt Like the Hive example, load docs content, each line is a field. GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; 68

115 inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt Tokenize into words (an array) and flatten into separate records. GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; 69

116 inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; Collect the same words together. cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; 70

117 inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); Count each word. STORE cntd INTO 'output'; 71

118 inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; Save the results. Profit! 72

119 Pig 73

120 Pig Pig and Hive overlap, but Pig is popular for ETL, e.g., data transformation, cleansing, ingestion, etc. 73

121 Questions? 74

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School September 2012 This work is licensed

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor CS 470 Spring 2018 Mike Lam, Professor Parallel Algorithm Development (Foster's Methodology) Graphics and content taken from IPP section 2.7 and the following: http://www.mcs.anl.gov/~itf/dbpp/text/book.html

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Big Data landscape Lecture #2

Big Data landscape Lecture #2 Big Data landscape Lecture #2 Contents 1 1 CORE Technologies 2 3 MapReduce YARN 4 SparK 5 Cassandra Contents 2 16 HBase 72 83 Accumulo memcached 94 Blur 10 5 Sqoop/Flume Contents 3 111 MongoDB 12 2 13

More information

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece Introduction to Map/Reduce Kostas Solomos Computer Science Department University of Crete, Greece What we will cover What is MapReduce? How does it work? A simple word count example (the Hello World! of

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Hadoop Development Introduction

Hadoop Development Introduction Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand

More information

A brief history on Hadoop

A brief history on Hadoop Hadoop Basics A brief history on Hadoop 2003 - Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct 2003 - Google releases papers with GFS (Google File System)

More information

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop Hadoop Open Source Projects Hadoop is supplemented by an ecosystem of open source projects Oozie 25 How to Analyze Large Data Sets in Hadoop Although the Hadoop framework is implemented in Java, MapReduce

More information

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus Getting to know MapReduce MapReduce Execution Pipeline Runtime Coordination and Task Management MapReduce Application Hadoop Word Count Implementation.

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

ExamTorrent.   Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you ExamTorrent http://www.examtorrent.com Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you Exam : Apache-Hadoop-Developer Title : Hadoop 2.0 Certification exam for Pig

More information

Distributed Systems. CS422/522 Lecture17 17 November 2014

Distributed Systems. CS422/522 Lecture17 17 November 2014 Distributed Systems CS422/522 Lecture17 17 November 2014 Lecture Outline Introduction Hadoop Chord What s a distributed system? What s a distributed system? A distributed system is a collection of loosely

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

Clustering Documents. Case Study 2: Document Retrieval

Clustering Documents. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve

More information

Processing big data with modern applications: Hadoop as DWH backend at Pro7. Dr. Kathrin Spreyer Big data engineer

Processing big data with modern applications: Hadoop as DWH backend at Pro7. Dr. Kathrin Spreyer Big data engineer Processing big data with modern applications: Hadoop as DWH backend at Pro7 Dr. Kathrin Spreyer Big data engineer GridKa School Karlsruhe, 02.09.2014 Outline 1. Relational DWH 2. Data integration with

More information

Introduction to BigData, Hadoop:-

Introduction to BigData, Hadoop:- Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

50 Must Read Hadoop Interview Questions & Answers

50 Must Read Hadoop Interview Questions & Answers 50 Must Read Hadoop Interview Questions & Answers Whizlabs Dec 29th, 2017 Big Data Are you planning to land a job with big data and data analytics? Are you worried about cracking the Hadoop job interview?

More information

Hadoop ecosystem. Nikos Parlavantzas

Hadoop ecosystem. Nikos Parlavantzas 1 Hadoop ecosystem Nikos Parlavantzas Lecture overview 2 Objective Provide an overview of a selection of technologies in the Hadoop ecosystem Hadoop ecosystem 3 Hadoop ecosystem 4 Outline 5 HBase Hive

More information

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. 1. MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. MapReduce is a framework for processing big data which processes data in two phases, a Map

More information

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2 Big Data Analysis using Hadoop Map-Reduce An Introduction Lecture 2 Last Week - Recap 1 In this class Examine the Map-Reduce Framework What work each of the MR stages does Mapper Shuffle and Sort Reducer

More information

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction

More information

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component

More information

September 2013 Alberto Abelló & Oscar Romero 1

September 2013 Alberto Abelló & Oscar Romero 1 duce-i duce-i September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate several use cases of duce 2. Describe what the duce environment is 3. Explain 6 benefits of using duce 4.

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

More information

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours) Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:

More information

Batch Inherence of Map Reduce Framework

Batch Inherence of Map Reduce Framework Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.287

More information

Top 25 Big Data Interview Questions And Answers

Top 25 Big Data Interview Questions And Answers Top 25 Big Data Interview Questions And Answers By: Neeru Jain - Big Data The era of big data has just begun. With more companies inclined towards big data to run their operations, the demand for talent

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 2: Hadoop Nuts and Bolts Jimmy Lin University of Maryland Thursday, January 31, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

EE657 Spring 2012 HW#4 Zhou Zhao

EE657 Spring 2012 HW#4 Zhou Zhao EE657 Spring 2012 HW#4 Zhou Zhao Problem 6.3 Solution Referencing the sample application of SimpleDB in Amazon Java SDK, a simple domain which includes 5 items is prepared in the code. For instance, the

More information

Introduction to Big Data. Hadoop. Instituto Politécnico de Tomar. Ricardo Campos

Introduction to Big Data. Hadoop. Instituto Politécnico de Tomar. Ricardo Campos Instituto Politécnico de Tomar Introduction to Big Data Hadoop Ricardo Campos Mestrado EI-IC Análise e Processamento de Grandes Volumes de Dados Tomar, Portugal, 2016 Part of the slides used in this presentation

More information

PARLab Parallel Boot Camp

PARLab Parallel Boot Camp PARLab Parallel Boot Camp Cloud Computing with MapReduce and Hadoop Matei Zaharia Electrical Engineering and Computer Sciences University of California, Berkeley What is Cloud Computing? Cloud refers to

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : About Quality Thought We are

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer) Hortonworks Hadoop-PR000007 Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer) http://killexams.com/pass4sure/exam-detail/hadoop-pr000007 QUESTION: 99 Which one of the following

More information

A Survey on Big Data

A Survey on Big Data A Survey on Big Data D.Prudhvi 1, D.Jaswitha 2, B. Mounika 3, Monika Bagal 4 1 2 3 4 B.Tech Final Year, CSE, Dadi Institute of Engineering & Technology,Andhra Pradesh,INDIA ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI Department of Information Technology IT6701 - INFORMATION MANAGEMENT Anna University 2 & 16 Mark Questions & Answers Year / Semester: IV / VII Regulation: 2013

More information

1Z Oracle Big Data 2017 Implementation Essentials Exam Summary Syllabus Questions

1Z Oracle Big Data 2017 Implementation Essentials Exam Summary Syllabus Questions 1Z0-449 Oracle Big Data 2017 Implementation Essentials Exam Summary Syllabus Questions Table of Contents Introduction to 1Z0-449 Exam on Oracle Big Data 2017 Implementation Essentials... 2 Oracle 1Z0-449

More information

TP1-2: Analyzing Hadoop Logs

TP1-2: Analyzing Hadoop Logs TP1-2: Analyzing Hadoop Logs Shadi Ibrahim January 26th, 2017 MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development

More information

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version : Hortonworks HDPCD Hortonworks Data Platform Certified Developer Download Full Version : https://killexams.com/pass4sure/exam-detail/hdpcd QUESTION: 97 You write MapReduce job to process 100 files in HDFS.

More information

Data Analysis Using MapReduce in Hadoop Environment

Data Analysis Using MapReduce in Hadoop Environment Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti

More information

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo Vendor: Cloudera Exam Code: CCD-410 Exam Name: Cloudera Certified Developer for Apache Hadoop Version: Demo QUESTION 1 When is the earliest point at which the reduce method of a given Reducer can be called?

More information

Introduction to the Hadoop Ecosystem - 1

Introduction to the Hadoop Ecosystem - 1 Hello and welcome to this online, self-paced course titled Administering and Managing the Oracle Big Data Appliance (BDA). This course contains several lessons. This lesson is titled Introduction to the

More information

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases

More information

Certified Big Data Hadoop and Spark Scala Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills

More information

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics using Apache Hadoop and Spark with Scala Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L12: Cloud Computing 1

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L12: Cloud Computing 1 Cloud Computing Leonidas Fegaras University of Texas at Arlington Web Data Management and XML L12: Cloud Computing 1 Computing as a Utility Cloud computing is a model for enabling convenient, on-demand

More information

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,

More information

Question: 1 You need to place the results of a PigLatin script into an HDFS output directory. What is the correct syntax in Apache Pig?

Question: 1 You need to place the results of a PigLatin script into an HDFS output directory. What is the correct syntax in Apache Pig? Volume: 72 Questions Question: 1 You need to place the results of a PigLatin script into an HDFS output directory. What is the correct syntax in Apache Pig? A. update hdfs set D as./output ; B. store D

More information

Map- reduce programming paradigm

Map- reduce programming paradigm Map- reduce programming paradigm Some slides are from lecture of Matei Zaharia, and distributed computing seminar by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet. In pioneer days they

More information

Introduction to Hadoop. Scott Seighman Systems Engineer Sun Microsystems

Introduction to Hadoop. Scott Seighman Systems Engineer Sun Microsystems Introduction to Hadoop Scott Seighman Systems Engineer Sun Microsystems 1 Agenda Identify the Problem Hadoop Overview Target Workloads Hadoop Architecture Major Components > HDFS > Map/Reduce Demo Resources

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

MapReduce Simplified Data Processing on Large Clusters

MapReduce Simplified Data Processing on Large Clusters MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /

More information

Lecture 12 DATA ANALYTICS ON WEB SCALE

Lecture 12 DATA ANALYTICS ON WEB SCALE Lecture 12 DATA ANALYTICS ON WEB SCALE Source: The Economist, February 25, 2010 The Data Deluge EIGHTEEN months ago, Li & Fung, a firm that manages supply chains for retailers, saw 100 gigabytes of information

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (1/2) January 10, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018 Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018 K. Zhang (pic source: mapr.com/blog) Copyright BUDT 2016 758 Where

More information

Certified Big Data and Hadoop Course Curriculum

Certified Big Data and Hadoop Course Curriculum Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation

More information

@Pentaho #BigDataWebSeries

@Pentaho #BigDataWebSeries Enterprise Data Warehouse Optimization with Hadoop Big Data @Pentaho #BigDataWebSeries Your Hosts Today Dave Henry SVP Enterprise Solutions Davy Nys VP EMEA & APAC 2 Source/copyright: The Human Face of

More information

South Asian Journal of Engineering and Technology Vol.2, No.50 (2016) 5 10

South Asian Journal of Engineering and Technology Vol.2, No.50 (2016) 5 10 ISSN Number (online): 2454-9614 Weather Data Analytics using Hadoop Components like MapReduce, Pig and Hive Sireesha. M 1, Tirumala Rao. S. N 2 Department of CSE, Narasaraopeta Engineering College, Narasaraopet,

More information

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since

More information

A Review Paper on Big data & Hadoop

A Review Paper on Big data & Hadoop A Review Paper on Big data & Hadoop Rupali Jagadale MCA Department, Modern College of Engg. Modern College of Engginering Pune,India rupalijagadale02@gmail.com Pratibha Adkar MCA Department, Modern College

More information

Enter the Elephant. Massively Parallel Computing With Hadoop. Toby DiPasquale Chief Architect Invite Media, Inc.

Enter the Elephant. Massively Parallel Computing With Hadoop. Toby DiPasquale Chief Architect Invite Media, Inc. Enter the Elephant Massively Parallel Computing With Hadoop Toby DiPasquale Chief Architect Invite Media, Inc. Philadelphia Emerging Technologies for the Enterprise March 26, 2008 Image credit, http,//www.depaulca.org/images/blog_1125071.jpg

More information

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014 Parallel Data Processing with Hadoop/MapReduce CS140 Tao Yang, 2014 Overview What is MapReduce? Example with word counting Parallel data processing with MapReduce Hadoop file system More application example

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem. About the Tutorial Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and

More information

Innovatus Technologies

Innovatus Technologies HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String

More information

Click Stream Data Analysis Using Hadoop

Click Stream Data Analysis Using Hadoop Governors State University OPUS Open Portal to University Scholarship All Capstone Projects Student Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors

More information

Top 25 Hadoop Admin Interview Questions and Answers

Top 25 Hadoop Admin Interview Questions and Answers Top 25 Hadoop Admin Interview Questions and Answers 1) What daemons are needed to run a Hadoop cluster? DataNode, NameNode, TaskTracker, and JobTracker are required to run Hadoop cluster. 2) Which OS are

More information

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH) Cloudera CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Download Full Version : http://killexams.com/pass4sure/exam-detail/cca-410 Reference: CONFIGURATION PARAMETERS DFS.BLOCK.SIZE

More information

Big Data with Hadoop Ecosystem

Big Data with Hadoop Ecosystem Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process

More information

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L3b: Cloud Computing 1

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L3b: Cloud Computing 1 Cloud Computing Leonidas Fegaras University of Texas at Arlington Web Data Management and XML L3b: Cloud Computing 1 Computing as a Utility Cloud computing is a model for enabling convenient, on-demand

More information

A MapReduce Relational-Database Index-Selection Tool

A MapReduce Relational-Database Index-Selection Tool A MapReduce Relational-Database Index-Selection Tool by Fatimah Alsayoud Bachelor of Computer and Information Sciences in the field of Information technology, King Saud University, 2008 A thesis presented

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

A Glimpse of the Hadoop Echosystem

A Glimpse of the Hadoop Echosystem A Glimpse of the Hadoop Echosystem 1 Hadoop Echosystem A cluster is shared among several users in an organization Different services HDFS and MapReduce provide the lower layers of the infrastructures Other

More information

Big Data for Engineers Spring Resource Management

Big Data for Engineers Spring Resource Management Ghislain Fourny Big Data for Engineers Spring 2018 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models

More information

Chase Wu New Jersey Institute of Technology

Chase Wu New Jersey Institute of Technology CS 644: Introduction to Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Institute of Technology Some of the slides were provided through the courtesy of Dr. Ching-Yung Lin at Columbia

More information

Semantics with Failures

Semantics with Failures Semantics with Failures If map and reduce are deterministic, then output identical to non-faulting sequential execution For non-deterministic operators, different reduce tasks might see output of different

More information

Expert Lecture plan proposal Hadoop& itsapplication

Expert Lecture plan proposal Hadoop& itsapplication Expert Lecture plan proposal Hadoop& itsapplication STARTING UP WITH BIG Introduction to BIG Data Use cases of Big Data The Big data core components Knowing the requirements, knowledge on Analyst job profile

More information

Hortonworks PR PowerCenter Data Integration 9.x Administrator Specialist.

Hortonworks PR PowerCenter Data Integration 9.x Administrator Specialist. Hortonworks PR000007 PowerCenter Data Integration 9.x Administrator Specialist https://killexams.com/pass4sure/exam-detail/pr000007 QUESTION: 102 When can a reduce class also serve as a combiner without

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

Hadoop File Management System

Hadoop File Management System Volume-6, Issue-5, September-October 2016 International Journal of Engineering and Management Research Page Number: 281-286 Hadoop File Management System Swaraj Pritam Padhy 1, Sashi Bhusan Maharana 2

More information

1. Introduction (Sam) 2. Syntax and Semantics (Paul) 3. Compiler Architecture (Ben) 4. Runtime Environment (Kurry) 5. Testing (Jason) 6. Demo 7.

1. Introduction (Sam) 2. Syntax and Semantics (Paul) 3. Compiler Architecture (Ben) 4. Runtime Environment (Kurry) 5. Testing (Jason) 6. Demo 7. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry Tran System Integrator Paul Tylkin Language Guru THE HOG LANGUAGE A scripting MapReduce language.

More information