Introduction to HDFS and MapReduce

Size: px

Start display at page:

Download "Introduction to HDFS and MapReduce"

Hester Hubbard
6 years ago
Views:

1 Introduction to HDFS and MapReduce

2 Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2

3 Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2

4 Think Big is the leading professional services firm that s purpose built for Big Data. One of Silicon Valley s Fastest Growing Big Data start ups 100% Focus on Big Data consulting & Data Science solution services Management Background: Cambridge Technology, C-bridge, Oracle, Sun Microsystems, Quantcast, Accenture C-bridge Internet Solutions (CBIS) founder 1996 & executives, IPO 1999 Clients: 40+ North America Locations US East: Boston, New York, Washington D.C. US Central: Chicago, Austin US West: HQ Mountain View, San Diego, Salt Lake City EMEA & APAC Confidential Think Big Analytics 3

5 Think Big Recognized as a Top Pure-Play Big Data Vendor Source: Forbes February 2012 Confidential Think Big Analytics 01/04/13 4

6 Agenda - Big Data - Hadoop Ecosystem - HDFS - MapReduce in Hadoop - The Hadoop Java API - Conclusions 5

7 Big Data 6

8 A Data Shift... Source: EMC Digital Universe Study* 7

9 Motivation Simple algorithms and lots of data trump complex models. Halevy, Norvig, and Pereira (Google), IEEE Intelligent Systems 8

10 Pioneers Google and Yahoo: - Index 850+ million websites, over one trillion URLs. Facebook ad targeting: million users, > 50% of whom are active daily. 9

11 Hadoop Ecosystem 10

12 Common Tool? Hadoop - Cluster: distributed computing platform. - Commodity*, server-class hardware. - Extensible Platform. 11

13 Hadoop Origins MapReduce and Google File System (GFS) pioneered at Google. Hadoop is the commercially-supported open-source equivalent. 12

14 What Is Hadoop? Hadoop is a platform. Distributes and replicates data. Manages parallel tasks created by users. Runs as several processes on a cluster. The term Hadoop generally refers to a toolset, not a single tool. 13

15 Why Hadoop? Handles unstructured to semi-structured to structured data. Handles enormous data volumes. Flexible data analysis and machine learning tools. Cost-effective scalability. 14

16 The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 15

17 The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 15

18 HDFS 16

19 What Is HDFS? Hadoop Distributed File System. Stores files in blocks across many nodes in a cluster. Replicates the blocks across nodes for durability. Master/Slave architecture. 17

20 HDFS Traits Not fully POSIX compliant. No file updates. Write once, read many times. Large blocks, sequential read patterns. Designed for batch processing. 18

21 HDFS Master NameNode - Runs on a single node as a master process Holds file metadata (which blocks are where) Directs client access to files in HDFS SecondaryNameNode - Not a hot failover - Maintains a copy of the NameNode metadata 19

22 HDFS Slaves DataNode - Generally runs on all nodes in the cluster Block creation/replication/deletion/reads Takes orders from the NameNode 20

23 HDFS Illustrated NameNode Put File File DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

24 HDFS Illustrated NameNode Put File File DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

25 HDFS Illustrated Put File NameNode DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

26 HDFS Illustrated Put File NameNode 1,4,6 2 3 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

27 HDFS Illustrated Put File NameNode 1,4,6 2,5,3 3 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

28 HDFS Illustrated Put File NameNode 1,4,6 2,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

29 HDFS Illustrated Put File NameNode 1,4,6 2,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

30 Power of Hadoop Read File NameNode 1,4,6 2,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

31 Power of Hadoop Read File NameNode 1,4,6 2,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

32 Power of Hadoop Read File NameNode 1,4,6 2,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

33 Power of Hadoop Read File NameNode,4,6 2,5,3 3,2,6 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

34 Power of Hadoop Read File NameNode 5,4,6 2,5,3 3,2,6 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

35 Power of Hadoop Read File NameNode 5,4,6 2,5,3 3,2,6 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

36 Power of Hadoop Read File NameNode 5,4,6 2,5,3 3,2,6 Read time = Transfer Rate x Number of Machines* DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

37 Power of Hadoop Read File NameNode 5,4,6 2,5,3 3,2,6 Read time = Transfer Rate x Number of Machines* DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode MB/s x 3 = 300MB/s 22

38 HDFS Shell Easy to use command line interface. Create, copy, move, and delete files. Administrative duties - chmod, chown, chgrp. Set replication factor for a file. Head, tail, cat to view files. 23

39 The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 24

40 The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 24

41 MapReduce in Hadoop 25

42 MapReduce Basics Logical functions: Mappers and Reducers. Developers write map and reduce functions, then submit a jar to the Hadoop cluster. Hadoop handles distributing the Map and Reduce tasks across the cluster. Typically batch oriented. 26

43 JobTracker (Master) MapReduce Daemons - Manages MapReduce jobs, giving tasks to different nodes, managing task failure TaskTracker (Slave) - Creates individual map and reduce tasks - Reports task status to JobTracker 27

44 MapReduce in Hadoop 28

45 MapReduce in Hadoop Let s look at how MapReduce actually works in Hadoop, using WordCount. 28

46 Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce There is a Map phase (hadoop, 1) (mapreduce, 1) (uses, 1) (is, 1), (a, 1) (map, 1),(phase,1) (there, 1) a 2 hadoop 1 is 2 map 1 mapreduce 1 phase 2 There is a Reduce phase (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) reduce 1 there 2 uses 1 29

convert (map, 1),(phase,1) (there, 1) the Input into the Output.

47 Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce There is a Map phase There is a Reduce phase (hadoop, 1) (mapreduce, 1) (uses, 1) (is, 1), (a, 1) We need to convert (map, 1),(phase,1) (there, 1) the Input into the Output. (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) a 2 hadoop 1 is 2 map 1 mapreduce 1 phase 2 reduce 1 there 2 uses 1 29

48 Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce a 2 hadoop 1 is 2 There is a Map phase map 1 mapreduce 1 phase 2 There is a Reduce phase reduce 1 there 2 uses 1 30

49 Input Mappers Hadoop uses MapReduce (doc1, " ") There is a Map phase (doc2, " ") (doc3, "") There is a Reduce phase (doc4, " ") 31

50 Input Mappers Hadoop uses MapReduce There is a Map phase (doc1, " ") (doc2, " ") (hadoop, 1) (uses, 1) (mapreduce, 1) (there, 1) (is, 1) (a, 1) (map, 1) (phase, 1) (doc3, "") There is a Reduce phase (doc4, " ") (there, 1) (is, 1) (a, 1) (reduce, 1) (phase, 1) 32

51 Input Mappers Sort, Shuffle Reducers Hadoop uses MapReduce (doc1, " ") (hadoop, 1) (mapreduce, 1) 0-9, a-l (uses, 1) There is a Map phase (doc2, " ") (is, 1), (a, 1) (map, 1),(phase,1) (there, 1) m-q (doc3, "") There is a Reduce phase (doc4, " ") (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) r-z 33

52 Input Mappers Sort, Shuffle Reducers Hadoop uses MapReduce There is a Map phase (doc1, " ") (doc2, " ") (doc3, "") (hadoop, 1) (mapreduce, 1) (uses, 1) (is, 1), (a, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) There is a Reduce phase (doc4, " ") (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) r-z (reduce, [1]), (there, [1,1]), (uses, 1) 34

53 Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce (doc1, " ") (hadoop, 1) (mapreduce, 1) (uses, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) a 2 hadoop 1 is 2 There is a Map phase (doc2, " ") (is, 1), (a, 1) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) map 1 mapreduce 1 phase 2 (doc3, "") There is a Reduce phase (doc4, " ") (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) r-z (reduce, [1]), (there, [1,1]), (uses, 1) reduce 1 there 2 uses 1 35

54 Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce (doc1, " ") (hadoop, 1) (mapreduce, 1) (uses, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) a 2 hadoop 1 is 2 There is a Map phase (doc2, " ") (is, 1), (a, 1) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) map 1 mapreduce 1 phase 2 (doc3, "") (doc4, " ") (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) r-z (reduce, [1]), (there, [1,1]), (uses, 1) 36

55 Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce There is a Map phase Map: (doc1, " ") (doc2, " ") (doc3, "") (doc4, " ") (hadoop, 1) (mapreduce, 1) (uses, 1) (is, 1), (a, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) (phase,1) Transform one input to 0-N outputs. (is, 1), (a, 1) (there, 1), (reduce 1) r-z (reduce, [1]), (there, [1,1]), (uses, 1) a 2 hadoop 1 is 2 map 1 mapreduce 1 phase 2 36

Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce There is a Map phase Map: (doc1, " ") (doc2, " ") (doc3, "") (doc4, " ") (hadoop, 1) (mapreduce, 1) (uses, 1) (is, 1), (a, 1) 0-9,

56 Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce There is a Map phase Map: (doc1, " ") (doc2, " ") (doc3, "") (doc4, " ") (hadoop, 1) (mapreduce, 1) (uses, 1) (is, 1), (a, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) (is, 1), (a, 1) (phase,1) Transform one input to 0-N outputs. (there, 1), (reduce 1) Reduce: r-z (reduce, [1]), one (there, output. [1,1]), (uses, 1) a 2 hadoop 1 is 2 map 1 mapreduce 1 phase 2 Collect multiple inputs into 36

57 Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode 37

58 Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode 37

59 Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker M M M DataNode DataNode DataNode 37

60 Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker Map Phase M M M DataNode DataNode DataNode 37

61 Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker Map Phase k,v M k,v k,v M k,v M k,v * Intermediate Data Is Stored Locally DataNode DataNode DataNode 37

62 Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker Map Phase k,v k,v k,v k,v k,v DataNode DataNode DataNode 37

63 Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker k,v k,v k,v k,v k,v Shuffle/Sort DataNode DataNode DataNode 37

64 Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker k,v k,v k,v k,v k,v Shuffle/Sort DataNode DataNode DataNode 37

65 Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker k,v R k,v k,v R k,v R k,v Reduce Phase DataNode DataNode DataNode 37

66 Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker R R R Reduce Phase DataNode DataNode DataNode 37

67 Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker Job Complete! DataNode DataNode DataNode 37

68 The Hadoop Java API 38

69 MapReduce in Java 39

70 MapReduce in Java Let s look at WordCount written in the MapReduce Java API. 39

71 Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { word.set(wordstring.tolowercase()); collector.collect(word, one); } } } 40

72 Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { word.set(wordstring.tolowercase()); collector.collect(word, one); } Let s drill into this code... } } 40

73 Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { word.set(wordstring.tolowercase()); collector.collect(word, one); } } } 41

74 Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); Mapper class with 4 type parameters for the input key-value types and output types. public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { word.set(wordstring.tolowercase()); collector.collect(word, one); } } } 41

75 Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); Output key-value objects we ll reuse. public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { word.set(wordstring.tolowercase()); collector.collect(word, one); } } } 42

76 Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { } static final Text word = new Text(); static final IntWritable one = new IntWritable(1); Map method with input, output collector, and reporting public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { word.set(wordstring.tolowercase()); collector.collect(word, one); } } } 43

77 Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { } } } word.set(wordstring.tolowercase()); collector.collect(word, one); Tokenize the line, collect each (word, 1) 44

78 Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasnext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); } 45

79 Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasnext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); } Let s drill into this code... 45

80 Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasnext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); } 46

81 Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { } Reducer class with 4 type parameters for the input key-value types and output public void reduce(text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasnext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); } 46

82 Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { } Reduce method with input, output collector, and reporting public void reduce(text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasnext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); } 47

83 Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; } while (counts.hasnext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); Count the counts per word and emit (word, N) 48

84 Other Options HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 49

85 Other Options HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 49

86 Other Options HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 49

87 Conclusions 50

88 Hadoop Benefits A cost-effective, scalable way to: - Store massive data sets. - Perform arbitrary analyses on those data sets. 51

89 Hadoop Tools Offers a variety of tools for: - Application development. - Integration with other platforms (e.g., databases). 52

90 Hadoop Distributions A rich, open-source ecosystem. - Free to use. - Commercially-supported distributions. 53

91 Thank You! - Feel free to contact me at ryan.tabora@thinkbiganalytics.com - Or our solutions consultant matt.mcdevitt@thinkbiganalytics.com - As always, THINK BIG! 54

92 Bonus Content 55

93 The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 56

94 The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 56

95 Hive: SQL for Hadoop 57

96 Hive 58

97 Hive Let s look at WordCount written in Hive, the SQL for Hadoop. 58

98 CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word; 59

99 CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word; Let s drill into this code... 59

100 CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word; 60

101 CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; Create a table to hold the raw text we re counting. Each line is a column. CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word; 60

102 CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; Load the text in the docs directory into the table. CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word; 61

103 CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word; Create the final table and fill it with the results from a nested query of the docs table that performs WordCount on the fly. 62

104 Hive 63

105 Hive Because so many Hadoop users come from SQL backgrounds, Hive is one of the most essential tools in the ecosystem!! 63

106 The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 64

107 The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 64

108 Pig: Data Flow for Hadoop 65

109 Pig 66

110 Pig Let s look at WordCount written in Pig, the Data Flow language for Hadoop. 66

111 inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; 67

112 inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; Let s drill into this code... 67

113 inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; 68

114 inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt Like the Hive example, load docs content, each line is a field. GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; 68

115 inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt Tokenize into words (an array) and flatten into separate records. GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; 69

116 inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; Collect the same words together. cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; 70

117 inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); Count each word. STORE cntd INTO 'output'; 71

118 inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; Save the results. Profit! 72

119 Pig 73

120 Pig Pig and Hive overlap, but Pig is popular for ETL, e.g., data transformation, cleansing, ingestion, etc. 73

121 Questions? 74

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides