Introduction to HDFS and MapReduce

Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2

Think Big is the leading professional services firm that s purpose built for Big Data. One of Silicon Valley s Fastest Growing Big Data start ups 100% Focus on Big Data consulting & Data Science solution services Management Background: Cambridge Technology, C-bridge, Oracle, Sun Microsystems, Quantcast, Accenture C-bridge Internet Solutions (CBIS) founder 1996 & executives, IPO 1999 Clients: 40+ North America Locations US East: Boston, New York, Washington D.C. US Central: Chicago, Austin US West: HQ Mountain View, San Diego, Salt Lake City EMEA & APAC Confidential Think Big Analytics 3

Think Big Recognized as a Top Pure-Play Big Data Vendor Source: Forbes February 2012 Confidential Think Big Analytics 01/04/13 4

Agenda - Big Data - Hadoop Ecosystem - HDFS - MapReduce in Hadoop - The Hadoop Java API - Conclusions 5

Big Data 6

A Data Shift... Source: EMC Digital Universe Study* 7

Motivation Simple algorithms and lots of data trump complex models. Halevy, Norvig, and Pereira (Google), IEEE Intelligent Systems 8

Pioneers Google and Yahoo: - Index 850+ million websites, over one trillion URLs. Facebook ad targeting: - 840+ million users, > 50% of whom are active daily. 9

Hadoop Ecosystem 10

Common Tool? Hadoop - Cluster: distributed computing platform. - Commodity*, server-class hardware. - Extensible Platform. 11

Hadoop Origins MapReduce and Google File System (GFS) pioneered at Google. Hadoop is the commercially-supported open-source equivalent. 12

What Is Hadoop? Hadoop is a platform. Distributes and replicates data. Manages parallel tasks created by users. Runs as several processes on a cluster. The term Hadoop generally refers to a toolset, not a single tool. 13

Why Hadoop? Handles unstructured to semi-structured to structured data. Handles enormous data volumes. Flexible data analysis and machine learning tools. Cost-effective scalability. 14

The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 15

HDFS 16

What Is HDFS? Hadoop Distributed File System. Stores files in blocks across many nodes in a cluster. Replicates the blocks across nodes for durability. Master/Slave architecture. 17

HDFS Traits Not fully POSIX compliant. No file updates. Write once, read many times. Large blocks, sequential read patterns. Designed for batch processing. 18

HDFS Master NameNode - Runs on a single node as a master process Holds file metadata (which blocks are where) Directs client access to files in HDFS SecondaryNameNode - Not a hot failover - Maintains a copy of the NameNode metadata 19

HDFS Slaves DataNode - Generally runs on all nodes in the cluster Block creation/replication/deletion/reads Takes orders from the NameNode 20

HDFS Illustrated NameNode Put File File DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

HDFS Illustrated Put File NameNode 1 2 3 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

HDFS Illustrated Put File NameNode 1,4,6 2 3 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

HDFS Illustrated Put File NameNode 1,4,6 2,5,3 3 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

HDFS Illustrated Put File NameNode 1,4,6 2,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

Power of Hadoop Read File NameNode 1,4,6 2,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

Power of Hadoop Read File NameNode,4,6 2,5,3 3,2,6 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

Power of Hadoop Read File NameNode 5,4,6 2,5,3 3,2,6 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

Power of Hadoop Read File NameNode 5,4,6 2,5,3 3,2,6 Read time = Transfer Rate x Number of Machines* DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

Power of Hadoop Read File NameNode 5,4,6 2,5,3 3,2,6 Read time = Transfer Rate x Number of Machines* DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 100 MB/s x 3 = 300MB/s 22

HDFS Shell Easy to use command line interface. Create, copy, move, and delete files. Administrative duties - chmod, chown, chgrp. Set replication factor for a file. Head, tail, cat to view files. 23

MapReduce in Hadoop 25

MapReduce Basics Logical functions: Mappers and Reducers. Developers write map and reduce functions, then submit a jar to the Hadoop cluster. Hadoop handles distributing the Map and Reduce tasks across the cluster. Typically batch oriented. 26

JobTracker (Master) MapReduce Daemons - Manages MapReduce jobs, giving tasks to different nodes, managing task failure TaskTracker (Slave) - Creates individual map and reduce tasks - Reports task status to JobTracker 27

MapReduce in Hadoop 28

MapReduce in Hadoop Let s look at how MapReduce actually works in Hadoop, using WordCount. 28

Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce There is a Map phase (hadoop, 1) (mapreduce, 1) (uses, 1) (is, 1), (a, 1) (map, 1),(phase,1) (there, 1) a 2 hadoop 1 is 2 map 1 mapreduce 1 phase 2 There is a Reduce phase (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) reduce 1 there 2 uses 1 29

Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce There is a Map phase There is a Reduce phase (hadoop, 1) (mapreduce, 1) (uses, 1) (is, 1), (a, 1) We need to convert (map, 1),(phase,1) (there, 1) the Input into the Output. (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) a 2 hadoop 1 is 2 map 1 mapreduce 1 phase 2 reduce 1 there 2 uses 1 29

Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce a 2 hadoop 1 is 2 There is a Map phase map 1 mapreduce 1 phase 2 There is a Reduce phase reduce 1 there 2 uses 1 30

Input Mappers Hadoop uses MapReduce (doc1, " ") There is a Map phase (doc2, " ") (doc3, "") There is a Reduce phase (doc4, " ") 31

Input Mappers Hadoop uses MapReduce There is a Map phase (doc1, " ") (doc2, " ") (hadoop, 1) (uses, 1) (mapreduce, 1) (there, 1) (is, 1) (a, 1) (map, 1) (phase, 1) (doc3, "") There is a Reduce phase (doc4, " ") (there, 1) (is, 1) (a, 1) (reduce, 1) (phase, 1) 32

Input Mappers Sort, Shuffle Reducers Hadoop uses MapReduce (doc1, " ") (hadoop, 1) (mapreduce, 1) 0-9, a-l (uses, 1) There is a Map phase (doc2, " ") (is, 1), (a, 1) (map, 1),(phase,1) (there, 1) m-q (doc3, "") There is a Reduce phase (doc4, " ") (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) r-z 33

Input Mappers Sort, Shuffle Reducers Hadoop uses MapReduce There is a Map phase (doc1, " ") (doc2, " ") (doc3, "") (hadoop, 1) (mapreduce, 1) (uses, 1) (is, 1), (a, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) There is a Reduce phase (doc4, " ") (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) r-z (reduce, [1]), (there, [1,1]), (uses, 1) 34

Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce (doc1, " ") (hadoop, 1) (mapreduce, 1) (uses, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) a 2 hadoop 1 is 2 There is a Map phase (doc2, " ") (is, 1), (a, 1) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) map 1 mapreduce 1 phase 2 (doc3, "") There is a Reduce phase (doc4, " ") (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) r-z (reduce, [1]), (there, [1,1]), (uses, 1) reduce 1 there 2 uses 1 35

Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce (doc1, " ") (hadoop, 1) (mapreduce, 1) (uses, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) a 2 hadoop 1 is 2 There is a Map phase (doc2, " ") (is, 1), (a, 1) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) map 1 mapreduce 1 phase 2 (doc3, "") (doc4, " ") (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) r-z (reduce, [1]), (there, [1,1]), (uses, 1) 36

Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce There is a Map phase Map: (doc1, " ") (doc2, " ") (doc3, "") (doc4, " ") (hadoop, 1) (mapreduce, 1) (uses, 1) (is, 1), (a, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) (phase,1) Transform one input to 0-N outputs. (is, 1), (a, 1) (there, 1), (reduce 1) r-z (reduce, [1]), (there, [1,1]), (uses, 1) a 2 hadoop 1 is 2 map 1 mapreduce 1 phase 2 36

Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce There is a Map phase Map: (doc1, " ") (doc2, " ") (doc3, "") (doc4, " ") (hadoop, 1) (mapreduce, 1) (uses, 1) (is, 1), (a, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) (is, 1), (a, 1) (phase,1) Transform one input to 0-N outputs. (there, 1), (reduce 1) Reduce: r-z (reduce, [1]), one (there, output. [1,1]), (uses, 1) a 2 hadoop 1 is 2 map 1 mapreduce 1 phase 2 Collect multiple inputs into 36

Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode 37

Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker M M M DataNode DataNode DataNode 37

Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker Map Phase M M M DataNode DataNode DataNode 37

Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker Map Phase k,v M k,v k,v M k,v M k,v * Intermediate Data Is Stored Locally DataNode DataNode DataNode 37

Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker Map Phase k,v k,v k,v k,v k,v DataNode DataNode DataNode 37

Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker k,v k,v k,v k,v k,v Shuffle/Sort DataNode DataNode DataNode 37

Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker k,v R k,v k,v R k,v R k,v Reduce Phase DataNode DataNode DataNode 37

Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker R R R Reduce Phase DataNode DataNode DataNode 37

Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker Job Complete! DataNode DataNode DataNode 37

The Hadoop Java API 38

MapReduce in Java 39

MapReduce in Java Let s look at WordCount written in the MapReduce Java API. 39

Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); } @Override public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { word.set(wordstring.tolowercase()); collector.collect(word, one); } } } 40

Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); Mapper class with 4 type parameters for the input key-value types and output types. } @Override public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { word.set(wordstring.tolowercase()); collector.collect(word, one); } } } 41

Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); Output key-value objects we ll reuse. } @Override public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { word.set(wordstring.tolowercase()); collector.collect(word, one); } } } 42

Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { } static final Text word = new Text(); static final IntWritable one = new IntWritable(1); Map method with input, output collector, and reporting object. @Override public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { word.set(wordstring.tolowercase()); collector.collect(word, one); } } } 43

Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); } @Override public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { } } } word.set(wordstring.tolowercase()); collector.collect(word, one); Tokenize the line, collect each (word, 1) 44

Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { } @Override public void reduce(text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasnext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); } 45

Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { } Reducer class with 4 type parameters for the input key-value types and output types. @Override public void reduce(text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasnext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); } 46

Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { } Reduce method with input, output collector, and reporting object. @Override public void reduce(text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasnext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); } 47

Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { } @Override public void reduce(text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; } while (counts.hasnext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); Count the counts per word and emit (word, N) 48

Other Options HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 49

Conclusions 50

Hadoop Benefits A cost-effective, scalable way to: - Store massive data sets. - Perform arbitrary analyses on those data sets. 51

Hadoop Tools Offers a variety of tools for: - Application development. - Integration with other platforms (e.g., databases). 52

Hadoop Distributions A rich, open-source ecosystem. - Free to use. - Commercially-supported distributions. 53

Thank You! - Feel free to contact me at ryan.tabora@thinkbiganalytics.com - Or our solutions consultant matt.mcdevitt@thinkbiganalytics.com - As always, THINK BIG! 54

Bonus Content 55

Hive: SQL for Hadoop 57

Hive 58

Hive Let s look at WordCount written in Hive, the SQL for Hadoop. 58

CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word; Let s drill into this code... 59

CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; Create a table to hold the raw text we re counting. Each line is a column. CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word; 60

CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; Load the text in the docs directory into the table. CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word; 61

Hive 63

Hive Because so many Hadoop users come from SQL backgrounds, Hive is one of the most essential tools in the ecosystem!! 63

Pig: Data Flow for Hadoop 65

Pig 66

Pig Let s look at WordCount written in Pig, the Data Flow language for Hadoop. 66

inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; 67

inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt Like the Hive example, load docs content, each line is a field. GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; 68

inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt Tokenize into words (an array) and flatten into separate records. GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; 69

inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; Collect the same words together. cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; 70

Pig 73

Pig Pig and Hive overlap, but Pig is popular for ETL, e.g., data transformation, cleansing, ingestion, etc. 73

Questions? 74