Introduction to HDFS and MapReduce

Similar documents
Hadoop. copyright 2011 Trainologic LTD

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop

HADOOP FRAMEWORK FOR BIG DATA

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Clustering Lecture 8: MapReduce

Lecture 11 Hadoop & Spark

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Hadoop An Overview. - Socrates CCDH

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Big Data landscape Lecture #2

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Hadoop Development Introduction

A brief history on Hadoop

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

Distributed Systems. CS422/522 Lecture17 17 November 2014

A BigData Tour HDFS, Ceph and MapReduce

Clustering Documents. Case Study 2: Document Retrieval

Processing big data with modern applications: Hadoop as DWH backend at Pro7. Dr. Kathrin Spreyer Big data engineer

Introduction to BigData, Hadoop:-

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

50 Must Read Hadoop Interview Questions & Answers

Hadoop ecosystem. Nikos Parlavantzas

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

September 2013 Alberto Abelló & Oscar Romero 1

MapReduce, Hadoop and Spark. Bompotas Agorakis

Big Data Architect.

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Batch Inherence of Map Reduce Framework

Top 25 Big Data Interview Questions And Answers

Data-Intensive Computing with MapReduce

EE657 Spring 2012 HW#4 Zhou Zhao

Introduction to Big Data. Hadoop. Instituto Politécnico de Tomar. Ricardo Campos

PARLab Parallel Boot Camp

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

Introduction to MapReduce

MI-PDB, MIE-PDB: Advanced Database Systems

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)

A Survey on Big Data

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

1Z Oracle Big Data 2017 Implementation Essentials Exam Summary Syllabus Questions

TP1-2: Analyzing Hadoop Logs

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

Data Analysis Using MapReduce in Hadoop Environment

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo

Introduction to the Hadoop Ecosystem - 1

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Certified Big Data Hadoop and Spark Scala Course Curriculum

Big Data Analytics using Apache Hadoop and Spark with Scala

Map Reduce & Hadoop Recommended Text:

Database Applications (15-415)

Chapter 5. The MapReduce Programming Model and Implementation

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L12: Cloud Computing 1

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Question: 1 You need to place the results of a PigLatin script into an HDFS output directory. What is the correct syntax in Apache Pig?

Map- reduce programming paradigm

Introduction to Hadoop. Scott Seighman Systems Engineer Sun Microsystems

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

MapReduce Simplified Data Processing on Large Clusters

Lecture 12 DATA ANALYTICS ON WEB SCALE

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

Certified Big Data and Hadoop Course Curriculum

@Pentaho #BigDataWebSeries

South Asian Journal of Engineering and Technology Vol.2, No.50 (2016) 5 10

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

A Review Paper on Big data & Hadoop

Enter the Elephant. Massively Parallel Computing With Hadoop. Toby DiPasquale Chief Architect Invite Media, Inc.

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Big Data Hadoop Stack

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Innovatus Technologies

Click Stream Data Analysis Using Hadoop

Top 25 Hadoop Admin Interview Questions and Answers

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

Big Data with Hadoop Ecosystem

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L3b: Cloud Computing 1

A MapReduce Relational-Database Index-Selection Tool

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

A Glimpse of the Hadoop Echosystem

Big Data for Engineers Spring Resource Management

Chase Wu New Jersey Institute of Technology

Semantics with Failures

Expert Lecture plan proposal Hadoop& itsapplication

Hortonworks PR PowerCenter Data Integration 9.x Administrator Specialist.

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Hadoop File Management System

1. Introduction (Sam) 2. Syntax and Semantics (Paul) 3. Compiler Architecture (Ben) 4. Runtime Environment (Kurry) 5. Testing (Jason) 6. Demo 7.

Transcription:

Introduction to HDFS and MapReduce

Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2

Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2

Think Big is the leading professional services firm that s purpose built for Big Data. One of Silicon Valley s Fastest Growing Big Data start ups 100% Focus on Big Data consulting & Data Science solution services Management Background: Cambridge Technology, C-bridge, Oracle, Sun Microsystems, Quantcast, Accenture C-bridge Internet Solutions (CBIS) founder 1996 & executives, IPO 1999 Clients: 40+ North America Locations US East: Boston, New York, Washington D.C. US Central: Chicago, Austin US West: HQ Mountain View, San Diego, Salt Lake City EMEA & APAC Confidential Think Big Analytics 3

Think Big Recognized as a Top Pure-Play Big Data Vendor Source: Forbes February 2012 Confidential Think Big Analytics 01/04/13 4

Agenda - Big Data - Hadoop Ecosystem - HDFS - MapReduce in Hadoop - The Hadoop Java API - Conclusions 5

Big Data 6

A Data Shift... Source: EMC Digital Universe Study* 7

Motivation Simple algorithms and lots of data trump complex models. Halevy, Norvig, and Pereira (Google), IEEE Intelligent Systems 8

Pioneers Google and Yahoo: - Index 850+ million websites, over one trillion URLs. Facebook ad targeting: - 840+ million users, > 50% of whom are active daily. 9

Hadoop Ecosystem 10

Common Tool? Hadoop - Cluster: distributed computing platform. - Commodity*, server-class hardware. - Extensible Platform. 11

Hadoop Origins MapReduce and Google File System (GFS) pioneered at Google. Hadoop is the commercially-supported open-source equivalent. 12

What Is Hadoop? Hadoop is a platform. Distributes and replicates data. Manages parallel tasks created by users. Runs as several processes on a cluster. The term Hadoop generally refers to a toolset, not a single tool. 13

Why Hadoop? Handles unstructured to semi-structured to structured data. Handles enormous data volumes. Flexible data analysis and machine learning tools. Cost-effective scalability. 14

The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 15

The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 15

HDFS 16

What Is HDFS? Hadoop Distributed File System. Stores files in blocks across many nodes in a cluster. Replicates the blocks across nodes for durability. Master/Slave architecture. 17

HDFS Traits Not fully POSIX compliant. No file updates. Write once, read many times. Large blocks, sequential read patterns. Designed for batch processing. 18

HDFS Master NameNode - Runs on a single node as a master process Holds file metadata (which blocks are where) Directs client access to files in HDFS SecondaryNameNode - Not a hot failover - Maintains a copy of the NameNode metadata 19

HDFS Slaves DataNode - Generally runs on all nodes in the cluster Block creation/replication/deletion/reads Takes orders from the NameNode 20

HDFS Illustrated NameNode Put File File DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

HDFS Illustrated NameNode Put File File DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

HDFS Illustrated Put File NameNode 1 2 3 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

HDFS Illustrated Put File NameNode 1,4,6 2 3 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

HDFS Illustrated Put File NameNode 1,4,6 2,5,3 3 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

HDFS Illustrated Put File NameNode 1,4,6 2,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

HDFS Illustrated Put File NameNode 1,4,6 2,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 21

Power of Hadoop Read File NameNode 1,4,6 2,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

Power of Hadoop Read File NameNode 1,4,6 2,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

Power of Hadoop Read File NameNode 1,4,6 2,5,3 3,2,6 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

Power of Hadoop Read File NameNode,4,6 2,5,3 3,2,6 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

Power of Hadoop Read File NameNode 5,4,6 2,5,3 3,2,6 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

Power of Hadoop Read File NameNode 5,4,6 2,5,3 3,2,6 DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

Power of Hadoop Read File NameNode 5,4,6 2,5,3 3,2,6 Read time = Transfer Rate x Number of Machines* DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 22

Power of Hadoop Read File NameNode 5,4,6 2,5,3 3,2,6 Read time = Transfer Rate x Number of Machines* DataNode 2 DataNode 3 DataNode 4 DataNode 5 DataNode 6 100 MB/s x 3 = 300MB/s 22

HDFS Shell Easy to use command line interface. Create, copy, move, and delete files. Administrative duties - chmod, chown, chgrp. Set replication factor for a file. Head, tail, cat to view files. 23

The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 24

The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 24

MapReduce in Hadoop 25

MapReduce Basics Logical functions: Mappers and Reducers. Developers write map and reduce functions, then submit a jar to the Hadoop cluster. Hadoop handles distributing the Map and Reduce tasks across the cluster. Typically batch oriented. 26

JobTracker (Master) MapReduce Daemons - Manages MapReduce jobs, giving tasks to different nodes, managing task failure TaskTracker (Slave) - Creates individual map and reduce tasks - Reports task status to JobTracker 27

MapReduce in Hadoop 28

MapReduce in Hadoop Let s look at how MapReduce actually works in Hadoop, using WordCount. 28

Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce There is a Map phase (hadoop, 1) (mapreduce, 1) (uses, 1) (is, 1), (a, 1) (map, 1),(phase,1) (there, 1) a 2 hadoop 1 is 2 map 1 mapreduce 1 phase 2 There is a Reduce phase (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) reduce 1 there 2 uses 1 29

Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce There is a Map phase There is a Reduce phase (hadoop, 1) (mapreduce, 1) (uses, 1) (is, 1), (a, 1) We need to convert (map, 1),(phase,1) (there, 1) the Input into the Output. (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) a 2 hadoop 1 is 2 map 1 mapreduce 1 phase 2 reduce 1 there 2 uses 1 29

Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce a 2 hadoop 1 is 2 There is a Map phase map 1 mapreduce 1 phase 2 There is a Reduce phase reduce 1 there 2 uses 1 30

Input Mappers Hadoop uses MapReduce (doc1, " ") There is a Map phase (doc2, " ") (doc3, "") There is a Reduce phase (doc4, " ") 31

Input Mappers Hadoop uses MapReduce There is a Map phase (doc1, " ") (doc2, " ") (hadoop, 1) (uses, 1) (mapreduce, 1) (there, 1) (is, 1) (a, 1) (map, 1) (phase, 1) (doc3, "") There is a Reduce phase (doc4, " ") (there, 1) (is, 1) (a, 1) (reduce, 1) (phase, 1) 32

Input Mappers Sort, Shuffle Reducers Hadoop uses MapReduce (doc1, " ") (hadoop, 1) (mapreduce, 1) 0-9, a-l (uses, 1) There is a Map phase (doc2, " ") (is, 1), (a, 1) (map, 1),(phase,1) (there, 1) m-q (doc3, "") There is a Reduce phase (doc4, " ") (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) r-z 33

Input Mappers Sort, Shuffle Reducers Hadoop uses MapReduce There is a Map phase (doc1, " ") (doc2, " ") (doc3, "") (hadoop, 1) (mapreduce, 1) (uses, 1) (is, 1), (a, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) There is a Reduce phase (doc4, " ") (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) r-z (reduce, [1]), (there, [1,1]), (uses, 1) 34

Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce (doc1, " ") (hadoop, 1) (mapreduce, 1) (uses, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) a 2 hadoop 1 is 2 There is a Map phase (doc2, " ") (is, 1), (a, 1) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) map 1 mapreduce 1 phase 2 (doc3, "") There is a Reduce phase (doc4, " ") (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) r-z (reduce, [1]), (there, [1,1]), (uses, 1) reduce 1 there 2 uses 1 35

Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce (doc1, " ") (hadoop, 1) (mapreduce, 1) (uses, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) a 2 hadoop 1 is 2 There is a Map phase (doc2, " ") (is, 1), (a, 1) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) map 1 mapreduce 1 phase 2 (doc3, "") (doc4, " ") (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) r-z (reduce, [1]), (there, [1,1]), (uses, 1) 36

Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce There is a Map phase Map: (doc1, " ") (doc2, " ") (doc3, "") (doc4, " ") (hadoop, 1) (mapreduce, 1) (uses, 1) (is, 1), (a, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) (phase,1) Transform one input to 0-N outputs. (is, 1), (a, 1) (there, 1), (reduce 1) r-z (reduce, [1]), (there, [1,1]), (uses, 1) a 2 hadoop 1 is 2 map 1 mapreduce 1 phase 2 36

Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce There is a Map phase Map: (doc1, " ") (doc2, " ") (doc3, "") (doc4, " ") (hadoop, 1) (mapreduce, 1) (uses, 1) (is, 1), (a, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) (is, 1), (a, 1) (phase,1) Transform one input to 0-N outputs. (there, 1), (reduce 1) Reduce: r-z (reduce, [1]), one (there, output. [1,1]), (uses, 1) a 2 hadoop 1 is 2 map 1 mapreduce 1 phase 2 Collect multiple inputs into 36

Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode 37

Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode 37

Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker M M M DataNode DataNode DataNode 37

Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker Map Phase M M M DataNode DataNode DataNode 37

Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker Map Phase k,v M k,v k,v M k,v M k,v * Intermediate Data Is Stored Locally DataNode DataNode DataNode 37

Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker Map Phase k,v k,v k,v k,v k,v DataNode DataNode DataNode 37

Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker k,v k,v k,v k,v k,v Shuffle/Sort DataNode DataNode DataNode 37

Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker k,v k,v k,v k,v k,v Shuffle/Sort DataNode DataNode DataNode 37

Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker k,v R k,v k,v R k,v R k,v Reduce Phase DataNode DataNode DataNode 37

Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker R R R Reduce Phase DataNode DataNode DataNode 37

Cluster View of MapReduce M R NameNode JobTracker jar TaskTracker TaskTracker TaskTracker Job Complete! DataNode DataNode DataNode 37

The Hadoop Java API 38

MapReduce in Java 39

MapReduce in Java Let s look at WordCount written in the MapReduce Java API. 39

Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); } @Override public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { word.set(wordstring.tolowercase()); collector.collect(word, one); } } } 40

Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); } @Override public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { word.set(wordstring.tolowercase()); collector.collect(word, one); } Let s drill into this code... } } 40

Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); } @Override public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { word.set(wordstring.tolowercase()); collector.collect(word, one); } } } 41

Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); Mapper class with 4 type parameters for the input key-value types and output types. } @Override public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { word.set(wordstring.tolowercase()); collector.collect(word, one); } } } 41

Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); Output key-value objects we ll reuse. } @Override public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { word.set(wordstring.tolowercase()); collector.collect(word, one); } } } 42

Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { } static final Text word = new Text(); static final IntWritable one = new IntWritable(1); Map method with input, output collector, and reporting object. @Override public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { word.set(wordstring.tolowercase()); collector.collect(word, one); } } } 43

Map Code public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static final Text word = new Text(); static final IntWritable one = new IntWritable(1); } @Override public void map(longwritable key, Text documentcontents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentcontents.tostring().split("\\s+"); for (String wordstring : tokens) { if (wordstring.length() > 0) { } } } word.set(wordstring.tolowercase()); collector.collect(word, one); Tokenize the line, collect each (word, 1) 44

Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { } @Override public void reduce(text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasnext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); } 45

Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { } @Override public void reduce(text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasnext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); } Let s drill into this code... 45

Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { } @Override public void reduce(text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasnext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); } 46

Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { } Reducer class with 4 type parameters for the input key-value types and output types. @Override public void reduce(text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasnext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); } 46

Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { } Reduce method with input, output collector, and reporting object. @Override public void reduce(text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasnext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); } 47

Reduce Code public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { } @Override public void reduce(text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; } while (counts.hasnext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); Count the counts per word and emit (word, N) 48

Other Options HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 49

Other Options HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 49

Other Options HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 49

Conclusions 50

Hadoop Benefits A cost-effective, scalable way to: - Store massive data sets. - Perform arbitrary analyses on those data sets. 51

Hadoop Tools Offers a variety of tools for: - Application development. - Integration with other platforms (e.g., databases). 52

Hadoop Distributions A rich, open-source ecosystem. - Free to use. - Commercially-supported distributions. 53

Thank You! - Feel free to contact me at ryan.tabora@thinkbiganalytics.com - Or our solutions consultant matt.mcdevitt@thinkbiganalytics.com - As always, THINK BIG! 54

Bonus Content 55

The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 56

The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 56

Hive: SQL for Hadoop 57

Hive 58

Hive Let s look at WordCount written in Hive, the SQL for Hadoop. 58

CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word; 59

CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word; Let s drill into this code... 59

CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word; 60

CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; Create a table to hold the raw text we re counting. Each line is a column. CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word; 60

CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; Load the text in the docs directory into the table. CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word; 61

CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word; Create the final table and fill it with the results from a nested query of the docs table that performs WordCount on the fly. 62

Hive 63

Hive Because so many Hadoop users come from SQL backgrounds, Hive is one of the most essential tools in the ecosystem!! 63

The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 64

The Hadoop Ecosystem HDFS - Hadoop Distributed File System. Map/Reduce - A distributed framework for executing work in parallel. Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate. HBase - A NoSQL, non-sequential data store. 64

Pig: Data Flow for Hadoop 65

Pig 66

Pig Let s look at WordCount written in Pig, the Data Flow language for Hadoop. 66

inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; 67

inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; Let s drill into this code... 67

inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; 68

inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt Like the Hive example, load docs content, each line is a field. GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; 68

inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt Tokenize into words (an array) and flatten into separate records. GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; 69

inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; Collect the same words together. cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; 70

inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); Count each word. STORE cntd INTO 'output'; 71

inpt = LOAD 'docs' using TextLoader AS (line:chararray); words = FOREACH inpt GENERATE flatten(tokenize(line)) AS word; grpd = GROUP words BY word; cntd = FOREACH grpd GENERATE group, COUNT(words); STORE cntd INTO 'output'; Save the results. Profit! 72

Pig 73

Pig Pig and Hive overlap, but Pig is popular for ETL, e.g., data transformation, cleansing, ingestion, etc. 73

Questions? 74