Your First Hadoop App, Step by Step

Size: px

Start display at page:

Download "Your First Hadoop App, Step by Step"

Milo Stokes
6 years ago
Views:

1 Learn Hadoop in one evening Your First Hadoop App, Step by Step Martynas 1

2 Your First Hadoop App, Step by Step By Martynas Miliauskas Published in 2013 by Martynas Miliauskas On the web: Please send errors to Copyright 2013 by Martynas Miliauskas. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording or any information storage and retrieval system, without prior permission in writing from the publisher. 2

3 Table of Contents Introduction! 5 What is Hadoop?! 7 Typical Hadoop Cluster! 8 MapReduce! 9 HDFS! 11 What s Next?! 13 Step 1. Getting Started! 14 Step 2. Making Data Sample! 16 Step 3. Simple MapReduce App! 17 Step 4. Mapper! 19 Step 5. Reducer! 21 Step 6. Running Hadoop Job! 23 Step 7. Using Combiners! 26 Step 8. Using Aggregators! 28 3

4 Step 9. Configuring Pseudo-Distributed Mode! 31 Step 10. Setting up HDFS! 35 Step 11. Booting Hadoop! 36 Step 12. Storing Data on HDFS! 39 Step 13. Running Job in Pseudo-Cluster! 43 Step 14. Compressing Input! 47 Step 15. Storing Data on S3! 50 Step 16. Setting up Amazon EMR! 52 Step 17. Plotting Scores! 58 Conclusion! 60 4

5 Introduction If you have ever wondered what Hadoop or MapReduce are but never had time to look into it, then you will love this book. This book will take you from having no idea what Hadoop is to your first MapReduce application spinning on an Amazon EMR cluster. You will not need to learn Java or any other language. Throughout this book we are going to be using Hadoop Streaming API, which lets you write MapReduce applications in your favorite language. What do I need to have? You need to have Hadoop installed. You can use following guide as a reference. 5

What will I build? The practical part of this book will walk you through building a MapReduce application with Streaming API and Ruby. Our application will take serverfault.

6 What will I build? The practical part of this book will walk you through building a MapReduce application with Streaming API and Ruby. Our application will take serverfault.com data dump (~280MB) and will calculate a histogram of posts' scores. If you do not want to download this data, install and configure Hadoop, you can still follow this book. Every step is illustrated by around 20 vivid screenshots that will help you relive the building process as if you were doing it yourself. Should you wish to learn Ruby "Your First Ruby Script, Step-by-Step" can help you master the language in a few evenings. 6

7 What is Hadoop? Hadoop is a tool that helps you process large amounts of data quickly. It does so by using a cluster of computers, to which data and work are distributed. With Hadoop Streaming API and Amazon EMR service, you will find it very easy to write and deploy Hadoop applications. Streaming API requires only two scripts written by the developer: the mapper and reducer (we will get back to these shortly) and then your app is ready. Amazon EMR makes it easy to create and launch a Hadoop cluster in seconds. Using a simple wizard, the developer can pick the number of worker nodes in the cluster, specify location of the input data and the MapReduce code, and the job is ready to be run. Before diving into building our first Hadoop application straight away, let s first spend some time on getting a basic understanding of how Hadoop functions under the hood. 7

8 Typical Hadoop Cluster A typical Hadoop cluster has many worker/slave nodes which store and process chunks of data in parallel. Two master nodes, master master namely NameNode and JobTracker, are responsible for NameNode JobTracker managing how data is stored and processed on the slave nodes. Each slave runs DataNode and TaskTracker daemons. The DataNode, instructed by the NameNode, stores information; slave slave slave slave and the TaskTracker, instructed by the JobTracker, runs Map and Reduce tasks. DataNode DataNode DataNode DataNode TaskTracker TaskTracker TaskTracker TaskTracker The reason why DataNode and TaskTracker are bundled together is in order to keep data as close as possible to the processing. However, TaskTracker might execute a task that takes data from a foreign DataNode; though usually priority is given to the DataNode on the same rack. 8

9 Pages 9-20 had been skipped.

10 Step 5. Reducer 1 2 reducer.rb #!/usr/bin/env ruby posts_count = 0 last_key = nil STDIN.each_line do line key, value = line.split("\t") if last_key && last_key!= key puts "#{last_key}\t#{posts_count}" last_key = key posts_count = value.to_i else last_key = key posts_count += value.to_i end end puts "#{last_key}\t#{posts_count}" A sorted version of the unsorted mapper output that we saw in the previous chapter is going to be an input for the reducer script. Our reducer will be summing up all the values (1s) that have the same key (score) and send the result to the output stream. Since the reducer input comes sorted by the key in a single continuous stream, we can assume that once the key changes, we won t see it again. Every time we notice a new key 1, we output the accumulated sum of all the 1s (posts_count) that belong to the same key (last_key) and we restart the posts_count counter for the new key. If the current key is the same as the last_key 2 we add integer version (to_i) of value to posts_count. 21

11 We can test our mapper and reducer scripts using the same one-liner that we saw in Step 3: $ cat posts_sample.xml./mapper.rb sort./ reducer.rb > output.txt Let s see if our output makes sense: $ less output.txt At a glance it seems to be fine. We are now ready to poke Streaming API by running our first Hadoop job with the scripts that we just wrote. 22

12 Did you like the sample? Download full version here.

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data