Getting Started with Hadoop/YARN

Getting Started with Hadoop/YARN Michael Völske 1 April 28, 2016 1 michael.voelske@uni-weimar.de Michael Völske Getting Started with Hadoop/YARN April 28, 2016 1 / 66

Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, 2016 2 / 66

Part One: Hadoop, HDFS, and MapReduce Hadoop Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, 2016 3 / 66

Part One: Hadoop, HDFS, and MapReduce Hadoop What is Hadoop Started in 2004 by Yahoo Open-Source implementation of Google MapReduce, Google Filesystem and Google BigTable Apache Software Foundation top level project Written in Java Michael Völske Getting Started with Hadoop/YARN April 28, 2016 4 / 66

Part One: Hadoop, HDFS, and MapReduce Hadoop Key Concepts Scale out, not up! 4000+ nodes, 100PB+ data cheap commodity hardware instead of supercomputers fault-tolerance, redundancy Bring the program to the data storage and data processing on the same node local processing (network is the bottleneck) Working sequentially instead of random-access optimized for large datasets Hide system-level details User doesn t need to know what code runs on which machine Michael Völske Getting Started with Hadoop/YARN April 28, 2016 5 / 66

Part One: Hadoop, HDFS, and MapReduce Hadoop Hadoop 1 Hadoop 2 Michael Völske Getting Started with Hadoop/YARN April 28, 2016 6 / 66

Part One: Hadoop, HDFS, and MapReduce HDFS - Distributed File System Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, 2016 7 / 66

Part One: Hadoop, HDFS, and MapReduce HDFS - Distributed File System HDFS Overview Designed for storing large files Files are split in blocks Integrity: Blocks are checksummed Redundancy: Each block stored on multiple machines Optimized for sequentially reading whole blocks Daemon processes: NameNode: Central registry of block locations DataNode: Block storage on each node Michael Völske Getting Started with Hadoop/YARN April 28, 2016 8 / 66

Part One: Hadoop, HDFS, and MapReduce HDFS - Distributed File System Reading a file Michael Völske Getting Started with Hadoop/YARN April 28, 2016 9 / 66

Part One: Hadoop, HDFS, and MapReduce MapReduce Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, 2016 10 / 66

Part One: Hadoop, HDFS, and MapReduce MapReduce Motivation Problem Collecting data is easy and cheap Evaluating data is difficult Solution Divide and Conquer Parallel Processing Michael Völske Getting Started with Hadoop/YARN April 28, 2016 11 / 66

Part One: Hadoop, HDFS, and MapReduce MapReduce Steps Map Step Each worker applies the map() function to the local data and writes the output to temporary storage. Each output record gets a key. Shuffle Step Worker nodes redistribute data based on the output keys: all records with the same key go to the same worker node. Reduce Step Workers apply the reduce() function to each group, per key, in parallel. The user specifies the map() and reduce() functions Michael Völske Getting Started with Hadoop/YARN April 28, 2016 12 / 66

Part One: Hadoop, HDFS, and MapReduce MapReduce Word Count Example Mary had a little lamb its fleece was white as snow and everywhere that Mary went the lamb was sure to go Michael Völske Getting Started with Hadoop/YARN April 28, 2016 13 / 66

Part One: Hadoop, HDFS, and MapReduce MapReduce Word Count Example Mary had a little lamb its fleece was white as snow and everywhere that Mary went the lamb was sure to go Map() Map() Map() Map() Mary 1 had 1 a 1 little 1 lamb 1 its 1 fleece 1 was 1 white 1 as 1 snow 1 and 1 everywhere 1 that 1 Mary 1 went 1 the 1 lamb 1 was 1 sure 1 to 1 go 1 Michael Völske Getting Started with Hadoop/YARN April 28, 2016 13 / 66

Part One: Hadoop, HDFS, and MapReduce MapReduce Word Count Example Mary had a little lamb its fleece was white as snow and everywhere that Mary went the lamb was sure to go Map() Map() Map() Map() Mary 1 had 1 a 1 little 1 lamb 1 its 1 fleece 1 was 1 white 1 as 1 snow 1 and 1 everywhere 1 that 1 Mary 1 went 1 the 1 lamb 1 was 1 sure 1 to 1 go 1 Reduce() Reduce() a 1 as 1 lamb 2 little 2... Mary 2 was 2 went 1... Michael Völske Getting Started with Hadoop/YARN April 28, 2016 13 / 66

Part One: Hadoop, HDFS, and MapReduce MapReduce Key-Value-Pairs Map Step: Map(k1,v1) list(k2,v2) Sorting and Shuffling: All pairs with the same key are grouped together; one group per key. Reduce Step: Reduce(k2, list(v2)) list(v3) Michael Völske Getting Started with Hadoop/YARN April 28, 2016 14 / 66

Part One: Hadoop, HDFS, and MapReduce MapReduce MapReduce on Hadoop Michael Völske Getting Started with Hadoop/YARN April 28, 2016 15 / 66

Part One: Hadoop, HDFS, and MapReduce MapReduce on YARN Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, 2016 16 / 66

Part One: Hadoop, HDFS, and MapReduce MapReduce on YARN YARN Basic Concepts YARN Processes ResourceManager Single instance, runs on the cluster s master node NodeManager Runs on each cluster node; executes computations and provides containers to applications MapReduce Job ApplicationMaster Controls execution on the cluster (one for each YARN application) Mapper Process input data Reducer Process (sorted) Mapper output each of these runs in a YARN Container Michael Völske Getting Started with Hadoop/YARN April 28, 2016 17 / 66

Part One: Hadoop, HDFS, and MapReduce MapReduce on YARN YARN+MapReduce Basic process: 1. Client application requests a container for the ApplicationMaster 2. ApplicationMaster runs on the cluster, requests further containers for Mappers and Reducers 3. Mappers execute user-provided map() function on their part of the input data 4. The shuffle() phase is started to distribute map output to reducers 5. Reducers execute user-provided reduce() function on their group of map output 6. Final result is stored in HDFS See also: [Anatomy of a MapReduce Job] Michael Völske Getting Started with Hadoop/YARN April 28, 2016 18 / 66

Part One: Hadoop, HDFS, and MapReduce Summary Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, 2016 19 / 66

Part One: Hadoop, HDFS, and MapReduce Summary YARN and HDFS Components Michael Völske Getting Started with Hadoop/YARN April 28, 2016 20 / 66

Part One: Hadoop, HDFS, and MapReduce Summary End of Part One. Questions? Michael Völske Getting Started with Hadoop/YARN April 28, 2016 21 / 66

Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, 2016 22 / 66

Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Virtual Toy Cluster Host SSH Client ssh localhost -p 2222 [master] $ ls -l VirtualBox 3 Virtual Machines Browser http://localhost:50070 Hadoop NameNode Status master slave1 Hadoop Client Terminal [master] $ hadoop jar... slave2 Shared Network Michael Völske Getting Started with Hadoop/YARN April 28, 2016 23 / 66

Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Betaweb Host Cluster SSH Client ssh betaweb020 [betaweb] $ ls -l Browser http://betaweb020:... Hadoop NameNode Status Hadoop Client Terminal [betaweb] $ hadoop jar... Routed Network 130 Machines 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Michael Völske Getting Started with Hadoop/YARN April 28, 2016 24 / 66

Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Starting the VM for the first time Make sure you have the appliance BigData.ova on your machine. Start up VirtualBox. "File" "Import Appliance... " Make sure the VM starts up from the VirtualBox interface. Michael Völske Getting Started with Hadoop/YARN April 28, 2016 25 / 66

Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Setting Up The Virtual Network "File" "Preferences" "Network" Create a "Host-only Network" Remember the name of the new network (on Linux it s something like "vboxnetx") Change its settings to: IPv4 Address 10.42.23.1 Network Mask 255.255.255.0 Michael Völske Getting Started with Hadoop/YARN April 28, 2016 26 / 66

Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Verifying the VM Network Settings Right-click each of the three VMs and go to "Settings" "Network" The four network adapters should be configured like this: Adapter 1 enabled, attached to "NAT" Adapter 2 enabled, attached to "Host-only Adapter" The "Name" field must be the network you created in the previous step Adapter 3 and 4 disabled Michael Völske Getting Started with Hadoop/YARN April 28, 2016 27 / 66

Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Set Up Host Names for the Virtual Machines I The hosts file maps hostnames to IP addresses (without a DNS server) We will use it to connect to our virtual cluster machines using their host names We will add the following hosts entries 10.42.23.101 master.cluster master 10.42.23.102 slave1.cluster slave1 10.42.23.103 slave2.cluster slave2 Michael Völske Getting Started with Hadoop/YARN April 28, 2016 28 / 66

Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Set Up Host Names for the Virtual Machines II Linux Open a terminal and run: sudo nano /etc/hosts Paste in the three lines at the end Save and close (Ctrl-X, then y, then enter) OSX (10.6 or newer) Run: sudo nano /private/etc/hosts Paste the lines, save and close Run: dscacheutil -flushcache Michael Völske Getting Started with Hadoop/YARN April 28, 2016 29 / 66

Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Set Up Host Names for the Virtual Machines III OSX (before 10.6) 1. Open /Applications/Utilities/NetInfo Manager 2. Select "Machines" 3. Select "localhost", then "Edit" "Duplicate" 4. Change the "ip_address" property to "10.42.23.101" and the "name" property to "master" Delete the "serves" property and save 5. Repeat the previous step for the other two entries on the list Michael Völske Getting Started with Hadoop/YARN April 28, 2016 30 / 66

Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Set Up Host Names for the Virtual Machines IV Windows (XP, 7, 8 and 10) Find "Notepad" in the start menu/launcher, right click and "Run as administrator" With notepad, open the file C:\Windows\System32\Drivers\etc\hosts Paste the three lines at the end, save and exit If the file is marked as read-only, try saving it under a different name and copying it over the current hosts file Michael Völske Getting Started with Hadoop/YARN April 28, 2016 31 / 66

Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Install an SSH Client on your Laptop Linux and OSX You should already have one. To check, open a terminal and run: ssh -V Windows Download PuTTY [http://www.putty.org/] and run the installer Find "PuTTY" in your start menu Michael Völske Getting Started with Hadoop/YARN April 28, 2016 32 / 66

Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Start all VMs Without Graphical Output Option 1 Select "Start" "Headless Start" from the UI Option 2: Open a terminal and run: VBoxHeadless -s "BigData Hadoop VM (master)" VBoxHeadless -s "BigData Hadoop VM (slave1)" VBoxHeadless -s "BigData Hadoop VM (slave2)" Michael Völske Getting Started with Hadoop/YARN April 28, 2016 33 / 66

Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Test Routing from your Laptop to the VMs Open a command prompt (terminal) on your Laptop If you can t find this on Windows, try typing "cmd" into the search box in the Start menu Type "ping master"; the output should look like this: PING master.cluster (10.42.23.101) 56(84) bytes of data. 64 bytes from master.cluster (10.42.23.101):... Michael Völske Getting Started with Hadoop/YARN April 28, 2016 34 / 66

Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines (If That Didn t Work) VirtualBox should update the routing table automatically, but sometimes that fails If you re on linux, and the above ping didn t work, try below I don t know how to do this on Windows/OSX :-( Linux sudo ip link set dev vboxnet1 up sudo ip addr replace 10.42.23.1 dev vboxnet1 sudo ip route add dev vboxnet1 to 10.42.23.0/24 Michael Völske Getting Started with Hadoop/YARN April 28, 2016 35 / 66

Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Connect via SSH Linux and OSX ssh -p 2222 hadoop-admin@localhost ssh -p 2223 hadoop-admin@localhost ssh -p 2224 hadoop-admin@localhost ## for master ## for slave1 ## for slave2 Windows Open PuTTY and type in the host name (e.g. "master") and port (after the "-p" above) When asked: "Login as: hadoop-admin" The password is always "hadoop-admin" Michael Völske Getting Started with Hadoop/YARN April 28, 2016 36 / 66

Part Two: Setting Up Your Own Virtual Cluster Basic Linux Shell Commands Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, 2016 37 / 66

Part Two: Setting Up Your Own Virtual Cluster Basic Linux Shell Commands Connect to the "master" Machine ssh -p 2222 hadoop-admin@localhost Michael Völske Getting Started with Hadoop/YARN April 28, 2016 38 / 66

Part Two: Setting Up Your Own Virtual Cluster Basic Linux Shell Commands Basic Shell Commands What How See what s in the current directory ls or ls -l Change directory cd NAME Show current directory pwd Create a directory mkdir NAME Delete a file rm NAME Delete an empty directory rmdir NAME Edit a text file nano NAME Make a file executable chmod +x NAME Download a file wget URL Run a command as root sudo COMMAND Disconnect exit or logout Michael Völske Getting Started with Hadoop/YARN April 28, 2016 39 / 66

Part Two: Setting Up Your Own Virtual Cluster Basic Linux Shell Commands Where To Learn More There are a lot of great resources online for learning to use the commandline more effectively. These are a few good places to start: [github.com/aleksandar-todorovic/awesome-linux#learning-resources] [github.com/alebcay/awesome-shell#guides] William Shotts, The Linux Command Line; [linuxcommand.org/tlcl.php] (free PDF book) Michael Völske Getting Started with Hadoop/YARN April 28, 2016 40 / 66

Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, 2016 41 / 66

Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Download and Unpack Hadoop Download Hadoop wget http://webis16/bigdata-seminar/hadoop-2.7.2.tar.gz (if that doesn t work, use webis16.medien.uni-weimar.de) Michael Völske Getting Started with Hadoop/YARN April 28, 2016 42 / 66

Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Download and Unpack Hadoop Download Hadoop wget http://webis16/bigdata-seminar/hadoop-2.7.2.tar.gz (if that doesn t work, use webis16.medien.uni-weimar.de) Unpack it tar xf hadoop-2.7.2.tar.gz Michael Völske Getting Started with Hadoop/YARN April 28, 2016 42 / 66

Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Download and Unpack Hadoop Download Hadoop wget http://webis16/bigdata-seminar/hadoop-2.7.2.tar.gz (if that doesn t work, use webis16.medien.uni-weimar.de) Unpack it tar xf hadoop-2.7.2.tar.gz Move it to /opt sudo mv hadoop-2.7.2 /opt/hadoop Michael Völske Getting Started with Hadoop/YARN April 28, 2016 42 / 66

Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Set up the Shell Environment Create the file /etc/profile.d/99-hadoop.sh with contents: export HADOOP_PREFIX="/opt/hadoop" export PATH="$PATH:/opt/hadoop/bin:/opt/hadoop/sbin" Make it executable and source it sudo chmod +x /etc/profile.d/99-hadoop.sh source /etc/profile Michael Völske Getting Started with Hadoop/YARN April 28, 2016 43 / 66

Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Set up the Shell Environment Create the file /etc/profile.d/99-hadoop.sh with contents: export HADOOP_PREFIX="/opt/hadoop" export PATH="$PATH:/opt/hadoop/bin:/opt/hadoop/sbin" Make it executable and source it sudo chmod +x /etc/profile.d/99-hadoop.sh source /etc/profile Edit the file /opt/hadoop/etc/hadoop/hadoop-env.sh (line 25) export JAVA_HOME="/usr/lib/jvm/java-7-oracle" Michael Völske Getting Started with Hadoop/YARN April 28, 2016 43 / 66

Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Test the Hadoop Binary hadoop Usage: hadoop [--config confdir] [COMMAND CLASSNAME] CLASSNAME run the class named CLASSNAME or where COMMAND is one of: fs run a generic filesystem user client version print the version jar <jar> run a jar file... Michael Völske Getting Started with Hadoop/YARN April 28, 2016 44 / 66

Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Configure HDFS I /opt/hadoop/etc/hadoop/core-site.xml <configuration> <property> <name>fs.defaultfs</name> <value>hdfs://master:9000</value> </property> </configuration> Michael Völske Getting Started with Hadoop/YARN April 28, 2016 45 / 66

Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Configure HDFS II /opt/hadoop/etc/hadoop/hdfs-site.xml <configuration> <property> <name>dfs.datanode.data.dir</name> <value>/home/hadoop-admin/dfs/dn</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>/home/hadoop-admin/dfs/nn</value> </property> <property> <name>dfs.permission.supergroup</name> <value>hadoop-admin</value> </property> </configuration> Michael Völske Getting Started with Hadoop/YARN April 28, 2016 46 / 66

Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Configure MapReduce This file doesn t exist yet; we will create it /opt/hadoop/etc/hadoop/mapred-site.xml <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> Michael Völske Getting Started with Hadoop/YARN April 28, 2016 47 / 66

Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Configure YARN /opt/hadoop/etc/hadoop/yarn-site.xml <configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>master</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> Michael Völske Getting Started with Hadoop/YARN April 28, 2016 48 / 66

Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Configure Slave Host Names remove the "localhost" entry that is already in the file /opt/hadoop/etc/hadoop/slaves slave1 slave2 Michael Völske Getting Started with Hadoop/YARN April 28, 2016 49 / 66

Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Copy Configuration to the Slave Hosts Log into the first slave host ssh slave1 Copy over the Hadoop distribution and environment scripts 1. scp -r master:/opt/hadoop. (note the dot at the end) 2. scp master:/etc/profile.d/99-hadoop.sh. 3. sudo mv hadoop /opt/hadoop 4. sudo cp 99-hadoop.sh /etc/profile.d/ 5. rm 99-hadoop.sh 6. logout Repeat these steps for the other slave host: ssh slave2 Michael Völske Getting Started with Hadoop/YARN April 28, 2016 50 / 66

Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Start Hadoop On the master node, format the HDFS... hdfs namenode -format... and start it: start-dfs.sh Then start YARN: start-yarn.sh Michael Völske Getting Started with Hadoop/YARN April 28, 2016 51 / 66

Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Check That Everything Is Running On the master jps 5178 Jps 4646 NameNode 1339 SecondaryNameNode 4076 ResourceManager On the slaves jps 3900 DataNode 4161 Jps 3994 NodeManager Michael Völske Getting Started with Hadoop/YARN April 28, 2016 52 / 66

Part Three: Using The Cluster Testing, testing... Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, 2016 53 / 66

Part Three: Using The Cluster Testing, testing... Test the Web UI If everything went according to plan, you should be able to open the Hadoop Web UI in your browser. NodeManager http://master:50070 ResourceManager http://master:8088 Michael Völske Getting Started with Hadoop/YARN April 28, 2016 54 / 66

Part Three: Using The Cluster Testing, testing... Create User Home Directory in HDFS You can access HDFS, but you need to create a home directory for your user: hadoop fs -mkdir -p /user/hadoop-admin Browsing HDFS from the shell 2 List files Remove directory Remove file Copy from local FS to HDFS hadoop fs -ls NAME hadoop fs -rmdir NAME hadoop fs -rm NAME hadoop fs -put LOCAL REMOTE 2 [Hadoop FS Shell Documentation] Michael Völske Getting Started with Hadoop/YARN April 28, 2016 55 / 66

Part Three: Using The Cluster Our First Job Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, 2016 56 / 66

Part Three: Using The Cluster Our First Job Starting Our First Job Now that everything is set up, let s start one of the standard example MapReduce jobs: Running the Job cd /opt/hadoop/share/hadoop/mapreduce hadoop jar hadoop-mapreduce-examples-*.jar pi 16 1000000 Output Number of Maps = 16 Samples per Map = 1000000 Wrote input for Map #0... Job Finished in 186.733 seconds Estimated value of Pi is 3.14159125000000000000 Michael Völske Getting Started with Hadoop/YARN April 28, 2016 57 / 66

Part Three: Using The Cluster MapReduce Streaming API Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, 2016 58 / 66

Part Three: Using The Cluster MapReduce Streaming API Note: the hadoop-streaming jar file is in /opt/hadoop/share/hadoop/tools/lib/ Michael Völske Getting Started with Hadoop/YARN April 28, 2016 59 / 66 Streaming API: MapReduce with shell scripts You can write MapReduce programs with simple shell scripts, or arbitrary binaries you compile yourself A pipe in the shell grep the input sort wc -l > output

Part Three: Using The Cluster MapReduce Streaming API Try it for Yourself Download a book from Project Gutenberg [www.gutenberg.org] Find out how many lines contain the word "the" For example, with Shakespeare s complete works: wget www.gutenberg.org/cache/epub/100/pg100.txt hadoop fs -put pg100.txt shakespeare.txt hadoop jar /[...]/hadoop-streaming-*.jar \ -input shakespeare.txt -output the-shake \ -mapper "grep the" -reducer "wc -l" Michael Völske Getting Started with Hadoop/YARN April 28, 2016 60 / 66

Part Three: Using The Cluster MapReduce Java API Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, 2016 61 / 66

Part Three: Using The Cluster MapReduce Java API MapReduce with Java: API Prefer the new API: org.apache.hadoop.mapreduce.* (not org.apache.hadoop.mapred.*) In general: Java program extends the Mapper and/or Reducer classes, and implements the setup(), map() and reduce() methods [Online Tutorial] with WordCount example. Michael Völske Getting Started with Hadoop/YARN April 28, 2016 62 / 66

Part Three: Using The Cluster MapReduce Java API WordCount Example We ll run the WordCount.java example from the Apache MapReduce tutorial. Prepare a working directory mkdir wordcount cd wordcount Download the code wget webis16/bigdata-seminar/wordcount.java Michael Völske Getting Started with Hadoop/YARN April 28, 2016 63 / 66

Part Three: Using The Cluster MapReduce Java API WordCount Example Compiling the code javac -cp $( hadoop classpath ) WordCount.java Compiler output ls WordCount$IntSumReducer.class WordCount$TokenizerMapper.class WordCount.class WordCount.java Michael Völske Getting Started with Hadoop/YARN April 28, 2016 64 / 66

Part Three: Using The Cluster MapReduce Java API WordCount Example Packing a jar file for Hadoop jar cf wc.jar WordCount*.class Running the code on Hadoop hadoop jar wc.jar WordCount shakespeare.txt shakes-count Michael Völske Getting Started with Hadoop/YARN April 28, 2016 65 / 66

Part Three: Using The Cluster MapReduce Java API WordCount Example Packing a jar file for Hadoop jar cf wc.jar WordCount*.class Running the code on Hadoop hadoop jar wc.jar WordCount shakespeare.txt shakes-count Looking at the result hadoop fs -cat shakes-count/\* head Michael Völske Getting Started with Hadoop/YARN April 28, 2016 65 / 66

Part Three: Using The Cluster MapReduce Java API That s it for today. Questions? Michael Völske Getting Started with Hadoop/YARN April 28, 2016 66 / 66