Getting Started with Hadoop/YARN

Size: px
Start display at page:

Download "Getting Started with Hadoop/YARN"

Transcription

1 Getting Started with Hadoop/YARN Michael Völske 1 April 28, michael.voelske@uni-weimar.de Michael Völske Getting Started with Hadoop/YARN April 28, / 66

2 Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66

3 Part One: Hadoop, HDFS, and MapReduce Hadoop Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66

4 Part One: Hadoop, HDFS, and MapReduce Hadoop What is Hadoop Started in 2004 by Yahoo Open-Source implementation of Google MapReduce, Google Filesystem and Google BigTable Apache Software Foundation top level project Written in Java Michael Völske Getting Started with Hadoop/YARN April 28, / 66

5 Part One: Hadoop, HDFS, and MapReduce Hadoop Key Concepts Scale out, not up! nodes, 100PB+ data cheap commodity hardware instead of supercomputers fault-tolerance, redundancy Bring the program to the data storage and data processing on the same node local processing (network is the bottleneck) Working sequentially instead of random-access optimized for large datasets Hide system-level details User doesn t need to know what code runs on which machine Michael Völske Getting Started with Hadoop/YARN April 28, / 66

6 Part One: Hadoop, HDFS, and MapReduce Hadoop Hadoop 1 Hadoop 2 Michael Völske Getting Started with Hadoop/YARN April 28, / 66

7 Part One: Hadoop, HDFS, and MapReduce HDFS - Distributed File System Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66

8 Part One: Hadoop, HDFS, and MapReduce HDFS - Distributed File System HDFS Overview Designed for storing large files Files are split in blocks Integrity: Blocks are checksummed Redundancy: Each block stored on multiple machines Optimized for sequentially reading whole blocks Daemon processes: NameNode: Central registry of block locations DataNode: Block storage on each node Michael Völske Getting Started with Hadoop/YARN April 28, / 66

9 Part One: Hadoop, HDFS, and MapReduce HDFS - Distributed File System Reading a file Michael Völske Getting Started with Hadoop/YARN April 28, / 66

10 Part One: Hadoop, HDFS, and MapReduce MapReduce Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66

11 Part One: Hadoop, HDFS, and MapReduce MapReduce Motivation Problem Collecting data is easy and cheap Evaluating data is difficult Solution Divide and Conquer Parallel Processing Michael Völske Getting Started with Hadoop/YARN April 28, / 66

12 Part One: Hadoop, HDFS, and MapReduce MapReduce Steps Map Step Each worker applies the map() function to the local data and writes the output to temporary storage. Each output record gets a key. Shuffle Step Worker nodes redistribute data based on the output keys: all records with the same key go to the same worker node. Reduce Step Workers apply the reduce() function to each group, per key, in parallel. The user specifies the map() and reduce() functions Michael Völske Getting Started with Hadoop/YARN April 28, / 66

13 Part One: Hadoop, HDFS, and MapReduce MapReduce Word Count Example Mary had a little lamb its fleece was white as snow and everywhere that Mary went the lamb was sure to go Michael Völske Getting Started with Hadoop/YARN April 28, / 66

14 Part One: Hadoop, HDFS, and MapReduce MapReduce Word Count Example Mary had a little lamb its fleece was white as snow and everywhere that Mary went the lamb was sure to go Map() Map() Map() Map() Mary 1 had 1 a 1 little 1 lamb 1 its 1 fleece 1 was 1 white 1 as 1 snow 1 and 1 everywhere 1 that 1 Mary 1 went 1 the 1 lamb 1 was 1 sure 1 to 1 go 1 Michael Völske Getting Started with Hadoop/YARN April 28, / 66

15 Part One: Hadoop, HDFS, and MapReduce MapReduce Word Count Example Mary had a little lamb its fleece was white as snow and everywhere that Mary went the lamb was sure to go Map() Map() Map() Map() Mary 1 had 1 a 1 little 1 lamb 1 its 1 fleece 1 was 1 white 1 as 1 snow 1 and 1 everywhere 1 that 1 Mary 1 went 1 the 1 lamb 1 was 1 sure 1 to 1 go 1 Reduce() Reduce() a 1 as 1 lamb 2 little 2... Mary 2 was 2 went 1... Michael Völske Getting Started with Hadoop/YARN April 28, / 66

16 Part One: Hadoop, HDFS, and MapReduce MapReduce Key-Value-Pairs Map Step: Map(k1,v1) list(k2,v2) Sorting and Shuffling: All pairs with the same key are grouped together; one group per key. Reduce Step: Reduce(k2, list(v2)) list(v3) Michael Völske Getting Started with Hadoop/YARN April 28, / 66

17 Part One: Hadoop, HDFS, and MapReduce MapReduce MapReduce on Hadoop Michael Völske Getting Started with Hadoop/YARN April 28, / 66

18 Part One: Hadoop, HDFS, and MapReduce MapReduce on YARN Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66

19 Part One: Hadoop, HDFS, and MapReduce MapReduce on YARN YARN Basic Concepts YARN Processes ResourceManager Single instance, runs on the cluster s master node NodeManager Runs on each cluster node; executes computations and provides containers to applications MapReduce Job ApplicationMaster Controls execution on the cluster (one for each YARN application) Mapper Process input data Reducer Process (sorted) Mapper output each of these runs in a YARN Container Michael Völske Getting Started with Hadoop/YARN April 28, / 66

20 Part One: Hadoop, HDFS, and MapReduce MapReduce on YARN YARN+MapReduce Basic process: 1. Client application requests a container for the ApplicationMaster 2. ApplicationMaster runs on the cluster, requests further containers for Mappers and Reducers 3. Mappers execute user-provided map() function on their part of the input data 4. The shuffle() phase is started to distribute map output to reducers 5. Reducers execute user-provided reduce() function on their group of map output 6. Final result is stored in HDFS See also: [Anatomy of a MapReduce Job] Michael Völske Getting Started with Hadoop/YARN April 28, / 66

21 Part One: Hadoop, HDFS, and MapReduce Summary Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66

22 Part One: Hadoop, HDFS, and MapReduce Summary YARN and HDFS Components Michael Völske Getting Started with Hadoop/YARN April 28, / 66

23 Part One: Hadoop, HDFS, and MapReduce Summary YARN and HDFS Components Michael Völske Getting Started with Hadoop/YARN April 28, / 66

24 Part One: Hadoop, HDFS, and MapReduce Summary YARN and HDFS Components Michael Völske Getting Started with Hadoop/YARN April 28, / 66

25 Part One: Hadoop, HDFS, and MapReduce Summary YARN and HDFS Components Michael Völske Getting Started with Hadoop/YARN April 28, / 66

26 Part One: Hadoop, HDFS, and MapReduce Summary End of Part One. Questions? Michael Völske Getting Started with Hadoop/YARN April 28, / 66

27 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66

28 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Virtual Toy Cluster Host SSH Client ssh localhost -p 2222 [master] $ ls -l VirtualBox 3 Virtual Machines Browser Hadoop NameNode Status master slave1 Hadoop Client Terminal [master] $ hadoop jar... slave2 Shared Network Michael Völske Getting Started with Hadoop/YARN April 28, / 66

29 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Betaweb Host Cluster SSH Client ssh betaweb020 [betaweb] $ ls -l Browser Hadoop NameNode Status Hadoop Client Terminal [betaweb] $ hadoop jar... Routed Network 130 Machines Michael Völske Getting Started with Hadoop/YARN April 28, / 66

30 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Starting the VM for the first time Make sure you have the appliance BigData.ova on your machine. Start up VirtualBox. "File" "Import Appliance... " Make sure the VM starts up from the VirtualBox interface. Michael Völske Getting Started with Hadoop/YARN April 28, / 66

31 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Starting the VM for the first time Make sure you have the appliance BigData.ova on your machine. Start up VirtualBox. "File" "Import Appliance... " Make sure the VM starts up from the VirtualBox interface. Michael Völske Getting Started with Hadoop/YARN April 28, / 66

32 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Setting Up The Virtual Network "File" "Preferences" "Network" Create a "Host-only Network" Remember the name of the new network (on Linux it s something like "vboxnetx") Change its settings to: IPv4 Address Network Mask Michael Völske Getting Started with Hadoop/YARN April 28, / 66

33 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Verifying the VM Network Settings Right-click each of the three VMs and go to "Settings" "Network" The four network adapters should be configured like this: Adapter 1 enabled, attached to "NAT" Adapter 2 enabled, attached to "Host-only Adapter" The "Name" field must be the network you created in the previous step Adapter 3 and 4 disabled Michael Völske Getting Started with Hadoop/YARN April 28, / 66

34 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Set Up Host Names for the Virtual Machines I The hosts file maps hostnames to IP addresses (without a DNS server) We will use it to connect to our virtual cluster machines using their host names We will add the following hosts entries master.cluster master slave1.cluster slave slave2.cluster slave2 Michael Völske Getting Started with Hadoop/YARN April 28, / 66

35 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Set Up Host Names for the Virtual Machines II Linux Open a terminal and run: sudo nano /etc/hosts Paste in the three lines at the end Save and close (Ctrl-X, then y, then enter) OSX (10.6 or newer) Run: sudo nano /private/etc/hosts Paste the lines, save and close Run: dscacheutil -flushcache Michael Völske Getting Started with Hadoop/YARN April 28, / 66

36 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Set Up Host Names for the Virtual Machines III OSX (before 10.6) 1. Open /Applications/Utilities/NetInfo Manager 2. Select "Machines" 3. Select "localhost", then "Edit" "Duplicate" 4. Change the "ip_address" property to " " and the "name" property to "master" Delete the "serves" property and save 5. Repeat the previous step for the other two entries on the list Michael Völske Getting Started with Hadoop/YARN April 28, / 66

37 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Set Up Host Names for the Virtual Machines IV Windows (XP, 7, 8 and 10) Find "Notepad" in the start menu/launcher, right click and "Run as administrator" With notepad, open the file C:\Windows\System32\Drivers\etc\hosts Paste the three lines at the end, save and exit If the file is marked as read-only, try saving it under a different name and copying it over the current hosts file Michael Völske Getting Started with Hadoop/YARN April 28, / 66

38 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Install an SSH Client on your Laptop Linux and OSX You should already have one. To check, open a terminal and run: ssh -V Windows Download PuTTY [ and run the installer Find "PuTTY" in your start menu Michael Völske Getting Started with Hadoop/YARN April 28, / 66

39 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Start all VMs Without Graphical Output Option 1 Select "Start" "Headless Start" from the UI Option 2: Open a terminal and run: VBoxHeadless -s "BigData Hadoop VM (master)" VBoxHeadless -s "BigData Hadoop VM (slave1)" VBoxHeadless -s "BigData Hadoop VM (slave2)" Michael Völske Getting Started with Hadoop/YARN April 28, / 66

40 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Test Routing from your Laptop to the VMs Open a command prompt (terminal) on your Laptop If you can t find this on Windows, try typing "cmd" into the search box in the Start menu Type "ping master"; the output should look like this: PING master.cluster ( ) 56(84) bytes of data. 64 bytes from master.cluster ( ):... Michael Völske Getting Started with Hadoop/YARN April 28, / 66

41 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines (If That Didn t Work) VirtualBox should update the routing table automatically, but sometimes that fails If you re on linux, and the above ping didn t work, try below I don t know how to do this on Windows/OSX :-( Linux sudo ip link set dev vboxnet1 up sudo ip addr replace dev vboxnet1 sudo ip route add dev vboxnet1 to /24 Michael Völske Getting Started with Hadoop/YARN April 28, / 66

42 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Connect via SSH Linux and OSX ssh -p 2222 ssh -p 2223 ssh -p 2224 ## for master ## for slave1 ## for slave2 Windows Open PuTTY and type in the host name (e.g. "master") and port (after the "-p" above) When asked: "Login as: hadoop-admin" The password is always "hadoop-admin" Michael Völske Getting Started with Hadoop/YARN April 28, / 66

43 Part Two: Setting Up Your Own Virtual Cluster Basic Linux Shell Commands Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66

44 Part Two: Setting Up Your Own Virtual Cluster Basic Linux Shell Commands Connect to the "master" Machine ssh -p 2222 Michael Völske Getting Started with Hadoop/YARN April 28, / 66

45 Part Two: Setting Up Your Own Virtual Cluster Basic Linux Shell Commands Basic Shell Commands What How See what s in the current directory ls or ls -l Change directory cd NAME Show current directory pwd Create a directory mkdir NAME Delete a file rm NAME Delete an empty directory rmdir NAME Edit a text file nano NAME Make a file executable chmod +x NAME Download a file wget URL Run a command as root sudo COMMAND Disconnect exit or logout Michael Völske Getting Started with Hadoop/YARN April 28, / 66

46 Part Two: Setting Up Your Own Virtual Cluster Basic Linux Shell Commands Where To Learn More There are a lot of great resources online for learning to use the commandline more effectively. These are a few good places to start: [github.com/aleksandar-todorovic/awesome-linux#learning-resources] [github.com/alebcay/awesome-shell#guides] William Shotts, The Linux Command Line; [linuxcommand.org/tlcl.php] (free PDF book) Michael Völske Getting Started with Hadoop/YARN April 28, / 66

47 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66

48 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Download and Unpack Hadoop Download Hadoop wget (if that doesn t work, use webis16.medien.uni-weimar.de) Michael Völske Getting Started with Hadoop/YARN April 28, / 66

49 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Download and Unpack Hadoop Download Hadoop wget (if that doesn t work, use webis16.medien.uni-weimar.de) Unpack it tar xf hadoop tar.gz Michael Völske Getting Started with Hadoop/YARN April 28, / 66

50 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Download and Unpack Hadoop Download Hadoop wget (if that doesn t work, use webis16.medien.uni-weimar.de) Unpack it tar xf hadoop tar.gz Move it to /opt sudo mv hadoop /opt/hadoop Michael Völske Getting Started with Hadoop/YARN April 28, / 66

51 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Set up the Shell Environment Create the file /etc/profile.d/99-hadoop.sh with contents: export HADOOP_PREFIX="/opt/hadoop" export PATH="$PATH:/opt/hadoop/bin:/opt/hadoop/sbin" Michael Völske Getting Started with Hadoop/YARN April 28, / 66

52 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Set up the Shell Environment Create the file /etc/profile.d/99-hadoop.sh with contents: export HADOOP_PREFIX="/opt/hadoop" export PATH="$PATH:/opt/hadoop/bin:/opt/hadoop/sbin" Make it executable and source it sudo chmod +x /etc/profile.d/99-hadoop.sh source /etc/profile Michael Völske Getting Started with Hadoop/YARN April 28, / 66

53 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Set up the Shell Environment Create the file /etc/profile.d/99-hadoop.sh with contents: export HADOOP_PREFIX="/opt/hadoop" export PATH="$PATH:/opt/hadoop/bin:/opt/hadoop/sbin" Make it executable and source it sudo chmod +x /etc/profile.d/99-hadoop.sh source /etc/profile Edit the file /opt/hadoop/etc/hadoop/hadoop-env.sh (line 25) export JAVA_HOME="/usr/lib/jvm/java-7-oracle" Michael Völske Getting Started with Hadoop/YARN April 28, / 66

54 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Test the Hadoop Binary hadoop Usage: hadoop [--config confdir] [COMMAND CLASSNAME] CLASSNAME run the class named CLASSNAME or where COMMAND is one of: fs run a generic filesystem user client version print the version jar <jar> run a jar file... Michael Völske Getting Started with Hadoop/YARN April 28, / 66

55 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Configure HDFS I /opt/hadoop/etc/hadoop/core-site.xml <configuration> <property> <name>fs.defaultfs</name> <value>hdfs://master:9000</value> </property> </configuration> Michael Völske Getting Started with Hadoop/YARN April 28, / 66

56 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Configure HDFS II /opt/hadoop/etc/hadoop/hdfs-site.xml <configuration> <property> <name>dfs.datanode.data.dir</name> <value>/home/hadoop-admin/dfs/dn</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>/home/hadoop-admin/dfs/nn</value> </property> <property> <name>dfs.permission.supergroup</name> <value>hadoop-admin</value> </property> </configuration> Michael Völske Getting Started with Hadoop/YARN April 28, / 66

57 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Configure MapReduce This file doesn t exist yet; we will create it /opt/hadoop/etc/hadoop/mapred-site.xml <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> Michael Völske Getting Started with Hadoop/YARN April 28, / 66

58 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Configure YARN /opt/hadoop/etc/hadoop/yarn-site.xml <configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>master</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> Michael Völske Getting Started with Hadoop/YARN April 28, / 66

59 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Configure Slave Host Names remove the "localhost" entry that is already in the file /opt/hadoop/etc/hadoop/slaves slave1 slave2 Michael Völske Getting Started with Hadoop/YARN April 28, / 66

60 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Copy Configuration to the Slave Hosts Log into the first slave host ssh slave1 Copy over the Hadoop distribution and environment scripts 1. scp -r master:/opt/hadoop. (note the dot at the end) 2. scp master:/etc/profile.d/99-hadoop.sh. 3. sudo mv hadoop /opt/hadoop 4. sudo cp 99-hadoop.sh /etc/profile.d/ 5. rm 99-hadoop.sh 6. logout Repeat these steps for the other slave host: ssh slave2 Michael Völske Getting Started with Hadoop/YARN April 28, / 66

61 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Start Hadoop On the master node, format the HDFS... hdfs namenode -format... and start it: start-dfs.sh Then start YARN: start-yarn.sh Michael Völske Getting Started with Hadoop/YARN April 28, / 66

62 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Check That Everything Is Running On the master jps 5178 Jps 4646 NameNode 1339 SecondaryNameNode 4076 ResourceManager On the slaves jps 3900 DataNode 4161 Jps 3994 NodeManager Michael Völske Getting Started with Hadoop/YARN April 28, / 66

63 Part Three: Using The Cluster Testing, testing... Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66

64 Part Three: Using The Cluster Testing, testing... Test the Web UI If everything went according to plan, you should be able to open the Hadoop Web UI in your browser. NodeManager ResourceManager Michael Völske Getting Started with Hadoop/YARN April 28, / 66

65 Part Three: Using The Cluster Testing, testing... Create User Home Directory in HDFS You can access HDFS, but you need to create a home directory for your user: hadoop fs -mkdir -p /user/hadoop-admin Browsing HDFS from the shell 2 List files Remove directory Remove file Copy from local FS to HDFS hadoop fs -ls NAME hadoop fs -rmdir NAME hadoop fs -rm NAME hadoop fs -put LOCAL REMOTE 2 [Hadoop FS Shell Documentation] Michael Völske Getting Started with Hadoop/YARN April 28, / 66

66 Part Three: Using The Cluster Our First Job Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66

67 Part Three: Using The Cluster Our First Job Starting Our First Job Now that everything is set up, let s start one of the standard example MapReduce jobs: Running the Job cd /opt/hadoop/share/hadoop/mapreduce hadoop jar hadoop-mapreduce-examples-*.jar pi Output Number of Maps = 16 Samples per Map = Wrote input for Map #0... Job Finished in seconds Estimated value of Pi is Michael Völske Getting Started with Hadoop/YARN April 28, / 66

68 Part Three: Using The Cluster MapReduce Streaming API Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66

69 Part Three: Using The Cluster MapReduce Streaming API Note: the hadoop-streaming jar file is in /opt/hadoop/share/hadoop/tools/lib/ Michael Völske Getting Started with Hadoop/YARN April 28, / 66 Streaming API: MapReduce with shell scripts You can write MapReduce programs with simple shell scripts, or arbitrary binaries you compile yourself A pipe in the shell grep the input sort wc -l > output

70 Part Three: Using The Cluster MapReduce Streaming API Note: the hadoop-streaming jar file is in /opt/hadoop/share/hadoop/tools/lib/ Michael Völske Getting Started with Hadoop/YARN April 28, / 66 Streaming API: MapReduce with shell scripts You can write MapReduce programs with simple shell scripts, or arbitrary binaries you compile yourself A pipe in the shell grep the input sort wc -l > output The corresponding Hadoop Streaming program hadoop jar /[...]/hadoop-streaming-*.jar \ -input myinputdirs \ -output myoutputdir \ -mapper "grep the" \ -reducer "wc -l"

71 Part Three: Using The Cluster MapReduce Streaming API Try it for Yourself Download a book from Project Gutenberg [ Find out how many lines contain the word "the" For example, with Shakespeare s complete works: wget hadoop fs -put pg100.txt shakespeare.txt hadoop jar /[...]/hadoop-streaming-*.jar \ -input shakespeare.txt -output the-shake \ -mapper "grep the" -reducer "wc -l" Michael Völske Getting Started with Hadoop/YARN April 28, / 66

72 Part Three: Using The Cluster MapReduce Java API Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66

73 Part Three: Using The Cluster MapReduce Java API MapReduce with Java: API Prefer the new API: org.apache.hadoop.mapreduce.* (not org.apache.hadoop.mapred.*) In general: Java program extends the Mapper and/or Reducer classes, and implements the setup(), map() and reduce() methods [Online Tutorial] with WordCount example. Michael Völske Getting Started with Hadoop/YARN April 28, / 66

74 Part Three: Using The Cluster MapReduce Java API WordCount Example We ll run the WordCount.java example from the Apache MapReduce tutorial. Prepare a working directory mkdir wordcount cd wordcount Download the code wget webis16/bigdata-seminar/wordcount.java Michael Völske Getting Started with Hadoop/YARN April 28, / 66

75 Part Three: Using The Cluster MapReduce Java API WordCount Example Compiling the code javac -cp $( hadoop classpath ) WordCount.java Compiler output ls WordCount$IntSumReducer.class WordCount$TokenizerMapper.class WordCount.class WordCount.java Michael Völske Getting Started with Hadoop/YARN April 28, / 66

76 Part Three: Using The Cluster MapReduce Java API WordCount Example Packing a jar file for Hadoop jar cf wc.jar WordCount*.class Running the code on Hadoop hadoop jar wc.jar WordCount shakespeare.txt shakes-count Michael Völske Getting Started with Hadoop/YARN April 28, / 66

77 Part Three: Using The Cluster MapReduce Java API WordCount Example Packing a jar file for Hadoop jar cf wc.jar WordCount*.class Running the code on Hadoop hadoop jar wc.jar WordCount shakespeare.txt shakes-count Looking at the result hadoop fs -cat shakes-count/\* head Michael Völske Getting Started with Hadoop/YARN April 28, / 66

78 Part Three: Using The Cluster MapReduce Java API That s it for today. Questions? Michael Völske Getting Started with Hadoop/YARN April 28, / 66

Getting Started with Hadoop

Getting Started with Hadoop Getting Started with Hadoop May 28, 2018 Michael Völske, Shahbaz Syed Web Technology & Information Systems Bauhaus-Universität Weimar 1 webis 2018 What is Hadoop Started in 2004 by Yahoo Open-Source implementation

More information

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g. Big Data Computing Instructor: Prof. Irene Finocchi Master's Degree in Computer Science Academic Year 2013-2014, spring semester Installing Hadoop Emanuele Fusco (fusco@di.uniroma1.it) Prerequisites You

More information

Installing Hadoop / Yarn, Hive 2.1.0, Scala , and Spark 2.0 on Raspberry Pi Cluster of 3 Nodes. By: Nicholas Propes 2016

Installing Hadoop / Yarn, Hive 2.1.0, Scala , and Spark 2.0 on Raspberry Pi Cluster of 3 Nodes. By: Nicholas Propes 2016 Installing Hadoop 2.7.3 / Yarn, Hive 2.1.0, Scala 2.11.8, and Spark 2.0 on Raspberry Pi Cluster of 3 Nodes By: Nicholas Propes 2016 1 NOTES Please follow instructions PARTS in order because the results

More information

Hadoop Setup on OpenStack Windows Azure Guide

Hadoop Setup on OpenStack Windows Azure Guide CSCI4180 Tutorial- 2 Hadoop Setup on OpenStack Windows Azure Guide ZHANG, Mi mzhang@cse.cuhk.edu.hk Sep. 24, 2015 Outline Hadoop setup on OpenStack Ø Set up Hadoop cluster Ø Manage Hadoop cluster Ø WordCount

More information

Apache Hadoop Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Apache Hadoop Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2. SDJ INFOSOFT PVT. LTD Apache Hadoop 2.6.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.x Table of Contents Topic Software Requirements

More information

Installation of Hadoop on Ubuntu

Installation of Hadoop on Ubuntu Installation of Hadoop on Ubuntu Various software and settings are required for Hadoop. This section is mainly developed based on rsqrl.com tutorial. 1- Install Java Software Java Version* Openjdk version

More information

Hadoop Tutorial. General Instructions

Hadoop Tutorial. General Instructions CS246H: Mining Massive Datasets Hadoop Lab Winter 2018 Hadoop Tutorial General Instructions The purpose of this tutorial is to get you started with Hadoop. Completing the tutorial is optional. Here you

More information

Introduction into Big Data analytics Lecture 3 Hadoop ecosystem. Janusz Szwabiński

Introduction into Big Data analytics Lecture 3 Hadoop ecosystem. Janusz Szwabiński Introduction into Big Data analytics Lecture 3 Hadoop ecosystem Janusz Szwabiński Outlook of today s talk Apache Hadoop Project Common use cases Getting started with Hadoop Single node cluster Further

More information

BIG DATA TRAINING PRESENTATION

BIG DATA TRAINING PRESENTATION BIG DATA TRAINING PRESENTATION TOPICS TO BE COVERED HADOOP YARN MAP REDUCE SPARK FLUME SQOOP OOZIE AMBARI TOPICS TO BE COVERED FALCON RANGER KNOX SENTRY MASTER IMAGE INSTALLATION 1 JAVA INSTALLATION: 1.

More information

UNIT II HADOOP FRAMEWORK

UNIT II HADOOP FRAMEWORK UNIT II HADOOP FRAMEWORK Hadoop Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.

More information

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. HCatalog

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. HCatalog About the Tutorial HCatalog is a table storage management tool for Hadoop that exposes the tabular data of Hive metastore to other Hadoop applications. It enables users with different data processing tools

More information

CMU MSP Intro to Hadoop

CMU MSP Intro to Hadoop CMU MSP 36602 Intro to Hadoop H. Seltman, April 3 and 5 2017 1) Carl had created an MSP virtual machine that you can download as an appliance for VirtualBox (also used for SAS University Edition). See

More information

Hadoop Lab 2 Exploring the Hadoop Environment

Hadoop Lab 2 Exploring the Hadoop Environment Programming for Big Data Hadoop Lab 2 Exploring the Hadoop Environment Video A short video guide for some of what is covered in this lab. Link for this video is on my module webpage 1 Open a Terminal window

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

Part II (c) Desktop Installation. Net Serpents LLC, USA

Part II (c) Desktop Installation. Net Serpents LLC, USA Part II (c) Desktop ation Desktop ation ation Supported Platforms Required Software Releases &Mirror Sites Configure Format Start/ Stop Verify Supported Platforms ation GNU Linux supported for Development

More information

Compile and Run WordCount via Command Line

Compile and Run WordCount via Command Line Aims This exercise aims to get you to: Compile, run, and debug MapReduce tasks via Command Line Compile, run, and debug MapReduce tasks via Eclipse One Tip on Hadoop File System Shell Following are the

More information

Cloud Computing II. Exercises

Cloud Computing II. Exercises Cloud Computing II Exercises Exercise 1 Creating a Private Cloud Overview In this exercise, you will install and configure a private cloud using OpenStack. This will be accomplished using a singlenode

More information

Contents. Note: pay attention to where you are. Note: Plaintext version. Note: pay attention to where you are... 1 Note: Plaintext version...

Contents. Note: pay attention to where you are. Note: Plaintext version. Note: pay attention to where you are... 1 Note: Plaintext version... Contents Note: pay attention to where you are........................................... 1 Note: Plaintext version................................................... 1 Hello World of the Bash shell 2 Accessing

More information

Hadoop Quickstart. Table of contents

Hadoop Quickstart. Table of contents Table of contents 1 Purpose...2 2 Pre-requisites...2 2.1 Supported Platforms... 2 2.2 Required Software... 2 2.3 Installing Software...2 3 Download...2 4 Prepare to Start the Hadoop Cluster...3 5 Standalone

More information

Problem Set 0. General Instructions

Problem Set 0. General Instructions CS246: Mining Massive Datasets Winter 2014 Problem Set 0 Due 9:30am January 14, 2014 General Instructions This homework is to be completed individually (no collaboration is allowed). Also, you are not

More information

This brief tutorial provides a quick introduction to Big Data, MapReduce algorithm, and Hadoop Distributed File System.

This brief tutorial provides a quick introduction to Big Data, MapReduce algorithm, and Hadoop Distributed File System. About this tutorial Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed

More information

Multi-Node Cluster Setup on Hadoop. Tushar B. Kute,

Multi-Node Cluster Setup on Hadoop. Tushar B. Kute, Multi-Node Cluster Setup on Hadoop Tushar B. Kute, http://tusharkute.com What is Multi-node? Multi-node cluster Multinode Hadoop cluster as composed of Master- Slave Architecture to accomplishment of BigData

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Big Data Retrieving Required Information From Text Files Desmond Hill Yenumula B Reddy (Advisor)

Big Data Retrieving Required Information From Text Files Desmond Hill Yenumula B Reddy (Advisor) Big Data Retrieving Required Information From Text Files Desmond Hill Yenumula B Reddy (Advisor) 1 OUTLINE Objective What is Big data Characteristics of Big Data Setup Requirements Hadoop Setup Word Count

More information

Create Test Environment

Create Test Environment Create Test Environment Describes how to set up the Trafodion test environment used by developers and testers Prerequisites Python Passwordless ssh If you already have an existing set of ssh keys If you

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Vendor: Cloudera. Exam Code: CCA-505. Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam.

Vendor: Cloudera. Exam Code: CCA-505. Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam. Vendor: Cloudera Exam Code: CCA-505 Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam Version: Demo QUESTION 1 You have installed a cluster running HDFS and MapReduce

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns STATS 700-002 Data Analysis using Python Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns Unit 3: parallel processing and big data The next few lectures will focus on big

More information

A brief history on Hadoop

A brief history on Hadoop Hadoop Basics A brief history on Hadoop 2003 - Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct 2003 - Google releases papers with GFS (Google File System)

More information

3 Hadoop Installation: Pseudo-distributed mode

3 Hadoop Installation: Pseudo-distributed mode Laboratory 3 Hadoop Installation: Pseudo-distributed mode Obecjective Hadoop can be run in 3 different modes. Different modes of Hadoop are 1. Standalone Mode Default mode of Hadoop HDFS is not utilized

More information

Tutorial for Assignment 2.0

Tutorial for Assignment 2.0 Tutorial for Assignment 2.0 Web Science and Web Technology Summer 2011 Slides based on last years tutorial by Florian Klien and Chris Körner 1 IMPORTANT The presented information has been tested on the

More information

Hadoop is essentially an operating system for distributed processing. Its primary subsystems are HDFS and MapReduce (and Yarn).

Hadoop is essentially an operating system for distributed processing. Its primary subsystems are HDFS and MapReduce (and Yarn). 1 Hadoop Primer Hadoop is essentially an operating system for distributed processing. Its primary subsystems are HDFS and MapReduce (and Yarn). 2 Passwordless SSH Before setting up Hadoop, setup passwordless

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

COMS 6100 Class Notes 3

COMS 6100 Class Notes 3 COMS 6100 Class Notes 3 Daniel Solus September 1, 2016 1 General Remarks The class was split into two main sections. We finished our introduction to Linux commands by reviewing Linux commands I and II

More information

Top 25 Big Data Interview Questions And Answers

Top 25 Big Data Interview Questions And Answers Top 25 Big Data Interview Questions And Answers By: Neeru Jain - Big Data The era of big data has just begun. With more companies inclined towards big data to run their operations, the demand for talent

More information

AMS 200: Working on Linux/Unix Machines

AMS 200: Working on Linux/Unix Machines AMS 200, Oct 20, 2014 AMS 200: Working on Linux/Unix Machines Profs. Nic Brummell (brummell@soe.ucsc.edu) & Dongwook Lee (dlee79@ucsc.edu) Department of Applied Mathematics and Statistics University of

More information

GNU/Linux 101. Casey McLaughlin. Research Computing Center Spring Workshop Series 2018

GNU/Linux 101. Casey McLaughlin. Research Computing Center Spring Workshop Series 2018 GNU/Linux 101 Casey McLaughlin Research Computing Center Spring Workshop Series 2018 rccworkshop IC;3df4mu bash-2.1~# man workshop Linux101 RCC Workshop L101 OBJECTIVES - Operating system concepts - Linux

More information

Unix/Linux Basics. Cpt S 223, Fall 2007 Copyright: Washington State University

Unix/Linux Basics. Cpt S 223, Fall 2007 Copyright: Washington State University Unix/Linux Basics 1 Some basics to remember Everything is case sensitive Eg., you can have two different files of the same name but different case in the same folder Console-driven (same as terminal )

More information

Getting Started with Spark

Getting Started with Spark Getting Started with Spark Shadi Ibrahim March 30th, 2017 MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development

More information

MapReduce Simplified Data Processing on Large Clusters

MapReduce Simplified Data Processing on Large Clusters MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /

More information

Welcome to getting started with Ubuntu Server. This System Administrator Manual. guide to be simple to follow, with step by step instructions

Welcome to getting started with Ubuntu Server. This System Administrator Manual. guide to be simple to follow, with step by step instructions Welcome to getting started with Ubuntu 12.04 Server. This System Administrator Manual guide to be simple to follow, with step by step instructions with screenshots INDEX 1.Installation of Ubuntu 12.04

More information

Expert Lecture plan proposal Hadoop& itsapplication

Expert Lecture plan proposal Hadoop& itsapplication Expert Lecture plan proposal Hadoop& itsapplication STARTING UP WITH BIG Introduction to BIG Data Use cases of Big Data The Big data core components Knowing the requirements, knowledge on Analyst job profile

More information

Unix/Linux Operating System. Introduction to Computational Statistics STAT 598G, Fall 2011

Unix/Linux Operating System. Introduction to Computational Statistics STAT 598G, Fall 2011 Unix/Linux Operating System Introduction to Computational Statistics STAT 598G, Fall 2011 Sergey Kirshner Department of Statistics, Purdue University September 7, 2011 Sergey Kirshner (Purdue University)

More information

Introduction to the Hadoop Ecosystem - 1

Introduction to the Hadoop Ecosystem - 1 Hello and welcome to this online, self-paced course titled Administering and Managing the Oracle Big Data Appliance (BDA). This course contains several lessons. This lesson is titled Introduction to the

More information

Computer Systems and Architecture

Computer Systems and Architecture Computer Systems and Architecture Stephen Pauwels Computer Systems Academic Year 2018-2019 Overview of the Semester UNIX Introductie Regular Expressions Scripting Data Representation Integers, Fixed point,

More information

Intro to Linux. this will open up a new terminal window for you is super convenient on the computers in the lab

Intro to Linux. this will open up a new terminal window for you is super convenient on the computers in the lab Basic Terminal Intro to Linux ssh short for s ecure sh ell usage: ssh [host]@[computer].[otheripstuff] for lab computers: ssh [CSID]@[comp].cs.utexas.edu can get a list of active computers from the UTCS

More information

Running Kmeans Spark on EC2 Documentation

Running Kmeans Spark on EC2 Documentation Running Kmeans Spark on EC2 Documentation Pseudo code Input: Dataset D, Number of clusters k Output: Data points with cluster memberships Step1: Read D from HDFS as RDD Step 2: Initialize first k data

More information

Introduction to Linux Environment. Yun-Wen Chen

Introduction to Linux Environment. Yun-Wen Chen Introduction to Linux Environment Yun-Wen Chen 1 The Text (Command) Mode in Linux Environment 2 The Main Operating Systems We May Meet 1. Windows 2. Mac 3. Linux (Unix) 3 Windows Command Mode and DOS Type

More information

Hadoop Setup Walkthrough

Hadoop Setup Walkthrough Hadoop 2.7.3 Setup Walkthrough This document provides information about working with Hadoop 2.7.3. 1 Setting Up Configuration Files... 2 2 Setting Up The Environment... 2 3 Additional Notes... 3 4 Selecting

More information

Useful Unix Commands Cheat Sheet

Useful Unix Commands Cheat Sheet Useful Unix Commands Cheat Sheet The Chinese University of Hong Kong SIGSC Training (Fall 2016) FILE AND DIRECTORY pwd Return path to current directory. ls List directories and files here. ls dir List

More information

Actual4Dumps. Provide you with the latest actual exam dumps, and help you succeed

Actual4Dumps.   Provide you with the latest actual exam dumps, and help you succeed Actual4Dumps http://www.actual4dumps.com Provide you with the latest actual exam dumps, and help you succeed Exam : HDPCD Title : Hortonworks Data Platform Certified Developer Vendor : Hortonworks Version

More information

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

CS60021: Scalable Data Mining. Sourangshu Bhattacharya CS60021: Scalable Data Mining Sourangshu Bhattacharya In this Lecture: Outline: HDFS Motivation HDFS User commands HDFS System architecture HDFS Implementation details Sourangshu Bhattacharya Computer

More information

Introduction to remote command line Linux. Research Computing Team University of Birmingham

Introduction to remote command line Linux. Research Computing Team University of Birmingham Introduction to remote command line Linux Research Computing Team University of Birmingham Linux/UNIX/BSD/OSX/what? v All different v UNIX is the oldest, mostly now commercial only in large environments

More information

Parallel Programming Pre-Assignment. Setting up the Software Environment

Parallel Programming Pre-Assignment. Setting up the Software Environment Parallel Programming Pre-Assignment Setting up the Software Environment Authors: B. Wilkinson and C. Ferner. Modification date: Aug 21, 2014 (Minor correction Aug 27, 2014.) Software The purpose of this

More information

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : About Quality Thought We are

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange

More information

Hands-on Exercise Hadoop

Hands-on Exercise Hadoop Department of Economics and Business Administration Chair of Business Information Systems I Prof. Dr. Barbara Dinter Big Data Management Hands-on Exercise Hadoop Building and Testing a Hadoop Cluster by

More information

Hadoop Lab 3 Creating your first Map-Reduce Process

Hadoop Lab 3 Creating your first Map-Reduce Process Programming for Big Data Hadoop Lab 3 Creating your first Map-Reduce Process Lab work Take the map-reduce code from these notes and get it running on your Hadoop VM Driver Code Mapper Code Reducer Code

More information

Exam Questions CCA-505

Exam Questions CCA-505 Exam Questions CCA-505 Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam https://www.2passeasy.com/dumps/cca-505/ 1.You want to understand more about how users browse you public

More information

Introduction to UNIX Command Line

Introduction to UNIX Command Line Introduction to UNIX Command Line Files and directories Some useful commands (echo, cat, grep, find, diff, tar) Redirection Pipes Variables Background processes Remote connections (e.g. ssh, curl) Scripts

More information

Linux Essentials Objectives Topics:

Linux Essentials Objectives Topics: Linux Essentials Linux Essentials is a professional development certificate program that covers basic knowledge for those working and studying Open Source and various distributions of Linux. Exam Objectives

More information

CSE Linux VM. For Microsoft Windows. Based on opensuse Leap 42.2

CSE Linux VM. For Microsoft Windows. Based on opensuse Leap 42.2 CSE Linux VM For Microsoft Windows Based on opensuse Leap 42.2 Dr. K. M. Flurchick February 2, 2017 Contents 1 Introduction 1 2 Requirements 1 3 Procedure 1 4 Usage 3 4.1 Start/Stop.................................................

More information

CS197U: A Hands on Introduction to Unix

CS197U: A Hands on Introduction to Unix CS197U: A Hands on Introduction to Unix Lecture 3: UNIX Operating System Organization Tian Guo CICS, Umass Amherst 1 Reminders Assignment 2 is due THURSDAY 09/24 at 3:45 pm Directions are on the website

More information

Big Data for Engineers Spring Resource Management

Big Data for Engineers Spring Resource Management Ghislain Fourny Big Data for Engineers Spring 2018 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Installation and Configuration Documentation

Installation and Configuration Documentation Installation and Configuration Documentation Release 1.0.1 Oshin Prem Sep 27, 2017 Contents 1 HADOOP INSTALLATION 3 1.1 SINGLE-NODE INSTALLATION................................... 3 1.2 MULTI-NODE INSTALLATION....................................

More information

Computer Systems and Architecture

Computer Systems and Architecture Computer Systems and Architecture Introduction to UNIX Stephen Pauwels University of Antwerp October 2, 2015 Outline What is Unix? Getting started Streams Exercises UNIX Operating system Servers, desktops,

More information

CS November 2017

CS November 2017 Distributed Systems 09r. Map-Reduce Programming on AWS/EMR (Part I) Setting Up AWS/EMR Paul Krzyzanowski TA: Long Zhao Rutgers University Fall 2017 November 21, 2017 2017 Paul Krzyzanowski 1 November 21,

More information

Distributed Systems. 09r. Map-Reduce Programming on AWS/EMR (Part I) 2017 Paul Krzyzanowski. TA: Long Zhao Rutgers University Fall 2017

Distributed Systems. 09r. Map-Reduce Programming on AWS/EMR (Part I) 2017 Paul Krzyzanowski. TA: Long Zhao Rutgers University Fall 2017 Distributed Systems 09r. Map-Reduce Programming on AWS/EMR (Part I) Paul Krzyzanowski TA: Long Zhao Rutgers University Fall 2017 November 21, 2017 2017 Paul Krzyzanowski 1 Setting Up AWS/EMR November 21,

More information

More Raspian. An editor Configuration files Shell scripts Shell variables System admin

More Raspian. An editor Configuration files Shell scripts Shell variables System admin More Raspian An editor Configuration files Shell scripts Shell variables System admin Nano, a simple editor Nano does not require the mouse. You must use your keyboard to move around the file and make

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

CENG 334 Computer Networks. Laboratory I Linux Tutorial

CENG 334 Computer Networks. Laboratory I Linux Tutorial CENG 334 Computer Networks Laboratory I Linux Tutorial Contents 1. Logging In and Starting Session 2. Using Commands 1. Basic Commands 2. Working With Files and Directories 3. Permission Bits 3. Introduction

More information

Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:

Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example: Hadoop User Guide Logging on to the Hadoop Cluster Nodes To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example: ssh username@roger-login.ncsa. illinois.edu after entering

More information

Network Monitoring & Management. A few Linux basics

Network Monitoring & Management. A few Linux basics Network Monitoring & Management A few Linux basics Our chosen platform Ubuntu Linux 14.04.3 LTS 64-bit LTS = Long Term Support no GUI, we administer using ssh Ubuntu is Debian underneath There are other

More information

Developer s Manual. Version May, Computer Science Department, Texas Christian University

Developer s Manual. Version May, Computer Science Department, Texas Christian University Developer s Manual Version 4.0 2 May, 2016 2015-2016 Computer Science Department, Texas Christian University Revision Signatures By signing the following document, the team member is acknowledging that

More information

Unix basics exercise MBV-INFX410

Unix basics exercise MBV-INFX410 Unix basics exercise MBV-INFX410 In order to start this exercise, you need to be logged in on a UNIX computer with a terminal window open on your computer. It is best if you are logged in on freebee.abel.uio.no.

More information

Chapter-3. Introduction to Unix: Fundamental Commands

Chapter-3. Introduction to Unix: Fundamental Commands Chapter-3 Introduction to Unix: Fundamental Commands What You Will Learn The fundamental commands of the Unix operating system. Everything told for Unix here is applicable to the Linux operating system

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

About 1. Chapter 1: Getting started with oozie 2. Remarks 2. Versions 2. Examples 2. Installation or Setup 2. Chapter 2: Oozie

About 1. Chapter 1: Getting started with oozie 2. Remarks 2. Versions 2. Examples 2. Installation or Setup 2. Chapter 2: Oozie oozie #oozie Table of Contents About 1 Chapter 1: Getting started with oozie 2 Remarks 2 Versions 2 Examples 2 Installation or Setup 2 Chapter 2: Oozie 101 7 Examples 7 Oozie Architecture 7 Oozie Application

More information

Lab Working with Linux Command Line

Lab Working with Linux Command Line Introduction In this lab, you will use the Linux command line to manage files and folders and perform some basic administrative tasks. Recommended Equipment A computer with a Linux OS, either installed

More information

CS 460 Linux Tutorial

CS 460 Linux Tutorial CS 460 Linux Tutorial http://ryanstutorials.net/linuxtutorial/cheatsheet.php # Change directory to your home directory. # Remember, ~ means your home directory cd ~ # Check to see your current working

More information

Inria, Rennes Bretagne Atlantique Research Center

Inria, Rennes Bretagne Atlantique Research Center Hadoop TP 1 Shadi Ibrahim Inria, Rennes Bretagne Atlantique Research Center Getting started with Hadoop Prerequisites Basic Configuration Starting Hadoop Verifying cluster operation Hadoop INRIA S.IBRAHIM

More information

Linux Operating System Environment Computadors Grau en Ciència i Enginyeria de Dades Q2

Linux Operating System Environment Computadors Grau en Ciència i Enginyeria de Dades Q2 Linux Operating System Environment Computadors Grau en Ciència i Enginyeria de Dades 2017-2018 Q2 Facultat d Informàtica de Barcelona This first lab session is focused on getting experience in working

More information

Big Data 7. Resource Management

Big Data 7. Resource Management Ghislain Fourny Big Data 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage

More information

Introduction to Unix The Windows User perspective. Wes Frisby Kyle Horne Todd Johansen

Introduction to Unix The Windows User perspective. Wes Frisby Kyle Horne Todd Johansen Introduction to Unix The Windows User perspective Wes Frisby Kyle Horne Todd Johansen What is Unix? Portable, multi-tasking, and multi-user operating system Software development environment Hardware independent

More information

The Unix Shell. Pipes and Filters

The Unix Shell. Pipes and Filters The Unix Shell Copyright Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See http://software-carpentry.org/license.html for more information. shell shell pwd

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine Version 4.11 Last Updated: 1/10/2018 Please note: This appliance is for testing and educational purposes only;

More information

Hadoop Map Reduce 10/17/2018 1

Hadoop Map Reduce 10/17/2018 1 Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018

More information

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH) Cloudera CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Download Full Version : http://killexams.com/pass4sure/exam-detail/cca-410 Reference: CONFIGURATION PARAMETERS DFS.BLOCK.SIZE

More information

Friends Workshop. Christopher M. Judd

Friends Workshop. Christopher M. Judd S Friends Workshop Christopher M. Judd Christopher M. Judd CTO and Partner at leader Columbus Developer User Group (CIDUG) Introduction http://hadoop.apache.org/ HCatalog Tez Zebra Owl Beyond MapReduce

More information

Hadoop Cluster Implementation

Hadoop Cluster Implementation Hadoop Cluster Implementation By Aysha Binta Sayed ID:2013-1-60-068 Supervised By Dr. Md. Shamim Akhter Assistant Professor Department of Computer Science and Engineering East West University A project

More information

MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java

MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java Contents Page 1 Copyright IBM Corporation, 2015 US Government Users Restricted Rights - Use, duplication or disclosure restricted

More information

An Introduction to Cluster Computing Using Newton

An Introduction to Cluster Computing Using Newton An Introduction to Cluster Computing Using Newton Jason Harris and Dylan Storey March 25th, 2014 Jason Harris and Dylan Storey Introduction to Cluster Computing March 25th, 2014 1 / 26 Workshop design.

More information

Unix Essentials. BaRC Hot Topics Bioinformatics and Research Computing Whitehead Institute October 12 th

Unix Essentials. BaRC Hot Topics Bioinformatics and Research Computing Whitehead Institute October 12 th Unix Essentials BaRC Hot Topics Bioinformatics and Research Computing Whitehead Institute October 12 th 2016 http://barc.wi.mit.edu/hot_topics/ 1 Outline Unix overview Logging in to tak Directory structure

More information

LINUX VPS GUIDE. Pre-requisites: (this guide assumes you are using windows)

LINUX VPS GUIDE. Pre-requisites: (this guide assumes you are using windows) LINUX VPS GUIDE Pre-requisites: (this guide assumes you are using windows) Philscurrency Wallet Download PHILS wallet if you don t have already from the link below https://github.com/philscurrency/philscurrency/releases/download/v1.2/phils

More information

CS 345A Data Mining. MapReduce

CS 345A Data Mining. MapReduce CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes

More information

Spark Programming at Comet. UCSB CS240A Tao Yang

Spark Programming at Comet. UCSB CS240A Tao Yang Spark Programming at Comet UCSB CS240A 2016. Tao Yang Comet Cluster Comet cluster has 1944 nodes and each node has 24 cores, built on two 12-core Intel Xeon E5-2680v3 2.5 GHz processors 128 GB memory and

More information