Getting Started with Hadoop/YARN
|
|
- Ira Burke
- 5 years ago
- Views:
Transcription
1 Getting Started with Hadoop/YARN Michael Völske 1 April 28, michael.voelske@uni-weimar.de Michael Völske Getting Started with Hadoop/YARN April 28, / 66
2 Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66
3 Part One: Hadoop, HDFS, and MapReduce Hadoop Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66
4 Part One: Hadoop, HDFS, and MapReduce Hadoop What is Hadoop Started in 2004 by Yahoo Open-Source implementation of Google MapReduce, Google Filesystem and Google BigTable Apache Software Foundation top level project Written in Java Michael Völske Getting Started with Hadoop/YARN April 28, / 66
5 Part One: Hadoop, HDFS, and MapReduce Hadoop Key Concepts Scale out, not up! nodes, 100PB+ data cheap commodity hardware instead of supercomputers fault-tolerance, redundancy Bring the program to the data storage and data processing on the same node local processing (network is the bottleneck) Working sequentially instead of random-access optimized for large datasets Hide system-level details User doesn t need to know what code runs on which machine Michael Völske Getting Started with Hadoop/YARN April 28, / 66
6 Part One: Hadoop, HDFS, and MapReduce Hadoop Hadoop 1 Hadoop 2 Michael Völske Getting Started with Hadoop/YARN April 28, / 66
7 Part One: Hadoop, HDFS, and MapReduce HDFS - Distributed File System Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66
8 Part One: Hadoop, HDFS, and MapReduce HDFS - Distributed File System HDFS Overview Designed for storing large files Files are split in blocks Integrity: Blocks are checksummed Redundancy: Each block stored on multiple machines Optimized for sequentially reading whole blocks Daemon processes: NameNode: Central registry of block locations DataNode: Block storage on each node Michael Völske Getting Started with Hadoop/YARN April 28, / 66
9 Part One: Hadoop, HDFS, and MapReduce HDFS - Distributed File System Reading a file Michael Völske Getting Started with Hadoop/YARN April 28, / 66
10 Part One: Hadoop, HDFS, and MapReduce MapReduce Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66
11 Part One: Hadoop, HDFS, and MapReduce MapReduce Motivation Problem Collecting data is easy and cheap Evaluating data is difficult Solution Divide and Conquer Parallel Processing Michael Völske Getting Started with Hadoop/YARN April 28, / 66
12 Part One: Hadoop, HDFS, and MapReduce MapReduce Steps Map Step Each worker applies the map() function to the local data and writes the output to temporary storage. Each output record gets a key. Shuffle Step Worker nodes redistribute data based on the output keys: all records with the same key go to the same worker node. Reduce Step Workers apply the reduce() function to each group, per key, in parallel. The user specifies the map() and reduce() functions Michael Völske Getting Started with Hadoop/YARN April 28, / 66
13 Part One: Hadoop, HDFS, and MapReduce MapReduce Word Count Example Mary had a little lamb its fleece was white as snow and everywhere that Mary went the lamb was sure to go Michael Völske Getting Started with Hadoop/YARN April 28, / 66
14 Part One: Hadoop, HDFS, and MapReduce MapReduce Word Count Example Mary had a little lamb its fleece was white as snow and everywhere that Mary went the lamb was sure to go Map() Map() Map() Map() Mary 1 had 1 a 1 little 1 lamb 1 its 1 fleece 1 was 1 white 1 as 1 snow 1 and 1 everywhere 1 that 1 Mary 1 went 1 the 1 lamb 1 was 1 sure 1 to 1 go 1 Michael Völske Getting Started with Hadoop/YARN April 28, / 66
15 Part One: Hadoop, HDFS, and MapReduce MapReduce Word Count Example Mary had a little lamb its fleece was white as snow and everywhere that Mary went the lamb was sure to go Map() Map() Map() Map() Mary 1 had 1 a 1 little 1 lamb 1 its 1 fleece 1 was 1 white 1 as 1 snow 1 and 1 everywhere 1 that 1 Mary 1 went 1 the 1 lamb 1 was 1 sure 1 to 1 go 1 Reduce() Reduce() a 1 as 1 lamb 2 little 2... Mary 2 was 2 went 1... Michael Völske Getting Started with Hadoop/YARN April 28, / 66
16 Part One: Hadoop, HDFS, and MapReduce MapReduce Key-Value-Pairs Map Step: Map(k1,v1) list(k2,v2) Sorting and Shuffling: All pairs with the same key are grouped together; one group per key. Reduce Step: Reduce(k2, list(v2)) list(v3) Michael Völske Getting Started with Hadoop/YARN April 28, / 66
17 Part One: Hadoop, HDFS, and MapReduce MapReduce MapReduce on Hadoop Michael Völske Getting Started with Hadoop/YARN April 28, / 66
18 Part One: Hadoop, HDFS, and MapReduce MapReduce on YARN Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66
19 Part One: Hadoop, HDFS, and MapReduce MapReduce on YARN YARN Basic Concepts YARN Processes ResourceManager Single instance, runs on the cluster s master node NodeManager Runs on each cluster node; executes computations and provides containers to applications MapReduce Job ApplicationMaster Controls execution on the cluster (one for each YARN application) Mapper Process input data Reducer Process (sorted) Mapper output each of these runs in a YARN Container Michael Völske Getting Started with Hadoop/YARN April 28, / 66
20 Part One: Hadoop, HDFS, and MapReduce MapReduce on YARN YARN+MapReduce Basic process: 1. Client application requests a container for the ApplicationMaster 2. ApplicationMaster runs on the cluster, requests further containers for Mappers and Reducers 3. Mappers execute user-provided map() function on their part of the input data 4. The shuffle() phase is started to distribute map output to reducers 5. Reducers execute user-provided reduce() function on their group of map output 6. Final result is stored in HDFS See also: [Anatomy of a MapReduce Job] Michael Völske Getting Started with Hadoop/YARN April 28, / 66
21 Part One: Hadoop, HDFS, and MapReduce Summary Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66
22 Part One: Hadoop, HDFS, and MapReduce Summary YARN and HDFS Components Michael Völske Getting Started with Hadoop/YARN April 28, / 66
23 Part One: Hadoop, HDFS, and MapReduce Summary YARN and HDFS Components Michael Völske Getting Started with Hadoop/YARN April 28, / 66
24 Part One: Hadoop, HDFS, and MapReduce Summary YARN and HDFS Components Michael Völske Getting Started with Hadoop/YARN April 28, / 66
25 Part One: Hadoop, HDFS, and MapReduce Summary YARN and HDFS Components Michael Völske Getting Started with Hadoop/YARN April 28, / 66
26 Part One: Hadoop, HDFS, and MapReduce Summary End of Part One. Questions? Michael Völske Getting Started with Hadoop/YARN April 28, / 66
27 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66
28 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Virtual Toy Cluster Host SSH Client ssh localhost -p 2222 [master] $ ls -l VirtualBox 3 Virtual Machines Browser Hadoop NameNode Status master slave1 Hadoop Client Terminal [master] $ hadoop jar... slave2 Shared Network Michael Völske Getting Started with Hadoop/YARN April 28, / 66
29 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Betaweb Host Cluster SSH Client ssh betaweb020 [betaweb] $ ls -l Browser Hadoop NameNode Status Hadoop Client Terminal [betaweb] $ hadoop jar... Routed Network 130 Machines Michael Völske Getting Started with Hadoop/YARN April 28, / 66
30 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Starting the VM for the first time Make sure you have the appliance BigData.ova on your machine. Start up VirtualBox. "File" "Import Appliance... " Make sure the VM starts up from the VirtualBox interface. Michael Völske Getting Started with Hadoop/YARN April 28, / 66
31 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Starting the VM for the first time Make sure you have the appliance BigData.ova on your machine. Start up VirtualBox. "File" "Import Appliance... " Make sure the VM starts up from the VirtualBox interface. Michael Völske Getting Started with Hadoop/YARN April 28, / 66
32 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Setting Up The Virtual Network "File" "Preferences" "Network" Create a "Host-only Network" Remember the name of the new network (on Linux it s something like "vboxnetx") Change its settings to: IPv4 Address Network Mask Michael Völske Getting Started with Hadoop/YARN April 28, / 66
33 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Verifying the VM Network Settings Right-click each of the three VMs and go to "Settings" "Network" The four network adapters should be configured like this: Adapter 1 enabled, attached to "NAT" Adapter 2 enabled, attached to "Host-only Adapter" The "Name" field must be the network you created in the previous step Adapter 3 and 4 disabled Michael Völske Getting Started with Hadoop/YARN April 28, / 66
34 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Set Up Host Names for the Virtual Machines I The hosts file maps hostnames to IP addresses (without a DNS server) We will use it to connect to our virtual cluster machines using their host names We will add the following hosts entries master.cluster master slave1.cluster slave slave2.cluster slave2 Michael Völske Getting Started with Hadoop/YARN April 28, / 66
35 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Set Up Host Names for the Virtual Machines II Linux Open a terminal and run: sudo nano /etc/hosts Paste in the three lines at the end Save and close (Ctrl-X, then y, then enter) OSX (10.6 or newer) Run: sudo nano /private/etc/hosts Paste the lines, save and close Run: dscacheutil -flushcache Michael Völske Getting Started with Hadoop/YARN April 28, / 66
36 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Set Up Host Names for the Virtual Machines III OSX (before 10.6) 1. Open /Applications/Utilities/NetInfo Manager 2. Select "Machines" 3. Select "localhost", then "Edit" "Duplicate" 4. Change the "ip_address" property to " " and the "name" property to "master" Delete the "serves" property and save 5. Repeat the previous step for the other two entries on the list Michael Völske Getting Started with Hadoop/YARN April 28, / 66
37 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Set Up Host Names for the Virtual Machines IV Windows (XP, 7, 8 and 10) Find "Notepad" in the start menu/launcher, right click and "Run as administrator" With notepad, open the file C:\Windows\System32\Drivers\etc\hosts Paste the three lines at the end, save and exit If the file is marked as read-only, try saving it under a different name and copying it over the current hosts file Michael Völske Getting Started with Hadoop/YARN April 28, / 66
38 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Install an SSH Client on your Laptop Linux and OSX You should already have one. To check, open a terminal and run: ssh -V Windows Download PuTTY [ and run the installer Find "PuTTY" in your start menu Michael Völske Getting Started with Hadoop/YARN April 28, / 66
39 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Start all VMs Without Graphical Output Option 1 Select "Start" "Headless Start" from the UI Option 2: Open a terminal and run: VBoxHeadless -s "BigData Hadoop VM (master)" VBoxHeadless -s "BigData Hadoop VM (slave1)" VBoxHeadless -s "BigData Hadoop VM (slave2)" Michael Völske Getting Started with Hadoop/YARN April 28, / 66
40 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Test Routing from your Laptop to the VMs Open a command prompt (terminal) on your Laptop If you can t find this on Windows, try typing "cmd" into the search box in the Start menu Type "ping master"; the output should look like this: PING master.cluster ( ) 56(84) bytes of data. 64 bytes from master.cluster ( ):... Michael Völske Getting Started with Hadoop/YARN April 28, / 66
41 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines (If That Didn t Work) VirtualBox should update the routing table automatically, but sometimes that fails If you re on linux, and the above ping didn t work, try below I don t know how to do this on Windows/OSX :-( Linux sudo ip link set dev vboxnet1 up sudo ip addr replace dev vboxnet1 sudo ip route add dev vboxnet1 to /24 Michael Völske Getting Started with Hadoop/YARN April 28, / 66
42 Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Connect via SSH Linux and OSX ssh -p 2222 ssh -p 2223 ssh -p 2224 ## for master ## for slave1 ## for slave2 Windows Open PuTTY and type in the host name (e.g. "master") and port (after the "-p" above) When asked: "Login as: hadoop-admin" The password is always "hadoop-admin" Michael Völske Getting Started with Hadoop/YARN April 28, / 66
43 Part Two: Setting Up Your Own Virtual Cluster Basic Linux Shell Commands Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66
44 Part Two: Setting Up Your Own Virtual Cluster Basic Linux Shell Commands Connect to the "master" Machine ssh -p 2222 Michael Völske Getting Started with Hadoop/YARN April 28, / 66
45 Part Two: Setting Up Your Own Virtual Cluster Basic Linux Shell Commands Basic Shell Commands What How See what s in the current directory ls or ls -l Change directory cd NAME Show current directory pwd Create a directory mkdir NAME Delete a file rm NAME Delete an empty directory rmdir NAME Edit a text file nano NAME Make a file executable chmod +x NAME Download a file wget URL Run a command as root sudo COMMAND Disconnect exit or logout Michael Völske Getting Started with Hadoop/YARN April 28, / 66
46 Part Two: Setting Up Your Own Virtual Cluster Basic Linux Shell Commands Where To Learn More There are a lot of great resources online for learning to use the commandline more effectively. These are a few good places to start: [github.com/aleksandar-todorovic/awesome-linux#learning-resources] [github.com/alebcay/awesome-shell#guides] William Shotts, The Linux Command Line; [linuxcommand.org/tlcl.php] (free PDF book) Michael Völske Getting Started with Hadoop/YARN April 28, / 66
47 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66
48 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Download and Unpack Hadoop Download Hadoop wget (if that doesn t work, use webis16.medien.uni-weimar.de) Michael Völske Getting Started with Hadoop/YARN April 28, / 66
49 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Download and Unpack Hadoop Download Hadoop wget (if that doesn t work, use webis16.medien.uni-weimar.de) Unpack it tar xf hadoop tar.gz Michael Völske Getting Started with Hadoop/YARN April 28, / 66
50 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Download and Unpack Hadoop Download Hadoop wget (if that doesn t work, use webis16.medien.uni-weimar.de) Unpack it tar xf hadoop tar.gz Move it to /opt sudo mv hadoop /opt/hadoop Michael Völske Getting Started with Hadoop/YARN April 28, / 66
51 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Set up the Shell Environment Create the file /etc/profile.d/99-hadoop.sh with contents: export HADOOP_PREFIX="/opt/hadoop" export PATH="$PATH:/opt/hadoop/bin:/opt/hadoop/sbin" Michael Völske Getting Started with Hadoop/YARN April 28, / 66
52 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Set up the Shell Environment Create the file /etc/profile.d/99-hadoop.sh with contents: export HADOOP_PREFIX="/opt/hadoop" export PATH="$PATH:/opt/hadoop/bin:/opt/hadoop/sbin" Make it executable and source it sudo chmod +x /etc/profile.d/99-hadoop.sh source /etc/profile Michael Völske Getting Started with Hadoop/YARN April 28, / 66
53 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Set up the Shell Environment Create the file /etc/profile.d/99-hadoop.sh with contents: export HADOOP_PREFIX="/opt/hadoop" export PATH="$PATH:/opt/hadoop/bin:/opt/hadoop/sbin" Make it executable and source it sudo chmod +x /etc/profile.d/99-hadoop.sh source /etc/profile Edit the file /opt/hadoop/etc/hadoop/hadoop-env.sh (line 25) export JAVA_HOME="/usr/lib/jvm/java-7-oracle" Michael Völske Getting Started with Hadoop/YARN April 28, / 66
54 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Test the Hadoop Binary hadoop Usage: hadoop [--config confdir] [COMMAND CLASSNAME] CLASSNAME run the class named CLASSNAME or where COMMAND is one of: fs run a generic filesystem user client version print the version jar <jar> run a jar file... Michael Völske Getting Started with Hadoop/YARN April 28, / 66
55 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Configure HDFS I /opt/hadoop/etc/hadoop/core-site.xml <configuration> <property> <name>fs.defaultfs</name> <value>hdfs://master:9000</value> </property> </configuration> Michael Völske Getting Started with Hadoop/YARN April 28, / 66
56 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Configure HDFS II /opt/hadoop/etc/hadoop/hdfs-site.xml <configuration> <property> <name>dfs.datanode.data.dir</name> <value>/home/hadoop-admin/dfs/dn</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>/home/hadoop-admin/dfs/nn</value> </property> <property> <name>dfs.permission.supergroup</name> <value>hadoop-admin</value> </property> </configuration> Michael Völske Getting Started with Hadoop/YARN April 28, / 66
57 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Configure MapReduce This file doesn t exist yet; we will create it /opt/hadoop/etc/hadoop/mapred-site.xml <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> Michael Völske Getting Started with Hadoop/YARN April 28, / 66
58 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Configure YARN /opt/hadoop/etc/hadoop/yarn-site.xml <configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>master</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> Michael Völske Getting Started with Hadoop/YARN April 28, / 66
59 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Configure Slave Host Names remove the "localhost" entry that is already in the file /opt/hadoop/etc/hadoop/slaves slave1 slave2 Michael Völske Getting Started with Hadoop/YARN April 28, / 66
60 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Copy Configuration to the Slave Hosts Log into the first slave host ssh slave1 Copy over the Hadoop distribution and environment scripts 1. scp -r master:/opt/hadoop. (note the dot at the end) 2. scp master:/etc/profile.d/99-hadoop.sh. 3. sudo mv hadoop /opt/hadoop 4. sudo cp 99-hadoop.sh /etc/profile.d/ 5. rm 99-hadoop.sh 6. logout Repeat these steps for the other slave host: ssh slave2 Michael Völske Getting Started with Hadoop/YARN April 28, / 66
61 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Start Hadoop On the master node, format the HDFS... hdfs namenode -format... and start it: start-dfs.sh Then start YARN: start-yarn.sh Michael Völske Getting Started with Hadoop/YARN April 28, / 66
62 Part Two: Setting Up Your Own Virtual Cluster Installing and Configuring Hadoop Check That Everything Is Running On the master jps 5178 Jps 4646 NameNode 1339 SecondaryNameNode 4076 ResourceManager On the slaves jps 3900 DataNode 4161 Jps 3994 NodeManager Michael Völske Getting Started with Hadoop/YARN April 28, / 66
63 Part Three: Using The Cluster Testing, testing... Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66
64 Part Three: Using The Cluster Testing, testing... Test the Web UI If everything went according to plan, you should be able to open the Hadoop Web UI in your browser. NodeManager ResourceManager Michael Völske Getting Started with Hadoop/YARN April 28, / 66
65 Part Three: Using The Cluster Testing, testing... Create User Home Directory in HDFS You can access HDFS, but you need to create a home directory for your user: hadoop fs -mkdir -p /user/hadoop-admin Browsing HDFS from the shell 2 List files Remove directory Remove file Copy from local FS to HDFS hadoop fs -ls NAME hadoop fs -rmdir NAME hadoop fs -rm NAME hadoop fs -put LOCAL REMOTE 2 [Hadoop FS Shell Documentation] Michael Völske Getting Started with Hadoop/YARN April 28, / 66
66 Part Three: Using The Cluster Our First Job Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66
67 Part Three: Using The Cluster Our First Job Starting Our First Job Now that everything is set up, let s start one of the standard example MapReduce jobs: Running the Job cd /opt/hadoop/share/hadoop/mapreduce hadoop jar hadoop-mapreduce-examples-*.jar pi Output Number of Maps = 16 Samples per Map = Wrote input for Map #0... Job Finished in seconds Estimated value of Pi is Michael Völske Getting Started with Hadoop/YARN April 28, / 66
68 Part Three: Using The Cluster MapReduce Streaming API Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66
69 Part Three: Using The Cluster MapReduce Streaming API Note: the hadoop-streaming jar file is in /opt/hadoop/share/hadoop/tools/lib/ Michael Völske Getting Started with Hadoop/YARN April 28, / 66 Streaming API: MapReduce with shell scripts You can write MapReduce programs with simple shell scripts, or arbitrary binaries you compile yourself A pipe in the shell grep the input sort wc -l > output
70 Part Three: Using The Cluster MapReduce Streaming API Note: the hadoop-streaming jar file is in /opt/hadoop/share/hadoop/tools/lib/ Michael Völske Getting Started with Hadoop/YARN April 28, / 66 Streaming API: MapReduce with shell scripts You can write MapReduce programs with simple shell scripts, or arbitrary binaries you compile yourself A pipe in the shell grep the input sort wc -l > output The corresponding Hadoop Streaming program hadoop jar /[...]/hadoop-streaming-*.jar \ -input myinputdirs \ -output myoutputdir \ -mapper "grep the" \ -reducer "wc -l"
71 Part Three: Using The Cluster MapReduce Streaming API Try it for Yourself Download a book from Project Gutenberg [ Find out how many lines contain the word "the" For example, with Shakespeare s complete works: wget hadoop fs -put pg100.txt shakespeare.txt hadoop jar /[...]/hadoop-streaming-*.jar \ -input shakespeare.txt -output the-shake \ -mapper "grep the" -reducer "wc -l" Michael Völske Getting Started with Hadoop/YARN April 28, / 66
72 Part Three: Using The Cluster MapReduce Java API Outline Part One: Hadoop, HDFS, and MapReduce Hadoop HDFS - Distributed File System MapReduce MapReduce on YARN Summary Part Two: Setting Up Your Own Virtual Cluster Starting And Connecting To The Virtual Machines Basic Linux Shell Commands Installing and Configuring Hadoop Part Three: Using The Cluster Testing, testing... Our First Job MapReduce Streaming API MapReduce Java API Michael Völske Getting Started with Hadoop/YARN April 28, / 66
73 Part Three: Using The Cluster MapReduce Java API MapReduce with Java: API Prefer the new API: org.apache.hadoop.mapreduce.* (not org.apache.hadoop.mapred.*) In general: Java program extends the Mapper and/or Reducer classes, and implements the setup(), map() and reduce() methods [Online Tutorial] with WordCount example. Michael Völske Getting Started with Hadoop/YARN April 28, / 66
74 Part Three: Using The Cluster MapReduce Java API WordCount Example We ll run the WordCount.java example from the Apache MapReduce tutorial. Prepare a working directory mkdir wordcount cd wordcount Download the code wget webis16/bigdata-seminar/wordcount.java Michael Völske Getting Started with Hadoop/YARN April 28, / 66
75 Part Three: Using The Cluster MapReduce Java API WordCount Example Compiling the code javac -cp $( hadoop classpath ) WordCount.java Compiler output ls WordCount$IntSumReducer.class WordCount$TokenizerMapper.class WordCount.class WordCount.java Michael Völske Getting Started with Hadoop/YARN April 28, / 66
76 Part Three: Using The Cluster MapReduce Java API WordCount Example Packing a jar file for Hadoop jar cf wc.jar WordCount*.class Running the code on Hadoop hadoop jar wc.jar WordCount shakespeare.txt shakes-count Michael Völske Getting Started with Hadoop/YARN April 28, / 66
77 Part Three: Using The Cluster MapReduce Java API WordCount Example Packing a jar file for Hadoop jar cf wc.jar WordCount*.class Running the code on Hadoop hadoop jar wc.jar WordCount shakespeare.txt shakes-count Looking at the result hadoop fs -cat shakes-count/\* head Michael Völske Getting Started with Hadoop/YARN April 28, / 66
78 Part Three: Using The Cluster MapReduce Java API That s it for today. Questions? Michael Völske Getting Started with Hadoop/YARN April 28, / 66
Getting Started with Hadoop
Getting Started with Hadoop May 28, 2018 Michael Völske, Shahbaz Syed Web Technology & Information Systems Bauhaus-Universität Weimar 1 webis 2018 What is Hadoop Started in 2004 by Yahoo Open-Source implementation
More informationInstalling Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.
Big Data Computing Instructor: Prof. Irene Finocchi Master's Degree in Computer Science Academic Year 2013-2014, spring semester Installing Hadoop Emanuele Fusco (fusco@di.uniroma1.it) Prerequisites You
More informationInstalling Hadoop / Yarn, Hive 2.1.0, Scala , and Spark 2.0 on Raspberry Pi Cluster of 3 Nodes. By: Nicholas Propes 2016
Installing Hadoop 2.7.3 / Yarn, Hive 2.1.0, Scala 2.11.8, and Spark 2.0 on Raspberry Pi Cluster of 3 Nodes By: Nicholas Propes 2016 1 NOTES Please follow instructions PARTS in order because the results
More informationHadoop Setup on OpenStack Windows Azure Guide
CSCI4180 Tutorial- 2 Hadoop Setup on OpenStack Windows Azure Guide ZHANG, Mi mzhang@cse.cuhk.edu.hk Sep. 24, 2015 Outline Hadoop setup on OpenStack Ø Set up Hadoop cluster Ø Manage Hadoop cluster Ø WordCount
More informationApache Hadoop Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.
SDJ INFOSOFT PVT. LTD Apache Hadoop 2.6.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.x Table of Contents Topic Software Requirements
More informationInstallation of Hadoop on Ubuntu
Installation of Hadoop on Ubuntu Various software and settings are required for Hadoop. This section is mainly developed based on rsqrl.com tutorial. 1- Install Java Software Java Version* Openjdk version
More informationHadoop Tutorial. General Instructions
CS246H: Mining Massive Datasets Hadoop Lab Winter 2018 Hadoop Tutorial General Instructions The purpose of this tutorial is to get you started with Hadoop. Completing the tutorial is optional. Here you
More informationIntroduction into Big Data analytics Lecture 3 Hadoop ecosystem. Janusz Szwabiński
Introduction into Big Data analytics Lecture 3 Hadoop ecosystem Janusz Szwabiński Outlook of today s talk Apache Hadoop Project Common use cases Getting started with Hadoop Single node cluster Further
More informationBIG DATA TRAINING PRESENTATION
BIG DATA TRAINING PRESENTATION TOPICS TO BE COVERED HADOOP YARN MAP REDUCE SPARK FLUME SQOOP OOZIE AMBARI TOPICS TO BE COVERED FALCON RANGER KNOX SENTRY MASTER IMAGE INSTALLATION 1 JAVA INSTALLATION: 1.
More informationUNIT II HADOOP FRAMEWORK
UNIT II HADOOP FRAMEWORK Hadoop Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.
More informationAbout the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. HCatalog
About the Tutorial HCatalog is a table storage management tool for Hadoop that exposes the tabular data of Hive metastore to other Hadoop applications. It enables users with different data processing tools
More informationCMU MSP Intro to Hadoop
CMU MSP 36602 Intro to Hadoop H. Seltman, April 3 and 5 2017 1) Carl had created an MSP virtual machine that you can download as an appliance for VirtualBox (also used for SAS University Edition). See
More informationHadoop Lab 2 Exploring the Hadoop Environment
Programming for Big Data Hadoop Lab 2 Exploring the Hadoop Environment Video A short video guide for some of what is covered in this lab. Link for this video is on my module webpage 1 Open a Terminal window
More informationCloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018
Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster
More informationPart II (c) Desktop Installation. Net Serpents LLC, USA
Part II (c) Desktop ation Desktop ation ation Supported Platforms Required Software Releases &Mirror Sites Configure Format Start/ Stop Verify Supported Platforms ation GNU Linux supported for Development
More informationCompile and Run WordCount via Command Line
Aims This exercise aims to get you to: Compile, run, and debug MapReduce tasks via Command Line Compile, run, and debug MapReduce tasks via Eclipse One Tip on Hadoop File System Shell Following are the
More informationCloud Computing II. Exercises
Cloud Computing II Exercises Exercise 1 Creating a Private Cloud Overview In this exercise, you will install and configure a private cloud using OpenStack. This will be accomplished using a singlenode
More informationContents. Note: pay attention to where you are. Note: Plaintext version. Note: pay attention to where you are... 1 Note: Plaintext version...
Contents Note: pay attention to where you are........................................... 1 Note: Plaintext version................................................... 1 Hello World of the Bash shell 2 Accessing
More informationHadoop Quickstart. Table of contents
Table of contents 1 Purpose...2 2 Pre-requisites...2 2.1 Supported Platforms... 2 2.2 Required Software... 2 2.3 Installing Software...2 3 Download...2 4 Prepare to Start the Hadoop Cluster...3 5 Standalone
More informationProblem Set 0. General Instructions
CS246: Mining Massive Datasets Winter 2014 Problem Set 0 Due 9:30am January 14, 2014 General Instructions This homework is to be completed individually (no collaboration is allowed). Also, you are not
More informationThis brief tutorial provides a quick introduction to Big Data, MapReduce algorithm, and Hadoop Distributed File System.
About this tutorial Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed
More informationMulti-Node Cluster Setup on Hadoop. Tushar B. Kute,
Multi-Node Cluster Setup on Hadoop Tushar B. Kute, http://tusharkute.com What is Multi-node? Multi-node cluster Multinode Hadoop cluster as composed of Master- Slave Architecture to accomplishment of BigData
More informationClustering Lecture 8: MapReduce
Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data
More informationBig Data Retrieving Required Information From Text Files Desmond Hill Yenumula B Reddy (Advisor)
Big Data Retrieving Required Information From Text Files Desmond Hill Yenumula B Reddy (Advisor) 1 OUTLINE Objective What is Big data Characteristics of Big Data Setup Requirements Hadoop Setup Word Count
More informationCreate Test Environment
Create Test Environment Describes how to set up the Trafodion test environment used by developers and testers Prerequisites Python Passwordless ssh If you already have an existing set of ssh keys If you
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationVendor: Cloudera. Exam Code: CCA-505. Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam.
Vendor: Cloudera Exam Code: CCA-505 Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam Version: Demo QUESTION 1 You have installed a cluster running HDFS and MapReduce
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationSTATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns
STATS 700-002 Data Analysis using Python Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns Unit 3: parallel processing and big data The next few lectures will focus on big
More informationA brief history on Hadoop
Hadoop Basics A brief history on Hadoop 2003 - Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct 2003 - Google releases papers with GFS (Google File System)
More information3 Hadoop Installation: Pseudo-distributed mode
Laboratory 3 Hadoop Installation: Pseudo-distributed mode Obecjective Hadoop can be run in 3 different modes. Different modes of Hadoop are 1. Standalone Mode Default mode of Hadoop HDFS is not utilized
More informationTutorial for Assignment 2.0
Tutorial for Assignment 2.0 Web Science and Web Technology Summer 2011 Slides based on last years tutorial by Florian Klien and Chris Körner 1 IMPORTANT The presented information has been tested on the
More informationHadoop is essentially an operating system for distributed processing. Its primary subsystems are HDFS and MapReduce (and Yarn).
1 Hadoop Primer Hadoop is essentially an operating system for distributed processing. Its primary subsystems are HDFS and MapReduce (and Yarn). 2 Passwordless SSH Before setting up Hadoop, setup passwordless
More informationBig Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.
Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop
More informationCOMS 6100 Class Notes 3
COMS 6100 Class Notes 3 Daniel Solus September 1, 2016 1 General Remarks The class was split into two main sections. We finished our introduction to Linux commands by reviewing Linux commands I and II
More informationTop 25 Big Data Interview Questions And Answers
Top 25 Big Data Interview Questions And Answers By: Neeru Jain - Big Data The era of big data has just begun. With more companies inclined towards big data to run their operations, the demand for talent
More informationAMS 200: Working on Linux/Unix Machines
AMS 200, Oct 20, 2014 AMS 200: Working on Linux/Unix Machines Profs. Nic Brummell (brummell@soe.ucsc.edu) & Dongwook Lee (dlee79@ucsc.edu) Department of Applied Mathematics and Statistics University of
More informationGNU/Linux 101. Casey McLaughlin. Research Computing Center Spring Workshop Series 2018
GNU/Linux 101 Casey McLaughlin Research Computing Center Spring Workshop Series 2018 rccworkshop IC;3df4mu bash-2.1~# man workshop Linux101 RCC Workshop L101 OBJECTIVES - Operating system concepts - Linux
More informationUnix/Linux Basics. Cpt S 223, Fall 2007 Copyright: Washington State University
Unix/Linux Basics 1 Some basics to remember Everything is case sensitive Eg., you can have two different files of the same name but different case in the same folder Console-driven (same as terminal )
More informationGetting Started with Spark
Getting Started with Spark Shadi Ibrahim March 30th, 2017 MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development
More informationMapReduce Simplified Data Processing on Large Clusters
MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /
More informationWelcome to getting started with Ubuntu Server. This System Administrator Manual. guide to be simple to follow, with step by step instructions
Welcome to getting started with Ubuntu 12.04 Server. This System Administrator Manual guide to be simple to follow, with step by step instructions with screenshots INDEX 1.Installation of Ubuntu 12.04
More informationExpert Lecture plan proposal Hadoop& itsapplication
Expert Lecture plan proposal Hadoop& itsapplication STARTING UP WITH BIG Introduction to BIG Data Use cases of Big Data The Big data core components Knowing the requirements, knowledge on Analyst job profile
More informationUnix/Linux Operating System. Introduction to Computational Statistics STAT 598G, Fall 2011
Unix/Linux Operating System Introduction to Computational Statistics STAT 598G, Fall 2011 Sergey Kirshner Department of Statistics, Purdue University September 7, 2011 Sergey Kirshner (Purdue University)
More informationIntroduction to the Hadoop Ecosystem - 1
Hello and welcome to this online, self-paced course titled Administering and Managing the Oracle Big Data Appliance (BDA). This course contains several lessons. This lesson is titled Introduction to the
More informationComputer Systems and Architecture
Computer Systems and Architecture Stephen Pauwels Computer Systems Academic Year 2018-2019 Overview of the Semester UNIX Introductie Regular Expressions Scripting Data Representation Integers, Fixed point,
More informationIntro to Linux. this will open up a new terminal window for you is super convenient on the computers in the lab
Basic Terminal Intro to Linux ssh short for s ecure sh ell usage: ssh [host]@[computer].[otheripstuff] for lab computers: ssh [CSID]@[comp].cs.utexas.edu can get a list of active computers from the UTCS
More informationRunning Kmeans Spark on EC2 Documentation
Running Kmeans Spark on EC2 Documentation Pseudo code Input: Dataset D, Number of clusters k Output: Data points with cluster memberships Step1: Read D from HDFS as RDD Step 2: Initialize first k data
More informationIntroduction to Linux Environment. Yun-Wen Chen
Introduction to Linux Environment Yun-Wen Chen 1 The Text (Command) Mode in Linux Environment 2 The Main Operating Systems We May Meet 1. Windows 2. Mac 3. Linux (Unix) 3 Windows Command Mode and DOS Type
More informationHadoop Setup Walkthrough
Hadoop 2.7.3 Setup Walkthrough This document provides information about working with Hadoop 2.7.3. 1 Setting Up Configuration Files... 2 2 Setting Up The Environment... 2 3 Additional Notes... 3 4 Selecting
More informationUseful Unix Commands Cheat Sheet
Useful Unix Commands Cheat Sheet The Chinese University of Hong Kong SIGSC Training (Fall 2016) FILE AND DIRECTORY pwd Return path to current directory. ls List directories and files here. ls dir List
More informationActual4Dumps. Provide you with the latest actual exam dumps, and help you succeed
Actual4Dumps http://www.actual4dumps.com Provide you with the latest actual exam dumps, and help you succeed Exam : HDPCD Title : Hortonworks Data Platform Certified Developer Vendor : Hortonworks Version
More informationCS60021: Scalable Data Mining. Sourangshu Bhattacharya
CS60021: Scalable Data Mining Sourangshu Bhattacharya In this Lecture: Outline: HDFS Motivation HDFS User commands HDFS System architecture HDFS Implementation details Sourangshu Bhattacharya Computer
More informationIntroduction to remote command line Linux. Research Computing Team University of Birmingham
Introduction to remote command line Linux Research Computing Team University of Birmingham Linux/UNIX/BSD/OSX/what? v All different v UNIX is the oldest, mostly now commercial only in large environments
More informationParallel Programming Pre-Assignment. Setting up the Software Environment
Parallel Programming Pre-Assignment Setting up the Software Environment Authors: B. Wilkinson and C. Ferner. Modification date: Aug 21, 2014 (Minor correction Aug 27, 2014.) Software The purpose of this
More informationWe are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info
We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : About Quality Thought We are
More informationMap Reduce & Hadoop Recommended Text:
Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange
More informationHands-on Exercise Hadoop
Department of Economics and Business Administration Chair of Business Information Systems I Prof. Dr. Barbara Dinter Big Data Management Hands-on Exercise Hadoop Building and Testing a Hadoop Cluster by
More informationHadoop Lab 3 Creating your first Map-Reduce Process
Programming for Big Data Hadoop Lab 3 Creating your first Map-Reduce Process Lab work Take the map-reduce code from these notes and get it running on your Hadoop VM Driver Code Mapper Code Reducer Code
More informationExam Questions CCA-505
Exam Questions CCA-505 Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam https://www.2passeasy.com/dumps/cca-505/ 1.You want to understand more about how users browse you public
More informationIntroduction to UNIX Command Line
Introduction to UNIX Command Line Files and directories Some useful commands (echo, cat, grep, find, diff, tar) Redirection Pipes Variables Background processes Remote connections (e.g. ssh, curl) Scripts
More informationLinux Essentials Objectives Topics:
Linux Essentials Linux Essentials is a professional development certificate program that covers basic knowledge for those working and studying Open Source and various distributions of Linux. Exam Objectives
More informationCSE Linux VM. For Microsoft Windows. Based on opensuse Leap 42.2
CSE Linux VM For Microsoft Windows Based on opensuse Leap 42.2 Dr. K. M. Flurchick February 2, 2017 Contents 1 Introduction 1 2 Requirements 1 3 Procedure 1 4 Usage 3 4.1 Start/Stop.................................................
More informationCS197U: A Hands on Introduction to Unix
CS197U: A Hands on Introduction to Unix Lecture 3: UNIX Operating System Organization Tian Guo CICS, Umass Amherst 1 Reminders Assignment 2 is due THURSDAY 09/24 at 3:45 pm Directions are on the website
More informationBig Data for Engineers Spring Resource Management
Ghislain Fourny Big Data for Engineers Spring 2018 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationInstallation and Configuration Documentation
Installation and Configuration Documentation Release 1.0.1 Oshin Prem Sep 27, 2017 Contents 1 HADOOP INSTALLATION 3 1.1 SINGLE-NODE INSTALLATION................................... 3 1.2 MULTI-NODE INSTALLATION....................................
More informationComputer Systems and Architecture
Computer Systems and Architecture Introduction to UNIX Stephen Pauwels University of Antwerp October 2, 2015 Outline What is Unix? Getting started Streams Exercises UNIX Operating system Servers, desktops,
More informationCS November 2017
Distributed Systems 09r. Map-Reduce Programming on AWS/EMR (Part I) Setting Up AWS/EMR Paul Krzyzanowski TA: Long Zhao Rutgers University Fall 2017 November 21, 2017 2017 Paul Krzyzanowski 1 November 21,
More informationDistributed Systems. 09r. Map-Reduce Programming on AWS/EMR (Part I) 2017 Paul Krzyzanowski. TA: Long Zhao Rutgers University Fall 2017
Distributed Systems 09r. Map-Reduce Programming on AWS/EMR (Part I) Paul Krzyzanowski TA: Long Zhao Rutgers University Fall 2017 November 21, 2017 2017 Paul Krzyzanowski 1 Setting Up AWS/EMR November 21,
More informationMore Raspian. An editor Configuration files Shell scripts Shell variables System admin
More Raspian An editor Configuration files Shell scripts Shell variables System admin Nano, a simple editor Nano does not require the mouse. You must use your keyboard to move around the file and make
More informationParallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce
Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The
More informationCENG 334 Computer Networks. Laboratory I Linux Tutorial
CENG 334 Computer Networks Laboratory I Linux Tutorial Contents 1. Logging In and Starting Session 2. Using Commands 1. Basic Commands 2. Working With Files and Directories 3. Permission Bits 3. Introduction
More informationLogging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:
Hadoop User Guide Logging on to the Hadoop Cluster Nodes To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example: ssh username@roger-login.ncsa. illinois.edu after entering
More informationNetwork Monitoring & Management. A few Linux basics
Network Monitoring & Management A few Linux basics Our chosen platform Ubuntu Linux 14.04.3 LTS 64-bit LTS = Long Term Support no GUI, we administer using ssh Ubuntu is Debian underneath There are other
More informationDeveloper s Manual. Version May, Computer Science Department, Texas Christian University
Developer s Manual Version 4.0 2 May, 2016 2015-2016 Computer Science Department, Texas Christian University Revision Signatures By signing the following document, the team member is acknowledging that
More informationUnix basics exercise MBV-INFX410
Unix basics exercise MBV-INFX410 In order to start this exercise, you need to be logged in on a UNIX computer with a terminal window open on your computer. It is best if you are logged in on freebee.abel.uio.no.
More informationChapter-3. Introduction to Unix: Fundamental Commands
Chapter-3 Introduction to Unix: Fundamental Commands What You Will Learn The fundamental commands of the Unix operating system. Everything told for Unix here is applicable to the Linux operating system
More informationBig Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela
More informationHadoop. copyright 2011 Trainologic LTD
Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides
More informationAbout 1. Chapter 1: Getting started with oozie 2. Remarks 2. Versions 2. Examples 2. Installation or Setup 2. Chapter 2: Oozie
oozie #oozie Table of Contents About 1 Chapter 1: Getting started with oozie 2 Remarks 2 Versions 2 Examples 2 Installation or Setup 2 Chapter 2: Oozie 101 7 Examples 7 Oozie Architecture 7 Oozie Application
More informationLab Working with Linux Command Line
Introduction In this lab, you will use the Linux command line to manage files and folders and perform some basic administrative tasks. Recommended Equipment A computer with a Linux OS, either installed
More informationCS 460 Linux Tutorial
CS 460 Linux Tutorial http://ryanstutorials.net/linuxtutorial/cheatsheet.php # Change directory to your home directory. # Remember, ~ means your home directory cd ~ # Check to see your current working
More informationInria, Rennes Bretagne Atlantique Research Center
Hadoop TP 1 Shadi Ibrahim Inria, Rennes Bretagne Atlantique Research Center Getting started with Hadoop Prerequisites Basic Configuration Starting Hadoop Verifying cluster operation Hadoop INRIA S.IBRAHIM
More informationLinux Operating System Environment Computadors Grau en Ciència i Enginyeria de Dades Q2
Linux Operating System Environment Computadors Grau en Ciència i Enginyeria de Dades 2017-2018 Q2 Facultat d Informàtica de Barcelona This first lab session is focused on getting experience in working
More informationBig Data 7. Resource Management
Ghislain Fourny Big Data 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage
More informationIntroduction to Unix The Windows User perspective. Wes Frisby Kyle Horne Todd Johansen
Introduction to Unix The Windows User perspective Wes Frisby Kyle Horne Todd Johansen What is Unix? Portable, multi-tasking, and multi-user operating system Software development environment Hardware independent
More informationThe Unix Shell. Pipes and Filters
The Unix Shell Copyright Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See http://software-carpentry.org/license.html for more information. shell shell pwd
More informationWe are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info
We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423
More informationQuick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine
Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine Version 4.11 Last Updated: 1/10/2018 Please note: This appliance is for testing and educational purposes only;
More informationHadoop Map Reduce 10/17/2018 1
Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018
More informationCCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)
Cloudera CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Download Full Version : http://killexams.com/pass4sure/exam-detail/cca-410 Reference: CONFIGURATION PARAMETERS DFS.BLOCK.SIZE
More informationFriends Workshop. Christopher M. Judd
S Friends Workshop Christopher M. Judd Christopher M. Judd CTO and Partner at leader Columbus Developer User Group (CIDUG) Introduction http://hadoop.apache.org/ HCatalog Tez Zebra Owl Beyond MapReduce
More informationHadoop Cluster Implementation
Hadoop Cluster Implementation By Aysha Binta Sayed ID:2013-1-60-068 Supervised By Dr. Md. Shamim Akhter Assistant Professor Department of Computer Science and Engineering East West University A project
More informationMapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java
MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java Contents Page 1 Copyright IBM Corporation, 2015 US Government Users Restricted Rights - Use, duplication or disclosure restricted
More informationAn Introduction to Cluster Computing Using Newton
An Introduction to Cluster Computing Using Newton Jason Harris and Dylan Storey March 25th, 2014 Jason Harris and Dylan Storey Introduction to Cluster Computing March 25th, 2014 1 / 26 Workshop design.
More informationUnix Essentials. BaRC Hot Topics Bioinformatics and Research Computing Whitehead Institute October 12 th
Unix Essentials BaRC Hot Topics Bioinformatics and Research Computing Whitehead Institute October 12 th 2016 http://barc.wi.mit.edu/hot_topics/ 1 Outline Unix overview Logging in to tak Directory structure
More informationLINUX VPS GUIDE. Pre-requisites: (this guide assumes you are using windows)
LINUX VPS GUIDE Pre-requisites: (this guide assumes you are using windows) Philscurrency Wallet Download PHILS wallet if you don t have already from the link below https://github.com/philscurrency/philscurrency/releases/download/v1.2/phils
More informationCS 345A Data Mining. MapReduce
CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes
More informationSpark Programming at Comet. UCSB CS240A Tao Yang
Spark Programming at Comet UCSB CS240A 2016. Tao Yang Comet Cluster Comet cluster has 1944 nodes and each node has 24 cores, built on two 12-core Intel Xeon E5-2680v3 2.5 GHz processors 128 GB memory and
More information