Hands-on Exercise Hadoop - PDF Free Download

Department of Economics and Business Administration Chair of Business Information Systems I Prof. Dr. Barbara Dinter Big Data Management Hands-on Exercise Hadoop Building and Testing a Hadoop Cluster by Means of Apache Ambari 1. Configuration of the Cluster Nodes First of all, define which computer is your master node and which computers are your slave nodes of your future Hadoop cluster. Now, configure your computers by installing an operating system (OS) on them. Pay attention to following hints. Hints: For the purpose of our exercise, we use CentOS (Community Enterprise Operation System) in the version 6.7 (64bit) which is a ready enterprise Linux distribution. 1 Install in parallel the OS for the master and the first slave node. Once you are done with this task, you can move on and install CentOS on the second slave node. It is recommended that during this process one student manually (not electronically) and cautiously writes down all relevant information for each node, e.g. the IP address, the unique host name (Fully Qualified Domain Name; FQDN), defined usernames and passwords and installed services. Media test: Skip the media test. In our case, it is not necessary and only consumes a lot of time, which can be saved by skipping it. Assigning the host name: Please configure the fully qualified domain name (FQDN) of the node as the host name (in the manner of: <computer name>.<domain>.<top level domain>). ATTENTION: When assigning the host name, you can scroll the page a bit down and select at the left side the option Configure Network. Do so and set up the configurations as follows: eth0 or eth1 (network interface controller) edit check the option connect automatically apply. Otherwise, you have to perform this step after the installation of CentOS and therefore you would have no immediate internet connection right after the installation. Assigning the root password: Please choose a non trivial password. Though, choose a password which you can share with your team mates and/or your supervisor during the exercise. Do not save it somewhere electronically! This is a considerable advice for all passwords you are defining during this exercise. Installation type: Install CentOS in the Desktop version. 1 CentOS can be downloaded for free at http://isoredirect.centos.org/centos/6/isos/x86_64. Hands on Exercise Hadoop Page 1 of 9

The installation of the OS takes a considerable amount of time (about 18 minutes). Take your time by investigating the following steps. 2. Configuration of CentOS Perform the following configurations of CentOS simultaneously for the master and the first slave node. Pay attention to the following hints. Hints: License policy: Accept the license policy. Choose the username: Choose a username for the Linux system. It is recommended to choose some intuitive names for the related nodes like master, slave01, and slave02. Subsequently assign a password for this account (Comply with the aforementioned password policy!). Time settings: Activate the option Synchronize the date and time via network (Network Time Protocol; NTP). This is necessary because we want all the nodes to be exactly synchronized. Kdump: Deactivate Kdump. It is not necessary in our use case and it only wastes valuable computational resources. HINT: Make your work a bit easier by deactivating the screen saver on each node. Repeat this configuration process in parallel (in the background) for the second slave node. 3. Configuration of the Hadoop Cluster Preparation of CentOS Please download the installation guide for installing a Hortonworks Data Platform 2.3 (HDP 2.3) Hadoop cluster by using Apache Ambari. You can find the installation guide at: http://docs.horton works.com/hdpdocuments/ambari 2.2.0.0/bk_Installing_HDP_AMB/bk_Installing_HDP_AMB 20151221.pdf. Note that not all of the steps being mentioned in the installation guide must be performed. Among them are for instance many steps dealing with problems under different operation systems and which do not refer to the CentOS. Think before you type! Before you start installing your Hadoop cluster, take a short look at the Linux terminal commands which you may need during the cluster installation: Linux Terminal Commands that You Should Have in Mind: You can find the terminal (also known as console or shell) under: Application System Tools Terminal (It is advised to create a shortcut for the terminal on the system panel by right clicking on the terminal and selecting Add to Panel ). You can also set a key combination (e.g. F3 or Ctrl + T ) to access the terminal directly via selecting: System Settings Hot Keys Desktop Start a terminal. Table 1 gives an overview of all Linux terminal commands which you may need during the exercise. You can find a more general and complete overview of these commands and their applications at: https://ubuntudanmark.dk/filer/fwunixref.pdf. Hands on Exercise Hadoop Page 2 of 9

Table 1: Overview of required Linux terminal commands Command SU (Super User) EXIT Description Changes the active user to super user (root); this is necessary because in some cases, the user must be a root to run a command; the user is asked afterwards to insert the root password Closes the terminal or the current session TABULATOR ( ) Completes the input based on available files located on the working path CLEAR STRG + C PWD (Print Working Directory) LS (LiSt) CD (Change Directory) MKDIR (MaKe DIRectory) IFCONFIG HOSTNAME SSH (Secure SHell) SCP (Secure CoPy) Clears the terminal window Cancels the terminal inputs or running processes (keyboard interrupt) Shows the path of the current working directory Shows all available files in the current working directory Changes the working directory to another given location: cd <Path> switches to the given path; cd / switches to the root directory; cd.. switches to the parent directory; cd - switches to the previous directory Creates a new directory Shows the IP and the MAC address of the computer Shows the assigned host name of the node Opens a remote connection to another node, whereby the destination node is represented in the following way: <user>@<fqdn of destination node> (Securely) Copies the data from a source path to a destination path. It is especially suitable for exchanging data among nodes Now, start the installation by correctly configuring CentOS. Please follow the steps mentioned in the installation guide carefully and pay attention to the following hints. The pages 1 4 of the installation guide only contain some introducing information. In Section 1.2.8 (page 5) you should get active for the first time. Hints: If you don t know a command, you can open the manual via the terminal with the following command: man <command> Preparation OpenSSL (important!): o First update the OpenSSL version on all nodes. In order to do this, execute as root: yum update openssl Regarding Section 1.2.8. (Check the Maximum Open File Descriptors) of HDP installation guide: o The amount of open file descriptors (= the amount of data which can be processed simultaneously) can be set as root by the following commands ulimit -Hn 16384 and ulimit -Sn 16384. Regarding Section 1.4.1. (Set Up Password less SSH) of HDP installation guide: Hands on Exercise Hadoop Page 3 of 9

o Please use your master node as the Ambari server o Generate the public SSH key as root user (!) o The key is available afterwards at: /root/.ssh/ o After setting the read and write rights (Step 4), run additionally the following command on all slave nodes restorecon -R ~/.ssh (NOTE: Password less SSH should not only be set up on both slave nodes, but also on the master node itself. Section 1.4.1 of the HDP installation guide must therefore be executed in the case of three cluster nodes three times.) Regarding Section 1.4.3. (Enable NTP on the Cluster and on the Browser Host) of the HDP installation guide: o Enable NTP: Actually, you should have already done this step during the installation of CentOS. Regarding Section 1.4.4. (Check DNS and NSCD) of the HDP installation guide: o In the case of using the DNS of your institution: It is important that you have assigned the FQDN correctly during the installation phase. However, check the mentioned configuration files with respect to your FQDNs. o Otherwise: The mapping between the chosen FQDNs and their related IP addresses have to be performed manually on each node. Therefore, the related hosts files (that can be found under /etc/hosts) on each node have to be adapted (cf. Section 1.4.5.1 Edit the Host File of HDP installation guide). Regarding Section 1.4.5. (Configuring iptables) of the HDP installation guide: o You can only deactivate IPTables when you are logged in as a root user. Run these commands on all nodes! 4. Configuration of the Hadoop Cluster Installing Apache Ambari The installation of the Apache Ambari server is addressed from page 21 of HDP installation guide. Pay attention to the following hint. Hint: Regarding Section 2.2. (Set Up the Ambari Server) of HDP installation guide: o Step 6: Enter advanced Database configuration : Enter n (Default Database) 5. Installation and Start of the Hadoop Cluster The installation of the Hadoop cluster is explained from page 32 of the HDP installation guide. Pay attention to the following hints: Hints: Regarding Section 3.5. (Install Options) of the HDP installation guide: o Target Hosts: Provide the FQDNs of all nodes, including the master node (!). Hands on Exercise Hadoop Page 4 of 9

o SSH Private Key: In order to be able to enter the SSH private key, you should copy as a root user the file /root/.ssh/id_rsa to the desktop (e.g. /home/master/desktop) and transfer the file ownership to the master user (e.g. chown -c master id_rsa). Subsequently, you can select and open the key via the web interface. OR: o Open as root the id_rsa file in the terminal using the cat command. Mark the text, copy and paste it into the Ambari web interface. Regarding Section 3.6. (Confirm Hosts) of the HDP installation guide: o Confirm Hosts: Pay attention to potential errors and/or warnings after registering the clients automatically. Don t move on to the next step before you have handled and solved all errors and/or warnings! Regarding Section 3.7. (Choose Services) of the HDP installation guide: o Choose Services: In order to keep the installation process short, select only Hadoop components and applications which are necessary for this exercise, namely HDFS, YARN + MapReduce2 and Pig. 2 Confirm all dependencies with OK. Regarding Section 3.9. (Assign Slaves and Clients) of the HDP installation guide: o Assign Clients: In order to keep the installation process short, install the necessary HDFS and Pig clients only on your master node. As a result, you can execute HDFS commands and start Pig scripts only from your master node. Regarding Section 3.10. (Customize Services) of the HDP installation guide: o Customize Services: Do not change any settings of the services. However, you can scroll to see what kind of settings you might select, if you are interested. If you have selected the installation of the Hadoop application Apache Oozie, you must provide a database user and a password in this step. Regarding Section 3.11. (Review) of the HDP installation guide: o Deploy: Depending on how many Hadoop applications you have selected, the deploy step may take up to 10 minutes to complete. Hence, it is time for another short break. 6. Using and Testing the Hadoop Cluster Use and test your recently installed Hadoop cluster. However, read the following hints regarding the usage of the Hadoop Distributed File Systems (HDFS). Subsequently, solve the exercises. Hints for Using HDFS: When automatically installing the Hadoop cluster by using Apache Ambari, you will find a user account called hdfs, which is created by Ambari. The hdfs user has read and write permissions for the virtual Hadoop Distributed File System. By using the terminal command passwd hdfs you can choose a new password for this account. Do so on your master node. After you have successfully changed the password, switch the CentOS user and log in as hdfs (Do not log out, such that the Ambari Server can still run in the background!). Solve the exercises using this account. Execute all HDFS commands as hdfs user (not as root!) and use for the following exercises the HDFS directory /tmp/. 2 The applications Hive + HCatalog and HBase consume too many resources to be launched and therefore are not recommended for our test purposes. Hands on Exercise Hadoop Page 5 of 9

By using the following terminal commands, you can put data on HDFS hdfs dfs -copyfromlocal foo.txt /tmp/ or get data from HDFS and put it on your physical local file system hdfs dfs -copytolocal /tmp/wordcountoutput/part-r-00000 result.txt or show your data in the terminal window hdfs dfs -cat /tmp/wordcountoutput/part-r-00000. The following command will list the content of an HDFS directory: hdfs dfs -ls /tmp/. An extensive overview of all HDFS terminal commands can be found on the HDFS cheat sheet of the book (Dirk deroos, 2014: Hadoop for Dummies) at http://www.dummies.com/how to/content/hadoop for dummies cheat sheet.html. Hands on Exercise Hadoop Page 6 of 9

Exercise 1 Airline On-time Performance Implement the HDFS example Airline on time performance from the book (Dirk deroos, 2014: Hadoop for Dummies, Chapter 13) on your own Hadoop cluster. Thereby, the following steps are of interest: Downloading the sample dataset Copying the sample dataset into HDFS Your first Hadoop program: Hello Hadoop! For your inputs and outputs, use the following HDFS directory rather than the one mentioned in the book: /tmp/. NOTE: The Pig script in the book contains \ to indicate line breaks. These backslashes are not part of the actual script and should therefore be ignored. In addition to that, the script contains a small bug: The path of the input data in the LOAD command (first line) should be put between two single quotation marks (cf. the construction of the following Pig script). If they are missing, an error in the terminal window will be shown. Exercise 2 The Word Count Example Apply the knowledge you gained in the first exercise to implement another Pig script in which you count the frequency of words in a text ( Word Count Example ). Use as an input text file the free RFC 7230 Hypertext Transfer Protocol (HTTP/1.1) : http://tools.ietf.org/rfc/rfc7230.txt. Your Pig script should have the following structure 3 : a = load '/foo.txt'; b = foreach a generate flatten(tokenize((chararray)$0)) as word; c = group b by word; d = foreach c generate COUNT(b), group; store d into '/output'; Subsequently answer the following questions: 2.1 Which essential text mining step have you just applied? 2.2 Where are the phases map and reduce located in the script? 2.3 Have a careful look at the physical representation of the HDFS blocks in your file system. Navigate as root to the following directory: (the name of the 5th subdirectory varies, depending on the time and the date of the file generation as well as the IP address of the node): /hadoop/hdfs/data/current/bp-<nb>-<ip>-<datetime>/current/finalized. By using the terminal command ls -lh you can see in the terminal window all files which are located in the current working directory including their sizes in bytes. Compare the directories on your master and slave nodes with each other! 4 Exercise 3 Performance Tests 3.1 How long does your Hadoop cluster take to compute the word count for the aforementioned RFC 7230? 3 Sample code obtained from http://hortonworks.com/hadoop tutorial/word counting with apache pig/. 4 Reminder: The default replication factor of HDFS is equal to 3. Hands on Exercise Hadoop Page 7 of 9

3.2 How long does your Hadoop cluster take to compute the word count of a significantly shorter text, like a text with probably just a sentence in length (Determination of the administration overhead)? 3.3 How long does your Hadoop cluster take to compute the word count for a significantly longer text, in a scale of 100 to 1000 times larger than RFC 7230? Exercise 4 Extension of an existing Hadoop Cluster 4.1 Consider the following question: Which steps do you need in order to add another slave node to your existing Hadoop cluster? 4.2 What kind of scaling category corresponds to this method? Hands on Exercise Hadoop Page 8 of 9

Additional Exercises: Explorative Learning The following exercises represent supplementary exercises. Start solving Exercise 5, by the time your group has done all four previous exercises. If several groups are done with their exercises and there is still some time left (~ 20 minutes), move on to Exercise 6 and collectively work on it. Exercise 5 The Hadoop Ecosystem Try to run a different Hadoop application other than Pig on your Hadoop cluster. For this purpose, search for a short tutorial on the Internet by yourself. 5 NOTE: You can add further services to your Hadoop cluster via the Ambari web dashboard: Actions + Add Services. Exercise 6 Think Big! Connect all available nodes in your laboratory to one single Hadoop cluster. Finally repeat your performance tests from Exercise 2. Can you observe a performance boost? NOTE: By using the following command, you can reset all the settings on your nodes which have been made by the Ambari installation wizard: python /usr/lib/python2.6/site-packages/ambari_agent/hostcleanup.py --silent 5 A good starting point for free Hadoop tutorials: http://hortonworks.com/products/hortonworks sandbox/. Hands on Exercise Hadoop Page 9 of 9