Using Hive for Data Warehousing

An IBM Proof of Technology Using Hive for Data Warehousing Unit 1: Exploring Hive

An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

Contents LAB 1 EXPLORING HIVE... 4 1.1 GETTING STARTED... 5 1.2 HIVE AND THE WEB CONSOLE... 9 1.2.1 STARTING/STOPPING HIVE FROM THE BIGINSIGHTS WEB CONSOLE... 9 1.2.2 HIVE WEB INTERFACE... 10 1.3 EXPLORING THE HIVE ENVIRONMENT... 11 1.3.1 INVESTIGATING HIVE DIRECTORY STRUCTURE WITH THE CONSOLE... 11 1.3.2 EXPLORING THE HIVE COMMAND LINE INTERFACE (CLI)... 12 1.4 SUMMARY... 15 Contents Page 3

Lab 1 Exploring Hive The overwhelming trend towards digital services, combined with cheap storage, has generated massive amounts of data that enterprises need to effectively gather, process, and analyze. Data analysis techniques from the data warehouse and high-performance computing communities are invaluable for many enterprises, however often times their cost or complexity of scale-up discourages the accumulation of data without an immediate need. As valuable knowledge may nevertheless be buried in this data, related scaled-up technologies have been developed. Examples include Google s MapReduce, and the open-source implementation, Apache Hadoop. Writing MapReduce programs to analyze your Big Data can get complex. Apache Hive can help make querying your data much easier. Hive, first created at Facebook, is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. After completing this hands-on lab, you will be able to: Start and stop Hive from both the command line and the BigInsights Web Console. Use the Linux command line to explore the Hive directory structure. Interact with the Hive CLI in interactive mode, one-shot mode, and via a file. Allow 30 minutes to 45 minutes to complete this section of lab. This version of the lab was designed using the InfoSphere BigInsights 2.1 Quick Start Edition. Throughout this lab you will be using the following account login information: Username Password VM image setup screen root password Linux biadmin biadmin Page 4 Using Hive

1.1 Getting Started To prepare for the contents of this lab, you must go through the process of getting all of the Hadoop components started. These instructions assume you have already followed the IBM InfoSphere BigInsights Quick Start Edition, v2.1 README setup guide. 1. Start the VMware image by clicking the Play virtual machine button in the VMware Player if it is not already on. 2. Log in to the VMware virtual machine using the following credentials. User: biadmin Password: biadmin Hands-on-Lab Page 5

3. After you log in, your screen should look similar to the one below. Before we can start working with Hive and the Hadoop Distributed File system, we must first start all the BigInsights components. There are two ways of doing this, through terminal and through simply doubleclicking an icon. Both of these methods will be shown in the following steps. 4. Now open the terminal by double clicking the BigInsights Shell icon. 5. Click on the Terminal icon. Page 6 Using Hive

6. Once the terminal has been opened change to the $BIGINSIGHTS_HOME/bin directory (which by default is /opt/ibm/biginsights) cd $BIGINSIGHTS_HOME/bin or cd /opt/ibm/biginsights/bin 7. Start the Hadoop components (daemons) on the BigInsights server. You can practice starting all components with these commands. Please note that they will take a few minutes to run../start-all.sh 8. Sometimes certain Hadoop components may fail to start. You can start and stop the failed components one at a time by using start.sh and stop.sh respectively. For example to start and stop Hive use:./start.sh hive./stop.sh hive Hands-on-Lab Page 7

Notice that since Hive did not initially fail, the terminal is telling us that Hive is already running. 9. Once all components have started successfully you may move on. 10. If you would like to stop all components execute the command below. However, for this lab please leave all components started../stop-all.sh Next, let us look at how you would start all the components by double-clicking an icon. 11. Double-clicking on the Start BigInsights icon would execute a script that does the above mentioned steps. Once all components are started the terminal exits and you are set. Simple. 12. We can stop the components in a similar manner, by double-clicking on the Stop Biginsights icon. (To the right of Start BigInsights icon) Now that the components are started you may move on to the next section. Page 8 Using Hive

1.2 Hive and the Web Console Hive can also be started and stopped very easily from the BigInsights Web Console. Additionally we can work with Hive from the Hive web interface that is packaged with Apache Hive. 1.2.1 Starting/Stopping Hive from the BigInsights Web Console 1. Start the Web Console by double-clicking on the BigInsights WebConsole icon. 2. Once logged in, click on the Cluster Status tab at the top of the page. 3. Click on the Hive service and note the detailed information provided for this service in the pane at right. From here, you can start or stop the Hive service depending on your needs. For example, you can see the URL for Hive's Web interface and its process ID. 4. In the pane to the right (which displays the Hive status), click the red Stop button to stop the service 5. When prompted to confirm that you want to stop the Hive service, click OK and wait for the operation to complete. The right pane should appear similar to the following image Hands-on-Lab Page 9

6. Restart the Hive service by clicking on the green arrow just beneath the Hive Status heading. (See the previous figure.) When the operation completes, the Web console will indicate that Hive is running again, likely under a process ID that differs from the earlier Hive process ID shown at the beginning of this lab module. (You may need to use the Refresh button of your Web browser to reload information displayed in the left pane.) 1.2.2 Hive Web Interface 1. Cut-and-paste the URL for Hive s Web interface (http://bivm:9999/hwi) into a new tab of your browser. You'll see the open source Hive Web Interface provided with Hive for administration purposes, as shown below. Page 10 Using Hive

1.3 Exploring the Hive environment 1.3.1 Investigating Hive directory structure with the console Let s navigate to the Hive home directory on the Linux file system and investigate the directories that Hive is comprised of. 1. Open the Linux terminal by double clicking the BigInsights Shell icon on the desktop. 2. Click on the Terminal icon 3. In the terminal change to the Hive home directory $ cd $HIVE_HOME Note: This is equivalent to $ cd $BIGINSIGHTS_HOME/hive 4. Check out the current directory $ pwd You are now in the /opt/ibm/biginsights/hive directory. This is where Hive is setup on this BigInsights virtual machine. 5. Explore the directory structure inside the hive folder by running the ls command. $ ls Hands-on-Lab Page 11

6. You will notice the following directories bin executables to start/stop/configure/check status of hive lib server s JAR files conf Hive environment, metastore, security, and log configuration files docs Hive documentation scripts scripts for upgrading derby and MySQL metastores from one version of Hive to the next examples Hive examples src Hive source and test scripts 1.3.2 Exploring the Hive Command Line Interface (CLI) From the Hive CLI shell you can perform queries, DML, DDL and more. We will be doing much work in the Hive CLI so let s briefly check it out! 1. In the Linux terminal change into the $HIVE_HOME/bin directory $ cd $HIVE_HOME/bin 2. Inside the bin directory we will run a command that will show us the command line options for the Hive CLI. $./hive -help -service cli Page 12 Using Hive

3. Page through the environment variables already set in the Hive CLI. $./hive S e set more Hands-on-Lab Page 13

4. Execute a hive one shot command (the -e designates this) to show the current schemas in the system. You should see that only the default schema (schema and database are equivalent terminology in Hive) is listed. $./hive S e SHOW SCHEMAS; Note that the S in the above command stands for Silent mode and removes some inessential output. 5. Create a new file in the /tmp directory with a simple HQL command inside of it. $ echo SHOW DATABASES; > /tmp/myfile.hql 6. Tell Hive to run the commands in your file by passing the -f option. $./hive -f /tmp/myfile.hql Note the output Hive lists only a single database the default Hive database. 7. Start an interactive Hive shell session. $./hive Page 14 Using Hive

8. Run the SHOW DATABASES statement from within the interactive Hive session. hive> SHOW DATABASES; 9. Quit Hive. hive> quit; 1.4 Summary Congratulations! You now know how to start and stop Hive using the terminal and the BigInsights Web Console. You can navigate to the Hive directories and understand the contents of those directories. You also know how to interact with the Hive CLI. You may move on to the next Unit. Hands-on-Lab Page 15

NOTES

Copyright IBM Corporation 2013. The information contained in these materials is provided for informational purposes only, and is provided AS IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, these materials. Nothing contained in these materials is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. References in these materials to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. This information is based on current IBM product plans and strategy, which are subject to change by IBM without notice. Product release dates and/or capabilities referenced in these materials may change at any time at IBM s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. IBM, the IBM logo and ibm.com are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at Copyright and trademark information at www.ibm.com/legal/copytrade.shtml.