Aims. Background. This exercise aims to get you to:

Similar documents
HBase Installation and Configuration

HBase Installation and Configuration

Introduction to Hive. Feng Li School of Statistics and Mathematics Central University of Finance and Economics

Apache Hive. CMSC 491 Hadoop-Based Distributed Compu<ng Spring 2016 Adam Shook

COSC 6397 Big Data Analytics. Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading. Edgar Gabriel Spring 2014.

Compile and Run WordCount via Command Line

Hadoop is essentially an operating system for distributed processing. Its primary subsystems are HDFS and MapReduce (and Yarn).

The detailed Spark programming guide is available at:

Lab: Hive Management

Accessing Hadoop Data Using Hive

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java

Introduction to BigData, Hadoop:-

Part 1: Installing MongoDB

Hadoop Development Introduction

Hadoop Lab 3 Creating your first Map-Reduce Process

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop Lab 2 Exploring the Hadoop Environment

More Access to Data Managed by Hadoop

Innovatus Technologies

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Getting Started. Table of contents. 1 Pig Setup Running Pig Pig Latin Statements Pig Properties Pig Tutorial...

Apache Hadoop Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

Apache Hive. Big Data - 16/04/2018

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

Introduction to Hive Cloudera, Inc.

Introduction to HDFS and MapReduce

Lab 3 Pig, Hive, and JAQL

Hadoop Online Training

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Using Hive for Data Warehousing

COSC 6339 Big Data Analytics. NoSQL (III) HBase in Hadoop MapReduce 3 rd homework assignment. Edgar Gabriel Spring 2017.

Ultimate Hadoop Developer Training

About 1. Chapter 1: Getting started with oozie 2. Remarks 2. Versions 2. Examples 2. Installation or Setup 2. Chapter 2: Oozie

SE256 : Scalable Systems for Data Science

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Getting Started. Table of contents. 1 Pig Setup Running Pig Pig Latin Statements Pig Properties Pig Tutorial...

Hive SQL over Hadoop

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Outline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc.

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Java in MapReduce. Scope

Hadoop. Introduction to BIGDATA and HADOOP

Hadoop & Big Data Analytics Complete Practical & Real-time Training

A Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

C2: How to work with a petabyte

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming)

Using Hive for Data Warehousing

Big Data Hadoop Course Content

HIVE INTERVIEW QUESTIONS

Expert Lecture plan proposal Hadoop& itsapplication

COSC 6339 Big Data Analytics. Hadoop MapReduce Infrastructure: Pig, Hive, and Mahout. Edgar Gabriel Fall Pig

Hadoop. copyright 2011 Trainologic LTD

Exercise #1: ANALYZING SOCIAL MEDIA AND CUSTOMER SENTIMENT WITH APACHE NIFI AND HDP SEARCH INTRODUCTION CONFIGURE AND START SOLR

Hadoop ecosystem. Nikos Parlavantzas

HIVE MOCK TEST HIVE MOCK TEST III

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. HCatalog

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

50 Must Read Hadoop Interview Questions & Answers

Processing big data with modern applications: Hadoop as DWH backend at Pro7. Dr. Kathrin Spreyer Big data engineer

Hortonworks Technical Preview for Stinger Phase 3 Released: 12/17/2013

Apache TM Hadoop TM - based Services for Windows Azure How- To and FAQ Guide

Importing and Exporting Data Between Hadoop and MySQL

Pig A language for data processing in Hadoop

Big Data Hadoop Stack

Architecture of Enterprise Applications 22 HBase & Hive

Hortonworks Certified Developer (HDPCD Exam) Training Program

APACHE HIVE CIS 612 SUNNIE CHUNG

Data Access 3. Migrating data. Date of Publish:

DATA MIGRATION METHODOLOGY FROM SQL TO COLUMN ORIENTED DATABASES (HBase)

Topics covered in this lecture

Guidelines For Hadoop and Spark Cluster Usage

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:

Big Data Analytics using Apache Hadoop and Spark with Scala

Hadoop An Overview. - Socrates CCDH

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

sqoop Automatic database import Aaron Kimball Cloudera Inc. June 18, 2009

Rails on HBase. Zachary Pinter and Tony Hillerson RailsConf 2011

Hortonworks Data Platform

Hortonworks Data Platform

UNIT V BIG DATA FRAMEWORKS

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

Getting Started with Hadoop/YARN

Hadoop: The Definitive Guide

Introduction to Apache Pig ja Hive

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

A Glimpse of the Hadoop Echosystem

MapReduce Simplified Data Processing on Large Clusters

Big Data Analysis using Hadoop. Lecture 4. Hadoop EcoSystem

Map Reduce & Hadoop Recommended Text:

Presented by Sunnie S Chung CIS 612

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka

Third Party Software: How can I import data from MySQL to Hadoop with Apache Sqoop? How can I import data from MySQL to Hadoop with Apache Sqoop?

Transcription:

Aims This exercise aims to get you to: Import data into HBase using bulk load Read MapReduce input from HBase and write MapReduce output to HBase Manage data using Hive Manage data using Pig Background In HBase-speak, bulk loading is the process of preparing and loading HFiles (HBase s own file format) directly into the RegionServers. Bulk load steps: 1. Extract the data from a source, typically text files or another database. 2. Transform the data into HFiles. This step requires a MapReduce job and for most input types you will have to write the Mapper yourself. The job will need to emit the row key as the Key, and either a KeyValue, a Put, or a Delete as the Value. The Reducer is handled by HBase; you configure it using HFileOutputFormat2.configureIncrementalLoad(). 3. Load the files into HBase by telling the RegionServers where to find them. It requires using LoadIncrementalHFiles (more commonly known as the completebulkload tool), and by passing it a URL that locates the files in HDFS, it will load each file into the relevant region via the RegionServer that serves it. Here s an illustration of this process. The data flow goes from the original source to HDFS, where the RegionServers will simply move the files to their regions directories. See more details at: http://blog.cloudera.com/blog/2013/09/how-to-usehbase-bulk-loading-and-why/.

Because HBase is not installed in the VM image in the lab computers, you need to install HBase again following the instructions in Lab 5. Create a project Lab6 and create a package comp9313.lab6 in this project. Put all your java codes in this package and keep a copy. Right click the project -> Properties -> Java Build Path -> Libraries -> Add Externals JARs -> go to the folder comp9313/base-1.2.2/lib, and add all the jar files to the project. Data Set Download the two files Votes and Posts from the course homepage. The data set contains many questions asked on http://www.stackexchange.com and the corresponding answers. The two file used in this week s lab are obtained at: https://archive.org/details/stackexchange, part of datascience.stackexchange.com.7z. The format of the data set is shown at: https://ia800500.us.archive.org/22/items/stackexchange/readme.txt. The data format of Votes is (the field BountyAmount is ignored): - **votes**.xml - Id - PostId - VoteTypeId - ` 1`: AcceptedByOriginator - ` 2`: UpMod - ` 3`: DownMod - ` 4`: Offensive - ` 5`: Favorite - if VoteTypeId = 5 UserId will be populated - ` 6`: Close - ` 7`: Reopen - ` 8`: BountyStart - ` 9`: BountyClose - `10`: Deletion - `11`: Undeletion - `12`: Spam - `13`: InformModerator - CreationDate - UserId (only for VoteTypeId 5) - BountyAmount (only for VoteTypeId 9) For example: The data format of Comments is:

- **comments**.xml - Id - PostId - Score - Text, e.g.: "@Stu Thompson: Seems possible to me - why not try it?" - CreationDate, e.g.:"2008-09-06t08:07:10.730" - UserId For example: HBase Data Bulk Load Import Votes as a table in HBase. 1. HBase will use a staging folder to store temporary data, and we need to configure this directory for HBase. Create a folder /tmp/hbase-staging in HDFS, and change its mode to 711 (i.e., rwx x x). $ hdfs dfs mkdir /tmp/hbase-staging $ hdfs dfs chmod 711 /tmp/hbase-staging Add the following lines to $HBASE_HOME/conf/hbase-site.xml (in between <configuration> and </configuration>: <property> <name>hbase.bulkload.staging.dir</name> <value>/tmp/hbase-staging</value> </property> <property> <name>hbase.coprocessor.region.classes</name> <value>org.apache.hadoop.hbase.security.token.tokenprovider,org.apach e.hadoop.hbase.security.access.accesscontroller,org.apache.hadoop.hba se.security.access.securebulkloadendpoint</value> </property> In your MapReduce code, you need to configure the two properties: hbase.fs.tmp.dir and hbase.bulkload.staging.dir. After creating a Configuration object, you need to: Configuration conf = HBaseConfiguration.create(); conf.set( hbase.fs.tmp.dir, /tmp/hbase-staging ); conf.set( hbase.bulkload.staging.dir, /tmp/hbase-staging );

2. The code for bulk loading Votes into HBase is available at the course homepage, i.e., Vote.java and HBaseBulkLoadExample.java. Below lists some explanations of the code: Only the mapper is required in bulk load, because the Reducer is handled by HBase and you configure it using HFileOutputFormat2.configureIncrementalLoad(). The map output key data type must be ImmutableBytesWritable, and the map output value data type can only be a KeyValue/Put/Delete object. In this example, you create a Put object, which will be used to insert the data into the HBase table. The table can either be created using HBase shell or HBase Java API. In the give code, the table is created using Java API. In the example code, the class HBaseBulkLoadExample implements the interface Tool, and the job is configured and started in the run() function. Then, ToolRunner.run() is used to invoke HBaseBulkLoadExample.run(). You can also configure and start the job in the main function, as you did in the previous labs on MapReduce. Before starting the job, you need to use HFileOutputFormat2.configureIncrementalLoad() to configure the bulk load. After the job is completed, that is, the mapper generate the Put objects for all input data, you use LoadIncrementalHFiles to do the bulk load. It is the tool to load the output of HFileOutputFormat2 into an existing table. 3. After Votes is loaded into the table votes, open the HBase shell to check the table and its contents. Your Task: Import Comments as a table in HBase. Create a class HBaseBulkLoadComments.java and a class Comment.java in package comp9313.lab6 to finish this task. Use Id as the rowkey, and create three column families, postinfo (containing PostId), commentinfo (containing Score, Text, and CreationDate), and userinfo (containing UserId ). Read MapReduce Input from HBase Problem 1. Read input data from table votes in HBase, and count for each post the number of each type of vote for this post. The output data is of format: (PostID, {<VoteTypeId, count>}).

For example, if post with ID 1 has two votes, one is of type 1 and another is of type 2, then you should output (1, {<1, 1>, <2, 1>}). Please refer to https://hbase.apache.org/book.html#mapreduce.example for the examples of HBase MapReduce read. Hints: 1. Your mapper should be extended from TableMapper<K, V>. The input key data type is ImmutableBytesWritable, and value data type is Result. Each map() function will read one row from the HBase table, and you can use Result.getValue(CF, COLUMN) to get the value in a cell. Your mapper code will be like: public static class AggregateMapper extends TableMapper<Text, Text>{ public void map(immutablebyteswritable row, Result value, Context context) throws IOException, InterruptedException { //do your job } } 2. The reducer is just like a normal MapReduce reducer 3. In the main function, you will need to use the function TableMapReduceUtil.initTableMapperJob() to configure the mapper. 4. Because the data is read from HBase, you do not need to configure the data input path. You only need to specify the output path in Eclipse. The code ReadHBaseExample.java is available at the course webpage. Try to write the mapper by yourself, and learn how to configure the HBase read job from that file. Problem 2: Read input data from table comments in HBase, and calculate the number of comments per UserID. Refer to the code ReadHBaseExample.java and write your code in ReadHBaseComment.java in package comp9313.lab6. Write MapReduce Output to HBase Problem 1. Read input data from Votes, and count the number of votes per user. The result will be written to an HBase table votestats, rather than storing in files generated by reducers.

Please refer to https://hbase.apache.org/book.html#mapreduce.example for the examples of HBase MapReduce Write. Hints: 1. The mapper is just like a normal MapReduce mapper. 2. Your reducer should be extended from TableReducer<K, V>. The output key is ignored, and the value data type is ImmutableBytesWritable. The reduce() function will aggregate the number of comments for a user. You need to create a Put object to store the information, and HBase will use this object to insert the information into table votestats. Your reducer code will be like: public static class UserVotesReducer extends TableReducer<Text, IntWritable, ImmutableBytesWritable> { public void reduce(text key, Iterable<IntWritable> values, Context context)throws IOException, InterruptedException { //do your job } } 3. In the main function, you will need to use the function TableMapReduceUtil.initTableReducerJob() to configure the reducer. 4. You can create the table in the main function, or using the HBase shell. 5. Because the data is written to HBase, you do not need to configure the data output path. You only need to specify the input path in Eclipse. The code WriteHBaseExample.java is available at the course webpage. Try to write the reducer by yourself, and learn how to configure the HBase write job from that file. Problem 2: Read input data from Comments, and calculate the average score of comments for each question. The result will be written to an HBase table post_comment_score, with only one column family stats. Refer to the code WriteHBaseExample.java and write your code in WriteHBaseComment.java in package comp9313.lab6. Manage Data Using Hive Hive Installation and Configuration 1. Download Hive 2.1.0

$ wget http://apache.uberglobalmirror.com/hive/stable-2/apache-hive- 2.1.0-bin.tar.gz Then unpack the package: $ tar xvf apache-hive-2.1.0-bin.tar.gz 2. Define environment variables for Hive We need to configure the working directory of Hive, i.e., HIVE_HOME. Open the file ~/.bashrc and add the following lines at the end of this file: export HIVE_HOME = ~/apache-hive-2.1.0-bin export PATH = $HIVE_HOME/bin:$PATH Save the file, and then run the following command to take these configurations into effect: $ source ~/.bashrc 3. Create /tmp and /user/hive/warehouse and set them chmod g+w for more than one user usage $ hdfs dfs -mkdir /tmp $ hdfs dfs -mkdir p /user/hive/warehouse $ hdfs dfs -chmod g+w /tmp $ hdfs dfs -chmod g+w /user/hive/warehouse 4. Run the schematool command to initialize Hive $ schematool -dbtype derby -initschema Now you have already done the basic configuration of Hive, and it is ready to use. Start Hive shell by the following command (start HDFS and YARN first!): $ hive

Practice Hive 1. Download the test file employees.txt from the course webpage. The file contains only 7 records. Put the file at the home folder. 2. Create a database $ hive> create database employee_data; $ hive> use employee_data; 3. All databases are created under /user/hive/warehouse directory. $ hdfs dfs ls /user/hive/warehouse 4. Create the employee table $ hive> CREATE TABLE employees ( name STRING, salary FLOAT, subordinates ARRAY<STRING>, deductions MAP<STRING, FLOAT>, address STRUCT<street:STRING, city:string, state:string, zip:int> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003' LINES TERMINATED BY '\n' STORED AS TEXTFILE; Because '\001', '\002', '\003', and '\n' are by default, and thus you can ignore ROW FORMAT DELIMITED. STORED AS TEXTFILE is also by default, and can be ignored as well. 5. Show all tables in the current database $ hive> show tables; 6. Load data from local file system into table $ hive> LOAD DATA LOCAL INPATH '/home/comp9313/employees.txt' OVERWRITE INTO TABLE employees;

After loading the data into the table, you can check in HDFS what happened: $ hdfs dfs ls /user/hive/warehouse/employee_data.db/employees The file employees.txt is copied into this folder corresponding to the table. 7. Check the data in the table $ select * from employees; 8. You can do various queries based on the employees table, just as in an RDBMS. For example: Question 1: show the number of employees and their average salary Hint: use count() and avg() Question 2: find the employee who has the highest salary Hint: use max(), IN clause, and subquery in where clause 9. Usage of explode(). Find all employees who are the subordinate of another person. explode() takes in an array (or a map) as an input and outputs the elements of the array (map) as separate rows. $ hive> SELECT explode(subordinates) FROM employees; 10. Hive partitions. When defining employees, it is not partitioned, and thus you cannot add a partition to it. You can only add a new partition to a table has already been partitioned! Create a table employees2, and load the same file into it. $ hive> CREATE TABLE employees2 ( name STRING, salary FLOAT, subordinates ARRAY<STRING>, deductions MAP<STRING, FLOAT>,

address STRUCT<street:STRING, city:string, state:string, zip:int> )PARTITIONED BY (join_year STRING); $ hive> LOAD DATA LOCAL INPATH '/home/comp9313/employees.txt' OVERWRITE INTO TABLE employees2 PARTITION (join_year= 2015 ); Now check HDFS again to see what happened: $ hdfs dfs ls /user/hive/warehouse/employ_data.db/employees2 You will see a folder join_year=2015 created in this folder, corresponding to the partition join_year= 2015. Add a new partition join_year= 2016 to the table. $ hive> ALTER TABLE employees2 ADD PARTITION (join_year= 2016 ) LOCATION /user/hive/warehouse/employee_data.db/employees2/join_year=2016 ; Check in HDFS, and you will see a new folder created for this partition. 11. Insert a record to partition join_year= 2016. Because Hive does not support literals for complex types (array, map, struct, union), so it is not possible to use them in INSERT INTO...VALUES clauses. You need to create a file to store the new record, and then load it into the partition. $ cp employees.txt employees2016.txt Then use vim or gedit to edit employees2016.txt to add some records, and then load the file into the partition. 12. Query on a partition. Question: find all employees joined in the year 2016 whose salary is more than 60000. 13. (optional) Do word count in Hive, using the file employees.txt. Manage Data Using Pig Pig Installation and Configuration 1. Download Pig 0.16.0 $ wget http://mirror.ventraip.net.au/apache/pig/pig-0.16.0/pig- 0.16.0.tar.gz Then unpack the package: $ tar xvf pig-0.16.0.tar.gz

2. Define environment variables for Pig We need to configure the working directory of Hive, i.e., PIG_HOME. Open the file ~/.bashrc and add the following lines at the end of this file: export PIG_HOME = ~/pig-0.16.0 export PATH = $PIG_HOME/bin:$PATH Save the file, and then run the following command to take these configurations into effect: $ source ~/.bashrc 3. Now you have already done the basic configuration of Pig, and it is ready to use. Start Pig Grunt shell by the following command (start HDFS and YARN first!): $ pig Practice Pig 1. Download the test file NYSE_dividends.txt from the course webpage. The file contains 670 records. Put the file to HDFS. $ hdfs dfs put NYSE_dividends.txt Start the Hadoop job history server. $ mr-jobhistory-daemon.sh start historyserver 2. Load Data using load command into Schema exchange, symbol, date, dividend. $ grunt> dividends = load 'NYSE_dividends.txt' as (exchange:chararray, symbol:chararray, date:chararray, dividend:float); $ grunt> dump dividends;

You should see results like: 3. Group rows by symbol. $ grunt> grouped = group dividends by symbol; 4. Compute the average dividends for each symbol. Dividend value is obtained using expression dividends.dividend (or dividends.$3). Store this result in a variable avg. $ grunt> avg = foreach grouped generate group, AVG(dividends.$3); Use dump to check the contents of avg, you should see: 5. Store result avg into HDFS using store command $ grunt> store avg into 'average_dividend'; 6. Store result avg into HDFS using store command $ grunt> fs -cat /user/comp9313/average_dividend/* 7. (optional) Do word count in Pig, using the file employees.txt. More Practices More practices of Hive and Pig are put into the second assignment.