Aims. Background. This exercise aims to get you to:

Aims This exercise aims to get you to: Import data into HBase using bulk load Read MapReduce input from HBase and write MapReduce output to HBase Manage data using Hive Manage data using Pig Background In HBase-speak, bulk loading is the process of preparing and loading HFiles (HBase s own file format) directly into the RegionServers. Bulk load steps: 1. Extract the data from a source, typically text files or another database. 2. Transform the data into HFiles. This step requires a MapReduce job and for most input types you will have to write the Mapper yourself. The job will need to emit the row key as the Key, and either a KeyValue, a Put, or a Delete as the Value. The Reducer is handled by HBase; you configure it using HFileOutputFormat2.configureIncrementalLoad(). 3. Load the files into HBase by telling the RegionServers where to find them. It requires using LoadIncrementalHFiles (more commonly known as the completebulkload tool), and by passing it a URL that locates the files in HDFS, it will load each file into the relevant region via the RegionServer that serves it. Here s an illustration of this process. The data flow goes from the original source to HDFS, where the RegionServers will simply move the files to their regions directories. See more details at: http://blog.cloudera.com/blog/2013/09/how-to-usehbase-bulk-loading-and-why/.

Because HBase is not installed in the VM image in the lab computers, you need to install HBase again following the instructions in Lab 5. Create a project Lab6 and create a package comp9313.lab6 in this project. Put all your java codes in this package and keep a copy. Right click the project -> Properties -> Java Build Path -> Libraries -> Add Externals JARs -> go to the folder comp9313/base-1.2.2/lib, and add all the jar files to the project. Data Set Download the two files Votes and Posts from the course homepage. The data set contains many questions asked on http://www.stackexchange.com and the corresponding answers. The two file used in this week s lab are obtained at: https://archive.org/details/stackexchange, part of datascience.stackexchange.com.7z. The format of the data set is shown at: https://ia800500.us.archive.org/22/items/stackexchange/readme.txt. The data format of Votes is (the field BountyAmount is ignored): - **votes**.xml - Id - PostId - VoteTypeId - ` 1`: AcceptedByOriginator - ` 2`: UpMod - ` 3`: DownMod - ` 4`: Offensive - ` 5`: Favorite - if VoteTypeId = 5 UserId will be populated - ` 6`: Close - ` 7`: Reopen - ` 8`: BountyStart - ` 9`: BountyClose - `10`: Deletion - `11`: Undeletion - `12`: Spam - `13`: InformModerator - CreationDate - UserId (only for VoteTypeId 5) - BountyAmount (only for VoteTypeId 9) For example: The data format of Comments is:

- **comments**.xml - Id - PostId - Score - Text, e.g.: "@Stu Thompson: Seems possible to me - why not try it?" - CreationDate, e.g.:"2008-09-06t08:07:10.730" - UserId For example: HBase Data Bulk Load Import Votes as a table in HBase. 1. HBase will use a staging folder to store temporary data, and we need to configure this directory for HBase. Create a folder /tmp/hbase-staging in HDFS, and change its mode to 711 (i.e., rwx x x). $ hdfs dfs mkdir /tmp/hbase-staging $ hdfs dfs chmod 711 /tmp/hbase-staging Add the following lines to $HBASE_HOME/conf/hbase-site.xml (in between <configuration> and </configuration>: <property> <name>hbase.bulkload.staging.dir</name> <value>/tmp/hbase-staging</value> </property> <property> <name>hbase.coprocessor.region.classes</name> <value>org.apache.hadoop.hbase.security.token.tokenprovider,org.apach e.hadoop.hbase.security.access.accesscontroller,org.apache.hadoop.hba se.security.access.securebulkloadendpoint</value> </property> In your MapReduce code, you need to configure the two properties: hbase.fs.tmp.dir and hbase.bulkload.staging.dir. After creating a Configuration object, you need to: Configuration conf = HBaseConfiguration.create(); conf.set( hbase.fs.tmp.dir, /tmp/hbase-staging ); conf.set( hbase.bulkload.staging.dir, /tmp/hbase-staging );

2. The code for bulk loading Votes into HBase is available at the course homepage, i.e., Vote.java and HBaseBulkLoadExample.java. Below lists some explanations of the code: Only the mapper is required in bulk load, because the Reducer is handled by HBase and you configure it using HFileOutputFormat2.configureIncrementalLoad(). The map output key data type must be ImmutableBytesWritable, and the map output value data type can only be a KeyValue/Put/Delete object. In this example, you create a Put object, which will be used to insert the data into the HBase table. The table can either be created using HBase shell or HBase Java API. In the give code, the table is created using Java API. In the example code, the class HBaseBulkLoadExample implements the interface Tool, and the job is configured and started in the run() function. Then, ToolRunner.run() is used to invoke HBaseBulkLoadExample.run(). You can also configure and start the job in the main function, as you did in the previous labs on MapReduce. Before starting the job, you need to use HFileOutputFormat2.configureIncrementalLoad() to configure the bulk load. After the job is completed, that is, the mapper generate the Put objects for all input data, you use LoadIncrementalHFiles to do the bulk load. It is the tool to load the output of HFileOutputFormat2 into an existing table. 3. After Votes is loaded into the table votes, open the HBase shell to check the table and its contents. Your Task: Import Comments as a table in HBase. Create a class HBaseBulkLoadComments.java and a class Comment.java in package comp9313.lab6 to finish this task. Use Id as the rowkey, and create three column families, postinfo (containing PostId), commentinfo (containing Score, Text, and CreationDate), and userinfo (containing UserId ). Read MapReduce Input from HBase Problem 1. Read input data from table votes in HBase, and count for each post the number of each type of vote for this post. The output data is of format: (PostID, {<VoteTypeId, count>}).

For example, if post with ID 1 has two votes, one is of type 1 and another is of type 2, then you should output (1, {<1, 1>, <2, 1>}). Please refer to https://hbase.apache.org/book.html#mapreduce.example for the examples of HBase MapReduce read. Hints: 1. Your mapper should be extended from TableMapper<K, V>. The input key data type is ImmutableBytesWritable, and value data type is Result. Each map() function will read one row from the HBase table, and you can use Result.getValue(CF, COLUMN) to get the value in a cell. Your mapper code will be like: public static class AggregateMapper extends TableMapper<Text, Text>{ public void map(immutablebyteswritable row, Result value, Context context) throws IOException, InterruptedException { //do your job } } 2. The reducer is just like a normal MapReduce reducer 3. In the main function, you will need to use the function TableMapReduceUtil.initTableMapperJob() to configure the mapper. 4. Because the data is read from HBase, you do not need to configure the data input path. You only need to specify the output path in Eclipse. The code ReadHBaseExample.java is available at the course webpage. Try to write the mapper by yourself, and learn how to configure the HBase read job from that file. Problem 2: Read input data from table comments in HBase, and calculate the number of comments per UserID. Refer to the code ReadHBaseExample.java and write your code in ReadHBaseComment.java in package comp9313.lab6. Write MapReduce Output to HBase Problem 1. Read input data from Votes, and count the number of votes per user. The result will be written to an HBase table votestats, rather than storing in files generated by reducers.

Please refer to https://hbase.apache.org/book.html#mapreduce.example for the examples of HBase MapReduce Write. Hints: 1. The mapper is just like a normal MapReduce mapper. 2. Your reducer should be extended from TableReducer<K, V>. The output key is ignored, and the value data type is ImmutableBytesWritable. The reduce() function will aggregate the number of comments for a user. You need to create a Put object to store the information, and HBase will use this object to insert the information into table votestats. Your reducer code will be like: public static class UserVotesReducer extends TableReducer<Text, IntWritable, ImmutableBytesWritable> { public void reduce(text key, Iterable<IntWritable> values, Context context)throws IOException, InterruptedException { //do your job } } 3. In the main function, you will need to use the function TableMapReduceUtil.initTableReducerJob() to configure the reducer. 4. You can create the table in the main function, or using the HBase shell. 5. Because the data is written to HBase, you do not need to configure the data output path. You only need to specify the input path in Eclipse. The code WriteHBaseExample.java is available at the course webpage. Try to write the reducer by yourself, and learn how to configure the HBase write job from that file. Problem 2: Read input data from Comments, and calculate the average score of comments for each question. The result will be written to an HBase table post_comment_score, with only one column family stats. Refer to the code WriteHBaseExample.java and write your code in WriteHBaseComment.java in package comp9313.lab6. Manage Data Using Hive Hive Installation and Configuration 1. Download Hive 2.1.0

$ wget http://apache.uberglobalmirror.com/hive/stable-2/apache-hive- 2.1.0-bin.tar.gz Then unpack the package: $ tar xvf apache-hive-2.1.0-bin.tar.gz 2. Define environment variables for Hive We need to configure the working directory of Hive, i.e., HIVE_HOME. Open the file ~/.bashrc and add the following lines at the end of this file: export HIVE_HOME = ~/apache-hive-2.1.0-bin export PATH = $HIVE_HOME/bin:$PATH Save the file, and then run the following command to take these configurations into effect: $ source ~/.bashrc 3. Create /tmp and /user/hive/warehouse and set them chmod g+w for more than one user usage $ hdfs dfs -mkdir /tmp $ hdfs dfs -mkdir p /user/hive/warehouse $ hdfs dfs -chmod g+w /tmp $ hdfs dfs -chmod g+w /user/hive/warehouse 4. Run the schematool command to initialize Hive $ schematool -dbtype derby -initschema Now you have already done the basic configuration of Hive, and it is ready to use. Start Hive shell by the following command (start HDFS and YARN first!): $ hive

Practice Hive 1. Download the test file employees.txt from the course webpage. The file contains only 7 records. Put the file at the home folder. 2. Create a database $ hive> create database employee_data; $ hive> use employee_data; 3. All databases are created under /user/hive/warehouse directory. $ hdfs dfs ls /user/hive/warehouse 4. Create the employee table $ hive> CREATE TABLE employees ( name STRING, salary FLOAT, subordinates ARRAY<STRING>, deductions MAP<STRING, FLOAT>, address STRUCT<street:STRING, city:string, state:string, zip:int> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003' LINES TERMINATED BY '\n' STORED AS TEXTFILE; Because '\001', '\002', '\003', and '\n' are by default, and thus you can ignore ROW FORMAT DELIMITED. STORED AS TEXTFILE is also by default, and can be ignored as well. 5. Show all tables in the current database $ hive> show tables; 6. Load data from local file system into table $ hive> LOAD DATA LOCAL INPATH '/home/comp9313/employees.txt' OVERWRITE INTO TABLE employees;

After loading the data into the table, you can check in HDFS what happened: $ hdfs dfs ls /user/hive/warehouse/employee_data.db/employees The file employees.txt is copied into this folder corresponding to the table. 7. Check the data in the table $ select * from employees; 8. You can do various queries based on the employees table, just as in an RDBMS. For example: Question 1: show the number of employees and their average salary Hint: use count() and avg() Question 2: find the employee who has the highest salary Hint: use max(), IN clause, and subquery in where clause 9. Usage of explode(). Find all employees who are the subordinate of another person. explode() takes in an array (or a map) as an input and outputs the elements of the array (map) as separate rows. $ hive> SELECT explode(subordinates) FROM employees; 10. Hive partitions. When defining employees, it is not partitioned, and thus you cannot add a partition to it. You can only add a new partition to a table has already been partitioned! Create a table employees2, and load the same file into it. $ hive> CREATE TABLE employees2 ( name STRING, salary FLOAT, subordinates ARRAY<STRING>, deductions MAP<STRING, FLOAT>,

address STRUCT<street:STRING, city:string, state:string, zip:int> )PARTITIONED BY (join_year STRING); $ hive> LOAD DATA LOCAL INPATH '/home/comp9313/employees.txt' OVERWRITE INTO TABLE employees2 PARTITION (join_year= 2015 ); Now check HDFS again to see what happened: $ hdfs dfs ls /user/hive/warehouse/employ_data.db/employees2 You will see a folder join_year=2015 created in this folder, corresponding to the partition join_year= 2015. Add a new partition join_year= 2016 to the table. $ hive> ALTER TABLE employees2 ADD PARTITION (join_year= 2016 ) LOCATION /user/hive/warehouse/employee_data.db/employees2/join_year=2016 ; Check in HDFS, and you will see a new folder created for this partition. 11. Insert a record to partition join_year= 2016. Because Hive does not support literals for complex types (array, map, struct, union), so it is not possible to use them in INSERT INTO...VALUES clauses. You need to create a file to store the new record, and then load it into the partition. $ cp employees.txt employees2016.txt Then use vim or gedit to edit employees2016.txt to add some records, and then load the file into the partition. 12. Query on a partition. Question: find all employees joined in the year 2016 whose salary is more than 60000. 13. (optional) Do word count in Hive, using the file employees.txt. Manage Data Using Pig Pig Installation and Configuration 1. Download Pig 0.16.0 $ wget http://mirror.ventraip.net.au/apache/pig/pig-0.16.0/pig- 0.16.0.tar.gz Then unpack the package: $ tar xvf pig-0.16.0.tar.gz

2. Define environment variables for Pig We need to configure the working directory of Hive, i.e., PIG_HOME. Open the file ~/.bashrc and add the following lines at the end of this file: export PIG_HOME = ~/pig-0.16.0 export PATH = $PIG_HOME/bin:$PATH Save the file, and then run the following command to take these configurations into effect: $ source ~/.bashrc 3. Now you have already done the basic configuration of Pig, and it is ready to use. Start Pig Grunt shell by the following command (start HDFS and YARN first!): $ pig Practice Pig 1. Download the test file NYSE_dividends.txt from the course webpage. The file contains 670 records. Put the file to HDFS. $ hdfs dfs put NYSE_dividends.txt Start the Hadoop job history server. $ mr-jobhistory-daemon.sh start historyserver 2. Load Data using load command into Schema exchange, symbol, date, dividend. $ grunt> dividends = load 'NYSE_dividends.txt' as (exchange:chararray, symbol:chararray, date:chararray, dividend:float); $ grunt> dump dividends;

You should see results like: 3. Group rows by symbol. $ grunt> grouped = group dividends by symbol; 4. Compute the average dividends for each symbol. Dividend value is obtained using expression dividends.dividend (or dividends.$3). Store this result in a variable avg. $ grunt> avg = foreach grouped generate group, AVG(dividends.$3); Use dump to check the contents of avg, you should see: 5. Store result avg into HDFS using store command $ grunt> store avg into 'average_dividend'; 6. Store result avg into HDFS using store command $ grunt> fs -cat /user/comp9313/average_dividend/* 7. (optional) Do word count in Pig, using the file employees.txt. More Practices More practices of Hive and Pig are put into the second assignment.