CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

Abstract Implementing a Hadoop-based system for processing big data and doing analytics is a topic which has been completed by many others in the past, and there is ample amount of documentation regarding the process. From the perspective of a beginner, or even someone with little knowledge of implementing a big data system from scratch, the process can be a overwhelming. Wading through the documentation, making mistakes along the way, and correcting those mistakes are considered by many part of the learning process. This paper will share our experiences with the installation of Hadoop on an Amazon Web Services cluster and analyzing this data in a meaningful way. The goal is to share areas where we encountered trouble so the reader may benefit from our learning. Introduction The Hadoop based installation was implemented by Lawrence Ni, Priya Patil, and James Tench as a group by working on the project over a series of Sunday afternoons. In addition, between meetings the individual members performed additional research to prepare for the following meeting. The Hadoop installation on Amazon Web Services (AWS) consisted of four servers hosted on micro EC2 instances. The cluster was setup in with one NameNode and three Data Nodes. In a real implementation, multiple Name Nodes would have been implemented to account for any machine failures. In addition to running Hadoop, the NameNode ran Hive as its NoSQL database to query the data. In addition to processing data on the AWS cluster, every step was first implemented on a local machine to test prior to running any job. On our local machines we ran MongoDb to query json data in an easy manner. In addition, the team implemented a custom Flume Agent to handle streaming data from Twitter s firehose. AWS Amazon Web Services offers various products that can be used in a cloud environment. Running an entire cluster of hardware in the cloud is referred to platform as a service. To get started with setting up a cloud infrastructure you begin by creating an account with AWS. AWS offers a free level, which is basically low end machines. For our implementation, these low end machines served our needs. After creating an account with AWS, the documentation for creating an EC2 instance is the place to start. An EC2 instance is the standard type of machine that can be launched in the cloud. The entire set up for AWS was as easy as following a wizard to launch the instances. Configuration After successfully launching 4 instances, to get the machines running Hadoop it is necessary to download the Hadoop files and configure each node. This is the first spot where the group encountered configuration issues. The trouble was minor and easy to resolve, but it was more about remembering the installation steps for Hadoop in pseudo mode. Hadoop communicates via SSH and must be able to do so without being prompted for a password. For AWS machines to communicate, it must be done via SSH and you must have your digitally

signed key available. To remedy the communication problem, a copy of the PEM file that is used locally was copied to each machine. Once the file was copied to each machine, a connection entry was made in the ~/.ssh config file with the ip address info for the other nodes. The next step after configuring the connection settings with SSH was to setup each of the Hadoop config files. Again, this process was straight-forward. Following the documentation on the Apache Hadoop website was all that was needed to set up the configuration. The key differences between installing on a cluster vs. pseudo mode were creating a slaves file, setting the replication factor, and adding the IP addresses of the data nodes. Flume The Twitter firehose API was chosen as the datasource for our project. The firehose is a stream of tweets coming from twitter live. To connect to the API, it is necessary to go to Twitter s developer page and register as a developer. Upon registration you may create an app and obtain an API key for the app. This key is used to connect, and download data from the various Twitter APIs. Because of the streaming nature of the data (vs. connecting to a REST API), a method for moving the data from the stream to HDFS is needed. Flume provides this API. Flume works by using sinks, channels and sources. A sink is a data source, and in our case is the streaming API. A channel is the method it will use to store data as it moves to permanent storage. For this project, memory is used as the channel. Finally the sink is where data is stored. In our case, we are storing data in HDFS. Flume is also very well documented, and the documentation will guide you through the majority of the process for creating a Flume Agent. One area documented on the Flume website references the Twitter API and warns the user that to code is experimental and subject to change. This was the first area of configuring Flume where trouble was encountered. For the most part, the Apache Flume example worked for downloading data and storing it in HDFS. However, the Twitter API allows for filtering of the data via keywords passed with the API request. The default Apache implementation did not implement the ability to pass keywords, so there was no filter. To get around this problem, there is a well documented java class from Cloudera that includes the ability to use a Flume Agent with a filter condition. For our project we elected to copy the Apache implementation, and modify it by adding in the filter code from Cloudera. Once we had this in place, Flume was streaming data from to HDFS. After a few minutes letting Flume run on a local machine, the program began throwing exceptions, and the exceptions starting increasing. To solve this problem it was necessary to modify the Flume Agent config files so that the memory channel was flushed to disc often enough. After modifying the transaction capacity setting, and some trial and error the Flume Agent began running without exceptions. The key to getting the program to run without exception was to set the transaction capacity higher than the batch size. Once this was working as desired, the Flume Agent was copied to the Namenode on AWS. The Namenode launched the Flume Agent, was allowed to download data for days. Flume Java code

MongoDb The Twitter API sends data in JSON format. MongoDb handles JSON naturally because it stores data in a binary JSON format called BSON. For these reasons, we used MongoDb on a local machine to understand the raw data better. Sample files were copied from the AWS cluster to a local machine. The files were imported into MongoDb via the mongoimport command. Once

the data was loaded, querying to view to format of the tweets, test for valid data, and review simple aggregations was done with the mongo query language. Realizing we wanted a method to process large amounts of data directly on HDFS the group decided that MongoDb would not be our best choice for direct manipulation of the data on HDFS. For those reasons, the extent of the MongoDb usage was limited to only analyzing and reviewing sample data. MapReduce The first attempt to process large queries on the Hadoop cluster involved writing a MapReduce job. The JSONObject library created by Douglas Crockford was used to parse the raw JSON and extract the components being aggregated. MapReduce for finding one summary metric was easily implemented by using the JSONObject library to extract screen_name as the key, and the followers_count as the value for our MapReduce Job. Once again, the job was tested locally first, then processed on the cluster. With about 3.6 gb of data, the cluster process our count job in about 90 seconds. We did not consider this bad performance for 4 low end machines processing almost 4gig of data. Although the MapReduce job was not difficult to create in Java, it lacked the flexibility of running various ad hoc queries at will. This lead to the next phase of processing our data on the cluster. mapreduce code

HIVE Apache Hive, like the other products mentioned prior was also very well documented and easy to install on the cluster following the standard docs. Moving data into hive proved to be the challenge.

For HIVE to process data it needs a method for serializing and deserializing data when you send a query request. This is referred to as a Serde. Finding a JSON Serde was the easy part. We used the Hive-JSON-Serde from user rcongiu on github. The initial trouble with setting up the hive table was telling the Serde file what the format of the data would look like. Typically a create table statement needs to be generated to define what each field looks like inside the nested JSON document. During the development and implementation of the table, many of the data fields that we expected to hold a value were returning null. This is where we learned that in order for the Serde to work properly, the table definition needed to be very precise. Because each tweet from twitter did not alway contain complete data, our original implementation was failing. To create the perfect schema definition, another library called hive-json-schema by user quux00 on github was used. This tool was very good at auto generating a hive schema if you provided it with a single sample JSON document. After using the tool to generate the create table statement, the data was tested again. Once again, the data was returning null values for fields that should have had values. This ended up being one of the most tedious areas of the project to debug. After spending time researching and debugging, the problem was discovered. The problem once again stemmed back to twitter data sometimes being incomplete. Because of this, the sample tweet that was used by the tool to generate the create table statement was not complete. To correct this problem, a sample tweet was reconstructed with dummy data in any field that we found to be missing. We used the Twitter API to validate what each field should look like in terms of data types and nested structures. After making a few typos, we finally got it right and constructed a full tweet. Using this new Tweet sample a create table statement was generated with the same tool. Queries began returning expected values! hive code

python code Queries & Visualization Now that we had HIVE up and running, we generated some sample queries that aggregated the data in various ways. Creating HIVE queries is just like creating standard SQL queries. In addition, it was easy to use Java style string manipulation to aid in processing the data.

After we queried data and aggregated it in different ways we moved the aggregated data to summary files. The aggregated data included information about who tweeted, how often they tweeted, and even the hours of the day users were most actively sending tweets. Watching HIVE generate MapReduce jobs in the terminal window was fun the first one or two times, but then we realized we should find a way to better represent our data. The final piece of software we used to process our data was called Plotly. Plotly is a Python library that offers multiple graphing options. To process and use Plotly, you need a developer account. Once you create an account, you use Python to define your data set and format the data set based on the graph or chart you intend to create. The library then generates a custom URL that can be used to view the data in chart form via a web browser. Conclusion From the perspective of a beginner, it may seem very difficult and overwhelming to implement and configure a complex computer system. However, breaking down these complex systems into more manageable pieces makes it easier to understand how these different parts work and communicate with each other. This type of structured learning not only helps you understand the material but also makes debugging issues a lot easier. While configuring and installing our various systems, we encountered a variety of different issues. Whether it be environment variables not being set or jar files that are no longer compatible with your current software, these issues were easier to debug because we were able to break down the different parts and localize the error. Experiencing errors/bugs when setting up these complex systems is when the learning truly begins. Having to break down the error messages and think about the different moving parts helps you develop a deeper understanding of how these different aspects work and interact as a whole.

References The Apache Software Foundation. Apache Hadooop, https://hadoop.apache.org/ The Apache Software Foundation. Apache HIVE, https://hive.apache.org/ The Apache Software Foundation. Apache Flume, https://flume.apache.org/ Cloudera. Cloudera Engineering Blog, Analyzing Twitter Data with Hadoop. http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-datawith-flume/ JSON-Hive Serde. https://github.com/rcongiu/hive-json-serde JSON-Hive-schema. https://github.com/quux00/hive-json-schema Plotly The Online Chart Maker. https://plot.ly/