CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

Similar documents
Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Certified Big Data and Hadoop Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum

Question: 1 You need to place the results of a PigLatin script into an HDFS output directory. What is the correct syntax in Apache Pig?

Hadoop. Introduction / Overview

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Innovatus Technologies

Big Data Hadoop Course Content

1Z Oracle Big Data 2017 Implementation Essentials Exam Summary Syllabus Questions

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

AWS Serverless Architecture Think Big

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Data in the Cloud and Analytics in the Lake

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

SOLUTION TRACK Finding the Needle in a Big Data Innovator & Problem Solver Cloudera

Hadoop, Yarn and Beyond

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

2013 AWS Worldwide Public Sector Summit Washington, D.C.

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

Social Network Analytics on Cray Urika-XA

Hadoop Development Introduction

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

DATA SCIENCE USING SPARK: AN INTRODUCTION

Deployment Planning Guide

/ Cloud Computing. Recitation 8 October 18, 2016

Prototyping Data Intensive Apps: TrendingTopics.org

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Big Data Architect.

An Introduction to Big Data Formats

MapReduce, Hadoop and Spark. Bompotas Agorakis

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Hadoop Stack

What is Cloud Computing? What are the Private and Public Clouds? What are IaaS, PaaS, and SaaS? What is the Amazon Web Services (AWS)?

Lambda Architecture for Batch and Stream Processing. October 2018

Hadoop. Introduction to BIGDATA and HADOOP

Scalable Tools - Part I Introduction to Scalable Tools

/ Cloud Computing. Recitation 7 October 10, 2017

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Hadoop Online Training

MongoDB - a No SQL Database What you need to know as an Oracle DBA

SQLite vs. MongoDB for Big Data

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Microsoft Big Data and Hadoop

Percona Live September 21-23, 2015 Mövenpick Hotel Amsterdam

Processing 11 billions events a day with Spark. Alexander Krasheninnikov

SAP VORA 1.4 on AWS - MARKETPLACE EDITION FREQUENTLY ASKED QUESTIONS

Course Content MongoDB

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

Expert Lecture plan proposal Hadoop& itsapplication

Apache Spark and Scala Certification Training

Cloud Computing & Visualization

DATA MINING II - 1DL460

Activity 03 AWS MapReduce

Configuring and Deploying Hadoop Cluster Deployment Templates

New Approaches to Big Data Processing and Analytics

Verarbeitung von Vektor- und Rasterdaten auf der Hadoop Plattform DOAG Spatial and Geodata Day 2016

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

DATABASE DESIGN II - 1DL400

MIS Database Systems.

BIS Database Management Systems.

Real-time Data Engineering in the Cloud Exercise Guide

IBM Big SQL Partner Application Verification Quick Guide

Unifying Big Data Workloads in Apache Spark

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Presented by Sunnie S Chung CIS 612

Hortonworks Data Platform

microsoft

Big Data Development HADOOP Training - Workshop. FEB 12 to (5 days) 9 am to 5 pm HOTEL DUBAI GRAND DUBAI

Data Analytics Job Guarantee Program

Embedded Technosolutions

Oracle Big Data SQL High Performance Data Virtualization Explained

Big Data on AWS. Peter-Mark Verwoerd Solutions Architect

Verteego VDS Documentation

Hadoop An Overview. - Socrates CCDH

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

An Introduction to Apache Spark

Data Storage Infrastructure at Facebook

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Top 25 Big Data Interview Questions And Answers

MapR Enterprise Hadoop

Processing of big data with Apache Spark

A Review Paper on Big data & Hadoop

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Documentation. This PDF was generated for your convenience. For the latest documentation, always see

Hadoop. copyright 2011 Trainologic LTD

Amazon Search Services. Christoph Schmitter

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

A Survey on Big Data

MI-PDB, MIE-PDB: Advanced Database Systems

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

Big Data Infrastructure at Spotify

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Transcription:

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

Abstract Implementing a Hadoop-based system for processing big data and doing analytics is a topic which has been completed by many others in the past, and there is ample amount of documentation regarding the process. From the perspective of a beginner, or even someone with little knowledge of implementing a big data system from scratch, the process can be a overwhelming. Wading through the documentation, making mistakes along the way, and correcting those mistakes are considered by many part of the learning process. This paper will share our experiences with the installation of Hadoop on an Amazon Web Services cluster and analyzing this data in a meaningful way. The goal is to share areas where we encountered trouble so the reader may benefit from our learning. Introduction The Hadoop based installation was implemented by Lawrence Ni, Priya Patil, and James Tench as a group by working on the project over a series of Sunday afternoons. In addition, between meetings the individual members performed additional research to prepare for the following meeting. The Hadoop installation on Amazon Web Services (AWS) consisted of four servers hosted on micro EC2 instances. The cluster was setup in with one NameNode and three Data Nodes. In a real implementation, multiple Name Nodes would have been implemented to account for any machine failures. In addition to running Hadoop, the NameNode ran Hive as its NoSQL database to query the data. In addition to processing data on the AWS cluster, every step was first implemented on a local machine to test prior to running any job. On our local machines we ran MongoDb to query json data in an easy manner. In addition, the team implemented a custom Flume Agent to handle streaming data from Twitter s firehose. AWS Amazon Web Services offers various products that can be used in a cloud environment. Running an entire cluster of hardware in the cloud is referred to platform as a service. To get started with setting up a cloud infrastructure you begin by creating an account with AWS. AWS offers a free level, which is basically low end machines. For our implementation, these low end machines served our needs. After creating an account with AWS, the documentation for creating an EC2 instance is the place to start. An EC2 instance is the standard type of machine that can be launched in the cloud. The entire set up for AWS was as easy as following a wizard to launch the instances. Configuration After successfully launching 4 instances, to get the machines running Hadoop it is necessary to download the Hadoop files and configure each node. This is the first spot where the group encountered configuration issues. The trouble was minor and easy to resolve, but it was more about remembering the installation steps for Hadoop in pseudo mode. Hadoop communicates via SSH and must be able to do so without being prompted for a password. For AWS machines to communicate, it must be done via SSH and you must have your digitally

signed key available. To remedy the communication problem, a copy of the PEM file that is used locally was copied to each machine. Once the file was copied to each machine, a connection entry was made in the ~/.ssh config file with the ip address info for the other nodes. The next step after configuring the connection settings with SSH was to setup each of the Hadoop config files. Again, this process was straight-forward. Following the documentation on the Apache Hadoop website was all that was needed to set up the configuration. The key differences between installing on a cluster vs. pseudo mode were creating a slaves file, setting the replication factor, and adding the IP addresses of the data nodes. Flume The Twitter firehose API was chosen as the datasource for our project. The firehose is a stream of tweets coming from twitter live. To connect to the API, it is necessary to go to Twitter s developer page and register as a developer. Upon registration you may create an app and obtain an API key for the app. This key is used to connect, and download data from the various Twitter APIs. Because of the streaming nature of the data (vs. connecting to a REST API), a method for moving the data from the stream to HDFS is needed. Flume provides this API. Flume works by using sinks, channels and sources. A sink is a data source, and in our case is the streaming API. A channel is the method it will use to store data as it moves to permanent storage. For this project, memory is used as the channel. Finally the sink is where data is stored. In our case, we are storing data in HDFS. Flume is also very well documented, and the documentation will guide you through the majority of the process for creating a Flume Agent. One area documented on the Flume website references the Twitter API and warns the user that to code is experimental and subject to change. This was the first area of configuring Flume where trouble was encountered. For the most part, the Apache Flume example worked for downloading data and storing it in HDFS. However, the Twitter API allows for filtering of the data via keywords passed with the API request. The default Apache implementation did not implement the ability to pass keywords, so there was no filter. To get around this problem, there is a well documented java class from Cloudera that includes the ability to use a Flume Agent with a filter condition. For our project we elected to copy the Apache implementation, and modify it by adding in the filter code from Cloudera. Once we had this in place, Flume was streaming data from to HDFS. After a few minutes letting Flume run on a local machine, the program began throwing exceptions, and the exceptions starting increasing. To solve this problem it was necessary to modify the Flume Agent config files so that the memory channel was flushed to disc often enough. After modifying the transaction capacity setting, and some trial and error the Flume Agent began running without exceptions. The key to getting the program to run without exception was to set the transaction capacity higher than the batch size. Once this was working as desired, the Flume Agent was copied to the Namenode on AWS. The Namenode launched the Flume Agent, was allowed to download data for days. Flume Java code

MongoDb The Twitter API sends data in JSON format. MongoDb handles JSON naturally because it stores data in a binary JSON format called BSON. For these reasons, we used MongoDb on a local machine to understand the raw data better. Sample files were copied from the AWS cluster to a local machine. The files were imported into MongoDb via the mongoimport command. Once

the data was loaded, querying to view to format of the tweets, test for valid data, and review simple aggregations was done with the mongo query language. Realizing we wanted a method to process large amounts of data directly on HDFS the group decided that MongoDb would not be our best choice for direct manipulation of the data on HDFS. For those reasons, the extent of the MongoDb usage was limited to only analyzing and reviewing sample data. MapReduce The first attempt to process large queries on the Hadoop cluster involved writing a MapReduce job. The JSONObject library created by Douglas Crockford was used to parse the raw JSON and extract the components being aggregated. MapReduce for finding one summary metric was easily implemented by using the JSONObject library to extract screen_name as the key, and the followers_count as the value for our MapReduce Job. Once again, the job was tested locally first, then processed on the cluster. With about 3.6 gb of data, the cluster process our count job in about 90 seconds. We did not consider this bad performance for 4 low end machines processing almost 4gig of data. Although the MapReduce job was not difficult to create in Java, it lacked the flexibility of running various ad hoc queries at will. This lead to the next phase of processing our data on the cluster. mapreduce code

HIVE Apache Hive, like the other products mentioned prior was also very well documented and easy to install on the cluster following the standard docs. Moving data into hive proved to be the challenge.

For HIVE to process data it needs a method for serializing and deserializing data when you send a query request. This is referred to as a Serde. Finding a JSON Serde was the easy part. We used the Hive-JSON-Serde from user rcongiu on github. The initial trouble with setting up the hive table was telling the Serde file what the format of the data would look like. Typically a create table statement needs to be generated to define what each field looks like inside the nested JSON document. During the development and implementation of the table, many of the data fields that we expected to hold a value were returning null. This is where we learned that in order for the Serde to work properly, the table definition needed to be very precise. Because each tweet from twitter did not alway contain complete data, our original implementation was failing. To create the perfect schema definition, another library called hive-json-schema by user quux00 on github was used. This tool was very good at auto generating a hive schema if you provided it with a single sample JSON document. After using the tool to generate the create table statement, the data was tested again. Once again, the data was returning null values for fields that should have had values. This ended up being one of the most tedious areas of the project to debug. After spending time researching and debugging, the problem was discovered. The problem once again stemmed back to twitter data sometimes being incomplete. Because of this, the sample tweet that was used by the tool to generate the create table statement was not complete. To correct this problem, a sample tweet was reconstructed with dummy data in any field that we found to be missing. We used the Twitter API to validate what each field should look like in terms of data types and nested structures. After making a few typos, we finally got it right and constructed a full tweet. Using this new Tweet sample a create table statement was generated with the same tool. Queries began returning expected values! hive code

python code Queries & Visualization Now that we had HIVE up and running, we generated some sample queries that aggregated the data in various ways. Creating HIVE queries is just like creating standard SQL queries. In addition, it was easy to use Java style string manipulation to aid in processing the data.

After we queried data and aggregated it in different ways we moved the aggregated data to summary files. The aggregated data included information about who tweeted, how often they tweeted, and even the hours of the day users were most actively sending tweets. Watching HIVE generate MapReduce jobs in the terminal window was fun the first one or two times, but then we realized we should find a way to better represent our data. The final piece of software we used to process our data was called Plotly. Plotly is a Python library that offers multiple graphing options. To process and use Plotly, you need a developer account. Once you create an account, you use Python to define your data set and format the data set based on the graph or chart you intend to create. The library then generates a custom URL that can be used to view the data in chart form via a web browser. Conclusion From the perspective of a beginner, it may seem very difficult and overwhelming to implement and configure a complex computer system. However, breaking down these complex systems into more manageable pieces makes it easier to understand how these different parts work and communicate with each other. This type of structured learning not only helps you understand the material but also makes debugging issues a lot easier. While configuring and installing our various systems, we encountered a variety of different issues. Whether it be environment variables not being set or jar files that are no longer compatible with your current software, these issues were easier to debug because we were able to break down the different parts and localize the error. Experiencing errors/bugs when setting up these complex systems is when the learning truly begins. Having to break down the error messages and think about the different moving parts helps you develop a deeper understanding of how these different aspects work and interact as a whole.

References The Apache Software Foundation. Apache Hadooop, https://hadoop.apache.org/ The Apache Software Foundation. Apache HIVE, https://hive.apache.org/ The Apache Software Foundation. Apache Flume, https://flume.apache.org/ Cloudera. Cloudera Engineering Blog, Analyzing Twitter Data with Hadoop. http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-datawith-flume/ JSON-Hive Serde. https://github.com/rcongiu/hive-json-serde JSON-Hive-schema. https://github.com/quux00/hive-json-schema Plotly The Online Chart Maker. https://plot.ly/