South Asian Journal of Engineering and Technology Vol.2, No.50 (2016) 5 10

Size: px

Start display at page:

Download "South Asian Journal of Engineering and Technology Vol.2, No.50 (2016) 5 10"

Darlene Pope
5 years ago
Views:

ISSN Number (online): 2454-9614 Weather Data Analytics using Hadoop Components like MapReduce, Pig and Hive Si

1 ISSN Number (online): Weather Data Analytics using Hadoop Components like MapReduce, Pig and Hive Sireesha. M 1, Tirumala Rao. S. N 2 Department of CSE, Narasaraopeta Engineering College, Narasaraopet, JNTUK, AP, India 1 sireeshamoturi@gmail.com 2 naga_tirumalarao@yahoo.co.in Abstract In big data world, Hadoop Distributed File System (HDFS) is very popular. It provides a framework for storing data in a distributed environment and also has set of tools to retrieve and process. These data sets using map-reduce concept. In this paper, a thorough research has been carried to discuss that how big data analytics can be performed on weather data stored on Hadoop distributed file system. Collecting, storing and processing of huge amounts of climatic data is necessary for accurate prediction of weather. Meteorological departments use different types of sensors such as temperature, humidity etc. to get the values. Volume and velocity of data in each of the sensors makes the data processing time consuming and complex. Leveraging MapReduce with Hadoop to process the massive amount of data. Hadoop is an open framework suitable for large scale data processing. MapReduce programming model helps to process large data sets in parallel, distributed manner. manage these databases we need, highly parallel software s. First of all, data is acquired from different sources such as social media, traditional enterprise data or sensor data etc.. Then, this data can be organized using distributed file systems such as Google File System or Hadoop File System. These file systems are very efficient when number of reads are very high as compared to writes. At last, data is analyzed using MapReduce so that queries can be run on this data easily and efficiently. Figure 1 showing the hadoop ecosystem. I. INTRODUCTION Big Data has become one of the buzzwords in IT during the last couple of years. Big Data is that data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast and is unstructured and doesn't fit the structures of the architectures. To gain value from this data we need an alternative way to process it. Various fields for example that generate such large amounts of huge data are Facebook, Twitter,Weather stations,new York Stock Exchange, Worldwide electric transmissions etc. Thus in our project we are dealing with huge amount of unstructured weather data. India is an emerging country. Now most of the cities have become smart. Different sensors employed in smart city can be used to measure weather parameters. Weather forecast department has begun collect and analysis massive amount of data like temperature. They use different sensor values like temperature, humidity to predict the rain fall etc. When the number of sensors increases, the data becomes high volume and the sensor data have high velocity data. There is a need of a scalable analytics tool to process massive amount of data. The traditional approach of process the data is very slow. Process the sensor data with MapReduce in Hadoop framework which remove the scalability bottleneck. So to Figure 1: Hadoop EcoSystem Weather forecasting is always a big challenge for the meteorologists to predict the state of the atmosphere at some future time and the weather conditions that may be expected. It is obvious that knowing the future of the weather can be important for individuals and organizations. Accurate weather forecasts can tell a farmer the best time to plant, an airport control tower what information to send to planes that are landing and taking off, and residents of a coastal region when a hurricane might strike. Weather sensors collecting data every hour at many locations across the globe gather a large volume of log data, which is a good candidate for analysis with MapReduce, since it is semistructured and record-oriented. The data we

2 will use is from the National Climatic Data Center (NCDC) [2]. The data is stored using a line-oriented ASCII format, in which each line is a record. The format supports a rich set of meteorological elements, many of which are optional or with variable data lengths. For simplicity, we shall focus on the basic elements, such as temperature. II. HADOOP Hadoop and Map Reduce are the most widely used models used today for Big Data processing. Hadoop is an opensource large-scale data processing framework that supports distributed processing of large chunks of data using simple programming models. The Apache Hadoop project consists of the HDFS and Hadoop Map Reduce in addition to other modules. The software is modeled to harvest upon the processing power of clustered computing while managing failures at node level. Hadoop is widely used in big data applications in the industry, e.g., weather forecasting, spam filtering, network searching, click-stream analysis, and social recommendation. In addition, considerable academic research is now based on Hadoop. Some representative cases are given below. As declared in June 2012, Yahoo runs Hadoop in 42,000 servers at four data centers to support its products and services, e.g.,searching and spam filtering, etc. At present, the biggest Hadoop cluster has 4,000 nodes, but the number of nodes will be increased to 10,000 with the release of Hadoop 2.0. In the same month, Facebook announced that their Hadoop cluster can process 100 PB data, which grew by 0.5 PB per day as in November Some well-known agencies that use Hadoop to conduct distributed computation are listed in [3]. A. HDFS HDFS is Hadoop s implementation of a distributed filesystem. It is designed to hold a large amount of data, and provide access to this data to many clients distributed across a network [4]. On a fully configured cluster, running Hadoop means running a set of daemons, or resident programs, on the different servers in your network. These daemons have specific roles; some exist only on one server, some exist across multiple servers. The daemons include NameNode, DataNode, Secondary NameNode, JobTracker and TaskTracker[5]. The topology of a typical hadoop cluster is shown in Figure 2. Figure 2: The topology of hadoop cluster The NameNode is the master of HDFS that directs the slave DataNode daemons to perform the low-level I/O tasks. Each slave machine in your cluster will host a DataNode daemon to perform the grunt work of the distributed filesystem reading and writing HDFS blocks to actual files on the local filesystem. When you want to read or write a HDFS file, the file is broken into blocks and the NameNode will tell your client which DataNode each block resides in. Your client communicates directly with the DataNode daemons to process the local files corresponding to the blocks. Furthermore, a DataNode may communicate with other DataNodes to replicate its data blocks for redundancy. The Secondary NameNode (SNN) is an assistant daemon for monitoring the state of the cluster HDFS. The SNN differs from the NameNode in that this process doesn t receive or record any real-time changes to HDFS. Instead, it communicates with the NameNode to take snapshots of the HDFS metadata at intervals defined by the cluster configuration. The NameNode is a single point of failure for a Hadoop cluster, and the SNN snapshots help minimize the downtime and loss of data. The JobTracker daemon is the liaison between your application and Hadoop. Once you submit your code to your cluster, the JobTracker determines the execution plan by determining which files to process, assigns nodes to different tasks, and monitors all tasks as they re running. Should a task fail, the JobTracker will automatically relaunch the task, possibly on a different node, up to a predefined limit of retries. There is only one JobTracker daemon per Hadoop cluster. It s typically run on a server as a master node of the cluster. Each TaskTracker is responsible for executing the individual tasks that the JobTracker assigns. One responsibility of the TaskTracker is to constantly communicate with the JobTracker. If the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster. B. Map Reduce

hardware in a reliable, fault-tolerant manner. MapReduce works by breaking the processing into two phases: the map phase and the reduce phase.

3 MapReduce is an excellent model for distributed computing, introduced by Google in 2004, which now adopted by Apache Hadoop. Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) inparallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. III. PIG ARCHITECTURE Pig was initially developed at Yahoo Research around 2006 but moved into the Apache Software Foundation in The Pig programming language is designed to handle any kind of data like structured, semistructured, unstructured data. Pig consists of a language and an execution environment. Pig s language, called as PigLatin, is a data flow language. The architecture of Pg is shown in Figure 4. interfaces. The driver passes the query to the compiler where it goes through typical parse, type check and semantic analysis phases, using the metadata stored in Metastore [6]. The compiler generates a logical plan that is then optimized through a simple rule based optimizer. Finally an optimized plan in the form of map-reduce tasks and HDFS tasks is generated. The execution engine then executes these tasks in the order of their dependencies, using Hadoop. Figure 3 shows the major components of Hive and its interactions with Hadoop [7]. The main components of Hive are: External Interface: Hive provides both CLI and web UI, and application programming interfaces (API) like JDBC and ODBC. A. Thrift Server Hive Thrift Server, which enables a rich set of clients to access the Hive subsystem. Thrift [8] is a framework for cross-language services, where a server written in one language (like Java) can also support clients in other languages. The Thrift Hive clients generated in different languages are used to build common drivers like JDBC (Java), ODBC(C++), and scripting drivers written in PHP, Perl, Python etc. Figure 4: The architecture of Pig A. Pig Latin Pig Latin, is a high-level dataflow language that allows you to write data processing and analysis programs. Pig Latin allows us to define a data stream and a series of transformations that are applied to the data. B. The Pig Latin compiler The Pig Latin compiler converts the Pig Latin code into executable code. The executable code is either in the form of MapReduce jobs or it can spawn a process where a virtual Hadoop instance is created to run the Pig code on a single node. IV. HIVE ARCHITECTURE Hive is a first step in building an open source warehouse over a map-reduce data processing system (Hadoop) [9]. Hive sits on top of the Hadoop Distributed File System (HDFS) and MapReduce systems. Data in Hive is queried using Hive Query Language (HiveQL). HiveQL is submitted via the Command Line Interface (CLI), the web UI or an external client like ireport using the thrift, ODBC or JDBC Figure 3: The Hive architecture B. Metastore The Hive metastore service stores the metadata for Hive tables and partitions in a relational database, and provides clients (including Hive) access to this information via the metastore service API By default, Hive includes the Apache Derby RDBMS configured with the metastore. C. Hive Driver The Hive Driver, which compiles, optimizes, and executes the HiveQL. The Hive Driver may choose to execute HiveQL statements and commands locally or spawn

a MapReduce job, depending on the task. The Hive Dr

Data Model Similar to traditional databases, Hive stores data in tables, where each table consists of a number of rows, and each row consists of a specified number of columns. V.

The NCDC maintains weather datasets for different purposes from 1901 to till date. From the temperature datasets we are generating the maximum temperature for every year from 1975 to 2016.

The data generated by sensors is unstructured, which becomes a challenging task to analyse it. The sample picture of the raw data which is collected from NCDC is shown in Figure 5. Figure 6.

4 a MapReduce job, depending on the task. The Hive Driver stores table metadata in the metastore and its database. D. Data Model Similar to traditional databases, Hive stores data in tables, where each table consists of a number of rows, and each row consists of a specified number of columns. V. NCDC WEATHER DATA ANALYTICS NCDC provides weather datasets [2]. The Big data collected by NCDC is used for analysis purpose. The NCDC maintains weather datasets for different purposes from 1901 to till date. From the temperature datasets we are generating the maximum temperature for every year from 1975 to An analysis is performed on this big data by using MapReduce, Pig and Hive. For this purpose Hadoop, Pig and Hive are installed in Ubuntu System in pseudo distributed mode. The data generated by sensors is unstructured, which becomes a challenging task to analyse it. The sample picture of the raw data which is collected from NCDC is shown in Figure 5. Figure 6. Execution Environment for MapReduce program in Hadoop The maximum temperature from the year 1975 to 2016 by using MapReduce is shown in Figure 7. The status report of jobtracker and 100% completion of map and reduce jobs are shown in Figure 8. Fi gure7. Maxi mum Temp eratur e from by using Map Redu ce Figure5: Sample data from NCDC The records are first stored in HDFS. They are splitted and go into different mappers. Finally all results go to reducer. MapReduce Framework does the execution. The execution of MapReduce is demonstrating in Figure 6. Figure 8: Completion of map and reduce jobs PigLatin is used in Pig to analyze a Weather database. Figure 9 showing the pig s grunt shell and Figure 10 shows the PigLatin script to find the maximum temperature from

weather database with less time than by writing map reduce code.

Figure 9: Pig s grunt shell Figure 12: Hive CLI Figure 10: PigLatin script to find the maximum temperature Figure 11: Maximum

from Weather datasets. HiveQL is used in Hive to find the maximum temperature from weather datasets.

HiveQL statements to find the maximum temperature from weather database with less time than by writing map reduce code are shown in

5 weather database with less time than by writing map reduce code. The maximum temperature from the year 1975 to 2016 by using Pig is shown in Figure 11. Figure 9: Pig s grunt shell Figure 12: Hive CLI Figure 10: PigLatin script to find the maximum temperature Figure 11: Maximum Temperature from by using PigLatin Along with pig and MapReduce coding Hive is also used to find the maximum temperature from Weather datasets. HiveQL is used in Hive to find the maximum temperature from weather datasets. The Hive CLI is shown in Figure 12. HiveQL statements to find the maximum temperature from weather database with less time than by writing map reduce code are shown in Figure 13. The maximum temperature from the year 1975 to 2016 by using Hive is shown in Figure 14. Figure 14: Maximum Temperature from by using Hive VI. CONCLUSION In traditional systems, the processing of millions of records is time consuming process. So, It is highly complex task to handle Big data using traditional database management systems like relational databases. In the era of

6 Internet of things, the meteorological department uses different sensors to get the temperature, humidity etc values. So we require some highly parallel software to handle these big databases. Some tools are also required to handle them. Then we can analyze the data by using distributed file system like HDFS. At last they can be analyzed with faster response with the help of MapReduce, Pig and Hive. Leveraging MapReduce with Hadoop to analyze the bigdata is an effective solution. REFERENCES [1] E. Laxmi Lydia and Dr. M.Ben Swarup, Big Data Analysis using Hadoop Components like Flume, MapReduce, Pig and Hive in IJCSET, issue 11, pp Nov [2] [3] Figure 13: HiveQL statements to find the maximum temperature [4] K. Morton, M. Balazinska and D. Grossman, Paratimer: a progress indicator for MapReduce DAGs, In Proceedings of the 2010 international conference on Management of data, pp , 2010 [5] Hadoop in Action CHUCK LAM [6] Hadoop for Dummies [7] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, "Hive: a warehousing solution over a mapreduce framework," Proceedings oohe VLDB Endowment, vol. 2, no.2, pp ,2009. [8] A. Sumaray and S. K. Makki, "A comparison of data serialization formats for optimal efficiency on a mobile platform," in Proceedings of the 6th International Conlerence on Ubiquitous Information Management and Communication. ACM, p. 48, [9] Haritha Chennamsetty, Suresh Chalasani, and Derek Riley Predictive Analytics on Electronic Health Records (EHRs) using Hadoop and Hive in Electrical, Computer and Communication Technologies (ICECCT), 2015 IEEE International Conference on 5-7 March 2015.

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Yojna Arora, Dinesh Goyal Abstract: Big Data refers to that huge amount of data which cannot be analyzed by using traditional analytics