Comprehensive Analysis of Hadoop Ecosystem Components: MapReduce, Pig and Hive

Size: px

Start display at page:

Download "Comprehensive Analysis of Hadoop Ecosystem Components: MapReduce, Pig and Hive"

Buddy Wilkerson
5 years ago
Views:

1 , pp Comprehensive Analysis of Hadoop Ecosystem Components: MapReduce, Pig and Hive N. Suneetha 1, Ch. Sekhar 2, A. Viswanath Sharma 3 and P. Sandhya 3 1,2,3,4 Vignan s Institute of Information Technology, Duvvada, Visakhapatnam 1 suneekir9@gmail.com, 2 sekhar1203@gmail.com Abstract. Big Data is a term for large-volume, complex, growing data sets with multiple, autonomous sources generated in various fields ranging from economic and business activities to public administration, from national security to scientific researches. It is the emerging technology that draws huge attention from researchers to extract value from voluminous datasets. As the velocity of data growth is increasing with the technological challenges, organization and storage of data is of primary concern. To proceed with a given situation pertaining to big data the consideration parameters are, firstly the background of data, then the value chain phases namely data generation, data acquisition, data storage, and data analysis and finally on the representative applications of big data, including enterprise management, Internet of Things, online social networks, medical applications, collective intelligence, financial service sectors and smart grid. This paper emphasizes on the performance and work nature of data analysis through Hadoop ecosystem components like Map reduces Pig and Hive. Keywords: Big Data, Map reduce, Pig, Hive. 1 Introduction Over the past few decades, data has increased in a large scale in various fields. According to a report from International Data Corporation (IDC), in 2011, the overall created and copied data volume in the world was 1.8ZB ( 1021B), which increased by nearly nine times within five years [1]. This figure will double at least every other two years in the near future. Under the explosive growth of information, the term Big Data means the large or complex data sets that could not be handled by traditional database technologies. Big Data also arises with many challenges, such as difficulties in data capture, data storage, data analysis and data visualization. Big data is characterized by the five Vs, namely volume, variety, velocity, value and veracity (Fig. 1). This 5V definition highlights the meaning and necessity of big data. ISSN: ASTL Copyright 2017 SERSC

2 Volume: Volume means, with the generation and collection of masses of data. Nowadays volume or the size of data in different areas/enterprises is larger than terabytes and petabytes. It is the task of Big data to extract valuable information from high volumes of low-density, unstructured Hadoop data that is, data of unknown value, such as Twitter data feeds, click streams on a web page and a mobile app, network traffic, sensor-enabled equipment capturing data at the speed of light, and many more. Big Data volume includes such features as size, scale, amount, dimension for tera- and exascale data collected from many transactions and stored in individual files or databases to be accessible, searchable, processed and manageable. One example of Big data from industry, global services providers such as Google, Facebook, Twitter are producing, analyzing and storing data in huge amount as their regular activity/production services. To deal these large volume of data, Big Data development is needed. Variety: Variety deals with the complexity of big data and information and semantic models behind these data. Big data comes from a great variety of sources and generally has in four types: structured, semi structured, unstructured and a mixed data. Structured data inserts a data warehouse already tagged and easily sorted but unstructured data is random and difficult to analyze. Unstructured and semi-structured data types, such as text, audio, and video does not conform to fixed fields but contains tags to separate data elements where they require additional processing to both derive meaning and the supporting metadata. Velocity: Velocity is the fast rate at which Big data streams is generated into memory or disks by arrays of sensors or multiple events, and need to be processed in real-time, near real-time or in batch, or as streams (like in case of visualisation). Velocity is required not only for big data, but also all processes. For time limited processes, big data should be used as it streams into the organization in order to maximize its value [4,16]. Some of the applications like Internet of Things (IoT), consumer ecommerce and mobile communications..etc. deals with large amount of data in their implementation in real time or near real time. Value Variety indicates the various types of data, which include semi-structured and unstructured data such as audio, video, webpage, and text, as well as traditional structured data. Value is an important feature of the Big data which is defined by the added-value that the collected data can bring to the intended process, activity or predictive analysis/hypothesis. Data value will depend on the events or processes they represent such as stochastic, probabilistic, regular or random. For example in consumer applications, the intrinsic value of data is derived using quantitative and investigative techniques from discovering a consumer preference or sentiment, to making a relevant offer by location, or for identifying a piece of equipment that is about to fail. Depending on this the requirements may be imposed to collect all data, Copyright 2017 SERSC 469

store for longer period (for some possible event of interest), etc. However, finding value of Bigdata requires new discovery processes to make more accurate and precise decisions.

3 store for longer period (for some possible event of interest), etc. However, finding value of Bigdata requires new discovery processes to make more accurate and precise decisions. Veracity: Big Data veracity ensures that the data used are trusted, authentic and protected from unauthorized access and modification. The data must be secured during the whole their lifecycle from collection from trusted sources to processing on trusted compute facilities and storage on protected and trusted storage facilities. Data veracity relies entirely on the security infrastructure deployed and available from the Big Data infrastructure. With this definition, characteristics of big data may be summarized as five Vs, i.e., Volume(greatvolume), Variety (various modalities), Velocity (rapid generation), Value (huge value but very low density) and Veracity (trusted and authentic) as shown in Fig. 1. Fig. 1. Five V s of Big Data This paper is organized as follows. Section II presents the Hadoop EcoSystem Components. Section III illustrates the precise distributed working environment. Section IV describes the Conclusion of the presented work 2 Hadoop EcoSystem Components The size of datasets are increasing at a rapid pace currently to the tune of petabytes(pb) which is becoming an issue to perform data analysis. The challenges for such analysis are capture, curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy. To meet these challenges, "parallel data processing software" like Hadoop framework is required. Hadoop is an open-source framework that allows to store and process Big Data in a distributed environment across clusters of computers using simple programming models. The Hadoop platform consists of two main services: one is a reliable, distributed file system called Hadoop Distributed File System (HDFS) and the other one is high-performance parallel data processing engine called Hadoop MapReduce. Vendors that provide Hadoop-based platforms include Cloudera, Hortonworks, MapR, Greenplum, IBM, 470 Copyright 2017 SERSC

and Amazon. Information is procured from diverse sources, like online networking, customary undertaking information or sensor information, etc.

4 and Amazon. Information is procured from diverse sources, like online networking, customary undertaking information or sensor information, etc. The two main components of Hadoop are Hadoop Distributed File System(HDFS) referring to distributed storage and distributed processing called Map Reduce framework. Combination of HDFS and MapReduce provides a software framework for processing vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault- tolerant manner. Hadoop is a generic processing framework designed to execute queries and other batch operations against massive datasets that can scale from tens of terabytes to petabytes in size. Figure 2 shows the Hadoop Ecosystem. In the ecosystem of Hadoop, there have been several recent research projects exploiting sharing opportunities and eliminating unnecessary data movements, e.g. [2] Fig. 2. Hadoop echo system In the ecosystem of Hadoop, there have been several recent research projects exploiting sharing opportunities and eliminating unnecessary data movements. Huge amounts of inconsistent, incomplete, and noisy data, a number of data preprocessing techniques, including data cleaning, data integration, data transformation and date reduction, can be applied for removing noise and correcting inconsistencies [5]. Data capture and storage Data sets are captured from different sources such as traditional organization data from different enterprises (it includes information from transactional ERP data or web store transactions and general ledger data..etc.), machine generated or sensor data (it includes information from information- sensing mobile devices, aerial sensory technologies, remote sensing, radio-frequency identification readers..etc.), social data (it includes information from blogging sites and social media platforms..etc.) and so Copyright 2017 SERSC 471

on. The world s technological capacity to store information is increasing exponentially in terms of quintillion byte Data organization Data is organized into the following file systems namely Google

5 on. The world s technological capacity to store information is increasing exponentially in terms of quintillion byte Data organization Data is organized into the following file systems namely Google File System(GFS) and Hadoop Distributed File System(HDFS). Google File System Google Inc. built up an appropriated record framework for their own particular use which was intended for proficient and solid acess to information utilizing extensive bunch of product equipment. It utilizes the methodology of "Big Files", which are created by Larry Page and Sergey Brin. Here records are partitioned in fixed size chunks of 64 MB whose replication factor is 3. HDFS is the distributed storage which positions the data into fixed size blocks which is replicated to 3. There will be one master node and multiple slave nodes. Map Reduce Jobs are processed using Version1 and Version2. Data analysis Pig and Hive are the frameworks that have an inherent map reduce functionality with a good processing speed. The analysis is performed on huge databases ranging from thousands to lakhs using Hive and MapReduce. Both the frameworks run on different platforms like redhat linux, Cloudera, Hortonworks etc. Figure 3 shows the Uber transport dataset collected from GITHUB and is placed in HDFS using the hadoop put command. Ubuntu-Server14.04LTS is used to implement the operations of the specified data set. Fig. 3. Uber data set from GITHUB site 472 Copyright 2017 SERSC

6 The data shown in above Figure 3 is taken from GITHUB site which is having the fields like DATE/TIME, LATITUDE, LONGITUDE, BASE fields. For the given data the number of cars with base numbers waiting at a particular latitude position are analyzed. Pig and Hive execution speeds are compared 4 Working of Distributed Environment Map Reduce Hadoop Map Reduce is a product structure for effortlessly composing applications which prepare tremendous measures of information (multi-terabyte information sets) in-parallel on substantial groups (a huge number of hubs) of item equipment in a dependable, fault tolerant way. A MapReduce work for the most part parts the information set into free pieces which are prepared by the guide assignments in a totally parallel way. The structure sorts the yields of the maps, which are then data to the diminish assignments. Ordinarily both the information and the yield of the employment are put away in a document framework. The system deals with planning errands, observing them and re-executes the fizzled assignments. Normally the register hubs and the stockpiling hubs are the same, that is, the MapReduce structure and the Hadoop Distributed File System are running on the same arrangement of hubs. This arrangement permits the system to adequately timetable undertakings on the hubs where information is as of now present, bringing about high total transfer speed over the group. The MapReduce structure comprises of a solitary expert JobTracker and one slave TaskTracker per group hub. The expert is in charge of booking the occupations' part assignments on the slaves, checking them and reexecuting the fizzled errands. The slaves execute the assignments as coordinated by the expert. Negligibly, applications indicate the information/yield areas and supply guide and lessen capacities through executions of suitable interfaces and/or conceptual classes. These, and other employment parameters, contain the occupation design. The Hadoop work customer then presents the occupation (jug/executable and so on.) and setup to the JobTracker which then expect the obligation of disseminating the product/design to the slaves, booking errands and observing them, giving status and demonstrative data to the employment customer. PIG: Pig was initially developed at Yahoo Research around 2006 but moved into the Apache Software Foundation in Pig which is a scripting language that consists of a language and an execution environment which is based on hadoop map reduce framework. Pig script is usually called as PigLatin[6]. PIG came into the world of big data because most of the industries and the companies are finding difficulties in handling a small data with at most lines of code So industry finds difficult in wasting the time and money for such things. Pig script is a language which connects things over very easily with very short lines of code. It is a high level language and it does not need the help of JAVA. Pig support dataflow language.. Pig can handle complex data structure, even those who have levels of nesting. It has two types of execution environment local and distributed environment. Local environment is used Copyright 2017 SERSC 473

for testing when distributed environment cannot be deployed. Pig Latin program is collection of statements. A statement can be a operation or command.

7 for testing when distributed environment cannot be deployed. Pig Latin program is collection of statements. A statement can be a operation or command. Pig distributed environment is chosen by command pig. Pig local mode is chosen by the command pig x local. Fig. 4. Execution of PIG Figure 4 shows the execution process of pig framework. Grunt shell is the console for writing and execution of the pig scripts. The execution engine converts the scripts into mapreduce jobs and the resultant is stored in HDFS. Handling of Semistructured data using Pig Data of semi-structured nature with extenson.xml,.jpeg,etc also can be handled using Pig. To handle this.xml files we need to have some additional libraries to be included with the pig. The two additional libraries for the pig are as follows.. Piggy Bank 1)Collection of useful LOAD,STORE and UDF functions. 2)Has many user defined functions. 3)Open source project of apache Hadoop and should be downloaded externally and links with pig. Apache DataFu 1)Collection of libraries for working with large scale data in Hadoop. 2)Project inspired by the need for stable well tested libraries for data mining and statistics 3)It is also a apache open source project and included with in apache pig from version 0.14 The additional library piggybank is used for loading the data into pig from HDFS. For handling the XML file xmlloader() function is linked with the additional library piggybank. 474 Copyright 2017 SERSC

8 Following is the sample XML file which is described as follows: <Data> <employee> <id>5</id> <name>aravind</name> <gender>m</gender> </employee> <employee> <id>6</id> <name>krishna</name> <gender>m</gender> </employee> <employee> <id>7</id> <name>thiru</name> <gender>m</gender> </employee> <employee> <id>8</id> <name>mani</name> <gender>m</gender> </employee> </Data> The above data is to loaded into Pig execution mode using the command load whose syntax is as follows: employee = LOAD input/employee.xml using org.apache.pig.piggybank.storage.xmlloader( employ ee ) as (x:chararray); The output of the script above described is as follows: Copyright 2017 SERSC 475

9 Hive Apache Hive is a data warehouse system for Apache Hadoop [7]. Hive is a technology developed by Facebook and which turns Hadoop into a complete datawarehouse with an extension of sql for querying. HiveQL is a declarative language used by Hive.Hive is a declarative language with various commands execution built upon a schema. The configuration can be set in three ways: Firstly by editing the hive-site.xml fileor by using the command set in the command prompt such as set hive-conf. The above specified database is used to describe the processing of Hive. s = foreach employee generate x; Data is displayed using DUMP Eg:dump employee; DATA ANALYSIS USING HIVE R data set is taken from github.the data can be downloaded from github by using the wget command with nk address in the console. Then the datais moved from our local system to the hadoop distributed file system using the hadoop command put Eg:hadoop fs put localfileaddress destinationaddress hadoop fs put uber-rawdata-may14.csv /user/allamrajuviswanath_gmail/viswanath ; 476 Copyright 2017 SERSC

10 open the hive console using the command hive The hive shell gets started in the console like hive> Hive:First create a data base with a name. Next create a table for the dataset ; Now for Uberdata set the table creation is like create table ubersep14(pickdatetime string,lat float,log float,base string) row format delimited fields terminated by ',' stored as textfile; Load the data into the table using the command LOAD DATA INPATH '/user/allamrajuviswanath_gmail/viswanath/uberraw- data-sep14.csv' OVERWRITE INTO TABLE socialmediaintelligence; Query to retrieve the total no.of Base numbers of Uber cars from the table are as follows: select count(base) from ubersep14; The resultant output screen is shown below: For counting the number of ubercars with their base numbers the at a particular latitude position of the command used is: Select base, count(*) as count from ubersep14 where lat= group by base; Copyright 2017 SERSC 477

11 5 Conclusion The aforesaid results represent lesser time and exertion on huge databases. Pig, Hive and MapReduce frameworks can do investigations in brief time. Hive can break down a database of more than 8 lakh records in only 34 seconds and Pig can do the same in 40 seconds. Hence all these parts make it conceivable to handle and to utilize vast databases in a simple and proficient way. References 1. Apache Hadoop. Available at 2. R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang. YSmart: Yet Another SQL-to- MapReduce Translator. In ICDCS, H. Lim, H. Herodotou, and S. Babu. Stubby: A Transformation- based Optimizer for Mapreduce Workflows. In VLDB, T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. MRShare: Sharing Across Multiple Queries in Mapreduce. In VLDB, X. Wang, C. Olston, A. D. Sarma, and R. Burns. CoScan: Cooperative Scan Sharing in the Cloud. In SoCC, Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu and Raghotham Murthy Hive A Petabyte Scale Data Warehouse Using Hadoop By Facebook Data Infrastructure Team 7. Zhifeng YANG, Qichen TU, Kai FAN, Lei ZHU, Rishan CHEN, BoPENG, Performance Gain with Variable Chunk Size in GFSlike File Systems, Journal of Computational Information Systems4:3 pp , Sam Madden, From Databases to Big Data, IEEE Computer Society, Sanjeev Dhawan & Sanjay Rathee, Big Data Analytics using Hadoop Components like Pig and Hive, American International Journal of Research in Science, Technology, Engineering & Mathematics, pp:1-5, Copyright 2017 SERSC

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Yojna Arora, Dinesh Goyal Abstract: Big Data refers to that huge amount of data which cannot be analyzed by using traditional analytics