International Journal of Computer Engineering and Applications, BIG DATA ANALYTICS USING APACHE PIG Prabhjot Kaur

Size: px

Start display at page:

Download "International Journal of Computer Engineering and Applications, BIG DATA ANALYTICS USING APACHE PIG Prabhjot Kaur"

Byron Matthews
5 years ago
Views:

Prabhjot Kaur Department of Computer Engineering ME CSE(BIG DATA ANALYTICS)-CHANDIGARH UNIVERSITY,GHARUAN kaurprabhjot770@gmail.

So to maintain this data, hadoop is an ultimate result. Hadoop is the framework that allows the bigdata to store in a distributed environment so that we can process it parallel.

1 Prabhjot Kaur Department of Computer Engineering ME CSE(BIG DATA ANALYTICS)-CHANDIGARH UNIVERSITY,GHARUAN ABSTRACT: In today world, as we know data is expanding along with the upgration in the applications and technologies.day by day, data is increasing in velocity, volume, value, veracity & variety. So to maintain this data, hadoop is an ultimate result. Hadoop is the framework that allows the bigdata to store in a distributed environment so that we can process it parallel. There are 2 components of Hadoop- 1- HDFS - Hadoop distributed file system that is used to store the large amount of data. 2- MAP REDUCE -It is the processing component of the apache hadoop using which we can process the big data stored in HDFS. Now the problem is that we need to write a program in java/python to process the bigdata. So what if we are from non-programming background? Are hadoop days over for us before even started? The answer is NO because there are many other hadoop ecosystem where you do not need programming background. in this paper we describe how apache pig came to the picture. Apache pig is an open source high level data flow system developed by yahoo.in pig we can use simple queries and these queries are converted into map-reduce program by apache pig tool then this map reduce is executed over the hadoop cluster & then the result is sent back to the client. Keywords: Big Data, Apache pig, Hadoop, Map-reduce, HDFS [1] INTRODUCTION Today as we are breathing in a digitalized universe, with this digitalization the extent of data is producing with magnification volume with collection of data, this lead to big data. Big data is a term that is account to characterize the exponential growth and presence of data in the unstructured and Prabhjot Kaur 162

2 structured formal. Bigdata is described by 5v s that is velocity, volume, variety, veracity and value. Data is produce in different formats coming from different sources like social networks, sensor data, media etc. bigdata is a collection of a large amount of complicated data that is a backbreaker to manage by using traditional database management tools. Apache Hadoop is an open source, java based programming framework that supports the storage & processing of an extremely large data sets in a distributed computing environment. It processes data parallel in distributed environment. in this paper we will discuss about the apache pig, its components,its architecture & we will discuss about the performance of the pig script. [2] HADOOP- Apache Hadoop is an open source, Java based programming frame work that supports the processing and storage of extremely large data sets in a distributed computing environment. The hadoop framework consists of hadoop fundamentals such as HDFS, Map Reduce, HIVE, Apache PIG etc. There are two components of Hadoop:- 1-Hadoop HDFS 2-Hadoop map reduce [ 2.1] Hadoop HDFS- HDFS is an abrivation used for Hadoop distribution file system. HDFS is a java based file system that provides realiable and scalable data storage. Basically HDFS is used to dump or store any kind of data across the cluster. HDFS divides files across the cluster. [ 2.2] Map Reduce- Hadoop Map Reduce is the processing element of the apache hadoop. It performs the parallel processing on the data in the distributive environment. Hadoop Map Reduce is the processing element of Hadoop using which we can process the big data present in the HDFS parallel. Fig-1. Map reduce we need map reduce because the data present in HDFS not in the traditional fashion the data get dividing into chunks of data which is stored in respective data nodes so, there is no complete data present at one single location. Hence any native application cannot process the data right away. Hence we need a Prabhjot Kaur 163

3 special framework that is that has the capability to process the data. so that processed data can go and bring back the result. So that kind of frame work is hadoop map-reduce do. In Map Reduce we need to write a program in Java/Python to process big data but what if we do not belong from the programming background? Are hadoop days over for us? The answer is NO because other than map reduce there are ecosystem tools that does not require programming. That is how PIG came into the picture. [3] APACHE PIG Apache pig is an open source high level data flow system developed by yahoo. In apache pig we can use simple queries and these queries are converted into map reduce program by apache pig tool then this map reduce is executed over the hadoop cluster and then the result is sent back to the client. -10 LINES OF PIG LATIN= APROX. 200 LINES OF MAP REDUCE JAVA PROGRAM [3.1] COMPONENTS OF APACHE PIG- A) PIG LATIN LANGUAGE-It is the language by which we can write simple queries. B) PIG EXECUTION-The execution engine which convert pig queries into apache map reduce. Why we need apache pig if map reduce is their? MAP REDUCE APACHE PIG 1-Line of code is more. 1-Line of code is reduced by 1/ Development time is more. 2- Development time also get reduced by 1/16. 3-Low level data processing paradigm. 4-need to write complex programs in java/python. 5-performing data operations in map reduce is a humongous task. 6-nested data types are not there in mapreduce. 3-High level data flow tool. 4-No need to write program we use pig latin queries. 5-built-in support for data operations like joins, filters, ordering, sorting etc. 6-provides nested data types like tuples, bags, maps. Prabhjot Kaur 164

So, it become very important for twitter to process this bigdata so that they can improve the offering that they are giving to their user and increase their users base.

4 Fig-2. Difference between line of code in MR and pig [3.2] Twitter case study On twitter there are lot of people who sharing stuff on day to day basis so big data which is generating at twitter is at very fast rate. So, it become very important for twitter to process this bigdata so that they can improve the offering that they are giving to their user and increase their users base. Twitter decide to move their archived data to HDFS and adopt hadoop so that they can analyze data stored in hadoop to come up with multiple insights on a daily, weekly or monthly basis. To analyze how many tweets are stored per user in a tweet table. [3.2.1] High level implementation In twitter there are many tables in which archived data is stored so insight we want to abstract is related to these 2 tables- [3.2.2] Detailed implementation User table- Have info about user id and name. Fig-3. High level implementation Prabhjot Kaur 165

4- count/aggregate. 5- join user name with the data. 6- result is stored back to HDFS. Now it can be used by existing system to view the result. [3.3] APACHE PIG ARCHITECTURE Fig-5.

5 Twitter table-all the tweets that posted on twitter. Fig-4.Detailed Implementation Tables import to hdfs by scub tool. scub tool is one of eco system tool of hadoop which is used to move data from RDBMS to HDFS or HDFS to RDBMS. Operations performed in this- 1- import to HDFS. 2- load to apache pig. 3- join + group. 4- count/aggregate. 5- join user name with the data. 6- result is stored back to HDFS. Now it can be used by existing system to view the result. [3.3] APACHE PIG ARCHITECTURE Fig-5. Architecture of apache pig Pig latin script-for writing queries over the big data. These query is executed through grunt shell which is native shell provided by apache pig to execute pig queries then it is pass through parse, optimize, compile and convert it into map reduce code and then execute it over hadoop cluster by pig reduce engine. Prabhjot Kaur 166

6 [3.4] APACHE PIG RUNNING MODES There are two modes given below: 1) MAP REDUCE MODE-We execute apache pig over hadoop cluster and HDFS. In this input and output is stored in HDFS. -Command-PIG 2) LOCAL MODE-We execute query in local mode that is We want to process the data present in one file system or centralized file system. In this input and output in local file system -command-pig-x local [3.5] PIG OPERATORS 1-LOAD- Local file system or hdfs to pig. 2-FILTER-filtering relation based on condition. 3- JOIN-to join 2 tables based on column. 4- ORDER BY-sort a relation. 5-STORE-to save result to hdfs or local file system. 6-DISTINCT- To remove duplication tuples in a relation. 7- GROUP- Group the data on the basis of particular field. [4] CONCLUSION Big Data being enlarging in size day by day and assembling of this large amount of Data is a huge challenge for researchers. Implementing Hadoop is a solution on that, is provided to manage Big Data. Apache Pig provides us appropriate results by working on data flow which is easily understandable with time and space complexity. Pig is the user friendly tool. Affirming Apache Pig along with Hadoop can accommodate appropriate results for structuring and analyzing Big Data with less amount of coding lines compared as to traditional coding. Researches also expose that Apache Pig is one of the most efficient scripting platform for studying and organizing of a large Data with minimum amount of implementation time. Prabhjot Kaur 167

7 REFRENCES- [1] J.Ramsingh1, Dr.V.Bhuvaneswari2, An Insight on Big Data Analytics Using Pig Script, (IJETTCS), Volume 4, Issue 6, November - December 2015 [2] Krati Bansal and Priyanka Chawla, A Study of Big Data Analysis Using Apache Pig, I J C T A, 9(17) 2016, pp International Science Press [3] Sanjeev Dhawan1, Sanjay Rathee2, Big Data Analytics using Hadoop Components like Pig and Hive, American International Journal of Research in Science, Technology, Engineering & Mathematics [4] Pooja Jain1, Prof. Jay Prakash Maurya, Comparative Analysis Using Hive and Pig on Consumers Data, (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 8 (2), 2017 [5] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Pig Latin: A Not-So- Foreign Yahoo! Research, Language for Data Processing. [6] Munesh Kataria1, Ms. Pooja Mittal2, Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql, International Journal of Computer Science and Mobile Computing, IJCSMC, Vol. 3, Issue. 7, July 2014, pg Prabhjot Kaur 168

Distributed Data Management Summer Semester 2013 TU Kaiserslautern

Distributed Data Management Summer Semester 2013 TU Kaiserslautern Dr.- Ing. Sebas4an Michel smichel@mmci.uni- saarland.de Distributed Data Management, SoSe 2013, S. Michel 1 Lecture 4 PIG/HIVE Distributed