Lab 3 Pig, Hive, and JAQL

Similar documents
Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Pig A language for data processing in Hadoop

Introduction to Hive Cloudera, Inc.

Beyond Hive Pig and Python

Dr. Chuck Cartledge. 18 Feb. 2015

this is so cumbersome!

Aims. Background. This exercise aims to get you to:

Distributed Data Management Summer Semester 2013 TU Kaiserslautern

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Using Hive for Data Warehousing

Hadoop Lab 3 Creating your first Map-Reduce Process

Oracle 1Z Oracle Big Data 2017 Implementation Essentials.

Using Hive for Data Warehousing

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Installing IBM InfoSphere BigInsights Quick Start Edition

Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please)

Hadoop. copyright 2011 Trainologic LTD

Processing Big Data with Hadoop in Azure HDInsight

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Introduction to HDFS and MapReduce

Introduction to BigData, Hadoop:-

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

$HIVE_HOME/bin/hive is a shell utility which can be used to run Hive queries in either interactive or batch mode.

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Getting Started. Table of contents. 1 Pig Setup Running Pig Pig Latin Statements Pig Properties Pig Tutorial...

Apache TM Hadoop TM - based Services for Windows Azure How- To and FAQ Guide

Hadoop Development Introduction

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Start Working with Parquet!!!!

MAP-REDUCE ABSTRACTIONS

Getting Started. Table of contents. 1 Pig Setup Running Pig Pig Latin Statements Pig Properties Pig Tutorial...

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Hadoop Online Training

Introduction to Apache Pig ja Hive

Higher level data processing in Apache Spark

Lab: Hive Management

A brief history on Hadoop

Shell and Utility Commands

IBM Big SQL Partner Application Verification Quick Guide

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

STUDENT GRADE IMPROVEMENT INHIGHER STUDIES

Accessing Hadoop Data Using Hive

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

Screwjack Documentation

Data Access 3. Migrating data. Date of Publish:

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Using Hive for Data Warehousing

PigUnit - Pig script testing simplified.

IT Certification Exams Provider! Weofferfreeupdateserviceforoneyear! h ps://

Shell and Utility Commands

DATA MINING II - 1DL460

Hive SQL over Hadoop

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Apache Hive. Big Data - 16/04/2018

Pentaho MapReduce. 3. Expand the Big Data tab, drag a Hadoop Copy Files step onto the canvas 4. Draw a hop from Start to Hadoop Copy Files

Expert Lecture plan proposal Hadoop& itsapplication

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

HIVE INTERVIEW QUESTIONS

Big Data Analytics using Apache Hadoop and Spark with Scala

Actual4Test. Actual4test - actual test exam dumps-pass for IT exams

Hadoop ecosystem. Nikos Parlavantzas

Hadoop User Platform management and examples

Index. bfs() function, 225 Big data characteristics, 2 variety, 3 velocity, 3 veracity, 3 volume, 2 Breadth-first search algorithm, 220, 225

Apache Pig Releases. Table of contents

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Big Data Hadoop Stack

Understanding NoSQL Database Implementations

Using Hive for Data Warehousing

Hortonworks Certified Developer (HDPCD Exam) Training Program

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

COSC 6339 Big Data Analytics. Hadoop MapReduce Infrastructure: Pig, Hive, and Mahout. Edgar Gabriel Fall Pig

Index. Symbols A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

IBM Fluid Query for PureData Analytics. - Sanjit Chakraborty

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)

Integrating Apache Sqoop And Apache Pig With Apache Hadoop

International Journal of Computer Engineering and Applications, BIG DATA ANALYTICS USING APACHE PIG Prabhjot Kaur

Oracle. Oracle Big Data 2017 Implementation Essentials. 1z Version: Demo. [ Total Questions: 10] Web:

C2: How to work with a petabyte

A Glimpse of the Hadoop Echosystem

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

TI2736-B Big Data Processing. Claudia Hauff

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

Getting Started. Table of contents

Click Stream Data Analysis Using Hadoop

How to Run the Big Data Management Utility Update for 10.1

BIG DATA COURSE CONTENT

Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Hortonworks PR PowerCenter Data Integration 9.x Administrator Specialist.

Big Data Hive. Laurent d Orazio Univ Rennes, CNRS, IRISA

Using IBM Big SQL over HBase, Part 1: Creating tables and loading data

Deploying Custom Step Plugins for Pentaho MapReduce

Splout SQL When Big Data Output is also Big Data

Big Data Hadoop Course Content

Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016

Transcription:

Lab 3 Pig, Hive, and JAQL Lab objectives In this lab you will practice what you have learned in this lesson, specifically you will practice with Pig, Hive, and Jaql languages. Lab instructions This lab has been developed as a tutorial. Simply execute the commands provided, and analyze the results. BigInsights should be already started before working on this lab (Refer to Lab 0 Setup for instructions on starting BigInsights) Part 1: Pig In this tutorial, we are going to use Apache Pig to process the 1988 subset of the Google Books 1-gram records to produce a histogram of the frequencies of words of each length. A subset of this database (0.5 million records) has been stored in the file googlebooks-1988.csv under /BigDataUniversity/input directory. 1. Let us examine the format of the Google Books 1-gram records: > head -5 /BigDataUniversity/input/googlebooks-1988.csv The columns these data represent are the word, the year, the number of occurrences of that word in the corpus, the number of pages on which that word appeared, and the number of books in which that word appeared. 2. Copy the data file into HDFS. 1

> hadoop fs -put /BigDataUniversity/input/googlebooks-1988.csv googlebooks-1988.csv 3. Start pig. > pig grunt> 4. We are going to use a Pig UDF to compute the length of each word. The UDF is located inside the piggybank.jar file. We use the REGISTER command to load this jar file: grunt> REGISTER /opt/ibm/biginsights/pig/contrib/piggybank/java/piggybank.jar; The first step in processing the data is to LOAD it. grunt> records = LOAD 'googlebooks-1988.csv' AS (word:chararray, year:int, wordcount:int, pagecount:int, bookcount:int); This returns instantly. The processing is delayed until the data needs to be reported. To produce a histogram, we want to group by the length of the word: grunt> grouped = GROUP records by org.apache.pig.piggybank.evaluation.string.length(word); Sum the word counts for each word length using the SUM function with the FOREACH GENERATE command. grunt> final = FOREACH grouped GENERATE group, SUM(records.wordcount); Use the DUMP command to print the result to the console. This will cause all the previous steps to be executed. grunt> DUMP final; 2

This should produce output like the following: 5. Quit pig. grunt> quit 3

Part 2: Hive In this tutorial, we are going to use Hive to process the 1988 subset of the Google Books 1-gram records to produce a histogram of the frequencies of words of each length. A subset of this database (0.5 million records) has been stored in the file googlebooks-1988.csv under /BigDataUniversity/input directory. 1. Start hive. > hive hive> 2. Create a table called wordlist. hive> CREATE TABLE wordlist (word STRING, year INT, wordcount INT, pagecount INT, bookcount INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; 3. Load the data from the googlebooks-1988.csv file into the wordlist table. hive> LOAD DATA LOCAL INPATH '/BigDataUniversity/input/googlebooks-1988.csv' OVERWRITE INTO TABLE wordlist; 4

4. Create a table named wordlengths to store the counts for each word length for our histogram. hive> CREATE TABLE wordlengths (wordlength INT, wordcount INT); 5. Fill the wordlengths table with word length data from the wordlist table calculated with the length function.. hive> INSERT OVERWRITE TABLE wordlengths SELECT length(word), wordcount FROM wordlist; 5

6. Produce the histogram by summing the word counts grouped by word length. hive> SELECT wordlength, sum(wordcount) FROM wordlengths group by wordlength; 6

7. Quit hive. hive> quit; 7

Part 3: Jaql In this tutorial, we are going to use Jaql to process the 1988 subset of the Google Books 1-gram records to produce a histogram of the frequencies of words of each length. A subset of this database (0.5 million records) has been stored in the file googlebooks-1988.csv under /BigDataUniversity/input directory. 1. Let us examine the format of the Google Books 1-gram records: > head -5 /BigDataUniversity/input/googlebooks-1988.del The columns these data represent are the word, the year, the number of occurrences of that word in the corpus, the number of pages on which that word appeared, and the number of books in which that word appeared. 2. Copy the googlebooks-1988.del file to HDFS. > hadoop fs -put /BigDataUniversity/input/googlebooks-1988.del googlebooks-1988.del 3. Start the Jaql shell using the -c option to use the running Hadoop cluster. > jaqlshell -c jaql> 8

4. Read the comma delimited file from HDFS. jaql> $wordlist = read(del("googlebooks-1988.del", { schema: schema { word: string, year: long, wordcount: long, pagecount: long, bookcount: long } })); 5. Transform each word into its length by applying the strlen function. jaql> $wordlengths = $wordlist -> transform { wordlength: strlen($.word), wordcount: $.wordcount }; 6. Produce the histogram by summing the word counts grouped by word length. jaql> $wordlengths -> group by $word = {$.wordlength} into { $word.wordlength, counts: sum($[*].wordcount) }; This should produce output like the following: 9

7. Quit Jaql. jaql> quit; ------ This is the end of this lab ------ 10