COSC 6339 Big Data Analytics. Hadoop MapReduce Infrastructure: Pig, Hive, and Mahout. Edgar Gabriel Fall Pig

Similar documents
Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Apache Pig coreservlets.com and Dima May coreservlets.com and Dima May

Pig Latin Reference Manual 1

Introduction to Apache Pig ja Hive

Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Big Data Analysis using Hadoop. Lecture 4. Hadoop EcoSystem

Introduction to Hive Cloudera, Inc.

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Pig A language for data processing in Hadoop

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Big Data Hadoop Stack

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

CISC 7610 Lecture 2b The beginnings of NoSQL

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

Data-intensive computing systems

Expert Lecture plan proposal Hadoop& itsapplication

Hadoop Development Introduction

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Innovatus Technologies

this is so cumbersome!

Apache Hive. CMSC 491 Hadoop-Based Distributed Compu<ng Spring 2016 Adam Shook

A Review on Hive and Pig

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Hadoop & Big Data Analytics Complete Practical & Real-time Training

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Introduction to BigData, Hadoop:-

Apache Pig. Craig Douglas and Mookwon Seo University of Wyoming

sqoop Automatic database import Aaron Kimball Cloudera Inc. June 18, 2009

Oracle Big Data Connectors

Index. bfs() function, 225 Big data characteristics, 2 variety, 3 velocity, 3 veracity, 3 volume, 2 Breadth-first search algorithm, 220, 225

Beyond Hive Pig and Python

Hadoop ecosystem. Nikos Parlavantzas

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Index. Symbols A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Microsoft Big Data and Hadoop

Hadoop. copyright 2011 Trainologic LTD

microsoft

Data Informatics. Seon Ho Kim, Ph.D.

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

IN ACTION. Chuck Lam SAMPLE CHAPTER MANNING

Hive SQL over Hadoop

An Introduction to Big Data Formats

Big Data Infrastructures & Technologies

Informatica PowerExchange for Microsoft Azure Blob Storage 10.2 HotFix 1. User Guide

Question: 1 You need to place the results of a PigLatin script into an HDFS output directory. What is the correct syntax in Apache Pig?

Distributed Data Management Summer Semester 2013 TU Kaiserslautern

Dr. Chuck Cartledge. 18 Feb. 2015

Data Storage Infrastructure at Facebook

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd.

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

sqoop Easy, parallel database import/export Aaron Kimball Cloudera Inc. June 8, 2010

Getting Started. Table of contents. 1 Pig Setup Running Pig Pig Latin Statements Pig Properties Pig Tutorial...

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Webinar Series TMIP VISION

Hadoop An Overview. - Socrates CCDH

Top 25 Big Data Interview Questions And Answers

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

Hortonworks Certified Developer (HDPCD Exam) Training Program

Shark: Hive (SQL) on Spark

Hadoop Online Training

Hadoop. Introduction to BIGDATA and HADOOP

Exam Questions

The Pig Experience. A. Gates et al., VLDB 2009

Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

APACHE HIVE CIS 612 SUNNIE CHUNG

Higher level data processing in Apache Spark

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

DEC 31, HareDB HBase Client Web Version ( X & Xs) USER MANUAL. HareDB Team

Oracle Data Integrator 12c: Integration and Administration

Certified Big Data and Hadoop Course Curriculum

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

HIVE INTERVIEW QUESTIONS

Architecture of Enterprise Applications 22 HBase & Hive

Data Analytics Job Guarantee Program

Specialist ICT Learning

Unifying Big Data Workloads in Apache Spark

Going beyond MapReduce

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Big Data Architect.

Chase Wu New Jersey Institute of Technology

Stages of Data Processing

New Approaches to Big Data Processing and Analytics

Big Data with Hadoop Ecosystem

A Review Paper on Big data & Hadoop

Big Data Hadoop Course Content

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

Getting Started. Table of contents. 1 Pig Setup Running Pig Pig Latin Statements Pig Properties Pig Tutorial...

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

IT Certification Exams Provider! Weofferfreeupdateserviceforoneyear! h ps://

Transcription:

COSC 6339 Big Data Analytics Hadoop MapReduce Infrastructure: Pig, Hive, and Mahout Edgar Gabriel Fall 2018 Pig Pig is a platform for analyzing large data sets abstraction on top of Hadoop Provides high level programming language designed for data processing Converted into MapReduce and executed on Hadoop Clusters 1

Why using Pig? MapReduce requires programmers Must think in terms of map and reduce functions More than likely will require Java programming Pig provides high-level language that can be used by Analysts and Scientists Does not require know how in parallel programming Pig s Features Join Datasets Sort Datasets Filter Data Types Group By User Defined Functions Pig Components Pig Latin Command based language Designed specifically for data transformation and flow expression Execution Environment The environment in which Pig Latin commands are executed Supporting local and Hadoop execution modes Pig compiler converts Pig Latin to MapReduce Automatic vs. user level optimizations compared to manual MapReduce code 2

Running Pig Script Execute commands in a file $pig scriptfile.pig Grunt Interactive Shell for executing Pig Commands Started when script file is NOT provided Can execute scripts from Grunt via run or exec commands Embedded Execute Pig commands using PigServer class Can have programmatic access to Grunt via PigRunner class Pig Latin concepts Building blocks Field piece of data Tuple ordered set of fields, represented with ( and ) (10.4, 5, word, 4, field1) Bag collection of tuples, represented with { and } { (10.4, 5, word, 4, field1), (this, 1, blah) } Some similarities to relational databases Bag is a table in the database Tuple is a row in a table 3

$ pig grunt> cat /input/pig/a.txt a 1 d 4 c 9 k 6 Simple Pig Latin example Load grunt in default map-reduce mode grunt supports file system commands Load contents of text file into a bag called records grunt> records = LOAD '/input/a.txt' as (letter:chararray, count:int); grunt> dump records;... org.apache.pig.backend.hadoop.executionengine.mapreducelayer.mapreducelauncher - 50% complete 2012-07-14 17:36:22,040 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapreducelayer.mapreducelauncher - 100% complete... (a,1) (d,4) (c,9) (k,6) grunt> Display records on screen Simple Pig Latin example No action is taken until DUMP or STORE commands are encountered Pig will parse, validate and analyze statements but not execute them STORE saves results (typically to a file) DUMP displays the results to the screen doesn t make sense to print large arrays to the screen For information and debugging purposes you can print a small sub-set to the screen grunt> records = LOAD '/input/excite-small.log' AS (userid:chararray, timestamp:long, query:chararray); grunt> toprint = LIMIT records 5; grunt> DUMP toprint; 4

Simple Pig Latin example LOAD 'data' [USING function] [AS schema]; data name of the directory or file Must be in single quotes USING specifies the load function to use By default uses PigStorage which parses each line into fields using a delimiter Default delimiter is tab ( \t ) The delimiter can be customized using regular expressions AS assign a schema to incoming data Assigns names and types to fields ( alias:type) (name:chararray, age:int, gpa:float) records = LOAD '/input/excite-small.log USING PigStorage() AS (userid:chararray, timestamp:long, query:chararray); int Signed 32-bit integer 10 long Signed 64-bit integer 10L or 10l float 32-bit floating point 10.5F or 10.5f double 64-bit floating point 10.5 or 10.5e2 or 10.5E2 chararray Character array (string) in Unicode UTF-8 bytearray Byte array (blob) hello world tuple An ordered set of fields (T: tuple (f1:int, f2:int)) bag A collection of tuples (B: bag {T: tuple(t1:int, t2:int)}) 5

Pig Latin Diagnostic Tools Display the structure of the Bag grunt> DESCRIBE <bag_name>; Display Execution Plan Produces Various reports, e.g. logical plan, MapReduce plan grunt> EXPLAIN <bag_name>; Illustrate how Pig engine transforms the data grunt> ILLUSTRATE <bag_name>; Joining Two Data Sets Join Steps Load records into a bag from input #1 Load records into a bag from input #2 Join the 2 data-sets (bags) by provided join key Default Join is Inner Join Rows are joined where the keys match Rows that do not have matches are not included in the result Inner join Set 1 Set 2 6

Simple join example 1. Load records into a bag from input #1 posts = load '/input/user-posts.txt' using PigStorage(',') as (user:chararray, post:chararray, date:long); 2. Load records into a bag from input #2 likes = load '/input/user-likes.txt' using PigStorage(',') as (user:chararray,likes:int,date:long); 3. Join the data sets when a key is equal in both data-sets then the rows are joined into a new single row; In this case when user name is equal userinfo = join posts by user, likes by user; dump userinfo; $ hdfs dfs -cat /input/user-posts.txt user1,funny Story,1343182026191 user2,cool Deal,1343182133839 user4,interesting Post,1343182154633 user5,yet Another Blog,13431839394 $ hdfs dfs -cat /input/user-likes.txt user1,12,1343182026191 user2,7,1343182139394 user3,0,1343182154633 user4,50,1343182147364 $ pig /code/innerjoin.pig (user1,funny Story,1343182026191,user1,12,1343182026191) (user2,cool Deal,1343182133839,user2,7,1343182139394) (user4,interestingpost,1343182154633,user4,50,1343182147364) 7

Outer Join Records which will not join with the other record-set are still included in the result Left Outer Records from the first data-set are included whether they have a match or not. Fields from the unmatched (second) bag are set to null. Right Outer The opposite of Left Outer Join: Records from the second dataset are included no matter what. Fields from the unmatched (first) bag are set to null. Full Outer Records from both sides are included. For unmatched records the fields from the other bag are set to null. Pig Use cases Loading large amounts of data Pig is built on top of Hadoop -> scales with the number of servers Alternative to manual bulkloading e.g. in HBASE Using different data sources, e.g. collect web server logs, use external programs to fetch geo-location data for the users IP addresses, join the new set of geo-located web traffic to click maps stored Support for data sampling 8

Hive Data Warehousing Solution built on top of Hadoop Provides SQL-like query language named HiveQL Minimal learning curve for people with SQL expertise Data analysts are target audience Early Hive development work started at Facebook in 2007 Translates HiveQL statements into a set of MapReduce Jobs which are then executed on a Hadoop Cluster Hive Ability to bring structure to various data formats Simple interface for ad hoc querying, analyzing and summarizing large amounts of data Access to files on various data stores such as HDFS and HBase Hive does NOT provide low latency or realtime queries Even querying small amounts of data may take minutes Designed for scalability and ease-of-use rather than low latency responses 9

Hive To support features like schema(s) and data partitioning Hive keeps its metadata in a Relational Database Packaged with Derby, a lightweight embedded SQL DB Default Derby based is good for evaluation an testing Schema is not shared between users as each user has their own instance of embedded Derby Stored in metastore_db directory which resides in the directory that hive was started from Can easily switch another SQL installation such as MySQL Hive Interface Options Command Line Interface (CLI) Hive Web Interface https://cwiki.apache.org/confluence/display/hive/hivewebinterface Java Database Connectivity (JDBC) https://cwiki.apache.org/confluence/display/hive/hiveclient Re-used from Relational Databases Database: Set of Tables, used for name conflict resolution Table: Set of Rows that have the same schema (same columns) Row: A single record; a set of columns Column: provides value and type for a single value 10

Hive creating a table hive> CREATE TABLE posts (user STRING, post STRING, time BIGINT) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE; OK Time taken: 10.606 seconds creates a table with 3 columns How the underlying file should be parsed hive> show tables; OK posts Time taken: 0.221 seconds hive> describe posts; OK user string post string time bigint Time taken: 0.212 seconds Display schema for posts table Hive Query Data hive> select * from posts where user="user2";...... OK user2 Cool Deal 1343182133839 Time taken: 12.184 seconds hive> select * from posts where time<=1343182133839 limit 2;...... OK user1 Funny Story 1343182026191 user2 Cool Deal 1343182133839 Time taken: 12.003 seconds hive> 11

Partitions To increase performance Hive has the capability to partition data The values of partitioned column divide a table into segments Entire partitions can be ignored at query time Similar to relational databases indexes but not as granular Partitions have to be properly crated by users When inserting data must specify a partition At query time, whenever appropriate, Hive will automatically filter out partitions Joins Hive support outer joins left, right and full joins Can join multiple tables Default Join is Inner Join Rows are joined where the keys match Rows that do not have matches are not included in the result 12

Pig vs. Hive Hive Uses an SQL like query language called HQL Gives non-programmers the ability to query and analyze data in Hadoop. Pig Uses a workflow driven scripting language Don't need to be an expert Java programmer but need a few coding skills. Can be used to convert unstructured data into a meaningful form. Mahout Scalable machine learning library Built with MapReduce and Hadoop in mind Written in Java Focusing on three application scenarios Recommendation Systems Clustering Classifiers Multiple ways for utilizing Mahout Java Interfaces Command line interfaces Newest Mahout releases target Spark, not Mapreduce anymore! 13

Classification Currently supported algorithms Naïve Baysian Classifier Hidden Markov Models Logistical Regression Random Forest Clustering Currently supported algorithms Canopy clustering K-means clustering Fuzzy k-means clustering Spectral clustering Multiple tools available to support clustering clusterdump: utility to output results of a clustering to a text file cluster visualization 14

Mahout input arguments Input data has to be sequence files and sequence vectors Sequence file: generic Hadoop concept for binary files containing a list of key/value pairs Classes used for the key and the value pair Sequence vector: binary file containing list of key/(array of values) For using mahout algorithms, key has to be text and value has to be of type VectorWritable (which is a Mahout class, not a Hadoop class) Sequence Files Creating a sequencfile using command line argument gabriel@shark>mahout seqdirectory -i /lastfm/input/ -o /lastfm/seqfiles Looking at the output of a sequence file gabriel@shark>mahout seqdumper i /lastfm/seqfiles/controldata.seq more Input Path: file:/lastfm/seqfiles/control-data.seq Key class: class org.apache.hadoop.io.text Value Class: class org.apache.mahout.math.vectorwritable Key: 0: Value: {0:28.7812,1:34.4632,2:31.3381} Key: 1: Value: {0:24.8923,1:25.741,2:27.5532} 15

Using Mahout clustering The SequenceFile containing the input vectors. The SequenceFile containing the initial cluster centers. The similarity measure to be used. The convergencethreshold. The number of iterations to be done. The Vector implementation used in the input files. Using Mahout clustering 16

Distance measures Euclidean distance measure Squared Euclidean distance measure Manhattan distance measure Distance measures Cosine distance measure Tanimoto distance measure 17

Running Mahout Clustering algorithms bin/mahout kmeans -i <input vectors directory> \ -c <input clusters directory> \ -o <output working directory> \ -k <optional no. of initial clusters> \ -dm <DistanceMeasure> \ -x <maximum number of iterations> \ -cd <optional convergence delta. Default is 0.5> \ -ow <overwrite output directory if present> -cl <run input vector clustering after computing Canopies> -xm <execution method: sequential or mapreduce> mahout clusterdump -i /gabriel/clustering/canopy/clusters-0-final -- pointsdir /gabriel/clustering/canopy/clusteredpoints -o /home/gabriel/mahouttest/synthetic-control-data/canopy.out 18