Introduction to Hive Cloudera, Inc.

Similar documents
Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Hive SQL over Hadoop

University of Maryland. Tuesday, March 23, 2010

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Jaql. Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata. IBM Almaden Research Center

APACHE HIVE CIS 612 SUNNIE CHUNG

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

COSC 6339 Big Data Analytics. Hadoop MapReduce Infrastructure: Pig, Hive, and Mahout. Edgar Gabriel Fall Pig

Apache Hive. CMSC 491 Hadoop-Based Distributed Compu<ng Spring 2016 Adam Shook

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

sqoop Automatic database import Aaron Kimball Cloudera Inc. June 18, 2009

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Going beyond MapReduce

Shark: Hive (SQL) on Spark

HIVE INTERVIEW QUESTIONS

Apache Hive for Oracle DBAs. Luís Marques

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Data-intensive computing systems

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

Big Data Hadoop Stack

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Relational Data Processing on MapReduce

Hadoop Online Training

CSE 444: Database Internals. Lecture 23 Spark

Lab 3 Pig, Hive, and JAQL

Data Access 3. Migrating data. Date of Publish:

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

sqoop Easy, parallel database import/export Aaron Kimball Cloudera Inc. June 8, 2010

Hadoop An Overview. - Socrates CCDH

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Big Data with Hadoop Ecosystem

Distributed Data Management Summer Semester 2013 TU Kaiserslautern

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Big Data Analysis using Hadoop. Lecture 4. Hadoop EcoSystem

Hortonworks Certified Developer (HDPCD Exam) Training Program

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Oracle 1Z Oracle Big Data 2017 Implementation Essentials.

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

April Copyright 2013 Cloudera Inc. All rights reserved.

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Question: 1 You need to place the results of a PigLatin script into an HDFS output directory. What is the correct syntax in Apache Pig?

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

A Examcollection.Premium.Exam.47q

Hortonworks Data Platform

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Hadoop ecosystem. Nikos Parlavantzas

Unifying Big Data Workloads in Apache Spark

Big Data Analytics using Apache Hadoop and Spark with Scala

Prototyping Data Intensive Apps: TrendingTopics.org

Joe Hummel, PhD. Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago

Importing and Exporting Data Between Hadoop and MySQL

Data Storage Infrastructure at Facebook

Databases 2 (VU) ( / )

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Integration of Apache Hive

Architecture of Enterprise Applications 22 HBase & Hive

Big Data Infrastructures & Technologies

Introduction to BigData, Hadoop:-

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Chapter 5. The MapReduce Programming Model and Implementation

Understanding NoSQL Database Implementations

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

Using Hive for Data Warehousing

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

C Exam Code: C Exam Name: IBM InfoSphere DataStage v9.1

MapReduce and Friends

Parallel Processing Spark and Spark SQL

Infosphere DataStage Hive Connector to read data from Hive data sources

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

Lab: Hive Management

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

IBM Big SQL Partner Application Verification Quick Guide

More Access to Data Managed by Hadoop

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

Introduction to Data Management CSE 344

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Turning Relational Database Tables into Spark Data Sources

Introduction to HDFS and MapReduce

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

1. Introduction to MapReduce

CISC 7610 Lecture 2b The beginnings of NoSQL

STUDENT GRADE IMPROVEMENT INHIGHER STUDIES

Part 1 Configuring Oracle Big Data SQL

빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기

Sqoop In Action. Lecturer:Alex Wang QQ: QQ Communication Group:

Aaron Sun, in collaboration with Taehoon Kang, William Greene, Ben Speakmon and Chris Mills

Transcription:

Introduction to Hive

Outline Motivation Overview Data Model Working with Hive Wrap up & Conclusions

Background Started at Facebook Data was collected by nightly cron jobs into Oracle DB ETL via hand-coded python Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x that.

Hadoop as Enterprise Data Warehouse Scribe and MySQL data loaded into Hadoop HDFS Hadoop MapReduce jobs to process data Missing components: Command-line interface for end users Ad-hoc query support without writing full MapReduce jobs Schema information

Hive Applications Log processing Text mining Document indexing Customer-facing business intelligence (e.g., Google Analytics) Predictive modeling, hypothesis testing

Hive Components Shell: allows interactive queries like MySQL shell connected to database Also supports web and JDBC clients Driver: session handles, fetch, execute Compiler: parse, plan, optimize Execution engine: DAG of stages (M/R, HDFS, or metadata) Metastore: schema, location in HDFS, SerDe

Data Model Tables Typed columns (int, float, string, date, boolean) Also, list: map (for JSON-like data) Partitions e.g., to range-partition tables by date Buckets Hash partitions within ranges (useful for sampling, join optimization)

Metastore Database: namespace containing a set of tables Holds table definitions (column types, physical layout) Partition data Uses JPOX ORM for implementation; can be stored in Derby, MySQL, many other relational databases

Physical Layout Warehouse directory in HDFS e.g., /home/hive/warehouse Tables stored in subdirectories of warehouse Partitions, buckets form subdirectories of tables Actual data stored in flat files Control char-delimited text, or SequenceFiles With custom SerDe, can use arbitrary format

Starting the Hive shell Start a terminal and run $ cd /usr/share/cloudera/hive/ $ bin/hive Should see a prompt like: hive>

Creating tables hive> SHOW TABLES; hive> CREATE TABLE shakespeare (freq INT, word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY \t STORED AS TEXTFILE; hive> DESCRIBE shakespeare;

Generating Data Let s get (word, frequency) data from the Shakespeare data set: $ hadoop jar \ $HADOOP_HOME/hadoop-*-examples.jar \ grep input shakespeare_freq \w+

Loading data Remove the MapReduce job logs: $ hadoop fs -rmr shakespeare_freq/_logs Load dataset into Hive: hive> LOAD DATA INPATH shakespeare_freq INTO TABLE shakespeare;

Selecting data hive> SELECT * FROM shakespeare LIMIT 10; hive> SELECT * FROM shakespeare WHERE freq > 100 SORT BY freq ASC LIMIT 10;

Most common frequency hive> SELECT freq, COUNT(1) AS f2 FROM shakespeare GROUP BY freq SORT BY f2 DESC LIMIT 10; hive> EXPLAIN SELECT freq, COUNT(1) AS f2 FROM shakespeare GROUP BY freq SORT BY f2 DESC LIMIT 10;

Joining tables A powerful feature of Hive is the ability to create queries that join tables together We have (freq, word) data for Shakespeare Can also calculate it for KJV Let s see what words show up a lot in both

Create the dataset: $ tar zxf ~/bible.tar.gz C ~ $ hadoop fs -put ~/bible bible $ hadoop jar \ $HADOOP_HOME/hadoop-*-examples.jar \ grep bible bible_freq \w+

Create the new table hive> CREATE TABLE kjv (freq INT, word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY \t STORED AS TEXTFILE; hive> SHOW TABLES; hive> DESCRIBE kjv;

Import data to Hive $ hadoop fs rmr bible_freq/_logs hive> LOAD DATA INPATH bible_freq INTO TABLE kjv;

Create an intermediate table hive> CREATE TABLE merged (word STRING, shake_f INT, kjv_f INT);

Running the join hive> INSERT OVERWRITE TABLE merged SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN kjv k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1; hive> SELECT * FROM merged LIMIT 20;

Most common intersections What words appeared most frequently in both corpuses? hive> SELECT word, shake_f, kjv_f, (shake_f + kjv_f) AS ss FROM merged SORT BY ss DESC LIMIT 20;

Some more advanced features TRANSFORM: Can use MapReduce in SQL statements Custom SerDe: Can use arbitrary file formats Metastore check tool Structured query log

Project Status Open source, Apache 2.0 license Official subproject of Apache Hadoop 4 committers (all from Facebook) First release candidate coming soon

Related work Parallel databases: Gamma, Bubba, Volcano Google: Sawzall Yahoo: Pig IBM Research: JAQL Microsoft: DryadLINQ, SCOPE Greenplum: YAML MapReduce Aster Data: In-database MapReduce Business.com: CloudBase

Conclusions Supports rapid iteration of ad-hoc queries Can perform complex joins with minimal code Scales to handle much more data than many similar systems