About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. HCatalog

Similar documents
This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

SQOOP - QUICK GUIDE SQOOP - INTRODUCTION

This brief tutorial provides a quick introduction to Big Data, MapReduce algorithm, and Hadoop Distributed File System.

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. HCatalog

Installation of Hadoop on Ubuntu

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

This tutorial provides a basic understanding of how to generate professional reports using Pentaho Report Designer.

Apache Hadoop Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

About the Tutorial. Audience. Prerequisites. Copyright and Disclaimer. PySpark

UNIT II HADOOP FRAMEWORK

Introduction into Big Data analytics Lecture 3 Hadoop ecosystem. Janusz Szwabiński

This brief tutorial provides a quick introduction to Big Data, MapReduce algorithm, and Hadoop Distributed File System.

Before you start with this tutorial, you need to know basic Java programming.

jmeter is an open source testing software. It is 100% pure Java application for load and performance testing.

Part II (c) Desktop Installation. Net Serpents LLC, USA

This tutorial is designed for all Java enthusiasts who want to learn document type detection and content extraction using Apache Tika.

About the Tutorial. Audience. Prerequisites. Disclaimer & Copyright. Jenkins

3 Hadoop Installation: Pseudo-distributed mode

This tutorial helps the professionals aspiring to make a career in Big Data and NoSQL databases, especially the documents store.

Installing Hadoop / Yarn, Hive 2.1.0, Scala , and Spark 2.0 on Raspberry Pi Cluster of 3 Nodes. By: Nicholas Propes 2016

This tutorial explains how you can use Gradle as a build automation tool for Java as well as Groovy projects.

Inria, Rennes Bretagne Atlantique Research Center

You must have a basic understanding of GNU/Linux operating system and shell scripting.

Getting Started with Hadoop/YARN

This tutorial will take you through simple and practical approaches while learning AOP framework provided by Spring.

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

BIG DATA TRAINING PRESENTATION

This tutorial is designed for software programmers who would like to learn the basics of ASP.NET Core from scratch.

Hadoop Setup Walkthrough

Hadoop Setup on OpenStack Windows Azure Guide

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. HBase

Memcached is an open source, high-performance, distributed memory object caching system.

Hortonworks Data Platform

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.

Before proceeding with this tutorial, you must have a sound knowledge on core Java, any of the Linux OS, and DBMS.

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer ASP.NET WP

About the Tutorial. Audience. Prerequisites. Copyright and Disclaimer. Logstash

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

About the Tutorial. Audience. Prerequisites. Disclaimer & Copyright. Avro

Adobe Flex Tutorial i

In this brief tutorial, we will be explaining the basics of Elasticsearch and its features.

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Meteor

Hadoop. Introduction to BIGDATA and HADOOP

Before you start proceeding with this tutorial, we are assuming that you are already aware about the basics of Web development.

This tutorial explains the key concepts of Web Dynpro with relevant screenshots for better understanding.

SAP Lumira is known as a visual intelligence tool that is used to visualize data and create stories to provide graphical details of the data.

Hadoop is essentially an operating system for distributed processing. Its primary subsystems are HDFS and MapReduce (and Yarn).

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

This tutorial has been prepared for beginners to help them understand the basic functionalities of Gulp.

Hortonworks Technical Preview for Apache Falcon

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Drupal

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

Certified Big Data and Hadoop Course Curriculum

This tutorial also elaborates on other related methodologies like Agile, RAD and Prototyping.

HCatalog. Table Management for Hadoop. Alan F. Page 1

This tutorial provides a basic understanding of the infrastructure and fundamental concepts of managing an infrastructure using Chef.

Hadoop An Overview. - Socrates CCDH

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

50 Must Read Hadoop Interview Questions & Answers

Hadoop Quickstart. Table of contents

Hadoop & Big Data Analytics Complete Practical & Real-time Training

You should have a basic understanding of Relational concepts and basic SQL. It will be good if you have worked with any other RDBMS product.

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

Scalable Vector Graphics commonly known as SVG is a XML based format to draw vector images. It is used to draw twodimentional vector images.

This tutorial introduces you to key DynamoDB concepts necessary for creating and deploying a highly-scalable and performance-focused database.

This tutorial will give you a quick start with Consul and make you comfortable with its various components.

Hadoop. Introduction / Overview

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Certified Big Data Hadoop and Spark Scala Course Curriculum

This is an introductory tutorial designed for beginners to help them understand the basics of Radius.

Oracle Big Data Fundamentals Ed 1

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. RichFaces

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

About the Tutorial. Audience. Prerequisites. Disclaimer & Copyright. TurboGears

Innovatus Technologies

Top 25 Big Data Interview Questions And Answers

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer

About the Tutorial. Audience. Prerequisites. Disclaimer & Copyright. Django

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Getting Started with Pentaho and Cloudera QuickStart VM

Configuring Ports for Big Data Management, Data Integration Hub, Enterprise Information Catalog, and Intelligent Data Lake 10.2

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Cloud Computing II. Exercises

This tutorial will show you, how to use RSpec to test your code when building applications with Ruby.

Configuring and Deploying Hadoop Cluster Deployment Templates

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Big Data Analytics by Using Hadoop

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)

Installation and Configuration Documentation

Big Data Hadoop Course Content

Microsoft Excel is a spreadsheet tool capable of performing calculations, analyzing data and integrating information from different programs.

This tutorial will teach you how to use Java Servlets to develop your web based applications in simple and easy steps.

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Laravel

docs.hortonworks.com

Design Proposal for Hive Metastore Plugin

Big Data Hadoop Stack

Outline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc.

Study on the Distributed Crawling for Processing Massive Data in the Distributed Network Environment

DATA MIGRATION METHODOLOGY FROM SQL TO COLUMN ORIENTED DATABASES (HBase)

Using Hive for Data Warehousing

Transcription:

About the Tutorial HCatalog is a table storage management tool for Hadoop that exposes the tabular data of Hive metastore to other Hadoop applications. It enables users with different data processing tools (Pig, MapReduce) to easily write data onto a grid. HCatalog ensures that users don t have to worry about where or in what format their data is stored. This is a small tutorial that explains just the basics of HCatalog and how to use it. Audience This tutorial is meant for professionals aspiring to make a career in Big Data Analytics using Hadoop Framework. ETL developers and professionals who are into analytics in general may as well use this tutorial to good effect. Prerequisites Before proceeding with this tutorial, you need a basic knowledge of Core Java, Database concepts of SQL, Hadoop File system, and any of Linux operating system flavors. Copyright & Disclaimer Copyright 2016 by Tutorials Point (I) Pvt. Ltd. All the content and graphics published in this e-book are the property of Tutorials Point (I) Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republish any contents or a part of contents of this e-book in any manner without written consent of the publisher. We strive to update the contents of our website and tutorials as timely and as precisely as possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt. Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of our website or its contents including this tutorial. If you discover any errors on our website or in this tutorial, please notify us at contact@tutorialspoint.com i

Table of Contents... About the Tutorial... i Audience... i Prerequisites... i Copyright & Disclaimer... i Table of Contents... ii PART 1: HCATALOG BASICS... 1 1. HCatalog Introduction... 2 What is HCatalog?... 2 Why HCatalog?... 2 HCatalog Architecture... 3 2. HCatalog Installation... 4 Step 1: Verifying JAVA Installation... 4 Step 2: Installing Java... 4 Step 3: Verifying Hadoop Installation... 5 Step 4: Downloading Hadoop... 5 Step 5: Installing Hadoop in Pseudo Distributed Mode... 6 Step 6: Verifying Hadoop Installation... 9 Step 7: Downloading Hive... 12 Step 8: Installing Hive... 12 Step 9: Configuring Hive... 13 Step 10: Downloading and Installing Apache Derby... 13 Step 11: Configuring the Hive Metastore... 15 Step 12: Verifying Hive Installation... 16 Step 13: Verify HCatalog Installation... 16 3. HCatalog CLI... 18 PART 2: HCATALOG CLI COMMANDS... 20 4. HCatalog Create Table... 21 Create Table Statement... 21 Load Data Statement... 22 5. HCatalog Alter Table... 24 Alter Table Statement... 24 Rename To Statement... 24 Change Statement... 24 Add Columns Statement... 25 Replace Statement... 25 Drop Table Statement... 25 6. HCatalog View... 26 Create View Statement... 26 ii

Drop View Statement... 27 7. HCatalog Show Tables... 28 8. HCatalog Show Partitions... 29 Show Partitions Statement... 29 Dynamic Partition... 29 Adding a Partition... 30 Renaming a Partition... 30 Dropping a Partition... 31 9. HCatalog Indexes... 32 Creating an Index... 32 Dropping an Index... 33 PART 3: HCATALOG APIS... 34 10. HCatalog Reader Writer... 35 HCatReader... 35 HCatWriter... 36 11. HCatalog Input Output Format... 42 HCatInputFormat... 42 HCatOutputFormat... 43 12. HCatalog Loader and Storer... 48 HCatloader... 48 HCatStorer... 49 Running Pig with HCatalog... 50 Setting the CLASSPATH for Execution... 50 iii

Part 1: HCatalog Basics 4

1. HCATALOG INTRODUCTION HCatalog What is HCatalog? HCatalog is a table storage management tool for Hadoop. It exposes the tabular data of Hive metastore to other Hadoop applications. It enables users with different data processing tools (Pig, MapReduce) to easily write data onto a grid. It ensures that users don t have to worry about where or in what format their data is stored. HCatalog works like a key component of Hive and it enables the users to store their data in any format and any structure. Why HCatalog? Enabling right tool for right Job Hadoop ecosystem contains different tools for data processing such as Hive, Pig, and MapReduce. Although these tools do not require metadata, they can still benefit from it when it is present. Sharing a metadata store also enables users across tools to share data more easily. A workflow where data is loaded and normalized using MapReduce or Pig and then analyzed via Hive is very common. If all these tools share one metastore, then the users of each tool have immediate access to data created with another tool. No loading or transfer steps are required. Capture processing states to enable sharing HCatalog can publish your analytics results. So the other programmer can access your analytics platform via REST. The schemas which are published by you are also useful to other data scientists. The other data scientists use your discoveries as inputs into a subsequent discovery. Integrate Hadoop with everything Hadoop as a processing and storage environment opens up a lot of opportunity for the enterprise; however, to fuel adoption, it must work with and augment existing tools. Hadoop should serve as input into your analytics platform or integrate with your operational data stores and web applications. The organization should enjoy the value of Hadoop without having to learn an entirely new toolset. REST services opens up the platform to the enterprise with a familiar API and SQL-like language. Enterprise data management systems use HCatalog to more deeply integrate with the Hadoop platform. 5

HCatalog Architecture The following illustration shows the overall architecture of HCatalog. HCatalog supports reading and writing files in any format for which a SerDe (serializerdeserializer) can be written. By default, HCatalog supports RCFile, CSV, JSON, SequenceFile, and ORC file formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe. HCatalog is built on top of the Hive metastore and incorporates Hive's DDL. HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive's command line interface for issuing data definition and metadata exploration commands. 6

2. HCATALOG INSTALLATION HCatalog All Hadoop sub-projects such as Hive, Pig, and HBase support Linux operating system. Therefore, you need to install a Linux flavor on your system. HCatalog is merged with Hive Installation on March 26, 2013. From the version Hive-0.11.0 onwards, HCatalog comes with Hive installation. Therefore, follow the steps given below to install Hive which in turn will automatically install HCatalog on your system. Step 1: Verifying JAVA Installation Java must be installed on your system before installing Hive. You can use the following command to check whether you have Java already installed on your system: $ java version If Java is already installed on your system, you get to see the following response: java version "1.7.0_71" Java(TM) SE Runtime Environment (build 1.7.0_71-b13) Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode) If you don t have Java installed on your system, then you need to follow the steps given below. Step 2: Installing Java Download Java (JDK <latest version> - X64.tar.gz) by visiting the following link http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html. Then jdk-7u71-linux-x64.tar.gz will be downloaded onto your system. Generally you will find the downloaded Java file in the Downloads folder. Verify it and extract the jdk-7u71-linux-x64.gz file using the following commands. $ cd Downloads/ $ ls jdk-7u71-linux-x64.gz $ tar zxf jdk-7u71-linux-x64.gz 7

$ ls jdk1.7.0_71 jdk-7u71-linux-x64.gz To make Java available to all the users, you have to move it to the location /usr/local/. Open root, and type the following commands. $ su password: # mv jdk1.7.0_71 /usr/local/ # exit For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc file. export JAVA_HOME=/usr/local/jdk1.7.0_71 export PATH=PATH:$JAVA_HOME/bin Now verify the installation using the command java -version from the terminal as explained above. Step 3: Verifying Hadoop Installation Hadoop must be installed on your system before installing Hive. Let us verify the Hadoop installation using the following command: $ hadoop version If Hadoop is already installed on your system, then you will get the following response: Hadoop 2.4.1 Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768 Compiled by hortonmu on 2013-10-07T06:28Z Compiled with protoc 2.5.0 From source with checksum 79e53ce7994d1628b240f09af91e1af4 If Hadoop is not installed on your system, then proceed with the following steps: 8

Step 4: Downloading Hadoop Download and extract Hadoop 2.4.1 from Apache Software Foundation using the following commands. $ su password: # cd /usr/local # wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/ hadoop-2.4.1.tar.gz # tar xzf hadoop-2.4.1.tar.gz # mv hadoop-2.4.1/* to hadoop/ # exit Step 5: Installing Hadoop in Pseudo Distributed Mode The following steps are used to install Hadoop 2.4.1 in pseudo distributed mode. Setting up Hadoop You can set Hadoop environment variables by appending the following commands to ~/.bashrc file. export HADOOP_HOME=/usr/local/hadoop export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin Now apply all the changes into the current running system. $ source ~/.bashrc Hadoop Configuration You can find all the Hadoop configuration files in the location $HADOOP_HOME/etc/hadoop. You need to make suitable changes in those configuration files according to your Hadoop infrastructure. 9

$ cd $HADOOP_HOME/etc/hadoop In order to develop Hadoop programs using Java, you have to reset the Java environment variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of Java in your system. export JAVA_HOME=/usr/local/jdk1.7.0_71 Given below are the list of files that you have to edit to configure Hadoop. core-site.xml The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and the size of Read/Write buffers. Open the core-site.xml and add the following properties in between the <configuration> and </configuration> tags. <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> hdfs-site.xml The hdfs-site.xml file contains information such as the value of replication data, the namenode path, and the datanode path of your local file systems. It means the place where you want to store the Hadoop infrastructure. Let us assume the following data. dfs.replication (data replication value) = 1 (In the following path /hadoop/ is the user name. hadoopinfra/hdfs/namenode is the directory created by hdfs file system.) namenode path = //home/hadoop/hadoopinfra/hdfs/namenode 10

(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.) datanode path = //home/hadoop/hadoopinfra/hdfs/datanode Open this file and add the following properties in between the <configuration>, </configuration> tags in this file. <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value> </property> <property> <name>dfs.data.dir</name> <value>file:///home/hadoop/hadoopinfra/hdfs/datanode</value> </property> </configuration> Note: In the above file, all the property values are user-defined and you can make changes according to your Hadoop infrastructure. yarn-site.xml This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the following properties in between the <configuration>, </configuration> tags in this file. <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> mapred-site.xml 11

This file is used to specify which MapReduce framework we are using. By default, Hadoop contains a template of yarn-site.xml. First of all, you need to copy the file from mapredsite,xml.template to mapred-site.xml file using the following command. $ cp mapred-site.xml.template mapred-site.xml Open mapred-site.xml file and add the following properties in between the <configuration>, </configuration> tags in this file. <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> Step 6: Verifying Hadoop Installation The following steps are used to verify the Hadoop installation. Namenode Setup Set up the namenode using the command hdfs namenode -format as follows: $ cd ~ $ hdfs namenode -format The expected result is as follows: 10/24/14 21:30:55 INFO namenode.namenode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = localhost/192.168.1.11 STARTUP_MSG: args = [-format] 12

STARTUP_MSG: version = 2.4.1...... 10/24/14 21:30:56 INFO common.storage: Storage directory /home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted. 10/24/14 21:30:56 INFO namenode.nnstorageretentionmanager: Going to retain 1 images with txid >= 0 10/24/14 21:30:56 INFO util.exitutil: Exiting with status 0 10/24/14 21:30:56 INFO namenode.namenode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11 ************************************************************/ Verifying Hadoop DFS The following command is used to start the DFS. Executing this command will start your Hadoop file system. $ start-dfs.sh The expected output is as follows: 10/24/14 21:37:56 Starting namenodes on [localhost] localhost: starting namenode, logging to /home/hadoop/hadoop-2.4.1/logs/hadoophadoop-namenode-localhost.out localhost: starting datanode, logging to /home/hadoop/hadoop-2.4.1/logs/hadoophadoop-datanode-localhost.out Starting secondary namenodes [0.0.0.0] Verifying Yarn Script 13

The following command is used to start the Yarn script. Executing this command will start your Yarn daemons. $ start-yarn.sh The expected output is as follows: starting yarn daemons starting resourcemanager, logging to /home/hadoop/hadoop-2.4.1/logs/yarn-hadoopresourcemanager-localhost.out localhost: starting nodemanager, logging to /home/hadoop/hadoop-2.4.1/logs/yarnhadoop-nodemanager-localhost.out Accessing Hadoop on Browser The default port number to access Hadoop is 50070. Use the following URL to get Hadoop services on your browser. http://localhost:50070/ Verify all applications for cluster The default port number to access all applications of cluster is 8088. Use the following url to visit this service. 14

http://localhost:8088/ Once you are done with the installation of Hadoop, proceed to the next step and install Hive on your system. 15

End of ebook preview If you liked what you saw Buy it from our store @ https://store.tutorialspoint.com 16