Apache Hive Cookbook Easy, hands-on recipes to help you understand Hive and its integration with frameworks that are used widely in today's big data world Hanish Bansal Saurabh Chauhan Shrey Mehrotra BIRMINGHAM - MUMBAI
Apache Hive Cookbook Copyright 2016 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: April 2016 Production reference: 1260416 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78216-108-0 www.packtpub.com
Credits Authors Hanish Bansal Saurabh Chauhan Shrey Mehrotra Reviewer Aristides Villarreal Bravo Commissioning Editor Wilson D'souza Acquisition Editor Tushar Gupta Content Development Editor Anish Dhurat Project Coordinator Bijal Patel Proofreader SaÞ s Editing Indexer Priya Sane Graphics Kirk D'Penha Production Coordinator Shantanu N. Zagade Cover Work Shantanu N. Zagade Technical Editor Vishal K. Mewada Copy Editor Dipti Mankame
About the Authors Hanish Bansal is a software engineer with over 4 years of experience in developing big data applications. He loves to study emerging solutions and applications mainly related to big data processing, NoSQL, natural language processing, and neural networks. He has worked on various technologies such as Spring Framework, Hibernate, Hadoop, Hive, Flume, Kafka, Storm, and NoSQL databases, which include HBase, Cassandra, MongoDB, and search engines such as Elasticsearch. In 2012, he completed his graduation in Information Technology stream from Jaipur Engineering College and Research Center, Jaipur, India. He was also the technical reviewer of the book Apache Zookeeper Essentials. In his spare time, he loves to travel and listen to music. You can read his blog at http://hanishblogger.blogspot.in/ and follow him on Twitter at https://twitter.com/hanishbansal786. I would like to thank my parents for their love, support, encouragement and the amazing chances they've given me over the years. Saurabh Chauhan is a module lead with close to 8 years of experience in data warehousing and big data applications. He has worked on multiple Extract, Transform and Load tools, such as Oracle Data Integrator and Informatica as well as on big data technologies such as Hadoop, Hive, Pig, Sqoop, and Flume. He completed his bachelor of technology in 2007 from Vishveshwarya Institute of Engineering and Technology. In his spare time, he loves to travel and discover new places. He also has a keen interest in sports. I would like to thank everyone who has supported me throughout my life.
Shrey Mehrotra has 6 years of IT experience and, since the past 4 years, in designing and architecting cloud and big data solutions for the governmental and Þ nancial domains. Having worked with big data R&D Labs and Global Data and Analytical Capabilities, he has gained insights into Hadoop, focusing on HDFS, MapReduce, and YARN. His technical strengths also include Hive, Pig, Spark, Elasticsearch, Sqoop, Flume, Kafka, and Java. He likes spending time performing R&D on different big data technologies. He is the coauthor of the book Learning YARN, a certiþ ed Hadoop developer, and has also written various technical papers. In his free time, he listens to music, watches movies, and spending time with friends. I would like to thank my mom and dad for giving me support to accomplish anything I wanted. Also, I would like to thank my friends, who bear with me while I am busy writing.
About the Reviewer Aristides Villarreal Bravo is a Java developers, a member of the NetBeans Dream Team, and a Java User Groups leader. He has organized and participated in various conferences and seminars related to Java, JavaEE, NetBeans, NetBeans Platform, free software, and mobile devices, nationally and internationally. He has written tutorials and blogs about Java, NetBeans, and web development. He has participated in several interviews on sites such as NetBeans, NetBeans Dzone, and JavaHispano. He has developed plugins for NetBeans. He has been a technical reviewer for the book PrimeFaces Blueprints. Aristides is the CEO of Javscaz Software Developers. He lives in Panamá To my mother, father, and all family and friends.
www.packtpub.com ebooks, discount offers, and more Did you know that Packt offers ebook versions of every book published, with PDF and epub Þ les available? You can upgrade to the ebook version at www.packtpub.com and as a print book customer, you are entitled to a discount on the ebook copy. Get in touch with us at customercare@packtpub.com for more details. At www.packtpub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and ebooks. TM https://www2.packtpub.com/books/subscription/packtlib Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books. Why Subscribe? Fully searchable across every book published by Packt Copy and paste, print, and bookmark content On demand and accessible via a web browser
Table of Contents Preface v Chapter 1: Developing Hive 1 Introduction 1 Deploying Hive on a Hadoop cluster 2 Deploying Hive Metastore 3 Installing Hive 6 ConÞ guring HCatalog 10 Understanding different components of Hive 11 Compiling Hive from source 13 Hive packages 15 Debugging Hive 16 Running Hive 17 Changing conþ gurations at runtime 18 Chapter 2: Services in Hive 19 Introducing HiveServer2 19 Understanding HiveServer2 properties 21 ConÞ guring HiveServer2 high availability 22 Using HiveServer2 Clients 24 Introducing the Hive metastore service 34 ConÞ guring high availability of metastore service 36 Introducing Hue 36 Chapter 3: Understanding the Hive Data Model 43 Introduction 43 Using numeric data types 45 Using string data types 46 Using Date/Time data types 47 Using miscellaneous data types 48 Using complex data types 48 i
Table of Contents Using operators 50 Partitioning 57 Partitioning a managed table 58 Partitioning an external table 65 Bucketing 65 Chapter 4: Hive Data DeÞ nition Language 69 Introduction 70 Creating a database schema 70 Dropping a database schema 72 Altering a database schema 73 Using a database schema 74 Showing database schemas 74 Describing a database schema 75 Creating tables 76 Dropping tables 78 Truncating tables 79 Renaming tables 80 Altering table properties 80 Creating views 81 Dropping views 82 Altering the view properties 83 Altering the view as select 83 Showing tables 84 Showing partitions 85 Show the table properties 85 Showing create table 86 HCatalog 87 WebHCat 88 Chapter 5: Hive Data Manipulation Language 89 Introduction 89 Loading Þ les into tables 90 Inserting data into Hive tables from queries 93 Inserting data into dynamic partitions 96 Writing data into Þ les from queries 98 Enabling transactions in Hive 99 Inserting values into tables from SQL 101 Updating data 104 Deleting data 105 ii