Infosphere DataStage Hive Connector to read data from Hive data sources

Size: px

Start display at page:

Download "Infosphere DataStage Hive Connector to read data from Hive data sources"

Stewart Reeves
6 years ago
Views:

Infosphere DataStage Hive Connector to read data from Hive Alekhya Telekicherla (alekhya102@in.ibm.com) Software Developer IBM 22 March 2017 Pallavi Koganti (palkogan@in.ibm.com) Software Developer IBM Srinivas Mudigonda (msrinivas@in.

Data can be fetched from various Hive into Information Server modules for more processing.

This step-by-step guide helps you create, configure, compile, and execute DataStage Hive Connector jobs that can read the data from Apache Hive.

1 Infosphere DataStage Hive Connector to read data from Hive Alekhya Telekicherla Software Developer IBM 22 March 2017 Pallavi Koganti Software Developer IBM Srinivas Mudigonda Lead Software Developer IBM India Pvt Ltd Sunil Kumar Mogulla Application Developer IBM This article describes a solution that is based on integration of the IBM InfoSphere DataStage with Apache Hive. Data can be fetched from various Hive into Information Server modules for more processing. You will learn how IBM InfoSphere Information Server can be used to perform read operation on Hive data source. This step-by-step guide helps you create, configure, compile, and execute DataStage Hive Connector jobs that can read the data from Apache Hive. Introduction Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem. It supports queries expressed in a language called HiveQL, which automatically translates SQL-like queries into MapReduce jobs executed on Hadoop. We need an efficient solution to move information from different Hive to ETL space to perform further operations. The integration of IBM InfoSphere DataStage with Apache Hive is achieved by the Infosphere Hive connector, which is a datastage component. The Hive Connector stage helps in fetching the data from Hive and then pass this data to other Information Server modules for more ETL processing. This solution helps the Hive users to make intelligent business decisions based on the data. Copyright IBM Corporation 2017 Trademarks Page 1 of 8

developerworks ibm.com/developerworks/ Configuring Hive Connector in Read mode Hive Connector supports normal read and partitioned read in the form of both Generated SQL and user-defined SQL.

This datastage job includes a Hive Connector stage that specifies details about accessing Apache Hive and a sequential file stage where data extracted to.

2 developerworks ibm.com/developerworks/ Configuring Hive Connector in Read mode Hive Connector supports normal read and partitioned read in the form of both Generated SQL and user-defined SQL. This section demonstrates a sample use case which performs read operation on Hive using Hive Connector Stage. This datastage job includes a Hive Connector stage that specifies details about accessing Apache Hive and a sequential file stage where data extracted to. Read mode of Hive CC in a Datastage job supports only one output link. 1. Generated SQL The detailed description of the steps required to read data using generated SQL mode from Hive is as follows. Figure 1. Hive Connector Read job Setting up Hive Connector properties 1. In Properties tab, select "Generated SQL at run time" to yes and provide value for "Table name" as shown below. 2. If the table is partitioned and if you want to utilize parallelism in the form of partitioned read, select "Enable Partitioned Reads" to Yes. Figure 2. Generated SQL Read properties Page 2 of 8

column of the table. 4. Note that the statement generated by Hive Connector is in regular SQL format, not in HiveQL format. The conversion from SQL to Hive QL will be handled by the driver internally.

3 ibm.com/developerworks/ developerworks 3. The primary partition key is used by connector to utilize the parallelism. In this case, pc1 is the primary partition column and the statements generated will be of the following format: select c1, c2 from part_test4 where pc1=1, where pc1 here is the primary partition column of the table. 4. Note that the statement generated by Hive Connector is in regular SQL format, not in HiveQL format. The conversion from SQL to Hive QL will be handled by the driver internally. 5. Under Output, provide the column name and type details of the columns that you want to extract, as follows: Figure 3. Column Properties 6. Provide file name details in the Sequential file. 7. Compile and run the job. Figure 4. Job Execution1 8. The output is seen as follows Figure 5. Output Rows 2. User-defined SQL The detailed description of the steps required to read data using user-defined SQL mode from Hive is as follows. Page 3 of 8

developerworks ibm.com/developerworks/ Figure 6. Hive Connector Read job 2 Setting up Hive Connector properties 1. In Properties tab, set "Generated SQL at run time" to no. 2. Provide the read statement that needs to be executed under "Select Statement" property.

User Defined SQL Read properties i) In case of partitioned read, provide "Select Statement" in the following format "select c1,c2 from part_test4 where pc1=[[part-value]]" where pc1 is the primary

4 developerworks ibm.com/developerworks/ Figure 6. Hive Connector Read job 2 Setting up Hive Connector properties 1. In Properties tab, set "Generated SQL at run time" to no. 2. Provide the read statement that needs to be executed under "Select Statement" property. 3. If the table is partitioned and if you want to utilize parallelism in the form of partitioned read, select "Enable Partitioned Reads" to Yes. Figure 7. User Defined SQL Read properties i) In case of partitioned read, provide "Select Statement" in the following format "select c1,c2 from part_test4 where pc1=[[part-value]]" where pc1 is the primary partition column and [[part-value]] is the placeholder which will be replaced by the values in the partition column during job run. ii) Note that the connector accepts only primary or the first partition column of the table as the partition column for the select statement. iii) Incase the table is not partitioned, then the job aborts as the user- defined query is no longer valid. 4. Under Output, provide the column name and type details of the columns that you want to extract, as follows: Page 4 of 8

5 ibm.com/developerworks/ developerworks Figure 8. Column Properties 5. Provide file name details in the Sequential file. 6. Compile and run the job Figure 9. Job Execution2 7. The output is seen as follows Figure 10. Output Rows2 Page 5 of 8

6 developerworks ibm.com/developerworks/ Resources Infocenter link: com.ibm.swg.im.iis.conn.hive.usage.doc/topics/hive_connector_top_of_nav.html Page 6 of 8

ibm.com/developerworks/ developerworks About the authors Alekhya Telekicherla Alekhya Telekicherla is a Software developer working in the IBM InfoSphere Information Server Connectivity team.

She has a Bachelors degree in Computer Science Engineering from IIT Guwahati.

Having worked on various domains like Network and Systems management to Data Integration, she is always interested in working on latest technologies.

He has over 16 years of experience in the IT industry and has varied experience ranging from the Distributed File Systems to the Data Integration domain.

7 ibm.com/developerworks/ developerworks About the authors Alekhya Telekicherla Alekhya Telekicherla is a Software developer working in the IBM InfoSphere Information Server Connectivity team. She has around 7 years of experience in IBM in the Data Integration domain, worked on development of various connectors like MDM, Hive, ODBC and Sybase. She has a Bachelors degree in Computer Science Engineering from IIT Guwahati. Pallavi Koganti Pallavi Koganti is a developer working in the Data Integration portfolio in the IBM Infosphere Information Server. She has 11 years of experience in software development. Having worked on various domains like Network and Systems management to Data Integration, she is always interested in working on latest technologies. She holds a Masters Degree (MCA) from Andhra University. Srinivas Mudigonda Srinivas Mudigonda is a lead developer working in the Data Integration portfolio in the IBM InfoSphere Information Server. He has over 16 years of experience in the IT industry and has varied experience ranging from the Distributed File Systems to the Data Integration domain. He is always fascinated by the latest technologies and is keen on leveraging the latest technologies in solving the complex customer problems. He has a Bachelors degree in Electrical and Electronics Engineering (Hons.) from BITS Pilani. Sunil Kumar Mogulla Sunil K Mogulla has around 6 years of experience as a Senior QA in IBM Information Server, handling various Datastage connectors like Hive, File, JDBC, Oracle, ODBC and Streams. Involved in implementation of Hadoop solutions using Information server. He has worked as Oracle PLSQL developer for 3 years and supported in performance tuning and design areas using Oracle Database. Certified Oracle Associate with Developer Track includes SQL and PLSQL. Page 7 of 8

8 developerworks ibm.com/developerworks/ Copyright IBM Corporation 2017 ( Trademarks ( Page 8 of 8

Perform scalable data exchange using InfoSphere DataStage DB2 Connector

Perform scalable data exchange using InfoSphere DataStage Angelia Song (azsong@us.ibm.com) Technical Consultant IBM 13 August 2015 Brian Caufield (bcaufiel@us.ibm.com) Software Architect IBM Fan Ding (fding@us.ibm.com)