Configuring a Hadoop Environment for Test Data Management

Similar documents
How to Configure Informatica HotFix 2 for Cloudera CDH 5.3

How to Install and Configure Big Data Edition for Hortonworks

How to Install and Configure EBF16193 for Hortonworks HDP 2.3 and HotFix 3 Update 2

Configuring Sqoop Connectivity for Big Data Management

How to Install and Configure EBF14514 for IBM BigInsights 3.0

How to Install and Configure EBF15545 for MapR with MapReduce 2

How to Run the Big Data Management Utility Update for 10.1

Using Synchronization in Profiling

How to Write Data to HDFS

Configuring Intelligent Streaming 10.2 For Kafka on MapR

Importing Metadata from Relational Sources in Test Data Management

How to Configure Big Data Management 10.1 for MapR 5.1 Security Features

Configuring a JDBC Resource for IBM DB2/ iseries in Metadata Manager HotFix 2

Using Standard Generation Rules to Generate Test Data

Pre-Installation Tasks Before you apply the update, shut down the Informatica domain and perform the pre-installation tasks.

Informatica Cloud Spring Hadoop Connector Guide

Informatica Cloud Spring Complex File Connector Guide

Configuring Ports for Big Data Management, Data Integration Hub, Enterprise Information Catalog, and Intelligent Data Lake 10.2

Upgrading Big Data Management to Version Update 2 for Hortonworks HDP

New Features and Enhancements in Big Data Management 10.2

Publishing and Subscribing to Cloud Applications with Data Integration Hub

Creating OData Custom Composite Keys

Creating Column Profiles on LDAP Data Objects

Creating a Subset of Production Data

Creating an Avro to Relational Data Processor Transformation

Importing Metadata From a Netezza Connection in Test Data Management

Cloudera Manager Quick Start Guide

Importing Metadata From an XML Source in Test Data Management

Creating a Column Profile on a Logical Data Object in Informatica Developer

Configuring a JDBC Resource for MySQL in Metadata Manager

Upgrading Big Data Management to Version Update 2 for Cloudera CDH

Configuring a JDBC Resource for IBM DB2 for z/os in Metadata Manager

Using MDM Big Data Relationship Management to Perform the Match Process for MDM Multidomain Edition

Informatica PowerExchange for Microsoft Azure Blob Storage 10.2 HotFix 1. User Guide

Configure an ODBC Connection to SAP HANA

Importing Flat File Sources in Test Data Management

Manually Defining Constraints in Enterprise Data Manager

Configuring a JDBC Resource for Sybase IQ in Metadata Manager

Tuning Intelligent Data Lake Performance

How to Optimize Jobs on the Data Integration Service for Performance and Stability

Detecting Outliers in Column Profile Results in Informatica Analyst

Hadoop is essentially an operating system for distributed processing. Its primary subsystems are HDFS and MapReduce (and Yarn).

Tuning Enterprise Information Catalog Performance

Cloudera ODBC Driver for Apache Hive Version

Migrating Mappings and Mapplets from a PowerCenter Repository to a Model Repository

How to Run a PowerCenter Workflow from SAP

Informatica Cloud Data Integration Spring 2018 April. What's New

Administration 1. DLM Administration. Date of Publish:

Hortonworks Data Platform

Big Data with Hadoop Ecosystem

Informatica (Version HotFix 2) Big Data Edition Installation and Configuration Guide

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

This document contains information on fixed and known limitations for Test Data Management.

VMware vsphere Big Data Extensions Administrator's and User's Guide

Using Data Replication with Merge Apply and Audit Apply in a Single Configuration

Installing an HDF cluster

Important Notice Cloudera, Inc. All rights reserved.

Installing HDF Services on an Existing HDP Cluster

Enterprise Data Catalog Fixed Limitations ( Update 1)

How to Use Full Pushdown Optimization in PowerCenter

Big Data Analytics using Apache Hadoop and Spark with Scala

How to Migrate RFC/BAPI Function Mappings to Use a BAPI/RFC Transformation

Database Setup in IRI Workbench 1

Administration 1. DLM Administration. Date of Publish:

Changing the Password of the Proactive Monitoring Database User

Implementing Data Masking and Data Subset with IMS Unload File Sources

Informatica Cloud Platform Building Connectors with the Toolkit Student Lab: Prerequisite Installations. Version Connectors Toolkit Training

Dynamic Data Masking: Capturing the SET QUOTED_IDENTIFER Value in a Microsoft SQL Server or Sybase Database

How to connect to Cloudera Hadoop Data Sources

Known Issues for Oracle Big Data Cloud. Topics: Supported Browsers. Oracle Cloud. Known Issues for Oracle Big Data Cloud Release 18.

Hortonworks Technical Preview for Stinger Phase 3 Released: 12/17/2013

Cloudera ODBC Driver for Apache Hive Version

Altus Shared Data Experience (SDX)

SAS Viya 3.2 and SAS/ACCESS : Hadoop Configuration Guide

PowerExchange IMS Data Map Creation

Informatica Cloud Spring Data Integration Hub Connector Guide

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

PowerExchange for Facebook: How to Configure Open Authentication using the OAuth Utility

Configuring and Deploying Hadoop Cluster Deployment Templates

Talend Open Studio for Data Quality. User Guide 5.5.2

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Major and Minor Relationships in Test Data Management

Getting Started with Pentaho and Cloudera QuickStart VM

HDI+Talena Resources Deployment Guide. J u n e

Cloudera ODBC Driver for Apache Hive

Implementing Data Masking and Data Subset with IMS Unload File Sources

Hortonworks Data Platform

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Hortonworks Data Platform

Informatica PowerExchange for Hive (Version HotFix 1) User Guide

Hortonworks Data Platform

Release Notes 1. DLM Release Notes. Date of Publish:

Informatica Big Data Management Hadoop Integration Guide

Vendor: Cloudera. Exam Code: CCA-505. Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam.

Table of Contents. Abstract

Data Storage Infrastructure at Facebook

Oracle Big Data Manager User s Guide. For Oracle Big Data Appliance

Beta. VMware vsphere Big Data Extensions Administrator's and User's Guide. vsphere Big Data Extensions 1.0 EN

Installing Apache Knox

Informatica (Version HotFix 3 Update 3) Big Data Edition Installation and Configuration Guide

Transcription:

Configuring a Hadoop Environment for Test Data Management Copyright Informatica LLC 2016, 2017. Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica LLC. All other company and product names may be trade names or trademarks of their respective owners and/or copyrighted materials of such owners.

Abstract You must install and configure a Hadoop environment if you want to perform Test Data Management (TDM) operations with Hadoop connections. The article describes how to install a Hadoop environment, configure the Data Integration Service, and configure Hive and Hadoop Distributed File System (HDFS) connections. Supported Versions Test Data Management 9.7.0 Test Data Management 9.7.1 Table of Contents Overview.... 2 Configure Hadoop Environment.... 2 Step 1. Install RPM on Hadoop.... 3 Step 2. Configure Hadoop Cluster Properties.... 3 Step 3. Configure Hadoop Pushdown Properties for the Data Integration Service.... 4 Step 4. Configure Hadoop Connections.... 5 Creating a Hive Connection.... 5 Creating an HDFS Connection.... 8 Step 5. Configure Hive Properties.... 9 Overview You can perform data masking, data domain discovery, and data movement operations on Big Data Edition Hadoop clusters. You must install a Hadoop environment for TDM. The Informatica Big Data Edition installation is distributed as a Red Hat Package Manager (RPM) installation package. The RPM package and the binary files that you need to run the Big Data Edition installation are compressed into a tar.gz file. Configure Hadoop Environment In TDM, you can use Hive or HDFS connections as sources or targets. Create Hive or HDFS connections in Test Data Manager to perform data masking, data domain discovery, and data movement operations. To configure Hadoop environment for TDM operations, perform the following steps: 1. Install RPM on Hadoop. 2. Configure Hadoop cluster properties. 3. Configure Hadoop pushdown properties for the Data Integration Service. 4. Configure Hadoop connections. 5. Configure Hive properties. 2

Step 1. Install RPM on Hadoop You can install the RPM package for Hadoop on a single node or on multiple node clusters. 1. Install Informatica RPM on the Hadoop machine that you want to use as the target. 2. If there are multiple nodes, install the RPM on all the nodes of the cluster. The installation path must be same in all the nodes of the cluster. For example, you can install RPM in the \opt folder in all the nodes of the cluster. Step 2. Configure Hadoop Cluster Properties Configure Hadoop cluster properties in the yarn-site.xml file that the Data Integration Service uses when it runs mappings on a Cloudera CDH cluster or a Hortonworks HDP cluster. 1. Copy the yarn-site.xml file from the Hadoop cluster to the following location: <Informatica installation directory>/services/shared/hadoop/<hadoop_distribution_name>/conf/ 2. Ensure that the following properties are present in the yarn-site.xml file that you copied: <name>mapreduce.jobhistory.address</name> <value><namenode>:10020</value> <description>mapreduce JobHistory Server IPC host:port</description> <name>mapreduce.jobhistory.webapp.address</name> <value> <NAMENODE>:19888</value> <description>mapreduce JobHistory Server Web UI host:port</description> <name>yarn.resourcemanager.scheduler.address</name> <value> <NAMENODE>:8030</value> <description>classpath for YARN applications. A comma-separated list of CLASSPATH entries</ description> 3. Copy the hive-site.xml file from the Hadoop cluster to the following location: <Informatica installation directory>/services/shared/hadoop/<hadoop_distribution_name>/conf/ 4. Ensure that the following properties are updated in the hive-site.xml file that you copied: <name>hive.metastore.uris</name> <value>thrift://<namenode>:9083</value> <description>thrift uri for the remote metastore. Used by metastore client to connect to remote metastore.</description> 3

<name>mapreduce.jobhistory.webapp.address</name> <value><namenode>:19888</value> <name>fs.defaultfs</name> <value>hdfs://<namenode>:8020</value> <name>mapreduce.jobhistory.address</name> <value><namenoode>:10020</value> 5. Verify that the ODBC entry files, TNS entry files, and DB2 installation entries are specified in the following location: <Informatica installation directory>/services/shared/hadoop/ <Hadoop_distribution_name>/infaConf/hadoopEnv.properties The following example shows the environment variables that you can edit: infapdo.env.entry.ld_library_path=ld_library_path=$hadoop_node_infa_home/services/shared/bin: $HADOOP_NODE_INFA_HOME/DataTransformation/bin:/opt/teradata/client/14.10/tbuild/lib64:/opt/ teradata/client/14.10/odbc_64/lib:/databases/oracle_11.2.0/lib:/databases/db2v9.5_64bit/ lib64:$hadoop_node_hadoop_dist/lib/native:$hadoop_node_infa_home/odbc7.1/lib: $HADOOP_NODE_HADOOP_DIST/lib/native/Linux-amd64-64:$LD_LIBRARY_PATH infapdo.env.entry.path=path=$hadoop_node_hadoop_dist/scripts:$hadoop_node_infa_home/services/ shared/bin:/databases/oracle_11.2.0/bin:/databases/db2v9.5_64bit/bin:$hadoop_node_infa_home/ ODBC7.1/bin:$PATH infapdo.env.entry.oracle_home=oracle_home=/databases/oracle_11.2.0/ infapdo.env.entry.tns_admin=tns_admin=/opt/ora_tns infapdo.env.entry.db2_home=db2_home=/databases/db2v9.5_64bit infapdo.env.entry.db2instance=db2instance=db95inst infapdo.env.entry.db2codepage=db2codepage="1208" infapdo.env.entry.odbchome=odbchome=$hadoop_node_infa_home/odbc7.1 infapdo.env.entry.odbcini=odbcini=/opt/odbcini/odbc.ini 6. When you install on multiple nodes of a cluster, copy the hdfs-site.xml, core-site.xml, and mapredsite.xml files from the /usr/lib/hadoop/conf cluster to the <Domain_installation>/services/shared/ hadoop/[hadoop_distribution]/conf cluster. Step 3. Configure Hadoop Pushdown Properties for the Data Integration Service Configure Hadoop pushdown properties for the Data Integration Service to run mappings in a Hive environment. 1. Log in to the Administrator tool. 2. In the Manage Services and Nodes view, select the Data Integration Service in the domain from the Navigator pane. 3. Click the Processes tab on the right pane. 4. In the Execution Options section, configure the following properties: 4

Informatica Home Directory on Hadoop The Big Data Edition home directory on every data node created by the Hadoop RPM install. Type / <HadoopInstallationDirectory>/Informatica Hadoop Distribution Directory The directory that contains a collection of Hive and Hadoop JARS on the cluster from the RPM install locations. The directory contains the minimum set of JARS required to process Informatica mappings in a Hadoop environment. Type /<HadoopInstallationDirectory>/Informatica/services/shared/ hadoop/<hadoop_distribution_name> Data Integration Service Hadoop Distribution Directory The Hadoop distribution directory on the Data Integration Service node. The contents of the Data Integration Service Hadoop distribution directory must be identical to the Hadoop distribution directory on the data nodes. The Hadoop Distribution at the end specifies which jars have to be used while running the mappings in Hadoop and Data Integration Service modes. The Hadoop RPM installs the Hadoop distribution directories in the following path: <Informatica installation directory>/services/shared/hadoop/<hadoop_distribution_name> 5. Restart the Data Integration Service. Note: When you create the Data Integration Service, the Mapping Service Module is enabled. Step 4. Configure Hadoop Connections After you install RPM on a Hadoop machine and configure the Data Integration Service, you must create Hadoop connections in Test Data Manager. You can create Hive or HDFS connections to perform TDM operations. Creating a Hive Connection In Test Data Manager, create a Hive connection and use the connection as a source or a target when you perform TDM operations. 1. Log in to Test Data Manager. 2. Click Administrator > Connections. 3. Click Actions > New Connection. The New Connection wizard appears with the connection properties. 4. Select the Hive connection type. 5. Enter the connection name, description, and owner information. The following image shows the New Connection wizard parameters: 6. Click Next. 5

7. To use Hive as a source or a target, select Access Hive as a source or target. 8. To use the connection to run mappings in the Hadoop cluster, select Use Hive to run mappings in Hadoop cluster. 9. To access the Hive database, enter the user name. The following image shows the connection modes and attributes that you can configure for the Hive connection: 10. Click Next. 11. To access the metadata from the Hadoop cluster, enter the metadata connection string in the following format: jdbc:hive2://<nodename>:10000/default. For example: jdbc:hive2://ivlhdp35:10000/default You can also create a schema and provide the schema name instead of the default schema. 12. To access data from the Hadoop cluster, enter the data access connection string in the following format: jdbc:hive2://<nodename>:10000/default For example: jdbc:hive2://ivlhdp35:10000/default You can also create a schema and provide the schema name instead of the default schema. 13. To run mappings in the Hadoop cluster, enter the following parameters: Database Name. Enter the name default for tables that do not have a specified database name. Default FS URI. Enter the URI to access the default HDFS in the following format: hdfs://<nodename>: 8020/ For example: hdfs://ivlhdp35:8020 JobTracker/Yarn Resource Manager URI. Enter the specific node in the Hadoop cluster in the following format: <NodeName>:<Port>. For Cloudera, the port is 8032, and for Hortonworks, the port is 8050. Hive Warehouse Directory on HDFS. Enter the HDFS file path of the default database. For example, the following file path specifies a local warehouse: /user/hive/warehouse 14. To access a Hive metastore, select Local or Remote. Remote. Connects to the thrift server which in turn interacts with the Hive server. Local. Uses a JDBC connection string to connect directly to the MySQL database. Default is Local. 15. To connect to a remote metastore, select Remote. If you select Remote, specify only the Remote Metastore URI with the thrift server details in the following format: thrift://<nodename>:9083 For example: ivlhdp35:9083 6

The following image shows the Hive properties that you can configure: 16. If you select Local mode, specify the Metastore Database URI, driver, user name, and password. The following image shows the local metastore execution mode properties that you can configure: 17. To test the connection, click Test Connection. 18. To save the connection, click OK. The connection is visible in the Administrator Connections view. 7

Creating an HDFS Connection In Test Data Manager, create an HDFS connection and use the connection as a source or a target when you perform TDM operations. 1. Log in to Test Data Manager. 2. Click Administrator > Connections. 3. Click Actions > New Connection. The New Connection wizard appears with the connection properties. 4. Select the HDFS connection type. 5. Enter the connection name, description, and owner information. The following image shows the New Connection wizard with the HDFS connection parameters: 6. Click Next. 7. To access HDFS, enter the user name. 8. To access the HDFS URI, enter the NameNode URI in the following format: hdfs://<namenode>:8020 HDFS runs on port 8020. 9. Enter the directory for the Hadoop instance on which you want to perform TDM operations. The following image shows the connection properties that you can configure for the HDFS connection: 10. To test the connection, click Test Connection. 11. To save the connection, click OK. The connection is visible in the Administrator Connections view. 8

Step 5. Configure Hive Properties To run the mappings from TDM, you must configure the Hive pushdown connection. 1. Click Adminstrator > Preferences. 2. In the Hive Properties section, click Edit. 3. Select the Hive connection. 4. To view the mappings in the Data Integration Service of the Administrator tool, enable Persist Mapping. The following image shows the Hive properties that you can configure: You can now perform data masking, data movement, and data domain discovery operations on Hadoop connections. Author Krishnakanth K S Senior Software Engineer QA Acknowledgements The author would like to acknowledge Ramesh Manchala, QA Engineer, for his technical assistance. 9