How to Install and Configure EBF15545 for MapR with MapReduce 2

How to Install and Configure EBF15545 for MapR 4.0.2 with MapReduce 2 1993-2015 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. All other company and product names may be trade names or trademarks of their respective owners and/or copyrighted materials of such owners.

Abstract Enable Big Data Edition to run mappings on a Hadoop cluster on MapR 4.0.2 with MapReduce 2. Supported Versions Informatica Big Data Edition 9.6.1 HotFix 2 Table of Contents Overview.... 2 Step 1. Download EBF15545.... 3 Step 2. Update the Informatica Domain.... 3 Applying EBF15545 to the Informatica Domain.... 4 Configuring MapR Distribution Variables for Mappings in a Hive Environment.... 4 Configuring Hadoop Cluster Properties in yarn-site.xml.... 4 Step 3. Update the Hadoop Cluster.... 5 Applying EBF15545 to the Hadoop Cluster.... 5 Configure the Heap Space for the MapR-FS.... 6 Verifying the Cluster Details.... 6 Step 4. Update the Developer tool.... 7 Applying EBF15545 to the Informatica Clients.... 7 Configuring the Developer tool.... 7 Step 5. Update PowerCenter.... 8 Configuring the PowerCenter Integration Service.... 8 Copying MapR Distribution Files for PowerCenter Mappings in the Native Environment.... 8 Enable User Impersonation for Native and Hive Execution Environments.... 9 Connections Overview.... 9 HDFS Connection Properties.... 10 HBase Connection Properties.... 11 Hive Connection Properties.... 12 Creating a Connection.... 17 Troubleshooting.... 17 Overview EBF15545 adds support for MapR 4.0.2 with MapReduce 2 to Informatica 9.6.1 HotFix 2. Note: Teradata Connector for Hadoop 1.3.3 (Command Line Edition) does not support MapR 4.0.2. Only MapR 3.1 is supported. To apply the EBF and configure Informatica, perform the following tasks: 1. Download the EBF. 2

2. Update the Informatica domain Note: If the Data Integration Service runs on a machine that uses SUSE 11, the native mode of execution and Hive pushdown are not supported. Use a Data Integration Service that runs on a machine that uses RHEL. 3. Update the Hadoop cluster 4. Update the Developer tool client 5. Update PowerCenter Optionally, you can enable support for user impersonation. Step 1. Download EBF15545 Before you enable MapR 4.0.2 with MapReduce 2 for Informatica 9.6.1 HotFix 2, download the EBF. 1. Open a browser. 2. In the address field, enter the following URL: https://tsftp.informatica.com. 3. Navigate to the following directory: /updates/informatica9/9.6.1 HotFix2/EBF15545. 4. Download the following files: EBF15545.Linux64-X86.tar.gz Contains the EBF installer for the Informatica domain and the Hadoop cluster. EBF15545_Client_Installer_win32_x86.zip Contains the EBF installer for the Informatica client. Use this file to update the Developer tool. 5. Extract the files from EBF15545.Linux64-X86.tar.gz. The EBF15545.Linux64-X86.tar.gz file contains the following.tar files: EBF15545_Server_installer_linux_em64t.tar EBF installer for the Informatica domain. Use this file to update the Informatica domain. EBF15545_HadoopRPM_EBFInstaller.tar EBF installer for the Hadoop RPM. Use this file to update the Hadoop cluster. Step 2. Update the Informatica Domain Update the Informatica domain to enable MapR 4.0.2 with MapReduce 2. Note: If the Data Integration Service runs on a machine that uses SUSE 11, the native mode of execution and Hive pushdown are not supported. Use a Data Integration Service that runs on a machine that uses RHEL. Perform the following tasks: 1. Apply the EBF to the Informatica domain 2. Configure MapR distribution variables for mappings in a Hive Environment 3. Configure Hadoop cluster properties in yarn-site.xml 3

Applying EBF15545 to the Informatica Domain Apply the EBF to every node in the domain that is used to connect to HDFS or HiveServer on MapR 4.0.2. To apply the EBF to a node in the domain, perform the following steps: 1. Copy EBF15545_Server_installer_linux_em64t.tar to a temporary location on the node. 2. Extract the installer file. Run the following command: tar -xvf EBF15545_Server_Installer_linux_em64t.tar 3. Configure the following properties in the Input.properties file: DEST_DIR=<Informatica installation directory> ROLLBACK=0 4. Run installebf.sh. 5. Repeat steps 1 through 4 for every node in the domain that is used for Hive pushdown. Note: To roll back the EBF for the Informatica domain on a node, set ROLLBACK to 1 and run installebf.sh. Configuring MapR Distribution Variables for Mappings in a Hive Environment When you use the MapR distribution to run mappings in a Hive environment, you must configure MapR environment variables. Configure the following MapR variables: Add MAPR_HOME to the environment variables in the Data Integration Service Process properties. Set MAPR_HOME to the following path: <Informatica installation directory>/services/shared/hadoop/ mapr_4.0.2_yarn. Add -Dmapr.library.flatclass to the custom properties in the Data Integration Service Process properties. For example, add JVMOption1=-Dmapr.library.flatclass Add -Dmapr.library.flatclass to the Data Integration Service advanced property JVM Command Line Options. Set the MapR Container Location Database name variable CLDB in the following file: <Informatica installation directory>/services/shared/hadoop/mapr_4.0.2_yarn/conf/mapr-clusters.conf. For example, add the following property: INFAMAPR402 secure=false <master_node_name>:7222 Configuring Hadoop Cluster Properties in yarn-site.xml To run mappings on a MapR 4.0.2 cluster, you must configure the cluster properties in yarn-site.xml on the machine where the Data Integration Service runs. yarn-site.xml is located in the following directory on the machine where the Data Integration Service runs: <Informatica installation directory>/services/shared/hadoop/mapr_4.0.2_yarn/conf/. In yarn-site.xml, configure the following properties: mapreduce.jobhistory.address Location of the MapReduce JobHistory Server. The default value is 10020. Use the value in the following file: /opt/mapr/hadoop/hadoop-2.5.1/etc/hadoop/mapred-site.xml mapreduce.jobhistory.webapp.address Web address of the MapReduce JobHistory Server. The default value is 19888. 4

Use the value in the following file: /opt/mapr/hadoop/hadoop-2.5.1/etc/hadoop/mapred-site.xml yarn.resourcemanager.scheduler.address Scheduler interface address. The default value is 8030. Use the value in the following file: /opt/mapr/hadoop/hadoop-2.5.1/etc/hadoop/yarn-site.xml The following sample code describes the properties you can set in yarn-site.xml: <property> <name>mapreduce.jobhistory.address</name> <value>hostname:port</value> <description>mapreduce JobHistory Server IPC host:port</description> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>hostname:port</value> <description>mapreduce JobHistory Server Web UI host:port</description> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>hostname:port</value> <description>the address of the scheduler interface</description> </property> Step 3. Update the Hadoop Cluster To update the Hadoop cluster to enable MapR4.0.2, perform the following tasks: 1. Apply the EBF to the Hadoop cluster 2. Configure the heap space for the MapR-FS 3. Verify the cluster details Applying EBF15545 to the Hadoop Cluster To apply the EBF to the Hadoop cluster, perform the following steps: 1. Copy EBF15545_HadoopRPM_EBFInstaller.tar to a temporary location on the cluster machine. 2. Extract the installer file. Run the following command: tar -xvf EBF15545_HadoopRPM_EBFInstaller.tar 3. Provide the node list in the HadoopDataNodes file. 4. Configure the destdir parameter in the input.properties file: destdir=<informatica home directory> For example, set the DEST_DIR parameter to the following value: destdir="/opt/informatica" 5. Run InformaticaHadoopEBFInstall.sh. 5

Configure the Heap Space for the MapR-FS You must configure the heap space reserved for the MapR-FS on every node in the cluster. Perform the following steps: 1. Navigate to the following directory: /opt/mapr/conf. 2. Edit the warden.conf file. 3. Set the value for the service.command.mfs.heapsize.percent property to 20. 4. Save and close the file. 5. Repeat steps 1 through 4 for every node in the cluster. 6. Restart the cluster. Verifying the Cluster Details Verify the following settings for the MapR cluster: MapReduce Version If the cluster is configured for Classic MRv1, use the MapR Control System (MCS) to change the configuration to YARN. Then, restart the cluster. MapR User Details Verify that the MapR user exists on each Hadoop cluster node and that the following properties match: User ID (uid) Group ID (gid) Groups For example, the MapR user might have the following properties: uid=2000(mapr) gid=2000(mapr) groups=2000(mapr) Data Integration Service User Details Verify that the user who runs the Data Integration Service is assigned the same gid as the MapR user and belongs to the same group. For example, a Data Integration Service user named testuser, might have the following properties: uid=30103(testuser) gid=2000(mapr) groups=2000(mapr) After you verify the Data Integration Service user details, perform the following steps: 1. Create a user that has the same user ID and name as the Data Integration Service user. 2. Add this user to all the nodes in the Hadoop cluster and assign it to the mapr group. 3. Verify that the user you created has read and write permissions for the following directory: /opt/mapr/ hive/hive-0.13/logs. A directory corresponding to the user will be created at this location. 4. Verify that the user you created has permissions for the Hive warehouse directory. 6

The Hive warehouse directory is set in the following file: /opt/mapr/hive/hive-0.13/conf/hivesite.xml. For example, if the warehouse directory is /user/hive/warehouse, run the following command to grant the user permissions for the directory: hadoop fs chmod R 777 /user/hive/warehouse Step 4. Update the Developer tool Update the Developer tool to enable MapR 4.0.2. Perform the following tasks: 1. Apply the EBF to the Informatica clients 2. Configure the Developer tool Applying EBF15545 to the Informatica Clients To apply the EBF to the Informatica client, perform the following steps: 1. Copy EBF15545_Client_Installer_win32_x86 to the Windows client machine. 2. Extract the installer. 3. Configure the following properties in the Input.properties file: DEST_DIR=<Informatica installation directory> ROLLBACK=0 Use two slashes when you set the DEST_DIR property. For example, include the following lines in the Input.properties file: DEST_DIR=C:\\Informatica\\9.6.1HF2RC ROLLBACK=0 4. Run installebf.bat. Note: To roll back the EBF for the Informatica client, set ROLLBACK to 1 and run installebf.bat. Configuring the Developer tool To configure the Developer tool after you apply the EBF, perform the following steps: 1. Go to the following directory on any node in the Hadoop cluster: <MapR installation directory>/conf. 2. Find the mapr-cluster.conf file. 3. Copy the file to the following directory on the machine on which the Developer tool runs: <Informatica installation directory>\clients\developerclient\hadoop\mapr_402\conf 4. Go to the following directory on the machine on which the Developer tool runs: <Informatica installation directory>\<version>\clients\developerclient 5. Edit run.bat to include the MAPR_HOME environment variable and the -clean settings: For example, include the following lines: <Informatica installation directory>\clients\developerclient\hadoop\mapr_402 developercore.exe -clean 6. Save and close the file. 7

7. Add the following values to the developercore.ini file: -Dmapr.library.flatclass -Djava.library.path=hadoop\mapr_402\lib\native\Win32;bin;..\DT\bin You can find developercore.ini in the following directory: <Informatica installation directory> \clients\developerclient 8. Save and close the file. 9. Use run.bat to start the Developer tool. Step 5. Update PowerCenter Update the Informatica domain to enable MapR 4.0.2. Perform the following tasks: 1. Update the repository plugin 2. Configure the PowerCenter Integration Service 3. Copy MapR distribution files to PowerCenter mappings in the native environment Configuring the PowerCenter Integration Service To enable support for MapR4.0.2, configure the PowerCenter Integration Service. Perform the following steps: 1. Log in to the Administrator tool. 2. In the Domain Navigator, select the PowerCenter Integration Service. 3. Click the Processes view. 4. Add the following environment variable: MAPR_HOME Use the following value: <INFA_HOME>/server/bin/javalib/hadoop/mapr402 5. Add the following custom property: JVMClassPath Use the following value: <INFA_HOME>/server/bin/javalib/hadoop/mapr402/*:<INFA_HOME>/ server/bin/javalib/hadoop/* Copying MapR Distribution Files for PowerCenter Mappings in the Native Environment When you use the MapR distribution to run mappings in a native environment, you must copy MapR files to the machine on which you install Big Data Edition. 1. Go to the following directory on any node in the cluster: <MapR installation directory>/conf. For example, go to the following directory on any node in the cluster: /opt/mapr/conf. 2. Find the following files: mapr-cluster.conf mapr.login.conf 3. Copy the files to the following directory to the machine on which the Data Integration Service runs: <Informatica installation directory>/server/bin/javalib/hadoop/mapr402/conf. 8

4. Log in to the Administrator tool. 5. In the Domain Navigator, select the PowerCenter Integration Service. 6. Recycle the service. Click Actions > Recycle Service. Enable User Impersonation for Native and Hive Execution Environments User impersonation allows the Data Integration Service to submit Hadoop jobs as a specific user. By default, Hadoop jobs are submitted with the user who runs the Data Integration Service. To enable user impersonation for the native and Hive environments, perform the following steps: 1. Go to the following directory on the machine on which the Data Integration Service runs: <Informatica installation directory>/services/shared/hadoop/mapr_4.0.2_yarn/conf 2. Create a directory named "proxy". Run the following command: mkdir <Informatica installation directory>/services/shared/hadoop/mapr_4.0.2_yarn/conf/ proxy 3. Change the permissions for the proxy directory to -rwxr-xr-x. Run the following command: chmod 755 <Informatica installation directory>/services/shared/hadoop/mapr_4.0.2_yarn/ conf/proxy 4. Verify the following details for the user that you want to impersonate with the Data Integration Service user: Exists on the machine on which the Data Integration Service runs Exists on every node in the Hadoop cluster Has the same user-id and group-id on machine on which the Data Integration Service runs as well as the Hadoop cluster. 5. Create a file for the Data Integration Service user that impersonates other users. Run the following command: touch <Informatica installation directory>/services/shared/hadoop/mapr_4.0.2_yarn/conf/ proxy/<username> For example, to create a file for the Data Integration Service user named user1 that is used to impersonate other users, run the following command: touch $INFA_HOME/services/shared/hadoop/mapr_4.0.2_yarn/conf/proxy/user1 6. Log in to the Administrator tool. 7. In the Domain Navigator, select the Data Integration Service. 8. Recycle the Data Integration Service. Click Actions > Recycle Service. Connections Overview Define the connections you want to use to access data in Hive or HDFS. You can create the following types of connections: HDFS connection. Create an HDFS connection to read data from or write data to the Hadoop cluster. 9

HBase connection. Create an HBase connection to access HBase. The HBase connection is a NoSQL connection Hive connection. Create a Hive connection to access Hive data or run Informatica mappings in the Hadoop cluster. Create a Hive connection in the following connection modes: - Use the Hive connection to access Hive as a source or target. If you want to use Hive as a target, you need to have the same connection or another Hive connection that is enabled to run mappings in the Hadoop cluster. You can access Hive as a source if the mapping is enabled for the native or Hive environment. You can access Hive as a target only if the mapping is run in the Hadoop cluster. - Use the Hive connection to validate or run an Informatica mapping in the Hadoop cluster. Before you run mappings in the Hadoop cluster, review the information in this guide about rules and guidelines for mappings that you can run in the Hadoop cluster. You can create the connections using the Developer tool, Administrator tool, and infacmd. Note: For information about creating connections to other sources or targets such as social media web sites or Teradata, see the respective PowerExchange adapter user guide for information. HDFS Connection Properties Use a Hadoop File System (HDFS) connection to access data in the Hadoop cluster. The HDFS connection is a file system type connection. You can create and manage an HDFS connection in the Administrator tool, Analyst tool, or the Developer tool. HDFS connection properties are case sensitive unless otherwise noted. Note: The order of the connection properties might vary depending on the tool where you view them. The following table describes HDFS connection properties: Property Name Name of the connection. The name is not case sensitive and must be unique within the domain. The name cannot exceed 128 characters, contain spaces, or contain the following special characters: ~ `! $ % ^ & * ( ) - + = { [ } ] \ : ; " ' <, >.? / ID Location Type User Name NameNode URI String that the Data Integration Service uses to identify the connection. The ID is not case sensitive. It must be 255 characters or less and must be unique in the domain. You cannot change this property after you create the connection. Default value is the connection name. The description of the connection. The description cannot exceed 765 characters. The domain where you want to create the connection. Not valid for the Analyst tool. The connection type. Default is Hadoop File System. User name to access HDFS. The URI to access MapR-FS. Use the following URI: maprfs:/// 10

HBase Connection Properties Use an HBase connection to access HBase. The HBase connection is a NoSQL connection. You can create and manage an HBase connection in the Administrator tool or the Developer tool. Hbase connection properties are case sensitive unless otherwise noted. The following table describes HBase connection properties: Property Name ID The name of the connection. The name is not case sensitive and must be unique within the domain. You can change this property after you create the connection. The name cannot exceed 128 characters, contain spaces, or contain the following special characters: ~ `! $ % ^ & * ( ) - + = { [ } ] \ : ; " ' <, >.? / String that the Data Integration Service uses to identify the connection. The ID is not case sensitive. It must be 255 characters or less and must be unique in the domain. You cannot change this property after you create the connection. Default value is the connection name. The description of the connection. The description cannot exceed 4,000 characters. Location Type ZooKeeper Host(s) ZooKeeper Port Enable Kerberos Connection The domain where you want to create the connection. The connection type. Select HBase. Name of the machine that hosts the ZooKeeper server. Port number of the machine that hosts the ZooKeeper server. Use the value specified for hbase.zookeeper.property.clientport in hbase-site.xml. You can find hbase-site.xml on the Namenode machine in the following directory: /opt/mapr/hbase/hbase-0.98.7/conf Enables the Informatica domain to communicate with the HBase master server or region server that uses Kerberos authentication. 11

Property HBase Master Principal Service Principal Name (SPN) of the HBase master server. Enables the ZooKeeper server to communicate with an HBase master server that uses Kerberos authentication. Enter a string in the following format: hbase/<domain.name>@<your-realm> Where: - domain.name is the domain name of the machine that hosts the HBase master server. - YOUR-REALM is the Kerberos realm. HBase Region Server Principal Service Principal Name (SPN) of the HBase region server. Enables the ZooKeeper server to communicate with an HBase region server that uses Kerberos authentication. Enter a string in the following format: hbase_rs/<domain.name>@<your-realm> Where: - domain.name is the domain name of the machine that hosts the HBase master server. - YOUR-REALM is the Kerberos realm. Hive Connection Properties Use the Hive connection to access Hive data. A Hive connection is a database type connection. You can create and manage a Hive connection in the Administrator tool, Analyst tool, or the Developer tool. Hive connection properties are case sensitive unless otherwise noted. Note: The order of the connection properties might vary depending on the tool where you view them. The following table describes Hive connection properties: Property Name ID Location Type The name of the connection. The name is not case sensitive and must be unique within the domain. You can change this property after you create the connection. The name cannot exceed 128 characters, contain spaces, or contain the following special characters: ~ `! $ % ^ & * ( ) - + = { [ } ] \ : ; " ' <, >.? / String that the Data Integration Service uses to identify the connection. The ID is not case sensitive. It must be 255 characters or less and must be unique in the domain. You cannot change this property after you create the connection. Default value is the connection name. The description of the connection. The description cannot exceed 4000 characters. The domain where you want to create the connection. Not valid for the Analyst tool. The connection type. Select Hive. 12

Property Connection Modes User Name Common Attributes to Both the Modes: Environment SQL Hive connection mode. Select at least one of the following options: - Access Hive as a source or target. Select this option if you want to use the connection to access the Hive data warehouse. If you want to use Hive as a target, you must enable the same connection or another Hive connection to run mappings in the Hadoop cluster. - Use Hive to run mappings in Hadoop cluster. Select this option if you want to use the connection to run mappings in the Hadoop cluster. You can select both the options. Default is Access Hive as a source or target. User name of the user that the Data Integration Service impersonates to run mappings on a Hadoop cluster. Use the user name of an operating system user that is present on all nodes on the Hadoop cluster. SQL commands to set the Hadoop environment. In native environment type, the Data Integration Service executes the environment SQL each time it creates a connection to a Hive metastore. If you use the Hive connection to run mappings in the Hadoop cluster, the Data Integration Service executes the environment SQL at the beginning of each Hive session. The following rules and guidelines apply to the usage of environment SQL in both connection modes: - Use the environment SQL to specify Hive queries. - Use the environment SQL to set the classpath for Hive user-defined functions and then use environment SQL or PreSQL to specify the Hive user-defined functions. You cannot use PreSQL in the data object properties to specify the classpath. The path must be the fully qualified path to the JAR files used for user-defined functions. Set the parameter hive.aux.jars.path with all the entries in infapdo.aux.jars.path and the path to the JAR files for user-defined functions. - You can use environment SQL to define Hadoop or Hive parameters that you want to use in the PreSQL commands or in custom queries. If you use the Hive connection to run mappings in the Hadoop cluster, the Data Integration service executes only the environment SQL of the Hive connection. If the Hive sources and targets are on different clusters, the Data Integration Service does not execute the different environment SQL commands for the connections of the Hive source or target. 13

Properties to Access Hive as Source or Target The following table describes the connection properties that you configure to access Hive as a source or target: Property Metadata Connection String Bypass Hive JDBC Server Data Access Connection String The JDBC connection URI used to access the metadata from the Hadoop server. You can use PowerExchange for Hive to communicate with a HiveServer service or HiveServer2 service. To connect to HiveServer2, specify the connection string in the following format: jdbc:hive2://<hostname>:<port>/<db> Where - <hostname> is name or IP address of the machine on which HiveServer2 runs. - <port> is the port number on which HiveServer2 listens. - <db> is the database name to which you want to connect. If you do not provide the database name, the Data Integration Service uses the default database details. JDBC driver mode. Select the check box to use the embedded JDBC driver mode. To use the JDBC embedded mode, perform the following tasks: - Verify that Hive client and Informatica services are installed on the same machine. - Configure the Hive connection properties to run mappings in the Hadoop cluster. If you choose the non-embedded mode, you must configure the Data Access Connection String. Informatica recommends that you use the JDBC embedded mode. The connection string to access data from the Hadoop data store. To connect to HiveServer2, specify the non-embedded JDBC mode connection string in the following format: jdbc:hive2://<hostname>:<port>/<db> Where - <hostname> is name or IP address of the machine on which HiveServer2 runs. - <port> is the port number on which HiveServer2 listens. - <db> is the database to which you want to connect. If you do not provide the database name, the Data Integration Service uses the default database details. Properties to Run Mappings in Hadoop Cluster The following table describes the Hive connection properties that you configure when you want to use the Hive connection to run Informatica mappings in the Hadoop cluster: Property Database Name Default FS URI Namespace for tables. Use the name default for tables that do not have a specified database name. The URI to access the default MapR File System. Use the following connection URI: maprfs:/// 14

Property Yarn Resource Manager URI The service within Hadoop that submits the MapReduce tasks to specific nodes in the cluster. For MapR 4.0.2 with YARN, use the following format: <hostname>:<port> Where - <hostname> is the host name or IP address of the JobTracker or Yarn resource manager. - <port> is the port on which the JobTracker or Yarn resource manager listens for remote procedure calls (RPC). Use the value specified by yarn.resourcemanager.address in yarnsite.xml. You can find yarn-site.xml in the following directory on the NameNode: /opt/mapr/hadoop/hadoop-2.5.1/etc/hadoop. For MapR 4.0.2 with MapReduce 1, use the following URI: maprfs:/// Hive Warehouse Directory on HDFS Advanced Hive/Hadoop Properties Temporary Table Compression Codec The absolute HDFS file path of the default database for the warehouse that is local to the cluster. For example, the following file path specifies a local warehouse: /user/hive/warehouse If the Metastore Execution Mode is remote, then the file path must match the file path specified by the Hive Metastore Service on the hadoop cluster. Use the value specified for the hive.metastore.warehouse.dir property in hive-site.xml. You can find yarn-site.xml in the following directory on the node that runs HiveServer2: /opt/mapr/hive/hive-0.13/conf. Configures or overrides Hive or Hadoop cluster properties in hive-site.xml on the machine on which the Data Integration Service runs. You can specify multiple properties. Use the following format: <property1>=<value> Where - <property1> is a Hive or Hadoop property in hive-site.xml. - <value> is the value of the Hive or Hadoop property. To specify multiple properties use &: as the property separator. The maximum length for the format is 1 MB. If you enter a required property for a Hive connection, it overrides the property that you configure in the Advanced Hive/Hadoop Properties. The Data Integration Service adds or sets these properties for each map-reduce job. You can verify these properties in the JobConf of each mapper and reducer job. Access the JobConf of each job from the Jobtracker URL under each mapreduce job. The Data Integration Service writes messages for these properties to the Data Integration Service logs. The Data Integration Service must have the log tracing level set to log each row or have the log tracing level set to verbose initialization tracing. For example, specify the following properties to control and limit the number of reducers to run a mapping job: mapred.reduce.tasks=2&:hive.exec.reducers.max=10 Hadoop compression library for a compression codec class name. 15

Property Codec Class Name Metastore Execution Mode Metastore Database URI Metastore Database Driver Metastore Database Username Metastore Database Password Remote Metastore URI Codec class name that enables data compression and improves performance on temporary staging tables. Controls whether to connect to a remote metastore or a local metastore. By default, local is selected. For a local metastore, you must specify the Metastore Database URI, Driver, Username, and Password. For a remote metastore, you must specify only the Remote Metastore URI. The JDBC connection URI used to access the data store in a local metastore setup. Use the following connection URI: jdbc:<datastore type>://<node name>:<port>/<database name> where - <node name> is the host name or IP address of the data store. - <data store type> is the type of the data store. - <port> is the port on which the data store listens for remote procedure calls (RPC). - <database name> is the name of the database. For example, the following URI specifies a local metastore that uses MySQL as a data store: jdbc:mysql://hostname23:3306/metastore Use the value specified for the javax.jdo.option.connectionurl property in hive-site.xml. You can find hive-site.xml in the following directory on the node that runs HiveServer2: /opt/mapr/hive/hive-0.13/conf. Driver class name for the JDBC data store. For example, the following class name specifies a MySQL driver: Use the value specified for the javax.jdo.option.connectiondrivername property in hivesite.xml. You can find hive-site.xml in the following directory on the node that runs HiveServer2: /opt/mapr/hive/hive-0.13/conf. The metastore database user name. Use the value specified for the javax.jdo.option.connectionusername property in hive-site.xml. You can find hive-site.xml in the following directory on the node that runs HiveServer2: /opt/mapr/hive/hive-0.13/ conf. Required if the Metastore Execution Mode is set to local. The password for the metastore user name. Use the value specified for the javax.jdo.option.connectionpassword property in hive-site.xml. You can find hive-site.xml in the following directory on the node that runs HiveServer2: /opt/mapr/hive/hive-0.13/ conf. The metastore URI used to access metadata in a remote metastore setup. For a remote metastore, you must specify the Thrift server details. Use the following connection URI: thrift://<hostname>:<port> Where - <hostname> is name or IP address of the Thrift metastore server. - <port> is the port on which the Thrift server is listening. Use the value specified for the hive.metastore.uris property in hivesite.xml. You can find hive-site.xml in the following directory on the node that runs HiveServer2: /opt/mapr/hive/hive-0.13/conf. 16

Creating a Connection Create a connection before you import data objects, preview data, profile data, and run mappings. 1. Click Window > Preferences. 2. Select Informatica > Connections. 3. Expand the domain in the Available Connections list. 4. Select the type of connection that you want to create: To select a Hive connection, select Database > Hive. To select an HDFS connection, select File Systems > Hadoop File System. 5. Click Add. 6. Enter a connection name and optional description. 7. Click Next. 8. Configure the connection properties. For a Hive connection, you must choose the Hive connection mode and specify the commands for environment SQL. The SQL commands appy to both the connection modes. Select at least one of the following connection modes: Option Access Hive as a source or target Run mappings in a Hadoop cluster. Use the connection to access Hive data. If you select this option and click Next, the Properties to Access Hive as a source or target page appears. Configure the connection strings. Use the Hive connection to validate and run Informatica mappings in the Hadoop cluster. If you select this option and click Next, the Properties used to Run Mappings in the Hadoop Cluster page appears. Configure the properties. 9. Click Test Connection to verify the connection. You can test a Hive connection that is configured to access Hive data. You cannot test a Hive connection that is configured to run Informatica mappings in the Hadoop cluster. 10. Click Finish. Troubleshooting This section describes troubleshooting information. A Hive pushdown mapping fails with the following error in the Hadoop job log: Container [pid=25720,containerid=container_1428396763721_0253_01_000002] is running beyond physical memory limits. Current usage: 1.1 GB of 1 GB physical memory used; 21.8 GB of 2.1 GB virtual memory used. Killing container To resolve this issue, you must modify yarn-site.xml on every node in the Hadoop cluster. Then, restart the cluster services. yarn-site.xml is located in the following directory on the Hadoop cluster nodes: /opt/mapr/hadoop/ hadoop-2.5.1/etc/hadoop. In yarn-site.xml, configure the following properties: Note: If a property does not exist, add it to yarn-site.xml. 17

yarn.nodemanager.resource.memory-mb Amount of physical memory, in MB, that can be allocated for containers. Use 24000 for the value. yarn.scheduler.minimum-allocation-mb The minimum allocation for every container request at the RM, in MBs. Memory requests lower than this won't take effect, and the specified value will get allocated at minimum. Use 2048 for the value. yarn.scheduler.maximum-allocation-mb The maximum allocation for every container request at the RM, in MBs. Memory requests higher than this won't take effect, and will get capped to this value. Use 24000 for the value. yarn.app.mapreduce.am.resource.mb The amount of memory the MR AppMaster needs. Use 2048 for the value. yarn.nodemanager.resource.cpu-vcores Number of CPU cores that can be allocated for containers. Use 8 for the value. The following sample code shows the properties you can configure in yarn-site.xml: <property> <name>yarn.nodemanager.resource.memory-mb</name> <description> Amount of physical memory, in MB, that can be allocated for containers. </description> <value>24000</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <description> The minimum allocation for every container request at the RM, in MBs. Memory requests lower than this won't take effect, and the specified value will get allocated at minimum.</description> <value>2048</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <description> The maximum allocation for every container request at the RM, in MBs. Memory requests higher than this won't take effect, and will get capped to this value.</ description> <value>24000</value> </property> <property> <name>yarn.app.mapreduce.am.resource.mb</name> <description> The amount of memory the MR AppMaster needs.</description> <value>2048</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <description> Number of CPU cores that can be allocated for containers. </ description> <value>8</value> </property> 18

Author Big Data Edition Team 19