Configuring Sqoop Connectivity for Big Data Management

Configuring Sqoop Connectivity for Big Data Management Copyright Informatica LLC 2017. Informatica, the Informatica logo, and Big Data Management are trademarks or registered trademarks of Informatica LLC in the United States and many jurisdictions throughout the world. A current list of Informatica trademarks is available on the web at https://www.informatica.com/trademarks.html.

Abstract Sqoop is a Hadoop command line program to process data between relational databases and HDFS through MapReduce programs. This article explains how to configure Sqoop connectivity with Big Data Management. Configure Sqoop connectivity for relational data objects, customized, data objects, and logical data objects that are based on a JDBC-compliant database. Supported Versions Informatica Big Data Management 10.1 Table of Contents Overview.... 2 Download the JDBC Driver JAR Files.... 2 Configure the HADOOP NODE JDK HOME Property in the hadoopenv.properties File.... 3 Configure the mapred-site.xml File for Cloudera Clusters.... 3 Configure the yarn-site.xml File for Cloudera Kerberos Clusters.... 4 Configure the mapred-site.xml File for Cloudera Kerberos non-ha Clusters.... 5 Configure the core-site.xml File for Ambari-based non-kerberos Clusters.... 5 Overview Big Data Management uses third-party Hadoop utilities such as Sqoop to process data efficiently. You can use Sqoop to import and export data. When you use Sqoop, you do not need to install the relational database client and software on any node in the Hadoop cluster. If you did not configure Sqoop connectivity when you installed Informatica Big Data Management, you can configure it later. Perform the following tasks to configure Sqoop connectivity with Big Data Management: 1. Download the JDBC driver JAR files for Sqoop connectivity. 2. Configure the HADOOP_NODE_JDK_HOME property in the hadoopenv.properties file. 3. Configure the mapred-site.xml file for Cloudera clusters. 4. Configure the yarn-site.xml file for Cloudera Kerberos clusters. 5. Configure the mapred-site.xml file for Cloudera Kerberos non-ha clusters. 6. Configure the core-site.xml file for Ambari-based non-kerberos clusters. Download the JDBC Driver JAR Files To configure Sqoop connectivity for relational databases, you must download the relevant JDBC driver jar files and copy the jar files to the node where the Data Integration Service runs. At run time, the Data Integration Service copies the jar files to the Hadoop distribution cache so that the jar files are accessible to all nodes in the Hadoop cluster. You can use any Type 4 JDBC driver that the database vendor recommends for Sqoop connectivity. Note: The DataDirect JDBC drivers that Informatica ships are not licensed for Sqoop connectivity. 2

If you use the Cloudera Connector Powered by Teradata or Hortonworks Connector for Teradata, you must download additional JAR files and copy them to the node where the Data Integration Service runs. 1. Download the JDBC driver jar files for the database that you want to connect to. 2. If you use the Cloudera Connector Powered by Teradata, perform the following steps: a. Download the Cloudera Connector Powered by Teradata package from the following URL: http://www.cloudera.com/downloads.html The package is named as sqoop-connector-teradata-<version>.tar.gz. Download all the jar files in the package. b. Download the terajdbc4.jar file and tdgssconfig.jar file from the following URL: http://downloads.teradata.com/download/connectivity/jdbc-driver 3. If you use the Hortonworks Connector for Teradata, perform the following steps: a. Download the Hortonworks Connector for Teradata package from the following URL: http://hortonworks.com/downloads/#addons The package is named as hdp-connector-for-teradata-<version>-distro.tar.gz. Download all the jar files in the package. b. Download the avro-mapred-1.7.4-hadoop2.jar file from the following URL: https://archive.apache.org/dist/avro/avro-1.7.4/java/ 4. On the node where the Data Integration Service runs, copy all the JAR files mentioned in the earlier steps to the following directory: <Informatica installation directory>\externaljdbcjars Configure the HADOOP NODE JDK HOME Property in the hadoopenv.properties File Before you run Sqoop mappings, you must configure the HADOOP_NODE_JDK_HOME property in the hadoopenv.properties file on the Data Integration Service node. Configure the HADOOP_NODE_JDK_HOME property to point to the JDK version that the cluster nodes use. You must use JDK version 1.7 or later. 1. Go to the following location: <Informatica installation directory>/services/shared/hadoop/ <Hadoop_distribution_name>_<version_number>/infaConf 2. Find the file named hadoopenv.properties. 3. Back up the file before you update it. 4. Use a text editor to open the file. 5. Define the HADOOP_NODE_JDK_HOME property as follows: infapdo.env.entry.hadoop_node_jdk_home=hadoop_node_jdk_home=<cluster_jdk_home>/jdk<version> For example, infapdo.env.entry.hadoop_node_jdk_home=hadoop_node_jdk_home=/usr/java/default 6. Save the properties file with the name hadoopenv.properties. Configure the mapred-site.xml File for Cloudera Clusters Before you run Sqoop mappings on Cloudera clusters, you must configure MapReduce properties in the mapredsite.xml file on the Hadoop cluster, and restart Hadoop services and the cluster. 1. Open the Yarn Configuration in Cloudera Manager. 3

3. Click + and configure the following properties: Property mapreduce.application.classpath 2. Find the property named NodeManager Advanced Configuration Snippet (Safety Valve) for mapredsite.xml. mapreduce.jobhistory.intermediate-donedir Value $HADOOP_MAPRED_HOME/,$HADOOP_MAPRED_HOME/lib/, $MR2_CLASSPATH,$CDH_MR2_HOME <Directory where the map-reduce jobs write history files> 4. Select the Final check box. 5. Redeploy the client configurations. 6. Restart Hadoop services and the cluster. Configure the yarn-site.xml File for Cloudera Kerberos Clusters To run Sqoop mappings on Cloudera clusters that use Kerberos authentication, you must configure properties in the yarn-site.xml file on the Data Integration Service node and restart the Data Integration Service. Copy the following properties from the mapred-site.xml file on the cluster and add them to the yarn-site.xml file on the Data Integration Service node: mapreduce.jobhistory.address Location of the MapReduce JobHistory Server. The default value is 10020. <name>mapreduce.jobhistory.address</name> <value>hostname:port</value> <description>mapreduce JobHistory Server IPC host:port</description> mapreduce.jobhistory.principal SPN for the MapReduce JobHistory server. <name>mapreduce.jobhistory.principal</name> <value>mapred/_host@your-realm</value> <description>spn for the MapReduce JobHistory server</description> mapreduce.jobhistory.webapp.address Web address of the MapReduce JobHistory Server. The default value is 19888. <name>mapreduce.jobhistory.webapp.address</name> <value>hostname:port</value> <description>mapreduce JobHistory Server Web UI host:port</description> mapreduce.application.classpath Classpaths for MapReduce applications. <name>mapreduce.application.classpath</name> 4

<value>$hadoop_mapred_home/*,$hadoop_mapred_home/lib/*,$mr2_classpath, $CDH_MR2_HOME</value> <description>classpaths for MapReduce applications</description> Configure the mapred-site.xml File for Cloudera Kerberos non- HA Clusters Before you run Sqoop mappings on the Spark and Blaze engines, and on Cloudera Kerberos clusters that are not enabled with NameNode high availability, you must configure the mapreduce.jobhistory.address property in the mapred-site.xml file on the Hadoop cluster, and restart Hadoop services and the cluster. 1. Open the Yarn Configuration in Cloudera Manager. 2. Find the property named NodeManager Advanced Configuration Snippet (Safety Valve) for mapredsite.xml. 3. Click +. 4. Enter the name as mapreduce.jobhistory.address. 5. Set the value as follows: <MapReduce JobHistory Server hostname>:<port> 6. Select the Final check box. 7. Redeploy the client configurations. 8. Restart Hadoop services and the cluster. Configure the core-site.xml File for Ambari-based non- Kerberos Clusters To run Sqoop mappings on IBM BigInsights, Hortonworks HDP, or Azure HDInsight clusters that do not use Kerberos authentication, you must create a proxy user for the yarn user who will impersonate other users. You must configure the impersonation properties in the core-site.xml file on the Hadoop cluster, and restart Hadoop services and the cluster. Configure the following user impersonation properties in the core-site.xml file: hadoop.proxyuser.yarn.groups <name>hadoop.proxyuser.yarn.groups</name> <value><name_of_the_impersonation_user></value> <description>allows impersonation from any group.</description> hadoop.proxyuser.yarn.hosts <name>hadoop.proxyuser.yarn.hosts</name> <value>*</value> <description>allows impersonation from any host.</description> Author Ellen Chandler Principal Technical Writer 5