Informatica (Version HotFix 3 Update 3) Big Data Edition Installation and Configuration Guide

Similar documents
Informatica (Version HotFix 2) Big Data Edition Installation and Configuration Guide

Informatica (Version 9.1.0) Data Quality Installation and Configuration Quick Start

Informatica Big Data Management (Version 10.0) Big Data Management Installation and Configuration Guide

Informatica Data Archive (Version HotFix 1) Amdocs Accelerator Reference

Informatica (Version HotFix 4) Metadata Manager Repository Reports Reference

Informatica (Version ) SQL Data Service Guide

Informatica PowerExchange for MSMQ (Version 9.0.1) User Guide

Informatica (Version 10.0) Rule Specification Guide

Informatica (Version 9.6.1) Mapping Guide

Informatica PowerExchange for Hive (Version 9.6.0) User Guide

Informatica Data Services (Version 9.5.0) User Guide

Informatica PowerExchange for Tableau (Version HotFix 1) User Guide

Informatica (Version HotFix 4) Installation and Configuration Guide

Informatica Test Data Management (Version 9.6.0) User Guide

Informatica Fast Clone (Version 9.6.0) Release Guide

Informatica PowerCenter Express (Version 9.6.1) Mapping Guide

Informatica (Version 10.0) Mapping Specification Guide

Informatica PowerCenter Express (Version HotFix2) Release Guide

Informatica (Version HotFix 3) Business Glossary 9.5.x to 9.6.x Transition Guide

Informatica Data Integration Hub (Version 10.0) Developer Guide

Informatica Data Director for Data Quality (Version HotFix 4) User Guide

Informatica PowerCenter Express (Version 9.6.1) Getting Started Guide

Informatica PowerExchange for Hive (Version 9.6.1) User Guide

Informatica PowerCenter Express (Version 9.6.0) Administrator Guide

Informatica Cloud (Version Fall 2016) Qlik Connector Guide

Informatica PowerExchange for SAP NetWeaver (Version 10.2)

Informatica Data Integration Hub (Version 10.1) Developer Guide

Informatica (Version HotFix 2) Upgrading from Version 9.1.0

Informatica PowerExchange for Amazon S3 (Version HotFix 3) User Guide for PowerCenter

Informatica Cloud (Version Winter 2015) Box API Connector Guide

How to Install and Configure Big Data Edition for Hortonworks

Informatica Cloud (Version Spring 2017) Microsoft Azure DocumentDB Connector Guide

Informatica Data Quality for SAP Point of Entry (Version 9.5.1) Installation and Configuration Guide

Informatica (Version 10.1) Metadata Manager Custom Metadata Integration Guide

Informatica Cloud (Version Spring 2017) Microsoft Dynamics 365 for Operations Connector Guide

Informatica Cloud (Version Spring 2017) Magento Connector User Guide

Informatica Test Data Management (Version 9.7.0) User Guide

Informatica (Version 10.0) Exception Management Guide

Informatica Cloud (Version Winter 2015) Dropbox Connector Guide

Informatica Dynamic Data Masking (Version 9.6.2) Stored Procedure Accelerator Guide for Sybase

Informatica Cloud (Version Spring 2017) Box Connector Guide

Informatica Development Platform (Version 9.6.1) Developer Guide

Informatica PowerExchange for SAS (Version 9.6.1) User Guide

Informatica (Version HotFix 3) Reference Data Guide

Informatica (Version ) Intelligent Data Lake Administrator Guide

Informatica Dynamic Data Masking (Version 9.8.3) Installation and Upgrade Guide

Informatica PowerExchange for Siebel (Version 9.6.1) User Guide for PowerCenter

Informatica Dynamic Data Masking (Version 9.6.1) Active Directory Accelerator Guide

Informatica (Version 9.6.1) Profile Guide

Informatica PowerExchange for Tableau (Version 10.0) User Guide

Informatica Data Services (Version 9.6.0) Web Services Guide

Informatica PowerExchange for Web Content- Kapow Katalyst (Version 10.0) User Guide

Informatica Big Data Management (Version Update 2) Installation and Configuration Guide

Informatica Data Integration Hub (Version ) Administrator Guide

Informatica Intelligent Data Lake (Version 10.1) Installation and Configuration Guide

Informatica B2B Data Exchange (Version 9.6.2) Installation and Configuration Guide

Infomatica PowerCenter (Version 10.0) PowerCenter Repository Reports

Informatica Cloud (Version Fall 2015) Data Integration Hub Connector Guide

Informatica B2B Data Transformation (Version 10.0) Agent for WebSphere Message Broker User Guide

Informatica PowerExchange for Hive (Version HotFix 1) User Guide

Informatica B2B Data Exchange (Version 9.6.2) Developer Guide

Informatica Cloud (Version Spring 2017) DynamoDB Connector Guide

Informatica PowerExchange for Web Content-Kapow Katalyst (Version ) User Guide

Informatica PowerExchange for Web Services (Version 9.6.1) User Guide for PowerCenter

Informatica (Version 10.1) Metadata Manager Administrator Guide

Informatica PowerExchange for HBase (Version 9.6.0) User Guide

Informatica Informatica (Version ) Installation and Configuration Guide

Informatica PowerCenter Data Validation Option (Version 10.0) User Guide

Informatica (Version 10.1) Live Data Map Administrator Guide

Informatica PowerExchange for Salesforce (Version HotFix 3) User Guide

Informatica MDM Multidomain Edition (Version ) Provisioning Tool Guide

Informatica 4.0. Installation and Configuration Guide

Informatica PowerExchange for Microsoft Azure Cosmos DB SQL API User Guide

Informatica Cloud (Version Winter 2016) REST API Connector Guide

Informatica PowerExchange for Greenplum (Version 10.0) User Guide

Informatica Data Archive (Version 6.4) Installation and Upgrade Guide

Informatica B2B Data Transformation (Version 10.0) XMap Tutorial

Informatica (Version 9.6.0) Developer Workflow Guide

Informatica Data Integration Hub (Version 10.2) Administrator Guide

Informatica Dynamic Data Masking (Version 9.8.0) Administrator Guide

Informatica B2B Data Exchange (Version 9.6.2) High Availability Installation Guide

Informatica (Version 10.1) Security Guide

Informatica Dynamic Data Masking (Version 9.8.1) Dynamic Data Masking Accelerator for use with SAP

Informatica PowerExchange for Microsoft Azure SQL Data Warehouse (Version ) User Guide for PowerCenter

Informatica B2B Data Exchange (Version 10.2) Administrator Guide

Informatica PowerExchange for MapR-DB (Version Update 2) User Guide

Informatica Cloud (Version Spring 2017) XML Target Connector Guide

Informatica PowerCenter Express (Version 9.6.1) Performance Tuning Guide

Informatica Enterprise Data Catalog Installation and Configuration Guide

Informatica PowerExchange for Tableau (Version HotFix 4) User Guide

Informatica Big Data Management (Version Update 2) User Guide

Informatica 4.5. Installation and Configuration Guide

Informatica Vibe Data Stream for Machine Data (Version 2.1.0) User Guide

Informatica Cloud (Version Spring 2017) NetSuite RESTlet Connector Guide

Informatica Dynamic Data Masking (Version 9.8.1) Administrator Guide

Informatica MDM Multidomain Edition (Version ) Data Steward Guide

Informatica Cloud (Version Spring 2017) Salesforce Analytics Connector Guide

Informatica PowerExchange for Hadoop (Version ) User Guide for PowerCenter

Informatica Business Glossary (Version 2.0) API Guide

Informatica PowerCenter Express (Version 9.5.1) User Guide

Transcription:

Informatica (Version 9.6.1 HotFix 3 Update 3) Big Data Edition Installation and Configuration Guide

Informatica Big Data Edition Installation and Configuration Guide Version 9.6.1 HotFix 3 Update 3 January 2015 Copyright (c) 1993-2016 Informatica LLC. All rights reserved. This software and documentation contain proprietary information of Informatica Corporation and are provided under a license agreement containing restrictions on use and disclosure and are also protected by copyright law. Reverse engineering of the software is prohibited. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. This Software may be protected by U.S. and/or international Patents and other Patents Pending. Use, duplication, or disclosure of the Software by the U.S. Government is subject to the restrictions set forth in the applicable software license agreement and as provided in DFARS 227.7202-1(a) and 227.7702-3(a) (1995), DFARS 252.227-7013 (1)(ii) (OCT 1988), FAR 12.212(a) (1995), FAR 52.227-19, or FAR 52.227-14 (ALT III), as applicable. The information in this product or documentation is subject to change without notice. If you find any problems in this product or documentation, please report them to us in writing. Informatica, Informatica Platform, Informatica Data Services, PowerCenter, PowerCenterRT, PowerCenter Connect, PowerCenter Data Analyzer, PowerExchange, PowerMart, Metadata Manager, Informatica Data Quality, Informatica Data Explorer, Informatica B2B Data Transformation, Informatica B2B Data Exchange Informatica On Demand, Informatica Identity Resolution, Informatica Application Information Lifecycle Management, Informatica Complex Event Processing, Ultra Messaging and Informatica Master Data Management are trademarks or registered trademarks of Informatica Corporation in the United States and in jurisdictions throughout the world. All other company and product names may be trade names or trademarks of their respective owners. Portions of this software and/or documentation are subject to copyright held by third parties, including without limitation: Copyright DataDirect Technologies. All rights reserved. Copyright Sun Microsystems. All rights reserved. Copyright RSA Security Inc. All Rights Reserved. Copyright Ordinal Technology Corp. All rights reserved.copyright Aandacht c.v. All rights reserved. Copyright Genivia, Inc. All rights reserved. Copyright Isomorphic Software. All rights reserved. Copyright Meta Integration Technology, Inc. All rights reserved. Copyright Intalio. All rights reserved. Copyright Oracle. All rights reserved. Copyright Adobe Systems Incorporated. All rights reserved. Copyright DataArt, Inc. All rights reserved. Copyright ComponentSource. All rights reserved. Copyright Microsoft Corporation. All rights reserved. Copyright Rogue Wave Software, Inc. All rights reserved. Copyright Teradata Corporation. All rights reserved. Copyright Yahoo! Inc. All rights reserved. Copyright Glyph & Cog, LLC. All rights reserved. Copyright Thinkmap, Inc. All rights reserved. Copyright Clearpace Software Limited. All rights reserved. Copyright Information Builders, Inc. All rights reserved. Copyright OSS Nokalva, Inc. All rights reserved. Copyright Edifecs, Inc. All rights reserved. Copyright Cleo Communications, Inc. All rights reserved. Copyright International Organization for Standardization 1986. All rights reserved. Copyright ejtechnologies GmbH. All rights reserved. Copyright Jaspersoft Corporation. All rights reserved. Copyright International Business Machines Corporation. All rights reserved. Copyright yworks GmbH. All rights reserved. Copyright Lucent Technologies. All rights reserved. Copyright (c) University of Toronto. All rights reserved. Copyright Daniel Veillard. All rights reserved. Copyright Unicode, Inc. Copyright IBM Corp. All rights reserved. Copyright MicroQuill Software Publishing, Inc. All rights reserved. Copyright PassMark Software Pty Ltd. All rights reserved. Copyright LogiXML, Inc. All rights reserved. Copyright 2003-2010 Lorenzi Davide, All rights reserved. Copyright Red Hat, Inc. All rights reserved. Copyright The Board of Trustees of the Leland Stanford Junior University. All rights reserved. Copyright EMC Corporation. All rights reserved. Copyright Flexera Software. All rights reserved. Copyright Jinfonet Software. All rights reserved. Copyright Apple Inc. All rights reserved. Copyright Telerik Inc. All rights reserved. Copyright BEA Systems. All rights reserved. Copyright PDFlib GmbH. All rights reserved. Copyright Orientation in Objects GmbH. All rights reserved. Copyright Tanuki Software, Ltd. All rights reserved. Copyright Ricebridge. All rights reserved. Copyright Sencha, Inc. All rights reserved. Copyright Scalable Systems, Inc. All rights reserved. Copyright jqwidgets. All rights reserved. Copyright Tableau Software, Inc. All rights reserved. Copyright MaxMind, Inc. All Rights Reserved. Copyright TMate Software s.r.o. All rights reserved. Copyright MapR Technologies Inc. All rights reserved. Copyright Amazon Corporate LLC. All rights reserved. This product includes software developed by the Apache Software Foundation (http://www.apache.org/), and/or other software which is licensed under various versions of the Apache License (the "License"). You may obtain a copy of these Licenses at http://www.apache.org/licenses/. Unless required by applicable law or agreed to in writing, software distributed under these Licenses is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the Licenses for the specific language governing permissions and limitations under the Licenses. This product includes software which was developed by Mozilla (http://www.mozilla.org/), software copyright The JBoss Group, LLC, all rights reserved; software copyright 1999-2006 by Bruno Lowagie and Paulo Soares and other software which is licensed under various versions of the GNU Lesser General Public License Agreement, which may be found at http:// www.gnu.org/licenses/lgpl.html. The materials are provided free of charge by Informatica, "as-is", without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. The product includes ACE(TM) and TAO(TM) software copyrighted by Douglas C. Schmidt and his research group at Washington University, University of California, Irvine, and Vanderbilt University, Copyright ( ) 1993-2006, all rights reserved. This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit (copyright The OpenSSL Project. All Rights Reserved) and redistribution of this software is subject to terms available at http://www.openssl.org and http://www.openssl.org/source/license.html. This product includes Curl software which is Copyright 1996-2013, Daniel Stenberg, <daniel@haxx.se>. All Rights Reserved. Permissions and limitations regarding this software are subject to terms available at http://curl.haxx.se/docs/copyright.html. Permission to use, copy, modify, and distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies. The product includes software copyright 2001-2005 ( ) MetaStuff, Ltd. All Rights Reserved. Permissions and limitations regarding this software are subject to terms available at http://www.dom4j.org/ license.html. The product includes software copyright 2004-2007, The Dojo Foundation. All Rights Reserved. Permissions and limitations regarding this software are subject to terms available at http://dojotoolkit.org/license. This product includes ICU software which is copyright International Business Machines Corporation and others. All rights reserved. Permissions and limitations regarding this software are subject to terms available at http://source.icu-project.org/repos/icu/icu/trunk/license.html. This product includes software copyright 1996-2006 Per Bothner. All rights reserved. Your right to use such materials is set forth in the license which may be found at http:// www.gnu.org/software/ kawa/software-license.html. This product includes OSSP UUID software which is Copyright 2002 Ralf S. Engelschall, Copyright 2002 The OSSP Project Copyright 2002 Cable & Wireless Deutschland. Permissions and limitations regarding this software are subject to terms available at http://www.opensource.org/licenses/mit-license.php. This product includes software developed by Boost (http://www.boost.org/) or under the Boost software license. Permissions and limitations regarding this software are subject to terms available at http:/ /www.boost.org/license_1_0.txt. This product includes software copyright 1997-2007 University of Cambridge. Permissions and limitations regarding this software are subject to terms available at http:// www.pcre.org/license.txt. This product includes software copyright 2007 The Eclipse Foundation. All Rights Reserved. Permissions and limitations regarding this software are subject to terms available at http:// www.eclipse.org/org/documents/epl-v10.php and at http://www.eclipse.org/org/documents/edl-v10.php.

This product includes software licensed under the terms at http://www.tcl.tk/software/tcltk/license.html, http://www.bosrup.com/web/overlib/?license, http:// www.stlport.org/doc/ license.html, http://asm.ow2.org/license.html, http://www.cryptix.org/license.txt, http://hsqldb.org/web/hsqllicense.html, http:// httpunit.sourceforge.net/doc/ license.html, http://jung.sourceforge.net/license.txt, http://www.gzip.org/zlib/zlib_license.html, http://www.openldap.org/software/release/ license.html, http://www.libssh2.org, http://slf4j.org/license.html, http://www.sente.ch/software/opensourcelicense.html, http://fusesource.com/downloads/licenseagreements/fuse-message-broker-v-5-3- license-agreement; http://antlr.org/license.html; http://aopalliance.sourceforge.net/; http://www.bouncycastle.org/licence.html; http://www.jgraph.com/jgraphdownload.html; http://www.jcraft.com/jsch/license.txt; http://jotm.objectweb.org/bsd_license.html;. http://www.w3.org/consortium/legal/ 2002/copyright-software-20021231; http://www.slf4j.org/license.html; http://nanoxml.sourceforge.net/orig/copyright.html; http://www.json.org/license.html; http:// forge.ow2.org/projects/javaservice/, http://www.postgresql.org/about/licence.html, http://www.sqlite.org/copyright.html, http://www.tcl.tk/software/tcltk/license.html, http:// www.jaxen.org/faq.html, http://www.jdom.org/docs/faq.html, http://www.slf4j.org/license.html; http://www.iodbc.org/dataspace/iodbc/wiki/iodbc/license; http:// www.keplerproject.org/md5/license.html; http://www.toedter.com/en/jcalendar/license.html; http://www.edankert.com/bounce/index.html; http://www.net-snmp.org/about/ license.html; http://www.openmdx.org/#faq; http://www.php.net/license/3_01.txt; http://srp.stanford.edu/license.txt; http://www.schneier.com/blowfish.html; http:// www.jmock.org/license.html; http://xsom.java.net; http://benalman.com/about/license/; https://github.com/createjs/easeljs/blob/master/src/easeljs/display/bitmap.js; http://www.h2database.com/html/license.html#summary; http://jsoncpp.sourceforge.net/license; http://jdbc.postgresql.org/license.html; http:// protobuf.googlecode.com/svn/trunk/src/google/protobuf/descriptor.proto; https://github.com/rantav/hector/blob/master/license; http://web.mit.edu/kerberos/krb5- current/doc/mitk5license.html; http://jibx.sourceforge.net/jibx-license.html; https://github.com/lyokato/libgeohash/blob/master/license; https://github.com/hjiang/jsonxx/ blob/master/license; https://code.google.com/p/lz4/; https://github.com/jedisct1/libsodium/blob/master/license; http://one-jar.sourceforge.net/index.php? page=documents&file=license; https://github.com/esotericsoftware/kryo/blob/master/license.txt; http://www.scala-lang.org/license.html; https://github.com/tinkerpop/ blueprints/blob/master/license.txt; and http://gee.cs.oswego.edu/dl/classes/edu/oswego/cs/dl/util/concurrent/intro.html. This product includes software licensed under the Academic Free License (http://www.opensource.org/licenses/afl-3.0.php), the Common Development and Distribution License (http://www.opensource.org/licenses/cddl1.php) the Common Public License (http://www.opensource.org/licenses/cpl1.0.php), the Sun Binary Code License Agreement Supplemental License Terms, the BSD License (http:// www.opensource.org/licenses/bsd-license.php), the new BSD License (http://opensource.org/ licenses/bsd-3-clause), the MIT License (http://www.opensource.org/licenses/mit-license.php), the Artistic License (http://www.opensource.org/licenses/artisticlicense-1.0) and the Initial Developer s Public License Version 1.0 (http://www.firebirdsql.org/en/initial-developer-s-public-license-version-1-0/). This product includes software copyright 2003-2006 Joe WaInes, 2006-2007 XStream Committers. All rights reserved. Permissions and limitations regarding this software are subject to terms available at http://xstream.codehaus.org/license.html. This product includes software developed by the Indiana University Extreme! Lab. For further information please visit http://www.extreme.indiana.edu/. This product includes software Copyright (c) 2013 Frank Balluffi and Markus Moeller. All rights reserved. Permissions and limitations regarding this software are subject to terms of the MIT license. See patents at https://www.informatica.com/legal/patents.html. DISCLAIMER: Informatica Corporation provides this documentation "as is" without warranty of any kind, either express or implied, including, but not limited to, the implied warranties of noninfringement, merchantability, or use for a particular purpose. Informatica Corporation does not warrant that this software or documentation is error free. The information provided in this software or documentation may include technical inaccuracies or typographical errors. The information in this software and documentation is subject to change at any time without notice. NOTICES This Informatica product (the "Software") includes certain drivers (the "DataDirect Drivers") from DataDirect Technologies, an operating company of Progress Software Corporation ("DataDirect") which are subject to the following terms and conditions: 1. THE DATADIRECT DRIVERS ARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. 2. IN NO EVENT WILL DATADIRECT OR ITS THIRD PARTY SUPPLIERS BE LIABLE TO THE END-USER CUSTOMER FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, CONSEQUENTIAL OR OTHER DAMAGES ARISING OUT OF THE USE OF THE ODBC DRIVERS, WHETHER OR NOT INFORMED OF THE POSSIBILITIES OF DAMAGES IN ADVANCE. THESE LIMITATIONS APPLY TO ALL CAUSES OF ACTION, INCLUDING, WITHOUT LIMITATION, BREACH OF CONTRACT, BREACH OF WARRANTY, NEGLIGENCE, STRICT LIABILITY, MISREPRESENTATION AND OTHER TORTS. Part Number: IN-BDI-961HF3-0001

Table of Contents Preface.... 7 Informatica Resources.... 7 Informatica My Support Portal.... 7 Informatica Documentation.... 7 Informatica Product Availability Matrixes.... 7 Informatica Web Site.... 7 Informatica How-To Library.... 8 Informatica Knowledge Base.... 8 Informatica Support YouTube Channel.... 8 Informatica Marketplace.... 8 Informatica Velocity.... 8 Informatica Global Customer Support.... 8 Chapter 1: Installation and Configuration.... 9 Installation and Configuration Overview.... 9 Informatica Big Data Edition Installation Process.... 9 Before You Begin.... 10 Install and Configure PowerCenter.... 10 Install and Configure PowerExchange Adapters.... 10 Install and Configure Data Replication.... 11 Pre-Installation Tasks for a Single Node Environment.... 11 Pre-Installation Tasks for a Cluster Environment.... 11 Informatica Big Data Edition Installation.... 13 Installing in a Single Node Environment.... 13 Installing in a Cluster Environment from the Primary NameNode Using SCP Protocol.... 14 Installing in a Cluster Environment from the Primary NameNode Using FTP, HTTP, or NFS Protocol.... 15 Installing in a Cluster Environment from any Machine.... 15 Installing Big Data Edition Using Cloudera Manager.... 16 After You Install.... 17 Configure Hadoop Pushdown Properties for the Data Integration Service.... 17 Update Hadoop Cluster Configuration Parameters.... 18 Reference Data Requirements.... 18 Hive Variables for Mappings in a Hive Environment.... 19 Library Path and Path Variables for Mappings in a Hive Environment.... 20 Hadoop Environment Properties File.... 20 Configure Big Data Edition for Hive CLI or HiveServer2.... 20 Informatica Big Data Edition Uninstallation.... 22 Uninstalling Big Data Edition.... 22 4 Table of Contents

Chapter 2: Mappings on Hadoop Distributions.... 24 Mappings on Hadoop Distributions Overview.... 24 Mappings on Cloudera CDH.... 25 Configure Hadoop Cluster Properties for Cloudera CDH.... 25 Create a Staging Directory on HDFS.... 26 Configure Virtual Memory Limits.... 27 Add hbase_protocol.jar to the Hadoop classpath.... 27 Configure the HiveServer2 Environment.... 27 Configure HiveServer2 for DB2 Partitioning.... 28 Disable SQL Standard Based Authorization for HiveServer2.... 28 Mappings on Hortonworks HDP.... 29 Configure Hadoop Cluster Properties for Hortonworks HDP... 29 Choose MapReduce or Tez.... 31 Add hbase_protocol.jar to the Hadoop classpath.... 33 Enable HBase Support for Hive CLI.... 33 Create the HiveServer2 Environment Variables.... 34 Disable SQL Standard Based Authorization... 35 Enable Storage Based Authorization with HiveServer2.... 35 Enable Support for HBase with HiveServer2.... 36 Configure HiveServer2 for DB2 Partitioning.... 37 Mappings on IBM BigInsights.... 37 User Account for the JDBC and Hive Connections... 37 Mappings on MapR.... 37 Verify the Cluster Details.... 38 Configure hive-site.xml on the Data Integration Service Machine for MapReduce 1.... 39 Configure hive-site.xml on Every Node in the Hadoop Cluster for MapReduce 1.... 40 Configure yarn-site.xml on the Data Integration Service Machine for MapReduce 2.... 40 Configure yarn-site.xml on Every Node in the Cluster for MapReduce 2.... 41 Configure MapR Distribution Variables for Mappings in a Hive Environment.... 42 Copy MapR Distribution Files for PowerCenter Mappings in the Native Environment.... 42 Configure the PowerCenter Integration Service.... 43 Configure the Heap Space for the MapR-FS.... 43 Enable Hive Pushdown for Hbase.... 43 Configure the HiveServer2 Environment.... 44 Disable SQL Standard Based Authorization.... 44 Mappings on Pivotal HD.... 45 Configure Hadoop Cluster Properties for Pivotal HD in yarn-site.xml.... 45 Configure Virtual Memory Limits.... 46 Update the Repository Plug-in.... 46 Informatica Developer Files and Variables.... 47 Enable Support for Lookup Transformations with Teradata Data Objects.... 47 Table of Contents 5

Configure High Availability.... 48 Configuring a Highly Available Cloudera Cluster.... 48 Enable Support for a Highly Available Hortonworks HDP Cluster.... 49 Configuring a Highly Available IBM BigInsights Cluster.... 53 Configuring a Highly Available MapR Cluster.... 54 Configuring a Highly Available Pivotal Cluster.... 55 Index.... 56 6 Table of Contents

Preface The Informatica Big Data Edition Installation and Configuration Guide is written for the system administrator who is responsible for installing Informatica Big Data Edition. This guide assumes you have knowledge of operating systems, relational database concepts, and the database engines, flat files, or mainframe systems in your environment. This guide also assumes you are familiar with the interface requirements for the Hadoop environment. Informatica Resources Informatica My Support Portal As an Informatica customer, you can access the Informatica My Support Portal at http://mysupport.informatica.com. The site contains product information, user group information, newsletters, access to the Informatica customer support case management system (ATLAS), the Informatica How-To Library, the Informatica Knowledge Base, Informatica Product Documentation, and access to the Informatica user community. Informatica Documentation The Informatica Documentation team makes every effort to create accurate, usable documentation. If you have questions, comments, or ideas about this documentation, contact the Informatica Documentation team through email at infa_documentation@informatica.com. We will use your feedback to improve our documentation. Let us know if we can contact you regarding your comments. The Documentation team updates documentation as needed. To get the latest documentation for your product, navigate to Product Documentation from http://mysupport.informatica.com. Informatica Product Availability Matrixes Product Availability Matrixes (PAMs) indicate the versions of operating systems, databases, and other types of data sources and targets that a product release supports. You can access the PAMs on the Informatica My Support Portal at https://mysupport.informatica.com/community/my-support/product-availability-matrices. Informatica Web Site You can access the Informatica corporate web site at http://www.informatica.com. The site contains information about Informatica, its background, upcoming events, and sales offices. You will also find product 7

and partner information. The services area of the site includes important information about technical support, training and education, and implementation services. Informatica How-To Library As an Informatica customer, you can access the Informatica How-To Library at http://mysupport.informatica.com. The How-To Library is a collection of resources to help you learn more about Informatica products and features. It includes articles and interactive demonstrations that provide solutions to common problems, compare features and behaviors, and guide you through performing specific real-world tasks. Informatica Knowledge Base As an Informatica customer, you can access the Informatica Knowledge Base at http://mysupport.informatica.com. Use the Knowledge Base to search for documented solutions to known technical issues about Informatica products. You can also find answers to frequently asked questions, technical white papers, and technical tips. If you have questions, comments, or ideas about the Knowledge Base, contact the Informatica Knowledge Base team through email at KB_Feedback@informatica.com. Informatica Support YouTube Channel You can access the Informatica Support YouTube channel at http://www.youtube.com/user/infasupport. The Informatica Support YouTube channel includes videos about solutions that guide you through performing specific tasks. If you have questions, comments, or ideas about the Informatica Support YouTube channel, contact the Support YouTube team through email at supportvideos@informatica.com or send a tweet to @INFASupport. Informatica Marketplace The Informatica Marketplace is a forum where developers and partners can share solutions that augment, extend, or enhance data integration implementations. By leveraging any of the hundreds of solutions available on the Marketplace, you can improve your productivity and speed up time to implementation on your projects. You can access Informatica Marketplace at http://www.informaticamarketplace.com. Informatica Velocity You can access Informatica Velocity at http://mysupport.informatica.com. Developed from the real-world experience of hundreds of data management projects, Informatica Velocity represents the collective knowledge of our consultants who have worked with organizations from around the world to plan, develop, deploy, and maintain successful data management solutions. If you have questions, comments, or ideas about Informatica Velocity, contact Informatica Professional Services at ips@informatica.com. Informatica Global Customer Support You can contact a Customer Support Center by telephone or through the Online Support. Online Support requires a user name and password. You can request a user name and password at http://mysupport.informatica.com. The telephone numbers for Informatica Global Customer Support are available from the Informatica web site at http://www.informatica.com/us/services-and-training/support-services/global-support-centers/. 8 Preface

C H A P T E R 1 Installation and Configuration This chapter includes the following topics: Installation and Configuration Overview, 9 Before You Begin, 10 Informatica Big Data Edition Installation, 13 After You Install, 17 Informatica Big Data Edition Uninstallation, 22 Installation and Configuration Overview The Informatica Big Data Edition installation is distributed as a Red Hat Package Manager (RPM) installation package. The RPM package includes the Informatica engine and adapter components. The RPM package and the binary files that you need to run the Big Data Edition installation are compressed into a tar.gz file. For Cloudera CDH 5.2 and 5.3, you can use Cloudera Manager to distribute the Big Data Edition installation as parcels across the Hadoop cluster nodes. To create parcel repositories, upload the Informatica parcels and manifest.json to a web server hosted on your local network. For more information about how to use Cloudera Manager to distribute the Big Data Edition installation, see Informatica Knowledge Base article 303347. After you complete the installation, you must enable Informatica mappings to run on a Hadoop cluster on a Hadoop distribution. You must also configure the upgrade tasks to upgrade Big Data Edition. Informatica Big Data Edition Installation Process You can install Big Data Edition in a single node or cluster environment. Installing in a Single Node Environment You can install Big Data Edition in a single node environment. 1. Extract the Big Data Edition tar.gz file to the machine. 2. Install Big Data Edition by running the installation shell script in a Linux environment. 9

Installing in a Cluster Environment You can install Big Data Edition in a cluster environment. 1. Extract the Big Data Edition tar.gz file to a machine. 2. Distribute the RPM package to all of the nodes within the Hadoop cluster. You can distribute the RPM package using any of the following protocols: File Transfer Protocol (FTP), Hypertext Transfer Protocol (HTTP), Network File System (NFS), or Secure Copy Protocol (SCP). 3. Install Big Data Edition by running the installation shell script in a Linux environment. You can install Big Data Edition from the primary NameNode or from any machine using the HadoopDataNodes file. Install from the primary NameNode. You can install Big Data Edition using FTP, HTTP, NFS or SCP protocol. During the installation, the installer shell script picks up all of the DataNodes from the following file: $HADOOP_HOME/conf/slaves. Then, it copies the Big Data Edition binary files to the following directory on each of the DataNodes: /<BigDataEditionInstallationDirectory>/ Informatica. You can perform this step only if you are deploying Hadoop from the primary NameNode. Install from any machine. Add the IP addresses or machine host names, one for each line, for each of the nodes in the Hadoop cluster in the HadoopDataNodes file. During the Big Data Edition installation, the installation shell script picks up all of the nodes from the HadoopDataNodes file and copies the Big Data Edition binary files to the /<BigDataEditionInstallationDirectory>/Informatica directory on each of the nodes. Before You Begin Before you begin the installation, install the Informatica components and PowerExchange adapters, and perform the pre-installation tasks. Install and Configure PowerCenter Before you install Big Data Edition, install and configure Informatica PowerCenter. You can install the following PowerCenter editions: PowerCenter Advanced Edition PowerCenter Standard Edition PowerCenter Real Time Edition You must install the Informatica services and clients. Run the Informatica services installation to configure the PowerCenter domain and create the Informatica services. Run the Informatica client installation to create the PowerCenter Client. Install and Configure PowerExchange Adapters Based on your business needs, install and configure PowerExchange adapters. Use Big Data Edition with PowerCenter and Informatica adapters for access to sources and targets. To run Informatica mappings in a Hive environment you must install and configure PowerExchange for Hive. For more information, see the Informatica PowerExchange for Hive User Guide. 10 Chapter 1: Installation and Configuration

PowerCenter Adapters Use PowerCenter adapters, such as PowerExchange for Hadoop, to define sources and targets in PowerCenter mappings. For more information about installing and configuring PowerCenter adapters, see the PowerExchange adapter documentation. Informatica Adapters You can use the following Informatica adapters as part of PowerCenter Big Data Edition: PowerExchange for DataSift PowerExchange for Facebook PowerExchange for HBase PowerExchange for HDFS PowerExchange for Hive PowerExchange for LinkedIn PowerExchange for Teradata Parallel Transporter API PowerExchange for Twitter PowerExchange for Web Content-Kapow Katalyst For more information, see the PowerExchange adapter documentation. Install and Configure Data Replication To migrate data with minimal downtime and perform auditing and operational reporting functions, install and configure Data Replication. For information, see the Informatica Data Replication User Guide. Pre-Installation Tasks for a Single Node Environment Before you begin the Big Data Edition installation in a single node environment, perform the pre-installation tasks. Verify that Hadoop is installed with Hadoop File System (HDFS) and MapReduce. The Hadoop installation should include a Hive data warehouse that is configured to use a non-embedded database as the MetaStore. For more information, see the Apache website here: http://hadoop.apache.org. To perform both read and write operations in native mode, install the required third-party client software. For example, install the Oracle client to connect to the Oracle database. Verify that the Big Data Edition administrator user can run sudo commands or have user root privileges. Verify that the temporary folder on the local node has at least 700 MB of disk space. Download the following file to the temporary folder: InformaticaHadoop- <InformaticaForHadoopVersion>.tar.gz Extract the following file to the local node where you want to run the Big Data Edition installation: InformaticaHadoop-<InformaticaForHadoopVersion>.tar.gz Pre-Installation Tasks for a Cluster Environment Before you begin the Big Data Edition installation in a cluster environment, perform the following tasks: Install third-party software. Before You Begin 11

Verify the distribution method. Verify system requirements. Verify connection requirements. Download the RPM. Install Third-Party Software Verify that the following third-party software is installed: Hadoop with Hadoop Distributed File System (HDFS) and MapReduce Hadoop must be installed on every node within the cluster. The Hadoop installation must include a Hive data warehouse that is configured to use a MySQL database as the MetaStore. You can configure Hive to use a local or remote MetaStore server. For more information, see the Apache website here: http://hadoop.apache.org/. Note: Informatica does not support embedded MetaStore server setups. Database client software to perform read and write operations in native mode Install the client software for the database. Informatica requires the client software to run MapReduce jobs. For example, install the Oracle client to connect to the Oracle database. Install the database client software on all the nodes within the Hadoop cluster. Verify the Distribution Method You can distribute the RPM package with one of the following protocols: File Transfer Protocol (FTP) Hypertext Transfer Protocol (HTTP) Network File System (NFS) protocol Secure Copy (SCP) protocol If the Hadoop cluster uses Cloudera CDH, you can use Cloudera Manager to distribute the Big Data Edition installation as parcels across the Hadoop cluster nodes. To verify that you can distribute the RPM package with one of the protocols, perform the following tasks: Note: If you use Cloudera Manager to distribute Big Data Edition, skip the tasks. 1. Ensure that the server or service for your distribution method is running. 2. In the config file on the machine where you want to run the Big Data installation, set the DISTRIBUTOR_NODE parameter to the following setting: FTP: Set DISTRIBUTOR_NODE=ftp://<Distributor Node IP Address>/pub HTTP: Set DISTRIBUTOR_NODE=http://<Distributor Node IP Address> NFS: Set DISTRIBUTOR_NODE=<Shared file location on the node.> The file location must be accessible to all nodes in the cluster. Verify System Requirements Verify the following system requirements: The Big Data Edition administrator can run sudo commands or has root user privileges. The temporary folder in each of the nodes on which Big Data Edition will be installed has at least 700 MB of disk space. 12 Chapter 1: Installation and Configuration

Verify Connection Requirements Verify the connection to the Hadoop cluster nodes. Big Data Edition requires a Secure Shell (SSH) connection without a password between the machine where you want to run the Big Data Edition installation and all the nodes in the Hadoop cluster. Download the RPM Download the following file to a temporary folder: InformaticaHadoop-<InformaticaForHadoopVersion>.tar.gz Extract the file to the machine from where you want to distribute the RPM package and run the Big Data Edition installation. Copy the following package to a shared directory based on the transfer protocol you are using: InformaticaHadoop-<InformaticaForHadoopVersion>.rpm. For example, HTTP: /var/www/html FTP: /var/ftp/pub NFS: <Shared location on the node> The file location must be accessible by all the nodes in the cluster. Note: The RPM package must be stored on a local disk and not on HDFS. Informatica Big Data Edition Installation You can install Big Data Edition in a single node environment. You can also install Big Data Edition in a cluster environment from the primary NameNode or from any machine. Install Big Data Edition in a single node environment or cluster environment: Install Big Data Edition in a single node environment. Install Big Data Edition in a cluster environment from the primary NameNode using SCP protocol. Install Big Data Edition in a cluster environment from the primary NameNode using FTP, HTTP, or NFS protocol. Install Big Data Edition in a cluster environment from any machine. Install Big Data Edition from a shell command line. Installing in a Single Node Environment You can install Big Data Edition in a single node environment. 1. Log in to the machine. 2. Run the following command from the Big Data Edition root directory to start the installation in console mode: bash InformaticaHadoopInstall.sh 3. Press y to accept the Big Data Edition terms of agreement. Informatica Big Data Edition Installation 13

4. Press Enter. 5. Press 1 to install Big Data Edition in a single node environment. 6. Press Enter. 7. Type the absolute path for the Big Data Edition installation directory and press Enter. Start the path with a slash. The directory names in the path must not contain spaces or the following special characters: { }! @ # $ % ^ & * ( ) : ; ' ` < >,? + [ ] \ If you type a directory path that does not exist, the installer creates the entire directory path on each of the nodes during the installation. Default is /opt. 8. Press Enter. The installer creates the /<BigDataEditionInstallationDirectory>/Informatica directory and populates all of the file systems with the contents of the RPM package. To get more information about the tasks performed by the installer, you can view the informatica-hadoopinstall.<datetimestamp>.log installation log file. Installing in a Cluster Environment from the Primary NameNode Using SCP Protocol You can install Big Data Edition in a cluster environment from the primary NameNode using SCP protocol. 1. Log in to the primary NameNode. 2. Run the following command to start the Big Data Edition installation in console mode: bash InformaticaHadoopInstall.sh 3. Press y to accept the Big Data Edition terms of agreement. 4. Press Enter. 5. Press 2 to install Big Data Edition in a cluster environment. 6. Press Enter. 7. Type the absolute path for the Big Data Edition installation directory. Start the path with a slash. The directory names in the path must not contain spaces or the following special characters: { }! @ # $ % ^ & * ( ) : ; ' ` < >,? + [ ] \ If you type a directory path that does not exist, the installer creates the entire directory path on each of the nodes during the installation. Default is /opt. 8. Press Enter. 9. Press 1 to install Big Data Edition from the primary NameNode. 10. Press Enter. 11. Type the absolute path for the Hadoop installation directory. Start the path with a slash. 12. Press Enter. 13. Type y. 14. Press Enter. The installer retrieves a list of DataNodes from the $HADOOP_HOME/conf/slaves file. On each of the DataNodes, the installer creates the Informatica directory and populates all of the file systems with the contents of the RPM package. The Informatica directory is located here: / <BigDataEditionInstallationDirectory>/Informatica You can view the informatica-hadoop-install.<datetimestamp>.log installation log file to get more information about the tasks performed by the installer. 14 Chapter 1: Installation and Configuration

Installing in a Cluster Environment from the Primary NameNode Using FTP, HTTP, or NFS Protocol You can install Big Data Edition in a cluster environment from the primary NameNode using FTP, HTTP, or NFS protocol. 1. Log in to the primary NameNode. 2. Run the following command to start the Big Data Edition installation in console mode: bash InformaticaHadoopInstall.sh 3. Press y to accept the Big Data Edition terms of agreement. 4. Press Enter. 5. Press 2 to install Big Data Edition in a cluster environment. 6. Press Enter. 7. Type the absolute path for the Big Data Edition installation directory. Start the path with a slash. The directory names in the path must not contain spaces or the following special characters: { }! @ # $ % ^ & * ( ) : ; ' ` < >,? + [ ] \ If you type a directory path that does not exist, the installer creates the entire directory path on each of the nodes during the installation. Default is /opt. 8. Press Enter. 9. Press 1 to install Big Data Edition from the primary NameNode. 10. Press Enter. 11. Type the absolute path for the Hadoop installation directory. Start the path with a slash. 12. Press Enter. 13. Type n. 14. Press Enter. 15. Type y. 16. Press Enter. The installer retrieves a list of DataNodes from the $HADOOP_HOME/conf/slaves file. On each of the DataNodes, the installer creates the /<BigDataEditionInstallationDirectory>/Informatica directory and populates all of the file systems with the contents of the RPM package. You can view the informatica-hadoop-install.<datetimestamp>.log installation log file to get more information about the tasks performed by the installer. Installing in a Cluster Environment from any Machine You can install Big Data Edition in a cluster environment from any machine. 1. Verify that the Big Data Edition administrator has user root privileges on the node that will be running the Big Data Edition installation. 2. Log in to the machine as the root user. 3. In the HadoopDataNodes file, add the IP addresses or machine host names of the nodes in the Hadoop cluster on which you want to install Big Data Edition. The HadoopDataNodes file is located on the node from where you want to launch the Big Data Edition installation. You must add one IP addresses or machine host names of the nodes in the Hadoop cluster for each line in the file. Informatica Big Data Edition Installation 15

4. Run the following command to start the Big Data Edition installation in console mode: bash InformaticaHadoopInstall.sh 5. Press y to accept the Big Data Edition terms of agreement. 6. Press Enter. 7. Press 2 to install Big Data Edition in a cluster environment. 8. Press Enter. 9. Type the absolute path for the Big Data Edition installation directory and press Enter. Start the path with a slash. Default is /opt. 10. Press Enter. 11. Press 2 to install Big Data Edition using the HadoopDataNodes file. 12. Press Enter. The installer creates the /<BigDataEditionInstallationDirectory>/Informatica directory and populates all of the file systems with the contents of the RPM package on the first node that appears in the HadoopDataNodes file. The installer repeats the process for each node in the HadoopDataNodes file. Installing Big Data Edition Using Cloudera Manager You can install Big Data Edition on a Cloudera CDH cluster using Cloudera Manager. Perform the following steps: 1. Download the following file: INFORMATICA-<version>-informatica-<version>.parcel.tar. 2. Extract manifest.json and the parcels from the.tar file. 3. Verify the location of your Local Parcel Repository. In Cloudera Manager, click Administration > Settings > Parcels 4. Create a SHA file with the parcel name and hash listed in manifest.json that corresponds with your Hadoop cluster. For example, use the following parcel name for Hadoop cluster nodes that run Red Hat Enterprise Linux 6.4 64-bit: INFORMATICA-<version>informatica-<version>-el6.parcel Use the following hash listed for Red Hat Enterprise Linux 6.4 64-bit: 8e904e949a11c4c16eb737f02ce4e36ffc03854f To create a SHA file, run the following command: echo <hash> > <ParcelName>.sha For example, run the following command: echo 8e904e949a11c4c16eb737f02ce4e36ffc03854f > INFORMATICA-9.6.1-1.informatica9.6.1.1.p0.1203-el6.parcel.sha 5. Transfer the parcel and SHA file to the Local Parcel Repository with FTP. 6. Check for new parcels with Cloudera Manager. To check for new parcels, click Hosts > Parcels. 7. Distribute the Big Data Edition parcels. 8. Activate the Big Data Edition parcels. 16 Chapter 1: Installation and Configuration

After You Install After you install Big Data Edition, perform the post-installation tasks to ensure that Big Data Edition runs properly. Perform the following tasks: Configure the Hadoop pushdown properties for the Data Integration Service. Optionally, update Hadoop cluster configuration parameters for mappings in a Hive environment. Optionally, install the Address Validation reference data. Configure Hive variables for mappings in a Hive environment. Configure library path and path variables for mappings in a Hive environment. Configure environment variables in the Big Data Edition properties file. Choose Hive CLI or HiveServer2. Configure Hadoop Pushdown Properties for the Data Integration Service Configure Hadoop pushdown properties for the Data Integration Service to run mappings in a Hive environment. You can configure Hadoop pushdown properties for the Data Integration Service in the Administrator tool. The following table describes the Hadoop pushdown properties for the Data Integration Service: Property Informatica Home Directory on Hadoop Hadoop Distribution Directory Data Integration Service Hadoop Distribution Directory Description The Big Data Edition home directory on every data node created by the Hadoop RPM install. Type /<BigDataEditionInstallationDirectory>/Informatica. The directory containing a collection of Hive and Hadoop JARS on the cluster from the RPM Install locations. The directory contains the minimum set of JARS required to process Informatica mappings in a Hadoop environment. Type / <BigDataEditionInstallationDirectory>/Informatica/services/ shared/hadoop/[hadoop_distribution_name]. The Hadoop distribution directory on the Data Integration Service node. The contents of the Data Integration Service Hadoop distribution directory must be identical to Hadoop distribution directory on the data nodes. Hadoop Distribution Directory You can modify the Hadoop distribution directory on the data nodes. When you modify the Hadoop distribution directory, you must copy the minimum set of Hive and Hadoop JARS, and the Snappy libraries required to process Informatica mappings in a Hive environment from your Hadoop install location. The actual Hive and Hadoop JARS can vary depending on the Hadoop distribution and version. The Hadoop RPM installs the Hadoop distribution directories in the following path: <BigDataEditionInstallationDirectory>/Informatica/services/shared/hadoop. After You Install 17

Update Hadoop Cluster Configuration Parameters Hadoop cluster configuration parameters that set Java library path in mapred-site.xml file can override the paths set in hadoopenv.properties. Update the mapred-site.xml cluster configuration file on all the cluster nodes to remove Java options that set the Java library path. The following cluster configuration parameters in mapred-site.xml can override the Java library path set in hadoopenv.properties: mapreduce.admin.map.child.java.opts mapreduce.admin.reduce.child.java.opts If the Data Integration Service cannot access the native libraries set in hadoopenv.properties, mappings can fail to run in a Hive environment. After you install, perform the following steps: Update the cluster configuration file mapred-site.xml to remove the Java option -Djava.library.path from the property configuration. Edit hadoopenv.properties to include the user Hadoop libraries in the Java Library path. Example to Update mapred-site.xml on Cluster Nodes If mapred-site.xml sets the following configuration for mapreduce.admin.map.child.java.opts parameter: <name>mapreduce.admin.map.child.java.opts</name> <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/lib/hadoop/lib/native/:/mylib/ - Djava.net.preferIPv4Stack=true</value> <final>true</final> The path to Hadoop libraries in mapreduce.admin.map.child.java.opts overrides following path set in hadoopenv.properties: infapdo.java.opts=-xmx512m -XX:GCTimeRatio=34 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC - XX:ParallelGCThreads=2 -XX:NewRatio=2 -Djava.library.path=$HADOOP_NODE_INFA_HOME/ services/shared/bin:$hadoop_node_hadoop_dist/lib/*:$hadoop_node_hadoop_dist/lib/native/ Linux-amd64-64 -Djava.security.egd=file:/dev/./urandom To run mappings in a Hive environment, complete the following steps: Remove the -Djava.library.path Java option from mapreduce.admin.map.child.java.opts parameter. Change hadoopenv.properties to include the Hadoop libraries in the path /usr/lib/hadoop/lib/native and /mylib/ with the following syntax: infapdo.java.opts=-xmx512m -XX:GCTimeRatio=34 -XX:+UseConcMarkSweepGC -XX: +UseParNewGC -XX:ParallelGCThreads=2 -XX:NewRatio=2 -Djava.library.path= $HADOOP_NODE_INFA_HOME/services/shared/bin:$HADOOP_NODE_HADOOP_DIST/lib/*: $HADOOP_NODE_HADOOP_DIST/lib/native/Linux-amd64-64:/usr/lib/hadoop/lib/native/:/ mylib/ -Djava.security.egd=file:/dev/./urandom Reference Data Requirements If you have a Data Quality product license, you can push a mapping that contains data quality transformations to a Hadoop cluster. Data quality transformations can use reference data to verify that data values are accurate and correctly formatted. When you apply a pushdown operation to a mapping that contains data quality transformations, the operation can copy the reference data that the mapping uses. The pushdown operation copies reference table data, 18 Chapter 1: Installation and Configuration

content set data, and identity population data to the Hadoop cluster. After the mapping runs, the cluster deletes the reference data that the pushdown operation copied with the mapping. Note: The pushdown operation does not copy address validation reference data. If you push a mapping that performs address validation, you must install the address validation reference data files on each DataNode that runs the mapping. The cluster does not delete the address validation reference data files after the address validation mapping runs. Address validation mappings validate and enhance the accuracy of postal address records. You can buy address reference data files from Informatica on a subscription basis. You can download the current address reference data files from Informatica at any time during the subscription period. Installing the Address Reference Data Files To install the address reference data files on each DataNode in the cluster, create an automation script. 1. Browse to the address reference data files that you downloaded from Informatica. You download the files in a compressed format. 2. Extract the data files. 3. Copy the files to the NameNode machine or to another machine that can write to the DataNodes. 4. Create an automation script to copy the files to each DataNode. If you copied the files to the NameNode, use the slaves file for the Hadoop cluster to identify the DataNodes. If you copied the files to another machine, use the Hadoop_Nodes.txt file to identify the DataNodes. Find the Hadoop_Nodes.txt file in the Big Data Edition installation package. The default directory for the address reference data files in the Hadoop environment is /reference_data. If you install the files to a non-default directory, create the following custom property on the Data Integration Service to identify the directory: AV_HADOOP_DATA_LOCATION Create the custom property on the Data Integration Service that performs the pushdown operation in the native environment. 5. Run the automation script. The script copies the address reference data files to the DataNodes. Hive Variables for Mappings in a Hive Environment To run mappings in a Hive environment, configure Hive environment variables.. You can configure Hive environment variables in the file /<BigDataEditionInstallationDirectory>/ Informatica/services/shared/hadoop/<Hadoop_distribution_name>/conf/hive-site.xml. Configure the following Hive environment variables: hive.exec.dynamic.partition=true and hive.exec.dynamic.partition.mode=nonstrict. Configure if you want to use Hive dynamic partitioned tables. hive.optimize.ppd = false. Disable predicate pushdown optimization to get accurate results for mappings with Hive version 0.9.0. You cannot use predicate pushdown optimization for a Hive query that uses multiple insert statements. The default Hadoop RPM installation sets hive.optimize.ppd to false. After You Install 19