Informatica Data Lake Management on the AWS Cloud

Size: px

Start display at page:

Download "Informatica Data Lake Management on the AWS Cloud"

Charles Smith
6 years ago
Views:

1 Informatica Data Lake Management on the AWS Cloud Quick Start Reference Deployment January 2018 Informatica Big Data Team Vinod Shukla AWS Quick Start Reference Team Contents Overview... 2 Informatica Components... 3 Costs and Licenses... 3 Architecture... 4 Informatica Services on AWS... 5 Planning the Data Lake Management Deployment...8 Deployment Options...8 Prerequisites...8 Deployment Steps... 9 Step 1. Prepare Your AWS Account... 9 Step 2. Upload Your Informatica License... 9 Step 3. Launch the Quick Start Step 4. Monitor the Deployment Step 5. Download and Install Informatica Developer Manual Cleanup Troubleshooting Using Informatica Data Lake Management on AWS Page 1 of 30

2 Transient and Persistent Clusters Common AWS Architecture Patterns for Informatica Data Lake Management Process Flow Additional Resources GitHub Repository Document Revisions This Quick Start deployment guide was created by Amazon Web Services (AWS) in partnership with Informatica. Quick Starts are automated reference deployments that use AWS CloudFormation templates to deploy key technologies on AWS, following AWS best practices. Overview This Quick Start reference deployment guide provides step-by-step instructions for deploying the Informatica Data Lake Management solution on the AWS Cloud. A data lake uses a single, Hadoop-based data repository that you create to manage the supply and demand of data. Informatica s solution on the AWS Cloud integrates, organizes, administers, governs, and secures large volumes of both structured and unstructured data. The solution delivers actionable fit-for-purpose, reliable, and secure information for business insights. Consider the following key principles when you implement a data lake: The data lake must prevent barriers to onboarding data of any type and size from any source. Data must be easily refined and immediately provisioned for consumption. Data must be easy to find, retrieve, and share within the organization. Data is a corporate accountable asset, managed collaboratively by data governance, data quality, and data security initiatives. This Quick Start is for users who want to deploy and develop an Informatica Data Lake Management solution on the AWS Cloud. Page 2 of 30

3 Informatica Components The Data Lake Management solution uses the following Informatica products: Informatica Big Data Management enables your organization to process large, diverse, and fast changing datasets so you can get insights into your data. Use Big Data Management to perform big data integration and transformation without writing or maintaining Apache Hadoop code. Collect diverse data faster, build business logic in a visual environment, and eliminate hand-coding to get insights on your data. Informatica Enterprise Data Catalog brings together all data assets in an enterprise and presents a comprehensive view of the data assets and data asset relationships. Enterprise Data Catalog captures the technical, business, and operational metadata for a large number of data assets that you use to determine the effectiveness of enterprise data. From across the enterprise, Enterprise Data Catalog gathers information related to metadata, including column data statistics, data domains, data object relationships, and data lineage information. A comprehensive view of enterprise metadata can help you make critical decisions on data integration, data quality, and data governance in the enterprise. The Developer tool includes the native and Hadoop run-time environments for optimal processing. In the native environment, the Data Integration Service processes the data. In the Hadoop environment, the Data Integration Service pushes the processing to nodes in a Hadoop cluster. Costs and Licenses You are responsible for the cost of the AWS services used while running this Quick Start reference deployment. There is no additional cost for using the Quick Start. The AWS CloudFormation template for this Quick Start includes configuration parameters that you can customize. Some of these settings, such as instance type, will affect the cost of deployment. See the pricing pages for each AWS service you will be using for cost estimates. This Quick Start requires a license to deploy the Informatica Data Lake Management solution, as described in the Prerequisites section. To sign up for a demo license, contact Informatica. Page 3 of 30

4 Architecture Figure 1 shows the typical components of a generic data lake management solution. Figure 1: Components of a data lake management solution The solution includes the following core components, beginning with the lower part of the diagram in Figure 1: Big Data Infrastructure: From a connectivity perspective (for example, on-premises, cloud, IOT, unstructured, semi-structured) the solution reliably accommodates an expanding volume and variety of data types. The solution has the capacity to scale up (when you increase individual hardware capacity) or scale out (when you increase infrastructure capacity linearly for parallel processing), and can be deployed directly into your AWS environment. Big Data Storage: The solution can store large amounts of a variety of data (structured, unstructured, semi-structured) at scale with the performance that guarantees timely delivery of data to business analysts. Big Data Processing: The solution can process data at any latency, such as real time, near real time, and batch, using big data processing frameworks such as Apache Spark. Metadata Intelligence manages all the metadata from a variety of data sources. For example, a data catalog manages data generated by big data and by traditional sources. To Page 4 of 30

5 do this, it collects, indexes, and applies machine learning to metadata. It also provides metadata services such as semantic search, automated data domain discovery and tagging, and data intelligence that can guide user behavior. Big Data Integration, in which a data lake architecture must integrate data from various disparate data sources, at any latency, with the ability to rapidly develop ELT (extract, load, and transform) or ETL (extract, transform, and load) data flows. Big Data Governance and Quality are critical to a data lake, especially when dealing with a variety of data. The purpose of big data governance is to deliver trusted, timely, and relevant information to support the business outcome. Big Data Security is the process of minimizing data risk. Activities include discovering, identifying, classifying, and protecting sensitive data, as well as analyzing its risk based on value, location, protection, and proliferation. Finally, Intelligent Data Applications (Self-Service Data Preparation, Enterprise Data Catalog, and Data Security Intelligence) provide data analysts, data scientists, data stewards, and data architects with a collaborative self-service platform for data governance and security that can discover, catalog and prepare data for big data analytics. Informatica Services on AWS Deploying this Quick Start with default parameters builds the Informatica Data Lake environment illustrated in Figure 2 in the AWS Cloud. The Quick Start deployment automatically creates the following Informatica elements: Domain Model Repository Service Data Integration Service In addition, the deployment automatically embeds Hadoop clusters in the virtual private cloud (VPC) for metadata storage and processing. The deployment then assigns the connection to the Amazon EMR cluster for the Hadoop Distributed File System (HDFS) and Hive. It also sets up connections to enable scanning of Amazon Simple Storage Service (Amazon S3) and Amazon Redshift environments as part of the data lake. Page 5 of 30

The Informatica domain and repository database are hosted on Amazon Relational Database Service (Amazon RDS) using Oracle, which handles management tasks such as backups, patch management, and

6 The Informatica domain and repository database are hosted on Amazon Relational Database Service (Amazon RDS) using Oracle, which handles management tasks such as backups, patch management, and replication. To access Informatica Services on the AWS Cloud, you can install the Informatica client to run Big Data Management on a Microsoft Windows machine. You can then access Enterprise Data Catalog by using a web browser. Figure 2 shows the Informatica Data Lake Management solution deployed on AWS. Figure 2: Informatica Data Lake Management solution deployed on AWS The Quick Start sets up a highly available architecture that spans two Availability Zones, and a VPC configured with public and private subnets according to AWS best practices. Managed network address translation (NAT) gateways are deployed into the public subnets and configured with an Elastic IP address for outbound internet connectivity. Page 6 of 30

7 The Quick Start also installs and configures the following information services during the one-click deployment: Informatica domain, which is the fundamental administrative unit of the Informatica platform. The Informatica platform has a service-oriented architecture that provides the ability to scale services and share resources across multiple machines. Model Repository Service, which is a relational database that stores all the metadata for projects created using Informatica client tools. The model repository also stores runtime and configuration information for applications that are deployed to a Data Integration Service. Data Integration Service, which is a compute component within the Informatica domain that manages requests to submit big data integration, big data quality, and profiling jobs to the Hadoop cluster for processing. Content Management Service, which manages reference data. It provides reference data information to the Data Integration Service and Informatica Developer. Analyst Service, which runs the Analyst tool in the Informatica domain. The Analyst Service manages the connections between the service components and the users who log in to the Analyst tool. You can perform column and rule profiling, manage scorecards, and manage bad records and duplicate records in the Analyst tool. Profiling, which helps you find the content, quality, and structure of data sources of an application, schema, or enterprise. A profile is a repository object that finds and analyzes all data irregularities across data sources in the enterprise, and hidden data problems that put data projects at risk. The profiling results include unique values, null values, data domains, and data patterns. When you use this Quick Start, you can run profiling on the Data Integration Service (default) or Hadoop. Business Glossary, which consists of online glossaries of business terms and policies that define important concepts within an organization. Data stewards create and publish terms that include information such as descriptions, relationships to other terms, and associated categories. Glossaries are stored in a central location for easy lookup by consumers. Glossary assets include business terms, policies, and categories that contain information that consumers might search for. A glossary is a high-level container that stores Glossary assets. A business term defines relevant concepts within the organization, and a policy defines the business purpose that governs practices related to the term. Business terms and policies can be associated with categories, which are descriptive classifications. Catalog Service, which runs Enterprise Data Catalog and manages connections between service components and external applications. Page 7 of 30

8 An embedded Hadoop cluster that uses Hortonworks, running HDFS, Hbase, Yarn, and Solr. Informatica Cluster Service, which runs and manages all Hadoop services, Apache Ambari server, and Apache Ambari agents on the embedded Hadoop cluster. Metadata and Catalog, which include the metadata persistence store, search index, and graph database in an embedded Hadoop cluster. The catalog represents an indexed inventory of all the data assets in the enterprise that you configure in Enterprise Data Catalog. Enterprise Data Catalog organizes all the enterprise metadata in the catalog and enables the users of external applications to discover and understand the data. The Informatica domain and the Informatica Model Repository databases are configured on Amazon RDS using Oracle. Planning the Data Lake Management Deployment Deployment Options This Quick Start provides two deployment options: Deployment of the Data Lake Management solution into a new VPC (end-toend deployment). This option builds a new virtual private cloud (VPC) with public and private subnets, and then deploys the Informatica Data Lake Management solution into that infrastructure. Deployment of the Data Lake Management solution into an existing VPC. This option provisions data lake components into your existing AWS infrastructure. The Quick Start provides separate templates for these options. It also lets you configure CIDR blocks, instance types, and data lake settings, as discussed later in this guide. Prerequisites Specialized Knowledge Before you deploy this Quick Start, we recommend that you become familiar with the following AWS services: Amazon VPC Amazon EC2 Amazon EMR If you are new to AWS, see the Getting Started Resource Center. Page 8 of 30

9 Technical Requirements Before you deploy this Quick Start, verify the following prerequisites: You have an account with AWS, and you know the account login information. You have purchased a license for the Informatica Data Lake Management solution. To sign up for a demo license, please contact Informatica, your sales representative, or the consulting partner you re working with. The license file should have a name like AWSDatalakeLicense.key. Deployment Steps Step 1. Prepare Your AWS Account 1. If you don t already have an AWS account, create one at by following the on-screen instructions. 2. Use the region selector in the navigation bar to choose the AWS Region where you want to deploy the Informatica Data Lake Management solution on AWS. 3. Create a key pair in your preferred region. When you log in to any Amazon EC2 system or Amazon EMR cluster, you use a password file for authentication. The file is called a private key file and has a file name extension of.pem. If you do not have an existing.pem key to use, follow the instructions in the AWS documentation to create a key pair. Note Your administrator might ask you to use a particular existing key pair. When you create a key pair, you save the.pem file to your desktop system. Simultaneously, AWS saves the key pair to your account. Make a note of the key pair that you want to use for the Data Lake Management instance, so that you can provide the key pair name during network configuration. 4. If necessary, request a service limit increase for the Amazon EC2 M3 and M4 instance types. You might need to do this if you already have an existing deployment that uses this instance type, and you think you might exceed the default limit with this reference deployment. Step 2. Upload Your Informatica License Upload the license for the Informatica Data Lake Management solution to an S3 bucket, following the instructions in the Amazon S3 documentation. You will be prompted for the bucket name during deployment. Page 9 of 30

10 To sign up for a demo license, please contact Informatica, your sales representative, or the consulting partner you re working with. Step 3. Launch the Quick Start Note You are responsible for the cost of the AWS services used while running this Quick Start reference deployment. There is no additional cost for using this Quick Start. For full details, see the pricing pages for each AWS service you will be using in this Quick Start. Prices are subject to change. 1. Choose one of the following options to launch the AWS CloudFormation template into your AWS account. For help choosing an option, see deployment options earlier in this guide. Option 1 Deploy data lake into a new VPC on AWS Launch Option 2 Deploy data lake into an existing VPC on AWS Launch Important If you re deploying Informatica Data Lake Management into an existing VPC, make sure that your VPC has two private and two public subnets in different Availability Zones for the database instances. These subnets require NAT gateways or NAT instances in their route tables, to allow the instances to download packages and software without exposing them to the Internet. You ll also need the domain name option configured in the DHCP options as explained in the Amazon VPC documentation. You ll be prompted for your VPC settings when you launch the Quick Start. Each deployment takes about two hours to complete. 2. Check the region that s displayed in the upper-right corner of the navigation bar, and change it if necessary. This is where the network infrastructure for Informatica Data Lake Management will be built. The template is launched in the US East (Ohio) Region by default. 3. On the Select Template page, keep the default setting for the template URL, and then choose Next. Page 10 of 30

11 4. On the Specify Details page, change the stack name if needed. Review the parameters for the template. Provide values for the parameters that require input. For all other parameters, review the default settings and customize them as necessary. When you finish reviewing and customizing the parameters, choose Next. In the following tables, parameters are listed by category and described separately for the two deployment options: Parameters for deploying Informatica components into a new VPC Parameters for deploying Informatica components into an existing VPC Note The templates for the two scenarios share most, but not all, of the same parameters. For example, the template for an existing VPC prompts you for the VPC and subnet IDs in your existing VPC environment. You can also download the templates and edit them to create your own parameters based on your specific deployment scenario. Option 1: Parameters for deploying into a new VPC View template Network Configuration: Availability Zones (AvailabilityZones) The two Availability Zones that will be used to deploy Informatica Data Lake Management components. The Quick Start preserves the logical order you specify. VPC CIDR (VPCCIDR) Private Subnet 1 CIDR (PrivateSubnet1CIDR) Private Subnet 2 CIDR (PrivateSubnet2CIDR) Public Subnet 1 CIDR (PublicSubnet1CIDR) Public Subnet 2 CIDR (PublicSubnet2CIDR) /16 The CIDR block for the VPC /19 The CIDR block for the private subnet located in Availability Zone /19 The CIDR block for the private subnet located in Availability Zone /20 The CIDR block for the public (DMZ) subnet located in Availability Zone /20 The CIDR block for the public (DMZ) subnet located in Availability Zone 2. IP Address Range (RemoteAccessCIDR) The CIDR IP range that is permitted to access the Informatica domain and the Amazon EMR cluster. We recommend that you use a constrained CIDR range to reduce the potential of inbound attacks from unknown IP addresses. For example, to Page 11 of 30

12 specify the range of to , enter /49. Amazon EC2 Configuration: Informatica Embedded Cluster Size (ICSClusterSize) Informatica Domain Instance Type (InformaticaServer InstanceType) Key Pair Name (KeyPairName) Small c4.4xlarge The size of the Informatica embedded cluster. Choose from the following: Small: c4.8xlarge, single node Medium: c4.8xlarge, three nodes Large: c4.8xlarge, six nodes The EC2 instance type for the instance that hosts the Informatica domain. The two options are c4.4xlarge and c4.8xlarge. A public/private key pair, which allows you to connect securely to your instance after it launches. When you created an AWS account, this is the key pair you created in your preferred region. Amazon EMR Configuration: EMR Cluster Name (EMRClusterName) EMR Core Instance Type (EMRCoreInstanceType) m4.xlarge The name of the Amazon EMR cluster where the Data Lake Management instance will be deployed. The instance type for Amazon EMR core nodes. EMR Core Nodes (EMRCoreNodes) The number of core nodes. Enter a value between 1 and 500. EMR Master Instance Type (EMRMasterInstance Type) EMR Logs Bucket Name (EMRLogBucket) m4.xlarge The instance type for the Amazon EMR master node. The S3 bucket where the Amazon EMR logs will be stored. Page 12 of 30

13 Amazon RDS Configuration: Informatica Database Username (DBUser) Informatica Database Instance Password (DBPassword) awsquickstart The user name for the database instance associated with the Informatica domain and services (such as Model Repository Service, Data Integration Service, and Content Management Service). The user name is an 8-18 character string. The password for the database instance associated with the Informatica domain and services. The password is an 8-18 character string. Amazon Redshift Configuration: Redshift Cluster Type (RedshiftClusterType) Redshift Database Name (RedshiftDatabaseName) single-node dev The type of cluster. You can specify single-node or multi-node. If you specify multi-node, use the Redshift Number of Nodes parameter to specify how many nodes you would like to provision in your cluster. The name of the first database to create when the cluster is created. Redshift Database Port (RedshiftDatabasePort) Redshift Number of Nodes (RedshiftNumberOf Nodes) 5439 The port number on which the cluster accepts incoming connections. 1 The number of compute nodes in the cluster. For multi-node clusters, this parameter must be greater than 1. Redshift Node Type (RedshiftNodeType) Redshift Username (RedshiftUsername) Redshift Password (RedshiftPassword) ds2.xlarge defaultuser The compute, memory, storage, and I/O capacity of the cluster's nodes. For node size specifications, see the Amazon Redshift documentation. The user name that is associated with the master user account for the cluster that is being created. The password that is associated with the master user account for the cluster that is being created. The password must be an 8-64 character string that consists of at least one uppercase letter, one lowercase letter, and one number. Page 13 of 30

14 Informatica Enterprise Catalog and BDM Configuration: Informatica Administrator Username (InformaticaAdminUser) Informatica Administrator Password (InformaticaAdmin Password) License Key Location (InformaticaKeyS3Bucket) License Key Name (InformaticaKeyName) Import Sample Content (ImportSampleData) No The administrator user name for accessing Big Data Management. You can specify any string. Make a note of the user name and password, and use it later to log in to the Administrator tool to configure the Informatica domain. The administrator password for accessing Big Data Management. You can specify any string. Make a note of the user name and password, and use it later to log in to the Administrator tool to configure the Informatica domain. The name of the S3 bucket in your account that contains the Informatica license key. The Informatica license key name; for example, INFALicense_10_2.key. Note: The key file must be in the top level of the S3 bucket and not in a subfolder. Select Yes to import sample catalog data. You can use the sample data to get started with the product. AWS Quick Start Configuration: Informatica recommends that you do not change the default values for the parameters in this category. Quick Start S3 Bucket Name (QSS3BucketName) Quick Start S3 Key Prefix (QSS3KeyPrefix) quickstartreference informatica/ datalake/latest/ The S3 bucket name for the Quick Start assets. This bucket name can include numbers, lowercase letters, uppercase letters, and hyphens (-), but should not start or end with a hyphen. You can specify your own bucket if you copy all of the assets and submodules into it, if you want to customize the templates and override the Quick Start behavior for your specific implementation. The S3 key name prefix for your copy of the Quick Start assets. This prefix can include numbers, lowercase letters, uppercase letters, hyphens (-), and forward slashes (/). This parameter enables you to customize or extend the Quick Start for your specific implementation. Page 14 of 30

15 Option 2: Parameters for deploying into an existing VPC View template Network Configuration: VPC (VPCID) Informatica Domain Subnet (InformaticaServer SubnetID) Informatica Database Subnets (DBSubnetIDs) IP Address Range (IPAddressRange) The ID of your existing VPC where you want to deploy the Informatica Data Lake Management solution (for example, vpc e). The VPC must meet the following requirements: It must be set up with public access through the internet via an attached internet gateway. The DNS Resolution property of the VPC must be set to Yes. The Edit DNS Hostnames property of the VPC must be set to Yes. A publicly accessible subnet ID where the Informatica domain will reside. Select one of the available subnets listed. The IDs of two private subnets in the selected VPC. Note: These subnets must be in different Availability Zones in the selected VPC. The CIDR IP range that is permitted to access the Informatica domain and the Informatica embedded cluster. We recommend that you use a constrained CIDR range to reduce the potential of inbound attacks from unknown IP addresses. For example, to specify the range of to , enter /49. Amazon EC2 Configuration: Key Pair Name (KeyPairName) Informatica Domain Instance Type (InformaticaServer InstanceType) Informatica Embedded Cluster c4.4xlarge Small A public/private key pair, which allows you to connect securely to your instance after it launches. When you created an AWS account, this is the key pair you created in your preferred region. The EC2 instance type for the instance that hosts the Informatica domain. The two options are c4.4xlarge and c4.8xlarge. The size of the Informatica embedded cluster. Choose from the following: Page 15 of 30

16 Size (ICSClusterSize) Small: c4.8xlarge, single node Medium: c4.8xlarge, three nodes Large: c4.8xlarge, six nodes Amazon EMR Configuration: EMR Master Instance Type (EMRMasterInstance Type) EMR Core Instance Type (EMRCoreInstanceType) EMR Cluster Name (EMRClusterName) m4.xlarge m4.xlarge The instance type for the Amazon EMR master node. The instance type for Amazon EMR core nodes. The name of the Amazon EMR cluster where the Data Lake Management instance will be deployed. EMR Core Nodes (EMRCoreNodes) The number of core nodes. Enter a value between 1 and 500. EMR Logs Bucket Name (EMRLogBucket) The S3 bucket where the Amazon EMR logs will be stored. Amazon RDS Configuration: Informatica Database Username (DBUser) Informatica Database Instance Password (DBPassword) awsquickstart The user name for the database instance associated with the Informatica domain and services (such as Model Repository Service, Data Integration Service, and Content Management Service). The user name is an 8-18 character string. The password for the database instance associated with the Informatica domain and services. The password is an 8-18 character string. Amazon Redshift Configuration: Redshift Database Name (RedshiftDatabaseName) dev The name of the first database to create when the cluster is created. Page 16 of 30

17 Redshift Cluster Type (RedshiftClusterType) single-node The type of cluster. You can specify single-node or multi-node. If you specify multi-node, use the Redshift Number of Nodes parameter to specify how many nodes you would like to provision in your cluster. Redshift Number of Nodes (RedshiftNumberOf Nodes) 1 The number of compute nodes in the cluster. For multi-node clusters, this parameter must be greater than 1. Redshift Node Type (RedshiftNodeType) Redshift Username (RedshiftUsername) Redshift Password (RedshiftPassword) ds2.xlarge defaultuser The compute, memory, storage, and I/O capacity of the cluster's nodes. For node size specifications, see the Amazon Redshift documentation. The user name that is associated with the master user account for the cluster that is being created. The password that is associated with the master user account for the cluster that is being created. The password must be an 8-64 character string that consists of at least one uppercase letter, one lowercase letter, and one number. Redshift Database Port (RedshiftDatabasePort) 5439 The port number on which the cluster accepts incoming connections. Informatica Enterprise Catalog and BDM Configuration: Informatica Administrator Username (InformaticaAdminUser name) Informatica Administrator Password (InformaticaAdmin Password) License Key Location (InformaticaKeyS3Bucket) License Key Name (InformaticaKeyName) The administrator user name for accessing Big Data Management. You can specify any string. Make a note of the user name and password, and use it later to log in to the Administrator tool to configure the Informatica domain. The administrator password for accessing Big Data Management. You can specify any string. Make a note of the user name and password, and use it later to log in to the Administrator tool to configure the Informatica domain. The name of the S3 bucket in your account that contains the Informatica license key. The Informatica license key name; for example, INFALicense_10_2.key. Note: The key file must be in the top level of the S3 bucket and not in a subfolder. Page 17 of 30

18 Import Sample Content (ImportSampleData) No Select Yes to import sample catalog data. You can use the sample data to get started with the product. AWS Quick Start Configuration: Informatica recommends that you do not change the default values for the parameters in this category. Quick Start S3 Bucket Name (QSS3BucketName) Quick Start S3 Key Prefix (QSS3KeyPrefix) quickstartreference informatica/ datalake/latest/ The S3 bucket name for the Quick Start assets. This bucket name can include numbers, lowercase letters, uppercase letters, and hyphens (-), but should not start or end with a hyphen. You can specify your own bucket if you copy all of the assets and submodules into it, if you want to customize the templates and override the Quick Start behavior for your specific implementation. The S3 key name prefix for your copy of the Quick Start assets. This prefix can include numbers, lowercase letters, uppercase letters, hyphens (-), and forward slashes (/). This parameter enables you to customize or extend the Quick Start for your specific implementation. When you finish reviewing and customizing the parameters, choose Next. 5. On the Options page, you can specify tags (key-value pairs) for resources in your stack and set advanced options. When you re done, choose Next. 6. On the Review page, review and confirm the template settings. Under Capabilities, select the check box to acknowledge that the template will create IAM resources. 7. Choose Create to deploy the stack. Step 4. Monitor the Deployment During deployment, you can monitor the creation of the cluster instance and the Informatica domain, and get more information about system resources. 1. Choose the stack that you are creating, and then choose the Events tab to monitor the creation of the stack. Figure 3 shows part of the Events tab. Page 18 of 30

19 Figure 3: Monitoring the deployment in the Events tab When stack creation is complete, the Status field shows CREATE_COMPLETE, and the Outputs tab displays a list of stacks that have been created, as shown in Figure Choose the Resources tab. Figure 4: Stack creation complete This tab displays information about the stack and the Data Lake instance. You can select the linked physical ID properties of individual resources to get more information about them, as shown in Figure 5. Page 19 of 30

20 3. Choose the Outputs tab. Figure 5: Resources tab When the Informatica domain setup is complete, the Outputs tab displays the following information: Key RedShiftIamRole EICCatalogURL InstanceID InformaticaAdminConsoleURL EtcHostFileEntry EICAdminURL EMRResourceManagerURL RedShiftClusterEndpoint CloudFormationLogs S3DatalakeBucketName InstanceSetupLogs InformaticaHadoopInstallLogs InformaticaDomainDatabaseEndPoint InformaticaAdminConsoleServerLogs InformaticaHadoopClusterURL Amazon Resource Name (ARN) for the Amazon RedShift IAM role URL for the Informatica EIC user console Informatica domain host name URL for the Informatica administrator console Etc host file entry to be added to the /etc/hosts file to enable access to the domain, using the host name of the Adminstrative Server URL for the EIC Administrator URL for the Amazon EMR Resource Manager Amazon Redshift cluster endpoint Location of the AWS CloudFormation installation log Name of the S3 bucket used for the data lake Location of the setup log for the Informatica domain EC2 instance Location of the master node Hadoop installation log Informatica domain database endpoint Location of the Informatica domain installation log URL to the IHS Hadoop gateway node Page 20 of 30

21 Key InformaticaBDMDeveloperClient Location where you can download the Informatica Developer tool (see step 5) Note If the Outputs tab is not populated with this information, wait for domain setup to be complete. 4. Use the links in the Outputs tab to access Informatica management tools. For example: Use InformaticaAdminConsoleURL EICAdminURL EICCatalogURL To Open the Instance Administration screen. You can use this screen to manage Informatica services and resources. You can also get additional information about the instance, such as the public DNS and public IP address. Administer the Enterprise Data Catalog environment. Access Enterprise Data Catalog. See the Informatica Enterprise Data Catalog User Guide for information about logging in to Enterprise Data Catalog. Step 5. Download and Install Informatica Developer Informatica Developer (the Developer tool) is an application that you use to design and implement data integration, data quality, data profiling, data services, and big data solutions. You can use the Developer tool to import metadata, create connections, and create data objects. You can also use the Developer tool to create and run profiles, mappings, and workflows. 1. Log in to the AWS CloudFormation console at 1. Choose the Outputs tab. 2. Right-click the value of the InformaticaBDMDeveloperClient key to download the Developer tool client installer. 3. Uncompress and launch the installer to install the Developer tool on a local drive. Page 21 of 30

22 Manual Cleanup If you deploy the Quick Start for a new VPC, Amazon EMR creates security groups that are not deleted when you delete the Amazon EMR cluster. To clean up after deployment, follow these steps: 1. Delete the Amazon EMR cluster. 2. Delete the Amazon EMR-managed security groups (ElasticMapReduce-master, ElasticMapReduce-slave) by deleting the circularly dependent rules followed by the security groups themselves. 3. Delete the AWS CloudFormation stack. Troubleshooting Q. I encountered a CREATE_FAILED error when I launched the Quick Start. A. If you encounter this error in the AWS CloudFormation console, we recommend that you relaunch the template with Rollback on failure set to No. (This setting is under Advanced in the AWS CloudFormation console, Options page.) With this setting, the stack s state will be retained and the instance will be left running, so you can troubleshoot the issue. (You'll want to look at the log files in %ProgramFiles%\Amazon\EC2ConfigService and C:\cfn\log.) Important When you set Rollback on failure to No, you ll continue to incur AWS charges for this stack. Please make sure to delete the stack when you ve finished troubleshooting. For additional information, see Troubleshooting AWS CloudFormation on the AWS website. Q. I encountered an error while installing Informatica domain and services. A. We recommend that you view the /installation.log log file to get more information about the errors you encountered. Q. I encountered a size limitation error when I deployed the AWS Cloudformation templates. A. We recommend that you launch the Quick Start templates from the location we ve provided or from another S3 bucket. If you deploy the templates from a local copy on your computer or from a non-s3 location, you might encounter template size limitations when you create the stack. For more information about AWS CloudFormation limits, see the AWS documentation. Page 22 of 30

23 Using Informatica Data Lake Management on AWS After you deploy this Quick Start, you can use any of the patterns described in this section to use the Informatica Data Lake Management solution on AWS. Transient and Persistent Clusters Amazon EMR provides two methods to configure a cluster: transient and persistent. Transient clusters are shut down when the jobs are complete. For example, if a batchprocessing job pulls web logs from Amazon S3 and processes the data once a day, it is more cost-effective to use transient clusters to process web log data and shut down the nodes when the processing is complete. Persistent clusters continue to run after data processing is complete. The Informatica Data Lake Management solution supports both cluster types. For more information, see the Amazon EMR best practices whitepaper. This Quick Start sets up a persistent EMR cluster with a configurable number of core nodes, as defined by the EMRCoreNodes parameter. Common AWS Architecture Patterns for Informatica Data Lake Management Informatica Data Lake Management supports the following patterns that leverage AWS for big data processing. Pattern 1: Using Amazon S3 In this first pattern, data is loaded to Amazon S3 using Informatica. For data processing, the Informatica Big Data Management mapping logic pulls data from Amazon S3 and sends it for processing to Amazon EMR. Amazon EMR does not copy the data to the local disk or HDFS. Instead, the mappings open multithreaded HTTP connections to Amazon S3, pull data to the Amazon EMR cluster, and process data in streams, as illustrated in Figure 6. Page 23 of 30

24 Figure 6: Pattern 1 using Amazon S3 Pattern 2: Using HDFS and Amazon S3 as Backup Storage In this pattern, Informatica writes data directly to HDFS and leverages the Amazon EMR task nodes to process the data and periodically copy data to Amazon S3 as the backup storage, as illustrated in Figure 7. The advantage of this pattern is the ability to process data without copying it to Amazon EMR. Although copying to Amazon EMR may improve performance, the disadvantage is durability. Because Amazon EMR uses ephemeral disk to store data, data could be lost if the EC2 instance for Amazon EMR fails. HDFS replicates data within the Amazon EMR cluster and can usually recover from node failures. However, data loss could still occur if the number of lost nodes is greater than your replication factor. Informatica recommends that you back up HDFS data to Amazon S3 periodically. Page 24 of 30

25 Figure 7: Pattern 2 using HDFS and Amazon S3 as backup Pattern 3: Using Amazon Kinesis and Kinesis Firehose for Real-Time and Streaming Analytics In the third pattern, unbounded events streams that are continuously generated from devices, IoT applications, and cloud applications are ingested in real time, using Informatica Edge Data Streaming, into Amazon Kinesis. Using Informatica Big Data Streaming, which leverages the existing Informatica platform, streaming pipelines can be built using pre-built transformations, connectors, and parsers. These elements are optimized to execute on an Amazon EMR cluster in streaming mode using Spark Streaming. They support the consumption of data records from an Amazon Kinesis stream and act as a producer for writing data to a defined Amazon Kinesis Firehose delivery stream. Data can be persisted to Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service (Amazon ES) and delivered in JSON and binary payloads. For more information about deploying Informatica Big Data Streaming on AWS, please contact Informatica or your implementation partner. Figure 8 shows the Informatica Big Data Streaming architecture. Page 25 of 30

26 Figure 8: Pattern 3 using the Informatica Big Data Streaming architecture Pattern 4: Using AWS for Self-Service Data Discovery and Preparation In the last pattern, Informatica Enterprise Data Lake provides data analysts with a collaborative, self-service, big data discovery and preparation solution. Analysts can rapidly discover and turn raw data into insights, with quality and governance powered by data intelligence deployed on AWS. When deployed on AWS, Informatica Enterprise Data Lake leverages the existing Informatica platform, which allows analysts to discover, search, and explore data assets for analysis using an AI-driven data catalog. The Data Lake Management solution makes recommendations based on the behavior and shared knowledge of the data assets used for analysis. Once analysts find the relevant data, they can blend, transform, cleanse, and enrich data by using a Microsoft Excel-like data preparation interface, at scale on an Amazon EMR cluster. Data is prepared, published, and made available for consumption in the data lake. An analyst can assess the prepared data using ad-hoc queries to generate charts, tables, and other visual formats. IT can operationalize the data preparation steps that will execute the ad-hoc work done by analysts into Informatica big data mappings, which will run in batch on an Amazon EMR cluster. Page 26 of 30

You can deploy Informatica Enterprise Data Lake on the same AWS infrastructure that supports Informatica Big Data Management and Informatica Enterprise Data Catalog.

27 You can deploy Informatica Enterprise Data Lake on the same AWS infrastructure that supports Informatica Big Data Management and Informatica Enterprise Data Catalog. Figure 9 shows the data flows for Informatica Enterprise Data Lake. Figure 9: Data flows used in pattern 4 Process Flow Figure 10 shows the process flow for using the Informatica Data Lake Management solution on AWS. It illustrates the data flow process using the Informatica Data Lake Management solution and Amazon EMR, Amazon S3, and Amazon Redshift. Page 27 of 30

Figure 10: Informatica Data Lake Management Solution process flow using Amazon EMR The numbers in Figure 10 refer to the following steps: Step 1: Collect and move data from on-premises systems into

28 Figure 10: Informatica Data Lake Management Solution process flow using Amazon EMR The numbers in Figure 10 refer to the following steps: Step 1: Collect and move data from on-premises systems into Amazon S3 storage. Consider offloading infrequently used data, and batch-load raw data to a defined landing zone in Amazon S3. Step 2: Collect cloud application and streaming data generated by machines and sensors in Amazon S3 storage instead of staging it in a temporary file system or a data warehouse. Step 3: Discover and profile data stored in Amazon S3, using Amazon EMR as the processing infrastructure. Profile data to better understand its structure and context. Parse raw data, either in multi-structured or unstructured formats, to extract features and entities, and cleanse data with data quality tasks. To prepare data for analysis, you can execute prebuilt transformations and data quality rules natively in EMR to prepare data for analysis. Step 4: Match duplicate data within and across big data sources and link them to create a single view. Step 5: Perform data masking to protect confidential data such as credit card information, social security numbers, names, addresses, and phone numbers from unintended exposure to reduce the risk of data breaches. Data masking helps IT organizations manage the access to their most sensitive data, providing enterprise-wide scalability, robustness, and connectivity to a vast array of databases. Page 28 of 30

29 Step 6: Data analysts and data scientists can prepare and collaborate on data for analytics by incorporating semantic search, data discovery, and intuitive data preparation tools for interactive analysis with trusted, secure, and governed data assets. Step 7: After cleansing and transforming data on Amazon EMR, move high-value curated data back to Amazon S3 or to Amazon Redshift. From there, users can directly access data with BI reports and applications. Additional Resources AWS services AWS CloudFormation Amazon EBS Amazon EC2 Amazon EMR Amazon Redshift Amazon S3 Amazon VPC Informatica Informatica Network: a source for product documentation, Knowledge Base articles, and other information Quick Start reference deployments AWS Quick Start home page Page 29 of 30

30 GitHub Repository You can visit our GitHub repository to download the templates and scripts for this Quick Start, to post your comments, and to share your customizations with others. Document Revisions Date Change In sections January 2018 Initial publication 2018, Amazon Web Services, Inc. or its affiliates, and Informatica LLC. All rights reserved. Notices This document is provided for informational purposes only. It represents AWS s current product offerings and practices as of the date of issue of this document, which are subject to change without notice. Customers are responsible for making their own independent assessment of the information in this document and any use of AWS s products or services, each of which is provided as is without warranty of any kind, whether express or implied. This document does not create any warranties, representations, contractual commitments, conditions or assurances from AWS, its affiliates, suppliers or licensors. The responsibilities and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers. The software included with this paper is licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance with the License. A copy of the License is located at or in the "license" file accompanying this file. This code is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. Page 30 of 30

Informatica Big Data Management on the AWS Cloud

Informatica Big Data Management on the AWS Cloud Quick Start Reference Deployment November 2016 Andrew McIntyre, Informatica Big Data Management Team Santiago Cardenas, AWS Quick Start Reference Team Contents