Tuning Intelligent Data Lake Performance

Tuning Intelligent Data Lake Performance 2016 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica LLC. All other company and product names may be trade names or trademarks of their respective owners and/or copyrighted materials of such owners.

Abstract You can tune Intelligent Data Lake for better performance. You can tune parameters for ingesting data, previewing data assets, adding data assets to projects, managing projects, publishing projects, searching for data assets, exporting data assets, and tuning data profiling. The article lists the parameters that you can tune in Intelligent Data Lake and the steps that you must perform to configure the parameters. Supported Versions Intelligent Data Lake 10.1 Table of Contents Overview.... 2 Target Audience.... 4 Intelligent Data Lake Sizing Recommendations.... 4 Tune the Hardware.... 5 CPU Frequency.... 5 NIC Card Ring Buffer Size.... 6 Tune the Hadoop Cluster.... 7 Hard Disk Recommendations.... 8 Transparent Huge Page Compaction.... 8 HDFS Block Size Recommendations.... 8 YARN Settings Recommendations for Parallel Jobs.... 8 Tune the Services... 9 Intelligent Data Lake Service Parameters.... 9 Intelligent Data Lake Service Properties.... 9 Catalog Service Properties.... 11 Solr Tuning Parameters.... 12 Tuning for Profiling Performance.... 13 Data Preparation Service Properties.... 14 Model Repository Service Properties.... 15 Catalog Service Sizing Recommendations.... 15 Metadata Extraction Scanner Memory Parameters.... 16 Overview Tuning Intelligent Data Lake for better performance includes tuning the performance parameters at different stages of data ingestion, file upload, data asset preview, project creation, addition of data assets to projects, worksheet preparation, publishing, exporting, and catalog service operations. Each operation involves many hardware and software entities such as the Intelligent Data Lake Service, Catalog Service, Data Preparation Service, Data Integration Service, Hive Server, and Hadoop processes like resource manager and node manager. 2

You can optimize the performance of Intelligent Data Lake by tuning the following areas: 1. Hardware 2. Hadoop cluster level parameters 3. The following services and properties: Intelligent Data Lake Service Catalog Service Data Preparation Service Model Repository Service Data Integration Service Tuning profiling performance involves tuning parameters for the Data Integration Service and the profiling warehouse database properties. The following table lists the Intelligent Data Lake operations and the corresponding services that are affected: Operation/ Component Model Repository Service Data Integration Service Catalog Service Hadoop Cluster Intelligent Data Lake Service Data Preparation Service Ingestion No Yes Yes Yes No No File Upload No Yes. BDM Mapping or Preview Yes Yes Yes No Data Asset Preview No No Yes No Yes No Project Management Yes No No No Yes No Addition of Data Asset to a Project Yes No No No Yes No Worksheet Preparation Yes No No No Yes Yes Publishing a Worksheet Yes. Mapplet or mappings are saved in Repository. Yes. BDM Mapping execution No No Yes Yes. Rev to Mapplet Exporting a Worksheet No No Yes No Yes No Search Catalog No No Yes No Yes No 3

Target Audience This article is intended for Intelligent Data Lake administrators and users familiar with configuring Hadoop and features of Informatica Big Data Management and Catalog Service. Intelligent Data Lake Sizing Recommendations Based on data volume, operational volume, and the number of users, you must add additional memory and CPU cores to tune the performance of Intelligent Data Lake. There are three deployments based on the factors such as data volume, operational volume, and the number of concurrent users: Small Deployment The following table lists the size of a small deployment, data volume, operational volume, and the supported number of concurrent users: Number of Catalog Objects Data Volume Operational Data Volume Number of Concurrent Users 1 Million 10 TB 100 GB 5 Users Medium Deployment The following table lists the size of a medium deployment, data volume, operational volume, and the supported number of concurrent users: Number of Catalog Objects Data Volume Operational Data Volume Number of Concurrent Users 20 Million 100 TB 1 TB 10 Users Large Deployment The following table lists the size of a large deployment, data volume, operational volume, and the supported number of concurrent users: Number of Catalog Objects Data Volume Operational Data Volume Number of Concurrent Users 50 Million 1 PB 10 TB 50 Users Sizing Recommendations for the Deployments This article assumes that jobs are run with 8 map waves (data size equivalent to the operational volume being processed in about 8-10 minutes for a simple formula) when running with the operational volume mentioned in the deployment definitions with the specified number of concurrent users. 4

The following table lists the high level recommendations for the Informatica servers and the cluster sizes in terms of resource requirements. IDL Requirements Informatica Server Hadoop Cluster (YARN) CPU Core Memory CPU Core Memory Small Deployment ~24 ~16 GB ~120 ~90 GB Medium Deployment ~44 ~24 GB ~600 ~700 GB Large Deployment ~102 ~72 GB ~5240 ~5.4 TB Note: 1. The table assumes that profiling is not done concurrently. 2. The Hadoop cluster CPU cores and memory requirements are inclusive of Catalog Service requirements. 3. Catalog Service and Data Preparation Service can be provisioned to a different node in the domain if the Primary Gateway Node does not have enough CPU cores available. 4. Catalog Service and Data Preparation Service requires 35~45% of Total CPU cores recommended in the above table (CPU Core for Informatica Server) for various deployment options. The following table shows the system level requirements for all the deployments. IDL Requirements Informatica Server Hadoop Cluster File Descriptor Disk Space File Descriptor per Node Temp Space All Deployments 15K At least 72 GB 64K Minimum:10 GB Recommended: 30 GB Tune the Hardware You can tune the following hardware parameters to optimize performance: CPU frequency NIC card ring buffer size CPU Frequency Dynamic frequency scaling is a feature that allows the processor's frequency to be adjusted on-the-fly either for power savings or to reduce heat. Ensure that the CPU operates at least at the base frequency. When CPUs are under clocked, that is, they run below the base frequency, the performance degrades by 30% to 40%. Informatica recommends that you work with your IT system administrator to ensure that all the nodes on the cluster are configured to run at least at their supported base frequency. To tune the CPU frequency for Intel multi-core processors, perform the following steps: 1. Run the lscpu command to determine the current CPU frequency, base CPU frequency, and the maximum CPU frequency that is supported for the processor. 5

2. Request your system administrator to perform the following tasks: a. Increase the CPU frequency at least to the supported base frequency. b. Change the power management setting to OS Control at the BIOS level. 3. Run CPU-intensive tests to monitor the CPU frequency in real time and adjust the frequency for improved performance. On Red Hat operating systems, you can install a monitoring tool such as cpupower. 4. Work with your IT department to ensure that the CPU frequency and power management settings are persisted even in case of future system reboots. NIC Card Ring Buffer Size NIC configuration is a key factor in network performance tuning. When you deal with large volumes of data, it is crucial that you tune the Receive (RX) and Transmit (TX) ring buffer size. The ring buffers contain descriptors or pointers to the socket kernel buffers that hold the actual packet data. You can run the ethtool command to determine the current configuration. For example, run the following command: # ethtool -g eth0 The following sections show a sample output: Ring parameters for eth0: Pre-set maximums: RX: 2040 RX Mini: 0 RX Jumbo: 8160 TX: 255 Current hardware settings: RX: 255 RX Mini: 0 RX Jumbo: 0 TX: 255 The Pre-set maximums section shows the maximum values that you can set for each parameter. The Current hardware settings section shows the current configuration details. A low buffer size leads to low latency. However, low latency comes at the cost of throughput. For greater throughputs, you must configure large buffer ring sizes for RX and TX. Informatica recommends that you use the ethtool command to determine the current hardware settings and the maximum supported values. Then, set the values based on the maximum values that are supported for each operating system. For example, if the maximum supported value for RX is 2040, you can use the ethtool command as follows to set the RX value to 2040: # ethtool -G eth0 RX 2040 If you set a low ring buffer size for data transfer, packets might get dropped. To find out if packets were dropped, you can use the netstat and ifconfig commands. The following image shows a sample output of the netstat command: 6

The RX-DRP column indicates the number of packets that were dropped. Set the RX value such that no packets get dropped and the RX-DRP column shows the values as 0. The following image shows a sample output of the ifconfig command: The status messages indicate the number of packets that were dropped. Tune the Hadoop Cluster You can tune the following Hadoop cluster level areas to optimize performance: Hard disk Transparent huge page HDFS block size YARN settings for parallel jobs 7

Hard Disk Recommendations Hadoop workloads are always composite where there is a demand for multiple resources like CPU, Memory, Disk IO and Network IO. Disk performance plays a critical role in the overall Hadoop job's performance. Consider the following factors: 1. Use EXT4 or XFS file systems for the directories used for the cluster. 2. Use SAS disks with 15K RPM for best performance. Transparent Huge Page Compaction Linux has a feature called transparent huge page compaction. This feature impacts the performance of Hadoop workloads. Informatica recommends that you disable huge page compaction. For more information about disabling the transparent huge page compaction feature, see KB article [147609]. HDFS Block Size Recommendations Set the HDFS block size based on your requirements. The dfs.block.size parameter is the file system block size parameter for the data stored in HDFS (hdfs-site.xml). The default block size is 128 MB. An increase or decrease in block size impacts parallelism and resource contention when you run MapReduce tasks. You can set the block size to 256 MB on a medium sized cluster (with up to 40 nodes) and a smaller value for a larger cluster. Tune this value after experimenting on the basis of your requirements. YARN Settings Recommendations for Parallel Jobs You can change or fine-tune the YARN settings to improve the performance of Intelligent Data Lake. YARN stands for Yet-Another-Resource-Negotiator. YARN's functions include splitting the operations of resource management and job scheduling/monitoring into separate services. The number of containers that YARN node manager can run depends on the memory size, number of CPU cores, number of physical disks, and type of tasks. It is recommended not to let the number of parallel containers go beyond the minimum of four times the number of physical disks and the number of physical cores. The following parameters can be modified to ensure that the Hadoop node can allocate that many parallel containers. Parameter yarn.nodemanager.resource.memorymb yarn.nodemanager.resource.cpuvcores yarn.nodemanager.pmem-checkenabled yarn.nodemanager.vmem-checkenabled Description The amount of physical memory in MB that can be allocated for containers. It is recommended to reserve some memory for other processes running in a node. The number of CPU cores that can be allocated for containers. It is recommended to have this value set to the number of physical cores available in the node. Informatica recommends that you disable the physical memory check by setting it to false. Informatica recommends that you disable the virtual memory check by setting it to false. Note: Consult your Hadoop administrator before you change these settings. These recommendations are based on internal tests and might differ from the Hadoop vendor's recommendations. 8

Tune the Services After tuning the hardware and Hadoop cluster, you must tune the different services and their parameters for optimum performance. You can tune the following areas to achieve optimum performance. Intelligent Data Lake service parameters Intelligent Data Lake service properties Catalog service properties Data Preparation service properties Model Repository service properties Intelligent Data Lake Service Parameters The parameters in hive-site.xml file associated with Data Integration Service impact the performance of Intelligent Data Lake file uploading and publishing operations. The following table lists the parameters that you can tune for optimum performance. Parameter mapred.compress.map.output mapred.map.output.compression.c odec mapred.map.tasks.speculative.exe cution mapred.reduce.tasks.speculative.e xecution hive.exec.compress.intermediate Description This parameter determines whether the map phase output is compressed or not. The default value is false. Informatica recommends setting this parameter to true for better performance. This parameter specifies the compression codec to be used for map output compression. The default value is org.apache.hadoop.io.compress.defaultcodec. Snappy codec is recommended for better performance (org.apache.hadoop.io.compress.snappycodec). This parameter specifies whether the map tasks can be speculatively executed. The default value is true. With speculative map task execution, duplicate tasks are spawned for the tasks that are not making much progress The task (original or speculative) that completes first is considered and the other is killed. It is recommended to enable it for better performance. This parameter specifies whether the reduce tasks can be speculatively executed. The default value is true. This is similar to map task speculative execution in functionality. It is recommended to disable it for better performance by setting it to false. This parameter determines whether the results of intermediate map/reduce jobs in a hive query execution are compressed or not. The default value is false. This should not be confused with mapred.compress.map.output that deals with the compression of the output of map task. It is recommended to enable it by setting it to true. This uses the same codec specified by mapred.map.output.compression.codec and SnappyCodec is recommended for better performance. Intelligent Data Lake Service Properties You can fine-tune or change properties, such as Hive Table Storage Format, Export Options for better performance. Hive Table Storage Format In the Intelligent Data Lake Options tab of service properties, change the value of Hive Table Storage Format. Informatica recommends that you use columnar storage formats like ORC or Parquet for better storage efficiency (less space consumption) and improved performance (as operations on such tables are faster). The following image shows the Data Lake Options tab. 9

Export Options Enter the number of rows that need to be exported in the Number of Rows to Export field. This value impacts the performance of exporting a data asset to the local machine running the browser. To avoid large files being downloaded and exporting taking longer time, you need to carefully set this value. The following image shows the Export Options tab. Data Asset Recommendation Options Recommendation is an Intelligent Data Lake feature where alternate and additional data assets are recommended for a project depending on existing data assets to improve productivity. By default, the number of recommendations to display is set to 10. Recommendations involve requests sent to Catalog services and you need to carefully set the number of recommendations. The following image shows the Data Asset Recommendation Options tab. Sampling Options The Data Preparation Sample Size property determines the number of sample rows to be fetched for data preparation. The default value is 50,000 records. The minimum supported value is 1000 and the maximum supported value is one million. The value of this property has an impact on the performance and user experience when the data preparation page is loaded or refreshed. Though the rows are loaded asynchronously after fetching 10,000 rows, operations can be performed only after all the records are fetched. Setting sample size to a higher value can result in higher memory usage for Data Preparation Service. The following image shows the Sampling Options tab. Logging Options The Log Severity property defines the severity level for service logs. The supported levels are FATAL, ERROR, WARNING, INFO, TRACE, and DEBUG. INFO is the default value. It is recommended to set this property at INFO or higher levels due to performance considerations. The following image shows the Logging Options tab. 10

Catalog Service Properties Intelligent Data Lake uses the Catalog Service to catalog and ingest the data lake table metadata and search results. Properly tuning the Catalog Service is critical for Intelligent Data Lake Service performance. Live Data Map LoadType Property The resource demand and performance from the Hadoop cluster by Live Data Map depends on the load type which can be set using the custom property CustomOptions.loadType. The parameters for different loadtypes can be found in Appendix A. It is recommended to always specify loadtype and not to leave it as default. The recommended values for loadtypes are as follows: Low. One node setup with up to one million catalog objects Medium. Three node setup with up to 20 million catalog objects High. Six node setup with up to 50 million catalog objects Objects include the sum of number of tables and number of columns in all the tables. The following figure shows the Custom Properties tab: Scanner Configuration You must configure Hive Resource Scanner to ingest Hive tables to be catalogued and searched in Catalog Service for use in Intelligent Data Lake. You must configure the scanner for better performance. You can configure the scanner to extract all schemas or a particular schema. Select specific schema for better performance. The following figure shows the schema property. You can configure the memory that scanner process consumes. It is recommended to set this value based on the total number of columns that need to be ingested. The recommended values for scanner memory are as follows: Low. Up to one million columns Medium. Up to four million columns High. Up to 12 million columns The following figure shows the Advanced Properties section where you and set the values for the scanner memory. 11

Note: Incremental scanning is not supported in Intelligent Data Lake 10.1. You must consider the number of columns in all the tables for scanner memory configuration. Solr Tuning Parameters These parameters include the Apache Solr Slider app master properties and the Catalog Service custom options for the Apache Solr node. Apache Solr Slider App Master Properties The following tables list the Apache Solr parameters that you can use to tune the performance of search in Catalog Service: Parameter jvm.heapsize yarn.component.instances 2 yarn.memory yarn.vcores 1 Description Memory used by the slider app master. Number of instances of each component for the slider. This parameter specifies the number of master servers that are run. Amount of memory allocated for the container hosting the master server. Number of cores allocated for the master server. 1 For external clusters, when you increase the value for this parameter, it is recommended that you increase the maximum number of cores in YARN, for a container. 2 Before increasing this parameter, you must add the required number of nodes to the cluster. Catalog Service Custom Options for Apache Solr Node Properties Parameter xmx_val* xms_val yarn.component.instances 1 yarn.memory yarn.vcores 2 Description Solr maximum heap Solr minimum heap Number of instances of each component for the slider. This parameter specifies the number of master servers that are run. Amount of memory allocated for the container hosting the master server. Number of cores allocated for the master server. * When you increase the value for this parameter, it is recommended that you increase the maximum memory allocation in YARN. for a container. Failing to increase the memory allocation might result in YARN shutting down the applications. It is recommended that when you increase the memory configuration of any component, for example, 12

ingestion, you must keep a buffer of 30% of the actual memory required for the component. For example, if a component requires 100 MB of memory, you must increase the memory configuration to 130 MB for that component. 1 Before increasing this parameter, you must add the required number of nodes to the cluster. 2 For external clusters, when you increase the value for this parameter, it is recommended that you increase the maximum number of cores in YARN, for a container. Tuning for Profiling Performance Tuning the parameters for the profiling options is very important as the profile jobs run on the same cluster as Intelligent Data Lake. Native and Blaze Mode For Catalog Service and Intelligent Data Lake 10.1, profiling using the Hive engine is recommended for performance considerations over using the Blaze engine. Incremental Profiling For Catalog Service and Intelligent Data Lake 10.1, incremental profiling is not supported for Hive data sources. Do not enable incremental profiling for Hive data source in Intelligent Data Lake. Profiling Performance Parameters See the following table to identify the parameters that you can tune to improve profiling performance: Parameter Profiling Warehouse Database Maximum Profile Execution Pool Size Maximum Execution Pool Size Description Connection name to the profiling warehouse database. In addition to the profile results, the profiling warehouse holds the persisted profile job queue. Verify that no profile job runs when you change the connection name. Otherwise, the profile jobs might stop running because the profile jobs run on the Data Integration Service where the Profiling Service Module submitted the profile jobs. You set the default value when you create the instance. The number of profile mappings that the Profiling Service Module can run concurrently when the Data Integration Service runs on a single node or on a grid. The pool size is dependent on the aggregate processing capacity of the Data Integration Service, which you specify for each node on the Processes tab of the Administrator tool. The pool size cannot be greater than the sum of the processing capacity of all nodes. When you plan for a deployment, consider the threads used for profile tasks. It is important to understand the mixture of mappings and profile jobs so that you can configure the Maximum Execution Pool Size parameter. Default is 10. The maximum number of requests that the Data Integration Service can run concurrently. Requests include data previews, mappings, and profiling jobs. This parameter has an impact on the Data Integration Service. 13

Parameter Maximum Concurrent Columns Maximum Column Heap Size Description The number of columns that a mapping runs in parallel. The default value of 5 is optimal for most of the profiling use cases. You can increase the default value for columns with cardinality lower than the average value. Decrease the default value for columns with cardinality higher than the average value. You might also want to decrease this value is when you consistently run profiles on large source files where temporary disk space is low. Default is 5. The cache size for each column profile mapping for flat files. You can increase this value to prevent the Data Integration Service from writing some parts of the intermediate profile results to temporary disk. However, this effect does not apply to large data sources. The default setting is optimal for most of the profiling use cases. In Live Data Map, profiling is done with a lower data volume, for example, in a batch of 10000 rows. To avoid creating multiple mappings and to prevent an impact on compilation performance, you can combine multiple columns in a single mapping. It is recommended to set the value for the Maximum Column Heap Size to 512 to avoid temporary disk usage with combined mapping. Default is 64. Note: See the Informatica How-To Library article "Tuning Live Data Map Performance" for more information about tuning profile jobs. Running Profile Jobs for IDL Profiling jobs on Hive Tables also require Hadoop cluster resources through YARN and compete for resources with other IDL jobs like publish and file upload. This is critical in Catalog Service and Intelligent Data Lake 10.1 as incremental scan is not supported for Hive tables. Incremental profiling is also not supported for Hive sources. All tables will be profiled every time the scan runs. Informatica recommends running scans with profile configuration during off business hours, zero or reduced cluster demand time rather than with every scan. However, if profiling needs to be run with every scan, make sure to submit profile jobs to a profile job-specific YARN queue with restricted capacity. Data Preparation Service Properties Change the Data Preparation Storage options and heap size recommendations for optimum performance. Local and Durable Storage Disk Space Requirements Data Preparation, by design, stores a copy of the worksheet in the local storage and durable storage whenever a worksheet is opened or refreshed from its source. The default sample size is 50,000 rows and IDL File Upload Limit is 100 MB. The size of the sample copy of the worksheet in the local storage or durable storage will be either in size of 50K rows or 100 MB whichever is lower. The cleanup thread cleans up all the local store data once the user session timeout is reached. In Durable storage, data is stored and replicated in HDFS. Data Preparation Service keeps three copies of opened or refreshed worksheets in both local and durable storage. Make sure you configure a faster disk with enough storage space available for the Local Storage Location option. The following image shows the Data Preparation Storage Options screen. 14

Use the following guideline to estimate the disk space requirements for local and durable stage: size of the opened data preparation worksheet (size of 50,000 rows if default sample size is used) the number of concurrent users working on data preparation 3 (Data Preparation Service keeps three copies of the worksheet in local and durable storage) + 2 GB (Additional disk space) Heap Size Recommendations The default maximum heap size is set at 8 GB. Set the maximum heap size to a higher value for larger concurrency. Make sure to allocate 512 MB per user. For 20 users, increase the maximum heap size to 10 GB. The following image shows the Advanced Options screen where you can set the maximum heap size value. Data Preparation Repository Options Data Preparation Service uses MySQL database to persist all the Recipe and Mapping metadata. The supported database is MySQL 5.6.x. The CPU and memory requirements for the MySQL database are as follows: Small deployment. Minimum 1 CPU core and 2 GB memory Medium deployment. Minimum 2 CPU cores and 4 GB memory Large deployment. Minimum 4 CPU cores and 8 GB memory The disk space requirements for the MySQL database are as follows: Small deployment. 50 GB Medium deployment. 100 GB Large deployment. 200 GB Model Repository Service Properties You can fine tune or change the MRS heap size based on the number of concurrent users. The following table lists the guidelines to tune the Model Repository Service JVM heap size based on the number of concurrent users. Number of Concurrent Users Single User Less than 10 Users Less than 50 Users More than 50 Users MRS JVM Heap Size 1 GB 2 GB 4 GB 8 GB Catalog Service Sizing Recommendations Based on the size of the data set, you must add additional memory and CPU cores to tune the performance of Live Data Map. You must also note the minimum number of nodes that are required to deploy supported data set sizes. Note: Each node in the recommendation in the following sections requires 32 logical cores and 64 GB of available memory. It is recommended that you have a minimum of four physical SATA hard disks for each node. 15

Small Data Set You can deploy a small data set on a single node. The following table lists the size of a small data set and recommendations for the number of CPU cores and memory settings: Number of Catalog Objects Memory Number of CPU Cores Minimum Number of Hadoop Nodes Required One million 64 GB 32 1 Medium Data Set You can deploy a medium data set on a minimum of three nodes. The following table lists the size of a medium data set and the recommended number of CPU cores and memory settings for a medium data set: Number of Catalog Objects Memory Number of CPU Cores Minimum Number of Hadoop Nodes Required 20 million 192 GB 96 3 Large Data Set You can deploy a large data set on a minimum of six nodes. The following table lists the size of a large data set and the recommended number of CPU cores and memory settings for a large data set : Number of Catalog Objects Memory Number of CPU Cores Minimum Number of Hadoop Nodes Required 50 million 384 GB 192 6 Default Data Set A default data set represents a data set that is lesser in size than the small data set. You can use a configuration lower than the configuration for small data sets to process the default data sets. Demo Data Set You can deploy a demo data set on a single node. Use the demo data set for product demonstrations using a feature-restricted version of Live Data Map. Metadata Extraction Scanner Memory Parameters Depending on the size of the data set, you can use the scanner memory parameters to configure the memory requirements for the scanner to extract metadata. In the following table, the values listed in the Memory column indicate the default values configured for the scanner based on the data set size: Parameter Data Set Size Memory LdmCustomOptions.scanner.memory.low Small 1024 MB LdmCustomOptions.scanner.memory.medium Medium 4096 MB LdmCustomOptions.scanner.memory.high Large 12288 MB 16

Authors Anand Sridharan Staff Performance Engineer Mohammed Morshed Principal Performance Engineer Chakravarthy Tenneti Lead Technical Writer 17