Tuning Intelligent Data Lake Performance

Size: px

Start display at page:

Download "Tuning Intelligent Data Lake Performance"

Shana Barton
6 years ago
Views:

of Informatica LLC in the United States and many jurisdictions

1 Tuning Intelligent Data Lake Performance Copyright Informatica LLC Informatica, the Informatica logo, Intelligent Data Lake, Big Data Mangement, and Live Data Map are trademarks or registered trademarks of Informatica LLC in the United States and many jurisdictions throughout the world. A current list of Informatica trademarks is available on the web at

2 Abstract You can tune Intelligent Data Lake for better performance. You can tune parameters to ingest data, preview data assets, add data assets to projects, manage projects, publish projects, search data assets, export data assets, and tune data profiling. The article lists the parameters that you can tune in Intelligent Data Lake and the steps that you must perform to configure the parameters. Supported Versions Intelligent Data Lake Table of Contents Overview Target Audience Intelligent Data Lake Sizing Recommendations Sizing Recommendations for Each Deployment Tune the Hardware CPU Frequency NIC Card Ring Buffer Size Tune the Hadoop Cluster Hard Disk Recommendations Transparent Huge Page Compaction HDFS Block Size Recommendations YARN Settings Recommendations for Parallel Jobs Tune the Services Intelligent Data Lake Service Parameters Intelligent Data Lake Service Properties Catalog Service Properties Solr Tuning Parameters Tuning for Profiling Performance Data Preparation Service Properties Model Repository Service Properties Data Integration Service Properties Catalog Service Sizing Recommendations

3 Metadata Extraction Scanner Memory Parameters Overview Tuning Intelligent Data Lake for better performance includes tuning the performance parameters at different stages of data ingestion, file upload, data asset preview, project creation, addition of data assets to projects, worksheet preparation, publishing, exporting, and catalog service operations. Each operation includes many hardware and software entities such as the Intelligent Data Lake Service, Catalog Service, Data Preparation Service, Data Integration Service, Hive Server, and Hadoop processes like resource manager and node manager. You can optimize the performance of Intelligent Data Lake by tuning the following areas: 1. Hardware 2. Hadoop cluster parameters 3. Application services in the domain Improving profiling performance includes tuning parameters for the Data Integration Service and the profiling warehouse database properties. The following table lists the Intelligent Data Lake operations and the corresponding services that are affected: Operation/ Component Model Repository Service Data Integration Service Catalog Service Hadoop Cluster Intelligent Data Lake Service Data Preparation Service Import Data Asset Yes Yes Yes Yes Yes No File Upload No Yes Yes Yes Yes No Data Asset Preview No No Yes No Yes No Project Management Yes No No No Yes No Addition of Data Asset to a Project Yes No No No Yes No Worksheet Preparation Yes No No No Yes Yes Worksheet Publication Yes 1 Yes 2 Yes 3 Yes Yes Yes 4 Export Data Asset Yes Yes Yes Yes Yes No Search Catalog No No Yes No Yes No 1 Mapplets and mappings are saved in the repository. 2 Big Data Management mapping execution. 3 Enrichment 4 Recipe to mapplet. 3

4 The following table lists the Intelligent Data Lake operations and their corresponding execution engine: Operation/Component MapReduce Version 2 Blaze Spark Worksheet Publication Yes Yes No Random Sampling Yes No Yes File Upload Yes No No Import Data Asset Yes Yes No Export Data Asset Yes Yes No Target Audience This article is intended for Intelligent Data Lake administrators and users familiar with configuring Hadoop and features of Informatica Big Data Management and the Catalog Service. Intelligent Data Lake Sizing Recommendations Based on data volume, operational volume, and the number of users, you must add additional memory and CPU cores to tune the performance of Intelligent Data Lake. There are three deployments based on the factors such as data volume, operational volume, and the number of concurrent users: Small deployment The following table lists the size of a small deployment, data volume, operational volume, and the supported number of concurrent users: Number of Catalog Objects Data Volume Operational Data Volume Number of Concurrent Users 1 Million 10 TB 100 GB 5 Users Medium deployment The following table lists the size of a medium deployment, data volume, operational volume, and the supported number of concurrent users: Number of Catalog Objects Data Volume Operational Data Volume Number of Concurrent Users 20 Million 100 TB 1 TB 10 Users 4

5 Large deployment The following table lists the size of a large deployment, data volume, operational volume, and the supported number of concurrent users: Number of Catalog Objects Data Volume Operational Data Volume Number of Concurrent Users 50 Million 1 PB 10 TB 50 Users Sizing Recommendations for Each Deployment This article assumes that jobs run with eight map waves when running with the operational volume mentioned in the deployment definitions with the specified number of concurrent users. Small deployment The following table lists the sizing recommendations for a small deployment: Resource Value CPU Core used by Informatica Services 24 Memory used by Informatica Services File Descriptor for Informatica Services Disk Space used by Informatica Services 16 GB 15K At least 72 GB CPU Core used by Hadoop Cluster (YARN) (1 number of nodes in the Hadoop cluster) +1 Memory used by Hadoop Cluster (YARN) File Descriptor per Node for Hadoop Cluster (YARN) Temp Space used by Hadoop Cluster (YARN) 90 GB + (4 GB number of nodes in the Hadoop cluster) 64 K Minimum: 10 GB Recommended: 30 GB Medium deployment The following table lists the sizing recommendations for a medium deployment: Resource Value CPU Core used by Informatica Services 44 Memory used by Informatica Services File Descriptor for Informatica Services Disk Space used by Informatica Services 24 GB 15K At least 72 GB CPU Core used by Hadoop Cluster (YARN) (1 number of nodes in the Hadoop cluster) +1 Memory used by Hadoop Cluster (YARN) 700 GB + (4 GB number of nodes in the Hadoop cluster)) 5

6 Resource File Descriptor per Node for Hadoop Cluster (YARN) Temp Space used by Hadoop Cluster (YARN) Value 64 K Minimum: 10 GB Recommended: 30 GB Large deployment The following table lists the sizing recommendations for a large deployment: Resource Value CPU Core used by Informatica Services 102 Memory used by Informatica Services File Descriptor for Informatica Services Disk Space used by Informatica Services 72 GB 15K At least 72 GB CPU Core used by Hadoop Cluster (YARN) (1 number of nodes in the Hadoop cluster) +1 Memory used by Hadoop Cluster (YARN) File Descriptor per Node for Hadoop Cluster (YARN) Temp Space used by Hadoop Cluster (YARN) 5.4 TB + (4 GB number of nodes in the Hadoop cluster) 64 K Minimum: 10 GB Recommended: 30 GB Consider the following points regarding the recommendations provided in the preceding tables: The table assumes that profiling is not done concurrently. The Hadoop cluster CPU cores and memory requirements are inclusive of the Catalog Service requirements. The Blaze Grid Manager uses one core and 4 GB Memory per Node. For example, if you have a cluster with 14 nodes, the Blaze Grid Manager uses 14 cores and 14 4 = 56 GB memory. The grid monitor uses one core and 1 GB memory across the clusters. The Catalog Service and Data Preparation Service can be provisioned to a different node in the domain if the Primary Gateway Node does not have enough CPU cores available. The Catalog Service and Data Preparation Service requires 35-45% of Total CPU cores recommended in the preceding table (CPU Core for Informatica Server) for various deployment options. Tune the Hardware You can tune the following hardware parameters to optimize performance: CPU frequency NIC card ring buffer size 6

7 CPU Frequency Dynamic frequency scaling allows the processor's frequency to be adjusted dynamically either for power savings or to reduce heat. Ensure that the CPU operates at least at the base frequency. When CPUs run below the base frequency, the performance degrades by 30% to 40%. Tuning Tip: Informatica recommends that you work with your IT system administrator to ensure that all the nodes on the cluster are configured to run at least at their supported base frequency. To tune the CPU frequency for Intel multi-core processors, perform the following steps: 1. Run the lscpu command to determine the current CPU frequency, base CPU frequency, and the maximum CPU frequency that is supported for the processor. 2. Ask your system administrator to perform the following tasks: a. Increase the CPU frequency at least to the supported base frequency. b. Change the power management setting to OS Control at the BIOS level. 3. Run CPU-intensive tests to monitor the CPU frequency in real time and adjust the frequency for improved performance. On Red Hat Enterprise Linux operating systems, you can install a monitoring tool such as cpupower. 4. Work with your IT department to ensure that the CPU frequency and power management settings are persisted even in case of future system reboots. NIC Card Ring Buffer Size NIC configuration is a key factor in network performance tuning. When you process large volumes of data, you must tune the Receive (RX) and Transmit (TX) ring buffer size. The ring buffers contain descriptors or pointers to the socket kernel buffers that hold the packet data. You can run the ethtool command to determine the current configuration. For example, run the following command: # ethtool -g eth0 The following sections show a sample output: Ring parameters for eth0: Pre-set maximums: RX: 2040 RX Mini: 0 RX Jumbo: 8160 TX: 255 Current hardware settings: RX: 255 RX Mini: 0 RX Jumbo: 0 TX: 255 The Pre-set maximums section shows the maximum values that you can set for each parameter. The Current hardware settings section shows the current configuration details. A low buffer size leads to low latency. However, low latency comes at the cost of throughput. For greater throughputs, you must configure large buffer ring sizes for RX and TX. Informatica recommends that you use the ethtool command to determine the current hardware settings and the maximum supported values. Then, set the values based on the maximum values that are supported for each operating system. For example, if the maximum supported value for RX is 2040, you can use the ethtool command as follows to set the RX value to 2040: # ethtool -G eth0 RX

If you set a low ring buffer size for data transfer, packets might get dropped. To find out if packets were dropped, you can use the netstat and ifconfig commands.

8 If you set a low ring buffer size for data transfer, packets might get dropped. To find out if packets were dropped, you can use the netstat and ifconfig commands. The following image shows a sample output of the netstat command: The RX-DRP column indicates the number of packets dropped. To make sure that the RX-DRP column shows the value as 0, set the RX value accordingly. The following image shows a sample output of the ifconfig command: The status messages indicate the number of packets that were dropped. 8

9 Tune the Hadoop Cluster You can tune the following Hadoop cluster level areas to optimize performance: Hard disk Transparent huge page HDFS block size YARN settings for parallel jobs Hard Disk Recommendations Hadoop workloads are always composite where there is a demand for multiple resources like CPU, Memory, Disk IO and Network IO. Disk performance plays a critical role in the overall Hadoop job's performance. Consider the following factors to improve the performance of the Hadoop job: Use EXT4 or XFS file systems for the directories used for the cluster. Use SAS disks with 15K RPM for best performance. Transparent Huge Page Compaction Linux has transparent huge page compaction that impacts the performance of Hadoop workloads. Informatica recommends that you disable transparent huge page compaction. For more information about disabling the transparent huge page compaction feature, see KB article: HDFS Block Size Recommendations Set the HDFS block size based on your requirements. The dfs.block.size parameter is the file system block size parameter for the data stored in the hdfs-site.xml file. The default block size is 128 MB. An increase or decrease in block size impacts parallelism and resource contention when you run MapReduce tasks. You can set the block size to 256 MB on a medium sized cluster with up to 40 nodes and a smaller value for a larger cluster. Tune the dfs.block.size value after experimenting on the basis of your requirements. YARN Settings Recommendations for Parallel Jobs You can change or tune the Yet-Another-Resource-Negotiator (YARN) settings to improve the performance of Intelligent Data Lake. YARN's functions include splitting the operations of resource management and job scheduling/monitoring into separate services. The number of containers that YARN node manager can run depends on the memory size, number of CPU cores, number of physical disks, and type of tasks. Avoid letting the number of parallel containers go beyond the minimum of four times the number of physical disks and the number of physical cores. 9

10 You can change or tune the following parameters to make sure that the Hadoop node can allocate that many parallel containers: Parameter yarn.nodemanager.resource.memorymb yarn.nodemanager.resource.cpuvcores yarn.nodemanager.pmem-checkenabled yarn.nodemanager.vmem-checkenabled Description The amount of physical memory in MB allocated for containers. Informatica recommends that you reserve some memory for other processes running on a node. The number of CPU cores allocated for containers. Informatica recommends that you set this value to the number of physical cores available on the node. Informatica recommends that you disable the physical memory check by setting it to false. Informatica recommends that you disable the virtual memory check by setting it to false. Note: Consult your Hadoop administrator before you change these settings. These recommendations based on internal tests and might differ from the Hadoop vendor's recommendations. Tune the Services After tuning the hardware and Hadoop cluster, you must tune the different services and their parameters for optimum performance. You can tune the following areas to achieve optimum performance: Intelligent Data Lake Service parameters Intelligent Data Lake Service properties Catalog Service properties Data Preparation Service properties Model Repository Service properties Intelligent Data Lake Service Parameters The parameters in the hive-site.xml file associated with Data Integration Service impact the performance of Intelligent Data Lake file uploading and publishing operations. The following table lists the parameters that you can tune for optimum performance. Parameter mapred.compress.map.output mapred.map.output.compression.c odec Description Determines whether the map phase output is compressed or not. The default value is false. Informatica recommends setting this parameter to true for better performance. Specifies the compression codec used for map output compression. The default value is org.apache.hadoop.io.compress.defaultcodec. Informatica recommends that you use SnappyCodec for better performance. The default value is org.apache.hadoop.io.compress.snappycodec. 10

Parameter mapred.map.tasks.speculative.exe cution mapred.reduce.tasks.speculative.e xecution hive.exec.compress.intermediate Description Specifies whether the map tasks can be speculatively executed.

11 Parameter mapred.map.tasks.speculative.exe cution mapred.reduce.tasks.speculative.e xecution hive.exec.compress.intermediate Description Specifies whether the map tasks can be speculatively executed. The default value is true. With speculative map task execution, duplicate tasks are spawned for the tasks that are not making much progress The original task or the speculative task that completes first is considered and the other is killed. Keep the default value for better performance. Specifies whether the reduce tasks can be speculatively executed. The default value is true. This is similar to map task speculative execution in functionality. Informatica recommends that you disable it for better performance by setting it to false. Determines whether the results of intermediate map/reduce jobs in a hive query execution are compressed or not. The default value is false. Do not confuse hive.exec.compress.intermediate with mapred.compress.map.output that deals with the compression of the output of map task. Informatica recommends that you set the value to true. The hive.exec.compress.intermediate parameter uses the same codec specified by mapred.map.output.compression.codec. Informatica recommends that you use SnappyCodec for better performance. Intelligent Data Lake Service Properties You can tune or change properties such as Hive Table Storage Format and Export Options in the Intelligent Data Lake Administrator for better performance. Hive Table Storage Format On the Data Lake Options section of the Intelligent Data Lake Service properties, change the value of Hive Table Storage Format. Informatica recommends that you use columnar storage formats like ORC or Parquet for better storage efficiency and improved performance as operations on such tables are faster. The following image shows the Data Lake Options section on the Properties tab of the Intelligent Data Lake Service: Export Options If you are downloading a data asset as a.csv or.tde file, enter the number of rows that you want to export in the Number of Rows to Export field. The number of rows that you export has an impact on the performance of exporting a data asset to the local machine running the browser. To avoid large files being downloaded and exporting taking longer time, you need to carefully set the value for Number of Rows to Export. The following image shows the Export Options section on the Properties tab of the Intelligent Data Lake Service: 11

12 Data Asset Recommendation Options Intelligent Data Lake recommends alternate and additional data assets for a project based on existing data assets to improve productivity. By default, the number of recommendations to display is 10. These recommendations include requests sent to the Catalog Service and you need to carefully set the number of recommendations. The following image shows the Data Asset Recommendation Options section on the Properties tab of the Intelligent Data Lake Service: Sampling Options Sample Size The Data Preparation Sample Size property determines the number of sample rows fetched for data preparation. The default value is 50,000 records. The minimum supported value is 1000 and the maximum supported value is one million. The value of this property has an impact on the performance and user experience when the data preparation page is loaded or refreshed. Though the rows are loaded asynchronously after fetching 10,000 rows, operations can be performed only after all the records are fetched. Setting sample size to a higher value can result in higher memory usage for Data Preparation Service. The following image shows the Sampling Options section on the Properties tab of the Intelligent Data Lake Service: Random Sampling Intelligent Data Lake also supports random sampling where records are randomly fetched for data preparation instead of first 'N' records. Random sampling needs a MapReduce job/spark job to be spawned. Use this option if you have high volumes of data or if the input involves multiple splits. Note that random sampling requires Hadoop cluster resources. Execution Options for the Hive Execution Engine Effective in version , you can select the Hive execution engine to perform sampling. When the data volume is more than one map wave, the Spark engine performs better. However, as the Spark executor reuse is not yet supported by Intelligent Data Lake, the initialization cost is higher for smaller volumes of data. Sampling tables with Hive statistics (row count) spawns a query without Reduce task. Hence, the YARN container demand is lesser than sampling a table without statistics in addition to the functional benefit of a better quality of sample. Update Hive Statistics at regular intervals (weekly or biweekly) for better sampling performance. Import Data into the Lake Effective in version , you can import data into the data lake through Intelligent Data Lake. Submitting an import job in Intelligent Data Lake triggers a Blaze job to persist data into the lake. The Blaze engine uses Sqoop to connect 12

13 to the source system and read data by spawning Map tasks. The default number of Map tasks is '1' that you can tune at the Connection (used by Import job) level by providing the required number of mappers. A higher number of mappers improves the import job. The following image shows the Connection Properties section on the Properties tab in the Connections view: Note: If the table being imported does not have primary keys, you need specify additional arguments like --split-by. Export Data from the Lake You cannot use the -num-mappers or -m sqoop parameters to control the number of map tasks for writing to an external target due to a third-party dependency. Note: For more information about exporting data from the lake, see Exporting a Data Asset in the Informatica Intellligent Data Lake User Guide. Logging Options The Log Severity property defines the severity level for service logs. The supported levels are FATAL, ERROR, WARNING, INFO, TRACE, and DEBUG. INFO is the default value. Informatica recommends you to set the Log Severity property at INFO or higher levels due to performance considerations. The following image shows the Logging Options section on the Properties tab of the Intelligent Data Lake Service: Catalog Service Properties Intelligent Data Lake uses the Catalog Service to catalog and ingest the data lake table metadata and search results. Properly tuning the Catalog Service is critical for Intelligent Data Lake Service performance. Live Data Map LoadType Property The resource demand and performance from the Hadoop cluster by Live Data Map depends on the load type which can be set using the custom property CustomOptions.loadType in the Intelligent Data Lake Administrator. For different loadtypes parameters, see Catalog Service Sizing Recommendations on page

Informatica recommends that you always specify loadtype and not to leave it as default. The recommended values for loadtypes are as follows: Low. One node setup with up to one million catalog objects.

14 Informatica recommends that you always specify loadtype and not to leave it as default. The recommended values for loadtypes are as follows: Low. One node setup with up to one million catalog objects. Medium. Three node setup with up to 20 million catalog objects. High. Six node setup with up to 50 million catalog objects. Objects include the sum of number of tables and number of columns in all the tables. The following image shows the Custom Properties section on the Properties tab of the Catalog Service: Scanner Configuration To ingest Hive tables to be catalogued and searched in Catalog Service for use in Intelligent Data Lake, configure Hive resource in the Live Data Map Administrator and configure the scanner for better performance. You can configure the scanner to extract all schemas or a particular schema. Select specific schema for better performance. The following image shows the Source Metadata section on the Metadata Load Settings tab of the Hive resource where you can configure the schema property: You can configure the memory that scanner process consumes in the Advanced Properties section on the Metadata Load Settings tab of the Hive resource. Informatica recommends that you set the memory based on the total number of columns that need to be ingested. The recommended values for scanner memory are as follows: Low. Up to one million columns Medium. Up to four million columns High. Up to 12 million columns The following image shows the Advanced Properties section on the Metadata Load Settings tab of the Hive resource where you can set the values for the scanner memory: Note: Incremental scanning is not supported in Intelligent Data Lake You must consider the number of columns in all the tables for scanner memory configuration. 14

15 Solr Tuning Parameters These parameters include the Apache Solr Slider app master properties and the Catalog Service custom options for the Apache Solr node. Apache Solr Slider App Master Properties The following tables list the Apache Solr parameters that you can tune to improve the performance of search in Catalog Service: Parameter jvm.heapsize yarn.component.instances 2 yarn.memory yarn.vcores 1 Description Memory used by the slider app master. Number of instances of each component for the slider. This parameter specifies the number of master servers that run. Amount of memory allocated for the container hosting the master server. Number of cores allocated for the master server. 1 For external clusters, when you increase the value for this parameter, Informatica recommends that you increase the maximum number of cores in YARN, for a container. 2 Before increasing this parameter, you must add the required number of nodes to the cluster. Catalog Service Custom Options for Apache Solr Node Properties The following tables list the Catalog Service Custom Options for Apache Solr Node Properties that you can tune to improve the performance of search in Catalog Service: Parameter xmx_val 1 xms_val yarn.component.instances 2 yarn.memory yarn.vcores 3 Description Solr maximum heap. Informatica recommends that when you increase the value for this parameter, you must increase the maximum memory allocation in YARN. for a container. Failing to increase the memory allocation might result in YARN shutting down the applications. Informatica recommends that when you increase the memory configuration of any component, for example, ingestion, you must keep a buffer of 30% of the actual memory required for the component. For example, if a component requires 100 MB of memory, you must increase the memory configuration to 130 MB for that component. Solr minimum heap Number of instances of each component for the slider. This parameter specifies the number of master servers that run. Before increasing this parameter, you must add the required number of nodes to the cluster. Amount of memory allocated for the container hosting the master server. Number of cores allocated for the master server. For external clusters, when you increase the value for this parameter, Informatica recommends that you must increase the maximum number of cores in YARN, for a container. 15

16 Tuning for Profiling Performance Tuning the parameters for the profiling options is important as the profile jobs run on the same cluster as Intelligent Data Lake. Native and Blaze Mode For the Catalog Service and Intelligent Data Lake , Informatica recommends that you perform profiling using the Blaze engine for performance considerations. Incremental Profiling For Catalog Service and Intelligent Data Lake , incremental profiling is unsupported for Hive data sources where incremental changes in data remains undetected. Only new tables can be detected. Do not enable incremental profiling for Hive data source in Intelligent Data Lake if a change in data needs to be detected for incremental scans. Profiling Performance Parameters See the following table to identify the parameters that you can tune to improve profiling performance: Parameter Profiling Warehouse Database Maximum Profile Execution Pool Size Maximum Execution Pool Size Maximum Concurrent Columns Maximum Column Heap Size Description Connection name to the profiling warehouse database. In addition to the profile results, the profiling warehouse holds the persisted profile job queue. Verify that no profile job runs when you change the connection name. Otherwise, the profile jobs might stop running because the profile jobs run on the Data Integration Service where the Profiling Service Module submitted the profile jobs. You set the default value when you create the instance. The number of profile mappings that the Profiling Service Module can run concurrently when the Data Integration Service runs on a single node or on a grid. The pool size is dependent on the aggregate processing capacity of the Data Integration Service, which you specify for each node on the Processes tab of the Administrator tool. The pool size cannot be greater than the sum of the processing capacity of all nodes. When you plan for a deployment, consider the threads used for profile tasks. It is important to understand the mixture of mappings and profile jobs so that you can configure the Maximum Execution Pool Size parameter. Default is 10. The maximum number of requests that the Data Integration Service can run concurrently. Requests include data previews, mappings, and profiling jobs. This parameter has an impact on the Data Integration Service. The number of columns that a mapping runs in parallel. The default value of 5 is optimal for most of the profiling use cases. You can increase the default value for columns with cardinality lower than the average value. Decrease the default value for columns with cardinality higher than the average value. You might also want to decrease this value is when you consistently run profiles on large source files where temporary disk space is low. Default is 5. The cache size for each column profile mapping for flat files. You can increase this value to prevent the Data Integration Service from writing some parts of the intermediate profile results to temporary disk. However, this effect does not apply to large data sources. The default setting is optimal for most of the profiling use cases. In Live Data Map, profiling is done with a lower data volume, for example, in a batch of rows. To avoid creating multiple mappings and to prevent an impact on compilation performance, you can combine multiple columns in a single mapping. Informatica recommends that you set the value for the Maximum Column Heap Size to 512 to avoid temporary disk usage with combined mapping. Default is 64. Note: For more information about tuning profile jobs, see the Informatica How-To Library article "Tuning Live Data Map Performance". 16

17 Running Profile Jobs for Intelligent Data Lake Profiling jobs on Hive Tables also require Hadoop cluster resources through YARN and compete for resources with other Intelligent Data Lake jobs like publish and file upload. This is critical in Catalog Service and Intelligent Data Lake as incremental scan is unsupported for Hive tables. Incremental profiling is also unsupported for Hive sources. Intelligent Data Lake performs profiling for all the tables every time the scan runs. Profiling Tip: Informatica recommends running scans with profile configuration during off business hours, zero or reduced cluster demand time rather than with every scan. If you want to run profiling with every scan, you must submit profile jobs to a profile job-specific YARN queue with restricted capacity. Data Preparation Service Properties Change the Data Preparation Storage options and heap size recommendations in the Intelligent Data Lake Administrator for optimum performance. Local and Durable Storage Disk Space Requirements The Data Preparation Service, by design, stores a copy of the worksheet in the local storage and durable storage whenever a worksheet is opened or refreshed from its source. The default sample size is 50,000 rows and the Intelligent Data Lake file upload limit is 100 MB. The size of the sample copy of the worksheet in the local storage or durable storage will be either in size of 50K rows or 100 MB whichever is lower. The cleanup thread cleans up all the local store data once the user session timeout is reached. In durable storage, data is stored and replicated in HDFS. The Data Preparation Service keeps three copies of opened or refreshed worksheets in both local and durable storage. Make sure that you configure a faster disk with enough storage space available for the Local Storage Location option. The following image shows the Data Preparation Storage Options section on the Properties tab of the Data Preparation Service: Use the following guideline to estimate the disk space requirements for local and durable stage: size of the opened data preparation worksheet (size of 50,000 rows if default sample size is used) the number of concurrent users working on data preparation 3 (Data Preparation Service keeps three copies of the worksheet in local and durable storage) + 2 GB (Additional disk space) Heap Size Recommendations The default maximum heap size is 8 GB. Set the maximum heap size to a higher value for larger concurrency. Make sure to allocate 512 MB per user. For 20 users, increase the maximum heap size to 10 GB. The following image shows the Advanced Options section on the Processes tab of the Catalog Service where you can set the maximum heap size value: 17

18 Data Preparation Repository Options The Data Preparation Service uses MySQL database to persist all the recipe and mapping metadata. The supported database is MySQL 5.6.x. The following are the CPU and memory requirements for the MySQL database: Small deployment: Minimum 1 CPU core and 2 GB memory Medium deployment: Minimum 2 CPU cores and 4 GB memory Large deployment: Minimum 4 CPU cores and 8 GB memory The disk space requirements for the MySQL database are as follows: Small deployment: 50 GB Medium deployment: 100 GB Large deployment: 200 GB Model Repository Service Properties You can fine tune or change the Model Repository Service heap size based on the number of concurrent users. The following table lists the guidelines to tune the Model Repository Service JVM heap size based on the number of concurrent users: Number of Concurrent Users Single User Less than 10 Users Less than 50 Users More than 50 Users MRS JVM Heap Size 1 GB 2 GB 4 GB 8 GB Data Integration Service Properties You can tune the Maximum Execution Pool Size for the Data Integration Service in the Intelligent Data Lake Administrator based on the number of concurrent users. This parameter controls the Maximum number of requests that the Data Integration Service can run concurrently. The default value is 10. Intelligent Data Lake uses the Data Integration Service to execute Data Preview on an asset or Publish an asset to the Data lake. The following table lists the guidelines to tune the Data Integration Service Max Execution Pool Size based on the number of concurrent users: Number of Concurrent Users Single user Small deployment (5 Users) Maximum execution Pool Size 10 (default) 10 (default) Medium deployment (10 Users) 15 Large deployment (50 Users) 75 18

19 The following image shows the Execution Options section on the Properties tab of the Data Integration Service where you can tune the Max Execution Pool Size based on the number of concurrent users: Catalog Service Sizing Recommendations Based on the size of the data set, you must add additional memory and CPU cores to tune the performance of Live Data Map. You must also note the minimum number of nodes required to deploy supported data set sizes. Note: Each node in the recommendation in the following sections requires 32 logical cores and 64 GB of available memory. Informatica recommends that you must have a minimum of four physical SATA hard disks for each node. Small Data Set You can deploy a small data set on a single node. The following table lists the size of a small data set and recommendations for the number of CPU cores and memory settings: Number of Catalog Objects Memory Number of CPU Cores Minimum Number of Hadoop Nodes Required One million 64 GB 32 1 Medium Data Set You can deploy a medium data set on a minimum of three nodes. The following table lists the size of a medium data set and the recommended number of CPU cores and memory settings for a medium data set: Number of Catalog Objects Memory Number of CPU Cores Minimum Number of Hadoop Nodes Required 20 million 192 GB

20 Large Data Set You can deploy a large data set on a minimum of six nodes. The following table lists the size of a large data set and the recommended number of CPU cores and memory settings for a large data set: Number of Catalog Objects Memory Number of CPU Cores Minimum Number of Hadoop Nodes Required 50 million 384 GB Default Data Set A default data set represents a data set that is smaller than the small data set. You can use a configuration lower than the configuration for small data sets to process the default data sets. Demo Data Set You can deploy a demo data set on a single node. Use the demo data set for product demonstrations using a featurerestricted version of Live Data Map. Metadata Extraction Scanner Memory Parameters Depending on the size of the data set, you can use the scanner memory parameters to configure the memory requirements for the scanner to extract metadata. In the following table, the values listed in the Memory column indicate the default values configured for the scanner based on the data set size: Parameter Data Set Size Memory LdmCustomOptions.scanner.memory.low Small 1024 MB LdmCustomOptions.scanner.memory.medium Medium 4096 MB LdmCustomOptions.scanner.memory.high Large MB Authors Anand Sridharan Staff Performance Engineer Mohammed Morshed Principal Performance Engineer Chakravarthy Tenneti Lead Technical Writer Anupam Nayak Documentation Trainee 20

Tuning Intelligent Data Lake Performance

Tuning Intelligent Data Lake Performance 2016 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without