Optimizing Performance for Partitioned Mappings

Optimizing Performance for Partitioned Mappings 1993-2015 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica LLC. All other company and product names may be trade names or trademarks of their respective owners and/or copyrighted materials of such owners.

Abstract If your license includes partitioning, you can enable the Data Integration Service to maximize parallelism when it runs mappings. When you maximize parallelism, the Data Integration Service dynamically divides the underlying data into partitions and processes all of the partitions concurrently. This article describes how to enable partitioning and how to optimize the performance of partitioned mappings. Supported Versions Data Quality 10.0 Table of Contents Overview.... 2 Set the Maximum Parallelism for the Data Integration Service.... 3 Enabling Partitioning for Mappings.... 4 Override the Maximum Parallelism for a Mapping.... 5 Suggested Parallelism for a Transformation.... 5 Execution Instances for Address Validator and Match Transformations.... 6 Overriding the Maximum Parallelism Value.... 7 Optimize Throughput from Flat File Sources.... 8 Write to Separate Files for Each Partition.... 9 Write to Multiple Cache, Temporary, and Target Directories.... 10 Configuring Multiple Directories for the Data Integration Service.... 11 Optimize Relational Databases for Partitioning.... 12 Optimize the Source Database for Partitioning.... 12 Optimize the Target Database for Partitioning.... 13 Scenarios that Cannot be Optimized.... 13 Relational Source Restrictions.... 14 Transformation Restrictions.... 14 Relational Target Restrictions.... 15 Overview If your license includes partitioning, you can enable the Data Integration Service to maximize parallelism when it runs mappings. When you maximize parallelism, the Data Integration Service dynamically divides the underlying data into partitions and processes all of the partitions concurrently. If mappings process large data sets or contain transformations that perform complicated calculations, the mappings can take a long time to process and can cause low data throughput. When you enable partitioning for these mappings, the Data Integration Service uses additional threads to process the mapping. The Data Integration Service can create partitions for mappings that have physical data as input and output. The Data Integration Service can use multiple partitions to complete the following actions during a mapping run: Read from flat file, IBM DB2 for LUW, or Oracle sources. Run transformations. 2

Write to flat file, IBM DB2 for LUW, or Oracle targets. To enable partitioning, administrators and developers perform the following tasks: Administrators set maximum parallelism for the Data Integration Service to a value greater than 1 in the Administrator tool. Maximum parallelism determines the maximum number of parallel threads that process a single pipeline stage. Administrators increase the Maximum Parallelism property value based on the number of CPUs available on the nodes where the Data Integration Service runs mappings. Optionally, developers can override the maximum parallelism value for a mapping in the Developer tool. By default, the Maximum Parallelism property for each mapping is set to Auto. Each mapping uses the maximum parallelism value defined for the Data Integration Service. Developers can override the maximum parallelism value in the mapping run-time properties to define a maximum value for a particular mapping. When maximum parallelism is set to different integer values for the Data Integration Service and the mapping, the Data Integration Service uses the minimum value of the two. After you enable partitioning, you can optimize the performance of partitioned mappings by performing the following tasks: Verify that flat file data objects are configured to optimize throughput from flat file sources. Verify that flat file data objects are configured to write to separate files for each partition. Configure the Data Integration Service to write to multiple cache, temporary, and target directories. Configure relational databases to optimize partitioning. Set the Maximum Parallelism for the Data Integration Service By default, the Maximum Parallelism property for the Data Integration Service is set to 1. When the Data Integration Service runs a mapping, it separates the mapping into pipeline stages and uses one thread to process each stage. When you use the Administrator tool to set maximum parallelism for the Data Integration Service to a value greater than 1, you enable partitioning. The Data Integration Service separates a mapping into pipeline stages and uses multiple threads to process each stage. Maximum parallelism determines the maximum number of parallel threads that can process a single pipeline stage. Configure the Maximum Parallelism property for the Data Integration Service based on the available hardware resources. When you increase maximum parallelism for a mapping that is CPU-bound, you can achieve near linear scaling. For example, a mapping processed in a single partition takes 25 minutes to complete. If you set maximum parallelism to two, the Data Integration Service processes the mapping with two partitions which decreases the processing time to 13 minutes. When you increase maximum parallelism for a mapping that is bound by input/output (I/O) operations or by memory, the scaling that you can achieve also depends on the disk I/O performance and on the available memory. For example, for a mapping that includes transformations that use cache memory and that includes flat file sources and targets, performance depends on the available CPUs, memory, and disk I/O performance. Consider the following guidelines when you configure maximum parallelism: Increase the value based on the number of available CPUs. Increase the maximum parallelism value based on the number of CPUs available on the nodes where the Data Integration Service runs mappings. When you increase the maximum parallelism value, the Data Integration Service uses more threads to run the mapping and leverages more CPUs. A simple mapping runs faster in two partitions, but typically requires twice the amount of CPU than when the mapping runs in a single partition. 3

When the Data Integration Service runs on multiple nodes, a Data Integration Service process can run on each node with the service role. Each service process uses the maximum parallelism value configured for the service. Verify that each node where a service process runs has an adequate number of CPUs. Consider additional jobs that the Data Integration Service must run and additional processing that occurs on the node. If you configure maximum parallelism such that each mapping uses a large number of threads, fewer threads are available for the Data Integration Service to run additional jobs or for additional processing that occurs on the node. For example, if a node has four CPUs, you can set maximum parallelism to three to leave one thread available for additional processing. Enabling Partitioning for Mappings To enable partitioning, set maximum parallelism for the Data Integration Service to a value greater than 1. The maximum value is 64. 1. In the Administrator tool, click the Manage tab > Services and Nodes view. 2. In the Domain Navigator, select the Data Integration Service. 3. In the contents panel, click the Properties view. 4. In the Execution Options section, click Edit. The Edit Execution Options dialog box appears. 5. Enter a value greater than 1 for the Maximum Parallelism property. The following image shows a Data Integration Service configured with a maximum parallelism of six: 6. Click OK. 7. Recycle the Data Integration Service to apply the changes. 4

Override the Maximum Parallelism for a Mapping By default, the Maximum Parallelism property for each mapping in the Developer tool is set to Auto. Each mapping uses the maximum parallelism value defined for the Data Integration Service. You can override the maximum parallelism value to define a maximum value for a particular mapping. When maximum parallelism is set to different integer values for the Data Integration Service and the mapping, the Data Integration Service uses the minimum value of the two. You might want to use the Developer tool to override the Maximum Parallelism property for a mapping for the following reasons: You run a complex mapping that results in more threads than the CPU can handle. Each partition point adds an additional pipeline stage. A complex mapping with multiple Aggregator or Joiner transformations might have many pipeline stages. A large number of pipeline stages can cause the Data Integration Service to use more threads than the CPU can handle. Mapping performance is satisfactory with fewer parallel threads for each pipeline stage. When a single mapping runs with fewer parallel threads, more threads are available for the Data Integration Service to run additional jobs or for additional processing that occurs on the node. You want to define a suggested parallelism value for a transformation. If you override the maximum parallelism for a mapping, you can define a suggested parallelism value for a specific transformation in the mapping. You might want to define a suggested parallelism value to optimize performance for a transformation that contains many ports or performs complicated calculations. You want to define an execution instances value for an Address Validator or Match transformation. If you override the maximum parallelism for a mapping, the Data Integration Service considers the execution instances value for an Address Validator or Match transformation in the mapping. You might want to define an execution instances value to optimize performance for the transformation. Suggested Parallelism for a Transformation If you override the Maximum Parallelism run-time property for a mapping, you can define the Suggested Parallelism property for a specific transformation in the mapping run-time properties. The Data Integration Service considers the suggested parallelism value for the number of threads for that transformation pipeline stage as long as the transformation can be partitioned. For example, if you configure the mapping to maintain row order, the Data Integration Service might need to use one thread for the transformation. If the Maximum Parallelism run-time property for the mapping is set to Auto, you cannot define a suggested parallelism value for any transformations in the mapping. If you set the maximum parallelism value for the mapping to Auto after defining a suggested parallelism value for a transformation, the Data Integration Service ignores the suggested parallelism value. You might want to define a suggested parallelism value to optimize performance for a transformation that contains many ports or performs complicated calculations. For example, if a mapping enabled for partitioning processes a small data set, the Data Integration Service might determine that one thread is sufficient to process an Expression transformation pipeline stage. However, if the Expression transformation contains many complicated calculations, the transformation pipeline stage can still take a long time to process. You can enter a suggested parallelism value greater than 1 but less than the maximum parallelism value defined for the mapping or the Data Integration Service. The Data Integration Service uses the suggested parallelism value for the number of threads for the Expression transformation. 5

You can configure the following values for the Suggested Parallelism property for a transformation when you override the maximum parallelism for the mapping: Suggested Parallelism Value Description 1 The Data Integration Service uses one thread to run the transformation. Auto Greater than 1 The Data Integration Service considers the maximum parallelism defined for the mapping and for the Data Integration Service. The service uses the lowest value to determine the optimal number of threads that run the transformation. Default for each transformation. The Data Integration Service considers the suggested parallelism defined for the transformation, the maximum parallelism defined for the mapping, and the maximum parallelism defined for the Data Integration Service. The service uses the lowest value for the number of threads that run the transformation. You can define the Suggested Parallelism property in the mapping run-time properties for the following transformations: Aggregator Expression Filter Java Joiner Lookup Normalizer Rank Router Sequence Generator Sorter SQL Union Update Strategy Execution Instances for Address Validator and Match Transformations If you override the Maximum Parallelism run-time property for a mapping, the Data Integration Service considers the value of the Execution Instances advanced property defined for an Address Validator or Match transformation. The Data Integration Service considers the execution instances value for the number of threads for that transformation pipeline stage as long as the transformation can be partitioned. For example, if you configure the mapping to maintain row order, the Data Integration Service might need to use one thread for the transformation. You can increase the number of execution instances on a Match transformation when you configure the transformation for identity match analysis. You cannot increase the number of execution instances on a Match transformation when you configure the transformation for field match analysis. In field match analysis, the Match transformation uses a single execution instance. 6

If the Maximum Parallelism run-time property for a mapping is set to Auto, the Data Integration Service ignores the execution instances value defined for an Address Validator or Match transformation. You can configure the following values for the Execution Instances advanced property for an Address Validator or Match transformation when you override the maximum parallelism for the mapping: Execution Instances Value Description 1 The Data Integration Service uses one thread to run the transformation. Default for the Address Validator transformation. Auto Greater than 1 The Data Integration Service considers the maximum parallelism defined for the mapping and for the Data Integration Service. The service uses the lowest value to determine the optimal number of threads that run the transformation. Default for the Match transformation in identity match analysis. The Data Integration Service considers the execution instances defined for the transformation, the maximum parallelism defined for the mapping, and the maximum parallelism defined for the Data Integration Service. The service uses the lowest value for the number of threads that run the transformation. Note: The Data Integration Service also considers the Max Address Object Count property on the Content Management Service when it calculates the optimal number of threads for an Address Validator transformation. The Max Address Object Count property determines the maximum number of address validation instances that can run concurrently in a mapping. The Max Address Object Count value must be greater than or equal to the maximum parallelism value on the Data Integration Service. Overriding the Maximum Parallelism Value To override the maximum parallelism value for a mapping, set maximum parallelism in the mapping run-time properties to an integer value greater than 1 and less than the value set for the Data Integration Service. 1. In the Developer tool, open the mapping. 2. In the Properties view, click the Run-time tab. 3. Select Native for the Execution Environment. 4. For the Maximum Parallelism property, enter an integer value greater than 1 and less than the value set for the Data Integration Service. Or you can assign a user-defined parameter to the property, and then define the parameter value in a parameter set or a parameter file. 5. To define a suggested parallelism value for a specific transformation in the mapping, enter an integer value greater than 1 for the transformation in the Suggested Parallelism section. 7

The following image shows a mapping with an overridden maximum parallelism value and with the default suggested parallelism values for transformations: 6. Save the mapping. 7. To define an execution instances value for an Address Validator or for a Match transformation configured for identity match analysis, complete the following steps: a. Open the Address Validator or Match transformation. b. In the Advanced view, enter an integer value greater than 1 for the Execution Instances property. c. Save the transformation. Optimize Throughput from Flat File Sources For optimal performance when multiple threads read from a flat file, verify that the flat file data object is configured to optimize throughput instead of preserving row order. By default, flat file data objects are configured to optimize throughput. When you optimize throughput, the Data Integration Service does not preserve row order because it does not read the rows in the file or file list sequentially. If you need to maintain row order in the mapping, you can configure the Concurrent Read Partitioning property to preserve the order. However, preserving row order can slow performance. 1. In the Developer tool, open the flat file data object. 2. In the Advanced view, locate the Runtime: Read section. 8

3. Verify that the Concurrent Read Partitioning property is set to Optimize throughput. The following image shows a flat file data object configured to optimize throughput: Write to Separate Files for Each Partition For optimal performance when multiple threads write to a flat file, verify that the flat file data object is configured to write to separate files for each partition. When you merge target data to a single file, the Data Integration Service can take a longer amount of time to write to the file. By default, the Data Integration Service concurrently writes the target output to a separate file for each partition. If you require merged target data, you can configure the flat file data object to create a single merge file for all target partitions. The concurrent merge type optimizes performance more than the sequential merge type. However, using any merge type can slow performance. 1. In the Developer tool, open the flat file data object. 2. In the Advanced view, locate the Runtime: Write section. 9

3. Verify that the Merge Type property is set to No merge. The following image shows a flat file data object configured to write to separate files for each partition: Write to Multiple Cache, Temporary, and Target Directories For optimal performance when multiple transformation threads page cache files to the disk or when multiple writer threads write to separate flat files, configure multiple cache, temporary, and target directories for the Data Integration Service. When multiple threads write to a single directory, the mapping might encounter a bottleneck due to I/O contention. An I/O contention can occur when threads write data to the file system at the same time. Note: Transformation threads write to the cache or temporary directory when the Data Integration Service runs Aggregator, Joiner, Rank, and Sorter transformations and must page cache files to the disk. When the Data Integration Service pages cache files to the disk, processing time increases. For optimal performance, configure the transformation cache size so that the Data Integration Service can run the complete transformation in memory. If sufficient memory is not available, configure multiple cache and temporary directories to avoid I/O contentions. When you configure multiple directories, the Data Integration Service determines the directory for each thread in a round-robin fashion. If the Data Integration Service does not use cache partitioning for transformations or does not use multiple threads to write to the target, the service writes the files to the first listed directory. For optimal performance, configure each directory on a separate disk drive and configure the number of directories to be equal to the maximum parallelism value to ensure that each thread writes to a separate directory. For example, if maximum parallelism is four, configure four cache, temporary, and target directories. In the Administrator tool, you configure multiple cache, temporary, and target directories by entering multiple directories separated by semicolons for the Data Integration Service execution properties. Configure the directories in the following execution properties: 10

Cache Directory Defines the cache directories for Aggregator, Joiner, and Rank transformations. By default, the transformations use the CacheDir system parameter to access the cache directory value defined for the Data Integration Service. Note: A Lookup transformation can only use a single cache directory. Temporary Directories Defines the cache directories for Sorter transformations. By default, the Sorter transformation uses the TempDir system parameter to access the temporary directory value defined for the Data Integration Service. Target Directory Defines the target directories for flat file targets. By default, flat file data objects use the TargetDir system parameter to access the target directory value defined for the Data Integration Service. Instead of using the default system parameters, developers can configure multiple directories specific to the transformation or flat file data object in the Developer tool. However, if the Data Integration Service runs on multiple nodes, developers must verify that each DTM instance that runs a job can access the directories. Configuring Multiple Directories for the Data Integration Service Use the Administrator tool to configure multiple cache, temporary, and target directories for the Data Integration Service. 1. In the Administrator tool, click the Manage tab > Services and Nodes view. 2. In the Domain Navigator, select the Data Integration Service. 3. In the contents panel, click the Properties view. 4. In the Execution Options section, click Edit. The Edit Execution Options dialog box appears. 5. Enter multiple directories separated by semicolons for the Temporary Directories, Cache Directory, and Target Directory properties. 11

The following image shows two temporary, cache, and target directories configured for the Data Integration Service: 6. Click OK. 7. Recycle the Data Integration Service to apply the changes. Optimize Relational Databases for Partitioning When a mapping that is enabled for partitioning reads from or writes to an IBM DB2 for LUW or an Oracle relational database, the Data Integration Service can use multiple threads to read from the relational source or to write to the relational target. You can configure options in the source or target database to optimize the performance of partitioned reads or writes to the database. Optimize the Source Database for Partitioning To optimize the source database for partitioning, perform the following tasks: Add database partitions to the source. Add database partitions to the relational source to increase the speed of the Data Integration Service query that reads the source. If the source does not have database partitions, the Data Integration Service uses one thread to read from the source. Enable parallel queries. Relational databases might have options that enable parallel queries to the database. Refer to the database documentation for these options. If these options are not enabled, the Data Integration Service runs multiple partition SELECT statements serially. Separate data into different tablespaces. Each database provides an option to separate the data into different tablespaces. Each tablespace can refer to a unique file system, which prevents any I/O contention across partitions. 12

Increase the maximum number of sessions allowed to the database. The Data Integration Service creates a separate connection to the source database for each partition. Increase the maximum number of allowed sessions so that the database can handle a larger number of concurrent connections. Optimize the Target Database for Partitioning To optimize the target database for partitioning, perform the following tasks: Add database partitions to a DB2 for LUW target. The Data Integration Service can use multiple threads to write to a DB2 for LUW target that does not have database partitions. However, you can optimize load performance when the target has database partitions. In this case, each writer thread connects to the DB2 for LUW node that contains the database partition. Because the writer threads connect to different DB2 for LUW nodes instead of all threads connecting to the single master node, performance increases. Enable parallel inserts. Relational databases might have options that enable parallel inserts to the database. Refer to the database documentation for these options. For example, set the db_writer_processes option in an Oracle database and the max_agents option in a DB2 for LUW database to enable parallel inserts. Separate data into different tablespaces. Each database provides an option to separate the data into different tablespaces. Each tablespace can refer to a unique file system, which prevents any I/O contention across partitions. Increase the maximum number of sessions allowed to the database. The Data Integration Service creates a separate connection to the target database for each partition. Increase the maximum number of allowed sessions so that the database can handle a larger number of concurrent connections. Set options to enhance database scalability. Relational databases might have options that enhance scalability. For example, disable archive logging and timed statistics in an Oracle database to enhance scalability. Scenarios that Cannot be Optimized Performance might not increase for the following mapping scenarios when you enable partitioning: Mappings that process small data sets or that run in a short amount of time. Mappings that process small data sets or that run in a short amount of time might not result in performance improvements when partitioning is enabled. For example, if a mapping takes 30 seconds to run when it is not partitioned, enabling partitioning is not likely to decrease the processing time. If a mapping takes five minutes or more to run when it is not partitioned, enabling partitioning might decrease the processing time. Mappings that maintain row order. If you configure a mapping to maintain row order, the Data Integration Service might need to use one thread to process some transformations, even if partitioning is enabled for the mapping. When you configure Write transformations to maintain row order, the Data Integration Service always uses a single thread to write to the target. If an Aggregator transformation that uses sorted input precedes the Write transformation, the Data Integration Service uses a single thread to process both the Aggregator transformation and the target. When you configure Expression, Java, Sequence Generator, or SQL transformations to maintain row order, the Data Integration Service determines the optimal number of threads for the transformation pipeline stage 13

while maintaining the order. The service might use one thread to process the transformation if one thread is required to maintain the order. Mappings that include connected Lookup transformations that do not use cache partitioning. The Data Integration Service uses cache partitioning for connected Lookup transformations under the following conditions: The lookup condition contains only equality operators. When the connected Lookup transformation looks up data in a relational table, the database is configured for case-sensitive comparison. For example, if the lookup condition contains a string port and the database is not configured for case-sensitive comparison, the Data Integration Service does not use cache partitioning. When the Data Integration Service does not use cache partitioning for a Lookup transformation, all threads that run the Lookup transformation share the same cache. Each thread queries the same cache serially, which can cause a bottleneck. In some situations, the Data Integrations Services uses one thread to process a mapping pipeline stage, even if partitioning is enabled for the mapping. Using one thread for a pipeline stage can cause a bottleneck at that stage. Relational Source Restrictions The Data Integration Service uses one thread to read the relational source, but can use multiple threads for the remaining mapping pipeline stages in the following situations: The mapping reads from a relational source other than IBM DB2 for LUW or Oracle. The mapping uses a JDBC or ODBC connection to read from an IBM DB2 for LUW or Oracle source. The mapping pushes transformation logic to the source database. The Data Integration Service uses one thread to read the source and the transformations that are pushed to the source database. The service can use multiple threads for the remaining mapping pipeline stages. You use the simple query in the Read transformation to select the ports to sort by or to configure a userdefined join. You use the advanced query in the Read transformation to create a custom SQL query. Transformation Restrictions Some transformations do not support partitioning. When a mapping enabled for partitioning contains a transformation that does not support partitioning, the Data Integration Service uses one thread to run the transformation. The Data Integration Service can use multiple threads to run the remaining mapping pipeline stages. The following transformations do not support partitioning: Association Consolidation Exception Match, when configured for field match analysis REST Web Service Consumer Unconnected Lookup Web Service Consumer Some transformations that support partitioning require specific configurations. If a mapping enabled for partitioning contains a transformation with an unsupported configuration, the Data Integration Service uses one thread to run the 14

transformation. The Data Integration Service can use multiple threads to process the remaining mapping pipeline stages. The following transformations require specific configurations to support partitioning: Aggregator transformations must include a group by port. Aggregator transformations must not include a passthrough port. Aggregator transformations must not include numeric functions that calculate running totals and averages on a row-by-row basis. Expression transformations must not include the following types of functions or variables: - Numeric functions that calculate running totals and averages on a row-by-row basis. - Special functions that might return different results when multiple threads process the transformation. - Local variables that depend on the value of a previous row. Decision, Java, and SQL transformations must have the Partitionable property enabled. Joiner transformations must include a join condition that uses an equality operator. If the join condition includes multiple equality conditions, the conditions must be combined using the AND operator. Rank transformations must include a group by port. The following image shows an example mapping where the maximum parallelism for the Data Integration Service is two. Maximum parallelism for the mapping is Auto. The Expression transformation uses the numeric function MOVINGSUM, which is an unsupported configuration for partitioning. The Data Integration Service uses two threads to read the source, one thread to run the Expression transformation, and two threads to write to the target: Relational Target Restrictions The Data Integration Service uses one thread to write to the relational target, but can use multiple threads for the remaining mapping pipeline stages in the following situations: The mapping writes to a relational target other than IBM DB2 for LUW or Oracle. The mapping uses a JDBC or ODBC connection to write to an IBM DB2 for LUW or Oracle target. Author Alison Taylor Principal Technical Writer 15

Acknowledgements The author would like to acknowledge Anmol Chaturvedi, Navneeth Mandavilli, Mohammed Morshed, Ganesh Ramachandramurthy, and Madan Vijayakumar for their contributions to this article. 16