Optimizing Testing Performance With Data Validation Option

Size: px

Start display at page:

Download "Optimizing Testing Performance With Data Validation Option"

Benjamin Gallagher
5 years ago
Views:

otherwise) without prior consent of Informatica LLC.

1 Optimizing Testing Performance With Data Validation Option Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica LLC. All other company and product names may be trade names or trademarks of their respective owners and/or copyrighted materials of such owners.

2 Abstract To optimize test performance in Data Validation Option, you must understand the factors that affect test performance, and the features and techniques that you can use to improve test performance. Supported Versions Data Validation Option Hotfix 2 Update 1 Data Validation Option Data Validation Option Table of Contents Data Validation Option and Performance Understanding Tests Performance and PowerCenter Mappings Factors Affecting Performance Data Volume Tested Data Complexity Types of Tests Quantity of Tests PowerCenter Server Capacity Load on PowerCenter Integration Service Performance Optimization Techniques Database Optimization techniques Sorted Input Caching Sampling Splitting Wide Tables Server Configurations Performance Metrics Calculations COUNT Tests COUNT_ROWS Tests SUM Tests SET Tests OUTER_VALUE Tests VALUE Tests Conclusion

3 Data Validation Option and Performance When people look to deploy Data Validation Option, a key concern is performance. This concern is expressed in different ways by different people and include the following: How fast are tests executed? How much data does Data Validation Option test? What type of server does Data Validation Option need? Will the deployment of Data Validation Option in the production environment impact the performance of existing jobs? There is no absolute answer to these questions because there are many factors that contribute to the performance of tests conducted with Data Validation Option. The performance of Data Validation Option tests depend of the following factors: Data volume (the number of rows and columns) tested Data complexity (heterogeneous, homogeneous, complex joins) Types of tests performed (Aggregate, Set, Value, Expressions) Number of tests in a given single table object or table pair object Capacity of the PowerCenter Integration Service (memory or CPU) executing the tests Load on the PowerCenter Integration Service when tests are executed Configuration of Data Validation Option jobs to optimize performance This document explains the various factors that affect performance and explains features and techniques that can be applied to increase performance for tests conducted with Data Validation Option. Additionally, some baseline performance numbers are provided to give context to performance. Those numbers should be viewed as a minimal performance level given that they were conducted on a smaller server with no optimizations configured. Understanding Tests The first thing to understand is how Data Validation Option executes tests. Data Validation Option provides a client and metadata repository that uses PowerCenter as an execution engine. Thus performance is about optimizing the processing performed by PowerCenter. Consider the following image: 3

4 The image above has three major components: Data Validation Option, which consists of a desktop client, its own metadata repository, and a results warehouse which has predefined database views. PowerCenter, which consists of a metadata repository and a set of services for accessing and processing data. Enterprise Data, which consists of a broad spread of data sources including relational DBMS, warehousing appliances, mainframe data, flat files, and data in the cloud such as Salesforce.com. The numbers in the following steps below correspond with the numbers in the image above and explain the execution process for Data Validation Option: 1. Data Validation Option users define tests and store those (metadata) definitions in the Data Validation Option repository. 2. Tests are executed from a GUI or command line. Data Validation Option generates mappings that embody the test conditions and executes those mappings in PowerCenter. 3. PowerCenter accesses the data being tested and applies the tests to that data. 4. Test results (pass or fail, error data, and so on) are stored in the Data Validation Option warehouse. 5. Users can view test results in reports that are generated from the warehouse. The key steps from a performance perspective are 2, 3, and 4. In particular, step 2, the performance of the generated mapping, is important as it is that mapping that is executed by PowerCenter, and thus anything that can be done at design-time (that is, in the Data Validation Option client) to optimize for performance should be done. Performance and PowerCenter Mappings The basic principle of performance is that overall performance will never be faster than the slowest point in the mapping. PowerCenter mappings consist of three main parts, the reader, the transformation pipeline, and the writer. 4

5 The reader reads all the data from all the sources that are being tested and feeds that data into the transformation pipeline. The pipeline is where tests are processed, results, and error rows identified, and so on. The writer writes those results to the Data Validation Option warehouse. If there is a lot of data to read, but few tests to perform, then the reader could be slow point in the mapping. Conversely, if there are a lot of tests or a lot of complex joins and lookups defined on the data, then the transformation pipeline will be the slow point. It is rare that the writer is the slow point as there is usually very little data written to the Data Validation Option warehouse relative to the amount of data read. Though simplistic, this description of the process gives a high-level background before you get into the next section, which explains the factors that affect performance in more detail. Factors Affecting Performance The level of complexity of Data Validation Option mappings is a frequent query. Data Validation Option mappings are generated based on the configuration defined by the user in the Data Validation Option client. This configuration includes what data to read, how that data should be pre-processed for testing, what tests are to be performed, where they are performed, and the results to write out. If you think through the various source data types and configuration options in the table pair object, the use of join or SQL or lookup views, the various tests defined and where and how they should be executed, you can end up with what appear to be complex mappings. And if those mappings run on underpowered hardware, or read data across a slow network, or do not have enough memory to execute effectively, things will slow down. But while those mappings may appear complex, their design is efficient, and they run on PowerCenter, which has an unparalleled track record in the industry for scalable, high performance data processing. The key is to understand what the issues are, and how to address them to deliver efficient and high performance testing scenarios. Data Volume Tested The amount of data tested affects performance. More data to test means more data to read from the source, more rows to process, and, potentially, more errors to write. Data volume has the following three dimensions to it: Number of Rows The number of rows dimension is straight forward. The more rows read from a source, the more time spent in the reader and in test execution. But all rows are not created equal. Number of Columns A wide table or file with dozens or hundreds of columns, if read, will take more time to read and process than a narrow table. Width of Individual Columns Wide columns will take more time to read and process than narrow columns. For example, a string that is hundreds of characters wide will take more compute power than a short string or a number Data Complexity Data complexity refers to the complexity of the input. File or database tables are straight forward, but if you combine different sources into a join view, or different tables into a SQL view, and then use those in a table pair object, the processing time could be higher than with a single table object or file. The taken to join various tables or execute the SQL view's SQL in the database depends on the complexity and size of the joins, the complexity of the SQL, and the size of the associated tables. 5

6 Types of Tests Aggregate tests like SUM, MIN, MAX, and AVG are executed faster than SET or VALUE tests. This is due to the nature of the tests. Aggregates act on a single column, whereas VALUE and SET tests act across columns and need joined tables. Complex expression evaluation can impact performance, particularly if applied to many columns in the data. For very large tables, the join process itself takes significant resources to complete. Quantity of Tests When mapping generation occurs, all tests defined in a single table object or table pair object are encoded into the mapping. The mapping will read and then propagate only the data necessary for tests. For example, if a flat file with 100 columns is read in, but only five tests are defined that evaluate data in five different columns, then only those five columns are propagated in the mapping, and only five tests are made on each row. This is an efficiency built into the generated mappings. Now, if the same 100 column file has 50 tests on 50 columns instead of five, then 50 columns are propagated and 50 tests are made on each row. Thus, both, the amount of data tested and the number of tests per row, increase by a factor of 10. Both of these affect performance. PowerCenter Server Capacity The famous Greek mathematician, Archimedes, once said: Give me a lever long enough and a fulcrum on which to place it, and I shall move the world The same principle applies to PowerCenter. With enough compute power and memory, virtually any amount of data can be processed. But unlimited computing resources is rare. The PowerCenter environment used to execute the Data Validation Option mappings has a tremendous impact on performance. Testing large data sets on an underpowered server with insufficient RAM or processing power is almost guaranteed to be slow. Load on PowerCenter Integration Service A powerful server under heavy load can perform poorly. If you test large data or complex data sets on a server that has other jobs running on it, ensure that the server has enough additional capacity to run Data Validation Option mappings efficiently. If you run Data Validation Option in production, the Data Validation Option mappings will add load to the server. Like any other PowerCenter job, it is best to have a clear understanding of the resource requirements of Data Validation Option jobs before adding them to an already overloaded server. Performance Optimization Techniques There are features in Data Validation Option that can be used to maximize performance. Basic techniques such as reading only the data that is needed for testing, minimizing unnecessary tests, and providing Data Validation Option mappings with the resources they need contribute to increased performance. Database Optimization techniques A simple step to increase performance is to use the database optimization features available in Data Validation Option. These can be applied to data coming from any supported SQL database. use the following features: 6

7 WHERE clauses WHERE clauses in table pair objects and single table objects can be defined with SQL or with the PowerCenter expression language. When sourcing data from databases, it is usually more efficient to define the WHERE clause with SQL, and then select the Execute Where Clause in DB check box. This executes the WHERE clause in the database and only feeds PowerCenter the rows matching the WHERE condition. This is more efficient than reading the entire table and then processing the WHERE clause in PowerCenter. Essentially, it throws away unmatched data immediately after reading it. Aggregate and Count Tests Aggregate and COUNT Tests can be processed in a database very efficiently. To do this, set the Optimization Levelin the single table object or table pair object to Where Clause, Sorting and Aggregation in DB. Sorted Input When using Table Pairs with value or set tests, a join condition is required across table A and table B. If the data coming into the table pair object is already sorted (for example, via a WHERE clause in the Table Pair sent to the database) in the same order as the join condition, PowerCenter can optimize the join operation and use less memory for the process. To indicate that the input is already sorted, select Already Sorted Input in the Optimization Level drop down in the table pair object or single table object dialog box. Caching Caching is a means of memory allocation when executing PowerCenter jobs. As data is processed by PowerCenter, memory is allocated to specific operations up to a specified limit. Operations like joins, sorting, lookups and aggregations all use cached memory. If the PowerCenter job requires more memory than has been allocated to it, it spills data to disk and then swaps data between disk and memory as required. In-memory processing is significantly faster than spilling to disk. For best performance, keep all data in memory whenever possible. By default, Data Validation Option lets PowerCenter decide how to allocate cache memory for a given job. This is known as automatic caching. The image below shows the automatic caching option: In general, automatic caching works well, but there are cases where it is not sufficient. For example, in very large data sets, or when complex join views, large lookup views or extensive sorting is needed within PowerCenter, automatic caching is not sufficient. At that time, users must explicitly allocate memory to the job. This explanation of caching is an intentional simplification of the actual process, but should be sufficient to gain a general understanding of the concepts. From a PowerCenter perspective, cache allocation can get quite involved as the amount of cache required is specific to the transformation and depends on the amount of data processed by that transformation at runtime. 7

Detailed information on necessary cache allocation is given in PowerCenter session logs, but for the uninitiated, those logs are quite daunting.

8 Detailed information on necessary cache allocation is given in PowerCenter session logs, but for the uninitiated, those logs are quite daunting. If you need more detailed information on the topic of caching, see the appropriate sections in the PowerCenter Performance Tuning Guide. Setting Table Pair Object and Single Table Object Cache This is the start of the concept. You can set the total cache allocation on the Advanced tab of the single table object and table pair object dialog box. Simply uncheck Automatic, and enter the total amount of cache memory (for example, 256 MB, 1 GB, and so on) that you want to allocate to the job and Data Validation Option will take care of the rest. If the amount is insufficient and data spills to disk, then allocate a larger amount and rerun. Sampling In some situations, either with very large data sets (hundreds of millions or billions of rows), or in cases where 100% of the data is not required, data sampling can be employed. Sampling is available for both table pairs objects and single tables objects and can be accessed via the Advanced tabs in these dialog boxes. Users specify the percentage of rows needed from the source, and an optional seed value that is used to add repeatability in the sampled data. In table pairs objects, data is only sampled on one side, that is either table A or table B. When the data from both sides is joined (via a join condition), only the matching rows across Table A and B will pass through and be tested. The image below shows the data sampling option. Data sampling can operate in one of two modes: in the database (native) or wholly in PowerCenter. Native sampling is supported for databases (Oracle, SQLServer, DB2, Teradata) that support sampling directly. Here, the database does the sampling and only the sampled data is returned from the database for testing. For all other data sources (flat files, other databases, mainframe data, and so on), the sampling is done in PowerCenter, which means all data is read and then filtered out based on sampling criteria. It is important to understand how sampling actually works. For example, if a user indicates that they want 5% of the data, then each row has a 5% chance of being selected and passed on for testing. This is true for native sampling as well as sampling performed by PowerCenter. Thus, if there are 100,000 rows in a table, and the user specifies 5%, then about 5,000 rows, but not guaranteed exactly 5000 rows, will be delivered. The net result, regardless of the type and amount of data sampled, is a subset of large data volumes that can be tested efficiently to find issues or give confidence in a system. 8

9 Splitting Wide Tables Wide tables or files (that is, tables or files with hundreds of columns) are not uncommon in enterprise data environments. Testing such tables, especially with a value tests, can place heavy load on an underpowered server. Unlike sampling, which limits the number of rows being tested, splitting wide tables reduces the number of columns being tested. Imagine a table 500 columns wide, where value tests are required across all 500 columns. Instead of creating a single table pair object with 500 tests, and slowing down the PowerCenter Integration Service, you can create five table pairs objects, each with a 100 tests and run these table pair objects separately. The end result of this split will be the same, that is, all 500 columns will be tested, but, in smaller chunks that will be put much less load on the server. Additionally, with the reduced number of tests in the table pairs objects, inspecting results in the GUI or creating reports will be simpler, more manageable, and efficient. Server Configurations Though there is no absolute answer to Data Validation Option performance questions, some data is always better than none to help understand what can be expected and what looks anomalous. The following tables provide a baseline set of performance metrics to give some context on expected performance. They provide a rough minimum level of expected performance for tests executed with Data Validation Option. All tests were performed on the following server configuration: Operating System: Linux version el5 (Red Hat ) CPU: AMD Opteron 6220 Processor Speed: 3000 MHz Cores: 4 Memory: 128 GB All tests used data in Oracle tables for both table A and table B in the table pair object. A ll tests used default configurations and were performed entirely in PowerCenter. No database optimization, caching, sampling, or other performance enhancements were made. The intention is to provide sample baseline performance numbers on a small server. Larger or more powerful servers and enabling Data Validation Option optimizations would likely result in significantly improved performance. Performance Metrics Calculations All tests were conducted five times, and the elapsed time for each test run was recorded. The lowest and highest times were discarded, and the average of the middle three was computed and rounded to the nearest second. This is the number displayed in the tables below under Average Time. Rows/Second is shown to provide a normalized number to compare test performance across different row counts and test types. The ideal situation is to have a consistent (that is, flat) measure for Rows/Second as the number of rows increases for a given test scenario. This shows linear scalability for the system. COUNT Tests COUNT tests count all non-null values in a column and check if the expected number of values are present. In the data below, five COUNT tests are performed on a table pair object with 50 columns of data. The performance is consistent, averaging about 57,000 rows per second across all tests. 9

10 The following table shows the results of five COUNT tests: Joined Rows Columns Test Types Average Time Rows/Second No 1,000, Five COUNT 16 sec 63,830 No 2,000, Five COUNT 36 sec 56,604 No 5,000, Five COUNT 1 min 35 sec 52,817 No 10,000, Five COUNT 3 min 1 sec 55,351 COUNT_ROWS Tests The COUNT_ROWS test counts all values, including nulls, in a column. The performance of the COUNT_ROWS test is faster than that of the COUNT test because the COUNT test checks each data value to see if it is null before incrementing the count. Also, as the COUNT_ROWS test counts all rows in a table, only one test is needed for a table pair object or single table. The table below shows results for one COUNT_ROWS test: Joined Rows Columns Test Types Average Time Rows/Sec No 1,000, One COUNT_ROWS 7 sec 136,364 No 2,000, One COUNT_ROWS 12 sec 171,249 No 5,000, One COUNT_ROWS 23 sec 220,588 No 1,000, One COUNT_ROWS 51 sec 196,078 SUM Tests SUM Tests calculate the sum of a numeric column. They are a common high level test performed on data to see if sums match (for example, across a set of transactions) across data sets. The following table shows results for five SUM tests: Joined Rows Columns Test Types Average Time Rows/Sec No 1,000, Five SUM 19 sec 52,632 No 2,000, Five SUM 36 sec 56,075 No 5,000, Five SUM 1 min 47 sec 46,584 No 10,000, Five SUM 3 min 9 sec 53,191 The Rows/Sec is consistent, which shows the linear scalability of SUM tests. 10

11 SET Tests Set tests look at distinct values within a pair of columns, and identify if there are missing or extra distinct values across those columns. SET AinB Use the SET AinB test to identify if values in a column in source table (A) all exist in a column in Lookup Table (B). If any value in A does not exist in B, it will be revealed as an Error row. The following table shows the results of five SET AinB tests: Joined Rows Columns Test Type Average Time Rows/Sec Yes 1,000, Five SET AinB 32 sec 31,250 Yes 2,000, Five SET AinB 1 min 14 sec 27,027 Yes 5,000, Five SET AinB 3 min 57 sec 21,097 Yes 10,000, Five SET AinB Five SET AinB 22,523 SET ANotinB Use the SET ANotinB test to ensure that no value in A is also in B. This test is useful to identify duplicates during the merging of two data sets, or to validate masked data to ensure all values were appropriately masked. The following table shows the results of five SET ANotinB tests: Joined Rows Columns Test Type Average Time Rows/Sec Yes 1,000, Five SET ANotinB 38 sec 26,316 Yes 2,000, Five SET ANotinB 1 min 21 sec 24,691 Yes 5,000, Five SET ANotinB 4 min 13 sec 19,763 Yes 10,000, Five SET ANotinB 8 min 36 sec 19,380 From the performance of both types of set tests, you see that the Set AinB (approximately 25,000 rows/sec) test is slightly faster being slightly faster than the set ANotinB (approximately 22,000 rows/sec) test. OUTER_VALUE Tests OUTER_VALUE tests perform a full outer join across table A and table B in the table pair object and display any orphans from either side. In general, OUTER_VALUE tests are performed across the key columns of the data sets and only one outer value is typically needed in a given table pair object. 11

12 The table below shows the performance of one OUTER_VALUE test: Joined Rows Columns Test Types Average Time Rows/Sec No 1,000, One OUTER_VALUE 10 sec 103,448 No 2,000, One OUTER_VALUE 18 sec 111,111 No 5,000, One OUTER_VALUE 38 sec 131,579 No 1,000, One OUTER_VALUE 1 min 26 sec 116,279 VALUE Tests VALUE Tests are the most common tests executed in Data Validation Option as they provide the most detail about missing or erroneous values. VALUE tests are executed within PowerCenter, and are individual evaluations of row/ column data across the table pair object. In the following table, an additional column, Comparisons/ Sec, shows the number of comparisons executed per second. Total Comparisons is the total number of rows multiplied by the total number of value tests. Divide that by the number of seconds it takes to complete the test and you have Comparisons/Sec. This number can be used to compare the performance across the two scenarios presented below. that is, Five VALUE Tests versus 50 VALUE Tests. 5 VALUE Tests The following table shows that the performance of 5 VALUE tests is very good, with an average of about 35,700 rows/second across all scenarios. Joined Rows Columns Test Types Average Time Rows/Sec Comparisons/ Sec No 1,000, VALUE 34 sec 29, ,631 No 2,000, VALUE 52 sec 29, ,548 No 5,000, VALUE 2 min 16 sec 36, ,824 No 1,000, VALUE 4 min 52 sec 34, , VALUE Tests This scenario contains 50 VALUE Tests, one COUNT and one OUTER_VALUE. It is a typical set of tests and can be automatically generated for a table pair object in Data Validation Option. 12

13 Here, the Rows/Sec drops by about 10x as compared to the Five VALUE Tests scenario. This is expected given the fixed capacity of the server and the 10 times increase in number of columns testing. But Comparisons/Sec is, on average, slightly higher than the Five VALUE tests scenario. Joined Rows Columns Test Types Average Time Rows/S ec Comparisons/ Sec No 1,000, VALUE, 1 COUNT, 1 OUTER_VALUE No 2,000, VALUE, 1 COUNT, 1 OUTER_VALUE No 5,000, VALUE, 1 COUNT, 1 OUTER_VALUE No 1,000, VALUE, 1 COUNT, 1 OUTER_VALUE 3 min 49 sec 4, , min 14 sec 2, , min 20 sec 2, , min 29 sec 3,117 3,117 Conclusion All performance statistics depend on the conditions under which the performance was measured. This is as true for speed tests for cars as it is for computer software. There are many factors that affect the performance of Data Validation Option including. These factors are not limited to the characteristics of the data, the types of data, and the amount of testing done on the data, the server hardware where the tests are run, and the Data Validation Option testing configuration set by the user. It is important to understand these factors when designing and implementing tests with Data Validation Option. There are product features and testing tactics that can be used to improve performance. Different approaches can be used in different situations. For example when the data is primarily in relational databases, some processing (WHERE clauses, COUNT and aggregate tests) can be performed in the database itself. When very large (hundreds of millions of rows) data sets need to be tested, statistical sampling can be used. Baseline performance statistics provides a performance metrics for a variety of test types with increasing amounts of data. Run on a modest 4-core Linux server without any specific optimizations, these numbers serve as a baseline reference for what can be achieved. More powerful servers or specific performance optimizations will yield even better results, but as the numbers show, Data Validation Option tests perform very well out of the box, and scale linearly as data volumes grow. This makes for a predictable and efficient framework for large scale validation testing, whether in development, QA, or production environments. Author Saeed Khan Principal Product Manager, PowerCenter 13

Data Validation Option Best Practices

Data Validation Option Best Practices 1993-2016 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without