ETL Benchmarks V 1.1

Size: px

Start display at page:

Download "ETL Benchmarks V 1.1"

Holly Warner
5 years ago
Views:

1 Pg 1 V 1.1 Comparing DATASTAGE SERVER 7.5 DATASTAGE PX 7.5 TALEND OPEN STUDIO INFORMATICA PENTAHO DATA INTEGRATOR info@manapps.tm.fr

2 Pg 2 This document is published under the Creative Commons license: You are free: to Share to copy, distribute, display, and perform the work to Remix to make derivative works Under the following conditions: Attribution. You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to this web page. Any of the above conditions can be waived if you get permission from the copyright holder. Apart from the remix rights granted under this license, nothing in this license impairs or restricts the author's moral rights.

3 Pg 3 Table of Contents You are free:... 2 Under the following conditions:... 2 Table of Contents... 3 General comments... 5 Hardware Configuration... 6 Test 1: File Input Delimited > File Output Delimited... 8 Scenario:... 8 Test results: Test 2: File Input Delimited > Table MySQL Output Scenario: Test results: Test 3: Table Oracle Input > File Output Delimited Scenario: Test results: Test 4: File Input Delimited > Table Output Oracle BULK Scenario: Test results: Test 5: File Input Delimited > Transform > File Output Delimited Scenario: Tests result: Test 6: Table Input Oracle > Aggregation > Table Output Oracle (ELT) Scenario: Test results: Test 7: Tables Input Oracle > Transformation > Tables Output Oracle (ELT) Scenario: Test results: Test 8: File Input Delimited > Sort > File Output Delimited... 60

4 Pg 4 Scenario: Tests result: Test 9: File Input Delimited > Aggregate > File Output Delimited Scenario: Tests result: Test 10: File Input Delimited > Lookup > File Output Delimited Scenario: Tests result: Test 11: File Input Delimited > Lookup > File Output Delimited && rejects Scenario: Tests result:

5 Pg 5 General comments This document constitutes Version 1.1 of the ETL Benchmark, as version 1.0 showed inaccurate tests results for the PowerCenter solution powered by Informatica, as our tests were carried out with inadequate settings for this product. An expert from Informatica suggested adapted settings, and the same tests were run again on the same environment, in order to preserve the benchmarking basis between all compared ETL tools. Use of this settings on the Informatica PowerCenter solution greatly improve the results obtained by this solution on the same ETL benchmark tests, as detailed in this corrected version of our benchmark. This Version 1.1 of the benchmark thus includes the updated results and comparison between all tested tools, and Annexe1 details the changes in the use of the Informatica software. We are open to comments from all tested editors, but also to other publishers, and are ready to give access to our testing conditions in order to allow them to verify the results obtained by their products and to suggest applicable best practices. For the tests with DataStage PX, we used 2 nodes to take advantage of the dual cores and of the parallelization feature of the tool. Results: Even if it is difficult to give results for this kind of benchmark, and we think that each test is different, some people ask us to give a global synthesis of those tests. Global performance: As requested by some people after the issue of version 1.0 of this ETL Benchmark, we have assigned, for each test, a specific number of points to the tested solutions (5 points to the best, 4 to the second 1 to the fifth). According to this scenario, results are as follows: o First: Informatica (353 points) o Second: Talend Open Studio (333 points) o Third: IBM Datastage PX 7.5 (239 points)

6 Pg 6 o Fourth: IBM Dataserver 7.5 (199 points) o Fifth: Pentaho Data Integration (148 points) Below are the detailed results: TOS PDI IBM DS 7.5 IBM DS PX 7.5 INFA PWC Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Total In terms of intuitiveness and ease of use, Talend Open Studio and DataStage Server are ahead of the pack. DataStage PX comes in the third position, Informatica in fourth and the least intuitive is Pentaho Data Integrator. Our main reason for this assessment of Pentaho is mostly linked to the many parameters that need to be learnt. However, we think that if you invest lots of time in it, it could become an powerful tool. Open Source ETL & Parallelization: Pentaho Data Integrator claims the first position here. It is easier to parallelize with PDI. We did however fine some issues with the way the tool lets you to parallelize all the components, but some results are inconsistent. Hardware Configuration

7 Pg 7 OS: Windows XP Pro SP2 CPU: Intel Core2 Duo 2 GHz JVM 1.6.0_87 RAM: 4 Go

8 Pg 8 Test 1: File Input Delimited > File Output Delimited Scenario: Reading X lines from a file input delimited and writing in a file output delimited. File input delimited extract:

9 Pg 9 TALEND OPEN STUDIO Job name: file_input_delimited file_output_delimited Job Schema of file_input_delimited

10 Pg 10 PENTAHO DATA INTEGRATION Job name: file_input_delimited file_output_delimited Job Schema of file_input_delimited

11 Pg 11 DATASTAGE SERVER Job name: file_input_delimited file_output_delimited Job Schema of file_input_delimited

12 Pg 12 DATASTAGE PX Job name: PX_file_input_delimited file_output_delimited Job Schema of file_input_delimited

13 Pg 13 INFORMATICA Job name: file_input_delimited file_output_delimited Job Schema of file_input_delimited

Pg 14 Test results: Test 1: File Input Delimited > File Output Delimited Statistics: Lines 100 000 1 000 000 5 000 000 20 000 000 TOS 2.4.1 1,00 7,80 39,10 162,09 PDI 3.0.0 2,00 15,50 83,80 417,80 IBM DS 7.

14 Pg 14 Test results: Test 1: File Input Delimited > File Output Delimited Statistics: Lines TOS ,00 7,80 39,10 162,09 PDI ,00 15,50 83,80 417,80 IBM DS 7.5 2,00 4,00 12,50 66,00 IBM DS PX 7.5 3,40 12,00 40,00 150,00 INFA PWC ,00 7,00 18,00 74,00 Number of lines TOS PDI DataStage 7.5 DataStage PX 7.5 Informatica ratio compared with TOS , ,99 0,51 1,54 0,9

15 Pg ,14 0,32 1,02 0, ,58 0,41 0,93 0,47 Test 2: File Input Delimited > Table MySQL Output Scenario: Reading X lines from a file input delimited and writing into a table output MySQL. Comments: DataStage 7.5, DataStage PX 7.5 and Informatica are not tested for this use case. To begin, the test has been done with default parameters. To optimize the performances, the commit parameter has been learned. To finish, the job has been parallelize. To parallelize with TOS 2.4.1, we just have to cut through our file input delimited (With the header and the limit parameters) and parallelize two sub jobs. With PDI 3.0.0, we just have to increment the number of copy. TOS permits to use the extended insert, which is a MySQL feature. This feature limits the number of database accesses and increases the performances. With this feature, TOS is 6 times faster.

16 Pg 16 TALEND OPEN STUDIO Job name: file_input_delimited table_output_mysql Job (Multi Thread Execution checked on Job Settings) Schema of file_input_delimited

17 Pg 17 PENTAHO DATA INTEGRATION Job name: file_input_delimited table_output_mysql Job Schema of file_input_delimited

Pg 18 Test results: Test 2: File Input Delimited > Table MySQL Output Statistics: Lines 100 000 1 000 000 5 000 000 TOS 2.4.1 15,26 144,50 731,78 PDI 3.0.0 14,90 151,80 843,90 TOS 2.4.1 with Extended Insert 2,60 25,00 129,00 Number of lines TOS 2.

18 Pg 18 Test results: Test 2: File Input Delimited > Table MySQL Output Statistics: Lines TOS ,26 144,50 731,78 PDI ,90 151,80 843,90 TOS with Extended Insert 2,60 25,00 129,00 Number of lines TOS PDI TOS Extended Insert ratio compared with TOS ,98 0, ,05 0, ,15 0,18 Test 3: Table Oracle Input > File Output Delimited Scenario:

19 Pg 19 Reading X lines from a table output Oracle and writing into a file output delimited.

20 Pg 20 TALEND OPEN STUDIO Job name: table_input_oracle file_output_delimited Job Schema of table_input_oracle

21 Pg 21 PENTAHO DATA INTEGRATION Job name: table_input_oracle file_output_delimited Job SCHEMA VIEWER NOT POSSIBLE Schema of table_input_oracle

22 Pg 22 DATASTAGE SERVER Job name: table_input_oracle file_output_delimited Job Schema of table_input_oracle

23 Pg 23 DATASTAGE PX Job name: PX_table_input_oracle file_output_delimited Job Schema of table_input_oracle

24 Pg 24 INFORMATICA Job name: table_input_oracle file_output_delimited Job Schema of table_input_oracle

Pg 25 Test results: Test 3: Table Oracle Input > File Output Delimited Statistics: Lines 100 000 500 000 1 000 000 TOS 2.4.1 2,25 6,26 14,25 PDI 3.0.0 4,78 21,20 37,40 IBM DS 7.

25 Pg 25 Test results: Test 3: Table Oracle Input > File Output Delimited Statistics: Lines TOS ,25 6,26 14,25 PDI ,78 21,20 37,40 IBM DS 7.5 4,00 11,00 19,00 IBM DS PX 7.5 4,00 8,00 15,00 INFA PWC Number of lines TOS PDI DataStage 7.5 DataStage PX 7.5 Informatica ratio compared with TOS ,12 1,78 1, ,39 1,76 1,28 0, ,62 1,33 1,05 0,63

26 Pg 26 Test 4: File Input Delimited > Table Output Oracle BULK Scenario: Reading X lines from a file input delimited and writing into a table output Oracle BULK.

27 Pg 27 TALEND OPEN STUDIO Job name: file_input_delimited table_output_oracle_bulk Job

28 Pg 28 PENTAHO DATA INTEGRATION Job name: file_input_delimited table_output_oracle_bulk Job Schema of file_input_delimited

29 Pg 29 DATASTAGE SERVER Job name: file_input_delimited table_output_oracle_bulk Job Schema of file_input_delimited

30 Pg 30 DATASTAGE PX Job name: PX_file_input_delimited table_output_oracle_bulk Job Schema of file_input_delimited

31 Pg 31 INFORMATICA Job name: file_input_delimited table_output_oracle_bulk Job Schema of file_input_delimited

Pg 32 Test results: Test 4: File Input Delimited > Table Output Oracle BULK Statistics: Lines 100 000 1 000 000 2 000 000 TOS 2.4.1 4,36 22,12 49,66 PDI 3.0.0 2,60 30,60 72,70 IBM DS 7.

32 Pg 32 Test results: Test 4: File Input Delimited > Table Output Oracle BULK Statistics: Lines TOS ,36 22,12 49,66 PDI ,60 30,60 72,70 IBM DS 7.5 3,00 18,00 40,00 IBM DS PX 7.5 6,00 27,00 55,00 INFA PWC Number of lines TOS PDI DataStage 7.5 DataStage PX 7.5 Informatica ratio compared with TOS ,6 0,69 1,38 0, ,38 0,81 1,22 0, ,46 0,8 1,11 0,22

33 Pg 33 Test 5: File Input Delimited > Transform > File Output Delimited Scenario: Reading X lines from a file input delimited and writing in a file output delimited after some changes. Changes list: Comments: The field `rate` content is multiplied by 100. The new field `name` is a concatenation (`firstname`+ +`lastname`). The fields `address` content is converted to uppercase. Pentaho Data Integration hasn t any graphic component to transform data. Thus, we have to use a custom code component. The used language is JavaScript. The four others ETL got a transformer to do this. Talend Open Studio got a custom code too, named tjavarow or tperlrow.

34 Pg 34 TALEND OPEN STUDIO Job name: file_input_delimited transformation file_output_delimited Job Schema of file_input_delimited Schema of file_output_delimited

35 Pg 35 tmap

36 Pg 36 PENTAHO DATA INTEGRATION Job name: file_input_delimited transformation file_output_delimited Job Schema of file_input_delimited Schema of file_output_delimited

37 Pg 37 JavaScript Custom Code Select Values Select Values

38 Pg 38 DATASTAGE SERVER Job name: file_input_delimited transformation file_output_delimited Job Schema of file_input_delimited Schema of file_output_delimited

39 Pg 39 Transformer

40 Pg 40 DATASTAGE PX Job name: PX_file_input_delimited transformation file_output_delimited Job Schema of file_input_delimited Schema of file_output_delimited

41 Pg 41 Transformer

42 Pg 42 INFORMATICA Job name: file_input_delimited transformation file_output_delimited Job Schema of file_input_delimited

43 Pg 43 Schema of file_output_delimited

44 Pg 44 Mapping

Pg 45 Tests result: Test 5: File Input Delimited > Transform > File Output Delimited Statistics: Lines 100 000 1 000 000 5 000 000 20 000 000 TOS 2.4.1 1,30 8,50 43,10 183,13 PDI 3.0.0 5,30 51,00 259,40 1126,10 IBM DS 7.

45 Pg 45 Tests result: Test 5: File Input Delimited > Transform > File Output Delimited Statistics: Lines TOS ,30 8,50 43,10 183,13 PDI ,30 51,00 259, ,10 IBM DS 7.5 2,00 10,00 56,00 178,00 IBM DS PX 7.5 4,75 11,33 41,00 155,00 INFA PWC ,00 6,00 17,00 74,00 Number of lines TOS PDI DataStage 7.5 DataStage PX 7.5 Informatica ratio compared with TOS ,07 1,54 3,65 2, ,18 1,33 0, ,02 1,3 0,95 0, ,16 0,97 0,84 0,4

46 Pg 46 Test 6: Table Input Oracle > Aggregation > Table Output Oracle (ELT) Scenario: Mod). Reading X lines from tables input Oracle and writing into another tables output Oracle (ELT Comments: Only Talend Open Studio permits to use an ELT mod. Informatica got the Push Down Optimization, but I didn t find this feature on the tool.

47 Pg 47 TALEND OPEN STUDIO Job names: ELT table_input_oracle aggregate_group_by_age_count table_output_oracle Job (ELT) Schema of table_input_oracle

48 Pg 48 PENTAHO DATA INTEGRATION Job name: table_input_oracle aggregate_group_by_age_count table_output_oracle Job SCHEMA VIEWER NOT POSSIBLE Schema of table_input_oracle

49 Pg 49 DATASTAGE SERVER Job name: table_input_oracle aggregate_group_by_age_count table_output_oracle Job Schema of table_input_oracle

50 Pg 50 DATASTAGE PX Job name: PX_table_input_oracle aggregate_group_by_age_count table_output_oracle Job Schema of table_input_oracle

51 Pg 51 INFORMATICA Job name: table_input_oracle aggregate_group_by_age_count table_output_oracle Job Schema of table_input_oracle

Pg 52 Test results: Test 6: Table Input Oracle > Aggregation > Table Output Oracle (ELT) Statistics: Lines 100 000 500 000 1 000 000 TOS 2.4.1 1,24 1,4 1,69 PDI 3.0.0 4,26 22,26 47,80 IBM DS 7.

52 Pg 52 Test results: Test 6: Table Input Oracle > Aggregation > Table Output Oracle (ELT) Statistics: Lines TOS ,24 1,4 1,69 PDI ,26 22,26 47,80 IBM DS 7.5 2,40 8,00 13,67 IBM DS PX 7.5 8,00 12,00 17,50 INFA PWC Number of lines TOS PDI DataStage 7.5 DataStage PX 7.5 Informatica ratio compared with TOS ,44 1,94 6,45 3, ,9 5,71 8,57 2, ,28 8,09 10,36 2,36

53 Pg 53 Test 7: Tables Input Oracle > Transformation > Tables Output Oracle (ELT) Scenario: Reading X lines from tables input Oracle and writing into another tables output Oracle (ELT Mod) after some changes.

54 Pg 54 TALEND OPEN STUDIO Job name: table_input_oracle elt table_output_oracle Job (ELT) Schema of table_lookup_oracle Schema of table_input_oracle

55 Pg 55 PENTAHO DATA INTEGRATION Job name: table_input_oracle elt table_output_oracle Job SCHEMA VIEWER NOT POSSIBLE Schema of table_lookup_oracle SCHEMA VIEWER NOT POSSIBLE Schema of table_input_oracle

56 Pg 56 DATASTAGE SERVER Job name: table_input_oracle elt table_output_oracle Job Schema of table_lookup_oracle Schema of table_input_oracle

57 Pg 57 DATASTAGE PX Job name: PX_table_input_oracle elt table_output_oracle Job Schema of table_lookup_oracle Schema of table_input_oracle

58 Pg 58 INFORMATICA Job name: table_input_oracle elt table_output_oracle Job Schema of table_lookup_oracle

Pg 59 Schema of table_input_oracle Test results: Test 7: Tables Input Oracle > Transformation > Tables Output Oracle (ELT) Statistics: Lines 100 000 500 000 1 000 000 TOS 2.4.1 5,99 23,26 52,72 PDI 3.

59 Pg 59 Schema of table_input_oracle Test results: Test 7: Tables Input Oracle > Transformation > Tables Output Oracle (ELT) Statistics: Lines TOS ,99 23,26 52,72 PDI ,35 201,60 382,60 IBM DS ,70 65,00 116,00 IBM DS PX ,00 30,50 47,50 INFA PWC Number of lines TOS PDI DataStage 7.5 DataStage PX 7.5 Informatica ratio compared with TOS ,4 2,12 2,5 0, ,67 2,79 1,31 0, ,26 2,2 0,9 0,27

60 Pg 60 Test 8: File Input Delimited > Sort > File Output Delimited Scenario: Reading X lines from a file input delimited and writing in a file input delimited sorted. Sorts list: Comments: Order by the integer field `age` ASC. Order by the string field `firstname` ASC. Order by the fields `age` and `firstname` ASC. With the version used, I can t do sort in memory with Pentaho Data Integrator. But the feature is present on latest version. On Talend Open Studio, with a large volume ( and ), we have to use the component texternalsort which use GNU sort, a sort software.

61 Pg 61 TALEND OPEN STUDIO Job names: file_input_delimited sort_on_age file_output_delimited file_input_delimited sort_on_firstname file_output_delimited file_input_delimited sort_on_firstname_and_age file_output_delimited Job Schema of file_input_delimited

62 Pg 62 PENTAHO DATA INTEGRATION Job names: file_input_delimited sort_on_age file_output_delimited file_input_delimited sort_on_firstname file_output_delimited file_input_delimited sort_on_firstname_and_age file_output_delimited Job Schema of file_input_delimited

63 Pg 63 DATASTAGE SERVER Job names: file_input_delimited sort_on_age file_output_delimited file_input_delimited sort_on_firstname file_output_delimited file_input_delimited sort_on_firstname_and_age file_output_delimited Job Schema of file_input_delimited

64 Pg 64 DATASTAGE PX Job names: PX_file_input_delimited sort_on_age file_output_delimited PX_file_input_delimited sort_on_firstname file_output_delimited PX_file_input_delimited sort_on_firstname_and_age file_output_delimited Job Schema of file_input_delimited

65 Pg 65 INFORMATICA Job names: file_input_delimited sort_on_age file_output_delimited file_input_delimited sort_on_firstname file_output_delimited file_input_delimited sort_on_firstname_and_age file_output_delimited Job Schema of file_input_delimited

66 Pg 66 Tests result: Test 8: File Input Delimited > Sort > File Output Delimited Sorted by Age Statistics: Sorted by age Lines TOS ,44 15,73 188, ,03 PDI ,63 32,85 155,95 668,20 IBM DS 7.5 4,20 60,70 267,70 IBM DS PX 7.5 4,00 16,25 64,50 492,67 INFA PWC ,00 13,00 50,00 201,00 Number of lines TOS PDI DataStage 7.5 DataStage PX 7.5 Informatica ratio compared with TOS ,51 2,92 2,78 3, ,09 3,86 1,03 0, ,83 1,42 0,34 0,26

67 Pg , ,48 0,2 Test 8: File Input Delimited > Sort > File Output Delimited Sort By First Name Sorted by firstname Lines TOS ,69 18,05 168, ,20 PDI ,40 31,20 157,15 739,20 IBM DS 7.5 6,00 58,00 426,00 IBM DS PX 7.5 4,00 16,00 57,00 624,00 INFA PWC ,00 13,00 51,00 223,00 Statistics: Number of lines TOS PDI DataStage 7.5 DataStage PX 7.5 Informatica ratio compared with TOS ,01 3,55 2,37 2, ,73 3,21 0,89 0,72

68 Pg ,93 2,53 0,34 0, , ,58 0,21 Test 8: File Input Delimited > Sort > File Output Delimited Sort By First Age, Name Statistics: Sorted by age & firstname Lines TOS ,33 17,40 225, ,00 PDI ,22 29,27 159,10 842,20 IBM DS 7.5 7,33 60,00 360,00 IBM DS PX 7.5 4,50 16,33 59,00 582,50 INFA PWC ,00 13,00 49,00 211,00 Number of lines TOS PDI DataStage 7.5 DataStage PX 7.5 Informatica ratio compared with TOS ,42 5,51 3,38 3, ,68 3,45 0,94 0, ,71 1,6 0,26 0,22

69 Pg , ,58 0,21

70 Pg 70 Test 9: File Input Delimited > Aggregate > File Output Delimited Scenario: Reading X lines from a file input delimited, achieving an aggregation and writing the operations result in a file output delimited. 1 Group by the field `age`; Operation: COUNT. 2 Group by the field `age`; Operations: COUNT, SUM(rate), AVG(rate), MIN(rate), MAX(rate). 3 Group by the field `firstname`; Operations: COUNT. Comments: When the output flow is too big (aggregate by firstname with big volume here), we have to use the tsortedaggregaterow on Talend Open Studio. This component sorts rows before the aggregation. On this case, Pentaho Data Integrator failed.

71 Pg 71 TALEND OPEN STUDIO Job names: file_input_delimited aggregate_group_by_age_count file_output_delimited file_input_delimited aggregate_group_by_age_count_sum_avg_min_max file_o utput_delimited file_input_delimited aggregate_group_by_firstname_count file_output_delimit ed Job Job using the texternalsortrow component

72 Pg 72 Schema of file_input_delimited Schema of file_output_delimited file_input_delimited aggregate_group_by_age_count file_output_delimited

73 Pg 73 PENTAHO DATA INTEGRATION Job names: file_input_delimited aggregate_group_by_age_count file_output_delimited file_input_delimited aggregate_group_by_age_count_sum_avg_min_max file_o utput_delimited file_input_delimited aggregate_group_by_firstname_count file_output_delimit ed Job Schema of file_input_delimited Schema of file_output_delimited file_input_delimited aggregate_group_by_age_count file_output_delimited

Pg 74 DATASTAGE SERVER Job names: file_input_delimited aggregate_group_by_age_count file_output_delimited file_input_delimited aggregate_group_by_age_count_sum_avg_min_max file_o utput_delimited

74 Pg 74 DATASTAGE SERVER Job names: file_input_delimited aggregate_group_by_age_count file_output_delimited file_input_delimited aggregate_group_by_age_count_sum_avg_min_max file_o utput_delimited file_input_delimited aggregate_group_by_firstname_count file_output_delimit ed Job Schema of file_input_delimited Schema of file_output_delimited file_input_delimited aggregate_group_by_age_count file_output_delimited

75 Pg 75 DATASTAGE PX Job names: PX_file_input_delimited aggregate_group_by_age_count file_output_delimited PX_file_input_delimited aggregate_group_by_age_count_sum_avg_min_max fi le_output_delimited PX_file_input_delimited aggregate_group_by_firstname_count file_output_deli mited Job Schema of file_input_delimited Schema of file_output_delimited file_input_delimited aggregate_group_by_age_count file_output_delimited

Pg 76 INFORMATICA Job names: file_input_delimited aggregate_group_by_age_count file_output_delimited file_input_delimited aggregate_group_by_age_count_sum_avg_min_max file_o utput_delimited

76 Pg 76 INFORMATICA Job names: file_input_delimited aggregate_group_by_age_count file_output_delimited file_input_delimited aggregate_group_by_age_count_sum_avg_min_max file_o utput_delimited file_input_delimited aggregate_group_by_firstname_count file_output_delimit ed Job Schema of file_input_delimited Schema of file_output_delimited file_input_delimited aggregate_group_by_age_count file_output_delimited

Pg 77 Tests result: Test 9: File Input Delimited > Aggregate > File Output Delimited Group by age (count) Statistics: Group by Age (Count) Lines 100 000 1 000 000 5 000 000 20 000 000 TOS 2.4.

77 Pg 77 Tests result: Test 9: File Input Delimited > Aggregate > File Output Delimited Group by age (count) Statistics: Group by Age (Count) Lines TOS ,62 6,99 30,05 124,16 PDI ,70 26,53 134,30 466,50 IBM DS 7.5 2,00 6,00 21,00 128,00 IBM DS PX 7.5 4,00 6,50 21,33 78,00 INFA PWC ,00 5,00 8,00 27,00 Number of lines TOS PDI DataStage 7.5 DataStage PX 7.5 Informatica ratio compared with TOS ,35 3,23 6,45 4, ,8 0,86 0,93 0,72

Pg 78 5 000 000 4,47 0,7 0,71 0,27 20 000 000 3,76 1,03 0,63 0,22 Test 9: File Input Delimited > Aggregate > File Output Delimited Group by Age (Count, Sum(Rate), Avg(Rate), Min(Rate), Max(Rate))

78 Pg ,47 0,7 0,71 0, ,76 1,03 0,63 0,22 Test 9: File Input Delimited > Aggregate > File Output Delimited Group by Age (Count, Sum(Rate), Avg(Rate), Min(Rate), Max(Rate)) Group by Age (Count, Sum(Rate), Avg(Rate), Min(Rate), Max(Rate)) Lines TOS ,84 7,44 37,61 139,12 PDI ,60 25,20 138,30 426,00 IBM DS 7.5 2,00 11,00 50,00 184,00 IBM DS PX ,25 15,33 33,50 254,33 INFA PWC ,00 6,00 12,00 38,00 Statistics: Number of lines TOS PDI DataStage 7.5 DataStage PX 7.5 Informatica ratio compared with TOS ,1 2,38 13,39 2, ,39 1,48 2,06 0, ,68 1,33 0,89 0,31

Pg 79 20 000 000 3,06 1,32 1,91 0,27 Test 9: File Input Delimited > Aggregate > File Output Delimited Group by FirstName (Count) Group by FirstName (Count) Lines 100 000 1 000 000 5 000 000 20 000

79 Pg ,06 1,32 1,91 0,27 Test 9: File Input Delimited > Aggregate > File Output Delimited Group by FirstName (Count) Group by FirstName (Count) Lines TOS ,86 7,89 198,79 928,08 PDI ,70 29,70 162,30 544,00 IBM DS 7.5 2,00 14,00 68,00 424,00 IBM DS PX 7.5 4,50 11,00 40,00 505,00 INFA PWC Statistics: Number of lines TOS PDI DataStage 7.5 DataStage PX 7.5 Informatica ratio compared with TOS ,14 2,33 5,23 4, ,76 1,77 1,39 1, ,82 0,34 0, ,59 0,46 0,54 0,092

80 Pg 80 Test 10: File Input Delimited > Lookup > File Output Delimited Scenario: Reading X lines from a file input delimited, looking up to another file input delimited, for 4 fields using id_client column. Writing the jointure result into a file output delimited.

81 Pg 81 TALEND OPEN STUDIO Job name: file_input_delimited file_lookup_delimited file_output_delimited Job Schema of file_input_delimited Schema of file_lookup_delimited

82 Pg 82 Schema file_output_delimited tmap Component

83 Pg 83 PENTAHO DATA INTEGRATION Job name: file_input_delimited file_lookup_delimited file_output_delimited Job Schema of file_input_delimited Schema of file_lookup_delimited

84 Pg 84 Schema of file_output_delimited Mapping Component

85 Pg 85 DATASTAGE SERVER Job name: file_input_delimited file_lookup_delimited file_output_delimited Job Schema of file_input_delimited

86 Pg 86 Schema of file_lookup_delimited Schema file_output_delimited

87 Pg 87 Transformer Component

88 Pg 88 DATASTAGE PX Job name: PX_file_input_delimited file_lookup_delimited file_output_delimited Job Schema of file_input_delimited

89 Pg 89 Schema of file_lookup_delimited Schema file_output_delimited Transformer Component

90 Pg 90 INFORMATICA Job name: file_input_delimited file_lookup_delimited file_output_delimited Job Schema of file_input_delimited Schema of file_lookup_delimited

91 Pg 91 Schema file_output_delimited Transformer Component

92 Pg 92 Tests result: Test 10: File Input Delimited > Lookup > File Output Delimited Lookup rows ~7MB Lookup rows ~7MB Lines TOS ,45 6,39 28,72 108,37 PDI ,14 21,40 87,60 288,90 IBM DS 7.5 5,00 10,60 33,00 139,00 IBM DS PX 7.5 5,00 12,20 40,00 122,00 INFA PWC ,00 11,00 32,00 116,00 Statistics: Number of lines TOS PDI DataStage 7.5 DataStage PX 7.5 Informatica ratio compared with TOS ,86 3,45 3,45 3, ,35 1,66 1,91 1, ,05 1,15 1,39 1,11

93 Pg ,67 1,28 1,13 1,07 Test 10: File Input Delimited > Lookup > File Output Delimited Lookup rows ~34MB Lookup rows ~34MB Lines TOS ,9 8,89 32,36 115,67 PDI ,90 24,50 97,40 291,10 IBM DS ,00 33,00 56,00 195,00 IBM DS PX 7.5 7,00 13,00 40,00 122,00 INFA PWC ,00 11,00 33,00 122,00 Statistics: Number of lines TOS PDI DataStage 7.5 DataStage PX 7.5 Informatica ratio compared with TOS ,03 7,18 1,79 1, ,76 3,71 1,46 1, ,01 1,73 1,24 1, ,52 1,69 1,05 1,05

Pg 94 Test 10: File Input Delimited > Lookup > File Output Delimited Lookup 1 000 000 rows ~68MB Statistics: Lookup 1 000 000 rows ~68MB Lines 100 000 1 000 000 5 000 000 20 000 000 TOS 2.4.1 9,86 14,26 38,6 121,44 PDI 3.

94 Pg 94 Test 10: File Input Delimited > Lookup > File Output Delimited Lookup rows ~68MB Statistics: Lookup rows ~68MB Lines TOS ,86 14,26 38,6 121,44 PDI ,50 32,20 116,60 487,25 IBM DS ,30 80,00 102,00 203,00 IBM DS PX 7.5 9,25 15,00 40,00 123,00 INFA PWC ,00 12,00 35,00 142,00 Number of lines TOS PDI DataStage 7.5 DataStage PX 7.5 Informatica ratio compared with TOS ,47 6,93 0,94 0, ,26 5,61 1,05 0, ,02 2,64 1,04 0, ,01 1,67 1,01 1,16

Pg 95 Test 10: File Input Delimited > Lookup > File Output Delimited Lookup 5 000 000 rows ~365MB Lookup 5 000 000 rows ~365MB Lines 100 000 1 000 000 5 000 000 20 000 000 TOS 2.4.

95 Pg 95 Test 10: File Input Delimited > Lookup > File Output Delimited Lookup rows ~365MB Lookup rows ~365MB Lines TOS ,51 69,1 199,26 557,1 PDI IBM DS ,00 407,00 496,00 973,00 IBM DS PX ,00 30,00 55,00 134,00 INFA PWC ,00 14,00 42,00 141,00 Statistics: Number of lines TOS PDI DataStage 7.5 DataStage PX 7.5 Informatica ratio compared with TOS Failed 6,53 0,42 0, Failed 5,89 0,43 0, Failed 2,49 0,28 0, Failed 1,75 0,24 0,25

96 Pg 96 Test 11: File Input Delimited > Lookup > File Output Delimited && rejects Scenario: Reading X lines from a file input delimited, looking up to another file input delimited, for 4 fields using id_client column. Writing the jointure result into a file output delimited and the output rejects into another files output delimited. 1 Filter rejects: `age` content < 18 2 Filter rejects: `age` content < 18 and inner join reject Comments: Talend Open Studio and DataStage Server are the more ergonomic tools to manage the expression filter rejects and inner join rejects (with the Transformer component (tmap on Talend Open Studio)). For DataStage PX, Pentaho Data Integrator and Informatica, we have to use filter components. Talend Open Studio, Informatica and DataStage Server are the more ergonomic tools to manage the expression filter rejects and inner join rejects. For DataStage PX, Pentaho and Data Integrator, we have to use filter components.

97 Pg 97 TALEND OPEN STUDIO Job name: file_input_delimited file_lookup_delimited file_output_delimited rejects_file_output_delimited Job Schema of file_input_delimited Schema of file_lookup_delimited

98 Pg 98 Schema of file_output_delimited (age>=18) Schema of file_output_delimited (age<18) = Schema of file_ output _delimited tmap Component

99 Pg 99 PENTAHO DATA INTEGRATION Job name: file_input_delimited file_lookup_delimited file_output_delimited rejects_file_output_delimited Job Schema of file_input_delimited Schema of file_lookup_delimited

100 Pg 100 Schema of file_output_delimited Schema of file_output_delimited (age<18) = Schema of file_ output _delimited

101 Pg 101 Mapping Component DATASTAGE SERVER Job name: file_input_delimited file_lookup_delimited file_output_delimited rejects_file_output_delimited Job Schema of file_input_delimited

102 Pg 102 Schema file_lookup_delimited Schema of file_output_delimited Schema of file_output_delimited (age<18) = Schema of file_ output _delimited

103 Pg 103 Transformer Component

104 Pg 104 DATASTAGE PX Job name: PX_file_input_delimited file_lookup_delimited file_output_delimited rejects_file_output_delim ited Job Schema of file_input_delimited

105 Pg 105 Schema file_lookup_delimited Schema of file_output_delimited Schema of file_output_delimited (age<18) = Schema of file_output_delimited

106 Pg 106 Transformer Component

107 Pg 107 INFORMATICA Job name: file_input_delimited file_lookup_delimited file_output_delimited rejects_file_output_delimited Job Schema of file_input_delimited

108 Pg 108 Schema file_lookup_delimited Schema of file_output_delimited Schema of file_output_delimited (age<18) = Schema of file_output_delimited Transformer Component

Pg 109 Tests result: Test 11: File Input Delimited > Lookup > File Output Delimited && rejects Lookup 100 000 rows ~7MB + Filter 18 years Statistics: Lookup 100 000 rows ~7MB Lines 100 000 1 000 000

109 Pg 109 Tests result: Test 11: File Input Delimited > Lookup > File Output Delimited && rejects Lookup rows ~7MB + Filter 18 years Statistics: Lookup rows ~7MB Lines TOS ,51 6,74 29,55 101,65 PDI ,30 17,10 78,40 305,00 IBM DS 7.5 6,00 10,50 36,00 144,00 IBM DS PX 7.5 7,00 14,00 41,00 137,00 INFA PWC ,00 10,00 33,00 120,00 Number of lines TOS PDI DataStage 7.5 DataStage PX 7.5 Informatica ratio compared with TOS ,19 3,97 4,64 3, ,54 1,56 2,08 1, ,65 1,22 1,39 1, ,42 1,35 1,18

Pg 110 Test 11: File Input Delimited > Lookup > File Output Delimited && rejects Lookup 500 000 rows ~34MB + Filter 18 years Statistics: Lookup 500 000 rows ~34MB Lines 100 000 1 000 000 5 000 000 20

110 Pg 110 Test 11: File Input Delimited > Lookup > File Output Delimited && rejects Lookup rows ~34MB + Filter 18 years Statistics: Lookup rows ~34MB Lines TOS ,26 9,28 32,44 111,98 PDI ,80 20,50 81,50 310,00 IBM DS ,60 34,00 57,00 173,00 IBM DS PX 7.5 7,50 14,25 44,67 155,20 INFA PWC ,00 10,00 34,00 126,00 Number of lines TOS PDI DataStage 7.5 DataStage PX 7.5 Informatica ratio compared with TOS ,83 6,71 1,76 1, ,21 3,66 1,54 1, ,51 1,76 1,38 1, ,77 1,54 1,39 1,13

111 Pg 111

Pg 112 Test 11: File Input Delimited > Lookup > File Output Delimited && rejects Lookup 1 000 000 rows ~68MB + Filter 18 years Statistics: Lookup 1 000 000 rows ~68MB Lines 100 000 1 000 000 5 000

112 Pg 112 Test 11: File Input Delimited > Lookup > File Output Delimited && rejects Lookup rows ~68MB + Filter 18 years Statistics: Lookup rows ~68MB Lines TOS ,2 15,22 38,31 126,63 PDI ,10 32,35 111,35 319,05 IBM DS ,00 68,00 95,00 220,00 IBM DS PX 7.5 9,00 18,00 51,00 153,33 INFA PWC ,00 14,00 34,00 130,00 Number of lines TOS PDI DataStage 7.5 DataStage PX 7.5 Informatica ratio compared with TOS ,38 6,47 0,88 0, ,13 4,47 1,18 0, ,91 1,7 1,33 0, ,52 1,74 1,21 1,03

113 Pg 113 TALEND OPEN STUDIO Job name: file_input_delimited file_lookup_delimited file_output_delimited rejects_and_innerjoin_rejects _file_output_delimited Job Schema of file_input_delimited

114 Pg 114 Schema of file_lookup_delimited Schema of file_output_delimited (age>=18) Schema of file_output_delimited (age<18) = Schema of file_output_delimited Schema of file_output_delimited (inner join rejects) = Schema of file_output_delimited

115 Pg 115 tmap Component

116 Pg 116 PENTAHO DATA INTEGRATION Job name: file_input_delimited file_lookup_delimited file_output_delimited rejects_and_innerjoin_rejects _file_output_delimited Job Schema of file_input_delimited Schema of file_lookup_delimited

117 Pg 117 Schema of file_output_delimited Schema of file_output_delimited (age<18) = Schema of file_output_delimited Schema of file_output_delimited (inner join rejects) = Schema of file_output_delimited

118 Pg 118 Mapping Component DATASTAGE SERVER Job name: file_input_delimited file_lookup_delimited file_output_delimited rejects_and_innerjoin_rejects _file_output_delimited

119 Pg 119 Job Schema of file_input_delimited Schema of file_lookup_delimited

120 Pg 120 Schema file_output_delimited Schema of file_output_delimited (age<18) = Schema of file_output_delimited Schema of file_output_delimited (inner join rejects) = Schema of file_output_delimited

121 Pg 121 Transformer Component

122 Pg 122 DATASTAGE PX Job name: PX_file_input_delimited file_lookup_delimited file_output_delimited rejects_and_innerjoin_rej ects_file_output_delimited Job Schema of file_input_delimited

123 Pg 123 Schema of file_lookup_delimited Schema file_output_delimited Schema of file_output_delimited (age<18) = Schema of file_output_delimited Schema of file_output_delimited (inner join rejects) = Schema of file_output_delimited

124 Pg 124 Transformer Component

125 Pg 125 INFORMATICA Job name: file_input_delimited file_lookup_delimited file_output_delimited rejects_and_innerjoin_rejects _file_output_delimited Job Schema of file_input_delimited

126 Pg 126 Schema of file_lookup_delimited Schema file_output_delimited Schema of file_output_delimited (age<18) = Schema of file_output_delimited Schema of file_output_delimited (inner join rejects) = Schema of file_output_delimited Transformer Component

Pg 127 Test 12: file_input_delimited >_file_lookup_delimited > file_output_delimited rejects && innerjoin_rejects_file_output_delimited Lookup 100 000 rows ~7MB Lookup 100 000 rows ~7MB Lines 100 000

127 Pg 127 Test 12: file_input_delimited >_file_lookup_delimited > file_output_delimited rejects && innerjoin_rejects_file_output_delimited Lookup rows ~7MB Lookup rows ~7MB Lines TOS ,42 5,65 24,63 106,78 PDI ,60 13,00 59,80 327,60 IBM DS 7.5 6,00 10,00 30,00 137,00 IBM DS PX 7.5 9,00 15,25 47,33 146,00 INFA PWC ,00 12,00 33,00 121,00 Statistics: Number of lines TOS PDI DataStage 7.5 DataStage PX 7.5 Informatica ratio compared with TOS ,83 4,22 6,34 2, ,3 1,77 2,7 2, ,43 1,22 1,92 1, ,07 1,28 1,37 1,13

128 Pg 128 Test 12: file_input_delimited >_file_lookup_delimited > file_output_delimited rejects && innerjoin_rejects_file_output_delimited Lookup rows ~34MB Statistics: Lookup rows ~34MB Lines TOS ,16 8,74 30,34 120,53 PDI ,26 19,30 72,25 319,60 IBM DS ,00 35,50 63,00 189,50 IBM DS PX ,00 16,00 44,00 150,00 INFA PWC Number of lines TOS PDI DataStage 7.5 DataStage PX 7.5 Informatica ratio compared with TOS ,75 6,73 6,73 1, ,21 4,06 1,83 1, ,38 2,08 1,45 1, ,65 1,57 1,24 1,05

Pg 129 Test 12: file_input_delimited >_file_lookup_delimited > file_output_delimited rejects && innerjoin_rejects_file_output_delimited Lookup 1 000 000 rows ~68MB Statistics: Lookup 1 000 000 rows

129 Pg 129 Test 12: file_input_delimited >_file_lookup_delimited > file_output_delimited rejects && innerjoin_rejects_file_output_delimited Lookup rows ~68MB Statistics: Lookup rows ~68MB Lines TOS ,98 15,18 38,49 126,57 PDI ,30 27,35 79,00 413,45 IBM DS ,49 90,40 108,00 231,00 IBM DS PX ,00 19,00 49,00 134,00 INFA PWC Number of lines TOS PDI DataStage 7.5 DataStage PX 7.5 Informatica ratio compared with TOS ,21 3,51 1,18 0, ,8 5,96 1,25 0, ,05 2,81 1,27 0, ,27 1,83 1,06 1,04

130 Pg 130 Annex 1: Informatica settings and results This annex presents the settings changes made by Informatica and limitations they have found Comments and amendment done on the basic PowerCenter installation: *** Since the 'benchmark' machine is a tiny laptop with limited ressource (XP 32bit, Core2 Duo CPU and 3,43 GB of RAM) we've done following change: Auto Memory deactivation: MaxMem at 0 in the Default Session Config High Availability storage deactivation: EnableHAStorage at No for the 'Integration Service Metadata Manager and Reporting Service deactivation *** Configuration amendments : Unix environment variable INFA_DEFAULT_DOMAIN added Custom variable FileRdrTreatNullCharAs on the Integration Service added (NULL character are encountered in source data files) *** Standard Oracle 10g ( ) Database installation with: sga_max_size=164mb pga_aggregate_target=115mb Comments and "best practices" for the tests: Test 1: File Input Delimited > File Output Delimited - dynamic partitioning at 2 with more than 5 millions rows This is a Disk Bounded test Test 2: File Input Delimited > Table MySQL Output Not Applicable Test 3: Table Oracle Input > File Output Delimited - no partitioning as it's too small in volume and short in time Test 4: File Input Delimited > Table Output Oracle BULK

131 Pg commit size at dynamic partitioning at 2 with 2 millions rows This is a Disk Bounded test Test 5: File Input Delimited > Transform > File Output Delimited - function "CONCAT(CONCAT(firstname,' '),lastname)" is replaced by "firstname ' ' lastname" - dynamic partitioning at 2 with more than 5 millions rows This is a Disk Bounded test Test 6: Table Input Oracle > Aggregation > Table Output Oracle (ELT) - no partitioning as it's too small in volume and short in time Oracle database is not 'tuned' for ELT mode Test 7: Tables Input Oracle > Transformation > Tables Output Oracle (ELT) - commit size at no partitioning as it's too small in volume and short in time Oracle database is not 'tuned' for ELT mode Test 8: File Input Delimited > Sort > File Output Delimited - sorter memory adjustment This is a memory limited test at 20 millions rows (2 pass sort are required) and also disk limited sometime Test 9: File Input Delimited > Aggregate > File Output Delimited - dynamic partitioning at 2 with more than 5 millions rows in source - aggregator memory adjustment This is a CPU bounded test Test 10: File Input Delimited > Lookup > File Output Delimited - dynamic partitioning at 2 with more than 5 millions rows in source or lookup - lookup memory adjustment - lookup in the flow with hash partitioning point This is a CPU bounded test Test 11: File Input Delimited > Lookup > File Output Delimited && rejects - use of router in place of filters - dynamic partitioning at 2 with more than 5 millions rows in source - lookup memory adjustment - lookup in the flow with hash partitioning point This is a CPU bounded test Test 12: file_input_delimited >_file_lookup_delimited > file_output_delimited rejects && innerjoin_rejects_file_output_delimited - use of router in place of filters - dynamic partitioning at 2 with more than 5 millions rows in source - lookup memory adjustment - lookup in the flow with hash partitioning point This is a CPU bounded test

Increasing Performance for PowerCenter Sessions that Use Partitions

Increasing Performance for PowerCenter Sessions that Use Partitions 1993-2015 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying,