Transformer Looping Functions for Pivoting the data :

Size: px

Start display at page:

Download "Transformer Looping Functions for Pivoting the data :"

Joshua French
6 years ago
Views:

1) Refer This link for more details : Looping Concept in Datastage Now you can argue that this is possible using a pivot stage.

1 Transformer Looping Functions for Pivoting the data : Convert a single row into multiple rows using Transformer Looping Function? (Pivoting of data using parallel transformer in Datastage 8.5,8.7 and 9.1) Refer This link for more details : Looping Concept in Datastage Now you can argue that this is possible using a pivot stage. But for the sake of this article lets try doing this using a Transformer! Below is a screenshot of our input data We are going to read the above data from a sequential file and transform it to look like this

2 So lets get to the job design Step 1: Read the input data. Step 2: Logic for Looping in Transformer Properties In the adjacent image you can see a new box called Loop Condition. This where we are going to control the loop variables. Below is the screenshot when we expand the Loop Condition box

The Loop While constraint is used to implement a functionality similar to WHILE statement in

So, similar to a while statement need to have a condition to identify how many times the loop is

In our example we need to loop the data 3 times to get the column data onto subsequent rows.

3 The Loop While constraint is used to implement a functionality similar to WHILE statement in programming. So, similar to a while statement need to have a condition to identify how many times the loop is supposed to be executed. To achieve system variable was introduced. In our example we need to loop the data 3 times to get the column data onto subsequent rows. So lets <=3 Now create a new Loop variable with the name LoopName The derivation for this loop variable should be Then DSLink2.Name1 Else Then DSLink2.Name2

4 Else DSLink2.Name3 Below is a screenshot illustrating the same Now all we have to do is map this Loop variable Loop Name to our output column Name

Making some tweaks to the above design we can implement things like 1.

5 Lets map the output to a sequential file stage and see if the output is a desired. After running the job, we did a view data on the output stage and here is the data as desired. Making some tweaks to the above design we can implement things like 1. Adding new rows to existing rows 2. Splitting data in a single column to multiple rows and many more such stuff.. Posted by Devendra Kumar Yadav at 4:37 AM No comments:

6 Partitioning considerations For Best Performance Of datastage Jobs This Blog give you a complete details, how we can improve the performance of datastage Parallel jobs using appropriate partitioning methods. Refer These links as well : 1. Datastage Partitioning Methods and Use 2. Datastage Jobs Performance Improvement Tips1 3. Datastage Performance Tuning Tips 1.0 Partitioning considerations: Choose a partition method which makes sure that the number of rows per partition is close to equal. This will minimize the processing work load and there by improves the overall run time. Any stage that process a group of related records must be partitioned using a keyed partition technique. (Egs in the case of Aggregator stage, Remove duplicate, Change capture, Change apply, Join, Merge stages etc, as well as for transformers that process group of related records) Minimize repartitioning as it decreases the performance unless the partition distribution is highly skewed. Repartitioning results in overhead of network transport as well as even distribution of data among partitions is also gets disturbed. Specify hash partitioning for stages that require processing of group of related records. Partitioning keys should include only those key columns that are necessary for proper grouping If the grouping is on a single integer key column, go for Modulus partition on the same key column If the data is highly skewed and the key column values and distribution will not change significantly over time, use the Range partitioning technique Use Round robin partition to distribute data evenly across all partitions. (If grouping is not needed).this is very much suggested when the input data is in sequential mode or it is very much skewed Same partitioning requires minimum resources and can be used for optimization of job and to eliminate repartitioning of the already partitioned data

7 When the input data set is sorted in parallel, we need to use Sort merge collector, which will produce a single sorted stream of rows. When the input data set is sorted in parallel and range partitioned, the ordered collector method is more preferred for collection For round robin partitioned input data set use round robin collector to reconstruct rows in input order, as the long as the data set has not been re partitioned or reduced. Minimize the use of sorts in a job. Figure: Partitioning tab in a Datastage stage properties

8 Posted by Devendra Kumar Yadav at 12:22 AM No comments: Datastage Jobs Best Practices and Performance Tuning This Blog give you a complete details, how we can improve the performance of datastage Parallel jobs. Best practices we have to follow, while creating the datastage jobs. This Blog will help you on following topics. 1. Performance Tuning Guidelines 1.1 General Job Design 1.2 Transformer Stage 1.3 Data grouping Stages 1.4 ODBC Stages Refer This link as well : Parallel Job Performance Tuning Tips1 1.0 Performance Tuning Guidelines 1.1 General Job Design Jobs need to be developed using the modular development approach. Large jobs can be broken down in to smaller modules, which help in improving the performance. In scenarios where same data (huge number of records) is to be shared among more than one jobs in the same project, use dataset stage approach instead of re-reading the same data again. Eliminate unused columns

9 Eliminate unused references If the input file has huge number of records and the business logic allows splitting up of the data, then run the job in parallel to have a significant improvement in the performance 1.2 Transformer stage Use parallel transformer stage instead of filter/switch stages ( filter/switch stages will take more resources for execution. For egs: in the case of filter stage the were clause will get executed during run time, thus creating the requirement for more resources, there by decaying the job performance)

Figure: Example of using a Transformer stage instead of using a filter stage. The filter condition is given in the constraint section of the transformer stage properties.

10 Figure: Example of using a Transformer stage instead of using a filter stage. The filter condition is given in the constraint section of the transformer stage properties. Use BuildOp stage only when the required logic cannot be implemented using the parallel transformer stage. Avoid calling routines in derivations in the transformer stage. Implement the logic in derivation. This will avoid the over head of procedure call

11 Implement the logic using stage variables and call these stage variables in the derivations. During processing the execution starts with stage variables then constraints and then to individual columns. If ever there is a prerequisite formulae which can be used by both constraints and also individual columns then we can define it in stage variables so that it can be processed once and can be used by multiple records. If ever we require the formulae to be modified for each and every row then it is advisable to place in code in record level than stage variable level Figure: Example for using stage variables in and using it in the derivations.

1.3 Data grouping stages When dealing with stages like Aggregator, Filter etc, always try to use sorted data for better performance Figure: Sorting the input data on the grouping keys in an

12 1.3 Data grouping stages When dealing with stages like Aggregator, Filter etc, always try to use sorted data for better performance Figure: Sorting the input data on the grouping keys in an aggregator stage The example shown in the figure is the properties window for an aggregator stage that finds out the sum of a quantity column by grouping on the columns shown above. In such scenarios, we will do sorting of the input data on the same columns so that the records with same/similar values for these grouping columns will come together there by increasing the performance. Also note that if we are using more than one node, then the input dataset should be properly partitioned so that the similar records will be available in the same node.

13 1.4 ODBC Stages If possible sort the data in ODBC stage itself; this will reduce the over head of DS sorting the data. Don t use the sort stage when we have ORDER BY clause in ODBC sql Select only the required records or Remove the unwanted rows as early, so that the job need not deal with unnecessary records causing performance degrade Using a constraint to filter a record is much slower as compared to having a SELECT.WHERE in ODBC stage. User the power of database where ever possible and reduce the over head for DS.

14 Figure: Using the User-defined SQL option in ODBC stages to reduce the overhead of datastage by specifying the WHERE and ORDER BY clause in the SQL used to get data. Avoid using like operator in user defined queries in ODBC stages. But one thing to be noted here is that, if our custom sql requires a must scenario like it is doing a filter on some string pattern, we will be forced to use the like pattern to get the requirement done. Avoid using Stored Proceedures until and unless the functionality cannot be implemented in Data Stage jobs. Posted by Devendra Kumar Yadav at 12:07 AM No comments: TUESDAY, OCTOBER 22, 2013 Know about Conductor Node, Section Leaders and Players Process in Datastage Details about Conductor Node, Section Leaders and Players Process in Datastage Refer This Link as well For More Details : Job Run Time Architecture Jobs developed with DataStage Enterprise Edition (EE) are independent of the actual hardware and degree of parallelism used to run the job. The parallel Configuration File provides a mapping at runtime between the job and the actual runtime infrastructure and resources by defining logical processing nodes. To facilitate scalability across the boundaries of a single server, and to maintain platform independence, the parallel framework uses a multi-process architecture. The runtime architecture of the parallel framework uses a process-based architecture that enables scalability beyond server boundaries while avoiding platformdependent threading calls. The actual runtime deployment for a given job design is composed of a hierarchical relationship of operating system processes, running on one or more physical servers Section Leaders (one per logical processing node): used to create and manage player processes which perform the actual job execution. The Section Leaders also manage communication between the individual player processes and the master Conductor Node.

15 Players: one or more logical groups of processes used to execute the data flow logic. All players are created as groups on the same server as their managing Section Leader process. Conductor Node (one per job): the main process used to startup jobs, determine resource assignments, and create Section Leader processes on one or more processing nodes. Acts as a single coordinator for status and error messages, manages orderly shutdown when processing completes or in the event of a fatal error. The conductor node is run from the primary server It is a main process to 1. Start up jobs 2. Resource assignments 3. Responsible to create Section leader (used to create & manage player player process which perform actual job execution). 4. Single coordinator for status and error messages. 5. manages orderly shutdown when processing completes in the event of fatal error. When the job is initiated the primary process (called the conductor ) reads the job design, which is a generated Orchestrate shell (osh) script. The conductor also reads the parallel execution configuration file specified by the current setting of the APT_CONFIG_FILE environment variable. Once the execution nodes are known (from the configuration file) the conductor causes a coordinating process called a section leader to be started on each; by forking a child process if the node is on the same machine as the conductor or by remote shell execution if the node is on a different machine from the conductor (things are a little more dynamic in a grid configuration, but essentially this is what happens). Communication between the conductor, section leaders and player processes in a parallel job is effected via TCP. Senario's To Calculate the Processes : Sample APT CONFIG FILE : See in bold to mention conductor node. {node "node1" { fastname "DevServer1"pools "conductor" resource disk "/datastage/ascential/datastage/datasets/node1" {pools "conductor"} resource scratchdisk "/datastage/ascential/datastage/scratch/node1" {pools ""} } node "node2" {

16 fastname "DevServer1" pools "" resource disk "/datastage/ascential/datastage/datasets/node2" {pools ""} resource scratchdisk "/datastage/ascential/datastage/scratch/node2" {pools ""} } } Please find the below different answers : For every job that starts there will be one (1) conductor process (started on the conductor node), There will be one (1) section leader for each node in the configuration file and There will be one (1) player process (may or may not be true) for each stage in your job for each node. So if you have a job that uses a two (2) node configuration file and has 3 stages then your job will have 1 Conductor Node 2 Section leaders (2 Nodes * 1 Section leader per node) 6 Player processes (3 stages * 2 Nodes)Your dump score may show that your job will run 9 processes on 2 nodes. This kind of information is very helpful when determining the impact that a particular job or process will have on the underlying operating system and system resources. Posted by Devendra Kumar Yadav at 11:53 PM No comments: Situations to choose Parallel or Server Datastage Jobs Situations to choose Parallel or Server Datastage Jobs 1. The choice of server or parallel depends upon time to implement, functionality and cost. 2. When we have lots of functionality to implement for lower volume and hardware is less and ease of implementation we can go for Server jobs. 3. Parallel jobs are costly due to high scale of hardware, difficult to implement, extreme processing capabilities for absurd volumes with vast array of operators for high-performance manipulation. 4. When the data volume is less it is better to go for Server job as parallel jobs can have a longer start up time. 5. When data volume is high, it is better to choose parallel job than server job. Parallel job will be a lot faster than server job even if it runs on single node.

17 The obvious incentive for going parallel is data volume. Parallel jobs can remove bottlenecks and run across multiple nodes in a cluster for almost unlimited scalability. At this point parallel jobs become the faster and easier option. A parallel sort stage is lot faster than server stage. A Transformer stage in parallel job with the same transformations in server job is faster. Even on one node with a compiled transformer stage, the parallel version was three times faster. On 1 node configuration that does not have a lot of parallel processing also we can still get big performance improvements from an Enterprise Edition job. The improvements will be multiplied 10 or more than that if we work on 2CPU machines and two nodes in most stages. 6. Parallel jobs take advantage of both pipeline parallelism and partitioning parallelism. 7. We can improve the performance of server job by enabling inter process row buffering. This helps stages to exchange data as soon as it is available in the link. IPC stage also helps passive stage to read data from another as soon as data is available. In other words, stages do not have to wait for the entire set of records to be read first and then transferred to the next stage. Link partitioner and link collector stages can be used to achieve a certain degree of partitioning parallelism. 8. Look up with sequential file is possible in parallel jobs and not possible in server jobs. 9. Datastage EE jobs are compiled into OSH (Orchestrate Shell script language). OSH executes operators - instances of executable C++ classes, pre-built components representing stages used in Datastage jobs. Server Jobs are compiled into Basic which is an interpreted pseudo-code. This is why parallel jobs run faster, even if processed on one CPU. 10. The major difference between Infosphere Datastage Enterprise and Server edition is that Enterprise Edition (EE) introduces Parallel jobs. Parallel jobs support a completely new set of stages, which implement the scalable and parallel data processing mechanisms. In most cases parallel jobs and stages look similiar to the Datastage Server objects, however their capababilities are way different. In rough outline: Parallel jobs are executable datastage programs, managed and controlled by Datastage Server runtime environment Parallel jobs have a built-in mechanism for Pipelining, Partitioning and Parallelism. In most cases no manual intervention is needed to implement optimally those techniques. Parallel jobs are a lot faster in such ETL tasks like sorting, filtering, aggregating Refer This Link to Know More about parallel Jobs Stages: Parallel Jobs Stages Posted by Devendra Kumar Yadav at 11:02 PM No comments:

18 Surrogate Key Generator Implementation Surrogate Key Generator Implementation in Datastage 8.1, 8.5 & 9.1 The Surrogate Key Generator stage is a processing stage that generates surrogate key columns and maintains the key source. A surrogate key is a unique primary key that is not derived from the data that it represents, therefore changes to the data will not change the primary key. In a star schema database, surrogate keys are used to join a fact table to a dimension table. Surrogate key generator stage uses: 1. Create or delete the key source before other jobs run 2. Update a state file with a range of key values 3. Generate surrogate key columns and pass them to the next stage in the job 4. View the contents of the state file Generated keys are 64 bit integers and the key source can be stat file or database sequence. Surrogate keys are used to join a dimension table to a fact table in a star schema database. When the SCD stage performs a dimension lookup : A) If a matching record is found, it retrieves the value of the existing surrogate key. B) If a match is not found, the stage obtains a new surrogate key value by using the derivation of the Surrogate Key column on the Dim Update tab. If you want the SCD stage to generate new surrogate keys by using a key source that you created with a Surrogate Key Generator stage as described in Surrogate Key Generator. If you want to use your own method to handle surrogate keys, you should derive the Surrogate Key column from a source column. You can replace the dimension information in the source data stream with the

19 surrogate key value by mapping the Surrogate Key column to the output link. Creating the key Source : Drag the surrogate key stage from palette to parallel job canvas with no input and output links. Double click on the surrogate key stage and click on properties tab.

20 Properties: Key Source Action = create Source Type : FlatFile or Database sequence(in this case we are using FlatFile) When you run the job it will create an empty file. If you want to the check the content change the View Stat File = YES and check the job log for details. skey_genstage,0: State file /tmp/skeycutomerdim.stat is empty. if you try to create the same file again job will abort with the following error. skey_genstage,0: Unable to create state file /tmp/skeycutomerdim.stat: File exists. Deleting the key source:

21 Updating the stat File: To update the stat file add surrogate key stage to the job with single input link from other stage. We use this process to update the stat file if it is corrupted or deleted Open the surrogate key stage editor and go to the properties tab.

23 If the stat file exists we can update otherwise we can create and update it. We are using SkeyValue parameter to update the stat file using transformer stage.

25 Generating Surrogate Keys: Now we have created stat file and will generate keys using the stat key file. Click on the surrogate keys stage and go to properties add add type a name for the surrogate key column in the Generated Output Column Name property.

26 Go to ouput and define the mapping like below.

27 Rowgen we are using 10 rows and hence when we run the job we see 10 skey values in the output.i have updated the stat file with 100 and below is the output.

If you want to generate the key value from begining you can use following property in the surrogate key stage. A. If the key source is a flat file, specify how keys are generated: 1.

29 If you want to generate the key value from begining you can use following property in the surrogate key stage. A. If the key source is a flat file, specify how keys are generated: 1. To generate keys in sequence from the highest value that was last used, set the Generate Key from Last Highest Value property to Yes. Any gaps in the key range are ignored. 2. To specify a value to initialize the key source, add the File Initial Value property to the Options group, and specify the start value for key generation. 3. To control the block size for key ranges, add the File Block Size property to the Options group, set this property to User specified, and specify a value for the block size. B. If there is no input link, add the Number of Records property to the Options group, and specify how many records to generate.

Operator Combination and Control

Operator Combination and Control Introduction Orchestrate Shell (OSH), a scripting language used to create a parallel job application, is integrated with the DataStage Suite as Parallel Extender (now called