Data Stage ETL Implementation Best Practices

Data Stage ETL Implementation Best Practices Copyright (C) SIMCA IJIS Dr. B. L. Desai Bhimappa.desai@capgemini.com ABSTRACT: This paper is the out come of the expertise gained from live implementation using Data Stage. This paper details the best practices to be carried during the implementation of the different phases of the ETL life cycle by using Data Stage. This paper also has the comparison study of few DataStage ETL processes that are observed during the implementation. Keywords : Data Warehouse, ETL, Configuration Management, Backup and Recovery, DataStage, User Administration INTRODUCTION : A data warehouse is generally a collection of subject oriented, integrated, time variant, non-volatile, business oriented databases designed to support management's decision-making function. A data warehouse environment typically contains data that has been integrated into one type of architecture and offers summarized read-only, historical information. The ETL process of building data warehouse consists of capturing, integrating and storing the data in a warehouse or mart. It consists of several basic concepts that must be integrated into an executable process. These include: Accessing and extracting data that may be spread out across an enterprise's diverse systems architecture and in many different data structures. Validating, and often improving, the consistency and quality of that data as it is repurposed from its operational role to a more strategic, decisionmaking role (quality requirements will differ between these two roles!). Adding business context to the operational data, through the data transformation process (converting "0" to "male" and "1" to "female," for example). This is also where data that might be stored differently in various systems is transformed to one consistent definition for the business. Storing this business information in an efficient and effective manner that allows rapid access by the information consumer and analyst, as needed. And finally, capturing process, business and technical Meta data all along this data flow. This is later used to help the consumer access and understand the information in the data warehouse or mart, as well as control the process flow for building and refreshing the data. If an ETL tool is not used, separate extraction, transformation and loading programs will need to be developed and scheduled to execute in sequence with FTP and staging of data typically required. The data extraction and loading programs can be written with 23

Data Stage ETL Implementation Best Practices - Dr. B. L. Desai traditional programming techniques, but the use of the ETL tools have proven to be more effective. The ETL process may be the same for all ETL tools, but the way it is implemented varies from tool to tool. The complete process involves a lot of real time problems and needs a well-defined strategy. A well-defined process for data warehouse projects brings business value and project success. This document deals with the best ETL practices for DataStage as an ETL tool. DataStage is an integrated ETL product that supports extraction of the source data, cleansing, decoding, transformation, integration, aggregation, and loading of target databases. This document deals with the various strategy's at different phases of ETL process like configuration management, backup and recovery, naming standards, user administration, performance tuning, building reusable components and customized use of DataStage tool. CONFIGURATION MANAGEMENT: In a typical enterprise environment, there may be many developers working on jobs all at different stages of their development cycle. Without some sort of version control, managing these jobs is time consuming and difficult to maintain. Version Control allows to: Store different versions of DataStage jobs. Run different versions of the same job. Revert to a previous version of a job. View version histories. Ensure that everyone is using the same version of a job. Protect jobs by making them read-only. Store all changes in one centralized place Guidelines for effective version control of Data Stage Mappings Create separate Data Stage Projects for Development, Testing, and Production; also create a separate VERSION Project for maintaining history of all jobs. For Example: knpc_dev for development Phase, knpc_ver for Version project, knpc_testing for Testing Phase, knpc_prod for Production Phase. VERSION CONTROL FOR DATA STAGE : Following are the steps to be followed for version controlling for Data Stage Jobs during Project lifecycle. Once Development is over, check-in the code into Version Control Software. Same checked in jobs can be deployed into Testing Environment, UAT environment and finally into Production Environment. In case of failover in any environment follow the version control at respective phases as below. V E R S I O N C O N T R O L D U R I N G DEVELOPMENT PHASE : Following are the steps to be followed for Version Controlling of DataStage Jobs during Development Phase: First Create Jobs in Development Project. Take the latest/required version from PVCS/VSS and import into Data stage Development project. Initialize the developed jobs into Version Project. If you want to make any changes in the existing jobs then you promote the required job from Version Project to the Development Project. After making changes in the Development project initialize the job to version project. VERSION CONTROL DURING TESTING PHASE: Following are the steps to be followed for Version Controlling of DataStage Jobs during Testing Phase: After the completion of Development Phase promote the latest version of all the jobs & other related components from Version Project (knpc_ver) to Test Project (knpc_testing) with Read only privileges. If you find any errors during testing phase promote the required job, which contain errors from Version Project (knpc_ver) to the Development Project (knpc_dev). After debugging the job initialize it into version project and then promote it from version project to the test project. VERSION CONTROL DURING PRODUCTION PHASE: Following are the steps to be followed for Version Controlling of DataStage Jobs during Production Phase: After the completion of Testing Phase promote the latest version of all the jobs & other related components from Version Project (knpc_ver) to Production Project (knpc_prod) with Read only privileges. If you find any errors during Production phase 24

International Journal of Information Systems ISSN : 2229-5429 Vol. I, Issue I, October 2010 promote the required job, which contain errors from Version Project (knpc_ver) to the Development Project (knpc_dev). After debugging the job initialize it into version project and then promote it from version project (knpc_ver) to the Production project (knpc_prod). BACKUP AND RECOVERY : Backup is the process of storing valuable data and recovery is the corrective process to restore the database to a usable state from an erroneous state. Backup and recovery is necessary to prevent data losses due to hardware or software problems and to ensure that data available is up-to-date, consistent and correct. DataStage has inbuilt facility to take backup and recovery of repository or component including the job Design, shared containers, data elements, stage types, table definitions, transforms, job schedules, and routines by using import and export features of DataStage Manager. DataStage Developer and Administrator are privileged to take backup or recovery of repository. Regular backups will avoid the data losses in different phases of project development life cycle. Here below are few guidelines for backup frequency. Development Phase: Daily backup of the development repository Testing Phase: When a change occurs in the project End of testing phase Production Phase: Beginning of production phase When a change in the DataStage project USER ADMINISTRATION: In a multi user environment, user administration is a system management task that involves assigning proper privileges to the users. Data Stage provides default roles like administrator, developer and operators. Following are the key inputs to categories the user into different groups. User Profiles: The description of static information concerning each user along with the access controls requirements based on the nature of the activity involved Applications: A description of each application and the controls that apply to its users GUIDELINES FOR USER ADMINISTRATION : Each repository must have one user with administrator privilege. Administrator is responsible for assigning privileges for other users, taking backup of repository, migration of repository. DataStage uses operating system user groups. Ensure to assign none privilege to those who belong to the network but not to the project. Assign privileges to the users based on their role in the project. Assign developer privilege to those involved in development of jobs Assign operator privilege to those who execute the jobs S H A R E D C O N TA I N E R S / R E U S A B L E COMPONENTS: Shared Containers are the reusable components in DataStage. Create containers for mappings that involve common business logic. Stages and links can be grouped together to form a container. These are created separately and are stored in the Repository. Instances of a shared container are inserted into any server job in the project. Hence they are reusable by other jobs in the project. Standard error handling can be implemented by using Shared Container in a Project or across projects, which will take input as error information and severity. And this shared container will have functionality to load the data into a standard error or log table and according to severity it will abort or continue further. GUIDELINES OF BUILDING SHARED CONTAINERS : 1. Create shared containers to make common job components in a project. They can also be used within a job where sequences of stages are replicated. 2. The input metadata for the shared container must match the output metadata of the stage to which it connects in a job. Similarly the output metadata from the container must match the input metadata of stage to which it connects in the job. 3. If a shared container is modified, then it is mandatory to recompile the jobs that use the same in order to reflect the changes 4. To deconstruct a shared container, first it should be converted into Local container. 25

Data Stage ETL Implementation Best Practices - Dr. B. L. Desai 5. Deconstructing a container is not recursive. If deconstructing a Reusable component that contains other Reusable Component, then those 'nested' component must be deconstructed separately. BASIC ROUTINES: Data Stage has a lot of inbuilt routines and functions. These functions and routines can be called while doing transformations. Apart from this user can write customized basic routines to meet his business requirements. This will give the user with flexibility to write codes to meet any complex transformations and functionalities. User can write routines for transformations that are not possible with the built-in routines. Create Routines to perform the required task, which can again be used for other jobs and by other DataStage users. Writing Routines involves some basic programming skills. EXCEPTION HANDLING/ERROR HANDLING : It is advisable to have proper error handling mechanism in every project, it might be module level or project level standard. Example we can have a standard error handling routine which will take any user defined exception details and load in a standard format and raise accordingly with respective to serviority of the error. STANDARD PARAMETERS: It is also advisable to parameterize the jobs instead of hardcoding for certain details like report date etc. As a generic we can have the following parameters 1. ReportDate 2.UpdatedUser 3.Process Name By using above standard information it will be easy to track while logging the information into log table. STANDARD ROUTINES : It is advisable to create a routine(s) in a project level. Example we can have a UR_GetDateTimeStamp which will have inputs datetime value, format and returns the timestamp value in a standard format. NAMING CONVENTION STANDARDS : Data Stage has a lot of components like projects, jobs, batches, job sequencer, job container, built-in stages, plug-in stages, transformations, routines etc. These components should follow a standard naming convention for effective management, clarity and readability. Below Table.1 and Table.2 gives the naming conventions for various DataStage components Table.1.Data Source Names (DSN) Naming Convention DSN Convention Example Source SRC_Databasename(first 3 letters)_ DSN SRC_ORA_DSN Target DW(M)_Databasename(first 3 letters)_dsn DW_ORA_DSN DM_ORA_DSN Staging STG (number)_databasename (first 3 letters)_dsn STG1_ORA_DSN Table.2.DataStage Components Naming Convention DataStage Convention Example Components Project Name of the project_phase of the project XYZ_DEV Category CAT_name of the category CAT_SourceToSTG Job JOB_Name of the job JOB_EISFINKPIFact Batch BAT_BatchName BAT_SourceToSTG Job Sequencer JOBSEQ_Name of the job sequencer JOBSEQ_ForCommonMasterTables Source/ Target Stages Stage Name (3letters)_File Name (Table Name) SEQ_Product Transformation Stages TRS_Transformation Name TRS_Rank Lookup LKP_LookupType_Lookupname LKP_ORA_ProductMaster Local Containers LOC_CON_Container name LOC_CON_Salaryrank Shared Containers SHR_CON_Container name SH_CON_agecalc Sequences SEQ_Column name of the table for which KeyMgtNextValue("SEQ_Time_Key") sequence is generated 26

International Journal of Information Systems ISSN : 2229-5429 Vol. I, Issue I, October 2010 User defined routines UR_Name of the routine (describing its purpose) UR_GetMonth Stage variables V_StageVariableName V_Ename P E R F O R M A N C E I M P R O V E M E N T TECHNIQUES : Performance tuning is a process of getting optimal results in terms of time, hardware, and manpower required to monitor and cost of management. It is a disciplined practice that should be done with a strategy. Given below is the outline of performance tuning methodology. HINTS FOR IMPROVING PERFORMANCE : 1. DataStage server and related machines run on high performance CPU's to ensuring maximum performance 2. Increased network speed improves the performance 3. Try to minimize the no of stages in the Data stage job. 4. Use only columns required for processing and populating into Target table. Avoid unwanted tables/columns in the source level itself. 5. Try to join tables at source level in case lookup and source table are in same environment. 6. Also apply Filter conditions in the source query itself if possible. Instead of getting data and then using Filter stage to filter the data. 7. Create temporary tables for jobs involving complex SQL queries. First load the data into temporary tables then to target tables. 8. Create indexes for Look Up / ORDERBY / GROUPBY columns 9. Source and target files should be in the Data Stage Server machine 10.Split a complex mapping into many simpler mappings 11. Avoid unnecessary data conversions in jobs 12.Use Parallel jobs than server jobs. 13.Enabling both pipeline and parallel processing options under parallel jobs 14. Use filter conditions nearer to source stage 15. Use aggregator stage nearer to source stage 16.Use plug-in stages like OCI for Oracle for loading data rather than ODBC stages 17.Use bulk loading (ORAOCIBL stage for oracle) rather than normal loading. For more details, see Normal Loading and Bulk Loading. 18.Use Hash File stage look up rather than ODBC for better performance. For more details, see hashed file stage. 19.Select the Pre-load file in memory check box in the Hash file stage when Hash file is used both as input and output. The records are loaded into the memory and hence faster loading and retrieval. NORMAL LOADING AND BULK LOADING: Normal loading and bulk loading are two loading techniques available for data loading. In normal load, database is accessed each time for loading a record from source to target. Normal Load is achieved using the ODBC [oracle database connectivity] stage. In bulk load, database is accessed only once to create 2 files control file [contains schema of the database] and data file [contains the data to be loaded]. Using these files data is loaded from source to target. Bulk load is achieved using Orabulk stage. Better performance can be achieved than normal load. Orabulk Stage is the Plug-in provided by Data Stage for Bulk Load and ODBC/OCI Stages are used for Normal load.table.3 and Figure.1 gives the comparison between Normal load and bulk load. Table.3 comparison between Normal load and bulk load. Type Of Stage No. Of Rows Loaded Bulk Loading Normal Loading (Time Taken In Sec) (Time Taken In Sec) ODBC Stage is used as Source & 10000 10 51 Target in Normal loading and ODBC Stage is used as Source and OraBulk 50000 51 220 stage is used as target in Bulk Loading 1000000 860 4553 27

Data Stage ETL Implementation Best Practices - Dr. B. L. Desai bulk loading and hash ed file lookup that are observed during implementation. This paper will enlighten ETL developers with the practical knowledge gained through building a data warehouse using DataStage. This paper is an effort to put across the experience gained at this point in time. REFERENCES : Figure.1 comparison between Normal load and bulk load. 1. Peter Nolan and Ralph Kimball, "Getting Started and Finishing Well", URL:http://www.intelligententerprise.com, May 07,2001. 2. Martin Rennhackkamp, "Backup and Recovery", URL:http://www.dbmsmag.com, March 1997 HASHED FILE STAGE: Hashed File is a file that uses a hashing algorithm for distributing one or more groups on disk. Using hashed file has lookup has a significant advantage when the amount of data we handle is high. Table.4.Comparison of normal and hashed file lookup Source Target Lookup Time taken ODBC ODBC ODBC lookup 29 m 11 s ODBC ODBC ODBC lookup 8 m 1 s with index column ODBC ODBC HASHED file 3 m 10 s Lookup Note: No of Records Used: 50,000 CONCLUSION : This paper is an attempt to share the expertise gained through ETL implementation using DataStage. The paper covers the most of the technical difficulties and pitfalls faced while implementing a Data warehouse. This paper covers the most of ETL aspects and the best practices. It covers the techniques for configuration management, backup and recovery strategy, techniques of effective user administration, performance tuning, guide lines for building reusable components, naming standards and customization of DataStage tool. This paper also includes an exhaustive comparison study of some performance related DataStage processes like 28