IBM Software Group Designing your BI Architecture Data Movement and Transformation David Cope EDW Architect Asia Pacific 2007 IBM Corporation
DataStage and DWE SQW Complex Files SQL Scripts ERP ETL Engine IMS XML SQL Scripts SQL Scripts Other DB2 EDW 2
Parallel Processing Rich Connectivity to Applications, Data, and Content IBM Software Group IBM Information Server Delivering information you can trust Information Server Information Services Director 3 Understand Cleanse Transform & Move Federate Information Analyzer QualityStage DataStage Federation Server Metadata Server
IBM Information Server Architecture UNIFIED USER INTERFACE Analysis Interface Development Interface Web Admin Interface COMMON SERVICES Metadata Services Unified Service Deployment Security Services Logging & Reporting Services UNIFIED PARALLEL PROCESSING UNIFIED METADATA Understand Cleanse Transform Deliver Design Operational COMMON CONNECTIVITY Structured, Unstructured, Applications, Mainframe 4
Introducing DataStage WebSphere DataStage Client Designer Director Administrator Manager WebSphere DataStage Server Integrates data from the widest range of enterprise and external data sources Incorporates data validation rules Processes and transforms large amounts of data using scalable parallel processing Handles very complex transformations Manages multiple integration processes Provides direct connectivity to enterprise applications as source or targets Leverages meta data for analysis and maintenance Operates in batch, real time, or as Web Service 5
IBM DataStage Enterprise Edition Components Designer A design interface used to create WebSphere DataStage applications (known as jobs) User: ETL Developer Manager Used to view and edit the contents of the WebSphere DataStage Repository User: ETL Developer Administrator Used to perform administration tasks such as setting up DataStage users, creating and moving projects, and setting up purging criteria User: ETL Administrator Director Used to validate, schedule, run, and monitor DataStage jobs User: ETL Developer \ ETL Operator Client/Server Development Environment 6
What is Enterprise Edition? WebSphere DataStage Enterprise Edition ( EE ) takes performance to a new level, allowing you to handle the massive volume, velocity and variety of data flowing into your organization Enterprise Edition provides native parallel processing capabilities, including: Near-Linear scalability across parallel hardware environments Isolation of Job design from actual runtime resources (H/W, S/W) Data Pipelining Data Partitioning (including Automatic and Dynamic Re-Partitioning) Parallel I/O High-Performance, Parallel Sort, Aggregator, Lookup, Join, Merge Native (compiled) Parallel Transformer Parallel Database interfaces more than 50 native parallel stages 7
DataStage Enterprise Edition Architecture DataStage Client [ Manager, Designer, Director ] (WinNT or Win2000) DataStage Connect API Data flow ODBC/Native Data Sources (Database or File) DataStage Server + Enterprise Edition (Win2003/Linux/UNIX/USS) [ Uniprocessors / SMPs / Clusters / MPPs ] ODBC/Native Data flow Target (Database or File) 8
Traditional Batch (ETL) Processing Write to disk and read from disk before each processing operation Sub-optimal utilization of resources a 10 GB stream leads to 70 GB of I/O processing resources can sit idle during I/O Very complex to manage (lots and lots of small jobs) Becomes impractical with big data volumes disk I/O consumes the processing terabytes of disk required for temporary staging
Data Flow Architecture: Data Pipelining Think of a conveyor belt moving the records from step to step Run each step simultaneously, passing data records eg. Transform, Enrich, and Load run simultaneously Eliminates intermediate staging to disk This also keeps the processors busy But pipelining alone still limits overall scalability
Combined Partition and Pipeline Parallelism PIPELINING Record repartitioning occurs automatically No need to repartition data as add processors change hardware architecture Broad range of partitioning methods are available
Execution, Production Environment Supports all hardware configurations with a single job design Scale by simply adding processors or nodes with no application change or re-compilation External configuration file specifies hardware configuration and resources UNLIMITED SCALABILITY 12
Job Design vs. Execution Developer assembles the flow using DataStage Designer at runtime, this job runs in parallel for any configuration (1 node, 4 nodes, N nodes) No need to modify or recompile the job design! 13
Job Monitoring and Scheduling 14
Job Performance Analysis A visualization tool which: Provides deeper insight into runtime job behavior. Offers several categories of visualizations, including: Record Throughput CPU Utilization Job Timing Job Memory Utilization Physical Machine Utilization 15
DataStage and DWE SQW Complex Files ERP IMS XML Other DB2 EDW 16
SQL Warehousing Tool (SQW) Build and execute intra-warehouse (SQL-based) data movement and transformation services Integrated Development Environment and metadata system Model logical flows of higher-level operations Generate code and create execution plans Test and debug flows Package generated code and artifacts into a data warehouse application Integrate SQW Flows and DataStage jobs Runtime Infrastructure Configuration of runtime environments Deployment of warehouse applications Manage, Execute and Monitor processes and activities SQW Flows execute in a DB2 Execution database DataStage jobs execute in a DataStage server 17
Design Data/Mining Flow Creation in GUI Execution Control Flow Creation in GUI Non-WAS Design Center Debugger + Executor DIS Executor WAS Execution Plans (EPG) Deployment Preparation Define a Warehouse Application Sources DataStage Execution Engine Parameterize App, Generate Plans DB2 SQL Execution Engine Targets Create a Deployment Package Production Deployment via Admin Console Deploy Application (WAS) Production Ready Prepare DB Environment Administration Schedule Process Statistics, Logging Manage Processes 18
DWE Components Design Studio Control Flow Editor FTP MetaData (Eclipse Modeling Framework) FF/JDBC DS Job SQL DF DS Job Run Time Email Verify Data/Mining Flow Editor SQL DS Extract SQL Join Lookup subflow Websphere Application Server DIS DataStage Server Metadata DB2 DWE Admin Console (Web Browser) 19
Life Cycle of a SQW Data Warehouse Application 1. Install and set up design and runtime environments 2. Design and validate data flows 3. Test-run data flows 4. Design and validate control flows 5. Test-run control flows 6. Prepare control flow application for deployment 7. Deploy application (from console) 8. Run and manage application at process (control flow) level (from console) 9. Iterate based on changes in source and target databases Note: For testing purposes, you can design and run applications from the Design Studio (built-in runtime environment without WebSphere; just need a DB2 instance) 20
Data Flows: Definition and Simple Example Data flows are models that represent data movement and transformation requirements SQW Codegen translates the models into repeatable, SQL-based warehouse building processes Data from source files and tables moves through a series of transformation steps then loads or updates a target file or table The following example selects data from a DB2 staging table, removes duplicates, sorts the rows, and inserts the result into another DB2 table. Discarded duplicates go to a flat file. 21
Data Flows: Anatomy Operators Source Target Transfomations Ports Defines the points of data input or output for an operator. Also define the data layout. Connectors Directs flow of data from an output port of one operator to the input port of another operator Source Transform O Source I I O O Transform Target O I I O I Source O 22
Data Flows: Source and Target Operators Sources File import Table source SQL replication source Targets File export Table target (SQL insert, update) Bulk load target (DB2 load utility) SQL merge (upsert) Slowly changing dimension (SCD) Data station special staging operator intermediate target 23
Data Flows: Transform Operators Select list (columns and expressions) Distinct (similar to a SELECT DISTINCT) Where condition (constraints) Table join (inner, outer joins supported) Group by (aggregations, HAVING clause) Order by Union (also INTERSECT and EXCEPT) Pivot and unpivot Key lookup Fact key replace Sequence (DB2 key generator) Splitter Custom SQL DB2 table function 24
Data Flows: Operator Properties Properties view for all operators Properties for operators and properties for operator ports Properties are duplicated in a wizard view for object-dependent operators (table/file sources and targets, data station, etc.) Wizard view prompts for object definition but does not require it Properties view approach is the standard Eclipse interface for defining object attributes Properties Wizard Properties View 25
Data Flows: Ports and Port Properties Operators have input and/or output ports Connections go from upstream output ports to downstream input ports Ports have properties (virtual table definitions) 26
Data Flows: Column Level Connections Connections may need to be made at the column level You might change your mind about a flow definition, delete a connection, or delete an upstream operator You do not use all of the attributes that you defined downstream You can use column-level connections to refresh or modify the new input schema 27
Data Flows: Variables Variables can be used in Data Flows Defer the definition of certain properties until a later phase in the life cycle. File Names Table Names Database Schema Names etc Generalize a Data Flow 28
Data Flows: Variable Definition and Selection Define a variable using the Variable Manager Set its properties, current value, and phase Phase = when can the value be set during the life cycle? Use the same variable in multiple operators in different flows 29
Data Flows: Validation When you save a data flow or validate it, any errors are identified. The yellow exclamation marks are warnings; the red X marks are serious errors. Hover help message text exists for these error conditions; just mouse over the icon. Also check the Problems view (next to Properties) to see the errors. Validation rules cover a variety of error conditions: missing links and properties, for example 30
Data Flows: Data Station Operators Staging points in a data flow Station types: persistent table, temporary table, view, or file (temp tables and views are dropped after execution) Data stations with persistent tables can serve as target operators Useful as a recovery mechanism and as a checkpoint (what does the data set look like at this point in the flow?) Pass-through option: switch data station on and off for different runs 31
Data Flows: Subflows A subflow is a predefined set of operators that you can place inside a data flow. Useful as a plugin into multiple versions of the same or similar data flows Containers or building blocks for complex flows (division of labor) Blue ports represent subflow inputs and outputs 32
Data Flows: Subflows Subflows consist of input ports and/or output ports and operators Where does the subflow fit: Input only = subflow at beginning of data flow Output only = subflow at end of data flow Input and output ports = subflow is mid-flow After creating a subflow, drop it into a data flow Subflows can be nested Data flows can be saved as subflows DataStage jobs can be imported into data flows as subflows 33
Data Flows: Design Studio Execution Validate the flow first and troubleshoot any errors Generate and review the code (this is optional) Complete the Flow Execution wizard: Choose or define the run profile Select resources and variable values if required Wait for the execution results to be displayed Design Studio execution is intended for testing and training purposes Deploy applications to DWE Runtime for production runs, scheduling, and administration 34
Data Flows: Testing Logs and Tracing Diagnostics tab of Flow Execution wizard Log file path Log files can be appended or overwritten Tip: Tracing performance is not dependent on data input size so tracing time will be negligable for large data sets. 35
Data Flows: Complete Example 36
Control Flows: Definition and Simple Example A control flow is a container model that sequences one or more data flows and integrates other data processing rules and activities. Data warehouse applications that you deploy to the DWE Runtime Environment depend on control flows You cannot deploy data flows independently; wrap them inside a control flow first This simple example processes two data flows in sequence. If they fail, e-mail is sent to an administrator: 37
Control Flows: Anatomy Operators Defines the type of activity Ports Defines the entry and exit points of an operator. Connectors Directs the processing flow of control between operators. 38
Control Flows: Ports On-Success Exit Entry Unconditional Exit On-Failure Exit Unconditional connection supersedes Conditional connections. 39
Control Flows: Ports Start/End Operators Start Process Process On-Failure Cleanup Process Only one Start Operator per Control Flow Invoked after Activity On- Failure branch, if any Invoked after reaching the terminal point of any branch Optional but may have multiple as needed Entry 40
Control Flows: Operators SQW Flow Operators Data Flow Mining Flow Command Operator DB2 Shell (OS Scripts) DB2 Scripts FTP Executable Control Operators File Wait Iterator End Email Operator DataStage Operator Job Sequence Parallel Job 41
Control Flows: Iterators Data processing loops that iterate over: A series of delimited items in a file A series of files in a directory A fixed number of operations For example, a data flow can be executed multiple times inside one control flow, based on the existence of a set of different input files at runtime. 42
Control Flows: Design Studio Execution Validate the flow first and troubleshoot any errors Generate and review the code (this is optional) Complete the Flow Execution wizard: Choose or define the run profile Select resources and variable values if required Wait for the execution results to be displayed Design Studio execution is intended for testing and training purposes Deploy applications to DWE Runtime for production runs, scheduling, and administration Code for control flow operators validated/generated sequentially For sub flows/macros, code is generated every time it is referenced in the data flows 43
Control Flows: Command Line Execution Execute a data warehouse application process through a command line interface A java program that can be invoked outside of WAS For example: startsqwinstance -app <application_name> process <process_name> Embeddable inside a user application for example, a means to integrate third-party or customized scheduler by invoking a data warehouse process directly from the 3 rd -party scheduler application Examples of command line interface: Command name getsqwapplicationlist file filename Command description Get the list of applications from an application profile getsqwprocesslist app app_name startsqwinstance app app_name process process_name restartsqwinstance setsqwapplicationstatus Get the list of instances of an application Start an application process Restart an application instance Enable/disable application setsqwprocessstatus Enable/disable process 44
Control Flows: Complete Example 45
DataStage and DWE SQW Complex Files ERP IMS XML Other DB2 EDW 46
Design Studio with DataStage: Integration points Import DataStage Job as opaque Runtime object Design Studio Export SQL to DataStage as CMD Operator Call DWE Flows directly in DataStage Scheduler Control Flow Editor FTP SQL DF DS Job DS Job MetaData (EMF) Run Time CodeGen/Optimizer Email Verify Extract Data Flow Editor SQL DS SQL Join Lookup subflow Websphere Application Server DataStage Server DB2 Import DataStage Job as visual Subflow DWE Admin Console 47
Integrated Tools for Dynamic Warehousing Seemless integration of DataStage jobs into the SQW environment IBM Information Server 48
Import capabilities - Subflow From the DataStage Designer, export a DataStage job in XML format Bring the job into the Design Studio as a subflow 49
Import Control Flow Not really an import, per se Ability to execute a DataStage Job or Sequence as a black box within an Control Flow 50
Export capabilities Deploy a data flow as a set of DataStage executables (SQL, XML, and DSX files) Open the data flow in the DataStage Designer as a parallel job 51
52