Designing your BI Architecture

Similar documents
Topic 1, Volume A QUESTION NO: 1 In your ETL application design you have found several areas of common processing requirements in the mapping specific

Passit4sure.P questions

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

C Exam Code: C Exam Name: IBM InfoSphere DataStage v9.1

Introduction to Federation Server

Perform scalable data exchange using InfoSphere DataStage DB2 Connector

Call: Datastage 8.5 Course Content:35-40hours Course Outline

A Examcollection.Premium.Exam.47q

Techno Expert Solutions An institute for specialized studies!

MCSA SQL SERVER 2012

Data Integration and ETL with Oracle Warehouse Builder

Call: SAS BI Course Content:35-40hours

Integration Services. Creating an ETL Solution with SSIS. Module Overview. Introduction to ETL with SSIS Implementing Data Flow

COPYRIGHTED MATERIAL. Contents. Introduction. Chapter 1: Welcome to SQL Server Integration Services 1. Chapter 2: The SSIS Tools 21

Venezuela: Teléfonos: / Colombia: Teléfonos:

Plan, Install, and Configure IBM InfoSphere Information Server

Informatica Power Center 10.1 Developer Training

IBM WEB Sphere Datastage and Quality Stage Version 8.5. Step-3 Process of ETL (Extraction,

SAS Data Integration Studio 3.3. User s Guide

Modern Data Warehouse The New Approach to Azure BI

IDS V11.50 and Informix Warehouse Feature V11.50 Offerings Packaging

From business need to implementation Design the right information solution

Certkiller.A QA

Introduction to DWH / BI Concepts

This course is suitable for delegates working with all versions of SQL Server from SQL Server 2008 through to SQL Server 2016.

MOC 20463C: Implementing a Data Warehouse with Microsoft SQL Server

Transformer Looping Functions for Pivoting the data :

Techno Expert Solutions An institute for specialized studies!

Course Contents: 1 Datastage Online Training

Deccansoft Software Services. SSIS Syllabus

Implement a Data Warehouse with Microsoft SQL Server

20463C-Implementing a Data Warehouse with Microsoft SQL Server. Course Content. Course ID#: W 35 Hrs. Course Description: Audience Profile

Oracle Database: SQL and PL/SQL Fundamentals NEW

Information empowerment for your evolving data ecosystem

Best ETL Design Practices. Helpful coding insights in SAS DI studio. Techniques and implementation using the Key transformations in SAS DI studio.

Implementing a Data Warehouse with Microsoft SQL Server

Lambda Architecture for Batch and Stream Processing. October 2018

Performance Optimization for Informatica Data Services ( Hotfix 3)

Oracle Data Integrator 12c: Integration and Administration

Vendor: IBM. Exam Code: P Exam Name: IBM InfoSphere Information Server Technical Mastery Test v2. Version: Demo

PASS4TEST. IT Certification Guaranteed, The Easy Way! We offer free update service for one year

Exam /Course 20767B: Implementing a SQL Data Warehouse

Product Overview. Technical Summary, Samples, and Specifications

An Oracle White Paper March Oracle Warehouse Builder 11gR2: Feature Groups, Licensing and Feature Usage Management

P IBM. Rational Collaborative Lifecycle Mgmt for IT Tech Mastery v1

Implementing a Data Warehouse with Microsoft SQL Server 2012

Oracle Data Integrator 12c: Integration and Administration

MICROSOFT BUSINESS INTELLIGENCE

QS-AVI Address Cleansing as a Web Service for IBM InfoSphere Identity Insight

How Do I Inspect Error Logs in Warehouse Builder?

20767B: IMPLEMENTING A SQL DATA WAREHOUSE

High Speed ETL on Low Budget

VOLTDB + HP VERTICA. page

Oracle Warehouse Builder 10g Runtime Environment, an Update. An Oracle White Paper February 2004

Implementing a SQL Data Warehouse

Implementing a Data Warehouse with Microsoft SQL Server 2014

IBM B5280G - IBM COGNOS DATA MANAGER: BUILD DATA MARTS WITH ENTERPRISE DATA (V10.2)

What s new in IBM Operational Decision Manager 8.9 Standard Edition

Implementing a SQL Data Warehouse

Chapter 1 GETTING STARTED. SYS-ED/ Computer Education Techniques, Inc.

Jyotheswar Kuricheti

Informatica Developer Tips for Troubleshooting Common Issues PowerCenter 8 Standard Edition. Eugene Gonzalez Support Enablement Manager, Informatica

QUESTION 1 Assume you have before and after data sets and want to identify and process all of the changes between the two data sets. Assuming data is

Data Stage ETL Implementation Best Practices

POWER BI COURSE CONTENT

SQL Server Integration Services

IBM A IBM InfoSphere DataStage v9.1 Assessment. Download Full Version :

What does SAS Data Management do? For whom is SAS Data Management designed? Key Benefits

Implementing a Data Warehouse with Microsoft SQL Server 2012/2014 (463)

IBM WebSphere Studio Asset Analyzer, Version 5.1

Managing, Monitoring, and Reporting Functions

Oracle Data Integrator 12c: Integration and Administration

ETL Best Practices and Techniques. Marc Beacom, Managing Partner, Datalere

Module 1.Introduction to Business Objects. Vasundhara Sector 14-A, Plot No , Near Vaishali Metro Station,Ghaziabad

MSBI (SSIS, SSRS, SSAS) Course Content

EMC Documentum Composer

DREAMFACTORY SOFTWARE INC. Snapshot User Guide. Product Usage and Best Practices Guide. By Sathyamoorthy Sridhar June 25, 2012

Duration: 5 Days. EZY Intellect Pte. Ltd.,

Page 1. Oracle9i OLAP. Agenda. Mary Rehus Sales Consultant Patrick Larkin Vice President, Oracle Consulting. Oracle Corporation. Business Intelligence

Question: 1 What are some of the data-related challenges that create difficulties in making business decisions? Choose three.

Exam Name: IBM Certified System Administrator - WebSphere Application Server Network Deployment V7.0

Azure Data Factory VS. SSIS. Reza Rad, Consultant, RADACAD

April Copyright 2013 Cloudera Inc. All rights reserved.

Optimizing Testing Performance With Data Validation Option

Oracle Warehouse Builder 10g Release 2 Integrating Packaged Applications Data

IBM Data Replication for Big Data

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

A SAS/AF Application for Parallel Extraction, Transformation, and Scoring of a Very Large Database

Implementing a Data Warehouse with Microsoft SQL Server 2012

Netezza The Analytics Appliance

Copyright About the Customization Guide Introduction Getting Started...13

SAP HANA Leading Marketplace for IT and Certification Courses

Talend Open Studio for Big Data. User Guide 5.3.2

Querying Data with Transact-SQL

MOC 6232A: Implementing a Microsoft SQL Server 2008 Database

iway iway Big Data Integrator New Features Bulletin and Release Notes Version DN

ETL Transformations Performance Optimization

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Performance Best Practices Paper for IBM Tivoli Directory Integrator v6.1 and v6.1.1

Improving Your Relationship with SAS Enterprise Guide Jennifer Bjurstrom, SAS Institute Inc.

Transcription:

IBM Software Group Designing your BI Architecture Data Movement and Transformation David Cope EDW Architect Asia Pacific 2007 IBM Corporation

DataStage and DWE SQW Complex Files SQL Scripts ERP ETL Engine IMS XML SQL Scripts SQL Scripts Other DB2 EDW 2

Parallel Processing Rich Connectivity to Applications, Data, and Content IBM Software Group IBM Information Server Delivering information you can trust Information Server Information Services Director 3 Understand Cleanse Transform & Move Federate Information Analyzer QualityStage DataStage Federation Server Metadata Server

IBM Information Server Architecture UNIFIED USER INTERFACE Analysis Interface Development Interface Web Admin Interface COMMON SERVICES Metadata Services Unified Service Deployment Security Services Logging & Reporting Services UNIFIED PARALLEL PROCESSING UNIFIED METADATA Understand Cleanse Transform Deliver Design Operational COMMON CONNECTIVITY Structured, Unstructured, Applications, Mainframe 4

Introducing DataStage WebSphere DataStage Client Designer Director Administrator Manager WebSphere DataStage Server Integrates data from the widest range of enterprise and external data sources Incorporates data validation rules Processes and transforms large amounts of data using scalable parallel processing Handles very complex transformations Manages multiple integration processes Provides direct connectivity to enterprise applications as source or targets Leverages meta data for analysis and maintenance Operates in batch, real time, or as Web Service 5

IBM DataStage Enterprise Edition Components Designer A design interface used to create WebSphere DataStage applications (known as jobs) User: ETL Developer Manager Used to view and edit the contents of the WebSphere DataStage Repository User: ETL Developer Administrator Used to perform administration tasks such as setting up DataStage users, creating and moving projects, and setting up purging criteria User: ETL Administrator Director Used to validate, schedule, run, and monitor DataStage jobs User: ETL Developer \ ETL Operator Client/Server Development Environment 6

What is Enterprise Edition? WebSphere DataStage Enterprise Edition ( EE ) takes performance to a new level, allowing you to handle the massive volume, velocity and variety of data flowing into your organization Enterprise Edition provides native parallel processing capabilities, including: Near-Linear scalability across parallel hardware environments Isolation of Job design from actual runtime resources (H/W, S/W) Data Pipelining Data Partitioning (including Automatic and Dynamic Re-Partitioning) Parallel I/O High-Performance, Parallel Sort, Aggregator, Lookup, Join, Merge Native (compiled) Parallel Transformer Parallel Database interfaces more than 50 native parallel stages 7

DataStage Enterprise Edition Architecture DataStage Client [ Manager, Designer, Director ] (WinNT or Win2000) DataStage Connect API Data flow ODBC/Native Data Sources (Database or File) DataStage Server + Enterprise Edition (Win2003/Linux/UNIX/USS) [ Uniprocessors / SMPs / Clusters / MPPs ] ODBC/Native Data flow Target (Database or File) 8

Traditional Batch (ETL) Processing Write to disk and read from disk before each processing operation Sub-optimal utilization of resources a 10 GB stream leads to 70 GB of I/O processing resources can sit idle during I/O Very complex to manage (lots and lots of small jobs) Becomes impractical with big data volumes disk I/O consumes the processing terabytes of disk required for temporary staging

Data Flow Architecture: Data Pipelining Think of a conveyor belt moving the records from step to step Run each step simultaneously, passing data records eg. Transform, Enrich, and Load run simultaneously Eliminates intermediate staging to disk This also keeps the processors busy But pipelining alone still limits overall scalability

Combined Partition and Pipeline Parallelism PIPELINING Record repartitioning occurs automatically No need to repartition data as add processors change hardware architecture Broad range of partitioning methods are available

Execution, Production Environment Supports all hardware configurations with a single job design Scale by simply adding processors or nodes with no application change or re-compilation External configuration file specifies hardware configuration and resources UNLIMITED SCALABILITY 12

Job Design vs. Execution Developer assembles the flow using DataStage Designer at runtime, this job runs in parallel for any configuration (1 node, 4 nodes, N nodes) No need to modify or recompile the job design! 13

Job Monitoring and Scheduling 14

Job Performance Analysis A visualization tool which: Provides deeper insight into runtime job behavior. Offers several categories of visualizations, including: Record Throughput CPU Utilization Job Timing Job Memory Utilization Physical Machine Utilization 15

DataStage and DWE SQW Complex Files ERP IMS XML Other DB2 EDW 16

SQL Warehousing Tool (SQW) Build and execute intra-warehouse (SQL-based) data movement and transformation services Integrated Development Environment and metadata system Model logical flows of higher-level operations Generate code and create execution plans Test and debug flows Package generated code and artifacts into a data warehouse application Integrate SQW Flows and DataStage jobs Runtime Infrastructure Configuration of runtime environments Deployment of warehouse applications Manage, Execute and Monitor processes and activities SQW Flows execute in a DB2 Execution database DataStage jobs execute in a DataStage server 17

Design Data/Mining Flow Creation in GUI Execution Control Flow Creation in GUI Non-WAS Design Center Debugger + Executor DIS Executor WAS Execution Plans (EPG) Deployment Preparation Define a Warehouse Application Sources DataStage Execution Engine Parameterize App, Generate Plans DB2 SQL Execution Engine Targets Create a Deployment Package Production Deployment via Admin Console Deploy Application (WAS) Production Ready Prepare DB Environment Administration Schedule Process Statistics, Logging Manage Processes 18

DWE Components Design Studio Control Flow Editor FTP MetaData (Eclipse Modeling Framework) FF/JDBC DS Job SQL DF DS Job Run Time Email Verify Data/Mining Flow Editor SQL DS Extract SQL Join Lookup subflow Websphere Application Server DIS DataStage Server Metadata DB2 DWE Admin Console (Web Browser) 19

Life Cycle of a SQW Data Warehouse Application 1. Install and set up design and runtime environments 2. Design and validate data flows 3. Test-run data flows 4. Design and validate control flows 5. Test-run control flows 6. Prepare control flow application for deployment 7. Deploy application (from console) 8. Run and manage application at process (control flow) level (from console) 9. Iterate based on changes in source and target databases Note: For testing purposes, you can design and run applications from the Design Studio (built-in runtime environment without WebSphere; just need a DB2 instance) 20

Data Flows: Definition and Simple Example Data flows are models that represent data movement and transformation requirements SQW Codegen translates the models into repeatable, SQL-based warehouse building processes Data from source files and tables moves through a series of transformation steps then loads or updates a target file or table The following example selects data from a DB2 staging table, removes duplicates, sorts the rows, and inserts the result into another DB2 table. Discarded duplicates go to a flat file. 21

Data Flows: Anatomy Operators Source Target Transfomations Ports Defines the points of data input or output for an operator. Also define the data layout. Connectors Directs flow of data from an output port of one operator to the input port of another operator Source Transform O Source I I O O Transform Target O I I O I Source O 22

Data Flows: Source and Target Operators Sources File import Table source SQL replication source Targets File export Table target (SQL insert, update) Bulk load target (DB2 load utility) SQL merge (upsert) Slowly changing dimension (SCD) Data station special staging operator intermediate target 23

Data Flows: Transform Operators Select list (columns and expressions) Distinct (similar to a SELECT DISTINCT) Where condition (constraints) Table join (inner, outer joins supported) Group by (aggregations, HAVING clause) Order by Union (also INTERSECT and EXCEPT) Pivot and unpivot Key lookup Fact key replace Sequence (DB2 key generator) Splitter Custom SQL DB2 table function 24

Data Flows: Operator Properties Properties view for all operators Properties for operators and properties for operator ports Properties are duplicated in a wizard view for object-dependent operators (table/file sources and targets, data station, etc.) Wizard view prompts for object definition but does not require it Properties view approach is the standard Eclipse interface for defining object attributes Properties Wizard Properties View 25

Data Flows: Ports and Port Properties Operators have input and/or output ports Connections go from upstream output ports to downstream input ports Ports have properties (virtual table definitions) 26

Data Flows: Column Level Connections Connections may need to be made at the column level You might change your mind about a flow definition, delete a connection, or delete an upstream operator You do not use all of the attributes that you defined downstream You can use column-level connections to refresh or modify the new input schema 27

Data Flows: Variables Variables can be used in Data Flows Defer the definition of certain properties until a later phase in the life cycle. File Names Table Names Database Schema Names etc Generalize a Data Flow 28

Data Flows: Variable Definition and Selection Define a variable using the Variable Manager Set its properties, current value, and phase Phase = when can the value be set during the life cycle? Use the same variable in multiple operators in different flows 29

Data Flows: Validation When you save a data flow or validate it, any errors are identified. The yellow exclamation marks are warnings; the red X marks are serious errors. Hover help message text exists for these error conditions; just mouse over the icon. Also check the Problems view (next to Properties) to see the errors. Validation rules cover a variety of error conditions: missing links and properties, for example 30

Data Flows: Data Station Operators Staging points in a data flow Station types: persistent table, temporary table, view, or file (temp tables and views are dropped after execution) Data stations with persistent tables can serve as target operators Useful as a recovery mechanism and as a checkpoint (what does the data set look like at this point in the flow?) Pass-through option: switch data station on and off for different runs 31

Data Flows: Subflows A subflow is a predefined set of operators that you can place inside a data flow. Useful as a plugin into multiple versions of the same or similar data flows Containers or building blocks for complex flows (division of labor) Blue ports represent subflow inputs and outputs 32

Data Flows: Subflows Subflows consist of input ports and/or output ports and operators Where does the subflow fit: Input only = subflow at beginning of data flow Output only = subflow at end of data flow Input and output ports = subflow is mid-flow After creating a subflow, drop it into a data flow Subflows can be nested Data flows can be saved as subflows DataStage jobs can be imported into data flows as subflows 33

Data Flows: Design Studio Execution Validate the flow first and troubleshoot any errors Generate and review the code (this is optional) Complete the Flow Execution wizard: Choose or define the run profile Select resources and variable values if required Wait for the execution results to be displayed Design Studio execution is intended for testing and training purposes Deploy applications to DWE Runtime for production runs, scheduling, and administration 34

Data Flows: Testing Logs and Tracing Diagnostics tab of Flow Execution wizard Log file path Log files can be appended or overwritten Tip: Tracing performance is not dependent on data input size so tracing time will be negligable for large data sets. 35

Data Flows: Complete Example 36

Control Flows: Definition and Simple Example A control flow is a container model that sequences one or more data flows and integrates other data processing rules and activities. Data warehouse applications that you deploy to the DWE Runtime Environment depend on control flows You cannot deploy data flows independently; wrap them inside a control flow first This simple example processes two data flows in sequence. If they fail, e-mail is sent to an administrator: 37

Control Flows: Anatomy Operators Defines the type of activity Ports Defines the entry and exit points of an operator. Connectors Directs the processing flow of control between operators. 38

Control Flows: Ports On-Success Exit Entry Unconditional Exit On-Failure Exit Unconditional connection supersedes Conditional connections. 39

Control Flows: Ports Start/End Operators Start Process Process On-Failure Cleanup Process Only one Start Operator per Control Flow Invoked after Activity On- Failure branch, if any Invoked after reaching the terminal point of any branch Optional but may have multiple as needed Entry 40

Control Flows: Operators SQW Flow Operators Data Flow Mining Flow Command Operator DB2 Shell (OS Scripts) DB2 Scripts FTP Executable Control Operators File Wait Iterator End Email Operator DataStage Operator Job Sequence Parallel Job 41

Control Flows: Iterators Data processing loops that iterate over: A series of delimited items in a file A series of files in a directory A fixed number of operations For example, a data flow can be executed multiple times inside one control flow, based on the existence of a set of different input files at runtime. 42

Control Flows: Design Studio Execution Validate the flow first and troubleshoot any errors Generate and review the code (this is optional) Complete the Flow Execution wizard: Choose or define the run profile Select resources and variable values if required Wait for the execution results to be displayed Design Studio execution is intended for testing and training purposes Deploy applications to DWE Runtime for production runs, scheduling, and administration Code for control flow operators validated/generated sequentially For sub flows/macros, code is generated every time it is referenced in the data flows 43

Control Flows: Command Line Execution Execute a data warehouse application process through a command line interface A java program that can be invoked outside of WAS For example: startsqwinstance -app <application_name> process <process_name> Embeddable inside a user application for example, a means to integrate third-party or customized scheduler by invoking a data warehouse process directly from the 3 rd -party scheduler application Examples of command line interface: Command name getsqwapplicationlist file filename Command description Get the list of applications from an application profile getsqwprocesslist app app_name startsqwinstance app app_name process process_name restartsqwinstance setsqwapplicationstatus Get the list of instances of an application Start an application process Restart an application instance Enable/disable application setsqwprocessstatus Enable/disable process 44

Control Flows: Complete Example 45

DataStage and DWE SQW Complex Files ERP IMS XML Other DB2 EDW 46

Design Studio with DataStage: Integration points Import DataStage Job as opaque Runtime object Design Studio Export SQL to DataStage as CMD Operator Call DWE Flows directly in DataStage Scheduler Control Flow Editor FTP SQL DF DS Job DS Job MetaData (EMF) Run Time CodeGen/Optimizer Email Verify Extract Data Flow Editor SQL DS SQL Join Lookup subflow Websphere Application Server DataStage Server DB2 Import DataStage Job as visual Subflow DWE Admin Console 47

Integrated Tools for Dynamic Warehousing Seemless integration of DataStage jobs into the SQW environment IBM Information Server 48

Import capabilities - Subflow From the DataStage Designer, export a DataStage job in XML format Bring the job into the Design Studio as a subflow 49

Import Control Flow Not really an import, per se Ability to execute a DataStage Job or Sequence as a black box within an Control Flow 50

Export capabilities Deploy a data flow as a set of DataStage executables (SQL, XML, and DSX files) Open the data flow in the DataStage Designer as a parallel job 51

52