Perform scalable data exchange using InfoSphere DataStage DB2 Connector

Similar documents
Designing your BI Architecture

Infosphere DataStage Hive Connector to read data from Hive data sources

WebSphere Application Server Base Performance

C Exam Code: C Exam Name: IBM InfoSphere DataStage v9.1

Transformer Looping Functions for Pivoting the data :

Introduction to IBM Data Studio, Part 1: Get started with IBM Data Studio, Version and Eclipse

Introduction to IBM Data Studio, Part 1: Get started with IBM Data Studio, Version and Eclipse

PASS4TEST. IT Certification Guaranteed, The Easy Way! We offer free update service for one year

A Examcollection.Premium.Exam.47q

Oracle Data Integration and OWB: New for 11gR2

Topic 1, Volume A QUESTION NO: 1 In your ETL application design you have found several areas of common processing requirements in the mapping specific

Information empowerment for your evolving data ecosystem

Best ETL Design Practices. Helpful coding insights in SAS DI studio. Techniques and implementation using the Key transformations in SAS DI studio.

Passit4sure.P questions

Performance Best Practices Paper for IBM Tivoli Directory Integrator v6.1 and v6.1.1

Operator Combination and Control

Data Set Buffering. Introduction

PowerCenter 7 Architecture and Performance Tuning

From business need to implementation Design the right information solution

P IBM. Rational Collaborative Lifecycle Mgmt for IT Tech Mastery v1

Oracle Data Integrator 12c: Integration and Administration

IBM InfoSphere Streams v4.0 Performance Best Practices

IBM Data Replication for Big Data

Executive Brief June 2014

The Deployment of SAS Enterprise Business Intelligence Solution in a large IBM POWER5 Environment

Infor Lawson on IBM i 7.1 and IBM POWER7+

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Optimizing Performance for Partitioned Mappings

Oracle Warehouse Builder 10g Release 2 Integrating Packaged Applications Data

Call: Datastage 8.5 Course Content:35-40hours Course Outline

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

Data integration made easy with Talend Open Studio for Data Integration. Dimitar Zahariev BI / DI Consultant

InfoSphere Data Architect Pluglets

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

Actual4Test. Actual4test - actual test exam dumps-pass for IT exams

Solutions for Netezza Performance Issues

April Copyright 2013 Cloudera Inc. All rights reserved.

QUESTION 1 Assume you have before and after data sets and want to identify and process all of the changes between the two data sets. Assuming data is

EView/390z Insight for Splunk v7.1

Upgrade to Microsoft SQL Server 2016 with Dell EMC Infrastructure

Data Validation Option Best Practices

DATAWAREHOUSING AND ETL PROCESSES: An Explanatory Research

Product Overview. Technical Summary, Samples, and Specifications

SAS Data Integration Studio 3.3. User s Guide

Field Testing Buffer Pool Extension and In-Memory OLTP Features in SQL Server 2014

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

An Introduction to GPFS

Introduction to Federation Server

Data Integration and ETL with Oracle Warehouse Builder

IBM Education Assistance for z/os V2R2

... IBM Advanced Technical Skills IBM Oracle International Competency Center September 2013

Microsoft Dynamics CRM 2011 Data Load Performance and Scalability Case Study

Key Features. High-performance data replication. Optimized for Oracle Cloud. High Performance Parallel Delivery for all targets

Version Overview. Business value

NatQuery General Questions & Answers

MetaMatrix Enterprise Data Services Platform

Microsoft Design and Deploy Messaging Solutions with Microsoft Exchange Server 2010

Paper Best Practices for Managing and Monitoring SAS Data Management Solutions. Gregory S. Nelson

Journey to the Private Cloud

BUILD BETTER MICROSOFT SQL SERVER SOLUTIONS Sales Conversation Card

Satisfy the Business Using Db2 Web Query

IBM Db2 Analytics Accelerator Version 7.1

vsan 6.6 Performance Improvements First Published On: Last Updated On:

Oracle Big Data Connectors

Azure Certification BootCamp for Exam (Developer)

Mastering SOA Challenges more cost-effectively. Bodo Bergmann Senior Software Engineer Ingres Corp.

IBM Software IBM InfoSphere Information Server for Data Quality

IBM C IBM z Systems Technical Support V6.

Geneva 10.0 System Requirements

Third generation of Data Virtualization

Course 10233: Designing and Deploying Messaging Solutions with Microsoft Exchange Server 2010

DATA INTEGRATION PLATFORM CLOUD. Experience Powerful Data Integration in the Cloud

A SAS/AF Application for Parallel Extraction, Transformation, and Scoring of a Very Large Database

Installing and Configuring VMware Identity Manager Connector (Windows) OCT 2018 VMware Identity Manager VMware Identity Manager 3.

VOLTDB + HP VERTICA. page

Oracle Database 12c: JMS Sharded Queues

Using EMC FAST with SAP on EMC Unified Storage

Designing and Deploying Messaging Solutions with Microsoft Exchange Server 2010

SCASE STUDYS. Migrating from MVS to.net: an Italian Case Study. bizlogica Italy. segui bizlogica

Fluentd + MongoDB + Spark = Awesome Sauce

Performance Report: Multiprotocol Performance Test of VMware ESX 3.5 on NetApp Storage Systems

InfoSphere Guardium 9.1 TechTalk Reporting 101

IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store

Processor : Intel Pentium D3.0 GigaHtz

White Paper On Data Migration and EIM Tables into Siebel Application

Microsoft certified solutions associate

Optimizing Tiered Storage Workloads with Precise for Storage Tiering

Nutanix Tech Note. Virtualizing Microsoft Applications on Web-Scale Infrastructure

An Oracle White Paper March Oracle Warehouse Builder 11gR2: Feature Groups, Licensing and Feature Usage Management

IBM Rational Asset Manager Version Capacity and scalability benchmarks

Build and Deploy Stored Procedures with IBM Data Studio

Oracle Data Integrator 12c: ETL Integration Bootcamp and New Features

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Part 1: Indexes for Big Data

Scribe Insight Enterprise Architecture Overview

DQpowersuite. Superior Architecture. A Complete Data Integration Package

Overview of Data Services and Streaming Data Solution with Azure

TECHNICAL WHITE PAPER. Using SQL Performance for DB2: Gaining Insight into Stored Procedure Characteristics

POWER BI COURSE CONTENT

Click to edit Master subtitle style

Transcription:

Perform scalable data exchange using InfoSphere DataStage Angelia Song (azsong@us.ibm.com) Technical Consultant IBM 13 August 2015 Brian Caufield (bcaufiel@us.ibm.com) Software Architect IBM Fan Ding (fding@us.ibm.com) Senior Technical Consultant IBM Mi Wan Shum (msshum@us.ibm.com) Senior Performance Manager Sriram Padmanabhan (srp@us.ibm.com) Engineering Leader, Analytics and Insights DB2 for z/os delivers innovations in operational efficiency for out-of-the-box savings, business resiliency, warehouse deployment and enhanced business analytics. DB2 for z/os is a cost-effective, simple, and proven database that offers 24x7 availability, reliability, and security. This article illustrates how to use InfoSphere DataStage to access DB2 for z/os database systems. It also provides insights into the performance of connectivity options and presents valuable best-practice recommendations. Introduction InfoSphere Information Server is a data integration platform that enables customers to understand, cleanse, transform, and deliver trusted information to their critical business initiatives. InfoSphere Information Server is designed to provide the infrastructure to create scalable, high-performance information integration environments, helping organizations turn the data they have into the innovation they need. InfoSphere DataStage is the information integration component of InfoSphere Information Server. InfoSphere DataStage helps businesses integrate data across multiple high-volume data sources Copyright IBM Corporation 2015 Trademarks Page 1 of 13

developerworks ibm.com/developerworks/ and target applications. It also supports the collection, integration, and transformation of large volumes of data with data structures ranging from simple to complex. DB2 for z/os is IBM's relational database management system (DBMS) for the z/os operating system. DB2 for z/os is designed for and tightly integrated with the IBM System z mainframe to leverage the strengths of System z. DB2 for z/os delivers innovations in operational efficiency for out-of-the-box savings, business resiliency, warehouse deployment, and enhanced business analytics. DB2 for z/os is the only cost-effective, simple, and proven database on the market which offers 24x7 availability, reliability and security. This white paper illustrates how to use InfoSphere DataStage to access DB2 for z/os database systems. It also provides insights into the performance of different connectivity options and presents valuable best-practice recommendations. Overview of InfoSphere DataStage InfoSphere DataStage includes design tools that enable developers to create new data integration jobs visually. The term job is used in InfoSphere DataStage to describe extract, transform and load (ETL) tasks. Jobs are composed from a rich palette of operators called stages. These stages include: Source and target access for databases, applications and files General processing stages such as filter, sort, join, union, lookup, and aggregations Built-in and custom transformations Copy, move, FTP, and other data movement stages Real-time, XML, SOA and Message queue processing In addition, InfoSphere DataStage allows developers to apply pre and post-conditions to each of these stages. Multiple data integration jobs can be controlled and linked by a sequencer that provides control logic, processing jobs in a particular order based on user-defined triggers. InfoSphere DataStage also supports rich administration capabilities for deploying, scheduling, and monitoring jobs. One of the strengths of InfoSphere DataStage is that it enables you to design jobs independently of the underlying structure of the system. If the system changes, is upgraded or improved, or if a job is developed on one platform and implemented on another, the job design itself rarely needs to change. This separation is possible because InfoSphere DataStage gathers information about the shape and size of the system from a configuration file provided at run time. This file defines one or more processing nodes logical, rather than physical on which the job will run. Using the definitions in this file, DataStage organizes the necessary resources. When a system changes, the file is changed, not the job. It is important to note that the number of processing nodes need not necessarily correspond to the number of CPU cores in the system. The file may define one processing node for each physical core in the system, or it may define more or less, depending on the requirements of the job. In this Page 2 of 13

ibm.com/developerworks/ developerworks manner, InfoSphere DataStage delivers impressive parallel processing capabilities. The following factors affect the degree of parallelism required by a particular job or application: CPU-intensive applications, which typically perform multiple CPU-demanding operations on each record, benefit from the greatest possible parallelism supported by a system. Jobs with large memory requirements can take advantage of parallelism if they act on data that has been partitioned, and if the required memory is also divided among partitions. Applications that are disk- or I/O-intensive, such as those that extract data from and load data into databases, gain from configurations in which the number of logical nodes equals the number of I/O paths being accessed. For example, if a table is fragmented 16 ways inside a database or if a data set is spread across 16 disk drives, administrators should set up a node pool consisting of 16 processing nodes. InfoSphere DataStage DB2 connector InfoSphere Information Server connectivity options enable InfoSphere DataStage jobs to transfer data between InfoSphere Information Server and external data sources, such as: Relational databases Mainframe databases Enterprise resource planning (ERP) or customer relationship management (CRM) databases Online analytical processing (OLAP) or performance management databases An InfoSphere connector is a component that provides data connectivity and metadata integration for external data sources. A connector typically includes a stage specific to the external data source. InfoSphere supplies a native connector to access DB2 for Linux, UNIX, and Windows, and DB2 for z/os databases. This connector is called the. When you use InfoSphere DataStage to access DB2 database systems, the DB2 connector provides a single way to work with DB2 data. With the DB2 connector, you can perform transactional and non-transactional operations on DB2 data: Read, write, and look up data in DB2 database tables. Import metadata of table and column definitions, lists of DB2 databases, tables, views, and database aliases into a DB2 connector job. Manage rejected data. Work with national language data. The DB2 connector provides national language support (NLS). Prior to the introduction of the, there were a variety of older stages that had been developed to access DB2 data. The effectively replaces all these older techniques. The DB2 connector is the latest technology. It offers more functionality and improved performance, and is recommended for new job designs. If you have existing jobs that use the older stages and would like to update to use the connectors, the InfoSphere Connector Migration Tool can be used to facilitate the migration. Page 3 of 13

developerworks ibm.com/developerworks/ Bulk load in to DB2 for z/os InfoSphere DB2 connector provides the capability to bulk load data into DB2 for z/os. The following graph shows a DataStage job that reads data from a text file and loads into a DB2 for z/ OS table using the DB2 connector. Figure 1. Bulk loading data in to a DB2 table To enable the DB2 connector for bulk loading data into DB2 for z/os, several parameters must be set in the DB2 connector. Visit the IBM Knowledge Center IBM InfoSphere Information Server 9.1.2 documentation for the necessary configuration steps. To take advantage of the parallel bulk load feature in the DB2 connector, the Partition type property under Input > Partitioning tab must be set to DB2 connector. Figure 2. The In addition to that, under the Properties tab, the Write mode must be set to Bulk load, and a target table name needs to be specified for Table name. Figure 3. Setting the write mode and table name in Next, the Bulk load to DB2 on z/os must be set to Yes, and the hostname or IP address of the DB2 for z/os system must be specified for the Transfer to property. Page 4 of 13

ibm.com/developerworks/ developerworks If you choose to bulk load to DB2 for z/os, you must specify a load method. You can select one of the following valid values: MVS dataset(s) Batch pipe(s) UNIX System Services (USS) pipe(s) In this example, the Load method is set to USS pipe(s), hence a valid USS pipe directory must be specified. If other load methods are selected, you must specify the fields required for that load method. Figure 4. Selecting the load method Figure 5. Bulk load into a partitioned DB2 for z/os table using the USS pipe load method The above figure illustrates the data movement during bulk loading into a partitioned DB2 for z/os table using the USS pipe load method. On the InfoSphere DataStage system, the DB2 connector runs in parallel with one connector process per DataStage partition. In addition, one UNIX pipe and one FTP connection are established for each DataStage partition for data transport. Page 5 of 13

developerworks ibm.com/developerworks/ Once the USS pipes are created on the DB2 for z/os system to buffer data received by the FTP server (step 1A), the DB2 connector calls the LOAD utility via the DSNUTILS stored procedure (step 1B), and one LOAD utility is invoked for each table partition. When data starts arriving from upstream stages in the DataStage job, the DB2 connector immediately sends it to the UNIX pipe, feeding the FTP client. The data is then sent to the DB2 for z/os system via FTP directly into the USS pipe, and loaded into the table partition. Data loadings are executed independently and in parallel for all table partitions, making the DB2 connector a highly efficient approach for bulk load into DB2 for z/os. Read from DB2 for z/os The DB2 connector supports reading in parallel according to how the data is partitioned in DB2 for z/os namely, how the data is natively partitioned. The following image shows a DataStage job that reads data from a DB2 for z/os table using the DB2 connector and outputs data to a COPY stage. Figure 6. reading data from a DB2 table to a COPY stage The connector determines the number of partitions in the table and dynamically configures the number of DataStage processing nodes to match the number of partitions. For DB2 for z/os tables this is the number of table ("range") partitions in the table. The connector associates each node with one partition. For each node, the connector reads the records that belong to the partition that is associated with that node. The connector accomplishes this by modifying the SELECT statement defined in the stage. The connector adds a WHERE clause that uses the column(s) in the partitioning key to select rows belonging to the partition associated with the current node. The example below demonstrates how the DB2 connector reads data in a natively partitioned fashion. Assume a four-way-partitioned DB2 for z/os table defined as: CREATE TABLE TABLE1 ( COL1 INTEGER, COL2 BIGINT ) PARTITION BY (COL2) ( PART 1 VALUES (10000000000), PART 2 VALUES (20000000000), PART 3 VALUES (30000000000), PART 4 VALUES (MAXVALUE) ) If the DB2 connector is configured in the following way. Page 6 of 13

ibm.com/developerworks/ developerworks Figure 7. The configuration The connector will run four distinct SELECT statements in parallel: 10) 10) Node 1: SELECT * FROM TABLE1 WHERE ((COL2 <= '10000000000')) AND (COL1 > 10) Node 2: SELECT * FROM TABLE1 WHERE ((COL2 > '10000000000') AND (COL2 <= '20000000000')) AND (COL1 > Node 3: SELECT * FROM TABLE1 WHERE ((COL2 > '20000000000') AND (COL2 <= '30000000000')) AND (COL1 > Node 4: SELECT * FROM TABLE1 WHERE ((COL2 > '30000000000')) AND (COL1 > 10) To enable the native parallel read in DB2 connector, the Enable partitioned reads property must be set to Yes, and Partitioned reads method property to DB2 connector. The name of the table whose partitioning will be used must also be specified in the Table name sub-property. In addition, the Generate partitioning SQL sub-property defaults to Yes and must be set to Yes when reading from DB2 for z/os. Discussion of performance results Performance test system configuration To evaluate the performance of different connectivity options in DB2 connector, the following InfoSphere DataStage server and DB2 for z/os system are setup. Table 1. Hardware configuration for the InfoSphere DataStage server???? Processor Memory Network Storage Operating system 40 Intel Xeon processor @ 2.4GHz 256 GB 10Gbit Ethernet DS5300 SAN Storage REHL 5.8 64bit Table 2. Hardware configuration for DB2 for z/os system???? Processor 14 IBM system Z10 CPs (Central Processors) Page 7 of 13

developerworks ibm.com/developerworks/ Memory Network Storage 80 GB (Central Storage) 10Gbit Ethernet 700 GB DASD storage Operating system z/os 1.12 The following table lists the server software used in the performance study. Table 3. Server software for InfoSphere DataStage and DB2 for z/os???? InfoSphere DataStage Server Database system InfoSphere Information Server V9.1.2 DB2 for z/os V10 Bulk load into DB2 for z/os Figure 8. Throughput of data bulk load into partitioned DB2 for z/os tables using different load methods The image above shows the throughput of a bulk load into a partitioned DB2 for z/os table using different load methods. To evaluate the bulk load scalability, 1-, 2-, 4-, 8-, and 16-way partitioned tables are tested. For each table, DataStage loads data in n ways parallel, where n matches the number of partitions of the target table. Tests are repeated using the three load methods for each partitioned table. For bulk load using the DB2 connector, our results show that the throughput increases with the number of table partitions, though the degree of improvement depends on the specific load method. Furthermore, the USS pipe and batch pipe loading methods perform comparably, both outperforming the MVS dataset load method. The two pipe-based loading methods demonstrate high scalability (close to linear) with the number of table partitions. On our DB2 for z/os system, during bulk load into a 16-way partitioned table, the 14 CPs were over-committed, compromising the load throughput and causing noticeable Page 8 of 13

ibm.com/developerworks/ developerworks deviation from the theoretical linear scalability. We expect that enabling additional CPs on the system would increase the throughput in these cases. USS pipes are a standard feature of the USS environment. Configuration and usage of USS pipes is straightforward. In addition, USS pipes demonstrated comparable performance to batch pipes, and significantly better performance over the MVS dataset method. For bulk load into DB2 for z/ OS using the DB2 connector, USS pipes are recommended. Reading from DB2 for z/os Figure 9. Throughput of data read from partitioned DB2 for z/os tables using different methods The image above shows the throughput of data read from partitioned DB2 for z/os tables using different methods. To evaluate the native parallel read scalability, 1-, 2-, 4-, 8-, and 16-way partitioned tables are tested. For each table, DataStage reads data in n ways parallel, where n matches the number of the target table partitions. Tests are repeated using the DB2 connector and DB2 for z/os stages. The legacy DB2 for z/os stage is another option for reading data in parallel using the native table partitioning. To illustrate the efficiency of the native parallel read, tests are also conducted using the DB2 connector in the sequential mode, where data is read from each table partition serially, instead of in parallel. Our results show that the native parallel read, using either the DB2 connector or DB2 for z/ OS stage, substantially outperforms the sequential read. In addition, for the same number of table partitions, the DB2 connector yields a higher throughput than the DB2 for z/os stage. In addition, the DB2 connector consumes less computing power than the DB2 for z/os stage on the DataStage server, which is attributed to the difference in their underlying implementations. Finally, the throughput of native parallel read using DB2 connector demonstrates high scalability (close to linear) with the number of table partitions. Performance, lower resource overhead and scalability makes the DB2 connector a preferred option for data retrieval from DB2 for z/os. Page 9 of 13

developerworks ibm.com/developerworks/ Conclusion As demonstrated by the performance study results in this article, the InfoSphere DataStage DB2 connector enables a high-performance, scalable approach for data exchange between DB2 for z/ OS and InfoSphere DataStage. This connectivity option enables organizations to leverage these industry-leading database system and data integration platform offerings. Page 10 of 13

ibm.com/developerworks/ developerworks Resources Check out InfoSphere DataStage for more information. Learn more about the InfoSphere Platform. Visit Data Integration to learn more. The Information Management area on developerworks provides resources for architects, developers, and engineers. Stay current with developer technical events and webcasts focused on a variety of IBM products and IT industry topics. Follow developerworks on Twitter. Watch developerworks demos ranging from product installation and setup demos for beginners, to advanced functionality for experienced developers. Get involved in the developerworks Community. Connect with other developerworks users while you explore developer-driven blogs, forums, groups, and wikis. Page 11 of 13

developerworks ibm.com/developerworks/ About the authors Angelia Song Angelia Song is a member of the Information Server Performance team. Prior to this, she worked with IBM Data Studio tools. She has a master's degree in computer science from the University of Florida. Brian Caufield Brian Caufield is a software architect in IBM Silicon Valley Lab. Brian has been working in the DataStage development organization for 10 years and was involved in the design of the Slowly Changing Dimension Stage. Fan Ding Fan Ding is a member of the Information Server Performance team. Prior to this, he worked in Information Integration Federation Server development. Fan has a doctorate in mechanical engineering and a master's degree in computer science from the University of Wisconsin. Mi Wan Shum Mi Wan Shum is a senior performance manager for the Big Data (BigInsights/ BigSQL) Performance team at the IBM Silicon Valley Lab. She also has many years of software development experience in IBM. Mi Wan graduated from the University of Texas, Austin. Sriram Padmanabhan Sriram Padmanabhan is an IBM Distinguished Engineer, and chief architect for IBM InfoSphere servers. He was a research staff member and a manager of the Database Technology group at IBM T.J. Watson Research Center for several years. Padmanabhan has authored more than 25 publications, including a book chapter Page 12 of 13

ibm.com/developerworks/ developerworks on DB2 in a popular database textbook, several journal articles, and many papers in leading database conferences. He is currently the engineering leader, Anaytics and Insights, at Apigee. Copyright IBM Corporation 2015 (www.ibm.com/legal/copytrade.shtml) Trademarks (www.ibm.com/developerworks/ibm/trademarks/) Page 13 of 13