Saving ETL Costs Through Data Virtualization Across The Enterprise

Saving ETL Costs Through Virtualization Across The Enterprise IBM Virtualization Manager for z/os Marcos Caurim z Analytics Technical Sales Specialist 2017 IBM Corporation

What is Wrong with Status Quo? There is not enough time in the day to move all the data. My mobile users expect to see current data, not yesterday s data.

Current Integration Limitations Movement Using ETL Tools System of Record Staging Server OLTP Files Files Files ETL Server Staging Server Staging Server Warehouse S Q L Reportin g Ad-hoc OLAP Represents ETL inconsistency High latency Complex, high mainframe costs

ETL Drives Up Mainframe Costs ETL costs found in three areas Additional hardware, storage and networking costs Labor involved in managing file transfers Wasted systems cycles (MIPS) IBM study found that to move one terabyte of data, with three derivative copies each day, amortized over a four year period added up to $8,269,335 ETL responsible for consuming 16-18% of total MIPS Clabby Analytics The ETL Problem, October 2013

Virtualizing Movement Cloud Mainframe virtualization enables data structures that were designed independently to be leveraged together, from a single source, in real time, and without complex, costly data movement RDBMS Web/ Mobile Logical Source Big Unstructured

Virtualization Use Cases Faster, easier delivery of modern systems of engagement Need for immediate insight into your customer or business Reduce the cost/complexity of accessing mainframe data Modernization Real-time Analytics Optimization

IBM Virtualization Manager for z/os

Cost Efficient Information Processing Mainframes have multiple processors General purpose processor all processing counts against capacity Specialty Engines Eligible workloads don t count against GPU capacity IBM Virtualization Manager can run 99% of its own processing in the ziip engine Enables mainframe data to be integrated in-place without processing penalty Eligible Workloads Can Run Outside of GPP within ziip GPP ziip

Typical ETL Process Issues with data inconsistency not timely Complex process Prone to errors Costly - high MIPS usage Analytics, Search Warehouse Staging Warehouse Load into Warehouse Transformation of data into compatible formats Adaba s IDMS Natura l IMS CICS Sequenti al Db2 for z/os VSAM extracts from mainframe and non-mainframe data sources Db2 LUW Informi x dashd B Oracle IBM Federation Server SQL Server

Augmenting ETL with Virtualization All data transformations run on ziip specialty engine for significantly reduced MIPS capacity usage SQL JDBC/ODBC/DRDA NoSQL JSON Services SOAP Analytics, Search z/os Connect REST/APIs Design Information delivered in right format, in realtime IBM Virtualization Server for z/os IBM ziip Specialty Engine Combined data delivered to Mapping Caching analytics Map/Reduce Join Query mainframe and Parallel/IO nonmainframe Optimization data Security Monitoring Metadata Adaba s IDMS Natura l IMS CICS Sequenti al Db2 for z/os VSAM Db2 LUW Informi x Derby Oracle IBM Federation Server SQL Server

Augmenting Warehouse via DVS Analytics, Search SQL JDBC/ODBC/DRDA NoSQL JSON Services SOAP z/os Connect REST/APIs Design Warehouse IBM Virtualization Server for z/os IBM ziip Specialty Engine Combined data delivered to Mapping Caching analytics Map/Reduce Query Parallel/IO Optimization Join VSAM with DW data Security Monitoring Metadata Adaba s IDMS Natura l IMS CICS Sequenti al Db2 for z/os VSAM

Complex ETL Script ETL Environment Source System Extract Program Pre-Landing ETL (Flow 1) Landing ETL (Flow2) Staging ETL (Flow 3) Vendor Extract ETL (Flow 4) Vendor Landing ETL (Flow 5) Vendor Updates (Flow 6) Source System Services Environment Hub Key Generation Services Hub Key Generation Services base Environment Vendor Systems Cross-Ref Pre-Landing Landing Staging Landing Staging Cross-Ref Enterprise Exchange Interface

SQL Insert Into Select Statement Web/Mobi le ESB, ETL Analytics, Search Transactional Can replace complex and hard to manage ETL scripts with SQL statement SQL JDBC/ODBC/DR DA NoSQL MongoDB API IBM Virtualization Server for z/os Services SOAP/REST/HT ML Web HTTP Events CDC/ Streams Mapping Caching SQL Insert Into Select Statement Map/Reduce Query Parallel/IO Optimization Design Security Monitoring IBM ziip Specialty Engine Metadata SMF Sys Logs Tape Adaba s IDMS Natura l IMS CICS Sequenti al Db2 for z/os VSAM Big SQL Hadoop Mongo DB Db2 LUW Informi x dashd B Oracle IBM Federation Server SQL Server

Functional Architecture Input Ingest/Transform Persist Analyse Visualise/Interact Transaction ETL Landing Zone Engineering ETL Enterprise Warehouse Visualization Reporting Dashboarding ETL Hadoop Cluster Exploratory Analytics

Functional Architecture Input Ingest/Transform Persist Analyse Visualise/Interact Mainframe Transaction ETL/ELT (if necessary) Analytical LPAR (if necessary) Engineering ETL Enterprise Warehouse Visualization Reporting Dashboarding ETL Hadoop Cluster Exploratory Analytics

Analytics LPAR Architecture Mainframe Analytics LPAR QMF Cognos XXX Distributed Visualization Transactional LPAR SparkSQL IDVM BigSQL Access DB2 IMS DB2 IMS Sharing Distributed DBs Hadoop DB2 Dash PDA IDAA Stores

Ingestion and Integration Component Description: The Integration component focuses on the processes and environments that deal with the capture, qualification, processing, and movement of data in order to prepare it for storage in the Repository Layer, which is subsequently shared with the Analytical and Access applications and systems OLTP LPAR Lake LPAR Integration & Ingestion Distributed Environment DB2 IMS DB2 IMS ETL Tool Hadoop IDAA Sharing IDAA Loader IDVM Spark Integration & Ingestion Existing Cobol CDC apps Stage DB2 Dash PDA IDAA Loader: Load directly into IDAA non DB2 for z/os (IMS, VSAM, Logs, etc). Can accelerate exploration and discovery CDC: Update, if needed, from OLTP DB2 Schema to an OLAP DB2 Schema and also to IDAA (both, OLTP and OLAP) Existing Cobol apps: Several cobol programs already deployed. Leverage to new Lake LPAR to control costs of data movement. Invest on exploration and discovery to reduce total number of those programs Stage and other ETL tools: leverage IDVM or SparkSQL to connect mainframe data when needed, reducing inhouse cobol development dependency. Can be deployed on Linux on mainframe to reduce latency and footprint Load into Hadoop or into DWH, Mart (depend on use case) Z Connector for Hadoop: Accelerate know mainframe data movement to the Hadoop environment

IBM Analytics Banking Student Loan Processing Optimizing ETL to enabling faster loan review and approval Mountains of data to process Poor data quality, complicated by millions of records to process took 12 hours to load Faster Time to Insight accessing more than 7 million records went from 12 hours to less than 13 minutes Improved TCO Complex joins in-memory were performed on the mainframe, with 93% on ziip engine Software IBM Virtualization Manager for z/os The challenge: Student loan processing was taking too long due to poor data quality and huge volumes of student data stored in IMS DB on the customer s z12 mainframe. With IBM Virtualization Manager for z/os accessing more than 7 million IMS records, the cycle went from 12 hours (via ETL) to less than 13 minutes. Complex joins in-memory were performed on the mainframe, with 93% of related processing running on the System z Integrated Information Processor (ziip). The lending institution was able to use real-time insight to processing student loans faster and more accurately improving business efficiency and avoiding regulatory fines.

Unlocking Z for Real-time Business Insight Simple Get transactional access, no data movement Open to all Apps Modern APIs enable access Secure Avoid risk by reducing moving data off Z Systems IBM Virtualization Manager for z/os Fast Exploits Z architecture, including parallelism and in-memory processing Cost Effective Keeps Z costs down with up to 99% ziip offload Non z/os data

IBM Analytics Insurance North American Insurance Firm Modernization to accelerate adding new online customers From days to milliseconds Online account origination went from 3 days to 200 milliseconds, Improved operational efficiency Overcame time delays associated with inefficient batch processes API-enabled IBM Z apps/data Enhanced developer productivity with APIs to actuarial data in IMS DB Software IBM Virtualization Manager for z/os IBM z/os Connect Enterprise Edition The challenge: New online customers at major insurance company had to wait days for confirmation of coverage when adding a new insurance product (motorcycle, boat, RV, etc.). Batch processes associated with their policy management system running on their z13 mainframe contributed to the new product request taking approximately 3 business days to complete. Actuarial data in IMS DB was API-enabled using IBM Virtualization Manager, which allowed developers to incorporated risk calculation and cost estimates into new online service. Online policy origination went from 3 days to 200 milliseconds, and registered 400+ new policies in the first 2 weeks of going live.

IBM Analytics Financial Services Global Financial Services - Real-time, self service analytics for faster insight into customer investment needs 17 million VSAM records Huge data volumes 15 VSAM files concatenated together brought back 17 million records Faster time to insight Enabling portfolio managers to provide timely investment advice Real-time information For business analysts who no longer waited for data to be loaded Software IBM Virtualization Manager for z/os The challenge: Prior to doing analytics, business analysts had to enlist database programmers to create reports from VSAM data residing on the IBM z13 mainframe. Getting mainframe data into the data warehouse involved a complicated, multi-step extraction process that created delays for business analysts. IBM Virtualization Manager enabled real-time access to IMS DB and VSAM data from the online dashboard of the business intelligence application. Analysts can respond faster to business requests for customer insights enabling portfolio managers to use the intelligence to make more relevant, timely investment suggestions to their clients.

Thank You

Backup slides

Runtime flow Sources Transaction Transaction (Mainframe) 1 1.1 Integration Analytical Lake Storage Discovery & 2.1 Exploration ETL 2 Landing Zone 3 4 Enterprise Warehouse (and Marts) Archive 5 Engineering 6 7 Stewardship Discovery Actionable Insight Interactive Workloads Long-Running Workloads 1. Transaction data is extracted on a periodic basis or from operational systems. 1. Mainframe data can be directly access for Discovery & Exploration 2. Mainframe data is extracted based on needs and use case (not all data needed or should be moved) 2. is ingested into the analytics environment using an ETL engine (Stage or BigIntegrate) which generates the technical and operational metadata and stores it in the metadata repository for access during Engineering, Stewardship and Discovery. 3. is placed initially, when needed, in a Landing Zone (Hortonworks) where it can be staged, transformed and integrated. 4. is then loaded into an Enterprise Warehouse (DB2, dashdb, PDA, IDAA) and possibly to downstream marts (IDAA) for reporting, dashboarding and other interactive workloads. 5. As data ages, it is extracted from the Enterprise Warehouse (again using the ETL engine) and loaded into the Archive repository (Hadoop) where it can be accessed for long-running workloads such as exploratory analytics 1. DB2 for z/os transaction historical data can leverage IDAA capabilities to archive data 6. in either location (and its associated metadata) can be accessed for Engineering, modeling etc., using InfoSphere Architect. I can be accessed for Stewardship (curation, adding business metadata etc) and Discovery using the Information Governance Catalog UI. 7. Business Users and Scientists can access data either directly or through virtualization/federation tools such as BigSQL, IDVM. The users then visualize and analyze the data using their favorite tools (Cognos, SPSS, R Studio, QMF, etc.)

Mainframe data sources Traditional sources: The original corporate data sources are still very valuable resources. They are made up of application data (CRM, HR, and other customer data systems), transactional data (sales, events, claims, etc.), systems of record (historical data, reference data, etc.) and third-party data (provided by 3 rd part organizations e.g. census data). DB2 IMS VSAM Other 3 rd party DB Log (SMF,RMF, Midleware) DB2: High performance RDBMS In memory capabilities to even fast performance Exploration of NoSQL capabilities with native support of XML, JSON (up to 540 million transactions per hour arriving through a RESTful web API into DB2) Together, with IDAA, delivers real hybrid transactional analytical processing IMS: High performance NoSQL database (hierarchical). Fast Path High Volume Transaction Processing reaching a sustained average transaction rate of over 117,000 transactions per second on a single IMS instance. VSAM: Virtual Storage Access Method, another NoSQL database on mainframe, extreme performance. DB2 and IMS are based on VSAM. VISA process up to 145k transactions per second Logs: Another very important source data. The logs on mainframe has well defined data structure which can and should be used for analytics

Mainframe Analytical data lake storage Component Description: The overall purpose of the Mainframe Analytical Lake Storage component is for it to be a set of secure data repositories allowing for Discovery and Exploration of real time data, performing Actionable Insight, and utilizing Enhanced Applications, without a need to physically move from it source. Although is not mandatory, it can be use to control mainframe costs and have fine tune workload management DB2 IMS IDAA Sharing allows applications running on more than one DB2 or IMS subsystem to read and write to the same set of data concurrently. Possible architectures includes one DB2 member for transactional workload and one DB2 member as for analytical workload. Avoid unnecessary ETL, start the exploration and discovery right on transactional data without impacting applications DB2: HTAP (Hybrid Transactional/Analytical Processing) Leverage the same infrastructure to run any kind of workload. bases on DB2 are logical objects. It gives the possibility to have a transactional and analytical data model controlled by the same RDBMS. One OLTP application can access analytical data One OLAP application can access transactional data IDAA: Can be used to deploy a data warehouse and or specific data marts directly on the mainframe. IMS, VSAM, and other mainframe data can be loaded directly to be used in temporal data marts. Archive historical DB2 data to free up mainframe storage keeping it accessible

access Component Description: The overall purpose of the Access component is to express the various capabilities needed to interact with the Lake Repository component. The capabilities serve the access needs of data scientists, business analytics, developers, and others that need access to valuable data. virtualization: Describes any approach to data management that allows a user or application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted or where it is physically located. SparkSQL IDVM BigSQL SparkSQL Securely Integrate OLTP and Business Critical, can access almost all type of mainframe data Same distribution, no need for mainframe skills. Same set of applications language can be used: Scala / Python / Java / R / SQL Can be called from BigSQL IBM Virtualization Manager for z/os: The base for several IBM products such as QMF, Spark for z/os, IDAA Loader, etc Virtualized almost all data on mainframe, including 3 rd parties DBs, like Adabas, IDMS Can virtualized BigSQL objects to easier integration with hadoop environments Can also virtualized other distributed data stores BigSQL: Hadoop query engine derived from decades of IBM R&D investment in RDBMS technology, including database parallelism and query optimization. Can access DB2 for z/os directly thru DRDA connection Can access mainframe data thru IDVM Can access mainframe data thru SparkSQL

Access thru SparkSQL, BigSQL, and IDVM Application SparkSQL BigSQL IDVM DB2 IDAA IMS VSAM Other Distributed DBs Hadoop DB2 Dash PDA

Access thru BigSQL Application Distributed application accessing data from several sources eg.: Hadoop, DB2 and VSAM SparkSQL Big SQL calls Spark (using UDF) Big SQL native connection BigSQL IDVM Big SQL JDBC Connection DB2 IDAA IMS VSAM Other Distributed DBs Hadoop DB2 Dash PDA

Access thru IDVM Application Any application (distributed or mainframe) accessing and joining data from several sources eg.: Hadoop, DB2 and IMS SparkSQL JDBC connection to all data on mainframe BigSQL objects can be declared on IDVM to simplify access BigSQL IDVM DB2 IDAA IMS VSAM Other Distributed DBs Hadoop DB2 Dash PDA

Access thru SparkSQL on mainframe Application scientist tasks, leveraging mainframe data: Scala / Python / Java / R / SQL Spark z/os BigSQL IDVM DB2 IDAA IMS VSAM Other Distributed DBs Hadoop DB2 Dash PDA