SQL Maestro and the ELT Paradigm Shift

Similar documents
ELTMaestro for Spark: Data integration on clusters

ELTMaestro for RedShift: ELT in the Cloud

Seven Decision Points When Considering Containers

High Speed ETL on Low Budget

Low Friction Data Warehousing WITH PERSPECTIVE ILM DATA GOVERNOR

IBM DB2 Analytics Accelerator use cases

Evaluating Hyperconverged Full Stack Solutions by, David Floyer

IBM dashdb Local. Using a software-defined environment in a private cloud to enable hybrid data warehousing. Evolving the data warehouse

Composite Software Data Virtualization The Five Most Popular Uses of Data Virtualization

DATAWAREHOUSING AND ETL PROCESSES: An Explanatory Research

The Seven Steps to Implement DataOps

Efficiency Gains in Inbound Data Warehouse Feed Implementation

UNLEASHING THE VALUE OF THE TERADATA UNIFIED DATA ARCHITECTURE WITH ALTERYX

Introduction to Oracle

10 Things to expect from a DB2 Cloning Tool

FIVE REASONS YOU SHOULD RUN CONTAINERS ON BARE METAL, NOT VMS

Capturing Your Changed Data

BUILD BETTER MICROSOFT SQL SERVER SOLUTIONS Sales Conversation Card

Teradata Aggregate Designer

Azure Data Factory. Data Integration in the Cloud

How to integrate data into Tableau

The Hadoop Paradigm & the Need for Dataset Management

Benefits of Automating Data Warehousing

Building Self-Service BI Solutions with Power Query. Written By: Devin

Using the Network to Optimize a Virtualized Data Center

BEST PRACTICES IN SELECTING AND DEVELOPING AN ANALYTIC APPLIANCE

Virtualization. Q&A with an industry leader. Virtualization is rapidly becoming a fact of life for agency executives,

Data Warehouse and Data Mining

A Make-or-Break Decision: Choosing the Right Serialization Technology for Revenue, Operational, and Brand Success

Incremental Updates VS Full Reload

ISAM TO SQL MIGRATION

Course Contents: 1 Datastage Online Training

Data Virtualization Implementation Methodology and Best Practices

Data Mining & Data Warehouse

Partner Presentation Faster and Smarter Data Warehouses with Oracle OLAP 11g

OpEx Drivers in the Enterprise

CS102B: Introduction to Information Systems. Minerva A. Lagarde

Accelerate your SAS analytics to take the gold

SAP HANA Scalability. SAP HANA Development Team

Hype Cycle for Data Warehousing, 2003

The InfoLibrarian Metadata Appliance Automated Cataloging System for your IT infrastructure.

Real-time Communications Security and SDN

Improving the ROI of Your Data Warehouse

Cognos Dynamic Cubes

Data Replication Buying Guide

BUYING SERVER HARDWARE FOR A SCALABLE VIRTUAL INFRASTRUCTURE

COBOL-IT Compiler Suite

Data Mining. Associate Professor Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology

The Top Five Reasons to Deploy Software-Defined Networks and Network Functions Virtualization

Pervasive PSQL Summit v10 Highlights Performance and analytics

CASE STUDY EB Case Studies of Four Companies that Made the Switch MIGRATING FROM IBM DB2 TO TERADATA

Data Validation Option Best Practices

Data warehouse architecture consists of the following interconnected layers:

An Introduction to the Waratek Application Security Platform

Modernizing Business Intelligence and Analytics

Comprehensive Guide to Evaluating Event Stream Processing Engines

Oracle Exadata: The World s Fastest Database Machine

Q1) Describe business intelligence system development phases? (6 marks)

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

What is Software Architecture? What is Principal?

Massive Scalability With InterSystems IRIS Data Platform

Caché and Data Management in the Financial Services Industry

Information empowerment for your evolving data ecosystem

When, Where & Why to Use NoSQL?

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Lenovo Database Configuration for Microsoft SQL Server TB

Lenovo Database Configuration

Data Warehouse and Data Mining

Scalable Shared Databases for SQL Server 2005

Best ETL Design Practices. Helpful coding insights in SAS DI studio. Techniques and implementation using the Key transformations in SAS DI studio.

Was ist dran an einer spezialisierten Data Warehousing platform?

Best Practice for Creation and Maintenance of a SAS Infrastructure

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores

SAS Scalable Performance Data Server 4.3

INFORMATICA POWERCENTER BASICS KNOWLEDGES

Capturing Your Changed Data With the CONNX Data Synchronization Tool. by Dann Corbit, Senior Software Engineer

Data Warehousing. Ritham Vashisht, Sukhdeep Kaur and Shobti Saini

Solutions for Netezza Performance Issues

SOFTWARE DEFINED STORAGE VS. TRADITIONAL SAN AND NAS

Architecting the High Performance Storage Network

Enabling Performance & Stress Test throughout the Application Lifecycle

Five Common Myths About Scaling MySQL

Enterprise Challenges of Test Data Size, Change, Complexity, Disparity, and Privacy

Meaning & Concepts of Databases

REPORTING AND QUERY TOOLS AND APPLICATIONS

Data modelling patterns used for integration of operational data stores

Integrated Data Management:

AVOIDING HIGH ORACLE DBMS COSTS WITH EDB POSTGRES

Flash Decisions: Which Solution is Right for You?

In-Memory Data Management Jens Krueger

Data Mining Concepts & Techniques

Executive Brief June 2014

Deccansoft Software Services. SSIS Syllabus

FIVE BEST PRACTICES FOR ENSURING A SUCCESSFUL SQL SERVER MIGRATION

Integration With the Business Modeler

STATE OF STORAGE IN VIRTUALIZED ENVIRONMENTS INSIGHTS FROM THE MIDMARKET

Understanding Virtual System Data Protection

Fast Innovation requires Fast IT

A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective

CDW LLC 200 North Milwaukee Avenue, Vernon Hills, IL

Transcription:

SQL Maestro and the ELT Paradigm Shift Abstract ELT extract, load, and transform is replacing ETL (extract, transform, load) as the usual method of populating data warehouses. Modern data warehouse appliances and appliance- like systems, such as IBM Pure Data / Netezza, are ideally suited for an ELT environment. SQL Maestro is a framework for ELT development, deployment, and production. SQL Maestro is not an ETL tool that has been repurposed for ELT; rather, it is designed from the ground up with ELT in mind, and fully supports the ELT paradigm. At the same time, SQL Maestro is familiar and easy to learn for developers with experience in traditional ETL tools. In the beginning Data warehousing emerged from the realization that databases that worked well for operational purposes usually did not work well for analytical purposes. In a bank, for example, one operational database might keep track of customers accounts. Another operational database, from a different vendor, and maintained in a different department, contains demographic information about the bank s customers. A marketing analyst wants to know what the average account for customers in a certain demographic category. This is difficult to determine from the operational databases because (1) information from two different vendors databases must be combined, and (2) large operations (such as looking at every account balance) put banks operations at risk, since they take computing resources away from the basic business of serving customers. To address these kinds of problems, banks and other businesses started using data warehouses. Data warehouses collected all of the information needed for analytical tasks into an environment where they analysts could do their work without affecting a business s operations. Data warehouses were periodically refreshed with new information from the operational systems, preferable during times when demand on the operational systems was low, like the middle of the night. It was soon obvious that it wasn t enough to simply move the data from the operational systems to the warehouse unchanged. 1 The data from the operational systems needed to be rearranged, summarized, combined with data from other systems, and in general transformed in order to become useful to data warehouse users. This need gave rise to the Extract, Transform, and Load, or ETL paradigm: data was extracted from operational systems, suitably transformed, and then loaded to the data warehouse. The three- step ETL process was conceived as a process involving three servers: the original operational server, an intermediate server which extracted and transformed the data, and then the data warehouse itself, which was typically implemented in a traditional database on a third server. An ecosystem of tools and developers evolved to fill the role of the intermediate step of the ETL process. Notable players in this field included Ascential DataStage, Informatica, and Ab Initio. Traditional database vendors such 1 This was often done as an intermediate step, resulting in what was called an operational data warehouse or ODW.

as Oracle also offered ETL tools together with editions of their databases adapted for use as data warehouses. Some Business Intelligence vendors such as Business Objects, also had ETL offerings. Extract Transform Load Sources (Operational Systems) ETL server Data warehouse Figure 1. Traditional ETL Architecture As ETL implementations became more widespread, the following considerations became dominant: 1. Data volumes rapidly increased. Both the supply of data available from operational systems, and users demand that that data be available in the warehouse tended to grow exponentially. 2. ETL bottlenecks became important. ETL systems usually collected data from operational systems at night, and were expected to have incorporated the new data by the next morning. As data volumes increased, ETL systems found it difficult to meet their service level agreements. 3. ETL bottlenecks often occurred in data movement from one stage to the next. 4. Users submitted increasingly difficult queries to data warehouses, and wanted the results of those queries returned quickly. 5. It was observed that databases that performed well on OLTP tasks basically, many small queries involving few table rows did not necessarily perform well as data warehouses, use of which involved generally fewer queries on a large number of table rows. Vendors such as Netezza began to develop systems to address these issues (particularly, at first, issues 4 and 5). Netezza produced a machine that specialized quickly answering queries that were too large and complex for traditional databases. Netezza called their machines data warehousing appliances. Data warehouse appliances and the ELT paradigm Initially, the success of specialized systems such as the Netezza data warehouse appliance lay in their ability to address items 4 and 5 in the list above. It has turned out, however, that the same characteristics that make the appliances excellent at answering difficult queries also makes them ideal platforms for performing transformations. This means that the appliances can support a new paradigm for populating data warehouses.

In the ELT (Extract, Load, Transform) paradigm, operational data is loaded directly to the server that hosts the data warehouse, into tables that have the same structure as the source tables. The data is then transformed in place, and loaded into the data warehouse tables. Extract Load, then Transform Sources (Operational Systems) Data warehouse Figure 2. ELT architecture The first thing to notice is that compared with ETL, there is one less hop between machines; in fact, there is one less platform altogether. This is important because, as we mentioned, data movement is a major source of bottlenecks. Having one less platform to buy and maintain, and one less environment for developers to learn are also obvious advantages. These advantages might make the case for ELT even if the target machine performed less well than traditional ETL servers on the transformation task. But Netezza machines often perform much better. The Netezza architecture makes queries run faster by distributing them over tens or hundreds of parallel processing nodes. The architecture maintains detailed maps of the slices of data allotted to each node, and employs sophisticated algorithms to minimize the amount of data that needs to be moved to produce the desired result. Netezza will often run queries faster than traditional databases out of the box, that is, with no changes to the queries to accommodate the Netezza architecture. The queries will run even faster if the database designer uses domain knowledge to help Netezza decide how to distribute table data. These characteristics certainly improve query performance, by orders of magnitude in many cases. But these same characteristics knowing the contents of each slice of data, minimizing data movement, and organizing data in optimal configurations are often just as useful for the tasks involved in transforming data. Simply put, Netezza isn t just good at select statements it s good at SQL and data manipulation in general, and often outperforms traditional dedicated ETL systems at transformation tasks. The price/performance advantages of the ELT/Netezza approach over the traditional ETL are especially striking in a side- by- side comparison of processing power. Enterprise ETL tools are most commonly licensed per processor. Licenses for 48 or 96 processors will cost multipe millions of dollars a price that only includes only software and is many times the price of Netezza hardware and software. This is a potent consideration at a time when an increasing number of companies with moderate IT budgets are dealing with terabyte- range data volumes, and require processing power in this range.

Problems with the ELT approach We see that there are several reasons why ELT would be a tempting choice for a data warehouse architect (and in fact there are other reasons that we haven t mentioned) 2, especially an architect with a Netezza machine at their disposal. Even so, there are reasons why an architect might hesitate to pursue the ELT path. Designers of ELT systems must deal with the following issues: 1. Lack of tools and resources. The ecosystem of tools and developers that has evolved to support ETL does not exist for ELT. It is difficult to find developers that can implement complex transformations on a Netezza machine. 2. Preference for dataflow diagrams. ETL architects, developers and the business users they serve tend to think in terms of dataflow diagrams like the one below, rather than abstract set operations that comprise SQL. Traditional ETL tools usually offer a GUI based on the concept of data flows. Figure 3. Dataflow diagram. 3. Deployment. It is not enough to write a working program that transforms and loads data to the warehouse; the program must then be deployed to testing and development environments. To support deployment, all program features that depend on a particular environment s hardware, drivers, connections, auxiliary programs, etc. must be parameterized. Production environments are substantially different from development environments; typically they are more isolated and are implemented on more powerful machines. It is usually difficult or impossible to test directly on production machines, for fear of contaminating them or interfering with production programs. On the other hand, when a new program or version is taken to production, it is expected to work flawlessly! Deployment is obviously not a problem unique to ELT, ETL, or data warehousing, but data warehouse practitioners have more experience and comfort with deploying ETL systems. 4. Restartability. Loading data warehouses involves processes and transactions that may take many hours to complete. One must assume that these loads will occasionally fail. It is more 2 http://www.dataacademy.com/files/etl- vs- ELT- White- Paper.pdf

than likely that failures will occur at inconvenient times, since data warehouses are most often loaded during non- business hours. This is not a good time to deal with complex manual procedures to unwind and clean up the partially- accomplished work so that the load can be resumed. On the other hand, production probably won t be able to afford the time to restart from scratch if a significant amount of work has been done. What we need is a process that automatically determines the last known good point from which work can be resumed, and automatically clears (or preserves) any temporary files as necessary to produce the correct result. Restartability that meets these criteria should not be added as an afterthought, but rather designed into the process from the beginning. 5. Audit, balance, and control. It is a data warehousing best practice to maintain a database of information about the loading process. Examples of this kind of information include: The date and time that a certain target table was loaded from a certain source table. How many rows were loaded. Whether the load transaction succeeded or failed. If appropriate, whether or not the load balanced; whether some total or checksum computed from the target matched a checksum in the source. Maintaining records concerning the movement and handling of important information is almost always recommended; in some cases it is legally mandated. Audit, balance, and control (ABC) records can support restartability, mentioned above, and can be an important source of debugging information. Again, restartability and ABC are important issues independently of whether an ETL or and ELT approach is employed, but they will be implemented differently, and most architects will have more experience with them in the context of ETL. Architects who have dealt with these issues in an ETL context may be hesitatant to re- solve them in an ELT context. Role of SQL Maestro SQL Maestro is an environment for ELT development on the Netezza/Pure Data platform. SQL Maestro is partly a dataflow- style GUI tool which compiles to SQL commands to be run on Netezza, and partly a framework for supporting the data warehouse development lifecycle. SQL Maestro enables and encourages the ELT approach by addressing the issues raised in the last section. 1. SQL Maestro gives developers a tool as powerful as the best traditional ETL tools, but which is designed to implement the ELT/Netezza paradigm described above, currently the highest performing, most cost- effective solution available. SQL Maestro is lightweight, easy to administer, and inexpensive; and in all of these respects, by orders of magnitude compared with the best- known ETL tools. This is in no small measure because the engine of SQL Maestro, normally the largest component of an ETL tool, is in this case already provided by Netezza.

2. SQL Maestro uses a dataflow- based GUI. Those familiar with traditional ETL tools understand SQL Maestro jobs when they see them, and can construct them with little to no training. Figure 4. SQL Maestro GUI. 3. SQL Maestro includes a framework for testing and deployment. Developers, testers, and production engineers who have deployed ETL systems will find the concepts couched in terms and practices which are familiar from their previous experience, but appropriately adapted to ELT and Netezza. 4. SQL Maestro includes an ABC database. The ABC database may reside in a DBMS of the installer s choosing, or in a PostgreSQL instance which is part of the installation. The ABC database supports restartability and other aspects of audit, balance, and control. Conclusion Many data warehouse architects have concluded that with currently available hardware, ELT has marked performance and cost advantages over ETL as a way of loading data warehouses, especially if high- performing data warehouse appliances such as Netezza are available. Objections to ELT are mainly at the level of pragmatics: there are more ETL tools available, developers and other stakeholders have more experience, etc. in short, ELT is newer, and therefore involves more change. SQL Maestro unlocks the potential of ELT by obviating these objections. A data warehouse hosted on Netezza but loaded by an ETL system is allowing two expensive systems to do work that could be accomplished, faster, with one of them, by using ELT. SQL Maestro and Netezza provides the path to remove this inefficiency, permitting data warehouse development to be more agile while also allowing the warehouse itself to handle larger data volumes, quicker. All this within a best practices framework

meeting the reliability and data quality needs of the most highly regulated industries. Contact sales at New England Systems or IQ Associates Inc. to arrange a test drive. www.nesystems.com www.iq- associates.com : 978-226- 1470 X 3