Data modelling patterns used for integration of operational data stores

Similar documents
BI ENVIRONMENT PLANNING GUIDE

Federal Agency Firewall Management with SolarWinds Network Configuration Manager & Firewall Security Manager. Follow SolarWinds:

ETL Best Practices and Techniques. Marc Beacom, Managing Partner, Datalere

Data Warehouses Chapter 12. Class 10: Data Warehouses 1

FIVE BEST PRACTICES FOR ENSURING A SUCCESSFUL SQL SERVER MIGRATION

turning data into dollars

CRM-to-CRM Data Migration. CRM system. The CRM systems included Know What Data Will Map...3

Implement a Data Warehouse with Microsoft SQL Server

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes

Low Friction Data Warehousing WITH PERSPECTIVE ILM DATA GOVERNOR

Unified Governance for Amazon S3 Data Lakes

CSI:DW Anatomy of a VLDW. Dave Fackler Business Intelligence Architect

20463C-Implementing a Data Warehouse with Microsoft SQL Server. Course Content. Course ID#: W 35 Hrs. Course Description: Audience Profile

Enterprise Data Architect

Implementing a Data Warehouse with Microsoft SQL Server

Introduction to Data Science

Next Generation DWH Modeling. An overview of DWH modeling methods

After completing this course, participants will be able to:

Ten Innovative Financial Services Applications Powered by Data Virtualization

Deccansoft Software Services. SSIS Syllabus

6+ years of experience in IT Industry, in analysis, design & development of data warehouses using traditional BI and self-service BI.

Implementing a Data Warehouse with Microsoft SQL Server 2012

CSE 530A ACID. Washington University Fall 2013

Audience BI professionals BI developers

Windocks Technical Backgrounder

1. Analytical queries on the dimensionally modeled database can be significantly simpler to create than on the equivalent nondimensional database.

The Emerging Data Lake IT Strategy

Realizing the Full Potential of MDM 1

GDPR: An Opportunity to Transform Your Security Operations

Exam Name: PRO: Designing a Business Intelligence. Infrastructure Using Microsoft SQL Server 2008

BODS10 SAP Data Services: Platform and Transforms

How to integrate data into Tableau

Mission-Critical Customer Service. 10 Best Practices for Success

Sage SQL Gateway Installation and Reference Guide

BYOD WORK THE NUTS AND BOLTS OF MAKING. Brent Gatewood, CRM

ScaleArc for SQL Server

Implementing a Data Warehouse with Microsoft SQL Server 2012/2014 (463)

Implementing a Data Warehouse with Microsoft SQL Server 2014

20767B: IMPLEMENTING A SQL DATA WAREHOUSE

Microsoft certified solutions associate

Caché and Data Management in the Financial Services Industry

Macola Enterprise Suite Release Notes: Macola ES version ES

Exam /Course 20767B: Implementing a SQL Data Warehouse

Using Metadata Queries To Build Row-Level Audit Reports in SAS Visual Analytics

Jet Enterprise Frequently Asked Questions

OLAP Cubes 101: An Introduction to Business Intelligence Cubes

Integrated McAfee and Cisco Fabrics Demolish Enterprise Boundaries

Refresh a 1TB+ database in under 10 seconds

Protect Your Data the Way Banks Protect Your Money

Xcelerated Business Insights (xbi): Going beyond business intelligence to drive information value

SQL Maestro and the ELT Paradigm Shift

CHAPTER 3 Implementation of Data warehouse in Data Mining

2 The IBM Data Governance Unified Process

Data Warehousing. Jens Teubner, TU Dortmund Summer Jens Teubner Data Warehousing Summer

Best Practices in Data Modeling. Dan English

CA GovernanceMinder. CA IdentityMinder Integration Guide

Implementing a SQL Data Warehouse

THINGS YOU NEED TO KNOW ABOUT USER DOCUMENTATION DOCUMENTATION BEST PRACTICES

Data Strategies for Efficiency and Growth

Enhancing Security With SQL Server How to balance the risks and rewards of using big data

BI4Dynamics AX/NAV Integrate external data sources

Datastage Slowly Changing Dimensions

1 DATAWAREHOUSING QUESTIONS by Mausami Sawarkar

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

BUILD BETTER MICROSOFT SQL SERVER SOLUTIONS Sales Conversation Card

Balancing the pressures of a healthcare SQL Server DBA

Implementing a Data Warehouse with Microsoft SQL Server 2012

DATA MINING TRANSACTION

Distributed Hybrid MDM, aka Virtual MDM Optional Add-on, for WhamTech SmartData Fabric

WHITE PAPER. The truth about data MASTER DATA IS YOUR KEY TO SUCCESS

Oracle Audit Vault Implementation

Introduction to SSIS. Or you want to take some data, change it, and put it somewhere else? Then boy do I have THE tool for you!

Taking the Integrated Data Warehouse Global:

Microsoft Implementing a Data Warehouse with Microsoft SQL Server 2014

Provide Real-Time Data To Financial Applications

Choosing the Right Cloud Computing Model for Data Center Management

Focus On: Oracle Database 11g Release 2

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Introduction to Customer Data Platforms

DATABASE SCALE WITHOUT LIMITS ON AWS

Microsoft SharePoint Server 2013 Plan, Configure & Manage

MOBIUS + ARKIVY the enterprise solution for MIFID2 record keeping

MOC 20463C: Implementing a Data Warehouse with Microsoft SQL Server

Securing Amazon Web Services (AWS) EC2 Instances with Dome9. A Whitepaper by Dome9 Security, Ltd.

Columnstore Technology Improvements in SQL Server 2016

McAfee Total Protection for Data Loss Prevention

Accelerated SQL Server 2012 Integration Services

Phire Frequently Asked Questions - FAQs

Federal Agencies and the Transition to IPv6

Data Replication Buying Guide

Microsoft SQL Server More solutions. More value. More reasons to buy.

Fundamentals of Information Systems, Seventh Edition

Enabling Performance & Stress Test throughout the Application Lifecycle

Course Number : SEWI ZG514 Course Title : Data Warehousing Type of Exam : Open Book Weightage : 60 % Duration : 180 Minutes

Taking a First Look at Excel s Reporting Tools

WHAT CIOs NEED TO KNOW TO CAPITALIZE ON HYBRID CLOUD

Chimpegration for The Raiser s Edge

CSC 261/461 Database Systems Lecture 20. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Building a Data Strategy for a Digital World

ALIGNING CYBERSECURITY AND MISSION PLANNING WITH ADVANCED ANALYTICS AND HUMAN INSIGHT

Transcription:

Data modelling patterns used for integration of operational data stores A D ONE insight brought to you by Catalin Ehrmann Abstract Modern enterprise software development and data processing approaches have been divided into separate paths in the last decade. Recently, the two communities have realized the benefit of sharing expertise across domains. In this paper, we will explain how a clever mix of data warehouse and OLTP (Online Transaction Processing) patterns create a robust operational system and discuss the advantages and disadvantages of this approach. Introduction There is no shortage of design patterns for data processing or software development, but each pattern has its own set of trade offs. Developers and database administrators often have to weigh the pros and cons of several options. Before choosing a pattern, it is important to understand business requirements and the data model. What is the tolerance for failure of an operation? What are the legal requirements? Is the data used globally or only locally? What kind of analysis will be done on the data? We also have to think about connected systems and what hardware our database and software must operate on. However, if we can design a pattern that has only a few minor trade offs, we can meet more cross organization business requirements without slowing down the business or increasing error rates. DWH Patterns Before we can design an improved enterprise DWH pattern, we must understand two basic patterns important to DWH design and implementation: Slowly Changing Dimensions (SCD) and Change Data Capture (CDC). SCD Type 1 SCD1 updates data by overwriting existing values (See Figures 1 and 2). It is used about half of the time because it is easy to implement and use. This is a good approach for error fixes, but compliance laws could be violated since all historical values are lost. This approach should not be used when a data value is being updated because the information has changed for example, an organization moving to a new location. D ONE Insight Data modelling patterns used for integration of operational data stores 1/7

Figure 1: SCD1 Sample Code update single record in Vendors table UPDATE dbo.vendors SET CITY = 'BERLIN' WHERE UID = 1234 Figure 2: SCD1 Example VENDOR TABLE: BEFORE SCD1: UID TAX_ID VENDOR CITY COUNTRY 1234 46857832 ACME, INC MUNICH GERMANY AFTER SCD1: UID TAX_ID VENDOR CITY COUNTRY 1234 46857832 ACME, INC BERLIN GERMANY Additionally, analysis cubes and pre computed aggregates must be rebuilt any time a data point is changed using SCD1. If there are distributed copies of the data, the change will have to be implemented on the copies as well. Calculations must also be rebuilt on each copy. As compliance requirements grow, SCD1 will likely be used less as time goes on. Organizations will be forced to choose another method to maintain good standing with compliance enforcement agencies. SCD Type 2 is a bit more complex than SCD1, but has some important advantages. SCD Type 2 In SCD2, the current record is expired and a new row is added to take its place using SQL Server s MERGE functionality (See Figure 3). SCD Type 2 is a bit more difficult to implement, but has the advantage of preserving historical data. This method is an excellent choice when the law requires history preservation. The disadvantage of SCD2 is that database storage and performance could quickly become a concern since new rows are being added for every update. D ONE Insight Data modelling patterns used for integration of operational data stores 2/7

Figure 3: SCD1 Example VENDOR TABLE: BEFORE UID TAX_ID VENDOR CITY COUNTRY VALID_FROM VALID_TO CURR_FL 1234 46857832 ACME, INC MUNICH GERMANY 26-05-2001 31-12-9999 Y AFTER UID TAX_ID VENDOR CITY COUNTRY VALID_FROM VALID_TO CURR_FL 1234 46857832 ACME, INC MUNICH GERMANY 26-05-2001 15-04-2009 N 1234 46857832 ACME, INC BERLIN GERMANY 15-04-2009 31-12-9999 Y When implementing SCD2, it is important to include metadata columns so users are able to determine which record is current and which are historical (See Figure 3). Administrators should also make end users aware of the metadata columns and their meaning. A current flag is not absolutely necessary, but it does make querying for current or historical records easier. It is sometimes useful to include a reason flag or description to note why the data was updated to distinguish error fixes from information changes. Administrators should also keep in mind that updates to downstream systems may not be made properly when a natural key is updated and no surrogate key is present. It is recommended that surrogate keys always be present in data updated using SCD2 (See Figure 4). Figure 4: Changing a Natural Key (Tax ID) With No Surrogate Key (UID) Present is Not Recommended VENDOR TABLE: BEFORE TAX_ID VENDOR CITY COUNTRY VALID_FROM VALID_TO CURR_FL 46857832 ACME, INC MUNICH GERMANY 26-05-2001 31-12-9999 Y AFTER TAX_ID VENDOR CITY COUNTRY VALID_FROM VALID_TO CURR_FL 46857832 ACME, INC MUNICH GERMANY 26-05-2001 15-04-2009 N 56857833 ACME, INC BERLIN GERMANY 15-04-2009 31-12-9999 Y Change Data Capture CDC is a method to extract data for ETL (extract, transform and load) processing. CDC isolates changed data for extract rather than performing a full refresh. CDC captures all inserts, updates and deletes from all systems that interface with the database, including front end applications and database processes such as triggers. If metadata is present, CDC also follows compliance regulations. D ONE Insight Data modelling patterns used for integration of operational data stores 3/7

Changes can be detected in four different ways: via audit columns, database log scraping, timed extracts and a full database difference comparison. See Figure 5 for a comparison of CDC methods. Audit columns can be very easy to use, but they can also be unreliable. If front end applications modify data or there are null values in the data, audit columns should not be used to detect changes. If the administrator is certain that only database triggers are used to update metadata, audit columns may be a good option. Database log scraping should be used as a last resort. While this method is slightly more reliable than using audit columns to find changes, it can be very error prone and tedious to build a system that takes a snapshot of a log, then extracts useful information from that log, and finally acts on that information accordingly. Furthermore, log files tend to be the first thing to be erased when database performance and storage volumes are suffering, resulting in missed change captures. Figure 5: Comparison of CDC Methods IMPLEMENTATION & SPEED ACCURACY AUDIT COLUMNS Fast, easyimplementation Ifdatabasetriggersareusedto modifymetadata, highlyaccurate DATABASE LOG SCRAPING Tediousandtime-consuming Highlypronetoerrordueto natureoflogfilescraping; Must havealternativemethodifdba emptieslogfilestoensure databaseperformance TIMED EXTRACTS Fast, butmanualcleanupis oftenrequiredandcanbe time-consuming Veryunreliable. Mid-jobfailures orjobskipscancauselarge amountsofdatatobemissed FULL DIFFERENCE COMPARISON Somewhateasytoimplement, buthighlyresourceintensive Highlyaccurate Timed extracts are notoriously unreliable, but novice DBAs often mistakenly choose this technique. Here, an extract of data is taken at a specific time. Data is captured within a particular timeframe. If the process fails before it completes all steps, duplicate rows could be introduced into the table. A failed or stopped process will cause entire sets of data to be missed. In the case of failed or skipped processes, an administrator will have the tedious task of cleaning up duplicate rows and identifying which rows should be included in the next CDC process and which should be excluded. A full database difference comparison is the only method that is guaranteed to find all changes. Unfortunately, it can be very resource intensive to run a full diff compare since snapshots are compared record by record. To improve performance, a checksum can be used to quickly determine if a record has changed. This method is a good choice for environments where reliability and accuracy are the primary concerns. D ONE Insight Data modelling patterns used for integration of operational data stores 4/7

The OLTP Pattern OLTP (Online Transaction Processing) is a very popular method for processing data. Data input is gathered, then the information is processed and the data is updated accordingly all in real time. Most front end applications that allow users to interface with the database use OLTP. If you ve ever used a credit card to pay for something, you ve used OLTP (See Figure 6). You swiped your card (input), then the credit card machine or website sent the data to your card company (information gathering), and your card was charged according to your purchase (data update). OLTP is an all or nothing process. If any step in the process fails, the entire operation must fail. If you swiped your card, funds were verified, but the system failed to actually charge you for your purchase the vendor would not get his money; therefore the process must fail. Figure 6: Sample OLTP Process OLTP s design makes it ideal for real time, mission critical operations. If any process has zero tolerance for error, OLTP is the pattern of choice. Additionally, OLTP supports concurrent operations. You and other customers can all make purchases from the same vendor at the same time without having to wait for another transaction to finish. This is another reason OLTP is a good choice for front end applications. When implementing an OLTP system, it is important to keep your queries highly selective and highly optimized. Query operation times must be kept to a minimum or users will get tired of waiting and abandon the task at hand. To improve performance, the data used by an OLTP system should be highly normalized and the transaction distributed across multiple machines or networks (if anticipated traffic requires additional processing power and memory). Traditionally, OLTP updates have followed the SCD1 pattern and history is erased when an update is made. In the next section, we learn how to preserve historical data and use SCD2 in an OLTP environment. D ONE Insight Data modelling patterns used for integration of operational data stores 5/7

Business Case: Using SCD and CDC in an OLTP Environment Our client, a newly formed company, required infrastructure setup that would support multi system integrations between its customer and partner systems. They were using a Microsoft stack, so SQL Server 2012 was the database of choice while SQL Server Integration Services (SSIS), Analysis Services (SSAS) and Reporting Services (SSRS) were chosen as supporting applications. Code was written in T SQL and C# and managed using Team Foundation Server (TFS). The Process A process was designed that would reduce impacts on performance while making historical and current data easily available to customers and partners (See Figure 7). First, the customer sends three files to our client via a secure inbox containing deletes, updates and inserts to their database. That data was then imported to an operational data store (ODS). If the customer had not yet configured partner system credentials and integration parameters, they could log in to a customer portal to do so. Figure 7: SCD1 & SCD2 Mix in OLTP Environment After data is imported to the ODS, the unmodified data from both the partner systems and ODS is loaded into a pre staging environment. The data is then enriched with SCD2 metadata elements including valid to and from dates and a current record flag. The enriched data is imported into a persistent storage staging environment. Any changes to the data are then made in the core database using SCD1 in SSIS. When changes are completed in the core database, CDC is used to detect those changes, which are then sent to the other connected databases, excluding the database the change originated from. No deletes are made in the ODS. Rather, the record is marked as inactive. D ONE Insight Data modelling patterns used for integration of operational data stores 6/7

Advantages The biggest benefit of using this mixed approach is preservation of historical data without rapidly expanding the size of the production database. As a result, we can comply with the law and keep the production database stable and responsive. Because history is intact, analysis can be performed on the staging database in fact, an analysis component is planned for the project discussed above. With analysis computations taking place on the staging database, we are able to preserve resources on the production database for OLTP operations. If any analysis is performed on the production database, queries will perform better since historical records do not need to be filtered from current records. Additionally, users of the data are more clear on the data they are querying and do not need to decipher what metadata columns such as current flag and valid dates are used for and how they might impact their query. Last, downstream systems and the DWH will integrate changes more smoothly since each record will have its own primary key that will not change. Database triggers will also perform reliably. Disadvantages Compared to most patterns, there are few disadvantages to this mixed approach. The main concern is that multiple steps in the process will introduce more potential points of failure, as with any process. Because there are more steps, troubleshooting failures and errors will be more time consuming than a single pattern approach. If any analysis computations are performed on production data, they will need to be rebuilt any time there is an update. However, using the staging environment for any complex analytical functions will negate this issue. Conclusion In an OLTP environment, reliability and speed are paramount. Combining the SCD1 and SCD2 approaches allows us to benefit from the advantages of both patterns without suffering many of the disadvantages. When this mixed approach is implemented, it can be an ideal solution for the legal, IT, marketing and analytics departments without sacrificing customer and user workflows. D ONE Insight Data modelling patterns used for integration of operational data stores 7/7