Data Vault Brisbane User Group 26-02-2013
Agenda Introductions A brief introduction to Data Vault Creating a Data Vault based Data Warehouse Comparisons with 3NF/Kimball When is it good for you? Examples What s next? 2/28/2013 1
Agenda Introductions A brief introduction to Data Vault Creating a Data Vault based Data Warehouse Comparisons with 3NF/Kimball When is it good for you? Examples What s next? 2/28/2013 2
Introductions - About Analytics8 Founded in 2002 in Australia Offices in Sydney, Melbourne, Brisbane, Chicago, Raleigh and Dallas 85+ Consultants Cross industry Technology and vendor agnostic 100% Services organisation Consulting, Training, Support, Software Procurement Business Intelligence and Data Warehousing Strategy, Enablement and Optimisation Leverage your data to hit your targets www.analytics8.com 2/28/2013 3
Introductions - About Analytics8 Strategic Services Implementation Services DW/BI Strategy and Roadmaps DW, BI and ETL Architecture Data / Business Modeling Business Intelligence and Analytics Project Management & Governance Competency Centers DW, BI and ETL Assessments Data Integration Tool / Vendor Selection and procurement assistance Training Support 2/28/2013 4
Introductions Brisbane User Group 2/28/2013 5
Agenda Introductions A brief introduction to Data Vault Creating a Data Vault based Data Warehouse Comparisons with 3NF/Kimball When is it good for you? Examples What s next? 2/28/2013 6
Data Vault There are no facts, only interpretations Friedrich Nietzsche Get the facts first, then distort them as you please Mark Twain Everything we hear is an opinion, not a fact. Everything we see is a perspective, not the truth. Marcus Aurelius Thursday, February 28, 2013 7
Data Vault Data is managed as an asset Business Rules are moved closer to the business The Truth is subjective and based on changing business rules 2/28/2013 8
Data Vault Data Vault is the optimal choice for modelling the EDW in the DW 2.0 framework. Bill Inmon about DW 2.0 The Data Vault is a detailed, historically oriented, uniquely linked set of normalized tables that support one or more functional areas of business. Dan Linstedt 2/28/2013 9
Data Vault it s not necessarily about Oracle Database Vault An end-to-end solution; it complements existing approaches An ETL framework Creation of information 2/28/2013 10
Data Vault principles The goal is to integrate (disparate) data from many source systems and link them together while maintaining source system context An Enterprise Data Warehouse is collection of transactions, a single source of the facts as they were at the time (not the single source for the truth) The Truth is subjective: based on soft and changing business rules Data centric view of integration: Everything is many-to-many. Everything is time dependant. Late binding of data: simplified load dependencies and resulting options for parallel processing (application and database level). Repeatable, consistent, scalable, auditable and fault-tolerant It s all about flexibility: Handling changes in structure and data (expand) Changing the Data Warehouse structure and performance (manage) Uses RDBMS basics 2/28/2013 11
Data Vault architecture Source Systems Business Rules & IQ EDW Data Marts Source Systems Hard Business Rules EDW Business Rules & IQ Data Marts Virtualisation 2/28/2013 12
Data Vault architecture The business rules are moved closer to the business which: Improves IT reaction time Enables business users to direct Business Intelligence Reduces cost Minimises impact 2/28/2013 13
Reference Architecture Challenges: Dealing with complexities Dealing with dependencies Ability to respond to a changing environment Principles: Flexibility in design and maintenance Change resilient Future proof (Near) Real Time ready Modular Scalable Durable and predictable Provide a bottom up architecture which can be applied incrementally with a top down approach Results: Separation of Data Warehouse concepts Flexible error handling Hybrid modelling Parallelism Built-in audit trail 2/28/2013 14
Exception Handling Operational Meta Data Reference Architecture Presentation Layer Integration Layer / SOR Staging Layer
Agenda Introductions A brief introduction to Data Vault Creating a Data Vault based Data Warehouse Comparisons with 3NF/Kimball When is it good for you? Examples What s next? 2/28/2013 16
Data Vault entity types Hub: Unique list of business keys Satellite: Historical descriptive data (about the hub or link) Link: Unique list of relationships between keys (Current table) (Point in time table) (Reference table) 2/28/2013 17
Data Vault entities: Hubs A Hub entity contains the unique list of business keys. Contains: Surrogate key Business key Load date timestamp Last seen timestamp Record source indication for traceability 2/28/2013 18
Data Vault entities: Satellites Satellites entities provide context for a hub or Link. Much like a Type-2 dimension, its information is subject to change over time Contains: Hub or Link primary key Load date timestamp End date valid timestamp Record source indication All context attributes 2/28/2013 19
Data Vault entities: Links Link entities are many-to-many relationships. Determines the grain Leads to fact tables Are valid for a certain period of time Contains: Surrogate key (optional), relation to Link-Satellite Hub key(s), determines the relationship Load date timestamp Last seen date timestamp Record source indication for traceability 2/28/2013 20
Links: everything is many to many Portfolio One Many Customer Portfolio Many Many Customer Portfolio One Many Customer Portfolio Many One Customer When the EDW is modelled for today it breaks down when loading history 2/28/2013 21
Links: everything is many to many Portfolio One Many Customer Portfolio One Portfolio Many Many Customer Many Link Many One Customer Portfolio Many One Customer Historical, present and future data can be loaded without re-engineering 2/28/2013 22
Data Vault Why isolate keys? Data Warehouse management is reduced because of decoupling Keys are distributed early and data can be traced by these keys throughout the system Relation Relation History History Extra History 2/28/2013 23
Data Vault - Load strategy Hybrid modelling reduces dependencies and simplifies ETL Loading processes are self-dependant Capable of Near Real-Time loading Simple, Scalable, Parallel and Consistent 2/28/2013 24
Data Vault - Flexibility Shipment dates Billed amounts Product Supplier Link Products Suppliers Availability schedule Stocks Address Descriptions Descriptions Rating Score 2/28/2013 25
Agenda Introductions A brief introduction to Data Vault Creating a Data Vault based Data Warehouse Comparison with 3NF/Kimball When is it good for you? Examples What s next? 2/28/2013 26
Data Vault architecture comparisons Kimball or Inmon (CIF) Complex ETL Truth oriented Business Rules before EDW Data Vault 100% of the data (within scope) 100% of the time Source driven Auditable Transaction / data oriented Template/metadata driven No Business Rules No destructive loading Kinstedt or Dinmon! 2/28/2013 27
Compared to 3NF 3rd Normal Form: the corporate data model Long developing time (mainly due business changes) Subject Area Database, modelled to current views Adaptation issues: to change the model can be hard Definitions changing ( customer means something else now) Growth of new relationships Duplicate data sources require a priority / trust layer Cascading impact: changes ripple through to underlying tables Integration issues: Load dependencies because of referential integrity Data Quality!= Referential Integrity Time driven PK issues (new parent or child; key change) 2/28/2013 28
Data Vault case study late arriving requirement Normalised core DWH model 2/28/2013 29
Data Vault case study late arriving requirement Late arriving requirement: introduction of a Cover Group Policy 2/28/2013 30
Data Vault case study late arriving requirement x x x x x x x x Downstream impacts of normalisation 2/28/2013 31
Data Vault case study late arriving requirement 1 2 3 5 4 Downstream impacts of normalisation 2/28/2013 32
Data Vault case study late arriving requirement HUB_POLICY RISK HUB_POLICY Policy Id PMS_PLCY_NO HUB_POLICY STATUS Policy Status Id Policy Status Type Id HUB_POLICY INSURED Policy Insured Id Insured Id Policy Risk Id PMS Risk Pt 1 PMS Risk Pt 2 PMS Risk Pt 3 LNK_POL_ST_INS_RISK Link Policy Status ID Policy Id (FK) Policy Status Id (FK) Policy Insured Id (FK) Policy Risk Id (FK) POLICY OFFER Data Vault approach (before the introduction of the Cover Group Policy) Derived on output 2/28/2013 33
Data Vault case study late arriving requirement = HUB_POLICY Policy Id PMS_PLCY_NO HUB_POLICY STATUS Policy Status Id Policy Status Type Id HUB_POLICY INSURED Policy Insured Id Insured Id HUB_POLICY RISK Policy Risk Id PMS Risk Pt 1 PMS Risk Pt 2 PMS Risk Pt 3 HUB_COVER DEVELOPMENT GROUP Cover Development Group Id Cover Development Group Cd x LNK_POL_ST_INS_RISK Link Policy Status ID Policy Id (FK) Policy Status Id (FK) Policy Insured Id (FK) Policy Risk Id (FK) LNK_POL_ST_INS_RISK_CDG Link Policy Status ID Policy Id (FK) Policy Status Id (FK) Policy Insured Id (FK) Policy Risk Id (FK) Cover Development Group Id (F 2/28/2013 34
Data Vault case study late arriving requirement 2/28/2013 35
Data Vault - Disadvantages Scaling versus performance: lots of outer joins and tables in queries Not intended for ad hoc end user access Aging relationships Currently not an open platform Does not provide solutions for the data mart layer 2/28/2013 36
Compared to Star Schema models Star Schema / fact and dimensions issues: Expensive updates and deletes Dimensions over time (Type 1, 2 and 3) Architecture includes many kinds of tables (helper, bridge, junk, mini) Grain issues difficult to resolve Real-time loading impractical Issues with transactions appearing before dimension data Complex loading and changing of history Begins to fail under very heavy loads Inflexible mix of basic elements (history, structure, key distribution) 2/28/2013 37
Data Vault - Advantages Completely auditable architecture DWH model is aligned with the business model Extremely adaptable to (business) changes Designed and optimised for the EDW Durable, consistent and predictable Consistency pays back over time Lends itself for real-time processing Simple and consistent Isolation from change Incrementally built Easy to load a Dimensional Model 2/28/2013 38
Atomicity Data Warehouses try to do too much in a loading cycle; addressing all kinds of problems in a single load pattern 2/28/2013 39
Atomicity Data Warehouses try to do too much in a loading cycle; addressing all kinds of problems in a single load pattern 2/28/2013 40
Agenda Introductions A brief introduction to Data Vault Creating a Data Vault based Data Warehouse Comparisons with 3NF/Kimball When is it good for you? Examples What s next? 2/28/2013 41
Is it good for you? Is the introduction of Data Vault as the middle-tier (Integration / SOR / Core DWH layer) worth the additional effort in terms of (ETL) development and space? 2/28/2013 42
Not a good match You re using a 2-tiered architecture / don t want (or think you need) the extra layer (i.e. not an EDW). You re unfamiliar with the approach. These concerns are often deeply rooted and overriding this may not get the best result. There is a relatively low maturity regarding Data Modelling. Data Vault required a relatively senior/firm Modeller. Data Vault leaves less room for deviations, requires adequate assignment of business keys (not 1 on 1 with source primary keys) and generally requires a firm adherence to the standards. There is not enough involvement / drive to pursue the program. Related to the familiarity working with Data Vault requires continuous selling of the approach as to date it is still fairly uncommon. 2/28/2013 43
A good match The outcomes and/or requirements are not clear or are likely to change. You are following an agile approach for Project Management or specified very short delivery cycles. You want to incrementally expand your data model. You want to plan for / expect to require additional scalability. You want to leverage (ETL) automation / enforce standards through automation. You are stuck in a tactical (2-tiered / Dimensional Bus Architecture) solution and want to expand, Data Vault can be used to incrementally backfill the solution. 2/28/2013 44
Agenda Introductions A brief introduction to Data Vault Creating a Data Vault based Data Warehouse Comparisons with 3NF/Kimball When is it good for you? Examples What s next? 2/28/2013 45
Demonstration Assemblies Use of BIML and C# Model Driven Design 2/28/2013 46
Agenda Introductions A brief introduction to Data Vault Creating a Data Vault based Data Warehouse Comparisons with 3NF/Kimball When is it good for you? Examples What s next? 2/28/2013 47
What s next??? Data Vault 2.0? Big Data? Model Driven Design? Case Studies? Software / ETL specific implementations? 2/28/2013 48
Thank you! 2/28/2013 49