Drawing the Big Picture Multi-Platform Data Architectures, Queries, and Analytics Philip Russom TDWI Research Director for Data Management August 26, 2015
Sponsor 2
Speakers Philip Russom TDWI Research Director, Data Management Imad Birouty Director, Technical Product Marketing, Teradata 3
Agenda The Mission Queries, analytics, and other BI that reach multiple warehouse and data platforms simultaneously Enabling Technologies Modern data warehouse environments (DWEs) Single-console tools Data exploration and discovery Standard SQL, but extended Grid, fabric, virtualization, logical DW Benefits of the single big picture New ways to view data and develop queries or analytics Simplification for architecture, governance, stewardship, compliance, auditing, security... Recommendations PLEASE TWEET @prussom, @Teradata, #TDWI, #Analytics, #Big Data
The Mission Redux Today s BI/DW/analytics demands: As much data as possible From more sources and source types In many structures or structure free Persisted on old and new data platform types Virtualized, as appropriate All the above, available all the time, for everyone We ve always aspired toward these goals: But success is more likely today, because we have better software, hardware, skills, best practices We also have better executive support Organizations want more business value from big data, new data, analytics, new data-driven business programs
Enablers for the Revised Mission New tool types and functions, plus their disciplines & practices Data exploration and data discovery More agile data preparation Data visualization ease of use, analytics, fun & compelling presentations, story telling New data platforms Hadoop, whether open source or vendor distro MPP RDBMSs, appliances & columnar Old skills and technologies, too SQL & other relational techs are as important as ever All the above, integrated and interoperable Single console or as few tools as possible Single access & query method SQL, but for any data, platform Data architecture to integrate the back end
DEFINITION Multi-Platform Data Warehouse Environments Many enterprise data warehouses (EDWs) are evolving into multi-platform data warehouse environments (DWEs). Users continue to add additional standalone data platforms to their warehouse tool and platform portfolio. The new platforms don t replace the core warehouse, because it is still the best platform for the data that goes into standards reports, dashboards, performance management, and OLAP. Instead, the new platforms complement the warehouse, because they are optimized for workloads that manage, process, and analyze new forms of big data, non-structured data, and real-time data.
Modern DW Architectures are Complex Tech stack for DW, BI, DI, & analytics has always been multi-platform environ. What s new? The trend toward a portfolio of many physical data platforms has accelerated. Logical architecture that integrates them is very important. Why do it? More platform types to serve more types of users, data & workloads. Over The Passage of Time Federated Data Federated Marts Data Federated Marts Data Marts Data Warehouse Star or Multi- Snowflake dimensional Scheme Data Models Customer Mart Customer or ODS Mart or ODS Data Staging Data Areas Staging Data Areas Staging Areas Metrics for Performance Mgt Real Time ODS OLAP Cubes OLAP DBMSs DW from a Merger Detailed Source Detailed Data Source Detailed Data Source Data Analytic Sand Box Data Federation & Virtualization Columnar DBMS Columnar DBMS DW Appliance DW Appliances Map Reduce Logical Data Warehouse Cloudbased DBMSs Hadoop Distributed Hadoop File Distributed Sys File Sys No-SQL Database No-SQL Database Complex, Event Processing Streaming Data Tools It s a logical and/or virtual layer of the DW architecture that complements the physical layer of architecture under it.
DEFINITIONS OF THE Logical Data Warehouse TDWI: A Data Warehouse is user-defined data architecture The architecture & its design components must be populated by data But the data can be physical, logical/virtual, or both So, most DW architectures have two key layers: physical & logical Gartner s view: A Logical DW depends on virtual tech From simple federation to object-oriented virtualization, plus virtual views, indices, semantics, server memory Building out the Logical Layer of your DW is important The logical layer enables cross-platform integration and interoperability, for broad queries, exploration, analytics
DEFINITIONS OF THE Logical Data Warehouse (LDW) The LDW layer provides a unified view (or a collection of views) of data in multiple platforms Plus a simplified (yet diverse & high-performance) collection of interfaces into such sources and targets to achieve interoperability, especially for queries The point of the LDW layer is to provide A fairly comprehensive big picture of data in the DWE A single layer through which data can be accessed, thereby reducing data redundancy, movement, processing A simplified view & related mechanisms that enable more user types Similar Concepts: Virtual DW (LDW is often partially virtual, but mostly physical) Real-Time DW, Operational DW, Active DW, Dynamic DW Query Grid, Data Grid, Data Fabric
NEW ARCHITECTURES Hadoop integrated with a Relational DBMS The strengths of one balance the weaknesses of the other A Relational DBMS is good at: Metadata management Complex query optimization Table joins, views, keys, etc. Security, including roles, directories HDFS & other Hadoop tools are good at: Massive, linear scalability Multi-structured & no-schema data Some ETL and ELT functions Custom code for algorithmic analytics Other platforms are also being tightly integrated w/relational DW Analytic DBMSs based on columnar, appliance, MapReduce, graph To make this integration of diverse data platforms practical Good design by users for the logical DW architectural layer Vendor tools that can reach all the above and more from one query
Importance of Data Exploration Exploring data is a first step to leveraging new data Never allow new data into a DW without proper vetting Assess value & use cases for new (big) data via exploration Exploring data is a prerequisite to analyzing data By its natural, analysis makes correlations across data of diverse sources, structures, subjects, and vintages Finding just the right combination for successful analysis depends on data exploration as a first step High ease of use for user productivity Some users are biz people who need biz friendly view Ease of use accelerates developers productivity, too Support for all data platforms, from relational to Hadoop A modern data exploration tool will merge diverse data via a single complex query A data exploration tool must do more than exploration Profile data to understand its content and condition Extract data, model the result set, index big data Deduce data s structure and develop metadata Perform tasks as you go, not ahead of time, for greater agility
ITERATIVE, FOUR-STEP PROCESS FOR Exploratory Analytics with New (Big) Data Visualize Explore Analyze Data Prep
A FEW REQUIREMENTS FOR Advanced Analytics Visualize Analyze ITERATIVE, FOUR-STEP PROCESS Explore Data Prep Market direction: Seamless integration In one tool environment, exploration, data prep, analysis, visualization, and more The iterative, four-step process of exploratory analytics demands tight tool integration Advanced forms of analytics Mining, predictive, statistics, NLP (not OLAP) Algorithmic, as well as query based Both canned and home-grown algorithms Tool should include library of pre-built algorithms Tool should also help you write your own High ease-of-use for broad collaboration Functions for both technical and business users Both develop analytic apps and consume them Assume that many user types will share their work
SQL is More Important than Ever Data professionals want and depend on SQL It must be ANSI standard, high performance, iterative, optimized Why? To leverage user skills and SQL-based tool portfolios SQL on Hadoop versus SQL off Hadoop argument Users interviewed want BOTH! In survey, SQL on Hadoop is a must have (69%) Only 4% don t need SQL on Hadoop Source: TDWI survey run in late 2014. Based 99 respondents.
SQL-Based Analytics Data Exploration = Ad-hoc queries on steroids A query grows in size, scope, and complexity with each iteration KLOCs = Thousands of Lines of [SQL] Code Whether tool-generated, hand-written, or both Complex SQL expresses many things Data access via many interfaces, near real time Data models, even dimensional ones Multi-way joins, but also complex transformations Growing number and diversity of users Data analysts, data scientists, BI/DW pros, business analysts All the above demand a hefty tool environ t As described on the next slide
SUMMARY & CONCLUSION: TOOLS AND REQUIREMENTS FOR Logical Data Warehousing and Other Complex Data Ecosystems Look for tools and environments that enable: Designing and architecting a big picture Interoperability among diverse systems and data types Data operations optimized across multiple platforms ANSI SQL support; performance for iterative queries Features that help with complex data architectures: Distributed queries, in the extreme High performance, even with multiple platforms Metadata management and metadata deduction Easy ingestion of new data, whether streaming or static Real-time indexing, to keep pace with data ingestion Single-sign-on security, despite multiple systems
RECOMMENDATIONS Draw the Big Picture for its Benefits Benefits of the unified big picture of data. New ways to view data & develop queries & analytics Simplification for data architecture, governance, stewardship, compliance, auditing, security... Revisit your mission as a data professional Tons of data, sources, and source types, in many structures (or structure free) persisted on old and new data platform types (virtualized, as appropriate) All the above, available all the time, for everyone Satisfy new requirements with tools/platforms that provide unified view Virtual DW and miscellaneous approaches to Real-Time DW Query Grid, Data Grid, Data Fabric Special functions: Hadoop, exploration, SQL-based analytics
Teradata QueryGrid Imad Birouty Director, Teradata Product Marketing
DATA MART EDW/IDW LOGICAL DATA WAREHOUSE Just Give Me Some Data and Fast! 1990 s Give Me Good Data But Do It Efficiently! 2000 s Give Me All Data Fast, Simple & Effectively! 2010 s 20
What s Different Today? There Is No Single Technology That Can Do Everything New types of data New economic models New sources of data Higher volume of data New technologies Increased prevalence of analytics 21
What s The Same Today? Users need access to all relevant data to make informed business decisions Users need timely access to data when they need it User skills and tools 22
Shift from a Single Platform to an Ecosystem "Logical" Data Warehouse We will abandon the old models based on the desire to implement for high-value analytic applications. 23
Not All Data Should Be Treated Equally Data of different value High value density ERP, CRM, Low value density Sensors, weblogs, social, Different processing techniques required Structured data SQL Multi-structured data SQL, NoSQL Different integration requirements Pre-define schema and integrated upon data acquisition (schemaon-write) Define schema during query runtime (schema-on-read) Regardless.data and analytics should be accessible 24
Data Fabric Enabled by QueryGrid Analytic Flexibility to meet your business needs Pick Your Best-of Breed Technology: Data types Analytic engines Economic options Run the right analytic on the right platform: Minimize data movement, process data where it resides Minimize data duplication Optimized work distribution through pushdown processing Bi-directional data movement Users direct their queries to a cohesive data fabric using existing SQL skills & tools Focus on data and business questions, not integrating separate systems 25
Teradata QueryGrid Demo
Metadata Goal: View Database in Hadoop HELP FOREIGN SERVER hdp21; 27 Teradata Confidential
Metadata Goal: View Tables in Hadoop HELP FOREIGN DATABASE "default"@hdp21; 28 Teradata Confidential
Metadata Goal: View Specific Table in Hadoop HELP FOREIGN TABLE "default".carpricedata@hdp21; 29
Querying Hadoop Table Goal: Select a Sample of Rows From a Hadoop Table SELECT * FROM sample_08@hdp21; 30
Multi-System Query For all cars that received warranty repair, find the reported Diagnostic Trouble Code Requires data from Hadoop and Teradata data warehouse Query passed through, data not persisted HADOOP RAW MULTI- STRUCTURED DATA Massive amounts of detailed sensor data Teradata QueryGrid TERADATA PRODUCTION DATA VINs Service records Warranty data DTC descriptions 31
32
Questions? 33
Contact Information If you have further questions or comments: Philip Russom, TDWI prussom@tdwi.org Imad Birouty, Teradata imad.birouty@teradata.com 34