marko.hotti@microsoft.com
GARTNER MAGIC QUADRANT DW & BI Data Warehouse Database Management Systems Business Intelligence and Analytics Platforms * Disclaimer: Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose. 2
The Traditional Data Warehouse 3
Breaking Points of The Traditional Data Warehouse 3 1 2 4 5
Introducing The Modern Data Warehouse Business Intelligence Data Sources 6
Microsoft Hadoop Vision Insights to all users by activating new types of data
Limitations: Performance and Scale today Diminishing performance Scale UP Rowstore Existing Tables (Partitions) Diminishing Scale as requirements grow Non-optimal performance for many DW queries
SQL Server 2012 Parallel Data Warehouse (PDW) Insights on any data of any size Next-generation Performance At Scale Built For Big Data
Manageable Costs Scale Out MPP versus Scale Up SMP Big Data Integration Query Performance Updateable xvelocity Columnstore Appliance Simplicity: HW + SW
What is Parallel Data Warehouse? Shared-nothing parallel database system» Massively parallel processing (MPP)» A Control server that accepts user queries, generates a plan, and distributes operations in parallel to compute nodes» Multiple Compute servers running SQL Server» A Management server for administering the system» A Data Movement Service that facilitates parallel SQL operations Delivered as an appliance» Balanced and pre-configured software and industry standard hardware from Dell or HP» Single Call Support» Fastest Time to Market» Scales from 2 to 56 Nodes HP Example
Key Design Elements Modular Design High Density Leverage latest Microsoft software features» Windows Server 2012 Storage Spaces» Windows Server 2012 Hyper-V» SQL Server 2012 xvelocity ColumnStore HP Example
Ultra Shared Nothing architecture: Distribution Larger Fact Table is Hash Distributed Across All Compute Nodes Time Dim Product Dim TD SD SF 01-08 PD MD Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Prod Dim ID Prod Category Prod Sub Cat Prod Desc TD SD SF 09-16 PD MD Store Dim Store Dim ID Store Name Store Mgr Store Size Sales Facts Date Dim ID Store Dim ID Prod Dim ID Mktg Camp ID Qty Sold Dollars Sold Mktg Campaign Dim Mktg Camp ID Camp Name Camp Mgr Camp Start Camp End TD SD TD SD TD SD SF 17-24 SF 25-32 SF 33-n PD MD PD MD PD MD
In-Memory Columnstore in PDW V2 & SQL Server 2014 xvelocity in-memory columnstore in PDW columnstore index as primary data store in a scale-out MPP Data Warehouse - PDW V2 Appliance Updateable clustered columnstore index (CCI) Support for bulk load and insert/update/delete Extended data types decimal/numeric for all precision and scale Query processing enhancements for more batch mode processing (for example, Outer/Semi/Antisemi joins, union all, scalar aggregation) Customer benefits Outstanding query performance from in-memory columnstore index 600 GB per hour for a single 12-core server Significant hardware cost savings due to high compression 4 15x compression ratio Improved productivity through updateable index Ships in PDW V2 appliance and SQL Server 2014 14
Introducing PolyBase Fundamental breakthrough in data processing SQL SQL Server 2012 PDW Powered by PolyBase Single Query; Structured and Unstructured Query and join Hadoop tables with Relational Tables Use Standard SQL language Select, From Where Database HDFS (Hadoop) Existing SQL Skillset No IT Intervention Save Time and Costs Analyze All Data Types
External Tables» An external table is PDW s representation of data residing in HDFS» The table (metadata) lives in the context of a SQL Server database» The actual table data resides in HDFS CREATE EXTERNAL TABLE table_name ({<column_definition>} [,...n ]) {WITH (LOCATION = <URI>,[FORMAT_OPTIONS = (<VALUES>)])} [;] Required to indicate location of Hadoop cluster Optional format options associated with parsing of data from HDFS (e.g. field delimiters & reject-related thresholds)
Native Query Across Hadoop and PDW Parallel Data Import from HDFS into PDW Persistently storing data from HDFS in PDW tables Fully parallelized via CREATE TABLE AS SELECT (CTAS) with external tables as source table and PDW tables (either distributed or replicated) as destination CREATE TABLE ClickStream_PDW WITH DISTRIBUTION = HASH(url) AS SELECT url, event_date, user_ip FROM ClickStream Retrieval of data in HDFS on-the-fly Sensor & RFID Web Apps Social Apps Mobile Apps Hadoop Unstructured data Parallel HDFS Reads CTAS External Table Enhanced PDW query engine HDFS bridge Results DMS DMS Reader Reader 1 N Parallel Importing Traditional DW applications PDW Structured data
Native Query Across Hadoop and PDW Parallel Data Export from PDW into HDFS Fully parallelized via CREATE EXTERNAL TABLE AS SELECT (CETAS) with external tables as destination table and PDW tables as source Round-trip of data possible with first importing data from HDFS, joining it with relational data, and then exporting results back to HDFS CREATE EXTERNAL TABLE ClickStream (url, event_date, user_ip) WITH (LOCATION = hdfs://myhadoop:5000/users/outputdir, FORMAT_OPTIONS (FIELD_TERMINATOR = ' ')) AS SELECT url, event_date, user_ip FROM ClickStream_PDW Sensor & RFID Web Apps Social Apps Mobile Apps HDFS data nodes Unstructured data Parallel HDFS Writes CETAS External Table Enhanced PDW query engine HDFS bridge DMS Writer 1 Results DMS Writer N Parallel Reading Traditional DW applications PDW Structured data
PDW V2.0 Management Dashboard
PDW V2.0 Management Dashboard
PDW V2.0 Management Dashboard
Microsoft Business Intelligence Platform