Tomasz Libera Azure SQL Data Warehouse
Thanks to our partners!
About me Microsoft MVP Data Platform Microsoft Certified Trainer SQL Server Developer Academic Trainer datacommunity.org.pl One of the leaders of the Data Community Poland Organizer of conferences: SQLDay, SQLSaturday Interests Mountain biking MTB marathons tomasz.libera@datacommunity.pl blog.libera.net.pl
Agenda Introduction MPP DWU DISTRIBUTIONS Gen2 Creating tables and differences in TSQL CTAS DISTRIBUTION METHODS INDEXES STATS Monitor and tune JOIN QUERY PERFORMANCE Data Loading BCP AZCOPY POLYBASE SSIS
Introduction
MPP Architecture MPP - Massively parallel processing PDW - Parallel Data Warehouse -> APS - Analytics Platform System Main control node + 60 x compute nodes Distributed queries all queries are distributed across nodes Data Movement Service internal service that moves data across the nodes as necessary to run queries in parallel and return accurate results TSQL tables, views, stored procedures, temporary tables, variables Columnstore architecture default index type for all tables Microsoft Polybase enables to query data from Hadoop and Blob Storage using TSQL
Dynamically scale Pause and resume compute to save cost You will be charged for storage Scale compute in few minutes Service level: Data Warehouse Unit (DWU) units of compute scale CPU Memory IO operations DW100 1,28 EUR/h DW6000 76,51 EUR/h Pause/Resume Scale Azure portal Yes Yes PowerShell Yes Yes REST API Yes Yes T-SQL No Yes
GEN 2 GEN 1 DW Capacity limits DW100 DW400 DW1000 DW6000 Compute Nodes 1 4 10 60 Distibutions/ ComputeNode 60 15 6 1 Memory per data warehouse (GB) 24 96 240 1440 Price /hour 1,28 5,10 12,75 76,51 DW500c DW2000c DW5000c DW30000c Compute Nodes 1 4 10 60 Distibutions/ ComputeNode 60 15 6 1 Memory per data warehouse (GB) 300 1200 3000 18000 Price /hour 6,38 25,50 63,75 382,52
Distributed tables Every single row in table is assigned to distribution (distributed storage location) Distributions are grouped into compute nodes Number of distributions in SDW in static - 60 (every DWU) Higher DWU = more compute nodes = less compueted nodes addigned to one distribution. DW6000/DW30000c: 1 compute node = 1 distribution
Creating databases Azure Portal TSQL PowerShell Database name unique within SQL Server that hosts Azure SQL Database and SQL Data Warehouse Collation Windows/ SQL collation, default: SQL_Latin1_General_CP1_CI_AS Maximum database size: 250 GB - 240 TB, by default 10 TB Gen2: Unlimited columnar storage Edition datawarehouse (the only option for ADW)
Gen2 5x query performance via a adaptive caching technology, NVMe Solid State Disk cache keeps the most frequently accessed data close to the CPUs (compute node) NVMe SSD = 3GB/sec throughput, 0,02 ms latency SATA SSD = 500MB/sec, 0,2 ms Improvement in serving concurrent queries (32 to 128 queries/cluster) Amazon Redshift maximum concurrent queries: 50 Unlimited columnar storage Offers the greatest level of scale by enabling you to scale up to 30,000 Data Warehouse Units (Gen1 DW6000) Microsoft recommends migrate to Gen2 SQL Data Warehouse
DEMO 1 Create database PDW_SHOWSPACEUSED() CTAS Demo10 CreateDB.sql
Tables and differences in TSQL
Create Table As Select (CTAS) Fully parallelized operation Creates a new table based on the output of SELECT statement CTAS is the simplest and fastest way to create a copy of a table Use CTAS to: Re-create a table with a different hash distribution column. Re-create a table as replicated. Create a columnstore index on just some of the columns in the table. Query or import external data Use partitioning
Distributed method Each database is divided into 60 distributions, using one of 3 methods; ROUND ROBIN (default) distributes evenly, but randomly, doesn t require knowledge about data/ queries HASH DISTRIBUTION Distributed using hash algorithm, equal values to same distribution, optimal for large tables REPLICATED TABLES All data present on every node, simplifies many queries plans and reduces data movement, best for small lookup tables
Distributed method ROUND ROBIN (default) The assignment of rows to distributions is random rows with equal values are not assigned to the same distribution When to use no obvious joining key not good candidate column for hash distributing the table table does not share a common join key with other tables join is less significant than other joins in the query table is a temporary staging table
Distributed method HASH DISTRIBUTION Distributes table rows across the Compute nodes by using a deterministic hash function. Hash column static, many unique values When to use: table size on disk is more than 2 GB
Distributed method REPLICATED TABLES All data present on every node, simplifies many queries plans and reduces data movement, best for small lookup tables. A table that is replicated caches a full copy of the table on each compute node. Consequently, replicating a table removes the need to transfer data among compute nodes before a join or aggregation.
Indexes clustered columnstore index (default) clustered index (rowstore, for more selective queries) nonclustered index (rowstore) heap (faster row insert) When NOT to use default, columnstore: Unsoported data types; varchar/nvarchar(max) Temporary tables Small tables (< 100 mln rows)
Statistics Statistical information about the distribution of values in table/ index Created on one or more columns SQL Data Warehouse supports auto create statistics (May 2018)
Temporary tables Rows are visible only in current session Dropped when user logged-out SQLDW and SQL Server temp tables Similarities Temp Table created BEFORE execution of dtored proc is visible within procedure Differences Temp table created within stored proc is visible after proc execution Usually created by CTAS statement Can be indexed (heap, columnstore clustered, clustered)
Partitioning Improve query performance Speed up loading and archiving of data (partition switching) maintenance operations on individual partitions instead of the whole table Remind, that in SQLDW all tables are already divided into 60 databases simplier syntax than SQL Server, no partition schema/ partition function
Not supported Identity (available from June 2017) Sequences Primary Key, Foreign Key, Unique, Check Unique indexes Computed columns Sparse columns User definied data types Triggers Indexed views Synonims
IDENTITY IDENTITY column property supported since June 2017 Not supported: @@IDENTITY, SCOPE_IDENTITY functions Hash-distribution where the column is also the distribution key Where the table is an external table Doesn't guarantee the order in which the surrogate values are allocated Supported: IDENTITY_INSERT ON
DEMO 2 Demo20 Create tables.sql Demo21 Identity.sql Demo22 Indexes.sql Demo23 Statistics.sql Demo24 Temp tables.sql Demo25 Partitioning.sql Distributions - Round_robin - hash - replicate Identity Indexes Statistics Temporary tables Partitioning
Monitor and tune
LABEL Query hint to assign a comment to query Simplifies monitoring process Easy to find query in sys.dm_pdw_exec_requests DMV Use brackets when querying the label column, as it a key word
sys.dm_pdw_exec_requests Contain last 10K executed queries One row = one request/ query Status - 'Running', 'Suspended', 'Completed', 'Cancelled', 'Failed'. Resource_class - pre-determined resource limits to govern compute resources and concurrency for query execution Resource classes are implemented as pre-definied database roles.
sys.dm_pdw_request_steps All steps that compose a given request or query One row = query step Operation_type: DMS query plan operations (selected) SQL query plan operations ('OnOperation', 'RemoteOperation') Other query plan operations
Data Movement Service DMS - data transport technology that coordinates data movement between the Compute nodes. Some queries require data movement to ensure the parallel queries return accurate results. When data movement is required, DMS ensures the right data gets to the right location.
DISTRIBUTION COLUMN DISTRIBUTION COLUMN Data Movement between nodes NODE 1 ProductKey OrderDateKey CustomerKey SalesAmount 488 20181012 24604 53,99 CustomerKey Firstname Lastname 15460 Victoria Cooper CustomerKey Firstname Lastname 24604 Danny Travers NODE 2 371 20181019 15460 2181,5625 18125 Eduardo Turner 15460 Victoria Cooper NODE 3 381 20181021 18125 1000,4375 11264 Isabella Allen 18125 Eduardo Turner NODE 4 228 20181022 11264 49,99 24604 Danny Travers 11264 Isabella Allen
DEMO 3 Monitor and tune sys.dm_pdw_exec_requests sys.dm_pdw_request_steps join query performance - Hash - Round_robin - Replicate
Data loading
Data loading BCP Export to text file from SQL Server, import from text to SQL Data Warehouse PolyBase (External tables) 1. bcp export to flat file 2. AZCopy 3. Polybase SSIS ADO NET/ OLE DB Source Destination Azure Blob Upload Task Azure SQL DW Upload Task Redgate Data Platform Studio Other (Azure Lake Data Store, Azure Data Factory)
bcp Export/ import process: From text file to SQL Data Warehouse From SQL Data Warehouse to text file
Polybase Microsoft Polybase enables to query data from Hadoop and Blob Storage using TSQL With PolyBase, the data loads in parallel from the data source directly to the compute nodes The best (and fastest) method to load data into SQL Data Warahouse Data should be first loaded into Azure Blob Storage The higher DWU, the faster import https://docs.microsoft.com/en-us/azure/sql-data-warehouse/design-elt-data-loading
Polybase step by step 1. Bcp/SSIS export from SQL Server to text file 2. AZCOPY copying data to Azure Blob Storage 3. Access to Azure Blob Storage based on DATABASE SCOPED CREDENTIAL 4. Data source referencing credentials from previous step EXTERNAL DATA SOURCE 5. File format and external table table definitione EXTERNAL FILE FORMAT EXTERNAL TABLE 6. Load data into new table using CTAS statement
SQL Server Integration Services Feature Pack for Azure Azure Blob Upload Task replaces AZCopy
SQL Server Integration Services Feature Pack for Azure Azure SQL DW Upload Task Load data in text file to Blob Storage, and using Polybase integration to table in SDW
DEMO 4 Bcp Polybase SSIS Demo40 BCP.sql Demo41 Polybase.sql Demo42 SSIS.sql
THANK YOU! tomasz.libera@datacommunity.pl @tomasz_libera Slides, demos: http://bit.ly/sqlsatbanialuka_asdw
Thanks to our partners!