Build ETL efficiently (10x) with Minimal Logging

Similar documents
Build ETL efficiently (10x) with Minimal Logging

Build ETL efficiently (10x) with Minimal Logging

Build ETL efficiently (10x) with Minimal Logging

Transaction. Simon Cho

Get the Skinny on Minimally Logged Operations

I Want To Go Faster! A Beginner s Guide to Indexing

Jeff Mlakar SQL Saturday #773 Los Angeles Environmental SQL Server Troubleshooting

Tuning Transactional Replication. Level: Intermediate Paul Ou Yang paulouyang.blogspot.com

TempDB how it works? Dubi Lebel Dubi Or Not To Be

Boost your Analytics with Machine Learning for SQL Nerds. Julie mssqlgirl.com

Tuesday, April 6, Inside SQL Server

Natural Born Killers, performance issues to avoid

CPSC 421 Database Management Systems. Lecture 11: Storage and File Organization

Microsoft Developing Microsoft SQL Server 2012 Databases. Download Full Version :

User Perspective. Module III: System Perspective. Module III: Topics Covered. Module III Overview of Storage Structures, QP, and TM

SQL Server 2014: In-Memory OLTP for Database Administrators

Locking & Blocking Made Simple

Disks, Memories & Buffer Management

Eternal Story on Temporary Objects

IT Best Practices Audit TCS offers a wide range of IT Best Practices Audit content covering 15 subjects and over 2200 topics, including:

6 Months Training Module in MS SQL SERVER 2012

SQL Server 2014 Training. Prepared By: Qasim Nadeem

@KATEGRASS. Let s Get Meta: ETL Frameworks Using Biml

CAST(HASHBYTES('SHA2_256',(dbo.MULTI_HASH_FNC( tblname', schemaname'))) AS VARBINARY(32));

SQL Server Myths and Misconceptions

DESIGNING FOR PERFORMANCE SERIES. Smokin Fast Queries Query Optimization

Power BI for the Enterprise

Manual Trigger Sql Server 2008 Update Inserted Rows

SQL Server 2014 In-Memory Tables (Extreme Transaction Processing)

ColdFusion Summit 2016

Microsoft SQL Server Database Administration

Building Better. SQL Server Databases

Designing Database Solutions for Microsoft SQL Server (465)

SQL Coding Guidelines

Ext3/4 file systems. Don Porter CSE 506

ETL Best Practices and Techniques. Marc Beacom, Managing Partner, Datalere

Database Architectures

ò Very reliable, best-of-breed traditional file system design ò Much like the JOS file system you are building now

Building Better. SQL Server Databases

CS122 Lecture 15 Winter Term,

CS3600 SYSTEMS AND NETWORKS

Persistence Is Futile- Implementing Delayed Durability in SQL Server

5/2/2015. Overview of SSIS performance Troubleshooting methods Performance tips

L9: Storage Manager Physical Data Organization

Seminar 3. Transactions. Concurrency Management in MS SQL Server

Boosting DWH Performance with SQL Server ColumnStore Index

Tables. Tables. Physical Organization: SQL Server Partitions

Physical Organization: SQL Server 2005

Course Outline. SQL Server Performance & Tuning For Developers. Course Description: Pre-requisites: Course Content: Performance & Tuning.

Troubleshooting With Extended Events

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

High Availability- Disaster Recovery 101

Advanced Database Systems

DESIGNING DATABASE SOLUTIONS FOR MICROSOFT SQL SERVER CERTIFICATION QUESTIONS AND STUDY GUIDE

6232B: Implementing a Microsoft SQL Server 2008 R2 Database

PERFORMANCE TUNING SQL SERVER ON CRAPPY HARDWARE 3/1/2019 1

Firebird Tour 2017: Performance. Vlad Khorsun, Firebird Project

Enterprise Vault Best Practices

Getting the most from your SAN File and Filegroup design patterns. Stephen Archbold

abstract 2015 Progress Software Corporation.

Synergetics-Standard-SQL Server 2012-DBA-7 day Contents

SQL Server Optimisation

SQL Server DBA Online Training

Introduction to Data Management. Lecture #26 (Transactions, cont.)

Mastering the art of indexing

Let s Explore SQL Storage Internals. Brian

Update The Statistics On A Single Table+sql Server 2005

Disks and Files. Jim Gray s Storage Latency Analogy: How Far Away is the Data? Components of a Disk. Disks

High Availability- Disaster Recovery 101

Microsoft. Exam Questions Administering Microsoft SQL Server 2012 Databases. Version:Demo

Distributed KIDS Labs 1

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

Before-image log, checkpoints, crashes

The Right Read Optimization is Actually Write Optimization. Leif Walsh

SQL Server 2014 Internals and Query Tuning

Principles of Data Management. Lecture #2 (Storing Data: Disks and Files)

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

Disks and Files. Storage Structures Introduction Chapter 8 (3 rd edition) Why Not Store Everything in Main Memory?

Lesson 9 Transcript: Backup and Recovery

Index. Accent Sensitive (AS), 20 Aggregate functions, 286 Atomicity consistency isolation durability (ACID), 265

6.830 Lecture 15 11/1/2017

CS 333 Introduction to Operating Systems. Class 11 Virtual Memory (1) Jonathan Walpole Computer Science Portland State University

C13: Files and Directories: System s Perspective

Firebird Tour 2017: Performance. Vlad Khorsun, Firebird Project

Introduction to Data Management. Lecture #25 (Transactions II)

Microsoft Provisioning SQL Databases (beta)

Storing Data: Disks and Files

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Mobile MOUSe SQL SERVER 2005 OPTIMIZING AND MAINTAINING DATABASE SOLUTIONS ONLINE COURSE OUTLINE

Background. Let s see what we prescribed.

The Google File System

What is a Page Split. Fill Factor. Example Code Showing Page Splits

ColumnStore Indexes. מה חדש ב- 2014?SQL Server.

Background. $VENDOR wasn t sure either, but they were pretty sure it wasn t their code.

All references to "recovery mode" should be changed to "recovery model". Reads:...since it was read into disk).

What the Hekaton? In-memory OLTP Overview. Kalen Delaney

Transaction Log Internals and Troubleshooting

PERFORMANCE OPTIMIZATION FOR LARGE SCALE LOGISTICS ERP SYSTEM

Columnstore Technology Improvements in SQL Server 2016

SQL Saturday #654 - Omaha

Transcription:

Build ETL efficiently (10x) with Minimal Logging Simon Cho Blog : Simonsql.com Simon@simonsql.com

Please Support Our Sponsors SQL Saturday is made possible with the generous support of these sponsors. You can support them by opting-in and visiting them in the sponsor area.

Local User Groups Orange County User Group 2 rd Thursday of each month bigpass.pass.org Los Angeles User Group 3 rd Thursday of each odd month sql.la Malibu User Group 3 rd Wednesday of each month sqlmalibu.pass.org San Diego User Group 1 st & 3 rd Thursday of each month meetup.com/sdsqlug meetup.com/sdsqlbig Los Angeles - Korean Every Other Tuesday sqlangeles.pass.org

Who we are? SQLAngeles.com Official Local Chapter group in SQLPASS.org Only the community speak in Korean in SQL PASS. Blog : Simonsql.com/SQLmvp.kr Email : SQLAngeles@sqlpass.org, SQLAngeles@gmail.com

Agenda Want to discuss first Quick review SARG Index access methods Tipping Point Case 1 What s the best way to pull this table? Introduce for Minimal Logging What is minimal logging? How does it work? Condition Recovery Model

Question1 SARG How is it difference? --Query1 DECLARE @BaseAccountID BIGINT = 1 SELECT * FROM DBO.[Accounts] WHERE AccountID = @BaseAccountID + 1000 GO --Query2 DECLARE @BaseAccountID BIGINT = 1 SELECT * FROM DBO.[Accounts] WHERE AccountID - 1000 = @BaseAccountID GO

Question1 SARG How is it difference?

Question2 SARG How is it difference? --Query1 SELECT AccountID, AccountName FROM DBO.[Accounts] WHERE AccountID = 1000 AccontID : Bigint AccountName : Varchar SELECT AccountID, DCID FROM DBO.[Accounts] WHERE AccountID = '1000' GO --Query2 SELECT AccountID, AccountName FROM DBO.[Accounts] WHERE AccountName = 1000 SELECT AccountID, AccountName FROM DBO.[Accounts] WHERE AccountName = '1000' GO

Question2 SARG How is it difference?

SARG - Search Arguments Sargability: Why %string% Is Slow string% is it SARG arguments? https://www.brentozar.com/archive/2010/06/sarga ble-why-string-is-slow/ CAST and CONVERT (Transact-SQL) https://msdn.microsoft.com/enus/library/ms187928.aspx Data Type Precedence (Transact-SQL) https://msdn.microsoft.com/enus/library/ms190309.aspx

Let s talk about Index access methods There are only 3 methods regardless Clustered or Nonclustered (1) Index Scan Execution Plan shows : Table Scan, Index Scan Index Seek Execution Plan shows : Index Seek (2) Individual key look up. (3) Ordered Partial Scan

(1) Table Scan/Index Scan SELECT orderid, custid, empid, shipperid, orderdate FROM dbo.orders;

(2) Index Seek : Individual key lookup SELECT * FROM [Account].[Accounts] where AccountID in (100,200,300)

(3) Index Seek : Ordered Partial Scan SELECT orderid, custid, empid, shipperid, orderdate FROM dbo.orders WHERE orderdate = '20060212';

Combination Partial Scan + Index Seek SELECT orderid, custid, empid, shipperid, orderdate FROM dbo.orders WHERE orderid BETWEEN 101 AND 120;

Let s talk about Index access methods What s the most efficient method for ETL? (1) Index Scan (2) Index Seek - Individual key look up. (3) Index Seek - Ordered Partial Scan Book: Inside Microsoft SQL Server 2005 T-SQL Querying By Itzik Ben-Gan Chapter 3 Must read.

Let s talk about Index access methods What s the most efficient method for ETL? (1) Index Scan (2) Index Seek - Individual key look up. (3) Index Seek - Ordered Partial Scan Book: Inside Microsoft SQL Server 2005 T-SQL Querying By Itzik Ben-Gan Chapter 3 Must read.

What s the problem of Index Seek? TRUNCATE TABLE #tmp CREATE TABLE #tmp (AccountID BIGINT) GO INSERT INTO #tmp (AccountID) SELECT 1 GO 1000 SELECT a.* FROM [dbo].[accounts] a JOIN #tmp b ON a.accountid = b.accountid It can read same page multiple times!!

What s the problem of Index Seek? SELECT a.* FROM dbo.[accounts] a JOIN #tmp b ON a.accountid = b.accountid SELECT a.* FROM dbo.[accounts] a JOIN dbo.num b ON n BETWEEN 1 AND 1000 WHERE AccountID=1 (1000 row(s) affected) Table 'Accounts'. Scan count 0, logical reads 3071, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table '#tmp 00000000157D'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. (1 row(s) affected) (1000 row(s) affected) Table 'Num'. Scan count 1, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Accounts'. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. (1 row(s) affected)

What s the problem of Index Seek? SELECT a.* FROM [dbo].[accounts] a JOIN #tmp b ON a.accountid = b.accountid SELECT a.* FROM [dbo].[accounts] a JOIN dbo.num b ON n BETWEEN 1 AND 1000 WHERE a.accountid=1

So, what s the Best case for Data Extraction? Using Partial scan instead of individual record seek. What we need for this? Covered + Ordered Index If it s not an option, what should we do? Unordered Scan(Table scan) or Individual seek?

Tipping Point That s we call Tipping Point. Ref) Kimberly SQL Pass 2010 : Indexing Strategies It's the point where the number of rows returned is "no longer selective enough". SQL Server chooses NOT to use the nonclustered index to look up the corresponding data rows and instead performs a table scan. What % of data do you guess?

Tipping Point Tipping Point Query #1 Table with 1 million rows over 50,000 pages (20 r/p) 12,500 16,666 pages ஃ rows = 1.25-1.66% Tipping Point Query #2 Table with 1 million rows over 10,000 pages (100 r/p) 2,500 3,333 pages ஃ rows =.25-.33% Tipping Point Query #3 Table with 1 million rows over 100,000 pages (10 r/p) 25,000 33,333 pages ஃ rows = 12.5-16.66% Note : This is SQL 2005 testing. And this is not fragmented index. It depend on a lot of things. System, SQL version, Fragmetation and e.t.c. So, No exact number for determine. In my experience, generally greater than 1% of data, Table Scan is better. http://www.sqlskills.com/blogs/kimberly/the-tipping-point-query-answers/

Tipping Point What s exactly it s affected? Directly affected Index Seek vs Scan for data driven. Book mark lookup or Cluster scan. Indirectly related for ETL pull whole data vs just Delta data. Need to consider write as well. We can utilize minimal logging. Update bulk records or Create new table. Generally 5-10% data change is the Tipping point for ETL. About 20% data change, rebuild is better most likely.

Case 1 Source Table(Source_A) 1.5 GB table 20 M records Target Table(Target_B) 1 Clustered Index + 3 NonClustered Indexes Daily Base Goal Daily base refresh Target Table. Pull data using Linked Server

Case 1 Which method? Method 1 Drop clustered index Drop all nonclustered indexes Truncate table Target_B Insert Target_B select * from Source_A Create clustered index Method 2 Drop Table Target_B Create table Target_B Create clustered index Create all nonclustered index Insert Target_B select * from Source_A Method 3 Simple is the best! Truncate table Target_B Insert Target_B select * from Source_A Create all nonclustered indexes Or I have better one(?)

Case 1 Which method? Method 4 Truncate table Drop all Nonclustered Indexes Drop Clustered Index Insert Target_B with(tablock) Select * from Source_A Create Clustered Index Create all Nonclustered Indexes Method 6 Drop Table Target_B Select * into Target_B from Source_A Create Clustered Index Create all Nonclustered Indexes Method 5 Truncate table Drop all Nonclustered Indexes. Cluster index remains Insert Target_B with(tablock) Select * from Source_A Create all Nonclustered Indexes

Scenario 1 Summary 1 Million = 80 MB, SSD, Linked Server Recovery Method Seconds Log Count Log Length(Bytes) Compare Simple 1 23 1121768 183817760 80 Times Simple 2,3 27 4346543 567734748 250 Times Simple 4,5,6 11 32823 2304100 1 Bulk 1 22 1126958 184106236 80 Times Bulk 2,3 27 4346495 567733504 250 Times Bulk 4,5,6 11 37521 2573788 1 Full 1 23 1174056 523121874 230 Times Full 2,3 29 4372115 569862522 250 Times Full 4,5,6 11 69618 308086556 130 Times Note : How to check Log count and Log Length 1. CHECKPOINT or Log backup depends on Recovery mode 2. Execute below statement. SELECT COUNT(*) AS LogRecordCount, SUM([Log Record Length]) FROM sys.fn_dblog(null,null)

Maybe someone say like this!

Somebody may not agree SSIS is super fast!!. Do not use linked server! So, I tested it method 3 with SSIS package. Recovery Metho d Seconds Log Count Log Length(Bytes) Compare Simple 3 24 4273104 557659212 250 Times Bulk 3 24 4315214 563119236 250 Times Full 3 24 4340768 567257152 250 Times Nothing Difference!!

Somebody say Duration Doesn t differ So, we are ok to use old method. Yes, I m quite impressed with SSD. So that, I tested with USB external disk. 20 Million, 1.5 GB. Only for slowest(method3) and fastest(method4) Recovery Method Secon ds Log Count Log Length(Bytes) Compare Simple 3 1705 86924396 11354768774 130 Times Simple 4 610 1100271 84231150 1 Still impressive. Duration doesn t that much differ. (I expected about 20 times difference) But, I m pretty sure, if you dealing with bigger size of data, You will see a lot difference. Ex) In my experience, 24 hours big ETL job, only takes 45 Min.

Let s look at detail number(20 M, 1.5 GB) 1,005,690 writes(physical write) 1005690 * 8Kb/1024/1024 = 7.9 GB 727303 reads(physical read) 727303 * 8Kb/1024/1024 = 5.5 GB 1,298,861 CPU_Time 1298861/1000 = 1,298 Sec 554,488,821 Logical_reads(Buffer pool read) 554488821 * 8Kb/1024/1024 = 4230 GB Just for 1.5 GB Table!! 8 Times more log created on Disk VS 6 Times CPU, 1.5 times reads, 2 times writes, 250 times Buffer pool read

I know you feel like this

Minimal Logging Operation Full Logging Operation Everything logged by each row level. Minimal Logging Operation Someone call No-Logging. Technically not true. Do not log every individual row change. Only logging enough rollback information. Only logs extent allocations each time a new extent is allocated to the table. Minimal Logging Prerequisites http://technet.microsoft.com/en-us/library/ms190422%28v=sql.100%29.aspx For a database under the full recovery model, all row-insert operations that are performed by bulk import are fully logged in the transaction log. Note : After SQL2008 it s improved with TabLock After 2008 and Above in Full recovery mode It s logging by Page allocation level. CDC or Replication - Always fully logged.

What can be minimally logged With Prerequisites Below operation can be minimally logged Select Into Bulk Import, Bulk Insert, Bcp Create/Alter/Drop Index Insert into Table with (TabLock) select

Do you concern with Table Lock? Minimal logging is required Table Lock during inserting data. Table Lock Basically, it ll blocking other update/insert operation. Read committed can t read data either. Dirty read(nolock) is ok to read. So, not a many developer like blocking. * Select * into : Schema Modification lock is occurred. Do not allow dirty read either.

Do you concern with TabLock? This is SSIS Package is default setting. Data Flow Destination

Prerequisites of Minimal logging # Table Requirements for Minimally Logging Bulk-Import Operations The table is not being replicated. Table locking is specified (using TABLOCK) Database Recovery : Bulk-logged or Simple Note: The bulk-logged recovery model is designed to temporarily replace the full recovery model during large bulk operations No indexes (Heap). Don t care empty or not : data pages are minimally logged. Nonclustered indexes only : data pages are minimally logged. If the table is empty, index pages are minimally logged as well If table is non-empty, index pages are fully logged. Ex) If you start with an empty table and bulk import the data in multiple batches, both index and data pages are minimally logged for the first batch, but beginning with the second batch, only data pages are minimally logged. Clustered Index, No Nonclustered index, and empty table : both data and index pages are minimally logged. If a table is non-empty, data pages and index pages are both fully logged. EX) If you start with an empty table and bulk import the data in batches, both index and data pages are minimally logged for the first batch, but from the second batch onwards, only data pages are bulk logged. Clustered Index and NonClustered index : Fully logged. https://technet.microsoft.com/en-us/library/ms190422(v=sql.120).aspx

http://technet.microsoft.com/en-us/library/dd425070%28v=sql.100%29.aspx (1) If you are using the INSERT SELECT method, the ORDER hint does not have to be specified, but the rows must be in the same order as the clustered index. If using BULK INSERT the order hint must be used. (2) Concurrent loads only possible under certain conditions. See Bulk Loading with the Indexes in Place. Also, only rows written to newly allocated pages are minimally logged. (3) Depending on the plan chosen by the optimizer, the nonclustered index on the table may either be fully- or minimally logged. Minimal Logging Operation Indexes Rows in table Hints Logging Heap Any TABLOCK Minimal Heap Any None Full Heap + Index Any TABLOCK Full Cluster Empty TABLOCK, ORDER (1) Cluster Empty None Full Cluster Any None Full Cluster Any TABLOCK Full Cluster + Index Any None Full Cluster + Index Any TABLOCK Full Minimal

What is Recovery Mode? Full Log backup required Full logging everything Note : Starting SQL 2008, some statement can be small logged even in full recovery mode. Simple Automatically reclaims log space. No log backup Log chain is broken Not able to restore Point-in-time Fully Support minimal logging operation Bulk_Logged Log backup required Note : The log size pretty much same as full recovery mode even the minimal logging operation. Fully Support minimal logging operation Log chain is NOT broken. Note : Unfortunately, Mirrored database can t changed

Bulk Recovery Mode Protects against media failure. Normal transaction(full logging Transaction) Same as Full recovery mode. Minimal logging operation Provides the best performance and least log space usage. LDF file doesn t have full traction log. Point-in-time is NOT available.(stopat doesn t allow) Log backup is contain whole page(extent) data. Please check below URL before change Bulk recovery mode. https://technet.microsoft.com/en-us/library/ms190692.aspx https://msdn.microsoft.com/en-us/library/ms179451.aspx

Bulk Recovery Mode - Cont There is risk. But, performance gain is very big. And it s required only Target DB. Ex) how to use Temporary change to Bulk recovery mode for Minimal logging STMT in Full logging Database. Run the log backup more frequently. Change back after minimal logging operation.

Introduce about Trace Flag 610 SQL 2008 and above When the bulk load operation causes a new page to be allocated, all of the rows sequentially filling that new page are minimally logged. Please don t enable this TF on Prod right away. It may make overhead for regular full logging transaction. https://simonsql.com/2011/12/05/minimal-loggingoperation-with-traceflag-610/

Minimal Logging Operation-TF610 Table Indexes Rows in table Hints Without TF 610 With TF 610 Concurrent possible Heap Any TABLOCK Minimal Minimal Yes Heap Any None Full Full Yes Heap + Index Any TABLOCK Full Depends (3) No Cluster Empty TABLOCK, ORDER (1) Minimal Minimal No Cluster Empty None Full Minimal Yes (2) Cluster Any None Full Minimal Yes (2) Cluster Any TABLOCK Full Minimal No Cluster + Index Any None Full Depends (3) Yes (2) Cluster + Index Any TABLOCK Full Depends (3) No https://technet.microsoft.com/en-us/library/dd425070(v=sql.100).aspx

Q & A Simon Cho Blog : Simonsql.com Simon@simonsql.com