Build ETL efficiently (10x) with Minimal Logging

Build ETL efficiently (10x) with Minimal Logging Simon Cho Blog : Simonsql.com Simon@simonsql.com

Please Support Our Sponsors SQL Saturday is made possible with the generous support of these sponsors. You can support them by opting-in and visiting them in the sponsor area.

Local User Groups Orange County User Group 2 rd Thursday of each month bigpass.pass.org Los Angeles User Group 3 rd Thursday of each odd month sql.la Malibu User Group 3 rd Wednesday of each month sqlmalibu.pass.org San Diego User Group 1 st & 3 rd Thursday of each month meetup.com/sdsqlug meetup.com/sdsqlbig Los Angeles - Korean Every Other Tuesday sqlangeles.pass.org

Who we are? SQLAngeles.com Official Local Chapter group in SQLPASS.org Only the community speak in Korean in SQL PASS. Blog : Simonsql.com/SQLmvp.kr Email : SQLAngeles@sqlpass.org, SQLAngeles@gmail.com

Agenda Want to discuss first Quick review SARG Index access methods Tipping Point Case 1 What s the best way to pull this table? Introduce for Minimal Logging What is minimal logging? How does it work? Condition Recovery Model

Question1 SARG How is it difference? --Query1 DECLARE @BaseAccountID BIGINT = 1 SELECT * FROM DBO.[Accounts] WHERE AccountID = @BaseAccountID + 1000 GO --Query2 DECLARE @BaseAccountID BIGINT = 1 SELECT * FROM DBO.[Accounts] WHERE AccountID - 1000 = @BaseAccountID GO

Question1 SARG How is it difference?

Question2 SARG How is it difference? --Query1 SELECT AccountID, AccountName FROM DBO.[Accounts] WHERE AccountID = 1000 AccontID : Bigint AccountName : Varchar SELECT AccountID, DCID FROM DBO.[Accounts] WHERE AccountID = '1000' GO --Query2 SELECT AccountID, AccountName FROM DBO.[Accounts] WHERE AccountName = 1000 SELECT AccountID, AccountName FROM DBO.[Accounts] WHERE AccountName = '1000' GO

Question2 SARG How is it difference?

SARG - Search Arguments Sargability: Why %string% Is Slow string% is it SARG arguments? https://www.brentozar.com/archive/2010/06/sarga ble-why-string-is-slow/ CAST and CONVERT (Transact-SQL) https://msdn.microsoft.com/enus/library/ms187928.aspx Data Type Precedence (Transact-SQL) https://msdn.microsoft.com/enus/library/ms190309.aspx

Let s talk about Index access methods There are only 3 methods regardless Clustered or Nonclustered (1) Index Scan Execution Plan shows : Table Scan, Index Scan Index Seek Execution Plan shows : Index Seek (2) Individual key look up. (3) Ordered Partial Scan

(1) Table Scan/Index Scan SELECT orderid, custid, empid, shipperid, orderdate FROM dbo.orders;

(2) Index Seek : Individual key lookup SELECT * FROM [Account].[Accounts] where AccountID in (100,200,300)

(3) Index Seek : Ordered Partial Scan SELECT orderid, custid, empid, shipperid, orderdate FROM dbo.orders WHERE orderdate = '20060212';

Combination Partial Scan + Index Seek SELECT orderid, custid, empid, shipperid, orderdate FROM dbo.orders WHERE orderid BETWEEN 101 AND 120;

Let s talk about Index access methods What s the most efficient method for ETL? (1) Index Scan (2) Index Seek - Individual key look up. (3) Index Seek - Ordered Partial Scan Book: Inside Microsoft SQL Server 2005 T-SQL Querying By Itzik Ben-Gan Chapter 3 Must read.

What s the problem of Index Seek? TRUNCATE TABLE #tmp CREATE TABLE #tmp (AccountID BIGINT) GO INSERT INTO #tmp (AccountID) SELECT 1 GO 1000 SELECT a.* FROM [dbo].[accounts] a JOIN #tmp b ON a.accountid = b.accountid It can read same page multiple times!!

What s the problem of Index Seek? SELECT a.* FROM dbo.[accounts] a JOIN #tmp b ON a.accountid = b.accountid SELECT a.* FROM dbo.[accounts] a JOIN dbo.num b ON n BETWEEN 1 AND 1000 WHERE AccountID=1 (1000 row(s) affected) Table 'Accounts'. Scan count 0, logical reads 3071, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table '#tmp 00000000157D'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. (1 row(s) affected) (1000 row(s) affected) Table 'Num'. Scan count 1, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Accounts'. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. (1 row(s) affected)

What s the problem of Index Seek? SELECT a.* FROM [dbo].[accounts] a JOIN #tmp b ON a.accountid = b.accountid SELECT a.* FROM [dbo].[accounts] a JOIN dbo.num b ON n BETWEEN 1 AND 1000 WHERE a.accountid=1

So, what s the Best case for Data Extraction? Using Partial scan instead of individual record seek. What we need for this? Covered + Ordered Index If it s not an option, what should we do? Unordered Scan(Table scan) or Individual seek?

Tipping Point That s we call Tipping Point. Ref) Kimberly SQL Pass 2010 : Indexing Strategies It's the point where the number of rows returned is "no longer selective enough". SQL Server chooses NOT to use the nonclustered index to look up the corresponding data rows and instead performs a table scan. What % of data do you guess?

Tipping Point Tipping Point Query #1 Table with 1 million rows over 50,000 pages (20 r/p) 12,500 16,666 pages ஃ rows = 1.25-1.66% Tipping Point Query #2 Table with 1 million rows over 10,000 pages (100 r/p) 2,500 3,333 pages ஃ rows =.25-.33% Tipping Point Query #3 Table with 1 million rows over 100,000 pages (10 r/p) 25,000 33,333 pages ஃ rows = 12.5-16.66% Note : This is SQL 2005 testing. And this is not fragmented index. It depend on a lot of things. System, SQL version, Fragmetation and e.t.c. So, No exact number for determine. In my experience, generally greater than 1% of data, Table Scan is better. http://www.sqlskills.com/blogs/kimberly/the-tipping-point-query-answers/

Tipping Point What s exactly it s affected? Directly affected Index Seek vs Scan for data driven. Book mark lookup or Cluster scan. Indirectly related for ETL pull whole data vs just Delta data. Need to consider write as well. We can utilize minimal logging. Update bulk records or Create new table. Generally 5-10% data change is the Tipping point for ETL. About 20% data change, rebuild is better most likely.

Case 1 Source Table(Source_A) 1.5 GB table 20 M records Target Table(Target_B) 1 Clustered Index + 3 NonClustered Indexes Daily Base Goal Daily base refresh Target Table. Pull data using Linked Server

Case 1 Which method? Method 1 Drop clustered index Drop all nonclustered indexes Truncate table Target_B Insert Target_B select * from Source_A Create clustered index Method 2 Drop Table Target_B Create table Target_B Create clustered index Create all nonclustered index Insert Target_B select * from Source_A Method 3 Simple is the best! Truncate table Target_B Insert Target_B select * from Source_A Create all nonclustered indexes Or I have better one(?)

Case 1 Which method? Method 4 Truncate table Drop all Nonclustered Indexes Drop Clustered Index Insert Target_B with(tablock) Select * from Source_A Create Clustered Index Create all Nonclustered Indexes Method 6 Drop Table Target_B Select * into Target_B from Source_A Create Clustered Index Create all Nonclustered Indexes Method 5 Truncate table Drop all Nonclustered Indexes. Cluster index remains Insert Target_B with(tablock) Select * from Source_A Create all Nonclustered Indexes

Scenario 1 Summary 1 Million = 80 MB, SSD, Linked Server Recovery Method Seconds Log Count Log Length(Bytes) Compare Simple 1 23 1121768 183817760 80 Times Simple 2,3 27 4346543 567734748 250 Times Simple 4,5,6 11 32823 2304100 1 Bulk 1 22 1126958 184106236 80 Times Bulk 2,3 27 4346495 567733504 250 Times Bulk 4,5,6 11 37521 2573788 1 Full 1 23 1174056 523121874 230 Times Full 2,3 29 4372115 569862522 250 Times Full 4,5,6 11 69618 308086556 130 Times Note : How to check Log count and Log Length 1. CHECKPOINT or Log backup depends on Recovery mode 2. Execute below statement. SELECT COUNT(*) AS LogRecordCount, SUM([Log Record Length]) FROM sys.fn_dblog(null,null)

Maybe someone say like this!

Somebody may not agree SSIS is super fast!!. Do not use linked server! So, I tested it method 3 with SSIS package. Recovery Metho d Seconds Log Count Log Length(Bytes) Compare Simple 3 24 4273104 557659212 250 Times Bulk 3 24 4315214 563119236 250 Times Full 3 24 4340768 567257152 250 Times Nothing Difference!!

Somebody say Duration Doesn t differ So, we are ok to use old method. Yes, I m quite impressed with SSD. So that, I tested with USB external disk. 20 Million, 1.5 GB. Only for slowest(method3) and fastest(method4) Recovery Method Secon ds Log Count Log Length(Bytes) Compare Simple 3 1705 86924396 11354768774 130 Times Simple 4 610 1100271 84231150 1 Still impressive. Duration doesn t that much differ. (I expected about 20 times difference) But, I m pretty sure, if you dealing with bigger size of data, You will see a lot difference. Ex) In my experience, 24 hours big ETL job, only takes 45 Min.

Let s look at detail number(20 M, 1.5 GB) 1,005,690 writes(physical write) 1005690 * 8Kb/1024/1024 = 7.9 GB 727303 reads(physical read) 727303 * 8Kb/1024/1024 = 5.5 GB 1,298,861 CPU_Time 1298861/1000 = 1,298 Sec 554,488,821 Logical_reads(Buffer pool read) 554488821 * 8Kb/1024/1024 = 4230 GB Just for 1.5 GB Table!! 8 Times more log created on Disk VS 6 Times CPU, 1.5 times reads, 2 times writes, 250 times Buffer pool read

I know you feel like this

Minimal Logging Operation Full Logging Operation Everything logged by each row level. Minimal Logging Operation Someone call No-Logging. Technically not true. Do not log every individual row change. Only logging enough rollback information. Only logs extent allocations each time a new extent is allocated to the table. Minimal Logging Prerequisites http://technet.microsoft.com/en-us/library/ms190422%28v=sql.100%29.aspx For a database under the full recovery model, all row-insert operations that are performed by bulk import are fully logged in the transaction log. Note : After SQL2008 it s improved with TabLock After 2008 and Above in Full recovery mode It s logging by Page allocation level. CDC or Replication - Always fully logged.

What can be minimally logged With Prerequisites Below operation can be minimally logged Select Into Bulk Import, Bulk Insert, Bcp Create/Alter/Drop Index Insert into Table with (TabLock) select

Do you concern with Table Lock? Minimal logging is required Table Lock during inserting data. Table Lock Basically, it ll blocking other update/insert operation. Read committed can t read data either. Dirty read(nolock) is ok to read. So, not a many developer like blocking. * Select * into : Schema Modification lock is occurred. Do not allow dirty read either.

Do you concern with TabLock? This is SSIS Package is default setting. Data Flow Destination

Prerequisites of Minimal logging # Table Requirements for Minimally Logging Bulk-Import Operations The table is not being replicated. Table locking is specified (using TABLOCK) Database Recovery : Bulk-logged or Simple Note: The bulk-logged recovery model is designed to temporarily replace the full recovery model during large bulk operations No indexes (Heap). Don t care empty or not : data pages are minimally logged. Nonclustered indexes only : data pages are minimally logged. If the table is empty, index pages are minimally logged as well If table is non-empty, index pages are fully logged. Ex) If you start with an empty table and bulk import the data in multiple batches, both index and data pages are minimally logged for the first batch, but beginning with the second batch, only data pages are minimally logged. Clustered Index, No Nonclustered index, and empty table : both data and index pages are minimally logged. If a table is non-empty, data pages and index pages are both fully logged. EX) If you start with an empty table and bulk import the data in batches, both index and data pages are minimally logged for the first batch, but from the second batch onwards, only data pages are bulk logged. Clustered Index and NonClustered index : Fully logged. https://technet.microsoft.com/en-us/library/ms190422(v=sql.120).aspx

http://technet.microsoft.com/en-us/library/dd425070%28v=sql.100%29.aspx (1) If you are using the INSERT SELECT method, the ORDER hint does not have to be specified, but the rows must be in the same order as the clustered index. If using BULK INSERT the order hint must be used. (2) Concurrent loads only possible under certain conditions. See Bulk Loading with the Indexes in Place. Also, only rows written to newly allocated pages are minimally logged. (3) Depending on the plan chosen by the optimizer, the nonclustered index on the table may either be fully- or minimally logged. Minimal Logging Operation Indexes Rows in table Hints Logging Heap Any TABLOCK Minimal Heap Any None Full Heap + Index Any TABLOCK Full Cluster Empty TABLOCK, ORDER (1) Cluster Empty None Full Cluster Any None Full Cluster Any TABLOCK Full Cluster + Index Any None Full Cluster + Index Any TABLOCK Full Minimal

What is Recovery Mode? Full Log backup required Full logging everything Note : Starting SQL 2008, some statement can be small logged even in full recovery mode. Simple Automatically reclaims log space. No log backup Log chain is broken Not able to restore Point-in-time Fully Support minimal logging operation Bulk_Logged Log backup required Note : The log size pretty much same as full recovery mode even the minimal logging operation. Fully Support minimal logging operation Log chain is NOT broken. Note : Unfortunately, Mirrored database can t changed

Bulk Recovery Mode Protects against media failure. Normal transaction(full logging Transaction) Same as Full recovery mode. Minimal logging operation Provides the best performance and least log space usage. LDF file doesn t have full traction log. Point-in-time is NOT available.(stopat doesn t allow) Log backup is contain whole page(extent) data. Please check below URL before change Bulk recovery mode. https://technet.microsoft.com/en-us/library/ms190692.aspx https://msdn.microsoft.com/en-us/library/ms179451.aspx

Bulk Recovery Mode - Cont There is risk. But, performance gain is very big. And it s required only Target DB. Ex) how to use Temporary change to Bulk recovery mode for Minimal logging STMT in Full logging Database. Run the log backup more frequently. Change back after minimal logging operation.

Introduce about Trace Flag 610 SQL 2008 and above When the bulk load operation causes a new page to be allocated, all of the rows sequentially filling that new page are minimally logged. Please don t enable this TF on Prod right away. It may make overhead for regular full logging transaction. https://simonsql.com/2011/12/05/minimal-loggingoperation-with-traceflag-610/

Minimal Logging Operation-TF610 Table Indexes Rows in table Hints Without TF 610 With TF 610 Concurrent possible Heap Any TABLOCK Minimal Minimal Yes Heap Any None Full Full Yes Heap + Index Any TABLOCK Full Depends (3) No Cluster Empty TABLOCK, ORDER (1) Minimal Minimal No Cluster Empty None Full Minimal Yes (2) Cluster Any None Full Minimal Yes (2) Cluster Any TABLOCK Full Minimal No Cluster + Index Any None Full Depends (3) Yes (2) Cluster + Index Any TABLOCK Full Depends (3) No https://technet.microsoft.com/en-us/library/dd425070(v=sql.100).aspx

Q & A Simon Cho Blog : Simonsql.com Simon@simonsql.com