2016 Edition. SQL Server Sampler

Size: px

Start display at page:

Download "2016 Edition. SQL Server Sampler"

May Pope
6 years ago
Views:

1 2016 Edition THE EXPERT S VOICE IN SQL SERVER SQL Server Sampler Adam Aspin, Regis Baccaro, Jason Brimhall, Bryan Cafferky, Peter Carter, Miguel Cebollero, Michael Coles, Grant Fritchey, Jonathan Gennick, Clayton Groom, Kathi Kellenberger, Dmitri Korotkevitch, Mike McQuillan, Jay Natarajan, Robert Pearl, Wayne Sheffield, Jason Strate, Enrico van de Laar, and Joost van Rossum

2 PREFACE SQL Server Sampler: 2016 Edition Welcome to our SQL Server 2016 sampler! This book is Apress's gift to the community. It represents one chapter each from all the books we published in calendar year 2015 that relate to SQL server. We hope that you enjoy this book and find value from it, and that you'll support our authors and their books. About Apress Apress Media LLC is a technical publisher devoted to meeting the needs of IT professionals, software developers, and programmers, with more than 1,000 books in print and electronic formats. Apress provides high-quality, no-fluff content that helps serious technology professionals build a comprehensive pathway to career success. The Apress editorial and production teams work hand-in-hand with all authors to ensure that their unique voices come through in each book. Apress is committed to supporting the ever-growing programming community by taking risks on publishing books in niche and nascent technologies. Based in New York City, Apress strives to promote innovation in publishing, boasting a global network of authors, editors, technical reviewers, and sales and marketing teams who work together to provide our readers books and electronic products of the highest quality. Note Since 2007, Apress has been part of Springer Nature, one of the world's leading scientific, technical, and medical publishing houses, enabling global distribution of Apress publications. About our Books We are proud to publish an extensive range of books in print and electronic formats geared toward programmers from all areas, including industry bestsellers Pro C# and the.net Platform by Andrew Troelsen, CSS Mastery: Advanced Web Standards Solutions by Andy Budd, Beginning Ubuntu Linux: From Novice to Professional by Keir Thomas, and Beginning iphone Development: Exploring the iphone SDK by Dave Mark and Jeff LaMarche. Our business and software management line, which features titles by a trove of well-respected authors (including Founders at Work by Jessica Livingston, Coders at Work by Peter Seibel, and multiple books from the inimitable Joel Spolsky), has helped legions of programmers adapt to the business side of technology and become more productive. Apress Beginning books provide a reliable and readable place to start learning something new, while developers have long relied on Apress Pro books to provide the no-fuss, honest approach that is 1

3 needed for solving everyday problems and coverage of topics that is broad enough to support their complete career development. At Expert level, our readers appreciate the accuracy and singular voice that our books provide for the highest-intensity of topics. Our friends of ED books offer a combination of inspiration and techniques to help both experienced and novice designers and developers who are searching for fresh ideas or guidance. Coupon Code We are offering a coupon code, good for 30% off the e-book version of each of the 13 SQL server titles featured in this book. The following code is good at now through December 31, 2016: PASS2016 Thank you again for your enthusiasm, commitment to learning, and passion for all things SQL server. Enjoy! 2

4 PREFACE Call for Authors Apress is looking for authors with technical expertise and the ability to explain complicated concepts clearly. We want authors who are passionate, innovative, and original. Do you have something you d like to say back to the industry? We d love to hear from you. Apress doesn't work quite the same as more traditional computer book publishers do. For example, the first question many authors ask a potential publisher is What are you looking for? At Apress, we ask authors these questions instead: What are you an expert on? What do you have real-world experience in? What are you passionate about? In short, we're more interested in your ideas, expertise, and passions than we are in having you fill in some hole in our list. If you want to write about XML, that's possible, but we don't want you to write about XML just because we want you to, nor do we want you to write yet another XML book that's just like the dozens of "Learn XML" books that exist in various forms already. Of course, the Apress editorial board will work with you to help ensure that your book is marketable and ultimately successful, but the first stage of that process has to be for us to hear about your ideas not for you to hear about ours! Note If you re interested in writing a book for Apress, the first step is to submit a proposal. Please see our standard proposal form on the next pages this form gives us the answers to all of the questions we need to evaluate any new book project. To get this document sent as a Word file, please contact: Jonathan Gennick Assistant Editorial Director Databases / Java / Big Data Susan McDermott Senior Editor Security / Big Data / Cloud / Enterprise 1

5 PREFACE Apress New Book Proposal Form The following information is required for all new book proposals. However, we are more than happy to chat with you if you want to discuss potential ideas/topics before filling out a full proposal. New Print/Digital Proposal AUTHOR DATA (for all authors) 1. Author's full name: 2. Mailing address: 3. Phone: Citizenship: 6. Name to use on cover: TITLE INFORMATION Tentative book title: Subtitle: SUBJECT MATTER OF YOUR BOOK/PRODUCT Description: words or so think of it as back cover copy. Use key words or searchable terms in the first paragraph, as well as benefits (e.g., solves, simplifies, equips, clarifies, or explains, etc.) 1

6 and content ( supplies, shares, provides, offers, contains, etc.). Roughly 2-3 paragraphs plus three bullet points detailing content coverage: AUDIENCE List expressly who the customers are by specific job title and the need this content fills. WHAT YOU LL LEARN Please list 5-6 bullet points that summarize what a reader will be able to do as a result of having read your book. KEY WORDS If you were searching for a book on this topic, what key terms would you search for? Assume a potential reader does not know you or your book, but is trying to find content based on a specific subject. List targeted words and short phrases (5-10; more is better): AUTHOR BIO Write a brief one-two paragraph biographical statement. AUTHOR INFORMATION / PLATFORM Do you currently speak at industry seminars or conferences? If so, on what topic(s) and how frequently do you speak? Do you or any of your co-authors regularly contribute to any media? Please provide details (title or publication, circulation, number of unique visitors, etc.) Do you participate in any social media (blogs, Facebook, Twitter, LinkedIn, etc.) that will be leveraged to promote your book? Please provide statistical details (number of followers/friends/members, etc.) and links. PAGE COUNT Approximate number of manuscript pages? MANUSCRIPT DELIVERY INFORMATION (SCHEDULE) Three dates needed: Due-date for first-draft of first three chapters: Due-date for first-full-draft of all chapters: Publication month desired (in case book is needed for a specific event): 2

7 Preface SELLING POINTS List three of the most salient sales handles for your book (why is this book a need to have product?) INFORMATION ON THE COMPETITION List any books/products/ other sources (web offerings, white papers, etc.) with which your book will compete. Include author, title, publisher, and publication year. How will your book be different and/or better? TABLE OF CONTENTS A description of each chapter that includes: Chapter name Chapter content/goal The first level of headings for the chapter 3

8 PREFACE Contents We asked authors of 2015 titles on SQL Server to select one chapter that best respresents their book that is useful in isolation. What we ended up with is a mix of material that provides a look into the book content as whole, but that can be used on its own. There are 13 titles featured in this e-book. Featured in order according to publication date are the following: Pro T-SQL Programmer's Guide by Miguel Cebollero, Michael Coles, and Jay Natarajan Business Intelligence with SQL Server Reporting Services by Adam Aspin Expert T-SQL Functions in SQL Server by Kathi Kellenberger and Clayton Groom Healthy SQL by Robert Pearl SQL Server T-SQL Recipes by Jason Brimhall, Jonathan Gennick, and Wayne Sheffield Pro SQL Server Wait Statistics by Enrico van de Laar Expert SQL Server In-Memory OLTP by Dmitri Korotkevitch Introducing SQL Server by Mike McQuillan Extending SSIS with.net Scripting by Joost van Rossum and Regis Baccaro Pro PowerShell for Database Developers by Bryan Cafferky Expert Performance Indexing in SQL Server by Grant Fritchey and Jason Strate Pro SQL Server Administration by Peter Carter SQL Server AlwaysOn Revealed by Peter Carter On the following pages you will see the book opener, its Table of Contents, and the author-selected chapter from each. We hope you enjoy this resource, and look forward to the possibility of working with you in the future. 1

9 Pro T-SQL Programmer s Guide 4th Edition Miguel Cebollero Jay Natarajan Michael Coles

10 Pro T-SQL Programmer s Guide Copyright 2015 by Miguel Cebollero, Jay Natarajan, and Michael Coles This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. ISBN-13 (pbk): ISBN-13 (electronic): Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Managing Director: Welmoed Spahr Lead Editor: Jonathan Gennick Technical Reviewer: Edgar Lanting Editorial Board: Steve Anglin, Gary Cornell, Louise Corrigan, Jim DeWolf, Jonathan Gennick, Robert Hutchinson, Michelle Lowman, James Markham, Matthew Moodie, Jeff Olson, Jeffrey Pepper, Douglas Pundick, Ben Renow-Clarke, Gwenan Spearing, Matt Wade, Steve Weiss Coordinating Editor: Jill Balzano Copy Editor: Tiffany Taylor Compositor: SPi Global Indexer: SPi Global Artist: SPi Global Cover Designer: Anna Ishchenko Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY Phone SPRINGER, fax (201) , orders-ny@springer-sbm.com, or visit For information on translations, please rights@apress.com, or visit Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. ebook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales ebook Licensing web page at Any source code or other supplementary materials referenced by the author in this text is available to readers at at For detailed information about how to locate your book s source code, go to

11 Contents at a Glance About the Authors....xxiii About the Technical Reviewer....xxv Acknowledgments...xxvii Introduction...xxix Chapter 1: Foundations of T-SQL... 1 Chapter 2: Tools of the Trade Chapter 3: Procedural Code Chapter 4: User-Defined Functions Chapter 5: Stored Procedures Chapter 6: In-Memory Programming Chapter 7: Triggers Chapter 8: Encryption Chapter 9: Common Table Expressions and Windowing Functions Chapter 10: Data Types and Advanced Data Types Chapter 11: Full-Text Search Chapter 12: XML Chapter 13: XQuery and XPath Chapter 14: Catalog Views and Dynamic aent Views Chapter 15:.NET Client Programming Chapter 16: CLR Integration Programming iii

12 CONTENTS AT A GLANCE Chapter 17: Data Services Chapter 18: Error Handling and Dynamic SQL Chapter 19: Performance Tuning Appendix A: Exercise Answers Appendix B: XQuery Data Types Appendix C: Glossary Appendix D: SQLCMD Quick Reference Index iv

13 CHAPTER 6 In-Memory Programming SQL Server 2014 introduces new In-Memory features that are a game-changer in how you consider the data and physical architecture of database solutions. The manner in which data is accessed, the indexes used for in-memory tables, and the methods used for concurrency make this a significant new feature of the database software in SQL Server In-Memory OLTP is a performance enhancement that allows you to store data in memory using a completely new architecture. In addition to storing data in memory, database objects are compiled into a native DLL in the database. This release of SQL Server has made investments in three different In-Memory technologies: In-Memory OLTP, In-Memory data warehousing (DW), and the SSD Buffer Pool Extension. This chapter covers the In-Memory OLTP programming features; In-Memory DW and the Buffer Pool Extension aren t applicable to the subject matter in this book. In-Memory solutions provide a significant performance enhancement targeted at OLTP workloads. In-Memory OLTP specifically targets the high concurrency, processing, and retrieval contention typical in OLTP transactional workloads. These are the first versions of such features for SQL Server, and therefore they have numerous limitations, which are discussed in this chapter. Regardless of the limitations, some use cases see as much as a 30x performance improvement. Such performance improvements make In-Memory OLTP compelling for use in your environment. In-Memory OLTP is available in existing SQL Server 2014 installations; no specialized software is required. Additionally, the use of commodity hardware is a benefit of SQL Server s implementation of this feature over other vendors that may require expensive hardware or specialized versions of their software. The Drivers for In-Memory Technology Hardware trends, larger datasets, and the speed at which OLTP data needs to become available are all major drivers for the development of in-memory technology. This technology has been in the works for the past several years, as Microsoft has sought to address these technological trends. Hardware Trends CPU, memory, disk speeds, and network connections have continually increased in speed and capacity since the invention of computers. However, we re at the point that traditional approaches to making computers run faster are changing due to the economics of the cost of memory versus the speed of CPU processing. In 1965, Gordon E Moore made the observation that, over the history of computing hardware, the number of transistors in a dense integrated circuit doubles approximately every two years. 1 Since then, this statement has been known as Moore s Law. Figure 6-1 shows a graph of the increase in the number of transistors on a single circuit. 1 Moore s Law, 153

14 CHAPTER 6 IN-MEMORY PROGRAMMING Microprocessor Transistor Counts & Moore s Law 2,600,000,000 1,000,000, Core SPARC T3 Six-Core Core i7 Six-Core Xeon Core Xeon Westmere-EX Dual-Core Itanium 2 8-core POWER7 AMD K10 Quad-core z196 Quad-Core Itanium Tukwila POWER6 8-Core Xeon Nehalem-EX Itanium 2 with 9MB cache Six-Core Opteron 2400 AMD K10 Core i7(quad) Core 2 Duo Itanium 2 Cel 100,000,000 AMD K8 Transistor count 10,000,000 1,000,000 curve shows transistor count doubling every two years Pentium 4 AMD K7 AMD K6-III AMD K6 Pentium III Pentium II AMD KS Pentium Barton Atom , ,000 2, Z80 MOS 6502 RCA Figure 6-1. Moore s Law transistor counts Date of introduction Manufacturers of memory, pixels on a screen, network bandwidth, CPU architecture, and so on have all used Moore s Law as a guide for long-term planning. It s hard to believe, but today, increasing the amount of power to a transistor, for faster CPU clock speed, no longer makes economic sense. As the amount of power being sent to a transistor is increased, the transistor heats up to the point that the physical components begin to melt and malfunction. We ve essentially hit a practical limitation on the clock speed for an individual chip, because it isn t possible to effectively control the temperature of a CPU. The best way to continue to increase the power of a CPU with the same clock speed is via additional cores per socket. In parallel to the limitations of CPUs, the cost of memory has continued to decline significantly over time. It s common for servers and commodity hardware to come equipped with more memory than multimillion-dollar servers had available 20 years ago. Table 6-1 shows the historical price of 1 gigabyte of memory. 154

15 CHAPTER 6 IN-MEMORY PROGRAMMING Table 6-1. Price of RAM over time 2 Historic RAM Prices Year 1980 $6,635, $ 901, $ 108, $ 31, $ 1, $ $ $ 9.34 Average Cost per Gigabyte In order to make effective use of additional cores and the increase in memory available with modern hardware, software has to be written to take advantage of these hardware trends. The SQL Server 2014 In-Memory features are the result of these trends and customer demand for additional capacity on OLTP databases. Getting Started with In-Memory Objects SQL Server 2014 In-Memory features are offered in Enterprise, Developer, and Evaluation (64-bit only) Editions of the software. These features were previously available only to corporations that had a very large budget to spend on specialized software and hardware. Given the way Microsoft has deployed these features in existing editions, you may be able to use them an existing installation of your OLTP database system. The in-memory objects require a FILESTREAM data file (container) to be created using a memory-optimized data filegroup. From here on, this chapter uses the term container rather than data file ; it s more appropriate because a data file is created on disk at the time data is written to the new memory-optimized tables. Several checkpoint files are created in the memory-optimized data filegroup for the purposes of keeping track of changes to data in the FILESTREAM container file. The data for memory-optimized tables is stored in a combination of the transaction log and checkpoint files until a background thread called an offline checkpoint appends the information to data and delta files. In the event of a server crash or availability group failover, all durable table data is recovered from a combination of the data, delta, transaction log, and checkpoint files. All nondurable tables are re-created, because the schema is durable, but the data is lost. The differences between durable and non-durable tables, advantages, disadvantages, and some use cases are explained further in the section Step 3, later in this chapter. You can alter any existing database or new database to accommodate in-memory data files (containers) by adding the new data and filegroup structures. Several considerations should be taken into account prior to doing so. The following sections cover the steps listed in the code format and SQL Server Management Studio to create these structures. 2 Average Historic Price of RAM, Statistic Brain, 155

16 CHAPTER 6 IN-MEMORY PROGRAMMING Step 1: Add a New Memory-Optimized Data FILEGROUP Typically, before you can begin to using FILESTREAM in SQL Server, you must enable FILESTREAM on the instance of the SQL Server Database Engine. With memory-optimized filegroups, you don t need to enable FILESTREAM because the mapping to it s handled by the In-Memory OLTP engine. The memory-optimized data filegroup should be created on a solid state drive (SSD ) or fast serial attached SCSI (SAS) drive. Memory-optimized tables have different access patterns than traditional disk-based tables and require the faster disk subsystems to fully realize the speed benefit of this filegroup. Listing 6-1, adds a new memory-optimized filegroup to our existing AdventureWorks2014 database. This syntax can be used against any existing 2014 database on the proper SQL Server edition of the software. Listing 6-1. Adding a New Filegroup IF NOT EXISTS (SELECT * FROM AdventureWorks2014.sys.data_spaces WHERE TYPE = 'FX') ALTER DATABASE AdventureWorks2014 ADD FILEGROUP [AdventureWorks2014_mem] CONTAINS MEMORY_OPTIMIZED_DATA GO This adds an empty memory-optimized data filegroup to which you ll add containers in the next step. The key words in the syntax are CONTAINS MEMORY_OPTIMIZED_DATA, to create as a memory-optimized data filegroup. You can create multiple containers but only one memory-optimized data filegroup. Adding additional memory-optimized data filegroups results in the following error: Msg 10797, Level 15, State 2, Line 2 Only one MEMORY_OPTIMIZED_DATA filegroup is allowed per database. In Listing 6-1, we added a new memory-optimized filegroup using T-SQL code. In the following example, we will do the same using SQL Server Management Studio. Following are the steps to accomplish adding the filegroup via Management Studio (see Figure 6-2 ): 1. Right-click the database to which you want to add the new filegroup, and select Properties. 2. Select the Filegroups option, and type in the name of the memory-optimized data filegroup you wish to add. 3. Click the Add Filegroup button. 156

CHAPTER 6 IN-MEMORY PROGRAMMING Figure 6-2. Adding a new memory-optimized data filegroup Note Memory-optimized data filegroups can only be removed by dropping the database.

17 CHAPTER 6 IN-MEMORY PROGRAMMING Figure 6-2. Adding a new memory-optimized data filegroup Note Memory-optimized data filegroups can only be removed by dropping the database. Therefore, you should careful consider the decision to move forward with this architecture. Step 2: Add a New Memory-Optimized Container In step-2 we will add a new memory-optimized container. Listing 6-2 shows an example of how this is accomplished using T-SQL code. This code can be used against any database that has a memory-optimized filegroup. Listing 6-2. Adding a New Container to the Database IF NOT EXISTS ( SELECT * FROM AdventureWorks2014.sys.data_spaces ds JOIN AdventureWorks2014.sys.database_files df ON ds.data_space_id=df.data_space_id WHERE ds.type='fx' ) ALTER DATABASE AdventureWorks

$CHAPTER 6 IN-MEMORY PROGRAMMING GO ADD FILE (name=' AdventureWorks2014_mem', filename='c:\sqldata\adventureworks2014_mem') TO FILEGROUP [AdventureWorks2014_mem] In Listing 6-2, we added a new$

18 CHAPTER 6 IN-MEMORY PROGRAMMING GO ADD FILE (name=' AdventureWorks2014_mem', filename='c:\sqldata\adventureworks2014_mem') TO FILEGROUP [AdventureWorks2014_mem] In Listing 6-2, we added a new memory-optimized container to our database using T-SQL code. In the following steps we will do the same using Management Studio. In order to accomplish this, follow the steps outlined below (see Figure 6-3 ): 1. Right-click the database to which you want to add the new container, and select Properties. 2. Select the Files option, and type in the name of the file you wish to add. 3. Select FILESTREAM Data from the File Type list, and click the Add button. Figure 6-3. Adding a new filestream container file to a memory-optimized filegroup It is a best practice to adjust the Autogrowth / Maxsize of a fiegroup; this option is to the right of the "Filegroup" column in Figure 6-4. For a memory-optimized filegroup, you will not be able to adjust this option when creating the fielgroup through Management Studio. This filegroup lives in memory; therefore, the previous practice of altering this option no longer applies. Leave the Autogrowth / Maxsize option set to Unlimited. It s a limitation of the current version that you can t specify a MAXSIZE for the specific container you re creating. You now have a container in the memory-optimized data filegroup that you previously added to the database. Durable tables save their data to disk in the containers you just defined; therefore, it s recommended that you create multiple containers across multiple disks, if they re available to you. SSDs won t necessarily help performance, because data is accessed in a sequential manner and not in a randomaccess pattern. The only requirement is that you have performant disks so the data can be accessed efficiently from disk. Multiple disks allow SQL Server to recover data in parallel in the event of a system crash or availability group failover. Your in-memory tables won t become available until SQL Server has recovered the data into memory. 158

19 CHAPTER 6 IN-MEMORY PROGRAMMING Note Data and delta file pairs can t be moved to other containers in the memory-optimized filegroup. Step 3: Create Your New Memory-Optimized Table Step 1 and Step 2 laid out the foundation necessary to add memory-optimized objects. Listing 6-3 creates a table that in memory. The result is a compiled table with data that resides in memory. Listing 6-3. Creating a New Memory-Optimized Table USE AdventureWorks2014; GO CREATE SCHEMA [MOD] AUTHORIZATION [dbo]; GO CREATE TABLE [MOD].[Address] ( AddressID INT NOT NULL IDENTITY(1,1), AddressLine1 NVARCHAR(120) COLLATE Latin1_General_100_BIN2 NOT NULL, AddressLine2 NVARCHAR(120) NULL, City NVARCHAR(60) COLLATE Latin1_General_100_BIN2 NOT NULL, StateProvinceID INT NOT NULL, PostalCode NVARCHAR(30) COLLATE Latin1_General_100_BIN2 NOT NULL, rowguid UNIQUEIDENTIFIER NOT NULL INDEX [AK_MODAddress_rowguid] NONCLUSTERED CONSTRAINT [DF_MODAddress_rowguid] DEFAULT (NEWID()), ModifiedDate DATETIME NOT NULL INDEX [IX_MODAddress_ModifiedDate] NONCLUSTERED CONSTRAINT [DF_MODAddress_ModifiedDate] DEFAULT (GETDATE()), INDEX [IX_MODAddress_AddressLine1_ City_StateProvinceID_PostalCode] NONCLUSTERED ( [AddressLine1] ASC, [StateProvinceID] ASC, [PostalCode] ASC ), INDEX [IX_MODAddress_City] ( [City] DESC ), INDEX [IX_MODAddress_StateProvinceID] NONCLUSTERED ( [StateProvinceID] ASC), CONSTRAINT PK_MODAddress_Address_ID PRIMARY KEY NONCLUSTERED HASH ( [AddressID]) WITH (BUCKET_COUNT=30000) ) WITH(MEMORY_OPTIMIZED=ON, DURABILITY=SCHEMA_AND_DATA); GO 159

20 CHAPTER 6 IN-MEMORY PROGRAMMING Note You don t need to specify a filegroup when you create an in-memory table. You re limited to a single memory-optimized filegroup; therefore, SQL Server knows the filegroup to which to add this table. The sample table used for this memory-optimized table example is similar to the AdventureWorks2014. Person.Address table, with several differences: The hint at the end of the CREATE TABLE statement is extremely important: WITH(MEMORY_OPTIMIZED=ON, DURABILITY=SCHEMA_AND_DATA); The option MEMORY_OPTIMIZED=ON tells SQL Server that this is a memoryoptimized table. The DURABILITY=SCHEMA_AND_DATA option defines whether this table will be durable (data recoverable) or non-durable (schema-only recovery) after a server restart. If the durability option isn t specified, it defaults to SCHEMA_AND_DATA. PRIMARY KEY is NONCLUSTERED, because data isn t physically sorted:, CONSTRAINT PK_MODAddress_Address_ID PRIMARY KEY NONCLUSTERED HASH ( [AddressID] ) WITH (BUCKET_COUNT=30000) The NONCLUSTERED hint is required on a PRIMARY KEY constraint, because SQL Server attempts to create it as a CLUSTERED index by default. Because CLUSTERED indexes aren t allowed, not specifying the index type results in an error. Additionally, you can t add a sort hint on the column being used in this index, because HASH indexes can t be defined in a specific sort order. All character string data that is used in an index must use BIN2 collation: COLLATE Latin1_General_100_BIN2 Notice that the MOD.Address table purposely doesn t declare BIN2 collation for the AddressLine2 column, because it isn t used in an index. Figure 6-8 shows the effect that BIN2 collation has on data in different collation types. If you compare the MOD.Address table to Person.Address, you see that the column SpatialLocation is missing. In-memory tables don t support LOB objects. The SpatialLocation column in Person.Address is defined as a GEOGRAPHY data type, which isn t supported for memory-optimized tables. If you were converting this data type to be used in a memory-optimized table, you would potentially need to make coding changes to accommodate the lack of the data type. The index type HASH with the hint WITH (BUCKET_COUNT=30000) is new. This is discussed further in the In-Memory OLTP Table Indexes section of this chapter. Listing 6-3 added a memory-optimized table using T-SQL. We will now add add a memory-optimized table using Management Studio. Right-click the Tables folder, and select New Memory-Optimized Table (see Figure 6-4 ). A new query window opens with the In-Memory Table Creation template script available. 160

CHAPTER 6 IN-MEMORY PROGRAMMING Figure 6-4. Creating a new memory-optimized table You now have a very basic working database and table and can begin using the In-Memory features.

21 CHAPTER 6 IN-MEMORY PROGRAMMING Figure 6-4. Creating a new memory-optimized table You now have a very basic working database and table and can begin using the In-Memory features. You can access the table and different index properties using a system view (see Listing 6-4 and Figure 6-5 ) or Management Studio (see Figure 6-6 ). Listing 6-4. Selecting Table Properties from a System View SELECT t.name as 'Table Name', t.object_id, t.schema_id, filestream_data_space_id, is_memory_optimized, durability, durability_desc FROM sys.tables t WHERE type='u' AND t.schema_id = SCHEMA_ID(N'MOD'); 161

22 CHAPTER 6 IN-MEMORY PROGRAMMING Figure 6-5. System view showing MOD.Address table properties Figure 6-6. Management Studio showing MOD.Address table properties Now that you ve configured your database and created a new table, let s look at an example of the data in this table and some specific issues you may encounter. First you must load the data in the newly created memory-optimized table [MOD].[Address], as shown in Listing

23 CHAPTER 6 IN-MEMORY PROGRAMMING Listing 6-5. Inserting Data into the Newly Created Table SET IDENTITY_INSERT [MOD].[Address] ON; INSERT INTO [MOD].[Address] ( AddressID, AddressLine1, AddressLine2, City, StateProvinceID, PostalCode --, SpatialLocation, rowguid, ModifiedDate ) SELECT AddressID, AddressLine1, AddressLine2, City, StateProvinceID, PostalCode --, SpatialLocation, rowguid, ModifiedDate FROM [Person].[Address]; SET IDENTITY_INSERT [MOD].[Address] OFF; UPDATE STATISTICS [MOD].[Address] WITH FULLSCAN, NORECOMPUTE; GO Note In-memory tables don t support statistics auto-updates. In Listing 6-5, you manually update the statistics after inserting new data. Because AddressLine1 is being used in an index on the table, you have to declare the column with a BIN2 collation. The limitation with this collation is that all uppercase AddressLine1 values are sorted before lowercase string values ( Z sorts before a ). In addition, string comparisons of BIN2 columns don t give correct results. A lowercase value doesn t equal an uppercase value when selecting data ( A!= a ). Listing 6-6 gives an example query of the string-comparison scenario. Listing 6-6. Selecting Data from the AddressLine1 Column SELECT AddressID, AddressLine1, RowGuid FROM [MOD].[Address] WHERE AddressID IN (804, 831) AND AddressLine1 LIKE '%plaza' This query correctly results in only one record. However, you would expect two records to be returned, using disk-based tables. Pay careful attention in this area when you re considering moving your disk-based tables to memory-optimized tables. Figure 6-7 displays the result of the query. 163

24 CHAPTER 6 IN-MEMORY PROGRAMMING Figure 6-7. AddressLine1 results with no collation When the collation for the column is altered with a hint (Listing 6-7), the query correctly returns two records (Figure 6-8 ). Listing 6-7. Selecting Data from the AddressLine1 Column with Collation SELECT AddressID, AddressLine1, RowGuid FROM [MOD].[Address] WHERE AddressID IN (804, 831) AND AddressLine1 COLLATE SQL_Latin1_General_CP1_CI_AS LIKE '%plaza'; Figure 6-8. AddressLine1 results with collation In order to ensure proper results and behavior, you need to specify the collation for all string type columns, with BIN2 collation for comparison and sort operations. 164

25 Limitations on Memory-Optimized Tables CHAPTER 6 IN-MEMORY PROGRAMMING When you create a table, you need to take several limitations into account. Following are some of the more common restrictions that you may encounter: None of the LOB data types can be used to declare a column (XML, CLR, spatial data types, or any of the MAX data types). All the row lengths in a table are limited to 8,060 bytes. This limit is enforced at the time the table is initially created. Disk-based tables allow you to create tables that could potentially exceed 8,060 bytes per row. All in-memory tables must have at least one index defined. No heap tables are allowed. No DDL/DML triggers are allowed. No schema changes are allowed ( ALTER TABLE ). To change the schema of the table, you would need to drop and re-create the table. Partitioning or compressing a memory-optimized table isn t allowed. When you use an IDENTITY column property, it must be initialized to start at 1 and increment by 1. If you re creating a durable table, you must define a primary key constraint. Note For a comprehensive and up-to-date list of limitations, visit In-Memory OLTP Table Indexes Indexes are used to more efficiently access data stored in tables. Both in-memory tables and disk-based tables benefit from indexes; however, In-Memory OLTP table indexes have some significant differences from their disk-based counterparts. Two types of indexes differ from those of disk-based tables: nonclustered hashes and nonclustered range indexes. These indexes are both contained in memory and are optimized for memory-optimized tables. The differences between in-memory and disk- based table indexes are outlined in Table

26 CHAPTER 6 IN-MEMORY PROGRAMMING Table 6-2. Comparison of in-memory and disk-based indexes In-Memory Table Must have at least one index Clustered Index not allowed; Only hash or range non-clustered indexes allowed. Indexes added only at table creation No auto update statistics In-memory table indexes only exist in memory Indexes are created during table creation or database startup Indexes are covering, since the index contains a memory pointer to the actual row of the data There is a limitation of 8 indexes per table Disk-Based Table No indexes required Clustered Index usually recommended Indexes can be added to the table after table creation Auto update statistics allowed Indexes persist on disk and the transaction log Indexes are persisted to disk; therefore, they are not rebuilt and can be read from disk Indexes are not covering by default. 1 Clustered Index+999 NonClustered=1000 Indexes or 249 XML Indexes Note Durable memory-optimized tables require a primary key. By default, a primary key attempts to create a clustered index, which will generate an error for a memory-optimized table. You must specifically indicate NONCLUSTERED as the index type. The need for at least one index stems from the architecture of an in-memory table. The table uses index pointers as the only method of linking rows in memory into a table. This is also why clustered indexes aren t needed on memory-optimized tables; the data isn t specifically ordered or arranged in any manner. A new feature of SQL Server 2014 is that you can create indexes inline with the table create statement. Earlier, notice that Listing 6-3 creates an inline nonclustered index with table create :, rowguid UNIQUEIDENTIFIER NOT NULL INDEX [AK_MODAddress_rowguid] NONCLUSTERED CONSTRAINT [DF_MODAddress_rowguid] DEFAULT (NEWID()) Inline index creation is new to SQL Server 2014 but not unique to memory-optimized tables. It s also valid for disk-based tables. Both hash and range indexes are allowed on the same column. This can be a good strategy when the use cases vary for how the data is accessed. Hash Indexes A hash index is an efficient mechanism that accepts input values into a hashing function and maps to a hash bucket. The hash bucket is an array that contains pointers to efficiently return a row of data. The collection of pointers in the hash bucket is the hash index. When created, this index exists entirely in memory. 166

27 CHAPTER 6 IN-MEMORY PROGRAMMING Hash indexes are best used for single-item lookups, a WHERE clause with an =, or equality joins. They can t be used for range lookups such as LIKE operations or between queries. The optimizer won t give you an error, but it isn t an efficient way of accessing the data. When creating the hash index, you must decide at tablecreation time how many buckets to assign for the index. It s recommended that it should be created at 1.5 to 2 times larger than the existing unique key counts in your table. This is an important assessment, because the bucket count can t be extended by re-creating the index and the table. The performance of the point lookups doesn t degrade if you have a bucket count that is larger than necessary. However, performance will suffer if the bucket count is too small. Listing 6-3 used a hash bucket count of 30,000, because the number of unique rows in the table is slightly less than 20,000. Here s the code that defines the constraint with the bucket count:, CONSTRAINT PK_MODAddress_Address_ID PRIMARY KEY NONCLUSTERED HASH ( [AddressID] ASC ) WITH (BUCKET_COUNT=30000) If your use case requires it, you can create a composite index on a hash index. There are some limitations to be aware of if you decide to use a composite index. The hash index will be used only if the point-lookup search is done on both columns in the index. If both columns aren t used in the search, the result is an index scan or a scan of all the hash buckets. This occurs because the hash function converts the values from both columns into a hash values. Therefore, in a composite hash index, the value of one column never equates to the hash value of two columns: HASH(<Column1>) <> HASH(<Column1>, <Column2>) Let s compare the affect of a hash index on a memory-optimized table versus a disk-based table clustered index. Warning This applies to the code in Listing 6-8 and several other examples. Do not attempt to run the DBCC commands on a production system, because they can severely affect the performance of your entire instance. Listing 6-8 includes some DBCC commands to flush all cache pages and make sure the comparisons start in a repeatable state with nothing in memory. It s highly recommended that these types of commands be run only in a non-production environment that won t affect anyone else on the instance. Listing 6-8. Point Lookup on a Hash Index vs. Disk-Based Clustered Index CHECKPOINT GO DBCC DROPCLEANBUFFERS GO DBCC FREEPROCCACHE GO SET STATISTICS IO ON; SELECT * FROM Person.Address WHERE AddressId = 26007; SELECT * FROM MOD.Address WHERE AddressId = 26007; This first example simply looks at what happens when you compare performance when doing a simple point lookup for a specific value. Both the disk-based table ( Person.Address ) and the memory-optimized table ( MOD.Address ) have a clustered and hash index on the AddressID column. The result of running the entire batch is as shown in Figure 6-9 in the Messages tab. 167

28 CHAPTER 6 IN-MEMORY PROGRAMMING Figure 6-9. Hash index vs. clustered index IO statistics There are two piece of information worth noting. The first batch to run was the disk-based table, which resulted in two logical reads and two physical reads. The second batch was the memory-optimized table, which didn t register any logical or physical IO reads because this table s data and indexes are completely held in memory. Figure 6-10 clearly shows that the disk-based table took 99% of the entire batch execution time; the memory-optimized table took 1% of the time relative to the entire batch. Both query plans are exactly the same; however, this illustrates the significant difference that a memory-optimized table can make to the simplest of queries. 168

29 CHAPTER 6 IN-MEMORY PROGRAMMING Figure Hash index vs. clustered index point lookup execution plan Hovering over the Index Seek operator in the execution plan shows a couple of differences. The first is that the Storage category now differentiates between the disk-based table as RowStore and the memory-optimized table as MemoryOptimized. There is also a significant difference between the estimated row size of the two tables. Next let s experiment with running a range lookup against the disk-based table and the memory-optimized table. Listing 6-9 does a simple range lookup against the primary key of the table to demonstrate some of the difference in performance (see Figure 6-11 ). Listing 6-9. Range Lookup Using a Hash Index SELECT * FROM PERSON.ADDRESS WHERE ADDRESSID BETWEEN 100 AND 26007; SELECT * FROM MOD.ADDRESS WHERE ADDRESSID BETWEEN 100 AND 26007; 169

30 CHAPTER 6 IN-MEMORY PROGRAMMING Figure Hash index vs. clustered index range lookup execution plan This example clearly displays that a memory-optimized table hash index isn t necessarily quicker than a disk-based clustered index for all use cases. The memory-optimized table had to perform an index scan and then filter the results for the specific criteria you re looking to get back. The disk-based table clustered index seek is still more efficient for this particular use case. The moral of the story is that it always depends. You should always run through several use cases and determine the best method of accessing your data. Range Indexes A range index might best be defined as a memory-optimized nonclustered index. When created, this index exists entirely in memory. The memory-optimized nonclustered index works similarly to a disk-based nonclustered index, but it has some significant architectural differences. The architecture for range indexes is based on a new data structure called a Bw-tree.3 The Bw-tree architecture is a latch-free architecture that can take advantage of modern processor caches and multicore chips. Memory-optimized nonclustered indexes are best used for range-type queries such as (<,>,IN), (All sales orders between dates), and so on. These indexes also work with point lookups but aren t as optimized for those types of lookups as a hash index. Memory-optimized nonclustered indexes should also be considered over hash indexes when you re migrating a disk-based table that has a considerable number of duplicate values in a column. The size of the index grows with the size of the data, similar to B-tree disk-based table structures. 3 Justin J. Levandoski, David B. Lomet, and Sudipta Sengupta, The Bw-Tree: A B-tree for New Hardware Platforms, Microsoft Research, April 8, 2013, 170

31 CHAPTER 6 IN-MEMORY PROGRAMMING When you re using memory-optimized nonclustered indexes, a handful of limitations and differences from disk-based nonclustered indexes are worth mentioning. Listing 6-3 created the nonclustered index on the City column. Below is an excerpt from the listing, that displays the creation of the nonclustered index. INDEX [IX_MODAddress_City] ( [City] DESC) All the columns that are part of an index must be defined as NOT NULL. If the column is defined as a string data type, it must be defined using a BIN2 collation. The NONCLUSTERED hint is optional unless the column is the primary key for the table, because SQL Server will try to define a primary key constraint as clustered. The sort-order hint on a column in a range index is especially important for a memory-optimized table. SQL Server can t perform a seek on the index if the order in which the records are accessed is different from the order in which the index was originally defined, which would result in an index scan. Following are a couple of examples that demonstrate the comparison of a disk-based nonclustered index and a memory-optimized nonclustered index (range index). The two queries in Listing 6-10 select all columns from the Address disk-based table and the memory-optimized table using a single-point lookup of the date. The result of the queries is displayed in Figure Listing Single-Point Lookup Using a Range Index CHECKPOINT GO DBCC DROPCLEANBUFFERS GO DBCC FREEPROCCACHE GO SET STATISTICS IO ON SELECT * FROM [Person].[Address] WHERE ModifiedDate = ' '; SELECT * FROM [MOD].[Address] WHERE ModifiedDate = ' '; 171

32 CHAPTER 6 IN-MEMORY PROGRAMMING Figure Single-point lookup using a nonclustered index comparison This example displays a significant difference between Query 1 (disk-based table) and Query 2 (memory-optimized index). Both queries use an index seek to get to the row in the table, but the disk-based table has to do an additional key-lookup operation on the clustered index. Because the query is asking for all the columns of data in the row, the disk-based nonclustered index must obtain the pointer to the data through the clustered index. The memory-optimized index doesn t have the added cost of the key lookup, because all the indexes are covering and, therefore, the index already has a pointer to the additional columns of data. Next, Listing 6-11 does a range lookup on the disk-based table nonclustered index and a range lookup on the memory-optimized nonclustered index. The difference between the two queries is displayed in Figure Listing Range Lookup Using a Range Index CHECKPOINT GO DBCC DROPCLEANBUFFERS GO DBCC FREEPROCCACHE GO SET STATISTICS IO ON 172

33 CHAPTER 6 IN-MEMORY PROGRAMMING SELECT * FROM [Person].[Address] WHERE ModifiedDate BETWEEN ' ' AND ' '; SELECT * FROM [MOD].[Address] WHERE ModifiedDate BETWEEN ' ' AND ' '; Figure Range Lookup Comparison The results are as expected. The memory-optimized nonclustered index performs significantly better than the disk-based nonclustered index when performing a range query using a range index. Natively Compiled Stored Procedures Natively compile stored procedures are similar in purpose to disk-based stored procedures, with the major difference that a natively compiled stored procedure is compiled into C and then into machine language stored as a DLL. The DLL allows SQL Server to access the stored-procedure code more quickly, to take advantage of parallel processing and significant improvements in execution. There are several limitations, but if used correctly, natively compiled stored procedures can yield a 2x or more increase in performance. To get started, let s examine the outline of a natively compiled stored procedure in Listing 6-12 in detail. Listing Natively Compiled Stored Procedure Example 1 CREATE PROCEDURE seladdressmodifieddate 2 DATETIME DATETIME ) 4 WITH 5 NATIVE_COMPILATION 6, SCHEMABINDING 7, EXECUTE AS OWNER 8 AS 173

34 CHAPTER 6 IN-MEMORY PROGRAMMING 9 BEGIN ATOMIC 10 WITH 11 ( TRANSACTION ISOLATION LEVEL = SNAPSHOT 12 LANGUAGE = N'us_english') T-SQL Logic Here 15 SELECT AddressID, AddressLine1 16, AddressLine2, City 17, StateProvinceID, PostalCode 18, rowguid, ModifiedDate 19 FROM [MOD].[Address] 20 WHERE ModifiedDate END; The requirements to create a natively compiled stored procedure are as follows: Line 5, NATIVE COMPILATION : This option tells SQL Server that the procedure is to be compiled into a DLL. If you add this option, you must also specify the SCHEMABINDING, EXECUTE AS, and BEGIN ATOMIC options. Line 6, SCHEMABINDING : This option binds the stored procedure to the schema of the objects it references. At the time the stored procedure is compiled, the schema and of the objects it references are compiled into the DLL. When the procedure is executed, it doesn t have to check to see whether the columns of the objects it references have been altered. This offers the fastest and shortest method of executing a stored procedure. If any of the underlying objects it references are altered, you re first forced to drop and recompile the stored procedure with any changes to the underlying objects it references. Line 7, EXECUTE AS OWNER : The default execution context for a stored procedure is EXECUTE AS CALLER. Natively compiled stored procedures don t support this caller context and must be specified as one of the options EXECUTE AS OWNER, SELF, or USER. This is required so that SQL Server doesn t have to check execution rights for the user every time they attempt to execute the stored procedure. The execution rights are hardcoded and compiled into the DLL to optimize the speed of execution. Line 9, BEGIN ATOMIC : Natively compiled stored procedures have the requirement that the body must consist of exactly one atomic block. The atomic block is part of the ANSI SQL standard that specifies that either the entire stored procedure succeeds or the entire stored procedure logic fails and rolls back as a whole. At the time the stored procedure is called, if an existing transaction is open, the stored procedure joins the transaction and commits. If no transaction is open, then the stored procedure creates its own transaction and commits. Lines 11 and 12, TRANSACTION ISOLATION : All the session settings are fixed at the time the stored procedure is created. This is done to optimize the stored procedure s performance at execution time. 174

35 CHAPTER 6 IN-MEMORY PROGRAMMING Those are the main options in a natively compiled stored procedure that are unique to its syntax, versus a disk-based stored procedure. There are a significant number of limitations when creating a natively compiled stored procedure. Some of the more common limitations are listed next: Objects must be called using two-part names ( schema.table ). Temporary tables from tempdb can t be used and should be replaced with table variables or nondurable memory-optimized tables. A natively compiled stored procedure can t be accessed from a distributed transaction. The stored procedure can t access disk-based tables, only memory-optimized tables. The stored procedure can t use any of the ranking functions. DISTINCT in a query isn t supported. EXISTS or IN are not supported functions. Common table expressions (CTEs) are not supported constructs. Subqueries aren t available. Note For a comprehensive list of limitations, visit Execution plans for queries in the procedure are optimized when the procedure is compiled. This happens only when the procedure is created and when the server restarts, not when statistics are updated. Therefore, the tables need to contain a representative set of data, and statistics need to be up-to-date before the procedures are created. (Natively compiled stored procedures are recompiled if the database is taken offline and brought back online.) EXERCISES 1. Which editions of SQL Server support the new In-Memory features? a. Developer Edition b. Enterprise Edition c. Business Intelligence Edition d. All of the above 2. When defining a string type column in an in-memory table, you must always use a BIN2 collation. [True / False] 175

36 CHAPTER 6 IN-MEMORY PROGRAMMING 3. You want to define the best index type for a date column in your table. Which index type might be best suited for this column, if it is being used for reporting purposes using a range of values? a. Hash index b. Clustered index c. Range index d. A and B 4. When creating a memory-optimized table, if you do not specify the durability option for the table, it will default to SCHEMA_AND_DATA. [True / False] 5. Memory-optimized tables always require a primary key constraint. [True / False] 6. Natively compiled stored procedures allow for which of the following execution contexts? a. b. c. EXECUTE AS OWNER EXECUTE AS SELF EXECUTE AS USER d. A and B e. A, B, and C 176

37 Business Intelligence with SQL Server Reporting Services Adam Aspin

38 Business Intelligence with SQL Server Reporting Services Copyright 2015 by Adam Aspin This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. ISBN-13 (pbk): ISBN-13 (electronic): Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Managing Director: Welmoed Spahr Lead Editor: Jonathan Gennick Technical Reviewers: Rodney Landrum and Ian Rice Editorial Board: Steve Anglin, Mark Beckner, Gary Cornell, Louise Corrigan, Jim DeWolf, Jonathan Gennick, Robert Hutchinson, Michelle Lowman, James Markham, Matthew Moodie, Jeff Olson, Jeffrey Pepper, Douglas Pundick, Ben Renow-Clarke, Gwenan Spearing, Matt Wade, Steve Weiss Coordinating Editor: Jill Balzano Copy Editor: Mary Behr Compositor: SPi Global Indexer: SPi Global Artist: SPi Global Cover Designer: Anna Ishchenko Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY Phone SPRINGER, fax (201) , orders-ny@springer-sbm.com, or visit Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information on translations, please rights@apress.com, or visit Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. ebook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales ebook Licensing web page at Any source code or other supplementary material referenced by the author in this text is available to readers at For detailed information about how to locate your book s source code, go to

39 Contents at a Glance About the Author...xxi About the Technical Reviewers...xxiii Acknowledgments...xxv Introduction...xxvii Chapter 1: SQL Server Reporting Services as a Business Intelligence Platform...1 Chapter 2: KPIs and Scorecards Chapter 3: Gauges for Business Intelligence Chapter 4: Charts for Business Intelligence Chapter 5: Maps in Business Intelligence Chapter 6: Images in Business Intelligence Chapter 7: Assembling Dashboards and Presentations Chapter 8: Interface Enhancements for Business Intelligence Delivery Chapter 9: Interface Enhancements Chapter 10: BI for SSRS on Tablets and Smartphones Chapter 11: Standardizing BI Report Suites Chapter 12: Optimizing SSRS for Business Intelligence Appendix A: Sample Data Index v

40 CHAPTER 10 BI for SSRS on Tablets and Smartphones Give me Mobile BI is the cry that has been coming from the executive suite for some time now. Admittedly, until SQL Server 2012 SP1 was released, all a Microsoft Reporting Services specialist could do in answer to this request was to shuffle their feet and look sheepish while they tried to implement a third-party add-on. Now, however, the landscape has changed. Thanks to SQL Server 2012 SP1 (and naturally SQL Server 2014) you can output reports to a host of mobile devices, including ipads and iphones as well as Android and Windows phones and tablets. Moreover, nothing has fundamentally changed as far as SSRS is concerned. You still develop reports as you did before. All you have to do is ensure that your reports are designed for the output device you will be using. In the case of both tablets and smartphones, this means the following: Design your output as a function of the size of the phone or tablet s screen. Take account of the height-to-width aspect ratio of the output device s screen. Tweak your reports to be used with the device held either vertically or horizontally for optimum viewing. Do not force the same report to appear on a multitude of output devices. Be prepared to start by building widgets that display the data, and then reuse the widgets in possibly several different reports, where each report is tailored to the size and aspect (height to width) ratio of each device. Attempt to use shared datasets so that your initial effort can be reused more easily. So you are likely to be looking at a minimum of redesigning, and possibly rewriting, a good few reports. However, this could be the case with any suite of reports that have to be reworked for mobile output, whatever the tool used to create them. When all is said and done, the constraints come from the output device. Good mobile reports are the ones that have been designed with the specific mobile device in mind. In this chapter, you will build on some of the visualizations that you saw in previous chapters. More specifically, you will look at further techniques that you can use to build gauges. This will extend the knowledge that you acquired in Chapter

41 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES Designing Mobile Reports There is definitely an art to designing tablet and smartphone reports. In any case, there is no one way of doing things, and most people will disagree on the best approach anyway. However, there are probably a few core guidelines that you need to take into account when designing reports for tablets and smartphones. These include the following: Don't overload the screen. It is too easy to think the user only has to zoom in. The result is that you create a report that is hard to read. The user will probably want to view their data immediately and clearly, without having to use pinch movements to read the data, be it a chart, gauge, or table. Hide the Report Toolbar and mask parameters using a custom interface. Users of mobile devices are accustomed to slick apps with state of the art interfaces. While you can never get to the highest levels of swish user interfaces using Reporting Services, you can at least make the interaction smoother. Develop a clear interface hierarchy. You will inevitably need a couple of reports on a tablet, and possibly several on a smartphone, to do the work of one report designed for a large laptop screen. So accept this, and be prepared to break up existing reports into separate reports, and to drill through from report to report. Learn about firewalls. You will have to take corporate firewalls into account when preparing to deploy mobile BI using SSRS. So it is a worthwhile investment to make friends with the IT people who deal with this, or learn about it if the buck stops with you. Delivering Mobile Reports At the start of this chapter I mentioned that mobile BI has only become practical since SQL Server 2012 SP1. The other main point to remember is that if you are intending to view reports on mobile devices, you can only access them using the Web Service Report Viewer. You cannot use the Report Manager to view or browse reports. Indeed, if you try to use the Report Manager all you will see is a discouragingly blank screen. Using Report Viewer is not a handicap in any way when it comes to displaying BI reports. However, report navigation may well require some tweaking. There are a couple of main reasons for this: Do you really want to show your users a report navigation screen that looks like it stems from the 1990s, and one that has never even heard of interactive interfaces? Your users will swipe to another app in microseconds! The URL that returns a report will be almost impossible to memorize. An example of the first of these two limitations is shown in Figure

CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES Figure 10-1. The standard report web services interface As to the URLs, well, you will find one or two of those further on in this chapter.

42 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES Figure The standard report web services interface As to the URLs, well, you will find one or two of those further on in this chapter. So be prepared to wrap your mobile BI reports in a navigation hierarchy, as described at the end of this chapter. It can require a little extra effort, but it is a major step on the road to ensuring that business users buy into your mobile BI strategy. Tablet Reports I am prepared to bet that tablets are the most frequently-used output platform for mobile business intelligence. They combine portability, ease of use, and a practical screen size, and can become an extremely efficient medium for BI using SSRS. This is not to say, however, that the transition from laptop or PC to tablet will always be instantaneous. You will almost certainly need to adapt reports to tablet display for at least some of the following of reasons: The screen is smaller than most laptops, and despite often extremely high resolution, cannot physically show all that a desktop monitor is capable of displaying. Even if you can fit all of a report that was designed for a large desktop screen onto a tablet, users soon get tired of zooming in and out to make the data readable. Interactivity will require the application of many of the revamping techniques introduced in Chapter 8. Let s now look at one of the principal techniques that you can use when creating business intelligence reports for tablets. Multi-Page Reports A classic way to make large report easier to use is to separate different elements on separate pages. Obviously, in Reporting Services, pages are a display, not a design concept. Yet you can use paging effectively when creating or adapting reports for tablet devices. As an example, think back to the dashboard Dashboard_Monthly that you assembled in Chapter 7. A dashboard like this that was designed for a large high-resolution monitor would be unreadable if it were squeezed onto a 9-inch tablet screen. Yet if you separate out its component elements and adjust the layout a little, you end up with a presentation that is both appealing and easy to read, as shown in Figure

43 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES Figure A dashboard broken down into separate screens for mobile BI This is still a single SSRS report, but it uses page breaks to separate the elements into three parts. Then you add buttons that jump to bookmarks inside the report to make flipping from page to page easier. To make the point, Figure 10-3 shows the design view of the report. As you can see, the three pages are nothing more than a vertical report layout. 296

CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES Figure 10-3.

44 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES Figure Design view of a tablet report designed for multi-page display 297

45 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES The Source Data Assuming that you have already built the dashboard Dashboard_Monthly.rdl in Chapter 7, there is no data needed; you have it already. This is to underline the point that one positive aspect of SSRS BI is the potential reusability of the various objects that you create. Building the Report Let s see how to adapt the source dashboard so that it is perfectly adapted to dashboard delivery. 1. Make a copy of the report named Dashboard_Monthly.rdl. Name the copy Tablet_Dashboard_Monthly.rdl. 2. Add the following datasets to the copied report to prepare for the addition of the date selector: a. YearSelector, using the shared dataset DateWidgetYear b. ColorScheme, using the shared dataset DateWidgetColorScheme c. Dummy, using the SQL query SELECT 3. Add the three following images from the directory C:\BIWithSSRS\Images: a. SmallGrayButton_Overview b. SmallGrayButton_Make c. SmallGrayButton_Country 4. Select everything in the report and move all elements down a good 6-10 inches. This will give you some room to tweak the existing objects. 5. Open the report DateSelector.rdl that you created in Chapter 9, and copy the contents into the report Tablet_Dashboard_Monthly.rdl. Place the date selector elements at the top left of the report. 6. Cut and paste the three pyramid charts from the table that currently contains them. This way you will have three independent charts. Place them vertically on the top left of the report under the date selector. Make them slightly smaller, and align and space them until they look something like the leftmost part of the first page in Figure Make four copies of the table containing the figures that were originally at the top of the dashboard. Delete all the columns but one, leaving a different metric each time, until you have five separate tables, each containing one of the sales metrics from the table. Place them vertically on the top right of the report under the date selector. Align and space them until they look something like the rightmost part of the first page in Figure Leaving about an inch of clear space, place the gauges for the sales by country and the table of color sales under each other and under the elements that you rearranged in steps 6 and 7. They should look like the center page in Figure

46 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES 9. Drag both remaining visualizations out of the rectangle that contains them, and then delete the container rectangle. Again, leaving about an inch of clear space, place the two elements under each other and under the elements that you rearranged in step 8. They should look like the third page in Figure Add two image elements at the top right of the report just above the month list. Set the left-hand one to use the image SmallGrayButton_Country and the right-hand one to use the image SmallGrayButton_Make. 11. Set the images to a size of around 1 inch by 0.2 inches. 12. Copy the images twice. Place one copy above and at the right of the second group of elements (gauges for the sales by country and the table of color sales), and one copy above and at the right of the third group of elements (gauges for the sales by make and the chart of key ratios). 13. For the second set of images, set the left-hand one to use the image SmallGrayButton_Overview and the right-hand one to use the image SmallGrayButton_Make. 14. For the third set of images, set the left-hand one to use the image SmallGrayButton_Overview and the right-hand one to use the image SmallGrayButton_Country. Refer back to Figure 10-3 (the design view) if you need to see exactly how to set these items in the report. 15. Add a rectangle just under the bottom pyramid chart. Set the following properties: a. Hidden: False b. BackgroundColor: No Color c. BorderStyle: None d. PageBreak > PageLocation: Start e. Bookmark: MiddlePage 16. Copy this rectangle and place the copy just below the table of color sales. Set its bookmark property to BottomPage. 17. Ensure that the tops of the sets of buttons are always just below the bottom of the rectangles. 18. Select the tablix containing the years in the year selector, and using the Properties window, set its bookmark property to TopPage. 19. Right-click the top left image button (it should display Country) and select Image Properties from the context menu. Select Action on the left and then Go to bookmark as the action to enable. 20. Enter MiddlePage as the bookmark, and click OK. 21. Do the same for all the image buttons, using the following bookmarks per button: a. Country: MiddlePage b. Make: BottomPage c. Overview: TopPage 299

47 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES You can now preview the report. When you click the buttons you should jump to the next part of the report. You may need to tweak the height of the sections of the report to match the display size of the actual mobile device that you are using. How It Works This report uses a tried-and-tested technique of using the vertical parts of a report as separate pages when displayed. A rectangle (which, although technically not hidden, is not visible because it has no border or fill) is used to force the page breaks. The same rectangles serve to act as bookmarks, except for the top of the page, where using the year selector is more appropriate. Then a series of images (here they are buttons) are used to provide the action, which jumps to the appropriate bookmark if the image is clicked or tapped. Quite possibly the hardest part when rejuggling dashboards like this is deciding how best to reuse, and how to group, the existing visualizations. In reality, you might find yourself re-tweaking an original dashboard more than you did here. So do not be afraid to remove objects or add other elements if it suits the purpose of the report. You do not need to use images for the buttons that allow users to jump around a report. Any SSRS object that triggers an action will do. However, tweaking a handful of images so that they contain the required text takes only a few minutes, and it certainly looks professional. Also, these buttons could become a standard across all your reports and consequently familiar to all your users. Creating Tabbed Reports Sometimes you need to present information in a series of separate areas that are nonetheless part of a whole. In these cases, scrolling down through a report (or jumping to a different part of the report) is distracting, and moving to a completely different report can confuse the user. One classic, yet effective, way to overcome these issues is to design reports that break down the available information into separate sections. You then make these sections available to the user as tabs on the report. This avoids an unnecessarily complex navigation path through a set of reports, and lets the user focus on a specific area of information. These reports group different elements in separate sections (or tabs ) where one click on the tab displays the chosen subset of data. This is a bit like having an Excel file with multiple worksheets, only in an SSRS report. A tabbed report is a single report consisting of two or more sections. These sections are laid out vertically in SSRS, one above the other. The trick is only to make visible the elements that make up one section at a time when viewing the report. This approach is a little different from most of the techniques you have seen so far in this chapter. Up until now you have been filtering data in charts or tables. What you will be doing now is making report elements visible or invisible as an interface technique. In this example, the report will have three tabs as visual indicators at the top of the report. The sections of the report are implemented as three sets of elements (charts, tables, text boxes, etc.), one above the other in the actual report design. The key trick is to handle the Hidden property of each element so that it is set by clicking on the appropriate tab. The final report appears to the user like one of the tabs shown in Figure

48 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES EDUp l.october Core Sales R.IIROYCI o "Sales List" tab 5.les Ust *j" 3ffi * o.uges Country o "Sa les Chart" tab Ust Key Sales Ratios by Make "Sales Gauges" o Figure A t ab b e d re p o r t However, the report is somewhat different in Report Designer, shown in Figure Here, as you can see, the three tabbed screens are, in effect, a single report. 301

CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES Figure 10-5.

49 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES Figure A tabbed report in Report Designer The Source Data You saw how to create the gauges in Chapter 4, so I will not repeat the code for them here. However, the code for the initial table and the column chart is as follows: INT = 2014 INT = Code.pr_TabletTabbedReportCountrySales 302

50 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES SELECT CASE WHEN CountryName IN ('France','United Kingdom','Switzerland', 'United States', 'Spain') THEN CountryName ELSE 'OTHER' END AS CountryName,SUM(SalePrice) AS Sales FROM Reports.CarSalesData WHERE ReportingYear AND ReportingMonth GROUP BY CountryName -- Code.pr_TabletTabbedReportSalesList SELECT Make,SUM(SalePrice) AS Sales,SUM(TotalDiscount) AS Discounts,SUM(DeliveryCharge) AS DeliveryCharge,SUM(CostPrice) AS CostPrice FROM Reports.CarSalesData WHERE ReportingYear AND ReportingMonth GROUP BY Make -- Code.pr_TabletTabbedReportRatioGauges IF OBJECT_ID('tempdb..#Tmp_GaugeOutput') IS NOT NULL DROP TABLE #Tmp_GaugeOutput CREATE TABLE #Tmp_GaugeOutput ( ManufacturerName NVARCHAR(80) COLLATE DATABASE_DEFAULT,Sales NUMERIC(18,6),Discount NUMERIC(18,6),DeliveryCharge NUMERIC(18,6),SpareParts NUMERIC(18,6),LabourCost NUMERIC(18,6),DiscountRatio NUMERIC(18,6),DeliveryChargeRatio NUMERIC(18,6),SparePartsRatio NUMERIC(18,6),LabourCostRatio NUMERIC(18,6) ) INSERT INTO #Tmp_GaugeOutput (ManufacturerName, Sales, Discount, DeliveryCharge, SpareParts, LabourCost) SELECT Make,SUM(SalePrice) AS Sales,SUM(TotalDiscount) AS Discount,SUM(DeliveryCharge) AS DeliveryCharge,SUM(SpareParts) AS SpareParts,SUM(LaborCost) AS LabourCost 303

51 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES FROM WHERE GROUP BY Reports.CarSalesData ReportingYear AND ReportingMonth Make UPDATE #Tmp_GaugeOutput SET DiscountRatio =,DeliveryChargeRatio =,SparePartsRatio =,LabourCostRatio = CASE WHEN Discount IS NULL or Discount = 0 THEN 0 ELSE (Discount / Sales) * 100 END CASE WHEN DeliveryCharge IS NULL or DeliveryCharge = 0 THEN 0 ELSE (DeliveryCharge / Sales) * 100 END CASE WHEN SpareParts IS NULL or SpareParts = 0 THEN 0 ELSE (SpareParts / Sales) * 100 END CASE WHEN LabourCost IS NULL or LabourCost = 0 THEN 0 ELSE (LabourCost / Sales) * 100 END SELECT * FROM #Tmp_GaugeOutput How the Code Works These code snippets return a few key metrics aggregated by make or country for a specific year and month. Running this code returns the data shown in Figure Figure The output for the table in the tabbed report 304

52 Building the Report CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES This report is nothing more than a series of objects whose visibility is controlled by the action properties of the text boxes that are used to give a tabbed appearance. As I want to concentrate on the way that objects are made to appear and disappear, I will be somewhat succinct about how to create the objects (table, chart, and gauges) themselves, as they are essentially variations on a theme of visualizations that you have already seen in previous chapters. 1. Make a copy of the.rdl file named DateSelector. Name the copy Tablet_TabbedReport.rdl, and open the copy. 2. Remove the month elements as described for the tiled report earlier in this chapter. Leave the ReportingMonth parameter, however. 3. Add the following three parameters, all of which are Boolean and hidden: a. TabSalesList - default value: False b. TabSalesChart - default value: True c. TabSalesGauge - default value: True 4. Add the following datasets: a. CountrySales, using the stored procedure Code.pr_TabletTabbedReportCountrySales b. SalesList, using the stored procedure Code.pr_TabletTabbedReportSalesList c. RatioGauges, using the stored procedure Code.pr_TabletTabbedReportRatioGauges d. MonthList, using the shared dataset ReportingFullMonth 5. Embed the following seven images into the report (all.png files from the directory C:\BIWithSSRS\Images that you have downloaded from the Apress web site): EuropeFlag, GermanFlag, USAFlag, GBFlag, SpainFlag, FranceFlag, and SwissFlag. 6. Copy the pop-up menu text box and table that you created in Chapter 8. You can see this type of visualization in Figure 8-5. Set the table for the menu to use the dataset MonthList. 7. Add three text boxes in a row above the year selector. Format them to Arial Black 14 point centered. Enter the following texts (in this order): Sales List, Sales Chart, and Sales Gauges. Name the three text boxes (also in this order): TxtTabSalesList, TxtTabSalesChart, and TxtTabSalesGauges. 8. Add a line under the text boxes. Set the LineColor to dark blue and the LineWidth to 3 points. This is the tabbed header that you can see in Figure Create a table and delete the second (detail) row. Delete all but one column. Add two more rows and set the dataset to SalesList. In the Properties window, name this table TabSalesList. 10. Create a second table of five columns and drag it into the second row of the table that you created previously. As the nested table will inherit the outer table s dataset, you can set the detail row to use the fields Make, Sales, Discounts, DeliveryCharge, and CostPrice (in this order). 305

53 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES 11. Add the text Core Sales Data to the top row of the outer table. Format the table as you see fit, possibly using Figure 9-11 as an example. 12. Add a text box under the table and name it TxtChartTitles. Set the font to Arial blue 18 point. 13. Under this title, add a column chart that you name Chart. Set it to use the dataset CountrySales. Set the Values to use the Sales field and the category groups to use the CountryName field. 14. Click any column in the chart and set it to use an embedded background image where the Value for the image is set in the Properties window using the following expression: =Switch ( Fields!CountryName.Value= "United Kingdom", "GBFlag",Fields!CountryName.Value= "France", "FranceFlag",Fields!CountryName.Value= "UnitedStates", "USAFlag",Fields!CountryName.Value= "Germany", "GermanFlag",Fields!CountryName.Value= "Spain", "SpainFlag",Fields!CountryName.Value= "Switzerland", "SwissFlag",Fields!CountryName.Value= "OTHER", "EuropeFlag" ) 15. Format this chart as you want, possibly using Figure 6-13 from Chapter 6 as a model. 16. Add a text box under the chart and name it TxtGaugeTitles. Set the font to Arial blue 18 point. 17. Create a gauge similar to the one that you created in Chapter 3 for Figure 3-3. However, you must add a second scale by right-clicking the gauge and selecting Add Scale. 18. The first scale needs the following properties setting in the Linear Scale Properties dialog: Parameter Property Value Layout Position in gauge (percent) 50 Start margin (percent) 40 End margin (percent) 11 Scale bar width (percent) 0 Labels Placement relative to scale Cross Major Tick Marks Hide major tick marks Checked Minor Tick Marks Hide minor tick marks Checked 306

54 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES 19. The second scale needs the following properties setting in the Linear Scale Properties dialog: Parameter Property Value Layout Position in gauge (percent) 50 Start margin (percent) 8 End margin (percent) 70 Scale bar width (percent) 0 Labels Placement relative to scale Cross Major Tick Marks Hide major tick marks Checked Minor Tick Marks Hide minor tick marks Checked 20. In this case, however, use the dataset RatioGauges and add a total of four bar pointers, two for each scale. You do this by right-clicking inside the gauge and selecting Add Pointer For and selecting the relevant scale for each pointer that you add. These pointers must use (clockwise from top left) the following fields: DeliveryChargeRatio, DiscountRatio, LaborCostRatio, SparePartsRatio. 21. Set all the pointers to 15 percent width. The two left-hand pointers should have their distance from the scale set to 33 percent. The two right-hand pointers should have their distance from the scale set to 15 percent. 22. Add titles under each pointer. See Chapter 3 for techniques on how to do this. Add a title at the top of the gauge and enter the text Rolls Royce. 23. Filter the gauge on the make Rolls Royce. 24. Make two copies of the gauge and place them side by side with the first gauge. Name the second gauge GaugeAstonMartin and filter it on Make = Aston Martin. Name the third gauge GaugeJaguar and filter it on Make = Jaguar. Now that you have all the elements in place, you can add the final touch: managing the visibility of the various parts of the report. 25. Select the table named TabSalesList and in the Properties window, set its Hidden property to =Parameters!TabSalesList.Value. 26. Select the text box title for the chart as well as the chart, and set their Hidden property to =Parameters!TabSalesChart.Value. 27. Select the three gauges and the text box title for the gauges and set their Hidden property to =Parameters!TabSalesGauge.Value. 307

55 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES 28. Click the TxtTabSalesList text box, and set the following properties: Property BackgroundColor Color Value =Switch( Parameters!TabSalesList.Value=False,"Blue",1=1,"LightBlue" ) =Switch( Parameters!TabSalesList.Value=False,"White",1=1,"Black" ) 29. Set the same properties for the two other text boxes (TxtTabSalesChart and TxtTabGauges), only change the parameter reference in the expression to the appropriate parameter. So, for the text box TxtTabSalesChart, you will see Parameters!TabSalesChart.Value and the text box TxtTabSalesGauge you should use Parameters!TabSalesGauge.Value. 30. Right-click the text box TxtTabSalesList and select Text Box Properties from the context menu. Click Action on the left, and set the following options: Option Parameter Value Enable as action Specify a report Go to report [&ReportName] Use these parameters ReportingYear [@ReportingYear] TabSalesList TabSalesChart TabSalesGauge ReportingMonth 31. The dialog should look like Figure False True True [@ReportingMonth] 308

CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES Figure 10-7. The action parameters for a tabbed report 32.

56 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES Figure The action parameters for a tabbed report 32. Do exactly the same for the text box TxtTabSalesChart, only alter the two following parameters: a. TabSalesList: True b. TabSalesChart: False c. TabSalesGauge: True 33. Do exactly the same for the text box TxtTabSalesGauge, only alter the two following parameters: a. TabSalesList: True b. TabSalesChart: True c. TabSalesGauge: False That s it. You have defined a report where two out of three will parts of the report always be hidden, giving the impression that the user is flipping from tab to tab when a text box at the top of the report is clicked or tapped. 309

57 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES How It Works Because there are three groups of report items that have to be made visible or invisible as coherent units, I find it easier to use parameters to contain the state of the visibility of items in the report. Then, when a text box is clicked, its action property can set the value for the parameters that control visibility. This causes the report to be redisplayed with a different set of visible items. I used this report as an opportunity to reuse a technique that you saw in the previous chapter: the pop-up menu. This is not strictly necessary when creating tabbed reports, but I want to illustrate that you can soon end up with a plethora of parameters that have to be kept in synch when developing reports with more complex interfaces like this one. The property used to handle visibility is called Hidden, so you have to remember to set the parameter to True to hide an element and False to display it. This can seem more than a little disconcerting at first. Note A tabbed report is loaded in its entirety, including the hidden elements. This means that switching from tab to tab can be extremely rapid once the report has been loaded. This makes tabbed reports ideal candidates for caching. Caching techniques are described in greater detail in Chapter 12. Other Techniques for Tablet Reports Tabbed reports are not the only solution that you can apply when creating or adapting reports for handheld devices. Other approaches include the following: Drilldown to hierarchical tables, just as you would with classic reports in SSRS. Restructuring the visualizations into a whole new set of reports and linking the reports. An example of this technique is given at the end of this chapter. Simplifying the visualizations that make up a dashboard to allow a greater concentration of widgets in a smaller space. Most, if not all, of the examples in this chapter do this in some way. Removing dashboard elements to declutter the report. Smartphone Reports There is probably one word that defines successful reports for smartphones : simplicity. As in most areas, simplicity in report design is often harder than complexity. While we all have our opinions about design, here are a few tips: Isolate truly key elements. Remove anything not essential from the report. Create single-focus screens. Ideally the information should be bite-sized. Deliver key data higher in the sequence of reports. Give less detail at higher levels and reserve granularity for further reports lower in the hierarchy. Limit the number of metrics you deliver on each screen. Consider providing data or titles as tooltips. It can save screen real estate. Share datasets. This encourages widget reuse. 310

58 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES Let s see how these principles can be applied in practice. I ll cover using gauges and charts to display metrics, and then provide a few tips on delivering text-based metrics on a smartphone. Multiple Gauges A user looking at their phone for business intelligence will be a busy person and their attention span will be limited. After all, many other distractions are jostling for their time and focus. So you need to give them the information they want as simply and effectively as possible. Gauges can be an ideal way to achieve this objective. Inevitably you will have to limit the information that can be displayed efficiently. This will depend on the size of the screen you are targeting, so there are no definite limits. In this example, I will use six gauges to show car sales for the current month. Moreover, I will deliberately break down the display into Five specific makes of cars sold One gauge for all the others This could require tweaking the source data to suit the desired output, but this is all too often what you end up doing when designing BI for smartphones. This final output will look like Figure In this particular visualization, there is no facility for selecting the year and month; the current year and month are displayed automatically. Figure Gauges showing sales for the current month 311

59 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES The Source Data You can get the figures for sales for the current month using the following SQL (Code.pr_SmartPhoneCarSalesGauges). You saw this in Chapter 3, so I will refer you to back to this chapter rather than show it all again here. Building the Gauges Now it is time to put the pieces together and assemble the gauges for the smartphone display. You will be building on some of the experience gained in Chapter 3, specifically the report named _CarSalesGauge.rdl. 1. Make a copy of the report named _CarSalesGauge.rdl. Name the copy SmartPhone_CarSalesGauges.rdl. 2. Right-click the ReportingYear and ReportingMonth parameters (in turn) and set the parameter to hidden. 3. Increase the size of the report so that it is three times wider and five times taller than the gauge. Place the gauge on the right of the report. Resize the gauge so that it is 1.5 inches square. 4. Add a table to the left of the report. Delete the detail row. Make the table three columns by six rows, adding the required number of columns and rows. Set the dataset to SmartPhoneCarSalesGauges. 5. Set the first, third, and fifth rows to be 0.25 inches tall. Set the second, fourth, and sixth rows to be 1.5 inches tall. 6. Set the first and third columns to be 1.5 inches wide, and the second column to be 0.25 inches wide. 7. Set all the cell backgrounds to black. 8. Add the titles shown in Figure 10-8 to the first and third columns of the first, third, and fifth rows. Set the font to Arial 10 point white. 9. Drag the gauge into the left-hand cell of row two. Ensure that the filter is set to Aston Martin. 10. Copy the gauge into the cells of the left-hand and right-hand columns of rows two, four, and six. 11. Filter each of the gauges to apply only the make that appears in the title above the gauge. 12. Resize the report to fit to the size of the table. How It Works This report simply uses a table as a placeholder to contain the six gauges. The table uses the same dataset as the gauges, even though it does not use it. Each gauge is then filtered on the field that allows it to display data for a single record in the output recordset. The big advantage of gauges here is that they can be resized easily, and will even resize if you adjust the height or width of the row and column that they are placed in. 312

Slider Gauges CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES A type of presentation that is extremely well suited to smartphones is the lateral gauge.

60 Slider Gauges CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES A type of presentation that is extremely well suited to smartphones is the lateral gauge. These gauges have been popularized by certain web sites, and are sometimes called slider gauges. An example is shown in Figure Figure A slider gauge The Source Data The code needed to produce this mobile report is as follows: INT = 2014 INT = 10 IF OBJECT_ID('Tempdb..#Tmp_Output') IS NOT NULL DROP TABLE Tempdb..#Tmp_Output CREATE TABLE #Tmp_Output ( Make NVARCHAR(80) COLLATE DATABASE_DEFAULT,Sales NUMERIC(18,6) 313

61 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES,SalesBudget NUMERIC(18,6),PreviousYear NUMERIC(18,6),PreviousMonth NUMERIC(18,6),ScaleMax INT ) INSERT INTO #Tmp_Output ( Make,Sales ) SELECT FROM WHERE GROUP BY CASE WHEN Make IN ('Aston Martin','Bentley','Jaguar','Rolls Royce') THEN Make ELSE 'Other' END AS Make,SUM(SalePrice) Reports.CarSalesData ReportingYear AND ReportingMonth CASE WHEN Make IN ('Aston Martin','Bentley','Jaguar','Rolls Royce') THEN Make ELSE 'Other' END -- Previous Year Sales ; WITH SalesPrev_CTE AS ( SELECT CASE WHEN Make IN ('Aston Martin','Bentley','Jaguar','Rolls Royce') THEN Make ELSE 'Other' END AS Make,SUM(SalePrice) AS Sales Reports.CarSalesData FROM WHERE ReportingYear - 1 AND ReportingMonth GROUP BY ) CASE WHEN Make IN ('Aston Martin','Bentley','Jaguar','Rolls Royce') THEN Make ELSE 'Other' END UPDATE SET Tmp Tmp.PreviousYear = CTE.Sales 314

62 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES FROM ; WITH Budget_CTE AS ( SELECT FROM WHERE #Tmp_Output Tmp INNER JOIN SalesPrev_CTE CTE ON Tmp.Make = CTE.Make SUM(BudgetValue) AS BudgetValue,BudgetDetail Reference.Budget BudgetElement = 'Sales' AND Year AND Month GROUP BY ) UPDATE SET FROM BudgetDetail Tmp Tmp.SalesBudget = CTE.BudgetValue #Tmp_Output Tmp INNER JOIN Budget_CTE CTE ON Tmp.Make = CTE.BudgetDetail -- Previous month sales ; WITH PreviousMonthSales_CTE AS ( SELECT SUM(SalePrice) AS Sales,CASE WHEN Make IN ('Aston Martin','Bentley','Jaguar','Rolls Royce') THEN Make ELSE 'Other' END AS Make FROM WHERE Reports.CarSalesData InvoiceDate >= DATEADD(mm, -1,CONVERT(DATE, CAST(@ReportingYear AS CHAR(4)) + RIGHT('0' + CAST(@ReportingMonth AS VARCHAR(2)),2) + '01')) AND InvoiceDate <= DATEADD(dd, -1,CONVERT(DATE, CAST(@ReportingYear AS CHAR(4)) + RIGHT('0' + CAST(@ReportingMonth AS VARCHAR(2)),2) + '01')) GROUP BY CASE WHEN Make IN ('Aston Martin','Bentley','Jaguar','Rolls Royce') THEN Make ELSE 'Other' END ) UPDATE SET FROM Tmp Tmp.PreviousMonth = CTE.Sales #Tmp_Output Tmp INNER JOIN PreviousMonthSales_CTE CTE ON Tmp.Make = CTE.Make 315

CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES -- Scale maximum UPDATE #Tmp_Output SET ScaleMax = -- Output CASE WHEN Sales >= SalesBudget THEN (SELECT Code.

63 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES -- Scale maximum UPDATE #Tmp_Output SET ScaleMax = -- Output CASE WHEN Sales >= SalesBudget THEN (SELECT Code.fn_ScaleDecile (Sales)) ELSE (SELECT Code.fn_ScaleDecile (SalesBudget)) END SELECT * FROM #Tmp_Output Running this code gives the results shown in Figure (for March 2013): Figure The data for slider display How the Code Works This code first creates a temporary table, and then it adds the sales data for the current year up to the selected month, sales up to the same month for the previous year, the budget data for the current year, and the sales for the preceding month. This code block also includes the maximum value for the scale inside the table where the core data is held. This is because the maximum value can change for each make of car. Building the Report So, with the data in place, here is how to build the report. 1. Make a copy of the report DateSelector and name the copy SmartPhone_SalesAndTargetWithPreviousMonthAndPreviousPeriod.rdl. 2. Add the following two datasets: a. MonthList, based on the shared dataset ReportingFullMonth b. MonthlyCarSalesWithTargetPreviousMonthAndPreviousYear, based on the stored procedure Code.pr_MonthlyCarSalesWithTargetPreviousMonthAndPreviousYear 3. Delete the table containing the months. 4. Copy the text box and table that act as a pop-up menu from the report Tablet_TabbedReport.rdl that you created earlier in this chapter, and place them under the year selector. 316

64 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES 5. Add a gauge to the report and choose Horizontal as the gauge type. Apply the dataset MonthlyCarSalesWithTargetPreviousMonthAndPreviousYear. 6. Add three new pointers and one range so that there are four pointers and two ranges in total. Set the pointers to use the following fields (the easiest way is to click the gauge so that the Gauge Data pane appears): a. LinearPointer1: PreviousMonth b. LinearPointer2: SalesBudget c. LinearPointer3: Sales d. LinearPointer4: PreviousYear 7. Add the following tooltips in the Properties window for three of the four pointers: a. LinearPointer1: = Last Month b. LinearPointer3: = This Month c. LinearPointer4: = Same Month Last Year 8. Set the four pointer properties as follows (right-click each one individually and select Linear Pointer Properties from the context menu): Pointer Option Parameter Value LinearPointer1 Pointer options Pointer type Marker Marker style Triangle Placement relative to scale Inside Distance from scale (percent) 17 Width (percent) 21 Length (percent) 15 Pointer fill Fill style Solid Color Yellow LinearPointer2 Pointer options Pointer type Marker Marker style Rectangle Placement relative to scale Cross Distance from scale (percent) 5 Width (percent) 3 Length (percent) 40 Pointer fill Fill style Solid Color Blue (continued) 317

65 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES Pointer Option Parameter Value LinearPointer3 Pointer options Pointer type Bar Bar start Scale start Placement relative to scale Cross Distance from scale (percent) 6 Width (percent) 13 Pointer fill Fill style Solid Color White LinearPointer4 Pointer options Pointer type Marker Marker style Triangle Placement relative to scale Outside Distance from scale (percent) 6 Width (percent) 21 Length (percent) 15 Pointer fill Fill style Solid Color Orange 9. Set the following properties in the Linear Range Scale Properties dialog for the two ranges: Range Option Parameter Value Range 1 General Start range at scale value 0 End range at scale value =Fields!SalesBudget.Value Placement relative to scale Cross Distance from scale (percent) 5 Start width (percent) 40 End width (percent) 40 Fill Fill style Gradient Color Blue Secondary color Cornflower blue Gradient style Left Right Border Line style Solid Line color Silver (continued) 318

66 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES Range Option Parameter Value Range 2 General Start range at scale value =Fields!SalesBudget.Value End range at scale value Placement relative to scale Cross Distance from scale (percent) 5 Start width (percent) 40 End width (percent) 40 Fill Fill style Gradient Color Lime green Secondary color White Gradient style Top Bottom Border Line style Solid Line color Silver =Fields!ScaleMax.Value 10. Right-click the scale, select Scale Properties, and set the following properties: Section Property Value General Minimum 0 Maximum (expression) =Fields!ScaleMax.Value Layout Position in gauge (percent) 57 Start margin (percent) 4 End margin (percent) 4 Scale bar width (percent) 1 Labels Placement (relative to scale) Outside Distance from scale 28 Font Font Arial Size 8 point Color Dim gray Number Category Number Use 1000 separator Checked Major Tick Marks Hide major tick marks Checked Minor Tick Marks Hide minor tick marks Checked 11. Add a table to the report area and delete all but one column and the header row. Set the remaining cell to be the width of the gauge and approximately 0.8 inches high. 12. Drag the gauge into the table cell. 319

67 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES 13. Right-click the gauge and set the following Gauge Panel Properties. This will filter the gauge data so that only the data for the current record is displayed: Section Property Value Filters Expression Make Operator = Value Make Fill Background color of the gauge panel White 14. Right-click the gauge and set the following Gauge Properties: Section Property Value Back Fill Fill style Gradient Color Dim gray Secondary Color White smoke GradientStyle Diagonal left Frame Style Simple Width (percent) 4.5 Frame Fill Fill style Gradient Color White smoke Secondary Color Dim gray GradientStyle Horizontal center Frame Border Line style None How It Works This gauge uses three different pointer styles as well as pointer placement to add a lot of information to a single gauge. The main pointer is the central bar (which contains the data for sales to the selected month in the year) while the budget is a bar across the gauge that lets the user see how sales relate to budget. Because they are ancillary data, the pointers for last month s sales and the sales up until the same month for the previous year are shown as small triangular pointers above and below the main pointer. The budget is then reflected in the two ranges to indicate more clearly how sales relate to budget. Finally, tooltips are added for all the pointers so that the user can see which pointer represents which value. Note Every time a new user sees this gauge, they ask how they can move the slider pointers and change the data. You have been warned! 320

Text-Based Metrics CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES I would never advise creating BI visualizations for smartphones that rely on lots of text.

68 Text-Based Metrics CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES I would never advise creating BI visualizations for smartphones that rely on lots of text. Users will simply not read reams of prose on a tiny mobile device. Instead, think of some of the apps that you currently use. They probably have simple screens and large buttons. Smartphone BI should emulate this approach. As an example, consider the list of sales by make in Figure You will note that this example also gives the user the possibility to select the year and month using some of the techniques for interactive selection described earlier in this chapter. Figure Sales by make for smartphone display The Source Data The code needed to deliver this visualization is not overly complex, and is as follows: INT = 2014 INT = 10 IF OBJECT_ID('Tempdb..#Tmp_Output') IS NOT NULL DROP TABLE Tempdb..#Tmp_Output CREATE TABLE #Tmp_Output ( Make NVARCHAR(80) COLLATE DATABASE_DEFAULT,Sales NUMERIC(18,6) 321

69 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES,SalesBudget NUMERIC(18,6),SalesStatus TINYINT ) INSERT INTO #Tmp_Output ( Make,Sales ) SELECT FROM WHERE CASE WHEN Make IN ('Aston Martin','Bentley','Jaguar','Rolls Royce') THEN Make ELSE 'Other' END AS Make,SUM(SalePrice) Reports.CarSalesData ReportingYear AND ReportingMonth GROUP BY CASE WHEN Make IN ('Aston Martin','Bentley','Jaguar','Rolls Royce') THEN Make ELSE 'Other' END ; WITH Budget_CTE AS ( SELECT FROM WHERE GROUP BY ) SUM(BudgetValue) AS BudgetValue,BudgetDetail Reference.Budget BudgetElement = 'Sales' AND Year AND Month BudgetDetail UPDATE SET FROM Tmp Tmp.SalesBudget = CTE.BudgetValue #Tmp_Output Tmp INNER JOIN Budget_CTE CTE ON Tmp.Make = CTE.BudgetDetail -- Set Sales Status UPDATE #Tmp_Output SET SalesStatus = CASE WHEN Sales < (SalesBudget * 0.9) THEN 1 WHEN Sales >= (SalesBudget * 0.9) AND Sales <= (SalesBudget * 1.1) THEN 2 322

CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES -- Output WHEN Sales > (SalesBudget * 1.1) THEN 3 ELSE 0 END SELECT * FROM #Tmp_Output Running this code produces the table shown in Figure 10-12.

70 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES -- Output WHEN Sales > (SalesBudget * 1.1) THEN 3 ELSE 0 END SELECT * FROM #Tmp_Output Running this code produces the table shown in Figure Figure The data for sales by make for the current year How the Code Works Using the approach that you have probably become used to by now in this book, a temporary table is used to hold the sales data for a hard-coded set of vehicle makes. Then, using a CTE, the budget data for these makes is added, and a status flag is calculated on a scale of 1 through 3. Building the Display Let s assemble the table to display this key data on your smartphone. 1. Make a copy of the report DateSelector and name the copy SmartPhone_ CarSalesByCountryWithFlagDials.rdl. Delete the table containing the months. 2. Add the following two datasets: a. MonthList, based on the shared dataset ReportingFullMonth b. MonthlyCarSalesWithTargetAndStatus, based on the stored procedure Code.pr_MonthlyCarSalesWithTargetAndStatus 3. Copy the text box and table that act as a pop-up menu from the report SmartPhone_CarSalesByCountryWithFlagDials.rdl that you created earlier in this chapter, and place them under the year selector. 4. Add a table to the report and delete the third column. Apply the dataset MonthlyCarSalesWithTargetAndStatus. 5. Set the left column to 1.75 inches wide and the right column to 1.25 inches wide (approximately). Set the details row to be about 1/2 inch high. Merge the cells on the top row. 323

71 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES 6. Add the text Make to the top row and center it. Set the text box background to black and the font to light gray Arial 12 point bold. 7. Add the field Make to the left. Set the text box background to silver and the font to black Arial 16 point bold. 8. Set the text box background of the right column to dim gray. 9. Drag an indicator from the toolbox into the right-hand column cell and select 3 signs as the indicator type. 10. Click the indicator (twice if necessary) and when the Gauge Data pane appears, select SalesStatus as the Value. 11. Right-click the indicator and select Indicator Properties from the context menu. Set the Value and States measurement unit property to numeric. 12. Add an indicator state so that there are a total of four. Set all to use the diamond icon. Set the indicator properties as follows: Indicator Color Start End First No Color 0 0 Second Maroon 1 1 Third Green 2 2 Fourth Dark Blue Right-click the indicator and select Add Label from the context menu. Then right-click the label and select Label Properties from the context menu. Set the following properties: Section Property Value General Text (Expression) =Microsoft.VisualBasic.Strings. Format(Fields!Sales.Value, "#,#") Text alignment Center Vertical alignment Middle Top (percent) 30 Left (percent) 10 Width (percent) 90 Height (percent) 50 Font Font Arial Style Bold Color Yellow 14. Right-click the table and select Tablix properties from the context menu. Select Sorting on the left and click Add. Select Sales as the column to sort by and Z to A as the sort order. 324

72 How It Works CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES This visualization uses a table structure to show the records from the source data where the data is not only directly input into a table cell (the Make) but is also part of the indicator, as a text. However, when text is added to an indicator you must use an expression to display it and some simple Visual Basic to format the numbers. One advantage of a table here is that you can resize the indicator simply by adjusting the height and width of the column that contains it; the text will resize automatically as well. Because I am demonstrating smartphone delivery, the color scheme is deliberately brash. You may prefer more moderate pastel shades. Just be aware that this medium has little respect for subtlety or discretion. Multiple Charts Some, but not all, chart types can be extremely effective when used on smartphones. I do not want to dismiss any of the more traditional chart types as unsuitable, so the only comment I will make is that it s best to deliver a single chart per screen if the charts are complex or dense line or bar charts. If you need to deliver multiple charts, consider using a chart like the one from the tabbed report at the start of this chapter. While it can be a little laborious to configure multiple chart areas in this way, it does make resizing the whole chart to suit a specific mobile device extremely easy. Alternatively, if you need multiple charts to make comparisons easier for the user, consider a trellis chart structure. You saw an example of this at the end of Chapter 4 if you need to refer back to refresh your memory. Smartphone and Tablet Report Hierarchy You may find that the way you present the hierarchy of reports for smartphone users is not the same way you create a reporting suite for tablet devices. This is because smartphone users are (probably) more used to simpler, more bite-sized chunks of information. So consider the following tips: Breaking down reports into sub-elements, with bookmark links inside the report, as you saw at the start of this chapter. A menu-style access to a report hierarchy, using buttons rather than lists (as you saw for tablets above) to navigate down and up a hierarchy. Use graphically consistent buttons in all reports. Using visualizations that are tappable and that become part of the navigation when possible. As you have seen the first of these solutions already, let s take a quick look at a simple implementation of the second. This is fairly classic stuff, but it is worth showing how it can be applied to some of the tablet-based reports that you have developed so far in this book. Access to a Report Hierarchy To hide the undeniably horrendous Web Services interface, all you have to do is to set up a home page for your suite of reports and send this URL to your users. They can then add it to the favorites menu in their browser, and use it as the starting point for accessing the set of tablet-oriented BI that you have lovingly crafted. A simple example is shown in Figure

73 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES Figure Accessing reports through a custom interface I am presuming here that you have deployed your reports to a SQL Server Reporting Services Instance, and that you have structured the reports into a set of subfolders. In this specific instance, I have set up a virtual folder named CarSalesReports, with a subfolder named Tablet that contains all the reports in the sample solution that begin with Tablet_. Here is how to create this entry page. 1. Create a new, blank report and save it under the name Tablet_Reportsmenu.rdl. 2. Add the image CarsLogo.png. 3. Drag this image to the top left of the report. 4. Add a suitably formatted title such as Mobile BI Sales Reports. 5. Add a text box. Enter the text Sales Report, and set the text to Arial 12 point bold in blue. 6. Right-click the text box and select Text Box Properties from the context menu. 7. Click Action on the left. 8. Select Go to URL and enter the following URL: orts%2ftablet%2ftablet_tabbedreport&rs:command=render 9. Confirm with OK. 326

74 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES 10. Add four new text boxes, all formatted like the first one. Set the texts and URLs to the following: a. Sales Highlights: ReportViewer.aspx?%2fCarSalesReports%2fTablet%2fTablet_ SimpleSlicer&rs:Command=Render b. Sales by Make: px?%2fcarsalesreports%2ftablet%2ftablet_charthighlight&rs:comman d=render c. Sales and Costs: spx?%2fcarsalesreports%2ftablet%2ftablet_slicerhighlight&rs:comm and=render d. Reseller Sales: x?%2fcarsalesreports%2ftablet%2ftablet_tile&rs:command=render 11. Deploy the report (to the Tablet subdirectoy of the CarSalesReports virtual directory) and run it using the following URL: orts%2ftablet%2ftablet_reportsmenu&rs:command=render You can now display the home page for your reporting suite, and then drill down into any report with a simple tap on the tablet. How It Works As you can see, you are using the Web Services URL for every report and studiously avoiding the Report Manager interface. This is to ensure that tablets (and that means ipads too) can display the report. However, all that your users see is a structured drill-down to the reports that they want, so hopefully the few minutes of work to set up a menu-driven access to your reporting suite will be worth the effort. Structuring the Report Hierarchy Now that you have seen the principles of creating an interface to replace the Report Manager, you should be able to extend this to as many sublevels as you need. One extra point is that you might find it useful to add a small text box, using the principles that you just saw, to each of the subreports. This text box uses the URL of the home menu in its action property. This way you can flip back from any report at a sublevel to the home page or to another sublevel. If you are thinking yes, but I have the browser s back button for that, remember that your users could be using the postback techniques that you have been using in this and previous chapters. If so, the back button will only display the same report, but using the previous set of parameters and not return to a higher level in the report hierarchy. This technique can also be used to break down existing reports into multiple separate reports. In the case of both the tabbed report and the bookmark report that you saw earlier, you could cut a report into separate.rdl files and use action properties to switch to another report. 327

75 CHAPTER 10 BI FOR SSRS ON TABLETS AND SMARTPHONES Conclusion This chapter showed you some of the ways you can deliver business intelligence to tablets and smartphones using SSRS. You saw that you frequently have to tailor the way information is delivered to a tablet or phone. If you have an existing report, you may have to adapt it for optimum effect on a mobile device. In any case, there are a series of methods that you can apply to save space and use the available screen real estate to greatest effect. These can involve creating tabbed reports, breaking reports down into separate reports, or using bookmarks to flip around inside a report. Where smartphones are concerned, you probably need to think in terms of simplicity first. In this chapter, you applied the less is more principle and consequently you removed any element that added clutter to a screen. This way you key BI metrics are visible, instantly and clearly, for your users. 328

76 Expert T-SQL Window Functions in SQL Server Kathi Kellenberger with Clayton Groom

77 Expert T-SQL Window Functions in SQL Server Copyright 2015 by Kathi Kellenberger with Clayton Groom This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. ISBN-13 (pbk): ISBN-13 (electronic): Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Managing Director: Welmoed Spahr Lead Editor: Jonathan Gennick Development Editor: Douglas Pundick Technical Reviewer: Stéphane Faroult Edit orial Board: Steve Anglin, Mark Beckner, Gary Cornell, Louise Corrigan, Jim DeWolf, Jonathan Gennick, Robert Hutchinson, Michelle Lowman, James Markham, Matthew Moodie, Jeff Olson, Jeffrey Pepper, Douglas Pundick, Ben Renow-Clarke, Gwenan Spearing, Matt Wade, Steve Weiss Coordinating Editor: Jill Balzano Copy Editor: Mary Behr Compositor: SPi Global Indexer: SPi Global Artist: SPi Global Cover Designer: Anna Ishchenko Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY Phone SPRINGER, fax (201) , orders-ny@springer-sbm.com, or visit Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information on translations, please rights@apress.com, or visit Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. ebook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales ebook Licensing web page at Any source code or other supplementary material referenced by the author in this text is available to readers at For detailed information about how to locate your book s source code, go to

78 Contents at a Glance About the Authors...xi About the Technical Reviewer...xiii Acknowledgments...xv Author s Note...xvii Chapter 1: Looking Through the Window...1 Chapter 2: Discovering Ranking Functions...17 Chapter 3: Summarizing with Window Aggregates...33 Chapter 4: Tuning for Better Performance...47 Chapter 5: Calculating Running and Moving Aggregates...61 Chapter 6: Adding Frames to the Window...71 Chapter 7: Taking a Peek at Another Row...83 Chapter 8: Understanding Statistical Functions...97 Chapter 9: Time Range Calculations and Trends Index v

79 CHAPTER 9 Time Range Calculations and Trends A common reporting requirement is to produce totals by different ranges of time for comparison. Typical reports contain totals by month, quarter, and year, sometimes with comparisons to the same period in the prior year or for month-to-date or year-to-date totals. Products like SQL Server Analysis Services and PowerPivot provide functions to navigate date hierarchies. With window functions in SQL Server 2012 or later, you can produce the same calculations using the techniques provided earlier in this book. In this chapter, you will put all the techniques you have learned previously to work to create calculations for the following: Percent of Parent Year-to-Date (YTD) Quarter-to-Date (QTD) Month-to-Date (MTD) Same Period Prior Year (SP PY) Prior Year-to-Date (PY YTD) Moving Total (MT) Moving Average (MA) Putting It All Together You learned about using window aggregates to add summaries to queries without grouping in Chapter 3, accumulating aggregates in Chapter 5, and frames in Chapter 6. You will be putting these all together, with a bit of common sense, to create complex calculations that without the use of window functions would have required many more steps. 107

CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS Remember, in the case of accumulating aggregates, PARTITION BY and ORDER BY determine which rows end up in the window.

80 CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS Remember, in the case of accumulating aggregates, PARTITION BY and ORDER BY determine which rows end up in the window. The FRAME DEFINITION is used to define the subset of rows from within the partition that will be aggregated. The examples in this section use the frame definition to do the heavy lifting. To review, here s the syntax: <AggregateFunction>(<col1>) OVER([PARTITION BY <col2>[,<col3>,...<coln>]] ORDER BY <col4>[,<col5>,...<coln>] [Frame definition]) In this chapter, you will need to use the AdventureWorksDW sample database. Pe rc e nt o f Pa re nt Comparing the performance of a product in a specific period to the performance for all products in that same period is a common analytic technique. In this next set of examples, you will build upon a simple base query by adding columns that calculate the pieces needed to produce the final Percent of Parent results. You will start out with a straightforward query that aggregates sales by month, and will add new calculation columns as they are covered, enabling each new column to be introduced on its own without needing to replicate the entire block of example code for each iteration. The code for the base query is shown in Listing 9-1 and the results are shown in Figure 9-1. Listing 9-1. Base Query SELECT f.productkey, YEAR(f.orderdate) AS OrderYear, MONTH(f.orderdate) AS OrderMonth, SUM(f.SalesAmount) AS [Sales Amt] FROM dbo.factinternetsales AS f WHERE OrderDate BETWEEN ' ' AND ' ' GROUP BY f.productkey, YEAR(f.orderdate), MONTH(f.orderdate) ORDER BY 2, 3, f.productkey; Figure 9-1. Results of the simple base query 108

81 CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS Ratio-to-Parent calculations can be generally defined as [Child Total] / [Parent Total]. In order to calculate the ratio as defined, you need to calculate the numerator and denominator inputs, and combine them in a third measure. To calculate the overall contribution each product has made to all sales, you need to determine the total for [All Sales] and for [Product All Sales]. Once you have those defined, you can calculate [Product % of All Sales] as [Product All Sales] / [All Sales]. You can multiply the resulting ratio by 100 to display it as a percentage, or rely on the formatting functions in the reporting or front-end tool to display them as percentages. For each of these measures, the window aggregate SUM() function encapsulates a regular SUM() function, which might look a little bit strange to begin with, but is required to aggregate [SalesAmount] to levels higher than the level of granularity of the query. The result is the ability to aggregate the same source column to different levels in a single query without having to resort to temporary tables or common table expressions. Listing 9-2 contains the additional column logic you need to append to the base query, just after the last column in the select list. Be sure to include a comma after [Sales Amt]. See Figure 9-2 for the results. Listing 9-2. Additional Column Logic SUM(SUM(f.SalesAmount)) OVER () AS [All Sales], SUM(SUM(f.SalesAmount)) OVER (PARTITION BY f.productkey) AS [Product All Sales], SUM(SUM(f.SalesAmount)) OVER (PARTITION BY f.productkey) / SUM(SUM(f.SalesAmount)) OVER() AS [Product % of All Sales] Figure 9-2. Results of calculating product sales as a percentage of all sales The frame for the [All Sales] column does not have a PARTITION clause, which means it will aggregate across all the data available to the query, providing a total of sales for all time. This value will be the same for each row in the resulting table. The PARTITION clause for the [Product All Sales] column restricts the partition to each instance of a product key, providing a total of sales by product for all time. This value will be the same for all rows sharing the same [ProductKey] value. 109

82 CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS The [Product % of All Sales] column combines the two prior statements to calculate the ratio between them. The key thing to realize is that you can combine the results of multiple aggregation results within a single column. Knowing this will allow you to create all manner of complex calculations that otherwise would have been relegated to a reporting tool or application code. This approach works best if you work out the logic for each input component for a given calculation, then create the complex calculation that leverages the input calculations. The column calculations for the Annual and Monthly levels follow a similar pattern, so once you have calculations for one level worked out, the rest will follow quickly. There is no need to worry about handling a divide-by-zero error at the all level, as the only case that will result in an error is if there are no rows at all in the source table, but for every level below it, you must account for situations where the denominator value can be zero. The calculations for the annual and month levels demonstrate how this can be done. By wrapping the window SUM() statement in a NULLIF() function, any zero aggregate values are turned into a NULL value, avoiding the divide-by-zero error. You could also use a CASE statement instead of NULLIF(). See Listing 9-3. Listing 9-3. Additional Columns to Calculate the Annual and Monthly Percentage of Parent Columns SUM(SUM(f.SalesAmount)) OVER (PARTITION BY YEAR(f.OrderDate)) AS [Annual Sales], SUM(SUM(f.SalesAmount)) OVER (PARTITION BY f.productkey, YEAR(f.OrderDate)) AS [Product Annual Sales], --Pct of group: --[Product % Annual Sales] = [Product Annual Sales] / [Annual Sales] SUM(SUM(f.SalesAmount)) OVER (PARTITION BY f.productkey, YEAR(f.OrderDate)) / NULLIF(SUM(SUM(f.SalesAmount)) OVER (PARTITION BY YEAR(f.OrderDate)), 0) AS [Product % Annual Sales], SUM(SUM(f.SalesAmount)) OVER (PARTITION BY YEAR(f.OrderDate), MONTH(f.OrderDate)) AS [Month All Sales], SUM(SUM(f.SalesAmount)) OVER (PARTITION BY f.productkey, YEAR(f.OrderDate), MONTH(f.OrderDate)) / NULLIF(SUM(SUM(f.SalesAmount)) OVER (PARTITION BY YEAR(f.OrderDate), MONTH(f.OrderDate)), 0) AS [Product % Month Sales] If you want to make your code easier to read, understand, and maintain, you can calculate all of the base aggregations in one pass in a common table expression (CTE) and then perform the second-order calculation in a following query, using the named result columns from the CTE instead of the expanded logic shown above. Once you 110

CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS have worked out the logic for any given combined calculation, you can comment out or remove the input columns and just return the final result columns

83 CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS have worked out the logic for any given combined calculation, you can comment out or remove the input columns and just return the final result columns that you are interested in. See Listing 9-4 for the code and Figure 9-3 for the results. Listing 9-4. Base Query with Percent of Parent Calculations for [SalesAmount] SELECT f.productkey, YEAR(f.orderdate) AS OrderYear, MONTH(f.orderdate) AS OrderMonth, SUM(f.SalesAmount) AS [Sales Amt], SUM(SUM(f.SalesAmount)) OVER (PARTITION BY f.productkey) / SUM(SUM(f.SalesAmount)) OVER() AS [Product % of All Sales], SUM(SUM(f.SalesAmount)) OVER (PARTITION BY f.productkey, YEAR(f.OrderDate)) / NULLIF(SUM(SUM(f.SalesAmount)) OVER (PARTITION BY YEAR(f.OrderDate)),0) AS [Product % Annual Sales], SUM(SUM(f.SalesAmount)) OVER (PARTITION BY f.productkey, YEAR(f.OrderDate), MONTH(f.OrderDate)) / NULLIF(SUM(SUM(f.SalesAmount)) OVER (PARTITION BY YEAR(f.OrderDate), MONTH(f.OrderDate)),0) AS [Product % Month Sales] FROM dbo.factinternetsales AS f WHERE OrderDate BETWEEN ' ' AND ' ' GROUP BY f.productkey, YEAR(f.orderdate), MONTH(f.orderdate) ORDER BY 2, 3, f.productkey; Figure 9-3. The results for Listing

84 CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS Period-to-Date Calculations Period-to-date calculations are a mainstay of financial reports, but are notoriously difficult to incorporate into query-based reports without resorting to multiple CTEs or temporary tables. Typically the grouping is performed in a reporting tool such as SSRS or Excel to provide the aggregated results, but this can be tricky to implement. The examples you will work through next will show you how to create multiple levels of rolling totals in a single result set, by adding a FRAME clause to the mix. The frame clause is covered in more detail in Chapter 6. Listing 9-5 demonstrates how to use the frame definition to calculate period-to-date totals by date, by product for months, quarters, and years. The base query is essentially the same as before, but the level of granularity is at the date level instead of the month level. This is so that you can see the results of the aggregate columns in more detail. Listing 9-5. Calculating Period-to-Date Running Totals By Date --x.2.1 day level aggregates, with rolling totals for MTD, QTD, YTD SELECT f.orderdate, f.productkey, YEAR(f.orderdate) AS OrderYear, MONTH(f.orderdate) AS OrderMonth, SUM(f.SalesAmount) AS [Sales Amt], SUM(SUM(f.SalesAmount)) OVER(PARTITION BY f.productkey, YEAR(f.orderdate), MONTH(f.orderdate) ORDER BY f.productkey, f.orderdate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS [Sales Amt MTD], SUM(SUM(f.SalesAmount)) OVER(PARTITION BY f.productkey, YEAR(f.orderdate), DATEPART(QUARTER, f.orderdate) ORDER BY f.productkey, YEAR(f.orderdate), MONTH(f.orderdate) ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS [Sales Amt QTD], SUM(SUM(f.SalesAmount)) OVER(PARTITION BY f.productkey, YEAR(f.orderdate) ORDER BY f.productkey, f.orderdate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS [Sales Amt YTD], SUM(SUM(f.SalesAmount)) OVER(PARTITION BY f.productkey ORDER BY f.productkey, f.orderdate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS [Sales Amt Running Total], FROM dbo.factinternetsales AS f GROUP BY f.orderdate, f.productkey, YEAR(f.orderdate), MONTH(f.orderdate) ORDER BY f.productkey, f.orderdate; 112

CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS The OVER clause examples shown in Listing 9-5 use the ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW frame.

85 CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS The OVER clause examples shown in Listing 9-5 use the ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW frame. This results in the calculation aggregating all rows from the beginning of the frame to the current row, giving you the correct total to date for the level specified in the PARTITION clause. For instance, the [Sale Amt MTD] aggregate column will calculate the SUM([SalesAmount]) from the first day of the month, the first unbounded preceding row, through to the current row. The ORDER BY clause becomes mandatory when using a FRAME clause, otherwise there would be no context for the frame to be moved along within the PARTITION. Figure 9-4 shows the partial results. The [ Sales Amt MTD], [Sales Amt QTD], and [ Sales Amt YTD] column values increase until reaching a different [ ProductKey] or [ProductKey] and time level (Month and Quarter). The results in Figure 9-4 show the break in aggregations at the end of the second quarter, so you can see by looking at the rows where [ProductKey] is equal to 311, 312, or 313 that the aggregation resets on O ctob er 1. Figure 9-4. Partial results at the end of a quarter (September 30th) Averages, Moving Averages, and Rate-of-Change Before moving on to more involved examples, you need to stop and consider the challenges faced when working with dates. Dates as a data type are continuous and sequential. The T-SQL functions that work with dates are written with this in mind and handle any involved date math correctly. In reality, data based on dates will not be continuous. Transaction data will have gaps where there is no data for a day, week, or possibly even a month or more. Window functions are not date aware, so it is up to you to ensure that any aggregate calculations handle gaps in the data correctly. If you use the LEAD() and LAG() window functions over date ranges or date period ranges, you have to provide partitions in your result sets that contain continuous and complete sets of dates, months, quarters, or years as needed by your calculations. Failure to do so will result in incorrect results. The reason for this is that the LEAD() and LAG() functions operate over the result set of the query, moving the specified number of rows forward or backwards in the result set, regardless of the number of days or months represented. 113

86 CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS For example, a three-month rolling average implemented incorrectly using window functions won t take into account cases where there is no data for a product in a given month. It will perform the frame subset over the data provided and produce an average over the prior three months, regardless of whether they are contiguous months or not. Listing 9-6 demonstrates how not accounting for gaps in a date range will result in incorrect or misleading results. In this example, the data is being aggregated to the Month level by removing any grouping reference to [OrderDate]. Listing 9-6. Incorrectly Handing Gaps Dates Handling gaps in dates, Month level: not handling gaps SELECT ROW_NUMBER() OVER(ORDER BY f.productkey, YEAR(f.OrderDate), MONTH(f.OrderDate)) AS [RowID], f.productkey, YEAR(f.OrderDate) AS OrderYear, MONTH(f.OrderDate) AS OrderMonth, ROUND(SUM(f.SalesAmount), 2) AS [Sales Amt], -- month level ROUND(SUM(SUM(f.SalesAmount)) OVER(PARTITION BY f.productkey, YEAR(f.OrderDate) ORDER BY f.productkey, YEAR(f.OrderDate), MONTH(f.OrderDate) ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ), 2) AS [Sales Amt YTD], ROUND(AVG(SUM(f.SalesAmount)) OVER(PARTITION BY f.productkey ORDER BY f.productkey, YEAR(f.OrderDate), MONTH(f.OrderDate) ROWS BETWEEN 3 PRECEDING AND CURRENT ROW ),2) AS [3 Month Moving Avg] FROM [dbo].[factinternetsales] AS f WHERE ProductKey = 332 AND f.orderdate BETWEEN ' ' AND ' ' GROUP BY f.productkey, YEAR(f.OrderDate), MONTH(f.OrderDate) ORDER BY f.productkey,year(f.orderdate), MONTH(f.OrderDate) The results are shown in Figure 9-5 ; notice that for the time range selected, only nine months are represented. Calculating a moving average over the range of months that contain no data will produce incorrect results. You will learn how to address this next by filling in the gaps. 114

87 CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS Figure 9-5. Incorrect calculation of a moving average In order to address this problem, a supplementary Date table needs to be used to fill in the gaps in the transaction data. This is not a new problem, and it has been solved in data warehouse designs by including a Date dimension table that contains a row for every date in a specified range of years. The AdventureWorksDW database contains a table called DimDate that will be used in the following examples. In the event that you do not have a date dimension at your disposal, you can also use a CTE to create a Date dimension table. The use of a table of date values will result in much better performance over using a CTE. In Listing 9-7, the DimDate table is cross joined with the DimProduct table to produce a set containing all products for all dates in the specified range. The resulting CTE table is used as the primary table in the SELECT portion of the query so that every date in the range is represented in the aggregated results even if there were no transactions for that product in a given time period. You can also pick up additional attributes from the Product table such as product category, color, etc., row counts, and distinct counts from the fact table. These can be used to create additional statistics. In this case, [ ProductAlternateKey] is added and takes the place of [ ProductKey] in all grouping operations in order to make the results more user-friendly. Listing 9-7. Correctly Handling Gaps in Dates month level. Now handling gaps in transaction dates WITH CTE_ProductPeriod AS ( SELECT p.productkey, p.productalternatekey as [ProductID], Datekey, CalendarYear, CalendarQuarter, MonthNumberOfYear AS CalendarMonth FROM DimDate AS d CROSS JOIN DimProduct p WHERE d.fulldatealternatekey BETWEEN ' ' AND ' ' AND EXISTS(SELECT * FROM FactInternetSales f WHERE f.productkey = p.productkey AND f.orderdate BETWEEN ' ' AND ' ') ) 115

88 CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS SELECT ROW_NUMBER() OVER(ORDER BY p.[productid], p.calendaryear, p.calendarmonth ) as [RowID], p.[productid], p.calendaryear AS OrderYear, p.calendarmonth AS OrderMonth, ROUND(SUM(COALESCE(f.SalesAmount,0)), 2) AS [Sales Amt], ROUND(SUM(SUM(f.SalesAmount)) OVER(PARTITION BY p.[productid], p.calendaryear ORDER BY P.[ProductID], p.calendaryear, p.calendarmonth ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ), 2) AS [Sales Amt YTD], ROUND(SUM(SUM(COALESCE(f.SalesAmount, 0))) OVER(PARTITION BY p.[productid] ORDER BY p.[productid], p.calendaryear, p.calendarmonth ROWS BETWEEN 3 PRECEDING AND CURRENT ROW ) / 3, 2) AS [3 Month Moving Avg] FROM CTE_ProductPeriod AS p LEFT OUTER JOIN [dbo].[factinternetsales] AS f ON p.productkey = f.productkey AND p.datekey = f.orderdatekey WHERE p.productkey = 332 AND p.calendaryear = 2011 GROUP BY p.[productid], p.calendaryear, p.calendarmonth ORDER BY p.[productid], p.calendaryear, p.calendarmonth The results are shown in Figure 9-6. Compare the results of the two previous queries. The [3 Month Moving Avg] column is now correct for the months where there were no sales for the product (Feb, Mar, Nov) and for the months immediately after the empty periods (May, June, December). The calculation in the second query did not use the AVG() function but rather divides the SUM() by three to arrive at the average. This ensures a more accurate average for the first three periods. In following sections you will learn how to limit calculations only to ranges that are complete when calculating moving averages. 116

89 CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS Figure 9-6. Moving average, taking gaps in sales data into account Same Period Prior Year Part and parcel with providing period-to-date calculations, you will need to provide comparisons to the same period in the prior year, the prior period in the same year, and quite possibly difference amounts and difference percentages. These aggregates can be calculated in exactly the same way as the items you have work with so far: by defining the formula in simple terms, determining the input calculations at a column level, and then building the output column using the input calculations. For this example, the [ProductKey] is dropped from the query so that the granularity of the results is at a month level. This makes it easier for you to see the effect of the new calculations in the smaller number of result rows. In order to calculate a value from a prior year, the query cannot be limited to a single year in the WHERE clause. For a window function to be able to look back into a prior year, there has to be more than one year available in the result set. The LAG() function can retrieve and aggregate data by looking back in the record set by the number of rows specified. It also has an optional default parameter that can be used to return a zero value for cases where there is no row available when navigating back through the records. See Listing 9-8 for the code and Figure 9-7 for the results. Listing 9-8. Retrieving Results for the Same Month of the Prior Year -- Listing 9.8 Same Month Prior Year WITH CTE_ProductPeriod AS ( SELECT p.productkey, --p.productalternatekey as [ProductID], Datekey, CalendarYear, CalendarQuarter, MonthNumberOfYear AS CalendarMonth FROM DimDate AS d CROSS JOIN DimProduct p WHERE d.fulldatealternatekey BETWEEN ' ' AND ' ' 117

90 CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS AND EXISTS(SELECT * FROM FactInternetSales f WHERE f.productkey = p.productkey AND f.orderdate BETWEEN ' ' AND ' ') ) SELECT ROW_NUMBER() OVER(ORDER BY p.calendaryear, p.calendarmonth) as [RowID], p.calendaryear AS OrderYear, p.calendarmonth AS OrderMonth, ROUND(SUM(COALESCE(f.SalesAmount,0)), 2) AS [Sales Amt], ROUND(SUM(SUM(COALESCE(f.SalesAmount, 0))) OVER(PARTITION BY p.calendaryear ORDER BY p.calendaryear, p.calendarmonth ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ), 2) AS [Sales Amt YTD], ROUND(LAG(SUM(f.SalesAmount), 12, 0) OVER(ORDER BY p.calendaryear, p.calendarmonth),2) as [Sales Amt Same Month PY] FROM CTE_ProductPeriod AS p LEFT OUTER JOIN [dbo].[factinternetsales] AS f ON p.productkey = f.productkey AND p.datekey = f.orderdatekey GROUP BY p.calendaryear, p.calendarmonth ORDER BY p.calendaryear, p.calendarmonth Figure 9-7. Same month, prior year results 118

Difference and Percent Difference CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS Once you have the ability to look back and pluck a value from the past, you can calculate differences between those

91 Difference and Percent Difference CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS Once you have the ability to look back and pluck a value from the past, you can calculate differences between those values very easily. The commonly accepted method for calculating Percent Difference is ([current] - [previous]) / [previous]. You can also multiply the result by 100 if you want the percentage values to be in the format of ##.###. Add the code shown in Listing 9-9 to the query from Listing 9-8 to incorporate the calculations and run the query. Listing 9-9. Difference: Current Month Over the Same Month of the Prior Year LAG(SUM(f.SalesAmount), 12, 0 ) OVER(ORDER BY p.calendaryear, p.calendarmonth) as [Sales Amt Same Month PY], -- [Diff] = [CY] - [PY] SUM(COALESCE(f.SalesAmount,0)) - LAG(SUM(f.SalesAmount), 12, 0) OVER(ORDER BY p.calendaryear, p.calendarmonth) as [PY MOM Diff], -- [Pct Diff] = ([CY] - [PY]) / [PY] (SUM(COALESCE(f.SalesAmount,0)) - LAG(SUM(f.SalesAmount), 12, 0) OVER(ORDER BY p.calendaryear, p.calendarmonth) ) / nullif(lag(sum(f.salesamount), 12, 0 ) OVER(ORDER BY p.calendaryear, p.calendarmonth),0) as [PY MOM Diff %] Figure 9-8. Difference: current month over the same month of the prior year Figure 9-8 shows the results. The same approach can be used to determine the value for the prior month and the difference between it and the current month. Add the code from Listing 9-10 and run the query to calculate the prior month value, month-over-month difference, and month-over-month difference percentage. The results are shown in Figure

CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS Listing 9-10. Difference: Current Month Over the Same Month of the Prior Year LAG(SUM(f.SalesAmount), 1, 0) over(order BY p.calendaryear, p.

92 CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS Listing Difference: Current Month Over the Same Month of the Prior Year LAG(SUM(f.SalesAmount), 1, 0) over(order BY p.calendaryear, p.calendarmonth) as [Sales Amt PM], -- [Difference] = [CM] - [PM] SUM(COALESCE(f.SalesAmount,0)) - LAG(SUM(f.SalesAmount), 1, 0) over(order BY p.calendaryear, p.calendarmonth) as [PM MOM Diff], -- [Pct Difference] = ([CM] - [PM]) / [PM] (SUM(COALESCE(f.SalesAmount,0)) - LAG(SUM(f.SalesAmount), 1, 0) OVER(ORDER BY p.calendaryear, p.calendarmonth)) / nullif(lag(sum(f.salesamount), 1, 0 ) OVER(ORDER BY p.calendaryear, p.calendarmonth),0) as [PM MOM Diff %] Figure 9-9. Difference: current month to prior month Moving Totals and Simple Moving Averages The complexity of the queries in this chapter has been building with each example. They are about to become even more complex. In order to help keep them understandable, you need to think about them conceptually as the number of passes the queries are taking over the data. The early examples were a single query, aggregating over a set of records in a single pass. In order to introduce a contiguous range of dates to eliminate gaps in the transaction data, a second pass was added by introducing a CTE to do some pre-work before the main aggregation query. In this section of the chapter, you will be adding a third pass by turning the aggregate from the last example into a CTE, and aggregating results on top of it. In some cases, the same results can be achieved with one or two passes, but for cases where nesting of a window function is required, the only 120

93 CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS option is to add another pass. One example of this is calculating [Sales Amt PY YTD] at the Month level. To create the [Sales Amt YTD] measure, you had to use all the clauses of the window function. There is no method to allow you to shift the partition back to the prior year. By first calculating [Sales Amt YTD] in the second pass, you can then use a window function in the third pass to calculate [Sales Amt PY YTD]. A secondary advantage is that the column calculations on any higher-order passes can use the meaningful column names from the lower-order passes to make the calculations easier to understand and the code more compact. Managing the trade-offs of performance and code readability has to be considered as well. The last item that needs to be addressed is making sure the sets for any moving total or moving average calculation contain the correct number of input rows. For example, a three-month moving average [3 MMA] must contain data from three complete months, or it is not correct. By addressing gaps in transaction date ranges, part of the problem was solved but not the complete problem. At the beginning of a set, the first row has no prior rows, so it is incorrect to calculate a moving average for that row. The second row of the set only has one preceding row, making it incorrect to calculate the average for it as well. Only when the third row is reached are the conditions correct for calculating the three-month moving average. Table 9-1 shows how a three-month average should be calculated, given that the prior year had no data for the product. Table 9-1. Eliminating Incomplete Results for Averages Over Ranges of Periods Month Product Sales Sales YTD Sales 3 MMA January Bacon February Bacon March Bacon April Bacon May Bacon June Bacon To make this work, you simply have to count the number of rows in the frame for the three-month period instead of averaging the results, and use the count to determine when to perform the calculation. If there are three rows, perform the calculation; otherwise, return a null value. The following example uses a CASE statement to determine which rows have two preceding rows: CASE WHEN COUNT(*) OVER (ORDER BY p.calendaryear, p.calendarmonth ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) = 3 THEN AVG(SUM(f.SalesAmount)) OVER(ORDER BY p.calendaryear, p.calendarmonth ROWS BETWEEN 2 PRECEDING AND current row) ELSE null END AS [Sales Amt 3 MMA] 121

94 CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS The query in Listing 9-11 includes Moving Total and Moving Average calculations for rolling 3 and rolling 12 month periods. These are implemented in the second order CTE, so that the results are available for further manipulation in the third order SELECT statement. Listing Updates to the Base Query Month level, no product. Handling gaps, All products, 3 "pass" query WITH CTE_ProductPeriod AS ( SELECT p.productkey, Datekey, CalendarYear, CalendarQuarter, MonthNumberOfYear AS CalendarMonth FROM DimDate AS d CROSS JOIN DimProduct p WHERE d.fulldatealternatekey BETWEEN ' ' AND GETDATE() AND EXISTS(SELECT * FROM FactInternetSales f WHERE f.productkey = p.productkey AND f.orderdate BETWEEN ' ' AND GETDATE()) ), CTE_MonthlySummary AS ( SELECT ROW_NUMBER() OVER(ORDER BY p.calendaryear, p.calendarmonth) AS [RowID], p.calendaryear AS OrderYear, p.calendarmonth AS OrderMonth, count(distinct f.salesordernumber) AS [Order Count], count(distinct f.customerkey) AS [Customer Count], ROUND(SUM(COALESCE(f.SalesAmount,0)), 2) AS [Sales Amt], ROUND(SUM(SUM(COALESCE(f.SalesAmount, 0))) OVER(PARTITION BY p.calendaryear ORDER BY p.calendaryear, p.calendarmonth ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ), 2) AS [Sales Amt YTD], ROUND(LAG(SUM(f.SalesAmount), 11, 0 ) OVER(ORDER BY p.calendaryear, p.calendarmonth), 2) AS [Sales Amt SP PY], ROUND(LAG(SUM(f.SalesAmount), 1, 0) OVER(ORDER BY p.calendaryear, p.calendarmonth), 2) AS [Sales Amt PM], CASE WHEN COUNT(*) OVER(ORDER BY p.calendaryear, p.calendarmonth ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) = 3 THEN AVG(SUM(f.SalesAmount)) OVER(ORDER BY p.calendaryear, p.calendarmonth ROWS BETWEEN 2 PRECEDING AND current row) ELSE null 122

95 CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS END AS [Sales Amt 3 MMA], -- 3 Month Moving Average CASE WHEN count(*) OVER(ORDER BY p.calendaryear, p.calendarmonth ROWS BETWEEN 2 PRECEDING AND current row) = 3 THEN SUM(SUM(f.SalesAmount)) OVER(ORDER BY p.calendaryear, p.calendarmonth ROWS BETWEEN 2 PRECEDING AND current row) ELSE null END AS [Sales Amt 3 MMT], -- 3 month Moving Total CASE WHEN COUNT(*) OVER (ORDER BY p.calendaryear, p.calendarmonth ROWS BETWEEN 11 PRECEDING AND CURRENT ROW) = 12 THEN AVG(SUM(f.SalesAmount)) OVER(ORDER BY p.calendaryear, p.calendarmonth ROWS BETWEEN 11 PRECEDING AND current row) ELSE null END AS [Sales Amt 12 MMA], Month Moving Average CASE WHEN count(*) OVER(ORDER BY p.calendaryear, p.calendarmonth ROWS BETWEEN 11 PRECEDING AND current row) = 12 THEN SUM(SUM(f.SalesAmount)) OVER (ORDER BY p.calendaryear, p.calendarmonth ROWS BETWEEN 11 PRECEDING AND current row) ELSE null END AS [Sales Amt 12 MMT] month Moving Total FROM CTE_ProductPeriod AS p LEFT OUTER JOIN [dbo].[factinternetsales] AS f ON p.productkey = f.productkey AND p.datekey = f.orderdatekey GROUP BY p.calendaryear, p.calendarmonth ) SELECT [RowID], [OrderYear], [OrderMonth], [Order Count], [Customer Count], [Sales Amt], [Sales Amt SP PY], [Sales Amt PM], [Sales Amt YTD], [Sales Amt 3 MMA], [Sales Amt 3 MMT], [Sales Amt 12 MMA], [Sales Amt 12 MMT], [Sales Amt] - [Sales Amt SP PY] AS [Sales Amt SP PY Diff], ([Sales Amt] - [Sales Amt SP PY]) 123

96 CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS / NULLIF([Sales Amt SP PY], 0) AS [Sales Amt SP PY Pct Diff ], [Sales Amt] - [Sales Amt SP PY] AS [Sales Amt PY MOM Diff], ([Sales Amt] - [Sales Amt PM]) / NULLIF([Sales Amt PM], 0) AS [Sales Amt PY MOM Pct Diff] FROM CTE_MonthlySummary ORDER BY [OrderYear], [OrderMonth] Notice how much simpler and readable the final select is. Because you encapsulated the logic behind the columns in the previous CTE, the resulting columns are also available to be used in another layer of window functions. The addition of the Moving Monthly Totals (MMT) and Moving Monthly Average (MMA) that were added to the summary provide a way to address seasonality in the data by averaging the monthly totals across a range of months. Moving Totals smooth out the volatility in seasonal/noisy data. They can also be used to calculate an annual Rate-of-Change (RoC), which can be used to identify trends and measure cyclical change. You could not create a column for [Sales Amt YTD PY] until now. With the [Sales Amt YTD] column present in every row of the CTE, you can now use a window function to look back to the same period in the prior year and use it to calculate a difference between the current year to date and the prior year to date. Remember, even though you are working with a query that returns data at a month level, this technique works for date level results as well. Add the block of column calculations in Listing 9-12 to the new base query from Listing 9-11 and explore the results. Listing Same Period Prior Year to Date Calculations LAG([Sales Amt YTD], 11,0) OVER(ORDER BY [OrderYear], [OrderMonth]) AS [Sales Amt PY YTD], [Sales Amt YTD] - LAG([Sales Amt YTD], 11,0) OVER(ORDER BY [OrderYear], [OrderMonth]) AS [Sales Amt PY YTD Diff], ([Sales Amt YTD] - LAG([Sales Amt YTD], 11,0) OVER(ORDER BY [OrderYear], [OrderMonth])) /NULLIF(LAG([Sales Amt YTD], 11, 0) OVER(ORDER BY [OrderYear], [OrderMonth]), 0) AS [Sales Amt PY YTD Pct Diff] Because of the number of columns returned in this query, a chart makes more sense to demonstrate the results; see Figure The whole purpose of creating the difference and difference percent calculations is to be able to use them to analyze the data for trends. Plotting the data in a chart is a great way to visualize the results and present them to business users. 124

CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS Figure 9-10.

97 CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS Figure Monthly trend chart showing Month over Month (MOM) difference percent and Same Period Previous Year (SP PY) difference precent MOM difference calculations are generally not that useful as a direct measure. They show the change from one period to the next, which can be very noisy and can hide underlying trends. SP PY difference calculations show a better picture of the growth trend over the prior year, but can also be prone to seasonality. That being said, you are now going to improve upon these calculations by implementing the Rate-of-Change calculations mentioned previously and smooth out the seasonal ups and downs into long-term trends. Rate-of-Change Calculations Rate-of-Change is the percentage of change in a Moving Total or Moving Average, and it indicates if a measure is improving over the prior year or getting worse. It is useful for determining leading and lagging indicators between divergent information sources. For example, if your business relies on petrochemical feedstock to produce its products, changes to oil prices are likely to presage a change in demand for your products, and could be considered a leading indicator. Charting the Rate-of-Change for corporate sales alongside the Rate-of-Change for stock market and commodity indices allows you to determine if your company s performance leads, lags, or is coincident with the performance of the stock market. 125

98 CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS Add the code in Listing 9-13, a block of column calculations, to the query from Listing 9-11 and explore the results. Listing Rate-of-Change Calculations --Rate of Change [3 MMT]/([3 MMT].LAG( 12 months)) [Sales Amt 3 MMT] / LAG(NULLIF([Sales Amt 3 MMT], 0), 11, null) OVER(ORDER BY [OrderYear], [OrderMonth]) as [3/12 RoC], --[12 MMT] /([12 MMT].LAG( 12 months)) [Sales Amt 12 MMT] / LAG(NULLIF([Sales Amt 12 MMT],0), 11, null) OVER(ORDER BY [OrderYear], [OrderMonth]) as [12/12 RoC] A RoC less than 1.0 (100%) is a downward trend and a RoC of 1.0 or greater is a positive change. The calculation can also be amended to turn the ratio into a positive or negative number as follows: --Rate of Change +/- ([3 MMT]/([3 MMT].LAG( 12 months))* 100) -100 ([Sales Amt 3 MMT] / LAG(NULLIF([Sales Amt 3 MMT], 0), 11, null) OVER(ORDER BY [OrderYear], [OrderMonth]) *100) as [3/12 RoC2], --([12 MMT] /([12 MMT].LAG( 12 months))* 100) -100 ([Sales Amt 12 MMT] / LAG(NULLIF([Sales Amt 12 MMT],0), 11, null) OVER(ORDER BY [OrderYear], [OrderMonth]) *100) as [12/12 RoC2] The results of the Rate-of-Change calculations are best visualized in a chart; see Figure In comparison to the chart for the difference percentages, you should notice a closer correlation of the [RoC] measures to the natural curve of the Total. 126

99 CHAPTER 9 TIME RANGE CALCULATIONS AND TRENDS Figure Rate-of-Change trends for 3 and 12 month ranges Summary This chapter covered a lot of ground, building on previously learned concepts to create complex calculations for time-based financial analysis. Some of the calculations were not previously easily accomplished in T-SQL, let alone in a single query. The approach of creating and validating input calculations before tackling more complex calculations will serve you well when developing your own complex window calculations, as will the step-wise method of using CTEs to get around the nesting limitation of window functions. I hope you have realized just how versatile and powerful window functions are by reading this book. As you use these functions, you will begin to see even more ways to use them. They will change the way you approach queries, and you will become a better T-SQL developer. Happy querying! 127

100 Healthy SQL A Comprehensive Guide to Healthy SQL Server Performance Robert Pearl

101 Healthy SQL Copyright 2015 by Robert Pearl This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. ISBN-13 (pbk): ISBN-13 (electronic): Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Managing Director: Welmoed Spahr Lead Editor: Jonathan Gennick Development Editor: Douglas Pundick Technical Reviewer: Steve Jones Editorial Board: Steve Anglin, Mark Beckner, Gary Cornell, Louise Corrigan, Jim DeWolf, Jonathan Gennick, Robert Hutchinson, Michelle Lowman, James Markham, Susan McDermott, Matthew Moodie, Jeffrey Pepper, Douglas Pundick, Ben Renow-Clarke, Gwenan Spearing, Matt Wade, Steve Weiss Coordinating Editor: Jill Balzano Copy Editor: Kim Wimpsett Compositor: SPi Global Indexer: SPi Global Artist: SPi Global Cover Designer: Anna Ishchenko Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY Phone SPRINGER, fax (201) , orders-ny@springer-sbm.com, or visit Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information on translations, please rights@apress.com, or visit Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. ebook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales ebook Licensing web page at Any source code or other supplementary material referenced by the author in this text is available to readers at For detailed information about how to locate your book s source code, go to

102 Contents at a Glance About the Author...xvii About the Technical Reviewer...xix Acknowledgments...xxi Foreword...xxv Chapter 1: Introduction to Healthy SQL... 1 Chapter 2: Creating a Road Map Chapter 3: Waits and Queues Chapter 4: Much Ado About Indexes Chapter 5: Tools of the Trade: Basic Training Chapter 6: Expanding Your Tool Set Chapter 7: Creating a SQL Health Repository Chapter 8: Monitoring and Reporting Chapter 9: High Availability and Disaster Recovery Chapter 10: Sur viving the Audit Index vii

103 CHAPTER 3 Waits and Queues Now that the road map has been built, you are on your way to a healthy SQL Server infrastructure. In this chapter, you will learn about one of the most effective methodologies for performance tuning and troubleshooting. The key methodology is waits and queues, which is an accurate way to determine why your SQL Server is performing slowly and to pinpoint where the bottleneck is. By analyzing the wait statistics, you can discover where SQL Server is spending most of the time waiting and focus on the most relevant performance counters. In other words, by using this process, you will quickly discover what SQL Server is waiting on. Anyone who is involved in the development or performance of SQL Server can benefit from this methodology. The purpose of this chapter is to help DBAs, developers, and other database professionals by spreading the word about waits and queues. The objective here is to lay a foundation for your own independent investigation into the more advanced aspects of this in-depth topic. Consider this chapter your performance-tuning primer to identifying and understanding SQL Server performance issues. Introducing Waits and Queues The methodology known as waits and queues is a performance-tuning and analysis goldmine that came with the release of SQL Server However, there was not a lot of fanfare around it, nor was there significant adoption of it by those who would benefit from it the most: database administrators. There was not a lot of early promotion of the concept or a complete understanding of how to use it, especially the correlation between waits and queues. I began writing and speaking about the topic some time ago and was amazed at how many DBAs had not heard about it or used it. Today, this is not the case, and many SQL Server MVPs, speakers, and presenters have effectively created an awareness of the methodology. Moreover, many vendors have built third-party monitoring software built upon this methodology. Before drilling down into how to use this methodology, I ll define what a wait state is and how it works internally. Because a resource may not be immediately available for a query request, the SQL Server engine puts it into a wait state. Therefore, waits and queues is also known as wait state analysis. The official definition according to Microsoft is as follows: Whenever a request is made within SQL Server that for one of many reasons can t be immediately satisfied, the system puts the request into a wait state. The SQL Server engine internally tracks the time spent waiting, aggregates it at the instance level, and retains it in memory. 43

104 CHAPTER 3 WAITS AND QUEUES SQL Server now collects and aggregates in-memory metadata about the resources that a query or thread is waiting on. It exposes this information through the primary DMVs with respect to wait statistics: sys.dm_os_wait_stats and sys.dm_os_waiting_tasks. When one or more queries start to wait for excessive periods of time, the response time for a query is said to be slow, and performance is poor. Slow query response times and poor performance are the results that are apparent to end users. Once you know where the contention is, you can work toward resolving the performance issue. The queue side of the waits and queues equation comes in the form of Windows OS and SQL Server performance counters, which are discussed later in this chapter. All of the SQL Server performance counters that are available can also be viewed by using the sys.dm_os_performance_counters DMV. This DMV replaced the sys.sysperfinfo view, which was deprecated in SQL Server 2005 and therefore shouldn t be used. There will be broader discussion of how to calculate and derive useful numbers from the counter values exposed by sys.dm_os_performance_counters in this chapter briefly and in the upcoming chapters. Moreover, when I show you which counters to use for monitoring performance, relative to the various wait types, I will not necessarily define each one. You can see the counter detail from within Performance Monitor by clicking the Show Description checkbox for each counter, as shown in Figure 3-1. Or, you can look up more information about each counter by searching on MSDN.com. Figure 3-1. Performance Monitor displays a detailed description for each counter 44

CHAPTER 3 WAITS AND QUEUES You can run this quick SELECT statement to view the columns and inspect the raw data: SELECT object_name, counter_name, case when instance_name ='' then @@SERVICENAME end

105 CHAPTER 3 WAITS AND QUEUES You can run this quick SELECT statement to view the columns and inspect the raw data: SELECT object_name, counter_name, case when instance_name ='' end as instance_name, cntr_type, cntr_value FROM sys.dm_os_performance_counters Figure 3-2 shows the results from this query and shows a sample of the myriad of performance counters and objects available in the system view. Figure 3-2. A sample of the myriad of performance counters and objects via the sys.dm_os_performance_counters DMV S-l-o-w Performance The term slow performance is broad, often used by end users when they can t get their reports to run, their data to display, or their front-end application screens to render in a timely manner. Usually, the cause of this slowness is happening on the backend. As database professionals, you know there is a much deeper technical explanation. Poor performance is often because of a resource limitation. Either a SQL query is waiting on a resource to become available or the resources themselves are insufficient to complete the queries in a timely manner. Insufficient resources cause a performance bottleneck, and it is precisely the point of resource contention that the waits and queues methodology aims to identify and resolve. Once the bottleneck is identified, whether it is CPU, memory, or I/O, you can tune the query. Query optimization may make the query use the existing resources more efficiently, or you may need to add resources, such as more RAM or faster CPU processors and disks. The delay experienced between a request and an answer is called the total response time (TRT). In essence, the request is called a query, which is a request to retrieve data, while the answer is the output or data that is returned. TRT, also called the total query response time, is the time it takes from when the user executes the query to the time it receives the output, after a query is run against SQL Server. The TRT can be measured as the overall performance of an individual transaction or query. 45

106 CHAPTER 3 WAITS AND QUEUES You can also measure the average cumulative TRT of all the server queries running. Many developers often ask whether there is a way to calculate the total query response time. In this chapter, I will discuss CPU time, wait time, and signal waits; what they represent; and how you can calculate each of these individually. However, for now, know that if you add these wait times together (as in CPU time + wait time + signal wait time), you can derive the overall total query response time, as shown in Figure 3-3. Figure 3-3. Computing thetotal query response time One way you can get the average total query response time is to use the sys.dm_exec_query_stats DMV, which gives you the cumulative performance statistics for cached query plans in SQL Server. Within the cached plan, the view contains one row per query statement. The information here is persisted only until a plan is removed from the cache and the corresponding rows are deleted. You will have more use for this _query_stats view later in the book. In the meantime, you can focus on the total_elapsed_time column in this view, which will give you the time for all the executions of the cached queries. So, for completed executions of the query plan, you take the total_elapsed_time value and divide it by the number of times that the plan has been executed since it was last compiled, which is from the execution_count column. Here s an example: SELECT avg(total_elapsed_time / execution_count)/1000 As avg_query_response_time --total_avg_elapsed_time (div by 1000 for ms, div by for sec) FROM sys.dm_exec_query_stats The output in my case is as follows: avg_query_response_time 153 Likewise, you can use the same formula to approximate the total query response time for the individual cached queries, along with the average CPU time and statement text. This can help you identify which queries are taking a long time to return results. Here is the script: SELECT TOP 20 AVG(total_elapsed_time/execution_count)/ as "Total Query Response Time", SUM(query_stats.total_worker_time) / SUM(query_stats.execution_count)/1000 AS "Avg CPU Time", MIN(query_stats.statement_text) AS "Statement Text" 46

107 CHAPTER 3 WAITS AND QUEUES FROM (SELECT QS.*, SUBSTRING(ST.text, (QS.statement_start_offset/2) + 1, ((CASE statement_end_offset WHEN -1 THEN DATALENGTH(ST.text) ELSE QS.statement_end_offset END - QS.statement_start_offset)/2) + 1) AS statement_text FROM sys.dm_exec_query_stats AS QS CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) as ST) as query_stats WHERE statement_text IS NOT NULL GROUP BY query_stats.query_hash ORDER BY 2 DESC; -- ORDER BY 1 DESC uncomment to sort by TRT The following is the output. (I ve elided text from some of the longer statements in order to fit within the limitations of the page.) Total Query Response Time Avg CPU Time Statement Text SELECT DISTINCT Status FROM dbo.rpt_ SELECT DISTINCT dep.deploymentname AS SELECT [ReportComponent_ID],... Blame Game: Blame SQL Server When something goes wrong and performance is slow, there is often a visceral instinct to attribute immediate blame to SQL Server. The database is often the usual suspect, so SQL Server must be the cause. Directly translated, this puts you, the DBA, in the line of fire. Thus, the quicker you can identify and resolve the performance issue, the more confident the users will be in your technical, troubleshooting, database administration, and tuning skills. You must also gather evidence and document your findings, explaining and identifying the true cause. You might ask why it is that everyone blames SQL Server and points their finger at the DBA. One common reason is the database is always at the scene of the crime; another reason is that people target SQL Server because they don t understand much about it. It is easier to blame that which you don t know. SQL Server used to be, and still is in some respects, a black box to nondatabase professionals. Oftentimes, SQL Server will record an error message that is intended to lead you in the right direction as to where the bottleneck is located. The nondatabase professional will use this as evidence of database culpability, and end users will pile on. I will highlight some of these blame game antics throughout the chapter. Caution Remember, the database is the usual suspect, so the DBA will be to blame. Back to Waiting If I haven t convinced you yet why you want to use waits and queues in your performance troubleshooting, tuning, and monitoring, let s consider the following scenarios. An end user has a 4 p.m. dead line to generate some important reports from SQL Server, and the server has crawled to a halt. He calls to complain about the slowness in performance. Not very detailed, is he? Take another example: it s Friday afternoon, and suddenly the developers find that the application is running slow and it s time to deploy code into production. 47

108 CHAPTER 3 WAITS AND QUEUES The application is poorly designed; there are no indexes, no primary keys, and no standards, so there are many performance problems. The project managers get wind of this and, fearing missing their project milestones, escalate it to your higher-ups. Your managers then tell you to immediately drop everything and resolve these performance problems as a top priority. These folks likely don t understand the time it might take to identify the cause of the problem, let alone to resolve it, and most times they don t care. They know only one thing: they want it fixed now, and they know the DBA is the person to do it. Traditionally, to analyze and discover the root cause of the performance bottleneck, you needed to employ various methods and tools such as Performance Monitor, profiler traces, and network sniffers. Since there were some limitations in versions of SQL Server 2000 and earlier, many companies turned to expensive third-party software solutions. Another challenge with identifying and isolating a performance issue is knowing the right counters to set up. If at the outset you are uncertain whether the bottleneck is a memory, CPU, or I/O issue, it makes troubleshooting more difficult. Once you collect all the performance data, you still need to parse it. When setting up performance counters, you need to cast a wide net, and such methods are time-consuming and often frustrating. Moreover, what do tell your users, colleagues, clients, and managers when they call complaining about slow performance? Do you let them know that you need to set up about 20 or so Perfmon counters, collect the data over a period of a few days, trend it against peak and nonpeak hours, and get back to them real soon? The next time you have half your company and irate users approaching your desk like zombies from the Night of the Living Dead complaining about some vague or urgent performance issue, remember that wait stats can quickly identify the cause of the bottleneck and what SQL Server is waiting on. Therefore, the quickest way to get to the root cause is to look at the wait stats first. By employing the use of waits and queues, you can avoid troubleshooting the wrong bottleneck and sidestep a wild-goose chase. Once you identify what a query or session is waiting on, you have successfully uncovered the likely bottleneck. To dig deeper, the next step you will want to take is to set up the relevant performance counters related to the resource wait. Using the waits and queues methodology can offer the best opportunities to improve performance and provide the biggest return on time invested in performance tuning. Microsoft has proclaimed that this methodology is the biggest bang for the buck. And that is why you should use it and why it is a major component to healthy SQL. Note The waits and queues methodology offers the biggest bang for the buck and has a significant return on the performance-tuning time investment, according to the Microsoft engineering team. So, let s continue and correlate a query request with the wait queue. When a query is run and SQL Server cannot complete the request right away because either the CPU or some other resource is being used, your query must wait for the resource to become available. In the meantime, your query gets placed in the suspended or resource queue or on the waiter list and assigned a wait type that is reflective of the resource the query is waiting for. By examining the wait stat DMVs, you can answer this question and track how such waits are affecting overall performance. Wait Type Categories SQL Server tracks more than 400 wait types; I will discuss only a handful of them in this chapter, focusing on potential resource contention and bottlenecks, such as CPU pressure, I/O pressure, memory pressure, parallelism, and blocking. These cover the most typical and commonly occurring waits. Before digging into the different wait types, I ll discuss the four major categories of wait types. To help you understand what you will be looking at before running any wait stats queries, you should be familiar with the categories of waits types. Since there are so many, two great resources you should refer to are SQL Server 2005 Waits and Queues, by Microsoft, and The SQL Server Wait Type Repository, 48

109 CHAPTER 3 WAITS AND QUEUES by the CSS SQL Server engineers. SQL Server 2005 Waits and Queues was published originally for SQL Server 2005 but is an invaluable resource that I have based my waits and queues presentations on, as well as information in this chapter. It is very much valid for all subsequent versions and can be downloaded from The Microsoft CSS SQL Server engineers maintain a wait type repository on their official blog. Not only can you find a description for every wait type, but it describes the type of wait; the area it affects, such as IO, memory, network, and so on; and possible actions to take. You can access the wait repository here: The various waits can be broadly categorized as follows: Resource waits occur when a worker thread requests access to a resource that is not available because it is being used by another worker and is not yet available. These are the most common types of waits. These wait types typically surface as locks, latches, network, and I/O wait states. Signal waits are the time that worker threads spend waiting for the CPU to become available to run. The worker threads are said to be in the runnable queue, where they wait until it s their turn. Since there is no specific wait type that indicates CPU pressure, you must measure the signal wait time. Signal wait time is the difference between the time the waiting thread was signaled and when it started running. Queue waits occur when a worker is idle, waiting for work to be assigned. This wait type is most typically seen with system background tasks, such as the deadlock monitor and ghost record cleanup tasks. External waits occur when a SQL Server worker is waiting for an external event, such as an extended stored procedure call or a linked server query, to finish. Table 3-1 sums up each of these categories. Table 3-1. Details for Each Wait Category Category Resource waits Signal waits Queue waits External waits Details Locks, latches, memory, network, I/O Time spent waiting for CPU Idle workers, background tasks Extended procs (XPs), linked server queries Is Waiting a Problem? The key misconception by database professionals, and other folks looking at top wait stats, is that the wait type at the top of the list of results is the problem. So, let s compare waits to our everyday lives to make this easier to understand. When you go to the supermarket and complete your shopping, you must get in line and wait your turn until the cashier calls you. Waiting in line is certainly expected, and you wouldn t consider a fast-moving line a problem. Only when the little old lady, who I will call Granny, begins to argue with the cashier about the soap being 10 cents cheaper in last week s circular does the waiting become longer, as does the line. Now you have a performance problem. SQL Server has a line too, and its queries (the customers) must get inline and wait to be called. 49

110 CHAPTER 3 WAITS AND QUEUES Although the length of a queue may indicate a busy OLTP system, the line itself (the runnable queue) is not necessarily a problem. Therefore, the queue length will not cause a performance issue if the queue is getting processed in a speedy manner. If this is the case, there should be no high wait times, which is the nature of a healthy SQL Server. However, when the line becomes too long and the time it takes to get the customers to the cashier is excessive, this indicates a different type of problem. In other words, if the queue is long and the thread cannot get to the CPU to be processed, it indicates there is CPU pressure. CPU pressure is measured a little differently than all other resource waits. With the SQL Server scheduling system, keep in mind that a queue of queries waiting on resources to become available will always be present. If wait times are high and one query then blocks the other queries from executing, you must discover the cause of the bottleneck, whether it is I/O, memory, disk, network, or CPU. I will focus on the central concepts of query queues and provide an overview of the engine internals and how they apply to an OLTP system. You will learn how the process or query queue works so that you understand that processes waiting for available resources are normal when managing threads and query execution. All threads basically must wait in line to get executed, and only when these wait times become high does it become a concern. For example, just because you see that the PAGEIOLATCH_X wait type accounts for 85 percent of your waits does not immediately imply a performance issue. What users need to be concerned with are recurring waits where the total wait time is greater than some number, let s say greater than 30 seconds. People will run scripts and see that a particular resource wait is their top wait on the system, but all in all they have a very low wait time, as in the previous example. They automatically assume it is a problem and attempt to troubleshoot the issue. However, the top waits are only part of the story. What matters are top waits having excessive wait times. Observing Wait Statistics One of the most significant system views discussed in this chapter is sys.dm_os_wait_stats. As the primary DMV for collecting wait statistics, you will use this view that shows the time for waits that have already completed. Within this DMV, you will see the wait type, the number of waits for each type, the total wait time, the maximum wait time, and the signal wait time. The columns and description of the sys.dm_os_wait_stats DMV are as follows: wait_type : The name of the wait type waiting_tasks_count : The number of waits on this wait type wait_time_ms : The total wait time for this wait type in milliseconds (includes signal_wait_time ) max_wait_time_ms : The maximum wait time on this wait type for a worker signal_wait_time_ms : The difference between the time the waiting thread was signaled and when it started running (time in runnable queue!) These metrics are shown at the instance level and are aggregated across all sessions since SQL Server was last restarted or since the last time that the wait statistics were cleared. This means that the metrics collected are not persisted and are reset each time SQL Server is restarted. You also need to know that the waits and related measures in the table are cumulative and won t tell you what the top waits are at this moment. If you want to store this data beyond a restart, you must create a historical user-tracking table. You can easily dump this information into your tracking table for later use and historical trend analysis. Therefore, to derive any useful information for any point-in-time analysis, you need to take at least two snapshots of the data. The first snapshot is your baseline, and the second snapshot is your comparative delta. 50

111 CHAPTER 3 WAITS AND QUEUES When querying or building historical data from the sys.dm_os_wait_stats view, there are a number of system and background process wait types that are safe to ignore. This list changes and grows all the time, but you definitely want to filter out as much of these harmless waits that can be excluded from your analysis. You may want to even store these in a table so you can dynamically keep your wait stats analysis queries up-to-date. Here is a quick list of these typical wait types that can be excluded. These are the system and background process wait types to ignore. LAZYWRITER_SLEEP BROKER_TO_FLUSH RESOURCE_QUEUE BROKER_TASK_STOP SLEEP_TASK CLR_MANUAL_EVENT SLEEP_SYSTEMTASK CLR_AUTO_EVENT SQLTRACE_BUFFER_FLUSH DISPATCHER_QUEUE_SEMAPHORE WAITFOR FT_IFTS_SCHEDULER_IDLE_WAIT LOGMGR_QUEUE XE_DISPATCHER_WAIT CHECKPOINT_QUEUE XE_DISPATCHER_JOIN REQUEST_FOR_DEADLOCK_SEARCH BROKER_ EVENTHANDLER XE_TIMER_EVENT TRACEWRITE FT_IFTSHC_MUTEX BROKER_TRANSMITTER SQLTRACE_INCREMENTAL_FLUSH_SLEEP SQLTRACE_WAIT_ENTRIES BROKER_RECEIVE_WAITFOR SLEEP_BPOOL_FLUSH ONDEMAND_TASK_QUEUE SQLTRACE_LOCK 51

112 CHAPTER 3 WAITS AND QUEUES 52 DBMIRROR_EVENTS_QUEUE DIRTY_PAGE_POLL HADR_FILESTREAM_IOMGR_IOCOMPLETION Using sys.dm_os_wait_stats, you can isolate top waits for server instance by percentage; you use a popular DMV query written by SQL MVP Glenn Berry that gets the top waits on the server by percentage and converts the wait time to seconds. You will see that you can set the percentage threshold and eliminate nonimportant wait types. WITH Waits AS (SELECT wait_type, wait_time_ms / AS wait_time_s, 100. * wait_time_ms / SUM(wait_time_ms) OVER() AS pct, ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS rn FROM sys.dm_os_wait_stats WHERE wait_type NOT IN ('CLR_SEMAPHORE','LAZYWRITER_SLEEP','RESOURCE_QUEUE','SLEEP_TASK','SLEEP_SYSTEMTASK','SQLTRACE_BUFFER_FLUSH','WAITFOR', 'LOGMGR_QUEUE','CHECKPOINT_QUEUE','REQUEST_FOR_DEADLOCK_SEARCH','XE_TIMER_EVENT','BROKER_TO_FLUSH','BROKER_TASK_STOP', 'CLR_MANUAL_EVENT','CLR_AUTO_EVENT','DISPATCHER_QUEUE_SEMAPHORE', 'FT_IFTS_SCHEDULER_IDLE_WAIT','XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP')) SELECT W1.wait_type, CAST(W1.wait_time_s AS DECIMAL(12, 2)) AS wait_time_s, CAST(W1.pct AS DECIMAL(12, 2)) AS pct, CAST(SUM(W2.pct) AS DECIMAL(12, 2)) AS running_pct FROM Waits AS W1 INNER JOIN Waits AS W2 ON W2.rn <= W1.rn GROUP BY W1.rn, W1.wait_type, W1.wait_time_s, W1.pct HAVING SUM(W2.pct) - W1.pct < 99 OPTION (RECOMPILE) -- percentage threshold wait_typewait_time_spctrunning_ pct CXPACKET ASYNC_NETWORK_IO LCK_M_IS LCK_M_SCH_M LCK_M_IX LATCH_EX OLEDB BACKUPIO LCK_M_SCH_S SOS_SCHEDULER_YIELD LCK_M_S WRITELOG BACKUPBUFFER PAGELATCH_UP ASYNC_IO_COMPLETION THREADPOOL CXROWSET_SYNC When establishing a baseline of wait stat data for performance monitoring, you may want to first manually clear the wait stat data by running the following command: DBCC SQLPerf('sys.dm_os_wait_stats',CLEAR)

113 CHAPTER 3 WAITS AND QUEUES You should run this only in your test and development environments. Before you do this, you may want to save the current data in an archive table first. You can do this by running the following SELECT INTO statement. You can add a column called ArchivedDate that will append the current datetime to the data rows, indicating the time each row was captured. Here I will call the new table Waits_Stats_History : SELECT *, getdate() as ArchiveDate INTO Wait_Stats_History FROM Sys.dm_os_wait_stats WHERE wait_type NOT IN ('CLR_SEMAPHORE','LAZYWRITER_SLEEP','RESOURCE_QUEUE','SLEEP_TASK','SLEEP_SYSTEMTASK','SQLTRACE_BUFFER_FLUSH','WAITFOR', 'LOGMGR_QUEUE','CHECKPOINT_QUEUE','REQUEST_FOR_DEADLOCK_SEARCH','XE_TIMER_EVENT','BROKER_TO_FLUSH','BROKER_TASK_STOP', 'CLR_MANUAL_EVENT','CLR_AUTO_EVENT','DISPATCHER_QUEUE_SEMAPHORE', 'FT_IFTS_SCHEDULER_IDLE_WAIT','XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP') You can check the results of this table by querying the newly created Wait_States_History table using the following query: select * from Wait_Stats_ History The following are the results, with the ArchiveDate column data appended to the sys.dm_os_wait_stats view. This shows a simple example of creating an archive table to persist historical data for wait stats. Here the data from the sys.dm_os_wait_stats view is appended with the current datetime stamp. wait_typewaiting_tasks_countwait_time_msmax_wait_time_mssignal_wait_time_msarchivedate MISCELLANEOUS :49: LCK_M_SCH_S :49: LCK_M_SCH_M :49: LCK_M_S :49: LCK_M_U :49: LCK_M_X :49: LCK_M_IS :49: LCK_M_IU :49: LCK_M_IX :49: LCK_M_SIU :49: LCK_M_SIX :49: LCK_M_UIX :49: LCK_M_BU :49: LCK_M_RS_S :49: LCK_M_RS_U :49: LCK_M_RIn_NL :49: LCK_M_RIn_S :49: LCK_M_RIn_U :49: LCK_M_RIn_X :49: LCK_M_RX_S :49: LCK_M_RX_U :49: LCK_M_RX_X :49: LATCH_NL :49:

114 CHAPTER 3 WAITS AND QUEUES The overall wait time reflects the time that elapses when a thread leaves the RUNNING state, goes to the SUSPENDED state, and returns to the RUNNING state again. Therefore, you can capture and derive the resource wait time by subtracting the signal wait time from the overall wait time. You can use the simple query to get the resource, signal, and total wait time, as shown next. You would also want to order by total wait time descending to force the highest wait times to the top of the results. You can calculate all the wait times by querying the sys.dm _wait_stats DMV as follows: Select wait_type, waiting_tasks_count, wait_time_ms as total_wait_time_ms, signal_wait_time_ms, (wait_time_ms-signal_wait_time_ms) as resource_wait_time_ms FROM sys.dm_os_wait_stats ORDER BY total_wait_time_ms DESC The following output shows the raw output from the sys.dm_os_wait_stats DMV top wait stats data, ordered by the total wait time in milliseconds, with the highest total wait times at the top, along with the waiting task count and signal and resource wait times. wait_type waiting_tasks_count total_wait_time_ms signal_wait_time_ms resource_wait_time_ms BROKER_TASK_STOP SQLTRACE_INCREMENTAL_FLUSH_SLEEP HADR_FILESTREAM_IOMGR_IOCOMPLETION LOGMGR_QUEUE A Multithreaded World Multithreading is the ability of a process to manage its use by more than one user at a time. The smallest unit that can be managed independently by a scheduler is a thread. All threads follow a first-in-first-out (FIFO) model. Multiple threads can exist within the same process and share resources such as memory, while different processes do not share these resources. More than one thread can be simultaneously executed across multiple processors. This is an example of multithreading. SQL Server also uses what s called cooperative scheduling (or nonpreemptive scheduling ), where a thread voluntarily yields the processor to another thread when it is waiting for a system resource or non-cpu event. These voluntary yields often show up as SOS_SCHEDULER_YIELD waits, which require further investigation to determine whether there is CPU pressure. When the current thread does not yield the processor within a certain amount of time, the thread expires. When a thread expires, it is said to have reached its execution quantum. Quantum is the amount of time a thread is scheduled to run. The SQLOS layer implements what is called thread scheduling. An individual scheduler maps to a single CPU, equivalent to the number of logical CPUs on the system. So, if there are 8 OS schedulers, there are 8 logical CPUs. Sixteen OS schedulers equals 16 logical CPUs, and so forth. For each logical CPU, there is one scheduler. It should be noted that scheduler mapping is not to physical CPU cores or sockets but to logical CPU. If you want to get a count of the existing number of SQLOS schedulers on the system, you can query the sys.dm_os_schedulers view as follows: SELECT COUNT(*) AS NO_OF_OS_SCHEDULERS FROM sys.dm_os_schedulers WHERE status='visible ONLINE' NO_OF_OS_SCHEDULERS 32 You can take a look at the sys.dm_os_schedulers DMV, which shows all the activity associated with each scheduler, such as active tasks, number of workers, scheduler status, context switching, and runaway tasks. This DMV is useful in troubleshooting issues specifically related to CPU pressure. 54

115 CHAPTER 3 WAITS AND QUEUES A scheduler is mapped to an individual physical or logical CPU, which manages the work done by worker threads. Once the SQLOS creates the schedulers, the total number of workers is divided among the schedulers. Context switching occurs when the OS or the application is forced to change the executing thread on one processor to be switched out of the CPU so another process can run. Since the scheduler is doing the work, this keeps context switching to a minimum. Context switching is a natural occurrence in any CPU system, but when this happens excessively, you have a potential performance problem. A scenario where excessive context switching can cause high CPU and I/O contention is a potentially expensive operation and can have a serious impact on performance. Only one scheduler is created for each logical processor to minimize context switching. The Execution Model How does SQL Server manage the execution of user requests? To answer this question, you can examine what s known as the SQL Server SQLOS execution model. This model will help you understand how SQL Server uses schedulers to manage these requests. Let s go back to the example of the old lady in the supermarket holding up the line to see how the SQL Server execution model works (Figure 3-4 ). I will focus here on what happens on one OS scheduler for simplicity and easy visualization. First, let s set up the scenario. Compare the SQL execution model to the supermarket checkout line. You have the supermarket, the customers, the line, and the cashier. During the course of the transaction, the customer will be in three states (not U.S. states). The customers are running, runnable, or suspended. The customer is equivalent to a thread. The line is the runnable queue, the cashier is the CPU, and the price is the resource. The person or thread that is at the front of the line about to be checked out is said to be running, or currently using the CPU. A thread using the CPU is running until it has to wait for a resource that is not available. Each thread is assigned a SPID or session_id so SQL Server can keep track of them. Figure 3-4. A graphic depiction of the execution model running process So, Granny, despite her age, is in a running state, and the rest of the customers in line are waiting to be checked out (waiting for the cashier to become available), like the threads in SQL Server are waiting for the CPU to become available. However, because Granny is arguing that the soap should be on sale, the cashier must ask for a price-check. Fortunately for the other customers, Granny must step aside and wait for the price check or, in essence, for the price to become available. 55

116 CHAPTER 3 WAITS AND QUEUES She is now in a suspended state, and the next customer moves to the front of the line to checkout. The Granny thread gets assigned a wait type and is in the suspended (or resource) queue, waiting for the resource (the price) to become available. All the threads (or customers) like Granny that stay in the suspended queue go on the waiter list. You can measure the time that Granny spends in this queue by measuring the resource wait time. Meanwhile, the next customer, now in the running state getting checked out, continues with the transaction until it is complete. Like the thread running on the CPU, if the resource is available to proceed, it continues until the execution of the query is completed. Figure 3-5 shows what the suspended state would look like in the execution model. Figure 3-5. The execution model: the suspended queue If for any reason the thread needs to wait again, it goes back on the waiter list into the SUSPENDED queue, and the process continues in a circular and repeatable manner. Let s not forget about Granny, who s still in the SUSPENDED state, waiting for the price check. Once she gets her price check confirmed, she can proceed again toward getting checked out. Granny is signaled that the price or the resource is now available. However, the cashier CPU is busy now checking out other customers, so she must go to the back of the line. The line getting longer at the supermarket while Granny is arguing over 10 cents savings is stressing out the cashier, like the CPU. The thread, now back in the RUNNABLE queue, is waiting on the CPU, like Granny is waiting again on the cashier. Again, as discussed, the waiting time spent in the runnable queue is called signal wait time. In terms of resource versus signal waits, the price check Granny was waiting for is a resource wait, and waiting in line for the cashier is the signal wait. Figure 3-6 shows an example of the RUNNABLE queue. It shows the time spent waiting in the RUNNABLE queue(the transition from the suspended to runnable queue) or waiting for the CPU to become available; this time spent waiting is known as the signal wait time. You will learn more about signal wait time in the CPU Pressure section. 56

117 CHAPTER 3 WAITS AND QUEUES Figure 3-6. A visual example of signal wait time or transition from the suspended queue to the runnable queue As expected, Granny is pretty mad she has to wait in line again, and since she was already waiting, the sympathetic customers might, like the other threads, yield to Granny as a higher-priority customer. Higher-priority tasks will run before lower-priority tasks when there are yields or preemption. As mentioned earlier, these yields will show up in the wait stat view as SOS_SCHEDULER_YIELD. The threads alternating between the three states, RUNNABLE, RUNNING, and SUSPENDED, until the query is completed is known as the query lifecycle. By querying the DMV sys.dm_exec_requests, you can see the current status of a thread. Figure 3-7 gives a visual demonstration of the query life cycle. Figure 3-7. Query life cycle and what happens to a query that starts out executing on the CPU 57

118 CHAPTER 3 WAITS AND QUEUES CPU Pressure On a multiprocessor system, you can see what queries are running on a particular CPU. One way to see what activity is currently assigned to a particular scheduler or CPU and check for any CPU pressure is to exeucet the query I present next in Runnable Task Count. You can correlate scheduler data with queries currently running against the system. This query captures the current status of the session and number of context switches, as well as pending disk I/O count, the scheduler (CPU), the database, the command, and the actual statement being executed. It uses joins with some other DMVs and returns key information such as the SQL query that is being executed on an individual CPU. If a particular CPU is being pegged, you can see which queries are currently executing against it. CPU pressure may be apparent when the following conditions are true: 58 High number of SOS_SCHEDULER_YIELDS waits High percentage of signal waits over resource waits Runnable task counts greater than zero Runnable Task Count You can also get an overall count of the RUNNING (that is, current) and RUNNABLE tasks by querying sys.dm_os_schedulers. If the SQL Server tasks are waiting on the CPU (in other words, if the runnable_ task_counts value is greater than 0), this can be another indicator of an increase in CPU pressure. The higher the runnable task count, the busier the CPU is processing. Here is a query that can get you the average task counts against the SQL Server schedulers: SELECT AVG(current_tasks_count) AS [Avg Current Task], AVG(runnable_tasks_count) AS [Avg Wait Task] FROM sys.dm_os_schedulers WHERE scheduler_id < 255 AND status = 'VISIBLE ONLINE' The query results show the average count for the current ( RUNNING ) tasks, as well as the average waiting ( RUNNABLE ) tasks. Here is an example of the result: Avg Current Task 2 The value here is an average count of the current running tasks against all the schedulers or total CPU processors. The Average wait task column shows you the average number of tasks waiting on the CPU that are in the RUNNABLE state. The result is generally zero, and here it is important to pay attention to any value greater than zero. If the Average Wait Task column shows a higher number, you are looking at potential CPU pressure on SQL Server. You can view a raw current versus waiting task count for each individual scheduler on the system with the following query: SELECT scheduler_id, current_tasks_count, runnable_tasks_count FROM sys.dm_os_schedulers WHERE scheduler_id < 255 AND runnable_tasks_count >0 In this query, you are asking only that it return counts where runnable_tasks_count is greater than zero. You can use a query similar to this for monitoring and tracking CPU pressure.

119 Signal Waits CHAPTER 3 WAITS AND QUEUES Although excessive thread yields on the system can indicate potential CPU contention, there is no specific wait type to indicate CPU pressure. Therefore, by looking at the signal wait time, stored in the signal_wait_ time_ms column of the wait stat DMV sys.dm_os_wait_stats, you can compare the two major categories of waits. Taking an overall snapshot of whether SQL Server is waiting mostly on resources to become available (resource waits) or waiting for the CPU (signal waits) to become available can reveal whether you have any serious CPU pressure. You will measure the resource waits (all the threads that are put into a suspended or resource queue) and signal waits (the time waiting for the threads to run on the CPU). So, let s use the following query to compare resource waits to signal waits: Select ResourceWaitTimeMs=sum(wait_time_ms - signal_wait_time_ms),'%resource waits'= cast(100.0 * sum(wait_time_ms - signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2)),signalwaittimems=sum(signal_wait_time_ms),'%signal waits' = cast(100.0 * sum(signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2)) from sys.dm_os_wait_ stats The query output shown next gives percentages for signal waits, as compared to resource waits. What you also see are cumulative wait times since SQL Server was last restarted or the statistics were cleared. The output shows you the percentage of overall wait time versus signal waits. ResourceWaitTimeMs%resource waitssignalwaittimems%signal waits Remember that I said SQL Server waiting on resources is the healthy nature of a busy OLTP system? The query results are the resource waits versus signal waits as a percentage of the overall wait time. So, what this tells you is that the percent of resource waits is significantly higher than the percent of signal waits and is exactly what you want to see (more resource waits than signal waits). Higher resource waits should not be misinterpreted as a performance issue. However, if the CPU itself is the bottleneck, the percentage with respect to signal waits will show up much higher. Relative to the overall wait time on the server, you want to see that signal waits are as low as possible. A conceivable percentage of greater than 30 percent of signal waits may be actionable data to lead you considering CPU pressure, depending on the workload. For example, you may need faster or more CPUs to keep up with the workload requests, since too slow or too few cores can be one reason for a stressed CPU to cause a performance bottleneck. Anatomy of a CPU Metadata Query Here is a query that will identify lots of interesting information you can use to pinpoint the offending query that is causing CPU pressure: Select t.task_state, r.session_id, s.context_switches_count, s.pending_disk_io_count, s.scheduler_id AS CPU_ID, s.status AS Scheduler_Status, db_name(r.database_id) AS Database_Name, 59

120 CHAPTER 3 WAITS AND QUEUES r.command, px.text from sys.dm_os_schedulers as s INNER JOIN sys.dm_os_tasks t on s.active_worker_address = t.worker_address INNER JOIN sys.dm_exec_requests r on t.task_address = r.task_address CROSS APPLY sys.dm_exec_sql_text(r.plan_handle) as px -- filters out this session -- AND t.task_state='runnable' --To filter out sessions that are waiting on CPU uncomment. Let s breakdown this query and discuss the pieces of information available. You can see it is possible to identify the point of contention down to the exact CPU or scheduler and what query is causing the query or process to be bound. If you query the sys.dm_os_tasks DMV, all sorts of interesting information is displayed about what tasks are currently running. This DMV returns one row for each task that is active in the instance of SQL Server. Here is sample output of the query: task_state session_id context_switches_count pending_disk_io_count CPU_ID Scheduler_Status Database_Name command text RUNNING VISIBLE ONLINE master [Actual Query Statement Text here] In addition, you have the scheduler_id value, which is associated with a particular session_id (or spid ). The scheduler_id value is displayed as the CPUID, which was discussed earlier as each scheduler mapped to an individual CPU. For all intents and purposes, the scheduler and CPU are equivalent terms and used interchangeably. One piece of information is pending_io_count, which is the count of physical I/O that is being performed by a particular task. Another is the actual number of context_switches that occur while performing an individual task. You learned about context switching earlier in the chapter. Another column that is relevant here is task_address. The task address is the memory address allocated to the task that is associated with this request. Because each DMV has some information you need and the others don t have, you will need to join them together, as well as use the APPLY operator to get everything you want in the results. To get the actual query that is executing on the CPU, you need to get the plan_handle value, which is not available in sys.dm_os_tasks. Therefore, you will need to join the sys.dm_os_tasks view to the sys.dm_exec_requests view. Both of these have the actual task_address information and can be joined on this column. sys.dm_exec_requests is another handy DMV, even by itself, that returns information about each request that is executing within SQL Server. You will find a lot of useful statistics and information here, such as wait_time, total_elapsed_time, cpu_time, wait_type, status, and so on. With the sys.dm_exec_ requests, you can also find out whether there is any blocking or open transactions; you can also do this by querying the blocking_session_id and open_transaction_count columns in this view. Please note, for the purpose of this discussion, that these columns are filtered out but worth mentioning in general. One piece of information that is also not in sys.dm_os_tasks, but available in sys.dm_exec_requests, is database_id. Thus, you can find out what database the request is executing against. There is a lot more useful data in this view; visit TechNet for more information about the sys.dm_exec_requests DMV: You also use the task_state column from the sys.dm_os_tasks view, which shows you whether the task is PENDING, RUNNABLE, RUNNING, or SUSPENDED. Assuming the task has a worker thread assigned (no longer PENDING ), when the worker thread is waiting to run on the CPU, you say that it is RUNNABLE. The thread is therefore in the RUNNABLE queue. The longer it must wait to run on the CPU, the more pressure is on the CPU. As discussed earlier in the chapter, the signal wait time consists of the threads spending a lot of time waiting in the runnable queue. 60

121 CHAPTER 3 WAITS AND QUEUES Finally, to complete the query, you need to use the APPLY operator to derive the text of the SQL batch that is identified by the specified SQL handle using the sys.dm_exec_sql_text function. This invaluable DMF replaces the system function fn_get_sql and pretty much makes the old DBCC INPUTBUFFER command obsolete. If you want to see only the sessions that are waiting for CPU, you can filter the query by using where task_state='runnable'. A number of factors can affect CPU utilization adversely such as compilation and recompilation of SQL statements, missing indexes, excessive joins, unneeded sorts, order by, and group by options in your queries, multithreaded operations, disk bottlenecks, and memory bottlenecks, among others. Any of these can force the query to utilize more CPU. As you can see now, other bottlenecks can lead to CPU bottlenecks. You can either tweak your code or upgrade your hardware. This goes for any of the resource bottlenecks. More often than not, rewriting the code more efficiently and employing proper indexing strategies can do great wonders and is in the immediate control of the DBA. Throwing more hardware at a performance problem can help only so much. Therefore, the key to healthy CPU utilization, as well as healthy SQL, is ensuring that the CPU is spending its time processing queries efficiently and not wasting its processing power on poorly optimized code or inefficient hardware. Ideally, the aforementioned query will be useful in your performance-tuning efforts, especially when you need to pinpoint which CPU is causing the most contention and what query is causing it. Note Throwing more hardware at a performance problem can help only so much. So, you have observed various CPU-related queries and conditions that can help you identify CPU pressure. To further investigate and gather conclusive data, you will now want to set up some specific Performance Monitor counters to support your findings. Some of the traditional Performance Monitor counters specific to CPU performance that you would want to set up are as follows: Processor: % Processor Time System: Processor Queue Length System: Context Switches/sec You may also want to take a look at the SQL Compilations, Re-compilations, and Batch Requests per second counters. The reason to consider these has to do with plan reuse and how it affects CPU performance. When executing a stored procedure or a query, if the plan does not yet exist in memory, SQL Server must perform a compilation before executing it. The compilation step creates the execution plan, which is then placed in the procedure cache for use and potential reuse. Once a plan is compiled and stored in memory, or cache, SQL Server will try to use the existing plan, which will reduce the CPU overhead in compiling new plans. Excessive compilations and recompilations (essentially the same process) cause performance degradation. The fewer the compilations per second, the better, but only through your own performance baselines can you determine what an acceptable number is. When a new plan is compiled, you can observe that the SQLCompilations/sec counter increases, and when an existing plan gets recompiled (for a number of reasons), you can see that the Re-compilations/sec counter is higher. Based on these respective counters, you look to see whether a large number of queries need to be compiled or recompiled. When you are performance tuning a query or stored procedure, the goal is to reduce the number of compilations. Batch Requests per second shows the number of SQL statements that are being executed per second. A batch is a group of SQL statements. Again, I can t say what an ideal number is here, but what you want to see is a high ratio of batches to compiles. Therefore, you can compare the Batch Requests/sec counter to the SQLCompilations/sec and Re-compilations/sec counters. In other words, you want to achieve the most batch requests per second while utilizing the least amount of resources. While these performance counters don t tell you much by themselves, the ratio of the number of compilations to the number of batch requests gives you a better story on plan reuse and, in this case, CPU load. 61

CHAPTER 3 WAITS AND QUEUES Therefore, to get an overall understanding of how your server is performing, you need to correlate with other metrics.

122 CHAPTER 3 WAITS AND QUEUES Therefore, to get an overall understanding of how your server is performing, you need to correlate with other metrics. So, you usually look at these three together and can calculate them to get the percentage of plan reuse. The higher the plan use, the less CPU is needed to execute a query or stored procedure. Here are the counters referred to: SQL Statistics\Batch Requests/sec SQL Statistics\SQL Compilations/sec SQL Statistics\SQL Re-Compilations/sec You can also see an example of the Performance Monitor counters in Figure 3-8. Figure 3-8. The recommended Performance Monitor counters for CPU utilization You can calculate the overall percentage for plan reuse as follows: plan_reuse = (BatchRequestsperSecond SQLCompilationsperSecond) / BatchRequestsperSecond Here is the script I wrote to derive this data using the sys.dm_os_performance_counters DMV: select t1.cntr_value As [Batch Requests/sec], t2.cntr_value As [SQL Compilations/sec], plan_reuse_percentage = convert(decimal(15,2), (t1.cntr_value*1.0-t2.cntr_value*1.0)/t1.cntr_value*100) 62

123 CHAPTER 3 WAITS AND QUEUES from master.sys.dm_os_performance_counters t1, master.sys.dm_os_performance_counters t2 where t1.counter_name='batch Requests/sec' and t2.counter_name='sql Compilations/sec' Based on the results, as shown next, a high percentage number would indicate the plan reuse is high and thus reduces the amount of server resources required to run the same queries over and over again. Therefore, a high percentage for plan_reuse is ideal. This type of calculation is somewhat discounted as an accurate indicator of actual plan reuse, though. Later in the book, you will see more accurate ways to measure and analyze plan cache usage, using DMVs. Batch Requests/secSQL Compilations/secplan_reuse While sys.dm_os_wait_stats provides you with the aggregated historical data, it does not show current sessions that are waiting. To know what s happening right now on your server, you can use the following DMVs and join them together to get a holistic view: sys.dm_os_waiting_tasks sys.dm_exec_requests By using the sys.dm_os_waiting_tasks DMV, you can know what SQL Server is waiting on right now, as well as the current suspended sessions that are waiting in the SUSPENDED queue. In fact, the actual waiter list of all waiting sessions, with the reasons for the waits, are revealed, as well as any blocking and all the sessions involved. The waiting_tasks DMV is helpful in that it filters out all other nonwaiting sessions. For a full reference on this DMV, please visit the MSDN library here: The sys.dm_exec_requests DMV shows you can see the current status of a thread and all its associated metadata. All the current activity (for example, active sessions) on the server can be viewed by using sys.dm_exec_requests. Each SQL Server session has a unique session_id value, and you can filter out the system queries from the user queries by specifying session_id > 50 or, more accurately, where user_process = 1. This DMV returns a lot of information, and you can see the entire list of column data available on the MSDN library here: The following query joins sys.dm_os_waiting_tasks and sys.dm_exec_requests to return the most important columns of interest: SELECT dm_ws.session_id, dm_ws.wait_type, UPPER(dm_es.status) As status, dm_ws.wait_duration_ms, dm_t.text, dm_es.cpu_time, dm_es.memory_usage, dm_es.logical_reads, dm_es.total_elapsed_time, 63

124 CHAPTER 3 WAITS AND QUEUES dm_ws.blocking_session_id, dm_es.program_name, DB_NAME(dm_r.database_id) DatabaseName FROM sys.dm_os_waiting_tasks dm_ws INNER JOIN sys.dm_exec_requests dm_r ON dm_ws.session_id = dm_r.session_id INNER JOIN sys.dm_exec_sessions dm_es ON dm_es.session_id = dm_r.session_id CROSS APPLY sys.dm_exec_sql_text (dm_r.sql_handle) dm_t WHERE dm_es.is_user_process = 1 The following is a sample of the output from the previous query: session_idwait_typestatuswait_duration_ms(no column name)cpu_timememory_usagelogical_ readstotal_elapsed_timeblocking_session_idprogram_namedatabasename DATETIME; DATETIME; 0000NULL.Net SqlClient Data ProviderMyTroubledDB 72OLEDBRUNNING0SELECT dm_ws.session_id, dm_ws.wait_type, UPPER( NULLMicrosoft SQL Server Management Studio - QueryMyTroubledDB DATETIME; DATETIME; 0000NULL.Net SqlClient Data ProviderMyTroubledDB CPU Blame Game Now that I ve discussed the internals of waits and queues, the categories, and the DMVs to report on wait statistics and I ve shown some basic queries, let s focus on the key bottleneck areas. I also detailed performance issues with respect to signal waits and CPU pressure. In this section, I will highlight some of the resource wait types you will typically see associated with I/O, memory, locking and blocking, and parallelism issues. Once you identify and translate the wait types, you can provide further troubleshooting and analysis by setting up the proper Performance Monitor objects and counters. Let s play another quick round of the blame game, CPU edition. Suddenly, there are CPU spikes on your SQL Server, and CPU utilization is trending over 90 percent. The system administrator takes screenshots of the CPU usage on the Performance tab of Windows Task Manager. He also shows some Performance Monitor information that the Processor time is consistently greater than 90 percent. Then, he identifies the box as a SQL Server instance. Blame time. He tells all the managers that SQL Server must be the problem, and you, the DBA, are contacted to look into the issue. What do you do? Fortunately for you, you re reading this book, and you will run a handy little CPU script that shows historical CPU utilization derived from the sys.dm_os_ring_buffers view. The advantage of this script over looking at Task Manager is that it will break down CPU time into three key columns by the percentage used by SQLProcessUtilization, SystemIdle, and OtherProcessUtilzation, along with the event time. set nocount on bigint = cpu_ticks /( cpu_ticks / ms_ticks ) from sys.dm_os_sys_info select /*top 1*/ record_id,dateadd(ms, -1 * (@ts_now - [timestamp]), GetDate()) as EventTime, SQLProcessUtilization,SystemIdle,100 - SystemIdle - SQLProcessUtilization as OtherProcessUtilization 64

125 CHAPTER 3 WAITS AND QUEUES from (select record.value('(./record/@id)[1]', 'int') as record_id, record.value('(./record/schedulermonitorevent/systemhealth/systemidle)[1]', 'int') as SystemIdle, record.value('(./record/schedulermonitorevent/systemhealth/processutilization)[1]', 'int') as SQLProcessUtilization, timestamp from (select timestamp, convert(xml, record) as record from sys.dm_os_ring_buffers where ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR' and record like '%<SystemHealth>%') as x ) as y order by record_id desc Since the instance was last started (or statistics cleared), you will clearly see whether SQL Server is in fact the cause for high CPU. If the SQLProcessUtilization column shows zero or a low value and there is a high value in either the OtherProcessUtilization or SystemIdle column, SQL Server has been cleared as the suspect. Specifically, if the values in the OtherProcessUtilization column are consistently high, this means that there are other applications or processes causing high CPU utilization. The question now is shifted to the system administrators as to why there are CPU-bound applications sharing with SQL Server. The following are some results from the previous CPU query. They show the historical total CPU process utilization of SQL versus other processes since the instance was last restarted. record_ideventtimesqlprocessutilizationsystemidleotherprocessutilization :53: :52: :51: :50: :49: I/O May Be Why Your Server Is So Slow SQL Server under normal processing will read and write from disk where the data is stored. However, when the path to the data on the I/O subsystem is stressed, performance can take a hit, for a number of reasons. Performance will suffer if the I/O subsystem cannot keep up with the demand being placed on it by SQL Server. Through various performance-tuning strategies, you can minimize IO bottlenecks in order to retrieve and write data from disk. When I speak of performance related to disk or storage, it s called I/O. You can measure I/O per second (IOPS) as well as latency with respect to I/O. You will take a look at I/O usage, which can be useful in many scenarios. Tasks that are waiting for I/O to finish can surface as IO_COMPLETION and ASYNC_IO_COMPLETION waits, representing nondata page I/Os. When an SPID is waiting for asynchronous I/O requests to complete, this shows up as ASYNC_IO_COMPLETION. If you observe these common I/O wait types consistently, the I/O subsystem is probably the bottleneck, and for further analysis you should set up I/O-related physical disk counters in Performance Monitor (counters such as Disk sec/read and Disk sec/write). I/O Blame Game An example of such an occurrence, one that I experienced first-hand in my DBA travels, is an error message relating to I/O requests taking a long time to complete. There are various reasons this can occur. Specifically, as in my case, the latency between SQL Server and communication with the SAN storage can cause poor I/O performance. If you are experiencing suboptimal throughput with respect to I/O, you may examine the wait statistics and see PageIOLatch_ wait types appear. Values that are consistently high for this wait type indicate I/O subsystem issues. I will discuss these waits more in-depth later in the chapter. This of course 65

126 CHAPTER 3 WAITS AND QUEUES affects database integrity and performance but is clearly not the fault of SQL Server. In SQL Server 2005 SP2, an error message was added to help diagnose data read and write delays occurring with the I/O subsystem. When there are such issues, you may see this in the SQL Server error log: SQL Server has encountered n occurrence(s) of I/O requests taking longer than 15 seconds to complete on file <filename> in database <dbname>. The intention of this error message was in fact to help pinpoint where the delay is and what SQL Server is waiting on, not indict SQL Server itself as the cause. In a situation where SQL Server was cluster attached to a SAN, it was consistently crashing. After examining the error logs and encountering the previous I/O requests taking longer than 15 seconds... error, the IT folks and management incorrectly determined that SQL Server was corrupt. It took several conference calls and production downtime to convince everyone to check the SAN drivers and firmware versions. For various reasons, sometimes the server may fail to access the SAN altogether. Poor I/O throughput commonly occurs when a SAN s firmware is out-of-date. If the hardware bus adapter(hba) drivers are not updated, an insufficient HBA queue length will cause poor I/O as well. This I/O problem can occur when one or both have not been upgraded. Be aware that if you upgrade SAN firmware, you need to upgrade the HBA drivers at the same time. In the scenario I am describing, the SAN firmware had been upgraded but not the HBA drivers, which caused poor I/O throughput, and the subsystem couldn t keep up with the I/O requests generated from SQL Server. The SAN firmware and the HBA drivers should have been upgraded simultaneously. This in fact was the cause of SQL Server crashing, as indicated by the error message, and another instance of blaming SQL Server as the culprit. Other reasons why I/O performance is not good is the I/O load. Did you ever have a situation where there are I/O issues at night but SQL Server performs well during the day with no warnings in the logs? This is usually because of more intensive I/O operations that are running, such as batch jobs, backups, and DBCC checks. Moreover, several such jobs may be running at the same time because of poor scheduling or because one or more jobs are overlapping each other. Provided there is sufficient free time, you can reschedule one or more jobs and monitor for a couple of days to see whether the changes have made a difference in I/O. Another cause of poor I/O performance, which also can have a significant negative impact on SQL Server, is if you have antivirus software installed on your SQL Server. You need to ensure that.mdf,.ndf,.ldf,.bak, and.trn files are added to the exclusion list. In addition, you will also want to exclude filestream/filetable/ OLTP files and folders. If possible, exclude the entire directory tree for SQL Server. In addition, real-time virus checking should be disabled completely, and any virus scans should be scheduled at off-peak times instead. Note If you have antivirus (AV) software installed on your SQL Server, this can negatively impact I/O performance, and you need to exclude SQL Server from AV scans. Fragmentation Affects I/O Fragmentation is another issue that can occur with respect to I/O and may be internal (within tables/ indexes) or external (file fragmentation on the disk). Physical file fragmentation can contribute to an additional load on your I/O subsystem and reduce performance because the disks have to work harder to read and write data. To find all of the required data, the disk needs to hop around to different physical locations as a result of files not being on the disk contiguously. A physically fragmented file is when it is stored noncontiguously on disk. When file data is spread all over various locations on a disk that is fragmented, this will often result in slower I/O performance. Things to do to reduce fragmentation is to pre-allocate data and log file size (instead of using default auto-grow), never use the AutoShrink option on databases, and never manually shrink data files. Shrinking databases will rapidly cause major fragmentation and incur a significant performance penalty. In some 66

127 CHAPTER 3 WAITS AND QUEUES cases, you can and will need to shrink data files but will need to rebuild your indexes to ensure you have removed any fragmentation in the database. In addition, the disk should be dedicated for SQL Server and not shared with any other applications. Log files can be shrunk without issue if need be. Sometimes they grow out of control because of certain factors and you can reclaim space from the log, after active transactions are committed to disk. I mention this because this distinction is often the cause of confusion when shrinking database files in general. For shrinking out-of-control log files, you would run a DBCC Shrinkfile(2), with 2 representing the log. Caution Never, ever use the AutoShrink database option; avoid DBCC ShrinkDatabase! I/O Latch Buffer Issues Another category of I/O wait types relate to buffer I/O requests that are waiting on a latch. These types of waits show up as PAGEIOLATCH wait types, as described next. Pending the completion of a physical disk I/O, PAGEIOLATCH wait types can occur when there are buffer requests that are currently blocked. The technical definition of a latch is that they are lightweight internal structures used to synchronize access to buffer pages and are used for disk-to-memory transfers. As you should be aware, data modifications occur in an area of the memory reserved for data, called the buffer or data cache. Once the data is modified there, it then is written to disk. Therefore, to prevent further modifications of a data page that is modified in memory, a latch is created by SQL Server before the data page is written to disk. Once the page is successfully written to disk, the latch is released. It is similar in function to a lock (discussed further in the Blocking and Locking section in this chapter) to ensure transactional consistency and integrity and prevent changes while a transaction is in progress. Many papers cover the subjects of locks and latches together. Books Online describes some of the PAGIOLATCH wait types as follows: PAGEIOLATCH_DT : Occurs when a task is waiting on a latch for a buffer that is in an I/O request. The latch request is in Destroy mode. Long waits may indicate problems with the disk subsystem. PAGEIOLATCH_EX : Occurs when a task is waiting on a latch for a buffer that is in an I/O request. The latch request is in Exclusive mode. Long waits may indicate problems with the disk subsystem. PAGEIOLATCH_KP : Occurs when a task is waiting on a latch for a buffer that is in an I/O request. The latch request is in Keep mode. Long waits may indicate problems with the disk subsystem. PAGEIOLATCH_SH : Occurs when a task is waiting on a latch for a buffer that is in an I/O request. The latch request is in Shared mode. Long waits may indicate problems with the disk subsystem. PAGEIOLATCH_UP : Occurs when a task is waiting on a latch for a buffer that is in an I/O request. The latch request is in Update mode. Long waits may indicate problems with the disk subsystem When the percentage of the total waits on the system is high, it can indicate disk subsystem issues but possibly memory pressure issues as well. High values for both PageIOLatch_ex and PageIOLatch_sh wait types indicate I/O subsystem issues. PageIOLatch_ex indicates an exclusive I/O page latch, while PageIOLatch_sh is a shared I/O page latch. I will talk about steps to take to minimize these waits, but first let s talk about which DMVs can help you with analyzing I/O statistics. 67

128 CHAPTER 3 WAITS AND QUEUES A particularly useful DMV for observing I/O usage is sys.dm_io_virtual_file_stats, which returns I/O statistics for database files, both data and log. This DMV s num_of_bytes_read and num_of_bytes_written columns let you easily calculate total I/O. In addition to calculating total I/O, you can use common table expressions (CTEs) to determine the percentage of I/O usage, which tells you where most of the I/O is occurring. You can view the default column data output with the following query: SELECT * FROM sys.dm_io_virtual_file_stats (NULL, NULL) With sys.dm_io_virtual_file_stats, you can look for I/O stalls, as well as view the I/O usage by database, file, or drive. An I/O stall is a situation because of disk latency, for example, where I/O requests are taking a long time to complete its operation. You can also refer to an I/O stall as stuck or stalled I/O.I/O stall is the total time, in milliseconds, that users waited for I/O to be completed on the file. By looking at the I/O stall information, you can see how much time was waiting for I/O to complete and how long the users were waiting. Aside from the actual latency of I/O, you may want to know which database accounts for the highest I/O usage, which is a different metric than I/O stalls. Regardless of the disk layout of the particular database, the following query will return the I/O usage for each database across all drives. This will give you the opportunity to consider moving files from one physical disk to another physical disk by determining which files have the highest I/O. As a result of high I/O usage, moving data, log, and/or tempdb files to other physical drives can reduce the I/O contention, as well as spread the I/O over multiple disks. A percentage of I/O usage by database will be shown with this query: WITH Cu_IO_Stats AS ( SELECT DB_NAME(database_id) AS database_name, CAST(SUM(num_of_bytes_read + num_of_bytes_written) / AS DECIMAL(12, 2)) AS io_in_mb FROM sys.dm_io_virtual_file_stats(null, NULL) AS DM_IO_Stats GROUP BY database_id ) SELECT ROW_NUMBER() OVER(ORDER BY io_in_mb DESC) AS row_num, database_name, io_in_mb, CAST(io_in_mb / SUM(io_in_mb) OVER() * 100 AS DECIMAL(5, 2)) AS pct FROM Cu_IO_Stats ORDER BY row_num; The results that you will see from this query highlight the database with the highest I/O usage out of the total database I/O in descending order, with the highest percentage at the top. Here s an example: row_numdatabase_nameio_in_mbpct 1MDW_HealthySQL msdb tempdb model ReportServer

129 CHAPTER 3 WAITS AND QUEUES 6master ReportServerTempDB XEvents_ImportSystemHealth TEST Canon With respect to these I/O results, keep in mind that although you can identify which database is using the most I/O bandwidth, it doesn t necessarily indicate that there is a problem with the I/O subsystem. Suppose you determine the I/O usage of the files or databases on a server and that a database s I/O usage is 90 percent. However, if an individual file or database is using 90 percent of the total I/O but there s no waiting for reads or writes, there are no performance issues. The percentage that you see is out of the total I/O usage of all the databases on the instance. Please note that all the percentages here add to 100percent, so this represents only SQL Server I/O. It is possible that other processes can be using I/O and impacting SQL Server. Again, it comes down to how long the system is waiting. The more users wait, the more performance is potentially affected. So, in this case, you also need to look at statistics that tell you how long users have to wait for reads and writes to occur. In the next I/O query, you can calculate the percentage of I/O by drive (letter): With IOdrv as (select db_name(mf.database_id) as database_name, mf.physical_name, left(mf.physical_name, 1) as drive_letter, vfs.num_of_writes, vfs.num_of_bytes_written as bytes_written, vfs.io_stall_write_ms, mf.type_desc, vfs.num_of_reads, vfs.num_of_bytes_read, vfs.io_stall_read_ms, vfs.io_stall, vfs.size_on_disk_bytes from sys.master_files mf join sys.dm_io_virtual_file_stats(null, NULL) vfs on mf.database_id=vfs.database_id and mf.file_id=vfs.file_id --order by vfs.num_of_bytes_written desc) ) select database_name,drive_letter, bytes_written, Percentage = RTRIM(CONVERT(DECIMAL(5,2), bytes_written*100.0/(select SUM(bytes_written) FROM IOdrv))) --where drive_letter='d' <-- You can put specify drive ))) + '%' from IOdrv --where drive_letter='d' order by bytes_written desc Results will resemble the output shown here, showing the top five results by percentages of I/O usages by drive letter: database_namedrive_letterbytes_writtenpercentage tempdbd % tempdbd % DBABCD % DBABCD % DB123D % 69

130 CHAPTER 3 WAITS AND QUEUES Another way to look at I/O waiting is to use the io_stall column in sys.dm_io_virtual_file_stats. This column can tell you the total time that users waited for I/O on a given file. You can also look at the latency of I/O, which is measured in I/O stalls as defined earlier in this section. I/O stalls is a more accurate way to measure I/O since you are concerned with the time it takes to complete I/O operations, such as read and write, on a file. Here s an example query: WITH DBIO AS ( SELECT DB_NAME(IVFS.database_id) AS db, CASE WHEN MF.type = 1 THEN 'log' ELSE 'data' END AS file_type, SUM(IVFS.num_of_bytes_read + IVFS.num_of_bytes_written) AS io, SUM(IVFS.io_stall) AS io_stall FROM sys.dm_io_virtual_file_stats(null, NULL) AS IVFS JOIN sys.master_files AS MF ON IVFS.database_id = MF.database_id AND IVFS.file_id = MF.file_id GROUP BY DB_NAME(IVFS.database_id), MF.type ) SELECT db, file_type, CAST(1. * io / (1024 * 1024) AS DECIMAL(12, 2)) AS io_mb, CAST(io_stall / AS DECIMAL(12, 2)) AS io_stall_s, CAST(100. * io_stall / SUM(io_stall) OVER() AS DECIMAL(10, 2)) AS io_stall_pct FROM DBIO ORDER BY io_stall DESC; The query results show the top ten results of databases by number of I/O stalls: dbfile_typeio_mbio_stall_sio_stall_pct DBA123data tempdblog DBABCdata tempdbdata msdbdata DBABClog SQLDB1data DBCUPlog DBA_statsdata ReportServerdata You can use the io_stall_read_ms and io_stall_write_ms columns in sys.dm_io_virtual_file_ stats. These columns tell you the total time that users waited for reads and writes to occur for a given file. As mentioned in the previous paragraphs, if the I/O subsystem cannot keep up with the requests from SQL Server, you say that there is latency in the communication between SQL Server and the disk that affects the rate of reads and writes to and from. With one comprehensive script, provided by Paul Randal, industryknown SQL Microsoft MVP, you can examine the latencies of the I/O subsystem. This next script is the query that filters the stats to show ratios when there are reads and writes taking place and shows you where the latencies on reads and writes are occurring. Also, you can see the database names and file paths by joining the sys.master_files view. 70

131 CHAPTER 3 WAITS AND QUEUES SELECT --virtual file latency [ReadLatency] = CASE WHEN [num_of_reads] = 0 THEN 0 ELSE ([io_stall_read_ms] / [num_of_reads]) END, [WriteLatency] = CASE WHEN [num_of_writes] = 0 THEN 0 ELSE ([io_stall_write_ms] / [num_of_writes]) END, [Latency] = CASE WHEN ([num_of_reads] = 0 AND [num_of_writes] = 0) THEN 0 ELSE ([io_stall] / ([num_of_reads] + [num_of_writes])) END, --avg bytes per IOP [AvgBPerRead] = CASE WHEN [num_of_reads] = 0 THEN 0 ELSE ([num_of_bytes_read] / [num_of_reads]) END, [AvgBPerWrite] = CASE WHEN [num_of_writes] = 0 THEN 0 ELSE ([num_of_bytes_written] / [num_of_writes]) END, [AvgBPerTransfer] = CASE WHEN ([num_of_reads] = 0 AND [num_of_writes] = 0) THEN 0 ELSE (([num_of_bytes_read] + [num_of_bytes_written]) / ([num_of_reads] + [num_of_writes])) END, LEFT ([mf].[physical_name], 2) AS [Drive], DB_NAME ([vfs].[database_id]) AS [DB], --[vfs].*, [mf].[physical_name] FROM sys.dm_io_virtual_file_stats (NULL,NULL) AS [vfs] JOIN sys.master_files AS [mf] ON [vfs].[database_id] = [mf].[database_id] AND [vfs].[file_id] = [mf].[file_id] -- WHERE [vfs].[file_id] = 2 -- log files -- ORDER BY [Latency] DESC -- ORDER BY [ReadLatency] DESC ORDER BY [WriteLatency] DESC; GO You can observe where the read and write latencies are happening by examining the output results showing read-write I/O latency, as well as other stats, including drive, db_name, and file path, as shown here: ReadLatency WriteLatency Latency AvgBPerRead AvgBPerWrite AvgBPerTransfer Drive DB D: msdb D: eip_sp_prod D: model physical_name D:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\MSDBLog.ldf D:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\eip_sp_prod.mdf D:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\model.mdf 71

132 CHAPTER 3 WAITS AND QUEUES Pending I/O requests can be found by querying both the sys.dm_io_virtual_file_stats and sys.dm_io_pending_io_requests DMVs and can be used to identify which disk is responsible for the bottleneck. Remember, pending I/O requests are themselves not necessarily a performance problem, but if there are several consistent such I/O requests that are waiting to complete for long periods of time, whether they are slow or blocked, then you can certainly assume you have an I/O bottleneck and consider options discussed toward the end of this section. The processing of I/O requests takes usually subseconds, and you will likely not see anything appear in these results, unless they are waiting long to complete. Here is a pending I/O requests query: select database_id, file_id, io_stall, io_pending_ms_ticks, scheduler_address from sys.dm_io_virtual_file_stats(null, NULL) iovfs, sys.dm_io_pending_io_requests as iopior where iovfs.file_handle = iopior.io_handle As you can see from the number of queries that can be run against the sys.dm_io_virtual_file_stats DMV alone, there are several ways to analyze and view I/O statistics. Looking at I/O stalls, as well as at read-write latencies, is most useful in finding out the I/O contention hotspots for databases and their file locations. Let s now set up the related Performance Monitor counters you can use for further I/O analysis. Related Performance Monitor Counters If any buffer I/O latch issues occur and the PAGEIOLATCH waits value shows high wait times, you may want to take a look at Perfmon counters that relate to memory under Buffer Manager Object. The counters here can monitor physical I/O as database pages are read and written to and from disk, as well as how memory is used to store data pages. Page Lookups/sec Page Life Expectancy Page Reads/sec Full Scans/sec In Figure 3-9, you can see a Perfmon screen that includes several of the relevant counters that show both physical and logical I/O. 72

133 CHAPTER 3 WAITS AND QUEUES Figure 3-9. Physical and logical I/O disk counters Another type of I/O wait types are categorized as nondata page I/O, which occurs while waiting for I/O operations to complete. As such with nondata page I/O operations, they usually appear as IO_Completion waits. Therefore, you will want to set up physical disk counters that measure the following: Current Disk Queue Length Avg Disk sec/read Avg Disk sec/write Avg Disk Bytes/Read Avg Disk Bytes/Write Avg Disk sec/transfer The numbers that you get will determine whether you have an I/O bottleneck. What you re looking for are numbers that fall outside your normal baselines. I ll now list some action steps to reduce and minimize I/O bottlenecks. There are a number of things you can do from the SQL Server performance-tuning DBA perspective to reduce I/O bandwidth with proper indexing and fixing bad queries that are I/O intensive. You can also work to reduce the number of joins and eliminate table scans by replacing them with seeks. When there are missing indexes, this means there will be more time that SQL Server spends waiting on the disk to dot able scan operations to find records. Unused indexes waste space and should be dropped because this can cause more I/O activity. Another thing to look at is that if you see high I/O usage percentages on databases and their files, for example, you should be concerned with proper file placement. By checking the file system for where the files reside, you can identify where the I/O contention lies and move the database files to separate, dedicated, and faster storage. 73

134 CHAPTER 3 WAITS AND QUEUES As mentioned in the previous chapter, ideally LDF, MDF, and TempDB should all be on their own separate drives. If you identify the top hotspot tables by I/O, you can place them on their own separate filegroup on a separate disk. Here are some other specific I/O-related waits that might surface: ASYNC_IO_COMPLETION : Occurs when a task is waiting for I/Os to finish. IO_COMPLETION : Occurs while waiting for I/O operations to complete. This wait type generally represents nondata page I/Os. PAGEIOLATCH_EX : Occurs when a task is waiting on a latch for a buffer that is in an I/O request. The latch request is in Exclusive mode. Long waits may indicate problems with the disk. PAGEIOLATCH_SH : Occurs when a task is waiting on a latch for a buffer that is in an I/O request. The latch request is in Shared mode. Long waits may indicate problems with the disk subsystem. WRITELOG : Occurs while waiting for a log flush to complete. Common operations that cause log flushes are checkpoints and transaction commits. BACKUPIO : Occurs when a backup task is waiting for data or is waiting for a buffer in which to store data. Memory Pressure In the prior section on I/O, I discussed how such I/O buffer issues are related to memory pressure since the lack of available memory will force a query to use disk resources. Because it is desirable for a query to run in memory, if there is not enough memory resources to complete the query request, you say the system is under memory pressure, which will of course affect query performance. There are a number of things to look at to ensure that SQL Server has sufficient memory to perform optimally and prevent paging to disk. You will look at memory usage and the waits that can show up if under memory pressure. Before a query actually uses memory, it goes through several stages, prior to execution. Parsing and compiling a query execution plan was already discussed, but in addition, SQL Server requires the query to reserve memory. To reserve memory, it must be granted to the query. Referred to as a memory grant, it is memory used to store temporary rows for sort and hash join operations. A query will be allowed to reserve memory only if there is enough free memory available. The Resource Semaphore, which is responsible for satisfying memory grant requests, keeps track of how much memory is granted. The Resource Semaphore also keeps overall memory grant usages within the server limit, and rather than letting a query fail with out-of-memory errors, it will make it wait for the memory. The life of a memory grant is equal to that of the query. So, much of the memory pressure SQL Server will experience is related to the number of memory grants pending and memory grants outstanding. Memory grants pending is the total number of processes waiting for a workspace memory grant, while memory grants outstanding is the total number of processes that have already acquired a workspace memory grant. If query memory cannot be granted immediately, the requesting query is forced to wait in a queue and is assigned the Resource Semaphore wait type. When you look the number of memory grants, as you will with some diagnostic queries in this section, you typically want to see this queue empty. If Memory Grants Pending is averaging greater than0for extended periods of time, queries cannot run because they can t get enough memory. To find out all the queries waiting in the memory queue or which queries are using the most memory grants, you can run the following queries, using the very specific system view sys.dm_exec_query_memory_ grants. 74

135 CHAPTER 3 WAITS AND QUEUES If there are any queries waiting in the memory queue for memory grants, they will show up in response to the following SELECT statement: SELECT * FROM sys.dm_exec_query_memory_grants where grant_time is null To find queries that are using the most memory grants, run this query: SELECT mg.granted_memory_kb, mg.session_id, t.text, qp.query_plan FROM sys.dm_exec_query_memory_grants AS mg CROSS APPLY sys.dm_exec_sql_text(mg.sql_handle) AS t CROSS APPLY sys.dm_exec_query_plan(mg.plan_handle) AS qp ORDER BY 1 DESC OPTION (MAXDOP 1) When there is a user query waiting for memory, as you can see from the output, it will display how much memory is being granted in kilobytes, the session ID, the statement text, and the query plan shown as XML. granted_memory_kbsession_idtextquery_plan CREATE PROCEDURE [dbo].[sp_syscollector_purge_collection_logs]<showplanxml xmlns=" "> SQL Server is a memory hog. The more memory available, the better for SQL Server performance, and rest assured that it will take and use most of the memory from the buffer pool if there is no upper limit defined. SQL Server is a memory-intensive application that may drain the system dry of buffer pool memory, and under certain conditions, this can cause Windows to swap to disk. You of course don t want this to occur, and you also need to take into account the memory needed for the OS. Competition for memory resources between SQL Server and the OS will not make for a well-tuned database engine. Plus, if you add other applications to the mix that may reside on the same server, including the OS, which all require memory, resources can starve. These other resources will compete with SQL Server, which will attempt to use most of the available memory. The ideal scenario is to have a well-balanced and properly configured system. It is usually a good idea to set the maximum server memory and minimum server memory for each instance to control memory usage. Setting the maximum server memory will ensure that an instance will not take up all the memory on the box. It is also recommended that a box is dedicated to SQL Server, leaving some room for the OS. The minimum server memory option guarantees that the specified amount of memory will be available for the SQL instance. Once this minimum memory usage is allocated, it cannot be freed up until the minimum server memory setting is reduced. Determining the right number will depend on collecting and reviewing memory counter data under load. Once you know what thresholds you will set for your minimum and maximum memory usage, you can set them through SSMS or by using sp_configure. Since you configure the memory here in megabytes, you must multiply it by 1024 KB. So, for example, if you want the maximum memory SQL Server to use to be 8GB (assuming this much or even greater physical memory on the box is available), you would get this number by multiplying 8 by 1,024, which equals 8,192. Changing memory settings is an advanced user option, so you will need to show these as well. Since the memory settings are dynamic, they will start taking effect right away without the need to restart the instance. Here is how you would set the maximum memory setting to 8GB: sp_configure 'show advanced options',1 go reconfigure go 75

136 CHAPTER 3 WAITS AND QUEUES sp_configure max server memory (MB), 8192 go reconfigure go Once configuration changes are executed, you will receive the following message: Configuration option 'show advanced options'changed from 1 to 1. Runthe RECONFIGURE statement to install. Configuration option 'max server memory (MB)'changed from to Runthe RECONFIGURE statement to install. You can use the DBCC MEMORYSTATUS command to get a snapshot of the current memory status of Microsoft SQL Server and help monitor memory usage, as well as troubleshoot memory consumption and out-of-memory errors. You can find a reference to this command on Microsoft s website: Additionally, you can investigate some of the other available memory-related DMVs to monitor memory usage. Some to look at include the following: sys.dm_os_sys_info sys.dm_os_memory_clerks sys.dm_os_memory_cache_counters sys.dm_os_sys_memory (SQL 2008 and greater) Using Performance Monitor, you can setup the memory-related object counters, which monitor overall server memory usage. Memory: Available MBytes Paging File: % Usage Buffer Manager: Buffer Cache Hit Ratio Buffer Manager: Page Life Expectancy Memory Manager: Memory Grants Pending Memory Manager: Memory Grants Outstanding Memory Manager: Target Server Memory (KB) Memory Manager: Total Server Memory (KB) Figure 3-10 shows the memory-related Performance Monitor counters you would select to monitor memory usage. 76

CHAPTER 3 WAITS AND QUEUES Figure 3-10. Memory-related performance counters Particularly of interest is the Target Server Memory value versus Total Server Memory.

137 CHAPTER 3 WAITS AND QUEUES Figure Memory-related performance counters Particularly of interest is the Target Server Memory value versus Total Server Memory. The Target Server Memory value is the ideal memory amount for SQL Server, while the Total Server Memory value is the actual amount of committed memory to SQL Server. The Target Server Memory value is equal to the Max Memory setting under sys.configurations. These numbers will ramp up during SQL Server instance startup and stabilize to a point where the total should ideally approach the target number. Once this is the case, the Total Server Memory value should not decrease and not drop far below the Target Server memory because this could indicate memory pressure from the OS because SQL Server is being forced to deallocate its memory. You can measure these using sys.dm_os_performance_counters. SELECT [counter_name], [cntr_value] FROM sys.dm_os_performance_ counters WHERE [object_name] LIKE '%Memory Manager%' AND [counter_name] IN ('Total Server Memory (KB)', 'Target Server Memory (KB)') The raw counter data output would look like this: counter_namecntr_value Target Server Memory (KB) Total Server Memory (KB)

138 CHAPTER 3 WAITS AND QUEUES A good way to see how close the Total Server Memory value comes to the Target Server Memory value is to calculate Total Server Memory value as a percentage of Target Server Memory value, where 100 percent means they are equal. The closer the percentage is to 100 percent, the better. Here s an example: SELECT ROUND(100.0 * ( SELECT CAST([cntr_value] AS FLOAT) FROM sys.dm_os_performance_counters WHERE [object_name] LIKE '%Memory Manager%' AND [counter_name] = 'Total Server Memory (KB)' ) / ( SELECT CAST([cntr_value] AS FLOAT) FROM sys.dm_os_performance_counters WHERE [object_name] LIKE '%Memory Manager%' AND [counter_name] = 'Target Server Memory (KB)'), 2)AS [IDEAL MEMORY USAGE] The following output of the previous query shows the ideal memory usage in terms of Total vs. Target Server Memory. IDEAL MEMORY USAGE 99.8 Parallelism and CXPACKET In the previous chapter I mentioned one of the common wait types related to parallelism and parallel query processing, called CXPACKET. With respect to memory grants, as discussed in the previous section, parallelism can affect the query memory requirement. As multiple threads are created for a single query, not all threads are given the same amount of work to do; this causes one or more to lag behind, producing CXPACKET waits affecting the overall throughput. However, just because you see a majority of these wait types doesn t mean it s necessarily actionable. The biggest misconception among DBAs is that in order to reduce the occurrence of CXPACKETS, you need to turn parallelism off by changing the instance-level MAXDOP setting to 1. You need to understand and factor in other considerations before you do that. I would avoid turning it off altogether, instead setting it to some number equal to the number of physical processors, not exceeding 8, after which it becomes a point of diminishing returns. Testing your MAXDOP settings is essential, measuring query performance for each, before you settle on an optimal value. The idea is that SQL Server s internal optimizer determines opportunities to execute a query or index operation across multiple processors in parallel via multiple threads so that the operation can be completed faster. This concept of parallelism is intended to increase performance rather than degrade it. Moreover, the execution plan for a parallel query can use more than one thread, unlike a serial execution plan. Parallelism is usually best left to SQL Server to determine the most optimal execution plan. However, sometimes performance is not always optimum. You can manually control it at the server and at the individual query level by setting Maximum Degrees of Parallelism, which is the maximum number of CPUs used for query execution. You can modify this using sp_confgure max degree of parallelism. By default, SQL Server uses all available processors or 0 as the value. You can also set this option at the database level by using the query hint MAXDOP=x, where x is the number of processors used for the specific query. Oftentimes, this is a better option than to change the setting server-wide. This way you can control the behavior and parallelism of a particular offending query without affecting the parallelism of the entire instance. If you see that parallelism is becoming a performance issue, before you change your MAXDOP configuration, consider changing the cost threshold for parallelism configuration option, via sp_configure, to a higher value than the default, which will reduce the number of queries running under parallelism but still allow more expensive queries to utilize parallelism. Another thing you want to do at the table level is to correct any index issues you might have so as to prevent large table scans, often caused by missing indexes. Missing indexes can 78

139 CHAPTER 3 WAITS AND QUEUES cause the query optimizer to use a parallel plan to compensate for the missing index and therefore should all be identified. I will provide some missing index scripts later in the book. Such behavior will unnecessarily increase parallelism and create disk and CPU-bound performance penalties. In addition, ensure statistics are up-to-date on your tables to prevent a bad query plan. Please refer to Books Online or MSDN to run sp_updatestats. Blocking and Locking, Oh My! What is blocking, and how does this show up as a wait? When one SQL Server session puts a lock on one or more records while the second session requires a conflicting lock type on the records locked by the first session, this is blocking. Blocking will occur when the second session waits indefinitely until the first releases, significantly impacting performance. If the second session continues to wait on the first session and another session ends up waiting on the second session, and so on, this creates a blocking chain. Blocking and locking for short periods of time are part of the normal SQL Server processes and ensure a database s integrity and that all transactions have certain characteristics. Let s discuss SQL Server ACID properties. Going back to database theory 101, SQL Server must ensure through its locking mechanisms that transactions are atomic, consistent, isolated, and durable. These properties are known as ACID. Atomicity ensures that a transaction is all or none, meaning that it is completed in its entirety (with no errors), has not yet begun, or is rolled back. Once the transaction is completed, it must be in a valid or consistent state; consistency ensures that the system is valid with all the changes made, before and after the transaction takes place. Isolation of a transaction (or transaction isolation) is the key to making sure that an individual transaction believes it has exclusive access to the resources needed to complete. It also prevents a transaction from accessing data that is not in a consistent state. So, each and every transaction needs to run in isolation until it finishes, or is rolled back, where once the changes are successfully completed, it is made permanent or durable. Durability prevents the loss of information essentially saved and written to disk. Therefore, a normal amount of locking and blocking is expected, but when it begins to degrade SQL Server performance, then you have an issue that requires troubleshooting and resolution. Two blocking scenarios can occur that slow performance; one is where it holds a lock on resources for a long period of time but eventually releases them, and the other is where blocking puts a lock on resources indefinitely, preventing other processes from getting access to them. This latter example does not resolve itself and requires DBA intervention. With respect to this topic, you would typically see various LCK_M_x wait types that indicate blocking issues or sessions waiting for a lock to be granted. Performance problems could also be because of a lock escalation issue or increased memory pressure that causes increased I/O. The greater I/O can cause locks to be held longer and transaction time to go up as well. Transaction duration should be as short as possible. You can see more information about currently active lock manager resources in the sys.dm_tran_locks DMV and whether a lock has been granted or is waiting to be granted. You can set up the Lock wait time (ms) Perfmon counter and check the isolation level for shared locks. Find All Locks Being Held by a Running Batch Process The following query uses sys.dm_exec_requests to find the batch process in question and get its transaction_id from the results: SELECT * FROM sys.dm_exec_requests; GO 79

140 CHAPTER 3 WAITS AND QUEUES Then, use the transaction_id from the previous query output using the DMV sys.dm_tran_locks to find the lock information. SELECT * FROM sys.dm_tran_locks WHERE request_owner_type = N'TRANSACTION' AND request_owner_id = < your transaction_id >; GO Find All Currently Blocked Requests The old-school way was to run a quick sp_who2 and check whether there is an SPID entry in the BlkBy column in one or more of the rows. Another quick way was to query the sys.sysprocesses view, which would return only rows of sessions that are blocked. SELECT * FROM sys.sysprocesses WHERE blocked > 0 Once you obtained the session or SPID for the blocked processes, you could find the statement text associated with it by running this: DBCC INPUTBUFFER (SPID) The following example is a more elegant and modern method that queries the sys.dm_exec_requests view to find information about blocked requests and also returns the statements involved in the blocking and related wait stats. SELECT sqltext.text, xr.session_id, xr.status, xr.blocking_session_id, xr.command, xr.cpu_time, xr.total_elapsed_time, xr.wait_resource FROM sys.dm_exec_requests xr CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS sqltext WHERE status='suspended' Summary This was quite a technical and detailed chapter, introducing you to the waits and queues performancetuning methodology, as well as demonstrating other useful DMVs that you will use in the upcoming chapters. Now that you understand the methodology and the DMVs and performance counters you need in order to implement your performance-tuning strategy, you will move on to the tools of the trade, where you build upon this knowledge and set of DMVs and counters and leverage them for further automated analysis. Always remember in your DBA travels that when performance is slow, and it will be at some point, look at wait statistics first as the quickest way to identify your performance bottlenecks. Mining, collating, comparing, and compiling this data will allow you to build your data repository, enable you to create comprehensive reporting, and give you a clear path toward a certifiably healthy SQL Server. 80

141 SQL Server T-SQL Recipes Fourth Edition Jason Brimhall Jonathan Gennick Wayne Sheffield

142 SQL Server T-SQL Recipes Copyright 2015 by Jason Brimhall, Jonathan Gennick, and Wayne Sheffield This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. ISBN-13 (pbk): ISBN-13 (electronic): Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Managing Director: Welmoed Spahr Lead Editor: Jonathan Gennick Technical Reviewer: Louis Davidson Editorial Board: Steve Anglin, Mark Beckner, Gary Cornell, Louise Corrigan, Jim DeWolf, Jonathan Gennick, Robert Hutchinson, Michelle Lowman, James Markham, Susan McDermott, Matthew Moodie, Jeffrey Pepper, Douglas Pundick, Ben Renow-Clarke, Gwenan Spearing, Matt Wade, Steve Weiss Coordinating Editor: Jill Balzano Copy Editor: April Rondeau Compositor: SPi Global Indexer: SPi Global Artist: SPi Global Cover Designer: Anna Ishchenko Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY Phone SPRINGER, fax (201) , orders-ny@springer-sbm.com, or visit Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information on translations, please rights@apress.com, or visit Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. ebook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales ebook Licensing web page at Any source code or other supplementary material referenced by the author in this text is available to readers at For detailed information about how to locate your book s source code, go to

143 Contents at a Glance About the Authors...lxxiii About the Technical Reviewer...lxxv Acknowledgments...lxxvii Introduction...lxxix Chapter 1: Getting Started with SELECT... 1 Chapter 2: Elementary Programming Chapter 3: Working with NULLS Chapter 4: Querying from Multiple Tables Chapter 5: Aggregations and Grouping Chapter 6: Advanced Select Techniques Chapter 7: Windowing Functions Chapter 8: Inserting, Updating, Deleting Chapter 9: Working with Strings Chapter 10: Working with Dates and Times Chapter 11: Working with Numbers Chapter 12: Transactions, Locking, Blocking, and Deadlocking Chapter 13: Managing Tables Chapter 14: Managing Views Chapter 15: Managing Large Tables and Databases Chapter 16: Managing Indexes v

144 CONTENTS AT A GLANCE Chapter 17: Stored Procedures Chapter 18: User-Defined Functions and Types Chapter 19: In-Memory OLTP Chapter 20: Triggers Chapter 21: Error Handling Chapter 22: Query Performance Tuning Chapter 23: Hints Chapter 24: Index Tuning and Statistics Chapter 25: XML Chapter 26: Files, Filegroups, and Integrity Chapter 27: Backup Chapter 28: Recovery Chapter 29: Principals and Users Chapter 30: Securables, Permissions, and Auditing Chapter 31: Objects and Dependencies Index vi

145 CHAPTER 7 Windowing Functions by Wayne Sheffield SQL Server is designed to work best on sets of data. By definition, sets of data are unordered; it is not until the query s ORDER BY clause that the final results of the query become ordered. Windowing functions allow your query to look at a subset of the rows being returned by your query before applying the function to just those rows. In doing so, the functions allow you to specify an order for your unordered subset of data so as to evaluate that data in a particular order. This is performed before the final result is ordered (and in addition to it). This allows for processes that previously required self-joins, the use of inefficient inequality operators, or non-set-based row-by-row (iterative) processing to use more efficient set-based processing. The key to windowing functions is in controlling the order in which the rows are evaluated, when the evaluation is restarted, and what set of rows within the result set should be considered for the function (the window of the data set that the function will be applied to). These actions are performed with the OVER clause. There are three groups of functions that the OVER clause can be applied to; in other words, there are three groups of functions that can be windowed. These groups are the aggregate functions, the ranking functions, and the analytic functions. Additionally, the sequence object s NEXT VALUE FOR function can be windowed. The functions that can have the OVER clause applied to them are shown in the following tables: Table 7-1. Aggregate Functions AVG CHECKSUM_AGG COUNT COUNT_BIG MAX MIN STDEV STDEVP SUM VAR VARP Ranking functions allow you to return a ranking value that is associated with each row in a partition of a result set. Depending on the function used, multiple rows may receive the same value within the partition, and there may be gaps between assigned numbers. 141

146 CHAPTER 7 WINDOWING FUNCTIONS Table 7-2. Ranking Functions Function ROW_NUMBER RANK DENSE_RANK NTILE Description ROW_NUMBER returns an incrementing integer for each row within a partition of a set. ROW_NUMBER will return a unique number within each partition, starting with 1. Similar to ROW_NUMBER, RANK increments its value for each row within a partition of the set. The key difference is that if rows with tied values exist within the partition, they will receive the same rank value, and the next value will receive the rank value as if there had been no ties, producing a gap between assigned numbers. The difference between DENSE_RANK and RANK is that DENSE_RANK doesn t have gaps in the rank values when there are tied values; the next value has the next rank assignment. NTILE divides the result set into a specified number of groups, based on the ordering and optional partition clause. Analytic functions (introduced in SQL Server 2012) compute an aggregate value on a group of rows. In contrast to the aggregate functions, they can return multiple rows for each group. Table 7-3. Analytic Functions Function CUME_DIST FIRST_VALUE LAG LAST_VALUE LEAD PERCENTILE_CONT PERCENTILE_DISC PERCENT_RANK Description CUME_DIST calculates the cumulative distribution of a value in a group of values. The cumulative distribution is the relative position of a specified value in a group of values. Returns the first value from an ordered set of values. Retrieves data from a previous row in the same result set as specified by a row offset from the current row. Returns the last value from an ordered set of values. Retrieves data from a subsequent row in the same result set as specified by a row offset from the current row. Calculates a percentile based on a continuous distribution of the column value. The value returned may or may not be equal to any of the specific values in the column. Computes a specific percentile for sorted values in the result set. The value returned will be the value with the smallest CUME_DIST value (for the same sort specification) that is greater than or equal to the specified percentile. The value returned will be equal to one of the values in the specific column. Computes the relative rank of a row within a set. Many people will break down these functions into two groups: the LAG, LEAD, FIRST_VALUE, and LAST_VALUE functions are considered to be offset functions, and the remaining functions are called analytic functions. These functions come in complementary pairs, and many of the recipes will cover them in this manner. 142

147 CHAPTER 7 WINDOWING FUNCTIONS The syntax for the OVER clause is as follows: OVER ( [ <PARTITION BY clause> ] [ <ORDER BY clause> ] [ <ROW or RANGE clause> ] ) The PARTITION BY clause is used to restart the calculations when the values in the specified columns change. It specifies columns from the tables in the FROM clause of the query, scalar functions, scalar subqueries, or variables. If a PARTITION BY clause isn t specified, the entire data set will be the partition. The ORDER BY clause defines the order in which the OVER clause evaluates the data subset for the function. It can only refer to columns that are in the FROM clause of the query. The ROWS RANGE clause defines a subset of rows that the window function will be applied to within the partition. If ROWS is specified, this subset is defined with the position of the current row relative to the other rows within the partition by position. If RANGE is specified, this subset is defined by the value(s) of the column(s) in the current row relative to the other rows within the partition. This range is defined as a beginning point and an ending point. For both ROWS and RANGE, the beginning point can be UNBOUNDED PRECEDING or CURRENT ROW, and the ending point can be UNBOUNDED FOLLOWING or CURRENT ROW, where UNBOUNDED PRECEDING means the first row in the partition, UNBOUNDED FOLLOWING means the last row in the partition, and CURRENT ROW is just that the current row. Additionally, when ROWS is specified, an offset can be specified with <X> PRECEDING or <X> FOLLOWING, which is simply the number of rows prior to or following the current row. Additionally, there are two methods to specify the subset range you can specify just the beginning point (which will use the default CURRENT ROW as the default ending point), or you can specify both with the BETWEEN <starting point> AND <ending point> syntax. Finally, the entire ROWS RANGE clause itself is optional; if it is not specified, the default ROWS RANGE clause will default to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. Each of the windowing functions permits and requires various clauses from the OVER clause. With the exception of the CHECKSUM, GROUPING, and GROUPING_ID functions, all of the aggregate functions can be windowed through the use of the OVER clause, as shown in Table 7-1 above. Additionally, the ROWS RANGE clause allows you to perform running aggregations and sliding (moving) aggregations. The first four recipes in this section utilize the following table and data: CREATE TABLE #Transactions ( AccountId INTEGER, TranDate DATE, TranAmt NUMERIC(8, 2) ); INSERT INTO #Transactions SELECT * FROM ( VALUES ( 1, ' ', 500), ( 1, ' ', 50), ( 1, ' ', 250), ( 1, ' ', 75), ( 1, ' ', 125), ( 1, ' ', 175), ( 2, ' ', 500), ( 2, ' ', 50), ( 2, ' ', 25), ( 3, ' ', 5000), 143

148 CHAPTER 7 WINDOWING FUNCTIONS ( 3, ' ', 550), ( 3, ' ', 95 ), ( 3, ' ', 2500) ) dt (AccountId, TranDate, TranAmt); Note that within AccountIDs 1 and 3, there are two rows that have the same TranDate value. This duplicate date will be used to highlight the differences in some of the clauses used in the OVER clause in subsequent recipes Calculating Totals Based upon the Prior Row Problem You need to calculate the total of a column, where the total is the sum of the column values through the current row. For instance, for each account, calculate the total transaction amount to date in date order. Solution Utilize the SUM function with the OVER clause to perform a running total: SELECT AccountId, TranDate, TranAmt, -- running total of all transactions RunTotalAmt = SUM(TranAmt) OVER (PARTITION BY AccountId ORDER BY TranDate) FROM #Transactions AS t ORDER BY AccountId, TranDate; This query returns the following result set: AccountId TranDate TranAmt RunTotalAmt

149 How It Works CHAPTER 7 WINDOWING FUNCTIONS The OVER clause, when used in conjunction with the SUM function, allows us to perform a running total of the transaction. Within the OVER clause, the PARTITION BY clause is specified so as to restart the calculation every time the AccountId value changes. The ORDER BY clause is specified and determines in which order the rows should be calculated. Since the ROWS RANGE clause is not specified, the default RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is utilized. When the query is executed, the TranAmt column from all of the rows prior to and including the current row is summed up and returned. In this example, for the first row for each AccountID value, the RunTotalAmt returned is simply the value from the TotalAmt column from the row. For subsequent rows, this value is incremented by the value in the current row s TotalAmt column. When the AccountID value changes, the running total is reset and recalculated for the new AccountID value. So, for AccountID = 1, the RunTotalAmt value for TranDate is 500 (the value of that row s TranAmt column). For the next row ( TranDate ), the TranAmt of 50 is added to the 500 for a running total of 550. In the next row ( TranDate ), the TranAmt of 250 is added to the 550 for a running total of 800. Note the duplicate TranDate value within each AccountID value the running total did not increment in the way that you would expect it to. Since this query did not specify a ROWS RANGE clause, the default of RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW was utilized. RANGE does not work on a row-position basis; instead, it works off of the values in the columns. For the rows with the duplicate TranDate, the TranAmt for all of the rows with that duplicate value were summed together. To see the data in the manner in which you would most likely want to see a running total, modify the query to include an additional column that performs the same running total calculation with the ROWS clause: SELECT AccountId, TranDate, TranAmt, -- running total of all transactions RunTotalAmt = SUM(TranAmt) OVER (PARTITION BY AccountId ORDER BY TranDate), -- "Proper" running total by row position RunTotalAmt2 = SUM(TranAmt) OVER (PARTITION BY AccountId ORDER BY TranDate ROWS UNBOUNDED PRECEDING) FROM #Transactions AS t ORDER BY AccountId, TranDate; This query produces these more desirable results in the RunTotalAmt2 column: AccountId TranDate TranAmt RunTotalAmt RunTotalAmt

150 CHAPTER 7 WINDOWING FUNCTIONS Running aggregations can be performed over the other aggregate functions. In this next example, the query is modified to perform running averages, counts, and minimum/maximum calculations. SELECT AccountId, TranDate, TranAmt, -- running average of all transactions RunAvg = AVG(TranAmt) OVER (PARTITION BY AccountId ORDER BY TranDate), -- running total # of transactions RunTranQty = COUNT(*) OVER (PARTITION BY AccountId ORDER BY TranDate), -- smallest of the transactions so far RunSmallAmt = MIN(TranAmt) OVER (PARTITION BY AccountId ORDER BY TranDate), -- largest of the transactions so far RunLargeAmt = MAX(TranAmt) OVER (PARTITION BY AccountId ORDER BY TranDate), -- running total of all transactions RunTotalAmt = SUM(TranAmt) OVER (PARTITION BY AccountId ORDER BY TranDate) FROM #Transactions AS t WHERE AccountID = 1 ORDER BY AccountId, TranDate; This query returns the following result set: AccountId TranDate TranAmt RunAvg RunTranQty RunSmallAmt RunLargeAmt RunTotalAmt Calculating Totals Based upon a Subset of Rows Problem When performing these aggregations, you want only the current row and the two previous rows to be considered for the aggregation. Solution Utilize the ROWS clause of the OVER clause: SELECT AccountId, TranDate, TranAmt, -- average of the current and previous 2 transactions SlideAvg = AVG(TranAmt) 146

151 CHAPTER 7 WINDOWING FUNCTIONS OVER (PARTITION BY AccountId ORDER BY TranDate ROWS BETWEEN 2 PRECEDING AND CURRENT ROW), -- total # of the current and previous 2 transactions SlideQty = COUNT(*) OVER (PARTITION BY AccountId ORDER BY TranDate ROWS BETWEEN 2 PRECEDING AND CURRENT ROW), -- smallest of the current and previous 2 transactions SlideMin = MIN(TranAmt) OVER (PARTITION BY AccountId ORDER BY TranDate ROWS BETWEEN 2 PRECEDING AND CURRENT ROW), -- largest of the current and previous 2 transactions SlideMax = MAX(TranAmt) OVER (PARTITION BY AccountId ORDER BY TranDate ROWS BETWEEN 2 PRECEDING AND CURRENT ROW), -- total of the current and previous 2 transactions SlideTotal = SUM(TranAmt) OVER (PARTITION BY AccountId ORDER BY TranDate ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) FROM #Transactions AS t ORDER BY AccountId, TranDate; This query returns the following result set: AccountId TranDate TranAmt SlideAvg SlideQty SlideMin SlideMax SlideTotal

152 CHAPTER 7 WINDOWING FUNCTIONS How It Works The ROWS clause is added to the OVER clause of the aggregate functions to specify that the aggregate functions should look only at the current row and the previous two rows for their calculations. As you look at each column in the result set, you can see that the aggregation was performed over just these rows (the window of rows that the aggregation is applied to). As the query progresses through the result set, the window slides to encompass the specified rows relative to the current row. Let s examine the results row by row for AccountID 1. Remember that we are applying a subset ( ROWS clause) to be the current row and the two previous rows. For TranDate , there are no previous rows. For the COUNT calculation, there is just one row, so SlideQty returns 1. For each of the other columns ( SlideAvg, SlideMin, SlideMax, SlideTotal ), there are no previous rows, so the current row s TranAmt is returned as the AVG, MIN, MAX, and SUM values. For the second row ( TranDate ), there are now two rows visible in the subset of data, starting from the first row. The COUNT calculation sees these two and returns 2 for the SlideQty. The AVG calculation of these two rows is 275: ( ) / 2. The MIN of these two values (500, 50) is 50. The MAX of these two values is 500. And finally, the SUM (total) of these two values is 550. These are the values returned in the SlideAvg, SlideMin, SlideMax, and SlideTotal columns. For the third row ( TranDate ), there are now three rows visible in the subset of data, starting from the first row. The COUNT calculation sees these three and returns 3 for the SlideQty. The AVG calculation of the TranAmt column for these three rows is : ( ) / 3. The MIN of these three values (500, 50, 250) is still 50, and the MAX of these three values is still 500. And finally, the SUM (total) of these three values is 800. These are the values returned in the SlideAvg, SlideMin, SlideMax, and SlideTotal columns. For the fourth row ( TranDate ), we still have three rows visible in the subset of data; however, we have started our sliding / moving aggregation window the window starts with the second row and goes through the current (fourth) row. The COUNT calculation still sees that we are applying the function to only three rows, so it returns 3 in the SlideQty column. The AVG calculation of the TranAmt column for the three rows is applied over the values (50, 250, 75), which produces an average of 125: ( ) / 3. The MIN of the three values is still 50, while the MAX of these three values is now 250. The SUM total of these three values is 375. Again, these are the values returned in the SlideAvg, SlideMin, SlideMax, and SlideTotal columns. As we progress to the fifth row ( TranDate and TranAmt ), the window slides again. We are still looking at only three rows (the third row through the fifth row), so SlideQty still returns 3. The other calculations are looking at the TranAmt values of 250, 75, 125 for these three rows, so the AVG, MIN, MAX, and SUM calculations are 150, 75, 250, and 450. For the sixth row, the window again slides, and the calculations are recalculated for the new subset of data. For the seventh row, we now have the AccountID changing from 1 to 2. Since the query has a PARTITION BY clause set on the AccountID column, the calculations are reset. The seventh row of the result set is the first row for this partition ( AccountID ), so the SlideQty is 1, and the other columns will have for the AVG, MIN, MAX, and SUM calculations the value of the TranAmt column. The sliding window continues as defined above Calculating a Percentage of Total Problem With each row in your result set, you want to have the data included so that you are able to calculate what percentage of the total the row is. 148

153 Solution CHAPTER 7 WINDOWING FUNCTIONS Use the SUM function with the OVER clause without specifying any ordering so as to have each row return the total for that partition: SELECT AccountId, TranDate, TranAmt, AccountTotal = SUM(TranAmt) OVER (PARTITION BY AccountId), AmountPct = TranAmt / SUM(TranAmt) OVER (PARTITION BY t.accountid) FROM #Transactions AS t This query returns the following result set ( AmountPct column truncated at 7 decimals for brevity): AccountId TranDate TranAmt AccountTotal AmountPct How It Works When the SUM function is utilized with the OVER clause, and the OVER clause does not contain the ORDER BY clause, then the SUM function will return the total amount for the partition. The current row s value can be divided by this total to obtain the percentage of the total that the current row is. If the ORDER BY clause had been included, then a ROWS RANGE clause would have been used; if one wasn t specified, then the default RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW would have been used, as shown in recipe 7-1. If you wanted to get the total for the entire result set instead of the total for each partition (in this example, AccountId ), you would use: SELECT AccountId, TranDate, TranAmt, Total = SUM(TranAmt) OVER (), AmountPct = TranAmt / SUM(TranAmt) OVER () FROM #Transactions AS t ORDER BY AccountId, TranDate; 149

154 CHAPTER 7 WINDOWING FUNCTIONS 7-4. Calculating a Row X of Y Problem You want your result set to display a Row X of Y, where X is the current row number and Y is the total number of rows. Solution Use the ROW_NUMBER function to obtain the current row number, and the COUNT function with the OVER clause to obtain the total number of rows: SELECT AccountId, TranDate, TranAmt, AcctRowID = ROW_NUMBER() OVER (PARTITION BY AccountId ORDER BY AccountId, TranDate), AcctRowQty = COUNT(*) OVER (PARTITION BY AccountId), RowID = ROW_NUMBER() OVER (ORDER BY AccountId, TranDate), RowQty = COUNT(*) OVER () FROM #Transactions AS t ORDER BY AccountId, TranDate;; This query returns the following result set: AccountId TranDate TranAmt AcctRowID AcctRowQty RowID RowQty How It Works The ROW_NUMBER function is used to get the current row number within a partition, and the COUNT function is used to get the total number of rows within a partition. Both the ROW_NUMBER and COUNT functions are used twice, once with a PARTITION BY clause and once without. The ROW_NUMBER function returns a sequential number (as ordered by the specified ORDER BY clause in the OVER clause) for each row that has the partition specified. In the AcctRowID column, this is partitioned by the AccountId, so the sequential numbering will 150

155 CHAPTER 7 WINDOWING FUNCTIONS restart upon each change in the AccountId column; in the RowID column, a PARTITION BY is not specified, so this will return a sequential number for each row with the entire result set. Likewise for the COUNT function: the AcctRowQty column is partitioned by the AccountID column, so this will return, for each row, the number of rows within this partition ( AccountId ). The RowQty column is not partitioned, so this will return the total number of rows in the entire result set. The corresponding columns ( AcctRowID, AcctRowQty and RowID, RowQty ) utilize the same PARTITION BY clause (or lack of) in order to make the results meaningful. For each row for AccountID = 1, the AcctRowID column will return a sequential number for each row, and the AcctRowQty column will return 6 (since there are 6 rows for this account). In a similar way, the RowID column will return a sequential number for each row in the result set, and the RowQty will return the total number of rows in the result set (13), since both of these are calculated without a PARTITION BY clause. For the first row where AccountId = 1, this will be row 1 of 6 within AccountId 1, and row 1 of 13 within the entire result set. The second row will be 2 of 6 and 2 of 13, and this proceeds through the remaining rows for this AccountId. When we get to AccountId = 2, the AcctRowID and AcctRowQty columns reset (due to the PARTITION BY clause), and return row 1 of 3 for the AccountId, and row 7 of 13 for the entire result set Using a Logical Window Problem You want the rows being considered by the OVER clause to be affected by the value in the column instead of the row positioning as determined by the ORDER BY clause in the OVER clause. Solution In the OVER clause, utilize the RANGE clause instead of the ROWS option: CREATE TABLE #Test ( RowID INT IDENTITY, FName VARCHAR(20), Salary SMALLINT ); INSERT INTO #Test (FName, Salary) VALUES ('George', 800), ('Sam', 950), ('Diane', 1100), ('Nicholas', 1250), ('Samuel', 1250), --<< duplicate value of above row ('Patricia', 1300), ('Brian', 1500), ('Thomas', 1600), ('Fran', 2450), ('Debbie', 2850), ('Mark', 2975), ('James', 3000), ('Cynthia', 3000), --<< duplicate value of above row ('Christopher', 5000); 151

156 CHAPTER 7 WINDOWING FUNCTIONS SELECT RowID, FName, Salary, SumByRows = SUM(Salary) OVER (ORDER BY Salary ROWS UNBOUNDED PRECEDING), SumByRange = SUM(Salary) OVER (ORDER BY Salary RANGE UNBOUNDED PRECEDING) FROM #Test ORDER BY RowID; This query returns the following result set: RowID FName Salary SumByRows SumByRange George Sam Diane Nicholas Samuel Patricia Brian Thomas Fran Debbie Mark James Cynthia Christopher How It Works When utilizing the RANGE clause, the SUM function adjusts its window based upon the values in the specified column. The window is sized upon the beginning- and ending-point boundaries specified; in this case, the beginning point of UNBOUNDED PRECEDING (the first row in the partition) was specified, and the default ending boundary of CURRENT ROW was used. This example shows the salary of your employees, and the SUM function is performing a running total of the salaries in order of the salary. For comparison purposes, the running total is being calculated with both the ROWS and RANGE clauses. Within this dataset, there are two groups of employees that have the same salary: RowID s 4 and 5 are both 1,250, and 12 and 13 are both 3,000. When the running total is calculated with the ROWS clause, you can see that the salary of the current row is being added to the prior total of the previous rows. However, when the RANGE clause is used, all of the rows that contain the value of the current row are totaled and added to the total of the previous value. The result is that for rows 4 and 5, both employees with a salary of 1,250 are added together for the running total (and this action is repeated for rows 12 and 13). Tip If you need to perform running aggregations, and there is the possibility that you can have multiple rows with the same value in the columns specified by the ORDER BY clause, you should use the ROWS clause instead of the RANGE clause. 152

157 7-6. Generating an Incrementing Row Number Problem CHAPTER 7 WINDOWING FUNCTIONS You need to have a query return total sales information. You need to include a row number for each row that corresponds to the order of the date of the purchase (so as to show the sequence of the transactions), and the numbering needs to start over for each account number. Solution Utilize the ROW_NUMBER function to assign row numbers to each row: SELECT TOP 10 AccountNumber, OrderDate, TotalDue, ROW_NUMBER() OVER (PARTITION BY AccountNumber ORDER BY OrderDate) AS RowNumber FROM AdventureWorks2014.Sales.SalesOrderHeader ORDER BY AccountNumber; This query returns the following result set: AccountNumber OrderDate TotalDue RN :00: :00: :00: :00: :00: :00: :00: :00: :00: :00: How It Works The ROW_NUMBER function is utilized to generate a row number for each row in the partition. The PARTITION_BY clause is utilized to restart the number generation for each change in the AccountNumber column. The ORDER_BY clause is utilized to order the numbering of the rows by the value in the OrderDate column. You can also utilize the ROW_NUMBER function to create a virtual numbers, or tally, table. (A numbers, or tally, table is simply a table of sequential numbers, and it can be utilized to eliminate loops. Use your favorite Internet search tool to find information about what the numbers or tally table is and how it can replace loops. One excellent article is found at 153

158 CHAPTER 7 WINDOWING FUNCTIONS For instance, the sys.all_columns system view has more than 8,000 rows. You can utilize this to easily build a numbers table with this code: SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS RN FROM sys.all_columns; This query will produce a row number for each row in the sys.all_columns view. In this instance, the ordering doesn t matter, but it is required, so the ORDER BY clause is specified as "(SELECT NULL)". If you need more records than what are available in this table, you can simply cross join this table to itself, which will produce more than 64 million rows. In this example, a table scan is required. Another method is to produce the numbers or tally table by utilizing constants. The following example creates a one-million-row virtual tally table without incurring any disk I/O operations: WITH TENS (N) AS (SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0), THOUSANDS (N) AS (SELECT 1 FROM TENS t1 CROSS JOIN TENS t2 CROSS JOIN TENS t3), MILLIONS (N) AS (SELECT 1 FROM THOUSANDS t1 CROSS JOIN THOUSANDS t2), TALLY (N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM MILLIONS) SELECT N FROM TALLY; 7-7. Returning Rows by Rank Problem You want to calculate a ranking of your data based upon specified criteria. For instance, you want to rank your salespeople based upon their sales quotas on a specific date. Solution Utilize the RANK or DENSE_RANK functions to rank your salespeople: SELECT BusinessEntityID, SalesQuota, RANK() OVER (ORDER BY SalesQuota DESC) AS RankWithGaps, DENSE_RANK() OVER (ORDER BY SalesQuota DESC) AS RankWithoutGaps, ROW_NUMBER() OVER (ORDER BY SalesQuota DESC) AS RowNumber FROM Sales.SalesPersonQuotaHistory WHERE QuotaDate = ' ' AND SalesQuota < ; 154

159 CHAPTER 7 WINDOWING FUNCTIONS This query returns the following result set: BusinessEntityID SalesQuota RankWithGaps RankWithoutGaps RowNumber How It Works RANK and DENSE_RANK both assign a ranking value to each row within a partition. If multiple rows within the partition tie with the same value, they are assigned the same ranking value. When there is a tie, RANK will assign the following ranking value as if there had not been any ties, and DENSE_RANK will assign the next ranking value. If there are no ties in the partition, the ranking value assigned is the same as if the ROW_NUMBER function had been used with the same OVER clause definition. In this example, we have eight rows returned, and the RowNumber column shows these rows with their sequential numbering. The fourth and fifth rows have the same SalesQuota value, so for both RANK and DENSE_RANK, these are ranked as 4. The sixth row has a different value, so it continues with the ranking values. It is with this row that we can see the difference between the functions with RANK, the ranking continues with 6, which is the ROW_NUMBER that was assigned (as if there had not been a tie). With DENSE_RANK, the ranking continues with 5 the next value in this ranking. With this example, we can see that RANK produces a gap between the ranking values when there is a tie, and DENSE_RANK does not. The decision of which function to utilize will depend upon whether gaps are allowed or not. For instance, when ranking sports teams, you would want the gaps Sorting Rows into Buckets Problem You want to split your salespeople up into four groups based upon their sales quotas. Solution Utilize the NTILE function and specify the number of groups to divide the result set into: SELECT BusinessEntityID, QuotaDate, SalesQuota, NTILE(4) OVER (ORDER BY SalesQuota DESC) AS [NTILE] FROM Sales.SalesPersonQuotaHistory WHERE SalesQuota BETWEEN AND ; 155

160 CHAPTER 7 WINDOWING FUNCTIONS This query produces the following result set: BusinessEntityID QuotaDate SalesQuota NTILE :00: :00: :00: :00: :00: :00: :00: :00: :00: :00: How It Works The NTILE function divides the result set into the specified number of groups based upon the partitioning and ordering specified in the OVER clause. Notice that the first two groups have three rows in each group, and the final two groups have two. If the number of rows in the result set is not evenly divisible by the specified number of groups, then the leading groups will have one extra row assigned to those groups until the remainder has been accommodated. Additionally, if you do not have as many buckets as were specified, all of the buckets will not be assigned Grouping Logically Consecutive Rows Together Problem You need to group logically consecutive rows together so that subsequent calculations can treat those rows identically. For instance, your manufacturing plant utilizes RFID tags to track the movement of your products. During the manufacturing process, a product may be rejected and sent back to an earlier part of the process to be corrected. You want to track the number of trips that a tag makes to an area. The manufacturing plant has four rooms. The first room has two sensors in it. An RFID tag is affixed to a part of the item being manufactured. As the item moves about room 1, the RFID tag affixed to it can be picked up by the different sensors. As long as the consecutive entries (when ordered by the time the sensor was read) for this RFID tag are in room 1, then this RFID tag is to be considered to be in its first trip to room 1. Once the RFID tag leaves room 1 and goes to room 2, the sensor in room 2 will pick up the RFID tag and place an entry into the database this will be the first trip into room 2 for this RFID tag. The RFID tag subsequently is moved into room 3, where the sensor in that room detects the RFID tag and places an entry into the database the first trip into room 3. While in room 3, the item is rejected and is sent back into room 2 for corrections. As it enters room 2, it is picked up by the sensor in room 2 and entered into the system. Since there is a different room between the two entries for room 2, the entries for room 2 are not consecutive, which makes this the second trip into room 2. Subsequently, when the item is corrected and is moved back 156

161 CHAPTER 7 WINDOWING FUNCTIONS into room 3, the sensor in room 3 enters a second entry for the item. Since the item was in room 2 between the two sensor readings in room 3, this is the second trip into room 3. The item subsequently is moved to room 4. What we are looking to produce from the query for this tag is: Tag # Room # Trip # This recipe will utilize the following data: CREATE TABLE #RFID_Location ( TagId INTEGER, Location VARCHAR(25), SensorReadTime DATETIME); INSERT INTO #RFID_Location (TagId, Location, SensorReadTime) VALUES (1, 'Room1', ' T08:00:01'), (1, 'Room1', ' T08:18:32'), (1, 'Room2', ' T08:25:42'), (1, 'Room3', ' T09:52:48'), (1, 'Room2', ' T10:05:22'), (1, 'Room3', ' T11:22:15'), (1, 'Room4', ' T14:18:58'), (2, 'Room1', ' T08:32:18'), (2, 'Room1', ' T08:51:53'), (2, 'Room2', ' T09:22:09'), (2, 'Room1', ' T09:42:17'), (2, 'Room1', ' T09:59:16'), (2, 'Room2', ' T10:35:18'), (2, 'Room3', ' T11:18:42'), (2, 'Room4', ' T15:22:18'); Solution The goal of this recipe is to introduce the concept of an island of data, where rows that are desired to be sequential are compared to other values to determine if they are in fact sequential. This is accomplished by utilizing two ROW_NUMBER functions, differing only in that one uses an additional column in the PARTITION BY clause. This gives us one ROW_NUMBER function returning a sequential number per RFID tag ( PARTITION BY TagId ), and the second ROW_NUMBER function returning a number that is desired to be sequential 157

162 CHAPTER 7 WINDOWING FUNCTIONS ( PARTITION BY TagId, Location ) The difference between these results will group logically consecutive rows together. See the following: WITH cte AS ( SELECT TagId, Location, SensorReadTime, ROW_NUMBER() OVER (PARTITION BY TagId ORDER BY SensorReadTime) - ROW_NUMBER() OVER (PARTITION BY TagId, Location ORDER BY SensorReadTime) AS Grp FROM #RFID_Location ) SELECT TagId, Location, SensorReadTime, Grp, DENSE_RANK() OVER (PARTITION BY TagId, Location ORDER BY Grp) AS TripNbr FROM cte ORDER BY TagId, SensorReadTime; This query returns the following result set: TagId Location SensorDate Grp TripNbr Room :00: Room :18: Room :25: Room :52: Room :05: Room :22: Room :18: Room :32: Room :51: Room :22: Room :42: Room :59: Room :35: Room :18: Room :22: How It Works This recipe i ntroduces the concept of islands, where the data is logically grouped together based upon the values in the rows. As long as the values are sequential, they are part of the same island. A gap in the values separates one island from another. Islands are created by subtracting a value from each row that is desired to be sequential for the ordering column(s) from a value from that row that is sequential for the ordering column(s). In this example, we utilized two ROW_NUMBER functions to generate these numbers (if the columns had contained either of these numbers, then the associated ROW_NUMBER function could have been removed and that column itself used instead). The first ROW_NUMBER function partitions the result set by the TagId and assigns the row number as ordered by the SensorDate. This provides us with the sequential numbering within the TagId. The second ROW_NUMBER function partitions the result set by the TagId and Location and assigns the row number, as ordered by the SensorDate. This provides us with the numbering that is desired to be sequential. The difference between these two calculations will assign consecutive rows in the same location to the same Grp number. The previous results show that consecutive entries in the same location 158

163 CHAPTER 7 WINDOWING FUNCTIONS are indeed assigned the same Grp number. The following query breaks down the ROW_NUMBER functions into individual columns so that you can see how this is performed: WITH cte AS ( SELECT TagId, Location, SensorReadTime, -- For each tag, number each sensor reading by its timestamp ROW_NUMBER()OVER (PARTITION BY TagId ORDER BY SensorReadTime) AS RN1, -- For each tag and location, number each sensor reading by its timestamp. ROW_NUMBER() OVER (PARTITION BY TagId, Location ORDER BY SensorReadTime) AS RN2 FROM #RFID_Location ) SELECT TagId, Location, SensorReadTime, -- Display each of the row numbers, -- Subtract RN2 from RN1 RN1, RN2, RN1-RN2 AS Grp FROM cte ORDER BY TagId, SensorReadTime; This query r eturns the following result set: TagId Location SensorDate RN1 RN2 Grp Room :00: Room :18: Room :25: Room :52: Room :05: Room :22: Room :18: Room :32: Room :51: Room :22: Room :42: Room :59: Room :35: Room :18: Room :22: With this query, you can see that for each TagId, the RN1 column is sequentially numbered from 1 to the total number of rows for that TagId. For the RN2 column, the Location is added to the PARTITION BY clause, resulting in the assigned row numbers being restarted every time the location changes. Let s walk through what is going on with TagId #1. For the first sensor reading, RN1 is 1 (the first reading for this tag). This sensor was located in Room1. For RN2, this is the first sensor reading for this Tag/Location. The difference between these two values is 0. For the second row, RN1 is 2 (the second reading for this tag). The sensor reading is still from Room1, so RN2 returns a 2. Again, the difference between these two values is 0. For the third row, this is the third reading for this tag, so RN1 is 3. This sensor reading is from Room2. Since RN2 is calculated with a PARTITION BY clause that includes the location, this resets the numbering and RN2 returns a 1. The difference between these two values is

164 CHAPTER 7 WINDOWING FUNCTIONS For the fourth row, this is the fourth reading for this tag, so RN1 is 4. This sensor reading is from Room3, so RN2 is reset again and returns a 1. The difference between the two values is 3. For the fifth row, RN1 will return 5. This sensor reading is from Room2, and looking at just the values for Room2, this is the second row for Room2, so RN2 will return a 2. The difference between these two values is 3. For the sixth row, RN1 will return 6. This is from the second time in Room3, so RN2 will return a 2. The difference between these two values is 4. For the seventh and last row, RN1 will return 7. This reading is from Room4 (the first reading from this location), so RN2 will return a 1. The difference between these two values is 6. In looking at the data sequentially, as long as we are in the same location, then the difference between the two values will be the same. A subsequent trip to this location, after having been picked up by a second location first, will return a value that is higher than this difference. If we were to have multiple return trips to a location, each time this difference would be a higher value than what was returned for the last time in this location. This difference does not need to be sequential at this stage (that will be handled in the next step); what is important is that a return trip to this location will generate a difference that is higher than the previous difference, and that multiple consecutive readings in the same location will generate the same difference. In considering this difference (the Grp column) for all of the rows within the same location, as long as this difference is the same, those rows with the same difference value are in the same trip to that location. If the difference changes for that location, then you are in a subsequent trip to this location. To handle calculating the trips, the DENSE_RANK function is utilized so that there will not be any gaps, using the ORDER BY clause against this difference (the Grp column). The following query takes the first example and adds in both the DENSE_RANK and RANK functions to illustrate the difference that these would have on the results: WITH cte AS ( SELECT TagId, Location, SensorReadTime, ROW_NUMBER() OVER (PARTITION BY TagId ORDER BY SensorReadTime) - ROW_NUMBER() OVER (PARTITION BY TagId, Location ORDER BY SensorReadTime) AS Grp FROM #RFID_Location ) SELECT TagId, Location, SensorReadTime, Grp, DENSE_RANK() OVER (PARTITION BY TagId, Location ORDER BY Grp) AS TripNbr, RANK() OVER (PARTITION BY TagId, Location ORDER BY Grp) AS TripNbrRank FROM cte ORDER BY TagId, SensorReadTime; This query returns the following result set: TagId Location SensorDate Grp TripNbr TripNbrRank Room :00: Room :18: Room :25: Room :52: Room :05: Room :22: Room :18: Room :32: Room :51: Room :22:

165 CHAPTER 7 WINDOWING FUNCTIONS 2 Room :42: Room :59: Room :35: Room :18: Room :22: In this result, the first two rows are both in Room1, and they both produced the Grp value of 0, so they are both considered as Trip1 for this location. For the next two rows, the tag was in locations Room2 and Room3. These were both the first times in these locations, so each of these is considered as Trip1 for their respective locations. You can see that both the RANK and DENSE_RANK functions produced this value. For the fifth row, the tag was moved back into Room2. This produced the Grp value of 3. This location had a previous Grp value of 2, so this is a different island for this location. Since this is a higher value, its RANK and DENSE_RANK value is 2, indicating the second trip to this location. You can follow this same logic for the remaining rows for this tag. When we move to the second tag, you can see how the RANK function returns the wrong trip number for TagId 2 for the second trip to Room1 (the fourth and fifth rows for this tag). Since in this example we are looking for no gaps, DENSE_RANK would be the proper function to use, and we can see that DENSE_RANK did return that this is trip 2 for that location Accessing Values from Other Rows Problem You need to write a sales summary report that shows the total due from orders by year and quarter. You want to include a difference between the current quarter and prior quarter, as well as a difference between the current quarter of this year and the same quarter of the previous year. Solution Aggregate the total due by year and quarter, and utilize the LAG function to look at the previous records: WITH cte AS ( -- Break the OrderDate down into the Year and Quarter SELECT DATEPART(QUARTER, OrderDate) AS Qtr, DATEPART(YEAR, OrderDate) AS Yr, TotalDue FROM Sales.SalesOrderHeader ), cteagg AS ( -- Aggregate the TotalDue, Grouping on Year and Quarter SELECT Yr, Qtr, SUM(TotalDue) AS TotalDue FROM cte GROUP BY Yr, Qtr ) 161

166 CHAPTER 7 WINDOWING FUNCTIONS SELECT Yr, Qtr, TotalDue, -- Get the total due from the prior quarter TotalDue - LAG(TotalDue, 1, NULL) OVER (ORDER BY Yr, Qtr) AS DeltaPriorQtr, -- Get the total due from 4 quarters ago. -- This will be for the prior Year, same Quarter. TotalDue - LAG(TotalDue, 4, NULL) OVER (ORDER BY Yr, Qtr) AS DeltaPriorYrQtr FROM cteagg ORDER BY Yr, Qtr; This query returns the following result set: Yr Qtr TotalDue DeltaPriorQtr DeltaPriorYrQtr NULL NULL NULL NULL NULL How It Works The first CTE is utilized to retrieve the year and quarter from the OrderDate column and to pass the TotalDue column to the rest of the query. The second CTE is used to aggregate the TotalDue column, grouping on the extracted Yr and Qtr columns. The final SELECT statement returns these aggregated values and then makes two calls to the LAG function. The first call retrieves the TotalDue column from the previous row in order to compute the difference between the current quarter and the previous quarter. The second call retrieves the TotalDue column from four rows prior to the current row in order to compute the difference between the current quarter and the same quarter one year ago. The syntax for the LAG and LEAD functions is as follows: LAG LEAD (scalar_expression [,offset] [,default]) OVER ( [ partition_by_clause ] order_by_clause ) The scalar_expression is an expression of any type that returns a scalar value (typically a column), offset is the number of rows to offset the current row by, and default is the value to return if the value returned is NULL. The default value for offset is 1, and the default value for default is NULL. 162

167 7-11. Finding Gaps in a Sequence of Numbers Problem CHAPTER 7 WINDOWING FUNCTIONS You have a table with a series of numbers that has gaps in the series. You want to find these gaps. Solution Utilize the LEAD function in order to compare the next row with the current row to look for a gap: CREATE TABLE #Gaps (col1 INTEGER PRIMARY KEY CLUSTERED); INSERT INTO #Gaps (col1) VALUES (1), (2), (3), (50), (51), (52), (53), (54), (55), (100), (101), (102), (500), (950), (951), (952), (954); -- Compare the value of the current row to the next row. -- If > 1, then there is a gap. WITH cte AS ( SELECT col1 AS CurrentRow, LEAD(col1, 1, NULL) OVER (ORDER BY col1) AS NextRow FROM #Gaps ) SELECT cte.currentrow + 1 AS [Start of Gap], cte.nextrow - 1 AS [End of Gap] FROM cte WHERE cte.nextrow - cte.currentrow > 1; This query returns the following result set: Start of Gap End of Gap How It Works The LEAD function works in a similar manner to the LAG function, which was covered in the previous recipe. In this example, a table is created that has gaps in the column. The table is then queried, comparing the value in the current row to the value in the next row. If the difference is greater than 1, then a gap exists and is returned in the result set. 163

168 CHAPTER 7 WINDOWING FUNCTIONS To explain this in further detail, let s look at all of the rows, with the next row being returned: SELECT col1 AS CurrentRow, LEAD(col1, 1, NULL) OVER (ORDER BY col1) AS NextRow FROM #Gaps; This query returns the following result set: CurrentRow NextRow NULL For the current row of 1, we can see that the next value for this column is 2. For the current row value of 2, the next value is 3. For the current row value of 3, the next value is 50. At this point, we have a gap. Since we have the values of 3 and 50, the gap is from 4 through 49 or, as is coded in the first query, CurrentRow +1 to NextRow 1. Adding the WHERE clause for where the difference is greater than 1 results in only the rows with a gap being returned Accessing the First or Last Value from a Par tition Problem You need to write a report that shows, for each customer, the date that they placed their least and most expensive orders. Solution Utilize the FIRST_VALUE and LAST_VALUE functions : SELECT DISTINCT TOP (5) CustomerID, -- Get the date for the customer's least expensive order FIRST_VALUE(OrderDate) 164

169 CHAPTER 7 WINDOWING FUNCTIONS OVER (PARTITION BY CustomerID ORDER BY TotalDue ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS OrderDateLow, -- Get the date for the customer's most expensive order LAST_VALUE(OrderDate) OVER (PARTITION BY CustomerID ORDER BY TotalDue ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS OrderDateHigh FROM Sales.SalesOrderHeader ORDER BY CustomerID; This query returns the following result set for the first five customers: CustomerID OrderDateLow OrderDateHigh :00: :00: :00: :00: :00: :00: :00: :00: :00: :00: How It Works The FIRST_VALUE and LAST_VALUE functions are used to return a scalar expression (typically a column) from the first and last rows in the partition; in this example they are returning the OrderDate column. The window is set to a partition of the CustomerID, ordered by the TotalDue, and the ROWS clause is used to specify all of the rows for the partition. The syntax for the FIRST_VALUE and LAST_VALUE functions is as follows: FIRST_VALUE LAST_VALUE ( scalar_expression ) OVER ( [ partition_by_clause ] order_by_clause [ rows_range_clause ] ) where scalar_expression is an expression of any type that returns a scalar value (typically a column). Let s prove that this query is returning the correct results by examining the data for the first customer: SELECT CustomerID, TotalDue, OrderDate FROM Sales.SalesOrderHeader WHERE CustomerID = ORDER BY TotalDue; CustomerID TotalDue OrderDate :00: :00: :00: With these results, you can easily see that the date for the least expensive order was , and the date for the most expensive order was This matches up with the data returned in the previous query. 165

170 CHAPTER 7 WINDOWING FUNCTIONS Calculating the Relative Position or Rank of a Value within a Set of Values Problem You want to know the relative position and rank of a customer s order by the total of the order in respect to the total of all of the customers orders. Solution Utilize the CUME_DIST and PERCENT_RANK functions to obtain the relative position and the relative rank of a value: SELECT CustomerID, TotalDue, CUME_DIST() OVER (PARTITION BY CustomerID ORDER BY TotalDue) AS CumeDistOrderTotalDue, PERCENT_RANK() OVER (PARTITION BY CustomerID ORDER BY TotalDue) AS PercentRankOrderTotalDue FROM Sales.SalesOrderHeader WHERE CustomerID IN (11439, 30116) ORDER BY CustomerID, TotalDue; This code returns the following result set: CustomerID TotalDue CumeDistOrderTotalDue PercentRankOrderTotalDue How It Works The CUME_DIST functi on returns the cumulative distribution of a value within a set of values (that is, the relative position of a specific value within a set of values), while the PERCENT_RANK function returns the relative rank of a value in a set of values (that is, the relative standing of a value within a set of values). NULL values will be included, and the value returned will be the lowest possible value. There are two basic differences between these functions first, CUME_DIST checks to see how many values are less than or equal 166

171 CHAPTER 7 WINDOWING FUNCTIONS to the current value, while PERCENT_RANK checks to see how many values are less than the current value only. Secondly, CUME_DIST divides this number by the number of rows in the partition, while PERCENT_RANK divides this number by the number of other rows in the partition. The syntax of these functions is as follows: CUME_DIST() PERCENT_RANK( ) OVER ( [ partition_by_clause ] order_by_clause ) The result returned by CUME_DIST will be a float(53) data type, with the value being greater than 0 and less than or equal to 1 (0 < x <= 1). CUME_DIST returns a percentage defined as the number of rows with a value less than or equal to the current value, divided by the total number of rows within the partition. PERCENT_RANK also returns a float(53) data type, and the value being returned will be greater than or equal to 0, and less than or equal to 1 (0 <= x <= 1). PERCENT_RANK returns a percentage defined as the number of rows with a value less than the current row divided by the number of other rows in the partition. The first value returned by PERCENT_RANK will always be zero, since there will always be zero rows with a smaller value, and zero divided by anything will always be zero. In examining the results from this query, we see that for the first row for the first CustomerID, the TotalDue value is For CUME_DIST, there is 1 row that has this value or less, and there are 6 total rows, so 1/6 = For PERCENT_RANK, there are 0 rows that have a value lower than this, and there are 5 other rows, so 0/5 = 0. Regarding the second row s ( TotalDue value of ) CUME_DIST column, there are 2 rows with this value or less, which will return a CUME_DIST value of 2/6, or For PERCENT_RANK, there is 1 row with a value lower than this TotalDue value, and there are 5 other rows, so this will return a PERCENT_RANK value of 1/5, or 0.2. When we get down to t he fourth row, we see that the fourth and fifth rows have the same TotalDue value. For CUME_DIST, there are 5 rows with this value or less, so 5/6 = for both of these rows. For PERCENT_RANK, for both rows, there are 3 rows with a value less than the current value, so 3/5 = 0.6 for both rows. Note that for PERCENT_RANK, we are counting the number of other rows that are not the current row, not the number of other rows with a different value Calculating Continuous or Discrete Percentiles Problem You want to see both the median salary and the 75 th percentile salary for all employees per department. Solution Utilize the PERCENTILE_CONT and PERCENTILE_DISC functions to return percentile calculations based upon a value at a specified percentage: TABLE ( EmplId INT PRIMARY KEY CLUSTERED, DeptId INT, Salary NUMERIC(8, 2) ); 167

172 CHAPTER 7 WINDOWING FUNCTIONS INSERT VALUES (1, 1, 10000), (2, 1, 11000), (3, 1, 12000), (4, 2, 25000), (5, 2, 35000), (6, 2, 75000), (7, 2, ); SELECT EmplId, DeptId, Salary, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY Salary ASC) OVER (PARTITION BY DeptId) AS MedianCont, PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY Salary ASC) OVER (PARTITION BY DeptId) AS MedianDisc, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY Salary ASC) OVER (PARTITION BY DeptId) AS Percent75Cont, PERCENTILE_DISC(0.75) WITHIN GROUP (ORDER BY Salary ASC) OVER (PARTITION BY DeptId) AS Percent75Disc, CUME_DIST() OVER (PARTITION BY DeptId ORDER BY Salary) AS CumeDist ORDER BY DeptId, EmplId; This query returns the following result set: EmplId DeptId Salary MedianCont MedianDisc Percent75Cont Percent75Disc CumeDist

173 How It Works CHAPTER 7 WINDOWING FUNCTIONS PERCENTILE_CONT calc ulates a percentile based upon a continuous distribution of values of the specified column, while PERCENTILE_DISC calculates a percentile based upon a discrete distribution of the column values. The syntax for these functions is as follows: PERCENTILE_CONT ( numeric_literal) PERCENTILE_DISC ( numeric_literal ) WITHIN GROUP ( ORDER BY order_by_expression [ ASC DESC ] ) OVER ( [ <partition_by_clause> ] ) For PERCENTILE_CONT, this is performed by using the specified percentile value (SP) and the number of rows in the partition (N), and by computing the row number of interest (RN) after the ordering has been applied. The row number of interest is computed from the formula RN = (1 + (SP * (N 1))). The result returned is the average of the values from the rows at CRN = CEILING(RN) and FRN = FLOOR(RN). The value returned may or may not exist in the partition being analyzed. In looking at DeptId 1, with the specified percentile of 50%, we see that the RN = (1 + (0.5 * (3 1))). Working from the inside out, this goes to (1 + (0.5 * 2)), then to (1 + 1), with a final result of 2. The CRN and FRN of this value is the same: 2. When we look at DeptId 2, we see it has 4 rows. This changes the calculation to (1 + (0.5 * (4 1))), to (1 + (0.5 * 3)) to ( ) to 2.5. In this case, the CRN of this value is 3, and the FRN of this value is 2. When we use the 75th percentile, for DeptId 1 we get (1 + (.75 * (3 1))), which evaluates to RN = 2.5, CRN = 3 and FRN = 2. For DeptID 2, we get (1 + (.75 * (4 1))), which evaluates to RN = 3.25, CRN = 4, and FRN = 3. The next step is to return a linear interpolation of the values at these two row numbers. If CRN = FRN = RN, then return the value at RN. Otherwise, use the calculation ((CRN RN) * (value at FRN)) + ((RN FRN) * (value at CRN)). Starting with DeptId 1, for the 50th percentile, since CRN = FRN = RN, the value at RN (11,000) is returned. For the 75th percentile, the values of interest are those at rows 2 and 3. The more complicated calculation is used: ((3 2.5) * 11000) + ((2.5 2) * 12000) = (.5 * 11000) + (.5 * 12000) = ( ) = Notice that this value does not exist in the data set. When we evaluate DeptId 2, at the 50% percentile, we are looking at rows 2 and 3. The linear interpolation of these values is ((3 2.5) * 35000) + ((2.5 2) * 75000) = (.5 * 35000) + (.5 * 75000) = ( ) = For the 75% percentile, we are looking at rows 3 and 4. The linear interpolation of these values is ((4 3.25) * 75000) + ((3.25-3) * ) = (.75 * 75000) + (.25 * ) = ( ) = Again, notice that neither of these values exists in the data set. For PERCENTILE_DISC, and for the specified percentile (P), the values of the partition are sorted, and the value returned will be from the row with the smallest CUME_DIST value (with the same ordering) that is greater than or equal to P. The value returned will exist in one of the rows in the partition being analyzed. Since the result for this function is based on the CUME_DIST value, that function was included in the previous query in order to show its value. In the example, PERCENTILE_DISC(0.5) is utilized to obtain the median value. For DeptId = 1, there are three rows, so the CUME_DIST is split into thirds. The row with the smallest CUME_DIST value that is greater than or equal to the specified value is the middle row (0.667), so the median value is the value from the middle row (after sorting), or For DeptId = 2, there are four rows, so the CUME_DIST is split into fourths. For the second row, its CUME_DIST value matches the specified percentile, so the value used is the value from that row. When looking at the 75th percentile, for DeptId 1 the row with the smallest CUME_DIST that is greater than or equal to.75 is the last row, which has a CUME_DIST value of 1, so the salary value from that row (12000) is returned for each row. For DeptId 2, the third row has a CUME_DIST that matches the specified percentile, so the salary value from that row (75000) is returned for each row. Notice that PERCENTILE_DISC always returns a value that exists in the partition. 169

174 CHAPTER 7 WINDOWING FUNCTIONS Assigning Sequences in a Specified Order Problem You are inserting multiple student grades into a table. Each record needs to have a sequence assigned, and you want the sequences to be assigned in order of the grades. Solution Utilize the OVER clause of the NEXT VALUE FOR function, specifying the desired order. IF EXISTS (SELECT * FROM sys.sequences AS seq JOIN sys.schemas AS sch ON seq.schema_id = sch.schema_id WHERE sch.name = 'dbo' AND seq.name = 'CH7Sequence') DROP SEQUENCE dbo.ch7sequence; CREATE SEQUENCE dbo.ch7sequence AS INTEGER START WITH 1; TABLE ( StudentID TINYINT, Grade TINYINT, SeqNbr INTEGER ); INSERT (StudentId, Grade, SeqNbr) SELECT StudentId, Grade, NEXT VALUE FOR dbo.ch7sequence OVER (ORDER BY Grade ASC) FROM (VALUES (1, 100), (2, 95), (3, 85), (4, 100), (5, 99), (6, 98), (7, 95), (8, 90), (9, 89), (10, 89), (11, 85), (12, 82)) dt(studentid, Grade); SELECT StudentId, Grade, SeqNbr 170

175 CHAPTER 7 WINDOWING FUNCTIONS This query returns the following result set: StudentID Grade SeqNbr How It Works The optional OVER clause of the NEXT VALUE FOR function is utilized to specify the order in which the sequence should be applied. The syntax is as follows: NEXT VALUE FOR [ database_name. ] [ schema_name. ] sequence_name [ OVER (<over_order_by_clause>) ] Sequences are used to create an incrementing number. While similar to an identity column, they are not bound to any table, can be reset, and can be used across multiple tables. Sequences are discussed in detail in recipe Sequences are assigned by calling the NEXT VALUE FOR function, and multiple values can be assigned simultaneously. The order of these assignments can be controlled by the use of the optional OVER clause of the NEXT VALUE FOR function. 171

176 Pro SQL Ser ver Wait Statistics Enrico van de Laar

177 Pro SQL Server Wait Statistics Enrico van de Laar De Nijverheid Drachten, The Netherlands ISBN-13 (pbk): ISBN-13 (electronic): DOI / Library of Congress Control Number: Copyright 2015 by Enrico van de Laar This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Managing Director: Welmoed Spahr Lead Editor: Jonathan Gennick Technical Reviewer: Chris Presley Editorial Board: Steve Anglin, Mark Beckner, Gary Cornell, Louise Corrigan, Jim DeWolf, Jonathan Gennick, Robert Hutchinson, Michelle Lowman, James Markham, Susan McDermott, Matthew Moodie, Jeffrey Pepper, Douglas Pundick, Ben Renow-Clarke, Gwenan Spearing, Matt Wade, Steve Weiss Coordinating Editor: Jill Balzano Copy Editor: April Rondeau Compositor: SPi Global Indexer: SPi Global Artist: SPi Global Cover Designer: Anna Ishchenko Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY Phone SPRINGER, fax (201) , orders-ny@springer-sbm.com, or visit Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information on translations, please rights@apress.com, or visit Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. ebook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales ebook Licensing web page at Any source code or other supplementary material referenced by the author in this text is available to readers at For detailed information about how to locate your book s source code, go to Readers can also access source code at SpringerLink in the Supplementary Material section for each chapter. Printed on acid-free paper

178 Contents at a Glance About the Author... xvii About the Technical Reviewer... xix Acknowledgments... xxi Introduction... xxiii Part I: Foundations of Wait Statistics Analysis... 1 Chapter 1: Wait Statistics Internals... 3 Chapter 2: Querying SQL Server Wait Statistics Chapter 3: Building a Solid Baseline Part II: Wait Types Chapter 4: CPU-Related Wait Types Chapter 5: IO-Related Wait Types Chapter 6: Backup-Related Wait Types Chapter 7: Lock-Related Wait Types Chapter 8: Latch-Related Wait Types Chapter 9: High-Availability and Disaster-Recovery Wait Types Chapter 10: Preemptive Wait Types Chapter 11: Background and Miscellaneous Wait Types iii

179 Contents at a Glance Chapter 12: In-Memory OLTP Related Wait Types Appendix I: Example SQL Server Machine Configurations Appendix II: Spinlocks Appendix III: Latch Classes Index iv

180 Chapter 1 Wait Statistics Internals SQL Server Wait Statistics are an important tool you can use to analyze performancerelated problems or to optimize your SQL Server s performance. They are, however, not that well known to many database administrators or developers. I believe this has to do with their relatively complex nature, the sheer volume of the different types of Wait Statistics, and the lack of documentation for many types of Wait Statistics. Wait Statistics are also directly related to the SQL Server you are analyzing them on, which means that it is impossible to compare the Wait Statistics of Server A to the Wait Statistics of Server B, even if they had an identical hardware and database configuration. Every configuration option, from the hardware firmware level to the configuration of the SQL Server Native Client on the client computers, will have an impact on the Wait Statistics! For the reasons just mentioned, I firmly believe we should start with the foundation and internals of SQL Server Wait Statistics so you can get familiar with how they are generated, how you can access them, and how you can use them for performance troubleshooting. This approach will get you ready for Part II of this book, where we will examine specific Wait Statistics. In this chapter we will take a brief look at the history of Wait Statistics through the various versions of SQL Server. Following that, we will take a close look at the SQL Operating System, or SQLOS. The architecture of the SQLOS is closely tied to Wait Statistics and to performance troubleshooting in general. The rest of the chapter is dedicated to one of the most important aspects of Wait Statistics thread scheduling. Before we begin with the foundation and internals of SQL Server Wait Statistics, I would like to mention a few things related to the terminology used when discussing Wait Statistics. In the introduction of this book and the paragraphs above, I only mentioned the term Wait Statistics. The sentence compare the Wait Statistics of Server A to the Wait Statistics of Server B is actually wrong, since we can only compare the Wait Time (the total time we have been waiting on a resource) of a specific Wait Type (the specific Wait Type related to the resource we are waiting on). From this point on, when I use the term Wait Statistics I mean the concept of Wait Statistics, and I will use the correct terms Wait Time and Wait Type where appropriate. Electronic supplementary material The online version of this chapter (doi: / _1) contains supplementary material, which is available to authorized users. Enrico van de Laar 2015 E. van de Laar, Pro SQL Server Wait Statistics, DOI / _1 3

Chapter 1 Wait Statistics Internals A Brief History of Wait Statistics SQL Server has been around for quite some time now; the first release of SQL Server dates back to 1989 and was released for the

181 Chapter 1 Wait Statistics Internals A Brief History of Wait Statistics SQL Server has been around for quite some time now; the first release of SQL Server dates back to 1989 and was released for the OS/2 platform. Until SQL Server 6.0, released in 1995, Microsoft worked together with Sybase to develop SQL Server. In 1995, however, Microsoft and Sybase went their own ways. Microsoft and Sybase stayed active in the database world (SAP actually acquired Sybase in 2010), and in 2014 Microsoft released SQL Server 2014 and SAP released SAP Sybase ASE 16, both relational enterprise-level database systems. Between SQL Server 6.0 and SQL Server 2014, so many things have changed that you simply cannot compare the two versions. One thing that hasn't changed in all these years is Wait Statistics. In one way or another, SQL Server stores information about its internal processes, and even though the way we access that information has changed over the years, Wait Statistics remain an important part of the internal logging process. In early versions of SQL Server we needed to access the Wait Statistics using undocumented commands. Figure 1-1 shows how you would query Wait Statistics information in SQL Server 6.5 using the DBCC command. Figure 1-1. SQL Server Wait Statistics in SQL Server 6.5 One of the big changes that was introduced in SQL Server 2005 was the conversion of many internal functions and commands into Dynamic Management Views (DMV), including Wait Statistics information. This made it far easier to query and analyze the information returned by functions or commands. A new way of performance analysis was born with the release of the SQL Server 2005 Microsoft whitepaper SQL Server 2005 Waits and Queues by Tom Davidson. 4

Chapter 1 Wait Statistics Internals In the various releases of SQL Server the amount of different Wait Types grew exponentially whenever new features or configuration options were introduced.

182 Chapter 1 Wait Statistics Internals In the various releases of SQL Server the amount of different Wait Types grew exponentially whenever new features or configuration options were introduced. If you take a good look at Figure 1-1 you will notice that 21 different Wait Types were returned. Figure 1-2 shows the amount of Wait Types, as the number of rows returned, available in SQL Server Figure 1-2. SQL Server Wait Statistics in SQL Server 2014 Those 771 rows are all different Wait Types and hold wait information for different parts of the SQL Server engine. The number of Wait Types will continue to grow in future SQL Server releases, as new features are introduced or existing features are changed. Thankfully there is a lot more information available about Wait Statistics now then there was in SQL Server 6.5! The SQLOS The world of computer hardware changes constantly. Every year, or in some cases every month, we manage to put more cores inside a processor, increase the memory capacity of mainboards, or introduce entirely new hardware concepts like PCI-based persistent flash storage. Database Management Systems (or DBMSs) are always one of the first types of applications that want to take advantage of new hardware trends. Because of the fastchanging nature of hardware and the need to utilize new hardware options as soon as they become available, the SQL Server team decided to change the SQL Server platform layer in SQL Server

Chapter 1 Wait Statistics Internals Before SQL Server 2005, the platform layer of SQL Server was pretty restricted, and many operations were performed by the operating system.

183 Chapter 1 Wait Statistics Internals Before SQL Server 2005, the platform layer of SQL Server was pretty restricted, and many operations were performed by the operating system. This meant that it was difficult for SQL Server to keep up with the fast-changing world of server hardware, as changing a complete operating system in order to utilize faster hardware or new hardware features is a time-consuming and complex operation. Figure 1-3 shows the (simplified) architecture of SQL Server before the introduction of the SQLOS in SQL Server Figure 1-3. SQL Server architecture before the introduction of the SQLOS SQL Server 2005 introduced one of the biggest changes to the SQL Server engine seen to this day, the SQLOS. This is a completely new platform layer that functions as a user-level operating system. This new operating system has made it possible to fully utilize current and future hardware and has enabled features like advanced parallelism. The SQLOS is highly configurable and adjusts itself to the hardware it is running on, thus making it perfectly scalable for high-end or low-end systems alike. Figure 1-4 shows the (simplified) architecture of SQL Server 2005, including the SQLOS layer. 6

Chapter 1 Wait Statistics Internals Figure 1-4. SQL Server 2005 architecture The SQLOS changed the way SQL Server accesses processor resources by introducing schedulers, tasks, and worker threads.

184 Chapter 1 Wait Statistics Internals Figure 1-4. SQL Server 2005 architecture The SQLOS changed the way SQL Server accesses processor resources by introducing schedulers, tasks, and worker threads. This gives the SQLOS greater control of how work should be completed by the processors. The Windows operating system uses a preemptive scheduling approach. This means that Windows will give every process that needs processor time a priority and fixed slice of time, or a quantum. This process priority is calculated from a number of variables like resource usage, expected runtime, current activity, and so forth. By using preemptive scheduling, the Windows operating system can choose to interrupt a process when a process with a higher priority needs processor time. This way of scheduling can have a negative impact on processes generated by SQL Server, since those processes could easily be interrupted by higher priority ones, including those of other applications. For this reason, the SQLOS uses its own (cooperative) non-preemptive scheduling mechanism, making sure that Windows processes cannot interrupt SQLOS processes. SQL Server 7 and SQL Server 2000 also used non-preemptive scheduling using User Mode Scheduling (UMS). SQLOS brought many more system components closer together, thus enabling better performance and scalability. 7

Chapter 1 Wait Statistics Internals There are some exceptions when the SQLOS cannot use non-preemptive scheduling, for instance, when the SQLOS needs to access a resource through the Windows

185 Chapter 1 Wait Statistics Internals There are some exceptions when the SQLOS cannot use non-preemptive scheduling, for instance, when the SQLOS needs to access a resource through the Windows operating system. We will discuss these exceptions later in this book in the Preemptive Wait Types chapter. Schedulers, Tasks, and Worker Threads Because the SQLOS uses a different method to execute requests then the Windows operating system uses, SQL Server introduced a different way to schedule processor time using schedulers, tasks, and worker threads. Figure 1-5 shows the different parts of SQL Server scheduling and how they relate to each other. Figure 1-5. SQL Server scheduling 8

186 Sessions Chapter 1 Wait Statistics Internals A session is the connection a client has to the SQL Server it is connected to (after it has been successfully authenticated). We can easily access session information by querying the sys.dm_exec_sessions DMV using the following query: SELECT * FROM sys.dm_exec_sessions; Generally speaking, user sessions will have a session_id higher than 50; everything lower is reserved for internal SQL Server processes. However, on very busy servers there is a possibility that SQL Server needs to use a session_id higher than 50. If you are only interested in information about user-initiated sessions, it is better to filter the results of the sys.dm_exec_sessions DMV using the is_user_process column instead of filtering on a session_id greater than 50. The following query will only return user sessions and will filter out the internal system sessions: SELECT * FROM sys.dm_exec_sessions WHERE is_user_process = 1; Figure 1-6 shows a small part of the results of this query. Figure 1-6. sys.dm_exec_sessions results There are many more columns returned by the sys.dm_exec_sessions DMV that will give us information about the specific session. Some of the more interesting columns that deserve some extra explanation are the host_process_id, which is the Process ID (or PID) of the client program connected to the SQL Server. The cpu_time column will give you information about the processor time (in milliseconds) the session has used since it was first established. The memory_usage column displays the amount of memory used by the session. This is not the amount in MB or KB, but the number of 8 KB pages used. Another column I would like to highlight is the status column. This will show you if the session has any active requests. The most common values of the status column are Running, which indicates that one or more requests are currently being processed from this session, and Sleeping, which means no requests are currently being processed from this session. 9

Chapter 1 Wait Statistics Internals Requests A request is the SQL Server execution engine s representation of a query submitted by a session.

dm_exec_requests; Figure 1-7 shows a portion of the results of this query. Figure 1-7. sys.dm_exec_requests results The sys.

187 Chapter 1 Wait Statistics Internals Requests A request is the SQL Server execution engine s representation of a query submitted by a session. Again, we can use a DMV to query information about a request; in this case, we can run a query against the sys.dm_exec_requests DMV like the query below: SELECT * FROM sys.dm_exec_requests; Figure 1-7 shows a portion of the results of this query. Figure 1-7. sys.dm_exec_requests results The sys.dm_exec_requests DMV is an incredibly powerful tool to use when you are troubleshooting any performance-related issues. The reason for this is that it has a lot of information about the actual queries being executed and can help you detect performance bottlenecks relatively quickly. Because the sys.dm_exec_requests DMV also displays Wait Statistics related information, we will take a thorough look at it in Chapter 2, Querying Wait Statistics. Tasks Tasks represent the actual work that needs to be performed by the SQLOS, but they do not perform any work themselves. When a request is received by SQL Server, one or more tasks will be created to fulfill the request. The amount of tasks that get generated for a request depends on if the query request is being performed using parallelism or if it s being run serially. We can use the sys.dm_os_tasks DMV to query the task information, like I did in the example query below: SELECT * sys.dm_os_tasks; Figure 1-8 shows a part of the results of the query. Figure 1-8. sys.dm_os_tasks results 10

188 Chapter 1 Wait Statistics Internals When you query the sys.dm_os_tasks DMV you will discover it will return many results, even on servers that have no user activity. This is because SQL Server uses tasks for its own processes as well; you can identify those by looking at the session_id column. There are some interesting columns in this DMV that are worth exploring to see the relations between the different DMVs. The task_address column will show you the memory address of the task. The session_id will return the ID of the session that has requested the task, and the worker_address will hold the memory address of the worker thread associated with the task. Worker Threads Worker threads are where the actual work for the request is being performed. Every task that gets created will get a worker thread assigned to it, and the worker thread will then perform the actions requested by the task. A worker thread will actually not perform the work itself; it will request a thread from the Windows operating system to perform the work for it. For the sake of simplicity, and the fact the actual Windows thread runs outside the SQLOS, I have left this step out of Figure 1-5. You can access information about the Windows operating system threads by querying sys.dm_os_threads if you are interested. When a task requests a worker thread SQL Server will look for an idle worker thread and assign it to the task. In the case when no idle worker thread can be located and the maximum amount of worker threads has been reached, the request will be queued until a worker thread finishes its current work and becomes available. There is a limit to the number of worker threads SQL Server has available for processing requests. This number will be automatically calculated and configured by SQL Server during the installation. We can also calculate the maximum amount of worker threads ourselves using these formulas: 32-bit system with less than, or equal to, 4 logical processors: 256 worker threads 32-bit system with more than 4 logical processors: ((number of logical processors - 4) * 8) 64-bit system with less then, or equal to, 4 logical processors: 512 worker threads 64-bit system with more than 4 logical processors: ((number of logical processors - 4) * 16) Example: If we have a 64-bit system with 16 processors (or cores) we can calculate the maximum number of worker threads using the formula, ((16-4) * 16), which would give us a maximum of 704 worker threads. 11

Chapter 1 Wait Statistics Internals The amount of worker threads can be changed from the default of 0 (which means SQL Server sets the amount of max worker threads using the formulas above when it

189 Chapter 1 Wait Statistics Internals The amount of worker threads can be changed from the default of 0 (which means SQL Server sets the amount of max worker threads using the formulas above when it starts) by changing the max worker threads options in your SQL Server s properties, as illustrated by Figure 1-9. Figure 1-9. Processors page in the Server Properties Generally speaking, there should be no need to change the max worker threads option, and my advice is to leave the setting alone, as it should only be changed in very specific cases (I will discuss one of those potential cases in Chapter 4 when we talk about THREADPOOL waits). 12

190 Chapter 1 Wait Statistics Internals One thing to keep in mind is that worker threads require memory to work. For 32-bit systems this is 512 KB for every worker thread; 64-bit systems will need 2048 KB for every worker thread. Thus, changing the number of worker threads can potentially impact the memory requirements of SQL Server. This does not mean you need a massive amount of memory just for your worker threads SQL Server will automatically destroy worker threads if they have been idle for 15 minutes or if your SQL Server is under heavy memory pressure. SQL Server supplies us with a DMV to query information about the worker threads: sys.dm_os_workers. Figure 1-10 shows some of the results of this query: SELECT * FROM sys.dm_os_workers; Figure Results of querying sys.dm_os_workers The sys.dm_os_workers DMV is a very large and complex DMV where many columns are marked as Internal use only by Microsoft. In this DMV the columns task_address and scheduler_address are available to link together the different DMVs we have discussed. Worker threads go through different phases while they are being exposed to the processor, which we can view when we look at the state column in the sys.dm_os_ workers DMV: INIT. The worker thread is being initialized by the SQLOS. RUNNING. The worker thread is currently performing work on a processor. RUNNABLE. The worker thread is ready to run on a processor. SUSPENDED. The worker thread is waiting for a resource. The states the worker threads go through while performing their work are one of the main topics of this book. Every time a worker thread is not in the RUNNING state, it has to wait, and the SQLOS records this information into Wait Statistics, giving us valuable insight into what the worker thread has been waiting on and how long it has been waiting. 13

Chapter 1 Wait Statistics Internals Schedulers The scheduler component s main task is to surprise schedule work, in the form of tasks, on the physical processor(s).

191 Chapter 1 Wait Statistics Internals Schedulers The scheduler component s main task is to surprise schedule work, in the form of tasks, on the physical processor(s). When a task requests processor time it is the scheduler that assigns worker threads to that task so the request can get processed. It is also responsible for making sure worker threads cooperate with each other and yield the processor when their slice of time, or quantum, has expired. We call this cooperative scheduling. The need for worker threads to yield when their processor time has expired comes from the fact that a scheduler will only let one worker thread run on a processor at a time. If the worker threads didn t need to yield, a worker thread could stay on the processor for an infinite amount of time, blocking all usage of that processor. There is a one-on-one relation between processors and schedulers. If your system has two processors, each with four cores, there will be eight schedulers that the SQLOS can use to process user requests, each of them mapped to one of the logical processors. We can access information about the schedulers by running a query against the sys.dm_os_schedulers DMV: SELECT * FROM sys.dm_os_schedulers; The results of the query are shown in Figure Figure sys.dm_os_schedulers query results The SQL Server on which I ran this query has one processor with two cores, which means there should be two schedulers that can process my user requests. If we look at Figure 1-11, however, we notice there are more than two schedulers returned by the query. SQL Server uses its own schedulers to perform internal tasks, and those schedulers are also returned by the DMV and are marked HIDDEN ONLINE in the status column of the DMV. The schedulers that are available for user requests are marked as VISIBLE ONLINE in the DMV. There is also a special type of scheduler with the status VISIBLE ONLINE (DAC). This is a scheduler dedicated for use with the Dedicated Administrator Connection. This scheduler makes it possible to connect to SQL Server in situations where it is unresponsive; for instance, when there are no free worker threads available on the schedulers that process user requests. We can view the number of worker threads a scheduler has associated with it by looking at the current_workers_count column. This number also includes worker threads that aren t performing any work. The active_workers_count shows us the worker threads that are active on the specific scheduler. This doesn t mean they are actually 14

192 Chapter 1 Wait Statistics Internals running on the processor, as worker threads with states of RUNNING, RUNNABLE, and SUSPENDED also count toward this number. The work_queue_count is also an interesting column since it will give you insight into how many tasks are waiting for a free worker thread. If you see high numbers in this column, it might mean that you are experiencing CPU pressure. Putting It All Together All the parts of the SQL Server scheduling we have discussed so far are connected to each other, and every request passes through these same components. The following text is an example of how a query request would get processed. A user connects to the SQL Server through an application. The SQL Server will create a session for that user after the login process is completed successfully. When the user sends a query to the SQL Server, a task and a request will be created to represent the unit of work that needs to be done. The scheduler will assign worker threads to the task so it can be completed. To see all this information in SQL Server, we can join some of the DMVs we used. The query in Listing 1-1 will show you an example of how we can combine the different DMVs to get scheduling information about a specific session (in this case a session with an ID of 53). Listing 1-1. Join the different DMVs together to query scheduling information SELECT r.session_id AS 'Session ID', r.command AS 'Type of Request', qt.[text] AS 'Query Text', t.task_address AS 'Task Address', t.task_state AS 'Task State', w.worker_address AS 'Worker Address', w.[state] AS 'Worker State', s.scheduler_address AS 'Scheduler Address', s.[status] AS 'Scheduler State' FROM sys.dm_exec_requests r CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) qt INNER JOIN sys.dm_os_tasks t ON r.task_address = t.task_address INNER JOIN sys.dm_os_workers w ON t.worker_address = w.worker_address INNER JOIN sys.dm_os_schedulers s ON w.scheduler_address = s.scheduler_address WHERE r.session_id = 53 Figure 1-12 shows the information that the query returned on my test SQL Server. To keep the results readable, I only selected columns from the DMVs to show the relation between them. 15

Chapter 1 Wait Statistics Internals Figure 1-12. Results of the query from Listing 1-1 In the results we can see that Session ID 53 made a SELECT query request. I did a cross apply with the sys.

193 Chapter 1 Wait Statistics Internals Figure Results of the query from Listing 1-1 In the results we can see that Session ID 53 made a SELECT query request. I did a cross apply with the sys.dm_exec_sql_text Dynamic Management Object to show the query text of the request. The request was mapped to a task, and the task began running. The task was then mapped to a worker thread that was then also in a running state. This meant that this query began being processed on a processor. The Scheduler Address column shows on which specific scheduler our worker thread was being run. Wait Statistics So far we have gone pretty deep into the different components that perform scheduling for SQL Server and how they are interconnected, but we haven t given a lot of attention to the topic of this book: Wait Statistics. In the section about worker threads earlier in this chapter, I described the states a worker thread can be in while it is performing work on a scheduler. When a worker thread is performing its work, it goes through three different phases (or queues) in the scheduler process. Depending on the phase (or queue) a worker thread is in, it will get either the RUNNING, RUNNABLE, or SUSPENDED state. Figure 1-13 shows an abstract view of a scheduler with the three different phases. Figure Scheduler and its phases and queues When a worker thread gets access to a scheduler it will generally start in the Waiter List and get the SUSPENDED state. The Waiter List is an unordered list of worker threads that have the SUSPENDED state and are waiting for resources to become available. Those resources can be just about anything on the system, from data pages to a lock request. While a worker thread is in the Waiter List the SQLOS records the type of resource it needs to continue its work (the Wait Type) and the time it spends waiting before that specific resource becomes available, known as the Resource Wait Time. 16

194 Chapter 1 Wait Statistics Internals Whenever a worker thread receives access to the resources it needs, it will move to the Runnable Queue, a first-in-first-out list of all the worker threads that have access to their resources and are ready to be run on the processor. The time a worker thread spends in the Runnable Queue is recorded as the Signal Wait Time. The first worker thread in the Runnable Queue will move to the RUNNING phase, where it will receive processor time to perform its work. The time it spends on the processor is recorded as CPU time. In the meantime, the other worker threads in the Runnable Queue will move a spot higher in the list, and worker threads that have received their requested resources will move from the Waiter List into the Runnable Queue. While a worker thread is in the RUNNING phase there are three scenarios that can happen: The worker thread needs additional resources; in this case it will move from the RUNNING phase to the Waiter List. The worker thread spends its quantum (fixed value of 4 milliseconds) and has to yield; the worker thread is moved to the bottom of the Runnable Queue. The worker thread is done with its work and will leave the scheduler. Worker threads move through the three different phases all the time, and it is very common that one worker thread moves through them multiple times until its work is done. Figure 1-14 will show you the scheduler view from Figure 1-13 combined with the different types of Wait Time and the flow of Worker Threads. Figure Scheduler view complete with Wait Times and worker thread flow Knowing all the different lengths of time a request spends in one of the three different phases makes it possible to calculate the total request execution time, and also the total time a request had to wait for either processor time or resource time. Figure 1-15 shows the calculation of the total execution time and its different parts. 17

Chapter 1 Wait Statistics Internals Figure 1-15.

195 Chapter 1 Wait Statistics Internals Figure Request execution time calculation Since there is a lot of terminology involved into the scheduling of worker threads in SQL Server, I would like to give you an example on how worker threads move through a scheduler. Figure 1-16 will show you an abstract image of a scheduler like those we have already looked at, but this time I added requests that are being handled by that scheduler. Figure Scheduler with running requests In this example we see that the request from SID (Session ID) 76 is currently being executed on the processor; this request will have the state RUNNING. There are two other requests, SID 83 and SID 51, in the Waiter List waiting for their requested resources. The Wait Types they are waiting for are LCK_M_S and CXPACKET. I won t go into detail here about these Wait Types since we will be covering both of them in Part II of this book. While these two sessions are in the Waiter List, SQL Server will be recording the time they spend there as Wait Time, and the Wait Type will be noted as the representation of the resource they are waiting on. If we were to query information about these two threads, they would both have the SUSPENDED state. SID 59, SID 98, and SID 74 have their resources ready and are waiting in the Runnable Queue for SID 76 to complete its work 18

196 Chapter 1 Wait Statistics Internals on the processor. While they are waiting in the Runnable Queue, SQL Server records the time they spend there as the Signal Wait Time and adds that time to the total Wait Time. These three worker threads will have the status of RUNNABLE. In Figure 1-17 we have moved a few milliseconds forward in time; notice how the scheduler and worker threads have moved through the different phases and queues. Figure Scheduler a few millisecond later SID 76 completed its time on the processor; it didn t need any additional resources to complete its request and thus left the scheduler. SID 59 was the first worker thread in the Runnable Queue, and now that the processor is free it will move from the Runnable Queue to the processor, and its state will change from RUNNABLE to RUNNING. SID 51 is done waiting on the CXPACKET Wait Type and moved from the Waiter List to the bottom of the Runnable Queue, changing its state from SUSPENDED to RUNNABLE. Summary In this chapter we took a look at the history of Wait Statistics throughout various versions of SQL Server. Even though the method of analyzing SQL Server performance using Wait Statistics is relatively new, Wait Statistics have been a part of the SQL Server engine for a very long time. 19

197 Chapter 1 Wait Statistics Internals With the introduction of the SQLOS in SQL Server 2005 a lot changed in how SQL Server processed requests, introducing schedulers, worker threads, and tasks. All the information for the various parts are stored in Dynamic Management Views (DMVs) or Dynamic Management Functions (DMFs), which are easily queried and return a lot of information about the internals of SQL Server. Using these DMVs, we can view the progress of requests while they are being handled by a SQL Server scheduler and learn if they are waiting for any specific resources. The resources the requests are waiting for and the time they spend waiting for those resources are recorded as Wait Statistics, which is the main topic of this book. 20

198 Expert SQL Server In-Memory OLTP Dmitri Korotkevitch

199 Expert SQL Server In-Memory OLTP Copyright 2015 by Dmitri Korotkevitch This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. ISBN-13 (pbk): ISBN-13 (electronic): Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Managing Director: Welmoed Spahr Lead Editor: Jonathan Gennick Development Editor: Douglas Pundick Technical Reviewer: Sergey Olontsev Editorial Board: Steve Anglin, Mark Beckner, Gary Cornell, Louise Corrigan, Jim DeWolf, Jonathan Gennick, Robert Hutchinson, Michelle Lowman, James Markham, Susan McDermott, Matthew Moodie, Jeffrey Pepper, Douglas Pundick, Ben Renow-Clarke, Gwenan Spearing, Matt Wade, Steve Weiss Coordinating Editor: Jill Balzano Copy Editor: Mary Behr Compositor: SPi Global Indexer: SPi Global Artist: SPi Global Cover Designer: Anna Ishchenko Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY Phone SPRINGER, fax (201) , orders-ny@springer-sbm.com, or visit Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information on translations, please rights@apress.com, or visit Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. ebook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales ebook Licensing web page at Any source code or other supplementary material referenced by the author in this text is available to readers at For detailed information about how to locate your book s source code, go to

200 Contents at a Glance About the Author... xiii About the Technical Reviewer... xv Acknowledgments... xvii Introduction... xix Chapter 1: Why In-Memory OLTP?... 1 Chapter 2: In-Memory OLTP Objects... 7 Chapter 3: Memory-Optimized Tables Chapter 4: Hash Indexes Chapter 5: Nonclustered Indexes Chapter 6: In-Memory OLTP Programmability Chapter 7: Transaction Processing in In-Memory OLTP Chapter 8: Data Storage, Logging, and Recovery Chapter 9: Garbage Collection Chapter 10: Deployment and Management Chapter 11: Utilizing In-Memory OLTP Appendix A: Memory Pointers Management Appendix B: Page Splitting and Page Merging in Nonclustered Indexes v

201 CONTENTS AT A GLANCE Appendix C: Analyzing the States of Checkpoint File Pairs Appendix D: In-Memory OLTP Migration Tools Index vi

202 CHAPTER 11 Utilizing In-Memory OLTP This chapter discusses several design considerations for systems utilizing In-Memory OLTP and shows a set of techniques that can be used to address some of In-Memory OLTP s limitations. Moreover, this chapter demonstrates how to benefit from In-Memory OLTP in scenarios when refactoring of existing systems is cost-ineffective. Finally, this chapter talks about systems with mixed workload patterns and how to benefit from the technology in those scenarios. Design Considerations for the Systems Utilizing In-Memory OLTP As with any new technology, adoption of In-Memory OLTP comes at a cost. You will need to acquire and/or upgrade to the Enterprise Edition of SQL Server 2014, spend time learning the technology, and, if you are migrating an existing system, refactor code and test the changes. It is important to perform a cost/benefits analysis and determine if In-Memory OLTP provides you with adequate benefits to outweigh the costs. In-Memory OLTP is hardly a magical solution that will improve server performance by simply flipping a switch and moving data into memory. It is designed to address a specific set of problems, such as latch and lock contentions on very active OLTP systems. Moreover, it helps improve the performance of the small and frequently executed OLTP queries that perform point-lookups and small range scans. In-Memory OLTP is less beneficial in the case of Data Warehouse systems with low concurrent activity, large amounts of data, and queries that require large scans and complex aggregations. While in some cases it is still possible to achieve performance improvements by moving data into memory, you can often obtain better results by implementing columnstore indexes, indexed views, data compression, and other database schema changes. It is also worth remembering that most performance improvements with In-Memory OLTP are achieved by using natively compiled stored procedures, which can rarely be used in Data Warehouse workloads due to the limited set of T-SQL features that they support. The situation is more complicated with systems that have a mixed workload, such as an OLTP workload against hot, recent data and a Data Warehouse/Reporting workload against old, historical data. In those cases, you can partition the data into multiple tables, 169

203 CHAPTER 11 UTILIZING IN-MEMORY OLTP moving recent data into memory and keeping old, historical data on-disk. Partition views can be beneficial in this scenario by hiding the storage details from the client applications. We will discuss such implementation later in this chapter. Another important factor is whether you plan to use In-Memory OLTP during the development of new or the migration of existing systems. It is obvious that you need to make changes in existing systems, addressing the limitations of memory-optimized tables, such as missing support of triggers, foreign key constraints, check and unique constraints, calculated columns, and quite a few other restrictions. There are other factors that can greatly increase migration costs. The first is the 8,060-byte maximum row size limitation in memory-optimized tables without any off-row data storage support. This limitation can lead to a significant amount of work when the existing active OLTP tables use LOB data types, such as (n)varchar(max), xml, geography and a few others. While it is possible to change the data types, limiting the size of the strings or storing XML as text or in binary format, such changes are complex, timeconsuming, and require careful planning. Don t forget that In-Memory OLTP does not allow you to create a table if there is a possibility that the size of a row exceeds 8,060 bytes. For example, you cannot create a table with three varchar(3000) columns even if you do not plan to exceed the 8,060-byte row size limit. Indexing of memory-optimizing tables is another important factor. While nonclustered indexes can mimic some of the behavior of indexes in on-disk tables, there is still a significant difference between them. Nonclustered indexes are unidirectional, and they would not help much if the data needs to be accessed in the opposite sorting order of an index key. This often requires you to reevaluate your index strategy when a table is moved from disk into memory. However, the bigger issue with indexing is the requirement of case-sensitive binary collation of the indexed text columns. This is a breaking change in system behavior, and it often requires non-trivial changes in the code and some sort of data conversion. It is also worth noting that using binary collations for data will lead to changes in the T-SQL code. You will need to specify collations for variables in stored procedures and other T-SQL routines, unless you change the database collation to be a binary one. However, if the database and server collations do not match, you will need to specify a collation for the columns in temporary tables created in tempdb. There are plenty of other factors to consider. However, the key point is that you should perform a thorough analysis before starting a migration to In-Memory OLTP. Such a migration can have a very significant cost, and it should not be done unless it benefits the system. SQL Server 2014 provides the tools that can help during In-Memory OLTP migration. These tools are based on the Management Data Warehouse, and they provide you with a set of data collectors and reports that can help identify the objects that would benefit the most from the migration. While those tools can be beneficial during the initial analysis stage, you should not make a decision based solely on their output. Take into account all of the other factors and considerations we have already discussed in this book. Note We will discuss migration tools in detail in Appendix D. 170

204 CHAPTER 11 UTILIZING IN-MEMORY OLTP New development, on the other hand, is a very different story. You can design a new system and database schema taking In-Memory OLTP limitations into account. It is also possible to adjust some functional requirements during the design phase. As an example, it is much easier to store data in a case-sensitive way from the beginning compared to changing the behavior of existing systems after they were deployed to production. You should remember, however, that In-Memory OLTP is an Enterprise Edition feature, and it requires powerful hardware with a large amount of memory. It is an expensive feature due to its licensing costs. Moreover, it is impossible to set it and forget it. Database professionals should actively participate in monitoring and system maintenance after deployment. They need to monitor system memory usage, analyze data and recreate hash indexes if bucket counts need to be adjusted, update statistics, redeploy natively compiled stored procedures, and perform other tasks as well. All of that makes In-Memory OLTP a bad choice for Independent Software Vendors who develop products that need be deployed to a large number of customers. Moreover, it is not practical to support two versions of a system with and without In-Memory OLTP due to the increase in development and support costs. Addressing In-Memory OLTP Limitations Let s take a closer look at some of the In-Memory OLTP limitations and the ways to address them. Obviously, there is more than one way to skin a cat, and you can work around these limitations differently. 8,060-Byte Maximum Row Size Limit The 8,060-byte maximum row size limit is, perhaps, one of the biggest roadblocks in widespread technology adoption. This limitation essentially prevents you from using (max) data types along with CLR and system data types that require off-row storage, such as XML, geometry, geography and a few others. Even though you can address this by changing the database schema and T-SQL code, these changes are often expensive and time-consuming. When you encounter such a situation, you should analyze if LOB data types are required in the first place. It is not uncommon to see a column that never stores more than a few hundred characters defined as (n)varchar(max). Consider an Order Entry system and DeliveryInstruction column in the Orders table. You can safely limit the size of the column to 500-1,000 characters without compromising the business requirements of the system. Another example is a system that collects some semistructured sensor data from the devices and stores it in the XML column. If the amount of semistructured data is relatively small, you can store it in varbinary(n) column, which will allow you to move the table into memory. Tip It is more efficient to use varbinary rather than nvarchar to store XML data in cases when you cannot use the XML data type. 171

205 CHAPTER 11 UTILIZING IN-MEMORY OLTP Unfortunately, sometimes it is impossible to change the data types and you have to keep LOB columns in the tables. Nevertheless, you have a couple options to proceed. The first approach is to split data between two tables, storing the key attributes in memory-optimized and rarely-accessed LOB attributes in on-disk tables. Again, consider the situation where you have an Order Entry system with the Products table defined as shown in Listing Listing Products Table Definition create table dbo.products ( ProductId int not null identity(1,1), ProductName nvarchar(64) not null, ShortDescription nvarchar(256) not null, Description nvarchar(max) not null, Picture varbinary(max) null, ) constraint PK_Products primary key clustered(productid) As you can guess, in this scenario, it is impossible to change the data types of the Picture and Description columns, which prevents you from making the Products table memory-optimized. You can split that table into two, as shown in Listing The Picture and Description columns are stored in an on-disk table while all other columns are stored in the memory-optimized table. This approach will improve performance for the queries against the ProductsInMem table and will allow you to access it from natively compiled stored procedures in the system. Listing Splitting Data Between Two Tables create table dbo.productsinmem ( ProductId int not null identity(1,1) constraint PK_ProductsInMem primary key nonclustered hash with (bucket_count = 65536), ProductName nvarchar(64) collate Latin1_General_100_BIN2 not null, ShortDescription nvarchar(256) not null, index IDX_ProductsInMem_ProductName nonclustered(productname) ) with (memory_optimized = on, durability = schema_and_data); 172

206 CHAPTER 11 UTILIZING IN-MEMORY OLTP create table dbo.productattributes ( ProductId int not null, Description nvarchar(max) not null, Picture varbinary(max) null, ); constraint PK_ProductAttributes primary key clustered(productid) Unfortunately, it is impossible to define a foreign key constraint referencing a memory-optimized table, and you should support referential integrity in your code. You can hide some of the implementation details from the SELECT queries by defining a view as shown in Listing You can also define INSTEAD OF triggers on the view and use it as the target for data modifications; however, it is more efficient to update data in the tables directly. Listing Creating a View That Combines Data from Both Tables create view dbo.products(productid, ProductName, ShortDescription, Description, Picture) as select p.productid, p.productname, p.shortdescription,pa.description, pa.picture from dbo.productsinmem p left outer join dbo.productattributes pa on p.productid = pa.productid As you should notice, the view is using an outer join. This allows SQL Server to perform join elimination when the client application does not reference any columns from the ProductAttributes table when querying the view. For example, if you ran the query from Listing 11-4, you would see the execution plan as shown in Figure As you can see, there are no joins in the plan and the ProductAttributes table is not accessed. Listing Query Against the View select ProductId, ProductName from dbo.products 173

CHAPTER 11 UTILIZING IN-MEMORY OLTP Figure 11-1.

207 CHAPTER 11 UTILIZING IN-MEMORY OLTP Figure Execution plan of the query You can use a different approach and store LOB data in memory-optimized tables, splitting it into multiple 8,000-byte chunks. Listing 11-5 shows the table that can be used for such a purpose. Listing Spllitting LOB Data into Multiple Rows: Table Schema create table dbo.lobdata ( ObjectId int not null, PartNo smallint not null, Data varbinary(8000) not null, constraint PK_LobData primary key nonclustered hash(objectid, PartNo) with (bucket_count= ), index IDX_ObjectID nonclustered hash(objectid) with (bucket_count= ) ) with (memory_optimized = on, durability = schema_and_data) Listing 11-6 demonstrates how to insert XML data into the table using T-SQL code in interop mode. It uses an inline table-valued function called dbo.splitdata that accepts the varbinary(max) parameter and splits it into multiple 8,000-byte chunks. Listing Spllitting LOB Data into Multiple Rows: Populating Data create function dbo.splitdata varbinary(max) ) returns table as return 174

CHAPTER 11 UTILIZING IN-MEMORY OLTP ( with Parts(Start, Data) as ( select 1, substring(@lobdata,1,8000) where @LobData is not null union all ) go select Start + 8000,substring(@LobData,Start +

208 CHAPTER 11 UTILIZING IN-MEMORY OLTP ( with Parts(Start, Data) as ( select 1, substring(@lobdata,1,8000) is not null union all ) go select Start ,substring(@LobData,Start ,8000) from Parts where len(substring(@lobdata,start ,8000)) > 0 ) select row_number() over(order by Start) as PartNo,Data from Parts xml = (select * from master.sys.objects for xml raw) insert into dbo.lobdata(objectid, PartNo, Data) select 1, PartNo, Data from dbo.splitdata(convert(varbinary(max),@x)) Figure 11-2 illustrates the contents of the LobData table after the insert. Figure Dbo.LobData table content 175

209 CHAPTER 11 UTILIZING IN-MEMORY OLTP Note SQL Server limits the CTE recursion level to 100 by default. You need to specify OPTION (MAXRECURSION 0) in the statement that uses the SplitData function in case of very large input. You can construct original data using the code shown in Listing Alternatively, you can develop a CLR aggregate and concatenate binary data there. Listing Spllitting LOB Data into Multiple Rows: Getting Data ;with ConcatData(BinaryData) as ( select convert(varbinary(max), ( select convert(varchar(max),data,2) as [text()] from dbo.lobdata where ObjectId = 1 order by PartNo for xml path('') ),2) ) select convert(xml,binarydata) from ConcatData The biggest downside of this approach is the inability to split and merge large objects in natively compiled stored procedures due to the missing (max) parameters and variables support. You should use the interop engine for this purpose. However, it is still possible to achieve performance improvements by moving data into memory even when the interop engine is in use. This approach is also beneficial when memory-optimized tables are used just for the data storage, and all split and merge logic is done inside the client applications. We will discuss this implementation in much greater depth later in this chapter. Lack of Uniqueness and Foreign Key Constraints The inability to create unique and foreign key constraints rarely prevents us from adopting new technology. However, these constraints keep the data clean and allow us to detect data quality issues and bugs in the code at early stages of development. Unfortunately, In-Memory OLTP does not allow you to define foreign keys or unique indexes and constraints besides a primary key. To make matter worse, the lock-free nature of In-Memory OLTP makes uniqueness support in the code tricky. In-Memory OLTP transactions do not see any uncommitted changes done by other transactions. For example, if you ran the code from Table 11-1 in the default SNAPSHOT isolation level, both transactions would successfully commit without seeing each other s changes. 176

210 CHAPTER 11 UTILIZING IN-MEMORY OLTP Table Inserting the Duplicated Rows in the SNAPSHOT Isolation Level Session 1 Session 2 set transaction isolation level snapshot begin tran if not exists ( select * from dbo.productsinmem where ProductName = 'Surface 3' ) insert into dbo.productsinmem (ProductName) values ('Surface 3') commit set transaction isolation level snapshot begin tran if not exists ( select * from dbo.productsinmem where ProductName = 'Surface 3' ) insert into dbo.productsinmem (ProductName) values ('Surface 3') commit Fortunately, this situation can be addressed by using the SERIALIZABLE transaction isolation level. As you remember, In-Memory OLTP validates the serializable consistency rules by maintaining a transaction scan set. As part of the serializable rules validation at commit stage, In-Memory OLTP checks for phantom rows, making sure that other sessions do not insert any rows that were previously invisible to the transaction. Listing 11-8 shows a natively compiled stored procedure that runs in the SERIALIZABLE isolation level and inserts a row into the ProductsInMem table we defined earlier. Any inserts done through this stored procedure guarantee uniqueness of the ProductName even in a multi-user concurrent environment. The SELECT query builds a transaction scan set, which will be used for serializable rule validation. This validation will fail if any other session inserts a row with the same ProductName while the transaction is still active. Unfortunately, the first release of In-Memory OLTP does not support subqueries in natively compiled stored procedures and it is impossible to write the code using an IF EXISTS construct. 177

211 CHAPTER 11 UTILIZING IN-MEMORY OLTP Listing InsertProduct Stored procedure create procedure dbo.insertproduct nvarchar(64) not null,@shortdescription nvarchar(256) not null,@productid int output ) with native_compilation, schemabinding, execute as owner as begin atomic with ( transaction isolation level = serializable,language = N'English' ) bit = 0 -- Building scan set and checking existense of the product = 1 from dbo.productsinmem where ProductName = 1 begin ;throw 50000, 'Product Already Exists', 1; return end insert into dbo.productsinmem(productname, ShortDescription) end = scope_identity() You can validate the behavior of the stored procedure by running it in two parallel sessions, as shown in Table Session 2 successfully inserts a row and commits the transaction. Session 1, on the other hand, fails on commit stage with Error

212 CHAPTER 11 UTILIZING IN-MEMORY OLTP Table Validating dbo.insertproduct Stored Procedure Session 1 Session 2 begin tran int exec dbo.insertproduct 'Surface 3','Microsoft Tablet',@ProductId output commit Error: Msg 41325, Level 16, State 0, Line 62 The current transaction failed to commit due to a serializable validation failure. int exec dbo.insertproduct 'Surface 3','Microsoft Tablet',@ProductId output -- Executes and commits successfully Obviously, this approach will work and enforce the uniqueness only when you have full control over the data access code in the system and have all INSERT and UPDATE operations performed through the specific set of stored procedures and/or code. The INSERT and UPDATE statements executed directly against a table could easily violate uniqueness rules. However, you can reduce the risk by revoking the INSERT and UPDATE permissions from users, giving them EXECUTE permission on the stored procedures instead. You can use the same technique to enforce referential integrity rules. Listing 11-9 creates the Orders and OrderLineItems tables, and two stored procedures called InsertOrderLineItems and DeleteOrders enforce referential integrity between those tables there. I omitted the OrderId update scenario, which is very uncommon in the real world. Listing Enforcing Referential Integrity create table dbo.orders ( OrderId int not null identity(1,1) constraint PK_Orders primary key nonclustered hash with (bucket_count= ), 179

213 CHAPTER 11 UTILIZING IN-MEMORY OLTP OrderNum varchar(32) collate Latin1_General_100_BIN2 not null, OrderDate datetime2(0) not null constraint DEF_Orders_OrderDate default GetUtcDate(), /* Other Columns */ index IDX_Orders_OrderNum nonclustered(ordernum) ) with (memory_optimized = on, durability = schema_and_data); create table dbo.orderlineitems ( OrderId int not null, OrderLineItemId int not null identity(1,1) constraint PK_OrderLineItems primary key nonclustered hash with (bucket_count= ), ArticleId int not null, Quantity decimal(8,2) not null, Price money not null, /* Other Columns */ index IDX_OrderLineItems_OrderId nonclustered hash(orderid) with (bucket_count= ) ) with (memory_optimized = on, durability = schema_and_data); go create type dbo.tvporderlineitems as table ( ArticleId int not null primary key nonclustered hash with (bucket_count = 1024), Quantity decimal(8,2) not null, Price money not null /* Other Columns */ ) with (memory_optimized = on); go create proc dbo.deleteorder int not null ) 180

214 CHAPTER 11 UTILIZING IN-MEMORY OLTP with native_compilation, schemabinding, execute as owner as begin atomic with ( transaction isolation level = serializable,language=n'english' ) -- This stored procedure emulates ON DELETE NO ACTION -- foreign key constraint behavior bit = 0 = 1 from dbo.orderlineitems where OrderId = 1 begin ;throw 60000, N'Referential Integrity Violation', 1; return end end go delete from dbo.orders where OrderId create proc dbo.insertorderlineitems int not null,@orderlineitems dbo.tvporderlineitems readonly ) with native_compilation, schemabinding, execute as owner as begin atomic with ( transaction isolation level = repeatable read,language=n'english' ) bit = 0 = 1 from dbo.orders where OrderId 181

215 CHAPTER 11 UTILIZING IN-MEMORY OLTP = 0 begin ;throw 60001, N'Referential Integrity Violation', 1; return end end insert into dbo.orderlineitems(orderid, ArticleId, Quantity, Price) ArticleId, Quantity, Price It is worth noting that the InsertOrderLineItems procedure is using the REPEATABLE READ isolation level. In this scenario, you need to make sure that the referenced Order row has not been deleted during the execution and that REPEATABLE READ enforces this with less overhead than SERIALIZABLE. Case-Sensitivity Binary Collation for Indexed Columns As discussed, the requirement of having binary collation for the indexed text columns introduces a breaking change in the application behavior if case-insensitive collations were used before. Unfortunately, there is very little you can do about it. You can convert all the data and search parameters to uppercase or lowercase to address the situation; however, this is not always possible. Another option is to store uppercase or lowercase data in another column, indexing and using it in the queries. Listing shows such an example. Listing Storing Indexed Data in Another Column create table dbo.articles ( ArticleID int not null constraint PK_Articles primary key nonclustered hash with (bucket_count = 16384), ArticleName nvarchar(128) not null, ArticleNameUpperCase nvarchar(128) collate Latin1_General_100_BIN2 not null, -- Other Columns index IDX_Articles_ArticleNameUpperCase nonclustered(articlenameuppercase) ); -- Example of the query that uses upper case column select ArticleId, ArticleName from dbo.articles where ArticleNameUpperCase = upper(@articlename); 182

216 CHAPTER 11 UTILIZING IN-MEMORY OLTP Unfortunately, memory-optimized tables don t support calculated columns and you will need to maintain the data in both columns manually in the code. However, in the grand scheme of things, binary collations have benefits. The comparison operations on the columns that store data in binary collations are much more efficient compared to non-binary counterparts. You can achieve significant performance improvements when a large number of rows need to be processed. One such example is a substring search in large tables. Consider the situation when you need to search by part of the product name in a large Products table. Unfortunately, a substring search will lead to the following predicate WHERE ProductName LIKE '%' + '%', which is not SARGable, and SQL Server cannot use an Index Seek operation in such a scenario. The only option is to scan the data, evaluating every row in the table, which is significantly faster with binary collation. Let s look at an example and create the table shown in Listing The table has four text columns that store Unicode and non-unicode data in binary and non-binary format. Finally, we populate it with 65,536 rows of random data. Listing Binary Collation Performance: Table Creation create table dbo.collationtest ( ID int not null, VarCol varchar(108) not null, NVarCol nvarchar(108) not null, VarColBin varchar(108) collate Latin1_General_100_BIN2 not null, NVarColBin nvarchar(108) collate Latin1_General_100_BIN2 not null, constraint PK_CollationTest primary key nonclustered hash(id) with (bucket_count=131072) ) with (memory_optimized=on, durability=schema_only); create table #CollData ( ID int not null, Col1 uniqueidentifier not null default NEWID(), Col2 uniqueidentifier not null default NEWID(), Col3 uniqueidentifier not null default NEWID() ); 183

217 CHAPTER 11 UTILIZING IN-MEMORY OLTP ;with N1(C) as (select 0 union all select 0) -- 2 rows,n2(c) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows,n3(c) as (select 0 from N2 as T1 cross join N2 as T2) rows,n4(c) as (select 0 from N3 as T1 cross join N3 as T2) rows,n5(c) as (select 0 from N4 as T1 cross join N4 as T2) -- 65,536 rows,ids(id) as (select row_number() over (order by (select NULL)) from N5) insert into #CollData(ID) select ID from IDs; insert into dbo.collationtest(id,varcol,nvarcol,varcolbin,nvarcolbin) select ID /* VarCol */,convert(varchar(36),col1) + convert(varchar(36),col2) + convert(varchar(36),col3) /* NVarCol */,convert(nvarchar(36),col1) + convert(nvarchar(36),col2) + convert(nvarchar(36),col3) /* VarColBin */,convert(varchar(36),col1) + convert(varchar(36),col2) + convert(varchar(36),col3) /* NVarColBin */,convert(nvarchar(36),col1) + convert(nvarchar(36),col2) + convert(nvarchar(36),col3) from #CollData As the next step, run queries from Listing 11-12, comparing the performance of a search in different scenarios. All of the queries scan primary key hash index, evaluating the predicate for every row in the table. Listing Binary Collation Performance: Test Queries varchar(16),@nparam varchar(16) -- Getting substring for the search = substring(varcol,43,6),@nparam = substring(nvarcol,43,6) from dbo.collationtest where ID = 1000; 184

218 CHAPTER 11 UTILIZING IN-MEMORY OLTP select count(*) from dbo.collationtest where VarCol like '%' + '%'; select count(*) from dbo.collationtest where NVarCol like '%' + N'%'; select count(*) from dbo.collationtest where VarColBin like '%' + upper(@param) + '%' collate Latin1_General_100_Bin2; select count(*) from dbo.collationtest where NVarColBin like '%' + upper(@nparam) + N'%' collate Latin1_General_100_Bin2; The execution time of all queries in my system are shown in Table As you can see, the queries against binary collation columns are significantly faster, especially in the case of Unicode data. Table Binary Collation Performace: Test Results varchar column with non-binary collation varchar column with binary collation nvarchar column with non-binary collation 191ms 109ms 769ms 62ms nvarchar column with binary collation Finally, it is worth noting that this behavior is not limited to memory-optimized tables. You will get a similar level of performance improvement with on-disk tables when binary collations are used. Thinking Outside the In-Memory Box Even though the limitations of the first release of In-Memory OLTP can make refactoring an existing systems cost-ineffective, you can still benefit from it by using some In-Memory OLTP components. Importing Batches of Rows from Client Applications In Chapter 12 of my book Pro SQL Server Internals, I compare the performance of several methods that inserted a batch of rows from the client application. I looked at the performance of calling individual INSERT statements; encoding the data into XML and passing it to a stored procedure; using the.net SqlBulkCopy class; and passing data to a 185

219 CHAPTER 11 UTILIZING IN-MEMORY OLTP stored procedure utilizing table-valued parameters. Table-valued parameters became the clear winner of the tests, providing performance on par with the SqlBulkCopy implementation plus the flexibility of using stored procedures during the import. Listing illustrates the database schema and stored procedure I used in the tests. Listing Importing a Batch of Rows: Table, TVP, and Stored Procedure create table dbo.data ( ID int not null, Col1 varchar(20) not null, Col2 varchar(20) not null, /* Seventeen more columns Col3 - Col19*/ Col20 varchar(20) not null, ) go constraint PK_DataRecords primary key clustered(id) create type dbo.tvpdata as table ( ID int not null, Col1 varchar(20) not null, Col2 varchar(20) not null, /* Seventeen more columns: Col3 - Col19 */ Col20 varchar(20) not null, ) go primary key(id) create proc dbo.insertdatatvp dbo.tvpdata readonly ) as insert into dbo.data ( ID,Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8,Col9,Col10,Col11,Col12,Col13,Col14,Col15,Col16,Col17,Col18,Col19,Col20 ) select ID,Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8,Col9,Col10,Col11,Col12,Col13,Col14,Col15,Col16,Col17,Col18,Col19,Col20 186

220 CHAPTER 11 UTILIZING IN-MEMORY OLTP Listing shows the ADO.Net code that performed the import in case of tablevalued parameter. Listing Importing a Batch of Rows: Client Code using (SqlConnection conn = GetConnection()) { /* Creating and populating DataTable object with dummy data */ DataTable table = new DataTable(); table.columns.add("id", typeof(int32)); for (int i = 1; i <= 20; i++) table.columns.add("col" + i.tostring(), typeof(string)); for (int i = 0; i < packetsize; i++) table.rows.add(i, "Parameter: 1","Parameter: 2" /* Other columns */,"Parameter: 20"); } /* Calling SP with TVP parameter */ SqlCommand insertcmd = new SqlCommand("dbo.InsertDataTVP", conn); insertcmd.parameters.add("@data", SqlDbType.Structured); insertcmd.parameters[0].typename = "dbo.tvpdata"; insertcmd.parameters[0].value = table; insertcmd.executenonquery(); You can improve performance even further by replacing the dbo.tvpdata tablevalued type to be memory-optimized, which is transparent to the stored procedure and client code. Listing shows the new type definition. Listing Importing a Batch of Rows: Defining a Memory-Optimized Table Type create type dbo.tvpdata as table ( ID int not null, Col1 varchar(20) not null, Col2 varchar(20) not null, /* Seventeen more columns: Col3 - Col19 */ Col20 varchar(20) not null, primary key nonclustered hash(id) with (bucket_count=65536) ) with (memory_optimized=on); 187

221 CHAPTER 11 UTILIZING IN-MEMORY OLTP The degree of performance improvement depends on the table schema, and it grows with the size of the batch. In my test environment, I got about 5-10 percent improvement on the small 5,000-row batches, percent improvement on the 50,000-row batches, and percent improvement on the 500,000-row batches. You should remember, however, that memory-optimized table types cannot spill to tempdb, which can be dangerous in case of very large batches and with servers with an insufficient amount of memory. You should also define the bucket_count for the primary key based on the typical batch size, as discussed in Chapter 4 of this book. Note You can download the test application from this book s companion materials and compare the performance of the various import methods. Using Memory-Optimized Objects as Replacements for Temporary and Staging Tables Memory-optimized tables and table variables can be used as replacements for on-disk temporary and staging tables. However, the level of performance improvement may vary, and it greatly depends on the table schema, workload patterns, and amount of data in the table. Let s look at a few examples and, first, compare the performance of a memoryoptimized table variable with on-disk temporary objects in a simple scenario, which you will often encounter in OLTP systems. Listing shows stored procedures that insert up to 256 rows into the object, scanning it afterwards. Listing Comparing Performance of a Memory-Optimized Table Variable with On-Disk Temporary Objects create type dbo.tttemp as table ( Id int not null primary key nonclustered hash with (bucket_count=512), Placeholder char(255) ) with (memory_optimized=on) go create proc dbo.testinmemtemptables(@rows int) as dbo.tttemp,@cnt int 188

222 CHAPTER 11 UTILIZING IN-MEMORY OLTP ;with N1(C) as (select 0 union all select 0) -- 2 rows,n2(c) as (select 0 from N1 as t1 cross join N1 as t2) -- 4 rows,n3(c) as (select 0 from N2 as t1 cross join N2 as t2) rows,n4(c) as (select 0 from N3 as t1 cross join N3 as t2) rows,ids(id) as (select row_number() over (order by (select null)) from N4) insert select Id from Ids where Id go = count(*) create proc dbo.testtemptables(@rows int) as int create table #TTTemp ( Id int not null primary key, Placeholder char(255) ) ;with N1(C) as (select 0 union all select 0) -- 2 rows,n2(c) as (select 0 from N1 as t1 cross join N1 as t2) -- 4 rows,n3(c) as (select 0 from N2 as t1 cross join N2 as t2) rows,n4(c) as (select 0 from N3 as t1 cross join N3 as t2) rows,ids(id) as (select row_number() over (order by (select null)) from N4) insert into #TTTemp(Id) select Id from Ids where Id go = count(*) from #TTTemp create proc dbo.testtempvars(@rows int) as int table ( Id int not null primary key, Placeholder char(255) ) 189

223 CHAPTER 11 UTILIZING IN-MEMORY OLTP ;with N1(C) as (select 0 union all select 0) -- 2 rows,n2(c) as (select 0 from N1 as t1 cross join N1 as t2) -- 4 rows,n3(c) as (select 0 from N2 as t1 cross join N2 as t2) rows,n4(c) as (select 0 from N3 as t1 cross join N3 as t2) rows,ids(id) as (select row_number() over (order by (select null)) from N4) insert select Id from Ids where Id go = count(*) Table 11-4 illustrates the execution time of the stored procedures called 10,000 times in the loop. As you can see, the memory-optimized table variable outperformed on-disk objects. The level of performance improvements growth with the amount of data when on-disk tables need to allocate more data pages to store it. Table Execution Time of Stored Procedures (10,000 Executions) 16 rows 64 rows 256 rows Memory-Optimized Table Variable 920ms 1,496ms 3,343ms Table Variable 1,203ms 2,994ms 8,493ms Temporary Table 5,420ms 7,270ms 13,356ms It is also worth mentioning that performance improvements can be even more significant in the systems with a heavy concurrent load due to possible allocation pages contention in tempdb. You should remember that memory-optimized table variables do not keep index statistics, similar to on-disk table variables. The Query Optimizer generates execution plans with the assumption that they store just the single row. This cardinality estimation error can lead to highly inefficient plans, especially when a large amount of data and joins are involved. Important As the opposite of on-disk table variables, statement-level recompile with OPTION (RECOMPILE) does not allow SQL Server to obtain the number of rows in memory-optimized table variables. The Query Optimizer always assumes that they store just a single row. Memory-optimized tables can be used as the staging area for ETL processes. As a general rule, they outperform on-disk tables in INSERT performance, especially if you are using user database and durable tables for the staging. Scan performance, on the other hand, greatly depends on the row size and number of data pages in on-disk tables. Traversing memory pointers is a fast operation and it is significantly faster compared to getting a page from the buffer pool. However, on-page 190

224 CHAPTER 11 UTILIZING IN-MEMORY OLTP row access could be faster than traversing long memory pointers chain. It is possible that with the small data rows and large number of rows per page, on-disk tables would outperform memory-optimized tables in the case of scans. Query parallelism is another important factor to consider. The first release of In-Memory OLTP does not support parallel execution plans. Therefore, large scans against on-disk tables could be significantly faster when they use parallelism. Update performance depends on the number of indexes in memory-optimized tables, along with update patterns. For example, page splits in on-disk tables significantly decrease the performance of update operations. Let s look at a few examples based on a simple ETL process that inserts data into an imaginary Data Warehouse with one fact, FactSales, and two dimension, the DimDates and DimProducts tables. The schema is shown in Listing Listing ETL Performance: Data Warehouse Schema create table dw.dimdates ( ADateId int identity(1,1) not null, ADate date not null, ADay tinyint not null, AMonth tinyint not null, AnYear smallint not null, ADayOfWeek tinyint not null, ); constraint PK_DimDates primary key clustered(adateid) create unique nonclustered index IDX_DimDates_ADate on dw.dimdates(adate); create table dw.dimproducts ( ProductId int identity(1,1) not null, Product nvarchar(64) not null, ProductBin nvarchar(64) collate Latin1_General_100_BIN2 not null, ); constraint PK_DimProducts primary key clustered(productid) create unique nonclustered index IDX_DimProducts_Product on dw.dimproducts(product); 191

225 CHAPTER 11 UTILIZING IN-MEMORY OLTP create unique nonclustered index IDX_DimProducts_ProductBin on dw.dimproducts(productbin); create table dw.factsales ( ADateId int not null, ProductId int not null, OrderId int not null, OrderNum varchar(32) not null, Quantity decimal(9,3) not null, UnitPrice money not null, Amount money not null, constraint PK_FactSales primary key clustered(adateid,productid,orderid), constraint FK_FactSales_DimDates foreign key(adateid) references dw.dimdates(adateid), ); constraint FK_FactSales_DimProducts foreign key(productid) references dw.dimproducts(productid) Let s compare the performance of two ETL processes utilizing on-disk and memoryoptimized tables as the staging areas. We will use another table called InputData with 1,650,000 rows as the data source to reduce import overhead so we can focus on the INSERT operation performance. Listing shows the code of the ETL processes. Listing ETL Performance: ETL Process create table dw.factsalesetldisk ( OrderId int not null, OrderNum varchar(32) not null, Product nvarchar(64) not null, ADate date not null, Quantity decimal(9,3) not null, UnitPrice money not null, Amount money not null, /* Optional Placeholder Column */ -- Placeholder char(255) null, primary key (OrderId, Product) ) go 192

226 CHAPTER 11 UTILIZING IN-MEMORY OLTP create table dw.factsalesetlmem ( OrderId int not null, OrderNum varchar(32) not null, Product nvarchar(64) collate Latin1_General_100_BIN2 not null, ADate date not null, Quantity decimal(9,3) not null, UnitPrice money not null, Amount money not null, /* Optional Placeholder Column */ -- Placeholder char(255) null, constraint PK_FactSalesETLMem primary key nonclustered hash(orderid, Product) with (bucket_count = ) /* Optional Index */ -- index IDX_Product nonclustered(product) ) with (memory_optimized=on, durability=schema_and_data) go /*** ETL Process ***/ /* On Disk Table */ -- Step 1: Staging Table Insert insert into dw.factsalesetldisk (OrderId,OrderNum,Product,ADate,Quantity,UnitPrice,Amount) select OrderId,OrderNum,Product,ADate,Quantity,UnitPrice,Amount from dbo.inputdata; /* Optional Index Creation */ --create index IDX1 on dw.factsalesetldisk(product); -- Step 2: DimProducts Insert insert into dw.dimproducts(product) select distinct f.product from dw.factsalesetldisk f where not exists ( select * from dw.dimproducts p where p.product = f.product ); 193

227 CHAPTER 11 UTILIZING IN-MEMORY OLTP -- Step 3: FactSales Insert insert into dw.factsales(adateid,productid,orderid,ordernum, Quantity,UnitPrice,Amount) select d.adateid,p.productid,f.orderid,f.ordernum, f.quantity,f.unitprice,f.amount from dw.factsalesetldisk f join dw.dimdates d on f.adate = d.adate join dw.dimproducts p on f.product = p.product; /* Memory-Optimized Table */ -- Step 1: Staging Table Insert insert into dw.factsalesetlmem (OrderId,OrderNum,Product,ADate,Quantity,UnitPrice,Amount) select OrderId,OrderNum,Product,ADate,Quantity,UnitPrice,Amount from dbo.inputdata; -- Step 2: DimProducts Insert insert into dw.dimproducts(product) select distinct f.product from dw.factsalesetlmem f where not exists ( select * from dw.dimproducts p where f.product = p.productbin ); -- Step 3: FactSales Insert insert into dw.factsales(adateid,productid,orderid,ordernum, Quantity,UnitPrice,Amount) select d.adateid,p.productid,f.orderid,f.ordernum, f.quantity,f.unitprice,f.amount from dw.factsalesetlmem f join dw.dimdates d on f.adate = d.adate join dw.dimproducts p on f.product = p.productbin; I have repeated the tests in four different scenarios, varying row size, with and without Placeholder columns and the existence of nonclustered indexes on Product columns. Table 11-5 illustrates the average execution time in my environment for the scenarios when tables don t have nonclustered indexes. Table 11-6 illustrates the scenario with additional nonclustered indexes on the Product column. 194

228 CHAPTER 11 UTILIZING IN-MEMORY OLTP Table Execution Time of the Tests: No Additional Indexes On-Disk Staging Table Memory-Optimized Staging Table Small Row Large Row Small Row Large Row Staging Table Insert 5,586ms 7,246ms 3,453ms 3,655ms DimProducts Insert 1,263ms 1,316ms 976ms 993ms FactSales Insert 13,333ms 13,303ms 13,266ms 13,201ms Total Time 20,183ms 21,965ms 17,796ms 17,849ms Table Execution Time of the Tests: With Additional Indexes On-Disk Staging Table Memory-Optimized Staging Table Small Row Large Row Small Row Large Row Staging Table Insert 9,233ms 11,656ms 4,751ms 4,893ms DimProducts Insert 513ms 520ms 506ms 513ms FactSales Insert 13,163ms 13,276ms 12,283ms 12,300ms Total Time 22,909ms 25,453ms 17,540ms 17,706ms As you can see, memory-optimized table INSERT performance can be significantly better compared to the on-disk table. The performance gain increases with the row size and when extra indexes are added to the table. Even though extra indexes slow down the insert in both cases, their impact is smaller in the case of memory-optimized tables. On the other hand, the performance difference during the scans is insignificant. In both cases, the most work is done by accessing DimProducts and inserting data into the FactSales on-disk tables. Listing illustrates the code that allows us to compare UPDATE performance of the tables. The first statement changes a fixed-length column and does not increase the row size. The second statement, on the other hand, increases the size of the row, which triggers the large number of page splits in the on-disk table. Listing ETL Performance: UPDATE Performance update dw.factsalesetldisk set Quantity += 1; update dw.factsalesetldisk set OrderNum += ' '; update dw.factsalesetlmem set Quantity += 1; update dw.factsalesetlmem set OrderNum += ' '; Tables 11-7 and 11-8 illustrate the average execution time of the tests in my environment. As you can see, the page split operation can significantly degrade update performance for on-disk tables. This is not the case with memory-optimized tables, where new row versions are generated all the time. 195

229 CHAPTER 11 UTILIZING IN-MEMORY OLTP Table Execution Time of Update Statements: No Additional Indexes On-Disk Staging Table Memory-Optimized Staging Table Small Row Large Row Small Row Large Row Fixed-Length Column Update 2,625ms 2,712ms 2,900ms 2,907ms Row Size Increase 4,510ms 8,391ms 2,950ms 3,050ms Table Execution Time of Update Statements: With Additional Indexes On-Disk Staging Table Memory-Optimized Staging Table Small Row Large Row Small Row Large Row Fixed-Length Column Update 2,694ms 2,709ms 4,680ms 5,083ms Row Size Increase 4,456ms 8,561ms 4,756ms 5,186ms Nonclustered indexes, on the other hand, do not affect update performance of on-disk tables as long as their key columns were not updated. It is not the case with memory-optimized tables where multiple index chains need to be maintained. As you can see, using memory-optimized tables with a Data Warehouse workload completely fits into the It depends category. In some cases, you will benefit from it, while in others performance is degraded. You should carefully test your scenarios before deciding if memory-optimized objects should be used. Finally, it is worth mentioning that all tests in that section were executed with warm cache and serial execution plans. Physical I/O and parallelism could significantly affect the picture. Moreover, you will get different results if you don t need to persist the staging data and can use temporary and non-durable memory-optimized tables during the processes. Using In-Memory OLTP as Session - or Object State-Store Modern software systems have become extremely complex. They consist of a large number of components and services responsible for various tasks, such as interaction with users, data processing, integration with other systems, reporting, and quite a few others. Moreover, modern systems must be scalable and redundant. They need to be able to handle load growth and survive hardware failures and crashes. The common approach to solving scalability and redundancy issues is to design the systems in a way that permits to deploy and run multiple instances of individual services. It allows adding more servers and instances as the load grows and helps you survive hardware failures by distributing the load across other active servers. The services are usually implemented in stateless way, and they don t store or rely on any local data. Most systems, however, have data that needs to be shared across the instances. For example, front-end web servers usually need to maintain web session states. Back-end processing services often need to have shared cache with some data. 196

230 CHAPTER 11 UTILIZING IN-MEMORY OLTP Historically, there were two approaches to address this issue. The first one was to use dedicated storage/cache and host it somewhere in the system. Remember the old ASP.Net model that used either a SQL Server database or a separate web server to store session data? The problem with this approach is limited scalability and redundancy. Storing session data in web server memory is fast but it is not redundant. A SQL Server database, on the other hand, can be protected but it does not scale well under the load due to page latch contention and other issues. Another approach was to replicate content of the cache across multiple servers. Each instance worked with the local copy of the cache while another background process distributed the changes to the other servers. Several solutions on the market provide such capability; however, they are usually expensive. In some cases, the license cost for such software could be in the same order of magnitude as SQL Server licenses. Fortunately, you can use In-Memory OLTP as the solution. In the nutshell, it looks similar to the ASP.Net SQL Server session-store model; however, In-Memory OLTP throughput and performance improvements address the scalability issues of the old on-disk solution. You can improve performance even further by using non-durable memoryoptimized tables. Even though the data will be lost in case of failover, this is acceptable in most cases. However, the 8,060-byte maximum row size limit introduces challenges to the implementation. It is entirely possible that a serialized object will exceed 8,060 bytes. You can address this by splitting the data into multiple chunks and storing them in multiple rows in memory-optimized table. You saw an example of a T-SQL implementation earlier in the chapter. However, using T-SQL code and an interop engine will significantly decrease the throughput of the solution. It is better to manage serialization and split/merge functional on the client side. Listing shows the table and natively compiled stored procedures that you can use to store and manipulate the data in the database. The client application calls the LoadObjectFromStore and SaveObjectToStore stored procedures to load and save the data. The PurgeExpiredObjects stored procedure removes expired rows from the table, and it can be called from a SQL Agent or other processes based on the schedule. Listing Implementing Session Store: Database Schema create table dbo.objstore ( ObjectKey uniqueidentifier not null, ExpirationTime datetime2(2) not null, ChunkNum smallint not null, Data varbinary(8000) not null, ) constraint PK_ObjStore primary key nonclustered hash(objectkey, ChunkNum) with (bucket_count = ), index IDX_ObjectKey nonclustered hash(objectkey) with (bucket_count = ) 197

231 CHAPTER 11 UTILIZING IN-MEMORY OLTP with (memory_optimized = on, durability = schema_only); go create type dbo.tvpobjdata as table ( ChunkNum smallint not null primary key nonclustered hash with (bucket_count = 1024), Data varbinary(8000) not null ) with(memory_optimized=on) go create proc dbo.saveobjecttostore uniqueidentifier not null,@expirationtime datetime2(2) not null,@objdata dbo.tvpobjdata not null readonly ) with native_compilation, schemabinding, exec as owner as begin atomic with ( transaction isolation level = snapshot,language = N'English' ) delete dbo.objstore where ObjectKey end go insert into dbo.objstore(objectkey, ExpirationTime, ChunkNum, ChunkNum, Data create proc dbo.loadobjectfromstore uniqueidentifier not null ) with native_compilation, schemabinding, exec as owner as begin atomic with ( transaction isolation level = snapshot,language = N'English' ) 198

232 CHAPTER 11 UTILIZING IN-MEMORY OLTP datetime2(2) = sysutcdatetime(); select t.data from dbo.objstore t where t.objectkey and ExpirationTime order by t.chunknum end go create proc dbo.purgeexpiredobjects with native_compilation, schemabinding, exec as owner as begin atomic with ( transaction isolation level = snapshot,language = N'English' ) datetime2(2) = sysutcdatetime(); delete dbo.objstore where ExpirationTime end The client implementation includes several static classes. The ObjStoreUtils class provides four methods to serialize and deserialize objects into the byte arrays, and split and merge those arrays to/from 8,000-byte chunks. You can see the implementation in Listing Listing Implementing Session Store: ObjStoreUtils class public static class ObjStoreUtils { /// <summary> /// Serialize object of type T to the byte array /// </summary> public static byte[] Serialize<T>(T obj) { using (var ms = new MemoryStream()) { var formatter = new BinaryFormatter(); formatter.serialize(ms, obj); } } return ms.toarray(); 199

233 CHAPTER 11 UTILIZING IN-MEMORY OLTP /// <summary> /// Deserialize byte array to the object /// </summary> public static T Deserialize<T>(byte[] data) { using (var output = new MemoryStream(data)) { var binform = new BinaryFormatter(); return (T) binform.deserialize(output); } } /// <summary> /// Split byte array to the multiple chunks /// </summary> public static List<byte[]> Split(byte[] data, int chunksize) { var result = new List<byte[]>(); for (int i = 0; i < data.length; i += chunksize) { int currentchunksize = chunksize; if (i + chunksize > data.length) currentchunksize = data.length - i; var buffer = new byte[currentchunksize]; Array.Copy(data, i, buffer, 0, currentchunksize); } result.add(buffer); } return result; } /// <summary> /// Combine multiple chunks into the byte array /// </summary> public static byte[] Merge(List<byte[]> arrays) { var rv = new byte[arrays.sum(a => a.length)]; int offset = 0; foreach (byte[] array in arrays) { Buffer.BlockCopy(array, 0, rv, offset, array.length); offset += array.length; } return rv; } 200

234 CHAPTER 11 UTILIZING IN-MEMORY OLTP The ObjStoreDataAccess class shown in Listing loads and saves binary data to and from the database. It utilizes another static class called DBConnManager, which returns the SqlConnection object to the target database. This class is not shown in the listing. Listing Implementing Session Store: ObjStoreDataAccess class public static class ObjStoreDataAccess { /// <summary> /// Saves data to the database /// </summary> public static void SaveObjectData(Guid key, DateTime expirationtime, List<byte[]> chunks) { using (var cnn = DBConnManager.GetConnection()) { using (var cmd = cnn.createcommand()) { cmd.commandtext = "dbo.saveobjecttostore"; cmd.commandtype = CommandType.StoredProcedure; cmd.parameters.add("@objectkey", SqlDbType.UniqueIdentifier).Value = key; cmd.parameters.add("@expirationtime", SqlDbType.DateTime2).Value = expirationtime; var tvp = new DataTable(); tvp.columns.add("chunknum", typeof(short)); tvp.columns.add("chunkdata", typeof(byte[])); for(int i=0; i<chunks.count; i++) tvp.rows.add(i, chunks[i]); var tvpparam = new SqlParameter("@ObjData", SqlDbType.Structured) { TypeName = "dbo.tvpobjdata", Value = tvp }; } } } cmd.parameters.add(tvpparam); cmd.executenonquery(); 201

235 CHAPTER 11 UTILIZING IN-MEMORY OLTP /// <summary> /// Load data from the database /// </summary> public List<byte[]> LoadObjectData(Guid key) { using (var cnn = DBConnManager.GetConnection()) { using (var cmd = cnn.createcommand()) { cmd.commandtext = "dbo.loadobjectfromstore"; cmd.commandtype = CommandType.StoredProcedure; cmd.parameters.add("objectkey", SqlDbType.UniqueIdentifier).Value = key; } } } } var result = new List<byte[]>(); using (var reader = cmd.executereader()) { while (reader.read()) result.add((byte[])reader["data"]); } return result; Finally, the ObjStoreService class shown in Listing puts everything together and manages the entire process. It implements two simple methods, Load and Save, calling the helper classes defined above. Listing Implementing Session Store: ObjStoreService class public static class ObjStoreService { private const int MaxChunkSize = 8000; /// <summary> /// Saves object in the object store /// </summary> public static void Save(Guid key, DateTime expirationtime, object obj) { var objectbytes = ObjStoreUtils.Serialize(obj); var chunks = ObjStoreUtils.Split(objectBytes, MaxChunkSize); } ObjStoreDataAccess.SaveObjectData(key, expirationtime, chunks); 202

236 CHAPTER 11 UTILIZING IN-MEMORY OLTP /// <summary> /// Loads object from the object store /// </summary> public static T Load<T>(Guid key) where T: class { var chunks = ObjStoreDataAccess.LoadObjectData(key); if (chunks.count == 0) return null; var objectbytes = ObjStoreUtils.Merge(chunks); } } return ObjStoreUtils.Deserialize<T>(objectBytes); Obviously, this is oversimplified example, and production implementation could be significantly more complex, especially if there is the possibility that multiple sessions can update the same object simultaneously. You can implement retry logic or create some sort of object locking management in the system if this is the case. It is also worth mentioning that you can compress binary data before saving it into the database. The compression will introduce unnecessary overhead in the case of small objects; however, it could provide significant space savings and performance improvements if the objects are large. I did not include compression code in the example, although you can easily implement it with the GZipStream or DeflateStream classes. Note The code and test application are included in companion materials of this book. Using In-Memory OLTP in Systems with Mixed Workloads In-Memory OLTP can provide significant performance improvements in OLTP systems. However, with a Data Warehouse workload, results may vary. The complex queries that perform large scans and aggregations do not necessarily benefit from In-Memory OLTP. In-Memory OLTP is targeted to the Enterprise market and strong SQL Server teams. It is common to see separate Data Warehouse solutions in those environments. Nevertheless, even in those environments, some degree of reporting and analysis workload is always present in OLTP systems. The situation is even worse when systems do not have dedicated Data Warehouse and Analysis databases, and OLTP and Data Warehouse queries run against the same data. Moving the data into memory could negatively impact the performance of reporting queries. 203

237 CHAPTER 11 UTILIZING IN-MEMORY OLTP One of the solutions in this scenario is to partition the data between memoryoptimized and on-disk tables. You can put recent and hot data into memory-optimized tables, keeping old, historical data on-disk. Moreover, it is very common to see different access patterns in the systems when hot data is mainly customer-facing and accessed by OLTP queries while old, historical data is used for reporting and analysis. Data partitioning also allows you to create a different set of indexes in the tables based on their access patterns. In some cases, you can even use columnstore indexes with the old data, which significantly reduces the storage size and improves the performance of Data Warehouse queries. Finally, you can use partitioned views to hide partitioning details from the client applications. Listing shows an example of such implementation. The memory-optimized table called RecentOrders stores the most recent orders that were submitted in The on-disk LastYearOrders table stores the data for Lastly, the OldOrders table stores the old orders that were submitted prior to The view Orders combines the data from all three tables. Listing Data Partitioning: Tables and Views -- Storing Orders with OrderDate >= create table dbo.recentorders ( OrderId int not null identity(1,1), OrderDate datetime2(0) not null, OrderNum varchar(32) collate Latin1_General_100_BIN2 not null, CustomerId int not null, Amount money not null, /* Other columns */ constraint PK_RecentOrders primary key nonclustered hash(orderid) with (bucket_count= ), index IDX_RecentOrders_CustomerId nonclustered(customerid) ) with (memory_optimized=on, durability=schema_and_data) go create partition function pflastyearorders(datetime2(0)) as range right for values (' ',' ',' ',' ') go create partition scheme pslastyearorders as partition pflastyearorders all to ([LastYearOrders]) go 204

238 CHAPTER 11 UTILIZING IN-MEMORY OLTP create table dbo.lastyearorders ( OrderId int not null, OrderDate datetime2(0) not null, OrderNum varchar(32) collate Latin1_General_100_BIN2 not null, CustomerId int not null, Amount money not null, /* Other columns */ -- We have to include OrderDate to PK -- due to partitioning constraint PK_LastYearOrders primary key clustered(orderdate,orderid) with (data_compression=row) on pslastyearorders(orderdate), ); constraint CHK_LastYearOrders check ( OrderDate >= ' ' and OrderDate < ' ' ) create nonclustered index IDX_LastYearOrders_CustomerId on dbo.lastyearorders(customerid) with (data_compression=row) on pslastyearorders(orderdate); go create partition function pfoldorders(datetime2(0)) as range right for values ( /* Old intervals */ ' ',' ',' ',' ',' ',' ' ) go create partition scheme psoldorders as partition pfoldorders all to ([OldOrders]) go create table dbo.oldorders ( OrderId int not null, OrderDate datetime2(0) not null, OrderNum varchar(32) collate Latin1_General_100_BIN2 not null, CustomerId int not null, 205

239 CHAPTER 11 UTILIZING IN-MEMORY OLTP Amount money not null, /* Other columns */ constraint CHK_OldOrders check(orderdate < ' ') ) on psoldorders(orderdate); create clustered columnstore index CCI_OldOrders on dbo.oldorders with (data_compression=columnstore_archive) on psoldorders(orderdate); go create view dbo.orders(orderid,orderdate, OrderNum,CustomerId,Amount) as select OrderId,OrderDate,OrderNum,CustomerId,Amount from dbo.recentorders where OrderDate >= ' ' go union all select OrderId,OrderDate,OrderNum,CustomerId,Amount from dbo.lastyearorders union all select OrderId,OrderDate,OrderNum,CustomerId,Amount from dbo.oldorders As you know, memory-optimized tables do not support CHECK constraints, which prevent Query Optimizer from analyzing what data is stored in the RecentOrders table. You can specify that in a where clause of the first SELECT in the view. This will allow SQL Server to eliminate access to the table if queries do not need data from there. You can see this by running the code from Listing Listing Data Partitioning: Querying Data select top 10 CustomerId,sum(Amount) as [TotalSales] from dbo.orders where OrderDate >=' ' and OrderDate < ' ' group by CustomerId order by sum(amount) desc 206

240 CHAPTER 11 UTILIZING IN-MEMORY OLTP Figure 11-3 shows the partial execution plan of the query. As you can see, the query does not access the memory-optimized table at all. Figure Execution plan of the query The biggest downside of this approach is the inability to seam lessly move the data from a memory-optimized table to an on-disk table as the operational period changes. With on-disk tables, it is possible to make the data movement transparent by utilizing the online index rebuild and partition switches. However, it will not work with memoryoptimized tables where you have to copy the data to the new location and delete it from the source table afterwards. This should not be a problem if the system has a maintenance window when such operations can be performed. Otherwise, you will need to put significant development efforts into preventing customers from modifying data on the move. Note Chapter 15 in my book Pro SQL Server Internals discusses various data partitioning aspects including how to move data between different tables and file groups while keeping it transparent to the users. Summary In-Memory OLTP can dramatically improve the performance of OLTP systems. However, it can lead to large implementation cost especially when you need to migrate existing systems. You should perform a cost/benefits analysis, making sure that the implementation cost is acceptable. It is still possible to benefit from In-Memory OLTP objects even when you cannot utilize the technology in its full scope. Some of the In-Memory OLTP limitations can be addressed in the code. You can split the data between multiple tables to work around the 8,060-byte maximum row size limitation or, alternatively, store large objects in multiple rows in the table. Uniqueness and referential integrity can be enforced with REPEATABLE READ and SERIALIZABLE transaction isolation levels. 207

241 CHAPTER 11 UTILIZING IN-MEMORY OLTP You should be careful when using In-Memory OLTP with a Data Warehouse workload and queries that perform large scans. While it can help in some scenarios, it could degrade performance of the systems in others. You can implement data partitioning, combining the data from memory-optimized and on-disk tables when this is the case. 208

242 Introducing SQL Server Mike McQuillan

243 Introducing SQL Server Copyright 2015 by Mike McQuillan This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. ISBN-13 (pbk): ISBN-13 (electronic): Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Managing Director: Welmoed Spahr Lead Editor: Jonathan Gennick Technical Reviewer: Bradley Beard Editorial Board: Steve Anglin, Mark Beckner, Gary Cornell, Louise Corrigan, Jim DeWolf, Jonathan Gennick, Robert Hutchinson, Michelle Lowman, James Markham, Susan McDermott, Matthew Moodie, Jeffrey Pepper, Douglas Pundick, Ben Renow-Clarke, Gwenan Spearing, Matt Wade, Steve Weiss Coordinating Editor: Jill Balzano Copy Editor: James Fraleigh Compositor: SPi Global Indexer: SPi Global Artist: SPi Global Cover Designer: Anna Ishchenko Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY Phone SPRINGER, fax (201) , orders-ny@springer-sbm.com, or visit Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information on translations, please rights@apress.com, or visit Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. ebook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales ebook Licensing web page at Any source code or other supplementary material referenced by the author in this text is available to readers at For detailed information about how to locate your book s source code, go to

244 Contents at a Glance About the Author...xix About the Technical Reviewer...xxi Acknowledgments...xxiii Chapter 1: What Is SQL Server?... 1 Chapter 2: Obtaining and Installing SQL Server... 7 Chapter 3: Database Basics Chapter 4: Tables Chapter 5: Putting Good Tables Together Chapter 6: Automating Deployment with SQLCMD Chapter 7: NULLs and Table Constraints Chapter 8: DML (or Inserts, Updates, and Deletes) Chapter 9: Bulk Inserting Data Chapter 10: Creating Data Import Scripts Chapter 11: The SELECT Statement Chapter 12: Joining Tables Chapter 13: Views Chapter 14: Indexes Chapter 15: Transactions Chapter 16: Functions v

245 CONTENTS AT A GLANCE Chapter 17: Table-Valued Functions Chapter 18: Stored Procedures Part Chapter 19: Stored Procedures Part Chapter 20: Bits and Pieces Appendix A: SQL Data Types Appendix B: Glossary Appendix C: Common SQL Server System Objects Appendix D: Exercises Index vi

246 CHAPTER 14 Indexes We ve already met indexes during our database travels to date. We were creating clustered indexes when we created primary keys, and we also created unique indexes when we created unique constraints. Indexes are important because they can greatly affect how well your database performs. The benefits of an index become apparent once your database contains a certain number of rows; it s possible to reduce queries that can take 20 minutes (in some cases) to just two seconds. That s how important they are. We ll take a tour of the various types of index you can create, look at some examples, and see how to create an indexed view. This is a big subject, so hold on to your hat! What Is an Index? A database index is not particularly different from an index in a book. Just as you might use a book index to lookup a particular word or phrase (e.g., CREATE VIEW ), SQL Server uses an index to quickly find records based on search criteria you supply to it. Think what happens if you ask SQL Server to bring back all contacts who are developers. If there is an index on the RoleTitle column, SQL Server can use the index to find the rows that match the search criteria. If no index exists, SQL Server has to inspect every single row to find a match. Why Are Indexes Useful? Whenever it executes a query, SQL Server uses something called the Query Optimizer. This is a built-in component of SQL Server that takes the T-SQL code you provide and figures out the fastest way of executing it. Indexes can help the Query Optimizer decide on the best path to take. This often helps to avoid disk input/output operations (disk I/O) operations in which data has to be read from disk instead of memory. Disk I/O operations are expensive from a time-taken perspective, as disks are slower to access than memory. Indexes help to reduce disk I/O, as SQL Server can access the data it needs with fewer steps, especially as the data in the index is sorted according to the indexed columns, resulting in faster lookups. What Do Indexes Affect? Indexes affect SELECT, UPDATE, and DELETE statements. They also affect the MERGE statement, which I won t cover in this book. Anything that uses joins or WHERE conditions might benefit from an index. 235

247 CHAPTER 14 INDEXES Identifying Which Columns to Index Quite often you ll come to a database and find a lot of indexes have been created that make no sense. While I could probably write a thesis on all of the ways you can identify what you should index, and how to develop a good indexing strategy, there are actually a couple of basic rules of thumb that you can apply to determine whether a column should be included in an index or not. We ll use this statement as a basis for our discussion: SELECT C.ContactId, C.FirstName, C.LastName, C.DateOfBirth, PNT.PhoneNumberType, CPN. PhoneNumber FROM Contacts C INNER JOIN dbo.contactphonenumbers CPN ON C.ContactId = CPN.ContactId INNER JOIN dbo.phonenumbertypes PNT ON CPN.PhoneNumberTypeId = PNT.PhoneNumberTypeId WHERE C.DateOfBirth BETWEEN '1950' AND '2010' AND PNT.PhoneNumberType = 'Home'; The basic steps to index identification are: Write or obtain the SQL statements that will be used to return data from the database. This includes any UPDATE or DELETE statements that use joins or WHERE conditions. Once you have the statements, make a note of the columns used in joins. Contrary to popular belief, a foreign key column is not automatically included in an index. (I ve been told this by many non-sql developers down the years it isn t true!) Look for any WHERE conditions used by the queries and identify the columns used by such queries. Finally, make a note of the columns returned by the SELECT statements. The statement returns six columns. We join on: Contacts.ContactId (this is the primary key and is clustered), ContactPhoneNumbers.ContactId (a non-indexed foreign key), ContactPhoneNumbers.PhoneNumberTypeId (a non-indexed foreign key), and PhoneNumberTypes.PhoneNumberTypeId (a clustered primary key). Finally, the WHERE clause uses two conditions, each of which uses a different column: Contacts.DateOfBirth and PhoneNumberTypes.PhoneNumberType An index can only apply to one table. You cannot spread an index over two tables (Indexed Views offer a kind of workaround to this problem, as we ll see later). This means you cannot create a single index to optimize the preceding query you ll have to create appropriate indexes on all of the tables involved in the query. How Indexes Work I ve already mentioned why indexes are useful. Before we create any indexes, we ll take a glance at how they work. This will help you understand why indexes help queries run faster. 236

248 CHAPTER 14 INDEXES Imagine we have 30 rows in a table, and we write a query that will cause row 17 to be returned. Without an index, SQL Server finds the row we are interested in by executing something called a Table Scan (Figure 14-1 ). Figure Executing a query with a Table Scan Figure 14-1 shows something called an Execution Plan. You can turn these on to help you determine if SQL Server is executing your queries in the most efficient manner. Table Scans are bad, and most definitely not efficient. To turn execution plans on, click the Include Actual Execution Plan option in the Query menu of SSMS (or press Ctrl+M). The Execution Plan tab will appear after your query completes. Table Scans Why is a table scan bad? Because they mean every single row in your table is being interrogated. Our query looks for a particular postcode, which can be found in row 17 of 30. To find that row, SQL Server inspected the first 16 rows before finding the match. It marked row 17 as a match, then carried on inspecting rows 18 to 30 to check if any of those matched. This isn t too bad when only 30 rows exist, but imagine if a million rows existed! B -T re e s Indexes offer a much more efficient method of finding data by using something called a B-Tree, more formally known as a Balanced Tree. This is a structure that causes data to be split into different levels, allowing for very fast data querying. The B-Tree for a clustered index consists of three levels. Figure 14-2 shows how the levels are structured. 237

249 CHAPTER 14 INDEXES Figure A clustered index B-Tree There is only ever one root level, and it only ever contains one page. There can be multiple intermediate levels, depending upon the amount of data the index holds. Finally, there is only ever one data level, and the pages in this level hold the actual data, sorted as per the clustered index. You are probably wondering what these pages I ve mentioned are. I m not going to delve deeply into how SQL Server structures its data in this book, but here s a quick overview. Do you remember how a SQL Server database can consist of one or more files? Tables may be contained within one of those files, or spread across multiple files. Each file is split into extents. You could think of an extent as a folder. Each extent contains eight pages. Pages hold rows. If an extent is a folder, a page is a sheet of paper, and a row is a line on that sheet of paper. These pages that hold the data rows are what we are talking about with regards to indexes. Now, let s say we ve added a clustered index on the Postcode column, and we re again looking for postcode NR2 1NQ. Figure 14-3 has the execution plan. 238

250 CHAPTER 14 INDEXES Figure Running the same query after adding a clustered index Now we have a Clustered Index Seek instead of a Table Scan (you may see a Clustered Index Scan don t worry about it if you do). Is this any better? Let s see. Data is now ordered by the postcode. For simplicity, we ll assume the row we are interested in is still at position 17. We ll also assume the data is ordered using numbers, so we can use our earlier diagram (in reality, it would be sorted by postcode). Here s what SQL Server will do: Start at the root page and look at the value of the record at the start of the page. We have the value 1. The next value is 15. Seventeen is not between 1 and 15, so the index navigates to the page in the intermediate level that begins with value 15. The same process occurs. The start value is 15 and the next value is 20. Seventeen is between these two values, so the search now drops down to the data page level. The data page contains the index identifier for row 17. As this is a clustered index, it also holds the data. We re done! We only had to navigate three times to find the value using the clustered index, rather than checking all 30 rows to see if they matched. Much more efficient, especially when you consider what could happen if the table contained many more rows. This is a much-simplified explanation of how indexes work, but it should demonstrate that indexes can be very effective when used correctly. Non-Clustered Indexes And B-Trees Non-clustered indexes work in a very similar manner to clustered indexes, except the data is not stored in a sorted order, and indeed is not stored on the data pages. The data pages in a non-clustered index store an identifier that points at the row containing the data. So once the B-Tree has done its work and found the correct data page, there is an additional step as the index hops over to the actual row to retrieve the columns you have requested. You can see this extra step in Figure

251 CHAPTER 14 INDEXES Figure A non-clustered index B-Tree Included columns can work around this; you can include columns in an index and the values for those columns will be stored right inside the index. This can avoid the additional step required for non-clustered indexes if all of the relevant columns are available. We ll see how this works in a few pages time, in the section Indexed Columns vs. Included Columns. Basics Of The CREATE INDEX Statement Rather unsurprisingly, the CREATE INDEX statement is used to create indexes. The basic structure of this command is: CREATE INDEX IndexName ON TableName (Columns); To create a clustered index you d write: CREATE CLUSTERED INDEX IndexName ON TableName (Columns); Creating non-clustered indexes is very similar: CREATE NONCLUSTERED INDEX IndexName ON TableName (Columns); 240

252 CHAPTER 14 INDEXES Including additional columns with the index is as simple as adding the INCLUDES keyword: CREATE INDEX IndexName ON TableName (Columns) INCLUDE (Columns); You can also create something called a filtered index, which works on a subset of data (more to come on this, in the section Filtered Indexes). The command to create a filtered index is: CREATE INDEX IndexName ON TableName (Columns) WHERE (Conditions); We ll delve into all these commands starting right now. Clustered Indexes A clustered index dictates how the data in a table is sorted on disk. As it s only possible to sort data on disk in one particular way, you can only have one clustered index per table. Clustered indexes are often the most performant kind of index because the data is returned as soon as it is located by the index. This is because the data is stored with the index. To determine how a clustered index works, think of a telephone book (if this seems too old-fashioned for you, think of the Contacts app on your mobile phone). Say you want to look up Grace McQuillan s phone number. You open your phone book (or app) and scan for Grace McQuillan. All data is stored alphabetically. As soon as you find Grace McQuillan you have access to her phone number. The data was there as soon as you located Grace McQuillan. This is different to a non-clustered index, which will tell you where to locate the data. Creating a Clustered Index Most of the tables in our AddressBook database already have clustered indexes we created all tables with clustered primary keys so we cannot create additional clustered indexes on those tables. However, there is one exception: the ContactAddresses table, shown in Figure This was created with a non-clustered primary key. Figure ContactAddresses with a non-clustered primary key 241

253 CHAPTER 14 INDEXES Why did we do this? The primary key on this table is AddressId. This column exists purely to give a unique, fast primary key to the table. As a piece of data the users are interested in, it is utterly irrelevant. Users will never see it and we are unlikely to use it to find addresses; we are more likely to use the ContactId or the Postcode when searching for addresses. It therefore makes infinitely more sense to have the data sorted by these columns lookups will be much faster. Before creating the index, open up a New Query Window and execute the T-SQL shown in Figure We can see that the data is sorted by AddressId. Figure Query showing how the primary key orders data Open up another New Query Window and enter this script (don t run it yet). USE AddressBook; CREATE CLUSTERED INDEX IX_C_ContactAddresses_ContactIdPostcode ON dbo.contactaddresses(postcode, ContactId); GO The statement is pretty simple. The CLUSTERED keyword informs SQL Server we want to create a clustered index if this weren t present, a NONCLUSTERED index would be created by default. We ve given the index a descriptive name, so future developers coming to our database can easily see what its purpose is. The IX_C at the start indicates the object is a clustered index. We then have the name of the table for which we are creating the index, followed by the columns we are indexing. Execute this script to create the clustered index, then return to the SELECT * FROM dbo.contactaddresses script and run it again (Figure 14-7 ). 242

254 CHAPTER 14 INDEXES Figure The same query after adding a clustered index Wow, Figure 14-7 shows us that the order has changed somewhat! AddressId, which was neatly ordered earlier, is now all over the place. So is ContactId. If you look at the Postcode column you ll see it is nicely ordered. This makes sense, as it was the first column specified in our index. Take a look at rows 11 and 12 these are both addresses for ContactId 1, and they are ordered by Postcode in ascending order. At a glance, the data in Figure 14-7 looks oddly ordered, because the ID columns are jumbled up. But the rows are sorted according to our index. It s because of indexes like the one we ve just implemented that query results sometimes seem to be arbitrarily ordered. So we ve created a clustered index. We can see the results of the index simply by executing a SELECT * statement. Do you think this is a good index? It s almost a good index. To figure out why, we need to assess how this table is likely to be used. This table is probably going to be used in two ways: Obtain address information when a contact is being viewed through an app of some description, so address details can be displayed alongside the contact details (this will use a join of some sort) Support searching for addresses via a search interface (e.g., searching by postcode) Of these two use cases, the most common is probably going to be the first. The likely user interface will ask the operator to ask for the contact s name or date of birth, from which their record will be found and displayed on-screen. A join will then be executed to return the contact s addresses using the ContactId. It seems ContactId is likely to be used more than Postcode, so we ll make ContactId the first field in the index, with Postcode the second. 243

255 CHAPTER 14 INDEXES Return to the window containing your CREATE INDEX statement. Swap the field names around and execute the CREATE INDEX statement again. As Figure 14-8 tells us, it all goes wrong! Figure The index already exists We have a couple of options here. Modify the index using SSMS Check if the index exists, drop it if it does, then recreate it ALTER INDEX is not like other ALTER statements, as it doesn t allow modification of the index s definition. It is used to change index options, and we ll look at what it can do a bit later. Modifying using SSMS isn t a good idea, as it breaks our principle of providing DBAs with a set of scripts they can execute on any environment. Still, let s take a look at what we could do here. Modifying an Index Using SSMS In the Object Explorer, right-click the ContactAddresses table and refresh it. Then expand the Indexes node. You should see the two indexes shown in Figure 14-9 : our new index and the primary key index. Figure The indexes present for ContactAddresses Right-click IX_C_ContactAddresses_ContactIdPostcode and choose the Properties option from the context menu (alternatively, double-click the index name). The index properties dialog in Figure opens up. 244

CHAPTER 14 INDEXES Figure 14-10. Index properties dialog Note the Index key columns section. You could use the Move Up/Move Down buttons to change the order of the index columns.

256 CHAPTER 14 INDEXES Figure Index properties dialog Note the Index key columns section. You could use the Move Up/Move Down buttons to change the order of the index columns. There are other options on the right that allow you to customize various aspects of the index we ll take a look at one or two of those using T-SQL later. You d also see this screen if you chose to create a new index using SSMS. It s worth pointing out that you can right-click an index in the Object Explorer and script it to a file or New Query Window, just as we did with tables and databases earlier. For me, using SSMS involves a lot more work! Let s modify our script to check if the index already exists. Checking If an Index Exists with T-SQL Return to our CREATE INDEX script window. We ll add a check to see if the index exists, very similar to the checks we ve already added for tables and views. We used sys.tables and sys.views to check if a table or view existed, so no prizes for guessing that the system table holding the indexes is called sys.indexes. Let s add that check! USE AddressBook; IF EXISTS (SELECT 1 FROM sys.indexes WHERE [name] = 'IX_C_ContactAddresses_ContactIdPostcode') BEGIN DROP INDEX IX_C_ContactAddresses_ContactIdPostcode ON dbo.contactaddresses; END; CREATE CLUSTERED INDEX IX_C_ContactAddresses_ContactIdPostcode ON dbo.contactaddresses(contactid, Postcode); GO 245

257 CHAPTER 14 INDEXES This should be pretty familiar to you now. We check if the index exists and drop it if it does, then we create the index. Make sure you check the last line carefully we ve switched the order of ContactId and Postcode around. The DROP INDEX statement is slightly different to other DROP statements we ve met we had to specify not only the index name, but the name of the table on which the index exists, too. If you execute this script, it should run successfully. It should work no matter how many times you run it. Return to the SELECT * FROM dbo.contactaddresses window and run the SELECT statement. The order has changed back to its original form, as you can see in Figure Figure ContactAddresses with ContactId in the clustered index The fact that AddressId is in order is actually a coincidence the table is now sorted by ContactId, then Postcode. If we added a new address for ContactId 1 it would appear in the top three rows, depending upon the postcode provided. Save the index script as c:\temp\sqlbasics\apply\21 - Create ContactAddresses Clustered Index.sql. Then add a call to the bottom of the 00 - Apply.sql script. :setvar currentfile "21 - Create ContactAddresses Clustered Index.sql" PRINT 'Executing $(path)$(currentfile)'; :r $(path)$(currentfile) Now that we have a handle on clustered indexes, we ll move on to creating some indexes of the non-clustered variety. 246

258 Non-Clustered Indexes CHAPTER 14 INDEXES You can have pretty much all the non-clustered indexes you want on any table in your database you can add up to 999 of them (in SQL Server 2008 and later you were previously limited to only 249). These are the indexes your queries will normally use. Clustered indexes are great, but they are limited because you can only have one of them. You also need to keep the key as small and efficient as possible, which can limit their effectiveness in queries. Non-clustered indexes can exist with or without a clustered index. A table will generally work better if a clustered index exists, as will its non-clustered indexes. But it isn t a deal breaker. We ve already seen how a non-clustered index differs from a clustered index, in that it doesn t dictate the order in which table data is stored. Rather, the keys in the index are stored in a separate structure, which is used to identify the rows we are interested in. Once the rows have been identified, a link held within the non-clustered index is used to obtain the appropriate data from the appropriate rows. We ll use the phone number tables to demonstrate non-clustered indexes. We want to find all contacts who have a home phone number. Not a problem; we can write that query in a jiffy! USE AddressBook; SELECT C.ContactId, C.FirstName, C.LastName, C.AllowContactByPhone, PNT.PhoneNumberType, CPN.PhoneNumber FROM dbo.contacts C INNER JOIN dbo.contactphonenumbers CPN ON C.ContactId = CPN.ContactId INNER JOIN dbo.phonenumbertypes PNT ON CPN.PhoneNumberTypeId = PNT.PhoneNumberTypeId WHERE PNT.PhoneNumberType = 'Home'; Running this returns four rows, and generates the execution plan you can see in Figure (remember, Ctrl+M will toggle execution plans on/off): Figure An execution plan, using an Index Scan for ordering We have two inner joins here: one for each join specified in our query. There is a Clustered Index Seek at the bottom of the plan. This is used to seek the ContactId values from the Contacts table. In the top right, we have a Clustered Index Scan over the ContactId column in the ContactPhoneNumbers table. This is 247

259 CHAPTER 14 INDEXES because we configured the PhoneNumberId column as the clustered index when we created it as the primary key. In hindsight, this probably wasn t a good decision. The scan has been used because the column we are using in the join does not order the data. Scans are used when the data is not ordered, and seeks are used when the data is ordered. The final seek is for the PhoneNumberTypes table, which is clustered on the PhoneNumberTypeId column. CLUSTERED INDEX SEEKS AND SCANS You are probably wondering what the difference between an index seek and scan is. A seek will use the B-Tree to locate the data it requires. It will use the search parameters provided to limit the number of pages it searches through. A scan starts at the beginning of the index and moves through each row in order, pulling out matching rows as it finds them. Every row in the index is scanned. You are no doubt thinking seeks are more efficient than scans, and a lot of the time you would be right. But there are instances where a scan will outperform a seek. Further complications arise by considering that a seek sometimes contains a scan! The general rule to follow is that seeks are usually better for fairly straightforward queries (e.g., queries using JOIN s and WHERE clauses), while more complicated queries may benefit from the use of a scan. If in doubt, play around with your indexes until you are satisfied with the performance level. Nowhere in this query plan do we have an index used that supports the PhoneNumberType column. This is being used in our WHERE clause. We can create a non-clustered index on this column, which will assist our query when the number of records in our tables start to grow. It s time for a New Query Window, into which we ll type our non-clustered index statement. USE AddressBook; IF EXISTS (SELECT 1 FROM sys.indexes WHERE [name] = 'IX_NC_PhoneNumberTypes_PhoneNumberType') BEGIN DROP INDEX IX_NC_PhoneNumberTypes_PhoneNumberType ON dbo.phonenumbertypes; END; CREATE INDEX IX_NC_PhoneNumberTypes_PhoneNumberType ON dbo.phonenumbertypes(phonenumbertype); GO Run this, then execute the SELECT statement again. Check out the execution plan now (Figure ). 248

260 CHAPTER 14 INDEXES Figure The updated execution plan (with highlighted Index Seek) MOUSING OVER OPERATIONS IN THE EXECUTION PLAN Top tip time: If you place your mouse over an operation within the execution plan, a detailed tool tip will appear, providing you with various statistics about that particular operation. Aha! Now our new, non-clustered index is being used by the query. This means any queries we write in the future that use the PhoneNumberType column will be optimized. Top stuff. Save this index as c:\temp\sqlbasics\apply\22 - Create PhoneNumberTypes Index.sql. Here s the SQLCMD code to add to 00 - Apply.sql : :setvar currentfile "22 - Create PhoneNumberTypes Index.sql" PRINT 'Executing $(path)$(currentfile)'; :r $(path)$(currentfile) Execution Plan Percentages Let s take a moment to quickly look at the percentage assigned to each item in the execution plans. The percentage tells you how much work SQL Server has to perform for each individual part of the plan. When we added our non-clustered index, it took up 45% of the plan. This is a good thing, as the aim of the index was to cause it to be used by the plan. The two clustered index items use up most of the remaining percentage. Interestingly, these figures changed when we added our non-clustered index. This is because the non-clustered index is doing some of the work these indexes were previously doing. Execution Plans: A Quick Summary To say this has been a crash course in execution plans is an understatement. In truth, we ve only looked at them to demonstrate that our indexes are being used. But even this basic knowledge can help you. You can mouse over the items in an index plan and a pop-up will appear, telling you which columns are being 249

CHAPTER 14 INDEXES used by that item. If columns you are joining on or are using in WHERE clauses are not mentioned, consider adding an index, then running the query again.

261 CHAPTER 14 INDEXES used by that item. If columns you are joining on or are using in WHERE clauses are not mentioned, consider adding an index, then running the query again. If things don t work out as expected, you can always remove the index. Execution plans will actually suggest missing indexes it thinks you should add they appear just above the execution plan diagram. Execution plans especially a good knowledge of them is an advanced topic, but hopefully this brief introduction has whetted your appetite for more. We still need them in this chapter! Indexed Columns vs. Included Columns All of the indexes we ve created so far have used indexed columns that is, the columns have been declared as part of the index key. The index key consists of all columns declared within brackets after the table name. So in this index: CREATE CLUSTERED INDEX IX_C_ContactAddresses_ContactIdPostcode ON dbo.contactaddresses(contactid, Postcode); the index key is made up of ContactId and Postcode. There are no included columns. Included columns come in useful when you want to store more data alongside the index. You don t necessarily want to make these columns part of the index key. Indeed, you may not be able to do so the size of the key is limited to 900 bytes and 16 columns. If you had a Postcode column of size VARCHAR(880), you d only have 20 bytes left to play with. Included columns can work around this problem, and also the problem of non-clustered indexes needing to take an extra step. Open a New Query Window and run this SELECT statement, making sure you turn on Execution Plans with Ctrl+M. USE AddressBook; SELECT C.ContactId, C.FirstName, C.LastName, C.AllowContactByPhone FROM dbo.contacts C WHERE C.AllowContactByPhone = 1; Ten rows are returned, and the execution plan (displayed in Figure ) uses the clustered index to find the data. Figure Execution plan using the clustered index 250

CHAPTER 14 INDEXES Really, we need an index on AllowContactByPhone, as this is the WHERE clause being used. Let s go ahead and create that index in another New Query Window.

262 CHAPTER 14 INDEXES Really, we need an index on AllowContactByPhone, as this is the WHERE clause being used. Let s go ahead and create that index in another New Query Window. USE AddressBook; IF EXISTS (SELECT 1 FROM sys.indexes WHERE [name] = 'IX_NC_Contacts_AllowContactByPhone') BEGIN DROP INDEX IX_NC_Contacts_AllowContactByPhone ON dbo.contacts; END; CREATE NONCLUSTERED INDEX IX_NC_Contacts_AllowContactByPhone ON dbo.contacts(allowcontactby Phone); GO Execute this, then run the SELECT statement again. Figure shows the query plan: Figure Execution plan after adding non-clustered index It hasn t changed! Why is this? Well, SQL Server s Query Optimizer has decided the clustered index will execute the query more efficiently than the new non-clustered index we created. This is because the clustered index has immediate access to the data. As soon as a match is found by the clustered index, we return the data. With the non-clustered index, a match is found and then we have to skip across to the row to pull the data back. We can remedy this situation by including the columns we are interested in as part of the index. USE AddressBook; IF EXISTS (SELECT 1 FROM sys.indexes WHERE [name] = 'IX_NC_Contacts_AllowContactByPhone') BEGIN DROP INDEX IX_NC_Contacts_AllowContactByPhone ON dbo.contacts; END; CREATE NONCLUSTERED INDEX IX_NC_Contacts_AllowContactByPhone ON dbo. Contacts(AllowContactByPhone) INCLUDE (ContactId, FirstName, LastName); GO The INCLUDE line is the only change. We ve told SQL Server to store the ContactId, FirstName, and LastName columns with the index. These columns do not form part of the index they wouldn t be used by the index to find data matching a WHERE clause, for example but they are stored alongside the index, meaning an extra hop over to the row once a match is found is not required. Run the script to update the index, then run the SELECT statement again. A different query plan appears this time, as shown in Figure

CHAPTER 14 INDEXES Figure 14-16. Execution plan now using the non-clustered index! We ve just created something called a covering index.

263 CHAPTER 14 INDEXES Figure Execution plan now using the non-clustered index! We ve just created something called a covering index. This is an index that can be used to return all data for a particular query. The index in Figure utilizes all columns required to fulfil the query s requirements, so we say it covers the query s requirements. Save the index script as c:\temp\sqlbasics\apply\23 - Create Contacts AllowContactByPhone Index.sql, and add it to the SQLCMD file 00 - Apply.sql. :setvar currentfile "23 - Create Contacts AllowContactByPhone Index.sql" PRINT 'Executing $(path)$(currentfile)'; :r $(path)$(currentfile) Think carefully when including columns as part of your indexes. They are very useful, but just remember that SQL Server has to maintain all of this information whenever you insert, update, or delete data in your table. The benefits of the index must outweigh the downside of keeping it up to date. Filtered Indexes Filtered indexes were introduced in SQL Server 2008, and are severely underused in my experience. This is a shame, as they present an elegant solution to certain problems. A filtered index is the same as any other type of index you create, with one big difference: you specify a WHERE clause, limiting the index to certain types of data. Why would you want to do this? Filtered indexes are smaller than normal, full-table indexes Not all DML statements will cause filtered indexes to be updated, reducing the cost of index maintenance Less disk space is required to store a filtered index, as it only stores the rows matching the filter We ll change the index we just created for AllowContactByPhone into a filtered index. Before we do that, return to the SELECT statement we were using to test it. USE AddressBook SELECT C.ContactId, C.FirstName, C.LastName, C.AllowContactByPhone FROM dbo.contacts C WHERE C.AllowContactByPhone = 1; If you run this and check the execution plan, you ll see the non-clustered index was used. Change the query to = 0 instead of = 1. As Figure proves, you ll see the non-clustered index is still used. 252

$Open c:\temp\sqlbasics\apply\23 - Create Contacts AllowContactByPhone Index.sql and change the CREATE INDEX statement: CREATE NONCLUSTERED INDEX IX_NC_Contacts_AllowContactByPhone ON dbo.$

264 CHAPTER 14 INDEXES Figure Using a non-clustered index regardless of the value We ll change our index so it only applies to records where AllowContactByPhone = 1. Open c:\temp\sqlbasics\apply\23 - Create Contacts AllowContactByPhone Index.sql and change the CREATE INDEX statement: CREATE NONCLUSTERED INDEX IX_NC_Contacts_AllowContactByPhone ON dbo.contacts(allowcontactbyphone) INCLUDE (ContactId, FirstName, LastName) WHERE AllowContactByPhone = 1; Save and run the script to update the index. Execute the SELECT statement again as per Figure 14-18, setting AllowContactByPhone =

265 CHAPTER 14 INDEXES Figure Using a filtered index when the value matches Good news: our index is being used. Change the statement to use AllowContactByPhone = 0 and run it again. This time, our index is ignored, and the clustered index is used instead (Figure ). 254

266 CHAPTER 14 INDEXES Figure Not using a filtered index when the value doesn t match This is perfect we ve told SQL Server the non-clustered index should only apply when the filter value for AllowContactByPhone is 1. If you use them wisely, filtered indexes can really boost your queries, and can reduce the impact indexes may have on your DML statements. Keep them up your sleeve, as many SQL Server developers are not aware of filtered indexes. Knowing what features like this can do will help you stand out from the crowd. Don t forget to add script 23 to the 00 - Apply.sql SQLCMD script. Here is the code: :setvar currentfile "23 - Create Contacts AllowContactByPhone Index.sql" PRINT 'Executing $(path)$(currentfile)'; :r $(path)$(currentfile) Unique Indexes We met unique indexes when we were talking about constraints in Chapter 7. Whenever we created a unique constraint, we were actually creating a unique index. A unique index is an index that prevents duplicate values from being entered into different rows across the columns it represents. A unique index on the PhoneNumber column in ContactPhoneNumbers would prevent the same phone number from being added twice, for example. Refer back to Chapter 7 if you need to refresh your knowledge. Other Types of Index SQL Server provides an XML data type, which can store the tiniest fragment of XML up to huge documents of 2GB in size. As you might imagine, searching all of this XML data can take a while. SQL Server has the ability to create both primary and secondary XML indexes to assist with such searching. If you have plans to create XML columns you should investigate how these indexes work performance gains can be impressive. 255

267 CHAPTER 14 INDEXES A new type of index was introduced in SQL Server 2012: the Columnstore index. These are indexes aimed at bulk loads and read-only queries. It is primarily intended for use in data warehouses (essentially, flattened databases managed by SSAS), although they can be used for other purposes, too. If used in the correct manner, Microsoft claims index performance can be increased by up to 10 times that of traditional indexes. Maintaining Indexes Usually, your tables and other database objects require minimal maintenance. Once the object is structured how you want it and has been proven to be working correctly, you can pretty much leave it running, with the occasional check. This isn t the case with indexes. Over time, indexes will become fragmented. This means there are gaps in the index or data are not structured in the index as well as they might be. This happens because of DML statements. INSERTs, UPDATE s, and DELETE s cause data to be removed from or moved about inside the index. This causes new pages to be added to the index, which can increase the number of intermediate levels created, and require more pages to be checked when searching for records in the index. We ll take a look at some of the maintenance features SQL Server provides that help you keep on top of your indexes. Identifying Index Fragmentation SQL Server provides a set of Dynamic Management Views, or DMVs for short. There are lots of DMVs that provide access to all sorts of information. This query uses a DMV called sys.dm_db_index_physical_stats to tell you if any of your indexes are fragmented. SELECT DB_NAME(PS.[database_id]) AS DatabaseName, OBJECT_NAME(PS.[object_id]) AS TableOrViewName, SI.[name] AS IndexName, PS.[index_type_desc] AS IndexType, PS.[avg_fragmentation_in_percent] AS AmountOfFragementation FROM sys.dm_db_index_physical_stats(db_id(n'addressbook'), NULL, NULL, NULL, 'DETAILED') PS INNER JOIN sys.indexes SI ON PS.[object_id] = SI.[object_id] AND PS.[index_id] = SI.[index_id] ORDER BY OBJECT_NAME(PS.[object_id]) ASC; On the FROM line in Figure 14-20, note I ve specified 'AddressBook'. By substituting any database name here the query will return basic fragmentation information for your database s indexes. 256

268 CHAPTER 14 INDEXES Figure Returning index fragmentation details Altering Indexes To manage an existing index, you use the ALTER INDEX statement. This ALTER statement works differently from the ALTER VIEW or ALTER TABLE statements we ve seen so far, and also from the other ALTER statements we ll meet later in the book. Usually, an ALTER statement makes direct changes to the object concerned; ALTER VIEW allows you to completely change the definition of the view, for example. ALTER INDEX is used for maintenance purposes. Its principal aim is to allow you to either disable, rebuild, or reorganize an index. You can change certain options for the index, but you cannot change its definition to do that, you need to drop the index and then recreate it. Disabling Indexes You may occasionally need to disable an index. You might do this if you want to see how a query performs with the index and without it, but you don t want to lose the various metadata held for and about the index. Note that if you disable a clustered index you won t be able to query the table (but the data are still present; you just need to re-enable the index). In a New Query Window, run this query. USE AddressBook; SELECT * FROM dbo.contactaddresses; 257

269 CHAPTER 14 INDEXES All rows will be returned from the table. Now, change the script so it includes an ALTER INDEX statement above the SELECT, disabling the clustered index we created earlier. USE AddressBook; ALTER INDEX IX_C_ContactAddresses_ContactIdPostcode ON dbo.contactaddresses DISABLE; SELECT * FROM dbo.contactaddresses; You ll see some interesting messages, which are displayed in Figure Figure Querying a table with a disabled index The two warning messages are generated by the ALTER INDEX statement. When a clustered index is disabled, other indexes on the table are disabled, too. This includes foreign keys. We ve ended up disabling three indexes here: the index we requested, the FK_ContactAddresses_Contacts foreign key index, and the primary key index PK_ContactAddresses. These will all need to be re-enabled separately, unless we use the ALL keyword when rebuilding. The error message was raised by the SELECT statement. Our ALTER INDEX was successful, but because the clustered index is disabled data can no longer be retrieved from the table. We ll have to re-enable the indexes so we can query our data. Rebuilding Indexes There is no option to re-enable an index. Instead, you must rebuild the index. You rebuild indexes when the index is not performing as expected, probably due to fragmentation. Rebuilding an index causes the index to be dropped and recreated, resolving any fragmentation issues. Specifying the ALL keyword causes every index on the table to be dropped and recreated. This statement will rebuild just the clustered index on the ContactAddresses table: ALTER INDEX IX_C_ContactAddresses_ContactIdPostcode ON dbo.contactaddresses REBUILD; 258 But this statement will rebuild every index on the ContactAddresses table: ALTER INDEX ALL ON dbo.contactaddresses REBUILD; Running this will make our SELECT statement work again. Note in Figure that we ve had to add a GO before the SELECT statement this is needed, as the REBUILD has to complete in its own batch before we can query the table.

270 CHAPTER 14 INDEXES Figure Running the query after rebuilding the index Reorganizing Indexes Reorganizing causes the leaf level of the index the level that holds the data (or points to the data in a non-clustered index) to be, well, reorganized. This eliminates fragmentation. This is similar to rebuilding, but crucially it can be done without impacting access to the table. The indexes we are playing with here are very small and rebuild instantly. Imagine a table with millions of rows. Rebuilding an index on these tables can sometimes take hours. If this happens, rebuilding an index may not be desirable it could prevent access to the table during the rebuild. It is for this kind of scenario that reorganization was introduced. The index is reorganized but the table is still accessible. You cannot reorganize a disabled index; the index must be active. If we wanted to reorganize our clustered index we would specify the REORGANIZE keyword. ALTER INDEX IX_C_ContactAddresses_ContactIdPostcode ON dbo.contactaddresses REORGANIZE; Again, we could reorganize all indexes with the ALL keyword: ALTER INDEX ALL ON dbo.contactaddresses REORGANIZE ; Altering Indexes Using SSMS You can disable, rebuild, or reorganize using SSMS. Locate the required index in the Indexes node (found within the table the index is applied to), right-click it, and choose the appropriate option. 259

271 CHAPTER 14 INDEXES Dropping Indexes Over time, you may decide certain indexes have outlived their usefulness. They can be removed quite easily by using the DROP INDEX command, which we ve already met. Just for practice, we ll create three rollback scripts to drop the indexes we ve created in this chapter. Create these three scripts in c:\temp\sqlbasics\rollback Create ContactAddresses Clustered Index Rollback.sql USE AddressBook; IF EXISTS (SELECT 1 FROM sys.indexes WHERE [name] = 'IX_C_ContactAddresses_ ContactIdPostcode') BEGIN DROP INDEX IX_C_ContactAddresses_ContactIdPostcode ON dbo.contactaddresses; END; GO 22 - Create PhoneNumberTypes Index Rollback.sql USE AddressBook; IF EXISTS (SELECT 1 FROM sys.indexes WHERE [name] = 'IX_PhoneNumberTypes_ PhoneNumberType') BEGIN DROP INDEX IX_PhoneNumberTypes_PhoneNumberType ON dbo.phonenumbertypes; END; GO 23 - Create Contacts AllowContactByPhone Index Rollback.sql USE AddressBook; IF EXISTS (SELECT 1 FROM sys.indexes WHERE [name] = 'IX_NC_Contacts_ AllowContactByPhone') BEGIN DROP INDEX IX_NC_Contacts_AllowContactByPhone ON dbo.contacts; END; GO Add these lines to the top of 00 - Rollback.sql to ensure these scripts are executed whenever we rollback the database. :setvar currentfile "23 - Create Contacts AllowContactByPhone Index Rollback.sql" PRINT 'Executing $(path)$(currentfile)'; :r $(path)$(currentfile) :setvar currentfile "22 - Create PhoneNumberTypes Index Rollback.sql" PRINT 'Executing $(path)$(currentfile)'; :r $(path)$(currentfile) :setvar currentfile "21 - Create ContactAddresses Clustered Index Rollback.sql" PRINT 'Executing $(path)$(currentfile)'; :r $(path)$(currentfile) 260

272 Statistics CHAPTER 14 INDEXES I view statistics as an advanced topic, but I want to mention them briefly so you at least know they exist. There, I ve mentioned them let s move on! Only joking (it s very good for morale!). SQL Server uses statistics to figure out how it can best process your query. Statistics include the number of records for an index, how many pages those records cover, and details of table records, such as how many records in Contacts have a value of 1 in the AllowContactByPhone column. As an example, assume you are running the query WHERE AllowContactByPhone = 1. You have an index for this column, but for some reason it isn t being used. There are a couple of possible reasons for this. One is the statistics for the index are out of date. SQL Server automatically maintains statistics based on particular rules. Sometimes these rules are not met and the statistics are not updated perhaps as often as they should be. Another reason the index may not be used is because the statistics are up to date, and they inform SQL Server that using the index will be less efficient than using a scan or an alternative index. You can delve into statistics by expanding the Statistics node under a table in Object Explorer. In Figure 14-23, we can see the statistics for the ContactAddresses table. Figure Viewing statistics for a table Double-clicking one of these items will bring up more information about the statistics. As you are just starting out in your SQL Server journey, don t worry too much about statistics at the moment. I know people who ve worked with SQL Server for years and don t understand them properly (or at all). But know they are there it just might be worth delving into them in more detail. Creating Indexed Views The last piece of code we ll write in what has been an extremely involving chapter is to take one of our views and transform it into an indexed view. Indexed views perform better than normal views because they have a unique clustered index added to them. This means the view exists as a physical object, making it work in a manner very similar to a table. Creating an indexed view is the only way you can create an index that bridges multiple tables. You can create both clustered and non-clustered indexes against a view, but you can only create nonclustered indexes if a unique clustered index already exists. Also, the view definition must meet certain rules. If any of these rules are not met, you cannot create a clustered index on the view: The view definition must include WITH SCHEMABINDING All table names must include the schema name (e.g., dbo.contacts, not Contacts ) All expressions in the view must be deterministic; that is, the same value is always returned ( GETDATE() is not deterministic as it never returns the same value, but ADDNUMBERS(1,2) would always return 3, so it is deterministic) 261

273 CHAPTER 14 INDEXES The view can only include tables, not other views The tables in the view must exist in the same database Most aggregate functions cannot be used (e.g., COUNT(), MIN(), MAX() ) There are more rules, such as certain SET options that must be configured, but these are usually set to the default values anyway, so we won t concern ourselves with them. The preceding rules represent the most common things you have to think about. A full list of rules can be found at en-us/library/ms aspx. We ll create an index on the VerifiedContacts view. Before we can create any other type of index, we must create a unique, clustered index. You cannot just create a clustered index; it must be unique, too. The statement to create an index on a view is no different from other indexes we ve seen, other than the inclusion of the UNIQUE keyword: CREATE UNIQUE CLUSTERED INDEX IX_C_VerifiedContacts_ContactIdFirstNameLastName ON dbo.verifiedcontacts(contactid, LastName, AllowContactByPhone); The combination of columns declared in the index must be unique. ContactId will always be unique as it is an IDENTITY column. This means we can add other columns to the index, as the first column guarantees uniqueness in this case. If the first column didn t provide us with a unique value, we d need to add more columns so the combination of values would be unique. Running this statement is successful, and allows us to go ahead and create a non-clustered index on the view: CREATE INDEX IX_NC_VerifiedContacts_DrivingLicenseNumber ON dbo.verifiedcontacts(drivinglicensenumber); At this point, we could create as many non-clustered indexes as we wished on the view. These indexes will really help when your database has grown and your view is retrieving lots of rows. I once reduced a three-minute query to under a second using an indexed view. This query was used all over the system, so you can imagine the gains that were made. There s actually a lot more to indexed views than we ve discussed here, so take some time to read the MSDN article it s well worth a look. Are Indexes Ever a Bad Idea? We ve seen lots of index-related escapades in this chapter, all universally positive. It s worth asking: Is there ever a time when using indexes could be a bad thing? The answer is simple: Heck, yes! Like any other piece of SQL Server technology, indexes can have a negative effect on the database when used incorrectly. The main reason indexes can cause problems is because of INSERT, UPDATE, and DELETE statements the DML statements. Let s think about how indexes work again for a moment. If an INSERT or UPDATE statement executes, the index must be updated with the current data If a DELETE statement executes, the rows in question need to be removed from the index The point here is when a DML statement executes, it means SQL Server does some additional work to keep the index up to date. This additional work marginally slows down your INSERT, UPDATE, and DELETE statements. They don t slow down so much that you d notice, but there is an effect. 262

274 CHAPTER 14 INDEXES Now, imagine a couple of scenarios. An index that has become badly defragmented An index whose statistics are wildly out of date A covering index with multiple columns Any of these can cause update problems. Let me tell you a story about a covering index that went bad. I should point out this is an extreme example! A company I once worked for had built an event logging system. All log requests were sent to a queue and a service then came along, picked up the requests from the queue, and inserted them into an EventLog table. This table had a covering index on it, allowing easy querying of the log. For the first couple of months, the event logging system worked really well. Eventually, a developer needed the log to investigate a problem, and was surprised to discover data for the past week wasn t in the table. I took a look at the queue and saw there was a huge backlog of queue items waiting to be processed. This was strange; querying the table returned results effectively enough. I tried manually inserting a row and was stunned to discover it took five minutes to insert! This was into a table of about six columns. The covering index was the problem. Removing it fixed the problem immediately, resulting in inserts of less than a second. Because the index had multiple columns and hadn t been maintained properly, it was taking forever (well, five minutes) to figure out where it should place new rows. Remember this cautionary tale when building your indexes! They are a great thing, but make sure you use them correctly and in moderation. Don t add them for the sake of it. Summary My word, this has been a big chapter. We ve covered just about every aspect of indexes, although we ve hardly skimmed the surface of what is a huge topic. Indexes are the most impressive speed improvement you can make to your queries. Take some time to tinker with them and you ll be the toast of your department. Our AddressBook database has some nice indexes now, but if we can t add data in a consistent manner, indexes won t help us at all. Our next chapter will help us in this regard, as we take a look at transactions. 263

275 Extending SSIS with.net Scripting A Toolkit for SQL Server Integration Services Joost van Rossum Régis Baccaro

276 Extending SSIS with.net Scripting Copyright 2015 by Joost van Rossum and Régis Baccaro This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. ISBN-13 (pbk): ISBN-13 (electronic): Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Managing Director: Welmoed Spahr Lead Editor: Jonathan Gennick Development Editor: Douglas Pundick Technical Reviewer: John Welch Editorial Board: Steve Anglin, Mark Beckner, Gary Cornell, Louise Corrigan, Jim DeWolf, Jonathan Gennick, Robert Hutchinson, Michelle Lowman, James Markham, Susan McDermott, Matthew Moodie, Jeffrey Pepper, Douglas Pundick, Ben Renow-Clarke, Gwenan Spearing, Matt Wade, Steve Weiss Coordinating Editor: Jill Balzano Copy Editor: Kim Burton-Weisman Compositor: SPi Global Indexer: SPi Global Artist: SPi Global Cover Designer: Anna Ishchenko Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY Phone SPRINGER, fax (201) , orders-ny@springer-sbm.com, or visit Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information on translations, please rights@apress.com, or visit Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. ebook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales ebook Licensing web page at Any source code or other supplementary material referenced by the author in this text is available to readers at For detailed information about how to locate your book s source code, go to

277 Contents at a Glance About the Authors...xix About the Technical Reviewer...xxi Acknowledgments...xxiii Introduction...xxv Part I: Getting Started... 1 Chapter 1: Getting Started with SSIS and Scripting... 3 Chapter 2: Script Task vs. Script Component Chapter 3:.NET Fundamentals Part II: Script Tasks Chapter 4: Script Task Chapter 5: File Properties Chapter 6: Working Through the Internet and the Web Chapter 7: Working with Web Services and XML Chapter 8: Advanced Solutions with Script Task Part III: Script Component Chapter 9: Script Component Foundation Chapter 10: Script Component As Source Chapter 11: Script Component Transformation Chapter 12: Script Component As Destination iii

278 CONTENTS AT A GLANCE Chapter 13: Regular Expressions Chapter 14: Script Component Reflection Chapter 15: Web Services Part IV: Custom Tasks and Components Chapter 16: Create a Custom Task Chapter 17: Create Custom Transformation Part V: Scripting from.net Applications Chapter 18: Package Creation Chapter 19: Package Execution from.net Index iv

279 CHAPTER 4 Script Task If you want to work through SFTP (SSH File Transfer Protocol), or if you want to automate sending a highly customized , or applying passwords on files, or checking file properties such as creation date and modified date, then the Script Task is one of your best choices. This chapter dives into the Script Task and some of its most common usages, including how to use it with variables and connection managers, how to use it with logging and error handling, and how to reference custom assemblies. Editor When you add a Script Task to the Control Flow and edit it, you first have to choose the Script Language: C# or VB.NET (see Figure 4-1 ). If you still work with SSIS 2005, then you can only select VB.NET. This option is write-once. After hitting the Edit Script... button, this option is grayed out and you cannot change the Script Language anymore. The only way to change it is to delete the entire Script Task, add a new Script Task to the Control Flow, and then start over again. 69

280 CHAPTER 4 SCRIPT TASK Figure 4-1. The Script Task Editor You can change the default programming language to your own preference. In Visual Studio, go to the Tools menu and select Options... Expand the Business Intelligence Designer section and then the Integration Services Designer. The default Language option appears on the right side (see Figure 4-2 ). 70

281 CHAPTER 4 SCRIPT TASK Figure 4-2. Changing the default Script Language The second property in the editor is the Entry Point. With this property, you can change the name of the method that will be the starting point for your script. The default is the Main method. Unless you have a good reason, changing it could perhaps be a little confusing for others. ReadOnlyVariables and ReadWriteVariables give you the ability to read and change variables within the Script Task code, but you can also use them to read package and project parameters if you are using the project deployment model (available since SSIS 2012). You can either enter the variable names manually or use the pop-up window. These fields are optional, but more information and examples are provided later in this chapter. 71

282 CHAPTER 4 SCRIPT TASK Script Layout Hitting the Edit Script... button starts the VSTA editor, which gives you the ability to write.net code. This editor is a new instance of Visual Studio in a VSTA project with either C# or VB.NET code. VSTA stands for Visual Studio Tools for Applications. If you are still using SSIS 2005, then the VSA (Visual Studio for Applications) editor gives you the ability to write VB.NET code. The VSA editor is a stripped version of Visual Studio that lacks a lot of functionality, including the ability to write C# code. The first time you start a VSTA editor for a Script Task, it generates default code to help you get started. The VSTA environment has three main sections (see Figure 4-3 ). Figure 4-3. The VSTA editor for the Script Task A. ScriptMain: The editor in which you type your code. Saving is done automatically when you close the VSTA environment. B. Solution Explorer: In this section, you can add extra references to other.net libraries, such as LINQ or to custom libraries. You could also change project properties, such as the target framework, and optionally add extra C# or VB.NET files. Changes should be saved with the Save All button; otherwise, they will be lost when you close the VSTA editor. C. Properties: Here you can see the properties of the item you selected in the Solution Explorer. You can see where the (temporary) VSTA project is stored on disk, for example. 72

283 CHAPTER 4 SCRIPT TASK The script in section A is generated and the code varies per SSIS version and, of course, per scripting language (see Figures 4-4 and 4-5 ). Figure 4-4. SSIS 2008 C# Script Task code 73

284 CHAPTER 4 SCRIPT TASK Figure VB.NET Script Task code The Script Task always starts with a general comment. The text changes between SSIS versions. Remove these, or even better, replace them with a useful comment about the file/script. Why did you use a Script Task and what is your code doing? In 2012, regions were added to make the code more orderly. You could also add them manually to SSIS 2008 script. #region Help: Introduction to the script task /* The Script Task allows you to perform virtually any operation that can be * accomplished in a.net application within the context of an Integration * Services control flow. * * Expand the other regions which have "Help" prefixes for examples of specific * ways to use Integration Services features within this script task. */ #endregion 74

285 CHAPTER 4 SCRIPT TASK This is the VB.NET code: #Region "Help: Introduction to the script task" 'The Script Task allows you to perform virtually any operation that can be 'accomplished in a.net application within the context of an Integration 'Services control flow. 'Expand the other regions which have "Help" prefixes for examples of specific 'ways to use Integration Services features within this script task. #End Region Next part are the using directives or import statements. In C# they are called using directives and in VB.NET they are called Imports statements. For example: #region Namespaces using System; using System.Data; using Microsoft.SqlServer.Dts.Runtime; using System.Windows.Forms; #endregion And here is the VB.NET code: #Region "Imports" Imports System Imports System.Data Imports System.Math Imports Microsoft.SqlServer.Dts.Runtime #End Region Which namespaces are included varies per SSIS version and even per scripting language. You can add extra usings/imports to make your code more compact. They enable/allow the use of types in a given namespace. See Chapter 3 for more information about.net fundamentals. The third part is the namespace and class declaration. These are generated. Don t change these unless you are an experienced.net developer with a good reason to do it. namespace ST_abfa556bdb974f78a26e3c3af4606e6e { /// <summary> /// ScriptMain is the entry point class of the script. Do not change the /// name, attributes, or parent of this class. /// </summary> [Microsoft.SqlServer.Dts.Tasks.ScriptTask.SSISScriptTaskEntryPointAttribute] public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask. VSTARTScriptObjectModelBase { 75

286 CHAPTER 4 SCRIPT TASK And here is the VB.NET code: 'ScriptMain is the entry point class of the script. Do not change the name, 'attributes, or parent of this class. <Microsoft.SqlServer.Dts.Tasks.ScriptTask.SSISScriptTaskEntryPointAttribute()> _ <System.CLSCompliantAttribute(False)> _ Partial Public Class ScriptMain Inherits Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase Next is the Script Result declaration (in SSIS 2012 these were moved to the bottom of the script; they don t exist in SSIS 2005). This generated code is for assigning a result to the Script Task: Success or Failure. Don t change this. To save space, the book examples do no show this code, but you can t delete it from the actual code! #region ScriptResults declaration /// <summary> /// This enum provides a convenient shorthand within the scope of this class /// for setting the result of the script. /// /// This code was generated automatically. /// </summary> enum ScriptResults { Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success, Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure }; #endregion And this is the VB.NET code: #Region "ScriptResults declaration" 'This enum provides a convenient shorthand within the scope of this class 'for setting the result of the script. 'This code was generated automatically. Enum ScriptResults Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure End Enum #End Region The fifth part consists of help text and example code. You can leave it, remove it, or change it to general comments about your class and its methods. The final part is the Main method. This is the method that starts when you run the Script Task and where you add your custom code. The Main method should always result in either ScriptResults.Success or ScriptResults.Failure. In SSIS 2005, you see a different syntax. To succeed the Script Task, it is Dts.TaskResult = Dts.TaskResult.Success and to fail the Script Task, it is Dts.TaskResult = Dts.TaskResult.Failure. 76

287 Variables and Parameters CHAPTER 4 SCRIPT TASK The package variables in SSIS 2012 introduced parameters that can be used in a Script Task. You can use them to avoid hard-coded values in the script itself, or you can adjust the variable values in the script so that they can be used in other tasks or expressions. There are two different methods. In this example, you are counting the number of files in a folder. The folder path will be provided by a string variable or parameter, and the number of files will be stored in an integer variable. Note Parameters are read-only. You can t change them, and you will get this error if you try to: Exception has been thrown by the target of an invocation. First, create a new package called variables.dtsx and add a Script Task to the Control Flow. Give the Script Task a useful name like SCR Count Files. Next, create a string variable (or a string package parameter or a string project parameter), name it FolderPath, and fill it with the path of an existing directory. Also create an integer (Int32) variable called FileCount for storing the number of files. Figure 4-6 shows the two variables. Figure 4-6. Use variables for the FileCount and FolderPath Method 1: ReadOnlyVariables and ReadWriteVariables For the first method, you need to fill the ReadOnlyVariables and ReadWriteVariables properties in the Script Task Editor so that these variables are locked by the Script Task during runtime. This can be done by typing the name or using the selection window (click the button with the three dots that appears when you select the field). Add the string variable name FolderPath (or one of the string parameters) as the read-only variable and add the integer variable FileCount as the read-write variable. 77

CHAPTER 4 SCRIPT TASK Figure 4-7. Select the variables (or parameters) that you want to use After filling the ReadOnlyVariables and ReadWriteVariables properties, you can click the Edit Script.

288 CHAPTER 4 SCRIPT TASK Figure 4-7. Select the variables (or parameters) that you want to use After filling the ReadOnlyVariables and ReadWriteVariables properties, you can click the Edit Script... button to open the VSTA environment. First, add an extra using/import for System.IO on top so that you can do IO operations such as counting all the files in a folder. using System.IO; And here is the VB.NET code: Imports System.IO Next, add the actual code to the Main method. First you need to get the folder path from the variable and store it in a local.net variable. string myfolder = Dts.Variables["User::FolderPath"].Value.ToString(); And this is the VB.NET code: Dim myfolder As String = Dts.Variables("User::FolderPath").Value.ToString() 78

289 CHAPTER 4 SCRIPT TASK Then you need to use that local variable in the actual file counting code and store the file count directly in the SSIS integer variable FileCount. Dts.Variables["User::FileCount"].Value = Directory.GetFiles(myFolder, "*.*", SearchOption.TopDirectoryOnly).Length; And this is the VB.NET code: Dts.Variables("User::FileCount").Value = Directory.GetFiles(myFolder, "*.*", _ SearchOption.TopDirectoryOnly).Length Your total script now should look something like the following, but the namespace has a different name because it is generated. And the ScriptResults declaration is not shown in this code. #region Namespaces using System; using System.Data; using Microsoft.SqlServer.Dts.Runtime; using System.Windows.Forms; #endregion #region customnamespaces using System.IO; #endregion namespace ST_a0107ad99e244d5ca57c869184dd6a52 { /// <summary> /// This is an example on how to use variables and parameters in a Script /// Task. It gets a folder from a variable or parameter. Counts the number of /// files in it and fill the read write integer variable with the filecount. /// </summary> [Microsoft.SqlServer.Dts.Tasks.ScriptTask.SSISScriptTaskEntryPointAttribute] public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask. VSTARTScriptObjectModelBase { /// <summary> /// Get folder and count the number of files in it. /// Pass te file count to the SSIS variable /// </summary> public void Main() { // First create a.net string variable to store the path in. A little // redundant in this example but you could add extra steps to for // example validate the existance of the folder. Then choose which // variable or parameter you want to use to get the path from. In this // case I used the variable string myfolder = Dts.Variables["User::FolderPath"].Value.ToString(); 79

290 CHAPTER 4 SCRIPT TASK // If you rather want to use a parameter then use one of these codelines instead // of the variable line above. One of the three lines should be uncommented. //string myfolder = Dts.Variables["$Package::FolderPath"].Value.ToString(); //string myfolder = Dts.Variables["$Project::FolderPath"].Value.ToString(); // Get the file count from the my folder and store that number // in the SSIS integer variable. Dts.Variables["User::FileCount"].Value = Directory.GetFiles(myFolder, "*.*", SearchOption.TopDirectoryOnly).Length; // Close the script with result success. Dts.TaskResult = (int)scriptresults.success; } } } And here is the VB.NET code: #Region "Imports" Imports System Imports System.Data Imports System.Math Imports Microsoft.SqlServer.Dts.Runtime #End Region #region customnamespaces Imports System.IO #endregion <Microsoft.SqlServer.Dts.Tasks.ScriptTask.SSISScriptTaskEntryPointAttribute()> _ <System.CLSCompliantAttribute(False)> _ Partial Public Class ScriptMain Inherits Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase ' This is an example on how to use variables and parameters in a Script Task. ' It gets a folder from a variable or parameter. counts the number of files ' in it and fill the read write integer variable with the file count. Public Sub Main() ' First create a.net string variable to store the path in. A little ' redundant in this example but you could add extra steps to for ' example validate the existance of the folder. Then choose which ' variable or parameter you want to use to get the path from. In this ' case I used the variable Dim myfolder As String = Dts.Variables("User::FolderPath").Value.ToString() ' If you rather want to use a parameter then use one of these codelines instead ' of the variable line above. One of the three lines should be uncommented. ' string myfolder = Dts.Variables("$Package::FolderPath").Value.ToString() ' string myfolder = Dts.Variables("$Project::FolderPath").Value.ToString() 80

291 CHAPTER 4 SCRIPT TASK ' Get the file count from the my folder and store that number in the SSIS ' integer variable. Dts.Variables("User::FileCount").Value = Directory.GetFiles(myFolder, "*.*", _ SearchOption.TopDirectoryOnly).Length ' Close the script with result success. Dts.TaskResult = ScriptResults.Success End Sub End Class When you copy and paste a lot of code to other Script Tasks, it is also possible to check if a variable exists in the collection of ReadOnlyVariables and ReadWriteVariables, and then log a meaningful error if it isn t available: if (Dts.Variables.Contains("FileCount")) { // Get the file count from the my folder and store that number in the SSIS integer variable. Dts.Variables["User::FileCount"].Value = Directory.GetFiles(myFolder, "*.*", SearchOption.TopDirectoryOnly).Length; } else { // Handle error } And here is the VB.NET code: If (Dts.Variables.Contains("FileCount")) Then ' Get the file count from the my folder and store that number in the SSIS ' integer variable. Dts.Variables("User::FileCount").Value = Directory.GetFiles(myFolder, "*.*", _ SearchOption.TopDirectoryOnly).Length Else ' Handle error End If And it is also possible to loop through the collection of variables that are added to the ReadOnlyVariables and ReadWriteVariables properties. In this example, you use a message box, but in one of the next paragraphs, you see a more elegant way to show the variables: foreach (Variable myvar in Dts.Variables) { MessageBox.Show(myVar.Namespace + "::" + myvar.name); } And this is the VB.NET code: For Each myvar As Variable In Dts.Variables MessageBox.Show(myVar.Namespace & "::" & myvar.name) Next 81

292 CHAPTER 4 SCRIPT TASK However, it is not possible to loop through all package variables and parameters because the Script Task doesn t support enumerating the list of all variables and parameters. You always have to hard-code the names unless you hard-code a reference to your package, and then you can iterate through all package variables and parameters. You should be aware that in this case it will create a second instance in the memory of the package. If you want to try this code, change the file path of the package. Microsoft.SqlServer.Dts.Runtime.Application app = new Microsoft.SqlServer.Dts.Runtime.Application(); Package mypackage = app.loadpackage(@"y:\ssis\variables.dtsx", null); // Loop through package variables and parameters foreach (Variable myvar in mypackage.variables) { // Filter System variables if (!myvar.namespace.equals("system")) { MessageBox.Show(myVar.Name); } } 82 And this is the VB.NET code: Dim app As Microsoft.SqlServer.Dts.Runtime.Application = _ New Microsoft.SqlServer.Dts.Runtime.Application() Dim mypackage As Package = app.loadpackage("y:\ssis\variables.dtsx", Nothing) ' Loop through package variables and parameters For Each myvar As Variable In mypackage.variables ' Filter System variables If Not myvar.namespace.equals("system") Then MessageBox.Show(myVar.Name) End If Next If you want to get the value of a sensitive parameter, then you have to slightly change the code. Instead of using.value.tostring(), you need to use.getsensitivevalue().tostring(). But be aware that you re now responsible for not accidently leaking sensitive values like passwords. // Create string variable to store the parameter value string mysecretpassword = Dts.Variables["$Package::MySecretPassword"].GetSensitiveValue().ToString(); // Show the parameter value with a messagebox MessageBox.Show("Your secret password is " + mysecretpassword); And here is the VB.NET code: ' Create string variable to store the parameter value Dim mysecretpassword as string = _ Dts.Variables("$Package::MySecretPassword").GetSensitiveValue().ToString() ' Show the parameter value with a messagebox MessageBox.Show("Your secret password is " + mysecretpassword)

293 Method 2: Variable Dispenser CHAPTER 4 SCRIPT TASK For the second method, you don t use the ReadOnlyVariables and ReadWriteVariables properties in the Script Task Editor. Instead you lock the variables in the script using the variable dispenser with the LockForRead and LockForWrite methods. A different method than earlier, but it has the same end result. Add a new Script Task to your variables.dtsx package and connect it to the existing Script Task to make sure that the two Script Tasks don t execute at the same time; otherwise, they will both try to lock the same variables. Edit the Script Task and open the VSTA environment. Add an extra using/import for System.IO on top so that you can do IO operations such as counting all the files in a folder. using System.IO; And this is the VB.NET code: Imports System.IO Now go to the Main method and add the following lines to lock the variables by code. The FolderPath variable is locked for read and the FileCount variable is locked for write. Dts.VariableDispenser.LockForRead("User::FolderPath"); Dts.VariableDispenser.LockForWrite("User::FileCount"); And here is the VB.NET code: Dts.VariableDispenser.LockForRead("User::FolderPath"] Dts.VariableDispenser.LockForWrite["User::FileCount"] Then read the FolderPath variable and store its content in a local string variable. Variables vars = null; Dts.VariableDispenser.GetVariables(ref vars); string myfolder = vars["user::folderpath"].value.tostring(); And here is the VB.NET code: Dim vars As Variables = Nothing Dts.VariableDispenser.GetVariables(vars) Dim myfolder As String = vars("user::folderpath").value.tostring() The next step is to count the number of files in the folder and store it in the SSIS integer variable FileCount. vars["user::filecount"].value = Directory.GetFiles(myFolder, "*.*", SearchOption.TopDirectoryOnly).Length; And this is the VB.NET code: vars("user::filecount").value = Directory.GetFiles(myFolder, "*.*", SearchOption.TopDirectoryOnly).Length 83

294 CHAPTER 4 SCRIPT TASK Now the last part: releasing the lock on the variables. vars.unlock(); And here is the VB.NET code: vars.unlock() The finale code should look something like this: #region Namespaces using System; using System.Data; using Microsoft.SqlServer.Dts.Runtime; using System.Windows.Forms; #endregion #region customnamespaces using System.IO; #endregion namespace ST_fb03c633e7fc4e20a58e8e1ffc40b68e { /// <summary> /// This is an example on how to use variables and parameters in a Script /// Task. It gets a folder from a variable or parameter. Counts the number /// of files in it and fill the read write integer variable with the file /// count. /// </summary> [Microsoft.SqlServer.Dts.Tasks.ScriptTask.SSISScriptTaskEntryPointAttribute] public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase { /// <summary> /// Get folder and count the number of files in it. Pass te file count to /// the SSIS variable /// </summary> public void Main() { // Lock variables for read or for write Dts.VariableDispenser.LockForRead("User::FolderPath"); Dts.VariableDispenser.LockForWrite("User::FileCount"); // If you want to use a parameter instead of a variable then change the code to one of these lines. //Dts.VariableDispenser.LockForRead("$Package::FolderPath"); //Dts.VariableDispenser.LockForRead("$Project::FolderPath"); // Create a variable 'container' to store variables Variables vars = null; 84

295 CHAPTER 4 SCRIPT TASK // Add variables from the VariableDispenser to the variables container Dts.VariableDispenser.GetVariables(ref vars); // First create a.net string variable to store the path in. A little // redundant in this example but you could add extra steps to for // example validate the existance of the folder. Then choose which // variable or parameter you want to use to get the path from. In // this case I used the variable string myfolder = vars["user::folderpath"].value.tostring(); // Same alternative for using a parameter instead of a variable. Only use one of these three lines. //string myfolder = vars["$package::folderpath"].value.tostring(); //string myfolder = vars["$project::folderpath"].value.tostring(); // Get the file count from the my folder and store that number in the // SSIS integer variable. vars["user::filecount"].value = Directory.GetFiles(myFolder, "*.*", SearchOption.TopDirectoryOnly).Length; // Release the locks vars.unlock(); // Close the script with result success. Dts.TaskResult = (int)scriptresults.success; } } } And this is the VB.NET code: #Region "Imports" Imports System Imports System.Data Imports System.Math Imports Microsoft.SqlServer.Dts.Runtime #End Region #region customnamespaces Imports System.IO #endregion ' This is an example on how to use variables and parameters in a Script ' Task. It gets a folder from a variable or parameter. Counts the number ' of files in it and fill the read write integer variable with the file ' count. <Microsoft.SqlServer.Dts.Tasks.ScriptTask.SSISScriptTaskEntryPointAttribute()> _ <System.CLSCompliantAttribute(False)> _ Partial Public Class ScriptMain Inherits Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase 85

296 CHAPTER 4 SCRIPT TASK ' Get folder and count the number of files in it. Pass te file count to ' the SSIS variable Public Sub Main() ' Lock variables for read or for write Dts.VariableDispenser.LockForRead("User::FolderPath") Dts.VariableDispenser.LockForRead("$Package::FolderPath") Dts.VariableDispenser.LockForRead("$Project::FolderPath") Dts.VariableDispenser.LockForWrite("User::FileCount") ' Create a variable 'container' to store variables Dim vars As Variables = Nothing ' Add variables from the VariableDispenser to the variables container Dts.VariableDispenser.GetVariables(vars) ' First create a.net string variable to store the path in. A little ' redundant in this example but you could add extra steps to for example ' validate the existance of the folder. Then choose which variable or ' parameter you want to use to get the path from. In this case I used the ' variable Dim myfolder As String = vars("user::folderpath").value.tostring() ' Get the file count from the my folder and store that number in ' the SSIS integer variable. vars("user::filecount").value = Directory.GetFiles(myFolder, "*.*", _ SearchOption.TopDirectoryOnly).Length ' Release the locks vars.unlock() ' Close the script with result success. Dts.TaskResult = ScriptResults.Success End Sub End Class If you only need to lock one variable for read or for write, then you can use the LockOneForRead and LockOneForWrite methods. You need a little less code, but the result is similar. Here you only show the alternative code from the Main method. The extra using/import is the same as before. public void Main() { // Create a variable 'container' to store variables Variables vars = null; // Lock variable for read and add it to the variables 'container' Dts.VariableDispenser.LockOneForRead("User::FolderPath", ref vars); // First create a.net string variable to store the path in. A little // redundant in this example but you could add extra steps to for example // validate the existance of the folder. Then choose which variable or 86

297 CHAPTER 4 SCRIPT TASK // parameter you want to use to get the path from. In this case I used // the variable string myfolder = vars["user::folderpath"].value.tostring(); // Release the lock vars.unlock(); // Lock variable for write and add it to the variables container Dts.VariableDispenser.LockOneForWrite("User::FileCount", ref vars); // Get the file count from the my folder and store that number in the // SSIS integer variable. vars["user::filecount"].value = Directory.GetFiles(myFolder, "*.*", SearchOption.TopDirectoryOnly).Length; // Release the lock vars.unlock(); // Close the script with result success. Dts.TaskResult = (int)scriptresults.success; } And here is the VB.NET code: Public Sub Main() ' Create a variable 'container' to store variables Dim vars As Variables = Nothing ' Lock variable for read and add it to the variables 'container' Dts.VariableDispenser.LockOneForRead("User::FolderPath", vars) ' Dts.VariableDispenser.LockOneForRead("$Package::FolderPath", vars) ' Dts.VariableDispenser.LockOneForRead("$Project::FolderPath", vars) ' First create a.net string variable to store the path in. A little ' redundant in this example but you could add extra steps to for example ' validate the existence of the folder. Then choose which variable or ' parameter you want to use to get the path from. In this case I used the ' variable Dim myfolder As String = vars("user::folderpath").value.tostring() ' Release the locks vars.unlock() ' Lock variable for write and add it to the variables 'container' Dts.VariableDispenser.LockOneForWrite("User::FileCount", vars) ' Get the file count from the my folder and store that number in the ' SSIS integer variable. vars("user::filecount").value = Directory.GetFiles(myFolder, "*.*", SearchOption.TopDirectoryOnly).Length 87

298 CHAPTER 4 SCRIPT TASK ' Release the locks vars.unlock() ' Close the script with result success. Dts.TaskResult = ScriptResults.Success End Sub Advantages and Disadvantages of Both Methods With the variable dispenser method, you have a little more control when you lock and unlock your variables. The big downside is that you cannot quickly see which variables you are using. You have to check the entire code. Another disadvantage is that you need more code to accomplish the same thing. Therefore, the first method should be your preferred method. And you can also unlock the variables manually when you use the first method: if (Dts.Variables.Count > 0) { Dts.Variables.Unlock(); } And here is the VB.NET code: If (Dts.Variables.Count > 0) Then Dts.Variables.Unlock() End If Parent Package Variables A parent package is a package that executes another package (a child) via the Execute Package Task. In the child package, you can read and write variables from a parent package with a Script Task, but the variables are only available in run-time mode and not in design-time mode. This means you cannot use the selection window, but you can type it manually. You can use both methods to read/write parent package variables in a Script Task. There is one downside: without proper error handling, you won t be able to run the child package without the parent package, because it expects a variable that is not available. An alternative for reading parent package variables is to use parent package configurations, but that is only for reading and not for writing. Referencing Assemblies Sometimes you reuse a piece of code in multiple Script Tasks. If for some reason you have to change that piece of code, then you have to edit all the Scripts Tasks that use that code. To avoid this, you could create an assembly and reference it in your Script Tasks. An assembly is a piece of precompiled code that can be used by.net applications. You can create an assembly to store your often-used methods. If you have to change one of those methods, then you only have to change the assembly (and not all of those Script Tasks). Third-party companies (including Microsoft) can create assemblies for you as well; for example, an assembly to unzip files or to download files via SFTP. So, you don t have to reinvent the wheel. In some of the following chapters, you will learn how to use these third-party assemblies. In this chapter, you will learn how to create a simple assembly and use it in a Script Task. 88

Creating an Assembly CHAPTER 4 SCRIPT TASK If you want to create your own assembly, then you need the full version of Visual Studio, or at least a version that supports a C# or VB.NET project.

299 Creating an Assembly CHAPTER 4 SCRIPT TASK If you want to create your own assembly, then you need the full version of Visual Studio, or at least a version that supports a C# or VB.NET project. (So not just BIDS or SSDT BI.) In this example, you will create an assembly with a method to validate the format of an address. Start Visual Studio and create a new Class Library project called mymethodsforssis. This template can be found under Visual Basic and Visual C# (see Figure 4-8 ). Make sure that you choose the right.net Framework version (see Table 4-1 ); otherwise, you cannot reference it. Referencing a lower.net version is possible with some extra steps, but you can t reference an assembly with a higher.net version. Figure 4-8. New Class Library project Table 4-1. Choose the Correct.NET Framework Version SSIS Version Supported Framework (R2) 2.0 => => =>

300 CHAPTER 4 SCRIPT TASK When you create the new project, the Class1.cs or Class1.vb file is pretty empty. Start with adding the usings or imports at the top of the Class1 file. The C# file already has some usings, but the VB.NET file has none. They are in the project properties, but to keep the examples the same, you are adding them to the file as well. using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Text.RegularExpressions; // Added And this is the VB.NET code: Imports System Imports System.Collections.Generic Imports System.Linq Imports System.Text Imports System.Text.RegularExpressions ' Added The next step is the namespace and class declaration. You used a static class with static methods so that you can simply call the methods in your Script Task. The VB.NET equivalent for a static class is a module with functions. For more information about the static class, go to library/79b3xss3.aspx. For more information about the module, go to en-us/library/aaxss7da.aspx. The classname/modulename is Methods and the namespace is mymethodsforssis. namespace mymethodsforssis { // A static class with methods public static class Methods { } } And here is the VB.NET code: Namespace mymethodsforssis ' A module with methods Public Module Methods End Module End Namespace And the last step for the code is to add a public static method that validates the address format to the C# class, or a public function to the VB.NET module. It is called IsCorrect and it takes an address as input and returns either true or false, indicating whether the format is correct. You can copy the method from the sources added to this book. When you are finished, the complete code should look like this. 90

301 CHAPTER 4 SCRIPT TASK using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Text.RegularExpressions; // Added namespace mymethodsforssis { // A static class with methods public static class Methods { // A boolean method that validates an address // with a regex pattern. public static bool IsCorrect (String address) { // The pattern for string addresspattern (\"".+\""))@" (([a-za-z\-0-9]+\.)+" // Create a regex object with the pattern Regex addressregex = new Regex( AddressPattern); // Check if it is match and return that value (boolean) return addressregex.ismatch( address); } } } And this is the VB.NET code: Imports System.Collections.Generic Imports System.Linq Imports System.Text Imports System.Text.RegularExpressions ' Added Namespace mymethodsforssis ' A module with methods Public Module Methods ' A boolean method that validates an address ' with a regex pattern. Public Function IsCorrect ( Address As [String]) As Boolean ' The pattern for Dim addresspattern As String = "^(([^<>()[\]\\.,;:\s@\""]+" & "(\.[^<>()[\]\\.,;:\s@\""]+)*) (\"".+\""))@" & "((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}" & "\.[0-9]{1,3}\]) (([a-za-z\-0-9]+\.)+" & "[a-za-z]{2,}))$" 91

CHAPTER 4 SCRIPT TASK ' Create a regex object with the pattern Dim emailaddressregex As New Regex(emailAddressPattern) ' Check if it is match and return that value (boolean) Return emailaddressregex.

302 CHAPTER 4 SCRIPT TASK ' Create a regex object with the pattern Dim addressregex As New Regex( AddressPattern) ' Check if it is match and return that value (boolean) Return addressregex.ismatch( address) End Function End Module End Namespace Strong Name Before you can use the new assembly in an SSIS Script Task, you first have to strong name it. Go to the properties of the project and then to the Signing page. Check the Sign the assembly check box and then add a new key file in the drop-down list. The name for this example is PublicPrivateKeyFile.snk with sha256rsa as signature algorithm, and no password (see Figure 4-9 ). After clicking the OK button, the new key file will be visible in the Solution Explorer. Figure 4-9. Add key file (C# project, but VB.NET looks similar) Now close the project properties and build the project as a Release instead of the default Debug mode. You can change this in the Visual Studio toolbar. When you build the project, the assembly is signed with this new keyfile. Note Adding a strong name is not a security measure. It only provides a unique identity. 92

Global Assembly Cache CHAPTER 4 SCRIPT TASK The last step for preparing the assembly is to add it to the Global Assembly Cache (GAC) on the SSIS machine.

303 Global Assembly Cache CHAPTER 4 SCRIPT TASK The last step for preparing the assembly is to add it to the Global Assembly Cache (GAC) on the SSIS machine. SSIS can only use assemblies that are available via the GAC. Open the Visual Studio (2008/2010/ etc.) command prompt, but run it as administrator; otherwise, you can t add assemblies to the GAC. Go to the Bin\Release folder of your project, where you will find the.dll file of your newly created assembly. Execute the following command to add it to the GAC (see Figure 4-10 ): gacutil /i mymethodsforssis.dll Figure NET Global Assembly Cache Utility If you don t have gacutil on your server, which is often the case if don t have Visual Studio installed on a server, then you can use PowerShell to deploy your assembly (an example script is available with the source code for this book), or you can create a Setup and Deployment project in Visual Studio to create an installer for your assembly. Depending on the Visual Studio version installer, projects are located in Other Project Types Setup Deployment Projects. For Visual Studio 2013 and above, you need to download this project template: Build Events If you don t want to use the command prompt to add the assembly to the GAC each time you change it, then you could also add a post-build event to your Visual Studio project. Go to the properties of your assembly project by right-clicking it in the Solution Explorer. For C#, you can go to the Build Events tab and locate the post-build event command line (see Figure 4-11 ). For VB.NET, you need to go to the Compile tab and hit the Build events button in the lower-right corner, and then locate the same post-build event command line (see Figure 4-12 ). 93

304 CHAPTER 4 SCRIPT TASK Figure Post-build event with C# Figure Post-build event with VB.NET 94

305 CHAPTER 4 SCRIPT TASK Now copy the following command, but change the path of the gacutil. The path depends on the.net Framework version and even the Windows version. This example is for.net 3.5 on Windows 7. cd GACUTIL="C:\Program Files (x86)\microsoft SDKs\Windows\v7.0A\bin\gacutil.exe" Echo Installing dll in GAC Echo $(OutDir) Echo $(TargetFileName) %GACUTIL% -if "$(OutDir)$(TargetFileName)" Here are some alternative paths: C:\Program Files (x86)\microsoft SDKs\Windows\v7.0A\Bin\NETFX 4.0 Tools\gacutil.exe (4.0 on Win7) C:\Program Files (x86)\microsoft SDKs\Windows\v8.1A\bin\NETFX Tools\gacutil.exe (4.5.1 on Win8.1) Now you can add the assembly to the GAC by building the assembly project, but you need to run Visual Studio as an administrator, otherwise it won t work. To use the assembly in a Script Task you also need to think of a location for the actual dll file. For SSIS 2005, it is mandatory to put the.dll file in the Assemblies folder of SQL Server: C:\Program Files\Microsoft SQL Server\90\SDK\Assemblies\. For newer versions, you can put it anywhere. You can do that manually, but you could also add some extra lines to the post-build DLLDIR="C:\Program Files (x86)\microsoft SQL Server\100\DTS\Assemblies\" Echo Copying files to Assemblies copy "$(OutDir)$(TargetFileName)" %DLLDIR% Add a Reference in the Script Task Now you can add a reference in the Script Task to the newly created assembly. For C#, right-click References in the Solution Explorer and choose Add Reference (see Figure 4-13 ). For VB.NET, right-click the project in the Solution Explorer and choose Add Reference (see Figure 4-14 ). Now browse to your.dll file and click OK to add it. The location of the browse button varies per version of Visual Studio. After adding the new reference, you need to click Save All to save the internal project and its new references. In newer versions of Visual Studio (2014), this mandatory Save All step isn t necessary anymore. 95

306 CHAPTER 4 SCRIPT TASK Figure Add reference in C# Script Task Figure Add reference in VB.NET Script Task 96

307 CHAPTER 4 SCRIPT TASK Note Adding a reference to a third-party assembly works the same as adding a reference to your own assembly. The following is some example code for using the assembly in a Script Task. To test this code, you first need to create an SSIS string variable named address and fill it with a valid address. Then add a new Script Task and select the string variable as a ReadOnlyVariable in the Script Task Editor. Now go to the VSTA environment and add the assembly mymethodsforssis as a reference. To create more compact code, you can add an extra using/import. Note that the import of the VB assembly in the VB.NET code is slightly different. #region customnamespaces using mymethodsforssis; #endregion And here is the VB.NET code: #region customnamespaces Imports mymethodsforssis.mymethodsforssis #End Region Now comes the actual code in the Main method. The script uses a variable as the parameter for the method that checks the format with a regular expression. If the format is correct, then it succeeds the Script Task. If it s not correct, it fails the Script Task, but it also fires an error event explaining why the Script Task failed. Firing events for logging purposes is explained in detail in the last part of this chapter. public void Main() { // Get address from variable string = Dts.Variables["User:: address"].Value.ToString(); // Let the script task fail if the format of the is incorrect if ( Methods.IsCorrect ( )) { Dts.TaskResult = (int)scriptresults.success; } else { // Show why the Script Task is failing by firing an error event Dts.Events.FireError(0, " check", "Incorrect format:" + , string.empty, 0); Dts.TaskResult = (int)scriptresults.failure; } } 97

308 CHAPTER 4 SCRIPT TASK And this is the VB.NET code: Public Sub Main() ' Get address from variable Dim As String = Dts.Variables("User:: address").Value.ToString() ' Let the script task fail if the format of the address is incorrect If (IsCorrect ( )) Then Dts.TaskResult = ScriptResults.Success Else Dts.Events.FireError(0, " check", "Incorrect format:" + , String.Empty, 0) Dts.TaskResult = ScriptResults.Failure End If End Sub Connection Managers Integration Services uses connection managers to provide access to various data sources, such as flat files and databases, but also to web servers or FTP servers, or a message queue. You can use these connection managers in a Script Task to avoid hard-coded paths and connection strings. When used in the correct way, you can even let them participate in MSDTC transactions, but only for connection managers that support it. File Connection Managers Let s first cover the connection managers for files and folders such as File, Flat File, and Excel. There are two different methods for using connection managers in a Script Task. The first one is just getting properties from the connection manager, such as the ConnectionString property from a flat file connection manager or the ExcelFilePath property of an Excel connection manager. In the first code example, you need a File or Flat File Connection Manager to an existing text file. The content doesn t matter as long as the connection manager is called myflatfile and the file contains some data. The script gets the file path and checks whether it contains data by checking the file size. It uses FileInfo from the System.IO namespace, which you will add as using/import. #region customnamespaces using System.IO; #endregion And this is the VB.NET code: #region customnamespaces Imports System.IO #End Region 98

309 CHAPTER 4 SCRIPT TASK Now the actual code from the Main method : public void Main() { // Declare string variable and fill it with the connection string of the Flat File. string filepath = Dts.Connections["myFlatFile"].ConnectionString; // Declare file info object and fill it with the filepath from the string variable. // Then fill a bigint variable with the actual filesize for the next if-statement FileInfo fi = new FileInfo(filePath); Int64 length = fi.length; // Let the script fail when the filesize is 0 bytes if (length.equals(0)) { Dts.TaskResult = (int)scriptresults.failure; } else { Dts.TaskResult = (int)scriptresults.success; } } And here is the VB.NET code: Public Sub Main() ' Declare string variable and fill it with the connection string of the Flat File. Dim filepath As String = Dts.Connections("myFlatFile").ConnectionString ' Declare file info object and fill it with the filepath from the string variable. ' Then fill a bigint variable with the actual filesize for the next if-statement Dim fi As FileInfo = New FileInfo(filePath) Dim length As Int64 = fi.length ' Let the script fail when the filesize is 0 bytes If (length.equals(0)) Then Dts.TaskResult = ScriptResults.Failure Else Dts.TaskResult = ScriptResults.Success End If End Sub When you are using an Excel file, you cannot use the whole ConnectionString property because it contains more than just the file path. The next example extracts the file path from a connection string of an Excel connection manager named myexcel by using a substring method. It is stored in a string variable that you can use in your actual code. 99

310 CHAPTER 4 SCRIPT TASK // Declare string variable and fill it with the connection string of the Excel File. string filepath = Dts.Connections["myExcel"].ConnectionString; // For Excel connection you only need a part of the connectionstring: // ====================================================================== // Provider=Microsoft.Jet.OLEDB.4.0;Data Source=D:\MyExcelFile.xls; // Extended Properties="Excel 8.0;HDR=YES"; // Provider=Microsoft.ACE.OLEDB.12.0;Data Source=D:\MyExcelFile.xlsx; // Extended Properties="Excel 12.0 XML;HDR=YES"; // ====================================================================== // You only want the part after 'Source=' until the next semicolon (;) filepath = filepath.substring(filepath.indexof("source=") + 6); filepath = filepath.substring(1, filepath.indexof(";") - 1); And here is the VB.NET code: ' Declare string variable and fill it with the connection string of the Excel File. Dim filepath As String = Dts.Connections("myExcel").ConnectionString ' For Excel connection you only need a part of the connectionstring: ' ====================================================================== ' Provider=Microsoft.Jet.OLEDB.4.0;Data Source=D:\MyExcelFile.xls; ' Extended Properties="Excel 8.0;HDR=YES"; ' Provider=Microsoft.ACE.OLEDB.12.0;Data Source=D:\MyExcelFile.xlsx; ' Extended Properties="Excel 12.0 XML;HDR=YES"; ' ====================================================================== ' You only want the part after 'Source=' until the next semicolon (;) filepath = filepath.substring(filepath.indexof("source=") + 6) filepath = filepath.substring(1, filepath.indexof(";") - 1) Another trick is to first create a connection manager variable and fill it a reference to the Excel connection manager. Then you can read the ExcelFilePath property to get the file path instead of the complete connection string. It then stores the file path in a string variable that you can use in your actual code. // Get the Excel connection manager to read its properties ConnectionManager myexcelconn = Dts.Connections["myExcel"]; // Declare string variable and fill it with the ExcelFilePath property of the Excel connection manager. string filepath = myexcelconn.properties["excelfilepath"].getvalue(myexcelconn).tostring(); And this is the VB.NET code: ' Get the Excel connection manager to read its properties Dim myexcelconn As ConnectionManager = Dts.Connections("myExcel") ' Declare string variable and fill it with the ExcelFilePath property of the Excel connection manager. Dim filepath As String = myexcelconn.properties("excelfilepath").getvalue(myexcelconn).tostring() 100

311 CHAPTER 4 SCRIPT TASK The big downside with this first method is that it doesn t validate expressions on the connection manager. If you are using it within a Foreach Loop Container, it could cause some unexpected results. By using the AcquireConnection method, you can overcome this because it forces SSIS to re-evaluate any expressions on the connection manager. Here are two examples that fill the same string variable with the file path of the flat file: // Declare string variable and fill it with the connection string of the Flat File. string filepath = Dts.Connections["myFlatFile"].AcquireConnection(Dts.Transaction).ToString(); // Or a little more complicated version that applies to more to all connection manager types // Declare object variable to reference a connection manager object rawconnection = Dts.Connections["myFlatFile"].AcquireConnection(Dts.Transaction); // Declare string variable and fill it with the connection string of the Flat File. string filepath = rawconnection.tostring(); And here is the VB.NET code: // And optional release the connection manager manually to let SSIS know you re done Dts.Connections["myFlatFile"].ReleaseConnection(rawConnection); ' Declare string variable and fill it with the connection string of the Flat File. Dim filepath As String = Dts.Connections("myFlatFile").AcquireConnection(Dts.Transaction).ToString() ' Or a little more complicated version that applies to more to all connection manager types ' Declare object variable to store a Connection Manager Dim rawconnection As Object = Dts.Connections("myFlatFile").AcquireConnection(Dts.Transaction) ' Declare string variable and fill it with the connection string of the Flat File. Dim filepath As String = rawconnection.tostring() ' And optional release the connection manager manually to let SSIS know you're done Dts.Connections("myFlatFile").ReleaseConnection(rawConnection) Note If you don t want to use MSDTC transactions, then you can replace the AcquireConnection parameter Dts.Transaction with null. The same can be done for OLE DB and ADO.NET Connection Managers, but beware of using database connection managers in a Script Task. Don t use them unnecessarily if you can also use an Execute SQL Task, such as for executing a query or stored procedure. The preferred connection manager for connecting databases in a Script Task is the ADO.NET Connection Manager. OLE DB is also possible, but it is a lot more difficult and it has some limitations, like not being able to pass current transactions, and it doesn t honor the Retain Same Connection property. This is because the Script Task has managed code that interacts better with other managed code, and the OLE DB provider is made with unmanaged code. The following code is a very simplified example of using a database connection manager in a Script Task. It could have been 101

312 CHAPTER 4 SCRIPT TASK accomplished more easily with an Execute SQL Task, but more sophisticated examples will follow later in this book. For this example, you have added an SSIS string variable named sqlserverversion in the ReadWriteVariables property. It will be filled with the SQL Server version information by the Script Task. Also make sure that you have an ADO.NET Connection Manager in your package named myadonetconnection. This example uses the SqlClient assembly, which you will add as using/import to shorten the code. #region customnamespaces using System.Data.SqlClient; #endregion And here is the VB.NET code: #region customnamespaces Imports System.Data.SqlClient #End Region And now the actual code in the Main method. public void Main() { // Declare a SqlClient connection and assign your ADO.NET Connection Manager to this connection. SqlConnection myadonetconnection = (SqlConnection) Dts.Connections["myADONETConnection"].AcquireConnection(Dts.Transaction); // Create string variable with query string myquerytext = as SqlVersion"; // Create a SqlClient command to store a query in it. In this case a simple query to get the SQL version SqlCommand myquery = new SqlCommand(myQueryText, myadonetconnection); // Execute the query and store the result in a SqlClient datareader object SqlDataReader myqueryresult = myquery.executereader(); // Go to the first record of the datareader myqueryresult.read(); // Store the value of the 'SqlVersion' column in an SSIS string variable Dts.Variables["User::sqlServerVersion"].Value = myqueryresult["sqlversion"].tostring(); // Close Script Task with success Dts.TaskResult = (int)scriptresults.success; } 102

313 CHAPTER 4 SCRIPT TASK And this is the VB.NET code: Public Sub Main() ' Declare a SqlClient connection and assign your ADO.NET Connection Manager to this connection Dim myadonetconnection As SqlConnection = DirectCast(Dts.Connections("myADONETConnection") _.AcquireConnection(Dts.Transaction), SqlConnection) ' Create string variable with query Dim myquerytext As String = as SqlVersion" ' Create a SqlClient command to store a query in it. In this case a simple query to get the SQL version Dim myquery As SqlCommand = New SqlCommand(myQueryText, myadonetconnection) ' Execute the query and store the result in a SqlClient datareader object Dim myqueryresult As SqlDataReader = myquery.executereader() ' Go to the first record of the datareader myqueryresult.read() ' Store value of the SqlVerion column in an SSIS string variable Dts.Variables("User::sqlServerVersion").Value = myqueryresult("sqlversion").tostring() ' Close Script Task with success Dts.TaskResult = ScriptResults.Success End Sub As I said earlier, getting the OLE DB version to work is a lot more difficult. The AcquireConnection method cannot be used for OLE DB connection managers because it returns a native COM object. In this example, you need an OLE DB Connection Manager named myoledbconnection, and the same sqlserverversion string variable as in the previous example in the ReadWriteVariable property. The work-around is casting the OLE DB connection manager s InnerObject to the IDTSConnectionManagerDatabaseParameters100 interface (SSIS 2005 uses 90 instead of 100). To do that, you first have to add a reference to Microsoft.SqlServer.DTSRuntimeWrap.dll in the VSTA project. That assembly can be found in the GAC 64-bit folder. The folder path should look something like this: C:\Windows\Microsoft.NET\assembly\GAC_64\Microsoft.SqlServer.DTSRuntimeWrap\v4.0_ dcd8080cc91\. For C#, go to the Solution Explorer of the VSTA project and right-click References. Choose Add Reference to open the Add Reference window (see Figure 4-15 ). You can search for it in newer versions of Visual Studio or browse to it. In the Browse tab, you can browse to the correct folder. Select the assembly and click OK to add the new reference. After adding the reference, it is necessary to click the Save All button if you are using Visual Studio 2010 or lower! 103

CHAPTER 4 SCRIPT TASK Figure 4-15. Add reference to DTSRuntimeWrap.dll in C# project For VB.

Choose Add Reference to open the Add Reference window (see Figure 4-16 ).

314 CHAPTER 4 SCRIPT TASK Figure Add reference to DTSRuntimeWrap.dll in C# project For VB.NET, go to the Solution Explorer of the VSTA project and right-click the project. Choose Add Reference to open the Add Reference window (see Figure 4-16 ). In the Browse tab, you can browse to the correct folder. Select the assembly and click OK to add the new reference. After adding the reference, it is necessary to click the Save All button! Figure Add reference to DTSRuntimeWrap.dll in VB project 104

315 CHAPTER 4 SCRIPT TASK Add two extra usings/imports for shorter code: one for the OLE DB and one for the newly added reference. #region customnamespaces using System.Data.SqlClientOleDb; using Microsoft.SqlServer.Dts.Runtime.Wrapper; #endregion And here is the VB.NET code: #region customnamespaces Imports System.Data.SqlClient using Microsoft.SqlServer.Dts.Runtime.Wrapper #End Region Next is the actual code for the Main method. The major code difference is mainly in the beginning of the code. The rest looks very similar to the ADO.net example. public void Main() { // Store the connection in a Connection Manager object ConnectionManager myconnectionmanager = Dts.Connections["myOLEDBConnection"]; // Cast the Connection Managers's InnerObject to the // IDTSConnectionManagerDatabaseParameters100 // interface (SSIS 2005 uses 90 instead of 100). IDTSConnectionManagerDatabaseParameters100 cmparams; cmparams = myconnectionmanager.innerobject as IDTSConnectionManagerDatabaseParameters100; // Get the connection from the IDTSConnectionManagerDatabaseParameters100 object OleDbConnection myconnection = cmparams.getconnectionforschema() as OleDbConnection; // Create string variable with query string myquerytext = as SqlVersion"; // Create a new OleDbCommand object to store a query in it. OleDbCommand myquery = new OleDbCommand(myQueryText, myconnection); // Execute the query and store the result in an OleDb DataReader object OleDbDataReader myqueryresult = myquery.executereader(); // Go to the first record of the datareader myqueryresult.read(); // Store the value of the SqlVersion column in an SSIS // string variable Dts.Variables["User::sqlServerVersion"].Value = myqueryresult["sqlversion"].tostring(); // Close Script Task with success Dts.TaskResult = (int)scriptresults.success; } 105

316 CHAPTER 4 SCRIPT TASK And here is the VB.NET code: Public Sub Main() ' Store the connection in a Connection Manager object Dim myconnectionmanager As ConnectionManager = Dts.Connections("myOLEDBConnection") ' Cast the Connection Managers's InnerObject to the IDTSConnectionManagerDatabaseParameters100 ' interface (SSIS 2005 uses 90 instead of 100). Dim cmparams As IDTSConnectionManagerDatabaseParameters100 cmparams = TryCast(myConnectionManager.InnerObject, IDTSConnectionManagerDatabaseParameters100) ' Get the connection from the IDTSConnectionManagerDatabaseParameters100 object Dim myconnection As OleDbConnection = DirectCast(cmParams.GetConnectionForSchema(), OleDbConnection) ' Create a new OleDbCommand object to store a query in it. Dim myquery As OleDbCommand = New as SqlVersion", myconnection) ' Execute the query and store the result in an OleDb DataReader object Dim myqueryresult As OleDbDataReader = myquery.executereader() ' Go to the first record of the datareader myqueryresult.read() ' Store the value of the SqlVersion column in an SSIS string variable Dts.Variables("User::sqlServerVersion").Value = myqueryresult("sqlversion").tostring() ' Close Script Task with success Dts.TaskResult = ScriptResults.Success End Sub Note If you don t want to use this complicated method, you could always just use the ConnectionString property of the OLE DB connection manager and create a new connection. Logging Events When you want to log messages from a Script Task into the SSIS log, you have to raise events with code. Whether they will show up in the log depends on the chosen log level (project deployment) or on the chosen logging configuration (package deployment). The Script Task can raise events by calling event firing methods on the Events property of the Dts object. In this example, you will check whether a file from a File Connection Manager exists and contains data. Create a File or Flat File Connection Manager named myfile that points to a random text file. The content doesn t matter. Because you will try to get some file properties, you need the System.IO assembly. You will add this to the usings/imports. #region customnamespaces using System.IO; #endregion 106

317 CHAPTER 4 SCRIPT TASK And this is the VB.NET code: #region customnamespaces Imports System.IO #End Region Next is the actual code for the Main method. First get the file path from the connection manager and then check if the file exists and contains data. public void Main() { // Get filepath from File Connection Manager and store it in a string variable string filepath = Dts.Connections["myFile"].AcquireConnection(Dts.Transaction).ToString(); // Create File Info object with filepath variable FileInfo fi = new FileInfo(filePath); // Check if file exists if (fi.exists) { // File exists, but check size if (fi.length > 0) { // Boolean variable indicating if the same event can fire // multiple times bool fireagain = true; // File exists and contains data. Fire Information event Dts.Events.FireInformation(0, "Script Task File Check", "File exists and contains data.", string.empty, 0, ref fireagain); } else { // File exists, but contains no data. Fire Warning event Dts.Events.FireWarning(0, "Script Task File Check", "File exists, but contains no data.", string.empty, 0); } // Succeed Script Task Dts.TaskResult = (int)scriptresults.success; } else { // File doesn't exists. Fire Error event and fail Script Task Dts.Events.FireError(0, "Script Task File Check", "File doesn't exists.", string.empty, 0); // Fail Script Task Dts.TaskResult = (int)scriptresults.failure; } } 107

318 CHAPTER 4 SCRIPT TASK And here is the VB.NET code: Public Sub Main() ' Get filepath from File Connection Manager and store it in a string variable Dim filepath As String = Dts.Connections("myFile").AcquireConnection(Dts.Transaction).ToString() ' Create File Info object with filepath variable Dim fi As FileInfo = New FileInfo(filePath) ' Check if file exists If (fi.exists) Then ' File exists, but check size If (fi.length > 0) Then ' Boolean variable indicating if the same event can fire multiple times Dim fireagain As Boolean = True ' File exists and contains data. Fire Information event Dts.Events.FireInformation(0, "Script Task File Check", "File exists and contains data.", _ String.Empty, 0, fireagain) Else ' File exists, but contains no data. Fire Waring event Dts.Events.FireWarning(0, "Script Task File Check", "File exists, but contains no data.", _ String.Empty, 0) End If ' Succeed Script Task Dts.TaskResult = ScriptResults.Success Else ' File doesn't exists. Fire Error event and fail Script Task Dts.Events.FireError(0, "Script Task File Check", "File doesn't exists.", String.Empty, 0) ' Fail Script Task Dts.TaskResult = ScriptResults.Failure End If End Sub Now you can test the script by emptying or deleting the file that is referenced in the connection manager, and then run the package and watch the Execution Results tab. Besides the common FireInformation, FireWarning, and FireError, there are more firing event methods available, but they are less used: FireBreakpointHit: Raises an event indicating a breakpoint has been hit in the Script Task FireCustomEvents: Raises a custom event FireProgress: Raises an event that shows the progress of the Script Task FileQueryCancel: Raises an event that indicates whether the Script Task should shut down prematurely 108

319 CHAPTER 4 SCRIPT TASK Note Because firing events is expensive, you shouldn t use it excessively. Some firing event methods have a Boolean parameter, fireagain, to suppress firing the same event multiple times. FireCustomEvents Once in a while you end up in a situation where SSIS lacks some of the enterprise skills that a complete ETL solution offers, such as when you want to implement a custom metadata driven logging solution. Let s say that you want to centrally configure the logging setup for all the running packages. What happens with the logging logic has to be transparent for all the child packages. The only thing a package has to do is notify the master package by firing a custom event. A good way to handle that on the master package is to use the event handling functionality. There are several types of events at your disposal (see Figure 4-17 ). Figure Events at our disposal A good candidate for this example is the OnVariableValueChanged because It has some built-in variables you can use. It is not fired automatically, even when a variable changes. Basically, you want to be able to catch a custom event fired in the child packages by using an event handler of the parent package. Let s start by building the child package. 109

320 CHAPTER 4 SCRIPT TASK Child Package Create a new SSIS package called Child and add a Script Task called SCR_FireCustomEvent to it. The code for the event is simple and does nothing else than take some of the available variables to the package and fire them in an event. The Script In the Main method of the class, add the following: // fire once or multiple times bool fireagain = false; // the values that we want to surface in the custom event object[] parameters = new object[] { "This is the value I want to log", "Second value to log", DateTime.Now.ToLongDateString(), "More value to log" }; //fire the right event type : OnVariableValueChanged Dts.Events.FireCustomEvent("OnVariableValueChanged", "", ref parameters, "", ref fireagain); Dts.TaskResult = (int)scriptresults.success; And in VB.NET code: ' fire once or multiple times Dim fireagain As Boolean = False ' the values that we want to surface in the custom event Dim parameters As Object() = New Object() {"This is the value I want to log", _ "Second value to log", DateTime.Now.ToLongDateString(), "More value to log"} 'fire the right event type : OnVariableValueChanged Dts.Events.FireCustomEvent("OnVariableValueChanged", "", parameters, "", fireagain) Dts.TaskResult = CInt(ScriptResults.Success) As you can see, it is quite simple. You call the SSIS method FireCustomEvent with: The name of the event to be fired: OnVariableValueChanged The event text, which you don t need, so it is "" An object array with some string parameters (can also be other types) The name of a subcomponent (not needed) Instruction about firing the event again, false in this case This is all that you need for the Script Task. Now let s create a second package called Parent. 110

321 The Parent Package CHAPTER 4 SCRIPT TASK This package invokes the child package; so for that you need an Execute Package Task that points at the child package as shown in Figure Figure Execute Package Task Editor On the Event handler tab of the package surface, choose the OnVariableValueChanged event in the drop-down list. It opens the designer surface for this specific event and you can place a Script Component on the surface of the event handler (see Figure 4-19 ). 111

322 CHAPTER 4 SCRIPT TASK Figure Configuring the event handler Here is the callout for Figure 4-19 : 1. The Executable is set to the top level element (the package). 2. The event handler is OnVariableValueChanged. 3. The Script Task that you use for capturing the event. 112

323 The Script Task You added a Script Task on the surface of the event handler, which you called SCR_CaptureEvent. Inside the script, you need to set some of the available system variables as read-only: System::TaskName System::SourceName System::VariableDescription System::VariableID System::VariableName System::VariableValue CHAPTER 4 SCRIPT TASK The last four variables are the ones that you populated with the object array from the child package. The Code You open the script by clicking the Edit Script... button on the Script page of the Script Component Editor. Next, you add the following lines of code to the Main method: //Building a string that gets the values of the variables // from the event fired in the child package string result = "VariableName: " + Dts.Variables["System::VariableName"].Value.ToString() + Environment.NewLine; result += "VariableID: " + Dts.Variables["System::VariableID"].Value.ToString() + Environment.NewLine; result += "VariableDescription: " + Dts.Variables["System::VariableDescription"].Value.ToString() + Environment.NewLine; result += "VariableValue: " + Dts.Variables["System::VariableValue"].Value.ToString() + Environment.NewLine; result += "TaskName: " + Dts.Variables["System::TaskName"].Value.ToString() + Environment.NewLine; result += "SourceName: " + Dts.Variables["System::SourceName"].Value.ToString(); //Showing the string value as a message box MessageBox.Show(result); And in VB.NET: 'Building a string that gets the values of the variables ' from the event fired in the child package Dim result As String = "VariableName: " + Dts.Variables("System::VariableName").Value.ToString() + Environment.NewLine 113

CHAPTER 4 SCRIPT TASK result += "VariableID: " + Dts.Variables("System::VariableID").Value.ToString() + Environment.NewLine result += "VariableDescription: " + Dts.

324 CHAPTER 4 SCRIPT TASK result += "VariableID: " + Dts.Variables("System::VariableID").Value.ToString() + Environment.NewLine result += "VariableDescription: " + Dts.Variables("System::VariableDescription").Value.ToString() + Environment.NewLine result += "VariableValue: " + Dts.Variables("System::VariableValue").Value.ToString() + _ Environment.NewLine result += "TaskName: " + Dts.Variables("System::TaskName").Value.ToString() + Environment.NewLine result += "SourceName: " + Dts.Variables("System::SourceName").Value.ToString() 'Showing the string value as a message box MessageBox.Show(result) Running the parent package invokes the child package, which in turn fires a custom event that is captured by the event handler of the parent package, and shows the results in Figure Figure The results of running the package In this example, you kept it really simple, but it wouldn t be a problem to create a metadata framework to control the logging or to implement some custom auditing using FireCsutomEvents. Summary In this chapter you learned the basic functionality of the Script Task, such as the use of variables and connection managers to avoid hard-code values in your scripts, and logging useful information by firing events. And you saw how you can reference custom or third-party assemblies. In the next few chapters you will see solutions for all the common problems. With the knowledge of this Script Task chapter, you can now customize those examples by logging or by using a different connection manager. 114

325 Pro PowerShell for Database Developers Bryan Cafferky

326 Pro PowerShell for Database Developers Copyright 2015 by Bryan Cafferky This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. ISBN-13 (pbk): ISBN-13 (electronic): Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Managing Director: Welmoed Spahr Lead Editor: Jonathan Gennick Technical Reviewer: Jason Horner and Mike Robbins Editorial Board: Steve Anglin, Mark Beckner, Gary Cornell, Louise Corrigan, Jim DeWolf, Jonathan Gennick, Robert Hutchinson, Michelle Lowman, James Markham, Susan McDermott, Matthew Moodie, Jeffrey Pepper, Douglas Pundick, Ben Renow-Clarke, Gwenan Spearing, Matt Wade, Steve Weiss Coordinating Editor: Jill Balzano Copy Editor: April Rondeau Compositor: SPi Global Indexer: SPi Global Artist: SPi Global Cover Designer: Anna Ishchenko Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY Phone SPRINGER, fax (201) , orders-ny@springer-sbm.com, or visit Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information on translations, please rights@apress.com, or visit Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. ebook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales ebook Licensing web page at Any source code or other supplementary material referenced by the author in this text is available to readers at For detailed information about how to locate your book s source code, go to

327 Contents at a Glance About the Author...xv About the Technical Reviewer...xvii Acknowledgments...xix Introduction...xxi Chapter 1: PowerShell Basics... 1 Chapter 2: The PowerShell Language Chapter 3: Advanced Programming Chapter 4: Writing Scripts Chapter 5: Writing Reusable Code Chapter 6: Extending PowerShell Chapter 7: Accessing SQL Server Chapter 8: Customizing the PowerShell Environment Chapter 9: Augmenting ETL Processes Chapter 10: Configurations, Best Practices, and Application Deployment Chapter 11: PowerShell Versus SSIS Chapter 12: PowerShell Jobs Chapter 13: PowerShell Workflows Index v

328 CHAPTER 7 Accessing SQL Server We have seen some examples of executing queries against SQL Server. There are many ways to access a database, and it can hinder development to custom code each variation, as there are a number of questions to be resolved. What type of access method will be used: ADO.Net, OleDB, ODBC, or the native SQL Server driver? What type of authentication will be used: Windows Integrated Security, Logon ID and password, or a credential object? What database platform: On Premise SQL Server, SQL Azure, or a third-party database like MySQL or PostgreSQL? Why not write some functions to make this task easier? In this chapter, we will be carefully walking through a custom module that provides easy access to databases using ADO.Net. About the Module The module, umd_database, centers around a function that returns a custom database-connection object that is built on PowerShell s object template, PSObject. By setting the object s properties, we tell it how to execute SQL statements to the database. Using a method called RunSQL, we will run various queries. To run stored procedures that require output parameters, we can use the RunStoredProcedure method. Since an object is returned by the New-UdfConnection function, we can create as many database-connection object instances as we want. Each instance can be set for specific database-access properties. For example, one instance might point to a SQL Server database, another might point to an Azure SQL database, and yet another might point to a MySQL database. Once the connection attributes are set, we can run as many queries as we want by supplying the SQL. Beyond providing a useful module, the code will demonstrate many techniques that can be applied elsewhere. Using what we learned in the previous chapter, we copy the module umd_database to a folder named umd_database in a directory that is in $env:psmodulepath, such as \ WindowsPowerShell\Modules\, under the current user s Documents folder. To run the examples that use SQL Server, we will need the Microsoft training database AdventureWorks installed. You can get it at Using umd_database Before we look at the module s code, let s take a look at how we can use the module. The idea is to review the module s interface to make it easier to understand how the code behind it w orks. Let s assume we want to run some queries against the SQL Server AdventureWorks database. We ll use the local instance and integrated security, so the connection is pretty simple. Since code using ADO.Net is so similar to code using ODBC, we will only review the ADO.Net-specific code in this chapter. However, you can also see in the module the ODBC-specific coding. Listing 7-1 shows the code to create and define the connection object. Note: You will need to change the property values to ones appropriate for your environment. 173

329 CHAPTER 7 ACCESSING SQL SERVER Listing 7-1. Using the umd_database module Import-Module umd_database [psobject] $myssint = New-Object psobject New-UdfConnection ([ref]$myssint) $myssint.connectiontype = 'ADO' $myssint.databasetype = 'SqlServer' $myssint.server = '(local)' $myssint.databasename = 'AdventureWorks' $myssint.usecredential = 'N' $myssint.setauthenticationtype('integrated') $myssint.authenticationtype In Listing 7-1, the first thing we do is import the umd_database module to load its functions. Then we create an instance of a PSObject named $myssint. Next, a statement calls New-UdfConnection to attach a number of properties and methods to the object we created. Note: The parameter $myssint is being passed as a [REF] type. As we saw previously, a REF is a reference to an object, i.e., we are passing our object to the function where it is modified. The lines that follow the function call set the properties of the object. First, we set the ConnectionType to ADO, meaning ADO.Net will be used. The DatabaseType is SQL Server. The server instance is '(local)'. Since we are using integrated security, we will not be using a credential object, which is a secure way to connect when user ID and password are being passed. We just set UseCredential = 'N'. Finally, we set the AuthenticationType to 'Integrated', i.e., integrated security. Notice that the authentication type is being set by a method rather than by our just assigning it a value. This is so the value can be validated. An invalid value will be rejected. With these properties set, we can generate the connection string using the BuildConnectionString method, as shown here: $myssint.buildconnectionstring If there is a problem creating the connection string, we will get the message 'Error - Connection string not found!'. Technically, we don t have to call the BuildConnectionString method, because it is called by the method that runs the SQL statement, but calling it confirms a valid connection string was created. We can view the object s ConnectionString property to verify it looks correct using the statement here: $myssint.connectionstring We should see the following output: Data Source=(local);Integrated Security=SSPI;Initial Catalog=AdventureWorks Let s try running a query using our new connection object: $myssint.runsql("select top 20 * FROM [HumanResources].[EmployeeDepartmentHistory]", $true) We use the object s RunSQL method to execute the query. The first parameter is the query. The second parameter indicates whether the query is a select statement i.e., returns results. Queries that return results need to have code to retrieve. When we run the previous statement, we should see query results displayed to the console. 174

330 CHAPTER 7 ACCESSING SQL SERVER We can look at all the object s properties and their current values by entering the object name on a line, as follows: $myssint We should get the following output: ConnectionType : ADO DatabaseType : SqlServer Server : (local) DatabaseName : AdventureWorks UseCredential : N UserID : Password : DSNName : NotAssigned Driver : NotAssigned ConnectionString : Data Source=(local);Integrated Security=SSPI;Initial Catalog=AdventureWorks AuthenticationType : Integrated AuthenticationTypes : {Integrated, Logon, Credential, StoredCredential...} Being able to display the object properties is very useful, especially for the connection string and authentication types. If there is a problem running the query, we can check the connection string. If we don t know what the valid values are for authentication types, the AuthenticationTypes property shows them. Notice UserID and Password are blank since they don t apply to integrated security. Imagine we want to have another connection, but this time using a table in the SQL Azure cloud. Azure does not support integrated security or the credential object, so we ll use a login ID and password. We can code this as follows: [psobject] $azure1 = New-Object psobject New-UdfConnection ([ref]$azure1) $azure1 Get-Member $azure1.connectiontype = 'ADO' $azure1.databasetype = 'Azure' $azure1.server = 'fxxxx.database.windows.net' $azure1.databasename = 'Development' $azure1.usecredential = 'N' $azure1.userid = 'myuserid' $azure1.password = "mypassword" $azure1.setauthenticationtype('logon') $azure1.buildconnectionstring() With the connection properties set, we can run a query as follows: $azure1.runsql("select top 20 * FROM [HumanResources].[EmployeeDepartmentHistory]", $true) Assuming you change the properties to fit your Azure database, you should s ee results displayed in the console. 175

331 CHAPTER 7 ACCESSING SQL SERVER Now that we have some connection objects, we can execute more queries, as follows: $myssint.runsql("select top 20 * FROM [HumanResources].[Department]", $true) $azure1.runsql("select top 20 * FROM [Person].[Person]", $true) Each statement will return results. The connection is closed after getting the results, but the properties are retained so that we can keep running queries. The result data is not retained, so if we want to capture it, we need to assign it to a variable, as follows: $myresults = $myssint.runsql("select top 20 * FROM [HumanResources].[Department]", $true) What if we want to run a query that does not return results? That s where the second parameter to the RunSQL method comes in. By setting it to $false, we are telling the method not to return results. For example, the following lines will execute an update statement using the first connection object we created: $esql = "Update [AdventureWorks].[HumanResources].[CompanyUnit] set UnitName = 'NewSales2' where UnitID = 3;" $myssint.runsql($esql, $false) Here we are assigning the SQL statement to $esql, which is passed to the RunSQL method. How umd_database Works The umd_database module consists of a set of functions and is designed so that it can easily be extended. As we can see from the examples that use the module, the core function, New-UdfConnection, is the one that returns the connection object to us. Because this function is so long, I m not going to list all of it at once. However, you can see the entire function in the code accompanying this book, in the umd_database module script. Let s start with the following function header: function New-UdfConnection { [CmdletBinding()] param ( [ref]$p_connection ) The first thing to notice in the function header is that the parameter is of type [REF], which means reference. The caller passes an object reference to the function. This gives the function the ability to make changes to the object passed, which is how the connection object is customized with properties and methods. Before calling the New-UdfConnection function, the calling code needs to create an instance of a PSObject and then pass a reference to it, as shown in the example code here: [psobject] $myssint = New-Object psobject New-UdfConnection ([ref]$myssint) The function New-UdfConnection attaches properties and methods to the object passed in, as shown here: # Connection Type... $p_connection.value Add-Member -MemberType noteproperty ` -Name ConnectionType ` -Value $p_connectiontype 176

332 CHAPTER 7 ACCESSING SQL SERVER To change the object passed to the function, we need to use the object s Value property, $p_connection.value. The Add-Member cmdlet adds properties and methods to the object. MemberType specifies what type of property or method we want to attach. In the previous code, we add a noteproperty, which is just a place to hold static data. The Name parameter assigns an externally visible name to the property. The Value parameter assigns an initial value to the property, and the Passthru parameter tells PowerShell to pass the property through the pipe. This same type of coding is used to create the following properties: DatabaseType, Server, DatabaseName, UseCredential, UserID, Password, DSNName, and Driver. Since the coding is identical except for Name and Value, I m not going to list them all here. AuthenticationType is handled a bit differently. To start, the property is created just like the other properties are. The code is as follows: # AuthenticationType. Value must be in AuthenticationTypes. $p_connection.value Add-Member -MemberType noteproperty ` -Name AuthenticationType ` -Value 'NotAssigned' ` -Passthru However, AuthenticationType will not be set by directly assigning a value to it. Instead, a method named SetAuthenticationType will be called. This is so we can validate the AuthenticationType value before allowing it to be assigned. So, when we call SetAuthenticationType on the object as in "$myssint. SetAuthenticationType('Integrated')", an invalid value will be rejected. Let s look at the code that stores the valid list of values as an object property: $p_authentiationtypes = ('Integrated','Logon','Credential', 'StoredCredential', 'DSN', 'DSNLess') $p_connection.value Add-Member -MemberType noteproperty ` -Name AuthenticationTypes ` -Value $p_authentiationtypes In this code, we are creating the array $p_authenticationtypes and loading it with values. Then, we use the Add-Member cmdlet to attach the array as a property named AuthenticationTypes to the object passed in to the function. Attaching the list to the object as a property enables the user to see what the valid values are. Now, let s look at the code that sets the AuthenticationType : $bauth param([string]$p_authentication) '@ if ($this.authenticationtypes -contains $p_authentication) { $this.authenticationtype = $p_authentication } Else { Throw "Error - Invalid Authentication Type, valid values are " + $this. AuthenticationTypes } RETURN $this.authenticationtype 177

333 CHAPTER 7 ACCESSING SQL SERVER The here string is the script to be executed when the SetAuthenticationType is called. The only parameter is the authentication type to assign. To validate that the desired authentication type is valid, the code uses the ob ject s AuthenticationTypes property. The special keyword $this refers to the current object i.e., the object we are extending. The line if ($this.authenticationtypes -contains $p_authentication) is checking whether the value passed in is contained in the object s AuthenticationTypes hash table. If it is, the object $this.authenticationtype is assigned the parameter passed in. If the value is not in the list, an error is thrown with a message telling the caller that the authentication type is not valid. The code that follows attaches the script to the object as a script method named SetAuthenticationType : $sauth = [scriptblock]::create($bauth) $p_connection.value Add-Member -MemberType scriptmethod ` -Name SetAuthenticationType ` -Value $sauth ` -Passthru Then first statement, "$sauth = [scriptblock]::create($bauth)", converts the here string into a script block. Then the Add-Member cmdlet is used to attach the scriptblock to the object as a scriptmethod named SetAuthenticationType. The method takes one parameter the value to set the AuthenticationType. You may be asking yourself, why is the code for the script method first defined as a here string and then converted to a script block? Why not just define it as a script block to start with? The reason is that a script block has limitations, including that it will not expand variables and it cannot define named parameters. By defining the code as a here string, which does support variable expansion, we can get the script block to support it too; i.e., when we convert it to the script block using the script block constructor create method. This technique gives our script methods the power of functions. I got this idea from a blog by Ed Wilson, the Scripting Guy: It is not always necessary to expand variables, so in some cases we could just directly define the script method as a script block. However, I prefer to code for maximum flexibility since requirements can change, so I make it a practice to always code script methods for my custom objects using this approach. The next section of the New-UdfConnection function creates the script method BuildConnectionString, w hich creates the required connection string so as to connect to the database. Let s look at the code that follows: $bbuildconnstring param() '@ 178 If ($this.authenticationtype -eq 'DSN' -and $this.dsnname -eq 'NotAssigned') { Throw "Error - DSN Authentication requires DSN Name to be set." } If ($this.authenticationtype -eq 'DSNLess' -and $this.driver -eq 'NotAssigned') { Throw "Error - DSNLess Authentication requires Driver to be set." } $Result = Get-UdfConnectionString $this.connectiontype ` $this.databasetype $this.server $this.authenticationtype ` $this.databasename $this.userid $this.password $this.dsnname $this.driver If (!$Result.startswith("Err")) { $this.connectionstring = $Result } Else { $this.connectionstring = 'NotAssigned' } RETURN $Result

334 CHAPTER 7 ACCESSING SQL SERVER A here string is assigned the code to build the connection string. The script does not take any parameters. The first two lines do some validation to make sure that, if this is an ODBC DSN connection, the object s DSNName property has been set. Alternatively, if this is an ODBC DSNLess connection, it verifies that the Driver property is set. If something is wrong, an error is thrown. There could be any number of validations, but the idea here is to demonstrate how they can be implemented. Again, to access the object properties, we use the $this reference. If the script passes the validation tests, it calls function Get-UdfConnectionString, passing required parameters to build the connection string, which is returned to the variable $Result. If there was an error getting the connection string, a string starting with Err is returned. $Result is checked, and if it does not start with Err, i.e., If (!$Result.startswith("Err")), the object s ConnectionString property is set to $Result. Remember,! means NOT. Otherwise, the ConnectionString is set to 'NotAssigned '. The function Get-UdfConnectionString is not part of the object. It is just a function in the module, which means it is a static function i.e., there is only one instance of it. The example shows how we can mix instance-specific methods and properties with static ones. Get-UdfConnectionString can also be called outside of the object method, which might come in handy if someone just wants to build a connection string. We ll cover the code involved in Get-UdfConnectionString after we ve finished covering New-UdfConnection. Now, let s take a look at the code that attaches the script to the script method: $sbuildconnstring = [scriptblock]::create($bbuildconnstring) $p_connection.value Add-Member -MemberType scriptmethod ` -Name BuildConnectionString ` -Value $sbuildconnstring ` -Passthru The first line converts the here string with the script into a scriptblock. Then the Add-Member cmdlet is used to attach the scriptblock to the scriptmethod BuildConnectionString. The Passthru parameter tells PowerShell to pass this scriptmethod through the pipe. The next method, RunSQL, is the real powerhouse of the object, and yet it is coded in the same way. This method will execute any SQL statement passed to it, using the connection string generated by the BuildConnectionString method to connect to a database. Let s look at the script definition here: # Do NOT put double quotes around the object $this properties as it messes up the values. # **** RunSQL *** $bsql param([string]$p_insql,[boolean]$isselect) '@ $this.buildconnectionstring() If ($this.connectionstring -eq 'NotAssigned') {Throw "Error - Cannot create connection string." } $Result = Invoke-UdfSQL $this.connectiontype $this.databasetype $this.server ` $this.databasename "$p_insql" ` $this.authenticationtype $this.userid $this.password ` $this.connectionstring $IsSelect ` $this.dsnname $this.driver RETURN $Result 179

335 CHAPTER 7 ACCESSING SQL SERVER The script is assigned to a variable, $bsql, as a here string. The script takes two parameters, $p_insql, which is the SQL statement to be executed, and $IsSelect, a Boolean, which is $true if the SQL statement is a Select statement and $false if it is not a Select statement, such as an update statement. Statements that return results need to be handled differently than statements that do not. Just in case the user did not call the BuildConnectionString method yet, the script calls it to make sure we have a connection string. Then the script checks that the ConnectionString has been assigned a value, and, if not, it throws an error. Finally, Invoke-UdfSQL is called with the parameters for connection type, database type, server name, database name, SQL statement, authentication type, User ID, Password, Connection String, $true if the statement is a Select statement or $false if it is not, DSN Name, and Driver Name. If the SQL were a Select statement, $Result will hold the data returned. The code that follows attaches the script to the object: $ssql = [scriptblock]::create($bsql) $p_connection.value Add-Member -MemberType scriptmethod ` -Name RunSQL ` -Value $ssql ` -Passthru The here string holding the script is converted to a scriptblock variable $ssql. Then $ssql is attached to the object with the Add-Member cmdlet and becomes scriptmethod RunSQL. The Passthru parameter tells PowerShell to pass the scriptmethod through the pipe. Supporting Functions There are a few support functions used by the object that New-UdfConnection returns. The object user need not be aware of these functions, but they help the connection object perform tasks. If the user is familiar with these functions, they can call them directly rather than through the connection object. This adds flexibility to the module, as a developer can pick and choose what they want to use. Get-UdfConnectionString The connection object returned by New-UdfConnection calls Get-UdfConnectionString when the method BuildConnectionString is executed. This function creates a connection string based on the parameters passed to it. Rather than write code that dynamically formats a connection string, why not read them from a text file? The format of a connection string is pretty static. Given the connection requirements i.e., connection type, such as ADO.Net or ODBC, the database type, and the authentication method, such as Windows Integrated Security or logon credentials the connection string format does not change. Only the variables like server name, database name, user ID, and password change. Ideally, we want to use connection strings in a file, like a template in which we can fill in the variable values. Fortunately, PowerShell makes this very easy to do. We ll use the CSV file named connectionstring.txt with the columns Type, Platform, AuthenticationType, and ConnectionString. Type, Platform, and AuthenticationType will be used to locate the connection string we need. Let s take a look at the contents of the file: "Type","Platform","AuthenticationType","ConnectionString" "ADO","SqlServer","Integrated","Data Source=$server;Integrated Security=SSPI;Initial Catalog=$databasename" "ADO","SqlServer","Logon","Data Source=$server;Persist Security Info=False; IntegratedSecurity=false;Initial Catalog=$databasename;User ID=$userid;Password=$password" "ADO","SqlServer","Credential","Data Source=$server;Persist Security Info=False;Initial Catalog=$databasename" 180

336 CHAPTER 7 ACCESSING SQL SERVER "ADO","Azure","Logon","Data Source=$server;User ID=$userid;Password=$password;Initial Catalog=$databasename;Trusted_Connection=False;TrustServerCertificate=False;Encrypt=True;" "ODBC","PostgreSQL","DSNLess","Driver={$driver};Server=$server;Port=5432;Database= $databasename;uid=$userid;pwd=$password;sslmode=disable;readonly=0;" "ODBC","PostgreSQL","DSN","DSN=$dsn;SSLmode=disable;ReadOnly=0;" Notice that we have what appear to be PowerShell variables in the data, prefixed with $, as values for things in the connection string that are not static. Now, let s take a look at the code that uses the text file to build a connection string: function Get-UdfConnectionString { [CmdletBinding()] param ( [string] $type, # Connection type, ADO, OleDB, ODBC, etc. [string] $dbtype, # Database Type; SQL Server, Azure, MySQL. [string] $server, # Server instance name [string] $authentication, # Authentication method [string] $databasename, # Database [string] $userid, # User Name if using credential [string] $password, # password if using credential [string] $dsnname, # dsn name for ODBC connection [string] $driver # driver for ODBC DSN Less connection ) } $connstrings = Import-CSV ($PSScriptRoot + "\connectionstrings.txt") foreach ($item in $connstrings) { if ($item.type -eq $type -and $item.platform -eq $dbtype -and $item.authenticationtype -eq $authentication) { $connstring = $item.connectionstring $connstring = $ExecutionContext.InvokeCommand.ExpandString($connstring) Return $connstring } } # If this line is reached, no matching connection string was found. Return "Error - Connection string not found!" In this code, the function parameters are documented by comments; let s look at the first statement. The first line loads a list of connection strings from a CSV file into a variable named $connstrings. It looks for the file in the path pointed to by the automatic variable $PSScriptRoot, which is the folder the module umd_database is stored in. Make sure you copied the connectionstring.txt file to this folder. Using a foreach loop, we loop through each row in $connstrings until we find a row that matches the connection type, database type, and authentication type passed to the function. The matching row is loaded into $connstring. Now, a nice thing about PowerShell strings is that they automatically expand to replace variable names with their values. By having the function parameter names match the variable names used in 181

337 CHAPTER 7 ACCESSING SQL SERVER the connection strings in the file, we can have PowerShell automatically fill in the values for us by expanding the string using the command $ExecutionContext.InvokeCommand.ExpandString. Then, the connection string is returned to the caller. If no match is found, the statement after the foreach loop will be executed, which returns a message Error - Connection string not found!. Normally, creating connection strings is tedious, but this function can be extended to accommodate ma ny different requirements just by adding new rows to the input file. Invoke-UdfSQL Invoke-UdfSQL is the function calle d by th e connection object s RunSQL method. This function acts like a broker to determine which specific function to call to run the SQL statement. It currently supports ADO and ODBC, but you can easily add other types like OleDB. Let s take a look at the Invoke-UdfSQL function: function Invoke-UdfSQL { ([string]$p_inconntype, [string]$p_indbtype, [string]$p_inserver, [string]$p_indb, [string]$p_insql, [string]$p_inauthenticationtype, [string]$p_inuser, $p_inpw, # No type defined; can be securestring or string [string]$p_inconnectionstring, [boolean]$p_inisselect, [string]$p_indsnname, [string]$p_indriver, [boolean]$p_inisprocedure, $p_inparms) If ($p_inconntype -eq "ADO") { If ($p_inisprocedure) { RETURN Invoke-UdfADOStoredProcedure $p_inserver $p_indb $p_insql ` $p_inauthenticationtype $p_inuser $p_inpw $p_inconnectionstring $p_inparms } Else { $datatab = Invoke-UdfADOSQL $p_inserver $p_indb $p_insql ` $p_inauthenticationtype $p_inuser $p_inpw $p_inconnectionstring $p_inisselect Return $datatab } } ElseIf ($p_inconntype -eq "ODBC") { If ($p_inisprocedure) { 182 } RETURN Invoke-UdfODBCStoredProcedure $p_inserver $p_indb $p_insql ` $p_inauthenticationtype $p_inuser $p_inpw $p_inconnectionstring $p_inparms

338 CHAPTER 7 ACCESSING SQL SERVER Else { $datatab = Invoke-UdfODBCSQL $p_inserver $p_indb $p_insql $p_inauthenticationtype ` $p_inuser $p_inpw $p_inconnectionstring $p_inisselect $p_indsnname $driver Return $datatab } } Else { Throw "Connection Type No t Supported." } } This function has an extensive parameter list. Some of them are not needed, but it s good to have the extra information in case we need it. Notice that parameter $p_inpw has no type defined. This is so the parameter can accept whatever is passed to it. As I will demonstrate later, the connection object can support either an encrypted or a clear-text password. An encrypted string is of type securestring, and clear text is of type string. Note: Depending on the connection requirements, some of the parameters may not have values. The function code consists of an If/Else block. If the parameter connection type $p_inconntype is 'ADO', then the Boolean is checked to see if this is a stored procedure call. If yes, the function Invoke-UdfADOStoredProcedure is called, otherwise Invoke-UdfADOSQL is called, with the results loaded into $datatab, which is then returned to the caller. If the parameter connection type $p_inconntype is 'ODBC', then the Boolean is checked to see if this is a stored procedure call. If yes, the function Invoke-UdfODBCStoredProcedure is called, otherwise Invoke-UdfODBCSQL is called, with the results loaded into $datatab, which is returned to the caller. If the connection type is neither of these, a terminating error is thrown. Invoke-UdfADOSQL Invoke-UdfADOSQL uses ADO.Net to execute the SQL statement and the connection string passed in. ADO. Net is a very flexible provider supported by many vendors, including Oracle, My SQL, and PostgreSQL. Let s look at th e function Invoke-UdfADOSQL code: function Invoke-UdfADOSQL { [CmdletBinding()] param ( [string] $sqlserver, # SQL Server [string] $sqldatabase, # SQL Server Database. [string] $sqlquery, # SQL Query ) [string] $sqlauthenticationtype, # $true = Use Credentials [string] $sqluser, # User Name if using credential $sqlpw, # password if using credential [string] $sqlconnstring, # Connection string [boolean]$sqlisselect # true = select, false = non select statement 183

339 CHAPTER 7 ACCESSING SQL SERVER if ($sqlauthenticationtype -eq 'Credential') { $pw = $sqlpw $pw.makereadonly() $SqlCredential = new-object System.Data.SqlClient.SqlCredential($sqluser, $pw) $conn = new-object System.Data. SqlC lient.sqlconnection($sqlconnstring, $SqlCredential) } else { $conn = new-object System.Data.SqlClient.SqlConnection($sqlconnstring) } $conn.open() $command = new-object system.data.sqlclient.sqlcommand($sqlquery,$conn) if ($sqlisselect) { $adapter = New-Object System.Data.sqlclient.SqlDataAdapter $command $dataset = New-Object System.Data.DataSet $adapter.fill($dataset) Out-Null $conn.close() RETURN $dataset.tables[0] } Else { $command. ExecuteNonQu ery() $conn.close() } This function takes a lot of parameters, but most of them are not needed; rather, they are included in case there is a need to extend the functionality. Let s take a closer look at the first section of code that handles credentials: if ($sqlauthenticationtype -eq 'Credential') { $pw = $sqlpw $pw.makereadonly() $SqlCredential = new-object System.Data.SqlClient.SqlCredential($sqluser, $pw) $conn = new-object System.Data.SqlClient.SqlConnection($sqlconnstring, $SqlCredential) } else { $conn = new-object System.Data.SqlClient.SqlConnection($sqlconnstring); } We can see that if the authentication type passed is equal to Credential, it indicates that the caller wants to use a credential object to log in. A credential object allows us to separate the connection string from the user ID and password so no one can see them. Note: This concept can also be used in making 184

340 CHAPTER 7 ACCESSING SQL SERVER web-service requests. If the authentication type is Credential, the password parameter, $sqlpw, is assigned to $pw, and then this variable is set to read only. This is required by the credential object. Then, we create an instance of the SqlCredential object, passing the user ID and password as parameters, which returns a reference to the variable $SqlCredential. Finally, the connection object is created as an instance of SqlConnection with the connection string, $sqlconnectonstring, and the credential object, $SqlCredential, as parameters. The connection reference is returned to $conn. If the authentication type is not equal to Credential, the connection is created by passing only the connection string to the SqlConnection method. Note: The user ID and password would be found in the connection string if the authentication type were Logon or DSNLess. Now the function is ready to open the connection as shown here: $conn.open() $command = new-object system.data.sqlclient.sqlcommand($sqlquery,$conn) First, the connection must be opened. Then, a command object is created via the SqlCommand method, and the SQL statement and connection object are passed as parameters, returning the object to $command. The code to execute the SQL statement is as follows: if ($sqlisselect) { $adapter = New-Object System.Data.sqlclient.SqlDataAdapter $command $dataset = New-Object System.Data.DataSet $adapter.fill($dataset) Out-Null $conn.close() RETURN $dataset.tables[0] } Else { $command.executenonquery() $conn.close() } Above, the Boolean parameter $ sqlisselect is tested for a value of true i.e., if ($sqlisselect). Since it is a Boolean type, there is no need to compare the value to $true or $false. If true, this is a select statement and it needs to retrieve data and return it to the caller. We can see a SQL data adapter being created, with the SqlDataAdapter method call passing the command object as a parameter. Then a dataset object, DataSet, is created to hold the results. The data adapter $adapter Fill method is called to load the dataset with the results. The connection is closed, as we don t want to accumulate open connections. It is important to take this cleanup step. A dataset is a collection of tables. A command can submit multiple select statements, and each result will go into a separate table element in the dataset. This function only supports one select statement so as to keep things simple. Therefore, only the first result set is returned, as in the statement RETURN $dataset.table[0]". If $issqlselect is false, the statement just needs to be executed. No results are returned. Therefore, the command method ExecuteNonQuery is called. Then, the connection is closed. There is nothing to return to the caller. 185

341 CHAPTER 7 ACCESSING SQL SERVER Using the Credential Object The credential object provides good protection of the user ID and password. On-premises SQL Server supports using a credential object at logon, but Azure SQL does not. As we saw, the object returned by Get-UdfConnection supports using a credential. However, usin g it requires a slightly different set of statements. Let s look at the following example: [psobject] $myconnection = New-Object psobject New-UdfConnection ([ref]$myconnection) $myconnection.connectiontype = 'ADO' $myconnection.databasetype = 'SqlServer' $myconnection.server = '(local)' $myconnection.databasename = 'AdventureWorks' $myconnection.usecredential = 'Y' $myconnection.userid = 'bryan' $myconnection.password = Get-Content 'C:\Users\BryanCafferky\Documents\password.txt' convertto-securestring $myconnection.setauthenticationtype('credential') $myconnection.buildconnectionstring() # Should return the connection string. These statements will set all the properties needed to run queries against our local instance of SQL Server. Most of this is the same as what we did before. The line I want to call your attention to is copied here: $myconnection.password = Get-Content 'C:\Users\BryanCafferky\Documents\password.txt' convertto-securestring This line loads the password from a file that is piped into the ConvertTo-Securestring cmdlet to encrypt it. If you try to display the password property of $myconnection, you will just see 'System.Security. Securestring'. You cannot view the contents. So the password is never readable to anyone once it has been saved. Other than that difference, the connection object is used the same way as before. We can submit a query such as the one here: $myconnection.runsql("select top 20 * FROM [HumanResources].[EmployeeDepartmentHistory]", $true) Select-Object -Property BusinessEntityID, DepartmentID, ShiftID, StartDate Out-GridView The query results will be returned and piped into the Select-Object cmdlet, which selects the columns desired, and then will be piped into the Out-GridView for display. Encrypting and Saving the Password to a File We have not seen how to save the password to a file. We can do that with this st atement: Save-UdfEncryptedCredential 186

342 CHAPTER 7 ACCESSING SQL SERVER This statement calls a module function that will prompt us for a password and, after we enter it, will prompt us for the filename to save it to with the Window s Save File common dialog. Let s take a look at the code for this function: function Save-UdfEncryptedCredential { [CmdletBinding()] param () } $pw = read-host -Prompt "Enter the password:" -assecurestring $pw convertfrom-securestring out-file (Invoke-UdfCommonDialogSaveFile ("c:" + $env:homepath + "\Documents\" ) ) The first executable line prompts the user for a password, which is not displayed as typed because the assecurestring parameter to Read-Host suppresses display of the characters. The second line displays the Save File common dialog so the user can choose where to save the password. Because the call to Invoke- UdfCommonDialogSaveFile is in parentheses, it wi ll execute first, and the password will be stored to the file specified by the user. The password entered is piped into ConvertFrom-SecureString, which obfuscates it by making it a readable series of numbers that is piped into the Out-File cmdlet, which saves the file. For completeness, the function Invoke-UdfCommonDialogSaveFile, which displays the Save File common dialog form, is listed here: function Invoke-UdfCommonDialogSaveFile($initialDirectory) { [System.Reflection.Assembly]::LoadWithPartialName("System.windows.forms") Out-Null $OpenFileDialog = New-Object System.Windows.Forms.SaveFileDialog $OpenFileDialog.initialDirectory = $initialdirectory $OpenFileDialog.filter = "All files (*.*) *.*" $OpenFileDialog.ShowDialog() Out-Null $OpenFileDialog.filename } This function uses the Window s form SaveFileDialog to present a user with the familiar Save As dialog. The only parameter is the initial directory the caller wants the dialog to default to. The last line returns the selected folder and filename entered. Calling Stored Procedures In simple cases, stored procedures can be called like any other SQL statement. If the procedure returns a query result, it can be executed as a select query. For example, consider the stored procedure in Listing 7-2 that returns a list of employees. 187

343 CHAPTER 7 ACCESSING SQL SERVER Listing 7-2. The stored procedure HumanResources.uspListEmployeePersonalInfoPS Create PROCEDURE [int] WITH EXECUTE AS CALLER AS BEGIN SET NOCOUNT ON; BEGIN TRY select * from [HumanResources].[Employee] where [BusinessEntityID] END TRY BEGIN CATCH EXECUTE [dbo].[usplogerror]; END CATCH; END; GO The procedure in Listing 7-2 will return the list of employees that have the BusinessEntityID passed to the function i.e., one employee, since this is the primary key. We can test it on SQL Server as follows: exec [HumanResources].[uspListEmployeePersonalInfoPS] 1 One row is returned. Using the connection object returned by New-UdfConnection, we can code the call to this stored procedure, as shown in Listing 7-3. N ote: Change properties as needed to suit your environment. Listing 7-3. Calling a SQL Server stored procedure Import-Module umd_database -Force [psobject] $myconnection = New-Object psobject New-UdfConnection ([ref]$myconnection) $myconnection.connectiontype = 'ADO' $myconnection.databasetype = 'SqlServer' $myconnection.server = '(local)' $myconnection.databasename = 'AdventureWorks' $myconnection.usecredential = 'N' $myconnection.setauthenticationtype('integrated') $myconnection.buildconnectionstring() $empid = 1 $myconnection.runsql("exec [HumanResources].[uspListEmployeePersonalInfoPS] $empid", $true) 188

344 CHAPTER 7 ACCESSING SQL SERVER Since the result set is passed back from a select query within the stored procedure, the call to the procedure is treated like a select statement. Notice that we can even include input parameters by using PowerShell variables. Calling Stored Procedures Using Output Parameters When we want to call a stored procedure that uses output parameters to return results, we need to add the parameters object to the database call. To demonstrate, let s consider the stored procedure that follows that takes the input parameters BusinessEntityID, NationalIDNumber, BirthDate, MaritalStatus, and Gender and returns the output parameters JobTitle, HireDate, and VacationHours. This is a modified version of an AdventureWorks stored procedure named [ HumanResources].[uspUpdateEmployeePersonalInfo], with PS appended to the name, indicating it is for use by our PowerShell script. The input parameters are used to update the employee record. The output parameters are returned from the call. Let s review the stored procedure in Listing 7-4. Listing 7-4. A stored procedure with output parametersq USE [AdventureWorks] GO SET ANSI_NULLS ON SET QUOTED_IDENTIFIER ON GO Create PROCEDURE [nvarchar](50) [date] [smallint] output WITH EXECUTE AS CALLER AS BEGIN SET NOCOUNT ON; BEGIN TRY UPDATE [HumanResources].[ Em ployee] SET [NationalIDNumber] WHERE [BusinessEntityID] = = = VacationHours from [HumanResources].[Employee] where [BusinessEntityID] 189

345 CHAPTER 7 ACCESSING SQL SERVER END TRY BEGIN CATCH EXECUTE [dbo].[usplogerror]; END CATCH; END; GO Listing 7-4 is a simple stored procedure, but it allo ws us to see how to pass input and output parameters of different data types. We can see that the Update statement will update the employee record with the input parameters. Then a select statement will load the values for JobTitle, HireDate, and VacationHours into the Let s look more closely at how the parameters are [nvarchar](50) [date] [smallint] output Parameters can be of three possible types, which are Input, Output, or InputOutput. Input parameters can be read but not updated. Output parameters by definition should only be updatable but not read. InputOutput parameters can be read and updated. SQL Server does not support an Output parameter that can only be updated. Rather, it treats Output parameters as InputOutput parameters. By default, a parameter is Input, so it does not need to be specified. Notice that for the Output parameters, the word 'output' is specified after the data type. Before we look at how to call this stored procedure with PowerShell, let s review how we would call it from SQL Server. Actually, it s a good idea to test calls to SQL from within a SQL Server tool like SQL Server Management Studio before trying to develop and test PowerShell code to do the same thing. The short SQL script in Listing 7-5 executes our stored procedure. Listing 7-5. A SQL script to call a stored procedure Use AdventureWorks go [nvarchar](50) [date] [smallint] exec [HumanResources].[uspUpdateEmployeePersonalInfoPS] 1, , ' ', 'M', Output 190

346 CHAPTER 7 ACCESSING SQL SERVER When you run this script, you should see the following output: Chief Executive Officer Don t worry if the actual values are different. It will be w hatever is in the database at the time you execute this. PowerShell Code to Call a Stored Procedure with Output Parameters Now that we know we can execute the stored procedure with T-SQL, let s do the same thing using PowerShell. Let s look at the PowerShell script in Listing 7-6 that runs the stored procedure. Note: To get the ideas across, the script is pretty hard coded. Listing 7-6. PowerShell code to call a stored procedure with output parameters Import-Module umd_database $SqlConnection = New-Object System.Data.SqlClient.SqlConnection $SqlConnection.ConnectionString = "Data Source=(local);Integrated Security=SSPI;Initial Catalog=AdventureWorks" $SqlCmd = New-Object System.Data.SqlClient.SqlCommand $SqlCmd.CommandText = "[HumanResources].[uspUpdateEmployeePersonalInfoPS]" $SqlCmd.Connection = $SqlConnection $SqlCmd.CommandType = [System.Data.CommandType]'StoredProcedure' $SqlCmd.Parameters.AddWithValue("@BusinessEntityID", 1) >> $null $SqlCmd.Parameters.AddWithValue("@NationalIDNumber", ) >> $null $SqlCmd.Parameters.AddWithValue("@BirthDate", ' ') >> $null $SqlCmd.Parameters.AddWithValue("@MaritalStatus", 'S') >> $null $SqlCmd.Parameters.AddWithValue("@Gender", 'M') >> $null # -- Output Parameters --- # JobTitle $outparameter1 = new-object System.Data.SqlClient.SqlParameter $outparameter1.parametername = "@JobTitle" $outparameter1.direction = [System.Data.ParameterDirection]::Output $outparameter1.dbtype = [System.Data.DbType]'string' $outparameter1.size = 50 $SqlCmd.Parameters.Add($outParameter1) >> $null # HireDate $outparameter2 = new-object System.Data.SqlClient.SqlParameter $outparameter2.parametername = "@HireDate" $outparameter2.direction = [System.Data.ParameterDirection]::Output $outparameter2.dbtype = [System.Data.DbType]'date' $SqlCmd.Parameters.Add($outParameter2) >> $null # VacationHours $outparameter3 = new-object System.Data.SqlClient.SqlParameter $outparameter3.parametername = "@VacationHours" $outparameter3.direction = [System.Data.ParameterDirection]::Output 191

347 CHAPTER 7 ACCESSING SQL SERVER $outparameter3.dbtype = [System.Data.DbType]'int16' $SqlCmd.Parameters.Add($outParameter3) >> $null $SqlConnection.Open() $result = $SqlCmd. ExecuteNonQue ry() $SqlConnection.Close() $SqlCmd.Parameters["@jobtitle"].value $SqlCmd.Parameters["@hiredate"].value $SqlCmd.Parameters["@VacationHours"].value There s a lot of code there, but we ll walk through it. As before, first we import the umd_database module. Then, we define the parameters. First, let s look at the code that creates the SQL connection and command objects: $SqlConnection = New-Object System.Data.SqlClient.SqlConnection $SqlConnection.ConnectionString = "Data Source=(local);Integrated Security=SSPI;Initial Catalog=AdventureWorks" $SqlCmd = New-Object System.Data.SqlClient.SqlCommand $SqlCmd.CommandText = "[HumanResources].[uspUpdateEmployeePersonalInfoPS]" $SqlCmd.Connection = $SqlConnection $SqlCmd.CommandType = [System.Data.CommandType]'StoredProcedure'; We ve seen the first few lines before creating the connection, assigning the connection string, and creating a command object. Notice that for the command s CommandText property, we re just giving the name of the stored procedure, i.e., = [HumanResources].[uspUpdateEmployeePersonalInfoPS]. Then, we connect the command to the connection with the line "$SqlCmd.Connection = $SqlConnection". Finally, the last line assigns the CommandType property as 'StoredProcedure'. This is critical in order for the command to be processed correctly. The Input parameters are assigned by the code here: $SqlCmd.Parameters.AddWithValue("@BusinessEntityID", 1) >> $null $SqlCmd.Parameters.AddWithValue("@NationalIDNumber", ) >> $null $SqlCmd.Parameters.AddWithValue("@BirthDate", ' ') >> $null $SqlCmd.Parameters.AddWithValue("@MaritalStatus", 'S') >> $null $SqlCmd.Parameters.AddWithValue("@Gender", 'M') >> $null Input parameters can use the abbreviated format for assignment. The parameters collection of the command object holds the parameter details. The AddWithValue method adds each parameter with value to the collection. To suppress any output returned from the AddWithValue method, we direct it to $null. Output parameters need to be coded in a manner that provides more details. Now, let s look at the code that creates the JobTitle output parameters: # <---- Output Parameters -----> # JobTitle $outparameter1 = new-object System.Data.SqlClient.SqlParameter $outparameter1.parametername = "@JobTitle" $outparameter1.direction = [System.Data.ParameterDirection]::Output $outparameter1.dbtype = [System.Data.DbType]'string' $outparameter1.size = 50 $SqlCmd.Parameters.Add($outParameter1) >> $null 192

348 CHAPTER 7 ACCESSING SQL SERVER In the first non-comment line above, we create a new SQL parameter object instance to hold the information about the parameter. Once created, we just assign details about the parameter, such as ParameterName, Direction, DbType, and Size, to the associated object properties. The ParameterName is the name defined in the stored procedure, which is why it has sign prefix, as SQL Server variables and parameters have. The Direction property tells whether this is an Input, Output, or InputOutput parameter. Note: Although SQL Server does not support an InputOutput direction, ADO.Net does. We define the direction as Output. The DbType property is not of the SQL Server data type. Rather, it is an abstraction that will equate to a SQL Server data type in our case. For another type of database, the underlying database-type columns may be different. For a complete list of DbType to SQL Server data-type mappings, see the link: We use the DbType 'string' for the JobTitle, which is defined as nvarchar(50). For string types, we need to assign the Size property, which is the length of the column. Finally, we use the Parameters collection Add method to add the parameter to the collection. Table 7-1 shows some of the most common SQL Server data types and their ADO.Net corresponding DbType. Table 7-1. Common Data Types SQL Server Database Format int Smallint bigint varchar nvarchar char bit date decimal text ntext ADO.Net DbType Int32 Int16 Int64 String or Char[] String or Char[] String or Char[] Boolean Date Decimal String or Char[] String or Char[] Now, let s look at the code that creates the remaining two output parameters: # HireDate $outparameter2 = new-object System.Data. SqlClien t.sqlparameter $outparameter2.parametername = "@HireDate" $outparameter2.direction = [System.Data.ParameterDirection]::Output $outparameter2.dbtype = [System.Data.DbType]'date' $SqlCmd.Parameters.Add($outParameter2) >> $null # VacationHours $outparameter3 = new-object System.Data.SqlClient.SqlParameter $outparameter3.parametername = "@VacationHours" $outparameter3.direction = [System.Data.ParameterDirection]::Output $outparameter3.dbtype = [System.Data.DbType]'int16' $SqlCmd.Parameters.Add($outParameter3) >> $null 193

349 CHAPTER 7 ACCESSING SQL SERVER The code to assign the HireDate and VacationHours parameters is not very different from the code we saw to define the JobTitle parameter. Notice that the DbTypes of date and int16 do not require a value for the size property. Now, let s look at the code to execute the stored procedure: $SqlConnection.Open(); $result = $SqlCmd.ExecuteNonQuery() $SqlConnection.Close(); First, we open the connection. Then, we use the ExecuteNonQuery method of the command object to run the stored procedures, returning any result to $result. The returned value is usually -1 for success and 0 for failure. However, when there is a trigger on a table being inserted to or updated, the number of rows inserted and/or updated is returned. We can see the Output parameters by ge tting their Value property from the Parameters collection, as shown here: $SqlCmd.Parameters["@jobtitle"].value $SqlCmd.Parameters["@hiredate"].value $SqlCmd.Parameters["@VacationHours"].value Notice that although the connection is closed, we can still retrieve the Output parameters. Calling Stored Procedures the Reusable Way We have seen how we can use PowerShell to call stored procedures. Now, let s take a look at how we can incorporate those ideas as reusable functions in the umd_database module. Overall, executing a stored procedure is like executing any SQL statement, except we need to provide the Input and Output parameters. The challenge here is that there can be any number of parameters, and each has a set of property values. How can we provide for passing a variable-length list of parameters to a function? Since PowerShell supports objects so nicely, why not create the parameter list as a custom object collection? The function that gets the object collection as a parameter can iterate over the list of parameters to create e ach SQL command parameter object. To help us build this collection, we ll use the helper function Add-UdfParameter, which creates a single parameter as a custom object: function Add-UdfParameter { [CmdletBinding()] param ( [string] $name, # Parameter name from stored procedure, [string] $direction, # Input or Output or InputOutput [string] $value, # parameter value [string] $datatype, # db data type, i.e. string, int64, etc. [int] $size # length ) } 194 $parm = New-Object System.Object $parm Add-Member -MemberType NoteProperty -Name "Name" -Value "$name" $parm Add-Member -MemberType NoteProperty -Name "Direction" -Value "$direction" $parm Add-Member -MemberType NoteProperty -Name "Value" -Value "$value" $parm Add-Member -MemberType NoteProperty -Name "Datatype" -Value "$datatype" $parm Add-Member -MemberType NoteProperty -Name "Size" -Value "$size" RETURN $parm

CHAPTER 7 ACCESSING SQL SERVER This function takes the parameters for each of the properties of the SQL parameter object required to call a stored procedure.

350 CHAPTER 7 ACCESSING SQL SERVER This function takes the parameters for each of the properties of the SQL parameter object required to call a stored procedure. The first executable line in the function stores an instance of System.Object into $parm, which gives us a place to which to attach the properties. We then pipe $parm into the Add-Member cmdlet to add each property with a value from each of the parameters passed to the function. Finally, we return the $parm object back to the caller. Let s see how we would u se the Add-UdfParameter in Listing 7-7. Listing 7-7. Using Add-UdfParameter Import-Module umd_database $parmset # Create a collection object. # Add the parameters we need to use... $parmset += (Add-UdfParameter "@BusinessEntityID" "Input" "1" "int32" 0) $parmset += (Add-UdfParameter "@NationalIDNumber" "Input" " " "string" 15) $parmset += (Add-UdfParameter "@BirthDate" "Input" " " "date" 0) $parmset += (Add-UdfParameter "@MaritalStatus" "Input" "S" "string" 1) $parmset += (Add-UdfParameter "@Gender" "Input" "M" "string" 1) $parmset += (Add-UdfParameter "@JobTitle" "Output" "" "string" 50) $parmset += (Add-UdfParameter "@HireDate" "Output" "" "date" 0) $parmset += (Add-UdfParameter "@VacationHours" "Output" "" "int16" 0) $parmset Out-GridView # Verify the parameters are correctly defined. Listing 7-7 starts by importing the umd_database module. Then, we create an empty collection object named $parmset by setting it equal We load this collection with each parameter by calling the function Add-UdfParameter. There are a few things to notice here. We enclose the function call in parentheses to make sure that the function is executed first. We use the += assignment operator to append the object returned to the $parmset collection. Normally, the += operator is used to increment the variable on the left side of the operator with the value on the right side, i.e., x += 1 is the same as saying x = x + 1. However, in the case of objects, it appends the object instance to the collection. The last statement pipes the parameter collection $parmset to Out-Gridview so we can see the list. We should see a display like the one in Figure 7-1. Figure 7-1. Out-GridView showing the parameter collection 195

351 CHAPTER 7 ACCESSING SQL SERVER Now that we have the parameter collection, we want to pass this to a function that will execute the stored procedure. Let s look at the code to do this: function Invoke-UdfADOStoredProcedure { [CmdletBinding()] param ( [string] $sqlserver, # SQL Server [string] $sqldatabase, # SQL Server Database. [string] $sqlspname, # SQL Query 196 ) [string] $sqlauthenticationtype, # $true = Use Credentials [string] $sqluser, # User Name if using credential $sqlpw, # password if using credential [string] $sqlconnstring, # Connection string $parameterset # Parameter properties if ($sqlauthenticationtype -eq 'Credential') { $pw = $sqlpw $pw.makereadonly() $SqlCredential = new-object System.Data.SqlClient.SqlCredential($sqluser, $pw) $conn = new-object System.Data.SqlClient.SqlConnection($sqlconnstring, $SqlCredential) } else { $conn = new-object System.Data.SqlClient.SqlConnection($sqlconnstring); } $conn.open() $command = new-object system.data. sqlcl ient.sqlcommand($sqlspname,$conn) $command.commandtype = [System.Data.CommandType]'StoredProcedure'; foreach ($parm in $parameterset) { if ($parm.direction -eq 'Input') { $command.parameters.addwithvalue($parm.name, $parm.value) >> $null; } elseif ($parm.direction -eq "Output" ) { $outparm1 = new-object System.Data.SqlClient.SqlParameter; $outparm1.parametername = $parm.name $outparm1.direction = [System.Data.ParameterDirection]::Output; $outparm1.dbtype = [System.Data.DbType]$parm.Datatype; $outparm1.size = $parm.size $command.parameters.add($outparm1) >> $null } }

352 CHAPTER 7 ACCESSING SQL SERVER $command.executenonquery() $conn.close() $outparms foreach ($parm in $parameterset) { if ($parm.direction -eq 'Output') { $outparms.add($parm.name, $command.parameters[$parm.name].value) } } RETURN $outparms } Let s review this code in detail. The first executable block of lines should look familiar, as it is the same code used in the function earlier to run a SQL statement, i.e., Invoke-UdfADOSQL : if ($sqlauthenticationtype -eq 'Credential') { $pw = $sqlpw $pw.makereadonly() $SqlCredential = new-object System.Data.SqlClient.SqlCredential($sqluser, $pw) $conn = new-object System.Data.SqlClient.SqlConnection($sqlconnstring, $SqlCredential) } else { $conn = new-object System.Data.SqlClient.SqlConnection($sqlconnstring); } $conn.open() $command = new-object system.data. sqlclient.sqlcommand($sqlspname,$conn) First, we check if we need to use a credential object. If yes, then we make the password read only and create the credential object, passing in the user ID and password. Then, we create the connection object using the connection string and credential object. If no credential object is needed, we just create the connection object using the connection string. Then, we open the connection using the Open method. By now the code should be looking familiar. The next statement sets the CommandType property to 'StoredProcedure' : $command.commandtype = [System.Data.CommandType]'StoredProcedure'; 197

353 CHAPTER 7 ACCESSING SQL SERVER From here we are ready to create the stored procedure parameters, which we do with the code that follows: foreach ($parm in $parameterset) { if ($parm.direction -eq 'Input') { $command.parameters.addwithvalue($parm.name, $parm.value) >> $null; } elseif ($parm.direction -eq "Output" ) { $outparm1 = new-object System.Data.SqlClient.SqlParameter; $outparm1.parametername = $parm.name $outparm1.direction = [System.Data.ParameterDirection]::Output; $outparm1.dbtype = [System.Data.DbType]$parm.Datatype; $outparm1.size = $parm.size $command.parameters.add($outparm1) >> $null } } The foreach loop will iterate over each object in the collection $parameterset that was passed to this function. On each iteration, $parm will hold the current object. Remember, we created this parameter by using the helper function Add-UdfParameter. On each iteration, if the parameter s direction property is Input, we use the AddWithValue method to add the parameter to the command s parameters collection. Otherwise, if the direction is Output, we create a new SqlParameter object instance and assign the required property values from those provided in $parm. Then, we execute the stored procedure with the statements here: $command.executenonquery() $conn.close() Now, we need to return the Output parameters back to the caller. There are a number of ways this could be done. One method I considered was updating the parameter object collection passed into the function. However, that would make the caller do a lot of work to get the values. Instead, the function creates a hash table and loads each parameter name as the key and the return value as the value. Remember, hash tables are just handy lookup tables. Let s look at the code for this: $outparms foreach ($parm in $parameterset) { if ($parm.direction -eq 'Output') { $outparms.add($parm.name, $command.parameters[$parm.name].value) } } RETURN $outparms } 198

354 CHAPTER 7 ACCESSING SQL SERVER Here, $outparms is created as an empty hash table by the statement "$outparms Then, we iterate over the parameter set originally passed to the function. For each parameter with a direction of Output, we add a new entry into the $outparms hash table with the key of the parameter name and the value taken from the SQL command parameters collection. Be careful here not to miss what is happening. We are not using the value from the parameter collection passed to the function. We are going into the SQL command object and pulling the return value from there. Now we have nice, reusable functions to help us call stored procedures. Wouldn t it be nice to integrate this with the umd_database module s connection object we covered earlier? That object was returned by New-UdfConnection. Let s integrate the functions to call a stored procedure with New-UdfConnection so that we have one nice interface for issuing SQL commands. To do that, we ll need to add a method to the connection object returned by New-UdfConnection. The code that follows creates that function and attaches it to the object: # For call, $false set for $IsSelect as this is a store procedure. # $true set for IsProcedure $bspsql param([string]$p_insql, $p_parms) $this.buildconnectionstring() If ($this.connectionstring -eq 'NotAssigned') {Throw "Error - Cannot create connection string." } $Result = Invoke-UdfSQL $this.connectiontype $this.databasetype ` $this.server $this.databasename "$p_insql" ` $this.authenticationtype $this.userid $this.password ` $this.connectionstring $false ` $this.dsnname $this.driver $true $p_parms RETURN $Result '@ $sspsql = [scriptblock]::create($bspsql) $p_connection.value Add-Member -MemberType scriptmethod ` -Name RunStoredProcedure ` -Value $sspsql ` -Passthru As with other object methods, we first assign the code block to a here string variable, which is called $bsqlsql in this case. The code block shows that two parameters are accepted by the function: $p_insql, which is the name of the stored procedure, and $p_parms, which is the parameter collection we created using Add-UdfParameter. The first executable line of the function uses the object s BuildConnectionString to create the connection string so as to connect to the database. If there is a problem creating the connection string, as indicated by a value of 'NotAssigned', an error is thrown. Finally, the function Invoke-UdfSQL is called, passing the object properties and the parameters. Notice two Boolean values are being passed. The first, which is $false, is the $IsSelect parameter, so we are saying this is not a select query. The second, which is $true, is for a parameter that tells the function this is a stored procedure call. 199

355 CHAPTER 7 ACCESSING SQL SERVER We re almost there, but there is one piece missing. We need to discuss the function being called, Invoke-UdfSQL. This function acts as a broker to determine which specific function to call, i.e., for ADO or ODBC. To see how the function determines which call to make, let s look at the code for Invoke-UdfSQL : function Invoke-UdfSQL ( [string]$p_inconntype, [string]$p_indbtype, [string]$p_inserver, [string]$p_indb, [string]$p_insql, [string]$p_inauthenticationtype, [string]$p_inuser, $p_inpw, # No type defined; can be securestring or string [string]$p_inconnectionstring, [boolean]$p_inisselect, [string]$p_indsnname, [string]$p_indriver, [boolean]$p_inisprocedure, $p_inparms) { If ($p_inconntype -eq "ADO") { If ($p_inisprocedure) { RETURN Invoke-UdfADOStoredProcedure $p_inserver $p_indb $p_insql ` $p_inauthenticationtype $p_inuser $p_inpw $p_inconnectionstring $p_inparms } Else { $datatab = Invoke-UdfADOSQL $p_inserver $p_indb $p_insql $p_inauthenticationtype ` $p_inuser $p_inpw $p_inconnectionstring $p_inisselect Return $datatab } } ElseIf ($p_inconntype -eq "ODBC") { If ($p_inisprocedure) { write-host 'sp' RETURN Invoke-UdfODBCStoredProcedure $p_inserver $p_indb $p_insql ` $p_inauthenticationtype $p_inuser $p_inpw $p_inconnectionstring $p_inparms } Else { $datatab = Invoke-UdfODBCSQL $p_inserver $p_indb $p_insql $p_inauthenticationtype ` $p_inuser $p_inpw $p_inconnectionstring $p_inisselect $p_indsnname $driver } 200 } Return $datatab

356 CHAPTER 7 ACCESSING SQL SERVER Else { Throw "Connection Type Not Supported." Return "Failed - Connection type not supported" } } Don t be intimidated by this code. It s really just a set of if conditions. There are a lot of parameters, but some of them are just to support later expansion of the function. This function supports two types of SQL calls: ADO.Net and ODBC. Once it determines that this call is ADO.Net, it checks the Boolean passed in, $p_inisprocedure, to determine if this is a stored procedure call. If this is true, then the function Invoke- UdfADOStoredProcedure is called. Since Invoke-UdfADOStoredProcedure returns a value, i.e., a hash table of the output parameter values, the RETURN statement makes the function call. We ve seen the other statements before, which have to do with non stored procedure calls. Now that we ve covered how all this works, let s look at the code in Listing 7-8 that uses the connection object to call the stored procedure. Listing 7-8. Using the umd_database module s connection object to call a stored procedure Import-Module umd_database [psobject] $myconnection = New-Object psobject New-UdfConnection([ref]$myconnection) $myconnection.connectiontype = 'ADO' $myconnection.databasetype = 'SqlServer' $myconnection.server = '(local)' $myconnection.databasename = 'AdventureWorks' $myconnection.usecredential = 'N' $myconnection.userid = 'bryan' $myconnection.password = 'password' $myconnection.setauthenticationtype('integrated') $myconnection.buildconnectionstring() $parmset # Create a collection object. # Add the parameters we need to use... $parmset += (Add-UdfParameter "@BusinessEntityID" "Input" "1" "int32" 0) $parmset += (Add-UdfParameter "@NationalIDNumber" "Input" " " "string" 15) $parmset += (Add-UdfParameter "@BirthDate" "Input" " " "date" 0) $parmset += (Add-UdfParameter "@MaritalStatus" "Input" "S" "string" 1) $parmset += (Add-UdfParameter "@Gender" "Input" "M" "string" 1) $parmset += (Add-UdfParameter "@JobTitle" "Output" "" "string" 50) $parmset += (Add-UdfParameter "@HireDate" "Output" "" "date" 0) $parmset += (Add-UdfParameter "@VacationHours" "Output" "" "int16" 0) $myconnection.runstoredprocedure('[humanresources].[uspupdateemployeepersonalinfops]', $parmset) $parmset # Lists the output parameters 201

357 CHAPTER 7 ACCESSING SQL SERVER Listing 7-8 shows how we use our custom functions to call a stored procedure that returns Output parameters. First, we create a new PSObject variable, i.e., $myconnection. A reference to this object is passed to the function New-UdfConnection, which attaches our custom connection methods and properties. Then, a series of statements sets the properties of our connection object. We call the BuildConnectionString method to generate the connection string needed to access the database. Then, the statement $parmset creates an empty object collection named $parmset. Subsequent statements append custom parameter objects to the collection using Add-UdfParameter. Finally, the stored procedure is executed, passing the stored procedure name and the parameter collection. The results should be written to the console. The last line just displays the parameter set, i.e., $parmset. Summary In this chapter we explored executing queries using ADO.Net, which will handle most database platforms, including SQL Server, MS Access, Oracle, MySQL, and PostgreSQL. This chapter bypassed using the SQLPS module in favor of writing our own custom module that provides a generic and easy-to-use set of functions to run queries. To provide one point of interaction with these functions, the module provides a database connection object that can use ADO.Net or ODBC to execute any SQL statement. The first part of this chapter focused on running select statements and queries that yield no results, such as an update statement. We covered different authentication modes to provide the most secure connection possible. Then, we discussed calling stored procedures and how to support Output parameters using the SQL Command Parameters collection. We stepped through how the umd_database module integrates these features into one consistent interface. Beyond showing how to connect to databases and execute queries, the goal of this chapter was to demonstrate how to create reusable code to extend PowerShell and simplify your development work. 202

358 Expert Performance Indexing in SQL Server Second Edition Jason Strate Grant Fritchey

359 Expert Performance Indexing in SQL Server Copyright 2015 by Jason Strate and Grant Fritchey This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. ISBN-13 (pbk): ISBN-13 (electronic): Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Managing Director: Welmoed Spahr Lead Editor: Jonathan Gennick Technical Reviewer: Rodney Landrum Editorial Board: Steve Anglin, Mark Beckner, Gary Cornell, Louise Corrigan, Jim DeWolf, Jonathan Gennick, Robert Hutchinson, Michelle Lowman, James Markham, Susan McDermott, Matthew Moodie, Jeffrey Pepper, Douglas Pundick, Ben Renow-Clarke, Gwenan Spearing, Matt Wade, Steve Weiss Coordinating Editor: Jill Balzano Copy Editor: Kim Wimpsett Compositor: SPi Global Indexer: SPi Global Artist: SPi Global Cover Designer: Anna Ishchenko Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY Phone SPRINGER, fax (201) , orders-ny@springer-sbm.com, or visit Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information on translations, please rights@apress.com, or visit Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. ebook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales ebook Licensing web page at Any source code or other supplementary material referenced by the author in this text is available to readers at For detailed information about how to locate your book s source code, go to

360 Contents at a Glance About the Authors...xvii About the Technical Reviewer...xix Introduction...xxi Chapter 1: Index Fundamentals... 1 Chapter 2: Index Storage Fundamentals Chapter 3: Index Metadata and Statistics Chapter 4: XML Indexes Chapter 5: Spatial Indexing Chapter 6: Full-Text Indexing Chapter 7: Indexing Memory-Optimized Tables Chapter 8: Indexing Myths and Best Practices Chapter 9: Index Maintenance Chapter 10: Indexing Tools Chapter 11: Indexing Strategies Chapter 12: Query Strategies Chapter 13: Monitoring Indexes Chapter 14: Index Analysis Chapter 15: Indexing Methodology Index v

361 CHAPTER 8 Indexing Myths and Best Practices In the past few chapters, I ve defined indexes and showed how they are structured. In the upcoming chapters, you ll be looking at strategies to build indexes and ensure that they behave as expected. In this chapter, I ll dispel some common myths and show how to build the foundation for creating indexes. Myths result in an unnecessary burden when attempting to build an index. Knowing the myths associated with indexes can prevent you from using indexing strategies that will be counterproductive. The following are the indexing myths discussed in this chapter: Databases don t need indexes. Primary keys are always clustered. Online index operations don t block. Any column can be filtered in multicolumn indexes. Clustered indexes store records in physical order. Indexes always output in the same order. Fill factor is applied to indexes during inserts. Deleting form heaps results in unrecoverable space. Every table should be a heap or have a clustered index. When reviewing myths, it s also a good idea to take a look at best practices. Best practices are like myths in many ways, in the sense that they are commonly held beliefs. The primary difference is that best practices stand up to scrutiny and are useful recommendations when building indexes. This chapter will examine the following best practices : Use clustered indexes on primary keys by default. Balance index count. Properly target database level fill factors. Properly target index level fill factors. Index foreign key columns. Index to your environment. 153

362 CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES Index Myths One of the problems that people encounter when building databases and indexes is dealing with myths. Indexing myths originate from many different places. Some come from previous versions of SQL Server and its tools or are based on former functionality. Others come from the advice of others, based on conditions in a specific database that don t match those of other databases. The trouble with indexing myths is that they cloud the water of indexing strategies. In situations where an index can be built to resolve a serious performance issue, a myth can sometimes prevent the approach from being considered. Throughout the next few sections, I ll cover a number of myths regarding indexing and do my best to dispel them. Myth 1: Databases Don t Need Indexes Usually when developers are building applications, one or more databases are created to store data for the application. In many development processes, the focus is on adding new features with the mantra Performance will work itself out. An unfortunate result is that there are many databases that get developed and deployed without indexes being built because of the belief that they aren t needed. Along with this, there are developers who believe their databases are somehow unique from other databases. The following are some reasons that are heard from time to time: It s a small database that won t get much data. It s just a proof of concept and won t be around for long. It s not an important application, so performance isn t important. The whole database already fits into memory; indexes will just make it require more memory. I am going to use this database only for inserting data; I will never look at the results. Each of these reasons is easy to break down. In today s world of big data, even databases that are expected to be small can start growing quickly as they are adopted. Besides that, small in terms of a database is definitely in the eye of the beholder. Any proof-of-concept or unimportant database and application wouldn t have been created if there weren t a need or someone wasn t interested in expending resources for the features. Those same people likely expect that the features they asked for will perform as expected. Lastly, fitting a database into memory doesn t mean it will be fast. As was discussed in previous chapters, indexes provide an alternative access path to data, with the aim of decreasing the number of pages required to access the data. Without these alternative routes, data access will likely require reading every page of a table. These reasons may not be the ones you hear concerning your databases, but they will likely be similar. The general idea surrounding this myth is that indexes don t help the database perform better. One of the strongest ways to break apart this excuse is by demonstrating the benefits of indexing against a given scenario. To demonstrate, let s look at the code in Listing 8-1. This code sample creates the table MythOne. Next, you will find a query similar to one in almost any application. In the output from the query, in Listing 8-2, the query generated 1,496 reads. 154

363 CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES Listing 8-1. Table with No Index SELECT * INTO MythOne FROM Sales.SalesOrderDetail; GO SET STATISTICS IO ON SET NOCOUNT ON GO SELECT SalesOrderID, SalesOrderDetailID, CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID, UnitPrice, UnitPriceDiscount, LineTotal FROM MythOne WHERE CarrierTrackingNumber = ' C-98'; GO SET STATISTICS IO OFF GO Listing 8-2. I/O Statistics for Table with No Index Table 'MythOne'. Scan count 1, logical reads 1496, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. It could be argued that 1,496 isn t a lot of input/output (I/O). This might be true given the size of some databases and the amount of data in today s world. But the I/O of a query shouldn t be compared to the performance of the rest of the world; it needs to be compared to its potential I/O, the needs of the application, and the platform on which it is deployed. Improving the query from the previous demonstration can be as simple as adding an index on the table on the CarrierTrackingNumber column. To see the effect of adding an index to MythOne, execute the code in Listing 8-3. With the index created, the reads for the query were reduced from 1,496 to 15 reads, shown in Listing 8-4. With just a single index, the I/O for the query was reduced by nearly two orders of magnitude. Suffice it to say, an index in this situation provides a significant amount of value. Listing 8-3. Adding an Index to MythOne CREATE INDEX IX_CarrierTrackingNumber ON MythOne (CarrierTrackingNumber) GO SET STATISTICS IO ON SET NOCOUNT ON GO SELECT SalesOrderID, SalesOrderDetailID, CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID, UnitPrice, UnitPriceDiscount, LineTotal FROM MythOne WHERE CarrierTrackingNumber = ' C-98'; GO SET STATISTICS IO OFF GO 155

364 CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES Listing 8-4. I/O Statistics for Table with an Index Table 'MythOne'. Scan count 1, logical reads 15 physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. I ve shown in these examples that indexes do provide a benefit. If you encounter a situation where there is angst for building indexes on a database, try to break down the real reason for the pushback and provide an example similar to the one presented in this section. In Chapter 11, I ll discuss approaches that can be used to determine what indexes to create in a database. Myth 2: Primary Keys Are Always Clustered The next myth that is quite prevalent is the idea that primary keys are always clustered. While this is true in many cases, you cannot assume that all primary keys are also clustered indexes. Earlier in this book, I discussed how a table can have only a single clustered index on it. If a primary key is created after the clustered index is built, then the primary key will be created as a nonclustered index. To illustrate the indexing behavior of primary keys, I ll use another demonstration that includes building two tables. On the first table, named dbo.mythtwo1, I ll build the table and then create a primary key on the RowID column. For the second table, named dbo.mythtwo2, after the table is created, the script will build a clustered index before creating the primary key. The code for this is in Listing 8-5. Listing 8-5. Two Ways to Create Primary Keys CREATE TABLE dbo.mythtwo1 ( RowID int NOT NULL,Column1 nvarchar(128),column2 nvarchar(128) ); ALTER TABLE dbo.mythtwo1 ADD CONSTRAINT PK_MythTwo1 PRIMARY KEY (RowID); GO CREATE TABLE dbo.mythtwo2 ( RowID int NOT NULL,Column1 nvarchar(128),column2 nvarchar(128) ); CREATE CLUSTERED INDEX CL_MythTwo2 ON dbo.mythtwo2 (RowID); ALTER TABLE dbo.mythtwo2 ADD CONSTRAINT PK_MythTwo2 PRIMARY KEY (RowID); GO 156

CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES SELECT OBJECT_NAME(object_id) AS table_name,name,index_id,type,type_desc,is_unique,is_primary_key FROM sys.indexes WHERE object_id IN (OBJECT_ID('dbo.

365 CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES SELECT OBJECT_NAME(object_id) AS table_name,name,index_id,type,type_desc,is_unique,is_primary_key FROM sys.indexes WHERE object_id IN (OBJECT_ID('dbo.MythTwo1'),OBJECT_ID('dbo.MythTwo2')); After running the code segment, the final query will return results like those shown in Figure 8-1. This figure shows that PK_MythTwo1, which is the primary key on the first table, was created as a clustered index. Then on the second table, PK_MythTwo2 was created as a nonclustered index. Figure 8-1. Primary key sys.indexes output The behavior discussed in this section is important to remember when building primary keys and clustered indexes. If you have a situation where they need to be separated, the primary key will need to be defined after the clustered index. Myth 3: Online Index Operations Don t Block One of the advantages of SQL Server Enterprise Edition is the ability to build indexes online. During an online index build, the table on which the index is being created will still be available for queries and data modifications. This feature can be extremely useful when a database needs to be accessed and maintenance windows are short to nonexistent. A common myth with online index rebuilds is that they don t cause any blocking. Of course, like many myths, this one is false. When using an online index operation, there is an intent shared lock held on the table for the main portion of the build. At the finish, either a shared lock, for a nonclustered index, or a schema modification lock, for a clustered index, is held for a short time while the operation moves in the updated index. This differs from an offline index build where the shared or schema modification lock is held for the duration of the index build. Of course, you will want to see this in action; to accomplish this, you will create a table and use Extended Events to monitor the locks that are applied to the table while creating indexes with and without the ONLINE options. To start this demo, execute the code in Listing 8-6. This script creates the table dbo.myththree and populates it with ten million records. The last item it returns is the object_id for the table, which is needed for the subsequent parts of the demo. For this example, the object_id for dbo.myththree is Note The demos for this myth all require SQL Server Enterprise or Developer Edition. 157

366 CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES Listing 8-6. MythThree Table Create Script USE AdventureWorks2014 GO CREATE TABLE dbo.myththree ( RowID int NOT NULL,Column1 uniqueidentifier ); WITH L1(z) AS (SELECT 0 UNION ALL SELECT 0), L2(z) AS (SELECT 0 FROM L1 a CROSS JOIN L1 b), L3(z) AS (SELECT 0 FROM L2 a CROSS JOIN L2 b), L4(z) AS (SELECT 0 FROM L3 a CROSS JOIN L3 b), L5(z) AS (SELECT 0 FROM L4 a CROSS JOIN L4 b), L6(z) AS (SELECT TOP FROM L5 a CROSS JOIN L5 b) INSERT INTO dbo.myththree SELECT ROW_NUMBER() OVER (ORDER BY z) AS RowID, NEWID() FROM L6; GO SELECT OBJECT_ID('dbo.MythThree') GO To monitor those events in this scenario, you ll use Extended Events to capture the lock_acquired and lock_released events fired during index creation. Open sessions in SSMS for the code in Listing 8-7 and Listing 8-8. Use the session_id from Listing 8-8 for the session_id in Listing 8-7 ; for this scenario, the session_id is 42. After the Extended Events session is running, you can use the live view to monitor the locks as they occur. Listing 8-7. Extended Events Session for Lock Acquired and Released IF EXISTS(SELECT * FROM sys.server_event_sessions WHERE name = 'MythThreeXevents') DROP EVENT SESSION [MythThreeXevents] ON SERVER GO CREATE EVENT SESSION [MythThreeXevents] ON SERVER ADD EVENT sqlserver.lock_acquired(set collect_database_name=(1) WHERE [sqlserver].[session_id]=(42) AND [object_id]=( )), ADD EVENT sqlserver.lock_released( WHERE [sqlserver].[session_id]=(42) AND [object_id]=( )) ADD TARGET package0.ring_buffer GO ALTER EVENT SESSION [MythThreeXevents] ON SERVER STATE = START GO In the example from Listing 8-8, creating the index with the ONLINE option causes the lock acquired and the released events shown in Figure 8-2. In the output, the SCH_S (Schema_Shared ) lock is held from the beginning of the build to the end. The S (Shared ) locks are held only for a few milliseconds at the beginning and ending of the index build. For the time between the S locks, the indexes are fully available and ready for use. 158

(Column1); GO Figure 8-2. Index create with ONLINE option By default, only the name and timestamp appear in the live viewer. The live viewer allows for customizing the columns that are displayed.

367 CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES Listing 8-8. Online Index Operations on Nonclustered Index Creation USE AdventureWorks2014 GO CREATE INDEX IX_MythThree_ONLINE ON MythThree (Column1) WITH (ONLINE = ON); GO CREATE INDEX IX_MythThree ON MythThree (Column1); GO Figure 8-2. Index create with ONLINE option By default, only the name and timestamp appear in the live viewer. The live viewer allows for customizing the columns that are displayed. In Figure 8-2, the columns object_it, mode, resource_type, and sql_text have been added to the defaults of name and timestamp. To add additional columns, right-click a column header and select Choose columns. With the default index creation, which does not use the ONLINE option, S locks are held for the entirety of the index build. Shown in Figure 8-3, the S lock is taken before the SCH_S lock and isn t released until after the index is build. The result is that the index is unavailable during the index build. Figure 8-3. Index create without ONLINE option 159

368 CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES Myth 4: Any Column Can Be Filtered in Multicolumn Indexes The next common myth with indexes is that regardless of the position of the column in an index, the index can be used to filter for the column. As with the other myths discussed so far in this chapter, this one is also incorrect. An index does not need to use all the columns in a table. It does, however, need to start with the leftmost column in an index and use the columns from left to right, in order. This is why the order of the columns in an index is so important. To demonstrate this myth, I ll run through a few examples, shown in Listing 8-9. In the script, a table is created based on Sales.SalesOrderHeader with a primary key on SalesOrderID. To test the myth of searching all columns through multicolumn indexes, an index with the columns OrderDate, DueDate, and ShipDate is created. Listing 8-9. Multicolumn Index Myth USE AdventureWorks2014 GO IF OBJECT_ID('dbo.MythFour') IS NOT NULL DROP TABLE dbo.mythfour GO SELECT SalesOrderID, OrderDate, DueDate, ShipDate INTO dbo.mythfour FROM Sales.SalesOrderHeader; GO ALTER TABLE dbo.mythfour ADD CONSTRAINT PK_MythFour PRIMARY KEY CLUSTERED (SalesOrderID); GO CREATE NONCLUSTERED INDEX IX_MythFour ON dbo.mythfour (OrderDate, DueDate, ShipDate); GO With the test objects in place, the next thing to check is the behavior of the queries against the table that could potentially use the nonclustered index. First, I ll run a query that uses the leftmost column in the index. Listing 8-10 gives the code for this. As shown in Figure 8-4, by filtering on the leftmost column, the query uses a seek operation on IX_MythFour. Listing Query Using Leftmost Column in Index USE AdventureWorks2014 GO SELECT OrderDate FROM dbo.mythfour WHERE OrderDate = ' :00:00.000' 160

369 CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES Figure 8-4. Execution plan for leftmost column in index Next you ll look at what happens when querying from the other side of the index key columns. In Listing 8-11, the query filters the results on the rightmost column of the index. The execution plan for this query, shown in Figure 8-5, uses a scan operation on IX_MythFour. Instead of being able to go directly to the records that match the OrderDate, the query needs to check all records to determine which match the filter. While the index is used, it isn t able to actually filter the rows. Listing Query Using Rightmost Column in Index USE AdventureWorks2014 GO SELECT ShipDate FROM dbo.mythfour WHERE ShipDate = ' :00:00.000' Figure 8-5. Execution plan for rightmost column in index At this point, you ve seen that the leftmost column can be used for filtering and that filtering on the rightmost column can use the index but cannot use it optimally with a seek operation. The last validation is to check the behavior of columns in an index that are not on the left or right side of the index. In Listing 8-12, a query is included that uses the middle column in the index IX_MythFour. As with any execution plan, the execution plan for the middle column query, shown in Figure 8-6, uses the index but also uses a scan operation. The query is able to use the index but not in an optimal fashion. Listing Query Using Middle Column in Index USE AdventureWorks2014 GO SELECT DueDate FROM dbo.mythfour WHERE DueDate = ' :00:00.000' 161

CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES Figure 8-6. Execution plan for middle column in index The myth of how columns in a multicolumn index can be used is one that can sometimes be confusing.

370 CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES Figure 8-6. Execution plan for middle column in index The myth of how columns in a multicolumn index can be used is one that can sometimes be confusing. As the examples showed, queries can use the index regardless of which columns of the index are being filtered. The key is to effectively use the index. To accomplish this goal, filtering must start on the leftmost column of the index. Myth 5: Clustered Indexes Store Records in Physical Order One of the more pervasive myths commonly held is the idea that a clustered index stores the records in a table in their physical order when on disk. This myth seems to be primarily driven by confusion between what is stored on a page and where records are stored on those pages. As was discussed in Chapter 2, there is a difference between data pages and records. As a refresher, you ll see a simple demonstration that dispels this myth. To begin this example, execute the code in Listing The code in the example will create a table named dbo.mythfive. Then it will add three records to the table. The last part of the script will output, using sys.dm_db_database_page_allocations, the page location for the table. In this example, the page with the records inserted into dbo.mythfive is on page 24189, shown in Figure 8-7. Note The dynamic management function sys.dm_db_database_page_allocations is a replacement for DBCC IND. This function, introduced in SQL Server 2012, provides an improved interface to examining page allocations for objects in a database over its DBCC predecessor. Listing Create and Populate MythFive Table USE AdventureWorks2014 GO IF OBJECT_ID('dbo.MythFive') IS NOT NULL DROP TABLE dbo.mythfive CREATE TABLE dbo.mythfive ( RowID int PRIMARY KEY CLUSTERED,TestValue varchar(20) NOT NULL ); GO 162

371 CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES INSERT INTO dbo.mythfive (RowID, TestValue) VALUES (1, 'FirstRecordAdded'); INSERT INTO dbo.mythfive (RowID, TestValue) VALUES (3, 'SecondRecordAdded'); INSERT INTO dbo.mythfive (RowID, TestValue) VALUES (2, 'ThirdRecordAdded'); GO SELECT database_id, object_id, index_id, extent_page_id, allocated_page_page_id, page_type_desc FROM sys.dm_db_database_page_allocations(db_id(), OBJECT_ID('dbo.MythFive'), 1, NULL, 'DETAILED') GO Figure 8-7. sys.dm_db_database_page_allocations output The evidence to dispel this myth can be uncovered with the DBCC PAGE command. To do this, use the PagePID identified in Listing 8-13 with page_type_desc of DATA_PAGE. Since there is only a single data page for this table, that is where the data will be located. (For more information on DBCC commands, see Chapter 2.) For this example, Listing 8-14 shows the T-SQL required to look at the data in the table. This command outputs a lot of information that includes some header information that isn t useful in this example. The portion that you need is at the end, with the memory dump of the page, as shown in Figure 8-8. In the memory dump, the records are shown in the order in which they are placed on the page. As the dump shows from reading the far-right column, the records are in the order in which they are added to the table, not the order that they will appear in the clustered index. Listing Create and Populate MythFive Table DBCC TRACEON (3604); GO DBCC PAGE (AdventureWorks2014, 1, 24189, 2); GO Figure 8-8. Page contents portion of DBCC PAGE output 163

372 CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES Based on this evidence, it is easy to discern that clustered indexes do not store records in the physical order of the index. If this example were expanded, you would be able to see that the pages are in physical order, but the rows on the pages are not. Myth 6: Indexes Always Output in the Same Order One of the more common myths that pertain to indexes is that they guarantee the output order of results from queries. This is not correct. As previously described in this book, the purpose of indexes is to provide an efficient access path to the data. That purpose does not guarantee the order in which the data will be accessed. The trouble with this myth is that, oftentimes, SQL Server will appear to maintain order when queries are executed under certain conditions, but when those conditions change, the execution plans change, and the results are returned in the order that the data is processed versus the order that the end user might desire. To explore this myth, you ll first look at how conditions can change on a query that is using clustered index. In Listing 8-15, there is a single query, repeated twice, for the Sales.SalesOrderHeader and Sales. SalesOrderDetail tables that is performing a simple aggregation. This is something that might appear in many types of use cases for SQL Server. Listing Unordered Results with Clustered Index USE AdventureWorks2014 GO SELECT soh.salesorderid, COUNT(*) AS DetailRows FROM Sales.SalesOrderHeader soh INNER JOIN Sales.SalesOrderDetail sod ON soh.salesorderid = sod.salesorderid GROUP BY soh.salesorderid; GO DBCC FREEPROCCACHE DBCC SETCPUWEIGHT(1000) GO SELECT soh.salesorderid, COUNT(*) AS DetailRows FROM Sales.SalesOrderHeader soh INNER JOIN Sales.SalesOrderDetail sod ON soh.salesorderid = sod.salesorderid GROUP BY soh.salesorderid; GO DBCC FREEPROCCACHE DBCC SETCPUWEIGHT(1) GO The conditions in which the two queries execute vary a bit. The first query is running under the standard SQL Server cost model and generates an execution that performs a couple index scans and a stream aggregation to return the results, shown in Figure 8-9. The results from the query, provided in Figure 8-10, provide support that SQL Server will return data in the desired output, provided that the SaleOrderID column is the column that the users wants sorted. 164

CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES Figure 8-9. Default aggregation execution plan Figure 8-10.

373 CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES Figure 8-9. Default aggregation execution plan Figure Results from default aggregation execution plan But what happens if the conditions on the SQL Server change but the business rules do not? The second query executed in Listing 8-15 is the same query, but with a change in conditions. For this example, the DBCC command SETCPUWEIGHT is leveraged to change the cost of the execution plan. The change in cost results in a parallel execution plan being created and executed, shown in Figure The effect of the new plan is a change in the results of the query, provided in Figure While the results appear to still be ordered, the logic of the query hasn t change, but the first number in both results is different. At some point in the second result set, those rows not appearing at the start of results appear. The danger in this is that the results look sorted when a validation of them proves that they are not. Figure Aggregation execution plan with parallelism 165

374 CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES Figure Aggregation execution plan with parallelism Warning Do not use DBCC SETCPUWEIGHT in production code to control parallelism or for any other reason. This DBCC command is strictly available to control environmental variables within SQL Server to test and validate execution plans. The other condition to consider is when business rules change for a query. For instance, maybe a set of results wasn t originally filtered, but after a change to the application, the query may change to using a different set of indexes. This can result in a change in the order of the results, such as when a query changes from using a clustered index to a nonclustered index. To demonstrate this change in behavior, execute the code in Listing This code runs two queries. Both of the queries return SalesOrderID, CustomerID, and Status. For the purposes of the example, the business rule dictates that the results must be sorted by SalesOrderID. In this case, the results from the first query are sorted as the business rule state, shown at the top of Figure But in the second query, when the logic changes to request fewer rows by adding a filter, the results are no longer ordered, shown at the bottom of Figure The cause of the change comes from a change in the indexes that SQL Server is using to execute the query. The change in indexes drives the results to be processed, and ordered, in the manner in which those indexes sort the data. Listing Unordered Results with Nonclustered Index USE AdventureWorks2014 GO SELECT SalesOrderID, CustomerID, Status FROM Sales.SalesOrderHeader soh GO SELECT SalesOrderID, CustomerID, Status FROM Sales.SalesOrderHeader soh WHERE CustomerID IN (11020, 11021, 11022) GO 166

CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES Figure 8-13.

375 CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES Figure Query results demonstrating effect of filtering on order In these examples, you looked at just a couple of the conditions that can change when it comes to how SQL Server will stream the results from a query. While an index might provide results from the query in the order desired this time, there is no guarantee that this will not change. Don t rely on indexes to enforce ordering. Don t rely on being clever to get the results ordered as desired. Rely on ORDER BY statements to get the results ordered as needed. Myth 7: Fill Factor Is Applied to Indexes During Inserts When the fill factor is set on an index, it is applied to the index when the index is built, rebuilt, or reorganized. Unfortunately, with this myth many people believe that fill factor is applied when records are inserted into a table. In this section, you ll investigate this myth and see that it is not correct. To begin pulling this myth apart, let s look at what most people believe. In the myth, the thought is that if a fill factor has been specified when rows are added to a table, the fill factor is used during the inserts. To dispel this portion of the myth, execute the code in Listing In this script, the table dbo.mythseven is created with a clustered index with a 50 percent fill factor. That means that 50 percent of every page in the index should be left empty. With the table built, you ll insert records into the table. Finally, you ll check the average amount of space available on each page through the sys.dm_db_index_physical_stats DMV. Looking at the results of the script, included in Figure 8-14, the index is using 95 percent of every page versus the 50 percent that was specified in the creation of the clustered index. 167

376 CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES Listing Create and Populate MythSix Table USE AdventureWorks2014 GO IF OBJECT_ID('dbo.MythSeven') IS NOT NULL DROP TABLE dbo.mythseven; GO CREATE TABLE dbo.mythseven ( RowID int NOT NULL,Column1 varchar(500) ); GO ALTER TABLE dbo.mythseven ADD CONSTRAINT PK_MythSeven PRIMARY KEY CLUSTERED (RowID) WITH(FILLFACTOR = 50); GO WITH L1(z) AS (SELECT 0 UNION ALL SELECT 0), L2(z) AS (SELECT 0 FROM L1 a CROSS JOIN L1 b), L3(z) AS (SELECT 0 FROM L2 a CROSS JOIN L2 b), L4(z) AS (SELECT 0 FROM L3 a CROSS JOIN L3 b), L5(z) AS (SELECT 0 FROM L4 a CROSS JOIN L4 b), L6(z) AS (SELECT TOP FROM L5 a CROSS JOIN L5 b) INSERT INTO dbo.mythseven SELECT ROW_NUMBER() OVER (ORDER BY z) AS RowID, REPLICATE('X', 500) FROM L6; GO SELECT object_id, index_id, avg_page_space_used_in_percent FROM sys.dm_db_index_physical_stats(db_id(),object_id('dbo.mythseven'),null,null,'detailed') WHERE index_level = 0; Figure Fill factor myth on inserts Sometimes when this myth is dispelled, the belief is reversed, and it is believed that fill factor is broken or doesn t work. This is also incorrect. Fill factor isn t applied to indexes during data modifications. As stated previously, it is applied when the index is rebuilt, reorganized, or created. To demonstrate this, you can rebuild the clustered index on dbo.mythseven with the script included in Listing

377 CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES Listing Rebuild Clustered Index on MythSeven Table USE AdventureWorks2014 GO ALTER INDEX PK_MythSeven ON dbo.mythseven REBUILD SELECT object_id, index_id, avg_page_space_used_in_percent FROM sys.dm_db_index_physical_stats(db_id(),object_id('dbo.mythseven'),null,null,'detailed') WHERE index_level = 0 After the clustered index is rebuilt, the index will have the specified fill factor, or close to the value specified, as shown in Figure The average space used on the table, after the rebuild, changed from 95 to 51 percent. This change is in alignment with the fill factor that was specified for the index. Figure Fill factor myth after index rebuild When it comes to fill factor, there are a number of myths surrounding the index property. The key to understanding fill factor is to remember when and how it is applied. It isn t a property enforced on an index as it is used. It is, instead, a property used to distribute data within an index when it is created or rebuilt. Myth 8: Deleting Form Heaps Results in Unrecoverable Space Heaps are an interesting structure in SQL Server. In Chapter 2, you examined how they aren t really an index but just a collection of pages for storing data. One of the index maintenance tasks that will be a part of the next chapter is recovering space from heap tables. As will be more deeply discussed in that chapter, when rows are deleted from a heap, the pages associated with those rows are not removed from the heap. This is generally referred to as bloat within the heap. An interesting side effect of the concept of heap bloat is the myth that bloat never gets reused. The space stays in the heap and is not recoverable until the heap is rebuilt. Fortunately, for heaps and database administrators, this isn t the case. When data is removed from a heap, the space that the data previously held is made available for future inserts into the table. To demonstrate how this works, you ll build a table using the code in Listing The demonstration creates a heap named MythEight and then inserts 400 records, which results in 100 pages of data. This page count can be validated with the page_count column in the first resultset in Figure The next part of the script deletes every other row that was inserted into the heap. Generally, this should leave each page with half as many rows as it had previously, shown in the second result set in Figure The last part of the script re-inserts 200 rows into the MythEight table, returning the row count to 400 records and reusing the previously used pages that had data removed from them. There is a slight growth in the page count from the last resultset in Figure 8-16, but most of the new rows fit into the space already allocated. 169

CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES Listing 8-19. Reusing Data From the MythEight Heap USE AdventureWorks2014 GO IF OBJECT_ID('dbo.MythEight') IS NOT NULL DROP TABLE dbo.

378 CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES Listing Reusing Data From the MythEight Heap USE AdventureWorks2014 GO IF OBJECT_ID('dbo.MythEight') IS NOT NULL DROP TABLE dbo.mytheight; CREATE TABLE dbo.mytheight ( RowId INT IDENTITY(1,1),FillerData VARCHAR(2500) ); INSERT INTO dbo.mytheight (FillerData) SELECT TOP 400 REPLICATE('X',2000) FROM sys.objects; SELECT OBJECT_NAME(object_id), index_type_desc, page_count, record_count, forwarded_record_count FROM sys.dm_db_index_physical_stats (DB_ID(), OBJECT_ID('dbo.MythEight'), NULL, NULL, 'DETAILED'); DELETE FROM dbo.mytheight WHERE RowId % 2 = 0; SELECT OBJECT_NAME(object_id), index_type_desc, page_count, record_count, forwarded_record_count FROM sys.dm_db_index_physical_stats (DB_ID(), OBJECT_ID('dbo.MythEight'), NULL, NULL, 'DETAILED'); INSERT INTO dbo.mytheight (FillerData) SELECT TOP 200 REPLICATE('X',2000) FROM sys.objects; SELECT OBJECT_NAME(object_id), index_type_desc, page_count, record_count, forwarded_record_count FROM sys.dm_db_index_physical_stats (DB_ID(), OBJECT_ID('dbo.MythEight'), NULL, NULL, 'DETAILED'); Figure Heap reuse query results 170

379 CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES As the demonstration for this myth shows, space in a heap that previously held data is released for reuse by the table. For heaps that have a lot of data coming in and out of the table, there isn t a significant need to monitor for page reuse, and the myth can be considered inaccurate. With heaps that have a lot of data removed without the intention to replace the data, you are able to recover the space with ALTER TABLE REBUILD. The syntax and impact of this statement are discussed in the next chapter. Myth 9: Every Table Should Have a Heap/Clustered Index The last myth to consider is twofold. On the one hand, some people will recommend you should build all your tables with heaps. On the other hand, others will recommend that you create clustered indexes on all your tables. The trouble is that this viewpoint will exclude considering the benefits that each of the structures can offer on a table. The viewpoint makes a religious-styled argument for or against ways to store data in your databases without any consideration for the actual data that is being stored and how it is being used. Some of the arguments against the use of clustered indexes are as follows: Fragmentation negatively impacts performance through additional I/O. The modification of a single record can impact multiple records in the clustered index when a page split is triggered. Excessive key lookups will negatively impact performance through additional I/O. Of course, there are some arguments against using heaps. Excessive forwarded records negatively impact performance through additional I/O. Removing forwarded records requires a rebuild of the entire table. Nonclustered indexes are required for efficient filtered data access. Heaps don t release pages when data is removed. The negative impacts associated with either clustered indexes or heaps aren t the only things to consider when deciding between one or the other. Each has circumstances where they will outperform the other. For instance, clustered indexes perform best in the following circumstances: The key on the table is a unique, ever-increasing key value. The table has a key column that has a high degree of uniqueness. Ranges of data in a table will be accessed via queries. Records in the table will be inserted and deleted at a high rate. On the other hand, heaps are ideal for some of the following situations: Data in the table will be used only for a limited amount of time where index creation time exceeds query time on the data. Key values will change frequently, which in turn would change the position of the record in an index. You are inserting copious numbers of records into a staging table. The primary key is a nonascending value, such as a unique identifier. 171

380 CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES Although this section doesn t include a demonstration of why this myth is false, it is important to remember that both heaps and clustered indexes are available and should be used appropriately. Knowing which type of index to choose is a matter of testing, not a matter of doctrine. A good resource to consider for those in the cluster everything camp is the Fast Track Data Warehouse Architecture white paper ( ). The white paper addresses some significant performance improvements that can be found with heaps and also the point in which these improvements dissipate. The white paper helps show how changes in I/O system technologies, with flash and cache-based devices, can change patterns and practices in regard to heaps and clustered indexes. This helps to promote the idea of validating myths and best practices from time to time. Index Best Practices Similar to myths are the indexing best practices. A best practice should be considered the default recommendations that can be applied when there isn t enough information available to validate proceeding in another direction. Best practices are not the only option and are just a place to start from when working with any technology. When using a best practice provided from someone else, such as those appearing in this chapter, it is important to check them out for yourself first. Always take them with a grain of salt. You can trust that best practices will steer you in the correct direction, but you need to verify that it is appropriate to follow the practice. Given the preceding precautions, there are a number of best practices that can be considered when working with indexes. This section will review these best practices and discuss what they are and what they mean. Use Clustered Indexes on Primary Keys by Default The first best practice is to use clustered indexes on primary keys by default. This may seem to run contrary to the seventh myth presented in this chapter. Myth 7 discussed whether to choose clustered indexes or heaps as a matter of doctrine. Whether the database was built with one or the other, the myth would have you believe that if your table design doesn t match the myth, it should be changed regardless of the situation. This best practice recommends using clustered indexes on primary keys as a starting point. By clustering the primary key of a table by default, there is an increased likelihood that the indexing choice will be appropriate for the table. As stated earlier in this chapter, clustered indexes control how the data in a table is stored. Many primary keys, possibly most, are built on a column that utilizes the identity property that increments as each new record is added to the table. Choosing a clustered index for the primary key will provide the most efficient method to access the data. Balance Index Count As previously discussed in this book, indexes are extremely useful for improving the performance when accessing information in a record. Unfortunately, indexes are not without costs. The costs to having indexes go beyond just space within your database. When you build an index, you need to consider some of the following: How frequently will records be inserted or deleted? How frequently will the key columns be updated? How often will the index be used? What processes does the index support? How many other indexes are on the table? 172

381 CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES These are just some of the first considerations that need to be accounted for when building indexes. After the index is built, how much time will be spent updating and maintaining the index? Will you modify the index more frequently than the index is used to return results for queries? The trouble with balancing the index count on a table is that there is no precise number that can be recommended. Deciding on the number of indexes that it makes sense to have on an index is a per-table decision. You don t want too few, which may result in excessive scans of the clustered index or heap to return results. Also, the table shouldn t have too many indexes, where more time is being spent keeping the index current than returning results. As a rule of thumb, if a table has more than ten indexes on it in a transactional system, it is increasingly likely that there are too many indexes on the table. Specify Fill Factors Fill factor controls the amount of free space left on the data pages of an index after an index is built or defragmented. This free space is made available to allow for records on the page to expand with the risk that the change in record size may result in a page split. This is an extremely useful property of indexes to use for index maintenance. Modifying the fill factor can mitigate the risk of fragmentation. A more thorough discussion of fill factor is presented in Chapter 6. For the purposes of best practices, you are concerned with the ability to set the fill factor at the database and index levels. Database Level Fill Factor As already mentioned, one of the properties of SQL Server is the option to set a default fill factor for indexes. This setting is a SQL Server wide setting and can be altered in the properties of SQL Server on the Database Properties page. By default, this value is set to zero, which equates to 100. Do not modify the default fill factor to anything other than 0, or 100, which has the same impact. Doing so will change the fill factor for every index in the database to the new value; this will add the specified amount of free space to all indexes the next time indexes are created, rebuilt, or reorganized. On the surface this may seem like a good idea, but this will blindly increase the size of all indexes by the specified amount. The increased size of the indexes will require more I/O to perform the same work as before the change. For many indexes, making this change would result in a needless waste of resources. Index Level Fill Factor At the index level, you should modify the fill factor for indexes that are frequently becoming heavily fragmented. Decreasing the fill factor will increase the amount of free space in the index and provide additional space to compensate for the changes in record length leading to fragmentation. Managing fill factor at the index level is appropriate since it provides the ability to tune the index precisely to the needs of the database. Index Foreign Key Columns When a foreign key is created on a table, the foreign key column in the table should be indexed. This is necessary to assist the foreign key in determining which records in the parent table are constrained to each record in the referenced table. This is important when changes are being made against the referenced table. The changes in the referenced table may need to check all the rows that match the record in the parent table. If an index does not exist, then a scan of the column will occur. On a large parent table, this could result in a significant amount of I/O and potentially some concurrency issues. 173

382 CHAPTER 8 INDEXING MYTHS AND BEST PRACTICES An example of this issue would be a state and address table. There would likely be thousands or millions of records in the address table and maybe a hundred records in the state table. The address table would include a column that is referenced by the state table. Consider whether one of the records in the state table needed to be deleted. If there wasn t an index on the foreign key column in the address table, then how would the address table identify the rows that would be affected by deleting the state record? Without an index, SQL Server would have to check every record in the address table. If the column is indexed, SQL Server would be able to perform a range scan across the records that match to the value being deleted from the state table. By indexing your foreign key columns, performance issues, such as the one described in this section, can be avoided. The best practice with foreign keys is to index their columns. Chapter 11 includes more details on this best practice and a code example. Index to Your Environment The indexing that exists today will likely not be the indexing that will be needed in databases in the future. For this reason, the last best practice is to continuously review, analyze, and implement changes to the indexes in your environment. Realize that regardless of how similar two databases are, if the data in the databases is not the same, then the indexing for the two databases may also be different. For an expanded conversation on monitoring and analyzing indexes, see Chapters 13 and 14. Summary This chapter looked at some myths surrounding indexes as well as some best practices. For both areas, you investigated what some commonly held beliefs are and presented some details around each of them. With the myths, you looked at a number of ideas that are generally believed about indexes that are in fact not true. The myths covered clustered indexes, fill factor, the column makeup of indexes, and more. The key to how to view anything that is believed about indexes that may be a myth is to take it upon yourself to test them. You also looked at best practices. The best practices provided in the chapter should be the basis on which indexes for your databases can be built. I defined what a best practice is and what it is not. Then I discussed a number of best practices that can be considered when indexing your databases. 174

383 Peter A Carter Pro SQL Server Administration

384 Pro SQL Server Administration Copyright 2015 by Peter A Carter This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. ISBN-13 (pbk): ISBN-13 (electronic): Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Managing Director: Welmoed Spahr Lead Editor: Jonathan Gennick Technical Reviewers: Alex Grinberg and Louis Davidson Editorial Board: Steve Anglin, Mark Beckner, Gary Cornell, Louise Corrigan, Jim DeWolf, Jonathan Gennick, Robert Hutchinson, Michelle Lowman, James Markham, Susan McDermott, Matthew Moodie, Jeffrey Pepper, Douglas Pundick, Ben Renow-Clarke, Gwenan Spearing, Matt Wade, Steve Weiss Coordinating Editor: Jill Balzano Copy Editor: Rebecca Rider Compositor: SPi Global Indexer: SPi Global Artist: SPi Global Cover Designer: Anna Ishchenko Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY Phone SPRINGER, fax (201) , orders-ny@springer-sbm.com, or visit Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information on translations, please rights@apress.com, or visit Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. ebook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales ebook Licensing web page at Any source code or other supplementary material referenced by the author in this text is available to readers at For detailed information about how to locate your book s source code, go to

385 Contents at a Glance About the Author...xxiii About the Technical Reviewers...xxv Part I: Installing and Configuring SQL Server... 1 Chapter 1: Planning the Deployment... 3 Chapter 2: GUI Installation Chapter 3: Server Core Installation Chapter 4: Configuring the Instance Part II: Database Administration Chapter 5: Files and Filegroups Chapter 6: Configuring Tables Chapter 7: Indexes and Statistics Chapter 8: Database Consistency Part III: Security, Resilience, and Scaling Chapter 9: SQL Server Security Model Chapter 10: Encryption Chapter 11: High Availability and Disaster Recovery Concepts Chapter 12: Implementing Clustering Chapter 13: Implementing AlwaysOn Availability Groups v

386 CONTENTS AT A GLANCE Chapter 14: Implementing Log Shipping Chapter 15: Backups and Restores Chapter 16: Scaling Workloads Part IV: Monitoring and Maintenance Chapter 17: SQL Server Metadata Chapter 18: Locking and Blocking Chapter 19: Extended Events Chapter 20: Distributed Replay Chapter 21: Automating Maintenance Routines Chapter 22: Policy-Based Management Chapter 23: Resource Governor Chapter 24: Triggers Part V: Managing a Hybrid Cloud Environment Chapter 25: Cloud Backups and Restores Chapter 26: SQL Data Files in Windows Azure Chapter 27: Migrating to the Cloud Index vi

387 CHAPTER 15 Backups and Restores Backing up a database is one of the most important tasks that a DBA can perform. Therefore, after discussing the principles of backups, we look at some of the backup strategies that you can implement for SQL Server databases. We then discuss how to perform the backup of a database before we finally look in-depth at restoring it, including restoring to a point in time, restoring individual files and pages, and performing piecemeal restores. Backup Fundamentals Depending on the recovery model you are using, you can take three types of backup within SQL Server: full, differential, and log. We discuss the recovery models in addition to each of the backup types in the following sections. Recovery Models As discussed in Chapter 5, you can configure a database in one of three recovery models: SIMPLE, FULL, and BULK LOGGED. These models are discussed in the following sections. SIMPLE Recovery Model When configured in SIMPLE recovery model, the transaction log (or to be more specific, VLFs [Virtual Log Files] within the transaction log that contain transactions that are no longer required) is truncated after each checkpoint operation. This means that usually you do not have to administer the transaction log. However, it also means that you can t take transaction log backups. The SIMPLE recovery model can increase performance, for some operations, because transactions are minimally logged. Operations that can benefit from minimal logging are as follows: Bulk imports SELECT INTO UPDATE statements against large data types that use the.write clause WRITETEXT UPDATETEXT Index creation Index rebuilds 519

388 CHAPTER 15 BACKUPS AND RESTORES The main disadvantage of the SIMPLE recovery model is that it is not possible to recover to a specific point in time; you can only restore to the end of a full backup. This disadvantage is amplified by the fact that full backups can have a performance impact, so you are unlikely to be able to take them as frequently as you would take a transaction log backup without causing an impact to users. Another disadvantage is that the SIMPLE recovery model is incompatible with some SQL Server HA/DR features, namely: AlwaysOn Availability Groups Database mirroring Log shipping Therefore, in production environments, the most appropriate way to use the SIMPLE recovery model is for large data warehouse style applications where you have a nightly ETL load, followed by read-only reporting for the rest of the day. This is because this model provides the benefit of minimally logged transactions, while at the same time, it does not have an impact on recovery, since you can take a full backup after the nightly ETL run. FULL Recovery Model When a database is configured in FULL recovery model, the log truncation does not occur after a CHECKPOINT operation. Instead, it occurs after a transaction log backup, as long as a CHECKPOINT operation has occurred since the previous transaction log backup. This means that you must schedule transaction log backups to run on a frequent basis. Failing to do so not only leaves your database at risk of being unrecoverable in the event of a failure, but it also means that your transaction log continues to grow until it runs out of space and a 9002 error is thrown. When a database is in FULL recovery model, many factors can cause the VLFs within a transaction log not to be truncated. This is known as delayed truncation. You can find the last reason for delayed truncation to occur in the log_reuse_wait_desc column of sys.databases ; a full list of reasons for delayed truncation appears in Chapter 5. The main advantage of the FULL recovery model is that point-in-time recovery is possible, which means that you can restore your database to a point in the middle of a transaction log backup, as opposed to only being able to restore it to the end of a backup. Point-in-time recovery is discussed in detail later in this chapter. Additionally, FULL recovery model is compatible with all SQL Server functionality. It is usually the best choice of recovery model for production databases. Tip If you switch from SIMPLE recovery model to FULL recovery model, you are not actually in FULL recovery model until after you take a transaction log backup. Therefore, make sure to back up your transaction log immediately. BULK LOGGED Recovery Model The BULK LOGGED recovery model is designed to be used on a short-term basis, while a bulk import operation takes place. The idea is that your normal model of operations is to use FULL recovery model, and then temporarily switch to the BULK LOGGED recovery model just before a bulk import takes place; you then switch back to FULL recovery model when the import completes. This may give you a performance benefit and also stop the transaction log from filling up, since bulk import operations are minimally logged. Immediately before you switch to the BULK LOGGED recovery model, and immediately after you switch back to FULL recovery model, it is good practice to take a transaction log backup. This is because you cannot use any transaction log backups that contain minimally logged transactions for point-in-time recovery. 520

389 CHAPTER 15 BACKUPS AND RESTORES For the same reason, it is also good practice to safe-state your application before you switch to the BULK LOGGED recovery model. You normally achieve this by disabling any logins, except for the login that performs the bulk import and logins that are administrators, to ensure that no other data modifications take place. You should also ensure that the data you are importing is recoverable by a means other than a restore. Following these rules mitigates the risk of data loss in the event of a disaster. Although the minimally logged inserts keep the transaction log small and reduce the amount of IO to the log, during the bulk import, the transaction log backup is more expensive than it is in FULL recovery model in terms of IO. This is because when you back up a transaction log that contains minimally logged transactions, SQL Server also backs up any data extents, which contain pages that have been altered using minimally logged transactions. SQL Server keeps track of these pages by using bitmap pages, called ML (minimally logged) pages. ML pages occur once in every 64,000 extents and use a flag to indicate if each extent in the corresponding block of extents contains minimally logged pages. Caution BULK LOGGED recovery model may not be faster than FULL recovery model for bulk imports unless you have a very fast IO subsystem. This is because the BULK LOGGED recovery model forces data pages updated with minimally logged pages to flush to disk as soon as the operation completes instead of waiting for a checkpoint operation. Changing the Recovery Model Before we show you how to change the recovery model of a database, let s first create the Chapter 15 database, which we use for demonstrations in this chapter. You can create this database using the script in Listing Listing Creating the Chapter15 Database CREATE DATABASE Chapter 15 ON PRIMARY ( NAME = 'Chapter 15 ', FILENAME = 'C:\MSSQL\DATA\Chapter 15.mdf'), FILEGROUP FileGroupA ( NAME = 'Chapter 15 FileA', FILENAME = 'C:\MSSQL\DATA\Chapter 15 FileA.ndf' ), FILEGROUP FileGroupB ( NAME = 'Chapter 15 FileB', FILENAME = 'C:\MSSQL\DATA\Chapter 15 FileB.ndf' ) LOG ON ( NAME = 'Chapter 15 _log', FILENAME = 'C:\MSSQL\DATA\Chapter15_log.ldf' ) ; GO ALTER DATABASE [Chapter15] SET RECOVERY FULL ; GO USE Chapter15 GO CREATE TABLE dbo.contacts ( ContactID INT NOT NULL IDENTITY PRIMARY KEY, FirstName NVARCHAR(30), LastName NVARCHAR(30), AddressID INT ) ON FileGroupA ; 521

390 CHAPTER 15 BACKUPS AND RESTORES CREATE TABLE dbo.addresses ( AddressID INT NOT NULL IDENTITY PRIMARY KEY, AddressLine1 NVARCHAR(50), AddressLine2 NVARCHAR(50), AddressLine3 NVARCHAR(50), PostCode NCHAR(8) ) ON FileGroupB ; You can change the recovery model of a database from SQL Server Management Studio (SSMS) by selecting Properties from the context menu of the database and navigating to the Options page, as illustrated in Figure You can then select the appropriate recovery model from the Recovery Model drop-down list. Figure The Options tab 522

391 CHAPTER 15 BACKUPS AND RESTORES We can also use the script in Listing 15-2 to switch our Chapter15 database from the FULL recovery model to the SIMPLE recovery model and then back again. Listing Switching Recovery Models ALTER DATABASE Chapter15 SET RECOVERY SIMPLE ; GO ALTER DATABASE Chapter15 SET RECOVERY FULL ; GO Tip After changing the recovery model, refresh the database in Object Explorer to ensure that the correct recovery model displays. Backup Types You can take three types of backup in SQL Server: full, differential, and log. We discuss these backup types in the following sections. Full Backup You can take a full backup in any recovery model. When you issue a backup command, SQL Server first issues a CHECKPOINT, which causes any dirty pages to be written to disk. It then backs up every page within the database (this is known as the data read phase ) before it finally backs up enough of the transaction log (this is known as the log read phase ) to be able to guarantee transactional consistency. This ensures that you are able to restore your database to the most recent point, including any transactions that are committed during the data read phase of the backup. Differential Backup A differential backup backs up every page in the database that has been modified since the last full backup. SQL Server keeps track of these pages by using bitmap pages called DIFF pages, which occur once in every 64,000 extents. These pages use flags to indicate if each extent in their corresponding block of extents contains pages that have been updated since the last full backup. The cumulative nature of differential backups means that your restore chain only ever needs to include one differential backup the latest one. Only ever needing to restore one differential backup is very useful if there is a significant time lapse between full backups, but log backups are taken very frequently, because restoring the last differential can drastically decrease the number of transaction log backups you need to restore. Log Backup A transaction log backup can only be taken in the FULL or BULK LOGGED recovery models. When a transaction log backup is issued in the FULL recovery model, it backs up all transaction log records since the last backup. When it is performed in the BULK LOGGED recovery model, it also backs up any pages that include minimally logged transactions. When the backup is complete, SQL Server truncates VLFs within the transaction log until the first active VLF is reached. 523

CHAPTER 15 BACKUPS AND RESTORES Transaction log backups are especially important on databases that support OLTP (online transaction processing), since they allow a point-in-time recovery to the point

392 CHAPTER 15 BACKUPS AND RESTORES Transaction log backups are especially important on databases that support OLTP (online transaction processing), since they allow a point-in-time recovery to the point immediately before the disaster occurred. They are also the least resource-intensive type of backup, meaning that you can perform them more frequently than you can perform a full or differential backup without having a significant impact on database performance. Backup Media Databases can be backed up to disk, tape, or URL. Tape backups are deprecated however, so you should avoid using them; their support will be removed in a future version of SQL Server. The terminology surrounding backup media consists of backup devices, logical backup devices, media sets, media families, and backup sets. The structure of a media set is depicted in Figure 15-2, and the concepts are discussed in the following sections. Figure Backup media diagram Backup Device A backup device is a physical file on disk, a tape, or a Windows Azure Blob. When the device is a disk, the disk can reside locally on the server or on a backup share specified by a URL. A media set can contain a maximum of 64 backup devices, and data can be striped across the backup devices and can also be mirrored. In Figure 15-2, there are six backup devices, split into three mirrored pairs. This means that the backup set is striped across three of the devices and then mirrored to the other three. Note Windows Azure Blobs are discussed in Chapter

393 CHAPTER 15 BACKUPS AND RESTORES Striping the backup can be useful for a large database, because doing so allows you to place each device on a different drive array to increase throughput. It can also pose administrative challenges, however; if one of the disks in the devices in the stripe becomes unavailable, you are unable to restore your backup. You can mitigate this by using a mirror. When you use a mirror, the contents of each device are duplicated to an additional device for redundancy. If one backup device in a media set is mirrored, then all devices within the media set must be mirrored. Each backup device or mirrored set of backup devices is known as a media family. Each device can have up to four mirrors. Each backup device within a media set must be all disk or all tape. If they are mirrored, then the mirror devices must have similar properties; otherwise an error is thrown. For this reason, Microsoft recommends using the same make and model of device for mirrors. It is also possible to create logical backup devices, which abstract a physical backup device. Using logical devices can simplify administration, especially if you are planning to use many backup devices in the same physical location. A logical backup device is an instance-level object and can be created in SSMS by choosing New Backup Device from the context menu of Server Objects Backup Devices; this causes the Backup Device dialog box to be displayed, as illustrated in Figure Figure Backup Device dialog box 525

394 CHAPTER 15 BACKUPS AND RESTORES Alternatively, you can create the same logical backup device via T-SQL using the sp_addumpdevice system stored procedure. The command in Listing 15-3 uses the sp_addumpdevice procedure to create the Chapter15Backup logical backup device. In this example, we use parameter to pass in the type of the device, in our case, disk. We then pass the abstracted name of the device into parameter and the physical file into parameter. Listing Creating a Logical Backup Device EXEC = = = 'C:\MSSQL\Backup\Chapter15Backup.bak' ; GO Media Sets A media set contains the backup devices to which the backup is written. Each media family within a media set is assigned a sequential number based upon their position in the media set. This is called the family sequence number. Additionally, each physical device is allocated a physical sequence number to identify its physical position within the media set. When a media set is created, the backup devices (files or tapes) are formatted, and a media header is written to each device. This media header remains until the devices are formatted and contains details, such as the name of the media set, the GUID of the media set, the GUIDs and sequence numbers of the media families, the number of mirrors in the set, and the date/time that the header was written. Backup Sets Each time a backup is taken to the media set, it is known as a backup set. New backup sets can be appended to the media, or you can overwrite the existing backup sets. If the media set contains only one media family, then that media family contains the entire backup set. Otherwise, the backup set is distributed across the media families. Each backup set within the media set is given a sequential number; this allows you to select which backup set to restore. Backup Strategies A DBA can implement numerous backup strategies for a database, but always base your strategy on the RTO (recovery time objective) and RPO (recovery point objective) requirements of a data-tier application. For example, if an application has an RPO of 60 minutes, you are not able to achieve this goal if you only back up the database once every 24 hours. Full Backup Only Backup strategies where you only take full backups are the least flexible. If databases are infrequently updated and there is a regular backup window that is long enough to take a full backup, then this may be an appropriate strategy. Also, a full backup only strategy is often used for the Master and MSDB system databases. It may also be appropriate for user databases, which are used for reporting only, and are not updated by users. In this scenario, it may be that the only updates to the database are made via an ETL load. If this is the case, then your backup only needs to be as frequent as this load. You should, however, consider adding a dependency between the ETL load and the full backup, such as putting them in the same SQL Server 526

395 CHAPTER 15 BACKUPS AND RESTORES Agent job. This is because, if your backup takes place halfway through an ETL load, it may render the backup useless when you come to restore. At least, not without unpicking the transactions performed in the ETL load, that were included in the backup before finally re-running the ETL load. Using a full backup only strategy also limits your flexibility for restores. If you only take full backups, then your only restore option is to restore the database from the point of the last full backup. This can pose two issues. The first is that if you take nightly backups at midnight every night and your database becomes corrupt at 23:00, then you lose 23 hours of data modifications. The second issue occurs if a user accidently truncates a table at 23:00. The earliest restore point for the database is midnight the previous night. In this scenario, once again, your RPO for the incident is 23 hours, meaning 23 hours of data modifications are lost. Full and Transaction Log Backups If your database is in FULL recovery model, then you are able to take transaction log backups, as well as the full backups. This means that you can take much more frequent backups, since the transaction log backup is quicker than the full backup and uses fewer resources. This is appropriate for databases that are updated throughout the day, and it also offers more flexible restores, since you are able to restore to a point in time just before a disaster occurred. If you are taking transaction log backups, then you schedule your log backups to be in line with your RPO. For example, if you have an RPO of one hour, then you can schedule your log backups to occur every 60 minutes, because this means that you can never lose more than one hour of data. (This is true as long as you have a complete log chain, none of your backups are corrupt, and the share or folder where the backups are stored is accessible when you need it.) When you use this strategy, you should also consider your RTO. Imagine that you have an RPO of 30 minutes, so you are taking transaction log backups every half hour, but you are only taking a full backup once per week, at 01:00 on a Saturday. If your database becomes corrupt on Friday night at 23:00, you need to restore 330 backups. This is perfectly feasible from a technical view point, but if you have an RTO of 1 hour, then you may not be able to restore the database within the allotted time. Full, Differential, and Transaction Log Backups To overcome the issue just described, you may choose to add differential backups to your strategy. Because a differential backup is cumulative, as opposed to incremental in the way that log backups are, if you took a differential backup on a nightly basis at 01:00, then you only need to restore 43 backups to recover your database to the point just before the failure. This restore sequence consists of the full backup, the differential backup taken on the Friday morning at 01:00, and then the transaction logs, in sequence, between 01:30 and 23:00. Filegroup Backups For very large databases, it may not be possible to find a maintenance window that is large enough to take a full backup of the entire database. In this scenario, you may be able to split your data across filegroups and back up half of the filegroups on alternate nights. When you come to a restore scenario, you are able to restore only the filegroup that contains the corrupt data, providing that you have a complete log chain from the time the filegroup was backed up to the end of the log. Tip Although it is possible to back up individual files as well as a whole filegroup, I find this less helpful, because tables are spread across all files within a filegroup. Therefore, if a table is corrupted, you need to restore all files within the filegroup, or if you only have a handful of corrupt pages, then you can restore just these pages. 527

CHAPTER 15 BACKUPS AND RESTORES Partial Backup A partial backup involves backing up all read/write filegroups, but not backing up any read-only filegroups.

396 CHAPTER 15 BACKUPS AND RESTORES Partial Backup A partial backup involves backing up all read/write filegroups, but not backing up any read-only filegroups. This can be very helpful if you have a large amount of archive data in the database. The BACKUP DATABASE command in T-SQL also supports the READ_WRITE_FILEGROUP option. This means that you can easily perform a partial backup of a database without having to list out the read/write filegroups, which of course, can leave you prone to human error if you have many filegroups. Backing Up a Database A database can be backed up through SSMS or via T-SQL. We examine these techniques in the following sections. Usually, regular backups are scheduled to run with SQL Server Agent or are incorporated into a maintenance plan. These topics are discussed in Chapter 21. Backing Up in SQL Server Management Studio You can back up a database through SSMS by selecting Tasks Backup from the context menu of the database; this causes the General page of the Backup Database dialog box to display, as shown in Figure Figure The General page 528

CHAPTER 15 BACKUPS AND RESTORES In the Database drop-down list, select the database that you wish to back up, and in the Backup Type drop-down, choose to perform either a Full, a Differential, or a

397 CHAPTER 15 BACKUPS AND RESTORES In the Database drop-down list, select the database that you wish to back up, and in the Backup Type drop-down, choose to perform either a Full, a Differential, or a Transaction Log backup. The Copy-Only Backup check box allows you to perform a backup that does not affect the restore sequence. Therefore, if you take a copy-only full backup, it does not affect the differential base. Under the covers, this means that the DIFF pages are not reset. Taking a copy-only log backup does not affect the log archive point, and therefore the log is not truncated. Taking a copy-only log backup can be helpful in some online restore scenarios. It is not possible to take a copy-only differential backup. If you have selected a full or differential backup in the Backup Component section, choose if you want to back up the entire database, or specific files and filegroups. Selecting the Files And Filegroups radio button causes the Select File And Filegroups dialog box to display, as illustrated in Figure Here, you can select individual files or entire filegroups to back up. Figure The Select Files And Filegroups dialog box In the Back Up To section of the screen, you can select either Disk, Tape, or URL from the drop-down list before you use the Add and Remove buttons to specify the backup devices that form the definition of the media set. You can specify a maximum of 64 backup devices. The backup device may contain multiple backups (backup sets), and when you click the Contents button, the details of each backup set contained within the backup device will be displayed. Figure 15-6 illustrates the contents of a backup file that contains multiple backups. 529

CHAPTER 15 BACKUPS AND RESTORES Figure 15-6.

If you choose to use an existing media set, then specify if you want to overwrite the content of the media set or append a new backup set to the media set.

398 CHAPTER 15 BACKUPS AND RESTORES Figure The Backup file contents On the Media Option page, shown in Figure 15-7, you can specify if you want to use an existing media set or create a new one. If you choose to use an existing media set, then specify if you want to overwrite the content of the media set or append a new backup set to the media set. If you choose to create a new media set, then you can specify the name, and optionally, a description for the media set. If you use an existing media set, you can verify the date and time that the media set and backup set expire. These checks may cause the backup set to be appended to the existing backup device, instead of overwriting the backup sets. Figure The Media Options page 530

399 CHAPTER 15 BACKUPS AND RESTORES Under the Reliability section, specify if the backup should be verified after completion. This is usually a good idea, especially if you are backing up to a URL, since backups across the network are prone to corruption. Choosing the Perform Checksum Before Writing To Media option causes the page checksum of each page of the database to be verified before it is written to the backup device. This causes the backup operation to use additional resources, but if you are not running DBCC CHECKDB as frequently as you take backups, then this option may give you an early warning of any database corruption. (Please see Chapter 8 for more details.) The Continue On Error option causes the backup to continue, even if a bad checksum is discovered during verification of the pages. On the Backup Options page, illustrated in Figure 15-8, you are able to set the expiration date of the backup set as well as select if you want the backup set to be compressed or encrypted. For compression, you can choose to use the instance default setting, or you can override this setting by specifically choosing to compress, or not compress, the backup. Figure The Backup Options page If you choose to encrypt the backup, then you need to select a preexisting certificate. (You can find details of how to create a certificate in Chapter 10.) You then need to select the algorithm that you wish to use to encrypt the backup. Available algorithms in SQL Server 2014 are AES 128, AES 192, AES 256, or 3DES. You should usually select an AES algorithm, because support for 3DES will be removed in a future version of SQL Server. 531

400 CHAPTER 15 BACKUPS AND RESTORES Backing Up via T-SQL When you back up a database or log via T-SQL, you can specify many arguments. These can be broken down into the following categories: Backup options (described in Table 15-1 ) Table Backup Options Argument DATABASE/LOG database_name file_or_filegroup READ_WRITE_FILEGROUPS TO MIRROR TO Description Specify DATABASE to perform a full or differential backup. Specify LOG to perform a transaction log backup. The name of the database to perform the backup operation against. Can also be a variable containing the name of the database. A comma-separated list of files or filegroups to back up, in the format FILE = logical file name or FILEGROUP = Logical filegroup name. Performs a partial backup by backing up all read/write filegroups. Optionally, use comma-separated FILEGROUP = syntax after this clause to add read-only filegroups. A comma-separated list of backup devices to stripe the backup set over, with the syntax DISK = physical device, TAPE = physical device, or URL = physical device. A comma-separated list of backup devices to which to mirror the backup set. If the MIRROR TO clause is used, the number of backup devices specified must equal the number of backup devices specified in the TO clause. WITH options (described in Table 15-2 ) Table WITH Options Argument Description CREDENTIAL Use when backing up to a Windows Azure Blob. This is discussed in Chapter 25. DIFFERENTIAL Specifies that a differential backup should be taken. If this option is omitted, then a full backup is taken. ENCRYPTION Specifies the algorithm to use for the encryption of the backup. If the backup is not to be encrypted, then NO_ENCRYPTION can be specified, which is the default option. Backup encryption is only available in Enterprise, Business Intelligence, and Standard Editions of SQL Server. encryptor_name The name of the encryptor in the format SERVER CERTIFICATE = encryptor name or SERVER ASYMETRIC KEY = encryptor name. Backup set options (described in Table 15-3 ) 532

401 CHAPTER 15 BACKUPS AND RESTORES Table Backup Set Options Argument COPY_ONLY COMPRESSION/NO COMPRESSION NAME DESCRIPTION EXPIRYDATE/RETAINEDDAYS Description Specifies that a copy_only backup of the database or log should be taken. This option is ignored if you perform a differential backup. By default, SQL Server decides if the backup should be compressed based on the instance-level setting. You can override this setting, however, by specifying COMPRESSION or NO COMPRESSION, as appropriate. Backup compression is only available in Enterprise, Business Intelligence, and Standard Editions of SQL Server. Specifies a name for the backup set. Adds a description to the backup set. Use EXPIRYDATE = datetime to specify a precise date and time that the backup set expires. After this date, the backup set can be overwritten. Specify RETAINDAYS = int to specify a number of days before the backup set expires. Media set options (described in Table 15-4 ) Table Media Set Options Argument INIT/NOINIT SKIP/NOSKIP FORMAT/NOFORMAT MEDIANAME MEDIADESCRIPTION BLOCKSIZE Description INIT attempts to overwrite the existing backup sets in the media set but leaves the media header intact. It first checks the name and expiry date of the backup set, unless SKIP is specified. NOINIT appends the backup set to the media set, which is the default behavior. SKIP causes the INIT checks of backup set name and expiration date to be skipped. NOSKIP enforces them, which is the default behavior. FORMAT causes the media header to be overwritten, leaving any backup sets within the media set unusable. This essentially creates a new media set. The backup set names and expiry dates are not checked. NOFORMAT preserves the existing media header, which is the default behavior. Specifies the name of the media set. Adds a description of the media set. Specifies the block size in bytes that will be used for the backup. The BLOCKSIZE defaults to 512 for disk and URL and defaults to 65,536 for tape. 533

402 CHAPTER 15 BACKUPS AND RESTORES Error management options (described in Table 15-5 ) Table Error Management Options Argument CHECKSUM/NO_CHECKSUM CONTINUE_AFTER_ERROR/ STOP_ON_ERROR Description Specifies if the page checksum of each page should be validated before the page is written to the media set. STOP_ON_ERROR is the default behavior and causes the backup to fail if a bad checksum is discovered when verifying the page checksum. CONTINUE_AFTER_ERROR allows the backup to continue if a bad checksum is discovered. Tape options (described in Table 15-6 ) Table Tape Options Argument UNLOAD/NOUNLOAD REWIND/NOREWIND Description NOUNLOAD specifies that the tape remains loaded on the tape drive after the backup operation completes. UNLOAD specifies that the tape is rewound and unloaded, which is the default behavior. NOREWIND can improve performance when you are performing multiple backup operations by keeping the tape open after the backup completes. NOREWIND implicitly implies NOUNLOAD as well. REWIND releases the tape and rewinds it, which is the default behavior. * Tape options are ignored unless the backup device is a tape. Log-specific options (described in Table 15-7 ) Table Log-Specific Options Argument NORECOVERY/STANDBY NO_TRUNCATE Description NORECOVERY causes the database to be left in a restoring state when the backup completes, making it inaccessible to users. STANDBY leaves the database in a read-only state when the backup completes. STANDBY requires that you specify the path and file name of the transaction undo file, so it should be used with the format STANDBY = transaction_undo_file. If neither option is specified, then the database remains online when the backup completes. Specifies that the log backup should be attempted, even if the database is not in a healthy state. It also does not attempt to truncate an inactive portion of the log. Taking a tail-log backup involves backing up the log with NORECOVERY and NO_TRUNCATE specified. Miscellaneous options (described in Table 15-8 ) 534

403 CHAPTER 15 BACKUPS AND RESTORES Table Miscellaneous Options Argument BUFFERCOUNT MAXTRANSFERSIZE STATS Description The total number of IO buffers used for the backup operation. The largest possible unit of transfer between SQL Server and the backup media, specified in bytes. Specifies how often progress messages should be displayed. The default is to display a progress message in 10-percent increments. To perform the full database backup of the Chapter 15 database, which we demonstrate through the GUI, we can use the command in Listing Before running this script, modify the path of the backup device to meet your system s configuration. Listing Performing a Full Backup BACKUP DATABASE Chapter 15 TO DISK = 'H:\MSSQL\Backup\Chapter 15.bak' WITH RETAINDAYS = 90, FORMAT, INIT, MEDIANAME = 'Chapter 15 ', NAME = 'Chapter 15 -Full Database Backup', COMPRESSION ; GO If we want to perform a differential backup of the Chapter 15 database and append the backup to the same media set, we can add the WITH DIFFERENTIAL option to our statement, as demonstrated in Listing Before running this script, modify the path of the backup device to meet your system s configuration. Listing Performing a Differential Backup BACKUP DATABASE Chapter 15 TO DISK = 'H:\MSSQL\Backup\Chapter 15.bak' WITH DIFFERENTIAL, RETAINDAYS = 90, NOINIT, MEDIANAME = 'Chapter 15 ', NAME = 'Chapter 15 -Diff Database Backup', COMPRESSION ; GO If we want to back up the transaction log of the Chapter 15 database, again appending the backup set to the same media set, we can use the command in Listing Before running this script, modify the path of the backup device to meet your system s configuration. Listing Performing a Transaction Log Backup BACKUP LOG Chapter 15 TO DISK = 'H: \MSSQL\Backup\Chapter 15.bak' WITH RETAINDAYS = 90, NOINIT 535

404 CHAPTER 15 BACKUPS AND RESTORES GO, MEDIANAME = 'Chapter 15 ', NAME = 'Chapter 15 -Log Backup', COMPRESSION ; Note In enterprise scenarios, you may wish to store full, differential, and log backups in different folders. If we are implementing a filegroup backup strategy and want to back up only FileGroupA, we can use the command in Listing We create a new media set for this backup set. Before running this script, modify the path of the backup device to meet your system s configuration. Listing Performing a Filegroup Backup BACKUP DATABASE Chapter 15 FILEGROUP = 'FileGroupA' TO DISK = 'H:\MSSQL\Backup\Chapter 15 FGA.bak' WITH RETAINDAYS = 90, FORMAT, INIT, MEDIANAME = 'Chapter 15 FG', NAME = 'Chapter 15 -Full Database Backup-FilegroupA', COMPRESSION ; GO To repeat the full backup of the Chapter15 but stripe the backup set across two backup devices, we can use the command in Listing This helps increase the throughput of the backup. Before running this script, you should modify the paths of the backup devices to meet your system s configuration. Listing Using Multiple Backup Devices BACKUP DATABASE Chapter 15 TO DISK = 'H:\MSSQL\Backup\Chapter 15 Stripe1.bak', 'G:\MSSQL\Backup\Chapter 15 Stripe2.bak' WITH RETAINDAYS = 90, FORMAT, INIT, MEDIANAME = 'Chapter 15 Stripe', NAME = 'Chapter15-Full Database Backup-Stripe', COMPRESSION ; GO For increased redundancy, we can create a mirrored media set by using the command in Listing Before running this script, modify the paths of the backup devices to meet your system s configuration. Listing Using a Mirrored Media Set BACKUP DATABASE Chapter15 TO DISK = 'H:\MSSQL\Backup\Chapter15Stripe1.bak', 'G:\MSSQL\Backup\Chapter15Stripe2.bak' MIRROR TO = 'J:\MSSQL\Backup\Chapter15Mirror1.bak', 'K:\MSSQL\Backup\Chapter15Mirror2.bak' 536

CHAPTER 15 BACKUPS AND RESTORES GO WITH RETAINDAYS = 90, FORMAT, INIT, MEDIANAME = 'Chapter15Mirror', NAME = 'Chapter15-Full Database Backup-Mirror', COMPRESSION ; Restoring a Database You can

405 CHAPTER 15 BACKUPS AND RESTORES GO WITH RETAINDAYS = 90, FORMAT, INIT, MEDIANAME = 'Chapter15Mirror', NAME = 'Chapter15-Full Database Backup-Mirror', COMPRESSION ; Restoring a Database You can restore a database either through SSMS or via T-SQL. We explore both of these options in the following sections. Restoring in SQL Server Management Studio To begin a restore in SSMS, select Restore Database from the context menu of Databases in Object Explorer. This causes the General page of the Restore Database dialog box to display, as illustrated in Figure Selecting the database to be restored from the drop-down list causes the rest of the tab to be automatically populated. Figure The General page 537

406 CHAPTER 15 BACKUPS AND RESTORES You can see that the contents of the Chapter15 media set are displayed in the Backup Sets To Restore pane of the page. In this case, we can see the contents of the Chapter15 media set. The Restore check boxes allow you to select the backup sets that you wish to restore. The Timeline button provides a graphical illustration of when each backup set was created, as illustrated in Figure This allows you to easily see how much data loss exposure you have, depending on the backup sets that you choose to restore. In the Timeline window, you can also specify if you want to recover to the end of the log, or if you wish to restore to a specific date/time. Figure The Backup Timeline page Clicking the Verify Backup Media button on the General page causes a RESTORE WITH VERIFYONLY operation to be carried out. This operation verifies the backup media without attempting to restore it. In order to do this, it performs the following checks: The backup set is complete. All backup devices are readable. The CHECKSUM is valid (only applies if WITH CHECKSUM was specified during the backup operation). Page headers are verified. There is enough space on the target restore volume for the backups to be restored. On the Files page, illustrated in Figure 15-11, you can select a different location to which to restore each file. The default behavior is to restore the files to the current location. You can use the ellipses, next to each file, to specify a different location for each individual file, or you can use the Relocate All Files To Folder option to specify a single folder for all data files and a single folder for all log files. 538

CHAPTER 15 BACKUPS AND RESTORES Figure 15-11. The File page On the Options page, shown in Figure 15-12, you are able to specify the restore options that you plan to use.

407 CHAPTER 15 BACKUPS AND RESTORES Figure The File page On the Options page, shown in Figure 15-12, you are able to specify the restore options that you plan to use. In the Restore Options section of the page, you can specify that you want to overwrite an existing database, preserve the replication settings within the database (which you should use if you are configuring log shipping to work with replication), and restore the database with restricted access. This last option makes the database accessible only to administrators and members of the db_owner and db_creator roles after the restore completes. This can be helpful if you want to verify the data, or perform any data repairs, before you make the database accessible to users. 539

CHAPTER 15 BACKUPS AND RESTORES Figure 15-12. The Options page In the Restore Options section, you can also specify the recovery state of the database.

408 CHAPTER 15 BACKUPS AND RESTORES Figure The Options page In the Restore Options section, you can also specify the recovery state of the database. Restoring the database with RECOVERY brings the database online when the restore completes. NORECOVERY leaves the database in a restoring state, which means that further backups can be applied. STANDBY brings the database online but leaves it in a read-only state. This option can be helpful if you are failing over to a secondary server. If you choose this option, you are also able to specify the location of the Transaction Undo file. Tip If you specify WITH PARTIAL during the restore of the first backup file, you are able to apply additional backups, even if you restore WITH RECOVERY. There is no GUI support for piecemeal restores, however. Performing piecemeal restores via T-SQL is discussed later in this chapter. 540

409 CHAPTER 15 BACKUPS AND RESTORES In the Tail-Log Backup section of the screen, you can choose to attempt a tail-log backup before the restore operation begins, and if you choose to do so, you can choose to leave the database in a restoring state. A tail-log backup may be possible even if the database is damaged. Leaving the source database in a restoring state essentially safe-states it to mitigate the risk of data loss. If you choose to take a tail-log backup, you can also specify the file path for the backup device to use. You can also specify if you want to close existing connections to the destination database before the restore begins and if you want to be prompted before restoring each individual backup set. Restoring via T-SQL When using the RESTORE command in T-SQL, in addition to restoring a database, the following options are available. Table Restore Options Restore Option RESTORE FILELISTONLY RESTORE HEADERONLY RESTORE LABELONLY RESTORE REWINDONLY RESTORE VERIFYONLY Description Returns a list of all files in the backup device. Returns the backup headers for all backup sets within a backup device. Returns information regarding the media set and media family to which the backup device belongs. Closes and rewinds the tape. Only works if the backup device is a tape. Checks that all backup devices exist and are readable. Also performs other high-level verification checks, such as ensuring there is enough space of the destination drive, checking the CHECKSUM (providing the backup was taken with CHECKSUM) and checking key Page Header fields. When using the RESTORE command to perform a restore, you can use many arguments to allow many restore scenarios to take place. These arguments can be categorized as follows: Restore arguments (described in Table ) Table Restore Arguments Argument DATABASE/LOG database_name file_or_filegroup_or_pages READ_WRITE_FILEGROUPS FROM Description Specify DATABASE to which to restore all or some of the files that constitute the database. Specify LOG to restore a transaction log backup. Specifies the name of the target database that will be restored. Specifies a comma-separated list of the files, filegroups, or pages to be restored. If restoring pages, use the format PAGE = FileID:PageID. In simple recovery model, files and filegroups can only be specified if they are read-only or if you are performing a partial restore using WITH PARTIAL. Restores all read/write filegroups but no read-only filegroups. A comma-separated list of backup devices that contains the backup set to restore or the name of the database snapshot from which you wish to restore. Database snapshots are discussed in Chapter

410 CHAPTER 15 BACKUPS AND RESTORES WITH options (described in Table ) Table WITH Options Argument PARTIAL RECOVERY/NORECOVERY/STANDBY MOVE CREDENTIAL REPLACE RESTART RESTRICTED_USER Description Indicates that this is the first restore in a piecemeal restore, which is discussed later in this chapter. Specifies the state that the database should be left in when the restore operation completes. RECOVERY indicates that the database will be brought online. NORECOVERY indicates that the database will remain in a restoring state so that subsequent restores can be applied. STANDBY indicates that the database will be brought online in read-only mode. Used to specify the file system location that the files should be restored to if this is different from the original location. Used when performing a restore from a Windows Azure Blob, which will be discussed in Chapter 25. If a database already exists on the instance with the target database name that you have specified in the restore statement, or if the files already exist in the operating system with the same name or location, then REPLACE indicates that the database or files should be overwritten. Indicates that if the restore operation is interrupted, it should be restarted from that point. Indicates that only administrators and members of the db_owner and db_creator roles should have access to the database after the restore operation completes. Backup set options (described in Table ) Table Backup Set Options Argument FILE PASSWORD Description Indicates the sequential number of the backup set, within the media set, to be used. If you are restoring a backup that was taken in SQL Server 2008 or earlier where a password was specified during the backup operation, then you need to use this argument to be able to restore the backup. Media set options (described in Table ) 542

411 CHAPTER 15 BACKUPS AND RESTORES Table Media Set Options Argument MEDIANAME MEDIAPASSWORD BLOCKSIZE Description If you use this argument, then the MEDIANAME must match the name of the media set allocated during the creation of the media set. If you are restoring from a media set created using SQL Server 2008 or earlier and a password was specified for the media set, then you must use this argument during the restore operation. Specifies the block size to use for the restore operation, in bytes, to override the default value of 65,536 for tape and 512 for disk or URL. Error management options (described in Table ) Table Error Management Options Argument CHECKSUM/NOCHECKSUM CONTINUE_AFTER_ ERROR/STOP_ON_ERROR Description If CHECKSUM was specified during the backup operation, then specifying CHECKSUM during the restore operation will verify page integrity during the restore operation. Specifying NOCKECKSUM disables this verification. STOP_ON_ERROR causes the restore operation to terminate if any damaged pages are discovered. CONTINUE_AFTER_ERROR causes the restore operation to continue, even if damaged pages are discovered. Tape options (described in Table ) Table Tape Options Argument UNLOAD/NOUNLOAD REWIND/NOREWIND Description NOUNLOAD specifies that the tape will remain loaded on the tape drive after the backup operation completes. UNLOAD specifies that the tape will be rewound and unloaded, which is the default behavior. NOREWIND can improve performance when you are performing multiple backup operations by keeping the tape open after the backup completes. NOREWIND implicitly implies NOUNLOAD as well. REWIND releases the tape and rewinds it, which is the default behavior. * Tape options are ignored unless the backup device is a tape. 543

412 CHAPTER 15 BACKUPS AND RESTORES Miscellaneous options (described in Table ) Table Miscillaneous Options Argument BUFFERCOUNT MAXTRANSFERSIZE STATS FILESTREAM (DIRECTORY_NAME) KEEP_REPLICATION KEEP_CDC ENABLE_BROKER/ ERROR_BROKER_CONVERSATIONS/NEW BROKER STOPAT/STOPATMARK/STOPBEFOREMARK Description The total number of IO buffers used for the restore operation. The largest possible unit of transfer between SQL Server and the backup media, specified in bytes. Specifies how often progress messages should be displayed. The default is to display a progress message in 5-percent increments. Specifies the name of the folder to which FILESTREAM data should be restored. Preserves the replication settings. Use this option when configuring log shipping with replication. Preserves the Change Data Capture (CDC) settings of a database when it is being restored. Only relevant if CDC was enabled at the time of the backup operation. ENABLE_BROKER specifies that service broker message delivery will be enabled after the restore operation completes so that messages can immediately be sent. ERROR_BROKER_CONVERSATIONS specifies that all conversations will be terminated with an error message before message delivery is enabled. NEW_BROKER specifies that conversations will be removed without throwing an error and the database will be assigned a new Service Broker identifier. Only relevant if Service Broker was enabled when the backup was created. Used for point-in-time recovery and only supported in FULL recovery model. STOPAT specifies a datetime value, which will determine the time of the last transaction to restore. STOPATMARK specifies either an LSN (log sequence number) to restore to, or the name of a marked transaction, which will be the final transaction that is restored. STOPBEFOREMARK restores up to the transaction prior to the LSN or marked transaction specified. To perform the same restore operation that we performed through SSMS, we use the command in Listing Before running the script, change the path of the backup devices to match your own configuration. 544

413 CHAPTER 15 BACKUPS AND RESTORES Listing Restoring a Database USE master GO --Back Up the tail of the log BACKUP LOG Chapter 15 TO DISK = N'H:\MSSQL\Backup\Chapter 15 _LogBackup_ _ bak' WITH NOFORMAT, NAME = N'Chapter 15 _LogBackup_ _ ', NORECOVERY, STATS = 5 ; --Restore the full backup RESTORE DATABASE Chapter 15 FROM DISK = N'H:\MSSQL\Backup\Chapter 15.bak' WITH FILE = 1, NORECOVERY, STATS = 5 ; --Restore the differential RESTORE DATABASE Chapter 15 FROM DISK = N'H:\MSSQL\Backup\Chapter 15.bak' WITH FILE = 2, NORECOVERY, STATS = 5 ; --Restore the transaction log RESTORE LOG Chapter 15 FROM DISK = N'H:\MSSQL\Backup\Chapter 15.bak' WITH FILE = 3, STATS = 5 ; GO Restoring to a Point in Time In order to demonstrate restoring a database to a point in time, we first take a series of backups, manipulating data between each one. The script in Listing first creates a base full backup of the Chapter 15 database. It then inserts some rows into the Addresses table before it takes a transaction log backup. It then inserts some further rows into the Addresses table before truncating the table; and then finally, it takes another transaction log backup. 545

414 CHAPTER 15 BACKUPS AND RESTORES Listing Preparing the Chapter15 Database USE Chapter 15 GO BACKUP DATABASE Chapter 15 TO DISK = 'H:\MSSQL\Backup\Chapter 15 PointinTime.bak' WITH RETAINDAYS = 90, FORMAT, INIT, SKIP, MEDIANAME = 'Chapter 15 Point-in-time', NAME = 'Chapter 15 -Full Database Backup', COMPRESSION ; INSERT INTO dbo.addresses VALUES('1 Carter Drive', 'Hedge End', 'Southampton', 'SO32 6GH'),('10 Apress Way', NULL, 'London', 'WC10 2FG') ; BACKUP LOG Chapter 15 TO DISK = 'H:\MSSQL\Backup\Chapter 15 PointinTime.bak' WITH RETAINDAYS = 90, NOINIT, MEDIANAME = 'Chapter 15 Point-in-time', NAME = 'Chapter 15 -Log Backup', COMPRESSION ; INSERT INTO dbo.addresses VALUES('12 SQL Street', 'Botley', 'Southampton', 'SO32 8RT'),('19 Springer Way', NULL, 'London', 'EC1 5GG') ; TRUNCATE TABLE dbo.addresses ; BACKUP LOG Chapter 15 TO DISK = 'H:\MSSQL\Backup\Chapter 15 PointinTime.bak' WITH RETAINDAYS = 90, NOINIT, MEDIANAME = 'Chapter 15 Point-in-time', NAME = 'Chapter 15 -Log Backup', COMPRESSION ; GO Imagine that after the series of events that occurred in this script, we discover that the Addresses table was truncated in error and we need to restore to the point immediately before this truncation occurred. To do this, we either need to know the exact time of the truncation and need to restore to the date/time immediately before, or to be more accurate, we need to discover the LSN of the transaction where the truncation occurred and restore up to this transaction. In this demonstration, we choose the latter option. We can use a system function called sys.fn_dump_dblog() to display the contents of the final log backup that includes the second insert statement and the table truncation. The procedure accepts a massive 68 parameters, and none of them can be omitted! The first and second parameters allow you to specify a beginning and end LSN with which to filter the results. These parameters can both be set to NULL to return all entries in the backup. The third parameter specifies if the backup set is disk or tape, whereas the fourth parameter specifies the sequential ID of the 546

415 CHAPTER 15 BACKUPS AND RESTORES backup set within the device. The next 64 parameters accept the names of the backup devices within the media set. If the media set contains less than 64 devices, then you should use the value DEFAULT for any parameters that are not required. The script in Listing uses fn_dump_dblog() to identify the starting LSN of the autocommit transaction in which the truncation occurred. The issue with this function is that it does not return the LSN in the same format required by the RESTORE command. Therefore, the calculated column, ConvertedLSN, converts each of the three sections of the LSN from binary to decimal, pads them out with zeros as required, and finally concatenates them back together to produce an LSN that can be passed into the RESTORE operation. Listing Finding the LSN of the Truncation SELECT CAST( CAST( CONVERT(VARBINARY, '0x' + RIGHT(REPLICATE('0', 8) + SUBSTRING([Current LSN], 1, 8), 8), 1 ) AS INT ) AS VARCHAR(11) ) + RIGHT(REPLICATE('0', 10) + CAST( CAST( CONVERT(VARBINARY, '0x' + RIGHT(REPLICATE('0', 8) + SUBSTRING([Current LSN], 10, 8), 8), 1 ) AS INT ) AS VARCHAR(10)), 10) + RIGHT(REPLICATE('0',5) + CAST( CAST(CONVERT(VARBINARY, '0x' + RIGHT(REPLICATE('0', 8) + SUBSTRING([Current LSN], 19, 4), 8), 1 ) AS INT ) AS VARCHAR ), 5) AS ConvertedLSN,* FROM sys.fn_dump_dblog ( NULL, NULL, N'DISK', 3, N'H:\MSSQL\Backup\Chapter 15 PointinTime.bak' DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT) WHERE [Transaction Name] = 'TRUNCATE TABLE' ; 547

416 CHAPTER 15 BACKUPS AND RESTORES Now that we have discovered the LSN of the transaction that truncated the Addresses table, we can restore the Chapter 15 database to this point. The script in Listing restores the full and first transaction log backups in their entirety. It then restores the final transaction log but uses the STOPBEFOREMARK argument to specify the first LSN that should not be restored. Before running the script, change the locations of the backup devices, as per your own configuration. You should also replace the LSN with the LSN that you generated using sys.fn_dump_dblog(). Listing Restorimg to a Point in Time USE master GO RESTORE DATABASE Chapter 15 FROM DISK = N'H:\MSSQL\Backup\Chapter 15 PointinTime.bak' WITH FILE = 1, NORECOVERY, STATS = 5, REPLACE ; RESTORE LOG Chapter 15 FROM DISK = N'H:\MSSQL\Backup\Chapter 15 PointinTime.bak' WITH FILE = 2, NORECOVERY, STATS = 5, REPLACE ; RESTORE LOG Chapter 15 FROM DISK = N'H:\MSSQL\Backup\Chapter 15 PointinTime.bak' WITH FILE = 3, STATS = 5, STOPBEFOREMARK = 'lsn: ', RECOVERY, REPLACE ; Restoring Files and Pages The ability to restore a filegroup, a file, or even a page, gives you great control and flexibility in disaster recovery scenarios. The following sections demonstrate how to perform a file restore and a page restore. Restoring a File You may come across situations in which only some files or filegroups within the database are corrupt. If this is the case, then it is possible to restore just the corrupt file, assuming you have the complete log chain available, between the point when you took the file or filegroup backup and the end of the log. In order to demonstrate this functionality, we first insert some rows into the Contacts table of the Chapter 15 databas e before we back up the primary filegroup and FileGroupA. We then insert some rows into the Addresses table, which resides on FileGroupB before we take a transaction log backup. These tasks are performed by the script in Listing

417 CHAPTER 15 BACKUPS AND RESTORES Listing Preparing the Database INSERT INTO dbo.contacts VALUES('Peter', 'Carter', 1), ('Danielle', 'Carter', 1) ; BACKUP DATABASE Chapter 15 FILEGROUP = N'PRIMARY', FILEGROUP = N'FileGroupA' TO DISK = N'H:\MSSQL\Backup\Chapter 15 FileRestore.bak' WITH FORMAT, NAME = N'Chapter 15 -Filegroup Backup', STATS = 10 ; INSERT INTO dbo.addresses VALUES('SQL House', 'Server Buildings', NULL, 'SQ42 4BY'), ('Carter Mansions', 'Admin Road', 'London', 'E3 3GJ') ; BACKUP LOG Chapter 15 TO DISK = N'H:\MSSQL\Backup\Chapter 15 FileRestore.bak' WITH NOFORMAT, NOINIT, NAME = N'Chapter 15 -Log Backup', NOSKIP, STATS = 10 ; If we imagine that Chapter 15 FileA has become corrupt, we are able to restore this file, even though we do not have a corresponding backup for Chapter 15 FileB, and recover to the latest point in time by using the script in Listing This script performs a file restore on the file Chapter 15 FileA before taking a tail-log backup of the transaction log and then finally applying all transaction logs in sequence. Before running this script, change the location of the backup devices to reflect your own configuration. Caution If we had not taken the tail-log backup, then we would no longer have been able to access the Contacts table (In FileGroupB ), unless we had also been able to restore the Chapter 15 FileB file. Listing Restoring a File USE master GO RESTORE DATABASE Chapter 15 FILE = N'Chapter 15 FileA' FROM DISK = N'H:\MSSQL\Backup\Chapter 15 FileRestore.bak' WITH FILE = 1, NORECOVERY, STATS = 10, REPLACE ; GO BACKUP LOG Chapter 15 TO DISK = N'H:\MSSQL\Backup\Chapter 15 _LogBackup_ _ bak' WITH NOFORMAT, NOINIT, NAME = N'Chapter 15 _LogBackup_ _ ' 549

418 CHAPTER 15 BACKUPS AND RESTORES, NOSKIP, NORECOVERY, STATS = 5 ; RESTORE LOG Chapter 15 FROM DISK = N'H:\MSSQL\Backup\Chapter 15 FileRestore.bak' WITH FILE = 2, STATS = 10, NORECOVERY ; RESTORE LOG Chapter 15 FROM DISK = N'H:\MSSQL\Backup\Chapter 15 _LogBackup_ _ bak' WITH FILE = 1, STATS = 10, RECOVERY ; GO Restoring a Page If a page becomes corrupt, then it is possible to restore this page instead of restoring the complete file or even the database. This can significantly reduce downtime in a minor DR scenario. In order to demonstrate this functionality, we take a full backup of the Chapter15 database and then use the undocumented DBCC WRITEPAGE to cause a corruption in one of the pages of our Contacts table. These steps are performed in Listing Caution DBCC WRITEPAGE is used here for educational purposes only. It is undocumented, but also extremely dangerous. It should not ever be used on a production system and should only ever be used on any database with extreme caution. Listing Preparing the Database --Back up the database BACKUP DATABASE Chapter15 TO DISK = N'H:\MSSQL\Backup\Chapter15PageRestore.bak' WITH FORMAT, NAME = N'Chapter15-Full Backup', STATS = 10 ; --Corrupt a page in the Contacts table ALTER DATABASE Chapter15 SET SINGLE_USER WITH NO_WAIT ; GO NVARCHAR(MAX) 550

419 CHAPTER 15 BACKUPS AND RESTORES = 'DBCC WRITEPAGE(' + ( SELECT CAST(DB_ID('Chapter15') AS NVARCHAR) ) + ', ' + ( SELECT TOP 1 CAST(file_id AS NVARCHAR) FROM dbo.contacts CROSS APPLY sys.fn_physloccracker(%%physloc%%) ) + ', ' + ( SELECT TOP 1 CAST(page_id AS NVARCHAR) FROM dbo.contacts CROSS APPLY sys.fn_physloccracker(%%physloc%%) ) + ', 2000, 1, 0x61, 1)' ; EXEC(@SQL) ; ALTER DATABASE Chapter15 SET MULTI_USER ; GO If we attempt to access the Contacts table after running the script, we receive the error message warning us of a logical consistency-based I/O error, and the statement fails. The error message also provides details of the page that is corrupt, which we can use in our RESTORE statement. To resolve this, we can run the script in Listing The script restores the corrupt page before taking a tail-log backup, and then finally it applies the tail of the log. Before running the script, modify the location of the backup devices to reflect you configuration. You should also update the PageID to reflect the page that is corrupt in your version of the Chapter15 database. Specify the page to be restored in the format FileID:PageID. Tip The details of the corrupt page can also be found in MSDB.dbo.suspect_pages. Listing Restoring a Page USE Master GO RESTORE DATABASE Chapter15 PAGE='3:8' FROM DISK = N'H:\MSSQL\Backup\Chapter15PageRestore.bak' WITH FILE = 1, NORECOVERY, STATS = 5 ; BACKUP LOG Chapter15 TO DISK = N'H:\MSSQL\Backup\Chapter15_LogBackup_ _ bak' WITH NOFORMAT, NOINIT, NAME = N'Chapter15_LogBackup_ _ ', NOSKIP, STATS = 5 ; 551

420 CHAPTER 15 BACKUPS AND RESTORES RESTORE LOG Chapter15 FROM DISK = N'H:\MSSQL\Backup\Chapter15_LogBackup_ _ bak' WITH STATS = 5, RECOVERY ; GO Piecemeal Restores A piecemeal restore involves bringing the filegroups of a database online one by one. This can offer a big benefit for a large database, since you can make some data accessible, while other data is still being restored. In order to demonstrate this technique, we first take filegroup backups of all filegroups in the Chapter15 database and follow this with a transaction log backup. The script in Listing performs this task. Before running the script, modify the locations of the backup devices to reflect your own configurations. Listing Filegroup Backup BACKUP DATABASE Chapter15 FILEGROUP = N'PRIMARY', FILEGROUP = N'FileGroupA', FILEGROUP = N'FileGroupB' TO DISK = N'H:\MSSQL\Backup\Chapter15Piecemeal.bak' WITH FORMAT, NAME = N'Chapter15-Fiegroup Backup', STATS = 10 ; BACKUP LOG Chapter15 TO DISK = N'H:\MSSQL\Backup\Chapter15Piecemeal.bak' WITH NOFORMAT, NOINIT, NAME = N'Chapter15-Full Database Backup', STATS = 10 ; The script in Listing now brings the filegroups online, one by one, starting with the primary filegroup, followed by FileGroupA, and finally, FileGroupB. Before beginning the restore, we back up the tail of the log. This backup is restored WITH RECOVERY after each filegroup is restored. This brings the restored databases back online. It is possible to restore further backups because we specify the PARTIAL option on the first restore operation. Listing Piecemeal Restore USE master GO BACKUP LOG Chapter15 TO DISK = N'H:\MSSQL\Backup\Chapter15_LogBackup_ _ bak' WITH NOFORMAT, NOINIT, NAME = N'Chapter15_LogBackup_ _ ', NOSKIP, NORECOVERY, NO_TRUNCATE, STATS = 5 ; RESTORE DATABASE Chapter15 FILEGROUP = N'PRIMARY' FROM DISK = N'H:\MSSQL\Backup\Chapter15Piecemeal.bak' 552

421 CHAPTER 15 BACKUPS AND RESTORES WITH FILE = 1, NORECOVERY, PARTIAL, STATS = 10 ; RESTORE LOG Chapter15 FROM DISK = N'H:\MSSQL\Backup\Chapter15Piecemeal.bak' WITH FILE = 2, NORECOVERY, STATS = 10 ; RESTORE LOG Chapter15 FROM DISK = N'H:\MSSQL\Backup\Chapter15_LogBackup_ _ bak' WITH FILE = 1, STATS = 10, RECOVERY ; The PRIMARY Filegroup is now online RESTORE DATABASE Chapter15 FILEGROUP = N'FileGroupA' FROM DISK = N'H:\MSSQL\Backup\Chapter15Piecemeal.bak' WITH FILE = 1, NORECOVERY, STATS = 10 ; RESTORE LOG Chapter15 FROM DISK = N'H:\MSSQL\Backup\Chapter15Piecemeal.bak' WITH FILE = 2, NORECOVERY, STATS = 10 ; RESTORE LOG Chapter15 FROM DISK = N'H:\MSSQL\Backup\Chapter15_LogBackup_ _ bak' WITH FILE = 1, STATS = 10, RECOVERY ; The FilegroupA Filegroup is now online RESTORE DATABASE Chapter15 FILEGROUP = N'FileGroupB' FROM DISK = N'H:\MSSQL\Backup\Chapter15Piecemeal.bak' WITH FILE = 1, NORECOVERY, STATS = 10 ; RESTORE LOG Chapter15 FROM DISK = N'H:\MSSQL\Backup\Chapter15Piecemeal.bak' WITH FILE = 2, NORECOVERY, STATS = 10 ; 553

422 CHAPTER 15 BACKUPS AND RESTORES RESTORE LOG Chapter15 FROM DISK = N'H:\MSSQL\Backup\Chapter15_LogBackup_ _ bak' WITH FILE = 1, STATS = 10, RECOVERY ; The database is now fully online Summary A SQL Server database can operate in three recovery models. The SIMPLE recovery model automatically truncates the transaction log after CHECKPOINT operations occur. This means that log backups cannot be taken and, therefore, point-in-time restores are not available. In FULL recovery model, the transaction log is only truncated after a log backup operation. This means that you must take transaction log backups for both disaster recovery and log space. The BULK LOGGED recovery model is meant to be used only while a bulk insert operation is happening. In this case, you switch to this model if you normally use the FULL recovery model. SQL Server supports three types of backup. A full backup copies all database pages to the backup device. A differential backup copies all database pages that have been modified since the last full backup to the backup device. A transaction log backup copies the contents of the transaction log to the backup device. A DBA can adopt many backup strategies to provide the best possible RTO and RPO in the event of a disaster that requires a database to be restored. These include taking full backups only, which is applicable to SIMPLE recovery model; scheduling full backups along with transaction log backups; or scheduling full, differential, and transaction log backups. Scheduling differential backups can help improve the RTO of a database if frequent log backups are taken. DBAs may also elect to implement a filegroup backup strategy; this allows them to stager their backups into more manageable windows or perform a partial backup, which involves backing up only read/write filegroups. Ad-hoc backups can be taken via T-SQL or SQL Server Management Studio (SSMS). In production environments, you invariably want to schedule the backups to run periodically, and we discuss how to automate this action in Chapter 21. You can also perform restores either through SSMS or with T-SQL. However, you can only perform complex restore scenarios, such as piecemeal restores, via T-SQL. SQL Server also provides you with the ability to restore a single page or file. You can restore a corrupt page as an online operation, and doing so usually provides a better alternative to fixing small scale corruption than either restoring a whole database or using DBCC CHECKDB with the ALLOW_DATA_LOSS option. More details on DBCC CHECKDB can be found in Chapter 8. Tip Many other restore scenarios are beyond the scope of this book, because a full description of every possible scenario would be worthy of a volume in its own right. I encourage you to explore various restore scenarios in a sandpit environment before you need to use them for real! 554

423 SQL Server AlwaysOn Revealed Peter A Carter

424 SQL Server AlwaysOn Revealed Copyright 2015 by Peter A Carter This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. ISBN-13 (pbk): ISBN-13 (electronic): Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Managing Director: Welmoed Spahr Lead Editor: Jonathan Gennick Technical Reviewer: Alex Grinberg and Louis Davidson Editorial Board: Steve Anglin, Louise Corrigan, Jim DeWolf, Jonathan Gennick, Robert Hutchinson, Michelle Lowman, James Markham, Susan McDermott, Matthew Moodie, Jeffrey Pepper, Douglas Pundick, Ben Renow-Clarke, Gwenan Spearing Coordinating Editor: Jill Balzano Copy Editor: Rebecca Rider Compositor: SPi Global Indexer: SPi Global Artist: SPi Global Cover Designer: Anna Ishchenko Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY Phone SPRINGER, fax (201) , orders-ny@springer-sbm.com, or visit Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information on translations, please rights@apress.com, or visit Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. ebook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales ebook Licensing web page at Any source code or other supplementary material referenced by the author in this text is available to readers at For detailed information about how to locate your book s source code, go to

425 Contents at a Glance About the Author... xi About the Technical Reviewers... xiii Chapter 1: High Availability and Disaster Recovery Concepts... 1 Chapter 2: Understanding High Availability and Disaster Recovery Technologies... 7 Chapter 3: Implementing a Cluster Chapter 4: Implementing an AlwaysOn Failover Clustered Instance Chapter 5: Implementing AlwaysOn Availability Groups Chapter 6: Administering AlwaysOn Index v

426 CHAPTER 2 Understanding High Availability and Disaster Recovery Technologies SQL Server provides a full suite of technologies for implementing high availability and disaster recovery. This chapter provides an overview of these technologies and discuss their most appropriate uses. AlwaysOn Failover Clustering A Windows cluster is a technology for providing high availability in which a group of up to 64 servers works together to provide redundancy. An AlwaysOn Failover Clustered Instance (FCI) is an instance of SQL Server that spans the servers within this group. If one of the servers within this group fails, another server takes ownership of the instance. Its most appropriate usage is for high availability scenarios where the databases are large or have high write profiles. This is because clustering relies on shared storage, meaning the data is only written to disk once. With SQL Server level HA technologies, write operations occur on the primary database, and then again on all secondary databases, before the commit on the primary completes. This can cause performance issues. Even though it is possible to stretch a cluster across multiple sites, this involves SAN replication, which means that a cluster is normally configured within a single site. Each server within a cluster is called a node. Therefore, if a cluster consists of three servers, it is known as a three-node cluster. Each node within a cluster has the SQL Server binaries installed, but the SQL Server service is only started on one of the nodes, which is known as the active node. Each node within the cluster also shares the same storage for the SQL Server data and log files. The storage, however, is only attached to the active node. If the active node fails, then the SQL Server service is stopped and the storage is detached. The storage is then reattached to one of the other nodes in the cluster, and the SQL Server service is started on this node, which is now the active node. The instance is also assigned its own network name and IP address, which are also bound to the active node. This means that applications can connect seamlessly to the instance, regardless of which node has ownership. The diagram in Figure 2-1 illustrates a two-node cluster. It shows that although the databases are stored on a shared storage array, each node still has a dedicated system volume. This volume contains the SQL Server binaries. It also illustrates how the shared storage, IP address, and network name are rebound to the passive node in the event of failover. 7

CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES Network Name IP Address SQL Server Binaries Active Node Passive Node SQL Server Binaries System Volume System Volume

427 CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES Network Name IP Address SQL Server Binaries Active Node Passive Node SQL Server Binaries System Volume System Volume Normal Operation Database Failover Shared Storage Figure 2-1. Two-node cluster Active/Active Configuration Although the diagram in Figure 2-1 illustrates an active/passive configuration, it is also possible to have an active/active configuration. Although it is not possible for more than one node at a time to own a single instance, and therefore it is not possible to implement load-balancing, it is possible to install multiple instances on a cluster, and a different node may own each instance. In this scenario, each node has its own unique network name and IP address. Each instance s shared storage also consists of a unique set of volumes. Therefore, in an active/active configuration, during normal operations, Node1 may host Instance1 and Node2 may host Instance2. If Node1 fails, both instances are then hosted by Node2, and vice-versa. The diagram in Figure 2-2 illustrates a two-node active/active cluster. 8

CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES Network Name Network Name IP Address IP Address SQL Server Binaries System Volume Active Node (Instance1) Passive Node

428 CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES Network Name Network Name IP Address IP Address SQL Server Binaries System Volume Active Node (Instance1) Passive Node (Instance2) Active Node (Instance2) Passive Node (Instance1) SQL Server Binaries System Volume Normal Operation Database Database Failover Shared Storage (Instance1) Shared Storage (Instance2) Figure 2-2. Active/Active cluster Caution In an active/active cluster, it is important to consider resources in the event of failover. For example, if each node has 128GB of RAM and the instance hosted on each node is using 96GB of RAM and locking pages in memory, then when one node fails over to the other node, this node fails as well, because it does not have enough memory to allocate to both instances. Make sure you plan both memory and processor requirements as if the two nodes are a single server. For this reason, active/active clusters are not generally recommended for SQL Server. Three-Plus Node Configurations As previously mentioned, it is possible to have up to 64 nodes in a cluster. When you have 3 or more nodes, it is unlikely that you will want to have a single active node and two redundant nodes, due to the associated costs. Instead, you can choose to implement an N+1 or N+M configuration. In an N+1 configuration, you have multiple active nodes and a single passive node. If a failure occurs on any of the active nodes, they fail over to the passive node. The diagram in Figure 2-3 depicts a three-node N+1 cluster. 9

429 CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES Network Name Network Name IP Address IP Address SQL Server Binaries System Volume Active Node (Instance1) Active Node (Instance2) Passive Node (Instance1 & Instance2) SQL Server Binaries System Volume Normal Operation SQL Server Binaries System Volume Database Database Failover Shared Storage (Instance1) Shared Storage (Instance2) Figure 2-3. Three-node N+1 configuration In an N+1 configuration, in a multi-failure scenario, multiple nodes may fail over to the passive node. For this reason, you must be very careful when you plan resources to ensure that the passive node is able to support multiple instances. However, you can mitigate this issue by using an N+M configuration. Whereas an N+1 configuration has multiple active nodes and a single passive node, an N+M cluster has multiple active nodes and multiple passive nodes, although there are usually fewer passive nodes than there are active nodes. The diagram in Figure 2-4 shows a five-node N+M configuration. The diagram shows that Instance3 is configured to always fail over to one of the passive nodes, whereas Instance1 and Instance2 are configured to always fail over to the other passive node. This gives you the flexibility to control resources on the passive nodes, but you can also configure the cluster to allow any of the active nodes to fail over to either of the passive nodes, if this is a more appropriate design for your environment. 10

CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES Network Name Network Name Network Name IP Address IP Address IP Address SQL Server Binaries System Volume Active Node

430 CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES Network Name Network Name Network Name IP Address IP Address IP Address SQL Server Binaries System Volume Active Node (Instance1) Active Node (Instance2) Active Node (Instance3) Passive Node (Instance3) Passive Node (Instance1 & Instance2) SQL Server Binaries System Volume Normal Operation SQL Server SQL Server SQL Server Binaries Binaries Binaries System Volume System Volume System Volume Failover Database Database Database Shared Storage (Instance1) Shared Storage (Instance2) Shared Storage (Instance2) Figure 2-4. Five-node N+M configuration Quorum So that automatic failover can occur, the cluster service needs to know if a node goes down. In order to achieve this, you must form a quorum. The definition of a quorum is The minimum number of members required in order for business to be carried out. In terms of high availability, this means that each node within a cluster, and optionally a witness device (which may be a cluster disk or a file share that is external to the cluster), receives a vote. If more than half of the voting members are unable to communicate with a node, then the cluster service knows that it has gone down and any cluster-aware applications on the server fail over to another node. The reason that more than half of the voting members need to be unable to communicate with the node is to avoid a situation known as a split brain. To explain a split-brain scenario, imagine that you have three nodes in Data Center 1 and three nodes in Data Center 2. Now imagine that you lose network connectivity between the two data centers, yet all six nodes remain online. The three nodes in Data Center 1 believe that all of the nodes in Data Center 2 are unavailable. Conversely, the nodes in Data Center 2 believe that the nodes in Data Center 1 are unavailable. This leaves both sides (known as partitions) of the cluster thinking that they should take control. This can have unpredictable and undesirable consequences for any application that successfully connects to one or the other partition. The Quorum = (Voting Members / 2) + 1 formula protects against this scenario. Tip If your cluster loses quorum, then you can force one partition online, by starting the cluster service using the /fq switch. If you are using Windows Server 2012 R2 or higher, then the partition that you force online is considered the authoritative partition. This means that other partitions can automatically re-join the cluster when connectivity is re-established. 11

431 CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES Various quorum models are available and the most appropriate model depends on your environment. Table 2-1 lists the models that you can utilize and details the most appropriate way to use them. Table 2-1. Quorum Models Quorum Model Node Majority Node + Disk Witness Majority Node + File Share Witness Majority Appropriate Usage When you have an odd number of nodes in the cluster When you have an even number of nodes in the cluster When you have nodes split across multiple sites or when you have an even number of nodes and are required to avoid shared disks* *Reasons for needing to avoid shared disks due to virtualization are discussed later in this chapter. Although the default option is one node, one vote, it is possibly to manually remove a nodes vote by changing the NodeWeight property to zero. This is useful if you have a multi-subnet cluster (a cluster in which the nodes are split across multiple sites). In this scenario, it is recommended that you use a file-share witness in a third site. This helps you avoid a cluster outage as a result of network failure between data centers. If you have an odd number of nodes in the quorum, however, then adding a file-share witness leaves you with an even number of votes, which is dangerous. Removing the vote from one of the nodes in the secondary data center eliminates this issue. Caution A file-share witness does not store a full copy of the quorum database. This means that a two-node cluster with a file-share witness is vulnerable to a scenario know as partition in time. In this scenario, if one node fails while you are in the process of patching or altering the cluster service on the second node, then there is no up-to-date copy of the quorum database. This leaves you in a position in which you need to destroy and rebuild the cluster. Windows Server 2012 R2 also introduces the concepts of Dynamic Quorum and Tie Breaker for 50% Node Split. When Dynamic Quorum is enabled, the cluster service automatically decides whether or not to give the quorum witness a vote, depending on the number of nodes in the cluster. If you have an even number of nodes, then it is assigned a vote. If you have an odd number of nodes, it is not assigned a vote. Tie Breaker for 50% Node Split expands on this concept. If you have an even number of nodes and a witness and the witness fails, then the cluster service automatically removes a vote from one random node within the cluster. This maintains an odd number of votes in the quorum and reduces the risk of a cluster going offline, due to a witness failure. Note Clustering is discussed in more depth in Chapter 3 and Chapter 4 12

432 Database Mirroring CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES Database mirroring is a technology that can provide configurations for both high availability and disaster recovery. As opposed to relying on the Windows cluster service, Database Mirroring is implemented entirely within SQL Server and provides availability at the database level, as opposed to the instance level. It works by compressing transaction log records and sending them to the secondary server via a TCP endpoint. A database mirroring topology consists of precisely one primary server, precisely one secondary server, and an optional witness server. Database mirroring is a deprecated technology, which means that it will be removed in a future version of SQL Server. In SQL Server 2014, however, it can still prove useful. For instance, if you are upgrading a data-tier application from SQL Server 2008, where AlwaysOn Availability Groups were not supported and database mirroring had been implemented, and also assuming your expectation is that the lifecycle of the application will end before the next major release of SQL Server, then you can continue to use database mirroring. Some organizations, especially where there is disconnect between the Windows administration team and the SQL Server DBA team, are also choosing not to implement AlwaysOn Availability Groups, especially for DR, until database mirroring has been removed; this is because of the relative complexity and multi-team effort involved in managing an AlwaysOn environment. Database mirroring can also be useful when you upgrade data-tier applications from older versions of SQL Server in a side-by-side migration. This is because you can synchronize the databases and fail them over with minimal downtime. If the upgrade is unsuccessful, then you can move them back to the original servers with minimal effort and downtime. Database mirroring can be configured to run in three different modes: High Performance, High Safety, and High Safety with Automatic Failover. When running in High Performance mode, database mirroring works in an asynchronous manor. Data is committed on the primary database and is then sent to the secondary database, where it is subsequently committed. This means that it is possible to lose data in the event of a failure. If data is lost, the recovery point is the beginning of the oldest open transaction. This means that you cannot guarantee an RPO that relies on asynchronous mirroring for availability, since it will be nondeterministic. There is also no support for automatic failover in this configuration. Therefore, asynchronous mirroring offers a DR solution, as opposed to a high availability solution. The diagram in Figure 2-5 illustrates a mirroring topology, configured in High Performance mode. 13

433 CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES Application 1) Application sends transaction to SQL Server. Server manually failed over to secondary server; application manually reconnects. Primary Server 3) Transaction is sent to secondary server. Secondary Server 2) Transaction is committed to database. 4) Transaction is committed to secondary database. Primary Database Secondary Database Normal Operation Failover Figure 2-5. Database mirroring in High Performance mode When running in High Safety with Automatic Failover mode, data is committed at the secondary server using a synchronous method, as opposed to an asynchronous method. This means that the data is committed on the secondary server before it is committed on the primary server. This can cause performance degradation and requires a fast network link between the two servers. The network latency should be less than 3 milliseconds. In order to support automatic failover, the database mirroring topology needs to form a quorum. In order to achieve quorum, it needs a third server. This server is known as the witness server and it is used to arbitrate in the event that the primary and secondary servers loose network connectivity. For this reason, if the primary and secondary servers are in separate sites, it is good practice to place the witness server in the same data center as the primary server, as opposed to with the secondary server. This can reduce the likelihood of a failover caused by a network outage between the data centers, which makes them become isolated. The diagram in Figure 2-6 illustrates a database mirroring topology configured in High Protection with Automatic Failover mode. 14

434 CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES Application 1) Application sends transaction to SQL Server. Server manually failed over to secondary server; application manually reconnects. 2) Transaction is sent to secondary server. Primary Server Secondary Server 4) Notification is sent back to primary server. 5) Transaction is committed to secondary database. 3) Transaction is committed to secondary database. Primary Database Secondary Database Normal Operation Failover Witness Server Figure 2-6. Database mirroring in High Safety with Automatic Failover mode High Safety mode combines the negative aspects of the other two modes. You have the same performance degradation that you expect with High Safety with Automatic Failover, but you also have the manual server failover associated with High Performance mode. The benefit that High Safety mode offers is resilience in the event that the witness goes offline. If database mirroring loses the witness server, instead of suspending the mirroring session to avoid a split-brain scenario, it switches to High Safety mode. This means that database mirroring continues to function, but without automatic failover. High Safety mode is also useful in planned failover scenarios. If your primary server is online, but you need to fail over for maintenance, then you can change to High Safety mode. This essentially puts the database in a safe state, where there is no possibility of data loss, without you needing to configure a witness server. You can then fail over the database. After the maintenance work is complete and you have failed the database back, then you can revert to High Performance mode. 15

435 CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES Tip Database mirroring is not supported on databases that use In-Memory OLTP. You will be unable to configure database mirroring, if your database contains a memory-optimized filegroup. AlwaysOn Availability Groups AlwaysOn Availability Groups (AOAG) replaces database mirroring and is essentially a merger of database mirroring and clustering technologies. SQL Server is installed as a stand-alone instance (as opposed to an AlwaysOn Failover Clustered Instance) on each node of a cluster. A cluster-aware application, called an Availability Group Listener, is then installed on the cluster; it is used to direct traffic to the correct node. Instead of relying on shared disks, however, AOAG compresses the log stream and sends it to the other nodes, in a similar fashion to database mirroring. AOAG is the most appropriate technology for high availability in scenarios where you have small databases with low write profiles. This is because, when used synchronously, it requires that the data is committed on all synchronous replicas before it is committed on the primary database. Unlike with database mirroring, however, you can have up to eight replicas, including two synchronous replicas. AOAG may also be the most appropriate technology for implementing high availability in a virtualized environment. This is because the shared disk required by clustering may not be compatible with some features of the virtual estate. As an example, VMware does not support the use of vmotion, which is used to manually move virtual machines (VMs) between physical servers, and the Distributed Resource Scheduler (DRS), which is used to automatically move VMs between physical servers, based on resource utilization, when the VMs use shared disks, presented over Fiber Channel. Tip The limitations surrounding shared disks with VMware features can be worked around by presenting the storage directly to the guest OS over an iscsi connection at the expense of performance degradation. AOAG is the most appropriate technology for DR when you have a proactive failover requirement but when you do not need to implement a load delay. AOAG may also be suitable for disaster recovery in scenarios where you wish to utilize your DR server for offloading reporting. When used for disaster recovery, AOAG works in an asynchronous mode. This means that it is possible to lose data in the event of a failover. The RPO is nondeterministic and is based on the time of the last uncommitted transaction. When you use database mirroring, the secondary database is always offline. This means that you cannot use the secondary database to offload any reporting or other read-only activity. It is possible to work around this by creating a database snapshot against the secondary database and pointing read-only activity to the snapshot. This can still be complicated, however, because you must configure your application to issue read-only statements against a different network name and IP address. Availability Groups, on the other hand, allow you to configure one or more replicas as readable. The only limitation is that readable replicas and automatic failover cannot be configured on the same secondaries. The norm, however, would be to configure readable secondary replicas in asynchronous commit mode so that they do not impair performance. 16

436 CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES To further simplify this, the Availability Group Replica checks for the read-only or read-intent properties in an applications connection string and points the application to the appropriate node. This means that you can easily scale reporting and database maintenance routines horizontally with very little development effort and with the applications being able to use a single connection string. Because AOAG allows you to combine synchronous replicas (with or without automatic failover), asynchronous replicas, and replicas for read-only access, it allows you to satisfy high availability, disaster recovery, and reporting scale-out requirements using a single technology. When you are using AOAG, failover does not occur at the database level, nor at the instance level. Instead, failover occurs at the level of the availability group. The availability group is a concept that allows you to group similar databases together so that they can fail over as an atomic unit. This is particularly useful in consolidated environments, because it allows you to group together the databases that map to a single application. You can then fail over this application to another replica for the purposes of DR testing, among other reasons, without having an impact on the other data-tier applications that are hosted on the instance. No hard limits are imposed for the number of availability groups you can configure on an instance, nor are there any hard limits for the number of databases on an instance that can take part in AOAG. Microsoft, however, has tested up to, and officially recommends, a maximum of 100 databases and 10 availability groups per instance. The main limiting factor in scaling the number of databases is that AOAG uses a database mirroring endpoint and there can only be one per instance. This means that the log stream for all data modifications is sent over the same endpoint. Figure 2-7 depicts how you can map data-tier applications to availability groups for independent failover. In this example, a single instance hosts two data-tier applications. Each application has been added to a separate availability group. The first availability group has failed over to Node2. Therefore, the availability group listeners point traffic for Application1 to Node2 and traffic for Application2 to Node1. Because each availability group has its own network name and IP address, and because these resources fail over with the AOAG, the application is able to seamlessly reconnect to the databases after failover. 17

437 CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES Figure 2-7. Availability groups failover 18

CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES The diagram in Figure 2-8 depicts an AlwaysOn Availability Group topology.

438 CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES The diagram in Figure 2-8 depicts an AlwaysOn Availability Group topology. In this example, there are four nodes in the cluster and a disk witness. Node1 is hosting the primary replicas of the databases, Node2 is being used for automatic failover, Node3 is being used to offload reporting, and Node4 is being used for DR. Because the cluster is stretched across two data centers, multi-subnet clustering has been implemented. Because there is no shared storage, however, there is no need for SAN replication between the sites. Application Primary Site DR Site AlwaysOn Availability Group Cluster Availability Group Listener Node1 (Primary Replica) Synchronous Commit Node2 (Synchronous Secondary Replica -Automatic Failover) Node3 (Asynchronous Secondary Replica -Readable) Node4 (Asynchronous Secondary Replica -Not Readable) Asynchronous Commit Primary Replica Secondary Replica Secondary Replica Secondary Replica SQL Server Binaries SQL Server Binaries SQL Server Binaries SQL Server Binaries Figure 2-8. AlwaysOn Availability Group topology 19

439 CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES Note AlwaysOn Availability Groups are discussed in more detail in Chapter 5 and Chapter 6 Automatic Page Repair If a page becomes corrupt in a database configured as a replica in an AlwaysOn Availability Group topology, then SQL Server attempts to fix the corruption by obtaining a copy of the pages from one of the secondary replicas. This means that a logical corruption can be resolved without you needing to perform a restore or for you to run DBCC CHECKDB with a repair option. However, automatic page repair does not work for the following page types: File Header page Database Boot page Allocation pages GAM (Global Allocation Map) SGAM (Shared Global Allocation Map) PFS (Page Free Space) If the primary replica fails to read a page because it is corrupt, it first logs the page in the MSDB. dbo.suspect_pages table. It then checks that at least one replica is in the SYNCHRONIZED state and that transactions are still being sent to the replica. If these conditions are met, then the primary sends a broadcast to all replicas, specifying the PageID and LSN (log sequence number) at the end of the flushed log. The page is then marked as restore pending, meaning that any attempts to access it will fail, with error code 829. After receiving the broadcast, the secondary replicas wait, until they have redone transactions up to the LSN specified in the broadcast message. At this point, they try to access the page. If they cannot access it, they return an error. If they can access the page, they send the page back to the primary replica. The primary replica accepts the page from the first secondary to respond. The primary replica will then replace the corrupt copy of the page with the version that it received from the secondary replica. When this process completes, it updates the page in the MSDB.dbo.suspect_pages table to reflect that it has been repaired by setting the event_type column to a value of 5 (Repaired). If the secondary replica fails to read a page while redoing the log because it is corrupt, it places the secondary into the SUSPENDED state. It then logs the page in the MSDB.dbo.suspect_pages table and requests a copy of the page from the primary replica. The primary replica attempts to access the page. If it is inaccessible, then it returns an error and the secondary replica remains in the SUSPENDED state. If it can access the page, then it sends it to the secondary replica that requested it. The secondary replica replaces the corrupt page with the version that it obtained from the primary replica. It then updates the MSDB.dbo.suspect_pages table with an event_id of 5. Finally, it attempts to resume the AOAG session. Note It is possible to manually resume the session, but if you do, the corrupt page is hit again during the synchronization. Make sure you repair or restore the page on the primary replica first. 20

Log Shipping CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES Log shipping is a technology that you can use to implement disaster recovery.

440 Log Shipping CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES Log shipping is a technology that you can use to implement disaster recovery. It works by backing up the transaction log on the principle server, copying it to the secondary server, and then restoring it. It is most appropriate to use log shipping in DR scenarios in which you require a load delay, because this is not possible with AOAG. As an example of where a load delay may be useful, consider a scenario in which a user accidently deletes all of the data from a table. If there is a delay before the database on the DR server is updated, then it is possible to recover the data for this table, from the DR server, and then repopulate the production server. This means that you do not need to restore a backup to recover the data. Log shipping is not appropriate for high availability, since there is no automatic failover functionality. The diagram in Figure 2-9 illustrates a log shipping topology. Principle Secondary Transaction Log Transaction Log Backup Restore File Share Copy File Share Figure 2-9. Log Shipping topology Recovery Modes In a log shipping topology, there is always exactly one principle server, which is the production server. It is possible to have multiple secondary servers, however, and these servers can be a mix of DR servers and servers used to offload reporting. When you restore a transaction log, you can specify three recovery modes: Recovery, NoRecovery, and Standby. The Recovery mode brings the database online, which is not supported with Log Shipping. The NoRecovery mode keeps the database offline so that more backups can be restored. This is the normal configuration for log shipping and is the appropriate choice for DR scenarios. 21

CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES The Standby option brings the database online, but in a read-only state so that you can restore further backups.

441 CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES The Standby option brings the database online, but in a read-only state so that you can restore further backups. This functionality works by maintaining a TUF (Transaction Undo File). The TUF file records any uncommitted transactions in the transaction log. This means that you can roll back these uncommitted transactions in the transaction log, which allows the database to be more accessible (although it is read-only). The next time a restore needs to be applied, you can reapply the uncommitted transaction in the TUF file to the log before the redo phase of the next log restore begins. Figure 2-10 illustrates a log shipping topology that uses both a DR server and a reporting server. Principle Secondary (readonly for reporting) Secondary (online for DR) Transaction Log Transaction Log Transaction Log Backup Restore (with Standby) Restore (with NoRecovery) File Share Copy File Share File Share Copy Figure Log shipping with DR and reporting servers Remote Monitor Server Optionally, you can configure a monitor server in your log shipping topology. This helps you centralize monitoring and alerting. When you implement a monitor server, the history and status of all backup, copy, and restore operations are stored on the monitor server. A monitor server also allows you to have a single alert job, which is configured to monitor the backup, copy, and restore operations on all servers, as opposed to it needing separate alerts on each server in the topology. 22

442 CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES Caution If you wish to use a monitor server, it is important to configure it when you set up log shipping. After log shipping has been configured, the only way to add a monitor server is to tear down and reconfigure log shipping. Failover Unlike other high availability and disaster recovery technologies, an amount of administrative effort is associated with failing over log shipping. To fail over log shipping, you must back up the tail-end of the transaction log, and copy it, along with any other uncopied backup files, to the secondary server. You now need to apply the remaining transaction log backups to the secondary server in sequence, finishing with the tail-log backup. You apply the final restore using the WITH RECOVERY option to bring the database back online in a consistent state. If you are not planning to fail back, you can reconfigure log shipping with the secondary server as the new primary server. Combining Technologies To meet your business objectives and non-functional requirements (NFRs), you need to combine multiple high availability and disaster recovery technologies together to create a reliable, scalable platform. A classic example of this is the requirement to combine an AlwaysOn Failover Cluster with AlwaysOn Availability Groups. The reason you may need to combine these technologies is that when you use AlwaysOn Availability Groups in synchronous mode, which you must do for automatic failover, it can cause a performance impediment. As discussed earlier in this chapter, the performance issue is caused by the transaction being committed on the secondary server before being committed on the primary server. Clustering does not suffer from this issue, however, because it relies on a shared disk resource, and therefore the transaction is only committed once. Therefore, it is common practice to first use a cluster to achieve high availability and then use AlwaysOn Availability Groups to perform DR and/or offload reporting. The diagram in Figure 2-11 illustrates a HA/DR topology that combines clustering and AOAG to achieve high availability and disaster recovery, respectively. 23

CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES Primary Site DR Site Manual Failover Network Name Automatic Failover IP Address Automatic Failover Manual Failover SQL

443 CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES Primary Site DR Site Manual Failover Network Name Automatic Failover IP Address Automatic Failover Manual Failover SQL Server Binaries System Volume Node1 (Active) Failover Clustered Instance Node2 (Passive) Stand-Alone Instance Node3 Automatic Failover Normal Operation SQL Server Binaries Primary Replica System Volume Failover Secondary Replica Shared Storage SQL Server Binaries Figure Clustering and AlwaysOn Availability Groups combined The diagram in Figure 2-11 shows that the primary replica of the database is hosted on a two-node active/passive cluster. If the active node fails, the rules of clustering apply, and the shared storage, network name, and IP address are reattached to the passive node, which then becomes the active node. If both nodes are inaccessible, however, the availability group listener points the traffic to the third node of the cluster, which is situated in the DR site and is synchronized using log stream replication. Of course, when asynchronous mode is used, the database must be failed over manually by a DBA. Another common scenario is the combination of a cluster and log shipping to achieve high availability and disaster recovery, respectively. This combination works in much the same way as clustering combined with AlwaysOn Availability Groups and is illustrated in Figure

CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES Primary Site DR Site Network Name Automatic Failover IP Address Automatic Failover SQL Server Binaries Node1 (Active)

444 CHAPTER 2 UNDERSTANDING HIGH AVAILABILITY AND DISASTER RECOVERY TECHNOLOGIES Primary Site DR Site Network Name Automatic Failover IP Address Automatic Failover SQL Server Binaries Node1 (Active) Node2 (Passive) SQL Server Binaries Stand-Alone Server System Volume System Volume Automatic Failover Normal Operation Failover Transaction Transaction Log Log Backup Restore File Share Copy File Share Shared Storage Figure Clustering combined with log shipping The diagram shows that a two-node active/passive cluster has been configured in the primary data center. The transaction log(s) of the database(s) hosted on this instance are then shipped to a stand-alone server in the DR data center. Because the cluster uses shared storage, you should also use shared storage for the backup volume and add the backup volume as a resource in the role. This means that when the instance fails over to the other node, the backup share also fails over, and log shipping continues to synchronize, uninterrupted. Caution If failover occurs while the log shipping backup or copy jobs are in progress, then log shipping may become unsynchronized and require manual intervention. This means that after a failover, you should check the health of your log shipping jobs. 25

Windows 10 Revealed. The Universal Windows Operating System for PC, Tablets, and Windows Phone. Kinnary Jangla

Windows 10 Revealed. The Universal Windows Operating System for PC, Tablets, and Windows Phone. Kinnary Jangla Windows 10 Revealed The Universal Windows Operating System for PC, Tablets, and Windows Phone Kinnary Jangla Windows 10 Revealed Kinnary Jangla Bing Maps San Francisco, California, USA ISBN-13 (pbk): 978-1-4842-0687-4