SQL Server 2014 Column Store Indexes Vivek Sanil Microsoft Vivek.sanil@microsoft.com Sr. Premier Field Engineer
Trends in the Data Warehousing Space Approximate data volume managed by DW Less than 1TB 1-3 TB 3-10 TB More than 10 TB Don't Know 41% 17% 21% 18% 19% 25% 17% 34% 2% 6% 0% 10% 20% 30% 40% 50% Today In 3 years Scale more: DW systems continue to grow at a fast pace, scalability is a key concern, growing a system from 10s of TBs, to 100s of TB, to PBs Performance at scale: ability to analyze massive amounts of data while offering interactive query response Data warehousing for masses: drive down price per TB Source: TDWI Report Next Generation DW Columnstore designed to address above need
Columnstore Index In-memory columnstore Lives in both memory and disk Built-in to core RDBMS engine Customer benefits: - 10-100x faster - Reduced design effort - Hyper-efficient storage subsystem - Works on customers existing hardware - Easy upgrade, easy management By using SQL Server 2012 In-Memory Columnstore, we were able to extract about 100 million records in 2 or 3 seconds versus the 30 minutes required previously. - Atsuo Nakajima Asst Director, Bank of Nagoya Columnstore Index Representation Existing Tables (Partitions) C 1 C 2 C 3 C 4 C 5 C 6
Columnstore Index Storage Model Data Stored Column-wise Each page stores data from a single column Highly compressed More data fits in memory C1 C2 C3 C4 C5 C6 Each column can be accessed independently Fetch only columns that are needed Can dramatically decrease I/O
Batch Mode: Improving CPU Utilization Biggest advancement in query processing in years! Data moves in batches through query plan operators Minimizes instructions per row Takes advantage of cache structures Highly efficient algorithms Better parallelism
Columnstore in SQL 2012 SQL Server 2012 Columnstore functionality Non-clustered Columnstore indexes Improved compression, compared to ROW/PAGE compression Improved query performance for large read scenarios Limitations No DML support, no updates (data refresh) Only secondary, non-clustered, Columnstore indexes supported Poor memory management (Resource Governor was not honored, index build/rebuild, run-time) Limited data types support Limited batch operations supported
SQL 2014 - Clustered Columnstore Index: Why is clustered index important? Saves space used Simplifies management no secondary indexes to maintain 20.0 Sample Space Used in GB (101 million row table) Columnstore (and clustered Columnstore index) will be PREFERRED storage engine for DW scenarios We encourage users to either move existing tables to CCI, or start using CCI for new tables 15.0 10.0 5.0 91% savings Additional data types are supported High precision decimal, datetimeoffset, binary, varbinary, uniqueidentifier, etc.) Unsupported types: spatial, XML, max types 0.0 Table with customary indexing Table with Table with no customary indexing indexing (page compression) Table with no indexing (page compression) Table with columnstore index Clustered columnstore DDL supported Evolve your schema design as needed ** Space Used = Table space + Index space
Clustered Columnstore index Key Characteristics Available in Enterprise, Developer, and Evaluation editions Updateable Includes all columns in the table Only index on the table, cannot be combined with any other indexes Uses Columnstore compression Columns not physically sorted. Stores data to improve compression and performance
Nonclustered Columnstore Index Key Characteristics No need to include all of the columns in the table Stores a copy of the columns in the index Is not updateable. Changes = rebuild index Can be combined with other indexes on the table Uses Columnstore compression Columns not physically sorted. Stores data to improve compression and performance
Archival Compression What s New? Adds an additional layer of compression on top of the inherent compression used by Columnstore Shrink on-disk database sizes by up to 27% Compression applies per partition and can be set either during index creation or during rebuild Use archival compression only when extra time and CPU resources to compress and retrieve the data are affordable
Columnstore Enhancements Summary New functionalities delivered Clustered and updateable Columnstore index Columnstore archive option for data compression Global batch aggregation Main benefits Real-time super fast data warehouse engine Ability to continue queries while updating without the need to drop and recreate index or partition switching Huge disk space saving due to compression Ability to compress data 5 15x using archival per-partition compression Better performance and more efficient (less memory) batch query processing using batch mode rather than row mode
Columnstore Index Structure Row Groups & Segments Segment A segment contains values for one column for a set of rows Segments are compressed Each segment stored in a separate LOB Segment is unit of transfer between disk and memory Segments C1 C2 C3 C4 C5 C6 Row group Segments for the same set of rows comprise a row group Row group
Columnstore Index Processing Example
Horizontally Partition - Row Groups ~ 1M rows 14
Vertical Partition - Segments
Compress Each Segment* Some Compress More than Others *Encoding and reordering not shown 16
Concepts Coming Together: Loading Data into a Nonclustered Columnstore Index Rows to Load Rowgroups Column Segments C1 C2 C3 C4 Columnstore C1 C2 C3 C4 Compressed column segments are added to Columnstore
Syntax CREATE CLUSTERED COLUMNSTORE INDEX CL_Simple ON SIMPLETABLE WITH (MAXDOP = 0) ON PRIMARY; CREATE COLUMNSTORE INDEX NCI_Simple ON SIMPLETABLE ( SimpleID, SimpleAddressID, SimpleStateID, Amt ); Have to specify columns for nonclustered columnstore index CREATE CLUSTERED COLUMNSTORE INDEX CL_Simple ON SIMPLETABLE WITH (DROP_EXISTING = ON) ON PRIMARY; Required if there is an existing clustered index / columnstore index
Limitations and Restrictions Combination with nonclustered indexes A table with a clustered columnstore index cannot have any type of nonclustered index Constraints A table with a clustered columnstore index cannot have unique constraints, primary key constraints, or foreign key constraints View Cannot be created on a view or indexed view Keywords Cannot be created by using the INCLUDE, ASC and DESC keyword
Unsupported Data Types Following data types are not supported ntext, text, and image varchar(max) and nvarchar(max) rowversion (and timestamp) sql_variant CLR types (hierarchyid and spatial types) xml
Column Store Delta (row) store Updatable Columnstore Index C1 C2 C3 C4 C5 C6 Table consists of column store and row store DML (update, delete, insert) operations leverage delta store C1 C2 C3 C4 C5 C6 INSERT Values Always lands into delta store* DELETE Logical operation Data physically remove after REBUILD operation is performed. UPDATE DELETE followed by INSERT. BULK INSERT if batch < 100k, inserts go into delta store, otherwise columnstore SELECT Unifies data from Column and Row stores - internal UNION operation. Tuple Mover Tuple mover converts data into columnar format once segment is full (1M of rows) REORGANIZE statement forces Tuple Mover to start.
RowGroup DMV Row store or deltastore can accept rows SELECT * FROM sys.column_store_row_groups Columnstore Each row group has its own deltastore Closed (Full) Waiting to be compressed * RETIRED All rows deleted INVISIBLE Data in memory only
Bulk Insert Optimizations Threshold for Tuple Mover is now 102,400 rows < 102,400 Rows inserted into delta store >= 102,400 rows directly into columnstore If greater than 1,048,576 then rowgroup size will is limited to 1,048,576 Less than full columnstore row groups created by bulk insert will not be consolidated Batches of 90K row inserts: you eventually get large segments Batches of 120K row inserts: you get many 120K segments, and performance may not be as optimal long term Index rebuild will fix this by defragging the index, but that is resource intensive Physical order of data file determines how segments are created
Bulk Insert Optimizations Bulk Insert < 102,400 Rows Bulk Inserted 105K Rows (>= 102,400 Rows) ALTER INDEX REBUILD
Tuple Mover Runs every 5 minutes by default When row store reaches 1,048,576 rows convert to a columnstore De-allocates row groups where all rows are deleted Start manually ALTER INDEX REORGANIZE Extended events columnstore_tuple_mover_begin_compress columnstore_tuple_mover_end_compress
Tuple Mover Control Tuple Mover does consume resources Trace flag to disable Tuple Mover (634) as an edge case When disabled, has to be manually invoked with: Alter Index ( ) Reorganize/Rebuild If disabled and not manually invoked: Can cause performance issues when querying data Can end up with multiple rowstores (deltastore) which won t be compressed
Index Maintenance Operations Index rebuild: Re-creates clustered columnstore index completely ALTER TABLE REBUILD ALTER INDEX REBUILD CREATE CLUSTERED COLUMNSTORE INDEX WITH (DROP_EXISTING = ON) Reorganize: Forces delta store operations only ALTER INDEX REORGANIZE // compresses closed row groups REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON) // compresses all row groups
Statistics for Columnstore Index The needs for statistics Histogram of statistics is required for query plan generation for Columnstore indexes used by the optimizer Best Practices Keep statistics up to date Create multicolumn statistics on correlated columns 28
Best Practices Create columnstore index on large fact tables Leverage star joins Joins on integer keys Leverage Parallelism Provide sufficient memory Use in conjunction with partitioned tables 29
Non-Clustered Columnstore indexes Do we still need them? Yes, if you need constraints or triggers on the table Creating a CCI will fail if there is a B-tree enforcing a key constraint However, you won t be able to update the table No, if constraints are not needed Create table and add a clustered columnstore index No other indexes to worry about Can insert / update / delete in the table Consistent fast query performance 30
Updating Non-Clustered Columnstore Disable index, update data, rebuild -- or - Use partition switching -- or- Use delta table and UNION ALL 31
Questions?