Introduction to Column Stores with Microsoft SQL Server 2016

Size: px

Start display at page:

Download "Introduction to Column Stores with Microsoft SQL Server 2016"

Lucas Williams
5 years ago
Views:

1 Introduction to Column Stores with Microsoft SQL Server 2016 Seminar Database Systems Master of Science in Engineering Major Software and Systems HSR Hochschule für Technik Rapperswil Supervisor: Prof. Hansjörg Huser Author: Andreas Büchler Rapperswil, December 2015

2 Abstract Nowadays fast performance is the most important. There are several approaches to gain more performance out of the database systems. In a database a table is organized into rows and columns. The Microsoft SQL Server, and typically each relational database management system, uses row based data storage in the background a so called Row Store. This paper is about the Column Store of the Microsoft SQL Server 2016 which was especially designed and engineered to provide more performance within OLAP. The Column Store Index is a technology where the data is stored in a columnar data format. The advantage of a Column Store is the ability to read the values of a particular column without reading all the values of the other columns because each column of a table is grouped and stored together. Microsoft SQL Server provides a non-clustered and a clustered Column Store Index. It uses different compressions and has several limitations. This paper includes a benchmark to compare the performance of a Column Store with a Row Store of Microsoft SQL Server and PostgreSQL. Keywords: Microsoft SQL Server 2016; PostgreSQL 9.4; OLAP; Data Warehouse; Column Store Index; clustered; non-clustered; compression; Row Store; benchmark; database; SQL 2

3 Table of Contents 1 Introduction and Overview Motivation Problem Statement Structure of this Work Column Store Row Store Introduction Column Store Why do we need a Column Store? Segments and Row Groups Non-Clustered Column Store Clustered Column Store Compression Run-Length Encoding (RLE) Bit-Array Encoding Dictionary Encoding Supported Data Types Limitations Creating a Column Store Create a clustered Column Store Index Create a non-clustered Column Store Index Convert a Column Store to a Row Store Benchmarking Environment Notes about the SQL Scripts Data Set How is benchmarked? Results Benchmark with Table gnis Benchmark with Tables osm_poi_ch_* Short Conclusion on the Benchmark Conclusion and Outlook Bibliography List of Figures Appendix A. Microsoft SQL Server Scripts B. PostgreSQL Scripts

4 1 Introduction and Overview In the context of the seminar Advanced Database Systems at the University of Applied Sciences Rapperswil, this paper gives an introduction to Column Stores with Microsoft SQL Server This paper is mainly based on the following sources: [1], [2], [3] and [4]. This paper fulfills the following task given in the seminar context: 1. Give a brief introduction about the Column Store feature with Microsoft SQL Server Benchmark the Column Store of the Microsoft SQL Server 2016 against PostgreSQL Lesson Learned and a Conclusion. 1.1 Motivation Database Management Systems (DBMS) are used for a very long time. Everyday more and more data is stored in these systems. Nowadays fast performance is the most important. There are several approaches to gain more performance out of the database systems. The DBMS manufacturers constantly improve their processing engines, data store structures, and so forth to gain more speed. Moreover, it s possible to run the DBMS on the most modern hardware available which enables concurrent transaction processing and high data throughput. The Column Store was especially designed and engineered to provide more performance within a specific context. 1.2 Problem Statement The previous section has shown that it s necessary to gain performance. This paper is about the Column Store feature in Microsoft SQL Server It is a different approach to the Row Store data structure and should provide therefore more performance within a specific context. This paper is going to answer the following questions: 4

5 What is a Column Store? Why or when is a Column Store needed? How does a Column Store internally works? What is the difference between a clustered and a non-clustered Column Store? What are the limitations of the Column Store? What is the performance gain of a Column Store? Should a Column Store be used? 1.3 Structure of this Work In chapter 2 an introduction into Row Store is given and afterwards the Column Store is discussed. Topics like why a Column Store might be needed, compression, segments and row groups are discoursed. In the end of chapter 2 the limitations of a Column Store are listed as well some SQL snippets to create and maintain a Column Store. Chapter 3 is about the benchmark of a Column Store in Microsoft SQL Server 2016 compared to a Row Store in Microsoft SQL Server 2016 and PostgreSQL 9.4. There is a short overview about the data set which was used in both database systems as well as the environment and setup for the benchmark. At the end of chapter 3 the results of the benchmark are presented and discussed. The last chapter, chapter 4, summarizes the main and important points and facts about this paper and gives a short outlook about Column Stores. 5

6 2 Column Store This chapter is about the Column Store introduced in the Microsoft SQL Server As a side note, Microsoft calls its Column Store Column Store Index. Microsoft has continuously improved the Column Store Index feature since SQL Server version As a reminder, this paper and in particular this chapter is about the Column Store Index features with regard to the current Microsoft SQL Server version To understand the Column Store it s wise to first understand how the Microsoft SQL Server stores its data if the Column Store Index is not used. Section 2.1 will give a brief introduction to the Row Store which is commonly used not only in the Microsoft SQL Server. 2.1 Row Store In a database a table is organized into rows and columns. But how is the data stored? The Microsoft SQL Server, and typically each relational database management system, uses row based data storage in the background a so called Row Store. Each row is a contiguous unit on the page. This means that all the column values of a row are stored grouped together, side by side. This has the advantage that data which belongs together is quickly written and read from the storage. In an online transaction processing (OLTP) based application this technique provides the fastest performance. The following figure shows the principal row based storage concept. 6

7 Figure 1: Row Store Concept. Source: [5] (adapted) In addition the SQL Server organizes the data into pages. Each page has a size of 8 KB and contains a header at the beginning and row offsets at the end. The data rows are put on the page right after the header and are putted into the same page as long as there is enough space for the whole row. For performance improvement it is possible to define and create an index on one or more columns. This will speed up the searching and filtering when using the indexed columns included in the criteria query. This paper won t dive more into details about paging or index creation. Please refer to additional books if you want to dig deeper into these topics. 2.2 Introduction Column Store The Column Store Index is a technology where the data is stored in a (compressed) columnar data format instead of the widely used row based format (see therefore section 2.1). The following figure shows the main difference between column based storage and row based storage. Figure 1 form the previous section is therefore extended. 7

8 Figure 2: Row Store Concept compared to Column Store Concept. Source: [5] Figure 2 illustrates both concepts of Row Store and Column Store and shows very clear where the Column Store has its disadvantage over a Row Store. If a query applies to the entire data set (like select * from table where ) a column based system will need multiple disk operations to retrieve all data whereas a row based system would only need one disk operation (assuming you want one specific row out of hundreds of rows). On the other side, if only the column Sales is needed for all rows, a column based system has all data grouped together whereas a row based system would need to go through the whole data set row by row to read the sales value. The advantage of a Column Store is the ability to read the values of a particular column without reading all the values of the other columns because each column of a table is grouped and stored together. Whereas in a Row Store each row has to be retrieved to fetch the particular column value because the individual columns of a row are grouped together. Therefore in a row based storage always the whole page needs to be loaded into the memory to get one particular column value of one row. This means that in a row based storage multiple pages might be loaded to get all the values of one particular column whereas in 8

9 a column based storage probably one segment (more about segments in section 2.4) contains all the values of one specific column. The advantage of storing all values of one particular column together leads to better compression of the data. Compression in detail is discussed in section 2.7. Microsoft SQL Server 2016 supports two different Column Store types: non-clustered Column Store Indexes and clustered Column Store Indexes. However they share the same basic structure to store the data. The differences between these two types are discussed later in section 2.5 and 2.6. The next section is about why such a column based system has its advantages and where it s used to provide much more performance in contrast to a row based system. 2.3 Why do we need a Column Store? Column Store Index is a technology which optimizes the performance of analytic queries typically used on Data Warehouses. According to Microsoft using column bases storage and query processing will give a performance gain up to 10x compared with traditional roworiented storage [2]. In chapter 3 performance is discussed and it contains a benchmark test. A typical Data Warehouse query joins some facts and dimensions tables and aggregates over a subset of columns of a fact table. The following listing shows an example of such a query: select da.articlecode, sum(fs.quantity) as [Units Sold] from dbo.factsales fs join dbo.dimarticles da on da.articleid = fs.articleid join dbo.dimdates dd on dd.dateid = fs.dateid where dd.year = 2015 group by da.articlecode Listing 1: Data Warehouse query. From: [1] page 712, Listing 34-1 (adapted) 9

If this query is executed on row based storage the SQL Server needs to access each row and load the entire row into the memory (and therefore the entire page).

10 If this query is executed on row based storage the SQL Server needs to access each row and load the entire row into the memory (and therefore the entire page). There are always all columns loaded regardless of how many columns are actually required. In this case only three columns are actually needed. If millions of rows are processed this way, what is quite common in a productive Data Warehouse, this will lead to high memory consumption and high processing power and time what is certainly not desired. Using a Column Store Index only the three needed columns are actually loaded thus less read operations are performed. This brings less data and therefore less memory consumption and speeds up the entire process. 2.4 Segments and Row Groups To understand the Column Store Index of the Microsoft SQL Server it is also important to understand Segments and Row Groups. A segment in a Column Store Index contains values for one column of a particular set of rows. These set of rows are called row group. The segments are compressed and always keep the same homogeneous type of data. A row group can contain up to 2 20 = 1'048'576 rows. Compression is discussed later in section 2.7. A segment is also a unit of transfer between the disk and the memory. [6] [7] The following figure illustrates the segments together with the row groups. Figure 3: Columns, Segments and Row Groups. Source: [8] 10

11 2.5 Non-Clustered Column Store As discussed in the previous sections there are different kinds of Column Stores. This section is about the non-clustered Column Store Index and its specialty. In the following section, section 2.6, the clustered Column Store Index will be discussed. A non-clustered Column Store will be created in addition to a row store table heap like a traditional B-Tree. This means that no transformation of the row based table into a column based table will be done. However the non-clustered Column Store Index differs from the traditional B-Tree because it stores its data in a columnar based structure. The base table remains the same and is still organized in a row based manner. Moreover this implies that the index contains a copy of part or all of the rows and columns from the underlying table. This also means that the SQL Server needs extra space for this. Extra space is needed to store the non-clustered Column Store index but it gives the benefit of running analytics on the fast non-clustered Column Store Index while transactions can be done on the row based table at the same time. It is possible to define which columns should be included in the index but there is no need to define a specific order of the columns because they are stored in a per column based manner. Using the Microsoft SQL Server 2016 it is now possible to update the table on which the non-clustered Column Store Index is based on. Therefore the index is updated as soon as the underlying tables data has changed. [3] This was not possible in the version 2012 and 2014 of Microsoft SQL Server. Creating a Column Store Index is a memory consuming operation. The memory consumption (at least for Microsoft SQL Server 2012, version 2016 might need less or more memory) can roughly be estimated with the following formula [1] page 724: Memory Grant Request (MB) = (4.2 * Number of columns in the index + 68) * (Degree of Parallelism) + (Number of text columns in the index * 34) 11

12 An example with the table gnis (see therefore section 3.3) shows that the creation of a clustered Column Store would need (4.2 * ) * 1 + (5 * 34) = MB memory. Two points are interesting: text columns are heavy during the creation of a Column Store Index; and the size of the table doesn t matter. A Column Store Index can also be created on a temporary table. As soon as the temporary table is dropped or the session ends the Column Store Index will be dropped as well. 2.6 Clustered Column Store The clustered Column Store Index works more or less in the same way as the non-clustered Column Store Index does. There are some differences which will be discussed in this chapter. In contrast to the non-clustered Column Store Index the clustered Column Store Index is the principal storage for the entire table and therefore contains all of the columns of the table (like any clustered index). This makes the table and the index essentially the same, eliminating the need to leave the index to retrieve the table's data [9]. From the previous section it is known that the non-clustered Column Store Index can be defined on an arbitrary number of columns, so this is certainly a main difference to the clustered Column Store Index. Additionally a clustered Column Store Index has the restriction that no other index can be created. 2.7 Compression Like row based tables have their different compression formats, the column based tables and indexes have their own compressions. These compressions are not configurable and are always applied to the Column Store. Because each segment stores the same homogeneous type of data the algorithms to compress the data are much more efficient. In a row based storage compression is much harder because rows usually contain different type of data. There is an extra archival compression to further reduce the data and this archival compression is configurable but out of scope of this paper. [10] 12

13 According to [11] there are three basic phases for the Column Store Index creation: 1. Row Groups Separation 2. Segment Creation 3. Compression In this section the compression will be discussed. Row Groups Separation and Segment Creation have already been discussed in the sections before. There are several compression technics which are being applied to achieve the best possible compression while not spending too much CPU [11]. In this paper only the compression technics in bold are going to be discussed. Value Scale Bit Array Run-length Encoding Dictionary encoding Huffman encoding Lempel-Ziv-Welch Binary compression Run-Length Encoding (RLE) This compression procedure is simple. Its effectiveness is proportional to the number of distinct values [11]. The Run-Length Encoding algorithm tries to figure out on which column the data should be sorted in order to gain the best results for the compression. The algorithm first sorts the data on a particular column and then simply counts the consecutive repetition of a value. The number according to the value is stored in a hash table [11]. The following two figures show the algorithm once sorted on the name column and once sorted on the last name column. 13

Figure 4: RLE with ordering on the name column. 5 name values; 9 last name values. Source: [12] Figure 5: RLE with ordering on the last name column. 9 name values; 3 last name values.

14 Figure 4: RLE with ordering on the name column. 5 name values; 9 last name values. Source: [12] Figure 5: RLE with ordering on the last name column. 9 name values; 3 last name values. Source: [13] If the values for row number 8 are needed, the algorithm simply counts the number together. Here the name for row number 8 (starting with 0) is Charlie and the last name is Simpson. Comparing these two compression results the latter gives a slightly better result Bit-Array Encoding A bit array / bit vector is a structure based on an array / vector and only includes bits. The idea is to store data in a very compact way. It is only effective when there are a few distinct values in the column where the compression should be applied. For each distinct value a separate bit column is created. The following figure illustrates the encoding. 14

Figure 6: Bit-Array. Source: [14] 2.7.3 Dictionary Encoding The idea behind this encoding is to give smaller values to the distinct values.

Therefore this significantly uses less space. Figure 7 illustrates the compression using the dictionary encoding. Figure 7: Dictionary Encoding. Source: [15] 2.

15 Figure 6: Bit-Array. Source: [14] Dictionary Encoding The idea behind this encoding is to give smaller values to the distinct values. So instead of always storing the value Mark there is a mapping table which combines Mark with value 1 and so in the column only the value 1 is used. Therefore this significantly uses less space. Figure 7 illustrates the compression using the dictionary encoding. Figure 7: Dictionary Encoding. Source: [15] 2.8 Supported Data Types Unfortunately one cannot create a Column Store Index for each data type. Each column in a Column Store Index must be of one of the following common business data types: [3] 15

16 datetimeoffset [ ( n ) ] datetime2 [ ( n ) ] datetime smalldatetime date time [ ( n ) ] float [ ( n ) ] real [ ( n ) ] decimal [ ( precision [, scale ] ) ] money smallmoney bigint int smallint tinyint bit nvarchar [ ( n ) ] except nvarchar (max) is not supported nchar [ ( n ) ] varchar [ ( n ) ] char [ ( n ) ] varbinary [ ( n ) ] except varbinary (max) is not supported binary [ ( n ) ] uniqueidentifier (Applies starting withsql Server 2014) So keep in mind that NO Column Store Index can be created with columns using the following datatypes: ntext, text, and image varchar(max) and nvarchar(max) xml 16

17 2.9 Limitations There are some limitations of the non-clustered Column Store Index: [3] Limited to supported data types. See therefore section 2.8. Each non-clustered Column Store Index cannot have more than 1014 columns. A non-clustered Column Store Index cannot be created on a view or indexed view. Cannot be changed by using the ALTER INDEX statement. To change the nonclustered index, you must drop and re-create the Column Store Index instead. You can use ALTER INDEX to disable and rebuild a Column Store Index. A table with a non-clustered Column Store Index can have unique constraints, primary key constraints, or foreign key constraints, but the constraints cannot be included in the non-clustered Column Store Index. Cannot include the ASC or DESC keywords for sorting the index. Non-clustered Column Store Indexes are ordered according to the compression algorithms. Sorting would eliminate many of the performance benefits Creating a Column Store In the following sections are some listings for creating non-clustered and clustered Column Store Indexes. These snippets can be used like recipes. [3] Create a clustered Column Store Index The following listing will create a clustered Column Store Index. The index will be in-memory. All data will be compressed and stored in a columnar manner. Because a clustered Column Store Index is created, all of the columns in the table will be included. The table will be automatically converted to a clustered Column Store Index. To revert a Column Store Index into a Row Store refer to section

18 -- first we create a simple table as a heap. CREATE TABLE SimpleTable( ProductKey [int] NOT NULL, OrderDateKey [int] NOT NULL, DueDateKey [int] NOT NULL, ShipDateKey [int] NOT NULL); GO -- now we convert the table to a clustered Column Store Index. -- This changes the storage for the entire table from -- row store to Column Store. -- cci_simple is the name of the index. CREATE CLUSTERED COLUMNSTORE INDEX cci_simple ON SimpleTable; -- optionally we could add: -- WITH (DROP_EXISTING = ON) -- to drop an existing index. GO Listing 2: Create a clustered Column Store Index. Source: [3] (adapted) Create a non-clustered Column Store Index The following listing will create a non-clustered Column Store Index. The index will be inmemory. Remember from section 2.5 that this index will be created as a secondary index on a row store table. It will not be the primary storage for the entire table as with the clustered Column Store. There is no need to include all columns of the underlying table. This kind of index will be updated as soon as the underlying table is changed. -- first we create a simple table as a heap. CREATE TABLE SimpleTable (ProductKey [int] NOT NULL, OrderDateKey [int] NOT NULL, DueDateKey [int] NOT NULL, ShipDateKey [int] NOT NULL); GO -- now we create a clustered index CREATE CLUSTERED INDEX cl_simple ON SimpleTable (ProductKey); GO -- finally we add a non-clustered Column Store Index CREATE NONCLUSTERED COLUMNSTORE INDEX csindx_simple ON SimpleTable (OrderDateKey, DueDateKey, ShipDateKey); GO Listing 3: Create a non-clustered Column Store Index. Source: [3] (adapted) 18

19 Convert a Column Store to a Row Store To revert a Column Store to a Row Store one can simply drop the Column Store Index and a row store based table will result. DROP INDEX MyCCI ON MyFactTable; Listing 4: Drop a Column Store Index. Source [3] (adapted) 19

20 3 Benchmarking This chapter is about benchmarking on the different DBMS. The used DBMS were Microsoft SQL Server 2016 CTP2.4 and PostgreSQL Environment The following tables give an overview about the used environment on which the benchmark has been executed. Type Host OS OS X Yosemite Version Processor Memory Storage Intel Core 2.6 GHz 16 GB 1600 MHz DDR3 APPLE SSD SM512E Type OS Processor Memory Virtual Box Guest (with installed Guest Additions) Microsoft Server 2012 R2 4 virtual 2.59 GHz 8 GB Notes no Anti-Malware installed (no I/O slow down) all Windows Server December 2015 updates installed out-of-the-box Windows Server configuration. No special tweaks. Type Version Microsoft SQL Server (running on the Guest) 2016 (CTP2.4) (X64) Notes out-of-the-box SQL Server configuration. No special tweaks. 20

21 Type Version PostgreSQL (running on the Guest) PostgreSQL 9.4.5, compiled by Visual C++ build 1800, 64-bit Notes out-of-the-box SQL Server configuration. No special tweaks. 3.2 Notes about the SQL Scripts The used scripts for the benchmark on both DBMS PostgreSQL and Microsoft SQL Server can be found in the Appendix A and B. Due to the fact that the provided SQL script is for the DBMS PostgreSQL and that the Column Store Index in Microsoft SQL Server 2016 has some limitations the following modifications have been done for the scripts for the DBMS Microsoft SQL Server: Neither the data type text nor varchar(max) is supported in a Column Store Index. Therefore columns defined with these data types have been defined as varchar(int). The actual max length of the values for each of these columns had been determined and taken into account while creating the tables. The collate for the created database was set to Latin1_General_BIN. This was necessary because of the dynamic creations of the 3mio, 2mio and 1mio tables out of the osm_poi_ch table. The select statements for these tables include an ORDER BY so the same order result had to be achieved like in PostgreSQL. The primary key on the column fid in the table gnis could not have been created because no primary key is allowed in columns used in the Column Store Index. Some minor modifications had been done to the SQL statements: o rewritten the castings o adapted the PostgreSQL LIMIT keyword to Microsoft SQL Server TOP keyword In the queries for the 14a, 14b and 14c benchmarks in the where clause quotes ( ) had to be added to the text because during the import of the data it was not easily possible to remove them. Due to the fact that this has no major effect it was left like that. 21

22 3.3 Data Set This chapter gives an overview about the used data set in the benchmark. The benchmark was executed on the following tables and the configuration for each DBMS was like: In PostgreSQL in each table on one column (show in red) an index was created. In Microsoft SQL Server on each table o once an index including one column (shown in red) was created o and once a clustered Column Store Index (including all columns) was created. Table gnis (count(*) = ) Column PostgreSQL data type Microsoft SQL data type Distinct Values x double precision double precision y double precision double precision fid integer integer name text varchar(95) class text varchar(15) 59 state text varchar(2) 5 county text varchar(13) 282 elevation integer integer map text varchar(27) Table osm_poi_ch (count(*) = ) Column PostgreSQL data type Microsoft SQL data type Distinct Values id character varying(64) character varying(64) lastchanged character varying(35) character varying(35) changeset integer integer version integer integer 106 uid integer integer lon double precision double precision lat double precision double precision

23 Table osm_poi_ch_*mio These tables are created out of osm_poi_ch with same data structure and same data but limited by the amount. Also in PostgreSQL on each table on the column id an index was created. Table osm_poi_tag (count(*) = ) Column PostgreSQL data type Microsoft SQL data type Distinct Values id character varying(64) character varying(64) key text varchar(54) value text varchar(571) Table osm_poi_tag and tables osm_poi_ch as well as osm_poi_ch_*mio are somehow related. Both id columns share a lot of same values. These columns can be used for joins which is used in the benchmark. 3.4 How is benchmarked? In the benchmark script for PostgreSQL the psql switch \timing on was used to measure the time how long each SQL statement takes in milliseconds. Microsoft SQL Server has also such a thing called. It returns the parse and compile time as well as the execution time for the CPU time and the elapsed time. To get the duration of the total time for each SQL statement the elapsed time of both parse and compile time as well as the execution time have to be added together [16]. Each SQL statement was measured with the features above explained. Once the benchmark was executed on both DBMS the results have been written down in Microsoft Excel and a bar-chart was generated. It is important to mention that the timing function of PostgreSQL returns a value with a resolution with 3 digits after the decimal point (#.###) while Microsoft SQL Server returns no digits after the decimal point (an integer #). It is unknown if the time measure function of Microsoft SQL Server internally does any rounding. Therefore the results of PostgreSQL have been rounded (23.4 becomes 23 while 23.5 becomes 24). This may 23

24 or may not distort the results a little but it doesn t really matter because the benchmark is not about if a Column Store is 1ms faster than a Row Store. It s about if a Column Store is like 10x faster or slower compared to a Row Store. The benchmark was executed several times and no major difference could be evaluated. All results shown in the following section are from one specific benchmark run, therefore no average have been calculated. 3.5 Results In this chapter the results for the benchmark are presented and commented. To understand the measurement and the presented values it s important to read the section 3.4 preliminarily. In the following charts the following abbreviations have been used: Caption / abbreviation MSSQL w/ CCI MSSQL w/ IDX PSQL Explanation Microsoft SQL Server 2016 with clustered Column Store Index Microsoft SQL Server 2016 with non-clustered index PostgreSQL 9.4 with index Benchmark with Table gnis The following figure shows the benchmarks results for the queries where the table gnis is involved. This includes the queries labeled 1a, 1b, 2a, 2b, 3a, 3b, 4a, and 4b. There is no difference between query a and b. They are in each number group the same. So only caching may or may not have an effect. 24

25 Query Introduction to Column Stores with Microsoft SQL Server 2016 gnis 1a 1b a b 3a MSSQL w/ CCI MSSQL w/ IDX PSQL 3b a b Query Time [ms] Figure 8: Benchmark results for the queries 1a - 4b 25

26 Query 1a, 1b 2a, 2b 3a, 3b 4a, 4b Comment The column fid is the primary key of the table in PSQL. There is only one matching row for this query (search with single tupel in return set). Both Row Store and Column Store seem to have the same speed. PSQL is in query 1a a bit slower than in query 1b. In query 1b a cache might be used. Although the where clause filters on a column which doesn t have much distinct values the Column Store Index is 4-5 times slower than the Row Store on MSSQL. This might be because multiple rows are returned and in the select statement several columns are loaded what is a no-go for a Column Store. Furthermore the search is done on a text field. The Row Store of PSQL is twice as slow as the Row Store of MSSQL. This is a typical Data Warehouse query with an aggregate function. Therefore the Column Store Index of MSSQL is twice as fast as the Row Store of MSSQL. This is because in total only 3 columns are involved in the select statement. The Row Store needs to load each row to get the value for the column elevation. The Row Store of MSSQL is around 2-3 times faster than PSQL. This query is executed much faster on the Column Store of MSSQL than on the Row Store. This is because of the GROUP BY clause. This clause is much faster on a Column Store because in a Row Store each row has to be loaded to compare the value. There is no index on the class column in both MSSQL and PSQL with Row Store. Moreover the select statement only includes one column (class), so no extra lookup in other columns Benchmark with Tables osm_poi_ch_* The following figure shows the benchmark results for the queries where the table osm_poi_ch_1mio, osm_poi_ch_2mio, and osm_poi_ch_3mio are involved. This includes the queries labeled 10x 14c. There is no difference between query a, b, and c. They are in each number group the same, only the table is different. 26

27 Query Introduction to Column Stores with Microsoft SQL Server 2016 osm_poi_* 10x 10a b c a b c a 12b 12c MSSQL w/ CCI MSSQL w/ IDX PSQL 13a b c a b c Query Time [ms] Figure 9: Benchmark results for the queries 10x - 14c 27

28 Query 10x, 10a, 10b, 10c 11a, 11b, 11c 12a, 12b, 12c 13a, 13b, 13c Comment The query 10x is exactly the same as query 10a. This is to warm up the DBMS with the data. In all of these queries MSSQL with the Column Store Index is much slower than with the Row Store of MSSQL and PSQL. It s interesting that the Column Store Index of MSSQL is much slower than the Row Store of MSSQL and PSQL. Query 1a and 1b has somehow the same character but is not slower than the Row Store. The tables involved in the queries 10a, 10b, and 10c have much more data and loading all the columns (id, version, lon, and lat) takes much longer on a columnar based storage. In all of these queries the Column Store Index of MSSQL is much faster than the Row Store of MSSQL and PSQL. This is because in the where clause the query filters the column version which has around 100 distinct values. In a columnar based storage the needed rows for the filter can be found much faster than in a row based storage. These queries are very interesting. In query 12a the Column Store Index of MSSQL is much faster than the Row Store of MSSQL but the latter is much slower than the Row Store of PSQL. It seems to be reasonable that the Column Store is slower than the PSQL Row Store because the where clause includes columns lon and lat which are pretty unique and there are no aggregate functions. Remarkable is that the Row Store of MSSQL is much slower than PSQL. In query 12b the Column Store Index is twice as fast as the Row Store of MSSQL and 10 times faster than the Row Store of PSQL. In query 12c more data needs to be loaded and the Column Store Index is a bit slower than the Row Store of MSSQL but the Row Store of PSQL is much slower (around 10 times) than MSSQL regardless of the used store. In all of these queries the Column Store Index of MSSQL is much faster than the Row Store of MSSQL and PSQL. This is because of the GROUP BY clause in the query on the uid column with only 4730 (1mio table), 6570 (2mio table), and 7956 (3mio table) distinct values. The Column Store Index can load all these values at once and can then filter the rows needed to be fetched. The Row Store needs to go through all rows because there is no index on this column. This is a good query which shows the benefit of a Column Store Index. 28

29 14a, 14b, 14c This is a rather complex query compared to the previous queries. It includes 3 joins and no GROUP BY, nor aggregate functions. The Column Store Index is in query 14a twice as slow as the Row Store of MSSQL. In query 14b and 14c the Column Store is first a bit slower and then a bit faster than the Row Store of MSSQL. The Row Store of PSQL is in all queries up to 10 times slower than MSSQL. There is no index defined on the key column in the Row Store however it can compete with the Column Store Index. This column has only distinct values on rows. 3.6 Short Conclusion on the Benchmark The benchmark queries have shown that when queries are involved like in a Data Warehouse the Column Store Index indeed gives a performance boost. PSQL is sometimes much slower than the Row Store pendant in MSSQL. This might be because PSQL is not properly setup. It was used in the benchmark with the out-of-the-box configuration. Microsoft SQL Server might tweak itself to provide an optimal performance. It s interesting that sometimes the Column Store Index is sometimes slower in the first query compared to the MSSQL Row Store but faster on the second query with another table based on more data. To leverage the real performance benefit of the Column Store a deeper investigation on specific Data Warehouse queries with group by and aggregate functions would needed to be done. The query group 3 included an average function. The Column Store Index was twice as fast as the Row Store of MSSQL. Microsoft says that it can be 10 times faster. Probably query group 3 would have shown a much better performance boost with the Column Store Index if more data was included. 29

30 4 Conclusion and Outlook This paper has given a short introduction about the Column Store with Microsoft SQL Server The paper started off with the widely used Row Store. A lot of Database Management Systems use this kind of storage to store its data in it. After the introduction to the Row Store the Column Store has been introduced. The Column Store takes a different approach to store its data. While a row based storage stores the values of the columns of a row grouped together, a columnar based storage stores all values of a column together. Therefore the individual values of a row are stored distributed and not grouped together. The Microsoft SQL Server 2016 provides two slightly different Column Store Indexes. There is a non-clustered and a clustered Column Store Index. The clustered Column Store Index works more or less in the same way as the non-clustered Column Store Index does. In contrast to the non-clustered Column Store Index the clustered Column Store Index is the principal storage for the entire table and therefore contains all of the columns of the table. This makes the table and the index essentially the same. In contrast to the clustered Column Store Index, the non-clustered Column Store Index can be defined on an arbitrary number of columns. Additionally a clustered Column Store Index has the restriction that no other index can be created. Before the version 2016 of the SQL Server a table with a non-clustered Column Store Index was not updateable. With the new version 2016 this has changed and therefore the table is also updateable although there is a non-clustered index defined. The Column Store Index has some limitations especially on the supported data types. A Column Store Index of the Microsoft SQL Server uses also some compression features to provide a smaller memory footprint. Some compression encodings like Run-Length Encoding, Bit-Array Encoding as well as Dictionary Encoding have been introduced and discussed. Benchmarking the Column Store Index was a main part of the paper as well. A Column Store Index has its advantage in Data Warehouses where a lot of aggregation functions are used. The Index was especially created for these kinds of queries where only one or very few col- 30

31 umns are loaded. The benchmarks aim was to provide different queries including some typical Data Warehouse queries with aggregate functions. It has been run on different database management systems as well as on different tables with different datasets. The two used DBMS were PostgreSQL 9.4 and Microsoft SQL Server 2016 CTP2.4. The data was stored and benchmarked in row based tables in the PostgreSQL. On the Microsoft SQL Server the data was once benchmarked while stored in row based storage and once benchmarked while stored in columnar based storage each with a clustered Column Store Index. The different queries have been commented and analyzed in the paper and it showed that when using typical Data Warehouse queries for online analytical processing the Column Store was faster than the Row Store. Whereas queries which included no-gos (like loading a lot of columns with no aggregation functions nor groupings) run much slower on a Column Store than on a Row Store. To summarize this paper, using a Column Store where appropriate, meaning in a Data Warehouse where a lot of analytical queries are executed, it provides a performance boost. According to Microsoft numerous clients are already using a Column Store in their production environments and are very happy with its performance. It is very easy to create (and drop!) a Column Store Index on tables which is huge plus as well. Therefore it s possible to easily try a Column Store to test its performance boost without needing a lot of time to set it up, complex SQL statements nor changing the data structure of the tables. 31

32 Bibliography [1] D. Korotkevitch, Pro SQL Server Internals, Apress, [2] "msdn.microsoft.com," Microsoft, [Online]. Available: [Accessed ]. [3] "msdn.microsoft.com," Microsoft, [Online]. Available: [Accessed ]. [4] "wiki.technet.microsoft.com," [Online]. Available: [Accessed ]. [5] "saphanatutorial.com, Image," [Online]. Available: content/uploads/2013/09/difference-between-column-based-and-row-based- Tables.png. [Accessed ]. [6] S. Govoni, "blogs.msdn.com," Microsoft, [Online]. Available: [Accessed ]. [7] Microsoft, "SQL Server 2014 Developer Training Kit: containing pdf: SQL Server 2014 In- Memory Data Warehouse (Columnstore)," Microsoft, [Online]. Available: [Accessed ]. [8] S. Govoni, "blogs.msdn.com, Image," Microsoft, [Online]. Available: key/communityserver-blogs- components-weblogfiles/ metablogapi/7380.clip_5f00_image002_5f00_3d jpg. [Accessed ]. [9] R. Sheldon, "searchsqlserver.techtarget.com," [Online]. Available: [Accessed ]. [10] "msdn.microsoft.com," Microsoft, [Online]. Available: [Accessed ]. [11] N. Neugebauer, " [Online]. Available: compression-algorithms/. [Accessed ]. [12] "nikoport.com, Image," [Online]. Available: [Accessed ]. [13] "nikoport.com, Image," [Online]. Available: [Accessed ]. [14] "nikoport.com, Image," [Online]. Available: [Accessed ]. [15] "nikoport.com, Image," [Online]. Available: [Accessed ]. [16] P. Carter, "stackoverflow.com," [Online]. Available: [Accessed ]. 32

33 List of Figures Figure 1: Row Store Concept. Source: [5] (adapted)... 7 Figure 2: Row Store Concept compared to Column Store Concept. Source: [5]... 8 Figure 3: Columns, Segments and Row Groups. Source: [8] Figure 4: RLE with ordering on the name column. 5 name values; 9 last name values. Source: [12] Figure 5: RLE with ordering on the last name column. 9 name values; 3 last name values. Source: [13] Figure 6: Bit-Array. Source: [14] Figure 7: Dictionary Encoding. Source: [15] Figure 8: Benchmark results for the queries 1a - 4b Figure 9: Benchmark results for the queries 10x - 14c

34 Appendix A. Microsoft SQL Server Scripts Import Script: -- Import for Benchmark -- Tested on Microsoft SQL Server 2016 CTP Andreas Büchler Requirements: -- Files gnis_names09.csv, osm_poi_ch.csv, osm_poi_tag_ch.csv CREATE DATABASE [benchmark] COLLATE Latin1_General_BIN; GO USE benchmark; -- create table gnis CREATE TABLE gnis ( x double precision not null, y double precision not null, fid integer, name varchar(95), class varchar(15), state varchar(2), county varchar(13), elevation integer, map varchar(27) ); -- load data into table gnis BULK INSERT gnis FROM 'C:\temp\import_mssql\gnis_names09.csv' WITH ( FIRSTROW = 2, FIELDTERMINATOR = ';', ROWTERMINATOR = '\n', TABLOCK ); -- create table osm_poi_ch CREATE TABLE osm_poi_ch ( id character varying(64) not null, lastchanged character varying(35), changeset integer, version integer, uid integer, lon double precision not null, lat double precision not null ); 34

35 -- load data into table osm_poi_ch BULK INSERT osm_poi_ch FROM 'C:\temp\import_mssql\osm_poi_ch.csv' WITH ( FIRSTROW = 2, FIELDTERMINATOR = ';', ROWTERMINATOR = '\n', TABLOCK ); -- create table osm_poi_tag_ch CREATE TABLE osm_poi_tag_ch ( id character varying(64) not null, [key] varchar(54) not null, because of the quotes "" value varchar(571) because of the quotes "" ); -- load data into osm_poi_tag_ch BULK INSERT osm_poi_tag_ch FROM 'C:\temp\import_mssql\osm_poi_tag_ch.csv' WITH ( FIRSTROW = 2, FIELDTERMINATOR = ';', ROWTERMINATOR = '\n', TABLOCK ); Prepare Benchmark Script: -- Preparation for Benchmark -- Tested on Microsoft SQL Server 2016 CTP Andreas Büchler Requirements: -- Tables gnis, osm_poi_ch and osm_poi_tag_ch exist and are loaded. USE benchmark; PRINT 'Preparing tables. Pls. wait...' PRINT '=== Table gnis' CREATE CLUSTERED COLUMNSTORE INDEX gnis_cci ON dbo.gnis WITH (DROP_EXISTING = OFF); -- CREATE NONCLUSTERED INDEX gnis_fid_idx ON dbo.gnis (fid ASC); PRINT '=== Table osm_poi_ch' CREATE CLUSTERED COLUMNSTORE INDEX osm_poi_ch_cci ON dbo.osm_poi_ch WITH (DROP_EXISTING = OFF); 35

36 -- CREATE NONCLUSTERED INDEX osm_poi_ch_id_idx ON dbo.osm_poi_ch (id ASC); PRINT '=== Table osm_poi_tag_ch' CREATE CLUSTERED COLUMNSTORE INDEX osm_poi_tag_ch_cci ON dbo.osm_poi_tag_ch WITH (DROP_EXISTING = OFF); -- CREATE NONCLUSTERED INDEX osm_poi_tag_ch_id_idx -- ON dbo.osm_poi_tag_ch (id ASC); PRINT '=== Table osm_poi_ch_3mio' -- drop table if it exists IF OBJECT_ID('dbo.osm_poi_ch_3mio', 'U') IS NOT NULL DROP TABLE dbo.osm_poi_ch_3mio; -- create table osm_poi_ch_3mio out of osm_poi_ch PRINT '* create and import table' SELECT TOP id, max(version) "version", max(lastchanged) lastchanged, max(uid) uid, max(changeset) changeset, max(lon) lon, max(lat) lat INTO osm_poi_ch_3mio FROM osm_poi_ch GROUP BY id ORDER BY id; -- create clustered column store index PRINT '* create clustered column store index' CREATE CLUSTERED COLUMNSTORE INDEX osm_poi_ch_3mio_cci ON dbo.osm_poi_ch_3mio WITH (DROP_EXISTING = OFF); -- CREATE NONCLUSTERED INDEX osm_poi_ch_3mio_pk_idx -- ON dbo.osm_poi_ch_3mio (id ASC); PRINT '=== Table osm_poi_ch_2mio' -- drop table if it exists IF OBJECT_ID('dbo.osm_poi_ch_2mio', 'U') IS NOT NULL DROP TABLE dbo.osm_poi_ch_2mio; -- create table osm_poi_ch_2mio out of osm_poi_ch_3mio PRINT '* create and import table' SELECT TOP * INTO osm_poi_ch_2mio FROM osm_poi_ch_3mio ORDER BY id; 36

37 -- create clustered column store index PRINT '* create clustered column store index' CREATE CLUSTERED COLUMNSTORE INDEX osm_poi_ch_2mio_cci ON dbo.osm_poi_ch_2mio WITH (DROP_EXISTING = OFF); -- CREATE NONCLUSTERED INDEX osm_poi_ch_2mio_pk_idx -- ON dbo.osm_poi_ch_2mio (id ASC); PRINT '\n=== Table osm_poi_ch_1mio' -- drop table if it exists IF OBJECT_ID('dbo.osm_poi_ch_1mio', 'U') IS NOT NULL DROP TABLE dbo.osm_poi_ch_1mio; -- create table osm_poi_ch_1mio out of osm_poi_ch_3mio PRINT '* create and import table' SELECT TOP * INTO osm_poi_ch_1mio FROM osm_poi_ch_3mio ORDER BY id; -- create clustered column store index PRINT '* create clustered column store index' CREATE CLUSTERED COLUMNSTORE INDEX osm_poi_ch_1mio_cci ON dbo.osm_poi_ch_1mio WITH (DROP_EXISTING = OFF); -- CREATE NONCLUSTERED INDEX osm_poi_ch_1mio_pk_idx -- ON dbo.osm_poi_ch_1mio (id ASC); PRINT 'Ok.' Benchmark Script: -- Benchmark -- Tested on Microsoft SQL Server 2016 CTP Andreas Büchler USE benchmark; PRINT '=== Table gnis' -- Simple equality search with single tupel in return set SELECT count(*) FROM osm_poi_tag_ch; PRINT ';1a' SELECT name, county, state FROM gnis t WHERE t.fid= ; PRINT ';1b' SELECT name, county, state FROM gnis t WHERE t.fid= ; 37

38 -- Simple equality search on county Texas: SELECT count(*) FROM osm_poi_tag_ch; PRINT ';2a' SELECT name, county, state FROM gnis t WHERE t.county='texas'; PRINT ';2b' SELECT name, county, state FROM gnis t WHERE t.county='texas'; -- Range search with aggregate function SELECT count(*) FROM osm_poi_tag_ch; PRINT ';3a' SELECT cast(avg(t.elevation) as int) FROM gnis t WHERE t.x> and t.y> and t.x< and t.y<33.460; PRINT ';3b' SELECT cast(avg(t.elevation) as int) FROM gnis t WHERE t.x> and t.y> and t.x< and t.y<33.460; -- Group by query SELECT count(*) FROM osm_poi_tag_ch; PRINT ';4a' SELECT count(*), class FROM gnis GROUP BY class ORDER BY 1 DESC; PRINT ';4b' SELECT count(*), class FROM gnis GROUP BY class ORDER BY 1 DESC; PRINT '=== Table osm_poi_ch' -- Query with equality condition SELECT count(*) FROM gnis; PRINT ';10x' select id,version,lon,lat from osm_poi_ch_1mio where id=' pt'; PRINT ';10a' select id,version,lon,lat from osm_poi_ch_1mio where id=' pt'; PRINT ';10b' 38

39 select id,version,lon,lat from osm_poi_ch_2mio where id=' pt'; PRINT ';10c' select id,version,lon,lat from osm_poi_ch_3mio where id=' pt'; -- Query with range condition SELECT count(*) FROM gnis; PRINT ';11a' select TOP 10 id,version,lon,lat from osm_poi_ch_1mio where version>300 order by version desc; PRINT ';11b' select TOP 10 id,version,lon,lat from osm_poi_ch_2mio where version>300 order by version desc; PRINT ';11c' select TOP 10 id,version,lon,lat from osm_poi_ch_3mio where version>300 order by version desc; -- Query with range condition II. SELECT count(*) FROM gnis; PRINT ';12a' select id,version,lon,lat from osm_poi_ch_1mio where lon> and lat> and lon< and lat< order by version desc; PRINT ';12b' select id,version,lon,lat from osm_poi_ch_2mio where lon> and lat> and lon< and lat< order by version desc; PRINT ';12c' select id,version,lon,lat from osm_poi_ch_3mio where lon> and lat> and lon< and lat< order by version desc; -- Query with group by SELECT count(*) FROM gnis; PRINT ';13a' 39

Introduction to Column Stores with MemSQL. Seminar Database Systems Final presentation, 11. January 2016 by Christian Bisig

Introduction to Column Stores with MemSQL. Seminar Database Systems Final presentation, 11. January 2016 by Christian Bisig Final presentation, 11. January 2016 by Christian Bisig Topics Scope and goals Approaching Column-Stores Introducing MemSQL Benchmark setup & execution Benchmark result & interpretation Conclusion Questions