Understanding Interleaved Sort Keys in Amazon Redshift

Size: px

Start display at page:

Download "Understanding Interleaved Sort Keys in Amazon Redshift"

Lucinda Rose
5 years ago
Views:

1 Understanding Interleaved Sort Keys in Amazon Redshift

3 Introduction Recently, Amazon announced interleaved sort keys for Amazon Redshift. Until now, compound sort keys were the only option and, while they deliver incredible performance for workloads that include a common filter on a single dimension known ahead of time, they don t do much to facilitate ad hoc multidimensional analysis. In this paper, we will use some dummy data 1 and a set of Postgres queries to discuss the role of sort keys and compare how both types of keys work in theory. We won t be performing any work in Redshift directly until Part 2, where we will examine how interleaved sort keys are implemented in practice, discuss a common tactic that can benefit from using compound and interleaved sort keys together, and run some benchmark queries against a data set with billions of rows. In both parts we will link to code that can be used to recreate our results. Life Without the B-tree Redshift bills itself as a fast, fully managed, petabyte-scale data warehouse and it uses techniques that you may not find in a relational database built for transactional (OLTP) workloads. Most folks are familiar with the concept of using multiple B-tree indexes on the same table in order to optimize performance across a variety of queries with varying 'where' clauses. However, B-tree indexes have a couple of drawbacks when applied to the large scale analytical (OLAP) workloads that are common in data warehousing. First, they are secondary data structures. Every index you create makes a copy of the columns on which you ve indexed and stores this copy separately from the table as a doubly-linked list sorted within the leaf nodes of a B-tree. The additional space required to store multiple indexes in addition to the table can be prohibitively expensive when dealing with large volumes of data. Second, B-tree indexes are most useful for highly selective queries that lookup a specific data point. Analytical queries that aggregate a range of data points will typically be better served by a simple full table scan as opposed to traversing the B-tree, scanning the leaf nodes and fetching non-indexed columns from the table, as this approach suffers from a lot of random I/O, even for small ranges. Compound Sort Keys and Zone Maps For the above reasons, Redshift eschews the B-tree and instead employs a lighter form of indexing that lends itself well to table scans. Each table in Redshift can optionally define a sort key which is simply a subset of columns that will be used to sort the table on disk

The first column here is the value being looked up in the 'where' clause of a Postgres query, the second column is the number of records that contain the lookup value and the third column is the

4 The first column here is the value being looked up in the 'where' clause of a Postgres query, the second column is the number of records that contain the lookup value and the third column is the access method employed. Postgres default configuration abandons index scans pretty quickly. Bitmap scans don t make it much further. 2 A compound sort key produces a sort order similar to that of the 'order by' clause where the first column is sorted in its entirety, then within each first column grouping the second column is sorted in its entirety and so on until the entire key has been sorted. This is very similar to the leaf nodes of a B-tree index with a compound key but instead of using a secondary data structure Redshift stores the data in the table sorted in this manner from the jump. Once the sort order of the table has been computed, Redshift, being a columnar database breaks out The sort order produced by a compound key on our dummy data 3. This is an abbreviated screenshot

5 each column, optionally compresses it and stores blocks of the column values contiguously on disk. Additionally, it maintains a secondary data structure called a zone map that lists the minimum and maximum column values for each block. Since the zone map is maintained at the block level (as opposed to the row level with a B-tree), its size as a secondary data structure is not prohibitively expensive and could conceivably be held in memory. Redshift blocks are 1 MB in size, so assuming a zone map has a width of 100B, the zone map for a 1TB compressed table could be stored in 100MB. The zone map allows for pruning of blocks that are irrelevant for any given query. If a query against our dummy data requests records 'where region = 2' then we can see from the zone map below that only blocks 2, 6, 10 and 14 need to be accessed. However, the zone map s ability to facilitate pruning for a column depends on the sort order of the table on disk. As we can see in the zone map to the right, sorting on our compound key favors the leading columns in the key. An equality filter on year will scan only 4 contiguous blocks. Likewise, an equality filter on region also scans only 4 blocks, albeit The zone map for our dummy data when using a compound sort key. 4 Filtering on year or region scans 4 blocks. Filtering on customer or product scans all 16 blocks. non-contiguous blocks. An equality filter on customer will scan the entire table (16 blocks) because the zone map provides nothing valuable in the way of pruning. An equality filter on product must scan the entire table as well. Another way of saying this is the compound sort key produces great locality 5 for the leading columns of the key and terrible locality for the trailing columns. The values of the trailing columns are spread across many of the blocks in the table and any particular block includes a broad range of column values. Until recently, we either had to live with this fact or make multiple copies of the table with different compound sort keys for each. Now there s a better way!

A 2-dimensional Z-order curve formed by interleaving the binary representations of the coordinates (x,y) and sorting the resulting interleaved binary numbers.

6 The Z-order Curve and Interleaved Sort Keys To sort a table on disk we need to produce an ordered list of sort keys. A list of sort keys is a one-dimensional structure so we must linearize any multidimensional data set before mapping it to sort keys. A 2-dimensional Z-order curve formed by interleaving the binary representations of the coordinates (x,y) and sorting the resulting interleaved binary numbers. 9 If we think of our data as being spread amongst the coordinates of an N-dimensional space, with N being the number of columns in our sort key, linearization would be the process of finding a path or curve 6 through every point in this space. That path represents our ordered list of sort keys and we want it to weight each dimension equally, preserving locality across dimensions and maximizing the utility of a zone map. The sort order produced by a compound sort key represents one possible path called The zone map for our demo data set when using an interleaved sort key 10. Filtering on any single column scans 8 of 16 blocks. a row-order curve 7 but as mentioned earlier it does a poor job of preserving locality. Luckily, the Z-order curve 8 was introduced almost 50 years ago and not only preserves locality across

7 dimensions but can easily be calculated by interleaving the bits of the binary coordinates and sorting the resulting binary numbers. If we apply this technique to our dummy data the data is sorted in equal measure for both the leading and trailing columns of the key and no block stores a range broader than 50% of the column values. This means that a filter on any single column in the sort key will scan 8 out of 16 blocks. Ideally, the number of blocks scanned for a particular where clause can be calculated as follows with interleaved sort keys: N^((S-P)/S) where N = the total number of blocks required to store the table, S = the number of columns in the sort key, P = the number of sort key columns specified in the where clause. So for the 16 blocks that make up our dummy data, a four-column interleaved key results in the following block scan counts: Filter on 1 of 4 sort key columns: 8 blocks Filter on 2 of 4 sort key columns: 4 blocks Filter on 3 of 4 sort key columns: 2 blocks Filter on 4 of 4 sort key columns: 1 block Although this suggests worse performance when filtering on the leading column compared to a compound sort key (remember we only scanned 4 blocks when filtering on the leading column of the compound key for our dummy data), the average performance across all possible sort key filter combinations will be much better with interleaved keys. This is what makes interleaved sort keys a great match for ad hoc multidimensional analysis. The average number of block scans on our dummy data for compound vs. interleaved sort keys. 11 We can also infer two additional points from this formula. First, the benefits of interleaved sort keys increase as the size of the data set increases. For example, if we filter on 2 out of 4 sort key columns then we scale with the square root of N and it would take two orders of magnitude in table growth before we get an order of magnitude in block scan growth. Second, the benefits of interleaved sort keys decrease as the number of columns in the sort key increases. If instead our key was an eight-column key (which is the maximum) the block scan counts would be roughly as follows:

8 Filter on 1 of 8 sort key columns: 12 blocks Filter on 2 of 8 sort key columns: 8 blocks Filter on 3 of 8 sort key columns: 6 blocks Filter on 4 of 8 sort key columns: 4 blocks Filter on 5 of 8 sort key columns: 3 blocks Filter on 6 of 8 sort key columns: 2 blocks Filter on 7 of 8 sort key columns: 2 blocks Filter on 8 of 8 sort key columns: 1 block In this case, our 'where' clause would need to be twice as robust to match the performance of the four-column key. O(n) vs. O(sqrt(n)) The Zone Map: stv_blocklist The zone maps for each table, as well as additional metadata, are stored in the stv_blocklist system table. If we take a look at the definition 12 of the stv_blocklist table, we can see it has a width of about 100B. This corresponds with our estimates from Part I regarding the small total size of this secondary data structure relative to the tables it describes

9 Taking a look at stv_blocklist for a table with a cmpound key confirms what we ve learned about zone maps for compound keys. Below we can see that the zone map for c_custkey, the leading column of the sort key for the orders_compound table, would prove helpful in pruning irrelevant blocks. However, the zone map for any of the other columns in the key would be useless. The zone map for a compound sort key. 13 If we take a look at the zone map for orders_interleaved_4, which has an interleaved sort key, we can see that the zone maps for all of the sort key columns look pretty useful. The zone map for an interleaved sort key. 14 However, it s not the minimum and maximum column values that we are seeing here for each block but instead the minimum and maximum sort key values. Presumably, Redshift de-interleaves the zone map values on the fly whenever a query is issued in order to compare the column values in the where clause of the query to the zone map. But what s interesting is that these sort keys don t seem to map directly to our column values implementation.sql_in#l1-l implementation.sql_in#l56-l109 7

10 If we were to interleave the binary representation of the sort key column values for the first record in the table then we would expect the resulting binary value to be equal to the minimum sort key (which has a decimal value of 1,711,302,144 as shown above). However, this is not the case. Behind the scenes, Redshift is taking care of a few important details for us which we can see by examining the stv_interleaved_counts table. The Multidimensional Space: stv_interleaved_counts With our dummy data set in Part I, we were able to interleave the binary representation of the column values directly to produce our sort keys because we were ignoring a few details about working with real data. An abbreviated view of the stv_interleaved_counts system table for orders_interleaved_4. 16 Notice the the count of records assigned to each compressed_val is somewhat balanced. Sort keys need to be represented by a data type within the database. Redshift uses a bigint to represent the sort key which is a signed 64-bit integer. However, the values of our sort key columns could be of type bigint as well. If we had four bigint sort key columns, then interleaving their bits would produce a 256-bit sort key which could not be stored as a bigint. To account for this, Redshift uses a proprietary compression algorithm to compress the values of each sort key column (this is separate from the compression encoding type of each column 15 ). These compressed values are what actually get interleaved and we can take a look at them in the stv_interleaved_counts system table implementation.sql_in#l111-l151

11 We can check that it is in fact these compressed values that are being interleaved to produce the sort key by interleaving the binary representation of the minimum compressed value for each sort key column to produce the minimum sort key. Likewise, we can interleave the binary representation of the maximum compressed value for each sort key column to produce the maximum sort key. Above, the table on the left shows the minimum and maximum compressed values for each sort key column of orders_interleaved_4. On the right, we see the minimum and maximum sort key values for the entire table. 17 Below, we see that if we interleave the binary representation of the values on the left above, we get the binary representation of the values on the right above. What these compressed values ultimately represent are the dimension coordinates of the multidimensional space for which Redshift will fit a Z-order curve. The stv_interleaved_counts table gives us an idea of how this space is laid out. The coordinate range of each dimension is constrained by the size of the sort key. Since the sort key can be a maximum of 64 bits in size, then the size in bits of the maximum coordinate of each dimension can be no more than 64/N where N is the number of columns in the sort key. For example, the range of coordinates for orders_interleaved_8, which has an eight-column interleaved sort key, is 0 to 255. This makes sense since 2^(64/8) equals 256. If the sort key had only a single column then Redshift could potentially use the entire 64 bits for the maximum coordinate resulting in a range of 0 to 9,223,372,036,854,775,807. In this case, the sort should theoretically be no different than a compound key since we only have a single column. However, zone maps for compound keys only take into account the first 8 bytes of the sort key column values. If the column values The coordinate range for each column of an eight column sort key, as seen in orders_interleaved_8, is limited to 256 values implementation.sql_in#l186-l implementation.sql_in#l153-l183 9

12 have a long common prefix, such as then a zone map on the first 8 bytes won t be of much help. Redshift s proprietary compression algorithm for interleaved sort keys does a better job of taking into account the entire column value so that different values with common prefixes get different compressed values and thus different sort keys yielding a more useful zone map. Interestingly, as we ve seen before, the range of coordinates for orders_interleaved_4, which has a four-column interleaved sort key, is 0 to 1,023. In this case, it looks like the coordinates are actually 10 bits in size (2^10 equals 1,024) but conceivably, they could be as large as 16 bits for a range of 0 to 65,535 (2^(64/4) equals 65,536). The reason for the smaller size is likely due to the size and distribution of the data and Redshift s attempts to prevent skew. Skew: svv_interleaved_columns Interleaved sort keys aren t free. The magic of the Z-order curve works best when data is evenly distributed among the coordinates of the multidimensional space. Upon ingestion, Redshift will analyze the size and distribution of the data to determine the appropriate size of the space and distribute the data within the space to the best of its ability. This logic is a part of the proprietary compression algorithm that maps column values to the compressed values in stv_interleaved_counts and it may need to be recalibrated over time to prevent skew (this is separate from distribution skew across slices). 19 We can see a measure of this skew by looking at the svv_interleaved_columns system table. The upper table here shows the skew for each column of the sort key for orders_interleaved_4. 20 Notice c_region (column 1) has a really high skew. This is because it has a cardinality of 5 as shown in the table on the bottom left. 21 Because of this, Redshift can only assign records to 5 coordinates on the c_region dimension of the multidimensional space as shown in the table on the bottom right. 22 Skew can be caused by a few different factors, some of which cannot be accounted for. Columns with low cardinality will show a degree of skew because the records will be assigned to a small number of coordinates in the dimension. Redshift will try to space these few coordinates out as evenly as possible but there implementation.sql_in#l254-l implementation.sql_in#l275-l implementation.sql_in#l293-l316

13 will still be a lot of coordinates that are left empty. This is demonstrated by the c_region column of the orders_interleaved_4 table. Note: AWS has released a patch that significantly improves the skew measurements in svv_interleaved_columns to better account for duplicate values. The concepts discussed here will still apply mostly, though the numbers may vary. Data that is inherently The upper table here shows the skew for each column of the sort key for orders_interleaved_ skew_custkey. 23 Notice c_custkey (column 0) has a really high skew. This is because the data skewed cannot be accounted for either. In the assigned to coordinate 512 on the c_custkey dimension of the multidimensional space as set is dominated by a c_custkey of 1 and therefore a large number of records are being orders_interleaved_skew_ shown in the bottom tabl. 24 Also note that only half of the c_custkey dimension is being custkey table we ve reassigned half of the data used as 512 is lowest coordinate with records assigned to it. a c_custkey of 1. The c_custkey column still has a high cardinality in that there s still a lot of customers but one customer dominates the data set. We can see this skews the data towards the first coordinate in the dimension. For both of these cases shown there s no way to recalibrate the compression algorithm and resort the data to eliminate this skew. However, if the skew is artificial, meaning a large number of records are being assigned to the same coordinate when they could in fact be redistributed, then we can use the vacuum reindex command to fix this. This commonly occurs for data that increases monotonically over time, such as dates, where all values inserted after the first data load will be assigned to the last coordinate in the dimension causing artificial skew. For example, the orders_interleaved_skew_date table was loaded with data up until 01/01/1998 on the first load. After this initial load, only the c_region and c_mktsegment columns are skewed due to low cardinality. The d_date column has a relatively low skew implementation.sql_in#l318-l implementation.sql_in#l339-l368 11

14 The upper table here shows the skew for each column of the sort key for orders_interleaved_skew_date after initial load. 25 Notice d_date (column 3) has a low skew. The bottom table shows that records are somehwat evenly spread amongst the coordinates of the d_date dimension. 26 The upper table here shows the skew for each column of the sort key for orders_interleaved_skew_date after loading additional data. 27 Notice d_date (column 3) has a higher skew because all of the new records have been assigned to the last coordinate in the d_date dimension as seen in the lower table. 28 After inserting the remaining data from later than 01/01/1998, we can see the skew for the d_date column has skyrocketed because the majority of records have been assigned to the last coordinate in the dimension. Running vacuum reindex eliminates this skew at the cost of extended vacuum times of 10% to 50% because resorting the data in an interleaved fashion potentially involves touching every storage block. Current and Archive Tables We can offset some of the maintenance cost associated with vacuum reindex as well as optimize for common data warehousing workloads by using compound and interleaved sort keys together in a common data warehousing design pattern; current and archive tables implementation.sql_in#l370-l implementation.sql_in#l393-l implementation.sql_in#l434-l implementation.sql_in#l459-l499

15 For many business processes we can assume that after a certain period of time a record is highly unlikely to be updated. For example, an e-commerce company with a return policy of 60 days can safely assume that order records over 60 days old aren t very likely to be updated. This allows them to create an orders_current table that includes orders placed within roughly the last two months plus the current month and an orders_archive table that includes everything earlier than this window. To make this transparent to the end user they can add a view, orders_view, that uses union all to combine both underlying tables. The goal of the orders_current table is to facilitate quick lookups The upper table here shows the skew for each column of the sort key for of individual orders, either for pure orders_interleaved_skew_date after running vacuum reindex. 29 Notice that the reads or read/write operations such skew for d_date (column 3) has dramatically decreased and the records are as update. This makes it well suited once again somehwat evenly spread amongst the coordinates of the d_date for a compound key with the order dimension as shown in the bottom table. 30 id as the leading column of the key. The goal of the orders_archive table is to facilitate historical analysis which could very well include ad hoc multidimensional analysis. This would make it well suited for an interleaved sort key on the dimensions that are most likely to be analyzed. Using this design pattern, when the end of the current month is reached, the oldest month in the orders_current table is copied to the orders_archive table and then deleted from the orders_current table. That way the orders_archive table is likely to only be written to in an append only fashion once a month. The cost of running vacuum reindex on orders_archive would then be amortized over an entire month s worth of data implementation.sql_in#l501-l implementation.sql_in#l531-l571 13

Benchmark results on 6 billion rows (600 GB) using a 4 node dc1.8xlarge cluster. The most dramatic improvements can be seen in the chart titled "WHERE on 3 non-leading columns".

16 Benchmark results on 6 billion rows (600 GB) using a 4 node dc1.8xlarge cluster. The most dramatic improvements can be seen in the chart titled "WHERE on 3 non-leading columns". orders_compound runs in about 18 seconds where orders_interleaved_4 runs over 35x faster in about 0.5 seconds. Benchmark Now that we ve seen the inner workings of both compound and interleaved keys, let s benchmark their performance against each other with a set of 9 queries that are representative of their pros and cons. We ll run each query 3 times on a data set of 6 billion rows (600 GB) using a 4 node dc1.8xlarge cluster and measure the average execution time. We re not concerned with the absolute execution time here, just the relative execution time between keys for different operations. Plotting the results shows that, generally speaking the operations tested, where, group by, order by and join, when operating against the leading column of the key perform better against a compound key. Once the leading column is excluded, the interleaved sort key performs better. Also, as the number of interleaved sort key columns referenced in the query increases so does performance as seen by the decrease in execution time between "WHERE on trailing column" and "WHERE on 3 non-leading columns" for orders_interleaved_4. But, as the number of columns referenced in the interleaved sort key definition increases, performance decreases as seen by the increase in execution time between orders_interleaved_4 and orders_interleaved_8 for all queries. 14

17 Conclusion Hopefully you now have a much better understanding of Redshift s interleaved sort keys and why they are such a great fit for ad hoc multidimensional analysis. Please let us know if we missed anything as well as if you come across other interesting use cases for interleaved sort keys! 15

18 About Chartio Chartio s vision is to make business intelligence as accessible and widely used in the enterprise as the common spreadsheet. Chartio accomplishes this by making business intelligence tools available to organizations that have been poorly served by legacy BI vendors, simplifying setup and maintenance, streamlining storage decisions, and enabling business users to perform their own analyses of complex data. Finally, Chartio enables Agile Business Intelligence. Rather than requiring a monolithic waterfall implementation of Planning, ETL, Governance, Data Warehousing, and Deployment, Chartio make it possible to start small and roll out business intelligence as your organization s needs increase. Learn how to quickly understand your business data at chartio.com. 16

Part 1: Indexes for Big Data

Part 1: Indexes for Big Data JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,