1/3/2015. Column-Store: An Overview. Row-Store vs Column-Store. Column-Store Optimizations. Compression Compress values per column

Size: px

Start display at page:

Download "1/3/2015. Column-Store: An Overview. Row-Store vs Column-Store. Column-Store Optimizations. Compression Compress values per column"

Evelyn Lucas
6 years ago
Views:

1 //5 Column-Store: An Overview Row-Store (Classic DBMS) Column-Store Store one tuple ata-time Store one column ata-time Row-Store vs Column-Store Row-Store Column-Store Tuple Insertion: + Fast Requires multiple seeks Queries: May access more + data than needed Read only relevant data Perfect for analytics! Column-Store Optimizations Operating on columns enables and/or is combined with the following optimizations: Compression Compress values per column Late Tuple Materialization Construct tuples as late as possible Block Iteration Pass blocks of values between operators Invisible Join

2 //5 Column-Store Optimizations Compression Compress column data Column-Store Optimizations > Compression Why Compression Advantages of compression in general: Lower storage space requirements Minor Better I/O performance Read fewer data (from disk, S, or RAM), gain from cache locality Better query processing performance Typically when operating directly on compressed data Column-Store Optimizations > Compression Why Column Store Compress column data: One column at a time Data in a column more similar than data across columns Phone City Fred San Diego Rubble 69-- San Diego Maggie San Francisco James -7- New York

3 //5 Column-Store Optimizations > Compression Compression Example Run-length encoding Replace list of identical values by pair (value, count) x x Column-Store Optimizations > Compression Compression Example Run-length encoding Replace list of identical values by run pair (value, count) + Good for sorted columns x x Non-sorted Sorted Column-Store Optimizations > Compression Compression Example Run-length encoding Replace list of identical values by pair (value, count) + Enables query processing on compressed data directly e.g., Select persons in uncompress x x St =

4 //5 Column-Store Optimizations > Compression Compression Example Run-length encoding Replace list of identical values by pair (value, count) + Enables query processing on compressed data directly e.g., Select persons in x x St = x Column-Store Optimizations > Compression Other Compression Algos Dictionary Encoding Replace frequent patterns with smaller fixed length codes: eg, instead of string values Dasgupta, Freund, Papakonstantinou Commonly used in row-stores also, since it enables fixed length fields, therefore random access. Bit-Vector Encoding Create for each possible value a bit vector with s in the positions containing the value: Useful for small domains. (Covered in the indexing section.) Heavyweight, Variable-Length Compression Schemes e.g., Huffman: Excellent compression ratio but () no random access () possibly poor decompression CPU performance Currently not used they are good for selected workloads Column-Store Optimizations Late Tuple Materialization Create tuples as late in the query plan as possible 4

5 //5 Early Tuple Materialization Naïve method of query evaluation in a column store: Read relevant columns & create tuples Run query at a tuple-level as usual π Query Processing: Rows Storage: Columns Early Tuple Materialization e.g., SELECT FROM Person WHERE City= AND = π City = = Create tuples Phone City Early Tuple Materialization e.g., SELECT FROM Person WHERE City= AND = π Create tuples City = = Phone City City 5

6 //5 Early Tuple Materialization e.g., SELECT FROM Person WHERE City= AND = π City = = Create tuples City Phone City Early Tuple Materialization e.g., SELECT FROM Person WHERE City= AND = π City = = Create tuples Phone City Early Tuple Materialization e.g., SELECT FROM Person WHERE City= AND = π City = = Create tuples Phone City Reads relevant columns only 6

7 //5 Late Tuple Materialization Create tuples as late in the plan as possible U Storage & Query Processing: Columns Late Tuple Materialization e.g., SELECT FROM Person WHERE City= AND = Extract City = = Phone City Late Tuple Materialization e.g., SELECT FROM Person WHERE City= AND = Extract City = = Phone City Position lists 7

8 //5 Late Tuple Materialization e.g., SELECT FROM Person WHERE City= AND = Extract City = = Phone City Late Tuple Materialization e.g., SELECT FROM Person WHERE City= AND = Extract City = = Phone City Representing Position Lists Position lists can be represented as: Bit Vectors Bit-wise AND Ranges of positions {(, )} Special AND {(, )} {(, )} Run-length Encoding Revisited,,, 4, (starting pos, length) 8

9 //5 Late Materialization Benefits Avoid materializing certain tuples since they may be filtered out before being materialized (Reminds of pushing selections down.) Avoid data decompression which has to be done when a tuple is materialized Leverage improved cache locality which exists when operating on a single column Leverage optimizations for fixed-width attributes which would not be possible if operating on the tuple level, since a tuple with at least one variable-width attribute becomes variable-width Column-Store Optimizations Block Iteration Pass blocks of values between operators Column-Store Optimizations Block Iteration Row Store - Pass single tuples between operators - Extract attribute value through function calls Column Store - Pass blocks of values between operators - No need for attribute extraction b = X extractb() getnexttuple() b a b c d b = X getnextblock() b 9

10 //5 Column-Store Optimizations Invisible Join Join Optimization for Star Schemas > Invisible Join Joins & Late Tuple Mat. Late Tuple Materialization requires out of order access when we join tables e.g., R S R a b c b d S a b a c d out of order access to S Invisible Join Optimization for joins on Star Schemas - Make sure that the fact table is accessed in order - Reduce the amount of data that are accessed out of order from the dimension tables - Reminiscent of bitmap intersection technique

11 //5 Invisible Join Example Schema Date date year LineOrder orde rkey cust key supp key date revenue Customer Supplier custkey region nation suppkey region nation Example taken from VLDB 9 Tutorial by Harizopoulos, Abadi, Boncz Invisible Join Example Query SELECT c_nation, s_nation, d_year, sum(lo_revenue) as revenue FROM customer, lineorder, supplier, date WHERE lo_custkey = c_custkey AND lo_suppkey = s_suppkey AND lo_orderdate = d_datekey AND c_region = 'ASIA AND s_region = 'ASIA AND d_year >= 99 AND d_year <= 997 GROUP BY c_nation, s_nation, d_year ORDER BY d_year asc, revenue desc; Invisible Join Example Data Lineorder order cust supp key key key date revenue Supplier suppkey region nation Asia Russia Europe Spain Asia Japan Customer custkey region nation Asia China Asia India Asia India 4 Europe France Date date year

12 //5 Invisible Join Example Step : Apply selections on dimension tables suppkey region nation Asia Russia Supplier Europe Spain Asia Japan custkey region nation Asia China Customer Asia India Asia India 4 Europe France date year Date region = Asia region = Asia year in [99, 997] Hashtable {, } Hashtable {,, } Hashtable {*} Invisible Join Example Step : Find fact table tuples satisfying all selections on dimension tables supp key + Hashtable {, } cust key Hashtable {,, } date Hashtable {97, 97, 97} Invisible Join Example Step : Find fact table tuples satisfying all selections on dimension tables

13 //5 Invisible Join Example Step : Find fact table tuples satisfying all selections on dimension tables & & Invisible Join Example Step : Extract data from dimension tables Lineorder supp key + Supplier suppkey region nation + Asia Russia Europe Spain Asia Japan Russia Russia Invisible Join Example Step : Extract data from dimension tables Lineorder cust key Customer custkey region nation Asia China Asia India + Asia India 4 Europe France India China

14 //5 Invisible Join Example Step : Extract data from dimension tables Lineorder date Date date year Columm-Store Optimizations Optimizations Summary Compression Late Tuple Materialization Block Iteration Invisible Join Column-Store Optimizations Comparing Optimizations What is the speedup of each column-store optimization? Compression Late Tuple Materialization Block Iteration Invisible Join x(avg)/x(sorted) x.5-.5x.5-.75x From Column-Stores vs. Row-Stores: How Different Are They Really 4

15 //5 Column-Store vs Row-Store How better is a column-store than a row-store? Heated debate: (Exaggerated) claims of performance up to 6,x Can we simulate it in a row-store and get the performance benefits or does the row-store have to be internally modified? Another heated debate: Many papers on the topic Can we create a hybrid that will accommodate both transactional and analytics workloads? A column-store can be simulated in a row-store through: Vertical Partitioning Create one table per column Index-only Plans Create one index per column & use only indexes Materialized Views Create views of interest for given workload C-Table Vertical Partitioning Create one table per column ID Val Used to link these tables (column-stores use the order) 5

16 //5 Vertical Partitioning e.g., Phone City ID 4 ID Phone ID City ID 4 Overhead from storing tuple-id Excessive joins Index-only Plans Create one index per column & use only indexes Data are kept as rows but are never accessed Index-only Plans e.g., Phone City index Phone index City index index 6

17 //5 Index-only Plans e.g., SELECT FROM Person WHERE City= AND = Scan π City = = Index-only Plans e.g., SELECT FROM Person WHERE City= AND = 4 Scan π City = = Index-only Plans e.g., SELECT FROM Person WHERE City= AND = Scan π City = = 7

18 //5 Index-only Plans e.g., SELECT FROM Person WHERE City= AND = Scan π City = = + Avoids table access Have to scan index for attributes wo predicate Create indexes with composite keys! Materialized Views Create optimal view(s) for each query For each relation involved in the query, create a view including only columns needed to answer the query Q = π a b= x (R) a b c d a b Materialized Views e.g., SELECT FROM Person WHERE City= AND = Phone City City Requires workload knowledge 8

19 //5 C-Table Extend vertical partitioning with run-length encoding Rewrite queries to operate on the compressed data Start Pos Value Length C-Table e.g., Start Pos Length 4 + Pretty close to performance of native column-store Back to our question: Column-Store vs Row-Store Column-Stores have a definite advantage on analytic workflows but Row-Stores can be improved by taking lessons from them 9

20 //5 Implementations Open Source Commercial C-Store

Column-Stores vs. Row-Stores: How Different Are They Really?

Column-Stores vs. Row-Stores: How Different Are They Really? Daniel J. Abadi, Samuel Madden and Nabil Hachem SIGMOD 2008 Presented by: Souvik Pal Subhro Bhattacharyya Department of Computer Science Indian