COLUMN STORE DATABASE SYSTEMS. Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

Size: px

Start display at page:

Download "COLUMN STORE DATABASE SYSTEMS. Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe"

Jeffery Armstrong
5 years ago
Views:

1 COLUMN STORE DATABASE SYSTEMS Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

2 Telco Data Warehousing Example (Real Life) Michael Stonebraker et al.: One Size Fits All? Part 2: Benchmarking Studies. CIDR 2007 Star schema: account toll usage source Query2: SELECT account.account_number, sum (usage.toll_airtime), sum (usage.toll_price) FROM usage, toll, source, account WHERE usage.toll_id = toll.toll_id AND usage.source_id = source.source_id AND usage.account_id = account.account_id AND toll.type_ind in ( AE. AA ) AND usage.toll_price > 0 AND source.type!= CIBER AND toll.rating_method = IS AND usage.invoice_date = GROUP BY account.account_number 7 columns Column Store 212 columns Row Store Query1 2, Query2 2, Query3 0, Query4 5, Query5 2, Query Running Times (seconds) Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

Column Store Database Systems: Idea Goal: Reduce the number of disc access / amount of data to read + easy to insert/update a record + only need to read in relevant data might read in unnecessary

3 Column Store Database Systems: Idea Goal: Reduce the number of disc access / amount of data to read + easy to insert/update a record + only need to read in relevant data might read in unnecessary data + higher compression ratio insert/update require multiple accesses expensive reads on entire records suitable for read-mostly, read-intensive, large data repositories Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

4 Storage Layout Columnar storage Compression Multiple sort orders Column Store Key Features Execution Engine Avoid decompression operating directly on compressed data Early vs. late materialization Updates Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

5 Applications for Column Stores Data Warehousing High end Personal Analytics Data Mining RDF Information Retrieval Scientific Datasets Sparse and schema-flexible data within Column Family Database Systems (see chapter NoSQL Database Systems) Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

6 History: From DSM to Column Stores First approaches in the 1970s (scientific databases and data analysis) 1985: DSM-Paper: G. P. Copeland and S. Khoshafia: A decomposition storage model. SIGMOD Conference s: Commercialization through Sybase IQ Late 90s 2000s: Focus on main-memory performance (DSM on steroids with MonetDB) : Re-birth of read-optimized DSM as Column Store (C-Store, MonetDB/X100 etc.) Literature: M. Stonebraker, D. J. Abadi, A. Batkin et al.: C-Store: A Column-oriented DBMS. VLDB 2005 D. J. Abadi, S. Madden, N. Hachem: Column-stores vs. row-stores: how different are they really? SIGMOD Conference 2008 D. J. Abadi, P. A. Boncz, S. Harizopoulos: Column-oriented Database Systems. VLDB Conference 2009 (Tutorial) Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

7 Commercial Systems Sybase IQ Vertica VectorWise 1010data ParAccel Infobright Exasol SAP HANA. Open Source Systems MonetDB Infobright (C-Store) Column Store Database Systems Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

8 Column Store Database Systems Applications and Systems Storage Layout Execution Engine Alternatives and Trends Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

9 Storage Layout Column oriented storage layout Higher data value locality in column stores Columns compress better than rows Typical row-store compression ratio 1 : 3 Column-store 1 : 10 (up to 1:30) Caveat: CPU cost (use lightweight compression) Can use extra space to store multiple copies of data in different sort orders Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

10 Compression: Run-length Encoding Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

11 Compression: Bit-vector Encoding For each unique value v in column c, create bit-vector b: b[i] = 1 if c[i] = v Good for columns with few unique values Each bit-vector can be further compressed if sparse Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

12 Compression: Dictionary Encoding For each unique value create dictionary entry Dictionary can be per-block or per-column Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

13 Compression: Frame of Reference Encoding Encodes values as b bit offset from chosen frame of reference Special escape code (e.g. all bits set to 1) indicates a difference larger than can be stored in b bits After escape code, original (uncompressed) value is written Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

Compression: Differential Encoding Encodes values as b bit offset from previous value Special escape code (just like frame of reference encoding) indicates a difference larger than can be stored in b

14 Compression: Differential Encoding Encodes values as b bit offset from previous value Special escape code (just like frame of reference encoding) indicates a difference larger than can be stored in b bits After escape code, original (uncompressed) value is written Performs well on columns containing increasing/decreasing sequences inverted lists timestamps object Ids sorted / clustered columns Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

15 What Compression Scheme To Use? Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

16 Column Store Database Systems Applications and Systems Storage Layout Execution Engine Alternatives and Trends Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

17 Storage Layout Columnar storage Compression Multiple sort orders Column Store Key Features Execution Engine Avoid decompression operating directly on compressed data Early vs. late materialization Updates Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

18 Operating Directly on Compressed Data SELECT productid, COUNT(*) FROM table WHERE quarter = Q2 GROUP BY produktid Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

19 Early Materialization When should tuples be constructed? Solution 1: Create rows first = Early Materialization (EM) SELECT custid, SUM(price) FROM table WHERE (prodid = 4) AND (storeid = 1) GROUP BY custid Drawbacks: Need to construct ALL tuples Need to decompress data Poor memory bandwidth utilization Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

20 Step 1 Solution 2: Operate on Columns = Late Materialization (LM) SELECT custid, SUM(price) FROM table WHERE (prodid = 4) AND (storeid = 1) GROUP BY custid Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

21 Operate on Columns: Late Materialization Step 2 SELECT custid, SUM(price) FROM table WHERE (prodid = 4) AND (storeid = 1) GROUP BY custid Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

22 Operate on Columns: Late Materialization Step 3 SELECT custid, SUM(price) FROM table WHERE (prodid = 4) AND (storeid = 1) GROUP BY custid Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

23 Operate on Columns: Late Materialization Step 4 SELECT custid, SUM(price) FROM table WHERE (prodid = 4) AND (storeid = 1) GROUP BY custid Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

Early vs. Late Materialization For plans without joins, late materialization is a win Example Abadi, Myers, DeWitt, and Madden. Materialization Strategies in a Column-Oriented DBMS.

24 Early vs. Late Materialization For plans without joins, late materialization is a win Example Abadi, Myers, DeWitt, and Madden. Materialization Strategies in a Column-Oriented DBMS. ICDE 2007 SELECT C1, SUM(C2) FROM table WHERE (C1 < CONST) AND (C2 < CONST) GROUP BY C1 Ran on 2 compressed columns from TPC-H Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

25 Early vs. Late Materialization Even on uncompressed data, late materialization is still a win SELECT C1, SUM(C2) FROM table WHERE (C1 < CONST) AND (C2 < CONST) GROUP BY C1 Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

26 What about for plans with joins? Early Materialization Example Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

27 What about for plans with joins? Early Materialization Example (Cont.) Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

28 What about for plans with joins? Late Materialization Example Position! Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

29 Late Materialized Join Performance Naïve LM join about 2X slower than EM join on typical queries (due to random I/O) This number is very dependent on Amount of memory available Number of projected attributes Join cardinality But we can do better Invisible Join Jive/Flash Join Radix cluster/decluster join Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

30 Invisible Join [Abadi/Madden/Hachem:SIGMOD2008] Designed for typical joins when data is modeled using a star schema One ( fact ) table is joined with multiple dimension tables Typical query: SELECT c_nation, s_nation, d_year, sum(lo_revenue) as revenue FROM customer, lineorder, supplier, date WHERE lo_custkey = c_custkey AND lo_suppkey = s_suppkey AND lo_orderdate = d_datekey AND c_region = 'ASIA AND s_region = 'ASIA AND d_year >= 1992 AND d_year <= 1997 GROUP BY c_nation, s_nation, d_year ORDER BY d_year asc, revenue desc; Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

31 Invisible Join: Example Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

32 Invisible Join: Example (Cont.) Original Fact Table lineorder Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

33 Invisible Join: Example (Cont.) Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

34 Invisible Join: Bottom Line Invisible Join Many data warehouses model data using star/snowflake schemes Joins of one (fact) table with many dimension tables is common Invisible join takes advantage of this by making sure that the table that can be accessed in position order is the fact table for each join Position lists from the fact table are then intersected (in position order) This reduces the amount of data that must be accessed out of order from the dimension tables Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

Jive/Flash Join Still accessing table out of order Jive/Flash Join [Li an Ross: Fast Joins using Join Indices, VLDBJ 8:1-24, 1999] [Tsirogiannis, Harizopoulos et. al.

35 Jive/Flash Join Still accessing table out of order Jive/Flash Join [Li an Ross: Fast Joins using Join Indices, VLDBJ 8:1-24, 1999] [Tsirogiannis, Harizopoulos et. al. Query Processing Techniques for Solid State Drives. SIGMOD 2009] Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

36 Jive/Flash Join (Cont.) 1. Add column with dense ascending integers from 1 2. Sort new position list by second column 3. Probe projected column in order using new sorted position list, keeping first column from position list around 4. Sort new result by first column Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

37 Jive/Flash Join: Bottom Line Jive/Flash Join lnstead of probing projected columns from inner table out of order: Sort join index Probe projected columns in order Sort result using an added column LM vs EM tradeoffs: LM has the extra sorts (EM accesses all columns in order) LM only has to fit join columns into memory (EM needs join columns and all projected columns) LM only has to materialize relevant columns In many cases LM advantages outweigh disadvantages LM would be a clear winner if not for those pesky sorts can we do better? Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

38 LM vs EM joins Radix Cluster/Decluster Join The full sort from the Jive join is actually overkill We just want to access the storage blocks in order (we don t mind random access within a block) [Manegold/Boncz/Kersten: Database Architecture Optimized for the New Bottleneck: Memory Access, VLDB1999] [Manegold/Boncz/Kersten:Generic Database Cost Models for Hierarchical Memory Systems, VLDB2004] [Manegold/Boncz/Nes:Cache-Conscious Radix-Decluster Projections, VDLB2004] Invisible, Jive, Flash, Cluster, Decluster techniques contain a bag of tricks to improve LM joins Research papers show that LM joins become 2x faster than EM joins (instead of 2x slower) for a wide array of query types Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

39 For queries with Tuple Construction Heuristics selective predicates, aggregations, or compressed data, use late materialization For joins Research papers: Always use late materialization Commercial systems: Inner table to a join often materialized before join (reduces system complexity) Some systems will use late materialization only if columns from inner table can fit entirely in memory Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

40 Storage Layout Columnar storage Compression Multiple sort orders Column Store Key Features Execution Engine Avoid decompression operating directly on compressed data Early vs. late materialization Updates Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

41 Updates Column-stores are update-in-place averse In-place: I/O for each column + re-compression + multiple sorted replicas + sparse tree indices Update-in-place is infeasible! Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

42 Updates (Cont.) Column-stores use differential mechanisms instead Differential lists/files or more advanced Updates buffered in RAM, merged on each query Checkpointing merges differences in bulk sequentially I/O trends favor this anyway (trade RAM for converting random into sequential I/O) Detailed discussion in next chapter (In-Memory Database Systems) Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

43 Column Store Database Systems Applications and Systems Storage Layout Execution Engine Alternatives and Trends Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

44 Simulate a Column-Store inside a Row-Store Source: Abadi/Boncz/Harizopoulos:VLDB2009 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

Simulate a Column-Store inside a Row-Store [Abadi/Hachem/Madden:SIGMOD2008] SSBM (Star Schema Benchmark): very common data warehousing benchmark

45 Simulate a Column-Store inside a Row-Store [Abadi/Hachem/Madden:SIGMOD2008] SSBM (Star Schema Benchmark): very common data warehousing benchmark (based von TPC-H benchmark data model) Source: Abadi/Hachem/Madden:SIGMOD2008 Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

46 Trend: Hybrid Column-Row Systems Column-store features added to row-stores Oracle first approaches in Oracle 11g Release 2 on Exadata systems (Appliance, 2010) hybrid columnar compression July 2014 ( ): Oracle In-Memory Database : duplicate data column-oriented in main memory IBM Smart Analytics Optimizer 2010 MS SQL Server MS SQL Server 2012: new index type COLUMNSTORE MS SQL Server 2014: Clustered Colum Store Index (full table) IBM DB BLU Acceleration (April 2013): column-organized tables PostgreSQL Extension for PostgreSQL (April 2014) Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

47 Column Store Database Systems: Conclusion Columnar techniques provide clear benefits for: Data warehousing, BI Information retrieval, graphs A number of crucial techniques make them effective Row-Stores and column-stores could be combined Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

48 Big Data Technologies Introduction NoSQL Database Systems Column Store Database Systems In-Memory Database Systems Conclusion & Outlook Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

1/3/2015. Column-Store: An Overview. Row-Store vs Column-Store. Column-Store Optimizations. Compression Compress values per column

1/3/2015. Column-Store: An Overview. Row-Store vs Column-Store. Column-Store Optimizations. Compression Compress values per column //5 Column-Store: An Overview Row-Store (Classic DBMS) Column-Store Store one tuple ata-time Store one column ata-time Row-Store vs Column-Store Row-Store Column-Store Tuple Insertion: + Fast Requires