Column-Stores vs. Row-Stores: How Different Are They Really?

Similar documents
Column Stores vs. Row Stores How Different Are They Really?

Column-Stores vs. Row-Stores. How Different are they Really? Arul Bharathi

COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE)

1/3/2015. Column-Store: An Overview. Row-Store vs Column-Store. Column-Store Optimizations. Compression Compress values per column

Column-Stores vs. Row-Stores: How Different Are They Really?

Column- Stores vs. Row- Stores How Different Are They Really?

COLUMN STORE DATABASE SYSTEMS. Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

Data Modeling and Databases Ch 7: Schemas. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Column-Stores vs. Row-Stores How Different Are They Really?

Real-World Performance Training Star Query Edge Conditions and Extreme Performance

Column Store Internals

RELATIONAL OPERATORS #1

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores

CSE 544 Principles of Database Management Systems. Fall 2016 Lecture 14 - Data Warehousing and Column Stores

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture

Beyond Relational Databases: MongoDB, Redis & ClickHouse. Marcos Albe - Principal Support Percona

Bridging the Processor/Memory Performance Gap in Database Applications

Physical Design. Elena Baralis, Silvia Chiusano Politecnico di Torino. Phases of database design D B M G. Database Management Systems. Pag.

complex plans and hybrid layouts

Weaving Relations for Cache Performance

Outline. Database Tuning. Join Strategies Running Example. Outline. Index Tuning. Nikolaus Augsten. Unit 6 WS 2014/2015

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Weaving Relations for Cache Performance

Storage and Indexing

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Building Workload Optimized Solutions for Business Analytics

Column Stores - The solution to TB disk drives? David J. DeWitt Computer Sciences Dept. University of Wisconsin

Real-World Performance Training Star Query Prescription

Main-Memory Database Management Systems

BDCC: Exploiting Fine-Grained Persistent Memories for OLAP. Peter Boncz

Avancier Methods (AM) From logical model to physical database

Weaving Relations for Cache Performance

Midterm Review. March 27, 2017

Query optimization. Elena Baralis, Silvia Chiusano Politecnico di Torino. DBMS Architecture D B M G. Database Management Systems. Pag.

Query Processing and Query Optimization. Prof Monika Shah

Midterm Review CS634. Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Spark-GPU: An Accelerated In-Memory Data Processing Engine on Clusters

Low Overhead Concurrency Control for Partitioned Main Memory Databases. Evan P. C. Jones Daniel J. Abadi Samuel Madden"

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

University of Waterloo Midterm Examination Sample Solution

CompSci 516: Database Systems

class 5 column stores 2.0 prof. Stratos Idreos

Outline. Database Management and Tuning. Outline. Join Strategies Running Example. Index Tuning. Johann Gamper. Unit 6 April 12, 2012

CSE 544, Winter 2009, Final Examination 11 March 2009

Evaluation of Relational Operations

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Database Applications (15-415)

New Requirements. Advanced Query Processing. Top-N/Bottom-N queries Interactive queries. Skyline queries, Fast initial response time!

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

CSE 544 Principles of Database Management Systems

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II

EECS 647: Introduction to Database Systems

Using Druid and Apache Hive

Most database operations involve On- Line Transaction Processing (OTLP).

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

15-415/615 Faloutsos 1

CS122 Lecture 15 Winter Term,

Architecture-Conscious Database Systems

Cost-based Query Sub-System. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class.

Implementation of Relational Operations. Introduction. CS 186, Fall 2002, Lecture 19 R&G - Chapter 12

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Data on External Storage

Materialization Strategies in a Column-Oriented DBMS Daniel J. Abadi, Daniel S. Myers, David J. DeWitt, and Samuel R. Madden

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Jignesh M. Patel. Blog:

Introduction to column stores

C-Store: A column-oriented DBMS

Processing of Very Large Data

COLUMN DATABASES A NDREW C ROTTY & ALEX G ALAKATOS

The End of an Architectural Era (It's Time for a Complete Rewrite)

CS 222/122C Fall 2016, Midterm Exam

Sandor Heman, Niels Nes, Peter Boncz. Dynamic Bandwidth Sharing. Cooperative Scans: Marcin Zukowski. CWI, Amsterdam VLDB 2007.

Spanner A distributed database system

DATA WAREHOUSING II. CS121: Relational Databases Fall 2017 Lecture 23

Column-Oriented Database Systems

Architecture-Conscious Database Systems

Programming GPUs for database operations

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

Storage hierarchy. Textbook: chapters 11, 12, and 13

Review of Storage and Indexing

CMSC424: Database Design. Instructor: Amol Deshpande

Main-Memory Databases 1 / 25

Query Processing with Indexes. Announcements (February 24) Review. CPS 216 Advanced Database Systems

External Sorting Implementing Relational Operators

basic db architectures & layouts

column-stores basics

Kathleen Durant PhD Northeastern University CS Indexes

Modern Database Systems Lecture 1

Notes. Some of these slides are based on a slide set provided by Ulf Leser. CS 640 Query Processing Winter / 30. Notes

CSIT5300: Advanced Database Systems

Query Optimization Overview

Outline. Query Types

Clydesdale: Structured Data Processing on MapReduce

Principles of Data Management. Lecture #9 (Query Processing Overview)

Database Management Systems (COP 5725) Homework 3

class 6 more about column-store plans and compression prof. Stratos Idreos

Query Optimization Overview

Low Overhead Concurrency Control for Partitioned Main Memory Databases

Transcription:

Column-Stores vs. Row-Stores: How Different Are They Really? Daniel J. Abadi, Samuel Madden and Nabil Hachem SIGMOD 2008 Presented by: Souvik Pal Subhro Bhattacharyya Department of Computer Science Indian Institute of Technology, Bombay SIGMOD (2008) Column-Stores vs. Row-Stores 1 / 36

Outline 1 Introduction 2 Column-Stores vs. Row-Stores Row-oriented execution Column-oriented execution 3 Experiments 4 Conclusion SIGMOD (2008) Column-Stores vs. Row-Stores 2 / 36

Column Stores Figure 1: Column Stores [1]. Row store: Data are stored in the disk tuple by tuple Column store: Data are stored in the disk column by column SIGMOD (2008) Column-Stores vs. Row-Stores 3 / 36

Column Stores A relational DB shows its data as 2D tables of columns and rows Example EmpId Lastname Firstname Salary 1 Smith Joe 40000 2 Jones Mary 50000 3 Johnson Cathy 44000 Table 1: Column store vs Row Store [2] SIGMOD (2008) Column-Stores vs. Row-Stores 4 / 36

Column Stores A relational DB shows its data as 2D tables of columns and rows Row Store: serializes all values of a row together Example Row Store 1,Smith,Joe,40000; 2,Jones,Mary,50000; 3,Johnson,Cathy,44000; EmpId Lastname Firstname Salary 1 Smith Joe 40000 2 Jones Mary 50000 3 Johnson Cathy 44000 Table 1: Column store vs Row Store [2] SIGMOD (2008) Column-Stores vs. Row-Stores 4 / 36

Column Stores Example A relational DB shows its data as 2D tables of columns and rows Row Store: serializes all values of a row together Column Store: serializes all values of a column together Row Store 1,Smith,Joe,40000; 2,Jones,Mary,50000; 3,Johnson,Cathy,44000; EmpId Lastname Firstname Salary 1 Smith Joe 40000 2 Jones Mary 50000 3 Johnson Cathy 44000 Table 1: Column store vs Row Store [2] Column Store 1,2,3; Smith,Jones,Johnson; Joe,Mary,Cathy; 40000,50000,44000; SIGMOD (2008) Column-Stores vs. Row-Stores 4 / 36

Column Stores Row Store Column Store (+) Easy to add/modify a record (+) Only need to read in relevant data (-) Might read in unnecessary data (-) Tuple writes require multiple accesses Table 2: Column store vs Row Store [1] SIGMOD (2008) Column-Stores vs. Row-Stores 5 / 36

Column Stores Row Store Column Store (+) Easy to add/modify a record (+) Only need to read in relevant data (-) Might read in unnecessary data (-) Tuple writes require multiple accesses Table 2: Column store vs Row Store [1] Column stores are suitable for read-mostly, read-intensive, large data repositories data warehouses decision support applications business intelligent applications For performance comparison, the star schema bench mark is used(ssbm) SIGMOD (2008) Column-Stores vs. Row-Stores 5 / 36

Outline 1 Introduction 2 Column-Stores vs. Row-Stores Row-oriented execution Column-oriented execution 3 Experiments 4 Conclusion SIGMOD (2008) Column-Stores vs. Row-Stores 6 / 36

Outline 1 Introduction 2 Column-Stores vs. Row-Stores Row-oriented execution Column-oriented execution 3 Experiments 4 Conclusion SIGMOD (2008) Column-Stores vs. Row-Stores 7 / 36

Simulating a Column-Store in a Row-Store Column-store performance from a row-store? Vertical Partitioning Index-only plans Materialized Views SIGMOD (2008) Column-Stores vs. Row-Stores 8 / 36

Vertical Partitioning Figure 2: Vertical Partitioning [1]. Features Full vertical partitioning of each relation. 1 physical table for each column. SIGMOD (2008) Column-Stores vs. Row-Stores 9 / 36

Vertical Partitioning Figure 2: Vertical Partitioning [1]. Features Primary key of relation may be long and composite Integer valued position column for each table. Thus each table has 2 columns. Joins required on position attribute for multi-column fetch. SIGMOD (2008) Column-Stores vs. Row-Stores 9 / 36

Vertical Partitioning Figure 2: Vertical Partitioning [1]. Problems Position attribute: stored for every column wastes disk space and bandwidth large header per tuple more space is wasted Joining tables for multi-column fetch Hash Join slow Index Join slower SIGMOD (2008) Column-Stores vs. Row-Stores 9 / 36

Index-only plans Figure 3: Index-only plans [1]. Features Unclustered B+ tree index on each table column Plans never access actual tuples on the disk Tuple headers not stored, so overhead is less SIGMOD (2008) Column-Stores vs. Row-Stores 10 / 36

Index-only plans Figure 3: Index-only plans [1]. Features Indices stored as (record-id, value) pairs. All rids stored No duplicate values stored SIGMOD (2008) Column-Stores vs. Row-Stores 10 / 36

Index-only plans Figure 3: Index-only plans [1]. Problems Separate indices may require full index scan which is slow Example Solution: Composite indices required to answer queries directly SELECT AVG(SALARY) FROM emp WHERE AGE>40 SIGMOD (2008) Column-Stores vs. Row-Stores 10 / 36

Materialized views Features Optimal set of MVs created for given query Contains only those columns required to answer the query. Tuple headers are stored just once per tuple Provides just the required amount of data Problems Query should be known in advance SIGMOD (2008) Column-Stores vs. Row-Stores 11 / 36

Outline 1 Introduction 2 Column-Stores vs. Row-Stores Row-oriented execution Column-oriented execution 3 Experiments 4 Conclusion SIGMOD (2008) Column-Stores vs. Row-Stores 12 / 36

Optimizations in Column-Oriented DBs Compression Late Materialization Block Iteration Invisible Join SIGMOD (2008) Column-Stores vs. Row-Stores 13 / 36

Compression Features Low information entropy in columns than rows Decompression performance more valuable than compression achievable Advantages Low disk space Lesser I/O Performance increases if queries executed directly on compressed data Figure 4: Compression[1]. SIGMOD (2008) Column-Stores vs. Row-Stores 14 / 36

Late Materialization Information about entities stored in different tables. Most queries access multiple attributes of an entity. Naive column-store approach-early Materialization Read necessary columns from disk Construct tuples from component attributes Perform normal row-store operations of these tuples Much of performance potential unused SIGMOD (2008) Column-Stores vs. Row-Stores 15 / 36

Late Materialization Features Keep data in columns and operate on column data until late into the query plan Example Intermediate position lists need to be created. Required for matching up operations performed on different columns. SELECT R.a FROM R WHERE R.c = 5 AND R.b = 10 Output of each predicate is a bit string Perform Bitwise AND Use final position list to extract R.a SIGMOD (2008) Column-Stores vs. Row-Stores 16 / 36

Late Materialization Advantages Selection and Aggregation limits the number of tuples generated Compressed data need not be decompressed for creating tuples Better cache performance PAX Block iteration works better on columns than on rows SIGMOD (2008) Column-Stores vs. Row-Stores 17 / 36

Partition Attributes Across (PAX) Features column interleaving minimal row reconstruction cost only relevant data in cache minimizes cache misses effective when applying querying on a particular attribute Figure 5: PAX [3]. SIGMOD (2008) Column-Stores vs. Row-Stores 18 / 36

Block Iteration Features Operators operate on blocks of tuples at once Iterate over blocks of tuples rather than a single tuple Avoids multiple function calls on each tuple to extract data Data is extracted from a batch of tuples Fixed length columns can be operated as arrays Minimizes per-tuple overhead Exploits potential for parallelism SIGMOD (2008) Column-Stores vs. Row-Stores 19 / 36

Star Schema Benchmark Figure 6: Star Schema Benchmark [4]. SIGMOD (2008) Column-Stores vs. Row-Stores 20 / 36

Invisible Join Example SELECT c.nation, s.nation, d.year, sum(lo.revenue) as revenue FROM customer AS c, lineorder AS lo, supplier AS s, dwdate AS d WHERE lo.custkey = c.custkey AND lo.suppkey = s.suppkey AND s.region = ASIA AND d.year >= 1992 and d.year <= 1997 GROUP BY c.nation, s.nation, d.year ORDER BY d.year asc, revenue desc; Find total revenue from customers who live in ASIA and who purchase from an Asian supplier between 1992 and 1997 grouped by nation of customer, nation of supplier and year of transaction SIGMOD (2008) Column-Stores vs. Row-Stores 21 / 36

Invisible Join Traditional Plan Pipelines join in order of predicate selectivity. Disadvantage: misses out on late materialization Late materialized join: Disadvantage After join the list of positions for dimension tables are unordered Group by columns in dimension tables need to be extracted in out-of-position order. Figure 7: Late materialization [1]. SIGMOD (2008) Column-Stores vs. Row-Stores 22 / 36

Invisible Join Phase 1 Figure 8: Phase 1 [4]. SIGMOD (2008) Column-Stores vs. Row-Stores 23 / 36

Invisible Join Phase 2 Figure 9: Phase 2 [4]. SIGMOD (2008) Column-Stores vs. Row-Stores 24 / 36

Invisible Join Phase 3 Figure 10: Phase 3 [4]. SIGMOD (2008) Column-Stores vs. Row-Stores 25 / 36

Invisible Join Between-Predicate Rewriting Figure 11: Between-Predicate Rewriting [1]. SIGMOD (2008) Column-Stores vs. Row-Stores 26 / 36

Outline 1 Introduction 2 Column-Stores vs. Row-Stores Row-oriented execution Column-oriented execution 3 Experiments 4 Conclusion SIGMOD (2008) Column-Stores vs. Row-Stores 27 / 36

Goal Performance comparison of C-Store with R-Store Performance comparison of C-Store with column-store simulation on a R-Store Finding the best optimization for a column-store Comparison between invisible join and denormalized table SIGMOD (2008) Column-Stores vs. Row-Stores 28 / 36

C-Store(CS) vs. System-X(RS) Figure 12: C-Store(CS) and System-X(RS) [4]. SIGMOD (2008) Column-Stores vs. Row-Stores 29 / 36

C-Store(CS) vs. System-X(RS) First three rows as per expectation. For CS(Row-MV) materialized data is stored as strings in C-store. Expected that both RS(MV) and CS(Row-MV) will perform similarly However RS(MV) performs better No support for multi-threading and partitioning in C-Store. Disabling partitioning in RS(MV) halves performance Difficult to compare across systems C-Store(CS) 6 times faster than CS(Row-MV) Both read minimal amount of data from disk to answer a query I/O savings- not the only reason for performance advantage SIGMOD (2008) Column-Stores vs. Row-Stores 30 / 36

Column store simulation in Row store Traditional Vertical Partitioning: Each column is a relation Index-only plans: B+Tree on each column Materialized Views: Optimal set of views for every query SIGMOD (2008) Column-Stores vs. Row-Stores 31 / 36

Column store simulation in Row store MV < T < VP < AI (time taken) Without partitioning, T VP Vertical partitioning: Tuple Overhead 1 Column Whole Table T VP 1.1 GB CS 240 MB 4 GB 2.3 GB Index-only plans: Column Joins Hash Join: takes a long time Index Join: high index access overhead Merge Join: unable to skip sort step Figure 13: Column store simulation in Row store [4]. SIGMOD (2008) Column-Stores vs. Row-Stores 32 / 36

Breakdown of Column-Store Advantages Start with C-Store Remove optimizations one by one Finally emulate Row-Store Late materialization improves 3 times Compression improves 2 times Invisible Join improves 50% Block processing improves 5-50% SIGMOD (2008) Column-Stores vs. Row-Stores 33 / 36

Outline 1 Introduction 2 Column-Stores vs. Row-Stores Row-oriented execution Column-oriented execution 3 Experiments 4 Conclusion SIGMOD (2008) Column-Stores vs. Row-Stores 34 / 36

Conclusion C-Store emulation on R-Store is done by vertical partitioning, index plans Emulation does not yield good performance Reasons for low performance by emulation High tuple reconstruction costs High tuple overhead Reasons for high performance of C-Store Late Materialization Compression Invisible Join SIGMOD (2008) Column-Stores vs. Row-Stores 35 / 36

References [1] S. Harizopoulos, D. Abadi, and P. Boncz, Column-Oriented Database Systems, in VLDB, 2009. [2] http://en.wikipedia.org/wiki/column-oriented DBMS. [3] A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis, Weaving Relations for Cache Performance, in VLDB, 2001. [4] D. J. Abadi, S. R. Madden, and N. Hachem, Column-Stores vs. Row-Stores: How Different Are They Really? in SIGMOD, June 2008. SIGMOD (2008) Column-Stores vs. Row-Stores 36 / 36