COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE)

Similar documents
Column Stores vs. Row Stores How Different Are They Really?

Column-Stores vs. Row-Stores. How Different are they Really? Arul Bharathi

Column- Stores vs. Row- Stores How Different Are They Really?

Column-Stores vs. Row-Stores: How Different Are They Really?

Column-Stores vs. Row-Stores: How Different Are They Really?

HYRISE In-Memory Storage Engine

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture

COLUMN DATABASES A NDREW C ROTTY & ALEX G ALAKATOS

1/3/2015. Column-Store: An Overview. Row-Store vs Column-Store. Column-Store Optimizations. Compression Compress values per column

Column Stores - The solution to TB disk drives? David J. DeWitt Computer Sciences Dept. University of Wisconsin

COLUMN STORE DATABASE SYSTEMS. Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

Column-Stores vs. Row-Stores How Different Are They Really?

In-Memory Data Management

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ColumnStore Indexes. מה חדש ב- 2014?SQL Server.

Column Store Internals

Sandor Heman, Niels Nes, Peter Boncz. Dynamic Bandwidth Sharing. Cooperative Scans: Marcin Zukowski. CWI, Amsterdam VLDB 2007.

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Query optimization. Elena Baralis, Silvia Chiusano Politecnico di Torino. DBMS Architecture D B M G. Database Management Systems. Pag.

DATA WAREHOUSING II. CS121: Relational Databases Fall 2017 Lecture 23

Physical Design. Elena Baralis, Silvia Chiusano Politecnico di Torino. Phases of database design D B M G. Database Management Systems. Pag.

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores

External Sorting Implementing Relational Operators

Main-Memory Databases 1 / 25

Hyrise - a Main Memory Hybrid Storage Engine

Oracle Database In-Memory

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

C-STORE: A COLUMN- ORIENTED DBMS

Syllabus. Syllabus. Motivation Decision Support. Syllabus

C-Store: A column-oriented DBMS

In-Memory Data Management Jens Krueger

The Right Read Optimization is Actually Write Optimization. Leif Walsh

In-Memory Data Structures and Databases Jens Krueger

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

Database Applications (15-415)

Exadata Implementation Strategy

Albis: High-Performance File Format for Big Data Systems

SAP IQ - Business Intelligence and vertical data processing with 8 GB RAM or less

class 6 more about column-store plans and compression prof. Stratos Idreos

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

VOLTDB + HP VERTICA. page

HyPer-sonic Combined Transaction AND Query Processing

Single-pass restore after a media failure. Caetano Sauer, Goetz Graefe, Theo Härder

CSCI-GA Database Systems Lecture 8: Physical Schema: Storage

Evolution of Database Systems

Safe Harbor Statement

HyPer-sonic Combined Transaction AND Query Processing

Data-Intensive Distributed Computing

CSE 544 Principles of Database Management Systems. Fall 2016 Lecture 14 - Data Warehousing and Column Stores

Column-Oriented Database Systems. Liliya Rudko University of Helsinki

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11

Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel,

Storage hierarchy. Textbook: chapters 11, 12, and 13

The End of an Architectural Era (It's Time for a Complete Rewrite)

Informatica Data Explorer Performance Tuning

Query Processing and Query Optimization. Prof Monika Shah

IBM Data Retrieval Technologies: RDBMS, BLU, IBM Netezza, and Hadoop

Storing Data: Disks and Files. Storing and Retrieving Data. Why Not Store Everything in Main Memory? Database Management Systems need to:

Introduction to Column Stores with MemSQL. Seminar Database Systems Final presentation, 11. January 2016 by Christian Bisig

MemTest: A Novel Benchmark for In-memory Database

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Oracle Database 11g Release 2 for SAP Advanced Compression. Christoph Kersten Oracle Database for SAP Global Technology Center (Walldorf, Germany)

Storing and Retrieving Data. Storing Data: Disks and Files. Solution 1: Techniques for making disks faster. Disks. Why Not Store Everything in Tapes?

Storing Data: Disks and Files. Storing and Retrieving Data. Why Not Store Everything in Main Memory? Chapter 7

HadoopDB: An open source hybrid of MapReduce

Database Applications (15-415)

Storing and Retrieving Data. Storing Data: Disks and Files. Solution 1: Techniques for making disks faster. Disks. Why Not Store Everything in Tapes?

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II

DATABASE COMPRESSION. Pooja Nilangekar [ ] Rohit Agrawal [ ] : Advanced Database Systems

HANA Performance. Efficient Speed and Scale-out for Real-time BI

External Sorting. Chapter 13. Comp 521 Files and Databases Fall

Part 1: Indexes for Big Data

Information Systems (Informationssysteme)

BDCC: Exploiting Fine-Grained Persistent Memories for OLAP. Peter Boncz

Overview of Implementing Relational Operators and Query Evaluation

Approaching the Petabyte Analytic Database: What I learned

Query Processing. Lecture #10. Andy Pavlo Computer Science Carnegie Mellon Univ. Database Systems / Fall 2018

Disks, Memories & Buffer Management

Oracle Database In-Memory

Low Overhead Concurrency Control for Partitioned Main Memory Databases. Evan P. C. Jones Daniel J. Abadi Samuel Madden"

Achieving Horizontal Scalability. Alain Houf Sales Engineer

RAID in Practice, Overview of Indexing

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Chapter 5. Indexing for DWH

1) Partitioned Bitvector

SQL Server 2014 Column Store Indexes. Vivek Sanil Microsoft Sr. Premier Field Engineer

Materialization Strategies in a Column-Oriented DBMS Daniel J. Abadi, Daniel S. Myers, David J. DeWitt, and Samuel R. Madden

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. Guest Lecture in MIT Performance Engineering, 18 November 2010.

SAP HANA. Jake Klein/ SVP SAP HANA June, 2013

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

SAP HANA Scalability. SAP HANA Development Team

Revolutionizing Data Warehousing in Telecom with the Vertica Analytic Database

SAP NetWeaver BW Performance on IBM i: Comparing SAP BW Aggregates, IBM i DB2 MQTs and SAP BW Accelerator

Principles of Data Management. Lecture #9 (Query Processing Overview)

Memory-Based Cloud Architectures

Transcription:

COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE) PRESENTATION BY PRANAV GOEL

Introduction On analytical workloads, Column store are found to perform order of magnitude better than traditional row-oriented database systems row stores Elevator pitch: column-stores are more I/O efficient for read-only queries since they only have to read from disk (or from memory) those attributes accessed by a query Factor of 2 on price/performance Factor of 5 on performance

Data Warehouse and DBMS Software $5 Billion Data warehouse industry out of 20 Billion DBMS software industry of recurring revenues in 2010 Growing at 8.5 % annually Total industry is expected to be at 105 Billion by 2020 Source Gartner

Motivation Why not traditional DBMS system move to C-Store? C-Store: SAP HANA Sybase IQ Vertica

Key Questions: How much of the buzz around column- stores just marketing hype? Do you really need to buy products like SAP HANA or Vertica? How far will your current row-store take you? Is this assumption valid? Can we adapt our row-store to get column-store performance? Can you simulate a column-store in a row-store? If not, what makes column-store so special?

Background Source: Google Images

Background: Continued Most of the queries do not process all the attributes of a particular relation. For example the query Select c.name and c.address From CUSTOMES as c Where c.region = Vancouver ; Only processes three attributes of the relation CUSTOMER. But the customer relation can have more than three attributes. Column-stores are faster for OLAP operations and slower for OLTP operations with many row inserts

Paper Methodology Comparing row-store vs. column-store is dangerous/borderline meaningless Instead, compare row-store vs. row-store and column-store vs. column-store üsimulate a column-store inside of a row-store ü Remove column-oriented features from column-store until it behaves like a row-store

Row Oriented Execution Common assumption would be to based on the storage layout, is to modify the row store physical structure in such a way that it behaves more like column store : The three ways to do these are: üvertical Partitioning üusing index-only plans üusing materialized views

First: Vertical Partitioning Idea: Vertical partition every relation Mechanism to connect files from same row together : - This is done by addition integer position column to every table - Primary keys are bad idea as they are could be long or composite - Thus each column have one Physical table. Join are done based on position attributes when fetching multiple columns Problems: Position - Space and disk bandwidth Header for every tuple further space wastage

Vertical Partitioning : Example Google Images

Second: Index-Only Plans Approach - Create B+ tree like index for every column - Plans never access the actual tuples on disk - Tuple headers are not stored thus less overhead Example: Google Images

Index-Only Plans Problems - Separate indices may require full index scan, which is slower Example: SELECT AVG(salary) FROM emp WHERE age > 40 As there are separate indices on age and salary, an index only plan have to find ids corresponding to satisfying age and then merge with (id,salary) extracted from salary index - Slow tuple construction

Third: Materialized Views(MV) Methodology: Problems: - Create an optimal set of materialized views for every query workload. Idea behind Strategy: Only have columns needed for answer Avoid overhead perform better Advance knowledge of query workload thus practical in limited situations

Materialized Views: Example Google Images

Benchmarking Methodology Star Schema Benchmark (SSBM) Fact table contains 17 columns and 60,000,000 rows 4 dimension tables, biggest one has 80,000 rows Queries perform 2-4 joins between fact table and dimension tables, aggregate 1-2 columns from fact table All benchmarks are done on using similar hardware configuration named SYSTEM X for this paper

Performance Comparison Row Store Average case performance across all workloads of the SSBM T Traditional T(B) - Traditional Bitmap MV Materialized Views VP Vertical portioning AI All indexes MV performs best Index only are the worst

Column Store simulation in Row Store: Analysis Vertical Partitioning: Tuple overheads: LineOrder Table 60 million tuples, 17 columns 8 bytes of over head per row 4 bytes of record-id Tuple Header TID xxxx 1 yyy xxxx 2 yyy xxx 3 yyy Column Data 1 column (Compressed) Whole table (Compressed) Row Store 0.7-1.1 GB 4 GB Column Store 240 MB 2.3 GB

Column Store simulation in Row Store: Analysis All indexes approach is a poor way to simulate a column-store Problems with partitioning are fundamental Store tuple header in a separate partition Allow virtual TIDs Combine clustered indexes, vertical partitioning What can possible be done to simulate column store in row store? Need better support for partitioning at the storage layer Need support for column-specific optimization at the executer level Full integration: buffer pool, transaction manager

Column-Oriented Execution Different optimization for column oriented database ü Compression ülate Materialization ü Block Iteration The idea is to minimize these advantages of column store to simulate row store in column store

Compression: Low information entropy (high data value locality) leads to high compression ratio Advantages Disk Space is saved Less I/O CPU cost decrease if we can perform operation without decompressing Idea: Use unsorted data

Late Materialization: Most queries require data from multiple columns that be combined together into rows to form information about an entity. Tuple Construction: Join like materialization of tuples Delay tuple construction Advantages - Unnecessary Construction of tuple is avoided - Direct operations on compressed data - Cache performance is improved Idea: Use Early Materialization in the query plan

Block Iteration Row Store first iterate through each tuple to extract needed attributes thus leading to tuple at a time processing (MySQL). C-Store blocks of values from same column are sent to operator in single function call Iterate over blocks rather than tuples, like Batch Processing Advantages: Minimizes per-tuple overhead Exploits potential for parallelism Idea : Can be applied even in Row stores IBM DB2 implements it

Experiment: Performance of C store with various optimizations removed T=tuple-at-a-time processing; t=block processing; I=invisible join enabled; i=disabled; C=compression enabled, c=disabled; L=late materialization enabled; l=disabled;

CONCLUSION: TO SIMULATE COLUMN STORE IN ROW STORE, TECHNIQUES LIKE VERTICAL PARTITIONING, INDEX ONLY PLAN DO NOT YIELD GOOD PERFORMANCE HIGH PER-TUPLE OVERHEADS, HIGH TUPLE CONSTRUCTION COSTS ARE THE REASON WHERE AS IN COLUMN STORE LATE MATERIALIZING, COMPRESSION, BLOCK ITERATION ARE THE REASONS FOR GOOD PERFORMANCE MIGHT BE POSSIBLE TO SIMULATE A ROW-STORE IN A COLUMN-STORE, NEED BETTER SUPPORT FOR VERTICAL PARTITIONING AT THE STORAGE LAYER NEED SUPPORT FOR COLUMN-SPECIFIC OPTIMIZATIONS AT THE EXECUTER LEVEL

DO WE SEE COLUMN STORE SIMULATION IN ROW STORE WORKING? WILL COLUMN STORE PROVIDE WRITE OPTIMIZATIONS COMPARABLE TO ROW STORES? USE ROW STORE AND COLUMN STORE INTERCHANGEABLY SAP HANA USES CONCEPT OF DELTA MERGE WHERE WRITES ARE DONE ON ROW STORE AND THEN MERGED WITH COLUMNS STORE