C-Store: A column-oriented DBMS

Similar documents
C-STORE: A COLUMN- ORIENTED DBMS

Architecture of C-Store (Vertica) On a Single Node (or Site)

C-Store: A Column-oriented DBMS

International Journal of Software and Web Sciences (IJSWS) A new approach for database Column-oriented DBMS

Column Stores vs. Row Stores How Different Are They Really?

COLUMN DATABASES A NDREW C ROTTY & ALEX G ALAKATOS

Histogram-Aware Sorting for Enhanced Word-Aligned Compress

Column-Stores vs. Row-Stores: How Different Are They Really?

CPSC 421 Database Management Systems. Lecture 19: Physical Database Design Concurrency Control and Recovery

Column Stores - The solution to TB disk drives? David J. DeWitt Computer Sciences Dept. University of Wisconsin

COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE)

In-Memory Data Management Jens Krueger

Query Processing on Multi-Core Architectures

Question 1 (a) 10 marks

CS 245: Database System Principles

Low Overhead Concurrency Control for Partitioned Main Memory Databases

Chapter 18: Parallel Databases Chapter 19: Distributed Databases ETC.

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II

HYRISE In-Memory Storage Engine

Impact of Column-oriented Databases on Data Mining Algorithms

management systems Elena Baralis, Silvia Chiusano Politecnico di Torino Pag. 1 Distributed architectures Distributed Database Management Systems

Chapter 18: Parallel Databases

In-Memory Data Management

Main-Memory Databases 1 / 25

Concurrency Control In Distributed Main Memory Database Systems. Justin A. DeBrabant

Column-Stores vs. Row-Stores: How Different Are They Really?

Distributed KIDS Labs 1

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture

DATABASE MANAGEMENT SYSTEMS

Data about data is database Select correct option: True False Partially True None of the Above

Distributed Database Management Systems. Data and computation are distributed over different machines Different levels of complexity

Database Management Systems

CSIT5300: Advanced Database Systems

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores

Query optimization. Elena Baralis, Silvia Chiusano Politecnico di Torino. DBMS Architecture D B M G. Database Management Systems. Pag.

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Interpreting Explain Plan Output. John Mullins

Outline. Database Tuning. Ideal Transaction. Concurrency Tuning Goals. Concurrency Tuning. Nikolaus Augsten. Lock Tuning. Unit 8 WS 2013/2014

Column-Stores vs. Row-Stores. How Different are they Really? Arul Bharathi

NewSQL Database for New Real-time Applications

Columnstore and B+ tree. Are Hybrid Physical. Designs Important?

Table Compression in Oracle9i Release2. An Oracle White Paper May 2002

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Final Review. May 9, 2017

Final Review. May 9, 2018 May 11, 2018

It also performs many parallelization operations like, data loading and query processing.

SAP IQ - Business Intelligence and vertical data processing with 8 GB RAM or less

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Modern Database Systems Lecture 1

Distributed Databases

Questions about the contents of the final section of the course of Advanced Databases. Version 0.3 of 28/05/2018.

Binary Encoded Attribute-Pairing Technique for Database Compression

What are Transactions? Transaction Management: Introduction (Chap. 16) Major Example: the web app. Concurrent Execution. Web app in execution (CS636)

Advanced Databases: Parallel Databases A.Poulovassilis

7. Query Processing and Optimization

Database Systems Relational Model. A.R. Hurson 323 CS Building

Techno India Batanagar Computer Science and Engineering. Model Questions. Subject Name: Database Management System Subject Code: CS 601

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Database Management and Tuning

Administration Naive DBMS CMPT 454 Topics. John Edgar 2

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

6.830 Problem Set 3 Assigned: 10/28 Due: 11/30

Transaction Management: Introduction (Chap. 16)

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Lecture #16 (Physical DB Design)

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

TRANSACTIONS OVER HBASE

The End of an Architectural Era (It's Time for a Complete Rewrite)

A7-R3: INTRODUCTION TO DATABASE MANAGEMENT SYSTEMS

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

CompSci 516 Database Systems

B.H.GARDI COLLEGE OF MASTER OF COMPUTER APPLICATION. Ch. 1 :- Introduction Database Management System - 1

Database Management Systems MIT Introduction By S. Sabraz Nawaz

S-Store: Streaming Meets Transaction Processing

Single Record and Range Search

Distributed Databases. Distributed Database Systems. CISC437/637, Lecture #21 Ben Cartere?e

CS 525: Advanced Database Organization 03: Disk Organization

Page 1. Goals for Today" What is a Database " Key Concept: Structured Data" CS162 Operating Systems and Systems Programming Lecture 13.

Physical DB design and tuning: outline

Evolution of Database Systems

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

INSTITUTO SUPERIOR TÉCNICO Administração e optimização de Bases de Dados

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Chapter 19: Distributed Databases

8. Secondary and Hierarchical Access Paths

Routing Journal Operations on Disks Using Striping With Parity 1

Discuss physical db design and workload What choises we have for tuning a database How to tune queries and views

Physical Design. Elena Baralis, Silvia Chiusano Politecnico di Torino. Phases of database design D B M G. Database Management Systems. Pag.

The Google File System

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

IBM EXAM QUESTIONS & ANSWERS

Hash-Based Indexing 165

A7-R3: INTRODUCTION TO DATABASE MANAGEMENT SYSTEMS

Final Review. CS 377: Database Systems

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 25-1

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Introduction to Data Management. Lecture #17 (Physical DB Design!)

Boosting DWH Performance with SQL Server ColumnStore Index

Transcription:

Presented by: Manoj Karthick Selva Kumar C-Store: A column-oriented DBMS MIT CSAIL, Brandeis University, UMass Boston, Brown University Proceedings of the 31 st VLDB Conference, Trondheim, Norway 2005 1

Authors MIT CSAIL Mike Stonebraker Daniel Abadi Miguel Ferreira Edmond Lau Amerson Lin Sam Madden Umass Boston Xuedong Chan Elizabeth O Neil Pat O Neil Brown University Alex Rasin Stan Zdonik Brandeis University Adam Batkin Mitch Cherniack Nga Tran 2

Agenda Why do we need Column stores? What is C-Store? C-Store Architecture Important Terms and Definitions Data Model Components of C-Store Storage Management Query Execution and Optimization Performance Benchmarking 3

Why do we need Column Stores? In current Row Store architectures: Contiguously store attributes of a record (tuple) One write High performance writes Use cases: OLTP style applications Need Ad-hoc computation of aggregates Pre-computation not feasible/possible Need Read-Optimized systems, in which: Single column stored contiguously Read only the required columns Densepacked storage Trade-off between Disk bandwidth and CPU Cycles Use cases: CRMs, Library Catalogs, Ad-hoc inquiry systems 4

What is C-Store? Column-oriented data store Stores collection of columns sorted on attributes High retrieval performance High Availability Works on off-the-shelf hardware 5

Architecture of C-Store Updates Writeable Store (WS) Tuple Mover Reads Read-optimized Store (RS) 6

Architecture Reasoning Hybrid architecture containing WS and RS Need for hybrid architecture Historically need to compromise between high performance online updates vs readonly optimizations Example: Warehouses need on-line updates, CRMs need real-time data visibility C-Store approach: Include Writeable store for high-performance updates, inserts and deletes Read-optimized store for high-performance read only queries Usage of tuple mover to move records from WS to RS. Provide best of both worlds Focus on read performance than writes Result: Support large ad-hoc queries (RS), smaller update transactions (WS) and continuous inserts (Tuple mover). 7

Innovative features of C-Store Hybrid architecture with read and write optimized components Redundancy through storage of overlapping projections Heavily compressed columns Column-oriented optimizer and execution High availability Fault tolerance Snapshot Isolation 8

Important Terms Word boundaries Densepack Projections Logical tables Anchoring Sort keys Storage Keys Key ranges Join Indexes K-Safety Paths Space budget Horizontal Partitioning Segment Identifiers 9

Word boundaries, Densepack Word 10 XXX 000 Start End Slack 10

C-Store Data model Supports relational logical data model Consisting of named collection of attributes (columns) Primary keys Secondary keys Foreign keys Implements only projections Projections are anchored on a table T Base table T is not stored How to form a projection? Project attributes Retain duplicate values Perform joins Sort 11

Projections, Logical Tables, Anchoring, Sort Keys Projection 1 Col1 Col2 Col3 Col4 Col5 5 4 6 Logical Table Col1 Col2 Col3 Col4 4 5 6 Col1 Col2 Col3 Sort Key Anchor table Projection 2 12

C-Store data model Name Age Dept Salary Bob 25 Math 10K Bill 27 EECS 50K Jill 24 Biology 80K Sample employee table (EMP) EMP1 (name, age) EMP2 (dept, age, DEPT.floor) EMP3 (name, salary) DEPT1 (dname, floor) Possible Projections EMP1 (name, age age) EMP2 (dept, age, DEPT.floor DEPT.floor) EMP3 (name, salary salary) DEPT1 (dname, floor floor) Possible Projections with Sort keys 13

C-Store Data model Every projection: Horizontally partitioned into segments Support only value based projections Has an associated key range Need for covering set of projections Supports grid based systems Reconstruction of complete rows of tables needed. Done through: Storage Keys Join Indices 14

Horizontal Partitioning, Segment Identifiers, Key range 1..100 1..25 SID-1 26..65 SID-2 Segment Identifiers 66..100 SID-3 Projection Key Ranges Partitions 15

Storage Key Every data value in every column Storage Key Logically identifies Row Inferred in RS; Stored in WS Col Different Columns + same SK = Same row Segment Storage Key 16

Join indices Reconstruction of row/tuple Join segments from different projections Using Join Indices and Storage Keys Definition: EMP 1 Name Age Jill 24 Bob 25 Bill 27 Join Index SID Key 1 2 1 3 If T1 and T2 are two covering set of projections of table T, a join index from M segments of T1 to N segments of T2 is logically a collection of M tables, one per segment S, of T1 consisting of rows of the form (s: SID in T2, k: Storage key in S) Name Bob Bill Jill Salary 10K 50K 80K 1 1 Join indices are always 1:1 EMP 3 17

Join indices Expensive to store In reality, expect to store multiple overlapping projections Thus, only few join indices are required Every modification of projection = Update of Join Index To reconstruct logical table T from set of projections, need to define paths. Path A collection of join indices originating with some sort order Ti, passes through zero or more intermediate join indices and ends with projection stored in some order O* 18

K-safety Allocation of Segments of projections and corresponding join indexes into various nodes Admin can specify tables be K-safe K-safe = the database can suffer max k failures Despite of failed sites, covering set needs to exist. In other words, # of bearable failed sites = K 19

How to determine projections, segments, sort keys and covering set? This collection needs to support K-safety Usage of training workload to be provided by C-Store admin Space Budget constraints Automatic schema design tool being developed 20

RS Read optimized column store. Segments consists/ broken into columns. Column stored in order of sort key Trade-off between entry based sorting vs sort-key based sorting Encoding Schemes: Type 1: Self Order, Few distinct values Type 2: Foreign Order, Few distinct values Type 3: Self Order, Many distinct values Type 4: Foreign Order, Many distinct values Explore each of these in upcoming slides 21

Type 1: Self Order, Few distinct values Columns represented as (v, f, n) where: V = value stored in column F = position in column where v first appears N = number of Vs in the column 4 4 4 4 4 4 4.. Pos 12-18 One triple for every distinct value / column Usage of B-tree indices over columns value fields for search indexing Example: Column A, group of 4 s in positions 12-18 Encoded as (4, 12, 7) 22

Type 2: Foreign Order, Few distinct values Columns represented as (v, b) where: V = value stored in the column B = bitmap indicating the positions in which v is stored For search, usage of B-trees that map positions to values Bitmap Indexes Column C V = 0 V = 1 V = 2 0 1 0 0 1 0 1 0 2 0 0 1 0 1 0 0 23

Type 3 and Type 4 Type 3: Self Order, Many distinct values Column represented as Delta from the previous value in the column Type 4: Foreign Order, Many distinct values No efficient encoding available as of 2005 Decompression happens in main memory Col C 1 4 7 7 8 12 Col (Del-C) 1 3 3 0 1 4 Type 3 encoding 24

WS Physical design same as RS Projections and Join Indices are similar as in RS WS is horizontally partitioned in same way as RS; results in 1:1 mapping between RS and WS segments Storage is different and revamped for efficient updates to the database SK is stored explicitly instead of inference Unique SK given to each logical insert Storage Representation: Uncompressed Represented as collection of (v, sk) pairs v = value in the column sk = corresponding storage key Represented as conventional B-tree based on sk s value 25

WS Sort keys of each projection in WS Represented as a collection of (s, sk) pairs s: a particular sort key value sk: Storage key corresponding to where the sort key value first appears Represented as conventional B-tree based on sort key field Search in WS: Step 1: Use B-tree over the (s, sk) pairs Search condition: search criteria of the query Search result: corresponding storage keys Step 2: Use B-tree over the (v, sk) pairs Search condition: storage key sk found in previous step Search result: data values (s, sk) pairs (v, sk) pairs Search criterion sk values 26

Storage Management Need to allocate segments to nodes in grid system How? A Storage allocator Design constraints: All columns of single segment should be co-located Join indexed co-located with sender segments WS segments co-located with RS segments for same key range Columns of segment Join Indexes WS RS Co-location Trifecta 27

Storage Management Need to design storage allocator with mentioned constraints in mind Storage allocator needs to: Perform initial allocation Re-allocation when data in unbalanced Being worked on as of 2005; details of allocator beyond scope of the paper 28

Updates and Transactions In C-Store, updates are represented as an insert followed by a delete. Each object is inserted in all projections containing the given column If the inserts correspond to the same logical record, they would have the same storage key Synchronization of Storage keys To avoid synchronization between the all the grid sites, every site maintains a locally unique counter; append Site ID to the counter making it globally unique Storage keys in WS consistent with RS Set initial counter of WS = Largest Key in RS + 1 WS built on top of BerkeleyDB leverages data structure provided by the database 29

Snapshot Isolation Isolate read-only transactions by allowing reads in at some point in the recent past Need to find: High water mark: Time before which we can guarantee there are no uncommitted transactions Low water mark: Earliest effective time at which a read-only transaction can run To provide snapshot isolation, there should be no in-place updates A record is visible if: Inserted before the LWM Deleted after the HWM To calculate effective time without keeping track of time at every site at the atomic level, use coarse granulated epochs 30

IV and DRV IV refers to the insertion vector DRV refers to the deleted record vector For every projection record for every projection, we store the IV and DRV We can infer if a record is part of a snapshot using the IV and DRV Both IV and DRV values are stored as either 1 or 0 for a particular epoch Since the IV and DRV are sparse, they can be compacted using Type 2 encoding as discussed earlier. 31

Concurrency Control Uses strict 2 phase locking techniques. Each site sets locks on data objects that are read or written to by the system. This forms a distributed Lock table. Distributed COMMIT processing: A variation of 2 phase locking is used. Each transaction has a MASTER that assigns work to appropriate sites and assigning commit status No PREPARE messages are sent When master receives COMMIT message for a particular transaction: Waits until all the site workers have completed outstanding transactions Then issues COMMIT or ABORT message to each site Locks associated with the transaction can then be released 32

Failure and Recovery Three types of site failure: No data loss Roll forward from previous state through data from other sites Both RS and WS destroyed Reconstruction from other projections and join indices available at remote sites; Need functionality to retrieve DRV and IV to match state. WS is damaged, RS intact Out of scope RS usually expected to escape damage since written to only by the tuple mover 33

Tuple Mover Moves blocks of tuples from WS to RS (Merge out process) Uses LSM trees to efficiently support high speed ordered tuple movement Also responsible for updating the Join indices Two types of records with Insertion time at or before LWM: Ones deleted at or before LWM: Discarded by MOP Ones that are not deleted or deleted post LWM: Moved to RS MOP creates new RS segment RS. How? Reads blocks from columns in RS Deletes any RS items with DRV set Merges data from WS with RS creating RS This also means the storage keys in RS different from RS Join index maintenance reqd. 34

Queries Decompress Select returns bitstring repr. Mask emits where 1 Project Sort Aggregation Operators Concat Permute Join Bitstring Operators boolean AND/OR NOTE: Many queries are optimized such that they can operate on the compressed version rather than needing to decompress 35

Performance Comparison Performance benchmarked on TPC-H data Read only queries, only on RS C-Store Result Row Store Column Store 1.987 GB 4.480 GB 2.650 GB 36

Performance Comparison Using Non-materialized Views Using Materialized Views 37

38