basic db architectures & layouts

Similar documents
column-stores basics

column-stores basics

data systems 101 prof. Stratos Idreos class 2

SQL & intro to db architectures

data systems 101 prof. Stratos Idreos class 2

class 5 column stores 2.0 prof. Stratos Idreos

Models & Intro to DB Architectures

HOW INDEX TO STORE DATA DATA

Models & Intro to DB Architectures

class 17 updates prof. Stratos Idreos

complex plans and hybrid layouts

class 17 updates prof. Stratos Idreos

class 9 fast scans 1.0 prof. Stratos Idreos

class 6 more about column-store plans and compression prof. Stratos Idreos

from bits to systems

class 20 updates 2.0 prof. Stratos Idreos

Overview of Data Exploration Techniques. Stratos Idreos, Olga Papaemmanouil, Surajit Chaudhuri

class 10 b-trees 2.0 prof. Stratos Idreos

CS 405G: Introduction to Database Systems. Storage

class 13 scans vs indexes prof. Stratos Idreos

Database Technology Introduction. Heiko Paulheim

systems & research project

class 12 b-trees 2.0 prof. Stratos Idreos

Important Note. Today: Starting at the Bottom. DBMS Architecture. General HeapFile Operations. HeapFile In SimpleDB. CSE 444: Database Internals

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores

CS143: Relational Model

Goals for Today. CS 133: Databases. Relational Model. Multi-Relation Queries. Reason about the conceptual evaluation of an SQL query

CSC 261/461 Database Systems Lecture 19

CSE 544 Principles of Database Management Systems

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

CS425 Fall 2016 Boris Glavic Chapter 1: Introduction

CS542. Algorithms on Secondary Storage Sorting Chapter 13. Professor E. Rundensteiner. Worcester Polytechnic Institute

class 11 b-trees prof. Stratos Idreos

Overview of Query Processing and Optimization

CSCI1270 Introduction to Database Systems

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

CSE 444 Homework 1 Relational Algebra, Heap Files, and Buffer Manager. Name: Question Points Score Total: 50

class 10 fast scans 2.0 prof. Stratos Idreos

Overview of Data Management

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

COLUMN DATABASES A NDREW C ROTTY & ALEX G ALAKATOS

CPSC 421 Database Management Systems. Lecture 11: Storage and File Organization

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE)

Column Stores vs. Row Stores How Different Are They Really?

Column-Stores vs. Row-Stores: How Different Are They Really?

Question?! Processor comparison!

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

CMPSCI 645 Database Design & Implementation

Database Systems CSE 414

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

CSE544 Database Architecture

CMPT 354: Database System I. Lecture 7. Basics of Query Optimization

Spring 2013 CS 122C & CS 222 Midterm Exam (and Comprehensive Exam, Part I) (Max. Points: 100)

Disks and Files. Storage Structures Introduction Chapter 8 (3 rd edition) Why Not Store Everything in Main Memory?

Introduction to Database Systems CSE 344

BBM371- Data Management. Lecture 1: Course policies, Introduction to DBMS

Principles of Data Management. Lecture #3 (Managing Files of Records)

Introduction to Data Management. Lecture 14 (Storage and Indexing)

+ Random-Access Memory (RAM)

CMPT 354: Database System I. Lecture 4. SQL Advanced

CSE 544 Principles of Database Management Systems. Fall 2016 Lecture 14 - Data Warehousing and Column Stores

Weaving Relations for Cache Performance

Disks and Files. Jim Gray s Storage Latency Analogy: How Far Away is the Data? Components of a Disk. Disks

Caches. Hiding Memory Access Times

RAID in Practice, Overview of Indexing

Chapter 1: Introduction

DBMS. Relational Model. Module Title?

CS143: Disks and Files

Introduction to MySQL /MariaDB and SQL Basics. Read Chapter 3!

Storing Data: Disks and Files. Storing and Retrieving Data. Why Not Store Everything in Main Memory? Database Management Systems need to:

Why are you here? Introduction. Course roadmap. Course goals. What do you want from a DBMS? What is a database system? Aren t databases just

System R cs262a, Lecture 2

Storing Data: Disks and Files. Storing and Retrieving Data. Why Not Store Everything in Main Memory? Chapter 7

File Structures and Indexing

Announcements (September 21) SQL: Part III. Triggers. Active data. Trigger options. Trigger example

User Perspective. Module III: System Perspective. Module III: Topics Covered. Module III Overview of Storage Structures, QP, and TM

SQL: Part III. Announcements. Constraints. CPS 216 Advanced Database Systems

Query optimization. Elena Baralis, Silvia Chiusano Politecnico di Torino. DBMS Architecture D B M G. Database Management Systems. Pag.

Storing and Retrieving Data. Storing Data: Disks and Files. Solution 1: Techniques for making disks faster. Disks. Why Not Store Everything in Tapes?

Storing and Retrieving Data. Storing Data: Disks and Files. Solution 1: Techniques for making disks faster. Disks. Why Not Store Everything in Tapes?

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

Principles of Data Management. Lecture #2 (Storing Data: Disks and Files)

Module 1: Basics and Background Lecture 4: Memory and Disk Accesses. The Lecture Contains: Memory organisation. Memory hierarchy. Disks.

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Bridging the Processor/Memory Performance Gap in Database Applications

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

CS122A: Introduction to Data Management. Lecture #14: Indexing. Instructor: Chen Li

CSE 344 FEBRUARY 14 TH INDEXING

Chapter 1 Introduction

Storing Data: Disks and Files

Architecture-Conscious Database Systems

DATABASE MANAGEMENT SYSTEMS. UNIT I Introduction to Database Systems

Query Processing & Optimization

Storing Data: Disks and Files

Advanced Databases. Lecture 1- Query Processing. Masood Niazi Torshiz Islamic Azad university- Mashhad Branch

Storing Data: Disks and Files. Administrivia (part 2 of 2) Review. Disks, Memory, and Files. Disks and Files. Lecture 3 (R&G Chapter 7)

Implementing a Logical-to-Physical Mapping

Physical Data Organization. Introduction to Databases CompSci 316 Fall 2018

Transcription:

class 4 basic db architectures & layouts prof. Stratos Idreos HTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS165/

videos for sections 3 & 4 are online check back every week (1-2 sections weekly) there is a schedule but more will be added as we go Stratos Idreos 2 /40

applications from last time sql parser in/out optimizer thread pool cpu execution transactions memory storage buffer pool disk database kernel Stratos Idreos 3 /40

algorithms/operators database kernel query plan data data data Stratos Idreos 4 /40

result select name from student where GPA>3.0 project name select GPA>3.0 student(id,name,gpa,address,class, ) Stratos Idreos 5 /40

result select name from student where GPA>3.0 project name scan all the data? select GPA>3.0 student(id,name,gpa,address,class, ) Stratos Idreos 5 /40

result select name from student where GPA>3.0 project name scan all the data? logical plan select GPA>3.0 student(id,name,gpa,address,class, ) Stratos Idreos 5 /40

result result select name from student where GPA>3.0 project name project name physical plans scan GPA>3.0 index scan GPA>3.0 students students Stratos Idreos 6 /40

result avg GPA select avg(gpa) from student where class=2017 project GPA select year=2017 student(id,name,gpa,address,class, ) Stratos Idreos 7 /40

result avg GPA data flow select avg(gpa) from student where class=2017 project GPA interfaces algorithms/ operators select year=2017 data layout student(id,name,gpa,address,class, ) Stratos Idreos 7 /40

professor (id,name, ) enrolled (studentid, courseid, ) course (id,name, profid, ) student (id,name, ) give me all students enrolled in cs165 select student.name from student, enrolled, course where course.name= cs165 and enrolled.courseid=course.id and student.id=enrolled.studentid Stratos Idreos 8 /40

project student.name join student.id=enrolled.studentid select student.name from student, enrolled, course where course.name= cs165 and enrolled.courseid=course.id and student.id=enrolled.studentid select course.name= cs165 join enrolled.courseid=course.id student enrolled course good plan Stratos Idreos 9 /40

project student.name join student.id=enrolled.studentid select student.name from student, enrolled, course where course.name= cs165 and enrolled.courseid=course.id and student.id=enrolled.studentid join enrolled.courseid=course.id pushing selects down select course.name= cs165 student enrolled course Stratos Idreos 10/40

select min(a) from R where B<10 and C<80 logical plan internal language optimizer rules/cost model/statistics internal language physical plan execution Stratos Idreos 11/40

select min(a) from R where B<10 and C<80 logical plan internal language optimizer rules/cost model/statistics internal language physical plan execution Stratos Idreos 11/40

select min(a) from R where B<10 and C<80 logical plan internal language optimizer rules/cost model/statistics project interface/ level of abstraction internal language physical plan execution Stratos Idreos 11/40

can DBAs make wrong decisions? tuning can optimizers make wrong decisions? optimizer execution storage db kernel Stratos Idreos 12/40

memory hierarchy data layouts column-stores basics Stratos Idreos 13/40

applications sql cpu - cpu - cpu - cpu cpu registers smaller/faster caches memory disk - disk - disk - disk + flash + non volatile memory memory hierarchy system where db runs Stratos Idreos 14/40

registers 2x on chip cache 10x on board cache 100x memory my head ~0 this room 1 min this building 10 min New York 1.5 hours Jim Gray, IBM, Tandem, DEC, Microsoft ACM Turing award ACM SIGMOD Edgar F. Codd Innovations Award 100Kx disk Pluto 2 years Stratos Idreos 15/40

cheaper faster CPU registers on chip cache on board cache memory disk SRAM DRAM cache miss: looking for something which is not in the cache ~1ns ~10ns ~100ns speed memory wall memory miss: looking for something which is not in memory cpu mem time Stratos Idreos 16/40

touch/access only what you need design of storage/access methods/algorithms should minimize: data misses instruction misses Stratos Idreos 17/40

random access & page-based access CPU need to only read x but have to read all of page 1 registers data value x on chip cache page1 page2 page3 data move on board cache memory disk Stratos Idreos 18/40

query x<5 (size=120 bytes) memory level N memory level N-1 5 10 6 4 12 2 8 9 7 6 7 11 3 9 6 page size: 5x8 bytes Stratos Idreos 19/40

query x<5 scan (size=120 bytes) memory level N 5 10 6 4 12 memory level N-1 5 10 6 4 12 2 8 9 7 6 7 11 3 9 6 page size: 5x8 bytes Stratos Idreos 19/40

query x<5 (size=120 bytes) memory level N memory level N-1 scan 5 10 6 4 12 4 5 10 6 4 12 2 8 9 7 6 7 11 3 9 6 page size: 5x8 bytes Stratos Idreos 19/40

40 bytes query x<5 (size=120 bytes) memory level N memory level N-1 scan 5 10 6 4 12 4 5 10 6 4 12 2 8 9 7 6 7 11 3 9 6 page size: 5x8 bytes Stratos Idreos 19/40

40 bytes query x<5 (size=120 bytes) memory level N memory level N-1 scan 5 10 6 4 12 scan 2 8 9 7 6 4 5 10 6 4 12 2 8 9 7 6 7 11 3 9 6 page size: 5x8 bytes Stratos Idreos 19/40

40 bytes query x<5 (size=120 bytes) memory level N memory level N-1 scan 5 10 6 4 12 scan 2 8 9 7 6 4 2 5 10 6 4 12 2 8 9 7 6 7 11 3 9 6 page size: 5x8 bytes Stratos Idreos 19/40

80 bytes query x<5 (size=120 bytes) memory level N memory level N-1 scan 5 10 6 4 12 scan 2 8 9 7 6 4 2 5 10 6 4 12 2 8 9 7 6 7 11 3 9 6 page size: 5x8 bytes Stratos Idreos 19/40

80 bytes query x<5 (size=120 bytes) memory level N memory level N-1 scan 7 11 3 9 6 2 8 9 7 6 4 2 5 10 6 4 12 2 8 9 7 6 7 11 3 9 6 page size: 5x8 bytes Stratos Idreos 19/40

80 bytes query x<5 scan (size=120 bytes) memory level N 7 11 3 9 6 2 8 9 7 6 4 2 3 memory level N-1 5 10 6 4 12 2 8 9 7 6 7 11 3 9 6 page size: 5x8 bytes Stratos Idreos 19/40

120 bytes query x<5 scan (size=120 bytes) memory level N 7 11 3 9 6 2 8 9 7 6 4 2 3 memory level N-1 5 10 6 4 12 2 8 9 7 6 7 11 3 9 6 page size: 5x8 bytes Stratos Idreos 19/40

an oracle gives us the positions query x<5 (size=120 bytes) memory level N memory level N-1 5 10 6 4 12 2 8 9 7 6 7 11 3 9 6 page size: 5x8 bytes Stratos Idreos 20/40

an oracle gives us the positions query x<5 oracle (size=120 bytes) memory level N 5 10 6 4 12 memory level N-1 5 10 6 4 12 2 8 9 7 6 7 11 3 9 6 page size: 5x8 bytes Stratos Idreos 20/40

an oracle gives us the positions query x<5 oracle (size=120 bytes) memory level N memory level N-1 5 10 6 4 12 4 5 10 6 4 12 2 8 9 7 6 7 11 3 9 6 page size: 5x8 bytes Stratos Idreos 20/40

an oracle gives us the positions 40 bytes query x<5 (size=120 bytes) memory level N memory level N-1 oracle 5 10 6 4 12 4 5 10 6 4 12 2 8 9 7 6 7 11 3 9 6 page size: 5x8 bytes Stratos Idreos 20/40

an oracle gives us the positions 40 bytes query x<5 (size=120 bytes) memory level N memory level N-1 oracle 5 10 6 4 12 oracle 2 8 9 7 6 4 5 10 6 4 12 2 8 9 7 6 7 11 3 9 6 page size: 5x8 bytes Stratos Idreos 20/40

an oracle gives us the positions 40 bytes query x<5 (size=120 bytes) memory level N memory level N-1 oracle 5 10 6 4 12 oracle 2 8 9 7 6 4 2 5 10 6 4 12 2 8 9 7 6 7 11 3 9 6 page size: 5x8 bytes Stratos Idreos 20/40

an oracle gives us the positions 80 bytes query x<5 (size=120 bytes) memory level N memory level N-1 oracle 5 10 6 4 12 oracle 2 8 9 7 6 4 2 5 10 6 4 12 2 8 9 7 6 7 11 3 9 6 page size: 5x8 bytes Stratos Idreos 20/40

an oracle gives us the positions 80 bytes query x<5 (size=120 bytes) memory level N memory level N-1 oracle 7 11 3 9 6 2 8 9 7 6 4 2 5 10 6 4 12 2 8 9 7 6 7 11 3 9 6 page size: 5x8 bytes Stratos Idreos 20/40

an oracle gives us the positions 80 bytes query x<5 oracle (size=120 bytes) memory level N 7 11 3 9 6 2 8 9 7 6 4 2 3 memory level N-1 5 10 6 4 12 2 8 9 7 6 7 11 3 9 6 page size: 5x8 bytes Stratos Idreos 20/40

an oracle gives us the positions 120 bytes query x<5 oracle (size=120 bytes) memory level N 7 11 3 9 6 2 8 9 7 6 4 2 3 memory level N-1 5 10 6 4 12 2 8 9 7 6 7 11 3 9 6 page size: 5x8 bytes Stratos Idreos 20/40

scan=120bytes vs oracle=120bytes (and there is no such thing as an Oracle so Oracle is not for free ) when does it make sense to have an oracle Stratos Idreos 21/40

in parallel/prefetching sequential access: read one block; consume it completely; discard it; read next what is next? 1 2 3 4 hardware/software can better predict/buffer sequential pages to be read Stratos Idreos 22/40

random access: read one block; consume it partially; discard it; might have to read it again in future; read random next; Stratos Idreos 23/40

level N level N-1 buffer pool remember hot blocks why not use OS caching Stratos Idreos 24/40

dbms block size os block size device block size os and db will typically refer to pages Stratos Idreos 25/40

employee (id:int, name:varchar(50), office:char(5), telephone:char(10), city:varchar(30), salary:int) data storage blocks < pages < files (1, name1, office1, tel1, city1, salary1) (2, name2, office2, tel2, city2, salary2) (3, name3, office3, tel3, city3, salary3) (4, name4, office4, tel4, city4, salary4) (5, name5, office5, tel5, city5, salary5) (6, name6, office6, tel6, city6, salary6) (7, name7, office7, tel7, city7, salary7) (8, name8, office8, tel8, city8, salary8) (9, name9, office9, tel9, city9, salary9) file remember: the way we store data defines the best possible way we can access it Stratos Idreos 26/40

slotted pages employee (id:int, name:varchar(50), office:char(5), telephone:char(10), city:varchar(30), salary:int) page (1, name1, office1, tel1, city1, salary1) (2, name2, office2, tel2, city2, salary2) (3, name3, office3, tel3, city3, salary3) (4, name4, office4, tel4, city4, salary4) (5, name5, office5, tel5, city5, salary5) (6, name6, office6, tel6, city6, salary6) (7, name7, office7, tel7, city7, salary7) (8, name8, office8, tel8, city8, salary8) (9, name9, office9, tel9, city9, salary9) header row2 row1 row3 Stratos Idreos 27/40

slotted page free_offset, N, offset1-length1, offset2-lenght2, free space scan null update var length Stratos Idreos 28/40

some things to worry about how much data we transfer through the memory hierarchy how many computations we do cpu level N-1 level N Stratos Idreos 29/40

row-store select A,B,C,D select A A BC D one page contains all fields of multiple attributes stored continuously file Stratos Idreos 30/40

select A,B,C,D select A row-store column-store A BC D A B C D one page contains fields of a single attribute stored continuously Stratos Idreos 31/40

~1960s 1970: column storage ideas start appearing monetdb ~2000: open source complete system rows rows rows rows rows rows rows history/timeline 1985: first rather complete column-store model 2005-now: more ideas and industry adoption of columnstore designs c-store,vertica,vectorwise and then ibm,microsoft,oracle, and more Stratos Idreos 32/40

column-store with materialized IDs R(A,B,C) ID A ID B ID C header row2 row1 row3 good idea Stratos Idreos 33/40

virtual ids/ positional alignment A B C columns do not need to have the same width tuple 1 tuple 2 tuple 3 tuple 4 tuple 5 tuple 6 a1 a2 a3 a4 a5 a6 b1 b2 b3 b4 b5 b6 c1 c2 c3 c4 c5 c6 fixed-width + dense positional lookups/joins A(i) = A + i * width(a) Stratos Idreos 34/40

ok so now we can selectively read columns but how do we process them? disk memory A A B C D option1 columnstore engine option2 A BC early tuple reconstruction/materialization row-store engine Stratos Idreos 35/40

cheaper faster CPU registers on chip cache on board cache memory disk SRAM DRAM ~1ns ~10ns ~100ns memory wall it is not just memory and disk we want to move as few data items as possible all the way up to the CPU Stratos Idreos 36/40

select min(c) from R where A<10 & B<20 disk memory A B C D write the query plan and the code/logic of each operator do not forget about intermediate results describe data layouts at each step (milestone1 of project) Stratos Idreos 37/40

select min(c) from R where A<10 & B<20 disk memory A B C D write the query plan and the code/logic of each operator do not forget about intermediate results describe data layouts at each step (milestone1 of project) no precise final answer is OK understanding what matters is key concepts & designs will be repeated >>1 Stratos Idreos 37/40

late reconstruction/materialization select min(c) from R where A<10 & B<20 disk memory A B C D Stratos Idreos 38/40

late reconstruction/materialization select min(c) from R where A<10 & B<20 disk A B C D A<10 memory Stratos Idreos 38/40

late reconstruction/materialization select min(c) from R where A<10 & B<20 disk A B C D A<10 memory 1: int *input=a 2: for (i=0;i<tuples;i++,input++) 3: if *input<10 4: *output=i 5: output++ A<10 Stratos Idreos 38/40

late reconstruction/materialization select min(c) from R where A<10 & B<20 disk memory A B C D A<10 IDs Stratos Idreos 38/40

late reconstruction/materialization select min(c) from R where A<10 & B<20 disk memory A B C D A<10 IDs B Stratos Idreos 38/40

late reconstruction/materialization select min(c) from R where A<10 & B<20 disk memory A B C D A<10 IDs B B<20 Stratos Idreos 38/40

late reconstruction/materialization select min(c) from R where A<10 & B<20 disk memory A B C D A<10 IDs B B<20 IDs C Stratos Idreos 38/40

late reconstruction/materialization select min(c) from R where A<10 & B<20 disk memory A B C D A<10 IDs B B<20 IDs C minc Stratos Idreos 38/40

late reconstruction/materialization select min(c) from R where A<10 & B<20 disk memory A B C D A<10 IDs B B<20 IDs C minc always sequential access patterns memory contains only what is needed at any point in time Stratos Idreos 38/40

Notes to remember column-stores vs row-stores it all starts with how we store the data still basic concepts are the same moving data is a major cost component it is not just about disk the whole memory hierarchy matters Stratos Idreos 39/40

please keep up with reading! Architecture of a Database System (Sections 1,2,3,4) by J. Hellerstein, M. Stonebraker and J. Hamilton The Design and Implementation of Modern Column-store Database Systems by D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, S. Madden Stratos Idreos 40/40

class 4 basic db architectures & layouts DATA SYSTEMS prof. Stratos Idreos