Modern Database Systems Lecture 1

Modern Database Systems Lecture 1 Aristides Gionis Michael Mathioudakis T.A.: Orestis Kostakis Spring 2016

logistics assignment will be up by Monday (you will receive email) due Feb 12 th if you re not registered... I will post material (slides and assignments) also at http://michalis.co/moderndb/ 2

in this lecture... review past material relational model and sql storage and indexing access cost analysis hash index b+ tree 3

relational model and SQL

relational model and sql what is the relational model? tabular representation of data why do we study it? supports simple and intuitive querying good for educational purposes most widely used 5

definitions relational database a set of relations relation instance a table with rows and columns schema name of relation + name and type of each field fields as columns example! 6

example relation: students sid name username age gpa 53666 Sam Jones jones 22 3.4 53688 Alice Smith smith 22 3.8 53650 Jon Edwards jon 23 2.4 cardinality (number of rows) = 3, degree (number of fields/columns) = 5 > can we have the same value twice in the same column? schema students(sid: integer, name: string, username: string, age: integer, gpa: real) 7

querying major strength of relational model simple, intuitive, precise querying of data the DBMS is responsible for efficient evaluation Standard Query Language (SQL) the standard language for relational queries developed by IBM in the 1970s was standardized in 1986 latest standard in 2011 example! 8

example SQL query sid name username age gpa 53666 Kate Jones jones 22 3.4 53688 Alice Smith smith 22 3.8 53650 Jon Edward jon 23 2.4 to find student records of age 23 SELECT * FROM students WHERE age=23 to find just names and usernames SELECT name, username FROM students WHERE age=23 sid name username age gpa name username 53650 Jon Edward jon 23 2.4 Jon Edward jon 9

creating, altering, and destroying, relations in SQL CREATE TABLE students (sid CHAR(20), name CHAR(20), username CHAR(10), age INTEGER, gpa REAL); CREATE TABLE course (sid CHAR(20), points integer, grade CHAR(2)); the type of each column is enforced by the DBMS ALTER TABLE students ADD COLUMN firstyear integer; every tuple in the current instance is extended with a null value in the new column DROP TABLE students; destroy relation students (schema and instance) 10

adding and deleting tuples > what do the following statements do? INSERT INTO students(sid, name, username, age, gpa) VALUES (12345, Kate Doe, kate, 23, 4.0); DELETE FROM students WHERE name = Jane Smith ; 11

candidate keys a set of fields is a candidate key (aka key ) for a relation if... 1) distinct tuples cannot have same values in all key fields, and 2) this is not true for any subset of the key if only part (1) from above is true... we have a superkey possibly many candidate keys for a relation DBMS admin chooses one (1) of them as primary key an integrity constraint condition must be true for any instance of the database other integrity constraints? 12

candidate keys in SQL, use PRIMARY KEY to specify primary key UNIQUE to specify candidate keys example relation enrolled holds information about student enrollment to courses compare the following create table statements use ICs carefully - they might forbid database instances that could arise in practice CREATE TABLE Enrolled (sid CHAR(20), cid CHAR(20), grade CHAR(2), PRIMARY KEY (sid,cid)) CREATE TABLE Enrolled (sid CHAR(20) cid CHAR(20), grade CHAR(2), PRIMARY KEY (sid), UNIQUE (cid, grade)) 13

storage and indexing 14

storage setting the DBMS uses disks as external storage to store relations into files of records disks retrieve random page at fixed cost cheaper to retrieve several consecutive pages than each by random access why? file organization method of arranging a file of records on external storage record: one row of a relation record is internally assigned a record id (rid) rid is sufficient to physically locate record (address) 15

alternative file organizations heap files random order suitable when typical access is a file scan to retrieve all records sorted files records are sorted - typically by column value(s) suitable if records must be retrieved by same order indexes data structures that allow organized access to records... via search keys - typically column value(s) updates are faster than in sorted files -- why? 16

indexes data structures that allow us to find rids of records with specified column values any subset of the columns of a relation can be the search key for an index search key is not same as primary / candidate key an index contains a collection of data entries supports efficient retrieval of data entries k* with a given key value k index entries index file data entries data records data file 17

types of data entries three alternatives 1. data record with key value k 2. (k, rid of data record with search key k) 3. (k, list of rids of data records with search key k) type of data entries is orthogonal to index structure example of index structure B+ trees or hash tables 18

data entries of type 1 index structure is a file organization for data records we just have an index file > how many indexes of a relation can be of type 1? index entries index file data records 19

types of data entries - types 2 & 3 data entries typically much smaller than data records > why? index entries index file data entries data records data file type 3 is more compact than type 2 > why? 20

index classes primary vs secondary primary: if search key contains a primary key unique index: search key contains a candidate key clustered vs unclustered if order of data records is same as that of data entries makes big difference for some queries! > can alternative 1 indexes be unclustered? unclustered clustered 21

hash-based indexes retrieve records with exactly specified search-key values suitable for equality queries index is collection of buckets bucket = 1 or more disk pages hashing function h h(r) = bucket where record r belongs, based on its column values data entries are... type 1: the buckets contain data records... type 2 or 3: the buckets contain (key, rid) or (key, rids) pairs 22

hash-based indexes relation employes(name CHAR(100), age INTEGER, salary INTEGER) Smith, 44, 3000 Jones, 40, 6003 Tracy, 44, 5004 3000 3000 Ashby, 25, 3000 5004 age h1 Basu, 33, 4003 Kate, 29, 2007 Cass, 50, 5004 Basu, 33, 6003 5004 4003 2007 6003 h2 salary 6003 clustered (type 1) hash index on age unclustered (type 2) hash index on salary 23

b+ tree indexes non-leaf pages leaf pages (sorted by search key) leaf pages contain data entries, and are chained (prev & next) non-leaf pages have index entries; only used to direct searches index entry P 0 K 1 P 1 K 2 P 2 K m P m 24

example b+ tree root 17 note that data entries in leaf level are sorted entries < 17 entries >= 17 5 13 27 30 2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39* find 28*? 29*? all > 15* and < 30*? insert/delete find data entry in leaf, then update it need to adjust parent sometimes change sometimes bubbles up the tree

access-cost analysis 26

access-cost model relation students B: number of data pages, R: number of records per page execute typical select-from-where query D: (average) time to read or write one disk page SELECT * FROM students WHERE <...> estimate running time of query ignore cpu costs number of disk accesses (read/writes) is the bottleneck 27

file organizations heap file (random order; inserts at eof) sorted file, sorted on <age, gpa> clustered B+ tree file (type 1 data entries) on search key <age, gpa> heap file with unclustered B+ tree index on search key <age, gpa> heaf file with unclustered hash index on search key <age, gpa> 28

queries to compare scan - fetch all records SELECT * FROM students equality search SELECT * FROM students WHERE age = 22 and gpa = 4.0 range search SELECT * FROM student WHERE age >= 20 insert record INSERT INTO STUDENTS (sid, name, username, age, gpa) VALUES (12345, Michael, mike, 32, 2.6) 29

cost analysis what is the estimated time for each query to run? under simplified model how many disk pages are accessed? time = #disk-accesses x D 30

cost analysis heap sorted clustered unclustered b+ tree unclustered hash scan equality range insert 31

heap file operation cost and explanation scan B; simply retrieve all pages equality search B in worst case; if we know that exactly one such record exists, the cost is 0.5B in expectation range search B; must retrieve all records insert 2; fetch and store back the last page of the file 32

sorted file operation cost and explanation scan B; simply retrieve all pages equality search range search insert log 2 B + #qualifying-pages; since the condition matches the index, we can find the page of the record with binary search that retrieves log 2 B pages; if more than one records qualify, retrieve sequentially #qualifying-pages after the first log 2 B + #qualifying-pages; as above, log 2 B pages are retrieved to find the first matching record, followed possibly by a number (#qualifying-pages) of pages with qualifying records log 2 B + B; find the position of the record in the file (log 2 B); then, read the second half of the file, insert the record, write the second half back (0.5B + 0.5B in expectation) 33

clustered b+ tree assumptions: 2/3 = 67% occupancy of record pages, i.e. 1.5B record pages; fanout F operation cost and explanation scan 1.5B; simply retrieve all record pages equality search log F1.5B + #qualifying-pages; find the first qualifying record and retrieve consecutive qualifying ones range search log F1.5B + #qualifying-pages; find the first qualifying record and retrieve consecutive qualifying ones insert log F 1.5B + 1; search for record page (log F1.5B) and add record to it (1) 34

unclustered b+ tree assumptions: the size of one data entry is 10% the size of one record; also, index pages have 2/3=67% occupancy; therefore, number of index leaf pages is 0.1*1.5B = 0.15B and number of data entries in one page are 10*0.67R = 6.7R operation cost and explanation scan equality search range search B(R+0.15); scan the leaf level of the index (0.15B); for each data entry, fetch the page with the corresponding data record (6.7R x 0.15B = BR) log F 0.15B + #qualifying-records; locate the first data entry (log F 0.15B) and do one disk access for every qualifying record (#qualifying-records) log F 0.15B + #qualifying-records; locate the first data entry (log F 0.15B) and do one disk access for every qualifying record (#qualifying-records) insert 3 + log F0.15B;insert at end of heap file (2), find page for data entry (log F 0.15B) and update it (1) 35

unclustered hash index assumptions: the size of one data entry is 10% the size of one record; static hashing, no overflow pages (one bucket is one page); 4/5 = 80% occupancy; therefore, 0.1*1.25B = 0.125B pages for data entries and the number of data entries in a page is 10*0.8R = 8R operation scan equality search range search insert cost and explanation B(R+0.125); retrieve pages that contain data entries (0.125B); for each data entry, fetch the page with the corresponding data record 2; retrieve page with data entry (1) and page with data record (1) 0.125B + #qualifying-records; the hash index offers no help - scan index (0.125B) and retrieve pages of matching records; typically it s better to scan entire heapfile (B) 4; insert record into heap file (1 read+1 write); insert record into hash index (1 read + 1 write) 36

cost analysis scan equality range insert heap B B B 2 sorted B log 2 B + #qualifyingpages clustered 1.5B log F 1.5B + #qualifyingpages log 2 B + #qualifyingpages log F 1.5B + #qualifyingpages log 2 B + B log F 1.5B + 1 unclustered b+ tree B(R+0.15) log F 0.15B + #qualifyingrecords log F 0.15B + #qualifyingrecords 3 + log F 0.15B unclustered hash B(R+0.125) 2 0.125B + #qualifyingrecords note we made several assumptions to obtain these numbers 37 4

the morale different queries have different cost for different file organizations > how would you use this analysis as a db admin? discuss 38

the morale know your workload what queries? how often? on what relations? what file organizations? what indexes would speed-up response times for your workload? hint: see WHERE clause for index key candidates why? what trade-offs will you face? hint: queries are faster but updates take time, index takes space we ll see more complex cases in query optimization 39

indexes with composite search keys composite search keys search on a combination of fields equality query every field value is equal to a constant e.g., age=20 and sal =75, wrt <sal,age> index range query some field value is not a constant e.g., age =20; or age=20 and sal > 10, wrt <sal,age> index data entries in index sorted by search key to support range queries (e.g., b+ trees) remember also composite indexes are larger, updated more often 11,80 12,10 12,20 13,75 <age,sal> 10,12 20,12 75,13 80,11 <sal, age> data entries sorted by <sal,age> examples of composite key indexes name age sal bob 12 10 cal 11 80 joe 12 20 sue 13 75 data records sorted by name data entries sorted by <sal> 11 12 12 13 <age> 10 20 75 80 <sal> 40

composite search keys if condition is: 3000<sal<5000: <age,sal> index does not help! why? because the index does not match the selection condition index matches selection (condition...... condition) when: for hash index: only equality conditions for all fields for tree index: includes equality or range condition for a prefix of the search key 41

composite search keys to retrieve employee records with age=30 AND sal=4000, an index on <age,sal> or <sal, age> would be better than an index on <age> or an index on <sal> if condition is: age=30 AND 3000<sal<5000: <age,sal> index much better than <sal,age> index! why? hint: allows us to allocate answer with contiguous data entries order can make a difference depending on the selectivity of each condition if condition is: 20<age<30 AND 3000<sal<5000: tree index on <age,sal> or <sal,age> make no difference if selectivity of each condition is the same 42

index-only plans some queries can be answered without retrieving any data records if a suitable index is available index on <depnum> SELECT depnum, COUNT(*) FROM employees GROUP BY depnum example employees (name CHAR(100), depnum INTEGER, age INTEGER, salary INTEGER) b+ tree index on <age,salary> SELECT AVG(salary) FROM employees WHERE age=25 AND salary BETWEEN 3000 AND 5000 43

index-only plans index-only plans are possible with both <dno,age> or <age,dno> tree index <age, dno> is better why? SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age=30 GROUP BY E.dno 44

summary 45

summary relational model and SQL tabular representation one record per row schema determines names and types of columns simple, intuitive querying language statements to select records that satisfy a condition specify columns to project statements to insert and delete tuples storage a DBMS might use different file organizations to store relations heap file, sorted file, index different queries have different access costs for different file organizations having the right index can make a big difference in execution time commonly used indexes B+ tree and hash-based index 46

next b+ trees and hash-based index external sorting joins query optimization 47

references cowbook, database management systems, by ramakrishnan and gehrke elmasri, fundamentals of database systems, elmasri and navathe other database textbooks disk access analysis cowbook, chapter 8 b+ tree and hashing algorithms elmasri section 18.2: hash indexes section 18.3.2: b+ trees cowbook chapters 10 and 11 48

credits slides based on material from database management systems, by ramakrishnan and gehrke 49

joins students sid name username age gpa 53666 Sam Jones jones 22 3.4 53688 Alice Smith smith 22 3.8 53650 Jon Edwards jon 23 2.4 what does this compute? SELECT S.name, C.grade FROM Students S,Course C WHERE S.sid = C.sid AND C.points > 60 course sid points grade 53666 92 A S.name C.grade 53688 35 D Sam Jones A 53650 65 C Jon Edwards C 50

index-only plans what if we consider the second query? we ll come back to this after external sorting SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age>30 GROUP BY E.dno