Modern Database Systems CS-E4610 Aristides Gionis Michael Mathioudakis Spring 2017
what is a database? a collection of data what is a database management system?... a.k.a. database system software to store, access, administer a database not just a collection of files provides mechanism to query the data transfers data between main memory and secondary storage (disk) enables concurrent access, offers guarantees for data consistency provides crash recovery mechanisms provides security and access control 2
why use a DBMS? group work 3
why use a DBMS? separate logical from physical data organization efficient data access guarantee data integrity and security reduce application development time data administration 4
why study database systems? group work 5
why study database systems? to manage data efficiently 6
consider the following task data records that contain information about products viewed or purchased from an online store task for each pair of Games products, count the number of customers that have purchased both Product Category Customer Date Price Action other... Portal 2 Games Michael M. 12/01/2015 10 Purchase FLWR Plant Food Garden Aris G. 19/02/2015 32 View Chase the Rabbit Games Michael M. 23/04/2015 1 View... Portal 2 Games Orestis K. 13/05/2015 10 Purchase... case A case B 10,000 records (0.5MB per record, 5GB total disk space) 10GB of main memory 10,000,000 records (~5TB total disk space) stored across 100 nodes (50GB per node), 10GB of main memory per node > what challenges does case B pose compared to case A? hint limited main memory, disk access, distributed setting 7
the main message to manage data efficiently minimize expensive operations e.g., disk access parallelize computation 8
why study database systems? to manage data efficiently...... from different roles develop database systems that match application requirements use database systems efficiently o o knowing how a DBMS stores data, processes queries, and accesses data allows us to ü organize data appropriately ü design efficient algorithms to process the data combine existing database systems to match requirements large variety of data and applications one size fits none - Michael Stonebraker 9
the relational database system database user database design query interface application query optimization & execution relational operators introductory course relational dbms our course relational dbms our course non-relational dbms DBMS files and access methods buffer (memory) management disk space management database (data stored on disk) 10
course requirements basic programming in python SQL analysis of algorithms 11
quiz! complete the handout 12
keywords relational database, table, index, join, query, tree, hash table, schema, graph, path (on a graph), document retrieval, inverted index, compression, ranking algorithm, precision, recall 13
previously on database systems... relational data model relation, attribute, tuple, schema, domain, keys relational algebra projection, selection, cartesian product, natural joins, theta joins, outer joins renaming, constraints structured query language (SQL) schema design functional dependencies, normalization applications embedded sql, drivers 14
modern database systems beyond the typical relational DBMS setting... different data models semi-structured data, unstructured text, graphs operations at massive scale big data platforms & map-reduce paradigm, hadoop and spark, cloud computing tailored performance key-value stores, column-stores, in-memory databases, streaming systems 15
about this course familiarize ourselves with modern database systems principles and practice database models: data, queries, and computation algorithms - simple queries (e.g., joins) to complex algorithms experience with technologies emphasis is on understanding of core issues essentially: the cost of algorithms for different database models and settings you can use what you learn here to: select a database system that fits the demands of your application... based on supported data model, functionality, optimizations, scalability design your database to fit the needs of your application e.g., by building appropriate index structures write fast algorithms to process your data and estimate their running time adapt your knowledge to the database system you ll be using 5 years from now 16
syllabus part 1: relational database systems (Jan 27, Feb 3, Feb 10 - Michael) topics access costs, indexing (b+ trees, hash tables), join algorithms technology PostgreSQL part 2: semi-structured data (Feb 24, Mar 3, Mar 10 - Aris) topics semi-structured data, information retrieval, rank aggregation technology MongoDB part 3: big data (Mar 17 Michael; Mar 24, Mar 31 - Aris) topics spark, map-reduce algorithms, graph databases technology Apache Spark 17
logistics lectures teaching assistants tutorials assignments curriculum Friday 10-12; January 20 th to March 31 st, no class on February 17th to be recruited (see next slide); office hours TBD week of February 10 th ; March 10 th ; March 24 th ; details TBD announced in week of February 10th ; March 10 th ; March 31 st return within two weeks slides and course notes no single textbook, but slides will provide references for further study announcements follow course website on mycourses.aalto.fi for questions use mycourses.aalto.fi programming assignments you can use campus labs or own computer homework handouts will provide instructions 18
recruiting teaching assistants number of positions : 4 job : conducting tutorials + office hours + grading waive homework assignments required : taken basic course on databases + interview if interested, write us an email 19
workload & grading scheme 3 assignments, 1 exam 25% each assignments pen-and-paper part: questions based on slides programming part: PostgreSQL, MongoDB, Spark; tutorials Total Marks (%) Grade % of students last year 50-59 1 3.33 60-69 2 10 70-79 3 20 80-95 4 43.33 96-100 5 23.33 20
what to expect from the course in the eyes of last year s students 21
overall assessment ICS-E5040 Modern Database Systems (2016-01-08-2016-04-01) 1. My overall assessment of the course E=Not applicable, 1=Fair, 2=Satisfactory, 3=Good, 4=Very good, 5=Excellent Number of respondents: 22 0 1 2 3 4 5 6 7 8 9 10 E 1 2 3 4 5 22
required effort 4. According to the guidelines, one credit (ECTS) requires 27 hours of student work. Compared with this, the completion of the course required E=Not applicable, 1=Considerably less time, 2=Slightly less time, 3=The right amount of time, 4=Slightly more time, 5=Considerably more time Number of respondents: 22 0 1 2 3 4 5 6 7 8 9 10 11 12 13 E 1 2 3 4 5 23
I think I will benefit from the things I learned 5. I think I will benefit from the things learnt on the course. E=Not applicable, 1=Strongly disagree, 2=Disagree, 3=Neither agree nor disagree, 4=Agree, 5=Strongly agree Number of respondents: 22 0 1 2 3 4 5 6 7 8 9 10 E 1 2 3 4 5 24
things student liked course was practical staff tried to engage audience and make sessions interactive some assignments were useful assignments helped to put theory in practice assignments introduced many different systems 25
criticism and things to improve lectures were fast-paced the content balance was also too much in favour of data mining compared to data engineering tutorials should have been announced earlier some assignments took too long to grade the programming environment for assignment 3 was terrible I hate open book exams with passion as it makes it more acceptable to have bad exams Lecture 05 : top-k retrieval File" is theoretical. Boring. Don't understand how Fagin s algorithm works. Let alone proving... 26
that s all for now! questions? next hour very short introduction to sql next lecture access cost analysis indexing 27
relational model and SQL
relational model and sql what is the relational model? tabular representation of data why do we study it? supports simple and intuitive querying good for educational purposes most widely used 29
definitions relational database stores a set of relations relation instance a table with rows and columns schema name of relation + name and type of each field fields as columns example! 30
example relation: students sid name username age gpa 53666 Sam Jones jones 22 3.4 53688 Alice Smith smith 22 3.8 53650 Jon Edwards jon 23 2.4 cardinality (number of rows) = 3, degree (number of fields/columns) = 5 > can we have the same value twice in the same column? schema students(sid: integer, name: string, username: string, age: integer, gpa: real) 31
querying major strength of relational model simple, intuitive, precise querying of data the DBMS is responsible for efficient evaluation Standard Query Language (SQL) the standard language for relational queries developed by IBM in the 1970s was standardized in 1986 latest standard in 2011 example! 32
example SQL query - selection sid name username age gpa 53666 Kate Jones jones 22 3.4 53688 Alice Smith smith 22 3.8 53650 Jon Edward jon 23 2.4 to find student records of age 23 SELECT * FROM students WHERE age=23 to find just names and usernames SELECT name, username FROM students WHERE age=23 sid name username age gpa name username 53650 Jon Edward jon 23 2.4 Jon Edward jon 33
example SQL query grouping sid name username age gpa active 53666 Kate Jones jones 22 3.4 TRUE 53688 Alice Smith smith 22 3.8 TRUE 53650 Jon Edward jon 23 2.4 TRUE what is the result of this query? SELECT age, avg(gpa) as avggrade FROM students WHERE active = TRUE GROUP BY age age avggrade 22 3.6 23 2.4 34
example SQL query grouping sid name username age gpa active 53666 Kate Jones jones 22 3.4 TRUE 53688 Alice Smith smith 22 3.8 TRUE 53650 Jon Edward jon 23 2.4 TRUE what is the result of this query? SELECT age, avg(gpa) as avggrade FROM students WHERE active = TRUE GROUP BY age HAVING avg(gpa)> 3 age avggrade 22 3.6 35
example SQL query grouping sid name username age gpa active 53666 Kate Jones jones 22 3.4 TRUE 53688 Alice Smith smith 22 3.8 TRUE 53650 Jon Edward jon 23 2.4 TRUE would this work? SELECT * FROM students GROUP BY age 36
creating, altering, and destroying, relations in SQL CREATE TABLE students (sid CHAR(20), name CHAR(20), username CHAR(10), age INTEGER, gpa REAL); the type of each column is enforced by the DBMS ALTER TABLE students ADD COLUMN firstyear integer; CREATE TABLE course (sid CHAR(20), points integer, grade CHAR(2)); every tuple in the current instance is extended with a null value in the new column DROP TABLE students; destroy relation students (schema and instance) 37
adding and deleting tuples > what do the following statements do? INSERT INTO students(sid, name, username, age, gpa) VALUES (12345, Kate Doe, kate, 23, 4.0); DELETE FROM students WHERE name = Jane Smith ; 38
candidate keys a set of fields is a candidate key (aka key ) for a relation if... 1) distinct tuples cannot have same values in all key fields, and 2) this is not true for any subset of the key if only part (1) from above is true... we have a superkey possibly many candidate keys for a relation DBMS admin chooses one (1) of them as primary key an integrity constraint condition must be true for any instance of the database other integrity constraints? 39
candidate keys in SQL, use PRIMARY KEY to specify primary key UNIQUE to specify candidate keys example relation enrolled holds information about student enrollment to courses compare the following create table statements use ICs carefully - they might forbid database instances that could arise in practice CREATE TABLE Enrolled (sid CHAR(20), cid CHAR(20), grade CHAR(2), PRIMARY KEY (sid,cid)) CREATE TABLE Enrolled (sid CHAR(20) cid CHAR(20), grade CHAR(2), PRIMARY KEY (sid), UNIQUE (cid, grade)) 40
joins students sid name username age 53666 Sam Jones jones 22 53688 Alice Smith smith 22 53650 Jon Edwards jon 23 dbcourse sid points grade 53666 92 A 53650 65 C what does this compute? SELECT * FROM students S, dbcourse C WHERE S.sid = C.sid S.sid S.name S.username S.age C.sid C.points C.grade 53666 Sam Jones jones 22 53666 92 A 53650 Jon Edwards jon 23 53650 65 C 41
joins SELECT * FROM students S, dbcourse C WHERE S.sid = C.sid intuitively... take all pairs of records from S and C (the cross product S x C) keep only records that satisfy WHERE condition S record #1 C record #1 S record #1 C record #2 S record #1 C record #3...... S record #2 C record #1 S record #2 C record #2 S record #2 C record #3...... 42
joins SELECT * FROM students S, dbcourse C WHERE S.sid = C.sid intuitively... take all pairs of records from S and C (the cross product S x C) keep only records that satisfy WHERE condition output join result S S.sid=C.sid C S record #1 C record #1 S record #1 C record #2 S record #1 C record #3...... S record #2 C record #1 S record #2 C record #2 S record #2 C record #3...... 43
joins expensive to materialize! keep only records that satisfy WHERE condition SELECT * FROM students S, dbcourse C WHERE S.sid = C.sid intuitively... take all pairs of records from S and C (the cross product S x C) S record #1 C record #2 S record #2 C record #1 output join result S S.sid=C.sid C 44
Credits for some of these slides, we used material from Database Systems: The Complete Book, by Garcia-Mollina, Ullman, Widom Database Management Systems, by Ramakrishnan and Gehrke 45