Modern Database Systems CS-E4610

Similar documents
Modern Database Systems Lecture 1

Database Management Systems. Chapter 3 Part 1

Introduction to Data Management. Lecture #4 (E-R Relational Translation)

Review. The Relational Model. Glossary. Review. Data Models. Why Study the Relational Model? Why use a DBMS? OS provides RAM and disk

The Relational Model. Relational Data Model Relational Query Language (DDL + DML) Integrity Constraints (IC)

Database Applications (15-415)

The Relational Model. Why Study the Relational Model? Relational Database: Definitions

The Relational Model. Chapter 3. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Lecture 2 SQL. Instructor: Sudeepa Roy. CompSci 516: Data Intensive Computing Systems

Administrivia. The Relational Model. Review. Review. Review. Some useful terms

Relational data model

Lecture 2 SQL. Announcements. Recap: Lecture 1. Today s topic. Semi-structured Data and XML. XML: an overview 8/30/17. Instructor: Sudeepa Roy

CompSci 516 Database Systems. Lecture 2 SQL. Instructor: Sudeepa Roy

The Relational Data Model. Data Model

The Relational Model. Chapter 3. Comp 521 Files and Databases Fall

The Relational Model. Chapter 3

The Relational Model. Chapter 3. Comp 521 Files and Databases Fall

Announcements If you are enrolled to the class, but have not received the from Piazza, please send me an . Recap: Lecture 1.

CS 245: Database System Principles

Database Systems ( 資料庫系統 ) Practicum in Database Systems ( 資料庫系統實驗 ) 9/20 & 9/21, 2006 Lecture #1

Database Systems ( 資料庫系統 )

Introduction to Data Management. Lecture #5 Relational Model (Cont.) & E-Rà Relational Mapping

Why Study the Relational Model? The Relational Model. Relational Database: Definitions. The SQL Query Language. Relational Query Languages

The Relational Model. Database Management Systems

Data! CS 133: Databases. Goals for Today. So, what is a database? What is a database anyway? From the textbook:

9/8/2018. Prerequisites. Grading. People & Contact Information. Textbooks. Course Info. CS430/630 Database Management Systems Fall 2018

The Relational Model. Week 2

The Relational Model

Introduction to Data Management. Lecture #1 (The Course Trailer )

The Relational Model of Data (ii)

Overview of the Class and Introduction to DB schemas and queries. Lois Delcambre

CS430/630 Database Management Systems Spring, Betty O Neil University of Massachusetts at Boston

The Relational Model 2. Week 3

The Relational Model

Administration Naive DBMS CMPT 454 Topics. John Edgar 2

BBM371- Data Management. Lecture 1: Course policies, Introduction to DBMS

Relational Model. Topics. Relational Model. Why Study the Relational Model? Linda Wu (CMPT )

CIS 330: Applied Database Systems

Data Modeling. Yanlei Diao UMass Amherst. Slides Courtesy of R. Ramakrishnan and J. Gehrke

1. Textbook #1: Our Digital World (ODW). 2. Textbook #2: Guidelines for Office 2013 (GFO). 3. SNAP: Assessment Software

Comp 5311 Database Management Systems. 4b. Structured Query Language 3

CS634 Architecture of Database Systems Spring Elizabeth (Betty) O Neil University of Massachusetts at Boston

Review: Where have we been?

CSE 544 Principles of Database Management Systems

1 (10) 2 (8) 3 (12) 4 (14) 5 (6) Total (50)

Introduction to Data Management. Lecture #3 (E-R Design, Cont d.)

The Relational Model. Outline. Why Study the Relational Model? Faloutsos SCS object-relational model

Introduction to Databases Fall-Winter 2010/11. Syllabus

Introduction to Database Systems

Database Management Systems MIT Introduction By S. Sabraz Nawaz

Page 1. Quiz 18.1: Flow-Control" Goals for Today" Quiz 18.1: Flow-Control" CS162 Operating Systems and Systems Programming Lecture 18 Transactions"

Introduction to Data Management. Lecture #1 (Course Trailer )

Keys, SQL, and Views CMPSCI 645

Advanced Database Management Systems

CS W Introduction to Databases Spring Computer Science Department Columbia University

CompSci 516 Data Intensive Computing Systems

Databases. Jörg Endrullis. VU University Amsterdam

EGCI 321: Database Systems. Dr. Tanasanee Phienthrakul

Relational Databases BORROWED WITH MINOR ADAPTATION FROM PROF. CHRISTOS FALOUTSOS, CMU /615

CMPT 354 Database Systems I. Spring 2012 Instructor: Hassan Khosravi

AAAF (School of CS, Manchester) Advanced DBMSs / 19

CAS CS 460/660 Introduction to Database Systems. Fall

Lecture 3 SQL - 2. Today s topic. Recap: Lecture 2. Basic SQL Query. Conceptual Evaluation Strategy 9/3/17. Instructor: Sudeepa Roy

Introduction and Overview

MIT Database Management Systems Lesson 01: Introduction

CS145: Intro to Databases. Lecture 1: Course Overview

Introduction and Overview

Goals for Today. CS 133: Databases. Relational Model. Multi-Relation Queries. Reason about the conceptual evaluation of an SQL query

Relational Model and Relational Algebra

Database Management System Implementation. Who am I? Who is the teaching assistant? TR, 10:00am-11:20am NTRP B 140 Instructor: Dr.

CS 564: DATABASE MANAGEMENT SYSTEMS. Spring 2018

Database Systems. Course Administration

Introduction to Data Management. Lecture #1 (Course Trailer )

CSE 344 JANUARY 3 RD - INTRODUCTION

Outline. Databases and DBMS s. Recent Database Applications. Earlier Database Applications. CMPSCI445: Information Systems.

The Relational Model. Roadmap. Relational Database: Definitions. Why Study the Relational Model? Relational database: a set of relations

Course Web Site. 445 Staff and Mailing Lists. Textbook. Databases and DBMS s. Outline. CMPSCI445: Information Systems. Yanlei Diao and Haopeng Zhang

CS317 File and Database Systems

CMPT 354: Database System I. Lecture 1. Course Introduction

Page 1. Goals for Today" What is a Database " Key Concept: Structured Data" CS162 Operating Systems and Systems Programming Lecture 13.

Announcements. Relational Model & Algebra. Example. Relational data model. Example. Schema versus instance. Lecture notes

Outline. Database Management Systems (DBMS) Database Management and Organization. IT420: Database Management and Organization

SQL: Part III. Announcements. Constraints. CPS 216 Advanced Database Systems

Database Applications (15-415)

CMPSCI445: Information Systems

Relational Algebra Homework 0 Due Tonight, 5pm! R & G, Chapter 4 Room Swap for Tuesday Discussion Section Homework 1 will be posted Tomorrow

Lecture 16. The Relational Model

Score. 1 (10) 2 (10) 3 (8) 4 (13) 5 (9) Total (50)

Announcements (September 18) SQL: Part II. Solution 1. Incomplete information. Solution 3? Solution 2. Homework #1 due today (11:59pm)

CMPSCI 645 Database Design & Implementation

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #2: The Relational Model and Relational Algebra

Introduction to Data Management. Lecture #4 E-R Model, Still Going

CSC 261/461 Database Systems Lecture 19

CS564: Database Management Systems. Lecture 1: Course Overview. Acks: Chris Ré 1

Introduction to Data Management. Lecture #1 (Course Trailer ) Instructor: Chen Li

RAID in Practice, Overview of Indexing

CSC 261/461 Database Systems Lecture 13. Fall 2017

CS 525 Advanced Database Organization - Spring 2017 Mon + Wed 1:50-3:05 PM, Room: Stuart Building 111

Credits. Principles of Database Management Systems. Isn t Implementing a Database System Simple? Megatron 3000 Implementation Details

CSE 303: Database. Teaching Staff. Lecture 01. Lectures: 1 st half - from a user s perspective. Lectures: 2 nd half - understanding how it works

Transcription:

Modern Database Systems CS-E4610 Aristides Gionis Michael Mathioudakis Spring 2017

what is a database? a collection of data what is a database management system?... a.k.a. database system software to store, access, administer a database not just a collection of files provides mechanism to query the data transfers data between main memory and secondary storage (disk) enables concurrent access, offers guarantees for data consistency provides crash recovery mechanisms provides security and access control 2

why use a DBMS? group work 3

why use a DBMS? separate logical from physical data organization efficient data access guarantee data integrity and security reduce application development time data administration 4

why study database systems? group work 5

why study database systems? to manage data efficiently 6

consider the following task data records that contain information about products viewed or purchased from an online store task for each pair of Games products, count the number of customers that have purchased both Product Category Customer Date Price Action other... Portal 2 Games Michael M. 12/01/2015 10 Purchase FLWR Plant Food Garden Aris G. 19/02/2015 32 View Chase the Rabbit Games Michael M. 23/04/2015 1 View... Portal 2 Games Orestis K. 13/05/2015 10 Purchase... case A case B 10,000 records (0.5MB per record, 5GB total disk space) 10GB of main memory 10,000,000 records (~5TB total disk space) stored across 100 nodes (50GB per node), 10GB of main memory per node > what challenges does case B pose compared to case A? hint limited main memory, disk access, distributed setting 7

the main message to manage data efficiently minimize expensive operations e.g., disk access parallelize computation 8

why study database systems? to manage data efficiently...... from different roles develop database systems that match application requirements use database systems efficiently o o knowing how a DBMS stores data, processes queries, and accesses data allows us to ü organize data appropriately ü design efficient algorithms to process the data combine existing database systems to match requirements large variety of data and applications one size fits none - Michael Stonebraker 9

the relational database system database user database design query interface application query optimization & execution relational operators introductory course relational dbms our course relational dbms our course non-relational dbms DBMS files and access methods buffer (memory) management disk space management database (data stored on disk) 10

course requirements basic programming in python SQL analysis of algorithms 11

quiz! complete the handout 12

keywords relational database, table, index, join, query, tree, hash table, schema, graph, path (on a graph), document retrieval, inverted index, compression, ranking algorithm, precision, recall 13

previously on database systems... relational data model relation, attribute, tuple, schema, domain, keys relational algebra projection, selection, cartesian product, natural joins, theta joins, outer joins renaming, constraints structured query language (SQL) schema design functional dependencies, normalization applications embedded sql, drivers 14

modern database systems beyond the typical relational DBMS setting... different data models semi-structured data, unstructured text, graphs operations at massive scale big data platforms & map-reduce paradigm, hadoop and spark, cloud computing tailored performance key-value stores, column-stores, in-memory databases, streaming systems 15

about this course familiarize ourselves with modern database systems principles and practice database models: data, queries, and computation algorithms - simple queries (e.g., joins) to complex algorithms experience with technologies emphasis is on understanding of core issues essentially: the cost of algorithms for different database models and settings you can use what you learn here to: select a database system that fits the demands of your application... based on supported data model, functionality, optimizations, scalability design your database to fit the needs of your application e.g., by building appropriate index structures write fast algorithms to process your data and estimate their running time adapt your knowledge to the database system you ll be using 5 years from now 16

syllabus part 1: relational database systems (Jan 27, Feb 3, Feb 10 - Michael) topics access costs, indexing (b+ trees, hash tables), join algorithms technology PostgreSQL part 2: semi-structured data (Feb 24, Mar 3, Mar 10 - Aris) topics semi-structured data, information retrieval, rank aggregation technology MongoDB part 3: big data (Mar 17 Michael; Mar 24, Mar 31 - Aris) topics spark, map-reduce algorithms, graph databases technology Apache Spark 17

logistics lectures teaching assistants tutorials assignments curriculum Friday 10-12; January 20 th to March 31 st, no class on February 17th to be recruited (see next slide); office hours TBD week of February 10 th ; March 10 th ; March 24 th ; details TBD announced in week of February 10th ; March 10 th ; March 31 st return within two weeks slides and course notes no single textbook, but slides will provide references for further study announcements follow course website on mycourses.aalto.fi for questions use mycourses.aalto.fi programming assignments you can use campus labs or own computer homework handouts will provide instructions 18

recruiting teaching assistants number of positions : 4 job : conducting tutorials + office hours + grading waive homework assignments required : taken basic course on databases + interview if interested, write us an email 19

workload & grading scheme 3 assignments, 1 exam 25% each assignments pen-and-paper part: questions based on slides programming part: PostgreSQL, MongoDB, Spark; tutorials Total Marks (%) Grade % of students last year 50-59 1 3.33 60-69 2 10 70-79 3 20 80-95 4 43.33 96-100 5 23.33 20

what to expect from the course in the eyes of last year s students 21

overall assessment ICS-E5040 Modern Database Systems (2016-01-08-2016-04-01) 1. My overall assessment of the course E=Not applicable, 1=Fair, 2=Satisfactory, 3=Good, 4=Very good, 5=Excellent Number of respondents: 22 0 1 2 3 4 5 6 7 8 9 10 E 1 2 3 4 5 22

required effort 4. According to the guidelines, one credit (ECTS) requires 27 hours of student work. Compared with this, the completion of the course required E=Not applicable, 1=Considerably less time, 2=Slightly less time, 3=The right amount of time, 4=Slightly more time, 5=Considerably more time Number of respondents: 22 0 1 2 3 4 5 6 7 8 9 10 11 12 13 E 1 2 3 4 5 23

I think I will benefit from the things I learned 5. I think I will benefit from the things learnt on the course. E=Not applicable, 1=Strongly disagree, 2=Disagree, 3=Neither agree nor disagree, 4=Agree, 5=Strongly agree Number of respondents: 22 0 1 2 3 4 5 6 7 8 9 10 E 1 2 3 4 5 24

things student liked course was practical staff tried to engage audience and make sessions interactive some assignments were useful assignments helped to put theory in practice assignments introduced many different systems 25

criticism and things to improve lectures were fast-paced the content balance was also too much in favour of data mining compared to data engineering tutorials should have been announced earlier some assignments took too long to grade the programming environment for assignment 3 was terrible I hate open book exams with passion as it makes it more acceptable to have bad exams Lecture 05 : top-k retrieval File" is theoretical. Boring. Don't understand how Fagin s algorithm works. Let alone proving... 26

that s all for now! questions? next hour very short introduction to sql next lecture access cost analysis indexing 27

relational model and SQL

relational model and sql what is the relational model? tabular representation of data why do we study it? supports simple and intuitive querying good for educational purposes most widely used 29

definitions relational database stores a set of relations relation instance a table with rows and columns schema name of relation + name and type of each field fields as columns example! 30

example relation: students sid name username age gpa 53666 Sam Jones jones 22 3.4 53688 Alice Smith smith 22 3.8 53650 Jon Edwards jon 23 2.4 cardinality (number of rows) = 3, degree (number of fields/columns) = 5 > can we have the same value twice in the same column? schema students(sid: integer, name: string, username: string, age: integer, gpa: real) 31

querying major strength of relational model simple, intuitive, precise querying of data the DBMS is responsible for efficient evaluation Standard Query Language (SQL) the standard language for relational queries developed by IBM in the 1970s was standardized in 1986 latest standard in 2011 example! 32

example SQL query - selection sid name username age gpa 53666 Kate Jones jones 22 3.4 53688 Alice Smith smith 22 3.8 53650 Jon Edward jon 23 2.4 to find student records of age 23 SELECT * FROM students WHERE age=23 to find just names and usernames SELECT name, username FROM students WHERE age=23 sid name username age gpa name username 53650 Jon Edward jon 23 2.4 Jon Edward jon 33

example SQL query grouping sid name username age gpa active 53666 Kate Jones jones 22 3.4 TRUE 53688 Alice Smith smith 22 3.8 TRUE 53650 Jon Edward jon 23 2.4 TRUE what is the result of this query? SELECT age, avg(gpa) as avggrade FROM students WHERE active = TRUE GROUP BY age age avggrade 22 3.6 23 2.4 34

example SQL query grouping sid name username age gpa active 53666 Kate Jones jones 22 3.4 TRUE 53688 Alice Smith smith 22 3.8 TRUE 53650 Jon Edward jon 23 2.4 TRUE what is the result of this query? SELECT age, avg(gpa) as avggrade FROM students WHERE active = TRUE GROUP BY age HAVING avg(gpa)> 3 age avggrade 22 3.6 35

example SQL query grouping sid name username age gpa active 53666 Kate Jones jones 22 3.4 TRUE 53688 Alice Smith smith 22 3.8 TRUE 53650 Jon Edward jon 23 2.4 TRUE would this work? SELECT * FROM students GROUP BY age 36

creating, altering, and destroying, relations in SQL CREATE TABLE students (sid CHAR(20), name CHAR(20), username CHAR(10), age INTEGER, gpa REAL); the type of each column is enforced by the DBMS ALTER TABLE students ADD COLUMN firstyear integer; CREATE TABLE course (sid CHAR(20), points integer, grade CHAR(2)); every tuple in the current instance is extended with a null value in the new column DROP TABLE students; destroy relation students (schema and instance) 37

adding and deleting tuples > what do the following statements do? INSERT INTO students(sid, name, username, age, gpa) VALUES (12345, Kate Doe, kate, 23, 4.0); DELETE FROM students WHERE name = Jane Smith ; 38

candidate keys a set of fields is a candidate key (aka key ) for a relation if... 1) distinct tuples cannot have same values in all key fields, and 2) this is not true for any subset of the key if only part (1) from above is true... we have a superkey possibly many candidate keys for a relation DBMS admin chooses one (1) of them as primary key an integrity constraint condition must be true for any instance of the database other integrity constraints? 39

candidate keys in SQL, use PRIMARY KEY to specify primary key UNIQUE to specify candidate keys example relation enrolled holds information about student enrollment to courses compare the following create table statements use ICs carefully - they might forbid database instances that could arise in practice CREATE TABLE Enrolled (sid CHAR(20), cid CHAR(20), grade CHAR(2), PRIMARY KEY (sid,cid)) CREATE TABLE Enrolled (sid CHAR(20) cid CHAR(20), grade CHAR(2), PRIMARY KEY (sid), UNIQUE (cid, grade)) 40

joins students sid name username age 53666 Sam Jones jones 22 53688 Alice Smith smith 22 53650 Jon Edwards jon 23 dbcourse sid points grade 53666 92 A 53650 65 C what does this compute? SELECT * FROM students S, dbcourse C WHERE S.sid = C.sid S.sid S.name S.username S.age C.sid C.points C.grade 53666 Sam Jones jones 22 53666 92 A 53650 Jon Edwards jon 23 53650 65 C 41

joins SELECT * FROM students S, dbcourse C WHERE S.sid = C.sid intuitively... take all pairs of records from S and C (the cross product S x C) keep only records that satisfy WHERE condition S record #1 C record #1 S record #1 C record #2 S record #1 C record #3...... S record #2 C record #1 S record #2 C record #2 S record #2 C record #3...... 42

joins SELECT * FROM students S, dbcourse C WHERE S.sid = C.sid intuitively... take all pairs of records from S and C (the cross product S x C) keep only records that satisfy WHERE condition output join result S S.sid=C.sid C S record #1 C record #1 S record #1 C record #2 S record #1 C record #3...... S record #2 C record #1 S record #2 C record #2 S record #2 C record #3...... 43

joins expensive to materialize! keep only records that satisfy WHERE condition SELECT * FROM students S, dbcourse C WHERE S.sid = C.sid intuitively... take all pairs of records from S and C (the cross product S x C) S record #1 C record #2 S record #2 C record #1 output join result S S.sid=C.sid C 44

Credits for some of these slides, we used material from Database Systems: The Complete Book, by Garcia-Mollina, Ullman, Widom Database Management Systems, by Ramakrishnan and Gehrke 45