CSIT5300: Advanced Database Systems

Similar documents
CSIT5300: Advanced Database Systems

CSIT5300: Advanced Database Systems

CSIT5300: Advanced Database Systems

Answers to the review questions can be found in the listed sections. What are the components of a workload description? (Section 20.1.

CSIT5300: Advanced Database Systems

CSIT5300: Advanced Database Systems

The Relational Model Constraints and SQL DDL

CSIT5300: Advanced Database Systems

Access Methods. Basic Concepts. Index Evaluation Metrics. search key pointer. record. value. Value

Natural-Join Operation

CSIT5300: Advanced Database Systems

CS 348 Introduction to Database Management Assignment 2

CS3DB3/SE4DB3/SE6M03 TUTORIAL

VIEW OTHER QUESTION PAPERS

Advance Database Management System

Relational Model Introduction

Query optimization. Elena Baralis, Silvia Chiusano Politecnico di Torino. DBMS Architecture D B M G. Database Management Systems. Pag.

GUJARAT TECHNOLOGICAL UNIVERSITY

Physical Design. Elena Baralis, Silvia Chiusano Politecnico di Torino. Phases of database design D B M G. Database Management Systems. Pag.

CSCB20 Week 2. Introduction to Database and Web Application Programming. Anna Bretscher Winter 2017

Database Design. Goal: specification of database schema Methodology:

CSIT5300: Advanced Database Systems

CS222P Fall 2017, Final Exam

Modern Database Systems Lecture 1

192 Chapter 14. TotalCost=3 (1, , 000) = 6, 000

Data on External Storage

Based on the following Table(s), Write down the queries as indicated: 1. Write an SQL query to insert a new row in table Dept with values: 4, Prog, MO

Query Processing & Optimization

Entity-Relationship Models: Good Design and Constraints

CMPUT 391 Database Management Systems. An Overview of Query Processing. Textbook: Chapter 11 (first edition: Chapter 14)

Database Systems CSE Comprehensive Exam Spring 2005

Database Management Systems,

Database Management Systems

Query Processing & Optimization. CS 377: Database Systems

Database Management Systems (COP 5725) Homework 3

CS425 Midterm Exam Summer C 2012

Database Applications (15-415)

The Relational Model. Chapter 3. Comp 521 Files and Databases Fall

Homework 1: RA, SQL and B+-Trees (due Feb 7, 2017, 9:30am, in class hard-copy please)

CS34800 Information Systems

Spring 2013 CS 122C & CS 222 Midterm Exam (and Comprehensive Exam, Part I) (Max. Points: 100)

Based on handout: Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Goal. Faloutsos & Pavlo CMU SCS /615

Overview of Storage and Indexing

TotalCost = 3 (1, , 000) = 6, 000

The Relational Model 2. Week 3

MIS Database Systems Entity-Relationship Model.

1/24/2012. Chapter 7 Outline. Chapter 7 Outline (cont d.) CS 440: Database Management Systems

Relational Algebra for sets Introduction to relational algebra for bags

COMP 430 Intro. to Database Systems

The SQL data-definition language (DDL) allows defining :

Relational Algebra. Relational Query Languages

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.

CS3DB3/SE4DB3/SE6M03 TUTORIAL

INDEXES MICHAEL LIUT DEPARTMENT OF COMPUTING AND SOFTWARE MCMASTER UNIVERSITY

COMP Instructor: Dimitris Papadias WWW page:

LAB 2 Notes. Conceptual Design ER. Logical DB Design (relational) Schema Refinement. Physical DD

Queen s University Faculty of Arts and Science School of Computing CISC 432* / 836* Advanced Database Systems

The Entity-Relationship (ER) Model 2

CSE 344 Midterm. Wednesday, February 19, 2014, 14:30-15:20. Question Points Score Total: 100

CMPUT 391 Database Management Systems. Query Processing: The Basics. Textbook: Chapter 10. (first edition: Chapter 13) University of Alberta 1

Chapter 3: Introduction to SQL

Outline. Query Types

Question 1 (a) 10 marks

Why Is This Important? Overview of Storage and Indexing. Components of a Disk. Data on External Storage. Accessing a Disk Page. Records on a Disk Page

Database Systems: An Application-Oriented Approach (Complete Version) Practice Problems and Solutions for Students

IMPORTANT: Circle the last two letters of your class account:

CS127 Homework #3. Due: October 11th, :59 P.M. Consider the following set of functional dependencies, F, for the schema R(A, B, C, D, E)

The Relational Model. Chapter 3. Comp 521 Files and Databases Fall

CS430 Final March 14, 2005

CS 222/122C Fall 2017, Final Exam. Sample solutions

Overview of Storage and Indexing

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

Remember. 376a. Database Design. Also. B + tree reminders. Algorithms for B + trees. Remember

Entity-Relationship Modelling. Entities Attributes Relationships Mapping Cardinality Keys Reduction of an E-R Diagram to Tables

University of Massachusetts Amherst Department of Computer Science Prof. Yanlei Diao

King Fahd University of Petroleum and Minerals

Data Storage and Query Answering. Indexing and Hashing (5)

The Relational Model

Readings. Important Decisions on DB Tuning. Index File. ICOM 5016 Introduction to Database Systems

Section 1: Redundancy Anomalies [10 points]

LECTURE 3: ENTITY-RELATIONSHIP MODELING

Query Processing: The Basics. External Sorting

CSCB20. Introduction to Database and Web Application Programming. Anna Bretscher Winter 2017

CS425 Fall 2016 Boris Glavic Chapter 2: Intro to Relational Model

Database Management Systems MIT Introduction By S. Sabraz Nawaz

Review of Storage and Indexing

COSC 304 Introduction to Database Systems. Entity-Relationship Modeling

Lecture3: Data Modeling Using the Entity-Relationship Model.

CSE-6490B Assignment #1

COMP 430 Intro. to Database Systems. Indexing

Today's Class. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Example Database. Query Plan Example

IMPORTANT: Circle the last two letters of your class account:

CS348: INTRODUCTION TO DATABASE MANAGEMENT (Winter, 2011) FINAL EXAMINATION

Relational Query Languages. Relational Algebra and SQL. What is an Algebra? Select Operator. The Role of Relational Algebra in a DBMS

Homework 1: RA, SQL and B+-Trees (due September 24 th, 2014, 2:30pm, in class hard-copy please)

COMP3311 Database Systems

Chapter 3: Introduction to SQL. Chapter 3: Introduction to SQL

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #2: The Relational Model and Relational Algebra

Database Management Systems

Relational Algebra and SQL. Relational Query Languages

Transcription:

CSIT5300: Advanced Database Systems E11: Exercises Physical Database Design Dr. Kenneth LEUNG Department of Computer Science and Engineering The Hong Kong University of Science and Technology Hong Kong SAR, China kwtleung@cse.ust.hk CSIT5300

Exercise #1 Physical DB Design without Cost Estimation Consider the following relational schema for a university database: Prof (pid, pname, office, age, gender, specialty, dept-did) Dept (did, dname, budget, num-majors, chair-pid) Assume the 5 following queries (of equal importance) are the only ones in the workload, and specify indexes for their efficient processing: 1. List the names, ages, and offices of professors of a user-specified gender (male or female) that have a user-specified research specialty (e.g., databases). Assume that the university has a diverse set of faculty members, making it very uncommon for more than a few professors to have the same research specialty. 2. List all the department information for departments with professors in a userspecified age range. 3. List the department id, department name, and chairperson name for departments with a user-specified number of majors. 4. List the lowest budget for a department in the university. 5. List all the information about professors who are department chairpersons. kwtleung@cse.ust.hk CSIT5300 2

Exercise #1 Physical DB Design without Cost Estimation (cont.) Prof (pid, pname, office, age, gender, specialty, dept-did) Dept (did, dname, budget, num-majors, chair-pid) 1. List the names, ages, and offices of professors of a user-specified sex (male or female) that have a user-specified research specialty (e.g., database query processing). Assume that the university has a diverse set of faculty members, making it very uncommon for more than a few professors to have the same research specialty. Use a hash index for specialty or <specialty, gender> on Prof. The index does not have to be clustered because the query is very selective (i.e., it is expected to retrieve very few records). kwtleung@cse.ust.hk CSIT5300 3

Exercise #1 Physical DB Design without Cost Estimation (cont.) Prof (pid, pname, office, age, gender, specialty, dept-did) Dept (did, dname, budget, num-majors, chair-pid) 2. List all the department information for departments with professors in a user-specified age range. This is a join query since it specifies a selection condition on professors and requires information about their departments. We need a B + -Tree index for age on Prof for finding the professors. The B + - Tree should be clustered because there may be many professors in the specified range. Then for each professor we can find his/her dept using a hash index for department id on Dept. The index does not need to be clustered since each professor belongs to exactly one department. kwtleung@cse.ust.hk CSIT5300 4

Exercise #1 Physical DB Design without Cost Estimation (cont.) Prof (pid, pname, office, age, gender, specialty, dept-did) Dept (did, dname, budget, num-majors, chair-pid) 3. List the department id, department name, and chairperson name for departments with a user-specified number of majors. This is also a join query since it specifies a selection condition on departments (# of majors) and requires information about their chairpersons. Use a clustered B + -Tree on the number of majors so that all Depts with the specified number of majors are stored together. Alternatively, since the selection condition is equality we can use hash file organization of Dept on major (recall that hash indexes are non-clustered). First index is better. After we retrieve qualifying departments, we find information about their chairperson, using a hash index for pid on Prof. kwtleung@cse.ust.hk CSIT5300 5

Exercise #1 Physical DB Design without Cost Estimation (cont.) Prof (pid, pname, office, age, gender, specialty, dept-did) Dept (did, dname, budget, num-majors, chair-pid) 4. List the lowest budget for a department in the university. The query can be answered by index-only scan using a B + -Tree for budget on Dept. The index does not have to be clustered since it covers the query. 5. List all the information about professors who are department chairpersons. We have to scan the departments table and find information about their chairperson using the hash index for pid on Prof (that we created at query 3). kwtleung@cse.ust.hk CSIT5300 6

Exercise #2 Physical DB Design with Cost Estimation Keys are underlined and foreign keys are in italics. TABLES: # records #pages DEPARTMENT (DEPT_ID, NAME, HEAD_ID) 100 10 PROFESSOR (PROF_ID, PNAME, SALARY, DEPT_ID) 1,000 100 STUDENT (STU_ID, NAME, DEPT_ID) 10,000 1,000 COURSE (COURSE_ID, NAME, DEPT_ID) 1,000 100 OFFERING (OFF_ID, COURSE_ID, YEAR, SEMESTER, PROF_ID) 10,000 1,000 ENROLL (STU_ID,OFF_ID, GRADE) 100,000 10,000 There are 500 different professor names, 1,000 different student names and 100 different department names. On the average, each student is enrolled in 10 offerings and each offering has 10 students. Each department offers 10 courses. Assume that professor salaries are uniformly distributed in the range [10,000-110,000). Note: For clustered indexes assume B + -Trees with height 2. For hash indexes assume that there are no overflow buckets (i.e., the cost of finding an entry in the index is 1). For indexes on a non-candidate key, assume that you may have multiple entries with the same search key entry (e.g., if you have an index on PROFESSOR.salary there may be many entries with the same salary value pointing to different records). We want to optimize a set of queries. Propose indexes and estimate their cost. kwtleung@cse.ust.hk CSIT5300 7

Exercise #2 Physical DB Design with Cost Estimation (cont.) PROFESSOR (PROF_ID, PNAME, SALARY, DEPT_ID): 1,000 recs Q1: Display the names of all professors whose salary is in the range [20,000-25,000). Answer: Build a clustered index on PROFESSOR.salary. The query is expected to retrieve 50 (5%) professors, i.e., 5 pages. Total Cost: 2+5 Alternative answer: We can use a multi-attribute index on PROFESSOR<salary, name> and do index-only scan. Since each index entry contains only 2 of the 4 PROFESSOR attributes, we can assume that the index leaf level contains 50 pages (half of the file). The cost is 2 pages to find the first entry satisfying the input condition, and then 2 more pages for reading the remaining entries (recall that the query will retrieve 5% of professors, so it has to read 5% of the index leaf nodes, i.e., 3 pages). Total cost: 2+2 kwtleung@cse.ust.hk CSIT5300 8

Exercise #2 Physical DB Design with Cost Estimation (cont.) PROFESSOR (PROF_ID, PNAME, SALARY, DEPT_ID): 1,000 recs DEPARTMENT (DEPT_ID, NAME, HEAD_ID): 100 recs Q2: Given a professor name, display the name of the department where the professor belongs. Answer: Build a hash index on PROFESSOR.pname. Use the index to find the two professors with the input name (cost 1 for locating the pname entry in the index + 2 for following the two pointers in the file). For each dept_id of these professors, use a hash index on DEPARTMENT.dept_id to retrieve the department record (and name), with cost 2. Total cost: 3+2*2=7 Alternative answer: Instead of a hash index on DEPARTMENT.dept_id, use a hash file organization on DEPARTMENT, where the records are partitioned in 10 buckets (pages) based on their dept_id. In this case, we can retrieve each dept record with cost 1. Total cost: 3+2=5 kwtleung@cse.ust.hk CSIT5300 9

Exercise #2 Physical DB Design with Cost Estimation (cont.) DEPARTMENT (DEPT_ID, NAME, HEAD_ID): 100 recs COURSE (COURSE_ID, NAME, DEPT_ID): 1,000 recs Q3: Given a department name, display all the course names offered by the department. Answer: Use a hash index on DEPARTMENT.dept_name to retrieve the id of the department (cost 1+1). Use a clustering index on COURSE.dept_id to retrieve the names of all courses offered by the department (cost 2+1). Total cost 2+3=5. kwtleung@cse.ust.hk CSIT5300 10

Exercise #2 Physical DB Design with Cost Estimation (cont.) STUDENT (STU_ID, NAME, DEPT_ID): 10,000 recs, 1,000 names DEPARTMENT (DEPT_ID, NAME, HEAD_ID):100 recs Q4: Given a student name, display the name of the department where the student belongs. Answer: Use a clustering index on STUDENT.stu_name to retrieve the 10 students with the given name (cost 2+1). Find the dept_name where the student belongs using a hash index on DEPARTMENT.dept_id (cost 1+1 for each student). Total cost 3+2*10=23. Alternative answer: Similar to query 2, instead of a hash index on DEPARTMENT.dept_id, use a hash file organization on DEPARTMENT, where the records are partitioned in 10 buckets (pages) based on their dept_id. Total cost 3+10=13. kwtleung@cse.ust.hk CSIT5300 11

Exercise #2 Physical DB Design with Cost Estimation (cont.) ENROLL (STU_ID,OFF_ID, GRADE): 100,000 recs OFFERING (OFF_ID, COURSE_ID, YEAR, SEMESTER, PROF_ID): 10,000 recs Q5: Given a student id, display the id of the courses taken by the student and his/her average grade. Answer (using a clustering index on ENROLL.stu id): There are 100,000 records in ENROLL. Since there are 10,000 students, each student is enrolled in 10 offerings on the average. Use a clustering index on ENROLL.stu_id to retrieve the 10 offerings taken by the student with cost 2+1 (assuming that all the 10 offerings are in the same page). We can immediately compute the average of the student (based on these 10 offerings). For each of the offerings, we retrieve the corresponding course_id using a hash index on OFFERING.off_id (cost 1+1 for each offering). Total cost 3+10*2=23. Alternative answer (using a hash index on ENROLL. stu id): If we do not use a clustering index on ENROLL.stu_id (because we will need another clustering index for query Q6), we use hash index, so that we retrieve the 10 offerings taken by the student with cost 11 (we need 1 page access for retrieving the index bucket and 10 accesses for following the pointers). The rest of the solution remains the same and the total cost is: 11+20=31. kwtleung@cse.ust.hk CSIT5300 12

Exercise #2 Physical DB Design with Cost Estimation (cont.) ENROLL (STU_ID,OFF_ID, GRADE): 100,000 recs OFFERING (OFF_ID, COURSE_ID, YEAR, SEMESTER, PROF_ID): 10,000 recs Q6: Given a course id, display the average grade for each offering of the course. Each course has on the average 10 offerings. For example, if the input is id COMP231, the query will display the average grade group-by 10 offerings of COMP231. Each offering has on the average 10 students. Answer (using a hash index on ENROLL.offering id): Use a clustering index on OFFERING.course id so that we can retrieve the 10 offerings of the course with cost 2+1 (assuming that all 10 offerings are in the same page). If we have already assumed a clustering index ENROLL.studentid (for Q5), we cannot have another clustering index on the same file. Instead, we use a hash index on ENROLL.offering id, to retrieve the records of 10 students that took the offering and compute the average grade for the offering (for each offering we need 1 page access in the index and 10 in the file, i.e., the cost is 11). We repeat the same process for each of the 10 offerings. Total cost 3+ 10(11)=113. Alternative answer (using a clustering index on ENROLL.offering id): First we retrieve the 10 offerings of the course using a clustering index on OFFERING.course id (with cost 3 as explained in the first solution). Then, we use a clustering index on ENROLL.offering id so that all the records of students who took an offering are stored together and retrieved with cost 2+1. Therefore, the cost of this process for 10 offerings is 30 and the total cost is 33. kwtleung@cse.ust.hk CSIT5300 13