CSIT5300: Advanced Database Systems

Similar documents
CSIT5300: Advanced Database Systems

CSIT5300: Advanced Database Systems

CSIT5300: Advanced Database Systems

CSIT5300: Advanced Database Systems

Lassonde School of Engineering Winter 2016 Term Course No: 4411 Database Management Systems

Sorting & Aggregations

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Query Processing. Solutions to Practice Exercises Query:

Database Applications (15-415)

Evaluation of Relational Operations

University of Waterloo Midterm Examination Solution

CSIT5300: Advanced Database Systems

Query Optimization. Query Optimization. Optimization considerations. Example. Interaction of algorithm choice and tree arrangement.

Implementation of Relational Operations

Evaluation of Relational Operations. Relational Operations

University of Waterloo Midterm Examination Sample Solution

IMPORTANT: Circle the last two letters of your class account:

DBMS Query evaluation

Query Processing. Introduction to Databases CompSci 316 Fall 2017

CMPUT 391 Database Management Systems. Query Processing: The Basics. Textbook: Chapter 10. (first edition: Chapter 13) University of Alberta 1

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

CMPUT 391 Database Management Systems. An Overview of Query Processing. Textbook: Chapter 11 (first edition: Chapter 14)

Database System Concepts

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Evaluation of Relational Operations

Chapter 12: Query Processing

Evaluation of Relational Operations

Evaluation of relational operations

Chapter 12: Query Processing. Chapter 12: Query Processing

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Evaluation of Relational Operations: Other Techniques

CSIT5300: Advanced Database Systems

Course No: 4411 Database Management Systems Fall 2008 Midterm exam

Fundamentals of Database Systems

Introduction to Data Management CSE 344. Lecture 12: Cost Estimation Relational Calculus

Overview of Implementing Relational Operators and Query Evaluation

Section 1: Redundancy Anomalies [10 points]

Evaluation of Relational Operations: Other Techniques

Implementation of Relational Operations: Other Operations

Announcements. Two typical kinds of queries. Choosing Index is Not Enough. Cost Parameters. Cost of Reading Data From Disk

CS 222/122C Fall 2017, Final Exam. Sample solutions

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Database Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.

Query Execution [15]

Query Processing and Query Optimization. Prof Monika Shah

Outline. Query Processing Overview Algorithms for basic operations. Query optimization. Sorting Selection Join Projection

Evaluation of Relational Operations: Other Techniques

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

CS 186/286 Spring 2018 Midterm 1

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Cost-based Query Sub-System. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class.

Goals for Today. CS 133: Databases. Relational Model. Multi-Relation Queries. Reason about the conceptual evaluation of an SQL query

Database Management System

COSC-4411(M) Midterm #1

Chapter 12: Query Processing

Query Processing: The Basics. External Sorting

Advanced Database Systems

Midterm 1: CS186, Spring 2015

CompSci 516 Data Intensive Computing Systems

CS 564 Final Exam Fall 2015 Answers

Implementing Joins 1

Chapter 13: Query Processing

EECS 647: Introduction to Database Systems

TotalCost = 3 (1, , 000) = 6, 000

Implementation of Relational Operations. Introduction. CS 186, Fall 2002, Lecture 19 R&G - Chapter 12

Data Storage. Query Performance. Index. Data File Types. Introduction to Data Management CSE 414. Introduction to Database Systems CSE 414

Database Applications (15-415)

CS-245 Database System Principles

Overview of Query Processing and Optimization

CSE 344 FEBRUARY 21 ST COST ESTIMATION

Today's Class. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Example Database. Query Plan Example

CSE 544, Winter 2009, Final Examination 11 March 2009

Indexing Methods. Lecture 9. Storage Requirements of Databases

CSE 444: Database Internals. Sec2on 4: Query Op2mizer

Question 1 (a) 10 marks

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

SQL - Data Query language

CS 186/286 Spring 2018 Midterm 1

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Query Evaluation! References:! q [RG-3ed] Chapter 12, 13, 14, 15! q [SKS-6ed] Chapter 12, 13!

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

QUERY OPTIMIZATION E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 QUERY OPTIMIZATION

University of California, Berkeley. CS 186 Introduction to Databases, Spring 2014, Prof. Dan Olteanu MIDTERM

1.1 - Basics of Query Processing in SQL Server

Storage and File Structure

CS330. Query Processing

Query optimization. Elena Baralis, Silvia Chiusano Politecnico di Torino. DBMS Architecture D B M G. Database Management Systems. Pag.

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst October 23 & 25, 2007

CS222P Fall 2017, Final Exam

Database Management Systems (COP 5725) Homework 3

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Overview of Query Evaluation. Overview of Query Evaluation

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Query Processing & Optimization. CS 377: Database Systems

CS 245 Midterm Exam Winter 2014

Dtb Database Systems. Announcement

Transcription:

CSIT5300: Advanced Database Systems E10: Exercises on Query Processing Dr. Kenneth LEUNG Department of Computer Science and Engineering The Hong Kong University of Science and Technology Hong Kong SAR, China kwtleung@cse.ust.hk CSIT5300

Exercise #1 (Projection + External Sorting) Consider a file of 10,000 students sorted on student id. Each student has 5 attributes, each 20 bytes. The page size is 1000 bytes. What is the cost of processing the following query using external sorting? SELECT DISTINCT name FROM student Assume that the available main memory is 100 pages and that there are 5,000 different student names. There is no index. I can store 10 records/page and the file contains 1000 pages. At pass 0, I read 100 pages for each sorted run, but I write back 20 because I only keep the name (the other attributes are not needed). Thus I have 10 sorted runs, each with 20 pages. Cost of pass 0: 1000+200 At pass 1, I read and merge the 200 pages. Cost of pass 1: 200. Total cost: 1200+200 (without considering the cost of writing the final output). The output is only 100 pages because each name is assumed to appear twice. PASS 0 1 2 1,000 1 2 20... sorted run 1 sorted run 2 sorted run 10 PASS 1 run 1 run 2 run 10 100 pages kwtleung@cse.ust.hk CSIT5300 2

Exercise #2 (Projection + Hashing) Consider a file of 10,000 students sorted on student id. Each student has 5 attributes, each 20 bytes. The page size is 1000 bytes. What is the cost of processing the following query using hashing and 20 buckets? SELECT DISTINCT name FROM student Assume that the available main memory is 100 pages and that there are 5,000 different student names. There is no index. I read the file page by page and assign each record to a bucket. For each record I keep only the name attribute, i.e., I read 1000 pages but I only write back 200. The cost of partitioning is 1000+200. The average bucket size is 200/20 = 10 pages. Thus, at the next step I load each bucket in memory, build the in-memory hash table and perform duplicate elimination within the bucket. Total cost: 1200+200 (without considering the cost of writing the final output) Optimization: Given the large memory that I have I can keep 8 full buckets in memory (i.e., 80 pages) and have 12 pages as output buffers for the remaining buckets. In this way, I avoid writing and reading again the 8 buckets, i.e., total cost = 1400 80*2=1240 kwtleung@cse.ust.hk CSIT5300 3

Exercise #3 Consider two files Student (s_id, name, dept_id, address) Enroll (class_id, s_id, semester, grade) The Student file contains 10,000 records in 1,000 pages and the Enroll file contains 50,000 records in 5,000 pages. There are 10 different departments and 25 different classes. All attributes have the same length. Each index, wherever available, is a tree with 3 levels. For non-clustering indexes, each pointer is assumed to lead to a different page. Our goal is to process query: SELECT S.name FROM Student S, Enroll E WHERE S.dept_id="COMP" AND E.class_id= 231" and S.s_id=E.s_id Some useful statistics: A student enrolls on the average in 5 classes A department contains on the average 1,000 students Each class contains an average of 2,000 enrolment records kwtleung@cse.ust.hk CSIT5300 4

Consider that the Student file contains a clustering index on dept_id, and the Enroll file contains a non-clustering index on s_id. Describe a fully pipelined plan (i.e., do not materialize anything) for processing the query by using Student as the outer relation and taking advantage of both indexes. π name use index on dept_id σ dept_id="comp" JOIN use index on s_id to find the classes taken by the student discard records where the class is not 231 Student Enroll kwtleung@cse.ust.hk CSIT5300 5

Estimate the approximate cost of your plan. Selection needs to read 3+100 pages of Students (clustering index on dept_id) and will return 1,000 student ids and names. For each (of the 1,000) student we use the index on s_id (for Enroll) to find the corresponding record: cost=1,000*(3+5)=8,000 (for each s_id we retrieve 5 records in classes). Total cost: 8103 pages (this cost does not include an extra level of indirection for the non-clustering s_id index - if we also include this, the cost becomes 9103) For this and the following questions, assume that there are no indexes. Using the above file sizes, estimate the cost of block nested-loops with a main memory buffer of 102 pages (for each case explain briefly). Student as the outer relation: 1,000+5,000*10 = 51,000 Enroll as the outer relation: 5,000+1,000*50 = 55,000 How can you optimize assuming that you have the clustering index on Student.dept_id of the previous question? kwtleung@cse.ust.hk CSIT5300 6

Describe sort-merge join and its cost, assuming that neither file is sorted on s_id, and that you have a buffer of 100 pages. Minimize the total cost (hint: discard attributes and records that are not needed). For sorting Student, pass 0 reads 100 pages at a time, but writes back only the ids and names of students in COMP, i.e., 1/10*1/2 of 100= 5 pages (i.e., 10% of the records and for each record only half the attributes). Thus, the result of pass 0 contains 10 sorted runs with total size 50 pages. Cost of pass 0 for Student = 1000+50. For sorting Enroll, pass 0 reads 100 pages at a time, but writes back only the sids of students who took class 231, i.e., 1/25*1/4 of 100=1 page (i.e., 4% of the records and for each record only one attribute). Thus, the result of pass 0 for Enroll contains 50 sorted runs with a total size 50 pages. Cost of pass 0 for Enroll = 5000+50. We allocate another 60 pages as input buffers for the sorted runs of both files and now we can merge directly. Total cost: 1050 (pass 0 of Student) + 5050 (pass 0 of Enroll) + 100 (merging all passes) = 6200. See diagram in the next slide. kwtleung@cse.ust.hk CSIT5300 7

1 2 1000 1 2 5000 PASS 0 only keep attributes s_id, name for records of students in COMP only keep attributes s_id for records of class 530 1 2 5... 1 2 5 1... 1 materialize sorted run 1 sorted run 10 Relation STUDENT sorted run 1 do not materialize sorted run 50 Relation ENROLL PASS 1 MERGE JOIN PHASE run 1 run 10 run 1 run 50 JOIN RESULT kwtleung@cse.ust.hk CSIT5300 8

Describe hybrid hash-join and its cost, using a buffer of 101 pages and Student as the build input. Minimize the total cost (hint: discard attributes and records that are not needed) assuming that records are evenly distributed in the buckets. Since I have 101 pages, I can use 100 buckets to partition. After discarding the records of students not in COMP and the extra attributes, the size of the file is 50 pages (see previous answer). Therefore, each bucket is less than 1 page and I can keep ALL buckets of Student in memory. Then, I read the Enroll file page by page. For each record, I find directly the matching records in the corresponding bucket of Student and output the result. Total cost: 1,000 + 5,000 (just for reading the files) kwtleung@cse.ust.hk CSIT5300 9