Database Systems External Sorting and Query Optimization. A.R. Hurson 323 CS Building

Similar documents
Mobile and Heterogeneous databases Distributed Database System Query Processing. A.R. Hurson Computer Science Missouri Science & Technology

CSC 742 Database Management Systems

Chapter 3. Algorithms for Query Processing and Optimization

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

Chapter 19 Query Optimization

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

COSC344 Database Theory and Applications. σ a= c (P) S. Lecture 4 Relational algebra. π A, P X Q. COSC344 Lecture 4 1

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques

Outline. Query Processing Overview Algorithms for basic operations. Query optimization. Sorting Selection Join Projection

Query Processing & Optimization. CS 377: Database Systems

Chapter 12: Query Processing

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Relational Algebra. Ron McFadyen ACS

Ch 5 : Query Processing & Optimization

Query Evaluation! References:! q [RG-3ed] Chapter 12, 13, 14, 15! q [SKS-6ed] Chapter 12, 13!

Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Chapter 6 Outline. Unary Relational Operations: SELECT and

Chapter 18 Strategies for Query Processing. We focus this discussion w.r.t RDBMS, however, they are applicable to OODBS.

Advanced Databases. Lecture 1- Query Processing. Masood Niazi Torshiz Islamic Azad university- Mashhad Branch

Fundamentals of Database Systems

CS 348 Introduction to Database Management Assignment 2

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

ECE 650 Systems Programming & Engineering. Spring 2018

Query Processing & Optimization

CSIT5300: Advanced Database Systems

Chapter 6 Part I The Relational Algebra and Calculus

Chapter 13: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing

SQL STRUCTURED QUERY LANGUAGE

Hash-Based Indexing 165

Relational Query Optimization. Highlights of System R Optimizer

Database System Concepts

Relational Algebra 1

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Introduction Alternative ways of evaluating a given query using

Advanced Database Systems

Query Processing and Optimization *

CS 377 Database Systems

Overview of Implementing Relational Operators and Query Evaluation

Query optimization. Elena Baralis, Silvia Chiusano Politecnico di Torino. DBMS Architecture D B M G. Database Management Systems. Pag.

Relational Algebra. Relational Algebra Overview. Relational Algebra Overview. Unary Relational Operations 8/19/2014. Relational Algebra Overview

2.2.2.Relational Database concept

Chapter 6 The Relational Algebra and Calculus

SQL: A COMMERCIAL DATABASE LANGUAGE. Data Change Statements,

Lassonde School of Engineering Winter 2016 Term Course No: 4411 Database Management Systems

COSC344 Database Theory and Applications. Lecture 5 SQL - Data Definition Language. COSC344 Lecture 5 1

QUERY OPTIMIZATION E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 QUERY OPTIMIZATION

7. Query Processing and Optimization

Relational Query Optimization

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

DBMS Query evaluation

Query Processing and Query Optimization. Prof Monika Shah

CS 222/122C Fall 2017, Final Exam. Sample solutions

Query Optimization. Shuigeng Zhou. December 9, 2009 School of Computer Science Fudan University

Chapter 8 SQL-99: Schema Definition, Basic Constraints, and Queries

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst October 23 & 25, 2007

Principles of Data Management. Lecture #12 (Query Optimization I)

Review. Support for data retrieval at the physical level:

Chapter 6 5/2/2008. Chapter Outline. Database State for COMPANY. The Relational Algebra and Calculus

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Introduction to SQL. ECE 650 Systems Programming & Engineering Duke University, Spring 2018

CSE 344 APRIL 20 TH RDBMS INTERNALS

Database Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example

Query Optimization. Query Optimization. Optimization considerations. Example. Interaction of algorithm choice and tree arrangement.

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

RELATIONAL DATA MODEL: Relational Algebra

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Database Applications (15-415)

Introduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

An SQL query is parsed into a collection of query blocks optimize one block at a time. Nested blocks are usually treated as calls to a subroutine

Ian Kenny. November 28, 2017

CS 564 Final Exam Fall 2015 Answers

The Relational Algebra

Chapter 8. Joined Relations. Joined Relations. SQL-99: Schema Definition, Basic Constraints, and Queries

QUERY OPTIMIZATION [CH 15]

CSC 261/461 Database Systems Lecture 19

Chapter 14 Query Optimization

CSE 344 FEBRUARY 14 TH INDEXING

Chapter 14 Query Optimization

Chapter 14 Query Optimization

Chapter 8: Relational Algebra

Ian Kenny. December 1, 2017

Slides by: Ms. Shree Jaswal

Introduction to Database Systems CSE 444

Query processing and optimization

Chapter 5 Relational Algebra. Nguyen Thi Ai Thao

CMSC 424 Database design Lecture 18 Query optimization. Mihai Pop

Database System Concepts

SQL-99: Schema Definition, Basic Constraints, and Queries. Create, drop, alter Features Added in SQL2 and SQL-99

SQL: A COMMERCIAL DATABASE LANGUAGE. Complex Constraints

Course No: 4411 Database Management Systems Fall 2008 Midterm exam

CS5300 Database Systems

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 9 - Query optimization

Query Processing: an Overview. Query Processing in a Nutshell. .. CSC 468 DBMS Implementation Alexander Dekhtyar.. QUERY. Parser.

Outline. Database Tuning. Join Strategies Running Example. Outline. Index Tuning. Nikolaus Augsten. Unit 6 WS 2014/2015

Database design process

Transcription:

External Sorting and Query Optimization A.R. Hurson 323 CS Building

External sorting When data to be sorted cannot fit into available main memory, external sorting algorithm must be applied. Naturally, in applying external sorting, the goal is to achieve a higher performance. This is achieved by reducing the number of I/O operations.

Two-Way Merge Sort A binary tree of K leaf nodes is used to sort a file K in the number of data blocks. At the lowest level of the tree, data blocks are read one at a time into the buffer and sorted. At the next level and higher, pairs of sorted data blocks are read in and merged. At the last level two pairs of K/2 sorted blocks are read in and merged together.

Two-Way Merge Sort Overall it takes 1+ log 2 K passes over the data file in order to sort a data file of K blocks. Hence, the overall cost is proportional to: K (1+ log 2 K )

Two-Way Merge Sort Output Main Memory Buffer Note that, we just need three buffers.

4,2,6 5,10,4 1,8,6 7,12,20 15,10,5 11,9,7 2,4,6 4,5,10 1,6,8 7,12,20 5,10,15 7,9,11 2,4,4 5,6,10 1,6,7 8,12,20 5,7,9 10,11,15 1,2,4 4,5,6 6,7,8 10,12,20 1,2,4 4,5,5 6,6,7 7,8,9 10,10,11 12,12,20 5,7,9 10,11,12

Extended Merge Sort Reducing the number of passes to improve the performance We can reduce number of passes when there is more than three buffers. Assume we have B buffers: In the first pass group of B blocks are read in and sorted internally. This generates K/B groups of B sorted blocks. In successive passes, B-1 pages are read in and merged, the remaining buffers are used as output buffer. Suppose we have 64 blocks: With three buffers we have 2*64 (log 2 64 + 1) = 2*64*7 I/O s With five buffers we have 2*64 (log 4 16 + 1) = 2*64*3 I/O s

Extended Merge Sort Output

Query Optimization In general, optimization is required in such a system if the system is expected to achieve acceptable performance. This is one of the strength of relational algebra that optimization can be done automatically, since relational expression are at a sufficiently high semantic level.

The overall goal of an optimization is to choose an efficient strategy for evaluation of a given relational expression. An optimizer might actually do better than a human programmer since:

An optimizer will have a wealth of information available to it that human programmers typically do not have. If the data base statistics changes drastically, then an optimizer may choose a different strategy. Optimizer can potentially considers several strategies for a given request. Optimizer is written by an expert.

Query Optimization A Simple Example Get names of suppliers who supply part P 2 : SELECT FROM WHERE DISTINCT Sname S, SP S.S# = SP.S# AND SP.P# = P 2 ; Suppose that the cardinality of S and SP are 100 and 10,000, respectively. Furthermore, assume 50 tuples in SP are for part P 2.

Without an optimizer, the system will; Computer generates Cartesian product of S and SP. This will generate a relation of size 1,000,000 tuples Too large to be kept in the main memory. Restricts results of previous step as specified by WHERE clause. This means reading 1,000,000 tuples of which 50 will be selected. Projects the result of previous step over Sname to produce the final result.

An Optimizer on the other hand: Restricts SP to just the tuples for part P 2. This will involve reading 10,000 tuples, but produces a relation with 50 tuples. Joins the result of the previous step with S relation over S#. This involves the retrieval of only 100 tuples and the generation of a relation with at most 50 tuples. Projects the result of the last operation over Sname.

If the number of I/O s is used as the performance measure, then it is clear that the second approach is far faster that the first approach. In the first case we read about 3,000,000 tuples and in the second case we read about 10,000 tuples. So a simple policy doing restriction and then join instead of doing product and then a restriction sounds a good heuristic.

Query Optimization There are two main techniques for query optimization: Heuristic rules Systematically estimate In this course, we will talk about the heuristic rules.

Optimization Process Cast the query into some internal representation Convert the query to some internal representation that is more suitable for machine manipulation, i. e., relational algebra. Π (Sname) (σ P# = P2 (S S.S#=SP.S# SP ) ) Now we can build a query tree very easily.

Optimization Process Result Project (Sname) Restrict (Sp.P# = P 2 Join (S.S# = SP.S#) S SP

Optimization Process Convert the result of the previous step into a canonical form during this phase, optimizer performs a number of optimization that are guaranteed to be good (i.e., heuristic rules) regardless of the actual data value and the access paths. For Example: (A Join B) WHERE restriction-on-b can be transformed into (A Join (B WHERE restriction-on-b))

(A Join B) WHERE restriction-on-a AND restriction-on-b can be transformed into (A WHERE restriction-on-a) Join (B WHERE restriction-on-b)) General rule: It is a good idea to perform the restriction before the join, because: It reduces the size of the input to the join operation, It reduces the size of the output from the join.

WHERE p OR (q AND r) can be converted into WHERE (p or q) AND (p OR r) General rule: Transform restriction condition into an equivalent condition in conjunctive normal form, because: A condition that is in conjunctive normal form evaluates to true only if every conjunct evaluates to true. Consequently, it evaluates to false if any conjunct evaluates to false. This is specially useful in the domain of parallel systems where conjuncts can be evaluates in parallel.

(A WHERE restriction-1) WHERE restriction-2 can be converted into A WHERE restriction-1 AND restriction-2 General rule: A sequence of restrictions can be combined into a single restriction. (A [projection-1]) [projection-2] can be converted into A [projection-2] General rule: A sequence of projections, all but the last one can be ignored.

(A [projection]) WHERE restriction can be converted to (A WHERE restriction) [projection] General rule: A restriction and projection can be converted into a projection and restriction.

Finally, consider the following query: Get the supplier numbers who supply at least one part; (SP Join P) [S#] However, we know that P# is the foreign key in SP, therefore the above query is semantically equivalent to: SP [S#]

Optimization Process Generate query plans The final stage of optimization involve the construction of a set of candidate query plans and the choice of the best of these plans. Choosing the cheapest plan, naturally, requires a method for assigning a cost to any given plan This cost formula should estimate the number of disk accesses, CPU utilization and execution time, space utilization,.

Query Optimization So in short: after scanning and parsing, the query will be translated into an equivalent representation, this internal representation is in the form of a query tree or query graph, an execution strategy will be chosen. The execution strategy is a plan for accessing the data, executing the query, and storing the intermediate results.

Query Optimization a commercial system In this system, we are dealing with SQL statements in the form of SELECT-FROM-WHERE : An order of execution of query blocks are decided (nested query). An attempt is made to minimize the total cost by choosing the cheapest implementation for each individual block. For a given query block, two cases is considered;

Query Optimization a commercial system For a block that involves just a restriction and/or projection of a single relation: statistical information from the system catalog together with proper formulas to estimate size of the intermediate results and cost of low-level operations are used to choose an execution strategy. The statistics maintained in the system catalog are; Number of tuples in each relation, Number of pages occupied by each relation, Percentage of pages occupied by each relation with respect to all pages in the relevant portion of the data base, Number of distinct data values for each index, Number of pages occupied by each index.

Query Optimization a commercial system For a block that involves two or more relations to be joined together with local restrictions ans/or projections, the optimizer: treats each individual relation as in the case above, and decides on the sequence for performing joins.

Query Optimization a commercial system Given a set of relations to be joined, the optimizer always chooses a pair to be joined first, then a third to be joined the the result of joining the first two, and so on: A Join B Join C Join D is converted into ((D Join C) Join A) Join B Either the nested loop method or sort/merge method is used to perform the join.

Query Optimization An Example SELECT FROM WHERE AND AND Pnumber, Dnum, Lname, Bdate, Address Project, Department, Employee Dnum = Dnumber MGRSSN = SSN Plocation = California ;

Query Optimization An Example The above query based on the schemas of the relations can be translated into: Π Pnumber,Dnum,Lname,Address,Bdate (((σ Plocation= california (Project)) Dnum=Dnumber (Department ) ) MNGSSN=SSN (Employee))

Query Optimization An Example Π Pnumber,Dnum,Lname,Address,Bdate MNGSSN=SSN Dnum=Dnumber Employee σ Plocation= california Department Project