376a. Database Design Dept. of Computer Science Vassar College http://www.cs.vassar.edu/~cs376 Class 16 Query optimization What happens Database is given a query Query is scanned - scanner creates a list of tokens recognized by language. Query is parsed - tokens are checked for syntactic correctness. Query is validated - attributes exist and query is semantically validated. 1 2 Query conversion Execution strategy Query is converted to intermediate format Query tree Query graph How the DBMS takes the tree or graph and executes it against the database. Many different strategies This is where query optimization comes in to place Query optimization is picking the best execution plan (time, disk accesses, etc.) No! Just reasonably efficient strategy. 3 4 Next Two types of techniques Execution plan is converted to code (query code generation) Two types of execution Interpreted - executed directly Compiled - execution strategy is stored and executed at a later time Runtime DBMS executes the final query. Heuristic rules for ordering operations for query optimization. Systematically estimating costs. (usually combination of these two strategies is used.) 5 6 and Procedural Abstraction 1
Query operations Tasks include: search, sort, merge, union, intersect, etc. Typically DBMS have several algorithms to perfrom each task. Most start with SQL Take SQL query, break into query blocks (a block is composed of a single SELECT- FROM-WHERE). Convert SQL to relational algebra expression (a tree data structure). Optimize this expression. 7 8 Example Convert this to two queries SELECT Lname, Fname FROM EMPLOYEE WHERE Salary>(SELECT MAX (Salary) FROM EMPLOYEE WHERE Dno=5); SELECT MAX (Salary) FROM EMPLOYEE WHERE Dno=5; Æ MAX SALARY (σ Dno=5 (EMPLOYEE)) SELECT Lname, Fname FROM EMPLOYEE WHERE Salary > c; (c is the result of the first query) More difficult with correlated query though. 9 10 Cover optimization in following order External sorting - required for sort-merge operations. SELECT operations JOIN operations PROJECT operations Set operations (union, intersection, difference) Aggregate operation (min, max, average, count) 11 External sorting Required for almost everything: sort-merge, union, intersection, diff, duplicate elimination. Break file into subfiles (process in runs and merge). Two phase Sort Merge 12 and Procedural Abstraction 2
External sort calculations n R = b/n B n R - number of initial runs b - number of file blocks. n B - available buffer space (in main memory) E.G. if buffer space is 100 blocks and b = 64000 blocks then # runs = 640 13 Merge calculations dm - degree of merge (how many runs can be merged per pass) d M = MIN(n B -1, n R ) Passes = «(log dm (n R ))» For n R = 640 and n B = 100, d M = 99 640 runs can be merged in 2 runs. (remember, worst case d M is 2!) 14 Worst case performance SELECT Algorithms (2*b) + (2 * (b* (log 2 b))) One read/one write for sorting Disk accesses for merging. (replace log 2 with log dm for general case) 15 Depends if on indexed or non-indexed attributes. Simple methods (file scan or index scan) Linear search - check every record. Binary search - = is comparison op. Nonindexed. Primary index for equality test. Primary index for range (or other equality test). Cluster index for multiple. 16 SELECT (complex select) Using conjunctions or disjunctions Conjunctions Use simple methods and then check remaining simple conditions. Use composite index or composite hash. Intersection of record pointers - when you have multiple secondary indices. NOTE: access path - index. Complex SELECT Optimizer should chose access path that retrieves fewest records in most efficient way. S - selectivity of an access path. - defined r/r - r satisf/r total tuples. 17 18 and Procedural Abstraction 3
SELECT - disjunctions SELECT FROM WHERE a<x OR b<6; Return the UNION. Limited by attributes without indices. Can only optimize if all disjunctions have indices. JOIN algorithms Equijoin or natural join - R a=b S Algorithms: Nested loop join - brute force Single-loop join - use index to search for match. Sort-merge join - only if both are sorted by join value. Very efficient (if not using logical blocks) 19 20 JOIN algo - cont. JOIN analysis Hash join - hash both to same hash table. Hash smaller of 2 first. 2nd phase is probing phase. Requires blocks for both tables and 1 block for join results. Join selection factor - (% of records that will be joined) For the single loop join, use the table with highest join selection factor as the outer loop. 21 22 JOIN analysis PROJECT algorithms Sort-merge join - is linear if already sorted or n log (n) if not. Partition-hash - use the same hash for both. If internal hash, very fast, otherwise more complicated. If attribute is a key, result has same number of tuples as R. If attribute is not key, may need to remove duplicate through sorting or hashing. Remember, SQL queries do not normally remove duplicates (need to use DISTINCT keyword). 23 24 and Procedural Abstraction 4
Set algorithms UNION, INTERSECTION, SET DIFFERENCE, CARTESIAN PRODUCT - * avoid CP like the plague! * Others must be union compatible Use sort-merge algorithms. Hashing using partition and probe also work well. Set using sort-merge UNION - sort and merge both tables simultaneously. INTERSECTION - sort, merge only if found in both tables. DIFFERENCE - merge if in first but not in second. 25 26 Set using hash Aggregate operations UNION - hash R, hash S, on match, don t add again. INTERSECTION - hash R, hash S, on match, copy to result set. DIFFERENCE - hash R, hash S, on match, mark record invalid (but keep in hash). Table scan or index MIN and MAX are good index operators. COUNT, AVERAGE, SUM only work with dense indices. (need to count # of records matching.) COUNT DISTINCT can be used with sparse index. 27 28 Group by OUTER JOIN algorithm When using group by Partition records using sort or hash, then apply function to records in group. Clustered index has this by default (group by operations are easy to perform if table has cluster index.) Modify join operations. Or use relational algebra operations 1. Inner join R and S. 2. Find R not in join result. 3. Join difference with null 4. Union 1 and 3. 29 30 and Procedural Abstraction 5
How to do this all quickly Pipelining or stream-based processing - don t write intermediate results out to disk. For example: SELECT Lname from EMPLOYEE E, WORKS_ON W WHERE E.Ssn=W.Ssn and W.Pno=4 and E.dno = 4; Result from SELECT are fed right into join then project rather than create 4 temp files. 31 Query Tree The data structure used to hold a relational algebra(ra) or extended RA expression. Relations are leaf nodes. Relational algebra operations are internal nodes. Initial tree generated by parser is not best. Give heuristics for optimizing these trees. 32 Query tree Canonical form of query tree Top node is PROJECT (π) Next node is SELECT (σ) Leaf nodes are joined using Cartesian product into one relation (with all attributes and all tuples) connected to the big σ statement. VERY EXPENSIVE to execute this tree! Query tree Canonical form is good place to start optimization. All heuristics should not change flavor of query. 33 34 Example π Pnum, Dnum, Lname, Addr, Bdate (( σ Ploc = Stafford (PROJECT)) dnum=dnum (DEPARTMENT)) Mgrssn=Ssn(EMPLOYEE)) 35 36 and Procedural Abstraction 6