Query Processing and Optimization (Part-1) Prof Monika Shah
Overview of Query Execution SQL Query Compile Optimize Execute SQL query parse parse tree statistics convert logical query plan apply laws improved l.q.p estimate result sizes l.q.p. +sizes consider physical plans {(P1,C1),(P2,C2)...} answer execute Pi pick best estimate costs {P1,P2,..}
Logical Plans vs. Physical Plans Physical plan means how each operator will execute (which algorithm) E.g., Join can be nested-loop, hash-based, merge-based, or sort-based Each logical plan will map to multiple physical plans Logical Plan Ptitle starname=name StarsIn Pname sbirthdate LIKE %1960 One Physical Plan Hash join Parameters: join order, memory size, project attributes,... SEQ scan index scan Parameters: Select Condition,... StarsIn MovieStar MovieStar
Example: SQL query SELECT title FROM StarsIn WHERE starname IN ( SELECT name FROM MovieStar WHERE birthdate LIKE %1960 ); (Find the movies with stars born in 1960) 4
Example: Parse Tree <Query> <SFW> SELECT <SelList> FROM <FromList> WHERE <Condition> <Attribute> <RelName> <Tuple> IN <Query> title StarsIn <Attribute> ( <Query> ) starname <SFW> SELECT <SelList> FROM <FromList> WHERE <Condition> <Attribute> <RelName> <Attribute> LIKE <Pattern> name MovieStar birthdate %1960 5
Preprocessing : unfold view, semantic check Input: SQL Query SELECT t i t l e FROM ParamountMovies WHERE year = 1979; Where, CREATE VIEW ParamountMovies AS SELECT t i t l e, y ear FROM Movies WHERE studioname = Paramount ; simplification substitute
Query Plan Generator : Transform Parse tree into relational algebra
Example: Generating Relational Algebra SELECT title title FROM StarsIn WHERE starname IN ( SELECT name FROM MovieStar StarsIn <condition> ); WHERE birthdate LIKE %1960 <tuple> IN name (Find the movies with stars born in 1960) <attribute> birthdate LIKE %1960 starname MovieStar Fig. 7.15: An expression using a two-argument, midway between a parse tree nd relational algebra 8
Example: Logical Query Plan title title starname=name StarsIn <condition> <tuple> IN name StarsIn name <attribute> birthdate LIKE %1960 birthdate LIKE %1960 starname MovieStar MovieStar 9
Rewrite : Translate into an best equivalent logical query plan (Using algebraic laws): Optimal sequence of operation Algebraic Transformation Laws Commutative : i.e A op B B op A, for op: X, X,,. But, not for - Associativity: A op (B op C) (A op B)op C, for op: X, X,,. But, not for - Distribution Law : i.e. A s ( B s C) (A s B) s (A s C ) But, A B ( B B C)! (A B B) B (A B C ) c ( R S) c (R) c (S), c ( R x S) c (R) X (S), Other basic : R = R. R = R. R S = S, if S R. c ( R - S) c (R) - (S). if C is only applicable to R
Rewrite : Translate into an best equivalent logical query plan (Approaches) Approach 1: Cost based optimization Approach 2 : Heuristic based optimization Heuristic Optimization Laws Goal: reduce size of intermediate results Heuristics to reduce the number of choices set of rules that typically (but not in all cases) improve execution performance: Perform selection early (reduces the number of tuples) Perform projection early (reduces the number of attributes) Perform most restrictive selection and join operations before other similar operations. Multi-way join ordering Replace Cartesian with Join And many other
Example: Improved Logical Query Plan title starname=name title starname=name Question: Push project to StarsIn? StarsIn name StarsIn name birthdate LIKE %1960 birthdate LIKE %1960 MovieStar MovieStar Fig. 7.20: An improvement on fig. 7.18. 12
Example: Estimate Result Sizes Need expected size StarsIn MovieStar 13
Cost Estimation for various operators: Selection : T(S) = T (R ) / V ( R, A ), where S = a=c (R) T(S) = T (R ) / 3, where S = a<c (R) inequality T(S) = 1/3. T (R ) / V ( R, A ), where S = a<c AND b=2 (R) AND T(S) = c1uc2 = 1-(c1Uc2) = 1-c1 c2 =1- (1- T(R)/3) (1-T(R )/V( R, A)), where S = a<c OR b=2 (R) OR Join : T( R X a S) = 0, where R and S are disjoint = 1. T(S), where a is key of R = T(R). T(S), where a is non-key, and same values in a Hence, avg cost = T(R). T(S) /max(v(r,a),v(s,a))
Cost Estimation for various operators: Complex Join : Join with multiple attribute : T( R X a,b S) = T(R). T(S) / (max(v(r,a),v(s,a)) x max(v(r,b),v(s,b)) Multiple Join : T( R 1 X a R 2 x a...r k ) = T(R 1 ). T(R 2 )... T(R k ) / (Product of largest k-1 V(R,a))
Cost Estimation for various operators: Union: T(S) = T (R ) + T(V), where S = R B V, and disjoint R s V T(S) = max(t (R ), T(V)), where S = R s V and containment avg= larger + smaller/2 Intersection: T(S) = ½ (min(t (R ), T(V) )) average cost Difference: T(S) = ½ (T (R ) - T(V) ) average cost
Self Review Questions What is need of computing estimated size of logical operator? Does it required to compute cost of each operator of logical query plan? What is difference between Logical query plan and Physical Query Plan? What if we generate Physical Query plan directly from relational algebra?
Self Review Questions (contd ) Here are the statistics for four relations W, X, Y, Z. W(A,B) X(B,C) Y(C,D) Z(D,E) T(W)= 100 T(X) = 200 T(Y) = 300 T(Z) = 400 V(W,A) = 20 V(X,B) = 50 V(Y,C) = 50 V(Z,D)=40 V(W,B) = 60 V(X,C) = 100 V(Y,D) = 50 V(Z,E) = 100 Estimate the tuple numbers of the following expressions: 1. σ A=35 (W) 2. σ A=35^B=5(W) 3. W X 4. X Y 5. W X Y Z
Complete cost Evaluation : Schema for Examples Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: dates, rname: string) Prof Monika Shah Reserves: Each tuple is 40 bytes long, 100 tuples per block, 1000 blocks. Assume there are 100 boats Sailors: Each tuple is 50 bytes long, 80 tuples per block, 500 blocks. Assume there are 10 different ratings Assume we have 5 blocks in our buffer pool! Assume 1 I/O take 200 ms 1
Motivating Example SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5 Cost: 500+500*1000 I/Os By no means the worst plan! Misses several opportunities: selections could have been `pushed earlier, no use is made of any available indexes, etc. Goal of optimization: To find more efficient plans that compute the same answer. Plan: sname bid=100 rating > 5 sid=sid (On-the-fly) (On-the-fly) (Block nested join) Sailors Reserves Prof Monika Shah
Compute I/O cost of Plan A, Plan B and Plan C Which one is cost effective comparitive to others? sname Plan A Plan B (On-the-fly) sname (On-the-fly) Plan C sname (On-the-fly) bid=100 (On-the-fly) rating > 5 (On-the-fly) rating > 5 (On-the-fly) sid=sid (block-oriented Nested loops) sid=sid (block-oriented Nested loops) sid=sid (Index nested Join) rating > 5 (On-the-fly) Reserves bid=100 (On-the-fly) Sailors bid=100 (Use Index) Sailors Hash Index on sno Sailors Reserves Reserves Hash Index on bid