Query Processing and Query Optimization. Prof Monika Shah

Query Processing and Query Optimization

Query Processing SQL Query Is in Library Cache? System catalog (Dict / Dict cache) Scan and verify relations Parse into parse tree (relational Calculus) View definitions View unfolding Query transformations into Alternate relational algebras Statistics, index info Query optimizer Execution Plan Query Evaluation Query Result Data Indices

Query Optimization Cost Based Query Optimization (recommended) Rule Based Query Optimization (For backward compatibility with legacy application)

Cost Based Query Optimization Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops: simple, exploits extra memory Indexed nested loops: best if 1 rel small and one indexed Sort/Merge Join good with small amount of memory, bad with duplicates Hash Join fast (enough memory), bad with skewed data

Cost based Query Optimization (contd ) Query can be converted to relational algebra Rel. Algebra converted to tree, joins as branches Each operator has implementation choices Operators can also be applied in different order! SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5 sname bid=100 rating > 5 (sname) (bid=100 rating > 5) (Reserves Sailors) sid=sid Reserves Sailors

Schema for Examples Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: dates, rname: string) As seen in previous lectures Reserves: Each tuple is 40 bytes long, 100 tuples per page, 1000 pages. Assume there are 100 boats Sailors: Each tuple is 50 bytes long, 80 tuples per page, 500 pages. Assume there are 10 different ratings Assume we have 5 pages in our buffer pool!

Motivating Example SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5 Cost: 500+500*1000 I/Os By no means the worst plan! Misses several opportunities: selections could have been `pushed earlier, no use is made of any available indexes, etc. Goal of optimization: To find more efficient plans that compute the same answer. Plan: sname bid=100 rating > 5 sid=sid (Page-Oriented Nested loops) Sailors Reserves

Alternative Plans Push Selects (No Indexes) sname sname bid=100 bid=100 rating > 5 sid=sid (Page-Oriented Nested loops) rating > 5 sid=sid (Page-Oriented Nested loops) Reserves Sailors Reserves Sailors 500,500 IOs 250,500 IOs

Alternative Plans Push Selects (No Indexes) sname sname bid=100 sid=sid (Page-Oriented Nested loops) sid=sid (Page-Oriented Nested loops) rating > 5 bid = 100 rating > 5 Reserves Sailors Reserves Sailors 250,500 IOs 250,500 IOs

Alternative Plans Push Selects (No Indexes) sname sname bid=100 rating > 5 sid=sid (Page-Oriented Nested loops) sid=sid (Page-Oriented Nested loops) rating > 5 Reserves bid=100 Sailors Sailors Reserves 250,500 IOs 6000 IOs

Alternative Plans Push Selects (No Indexes) sname rating > 5 sname bid=100 sid=sid (Page-Oriented Nested loops) Sailors bid=100 sid=sid (Page-Oriented Nested loops) rating > 5 (Scan & Write to temp T2) Reserves 6000 IOs Reserves Sailors 4250 IOs 1000 + 500+ 250 + (10 * 250)

Alternative Plans Push Selects (No Indexes) sname sname sid=sid (Page-Oriented Nested loops) sid=sid (Page-Oriented Nested loops) bid=100 rating > 5 (Scan & Write to temp T2) rating>5 bid=100 (Scan & Write to temp T2) Reserves Sailors Sailors Reserves 4250 IOs 4010 IOs 500 + 1000 +10 +(250 *10)

More Alternative Plans (No Indexes) sname sid=sid (Sort-Merge Join) Main difference: Sort Merge Join With 5 buffers, cost of plan: (Scan; write to temp T1) bid=100 Reserves rating > 5 Sailors Scan Reserves (1000) + write temp T1 (10 pages, if we boats, uniform distribution) = 1010. have 100 Scan Sailors (500) + write temp T2 (250 pages, if have 10 ratings) = 750. Sort T1 (2*2*10) + sort T2 (2*4*250) + merge (10+250) = 2300 Total: 4060 page I/Os. If use BNL join, join = 10+4*250, total cost = 2770. Can also `push projections, but must be careful! T1 has only sid, T2 only sid, sname: T1 fits in 3 pgs, cost of BNL under 250 pgs, total < 2000. (Scan; write to temp T2)

More Alt Plans: Indexes With clustered index on bid of Reserves, we get 100,000/100 = 1000 tuples on 1000/100 = 10 pages. INL with outer not materialized. Projecting out unnecessary fields from outer doesn t help. Join column sid is a key for Sailors. (Use hash Index, do not write to temp) At most one matching tuple, unclustered index on sid OK. bid=100 Reserves sname rating > 5 sid=sid Sailors (Index Nested Loops, with pipelining ) Decision not to push rating>5 before the join is based on availability of sid index on Sailors. Cost: Selection of Reserves tuples (10 I/Os); then, for each, must get matching Sailors tuple (1000*1.2); total 1210 I/Os.

Cost Based Query Optimization Summary Find Alternate Plans Cost Estimation for each alternate plan Find a Query Plan with least cost Disadvantage : Expensive to cost estimation for Large number of Alternate plans generated. For Example, Find best join-order for r 1 r 2... r n. (2(n 1))!/(n 1)! different join orders for above expression For n = 7, the number is 665280, for n=10 number is 176 billion! Solution : No need to generate all the join orders. Use dynamic programming to find least-cost join order

Materialization create and read temporary relations create implies writing to disk more page writes π name σ coursename=advanced DBs courseid; index-nested loop cid; hash join course student takes

Pipelining (1/2) creating a pipeline of operations reduces number of read-write operations implementations demand-driven - data pull producer-driven - data push π name σ coursename=advanced DBs ccourseid; index-nested loop cid; hash join course student takes

Pipelining (2/2) can pipelining always be used? any algorithm? cost of R S materialization and hash join: B R + 3(B R +B S ) pipelined pipelining and indexed nested loop join: N R * HT i cid R courseid materialized S σ coursename=advanced DBs student takes course

Heuristic Optimization Cost-based optimization is expensive, even with dynamic programming. Solution : reduce search space using 1) Randomized Algorithm : Iterative Improvement or 2) Heuristic optimization Goal: reduce size of intermediate results Heuristics to reduce the number of choices Heuristic optimization transforms the query-tree by using a set of rules that typically (but not in all cases) improve execution performance: Perform selection early (reduces the number of tuples) Perform projection early (reduces the number of attributes) Perform most restrictive selection and join operations before other similar operations. And many other Some systems use only heuristics, others combine heuristics with partial cost-based optimization.

Heuristic Optimization (Example)

Rule/Hint Based Query Optimization Oracle : Allow to embed hints in SQL statements to guide the optimizer towards making more efficient choices. Syntax : SELECT /*+ hint */ cola, colb,... FROM tab1, tab2,... Where, the /* and */ are normally comments + sign : causes comment to be treated as a hint. Different values for hint can include: ALL_ROWS - Optimize the query for best throughput (lowest resource utilization) (CBO approach irrespective of presence of statistics) FIRST_ROWS(n) - Optimize for fastest response time. (CBO approach irrespective of presence of statistics) CHOOSE - Optimizer chooses either Rule based or Cost based. If statistics are available (via the ANALYZE TABLE command), Cost based is chosen, otherwise, rule based is chosen. RULE - Force the use of the Rule based optimizer.

Rule/Hint Based Query Optimization(contd ) Other Hints in Oracle : for every possible step within execution plans: Global hints rule, first_rows, first_rows_n all_rows, driving_site Table join hints use_nl, use_hash Index hints Specifies an index name Table access hints parallel, full, cardinality Table join hints ordered System ignore irrelevant Hint. i.e Specifying an index hint on a table that has no indexes Specifying a parallel hint for an index range scan Mutually exclusive index specified (like index and parallel both)

Rule/Hint Based Query Optimization(contd ) FIRST_ROWS(n): Used When : Typically users are interested to see first few rows This Hint ignored for DELETE, UPDATE statements and SELECT statement containing following clauses: Set operators (UNION,INTERSECT,MINUS,UNION ALL) n GROUP BYclause n FOR UPDATEclause n Aggregate functions n DISTINCToperator n ORDER BYclauses, when there is no index on the ordering columns Example : Best response time to retrieve first 10 rows SELECT /*+ FIRST_ROWS(10) */ employee_id, last_name, salary, job_id FROM employees WHERE department_id = 20;

Rule/Hint Based Query Optimization(contd ) Hint FULL: Ignore Indexes Blocks are read sequentially I/O larger than a single block can be speedup using FULL table scan Used When : Table is small or Typically users are interested to see first few rows Example: /*Ignore index on last_name*/ SELECT /*+ FULL(e) */ employee_id, last_name FROM employees e WHERE last_name LIKE :b1 Full table scan applied by default when function used on indexed column in where clause. i.e index on last_name SELECT last_name, first_name FROM employees WHERE UPPER(last_name) LIKE :b1

Rule/Hint Based Query Optimization(contd ) Hints in MS SQL Server : Query Hints can be added using OPTION clause at end of the statement Syntax : SELECT select_list FROM table_source WHERE search_condition GROUP BY group_by_expression HAVING search_condition ORDER BY order_expression OPTIONS (query options) Where, {HASH ORDER} GROUP : use hashing or ordering in the GROUP BY or COMPUTE {MERGE HASH CONCAT} UNION : use merging/hashing/concatenating in UNION If more than one hint, the query optimizer selects the least expensive strategy. {LOOP MERGE HASH } JOIN : use specified join in the whole query. If more than one join hint is specified, the query optimizer selects the least expensive FORCE ORDER : Specifies that the join order indicated by the query syntax is preserved during query optimization.

Distributed Query Processing Methodology Calculus Query on Distributed Relations CONTROL SITE LOCAL SITES Query Decomposition Algebraic Query on Distributed Relations Data Localization Fragment Query Global Optimization Optimized Fragment Query with Communication Operations Local Optimization Optimized Local Queries GLOBAL SCHEMA FRAGME NT SCHEMA STATS ON FRAGME NTS LOCAL SCHEMA S

MDB Query Processing Architecture Global/local correspondences Allocation and capabilities Local/DBMS mappings

Distributed Query Optimization

INGRES Algorithm 1. Decompose each multi-variable query into a sequence of mono-variable queries with a common variable 2. Process each by a one variable query processor Choose an initial execution plan (heuristics) Order the rest by considering intermediate relation sizes No statistical information is maintained

INGRES Algorithm (contd..) 1. Decompose each multi-variable query into a sequence of mono-variable queries with a common variable 2. Process each by a one variable query processor Choose an initial execution plan (heuristics) Order the rest by considering intermediate sizes relation Apply tuple substitution to integrate query q i-1 to q i No statistical information is maintained

System R* Algorithm 1. Simple (i.e., mono-relation) queries are executed according to the best access path. 2. Execute joins Determine the possible ordering of joins Determine the cost of each ordering Choose the join ordering with minimal cost Ship Whole / Fetch as needed (semijoin)

Ordering joins Distributed INGRES System R* Semijoin ordering SDD-1 Join Ordering Better if

SDD-1 Based on Hill Climbing Algorithm Hill Climbing Algorithm SemiJoins No Replication No Fragmentation Minimize total time or response time Do not consider cost of transferring data from result site to user site Ignore local processing cost