Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

Overview of Query Processing Query Parser Query Processor Evaluation of Relational Operations Query Rewriter Query Optimizer Query Executor Yanlei Diao UMass Amherst Lock Manager Access Methods (Buffer Manager) Log Manager Storage Manager Space Manager Manager DB Slides Courtesy of R. Ramakrishnan and J. Gehrke 2 v Set operators v Group By aggregation Why Sort? v Important utility in DBMS: Sorting is first step in bulk loading B+ tree index. Request data in sorted order (e.g., ORDER BY) e.g., find students in decreasing order of gpa. Sort-merge join algorithm involves sorting. Eliminate duplicates in a collection of records (e.g., SELECT DISTINCT) v Problem: sort 1GB of data with 1MB of RAM. Limited Memory. Key is to minimize # I/Os! 3 4 2-Way Sort: Requires 3 Buffer Pages Two-Way External Merge Sort v Assume that a file has N data pages to sort. v Pass 1: Read a page, sort it, write it. only one buffer page is used v Pass 2, 3,, etc.: Merge two sorted subfiles three buffer pages used. INPUT 1 INPUT 2 Main memory buffers v Divide and conquer, sort subfiles (runs) and merge A file of N pages: Pass 1: N sorted runs of 1 page each Pass 2: N/2 sorted runs of 2 pages each Pass 3: N/4 sorted runs of 4 pages each Pass P+1: 1 sorted run of 2 P pages 2 P N à P log 2 N 3,4 6,2 9,4 8,7 5,6 3,1 2 3,4 2,6 4,9 7,8 5,6 1,3 2 4,6 4,4 6,7 8,9 4,7 8,9 1,2 3,4 4,5 6,6 7,8 1,3 5,6 2 1,2 3,5 6 Input file PASS 1 1-page runs PASS 2 2-page runs PASS 3 4-page runs PASS 4 8-page runs 5 9 6

Two-Way External Merge Sort General External Merge Sort v Divide and conquer, sort subfiles (runs) and merge Each pass, read + write N pages in file à 2N. Number of passes is:! log 2 N" + 1 So total cost is: (! log N" 1) 2N 2 + 3,4 6,2 9,4 8,7 5,6 3,1 2 3,4 2,6 4,9 7,8 5,6 1,3 2 4,6 4,4 6,7 8,9 4,7 8,9 1,2 3,4 4,5 6,6 7,8 9 1,3 5,6 2 1,2 3,5 6 Input file PASS 0 1-page runs PASS 1 2-page runs PASS 2 4-page runs PASS 3 8-page runs 7 Given B (>3) buffer pages. How can we utilize them? v Pass 1: Use B buffer pages. Produce N/B sorted runs of B pages each. v Pass 2, 3, etc.: Merge B-1 runs. INPUT 1 INPUT 2 INPUT B-1 B Main memory buffers 8 v Cost of External Merge Sort v E.g., with B=5 buffer pages, sort a file of N=108 pages: Pass 1 Pass 2 108/5 = 22 sorted runs of 5 pages each (last run is only 3 pages) 22/4 = 6 sorted runs of 20 pages each (last run is only 8 pages) Pass 3 6/4 = 2 sorted runs, 80 pages and 28 pages Number of passes = 1 + log B-1 N/B Cost = 2N * (1 + log B-1 N/B ) N/B sorted runs of B pages each N/B /(B-1) sorted runs of B(B-1) pages each (except the last run) N/B /(B-1) 2 sorted runs of B(B-1) 2 pages (except the last run) Pass 4 Sorted file of 108 pages N/B /(B-1) 3 sorted runs of B(B-1) 3 ( N) pages (except the last run) 9 Number of Passes of External Sort N B=3 B=5 B=9 B=17 B=129 B=257 100 7 4 3 2 1 1 1,000 10 5 4 3 2 2 10,000 13 7 5 4 2 2 100,000 17 9 6 5 3 3 1,000,000 20 10 7 5 3 3 10,000,000 23 12 8 6 4 3 100,000,000 26 14 9 7 4 4 1,000,000,000 30 15 10 8 5 4 (1+P)*2N, almost linear in N 10 Using B+ Trees for Sorting Clustered B+ Tree Used for Sorting v Scenario: Table to be sorted has a B+ tree index on sorting attribute(s). Retrieve students in increasing order of age. There is a B+ tree on Students.age. v Idea: Can retrieve records in order by traversing leaf pages. v Is this a good idea? Cases to consider: B+ tree is clustered Good idea! B+ tree is not clustered Could be a very bad idea! v Alternative 1: cost of retrieving all leaf pages v Alternative 2: also cost of retrieving data records, but reading each page just once. Data Records Index (Directs search) Data Entries ("Sequential") * Almost always better than external sorting! 11 12

Unclustered B+ Tree Used for Sorting External Sorting vs. Unclustered Index v Alternative 2: each data entry contains rid of a data record. In general, one I/O per data record! Worse case I/O: RN R: # records per page N: # pages in file Index (Directs search) Data Entries ("Sequence set") Data Records 13 For sorting B=1,000 R: # of records per page R=100 is the more realistic value. Worse case numbers (RN) here! 14 Schema for Examples v Set operators v Group By aggregation 15 Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: date, rname: string) v Sailors: Each tuple is 50 bytes long, 80 tuples per page, 500 pages. v Reserves: Each tuple is 40 bytes long, 100 tuples per page, 1000 pages. v Cost metric: # I/Os (page accesses) v Goal: estimating I/O costs and choosing the best plan 16 Using an Index for Selections Cost Factors of Selection SELECT * FROM Sailors S WHERE S.rating > 8 SELECT * FROM Sailors S WHERE S.name = Harry SELECT * FROM Sailors S WHERE S.rating > 8 v Cost of selection = 1. Top down search in the tree; 2. Scan leaf nodes to find data entries; 3. Fetch records from file (could be large w/o clustering). v Cost 1 (top down search): 3-4 I/Os, depending on buffer management v Cost 2 (scanning leaf nodes): Reduction Factor (RF) of a predicate P: percentage of tuples satisfying P; e.g., RF=20% for rating > 8. Cost of scanning leaf nodes: how many leaf nodes to visit? # leaf pages: if a data entry in the index is 1/5 of a tuple in the file, need 1/5 of the space of storing all matching tuples # space to store matching tuples: 500*20%=100 pages. So, 20 leaf pages, or 20 I/Os! 17 18

Cost Factors of Selection Statistics in System Catalog v CLUSTERED Data entries Data Records (Index File) (Data file) Data entries Data Records UNCLUSTERED Cost 3 (fetching data records): # matching tuples & clustering rating > 8: 20% of tuples qualify, 100 pages, 8,000 tuples. Retrieving records from the file Clustered index: 100 I/Os. Unclustered index: worst case 1 I/O per tuple; 8,000 I/Os here! Unclustered index + Sorting of data entries on rid: 500 I/Os. v Statistics about each relation (R) and index (I): Relation cardinality: # tuples (NTuples) in R Relation size: # pages (NPages) in R Index cardinality: # distinct values (NKeys) in I Index size: # leaf pages (NLPages) in I Index height: # nonleaf levels (Height) of I Index range: low/high key values (Low/High) in I 19 20 Cost Estimates for Selections General Selections v Sequential scan of file: NPages(R) v Index I on a candidate key (Alt. 2) matches a selection: Cost of top-down search = Height(I) of the B+ tree Cost of scanning leaf pages = 1 Cost of record retrieval = 1 v Clustered index I (Alt. 2) matches a selection: Cost of top-down search + RF * NLPages(I) + RF * NPages(R) v Non-clustered index I (Alt. 2) matches a selection: Cost of top down search + RF * NLPages(I) + min(rf * NTuples(R), NPages(R)) v Boolean combination of predicates using AND and OR. Conjunctive Normal Form (CNF), e.g., pred1 AND (pred3 OR pred4), (pred1 OR pred2) AND (pred3 OR pred4) v File scan always works for general selections. v Index scan works when the matched predicate is a conjunct of CNF. E.g., an index matching pred1 can be used for pred1 AND (pred3 OR pred4) 21 22 Conjunctive Predicates Only v CNF without OR: e.g. pred 1 AND pred 2 AND pred 3 Retrieve tuples using the most selective access method File scan or index scan that gives the smallest I/O cost. Apply remaining terms that don t match index on the fly. Other terms do not affect I/O cost. day<8/9/94 AND bid=5 AND sid=3 B+ index on <bid, sid>: check day<8/9/94 on the fly. B+ tree index on day: apply bid=5 and sid=3 on the fly. Improvement: Intersection of Rids v 2+ matching indexes (Alternative 2): 1. Get sets of rids of data records using each index. 2. Intersect these sets of rids. 3. Retrieve the records and apply any remaining terms. day<8/9/94 AND bid=5 AND sid=3 B+ tree index on day, B+ index on sid, both using Alt 2: 1. retrieve rids of records satisfying day<8/9/94 using first index, rids of records satisfying sid=3 using second index, 2. intersect these rids, 3. retrieve records, check bid=5. 23 24

Creating Indexes in SQL CREATE [UNIQUE FULLTEXT] INDEX index_name ON table_name (index_col1, index_col2, ); DROP INDEX index_name ; What are the disadvantages of creating many indexes? v Set operators v Group By aggregation 25 26 Equality Joins Cost Estimation for Equality Joins SELECT * FROM Reserves R, Sailors S WHERE R.sid = S.sid ; SELECT M.*, A.name FROM Movies M, Actors A WHERE M.mid = A.movie_id and A.name = Brad Pitt ; v R S, natural join. Very common operation! v Semantics: cross product ( ) followed by selection (σ) If R or S is large, R S followed by a selection is inefficient. Must be carefully optimized. v Cost metric: # of I/Os. Ignore output cost in analysis. R: M pages, T R tuples per page. S: N pages, T S tuples per page. 27 28 Our Running Example Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: date, rname: string) v Sailors: Each tuple is 50 bytes long, 80 tuples per page, 500 pages (M). v Reserves: Each tuple is 40 bytes long, 100 tuples per page, 1000 pages (N). v Cost metric: # I/Os SELECT * FROM Reserves R, Sailors S WHERE R.sid = S.sid 29 Page-Oriented Nested Loops Join v A baseline approach: foreach page of R do foreach page of S do write out each matching pair <r, s> //r is in R-page, s is in S-page v Cost: M + M * N = 1000 + 1000*500 = 501,000 I/Os. If 5 ms per I/O, the join will take 0.7 hour. v How many buffer pages do we need?? 30

Block Nested Loops Join v How can we utilize additional buffer pages? If the smaller reln, say R, fits in memory, use R as outer, read the inner S only once. Otherwise, read a big chunk of R each time, hence reducing # times of reading S. v Block Nested Loops Join: The smaller reln R as outer, the other S as inner. Buffer allocation: 1 buffer for scanning the inner S 1 buffer for output All remaining buffers for holding a ``block of outer R Block Nested Loops Join (Contd.) foreach block in R do (build a hash table on R-block) foreach page in S do foreach matching tuple r in R-block, s in S-page do add <r, s> to result R & S Block of R (B-2 pages) (Hash table, size B-2 pages) Join Result Input buffer for S Output buffer 31 32 Cost of Block Nested Loops Join? Index Nested Loops Join v Cost: size of outer + #outer blocks * size of inner B buffer pages available Cost = size of outer + size of outer / B-2 * size of inner v E.g. B=102, Sailors S = 500 pages, Reserves R = 1000 pages. What is the cost if S is outer, R is inner? A block = B-2 = 100 pages Cost = 500 + 500/100 * 1000 = 5,500 I/Os. What is the cost if we swap R and S? Cost = 1000 + 1000/100 * 500 = 6,000 I/Os. Which relation should be the outer for smaller cost? v Given an index on the join column of one relation, say S: v foreach tuple r in R do foreach tuple s in S where r == s (via index lookup) do add <r, s> to result Cost: M + ( M * T R * cost of finding matching S tuples) Cost of equality search using the S index: Cost of top down search: 2-4 I/O s for B+ tree. Cost of scanning leaf page: 1 Cost of retrieving matching S tuples: clustering or not 33 34 Examples of Index Nested Loops Sailors Reserves Sailors: tuple size is 50 bytes, 80 tuples per page, 500 pages. Reserves: tuple size is 40 bytes, 100 tuples per page, 1000 pages. v B+ tree (Alt. 2) on sid of Reserves: Scan Sailors: 500 page I/Os, 80*500 = 40,000 tuples. For each Sailors tuple: # of matching Reserves tuples =? Uniform distribution: 2.5 Reserves tuples/sailor (100,000/40,000). Assume that 1 I/O is needed to search top down in the B+ tree 1 I/O is needed to read the leaf node with the data entries. Cost of retrieving the tuples is 1 or 2.5 I/Os (cluster or not). Total: 500+80*500*(3~4.5) = 120,500~180,500 I/Os. Worse than block nested loops join. Block Nested Loops Join Index Nested Loops Join Sort-Merge Join Hash Join 36 37

Equi-Join (R S) using Sort-Merge v Sort R and S on join column using external sorting. v Merge R and S on join column, output result tuples. sid sname rating age 22 dustin 7 45.0 28 yuppy 9 35.0 31 lubber 8 55.5 44 guppy 5 35.0 58 rusty 10 35.0 sid bid day rname 28 103 12/4/96 guppy 28 103 11/3/96 yuppy 31 101 10/10/96 dustin 31 102 10/12/96 lubber 31 101 10/11/96 lubber 58 103 11/12/96 dustin The Merge Algorithm (after sorting) Repeat until either R or S is finished: Scanning: Advance scan of R until current R-tuple >= current S tuple, Advance scan of S until current S-tuple >= current R tuple; Do this until current R tuple = current S tuple. Matching: Match all R tuples and S tuples with same value (called R- group and S-group of the current value). Output <r, s> for all pairs of such tuples. 38 39 I/O Cost of Sort-Merge Join v Cost: Sorting_cost(R) + Sorting_cost(S) + Merging_cost Sorting_cost(N): 2N * (1 + log B-1 N/B ) Merging_cost [M+N, M*N] M+N: foreign key join with the parent (referenced) reln. as inner. M*N: uncommon but possible. When? 40 Refinement of Sort-Merge Join v A two-pass algorithm for primary key-foreign key join? INPUT 1 INPUT 2 INPUT B-1 B memory buffer pages v Key observation: repeated merging phases Sorting of R and S has respective merging phases. Join of R and S also has a merging phase. Combine all these merging phases! 41 Two-Pass Sort-Merge Join Merging in Two-Pass Sort-Merge When memory is sufficiently large (detailed later), v Pass 1 Sorting: sort subfiles of R and S individually v Pass 2 Merging: merge sorted runs of R and S merge sorted runs of R, merge sorted runs of S, and compare R and S tuples using the join condition. Relation R Relation S Run1 of R Run2 of R RunK of R Run1 of S Run2 of S Join Results RunK of S 43 B memory buffer pages 44

Merging in Two-Pass Sort-Merge Current smallest values from R and S Relation R Relation S 9, 3, 1 11, 4, 2 Join Results 8, 7, 3 25, 12, 2 13, 6, 1 21, 14, 3 Memory Requirement v Memory requirement for two-pass sort-merge: Sorting pass produces sorted runs of size B. So, number of runs per relation M/B, or N/B Merging pass holds sorted runs of both relations and an output buffer. So, (M+N)/B + 1 <= B B > M + N v Cost: read & write each relation in sorting pass + read each relation in merging pass (+ writing result tuples, ignored here) = 3 ( M+N )! B memory buffer pages 45 46 Cost of Two-Pass Sort-Merge Join v Cost: read & write each relation in sorting pass + read each relation in merging pass (+ writing result tuples, ignored here) = 3 ( M+N )! In our running example, a total of 4500 I/Os using sort-merge, around 22.5 seconds. Compared to 0.7 hour w. Page NLJ! 47 Equi-Join using Hash-Join v Idea: For an equi-join, partition both R and S using a hash function s.t. R tuples will only match S tuples in partition i. v Phase 1 Partitioning: Partition both relations using hash function h (Ri tuples will only match with Si tuples). Original Relation 1 INPUT 2 hash function h B-1 B memory buffer pages Partitions 1 2 B-1 48 Hash-Join? Memory Requirement and I/O Cost v Phase 2 Probing: Given sufficiently large memory, Read in partition Ri (build hash table using h2!= h). Scan partition Si, one page at a time, search for matches. Partitions of R & S Partition Ri (k B-2 pages) Join Result v Partitioning: # partitions in memory B-1, Probing: to fit each Ri in memory, size of partition B-2. A little more memory needed to build hash table, but ignored here. v Assuming uniformly sized partitions, L = min(m, N): L / (B-1) (B-2) à roughly, ceil( L ) + 1 < B < L+2 h2 Input buffer for Si Output buffer B memory buffer pages Use the smaller relation as the building relation in probing phase. v Partitioning: reads+writes both relns; 2(M+N). Probing: reads both relns; M+N I/Os. Total cost = 3(M+N). v Difference from block nested loops join? 49 50

Effect of the Hash Function v What if hash fn h does not partition uniformly? One or more R partitions may not fit in memory. Can apply hash-join recursively to this R-partition and the corresponding S-partition. Higher cost, of course Two Pass Sort-Merge vs. Hash Join v Sort-Merge Join vs. Hash Join: Given a minimum amount of memory (what is this, for each?) both have a cost of 3(M+N) I/Os. Hash Join is superior if relation sizes differ greatly. Assuming M<N, what if sqrt(m) < B < sqrt(m+n)? Sort-Merge less sensitive to data skew. Sort-Merge yields a sorted relation. More advanced hash algorithms exist as the state of the art. v The above discussion is mainly for common joins (e.g., primary key foreign key joins) that would avoid O(M*N) worst case. 51 52 General Join Conditions v Equalities over several attributes (e.g., R.sid=S.sid AND R.rname=S.sname): ü Block NL works fine. ü For Index NL, use index on <sid, sname> if available; or use an index on sid or sname, check the other predicate on the fly. ü For Sort-Merge and Hash Join, sort/partition on combination of the two join columns. General Join Conditions v Inequality conditions (e.g., R.rating < S.rating): ü Block NL still works well. ü For Index NL, need a B+ tree index for inequality searches. Range probes on inner; number of matches likely to be much higher than for equality joins. Clustered index is much preferred. û Hash Join, Sort Merge Join are hard to apply. 53 54 v Set operators v Group By aggregation v Set operators v Group By aggregation 55 59

Aggregate Operations (AVG, MIN, etc.) SELECT min(s.age) FROM Sailors S WHERE S.rating = 10 v Aggregation without grouping File scan: in general, requires scanning the relation. Index scan: if a B+ tree matches a predicate, use index scan to retrieve the tuples and then compute the aggregate. Index only scan: if a tree index s search key includes all attributes in the SELECT and WHERE clauses. e.g. B+tree on <rating, age> Group By - Aggregate Operation Our running example: Select min(s.age) From Sailors S Where S.rating > 5 Group By S.rating Click Stream Analysis: Clicks(time, url, referral_url, user_id, geo_info ) Select count(*) From Clicks Group By url; 60 61 Group By - Aggregate Operation Group By - Aggregate Operation v Aggregation with grouping (GROUP BY) Single-relation sorting Single-relation hashing - Design a two phase algorithm for Group By Aggregation - Analyze the minimum memory size for this to work v Aggregation with grouping (GROUP BY) Single-relation sorting sort by group-by attribute(s); compute aggregate for each group in last merging phase. Single-relation hashing hash on group-by attribute(s): compute aggregate using inmemory hash table for each partition. Index only scan: if a tree index s search key includes all attributes in SELECT, WHERE and GROUP BY clauses. e.g. B+tree on <rating, age> for the above query 62 63 Questions 64