Module 8: Evaluation of Relational Operators

Size: px

Start display at page:

Download "Module 8: Evaluation of Relational Operators"

Joella Hamilton
6 years ago
Views:

1 Module 8: Evaluation of Relational Operators Module Outline 8.1 The DBMS s runtime system 8.2 General remarks 8.3 The selection operation 8.4 The projection operation 8.5 The join operation 8.6 A Three-Way Join Operator 8.7 Other Operators 8.8 The impact of buffering 8.9 Managing long pipelines of relational operators Web Forms Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control Applications SQL Commands You are here! Files and Index Structures Buffer Manager Disk Space Manager Parser Optimizer SQL Interface Query Processor Recovery Manager DBMS Index Files Data Files System Catalog Database 189

2 8.1 The DBMS s runtime system In some sense we can consider the implementation of the relational operators as a database s runtime system: The query plan (network of relational operators), constitutes the program to execute, 1 the relational operators act on files on disk (relations) and implement the behaviour of the plan. 2 The efficient evaluation of the relational operators should be carefully studied and tuned: Each operator implements only a small step of the overall query plan (thus, a plan for a query of modest complexity may easily contain up to 100 operators), the set of relational operators is designed to be small, each operator fulfills multiple tasks. 1 Compare this, e.g., to Java byte codes. 2 Again, in the Java world, this would be comparable to the Java VM. 190

3 Representation of Query Plans As in internal representation of queries, a DBMS typically uses an operator tree, whose internal nodes represent logical (e.g., algebra-style) or physical (e.g., concrete implementation algorithms) operators. Directed arcs connect arguments (inputs) to operators and operators to their output. As a result of query optimization, arguments that are used in multiple places may be connected to several operators, so we may end up with networks of operators, such as: R S sort T 191

4 Logical vs. Physical Operators A typical DBMS provides several implementations for a single relational operator (i.e., instead of we have,, ). For equivalent input file(s), all variants produce an equivalent output file. Equivalent? What do you think is precisely meant by equivalent here? Why don t we just say identical? Terminology: the variants,... are the different physical operators implementing the logical operator. We will discuss physical operators in this chapter. The query optimizer analyzes a given query plan based on its knowledge of the system internals, statistics, and ongoing bookkeeping and selects the best specific variant for each operator. During query optimization, logical operators are replaced by physical ones. 192

5 Physical Properties However, a specific variant may be tailored to exploit several physical properties of the system: the presence or absence of indexes on the input file(s), the sortedness of the input file(s), the size of the input file(s), the available space in the buffer pool (cf., external sorting in Chapter 7), the buffer replacement policy,... Example: The optimizer has marked each edge of the plan to indicate if the records flowing over this edge are sorted with respect to some sort key k or not (sorted:, unsorted: ): s u R s u u u S u u sort u T s s s 193

6 In general, the query optimizer may quite heavily transform the original plan to enable the use of the most efficient physical operators variants. Example (assume that physical operators op can exploit sortedness of their input(s), e.g., might be sort-merge join): R s s s S u sort s s s T s s 194

7 8.2 General remarks The system catalog A relational table can be stored in different file structures, we can create one or more indexes on each table, which are also stored in files. Conversely, a file may contain data from one (or more) table(s) or the entries in an index. Such data is refered to as (primary) data in the database. A relational DBMS maintains information about every table and index it contains. Such descriptive information is itself stored in a collection of special tables, the so-called catalog tables, aka. the system catalog, the data dictionary, the system catalog, or just the catalog. Catalog information includes relation and attribute names, attribute domains, integrity constraints, access privileges, and much more. Also, the query processor (or the query optimizer) draws a lot of information from the system catalog, such as, e.g., file structure for each table, availability of indexes, number of tuples in each relation, number of pages in each file,... We ll come back to some of these later. 195

8 8.2.2 Principal approaches to operator evaluation Algorithms for evaluating relational operators have a lot in common, they are based upon one of the following principles: 1 Indexing. If some form of (selection, join) condition is given, use an index to examine just the tuples that satisfy the condition. In more generality:... to examine a superset of candidate tuples that may satisfy the condition. 2 Iteration. Examine all tuples in an input table, one after the other. Index-only plans:... if there is an index covering all required attributes, we can scan the index instead of the data file. 3 Partitioning. By partitioning on a subset of attributes values, we can often decompose an operation into a less expensive collection of operations on partitions. Sorting and hashing are commonly used partitioning techniques. Devide-and-conquer:... partitioning is an instance of this principle of algorithm design. 196

9 8.3 The selection operation No index, unsorted data Selection (p) reads an input file r in of records and writes those records satisfying predicate p into the output file: Algorithm: (p, r in, r out) Input: predicate p, input file r in Output: output file r out written (side-effect) Observations: out createfile(r out); in openscan(r in ); while (r nextrecord(in)) EOF do if p(r) then appendrecord(out, r); closefile(out); Reading special record EOF from a file indicates the end of the input file. This simple procedure does not require r in to come with any special physical properties (the procedure is exclusively defined in terms of heap files, see Section 2.4.1). Predicate p may be arbitrary. 197

10 Query execution cost We summarize the characteristics of this implementation of the selection operator as follows: p (r in ) input access 3 prerequisites I/O cost file scan (openscan) of r in none (p arbitrary, r in may be a heap file) r in + sel(p) r }{{} in }{{} input cost output cost r in denotes the number of pages in file r in, r in denotes the number of records (if b records fit on one page, we have r in = r in /b ). 3 Sometimes also called access path in the literature and text books. 198

11 Selectivity sel(p), the selectivity of predicate p, is the fraction of records satisfying predicate p: 0 sel(p) = p (r in ) 1 r in Selectivity What can you say about the following selectivities? 1 sel(true) 2 sel(false) 3 sel(a = 0) 199

12 8.3.2 No index, sorted data If the input file r in is sorted with respect to a sort key k, we can use binary search on r in to find the first record matching predicate p more quickly. To find more hits, scan the sorted file. Obviously, predicate p must match the sort key k in some way. Otherwise we won t benefit from the sortedness of r in. When does a predicate match a sort key? Assume r in is sorted on attribute A in ascending order. Which of the selections below can benefit from the sortedness of r in? 1 A=42 (r in ) 2 A>42 (r in ) 3 A<42 (r in ) 4 A>42 AND A<100 (r in ) 5 A>42 OR A>100 (r in ) 6 A>42 OR A<32 (r in ) 7 A>42 AND A<32 (r in ) 8 A>42 AND B=10 (r in ) 9 A>42 OR B=10 (r in ) 200

13 We defer the treatment of disjunctive predicates (e.g., A > 42 OR A < 32) until later. The characteristics of selection via binary search are: p (r in ) input access prerequisites I/O cost binary search, then sorted file scan of r in r in sorted on key k, p matches sort key k log 2 r in + sel(p) r }{{} in + sel(p) r }{{} in }{{} binary search sorted scan output cost 201

14 8.3.3 B + tree index A clustered B + tree index on r in whose key matches the selection predicate p is clearly the superior method to evaluate p (r in ): Descend the B + tree to retrieve the first index entry to satisfy p. Then scan the sequence set to find more matching records. If the index is unclustered and sel(p) indicates a large number of qualifying records, it pays off to 1 read the index entries k, rid in the sequence set, 2 sort those entries on their rid field, 3 and then access the pages of r in in sorted rid order. Note that lack of clustering is a minor issue if sel(p) is close to 0. Why? p (r in ) input access access of B + tree on r in, then sequence set scan prerequisites clustered B + tree on r in with key k, p matches key k I/O cost 3 }{{} + sel(p) r in + sel(p) r }{{} in }{{} B + tree acc. sorted scan output cost 202

15 8.3.4 Hash index, equality selection A selection predicate p matches a hash index only if p contains a term of the form A = c (assuming the hash index is over key attribute A). We are directly led to the bucket of qualifying records and pay I/O cost only for the access of this bucket 4. Note that sel(p) is likely to be close to 0 for equality predicates. p (r in ) input access prerequisites I/O cost hash table on r in r in hashed on key k, p has term k = c sel(p) r in + sel(p) r }{{} in }{{} hash access output cost 4 Remember that this may include access cost for the pages of an overflow chain hanging off the primary bucket page. 203

16 8.3.5 General selection conditions Indeed, selection operations with simple predicates like A θ c (r in ) are a special case only. We somehow need to deal with complex predicates, built from simple comparisons and the boolean connectives AND and OR Conjunctive predicates and index matching Our simple notion of matching a selection predicate with an index can be extended to cover the case where predicate p has a conjunctive form: A 1 θ 1 c 1 }{{} conjunct AND A 2 θ 2 c 2 AND AND A n θ n c n. Here, each conjunct is a simple comparison (θ i {=, <, >,, }). An index with a multi-attribute key may match the entire complex predicate. 204

17 Matching a multi-attribute hash index. Suppose a hash index is maintained for the 3-attribute key k = (A, B, C) (i.e., all three attributes are input to the hash function). Which types of conjunctive selection predicates p would match this index? p =? Predicate matching rule for hash indexes: A conjunctive predicate p matches a (multi-attribute) hash index with key k = (A 1, A 2,..., A n ), if p covers the key k, i.e. 1 p A 1 = c 1 A 2 = c 2 A n = c n or 2 p A 1 = c 1 A 2 = c 2 A n = c n φ (conjunct φ is not supported by the index itself and has to be evaluated separately after index retrieval). 205

18 Matching a multi-attribute B + tree index. We have a B + tree index available on the multi-attribute key (A, B, C), i.e., the B + tree nodes are inserted/searched for using a lexicographic order on the three attributes. What this means is that inside the B + tree two keys k 1 = (A 1, B 1, C 1 ) and k 2 = (A 2, B 2, C 2 ) are ordered according to k 1 < k 2 A 1 < A 2 (A 1 = A 2 B 1 < B 2 ) (A 1 = A 2 B 1 = B 2 C 1 < C 2 ). Which types of conjunctive selection predicates p would match this B + tree index? 206

19 Predicate matching rule for B + tree indexes A conjunctive predicate p matches a (multi-attribute) B + tree index with key k = (A 1, A 2,..., A n ), if p is a prefix of key k, i.e. 1 p A 1 θ 1 c 1 p A 1 θ 1 c 1 A 2 θ 2 c 2 or. p A 1 θ 1 c 1 A 2 θ 2 c 2 A n θ n c n 2 p A 1 θ 1 c 1 φ p A 1 θ 1 c 1 A 2 θ 2 c 2 φ. p A 1 θ 1 c 1 A 2 θ 2 c 2 A n θ n c n φ 207

20 Intersecting rid sets If we find that a conjunctive predicate does not match a single index, its (smaller) conjuncts may nevertheless match distinct indexes. Example: The conjunctive predicate in p q (r in ) does not match an index, but both conjuncts, p and q, do. A typical optimizer might thus decide to transform the original query r in p q into r in p q rid rid denotes an set intersection operator defined by rid equality. 208

21 The selectivity of conjunctive predicates What can you say about the selectivity of the conjunctive predicate p q? sel(p q) =? 209

22 Disjunctive predicates Chosing an intelligent execution plan for disjunctive selection predicates of the general form A 1 θ 1 c 1 A 2 θ 2 c 2 A n θ n c n. is much harder: We are forced to fall back to a naive file scan based evaluation (see Section 8.3.1) as soon as only a single term does not match an index. Why? If all terms are supported by indexes we can exploit a rid-based set union rid to improve the plan: r in A 1 θ 1 c 1. A n θ n c n rid 210

23 The selectivity of disjunctive predicates What can you say about the selectivity of the disjunctive predicate p q? sel(p q) =? Predicates involving attribute attribute comparisons Can you think of a clever query plan for a selection operation like the one shown below? A=B (r in ). 211

24 Bypass Selections Problem: parts of a selection condition may be expensive to check (typically, we assumed this was not the case!), or be very inselective. It is useful to evaluate cheap (and selective) predicates first. Boolean laws used for this include: true P true (evaluating P is not necessary) false P P (only now evaluate P ) Example: Q := σ (F1 F 2 ) F 3 (R), where the selectivities and cost of each part of the selection condition are as follows: formula selectivity cost F 1 s 1 = 0.6 C 1 = 18 F 2 s 2 = 0.4 C 2 = 3 F 3 s 3 = 0.7 C 3 =

25 Evaluation Alternative 1: Bring the selection condition into disjunctive normal form (DNF) it is already in DNF in our case. Push each tuple from the input through each disjunct in parallel. Collect matching tuples from each disjunct (eliminating duplicates!) #=1000 #=700 F 3 #=1000 #=1000 F 2 #=400 #=240 F 1 #= dups. elim d Mean cost per tuple (ignoring cost for duplicate eliminiation!): C }{{} 3 + C 2 + s }{{} 2 C 1 = 50.2 }{{} upper path lower path: F 2 lower path: F 1 213

26 Evaluation Alternative 2: Bring the selection condition into conjunctive normal form (CNF). CNF [(F 1 F 2 ) F 3 ] = (F 1 F 3 ) (F 2 F 3 ). Push each tuple from the input through each conjunct in a row. Matching tuples survive all conjunct (no duplicate elimination necessary!) Mean cost per tuple: #=1000 F2 F 3 #=820 F 1 F 3 #=772 C 2 + (1 s 2 ) (C 3 + s 3 (C 1 + (1 s 1 ) C 3 )) + s 2 (C 1 + (1 s 1 ) C 3 ) = Problem: F 3 evaluated multiple times, result could be cached! Mean cost per tuple with caching: C 2 + C 3 + s 2 (1 s 3 ) C 1 =

27 Evaluation Alternative 3: Bypass Plan Goal: eliminate tuples early, avoid duplicates. Introduce Bypass Selection Operator F, which produces two results: true and false outputs. (N.B. the two outputs are disjoint!) Bypass plans are derived from the CNF, i.e., (F 1 F 3 ) (F 2 F 3 ) in our example. Boolean factors and disjuncts in factors are sorted by cost. #=1000 F 3 #=600 #=420 false F 2 F 3 #112 true #=160 #=772 #=400 false F 1 true #=240 Mean cost per tuple (... disjoint union): C 2 + (1 s 2 ) C 3 + s 2 (C 1 + (1 s 1 ) C 3 ) = 40.6 Many variations are possible, e.g., for tuning in parallel environments. 215

28 8.4 The projection operation Projection (l) modifies each record in its input file and cuts off any field not listed in the attribute list l. Example: A B C 1 "foo" 3 1 "bar" 2 A,B 1 "foo" 2 1 "bar" 0 1 "foo" 0 = 1 A B 1 "foo" 1 "bar" 1 "foo" 1 "bar" 1 "foo" = 2 A B 1 "foo" 1 "bar" In general, the size of the resulting file will only be a fraction of the original input file: 1 any unwanted fields (here: C) have been thrown away, and 2 cutting off record fields may lead to duplicate records which have to be eliminated 5 to produce the final result. 5 Remember that we are bound to implement set semantics. 216

29 While step 1 calls for a rather straightforward file scan (indexes won t help much here), it is step 2 which makes projection costly. To implement duplicate elimination we have two principal alternatives: 1 sorting, or 2 hashing Projection based on sorting Sorting is one obvious preparatory step to facilitate duplicate elimination: records with all fields equal will be adjacent to each other after the sorting step. One benefit of a sort-based projection is that operator l output file, i.e.: will write a sorted (See algorithm on next slide.) r in? sort l s 217

30 Algorithm: Input: Output: (l, r in, r out) attribute list l, input file r in output file r out written (side-effect) out createfile(r tmp); in openscan(r in ); while (r nextrecord(in)) EOF do r r with any field cut off not listed in l; appendrecord(out, r ); closefile(out); external-merge-sort(r tmp, r tmp, θ); out createfile(r out); in openscan( run * 0 ); lastr ; while (r nextrecord(in)) EOF do if r lastr then appendrecord(out, r); lastr r; closefile(out); Sort ordering θ? How do we have to specify the ordering θ to make sure the above algorithm works correctly? 218

31 In this algorithm, sorting and duplicate elimination are two separate steps executed in sequence. Marriage of sorting and duplicate elimination? Can you imagine how a DBMS could fold the formerly separate phases ( 1 external merge sort, 2 duplicate elimination) to avoid the two-stage approach? The outline of the external merge sort algorithm is reproduced below. Pass 0: 1 Read B pages at a time, 2 use in-memory sort to sort the records on these B pages, 3 write the sorted run to disk. (N.B.: Pass 0 writes N/B runs to disk, each run contains B pages except the last run which may contain less.) Passes 1,... (until only a single run is left): 1 Select B 1 runs from previous pass, read a page from each run, 2 perform a (B 1)-way merge and use the B-th page as temporary output buffer. 219

32 8.4.2 Projection based on hashing If the DBMS has a fairly large number of buffer pages (B, say) to spare for the l (r in ) operation, a hash-based projection may be an efficient alternative to sorting: Partitioning phase: 1 Allocate all B buffer pages. One page will be the input buffer, the remaining B 1 pages will be used as hash buckets. 2 Read the file r in page-by-page, for each record r cut off fields not listed in l. 3 For each such record, apply hash function h 1(r) = h(r) mod (B 1) which depends on all remaining fields of r and store r in hash bucket h 1(r). (Write the bucket to disk if full. 6 ) input file partitions 2 hash function... h B B 1 disk B main memory buffers disk 6 You may read this as: a bucket s overflow chain resides on disk. 220

33 After partitioning, we are ensured that duplicate elimination is an intra-partition problem only: two identical records r, r have been mapped to the same partition: h 1 (r) = h 1 (r ) r = r. We are not done yet, though. Due to hash collisions, the records in a partition are not guaranteed to be all equal: We need a... h 1 (r) = h 1 (r ) r = r. Duplicate elimination phase: 1 For each partition, read each partition page-by-page. (Buffer page layout as before.) 2 To each record, apply hash function h 2! h 1. Why? 3 If two records r, r collide w.r.t. h 2, check if r = r. If so, discard r. 4 After the entire partition has been read in, append all hash buckets to the result file (which will be free of duplicates). N.B.: The hash-based approach is efficient only if the duplicate elimination phase can be performed in-memory (i.e., any partition may not exceed the buffer size). 221

34 8.4.3 Use of indexes for projection If the index key contains all attributes of the projection, we can use an indexonly plan to retrieve all values from the index pages without accessing the actual data records. Next we apply hashing or sorting to eliminate duplicates from this (much smaller) set of pages. If the index key includes the projected attributes as a prefix, and the index is a sorted index (e.g., a B + tree), we can use an index-only plan, both to retrieve the projected attribute values and to eliminate the duplicates as well. 222

35 8.5 The join operation The semantics of the join operation (r 1 p r 2 ) is most easily described in terms of two other relational operators: r 1 r 2 p r 1 r 2 p ( denotes the cross product operator, predicate p may refer to record fields in files r 1 and r 2.) The are several alternative algorithms that implement r 1 p r 2, and some of them actually implement the above relational equivalence: 1 enumerate all records in the cross product of r 1 and r 2, 2 then pick those record pairs satisfying predicate p. More advanced algorithms try to avoid the obvious inefficency in step 1 (the size of the intermediate result is r 1 r 2 ) and instead try to select early. 223

36 8.5.1 Nested loops join The nested loops join (NL-) is the basic join algorithm variant. Its I/O cost is forbidding, though. Algorithm: (p, r 1, r 2, r out) Input: predicate p, input files r 1,2 Output: output file r out written (side-effect) out createfile(r out); in 1 openscan(r 1); while (r nextrecord(in 1)) EOF do in 2 openscan(r 2); while (r nextrecord(in 2)) EOF do if p(r, r ) then appendrecord(out, r, r ); closefile(out); For obvious reasons, file r 1 is referred to as the outer (relation), while r 2 is commonly called the inner (relation). 224

37 Cost of NL- We can easily modify the algorithm such that for each page of the outer relation (instead of for each record), one scan of the inner relation is initiated. (If we ignored this simple modification, the I/O cost would be a prohibiting r 1 r 2 for the inner loop!) p (r 1, r 2 ) input access file scan (openscan) of r 1,2 prerequisites none (p arbitrary, r 1,2 may be heap files) 7 I/O cost r 1 + r }{{} 1 r 2 }{{} outer loop inner loop 7 Ignoring the cost to write the result file r out. 225

38 The I/O cost for the simple NL- is staggering since NL- effectively enumerates all records in the cross product of r 1 and r 2. Example: Assume r 1 = 1000 and r 2 = 500, on current hardware, a single I/O operation takes about 10 msec (see Section 2.1.1). The resulting processing time for the NL- of r 1 and r 2 thus amounts to ( ) 10 msec = msec 83 mins. Remark: Swapping the roles of r 1 and r 2 (outer inner) does not buy us much here. This will, however, be different for advanced join algorithms. 8 8 If the DBMS s record field accesses are designed with care we can assume that r 1 p r 2 = r 2 p r

39 8.5.2 Block nested loops join Observe that plain NL- utilizes only 3 buffer pages at a time and otherwise effectively ignores the presence of spare buffer space. Given B pages of buffer space we can easily refine NL- to use the entire available space. The buffer setup is as follows: input files join result h hash table for block of r1 B 2 pages... input buffer (scan r2 page wise) output buffer disk B main memory buffers disk The main idea is to read the outer file r 1 in chunks of B 2 pages (instead of page-by-page as in NL-). Hash table? Which role does the in-buffer hash table over file r 1 play here? 227

40 Algorithm: (p, r 1, r 2, r out) Input: equality predicate p (r 1.A = r 2.B), input files r 1,2 Output: output file r out written (side-effect) out createfile(r out); in 1 openscan(r 1); repeat // try to read a chunk of maximum size (but don t read beyond EOF of r 1) B min(b 2, #remaining blocks in r 1); if B > 0 then read B blocks of r 1 into buffer, hash record r of r 1 to buffer page h(r.a) mod B ; in 2 openscan(r 2); while (r nextrecord(in 2)) EOF do compare record r with records r stored in buffer page h(r.b) mod B ; if r.a = r.b then appendrecord(out, r, r ); until B < B 2 ; closefile(out); If predicate p is a general predicate, block NL- is still applicable (at the cost of more CPU cycles, since all B 2 in-buffer blocks of r 1 have to be scanned to find a join partner for record r of r 2 ). 228

41 p (r 1, r 2 ) input access chunk-wise file scan of r 1, page-wise file scan of r 2 prerequisites p equality predicate (or arbitrary), r 1,2 may be heap files r1 I/O cost r 1 + r }{{} 2 B 2 outer loop }{{} inner loop Block NL- beats plain NL- in terms of I/O cost by far. To return to our running Example: Assume, as before, r 1 = 1000 and r 2 = 500, on current hardware, a single I/O operation takes about 10 msec (see Section 2.1.1), and assume B = 100. Resulting processing time for the block NL- of r 1 and r 2 : 1000 ( ) 10 msec = msec = 65 secs (... as opposed to 83 mins before!) 229

42 Which relation is outer? 230

43 8.5.3 Index nested loops join Whenever there is an index on (at least) one of the join relations that matches the join predicate, we can take advantage by making the indexed relation the inner relation of the join algorithm. We do not need to compare the tuples of the outer relation with those of the inner, but rather use the index to retrieve the matches efficiently. Algorithm: (p, r 1, r 2, r out) Input: predicate p, input files r 1,2, index on r 2 Output: output file r out written (side-effect) out createfile(r out); in 1 openscan(r 1); while (r nextrecord(in 1)) EOF do use index on r 2 to find all matches for r appending them to output out; closefile(out); Index nested loops avoids enumeration of the cross-product. 231

44 Cost of index nested loops depends on the available index. p (r 1, r 2 ) input access file scan (openscan) of r 1 index access to r 2 prerequisites index on r 2 matching join predicate p 9 I/O cost r 1 + r }{{} 1 (cost of 1 index access to r 2 ) }{{} outer loop inner loop This algorithm is especially useful, if the index is a clustered index, furthermore, even with unclustered indexes and few matches per outer tuples, index nested loops outperforms simple nested loops. 9 Ignoring the cost to write the result file r out. 232

45 8.5.4 Sort-merge join In a situation like the one depicted below, sort-merge join might be an attractive alternative to block NL-: r 1 s A=B r s 2 1 Both join inputs are sorted (annotation s on the incoming edges), and 2 the join predicate (here: A = B) is an equality predicate. Note that this effectively matches the situation just before the merge step of the two-way merge sort algorithm (see Chapter 7): simply consider join inputs r 1 and r 2 as runs that have to be merged. The merge phase has to be slightly adapted to ensure correct results are produced in a situation like this (with duplicates on both sides): 0 1 A C 1 "foo" 2 "foo" B 2 "bar" 2 "baz" A 4 "foo" R.A=S.B B D 1 true 2 false C 2 true A 3 false 233

46 Notes on the algorithm shown below: The code assumes that any comparison with EOF (besides itself) fails. Function tell(f ) yields the current file pointer of file f. The companion function seek(f, l) moves f s file pointer to position l. Unix: see man ftell and man fseek. Algorithm: (p, r 1, r 2, r out ) Input: equality predicate p (r 1.A = r 2.B), input files r 1,2 Output: output file r out written (side-effect) out createfile(r out ); in 1 openscan(r 1 ); in 2 openscan(r 2 ); r nextrecord(in 1 ); r nextrecord(in 2 ); // continued on next slide

47 //... continued from previous slide; while r EOF r EOF do while r.a < r.b do r nextrecord(in 1 ); while r.a > r.b do r nextrecord(in 2 ); l tell(in 2 ); while r.a = r.b do // repeat the scan of r 2 (implements the from previous slide) seek(in 2, l); r getrecord(in 2 ); // while we find matching records in r 2... while r.a = r.b do appendrecord(out, r, r ); r nextrecord(in 2 ); r nextrecord(in 1 ); r r ; closefile(out); 235

48 Summary and analysis of sort-merge join: p (r 1, r 2) input access sorted file scan of both r 1,2 prerequisites p equality predicate r 1.A = r 2.B, r 1 sorted on A, r 2 sorted on B I/O cost best case: If... worst case: If... I/O performance figures. Example: Just like before r 1 = 1000 and r 2 = 500, on current hardware, a single I/O operation takes about 10 msec. Resulting processing time for the sort-merge join of r 1 and r 2 : best case: worst case: ( ) 10 msec = msec = 15 sec ( ) 10 msec = msec 83 mins 236

49 Final remarks on sort-merge join: If either (or both) of R, S are not available in sorted order according to the join attribute(s), we can obtain the sort order by introducing an explicit sort step into the execution plan before the join operator. If we need to do explicit sorting before the join, we can combine the last merge phase of the (merge) sorting with the join (at the expense of slightly higher memory requirements). 237

50 8.5.5 Hash joins Hash join algorithms (there are quite a few!) follow a simple idea of partitioning: Instead of one big join compute many small joins: use the same hash function h to split r 1 and r 2 into k partitions, join each of the k pairs of partitions of r 1,2 separately. Due to hash partitioning, join partners from r 1 and r 2 can only be found in matching partitions i (hash joins only work for equi-joins!) Since the k small joins are independent of each other, this provides good parallelization potentials! The principal idea behind hash joins is the algorithmic divide-and-conquer paradigm. 238

51 Conceptually, a hash join is devided into a partitioning phase (or building phase) and a probing phase (or matching phase). The building phase scans each input relation in turn, filling k buckets. The probing phase scans each of the k buckets once, and computes a small join (hopefully in memory), e.g., using another hash function h 2. Partitions of R and S hash function h2 Join Result h2 Hash table for partition Ri (k < B-1 pages) Input buffer (To scan Si) Output buffer Disk B main memory buffers Disk 239

52 Algorithm: (p, r 1, r 2, r out ) Input: equality-predicate p, input files r 1,2 Output: output file r out written (side-effect) // building phase: in 1 openscan(r 1 ); while (r nextrecord(in 1 )) EOF do add r to buffer page h(r) // flushing buffer pages as they fill closefile(r 1 ); in 2 openscan(r 2 ); while (s nextrecord(in 2 )) EOF do add s to buffer page h(s) // flushing buffer pages as they fill closefile(r 2 ); // continued on next slide

53 //... continued from previous slide // probing phase: out createfile(r out ); for l = 1,..., k do // build in-memory hash table for r l 1, using h 2 for each tuple r in r l 1 do read r and insert it into hash table position h 2 (r); // scan r l 2 and probe for matching r l 1 tuples for each tuple s in r l 2 do read s and probe hash table using h 2 (s); for matching r 1 tuples r, appendrecord(out, r, s ); clear hash table for next partition; closefile(out); 241

54 Cost of this hash join Ignoring memory bottlenecks, this ( Grace Hash Join ) algorithm reads each page of r 1,2 exactly once in the building phase and writes about the same amount of pages out for the partitions. The probing phase reads each partition once. p (r 1, r 2 ) input access file scan (openscan) of r 1,2 prerequisites equi-join, r 1,2 may be heap files I/O cost r 1 + r 2 + r }{{} 1 + r 2 }{{} read write } {{ } building phase + r 1 + r 2 }{{} probing phase = 3 ( r 1 + r 2 ) Ignoring the cost to write the result file r out. 242

55 I/O performance figures. Example: Just like before r 1 = 1000 and r 2 = 500, on current hardware, a single I/O operation takes about 10 msec. Resulting processing time for the hash join of r 1 and r 2 : 3 ( ) 10 msec = msec = 45 sec More elaborate hash join algorithms deal, e.g., with the case that partitions do not fit into memory during the probing phase. 243

56 Memory Requirements for Grace Hash Join We have to try to fit each hash partition into memory for the probing phase. Hence, to minimize partition size, we have to maximize the number of partitions. While partitioning, we need 1 buffer page per partition and 1 input buffer. With B buffers, we can thus generate B 1 partitions. This gives partitions of size R B 1 (for equal distribution). The size of an (in-memory) hash table for the probing phase needs to be f R B 1, for some fudge factor f a little large than 1. During the probing phase, we need to keep one such in-memory hash table, one input buffer plus one output buffer in memory, which results in B > f R B In summary, we thus need approximately B > f R pages of buffer space for the Grace Hash Join to perform well. If one or more partitions do not fit into main memory during the probing phase, this degrades performance significantly. 244

57 Utilizing Extra Memory Suppose we are partitioning R (and S) into k partitions where B > f R k, i.e. we can build an in-memory hash table for each partition. The partitioning phase needs k + 1 buffers, which leaves us with some extra buffer space of B (k + 1) pages. If this extra space is large enough to hold one partition, i.e., B (k +1) f R k, we can collect the entire first partition of R in memory during the partitioning phase and need not write it to disk. Similarly, during the partitioning of S, we can avoid storing its first partition on disk and rather immediately probe the tuples in S s first partition against the in-memory first partition of R and write out results. At the end of the partitioning phase for S, we are already done with joining the first partitions. The savings obtained result from not having to write out and read back in the first partitions of R and S. This version of hash join is called Hybrid Hash Join. 245

58 8.5.6 Semijoins Origin: Distributed DBMSs. (here: transport cost dominates I/O-cost) Remember: Semijoin R S := π R (R S) R Idea: to compute the distributed join between two relations R, S stored on different nodes N R, N S (assuming we want the result on N R ; let the common attributes be J): 1 Compute π J (R) on N R. 2 Send the result to N S. 3 Compute π J (R) S on N S. 4 Send the result to N R. 5 Compute R (π J (R) S) on N R. N.B. Step 3 computes the semijoin between S and R. This algorithm is preferable over sending all of S to N R, if (C tr denotes transport cost, depending on size of transfered data): C tr (π J (R)) + C tr (S R) < C tr (S). 246

59 Example: Semijoin Let relations R and S be given as R A B S B C D This yields π B (R) B S R B C D Cost of Semijoin: C tr = = 15 whereas sending all of S has C tr =

60 8.5.7 Summary of join algorithms No single join algorithm performs best under all circumstances. Choice of algorithm affected by sizes of relations joined, size of available buffer space, availability of indexes, form of join condition, selectivity of join predicate, available physical properties of inputs (e.g., sort orders), desirable physical properties of output (e.g., sort orders),... Performance differences between good and bad algorithm for any given join can be enormous. Join algorithms have been subject to intensive research efforts, particularly also in the context of parallel DBMSs. 248

61 8.6 A Three-Way Join Operator Within the INGRES project at UC Berkeley, a three-way join operator has been developed. Observations: Suppose we want to compute the join R A S B T, where A is an attribute common to R and S, B is common to S and T. This is an instance of a (three-way) star join with S as the center relation. Using only traditional (two-way) join algorithms, choices will include left-deep NL--plans (with or without index) iterating over, say, S as outer, using either of R or T as first inner and the other of two as second inner relation. When thinking of simple NL--algorithms, this means that for each combination of matching SR- (or ST -) tuple, we have to iterate over all of T (or S), resulting in a complexity on the order of O(n m k), for n, m, k the size of the involved relations (either in terms of number of tuples or number of pages). This roughly corresponds to three levels of nested loops. 249

62 Disadvantage: This three-way join algorithm makes optimization even more complex, since a sequence of two binary (logical) operators needs to be mapped to a single ternary (physical) operator. 250 The INGRES Three-Way Join Algorithm Idea: Scan the center relation, S in our example. For each tuple s S do: Find all matching R-tuples r and collect them in a temporary space S (e.g., using a nested loop or an index). Find all matching T -tuples t and collect them in a temporary space T (e.g., using a nested loop or an index). Append to the output the product (i.e., all combinations) of the one s tuple with the r and t tuples from the two temporary spaces R and S. N.B.: this corresponds to only two levels of nested loops, one outer loop (over S), with two loops inside, but one after the other, hence a complexity of only O(n (m + k)).

63 8.7 Other Operators Set Operations Intersection and Cross Product... are implemented as special joins : for intersection, use equality on all attributes as join condition; for the product, use true ; hence, there is no need to further consider those. With Union and Difference,... the challenge lies in duplicate identification. based on sorting and one based on hashing. There are two approaches, one Work out the details on your own

64 8.7.2 Aggregates The language SQL supports a number of aggregation operators (such as, sum, avg, count, min, max). Basic algorithm: scan the whole relation and maintain some running information during that scan. Compute the aggregate value from the running information upon completion of the scan: Aggregate sum avg count min max Running Information Total of values read Total, Count of values read Count of values read Smallest value read Largest value read Grouping: if aggregation is combined with grouping, we first have to do the grouping, using hashing or sorting (or an appropriate index). Then, use the running information on a per-group basis. Index-only: sometimes, aggregate values can be computed without accessing the data records at all, by just using an available index

65 8.8 The impact of buffering Effective use of the buffer pool is crucial for efficient implementations of a relational query engine. Several operators use the size of available buffer space as a parameter. Keep the following in mind: 1 When several operators execute concurrently, they share the buffer pool. 2 Using an unclustered index for accessing records makes finding a page in the buffer rahter unlikely and dependent on (rather unpredictably!) the size of the buffer. 3 Furthermore, each page access is likely to refer to a new page, therefore, the buffer pool fills quickly and we obtain a high level of I/O activity. 4 If an operation has a repeated pattern of page accesses, a clever replacement policy and/or sufficient number of buffers can speed up the operation significantly. Examples of such patterns are: 253

66 Simple nested loops join: for each outer tuple, scan all pages of the inner relation. If there is enough buffer space to hold entire inner relation, the replacement policy is irrelevant. Otherwise it is critical: LRU will never find a needed page in the buffer ( Sequential Flooding problem, see Section 2.3) MRU gives best buffer utilization, the first B 2 pages of the inner will always stay in the buffer. Nested block join: for each block of the outer, scan all pages of the inner relation. Since only one unpinned page is available for the scan of the inner, the replacement policy makes no difference. Index nested loop join: for each tuple in the outer, use the index to find matching tuples in the inner relation. For duplicate values in the join attributes of the outer relation, we obtain repeated access patterns for the inner tuples and the index. The effect can be maximized by sorting the outer tuples on the join attributes. 254

67 8.9 Managing long pipelines of relational operators Note that any relational operator that we have been discussing takes a parameter r out, i.e., a file (name) to be written to hold the operator s output. In some sense, we are using secondary storage as a one-way communication channel between operators in a plan. Consequences of this approach: 1 We pay for the (substantial) I/O effort to feed into and read from this communication channel. 2 The operators in a plan are executed in sequence, the first result record is produced not before the last relational operator in the pipeline executes: r 1 r 2 p tmp 1 l tmp 2 q tmp 3... tmp n k N.B.: No more than three temporary files tmp i need to exist at any point in time during execution. 255

68 Architecting the query processor in this fashion bears much resemblance with using the Unix shell like this: 1 # report all large MP3 audio files 2 #... below the current working directory 3 $ find. -size +1MB > tmp1 4 $ xargs file < tmp1 > tmp2 5 $ grep -i MP3 < tmp2 > tmp3 6 $ cut -d: -f1 < tmp3 7 output tmp[0-9] 8 $ rm Unix supports another type of communication channel, the pipe, which lets the participating commands exchange data character-by-character: 1 # report all large MP3 audio files 2 #... below the current working directory 3 $ find. -size +1MB xargs file grep 4 output -i MP3 cut -d: -f1 256

69 The execution of the pipe is driven by the rightmost command: 1 To produce a line of output, cut only needs to see the next line in its input: grep is requested to produce this input. 2 To produce this line of output, grep only needs to see the next line in its input: xargs is requested to produce this input As soon as find has produce a line of output, it is passed through the pipe, transformed by xargs, grep, and cut and then echoed to the terminal. In the database world, this mode of executing a pipepline (a query plan) is called streaming: A streaming query processor avoids to write temporary files (the tmp i ) whenever possible, operators communicate their output record-by-record (or block-by-block), a result records appears as soon as it is available (as opposed to when the complete result has been computed 11 ). 11 This is of major importance in interactive DBMS environments (ad-hoc query interfaces). 257

70 Example: 1 $ grep foo 2 XML 3 foobar 4 foobar 5 What does foo mean anyway? 6 What does foo mean anyway? 7 Enough already 8 ^D 9 $ Note, however, that we have to modify the implementations of our relational operators to support streaming. Currently, all operators consume their input as a whole, then write their output file as a whole, and only then return control to the query processor. 258

71 8.9.1 Streaming Interface To support streaming we need a record-by-record calling convention. New operator interface (let denote a relational operator):.reset() Operator is requested to reset so that a call to.next() will produce the first result record..next() The operator is requested to produce the next record of its result. Returns EOF if all result records have been requested already. 259

72 Example (implementation of p (r in )): Algorithm: Input: in.reset(); (p, in).reset() predicate p, in-bound stream in Algorithm: Input: Output:.(p, in).next() predicate p, in-bound stream in next record of selection result (or EOF ) while (r in.next()) EOF do if p(r) then // immediately return if next result record found return r; return EOF ; 260

73 Given a query plan like the one shown below, query evaluation is driven by the query processor like this (just like in the Unix shell): 1 The whole plan is initially reseted by calling reset() on the root operator, i.e., q.reset(). 2 The reset() call is forwarded through the plan by the operators themselves (see.reset() on previous slide). 3 Control returns to the query processor. 4 The root is requested to produce its next result record, i.e., the call q.next() is made. 5 Operators forward the next() request as needed. As soon as the next result record is produced, control returns to the query processor again. r 1 r 2 scan p l q scan 261

74 In short, the query processor uses the following routine to evaluate a query plan: Algorithm: Input: Output: eval (q) root operator of query plan q query result sent to terminal q.reset(); while (r q.next()) EOF do print(r); print("done."); 262

75 A streaming scan operator. Complete the implementation below to provide a streaming file scan operator: Algorithm: scan(f ).reset() Input: filename f... Algorithm: Input: Output:... scan(f ).next() filename f next record in file f or EOF 263

76 A streaming NL- operator. Complete the implementation below to provide a streaming NL- operator (see 8.5.1): Algorithm: (p, in 1, in 2).reset() Input: predicate p, in-bound streams in 1,2... Algorithm: (p, in 1, in 2).next() Input: predicate p, in-bound streams in 1,2 Output: next record in join result or EOF

77 Below is a code snippet used in a real DBMS product. The overall structure of this code almost perfectly matches the recent discussion: 1 /* efltr -- apply filter predicate pred to stream 3 Filter the in-bound stream, only stream elements that fulfill e->pred 4 contribute to the result. No index support whatsoever. 5 */ 6 erc eop FLTR(eOp *ip) 7 { 8 eobj FLTR *e = (eobj FLTR *)eobj(ip); 9 10 /* Challenge the in-bound stream until it is exhausted... */ 11 while (eintp(e->in)!= eeos) { 12 eintp(e->pred); 13 /*... or a stream element fulfills predicate e->pred */ 14 if (et as bool(eval(e->pred))) { 15 eval(ip) = eval(e->in); 16 return eok; 17 } 18 } 19 return eeos; 20 } erc eop FLTR RST(eOp *ip) 23 { 24 eobj FLTR *e = (eobj FLTR *)eobj(ip); ereset(e->in); 27 ereset(e->pred); return eok; 30 } 265

78 8.9.2 Demand-Driven vs. Data-Driven Streaming The iterator interface as shown above implements a demand-driven query processing infrastructure: consumers (later operators) request more input (by calling next()) from their producers (earlier operators) whenever they are ready to process the input. Demand-driven streaming minimizes ressource requirements and wasted effort in case a user/client does not want to see the whole result. In contrast, data-driven streaming requires more ressources, uses a different query processing infrastructure, and can exploit more parallelism. Each operator starts (asynchronously) to work on its input as soon and as fast as possible. Output is enqueued into a pipeline to the consumers as it occurs. The pipelines need to do buffering and/or to suspend producers. An operator only needs to wait, if there is no more input yet, or if the outputpipeline is full. 266

79 Bibliography Graefe, G. (1993). Query evaluation techniques for large databases. ACM Computing Surveys, 25(2): Kemper, A., Moerkotte, G., Peithner, K., and Steinbrunn, M. (1994). Optimizing disjunctive queries with expensive predicates. In Snodgrass, R. T. and Winslett, M., editors, Proc. ACM SIGMOD Conference on Management of Data, pages , Minneapolis, MS. ACM Press. Ramakrishnan, R. and Gehrke, J. (2003). Database Management Systems. McGraw-Hill, New York, 3 edition. Steinbrunn, M., Peithner, K., Moerkotte, G., and Kemper, A. (1995). Bypassing joins in disjunctive queries. In Dayal, U., Gray, P., and Nishio, S., editors, Proc. Intl. Conf. on Very Large Databases, pages , Zurich, Switzerland. Morgan Kaufmann. Wong, E. and Youssefi, K. (1976). Decompostion A strategy for query processing. ACM Transactions on Database Systems, 1(3):

Chapter 8. Implementing the Relational Algebra. Architecture and Implementation of Database Systems Winter 2010/11

Chapter 8. Implementing the Relational Algebra. Architecture and Implementation of Database Systems Winter 2010/11 Chapter 8 Implementing the Relational Algebra Architecture and Implementation of Database Systems Winter 2010/11 Block Index Wilhelm-Schickard-Institut für Informatik Universität Tübingen 8.1 In many ways,