DBMS Y3/S5. 1. OVERVIEW The steps involved in processing a query are: 1. Parsing and translation. 2. Optimization. 3. Evaluation.

Size: px

Start display at page:

Download "DBMS Y3/S5. 1. OVERVIEW The steps involved in processing a query are: 1. Parsing and translation. 2. Optimization. 3. Evaluation."

Ada Gibson
5 years ago
Views:

1 Query Processing QUERY PROCESSING refers to the range of activities involved in extracting data from a database. The activities include translation of queries in high-level database languages into expressions that can be used at the physical level of the file system, a variety of query-optimizing transformations, and actual evaluation of queries. 1. OVERVIEW The steps involved in processing a query are: 1. Parsing and translation. 2. Optimization. 3. Evaluation. Before query processing can begin, the system must translate the query into a usable form. A language such as SQL is suitable for human use, but is ill suited to be the system s internal representation of a query. A more useful internal representation is one based on the extended relational algebra. Thus, the first action the system must take in query processing is to translate a given query into its internal form. This translation process is similar to the work performed by the parser of a compiler. In generating the internal form of the query, the parser checks the syntax of the user s query, verifies that the relation names appearing in the query are names of the relations in the database, and so on. The system constructs a parse-tree representation of the query, which it then translates into a relational-algebra expression. If the query was expressed in terms of a view, the translation phase also replaces all uses of the view by the relational-algebra expression that defines the view.1. Most compiler texts cover parsing in detail. Fig: steps in query processing DEPT OF CSE,RGCET Page 1

2 Given a query, there are generally a variety of methods for computing the answer. For example, we have seen that, in SQL, a query could be expressed in several different ways. Each SQL query can itself be translated into a relational algebra expression in one of several ways. Furthermore, the relational-algebra representation of a query specifies only partially howto evaluate a query; there are usually several ways to evaluate relational-algebra expressions. As an illustration, consider the query: select salary from instructor where salary < 75000; This query can be translated into either of the following relational-algebra expressions: _salary <75000 (_salary (instructor)) _salary (_salary<75000 (instructor)) Further, we can execute each relational-algebra operation by one of several different algorithms. For example, to implement the preceding selection, we can search every tuple in instructor to find tuples with salary less than If a B+-tree index is available on the attribute salary, we can use the index instead to locate the tuples. To specify fully how to evaluate a query, we need not only to provide the relational-algebra expression, but also to annotate it with instructions specifying how to evaluate each operation. Annotations may state the algorithm to be usedfor a specific operation, or the particular index or indices to use. Fig: A query evaluation plan A relational algebra operation annotated with instructions on how to evaluate it is called an evaluation primitive. A sequence of primitive operations that can be used to evaluate a query is a query-execution plan or query-evaluation plan. Figure illustrates an evaluation plan for our example query, in which a particular index (denoted in the figure as index 1 ) is specified for the selection operation. The query-execution engine takes a query-evaluation plan, executes that plan, and returns the answers to the query.the different evaluation plans for a given query can have different costs. DEPT OF CSE,RGCET Page 2

3 We do not expect users to write their queries in a way that suggests the most efficient evaluation plan. Rather, it is the responsibility of the system to construct a query evaluation plan that minimizes the cost of query evaluation; this task is called query optimization. Once the query plan is chosen, the query is evaluated with that plan, and the result of the query is output. Several operations may be grouped together into a pipeline, in which each of the operations starts working on its input tuples even as they are being generated by another operation. 2. MEASURES OF QUERY COST There are multiple possible evaluation plans for a query, and it is important to be able to compare the alternatives in terms of their (estimated) cost, and choose the best plan. To do so, we must estimate the cost of individual operations, and combine them to get the cost of a query evaluation plan. Thus, as we study evaluation algorithms for each operation later in this chapter, we also outline how to estimate the cost of the operation. The cost of query evaluation can be measured in terms of a number of different resources, including disk accesses, CPU time to execute a query, and, in a distributed or parallel database system, the cost of communication. In large database systems, the cost to access data from disk is usually the most important cost, since disk accesses are slow compared to in-memory operations. Moreover, CPU speeds have been improving much faster than have disk speeds. Thus, it is likely that the time spent in disk activity will continue to dominate the total time to execute a query. The CPU time taken for a task is harder to estimate since it depends on low-level details of the execution code. Although real-life query optimizers do take CPU costs into account, for simplicity in this book we ignore CPU costs and use only disk-access costs to measure the cost of a query-evaluation plan. We use the number of block transfers from disk and the number of disk seeks to estimate the cost of a query-evaluation plan. The response time for a query-evaluation plan (that is, the wall-clock time required to execute the plan), assuming no other activity is going on in the computer, would account for all these costs, and could be used as a measure of the cost of the plan. Unfortunately, the response time of a plan is very hard to estimate without actually executing the plan, for the following reasons: 1. The response time depends on the contents of the buffer when the query begins execution; this information is not available when the query is optimized, and is hard to account for even if it were available. 2. In a system with multiple disks, the response time depends on how accesses are distributed among disks, which is hard to estimate without detailed knowledge of data layout on disk. As a result, instead of trying to minimize the response time, optimizers generally try to minimize the total resource consumption of a query plan. 3. SELECTION OPERATION In query processing, the file scan is the lowest-level operator to access data. File scans are search algorithms that locate and retrieve records that fulfill a selection condition. In relational systems, DEPT OF CSE,RGCET Page 3

4 a file scan allows an entire relation to be read in those cases where the relation is stored in a single, dedicated file. 3.1 Selections Using File Scans and Indices Consider a selection operation on a relation whose tuples are stored together in one file. The most straightforward way of performing a selection is as follows: A1 (linear search). In a linear search, the system scans each file block and tests all records to see whether they satisfy the selection condition. An initial seek is required to access the first block of the file. In case blocks of the file are not stored contiguously, extra seeks may be required, but we ignore this effect for simplicity. Index structures are referred to as access paths, since they provide a path through which data can be located and accessed. Recall that a primary index (also referred to as a clustering index) is an index that allows the records of a file to be read in an order that corresponds to the physical order in the file. An index that is not a primary index is called a secondary index. Search algorithms that use an index are referred to as index scans. We use the selection predicate to guide us in the choice of the index to use in processing the query. Search algorithms that use an index are: A2 (primary index, equality on key). For an equality comparison on a key attribute with a primary index, we can use the index to retrieve a single record that satisfies the corresponding equality condition. A3 (primary index, equality on non key). We can retrieve multiple records by using a primary index when the selection condition specifies an equality comparison on a non key attribute, A. The only difference from the previous case is that multiple records may need to be fetched. However, the records must be stored consecutively in the file since the file is sorted on the search key. A4 (secondary index, equality). Selections specifying an equality condition can use a secondary index. This strategy can retrieve a single record if the equality condition is on a key; multiple records may be retrieved if the indexing field is not a key. In the first case, only one record is retrieved. The time cost in this case is the same as that for a primary index (casea2). In the second case, each record may be resident on a different block, which may result in one I/O operation per retrieved record, with each I/O operation requiring a seek and a block transfer. The worst-case time cost in this case is (hi + n) (ts + tt ), where n is the number of records fetched, if each record is in a different disk block, and the block fetches are randomly ordered. The worst-case cost could become even worse than that of linear search if a large number of records are retrieved. 3.2 Selections Involving Comparisons Consider a selection of the form _A v(r ). We can implement the selection either by using linear search or by using indices in one of the following ways: A5 (primary index, comparison). A primary ordered index (for example, a primary B+-tree index) can be used when the selection condition is a comparison. For comparison conditions of the form A > v or A v, a primary index on Acan be used to direct the retrieval of tuples, as follows: For A v, we look up the value v in the index to find the first tuple in the file that has a DEPT OF CSE,RGCET Page 4

5 value of A = v. A file scan starting from that tuple up to the end of the file returns all tuples that satisfy the condition. For A> v, the file scan startswith the first tuple such that A > v. The cost estimate for this case is identical to that for case A3. For comparisons of the form A < v or A v, an index lookup is not required. For A< v, we use a simple file scan starting from the beginning of the file, and continuing up to (but not including) the first tuple with attribute A = v. The case of A v is similar, except that the scan continues up to (but not including) the first tuple with attribute A> v. In either case, the index is not useful. A6 (secondary index, comparison). We can use a secondary ordered index to guide retrieval for comparison conditions involving <,,, or >. The lowest-level index blocks are scanned, either from the smallest value up to v (for < and ), or from v up to the maximum value (for > and ). 3.3 Implementation of Complex Selections We now consider more complex selection predicates. Conjunction: A conjunctive selection is a selection of the form: Disjunction: A disjunctive selection is a selection of the form: A disjunctive condition is satisfied by the union of all records satisfying the individual, simple conditions _i. Negation: The result of a selection σ θ(r) is the set of tuples of r for which the condition θ evaluates to false. In the absence of nulls, this set is simply the set of tuples in r that are not in σ θ (r ). We can implement a selection operation involving either a conjunction or a disjunction of simple conditions by using one of the following algorithms: A7 (conjunctive selection using one index). We first determine whether an access path is available for an attribute in one of the simple conditions. If one is, one of the selection algorithms A2 through A6 can retrieve records satisfying that condition. We complete the operation by testing, in the memory buffer, whether or not each retrieved record satisfies the remaining simple conditions. To reduce the cost, we choose a θ i and one of algorithms A1 through A6 for which the combination results in the least cost for σ θ i (r ). The cost of algorithm A7 is given by the cost of the chosen algorithm. A8 (conjunctive selection using composite index). An appropriate composite index (that is, an index on multiple attributes) may be available for some conjunctive selections. If the selection specifies an equality condition on two or more attributes, and a composite index exists on these combined attribute fields, then the index can be searched directly. The type of index determines which of algorithms A2, A3, or A4 will be used. A9 (conjunctive selection by intersection of identifiers). Another alternative for implementing conjunctive selection operations involves the use of record pointers or record identifiers. This algorithm requires indices with record pointers, on the fields involved in the individual conditions. The algorithm scans each index for pointers to tuples that satisfy an individual condition. The intersection of all the retrieved pointers is the set of pointers to tuples DEPT OF CSE,RGCET Page 5

6 that satisfy the conjunctive condition. The algorithm then uses the pointers to retrieve the actual records. If indices are not available on all the individual conditions, then the algorithm tests the retrieved records against the remaining conditions. The cost of algorithm A9is the sum of the costs of the individual index scans, plus the cost of retrieving the records in the intersection of the retrieved lists of pointers. This cost can be reduced by sorting the list of pointers and retrieving records in the sorted order. Thereby, (1) all pointers to records in a block come together, hence all selected records in the block can be retrieved using a single I/O operation, and (2) blocks are read in sorted order, minimizing diskarm movement. A10 (disjunctive selection by union of identifiers). If access paths are available on all the conditions of a disjunctive selection, each index is scanned or pointers to tuples that satisfy the individual condition. The union of all the retrieved pointers yields the set of pointers to all tuples that satisfy the disjunctive condition.we then use the pointers to retrieve the actual records. However, if even one of the conditions does not have an access path, we have to perform a linear scan of the relation to find tuples that satisfy the condition. Therefore, if there is even one such condition in the disjunct, the most efficient access method is a linear scan, with the disjunctive condition tested on each tuple during the scan. 4. SORTING Sorting of data plays an important role in database systems for two reasons. First, SQL queries can specify that the output be sorted. Second, and equally important for query processing, several of the relational operations, such as joins, can be implemented efficiently if the input relations are first sorted. We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each tuple. For relations that fit in memory, techniques like quicksort can be used. For relations that dont fit in memory, external sort-merge is a good choice. External Sort-Merge Sorting of relations that do not fit in memory is called external sorting. The most commonly used technique for external sorting is the external sort merge algorithm.we describe the external sort merge algorithm next. Let Mdenote the number of blocks in the main-memory buffer available for sorting, that is, the number of disk blocks whose contents can be buffered in available main memory. 1. In the first stage, a number of sorted runs are created; each run is sorted, but contains only some of the records of the relation. i = 0; repeat read Mblocks of the relation, or the rest of the relation, whichever is smaller; sort the in-memory part of the relation; write the sorted data to run file Ri ; i = i + 1; until the end of the relation 2. In the second stage, the runs are merged. Suppose, for now, that the total number of runs, N, is less than M, so that we can allocate one block to each run and have space left to hold one block of output. The merge stage operates as follows: DEPT OF CSE,RGCET Page 6

7 read one block of each of the N files Ri into a buffer block in memory; repeat choose the first tuple (in sort order) among all buffer blocks; write the tuple to the output, and delete it from the buffer block; if the buffer block of any run Ri is empty and not end-of-file(ri ) then read the next block of Ri into the buffer block; until all input buffer blocks are empty The output of the merge stage is the sorted relation. The output file is buffered to reduce the number of disk write operations. The preceding merge operation is a generalization of the twoway merge used by the standard in-memory sort merge algorithm; it merges N runs, so it is called an N-way merge. until all input buffer pages are empty. If N M, several merge passes are required. In each pass, contiguous groups of M - 1 runs are merged. A pass reduces the number of runs by a factor of M -1, and creates runs longer by the same factor. E.g. If M=11, and there are 90 runs, one pass reduces the number of runs to 9, each 10 times the size of the initial runs Repeated passes are performed till all runs have been merged into one. Example: External Sorting Using Sort-Merge DEPT OF CSE,RGCET Page 7

Cost analysis: 1 block per run leads to too many seeks during merge Instead use bb buffer blocks per run read/write bb blocks at a time Can merge M/bb 1 runs in one pass Total number of merge passes

8 Cost analysis: 1 block per run leads to too many seeks during merge Instead use bb buffer blocks per run read/write bb blocks at a time Can merge M/bb 1 runs in one pass Total number of merge passes required: Block transfers for initial run creation as well as in each pass is 2br for final pass, we dont count write cost we ignore final write cost for all operations since the output of an operation may be sent to the parent operation without being written to disk Thus total number of block transfers for external sorting: Cost of seeks During run generation: one seek to read each run and one seek to write each run 2 br / M DEPT OF CSE,RGCET Page 8

9 During the merge phase Need 2 br / bb seeks for each merge pass except the final one which does not require a write Total number of seeks: 5. JOIN OPERATION Several different algorithms to implement joins Nested-loop join Block nested-loop join Indexed nested-loop join Merge-join Hash-join Choice based on cost estimate Examples use the following information Number of records of student: nstudent = 5, 000. Number of blocks of student: bstudent = 100. Number of records of takes: ntakes = 10, 000. Number of blocks of takes: btakes = 400. Nested-Loop Join To compute the theta join r θ s for each tuple tr in r do begin for each tuple ts in s do begin test pair (tr,ts) to see if they satisfy the join condition θ if they do, add tr ts to the result. end end r is called the outer relation and s the inner relation of the join. Requires no indices and can be used with any kind of join condition. Expensive since it examines every pair of tuples in the two relations. In the worst case, if there is enough memory only to hold one block of each relation, the estimated cost is nr bs + br block transfers, plus nr + br seeks If the smaller relation fits entirely in memory, use that as the inner relation. Reduces cost to br + bs block transfers and 2 seeks Assuming worst case memory availability cost estimate is with student as outer relation: = 2,000,100 block transfers, = 5100 seeks with takes as the outer relation = 1,000,400 block transfers and 10,400 seeks DEPT OF CSE,RGCET Page 9

10 If smaller relation (student) fits entirely in memory, the cost estimate will be 500 block transfers. Block nested-loops algorithm (next slide) is preferable. Block Nested-Loop Join Variant of nested-loop join in which every block of inner relation is paired with every block of outer relation. for each block Br of r do begin for each block Bs of s do begin for each tuple tr in Br do begin for each tuple ts in Bs do begin they Check if (tr,ts) satisfy the join condition if do, add tr ts to the result. end end end end Worst case estimate: br bs + br block transfers + 2 * br seeks. Each block in the inner relation s is read once for each block in the outer relation Best case: br + bs block transfers + 2 seeks. Improvements to nested loop and block nested loop algorithms: In block nested-loop, use M 2 disk blocks as blocking unit for outer relations, where M = memory size in blocks; use remaining two blocks to buffer inner relation and output Cost = br / (M-2) bs + br block transfers + 2 br / (M-2) seeks If equi-join attribute forms a key or inner relation, stop inner loop on first match Scan inner loop forward and backward alternately, to make use of the blocks remaining in buffer (with LRU replacement) Use index on inner relation if available Indexed Nested-Loop Join Index lookups can replace file scans if join is an equi-join or natural join and an index is available on the inner relations join attribute Can construct an index just to compute a join. For each tuple tr in the outer relation r, use the index to look up tuples in s that satisfy the join condition with tuple tr. Worst case: buffer has space for only one page of r, and, for each tuple in r, we perform an index lookup on s. Cost of the join: br (tt + ts) + nr c Where c is the cost of traversing index and fetching all matching s tuples for one tuple or r c can be estimated as cost of a single selection on s using the join condition. DEPT OF CSE,RGCET Page 10

11 If indices are available on join attributes of both r and s, use the relation with fewer tuples as the outer relation. Example of Nested-Loop Join Costs Compute student takes, with student as the outer relation. Let takes have a primary B+-tree index on the attribute ID, which contains 20 entries in each index node. Since takes has 10,000 tuples, the height of the tree is 4, and one more access is needed to find the actual data student has 5000 tuples Cost of block nested loops join 400* = 40,100 block transfers + 2 * 100 = 200 seeks assuming worst case memory may be significantly less with more memory Cost of indexed nested loops join * 5 = 25,100 block transfers and seeks. CPU cost likely to be less than that for block nested loops join Merge-Join Sort both relations on their join attribute (if not already sorted on the join attributes). Merge the sorted relations to join them Join step is similar to the merge stage of the sort-merge algorithm. Main difference is handling of duplicate values in join attribute every pair with same value on join attribute must be matched Merge-Join Algorithm. Figure : Sorted relations for merge join. In the algorithm, JoinAttrs refers to the attributes in R S, and tr join ts, where tr and ts are tuples that have the same DEPT OF CSE,RGCET Page 11

12 Can be used only for equi-joins and natural joins Each block needs to be read only once (assuming all tuples for any given value of the join attributes fit in memory ) Thus the cost of merge join is: br + bs block transfers + br / bb + bs / bb seeks + the cost of sorting if relations are unsorted. hybrid merge-join: If one relation is sorted, and the other has a secondary B+-tree index on the join attribute Merge the sorted relation with the leaf entries of the B+-tree. Hash-Join Sort the result on the addresses of the unsorted relations tuples Scan the unsorted relation in physical address order and merge with previous result, to replace addresses by the actual tuples. Sequential scan more efficient than random lookup Applicable for equi-joins and natural joins. A hash function h is used to partition tuples of both relations h maps JoinAttrs values to {0, 1,..., n}, where JoinAttrs denotes the common attributes of r and s used in the natural join. r0, r1,..., rn denote partitions of r tuples. DEPT OF CSE,RGCET Page 12

Each tuple tr r is put in partition ri where i = h(tr [JoinAttrs]). r0,, r1..., rn denotes partitions of s tuples Each tuple ts s is put in partition si, where i = h(ts [JoinAttrs]).

13 Each tuple tr r is put in partition ri where i = h(tr [JoinAttrs]). r0,, r1..., rn denotes partitions of s tuples Each tuple ts s is put in partition si, where i = h(ts [JoinAttrs]). Note: In book, ri is denoted as Hri, si is denoted as Hsi and n is denoted as nh. Figure :Hash partitioning of relations. r tuples in ri need only to be compared with s tuples in si Need not be compared with s tuples in any other partition, since: an r tuple and an s tuple that satisfy the join condition will have the same value for the join attributes. If that value is hashed to some value i, the r tuple has to be in rand the s tuple in si. Hash-Join Algorithm /* Partition s */ for each tuple ts in s do begin i := h(ts [JoinAttrs]); Hsi := Hsi {ts}; end /* Partition r */ for each tuple tr in r do begin i := h(tr [JoinAttrs]); Hri := Hri {tr}; end /* Perform join on each partition */ for i := 0 to nh do begin read Hsi and build an in-memory hash index on it; for each tuple tr in Hri do begin probe the hash index on Hsi to locate all tuples ts such that ts [JoinAttrs] = tr [JoinAttrs]; for each matching tuple ts in Hsi do begin add tr join ts to the result; DEPT OF CSE,RGCET Page 13

14 end end end The hash-join of r and s is computed as follows. 1. Partition the relation s using hashing function h. When partitioning a relation, one block of memory is reserved as the output buffer for each partition. 2. Partition r similarly. 3. For each i: (a) Load si into memory and build an in-memory hash index on it using the join attribute. This hash index uses a different hash function than the earlier one h. (b) Read the tuples in ri from the disk one by one. For each tuple tr locate each matching tuple ts in si using the in-memory hash index. Output the concatenation of their attributes. The hash-join of r and s is computed as follows. Relation s is called the build input and r is called the probe input. The value n and the hash function h is chosen such that each si should fit in memory. Typically n is chosen as bs/m * f where f is a fudge factor, typically around 1.2 The probe relation partitions si need not fit in memory Recursive partitioning required if number of partitions n is greater than number of pages M of memory. instead of partitioning n ways, use M 1 partitions for s Further partition the M 1 partitions using a different hash function Use same partitioning method on r Rarely required: e.g., with block size of 4 KB, recursive partitioning not needed for relations of < 1GB with memory size of 2MB, or relations of < 36 GB with memory of 12 MB Handling of Overflows Partitioning is said to be skewed if some partitions have significantly more tuples than some others Hash-table overflow occurs in partition si if si does not fit in memory. Reasons could be Many tuples in s with same value for join attributes Bad hash function Overflow resolution can be done in build phase Partition si is further partitioned using different hash function. Partition ri must be similarly partitioned. Overflow avoidance performs partitioning carefully to avoid overflows during build phase E.g. partition build relation into many partitions, then combine them Both approaches fail with large numbers of duplicates Fallback option: use block nested loops join on overflowed partitions Cost of Hash-Join DEPT OF CSE,RGCET Page 14

Hybrid Hash Join Useful when memory sized are relatively large, and the build input is bigger than memory. Main feature of hybrid hash join: Keep the first partition of the build relation in memory.

15 Hybrid Hash Join Useful when memory sized are relatively large, and the build input is bigger than memory. Main feature of hybrid hash join: Keep the first partition of the build relation in memory. E.g. With memory size of 25 blocks, instructor can be partitioned into five partitions, each of size 20 blocks. Division of memory: The first partition occupies 20 blocks of memory 1 block is used for input, and 1 block each for buffering the other 4 partitions. teaches is similarly partitioned into five partitions each of size 80 the first is used right away for probing, instead of being written out Cost of 3( ) = 1300 block transfers for hybrid hash join, instead of 1500 with plain hash-join. Hybrid hash-join most useful if M >> bs 6. OTHER OPERATIONS Duplicate elimination can be implemented via hashing or sorting. On sorting duplicates will come adjacent to each other, and all but one set of duplicates can be deleted. Optimization: duplicates can be deleted during run generation as well as at intermediate merge steps in external sort-merge. Hashing is similar duplicates will come into the same bucket. Projection: perform projection on each tuple followed by duplicate elimination. Aggregation Aggregation can be implemented in a manner similar to duplicate elimination. Sorting or hashing can be used to bring tuples in the same group together, and then the aggregate functions can be applied on each group. DEPT OF CSE,RGCET Page 15

16 Optimization: combine tuples in the same group during run generation and intermediate merges, by computing partial aggregate values For count, min, max, sum: keep aggregate values on tuples found so far in the group. When combining partial aggregate for count, add up the aggregates For avg, keep sum and count, and divide sum by count at the end Set Operations Set operations (, and ): can either use variant of merge-join after sorting, or variant of hash-join. E.g., Set operations using hashing: Partition both relations using the same hash function Process each partition i as follows. Using a different hashing function, build an in-memory hash index on ri. Process si as follows r s: Add tuples in si to the hash index if they are not already in it. At end of si add the tuples in the hash index to the result. Set Operations E.g., Set operations using hashing: as before partition r and s, as before, process each partition i as follows build a hash index on ri Process si as follows r s: output tuples in si to the result if they are already there in the hash index r s: for each tuple in si, if it is there in the hash index, delete it from the index. At end of si add remaining tuples in the hash index to the result. Outer Join Outer join can be computed either as A join followed by addition of null-padded non-participating tuples. by modifying the join algorithms. 7. EVALUATION OF EXPRESSIONS Alternatives for evaluating an entire expression tree Materialization: generate results of an expression whose inputs are relations or are already computed, materialize (store) it on disk. Repeat. Pipelining: pass on tuples to parent operations even as an operation is being executed Materialization Materialized evaluation: evaluate one operation at a time, starting at the lowest-level. Use intermediate results materialized into temporary relations to evaluate next-level operations. E.g., in figure below, compute and store then compute the store its join with instructor, and finally compute the projection on name. DEPT OF CSE,RGCET Page 16

17 Materialized evaluation is always applicable Cost of writing results to disk and reading them back can be quite high. Our cost formulas for operations ignore cost of writing results to disk, so Overall cost = Sum of costs of individual operations + cost of writing intermediate results to disk Double buffering: use two output buffers for each operation, when one is full write it to disk while the other is getting filled Allows overlap of disk writes with computation and reduces execution time Pipelining Pipelined evaluation : evaluate several operations simultaneously, passing the results of one operation on to the next. E.g., in previous expression tree, dont store result of instead, pass tuples directly to the join.. Similarly, dont store result of join, pass tuples directly to projection. Much cheaper than materialization: no need to store a temporary relation to disk. Pipelining may not always be possible e.g., sort, hash-join. For pipelining to be effective, use evaluation algorithms that generate output tuples even as tuples are received for inputs to the operation. Pipelines can be executed in two ways: demand driven and producer driven In demand driven or lazy evaluation, system repeatedly requests next tuple from top level operation Each operation requests next tuple from children operations as required, in order to output its next tuple In between calls, operation has to maintain stateso it knows what to return next In producer-driven or eager pipelining, Operators produce tuples eagerly and pass them up to their parents Buffer maintained between operators, child puts tuples in buffer, parent removes tuples from buffer if buffer is full, child waits till there is space in the buffer, and then generates more tuples DEPT OF CSE,RGCET Page 17

18 System schedules operations that have space in output buffer and can process more input tuples Alternative name: pull and push models of pipelining Implementation of demand-driven pipelining Each operation is implemented as an iterator implementing the following operations open() E.g. file scan: initialize file scan state: pointer to beginning of file E.g.merge join: sort relations; state: pointers to beginning of sorted relations next() E.g. for file scan: Output next tuple, and advance and store file pointer E.g. for merge join: continue with merge from earlier state till next output tuple is found. Save pointers as iterator state. close() Evaluation Algorithms for Pipelining Some algorithms are not able to output results even as they get input tuples E.g. merge join, or hash join intermediate results written to disk and then read back Algorithm variants to generate (at least some) results on the fly, as input tuples are read in E.g. hybrid hash join generates output tuples even as probe relation tuples in the inmemory partition (partition 0) are read in Double-pipelined join technique: Hybrid hash join, modified to buffer partition 0 tuples of both relations in-memory, reading them as they become available, and output results of any matches between partition 0 tuples When a new r0 tuple is found, match it with existing s0 tuples, output matches, and save it in r0 Symmetrically for s0 tuples Query optimization is the process of selecting the most efficient query-evaluation plan from among the many strategies usually possible for processing a given query, especially if the query is complex. 1. OVERVIEW Alternative ways of evaluating a given query Equivalent expressions Different algorithms for each operation DEPT OF CSE,RGCET Page 18

19 An evaluation plan defines exactly what algorithm is used for each operation, and how the execution of the operations is coordinated. Cost difference between evaluation plans for a query can be enormous l E.g. seconds vs. days in some cases Steps in cost-based query optimization Generate logically equivalent expressions using equivalence rules Annotate resultant expressions to get alternative query plans Choose the cheapest plan based on estimated cost Estimation of plan cost based on: Statistical information about relations. Examples: number of tuples, number of distinct values for an attribute Statistics estimation for intermediate results to compute cost of complex expressions Cost formulae for algorithms, computed using statistics DEPT OF CSE,RGCET Page 19

20 2. TRANSFORMATION OF RELATIONAL EXPRESSIONS Two relational algebra expressions are said to be equivalent if the two expressions generate the same set of tuples on every legal database instance Note: order of tuples is irrelevant In SQL, inputs and outputs are multisets of tuples Two expressions in the multiset version of the relational algebra are said to be equivalent if the two expressions generate the same multiset of tuples on every legal database instance. An equivalence rule says that expressions of two forms are equivalent Can replace expression of first form by second, or vice versa Equivalence Rules An equivalence rule says that expressions of two forms are equivalent. We can replace an expression of the first form by an expression of the second form, or vice versa that is, we can replace an expression of the second form by an expression of the first form since the two expressions generate the same result on any DEPT OF CSE,RGCET Page 20

21 valid database. The optimizer uses equivalence rules to transform expressions into other logically equivalent expressions. 1. Conjunctive selection operations can be deconstructed into a sequence of individual selections. This transformation is referred to as a cascade of σ. 2. Selection operations are commutative. 3. Only the final operations in a sequence of projection operations are needed; the others can be omitted. This transformation can also be referred to as a cascade of π. 4. Selections can be combined with Cartesian products and theta joins. a. This expression is just the definition of the theta join. b. 5. Theta-join operations are commutative. Actually, the order of attributes differs between the left-hand side and right hand side, so the equivalence does not hold if the order of attributes is taken into account. A projection operation can be added to one of the sides of the equivalence to appropriately reorder attributes, but for simplicity we omit the projection and ignore the attribute order in most of our examples. Recall that the natural-join operator is simply a special case of the theta-join operator; hence, natural joins are also commutative. 6. a. Natural-join operations are associative. b. Theta joins are associative in the following manner: where θ2 involves attributes from only E2 and E3. Any of these conditions may be empty; hence, it follows that the Cartesian product ( ) operation is also associative. The commutativity and associativity of join operations are important for join reordering in query optimization. 7. The selection operation distributes over the theta-join operation under the following two conditions: a. It distributes when all the attributes in selection condition _0 involve only the attributes of one of the expressions (say, E1) being joined. b. It distributes when selection condition θ1 involves only the attributes of E1 and θ2 involves only the attributes of E2. DEPT OF CSE,RGCET Page 21

22 8. The projection operation distributes over the theta-join operation under the following conditions. a. Let L1 and L2 be attributes of E1 and E2, respectively. Suppose that the join condition _ involves only attributes in L1 L2. Then, b. Consider a join E1 join E2. Let L1 and L2 be sets of attributes from E1 and E2, respectively. Let L3 be attributes of E1 that are involved in join condition _, but are not in L1 L2, and let L4 be attributes of E2 that are involved in join condition, but are not in L1 L2. Then, 9. The set operations union and intersection are commutative. E1 E2 = E2 E1 E1 E2 = E2 E1 Set difference is not commutative. 10. Set union and intersection are associative. (E1 E2) E3 = E1 (E2 E3) (E1 E2) E3 = E1 (E2 E3) 11. The selection operation distributes over the union, intersection, and set difference operations. σp(e1 E2) = σp(e1) σp(e2) Similarly, the preceding equivalence, with replaced with either or, also holds. Further: P(E1 E2) = σp(e1) E2 The preceding equivalence, with replaced by, also holds, but does not hold if is replaced by. 12. The projection operation distributes over the union operation. πl (E1 E2) = (πl (E1 Enumeration of Equivalent Expressions Query optimizers use equivalence rules to systematically generate expressions equivalent to the given expression Can generate all equivalent expressions as follows: Repeat apply all applicable equivalence rules on every equivalent expression found so far add newly generated expressions to the set of equivalent expressions Until no new equivalent expressions are generated above The above approach is very expensive in space and time Two approaches Optimized plan generation based on transformation rules Special case approach for queries with only selections, projections and joins Implementing Transformation Based Optimization Space requirements reduced by sharing common sub-expressions: DEPT OF CSE,RGCET Page 22

23 when E1 is generated from E2 by an equivalence rule, usually only the top level of the two are different, subtrees below are the same and can be shared using pointers E.g. when applying join commutativity Same sub-expression may get generated multiple times Detect duplicate sub-expressions and share one copy Time requirements are reduced by not generating all expressions Dynamic programming We will study only the special case of dynamic programming for join order optimization Cost Estimation Need statistics of input relations E.g. number of tuples, sizes of tuples Inputs can be results of sub-expressions Need to estimate statistics of expression results To do so, we require additional statistics E.g. number of distinct values for an attribute More on cost estimation later Choice of Evaluation Plans Must consider the interaction of evaluation techniques when choosing evaluation plans choosing the cheapest algorithm for each operation independently may not yield best overall algorithm. E.g. merge-join may be costlier than hash-join, but may provide a sorted output which reduces the cost for an outer level aggregation. nested-loop join may provide opportunity for pipelining Practical query optimizers incorporate elements of the following two broad approaches: 1.Search all the plans and choose the best plan in a cost-based fashion. 2. Uses heuristics to choose a plan. Cost-Based Optimization Consider finding the best join-order for r 1 join r 2... r n. There are (2(n 1))!/(n 1)! different join orders for above expression. With n = 7, the number is , with n = 10, the number is greater than 176 billion! No need to generate all the join orders. Using dynamic programming, the least-cost join order for any subset of {r 1, r 2,... r n } is computed only once and stored for future use Dynamic Programming in Optimization To find best join tree for a set of n relations: DEPT OF CSE,RGCET Page 23

24 To find best plan for a set S of n relations, consider all possible plans of the form: S 1 join(s S 1 ) where S 1 is any non-empty subset of S. Recursively compute costs for joining subsets of S to find the cost of each plan. Choose the cheapest of the 2 n 1 alternatives. Base case for recursion: single relation access plan Apply all selections on R i using best choice of indices on R i When plan for any subset is computed, store it and reuse it when it is required again, instead of recomputing it Dynamic programming Join Order Optimization Algorithm procedure findbestplan(s) if (bestplan[s].cost ) return bestplan[s] // else bestplan[s] has not been computed earlier, compute it now if (S contains only 1 relation) set bestplan[s].plan and bestplan[s].cost based on the best way of accessing S /* Using selections on S and indices on S */ else for each non-empty subset S1 of S such that S1 S P1= findbestplan(s1) P2= findbestplan(s - S1) A = best algorithm for joining results of P1 and P2 cost = P1.cost + P2.cost + cost of A if cost < bestplan[s].cost bestplan[s].cost = cost bestplan[s].plan = execute P1.plan; execute P2.plan; join results of P1 and P2 using A return bestplan[s] Left Deep Join Trees In left-deep join trees, the right-hand-side input for each join is a relation, not the result of an intermediate join. DEPT OF CSE,RGCET Page 24

25 Cost of Optimization With dynamic programming time complexity of optimization with bushy trees is O(3 n ). With n = 10, this number is instead of 176 billion! Space complexity is O(2 n ) To find best left-deep join tree for a set of n relations: Consider n alternatives with one relation as right-hand side input and the other relations as left-hand side input. Modify optimization algorithm: Replace for each non-empty subset S1 of S such that S1 S By: for each relation r in S let S1 = S r. If only left-deep trees are considered, time complexity of finding best join order is O(n 2 n ) Space complexity remains at O(2 n ) Cost-based optimization is expensive, but worthwhile for queries on large datasets (typical queries have small n, generally < 10) Interesting Sort Orders Consider the expression (r 1 join r 2 ) join r 3 (with A as common attribute) An interesting sort order is a particular sort order of tuples that could be useful for a later operation Using merge-join to compute r 1 join r 2 may be costlier than hash join but generates result sorted on A Which in turn may make merge-join with r 3 cheaper, which may reduce cost of join with r 3 and minimizing overall cost Sort order may also be useful for order by and for grouping Not sufficient to find the best join order for each subset of the set of n given relations must find the best join order for each subset, for each interesting sort order Simple extension of earlier dynamic programming algorithms DEPT OF CSE,RGCET Page 25

26 Usually, number of interesting orders is quite small and doesn t affect time/space complexity significantly Heuristic Optimization Cost-based optimization is expensive, even with dynamic programming. Systems may use heuristics to reduce the number of choices that must be made in a costbased fashion. Heuristic optimization transforms the query-tree by using a set of rules that typically (but not in all cases) improve execution performance: Perform selection early (reduces the number of tuples) Perform projection early (reduces the number of attributes) Perform most restrictive selection and join operations (i.e. with smallest result size) before other similar operations. Some systems use only heuristics, others combine heuristics with partial costbased optimization. Structure of Query Optimizers Many optimizers considers only left-deep join orders. Plus heuristics to push selections and projections down the query tree Reduces optimization complexity and generates plans amenable to pipelined evaluation. Heuristic optimization used in some versions of Oracle: Repeatedly pick best relation to join next Starting from each of n starting points. Pick best among these Intricacies of SQL complicate query optimization E.g. nested subqueries Some query optimizers integrate heuristic selection and the generation of alternative access plans. Frequently used approach heuristic rewriting of nested block structure and aggregation followed by cost-based join-order optimization for each block Some optimizers (e.g. SQL Server) apply transformations to entire query and do not depend on block structure Even with the use of heuristics, cost-based query optimization imposes a substantial overhead. But is worth it for expensive queries Optimizers often use simple heuristics for very cheap queries, and perform exhaustive enumeration for more expensive queries 3. ESTIMATING STATISTICS OF EXPRESSION RESULTS Statistical Information for Cost Estimation n r : number of tuples in a relation r. b r : number of blocks containing tuples of r. l r : size of a tuple of r. f r : blocking factor of r i.e., the number of tuples of r that fit into one block. V(A, r): number of distinct values that appear in r for attribute A; same as the size of A (r). DEPT OF CSE,RGCET Page 26

If tuples of r are stored together physically in a file, then: n f r r Histograms b Histogram on attribute age of relation r person Equi-width histograms Equi-depth histograms Selection Size

27 If tuples of r are stored together physically in a file, then: n f r r Histograms b Histogram on attribute age of relation r person Equi-width histograms Equi-depth histograms Selection Size Estimation A=v (r) n r / V(A,r) : number of records that will satisfy the selection Equality condition on a key attribute: size estimate = 1 AV (r) (case of A V (r) is symmetric) Let c denote the estimated number of tuples satisfying the condition. If min(a,r) and max(a,r) are available in catalog c = 0 if v < min(a,r) c = v min( A, r) n r. max( A, r) min( A, r) If histograms available, can refine above estimate In absence of statistical information c is assumed to be n r / 2. Size Estimation of Complex Selections The selectivity of a condition i is the probability that a tuple in the relation r satisfies i. o If s i is the number of satisfying tuples in r, the selectivity of i is given by s i /n r. Conjunction: n (r). Assuming indepdence, estimate of n s s... s 1 2 n r n n tuples in the result is: r Disjunction: n (r). Estimated number of tuples: s 1 s2 sn n r 1 (1 ) (1 )... (1 ) nr nr nr Negation: (r). Estimated number of tuples: n r size( (r)) Join Operation: Running Example DEPT OF CSE,RGCET Page 27

28 Running example: depositor join customer Catalog information for join examples: n customer = 10,000. f customer = 25, which implies that b customer =10000/25 = 400. n depositor = f depositor = 50, which implies that b depositor = 5000/50 = 100. V(customer_name, depositor) = 2500, which implies that, on average, each customer has two accounts. Also assume that customer_name in depositor is a foreign key on customer. V(customer_name, customer) = (primary key!) Estimation of the Size of Joins The Cartesian product r x s contains n r.n s tuples; each tuple occupies s r + s s bytes. If R S =, then r s is the same as r x s. If R S is a key for R, then a tuple of s will join with at most one tuple from r therefore, the number of tuples in r s is no greater than the number of tuples in s. If R S in S is a foreign key in S referencing R, then the number of tuples in r s is exactly the same as the number of tuples in s. The case for R S being a foreign key referencing S is symmetric. In the example query depositor customer, customer_name in depositor is a foreign key of customer hence, the result has exactly n depositor tuples, which is 5000 If R S = {A} is not a key for R or S. If we assume that every tuple t in R produces tuples in R S, the number of tuples in R S is estimated to be: n r ns V ( A, s) If the reverse is true, the estimate obtained will be: n r ns V ( A, r) The lower of these two estimates is probably the more accurate one. Can improve on above if histograms are available Use formula similar to above, for each cell of histograms on the two relations Compute the size estimates for depositor customer without using information about foreign keys: V(customer_name, depositor) = 2500, and V(customer_name, customer) = The two estimates are 5000 * 10000/ ,000 and 5000 * 10000/10000 = 5000 DEPT OF CSE,RGCET Page 28

29 We choose the lower estimate, which in this case, is the same as our earlier computation using foreign keys. Size Estimation for Other Operations Projection: estimated size of A (r) = V(A,r) Aggregation : estimated size of A g F (r) = V(A,r) Set operations For unions/intersections of selections on the same relation: rewrite and use size estimate for selections E.g. 1 (r) 2 (r) can be rewritten as 1 2 (r) For operations on different relations: estimated size of r s = size of r + size of s. estimated size of r s = minimum size of r and size of s. estimated size of r s = r. All the three estimates may be quite inaccurate, but provide upper bounds on the sizes. Estimation of Number of Distinct Values Selections: (r) If forces A to take a specified value: V(A, (r)) = 1. e.g., A = 3 If forces A to take on one of a specified set of values: V(A, (r)) = number of specified values. (e.g., (A = 1 V A = 3 V A = 4 )), If the selection condition is of the form A op r estimated V(A, (r)) = V(A.r) * s where s is the selectivity of the selection. In all the other cases: use approximate estimate of min(v(a,r), n (r) ) More accurate estimate can be got using probability theory, but this one works fine generally DEPT OF CSE,RGCET Page 29

Estimation of distinct values are straightforward for projections. o They are the same in A (r) as in r. The same holds for grouping attributes of aggregation.

30 Estimation of distinct values are straightforward for projections. o They are the same in A (r) as in r. The same holds for grouping attributes of aggregation. For aggregated values For min(a) and max(a), the number of distinct values can be estimated as min(v(a,r), V(G,r)) where G denotes grouping attributes For other aggregates, assume all values are distinct, and use V(G,r) 4. CHOICE OF EVALUATION PLANS A cost-based optimizer explores the space of all query-evaluation plans that are equivalent to the given query, and chooses the one with the least estimated cost. Cost-Based Join Order Selection The most common type of query in SQL consists of a join of a few relations, with join predicates and selections specified in the where clause. In this section we consider the problem of choosing the optimal join order for such a query. For a complex join query, the number of different query plans that are equivalent to the query can be large. consider the expression: We can develop a dynamic-programming algorithm for finding optimal join orders. procedure FindBestPlan(S) if (bestplan[s].cost _= ) /* bestplan[s] already computed */ return bestplan[s] if (S contains only 1 relation) set bestplan[s]. plan and bestplan[s].cost based on best way of accessing S DEPT OF CSE,RGCET Page 30

31 else for each non-empty subset S1 of S such that S1 _= S P1 = FindBestPlan(S1) P2 = FindBestPlan(S S1) A = best algorithm for joining results of P1 and P2 cost = P1.cost + P2.cost + cost of A if cost < bestplan[s].cost bestplan[s].cost = cost bestplan[s]. plan = execute P1. plan; execute P2. plan; join results of P1 and P2 using A return bestplan[s] Figure Dynamic-programming algorithm for join order optimization. A particular sort order of the tuples is said to be an interesting sort order if it could be useful for a later operation. Cost-Based Optimization with Equivalence Rules The join order optimization techniquewe just saw handles the most common class of queries,which perform an inner join of a set of relations. However, clearly many queries use other features, such as aggregation, outer join, and nested queries, which are not addressed by join order selection. benefit of using equivalence rules is that it is easy to extend the optimizer with new rules to handle different query constructs. The procedure for generating equivalent expressions can be modified to generate all possible evaluation plans as follows: A new class of equivalence rules, called physical equivalence rules, is added that allows a logical operation, such as a join, to be transformed to a physical operation, such as a hash join, or a nested-loops join. By adding such rules to the original set of equivalence rules, the procedure can generate all possible evaluation plans. To make the approach work efficiently requires the following: 1. A space-efficient representation of expressions that avoidsmaking multiple copies of the same subexpressions when equivalence rules are applied. 2. Efficient techniques for detecting duplicate derivations of the same expression. 3. A form of dynamic programming based on memoization, which stores the optimal query evaluation plan for a subexpression when it is optimized for the first time; subsequent requests to optimize the same subexpression are handled by returning the already memoized plan. 4. Techniques that avoid generating all possible equivalent plans, by keeping track of the cheapest plan generated for any subexpression up to any point of time, and pruning away any plan that is more expensive than the cheapest plan found so far for that subexpression. Heuristics in Optimization A drawback of cost-based optimization is the cost of optimization itself. Although the cost of query optimization can be reduced by clever algorithms, the number of different evaluation plans for a query can be very large, and finding the optimal plan from this set requires a lot of computational effort. Hence, optimizers use heuristics to reduce the cost of optimization. An example of a heuristic rule is the following rule for transforming relationalalgebra queries: Perform selection operations as early as possible. Perform projections early. DEPT OF CSE,RGCET Page 31

32 The cost estimation techniques we have seen earlier can then be used to choose the optimal (that is, the least-cost) plan. Caching and reuse of query plans is referred to as plan caching Optimizing Nested Subqueries SQL conceptually treats nested subqueries in the where clause as functions that take parameters and return either a single value or a set of values The parameters are the variables from an outer level query that are used in the nested subquery (these variables are called correlation variables). Most optimizers allow a cost budget to be specified for query optimization. The System R optimizer considers only those join orders where the right operand of each join is one of the initial relations r1,..., rn. Such join orders are called left-deep join orders. Left-deep join orders are particularly convenient for pipelined evaluation, since the right operand is a stored relation, and thus only one input to each join is pipelined. Technique for evaluating a query with a nested subquery is called correlated evaluation. Correlated evaluation is not very efficient, since the subquery is separately evaluated for each tuple in the outer level query. A large number of random disk I/O operations may result. The process of replacing a nested query by a querywith a join (possibly with a temporary relation) is called decorrelation. The search for the optimal plan is terminated when the optimization cost budget is exceeded, and the best plan found up to that point is returned. TRANSACTIONS Collections of operations that form a single logical unit of work are called transactions. TRANSACTION CONCEPT A transaction is a unit of program execution that accesses and possibly updates various data items. A transaction is delimited by statements (or function calls) of the form begin transaction and end transaction. The transaction consists of all operations executed between the begin transaction and end transaction. E.g. transaction to transfer $50 from account A to account B: 1. read(a) 2. A := A write(a) 4. read(b) DEPT OF CSE,RGCET Page 32

33 5. B := B write(b) Two main issues to deal with: Failures of various kinds, such as hardware failures and system crashes. Concurrent execution of multiple transactions Transaction to transfer $50 from account A to account B: 1. read(a) 2. A := A write(a) 4. read(b) 5. B := B write(b) Atomicity requirement if the transaction fails after step 3 and before step 6, money will be lost leading to an inconsistent database state Failure could be due to software or hardware the system should ensure that updates of a partially executed transaction are not reflected in the database. Durability requirement once the user has been notified that the transaction has completed (i.e., the transfer of the $50 has taken place), the updates to the database by the transaction must persist even if there are software or hardware failures. Transaction to transfer $50 from account A to account B: 1. read(a) 2. A := A write(a) 4. read(b) 5. B := B write(b) Consistency requirement in above example: the sum of A and B is unchanged by the execution of the transaction In general, consistency requirements include Explicitly specified integrity constraints such as primary keys and foreign keys Implicit integrity constraints e.g. sum of balances of all accounts, minus sum of loan amounts must equal value of cash-in-hand A transaction must see a consistent database. During transaction execution the database may be temporarily inconsistent. When the transaction completes successfully the database must be consistent Erroneous transaction logic can lead to inconsistency Isolation requirement if between steps 3 and 6, another transaction T2 is allowed to access the partially updated database, it will see an inconsistent database (the sum A + B will be less than it should be). T1 T2 1. read(a) 2. A := A write(a) read(a), read(b), print(a+b) 4. read(b) 5. B := B write(b Isolation can be ensured trivially by running transactions serially that is, one after the other. DEPT OF CSE,RGCET Page 33

34 However, executing multiple transactions concurrently has significant benefits, as we will see later. Properties of the transactions -ACID Properties A transaction is a unit of program execution that accesses and possibly updates various data items.to preserve the integrity of data the database system must ensure: Atomicity. Either all operations of the transaction are properly reflected in the database or none are. Consistency. Execution of a transaction in isolation preserves the consistency of the database. Isolation. Although multiple transactions may execute concurrently, each transaction must be unaware of other concurrently executing transactions. Intermediate transaction results must be hidden from other concurrently executed transactions. That is, for every pair of transactions Ti and Tj, it appears to Ti that either Tj, finished execution before Ti started, or Tj started execution after Ti finished. Durability. After a transaction completes successfully, the changes it has made to the database persist, even if there are system failures. A SIMPLE TRANSACTION MODEL Transactions access data using two operations: read(x), which transfers the data item X from the database to a variable, also called X, in a buffer in main memory belonging to the transaction that executed the read operation. write(x), which transfers the value in the variable X in the main-memory buffer of the transaction that executed the write to the data item X in the database. Let Ti be a transaction that transfers $50 from account A to account B. This transaction can be defined as: Ti : read(a); A := A 50; write(a); read(b); B := B + 50; write(b). Let us now consider each of the ACID properties. (For ease of presentation, we consider them in an order different from the order A-C-I-D.) Consistency: The consistency requirement here is that the sum of A and B be unchanged by the execution of the transaction. Without the consistency requirement, money could be created or destroyed by the transaction! It can be verified easily that, if the database is consistent before an execution of the transaction, the database remains consistent after the execution of the transaction. Ensuring consistency for an individual transaction is the responsibility of the application programmer who codes the transaction. This task may be facilitated by automatic testing of integrity constraints Atomicity: Suppose that, just before the execution of transaction Ti, the values of accounts A and B are $1000 and $2000, respectively. Now suppose that, during the execution of transaction Ti, a failure occurs that prevents Ti from completing its execution successfully. Further, suppose that the failure happened after the write(a) operation but before the write(b) operation. In this case, the values of accounts A and B reflected in the database are $950 and $2000. The system DEPT OF CSE,RGCET Page 34

35 destroyed $50 as a result of this failure. In particular, we note that the sum A + B is no longer preserved. Thus, because of the failure, the state of the system no longer reflects a real state of the world that the database is supposed to capture. We term such a state an inconsistent state. We must ensure that such inconsistencies are not visible in a database system. Ensuring atomicity is the responsibility of the database system; specifically, it is handled by a component of the database called the recovery system Durability: Once the execution of the transaction completes successfully, and the user who initiated the transaction has been notified that the transfer of funds has taken place, it must be the case that no system failure can result in a loss of data corresponding to this transfer of funds. The durability property guarantees that, once a transaction completes successfully, all the updates that it carried out on the database persist, even if there is a system failure after the transaction completes execution. We assume for now that a failure of the computer system may result in loss of data in main memory, but data written to disk are never lost. We can guarantee durability by ensuring that either: 1. The updates carried out by the transaction have been written to disk before the transaction completes. 2. Information about the updates carried out by the transaction and written to disk is sufficient to enable the database to reconstruct the updates when the database system is restarted after the failure. The recovery system of the database, is responsible for ensuring durability, in addition to ensuring atomicity. Isolation: Even if the consistency and atomicity properties are ensured for each transaction, if several transactions are executed concurrently, their operations may interleave in some undesirable way, resulting in an inconsistent state. The isolation property of a transaction ensures that the concurrent execution of transactions results in a system state that is equivalent to a state that could have been obtained had these transactions executed one at a time in some order. Ensuring the isolation property is the responsibility of a component of the database system called the concurrency-control system. STORAGE STRUCTURE We review their relative speed, capacity, and resilience to failure, and classified as volatile storage or nonvolatile storage and introduce another class of storage, called stable storage. Volatile storage. Information residing in volatile storage does not usually survive system crashes. Examples of such storage are main memory and cache memory. Access to volatile storage is extremely fast, both because of the speed of the memory access itself, and because it is possible to access any data item in volatile storage directly. Nonvolatile storage. Information residing in nonvolatile storage survives system crashes. Examples of nonvolatile storage include secondary storage devices such as magnetic disk and flash storage, used for online storage, and tertiary storage devices such as optical media, and magnetic tapes, used for archival storage. At the current state of technology, nonvolatile storage is slower than volatile storage, particularly for random access. Both secondary and tertiary storage devices, however, are susceptible to failure which may result in loss of information. DEPT OF CSE,RGCET Page 35

36 Stable storage. Information residing in stable storage is never. Although stable storage is theoretically impossible to obtain, it can be closely approximated by techniques that make data loss extremely unlikely. To implement stable storage, we replicate the information in several nonvolatile storage media (usually disk) with independent failure modes. Updates must be done with care to ensure that a failure during an update to stable storage does not cause a loss of information. TRANSACTION ATOMICITY AND DURABILITY a transaction may not always complete its execution successfully. Such a transaction is termed aborted. If we are to ensure the atomicity property, an aborted transaction must have no effect on the state of the database. Thus, any changes that the aborted transaction made to the database must be undone. Once the changes caused by an aborted transaction have been undone, we say that the transaction has been rolled back. It is part of the responsibility of the recovery scheme to manage transaction aborts. This is done typically by maintaining a log. Each database modification made by a transaction is first recorded in the log. A transaction that completes its execution successfully is said to be committed. A committed transaction that has performed updates transforms the database into a new consistent state, which must persist even if there is a system failure. Once a transaction has committed, we cannot undo its effects by aborting it. The only way to undo the effects of a committed transaction is to execute a compensating transaction. We need to be more precise about what we mean by successful completion of a transaction. We therefore establish a simple abstract transaction model. A transaction must be in one of the following states: Active, the initial state; the transaction stays in this state while it is executing. Partially committed, after the final statement has been executed. Failed, after the discovery that normal execution can no longer proceed. Aborted, after the transaction has been rolled back and the database has been restored to its state prior to the start of the transaction. Committed, after successful completion. Fig: State diagram of a transaction. Atransaction is said to have terminated if it has either committed or aborted. the system has two options: DEPT OF CSE,RGCET Page 36

37 It can restart the transaction, but only if the transaction was aborted as a result of some hardware or software error that was not created through the internal logic of the transaction. A restarted transaction is considered to be a new transaction. It can kill the transaction. It usually does so because of some internal logical error that can be corrected only by rewriting the application program, or because the input was bad, or because the desired data were not found in the database. Transaction Isolation Transaction-processing systems usually allow multiple transactions to run concurrently. there are two good reasons for allowing concurrency: Improved throughput and resource utilization. A transaction consists of many steps. Some involve I/O activity; others involve CPU activity. The CPU and the disks in a computer system can operate in parallel. All of this increases the throughput of the system. Correspondingly, the processor and disk utilization also increase Reduced waiting time. it also reduces the average response time: the average time for a transaction to be completed after it has been submitted. The motivation for using concurrent execution in a database is essentially the same as the motivation for using multiprogramming in an operating system. The database system must control the interaction among the concurrent transactions to prevent themfromdestroying the consistency of the database. It does so through a variety of mechanisms called concurrency-control schemes. Example Schedules Let T1 and T2 be two transactions that transfer funds from one account to another. Transaction T1 transfers $50 from account A to account B. It is defined as: T1: read(a); A := A 50; write(a); read(b); B := B + 50; write(b). Transaction T2 transfers 10 percent of the balance from account A to account B. It is defined as: T2: read(a); temp := A * 0.1; A := A temp; write(a); read(b); B := B + temp; write(b). Suppose the current values of accounts A and B are $1000 and $2000, respectively. Let T1 transfer $50 from A to B, and T2 transfer 10% of the balance from A to B. The following is a serial schedule (Schedule 1 in the text), in which T1 is followed by T2. DEPT OF CSE,RGCET Page 37

38 Figure: Schedule 1 a serial schedule in which T1 is followed by T2. Similarly, if the transactions are executed one at a time in the order T2 followed by T1, then the corresponding execution sequence is that of Figure Again, as expected, the sum A + B is preserved, and the final values of accounts A and B are $850 and $2150, respectively Figure : Schedule 2 a serial schedule in which T2 is followed by T1. Let T1 and T2 be the transactions defined previously. The following schedule (Schedule 3 in the text) is not a serial schedule, but it is equivalent to Schedule 1. In both Schedule 1 and 3, the sum A + B is preserved. DEPT OF CSE,RGCET Page 38

39 Figure :Schedule 3 a concurrent schedule equivalent to schedule 1. The following concurrent schedule (Schedule 4 in the text) does not preserve the value of the the sum A + B. Figure :Schedule 4 a concurrent schedule resulting in an inconsistent state. These schedules are serial: Each serial schedule consists of a sequence of instructions from various transactions, where the instructions belonging to one single transaction appear together in that schedule. Recalling a well-known formula from combinatorics, we note that, for a set of n transactions, there exist n factorial (n!) different valid serial schedules. The schedule should, in some sense, be equivalent to a serial schedule. Such schedules are called serializable schedules. Serializability Basic Assumption Each transaction preserves database consistency. Thus serial execution of a set of transactions preserves database consistency. A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule. Different forms of schedule equivalence give rise to the notions of: 1. conflict serializability 2. view serializability We ignore operations other than read and write instructions, and we assume that transactions may perform arbitrary computations on data in local buffers in between reads and writes. Our simplified schedules consist of only read and write instructions. DEPT OF CSE,RGCET Page 39

Instructions li and lj of transactions Ti and Tj respectively, conflict if and only if there exists some item Q accessed by both li and lj, and at least one of these instructions wrote Q. 1.

If I comes before J, then Ti does not read the value of Q that is written by Tj in instruction J. If J comes before I, then Ti reads the value of Q that is written by Tj.

40 Instructions li and lj of transactions Ti and Tj respectively, conflict if and only if there exists some item Q accessed by both li and lj, and at least one of these instructions wrote Q. 1. I = read(q), J = read(q). The order of I and J does not matter, since the same value of Q is read by Ti and Tj, regardless of the order. 2. I = read(q), J = write(q). If I comes before J, then Ti does not read the value of Q that is written by Tj in instruction J. If J comes before I, then Ti reads the value of Q that is written by Tj. Thus, the order of I and J matters. Figure : Schedule 3 showing only the read and write instructions. Figure : Schedule 5 schedule 3 after swapping of a pair of instructions. 3. I = write(q), J = read(q). The order of I and J matters for reasons similar to those of the previous case. 4. I = write(q), J = write(q). Since both instructions are write operations, the order of these instructions does not affect either Ti or Tj. However, the value obtained by the next read(q) instruction of S is affected, since the result of only the latter of the two write instructions is preserved in the database. If there is no other write(q) instruction after I and J in S, then the order of I and J directly affects the final value of Q in the database state that results from schedule S. I and J conflict if they are operations by different transactions on the same data item, and at least one of these instructions is a write operation. DEPT OF CSE,RGCET Page 40

Figure Schedule 6 a serial schedule that is equivalent to schedule 3.

Swap the write(b) instruction of T1 with the write(a) instruction of T2. Swap the write(b) instruction of T1 with the read(a) instruction of T2.

If a schedule S can be transformed into a schedule S_ by a series of swaps of nonconflicting instructions, we say that S and S_ are conflict equivalent.

41 Figure Schedule 6 a serial schedule that is equivalent to schedule 3. Figure Schedule 7 We continue to swap nonconflicting instructions: Swap the read(b) instruction of T1 with the read(a) instruction of T2. Swap the write(b) instruction of T1 with the write(a) instruction of T2. Swap the write(b) instruction of T1 with the read(a) instruction of T2. The final result of these swaps, schedule 6 schedule 6 is exactly the same as schedule 1, but it shows only the read and write instructions. If a schedule S can be transformed into a schedule S_ by a series of swaps of nonconflicting instructions, we say that S and S_ are conflict equivalent. The concept of conflict equivalence leads to the concept of conflict serializability. We say that a schedule S is conflict serializable if it is conflict equivalent to a serial schedule. Thus, schedule 3 is conflict serializable, since it is conflict equivalent to the serial schedule 1. Figure Precedence graph for (a) schedule 1 and (b) schedule 2 We now present a simple and efficient method for determining conflict serializability of a schedule. Consider a schedule S. We construct a directed graph, called a precedence graph, froms. This graph consists of a pair G = (V, E), where V is a set of vertices and E is a set of edges. The set of vertices consists of all the transactions participating in the schedule. The set of edges consists of all edges Ti Tj for which one of three conditions holds: 1. Ti executes write(q) before Tj executes read(q). DEPT OF CSE,RGCET Page 41

2. Ti executes read(q) before Tj executes write(q). 3.

If an edge Ti Tj exists in the precedence graph, then, in any serial schedule S_ equivalent to

Figure: Precedence graph for schedule 4 A serializability order of the transactions can be

42 2. Ti executes read(q) before Tj executes write(q). 3. Ti executes write(q) before Tj executes write(q). If an edge Ti Tj exists in the precedence graph, then, in any serial schedule S_ equivalent to S, Ti must appear before Tj. Figure: Precedence graph for schedule 4 A serializability order of the transactions can be obtained by finding a linear order consistent with the partial order of the precedence graph. This process is called topological sorting. Figure Illustration of topological sorting. Transaction Isolation and Atomicity DEPT OF CSE,RGCET Page 42

Figure Schedule 9, a nonrecoverable schedule. If a transaction Ti fails, for whatever reason, we need to undo the effect of this transaction to ensure the atomicity property of the transaction.

43 Figure Schedule 9, a nonrecoverable schedule. If a transaction Ti fails, for whatever reason, we need to undo the effect of this transaction to ensure the atomicity property of the transaction. In a system that allows concurrent execution, the atomicity property requires that any transaction Tj that is dependent on Ti (that is, Tj has read data written by Ti) is also aborted. To achieve this, we need to place restrictions on the type of schedules permitted in the system. Recoverable Schedules Consider the partial schedule 9 in Figure 14.14, in which T7 is a transaction that performs only one instruction: read(a). We call this a partial schedule because we have not included a commit or abort operation for T6. Notice that T7 commits immediately after executing the read(a) instruction. Thus, T7 commits while T6 is still in the active state. Now suppose that T6 fails before it commits. T7 has read the value of data item A written by T6. Therefore, we say that T7 is dependent on T6. Because of this, we must abort T7 to ensure atomicity. However, T7 has already committed and cannot be aborted. Thus, we have a situation where it is impossible to recover correctly from the failure of T6. Schedule 9 is an example of a nonrecoverable schedule.arecoverable schedule is one where, for each pair of transactions Ti and Tj such that Tj reads a data item previously written by Ti, the commit operation of Ti appears before the commit operation of Tj. For the example of schedule 9 to be recoverable, T7 would have to delay committing until after T6 commits. Cascadeless Schedules Even if a schedule is recoverable, to recover correctly from the failure of a transaction Ti, we may have to roll back several transactions. Such situations occur if transactions have read data written by Ti. As an illustration, consider the partial schedule of Figure Transaction T8 writes a value of A that is read by transaction T9. Transaction T9 writes a value of A that is read by transaction T10. Suppose that, at this point, T8 fails. T8 must be rolled back. Since T9 is dependent on T8, T9 must be rolled back. Since T10 is dependent on T9, T10 must be rolled back. This phenomenon, in which a single transaction failure leads to a series of transaction rollbacks, is called cascading rollback. DEPT OF CSE,RGCET Page 43

Figure 14.15 Schedule 10. Cascading rollback is undesirable, since it leads to the undoing of a significant amount of work.

44 Figure Schedule 10. Cascading rollback is undesirable, since it leads to the undoing of a significant amount of work. It is desirable to restrict the schedules to those where cascading rollbacks cannot occur. Such schedules are called cascadeless schedules. Formally, a cascadeless schedule is one where, for each pair of transactions Ti and Tj such that Tj reads a data item previously written by Ti, the commit operation of Ti appears before the read operation of Tj. It is easy to verify that every cascadeless schedule is also recoverable. Transaction Isolation Levels The isolation levels specified by the SQL standard are as follows: Serializable usually ensures serializable execution. However, as we shall explain shortly, some database systems implement this isolation level in a manner that may, in certain cases, allow nonserializable executions. Repeatable read allows only committed data to be read and further requires that, between two reads of a data item by a transaction, no other transaction is allowed to update it. However, the transaction may not be serializable with respect to other transactions. For instance, when it is searching for data satisfying some conditions, a transaction may find some of the data inserted by a committed transaction, but may not find other data inserted by the same transaction. Read committed allows only committed data to be read, but does not require repeatable reads. For instance, between two reads of a data item by the transaction, another transaction may have updated the data item and committed. Read uncommitted allows uncommitted data to be read. It is the lowest isolation level allowed by SQL. All the isolation levels above additionally disallow dirty writes, that is, they disallowwrites to a data item that has already been written by another transaction that has not yet committed or aborted. Many database systems run, by default, at the read-committed isolation level. In SQL, it is possible to set the isolation level explicitly, rather than accepting the system s default setting. For example, the statement set transaction isolation level serializable; sets the isolation level to serializable; any of the other isolation levels may be specified instead. The above syntax is supported by Oracle, PostgreSQL and SQL Server; DB2 uses the syntax change isolation level, with its own abbreviations for isolation levels. Changing of the isolation level must be done as the first statement of a transaction. Further, automatic commit of individual statements must be turned off, if it is on by default; API functions, such as the JDBC method Connection. setautocommit(false) can be used to do so. DEPT OF CSE,RGCET Page 44

Query Processing & Optimization

Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction