UNIT 3 Transaction Management and Concurrency Control, Performance tuning and query optimization of SQL and NoSQL Databases. 1. Transaction: A transaction is a unit of program execution that accesses and possibly updates various data items. Usually, a transaction is initiated by a user program written in a high-level data manipulation language (typically SQL), or programming language (for example, C++, or Java), with embedded database accesses in JDBC or ODBC. A transaction is delimited by statements (or function calls) of the form begin transaction and end transaction. The transaction consists of all operations executed between the begin transaction and end transaction. This collection of steps must appear to the user as a single, indivisible unit. Since a transaction is indivisible, it either executes in its entirety or not at all. Thus, if a transaction begins to execute but fails for whatever reason, any changes to the database that the transaction may have made must be undone. This requirement holds regardless of whether the transaction itself failed (for example, if it divided by zero), the operating system crashed, or the computer itself stopped operating. As we shall see, ensuring that this requirement is met is difficult since some changes to the database may still be stored only in the main-memory variables of the transaction, while others may have been written to the database and stored on disk. This all-or-none property is referred to as atomicity. Furthermore, since a transaction is a single unit, its actions cannot appear to be separated by other database operations not part of the transaction. While we wish to present this user-level impression of transactions, we know that reality is quite different. Even a single SQL statement involves many separate accesses to the database, and a transaction may consist of several SQL statements. Therefore, the database system must take special actions to ensure that transactions operate properly without interference from concurrently executing database statements. This property is referred to as isolation. Even if the system ensures correct execution of a transaction, this serves little purpose if the system subsequently crashes and, as a result, the system forgets about the transaction. Thus, a transaction s actions must persist across crashes. This property is referred to as durability. Because of the above three properties, transactions are an ideal way of structuring interaction with a database. This leads us to impose a requirement on transactions themselves. A transaction must preserve database consistency if a transaction is run atomically in isolation starting from a consistent database, the database must again be consistent at the end of the transaction. This consistency requirement goes beyond the data integrity constraints we have seen earlier (such as primary-key constraints, referential integrity, check constraints, and the like). Rather, transactions are expected to go beyond that to ensure preservation of those applicationdependent consistency constraints that are too complex to state using the SQL constructs for data integrity. How this is done is the responsibility of the programmer who codes a transaction. This property is referred to as consistency. 1
To restate the above more concisely, we require that the database system maintain the following properties of the transactions: Atomicity. Either all operations of the transaction are reflected properly in the database, or none are. Consistency. Execution of a transaction in isolation (that is, with no other transaction executing concurrently) preserves the consistency of the database. Isolation. Even though multiple transactions may execute concurrently, the system guarantees that, for every pair of transactions T1 and T2, it appears to T1 that either T2finished execution before T1started or T2started execution after T1 finished. Thus, each transaction is unaware of other transactions executing concurrently in the system. Durability. After a transaction completes successfully, the changes it has made to the database persist, even if there are system failures. These properties are often called the ACID properties. 2. Concurrent Execution Several current trends in the field of computing are giving rise to an increase in the amount of concurrency possible. As database systems exploit this concurrency to increase overall system performance, there will necessarily be an increasing number of transactions run concurrently. Early computers had only one processor. Therefore, there was never any real concurrency in the computer. The only concurrency was apparent concurrency created by the operating system as it shared the processor among several distinct tasks or processes. Modern computers are likely to have many processors. These may be truly distinct processors all part of the one computer. However even a single processor may be able to run more than one process at a time by having multiple cores. The Intel Core Duo processor is a well-known example of such a multicore processor. For database systems to take advantage of multiple processors and multiple cores, two approaches are being taken. One is to find parallelism within a single transaction or query. Another is to support a very large number of concurrent transactions. Many service providers now use large collections of computers rather than large mainframe computers to provide their services. They are making this choice based on the lower cost of this approach. A result of this is yet a further increase in the degree of concurrency that can be supported. 2
3. Serializability Our basic assumption is that each transaction preserves database consistency. Thus serial execution of a set of transactions preserves database consistency. A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule. Different forms of schedule equivalence give rise to the notions of: (a) conflict serializability (b) view serializability Simplified view of transactions We ignore operations other than read and write instructions We assume that transactions may perform arbitrary computations on data in local buffers in between reads and writes. Our simplified schedules consist of only read and write instructions. Instructions li and lj of transactions Ti and Tj respectively, conflict if and only if there exists some item Q accessed by both li and lj, and at least one of these instructions wrote Q. Li = read(q), Lj = read (Q). Li and ljdon t conflict. Li = read(q), Lj = write (Q). They conflict. Li = write(q), Lj = read (Q). They conflict. Li = write(q), Lj = write (Q). They conflict. Intuitively, a conflict between liand lj forces a (logical) temporal order between them. If li and lj are consecutive in a schedule and they do not conflict, their results would remain the same even if they had been interchanged in the schedule. (a) conflict serializable If a schedule S can be transformed into a schedule S by a series of swaps of non-conflicting instructions, we say that S and S are conflict equivalent. We say that a schedule S is conflict serializable if it is conflict equivalent to a serial schedule Schedule 3 can be transformed into Schedule 6, a serial schedule where T2 follows T1, by series of swaps of non-conflicting instructions. Therefore Schedule 3 is conflict serializable. Example of a schedule that is not conflict serializable: 3
We are unable to swap instructions in the above schedule to obtain either the serial schedule <T3, T4>, or the serial schedule <T4, T3>. (b) View serializability Let S and S be two schedules with the same set of transactions. S and S are view equivalentif the following three conditions are met, for each data item Q, o If in schedule S, transaction Tireads the initial value of Q, then in schedule S also transaction Timust read the initial value of Q. o If in schedule S transaction Tiexecutes read (Q), and that value was produced by transaction Tj(if any), then in schedule S also transaction Ti must read the value of Q that was produced by the same write (Q) operation of transaction Tj. o The transaction (if any) that performs the final write (Q) operation in schedule S must also perform the finalwrite (Q) operation in schedule S. As can be seen, view equivalence is also based purely on reads and writes alone. A schedule S is view serializableif it is view equivalent to a serial schedule.every conflict serializable schedule is also view serializable. Below is a schedule which is view-serializable but not conflict serializable. Every view serializable schedule that is not conflict serializable has blind writes. 4. Lock-based Protocols One way to ensure isolation is to require that data items be accessed in a mutually exclusive manner; that is, while one transaction is accessing a data item, no other transaction can modify that data item. The most common method used to implement this requirement is to allow a transaction to access a data item only if it is currently holding a lock on that item. Locks: There are various modes in which a data item may be locked. Some of them are as follows: i. Shared. If a transaction Ti has obtained a shared-mode lock (denoted by S) on item Q, then Ti can read, but cannot write, Q. ii. Exclusive. If a transaction Ti has obtained an exclusive-mode lock (denoted by X) on item Q, then Ti can both read and write Q. 5. Deadlock Handling A situation where different transactions are unable to proceed, because each hold a lock that the other needs. Because both transactions are waiting for a resource to become available, neither will ever release the locks it holds. 4
UNIT 3 A deadlock can occur when the transactions lock rows in multiple tables (through statements such as UPDATE or SELECT FOR UPDATE), but in the opposite order. A deadlock can also occur when such statements lock ranges of index records and gaps, with each transaction acquiring some locks but not others due to a timing issue. Consider following two transactions Process A and Process B. A: Write (X) B: Write (Y) Write (Y) Write (X) Process A Lock-X on X Write (X) Wait for Lock-X on Y Process B Lock-X on Y Write (X) Wait for Lock-X on X Deadlock Preventions: Database is better to be prevented from deadlock rather than recovered. There are two ways to prevent a deadlock a) Wait die Scheme Non-primitive b) Wound wait Scheme Primitive These schemes use transaction timestamps for the sake of deadlock prevention alone. a) Wait Die Scheme: Older transaction may wait for younger one to release data item. Younger transactions never wait for older ones; they are rolled back instead. A transaction may die several times before acquiring needed data item b) Wound Wait Scheme: Older transaction wounds (forces rollback) of younger transaction instead of waiting for it. Younger transactions may wait for older ones. May be fewer rollbacks than wait-die scheme. Deadlock Recovery: a) Rollback: database to a previous stable state. Database stores a stable version before a transaction occurs, as it is required to rollback to that consistent state whenever a deadlock occurs. 5
6. Performance tuning and query optimization of SQL Adjusting various parameters and design choices to improve system performance for a specific application. Tuning is best done by a) Identifying bottlenecks. b) Eliminating bottlenecks. Can tune a database system at 3 levels: a) Hardware -- e.g., add disks to speed up I/O, add memory to increase buffer hits, move to a faster processor. b) Database system parameters -- e.g., set buffer size to avoid paging of buffer, set check pointing intervals to limit log size. System may have automatic tuning. c) Higher level database design, such as the schema, indices and transactions. 6