D.Hemavathi,R.Venkatalakshmi Assistant Professor, SRM University, Kattankulathur

DATABASE SYSTEMS IT 0303 5 TH Semester D.Hemavathi,R.Venkatalakshmi Assistant Professor, SRM University, Kattankulathur School of Computing, Department of IT

Unit 5: Physical implementation, transaction & recovery

Disclaimer The contents of the slides are solely for the purpose of teaching students at SRM University. All copyrights and Trademarks of organizations/persons apply even if not specified explicitly.

Classification of Physical Storage Media Speed with which data can be accessed Cost per unit of data Reliability data loss on power failure or system crash physical failure of the storage device Can differentiate storage into: volatile storage: loses contents when power is switched off non-volatile storage: Contents persist even when power is switched off. Includes secondary and tertiary storage, as well as battery-backed up main-memory.

Physical Storage Media Cache fastest and most costly form of storage; volatile; managed by the computer system hardware Main memory: fast access (10s to 100s of nanoseconds; 1 nanosecond = 10 9 seconds) generally too small (or too expensive) to store the entire database capacities of up to a few Gigabytes widely used currently Capacities have gone up and per-byte costs have decreased steadily and rapidly (roughly factor of 2 every 2 to 3 years) Volatile contents of main memory are usually lost if a power failure or system crash occurs.

Physical Storage Media (Cont.) Flash memory Data survives power failure Data can be written at a location only once, but location can be erased and written to again Can support only a limited number (10K 1M) of write/erase cycles. Erasing of memory has to be done to an entire bank of memory Reads are roughly as fast as main memory But writes are slow (few microseconds), erase is slower

Physical Storage Media (Cont.) Flash memory NOR Flash Fast reads, very slow erase, lower capacity Used to store program code in many embedded devices NAND Flash Page-at-a-time read/write, multi-page erase High capacity (several GB) Widely used as data storage mechanism in portable devices

Physical Storage Media (Cont.) Magnetic-disk Data is stored on spinning disk, and read/written magnetically Primary medium for the long-term storage of data; typically stores entire database. Data must be moved from disk to main memory for access, and written back for storage direct-access possible to read data on disk in any order, unlike magnetic tape Survives power failures and system crashes disk failure can destroy data: is rare but does happen

Physical Storage Media (Cont.) Optical storage non-volatile, data is read optically from a spinning disk using a laser CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular forms Write-one, read-many (WORM) optical disks used for archival storage (CD-R, DVD-R, DVD+R) Multiple write versions also available (CD-RW, DVD-RW, DVD+RW, and DVD-RAM) Reads and writes are slower than with magnetic disk Juke-box systems, with large numbers of removable disks, a few drives, and a mechanism for automatic loading/unloading of disks available for storing large volumes of data

Physical Storage Media (Cont.) Tape storage non-volatile, used primarily for backup (to recover from disk failure), and for archival data sequential-access much slower than disk very high capacity (40 to 300 GB tapes available) tape can be removed from drive storage costs much cheaper than disk, but drives are expensive Tape jukeboxes available for storing massive amounts of data hundreds of terabytes (1 terabyte = 10 9 bytes) to even a petabyte (1 petabyte = 10 12 bytes)

Storage Hierarchy

RAID RAID: Redundant Arrays of Independent Disks disk organization techniques that manage a large numbers of disks, providing a view of a single disk of high capacity and high speed by using multiple disks in parallel, and high reliability by storing data redundantly, so that data can be recovered even if a disk fails The chance that some disk out of a set of N disks will fail is much higher than the chance that a specific single disk will fail. E.g., a system with 100 disks, each with MTTF of 100,000 hours (approx. 11 years), will have a system MTTF of 1000 hours (approx. 41 days)

Improvement in Performance via Parallelism Two main goals of parallelism in a disk system: 1. Load balance multiple small accesses to increase throughput 2. Parallelize large accesses to reduce response time. Improve transfer rate by striping data across multiple disks. Bit-level striping split the bits of each byte across multiple disks But seek/access time worse than for a single disk Bit level striping is not used much any more Block-level striping with n disks, block i of a file goes to disk (i mod n) + 1 Requests for different blocks can run in parallel if the blocks reside on different disks A request for a long sequence of blocks can utilize all disks in parallel

RAID Levels RAID organizations, or RAID levels, have differing cost, performance and reliability characteristics RAID Level 0: Block striping; non-redundant. Used in high-performance applications where data lost is not critical. RAID Level 1: Mirrored disks with block striping Offers best write performance. Popular for applications such as storing log files in a database system.

RAID Levels (Cont.) RAID Level 2: Memory-Style Error-Correcting-Codes (ECC) with bit striping. RAID Level 3: Bit-Interleaved Parity a single parity bit is enough for error correction, not just detection When writing data, corresponding parity bits must also be computed and written to a parity bit disk To recover data in a damaged disk, compute XOR of bits from other disks (including parity bit disk)

RAID Levels (Cont.) RAID Level 3 (Cont.) Faster data transfer than with a single disk, but fewer I/Os per second since every disk has to participate in every I/O. RAID Level 4: Block-Interleaved Parity; uses block-level striping, and keeps a parity block on a separate disk for corresponding blocks from N other disks. When writing data block, corresponding block of parity bits must also be computed and written to parity disk To find value of a damaged block, compute XOR of bits from corresponding blocks (including parity block) from other disks.

RAID Levels (Cont.) RAID Level 4 (Cont.) Provides higher I/O rates for independent block reads than Level 3 block read goes to a single disk, so blocks stored on different disks can be read in parallel Before writing a block, parity data must be computed Can be done by using old parity block, old value of current block and new value of current block (2 block reads + 2 block writes) Or by recomputing the parity value using the new values of blocks corresponding to the parity block More efficient for writing large amounts of data sequentially Parity block becomes a bottleneck for independent block writes since every block write also writes to parity disk

RAID Levels (Cont.) RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among all N + 1 disks, rather than storing data in N disks and parity in 1 disk. E.g., with 5 disks, parity block for nth set of blocks is stored on disk (n mod 5) + 1, with the data blocks stored on the other 4 disks.

RAID Levels (Cont.) RAID Level 5 (Cont.) Higher I/O rates than Level 4. Block writes occur in parallel if the blocks and their parity blocks are on different disks. Subsumes Level 4: provides same benefits, but avoids bottleneck of parity disk. RAID Level 6: P+Q Redundancy scheme; similar to Level 5, but stores extra redundant information to guard against multiple disk failures. Better reliability than Level 5 at a higher cost; not used as widely.

Transaction A transaction is a unit of program execution that accesses and possibly updates various data items. A transaction must see a consistent database. During transaction execution the database may be inconsistent. When the transaction is committed, the database must be consistent.

ACID Properties To ensure integrity of data, the database system must maintain: Atomicity. Either all operations of the transaction are properly reflected in the database or none are. Consistency. Execution of a transaction in isolation preserves the consistency of the database. Isolation. Although multiple transactions may execute concurrently, each transaction must be unaware of other concurrently executing transactions. That is, for every pair of transactions T i and T j, it appears to T i that either T j, finished execution before T i started, or T j started execution after T i finished. Durability. After a transaction completes successfully, the changes it has made to the database persist, even if there are system failures.

Example Of Transfer Transaction to transfer $100 from Checking account A to Saving account B: 1. read(a) 2. A := A 100 3. write(a) 4. read(b) 5. B := B + 100 6. write(b) Consistency requirement the sum of A and B is unchanged by the execution of the transaction. Atomicity requirement if the transaction fails after step 3 and before step 6, the system should ensure that its updates are not reflected in the database, else an inconsistency will result.

Transfer Example (Cont.) Durability requirement once the user has been notified that the transaction has completed (i.e., the transfer of the $100 has taken place), the updates to the database by the transaction must persist despite failures. Isolation requirement if between steps 3 and 6, another transaction is allowed to access the partially updated database, it will see an inconsistent database.

Transaction State Active, the initial state; the transaction stays in this state while it is executing Partially committed, after the final statement has been executed. Failed, after the discovery that normal execution can no longer proceed. Aborted, after the transaction has been rolled back and the database restored to its state prior to the start of the transaction. 1) Restart the transaction only if no internal logical error 2) kill the transaction Committed, after successful completion..

State diagram of a transaction

Implementation of Atomicity and Durability The shadow-database scheme: assume that only one transaction is active at a time. a pointer called db_pointer always points to the current consistent copy of the database. all updates are made on a shadow copy of the database, and db_pointer is made to point to the updated shadow copy only after the transaction reaches partial commit and all updated pages have been flushed to disk. in case transaction fails, old consistent copy pointed to by db_pointer can be used, and the shadow copy can be deleted.

Cont.

Concurrent Executions Multiple transactions are allowed to run concurrently in the system. Advantages are: increased processor and disk utilization, leading to better transaction throughput: one transaction can be using the CPU while another is reading from or writing to the disk reduced waiting time for transactions: short transactions need not wait behind long ones.

Schedules Schedules sequences that indicate the chronological order in which instructions of concurrent transactions are executed a schedule for a set of transactions must consist of all instructions of those transactions must preserve the order in which the instructions appear in each individual transaction.

Example Schedules Let T 1 transfer $50 from A to B, and T 2 transfer 10% of the balance from A to B. The following is a serial schedule (Schedule 1 in the text), in which T 1 is followed by T 2.

Cont. Let T 1 and T 2 be the transactions defined previously. The following schedule is not a serial schedule, but it is equivalent to Schedule 1.

Cont. The following concurrent schedule does not preserve the value of the the sum A + B.

Serializability Basic Assumption Each transaction preserves database consistency. Thus serial execution of a set of transactions preserves database consistency. A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule. Different forms of schedule equivalence give rise to the notions of: 1. conflict serializability 2. view serializability

Conflict Serializability Instructions l i and l j of transactions T i and T j respectively, conflict if and only if there exists some item Q accessed by both l i and l j, and at least one of these instructions wrote Q. 1. l i = read(q), l j = read(q). l i and l j don t conflict. 2. l i = read(q), l j = write(q). They conflict. 3. l i = write(q), l j = read(q). They conflict 4. l i = write(q), l j = write(q). They conflict

Conflict Serializability (Cont.) If a schedule S can be transformed into a schedule S by a series of swaps of non-conflicting instructions, we say that S and S are conflict equivalent. We say that a schedule S is conflict serializable if it is conflict equivalent to a serial schedule Example of a schedule that is not conflict serializable: T 3 T 4 read(q) write(q) write(q) We are unable to swap instructions in the above schedule to obtain either the serial schedule < T 3, T 4 >, or the serial schedule < T 4, T 3 >.

Conflict Serializability (Cont.) Schedule 3 below can be transformed into Schedule 1, a serial schedule where T 2 follows T 1, by series of swaps of nonconflicting instructions. Therefore Schedule 3 is conflict serializable.

View Serializability Let S and S be two schedules with the same set of transactions. S and S are view equivalent if the following three conditions are met: 1. For each data item Q, if transaction T i reads the initial value of Q in schedule S, then transaction T i must, in schedule S, also read the initial value of Q. 2. For each data item Q if transaction T i executes read(q) in schedule S, and that value was produced by transaction T j (if any), then transaction T i must in schedule S also read the value of Q that was produced by transaction T j. 3. For each data item Q, the transaction (if any) that performs the final write(q) operation in schedule S must perform the final write(q) operation in schedule S. As can be seen, view equivalence is also based purely on reads and writes alone.

View Serializability (Cont.) A schedule S is view serializable it is view equivalent to a serial schedule. Every conflict serializable schedule is also view serializable. Schedule 9 (from text) a schedule which is view-serializable but not conflict serializable. Every view serializable schedule that is not conflict serializable has blind writes.

Levels of Consistency in SQL-92 Serializable default Repeatable read only committed records to be read, repeated reads of same record must return same value. However, a transaction may not be serializable it may find some records inserted by a transaction but not find others. Read committed only committed records can be read, but successive reads of record may return different (but committed) values. Read uncommitted even uncommitted records may be read.

Lock Based Protocols A lock is a mechanism to control concurrent access to a data item Data items can be locked in two modes : 1. exclusive (X) mode. Data item can be both read as well as written. X lock is requested using lock X instruction. 2. shared (S) mode. Data item can only be read. S lock is requested using lock S instruction.

Lock Based Protocols (Cont.) Lock compatibility matrix A transaction may be granted a lock on an item if the requested lock is compatible with locks already held on the item by other transactions

Lock Based Protocols (Cont.) Example of a transaction performing locking: T 2 : lock S(A); read (A); unlock(a); lock S(B); read (B); unlock(b); display(a+b)

Neither T 3 nor T 4 can make progress executing lock S(B) causes T 4 to wait for Pitfalls of Lock Based Protocols Consider the partial schedule

Pitfalls of Lock Based Protocols (Cont.) The potential for deadlock exists in most locking protocols. Deadlocks are a necessary evil. Starvation is also possible if concurrency control manager is badly designed. For example: A transaction may be waiting for an X lock on an item, while a sequence of other transactions request and are granted an S lock on the same item.

The Two Phase Locking Protocol This is a protocol which ensures conflictserializable schedules. Phase 1: Growing Phase transaction may obtain locks transaction may not release locks Phase 2: Shrinking Phase transaction may release locks transaction may not obtain locks The protocol assures serializability. It

The Two Phase Locking Protocol (Cont.) Two phase locking does not ensure freedom from deadlocks Cascading roll back is possible under two phase locking. To avoid this, follow a modified protocol called strict twophase locking. Here a transaction must hold all its exclusive locks till it commits/aborts. Rigorous two phase locking is even

The Two Phase Locking Protocol (Cont.) There can be conflict serializable schedules that cannot be obtained if two phase locking is used. However, in the absence of extra information (e.g., ordering of access to data), two phase locking is needed for conflict serializability in the following sense: Given a transaction T i that does not follow two phase locking, we can find a

Lock Conversions Two phase locking with lock conversions: First Phase: can acquire a lock S on item can acquire a lock X on item can convert a lock S to a lock X (upgrade) Second Phase: can release a lock S can release a lock X

Deadlock Handling Consider the following two transactions: T 1 : write (X) T 2 : write(y) write(x) write(y) lock-x on X write (X) T 1 T 2 Schedule with deadlock wait for lock-x on Y lock-x on Y write (X) wait for lock-x on X

Deadlock Handling System is deadlocked if there is a set of transactions such that every transaction in the set is waiting for another transaction in the set. Deadlock prevention protocols ensure that the system will never enter into a deadlock state. Some prevention strategies : Require that each transaction locks all its data items before it begins execution

More Deadlock Prevention Strategies Following schemes use transaction timestamps for the sake of deadlock prevention alone. wait die scheme non preemptive older transaction may wait for younger one to release data item. Younger transactions never wait for older ones; they are rolled back instead. a transaction may die several times before acquiring needed data item

Deadlock prevention (Cont.) Both in wait die and in wound wait schemes, a rolled back transactions is restarted with its original timestamp. Older transactions thus have precedence over newer ones, and starvation is hence avoided. Timeout Based Schemes : a transaction waits for a lock only for a specified amount of time. After that, the wait times out and the transaction is rolled back.

Deadlock Detection Deadlocks can be described as a wait for graph, which consists of a pair G = (V,E), V is a set of vertices (all the transactions in the system) E is a set of edges; each element is an ordered pair T i T j. If T i T j is in E, then there is a directed edge from T i to T j, implying that T i is waiting for T j to release a data item. When T i requests a data item currently

Deadlock Detection (Cont.) Wait-for graph without a cycle Wait-for graph with a cycle

Deadlock Recovery When deadlock is detected : Some transaction will have to rolled back (made a victim) to break deadlock. Select that transaction as victim that will incur minimum cost. Rollback determine how far to roll back transaction Total rollback: Abort the transaction and then restart it. More effective to roll back transaction only as far as necessary to break deadlock. Starvation happens if same transaction is

Failure Classification Transaction failure : Logical errors: transaction cannot complete due to some internal error condition System errors: the database system must terminate an active transaction due to an error condition (e.g., deadlock) System crash: a power failure or other hardware or software failure causes the system to crash. Fail stop assumption: non volatile storage contents are assumed to not be corrupted

Recovery Algorithms Recovery algorithms are techniques to ensure database consistency and transaction atomicity and durability despite failures Focus of this chapter Recovery algorithms have two parts 1. Actions taken during normal transaction processing to ensure enough information exists to recover from failures 2. Actions taken after a failure to recover the database contents to a state that ensures atomicity, consistency and durability

Parallel databases Parallel machines are becoming quite common and affordable Prices of microprocessors, memory and disks have dropped sharply Recent desktop computers feature multiple processors and this trend is projected to accelerate Databases are growing increasingly large large volumes of transaction data are collected and stored for later analysis. multimedia objects like images are increasingly stored in databases

Parallelism in Databases Data can be partitioned across multiple disks for parallel I/O. Individual relational operations (e.g., sort, join, aggregation) can be executed in parallel data can be partitioned and each processor can work independently on its own partition. Queries are expressed in high level language (SQL, translated to relational algebra)

I/O Parallelism Reduce the time required to retrieve relations from disk by partitioning the relations on multiple disks. Horizontal partitioning tuples of a relation are divided among many disks such that each tuple resides on one disk. Partitioning techniques (number of disks = n): Round robin: Send the i th tuple inserted in the relation to disk i

I/O Parallelism (Cont.) Partitioning techniques (cont.): Range partitioning: Choose an attribute as the partitioning attribute. A partitioning vector [v o, v 1,..., v n 2 ] is chosen. Let v be the partitioning attribute value of a tuple. Tuples such that v i v i+1 go to disk I + 1. Tuples with v < v 0 go to disk 0 and tuples with v v n 2 go to disk n 1. E.g., with a partitioning vector [5,11], a

Distributed Database System (ddbs) DDBS: Multiple logically interrelated databases distributed over a computer network A distributed database system consists of loosely coupled sites that share no physical component Database systems that run on each site are independent of each other Transactions may access data at one or more sites

Homogeneous Distributed Databases In a homogeneous distributed database All sites have identical software Are aware of each other and agree to cooperate in processing user requests. Each site surrenders part of its autonomy in terms of right to change schemas or software Appears to user as a single system In a heterogeneous distributed database Different sites may use different schemas and software Difference in schema is a major problem for query processing Difference in software is a major problem for transaction processing Sites may not be aware of each other and may provide only limited facilities for cooperation in transaction processing

Data Replication System maintains multiple copies of data, stored in different sites, for faster retrieval and fault tolerance A relation or fragment of a relation is replicated if it is stored redundantly in two or more sites. Full replication of a relation is the case where the relation is stored at all sites. Fully redundant databases are those in which every site contains a copy of the

Data Replication (Cont.) Advantages of Replication Availability: failure of site containing relation r does not result in unavailability of r is replicas exist. Parallelism: queries on r may be processed by several nodes in parallel. Reduced data transfer: relation r is available locally at each site containing a replica of r. Disadvantages of Replication Increased cost of updates: each replica of relation r must be updated. Increased complexity of concurrency control:

Data Fragmentation Division of relation r into fragments r 1, r 2,, r n which contain sufficient information to reconstruct relation r. Horizontal fragmentation: each tuple of r is assigned to one or more fragments Vertical fragmentation: the schema for relation r is split into several smaller schemas All schemas must contain a common candidate key (or superkey) to ensure lossless join property.

Horizontal Fragmentation of account Relation account_number A-305 A-226 A-155 branch_name Hillside Hillside Hillside balance 500 336 62 account 1 = σ branch_name= Hillside (account ) account_number A-177 A-402 A-408 A-639 branch_name Valleyview Valleyview Valleyview Valleyview balance 205 10000 1123 750 account 2 = σ branch_name= Valleyview (account )

Vertical Fragmentation of employee_info Relation branch_name customer_name tuple_id Hillside Hillside Valleyview Valleyview Hillside Valleyview Valleyview Lowman Camp Camp Kahn Kahn Kahn Green deposit 1 = Π branch_name, customer_name, tuple_id (employee_info ) account_number balance tuple_id A-305 500 1 A-226 336 2 A-177 205 3 A-402 10000 4 A-155 62 5 A-408 A-639 1123 750 6 7 deposit 2 = Π account_number, balance, tuple_id (employee_info ) 1 2 3 4 5 6 7

Advantages of Fragmentation Horizontal: allows parallel processing on fragments of a relation allows a relation to be split so that tuples are located where they are most frequently accessed Vertical: allows tuples to be split so that each part of the tuple is stored where it is most frequently accessed

Bibliography 1. Raghu Ramakrishnan, Johannes Gehrke, Database Management System, McGraw Hill., 3 rd Edition 2003. 2. Elmashri & Navathe, Fundamentals of Database System, Addison-Wesley Publishing, 3 rd Edition,2000. 3. Date C.J, An Introduction to Database, Addison-Wesley Pub Co, 7 th Edition, 2001. 4. Jeffrey D. Ullman, Jennifer Widom, A First Course in Database System, Prentice Hall, AWL 1 st Edition,2001. 5. Peter rob, Carlos Coronel, Database Systems Design, Implementation, and Management, 4 th Edition, Thomson Learning, 2001.

Review questions Define flash memory. Define log disk. What is meant by log-based file system? Define a transaction. List the required properties of a transaction to ensure integrity of the data. What is meant by cascading rollback? Define concurrency control. Define locking protocol. Define cache coherency. Define parallel aggregation. Define query optimization Define locking protocol. Define fuzzy checkpoint. What is meant by Write-ahead-logging (WAL) rule?