TRAID: Exploiting Temporal Redundancy and Spatial Redundancy to Boost Transaction Processing Systems Performance

Size: px

Start display at page:

Download "TRAID: Exploiting Temporal Redundancy and Spatial Redundancy to Boost Transaction Processing Systems Performance"

Norah Hutchinson
6 years ago
Views:

1 TRAID: Exploiting Temporal Redundancy and Spatial Redundancy to Boost Transaction Processing Systems Performance Abstract In the recent few years, more storage system applications employ transaction processing techniques to ensure data integrity and consistency. Logging is the most prominent transaction processing technique, which records the state of the system, and provides undo or redo operations after any kind of failure. Furthermore, RAID is used as the underlying storage system in these cases to guarantee the system reliability and availability with high I/O performance. Current I/O bound transaction processing applications suffer from the long log latency, lock contention, etc due to large sized logs and affect the overall throughput of the system. The overlap between spatial redundancy in RAID and temporal redundancy in database log enables us to minimize the log size, thereby reducing the latency. In this paper, we exploit this overlap and propose an inexpensive disk array architecture TRAID for TPS (Transaction Processing Systems). TRAID is implemented as a reliable storage architecture, which avoids double or multiple copies of the same data at different locations such as log disk and RAID. It also guarantees comparable RAID availability, recovery correctness and the same ACID semantics of a TPS. We use three different workloads to inspect the TRAID performance: standard OLTP benchmark TPC-C, modified TPC-C with strong access locality, and modified TPC-C with write intensive property. Our extensive experimental results demonstrate that TRAID is twice as fast as RAID for all kinds of workloads (saves upto 4%-6% response time). 1 Introduction I/O bound transaction processing applications like those in multimedia [1], service-oriented computing [2], etc are a norm in todays internet scale computing. The ever increasing transaction complexity and data sizes in these applications are contributing to the performance degrading factors like log latency, lock contention, and more disk I/O[3]. The log latency means the wait time before a transaction commits that includes the time to flush the log data and the real data to the disk, and time to acquire and release the locks for disk I/O. This longer log latency results in a scenario where a fewer number of transactions are committed in a particular time frame, hence reducing the overall throughput of the system. The log data and log buffer size affects a single transaction by elongating its latency, while holding locks for a longer time affects the subsequent transactions awaiting on the same data. Recent studies indicate that logging has been playing an increasingly important role in transaction processing systems and could potentially become a bottleneck [4] [5] [6]. Trends from both database system and application share the same observation as described in the following discussion. For example, in Temporal Databases [7][8][9] and Multidimensional Databases [1][11][12] more aspects of the object activities are provided, and data sets are combined from a multitude of data sources such as sales region, product, or time period. A typical record in traditional database has several more versions in these new kinds of databases, and leads to a bigger index structure and more complex management. More specifically, in Temporal Database, one object has several versions with different timestamps, and updating one object requires changes in much more records, or recursive updates due to more critical semantic consistency. All the timestamps and updates are logged, and the system throughput is reduced because of increased log latency and locks are held for a longer period. In today s database applications, application data objects are getting larger as digital media becomes ubiquitous. Similarly, the web services and other network applications lead to the frequent creation and updates of the application data. Instead of updating the data, the archive either stores multiple versions of the objects, or simply does wholesale replacement generating large log files [13]. Several research works have been conducted to solve some of the issues in aforementioned trends and have resulted in recognizable achievement. For example, Charm [6] reduces the waiting time of conflicting transactions by ensuring that all the required data pages are memory resident before it is allowed to lock shared pages. Bulk-logged option in SQL Server reduces the penalty of logging data and metadata [13]. Others include adjusting the log file size at database or application level, running hourly backups and truncating it nightly [14]; structuring the transaction into sub-transactions, allowing early commit of sub-transactions, and compensating transactions are provided for recovery purposes[15][16]. 1

2 Our prior works conclude that existing RAID redundancy can be exploited to provide extra functionalities such as energy-efficiency, in addition to reliability without compromising it [17, 18]. In this paper, we propose a new Transactional RAID system to address the long log wait time and space issue for transaction processing applications. TRAID utilizes the existing redundancy information in RAID. The idea is to de-duplicate information redundancy at different layers, e.g., temporal redundancy (i.e.different versions of data copies at time domain) on the log disk in database and spatial redundancy (i.e. mirroring redundancy or parity redundancy) in the RAID architecture. For databases supported by mirrored disk arrays [19] and erasure coded disk arrays [2, 21], there exists an overlap between temporal redundancy and spatial redundancy. We can take advantage of this overlap to improve the overall performance without violating the transaction processing ACID properties and recovery correctness. The database with underlying mirrored disk arrays enable us to directly exploit mirroring redundancy with no extra operation, but a delayed update of one of the mirroring copies. While the database with erasure coded disk arrays, especially paritybased RAID, result in an indirect exploitation of redundancy with an extra XOR operation. The feasibility of this additional XOR relies on the existing XOR support for RAID5 designs. This minimizes the amount of data to be logged while maintaining the same redundancy ratio in an overall storage system. Consequently, both higher performance and better space efficiency of logging could be obtained in transaction processing. 2 Background In this section we will give a brief overview of the two main components that form the basis of our TRAID design for transaction processing systems; Redundant Array of Independent Disks (RAID) and Write Ahead Logging (WAL) protocol with their features that are exploited in our design. 2.1 RAID in Transaction Processing System RAID architecture has been the most prominent architecture in the disk I/O systems for the past two decades. For database applications, RAID1 (mirroring redundancy) and RAID5 (parity redundancy) are the two of the most popular storage systems. Both of them are often used in commercial Database Systems[22][23] to improve the data availability and reliability. RAID1 is the combination of RAID1 (mirroring) and RAID (striping), it provides 2 data redundancy to protect data and uses stripping to improve the I/O performance. RAID5 stripes both data and parity information across three or more drives. The choice between RAID1 and RAID5 for database depends on workload characteristics. RAID5 is ideal for read operations with the file striped across multiple disk volumes, while there is write penalty for write operations[24]. RAID1 also stripes the data so the read performance is comparable to RAID5, but the system has to waste more physical disk space to set up the mirroring redundancy. However, the redundancy provided by the underlying storage system is often overlooked by the database and file system designers. Also, the storage system architecture designers are often unaware of the fault tolerant mechanisms deployed by the upper level file systems and Database Management Systems. As a result, both groups tend to implement an independent fault tolerant system from their own perspective and thereby leading to high overhead. We exploit the RAID redundancy in our TRAID design to improve the overall performance of transaction processing system, without penalizing the data availability or violating the transaction processing properties. 2.2 Logging in Transaction Processing Systems A transaction log (also database log or binary log) is a history of actions executed by a database management system to guarantee ACID (Atomicity, Consistency, Isolation, Durability) semantics [25] over crashes or hardware failures. A database log record is made up of Log Sequence umber (LS), Previous LS, Transaction ID number, Type and information about the actual changes that triggered the log record to be written. All log records include the general log attributes above, and also other attributes depending on their type (which is recorded in the Type attribute). In all transactions which can make changes to the database, the log needs to know both the previous and the next state of the object, since the undo programs will reset the object to the old state while redo programs will set the object to the new state. The details of the log format are shown as the Figure 1. Structure log{ Log Sequence umber; Prev LS; //A link to the last log record Transaction ID number; Type ; // Describes the type of database log record } Page ID Length and Offset Update Log Record Before and After Commit Record Images Abort Record Compensation Log Record: undo extls Checkpoint Record Figure 1. Log file format Redo LS Undo LS Write Ahead Log[26] is one of the basic logging protocols in database, which means before a block of data in main memory is output (e.g. transaction commit or partial commit, checkpoint, or database memory eviction) to the database, all log records pertaining to that block must be written to the persistent storage. Log record is used for recovery; if a redo or undo operation is required. Therefore, a transaction has to wait for the log to be flushed to the storage and the log wait time is bound by the log size and disk I/O. In this paper, we aim to reduce the log wait time and log space to improve the performance. 2

3 3 TRAID Design TRAID is implemented as a reliable RAID storage for transaction processing systems. Its goal is to provide a reliable storage by reducing the log wait time and log size. Our design targets transaction processing systems, hence, we also show that how redo and undo operations are performed correctly, i.e. recovery correctness and also the ACID semantics provided by relational database systems are maintained in TRAID. TRAID design exploits the existing redundancy in the most commonly used RAID architectures e.g. mirroring based (RAID1) or parity based (RAID5) redundancy. We develop the corresponding TRAID1 by exploiting mirroring redundancy and TRAID5 by exploiting parity redundancy as explained in the Sections 3.1 and 3.2 respectively. 3.1 Mirroring Redundancy: TRAID1 Original Page (A1) Log Disk Log{ Begin; LS; TrID; Pages; BeforeImage; AfterImage;... Commit;} Disk RAID1 Disk1 Mirroring Page (A1) Disk2 (2)Write Log Old data is not needed (1) Read A DataBase Update reuqest: A->A' FileSystem/ RAID Controller Original Page (A) Disk1 RAID RAID1 (3)Write A' Original Page (A2) Disk3 Mirror Undo/ Redo After (1)(2), transaction can commit After (3), transaction is written to the disk, can provide service After (4), data are consistent. TRAID1 Mirroring Page (Am) Disk2 RAID1 Mirroring Page (A2) Disk4 (4)Write A' after (3) Figure 2. RAID1 and TRAID1... Other disks RAID1 combines the mirroring redundancy (RAID1) and striping (RAID) as shown in Figure 2. Every striped block in RAID has a mirroring block in its RAID1 partner. In RAID1, there is an overlap between temporal redundancy and spatial redundancy. In TRAID1, we utilize this overlap of log and mirroring provided by the original RAID to reduce the log size, as shown in the Figure 2. Database with RAID1 processes an update transaction as follows: (1) Reads the requested data from disks into the memory. (2) Writes the Before Image (e.g. A1) into the log for undo requests. (3) Writes the After Image (e.g. A1 ) into the log for redo requests. Transaction can commit after step 3. (4) Update the data anytime before or after transaction commits (Write-Ahead-Log). We use the the two copies (page A and its mirror page, denoted as A m ) in the storage system in a novel way in TRAID1 to avoid recording the old data in the database log file. One of them is used as the page which should be updated immediately (e.g. A) and the other one is kept as an un-updated page (e.g. A m ). The un-updated page is used as a backup and it will be changed right after the transaction commits. The durability of ACID in a transaction ensures that once it is committed successfully, the state changes are permanent and the temporary data in A1 m will not be used for the current transaction after the commit. Hence, the mirroring data can be updated safely and correctly. An update request in database with TRAID1 is processed as follows: (1) Read the requested data from disks into the memory. (2) Write the After Image (A ) into the log for redo requests. Transaction can commit after step 2. (3) Update one copy of this data (e.g. A ) in the RAID. (4) Update the second copy (A m) after the transaction commits to maintain data consistency. For a read operation in the transactions, if the data on both original and mirroring disk are consistent, TRAID1 uses the same mechanism as RAID1, which will do readbalance to pick the best candidate among the disks to serve the read requests; while if the mirroring data is in inconsistent state, the TRAID1 controller forwards the request to the disk with the updated data. TRAID1 is aware of the corresponding old or new data because the logical address in the update request for the two disks in the mirroring group is same. The one logical to two physical address mapping is done by the RAID device driver. After we update the data on the primary disk, the copy of same request is recorded in the buffer for the secondary disk. As a result, TRAID1 controller knows which database page or disk block is updated or going to be updated based on the recorded requests. In this way, TRAID1 keeps track of the metadata which tells the location of undo and redo references for the transaction recovery and commit. We can summarize the advantages of DB+TRAID1 as follows: TRAID1 can avoid writing old version of the data, which will reduce the time to write log records, the log-lock-content waiting time among concurrent transactions, the time between log flushing and transaction commit (WAL), and the log size. Complete Rollback: If one transaction has already updated some database data on the disk but needs to be undone, the TRAID1 can guarantee this transaction to rollback completely in order to maintain the database consistency. Since we have the original version of data on the 3

4 secondary disk, we can easily generate I/O requests in the TRAID1 driver to read it, and then write it to the corresponding location on the primary disk which has been updated but not committed. We will discuss the recovery details for different scenarios in the recovery correctness section. Partial Rollback: Complete recovery mentioned above revokes all updates from the beginning of a transaction, often incurring significant cost. In order to alleviate this problem, a partial rollback scheme is used, which recovers a transaction to a savepoint [27]. TRAID1 also supports partial rollback. Since all the update requests are recorded in the TRAID1 controller for the secondary disk, we make one flag in the request list in case of a create savepoint request. If a transaction wants to rollback to a savepoint in the request list, we can read the original data from the secondary disk, and redo the update requests before the savepoint in the list, and then write the results on the primary disk to finish the partial rollback. The rollback operations will be treated as normal transactional operations, which record the result of rollback as an After Image. Recovery Correctness: TRAID1 guarantees the recovery correctness after failures. If system fails after step 2 before the transaction commits, neither undo nor redo is necessary since no change is made to the disk and the transaction is not committed. We just mark this transaction as aborted, which will have no effect on the database. Assuming an updating transaction is committed after step 2, a system failure may require the transaction to redo. In this case, the updated data in the log will be used as redo reference. Also, note that the transaction is already committed, we must not undo because of Durability property in ACID. If a transaction fails at step 3, there are three cases: (1) One copy of data on disk is updated and the transaction has already committed, the system failure at this point of time can only result in redo. We have two choices: read the after images from log file to update the un-updated copies; or read the updated data from the primary disk and then write it to the un-updated copies. We prefer the later recovery method since all the recovery actions can be held inside the TRAID device driver, where there is no need to generate new I/O request by file system, interpret the request or map the block addresses. (2) One copy of data on disk is updated or in the middle of updating, but the transaction has not been committed yet. In this case, either undo or redo is required. For undo request, we can use the un-updated copy as reference; while for redo request, we follow the same procedure as in (1). (3) If the action is aborted in the middle before the first copy is updated, we also can use the old data on the other disk to undo the transaction; or we can use the after image in log file to redo. ACID Semantics: TRAID1 can guarantee ACID semantics (Atomicity, Consistency, Isolation and Durability) of Transaction processing as traditional storage system does. Atomicity: Atomicity refers to the ability of the transaction processing system to guarantee that either all or none of the tasks of a transaction are performed. In DB+TRAID1, any real data must be updated after the corresponding log records are flushed onto the persistent storage. It also means that WAL protocol is maintained and recovery references are available in case atomicity semantics are violated. The recovery correctness as discussed previously, ensures that either a transaction finishes successfully or rolls back completely. Consistency: In TRAID1, the data on the disks are consistent before and after transaction commit; while during the transaction update, the two copies of data can be inconsistent since one copy is updated before transaction commit, and the other one after transaction commit. However, we can redo or undo the current transaction in case of any violation at any time to ensure the state of database is consistent. Isolation: Isolation refers to the constraint that refrain other operations to access or see the data in the intermediate state during transaction. The modification from RAID1 to TRAID1 does not affect this property because we do not modify the lock semantics of a transaction. Durability: Durability guarantees that once the user has been notified of success, the transaction is persistent, and can not be undone. In TRAID1, once all copies of data are updated after the transaction commit, the updates are persistent and can not be undone. As a result, ACID properties are well maintained in DB+TRAID1 systems. 3.2 Parity Redundancy: TRAID5 RAID5 is the representative storage system with parity redundancy and is widely used in database system to improve the read performance. To ensure the correctness of redo/undo, we must be able to either retrieve both old and new copies or use existing information to reconstruct both old and new copies before transaction commit. Unlike mirroring redundancy, parity redundancy indicating the relationship of updating blocks cannot directly enable us to retrieve or reconstruct both old and new copies. We need to seek a different way to log less amount of data that is comparable to that of TRAID1. This motivates us to create a new redundancy in flight for TPS recovery by exploiting the parity redundancies at different time domain, yet maintaining the data reliability of RAID5 and the ACID properties of transaction processing. In this way, the overlap between the parity redundancy in RAID5 and the temporal redundancy in database log can be eliminated. Specifically, we log the exclusive-or result of the old parity and the new parity instead of before and after images of the updating blocks. In a Database system with RAID5 storage system, a block update transaction results in the following set of op- 4

5 erations: (1) A read request of the target block and the parity block on the same stripe from disk to memory. (2) An exclusive-or (XOR) calculation based on the (new) updated data, the (old) un-updated data and the parity data on the same stripe to get the new parity. (3) Write the updated data and un-updated data into log file for undo and redo operations. Transaction can commit after step 3. (4) One write of the XOR result as the new parity data. (5) One write of the updated data onto the target block. More formally, the parity P in RAID5 is calculated as follows. Suppose at time T 1, we have (A 1,B 1,C 1,P 1 ) in RAID5, where P 1 = A 1 B 1 C 1 At time T 2, one update request changes A 1 as A 2, the data in the stripe are like (A 2,B 1,C 1,P 2 ), where P 2 = A 2 A 1 P 1 = A 2 B 1 C 1 ow, instead of logging old and new data in step (3), we log a new TRAID-parity Q such that: Q = P 1 P 2 Although, this TRAID-parity equation is for a single block update scenario, it is easy to adopt it for multi-block update. As we know, the bottleneck in RAID5 is write penalty for small writes, and is alleviated by collecting as many block-write requests as possible on one stripe to combine the small writes into one big write. This allows to write the new parity information onto disk only once instead of updating it for several times (as many as the number of small writes). However in the memory, the calculation to update the parity still needs to cover all the updating blocks. For example, one update request on the stripe containing (A 1,B 1,C 1,P 1 ) in RAID5 may want to get a result like (A 2,B 2,C 1,P 2 ). The one-time write of P 2 can be P 2 = A 2 B 2 C 1 But before that, we will have two versions of P in the memory, such as P 2 = A 2 A 1 P 1 = A 2 B 1 C 1 and P 2 = B 2 B 1 P 2 = A 2 B 2 C 1 Where P 2 equals to P 2. The TRAID-parity of block A and block B can be obtained by Q A = P 1 P 2 Q B = P 2 P 2 In other words, we treat the multi-block update as several single-block updates in memory. But we still have only one parity write (to parity disk) for the update request without any extra read or write. In the following discussion about the TRAID-parity calculation, we just consider a single-block update. In TRAID5 design, instead of logging all versions of data resulting from various update requests, it keeps the TRAID-parity info as undo and redo reference. The TRAID5 architecture is shown in Figure 3. The process in TRAID5 for an update transaction is: (1) Read the block and corresponding parity information from disk into memory; Block A Disk1 Log Disk Log{ Begin; LS; TrID; Pages; BeforeImage; AfterImage;... Commit;} Disk RAID5 Controller Array Management Software Provides Logical to Physical Mapping (2)Write Log (TRAID-parities) Old or new data are not needed Block B Block C Parity P (1)Read A Disk2 Disk3 Parity Disk DataBase Update reuqest: A->A' FileSystem/ RAID Controller Update(A)+update(P)-->W(A')+W(P') A Disk1 RAID5 (3) W(A')+W(P') B C P After (1)(2), transaction can commit After (3), transaction is written to disk can provide service + data are consistent. Disk2 Disk3 Parity disk4 TRAID5 Figure 3. RAID5 and TRAID5 (2) Calculate the new parity P and TRAID-parity Q; (3) Write the Q info (no physical undo or redo data is required) into the log file, besides all other transaction information; The transaction can commit after 3; (4) Write the updated block and parity P. The calculation of Q depends on whether partial rollback is required, which will be discussed in the following two sections Complete Rollback A complete rollback means that we need to reset the database to the original state when undo is needed. In this case, we just record the newest parity info (at time of point T) Q T for the updates on block A as follows: Q 1 = φ T = 1 Q 2 = P 2 P 1 Q 1 T = 2 Q T = P T P T 1 Q T 1 T 2 If the old data is lost, Q(T ) guarantees that old data can be recovered by A 1 = Q T A T. Similarly, if the new data is lost, we can use the XOR result of Q T and old data A 1 to obtain the new data (redo to A T. The Table 1 shows the details of recovery in case of complete rollback to the A i (1) Partial Rollback In the real database environment, a transaction that needs a partial rollback, can write the disk several times before it commits. In this case, we need a list of Q parity i.e. 5

6 Time Action Parity P Parity Q Get A T() Initialize P = A B C Q = ULL A = A T(1) A A 1 P 1 = A 1 A P Q 1 = P 1 P Q A = A 1 Q 1 T(2) A 1 A 2 P 2 = A 2 A 1 P 1 Q 2 = P 2 P 1 Q 1 A = A 2 Q T(K) A K 1 A K P K = A K A K 1 P K 1 Q K = P K P K 1 Q K 1 A = A K Q K Table 1. Recovery of the data without write disk during the transaction (Complete Rollback) Time Action Parity P Parity Q Get any version of A T() Initialize P = A B C Q = ULL A = A T(1) A A 1 P 1 = A 1 A P Q 1 = P 1 P A = A 1 Q 1 T(2) A 1 A 2 P 2 = A 2 A 1 P 1 Q 2 = P 2 P 1 A = A 2 Q 2 Q 1; A 1 = A 2 Q T(K) A K 1 A K P K = A K A K 1 Q K = P K P K 1 A i = A K Q K... P K 1 Q i+1 <= i < K Table 2. Recovery of the data without write disk during the transaction (Partial Rollback) Q 1, Q 2,..., Q n for all the writes as some or all of them will be used for the partial rollback. Q for the partial rollback is calculated as follows: Q 1 = φ T = 1 Q 2 = P 2 P 1 T = 2 Q T = P T P T 1 T 2 If there is a system failure at a time point n with data A n and the database needs to roll back to A m at time point of m, where m is between [1, n), undo operation to A m will work as follows: A m = A n Q n... Q m+1 Having Q 1, Q 2,..., Q n and A 1, we also can redo this transaction to any point of time m, where m is between (1, n], by the following calculation. A m = A 1 Q 2... Q m The details of this partial recovery of data is shown in Table 2. We only consider the situation in Table 1, with no partial commit. We record the new parity info Q in the log file for the newest write request on stable devices before the transaction commit. Recovery Correctness: TRAID5 guarantees the recovery correctness after failures. If the system fails after step 2, the database is still in a consistent state and no recovery is needed. If system failure happened after step 3 and before step 4, we may have two different cases: (1) the transaction is committed before the failure, redo is needed for the recovery. The XOR result of Q in the log and the un-updated data on the disk is used for redo, and the parity P can be calculated again. (2) the transaction is not committed yet, since data on disk is not updated, we can just mark this transaction as aborted. If the system failure happened during step 4, we also have two cases: (1) the transaction is already committed, we can redo the transaction by using the XOR result of TRAID-parity and the un-updated data. (2) the transaction is not committed yet, we need to undo all transactions. The XOR result of TRAID-Parity and the updated data can provide the undo reference. ACID Semantics: also guarantees the ACID properties. Since TRAID5 can undo the failed or aborted transaction, data on the disk is guaranteed to be valid, as a result, the Consistency property is maintained. The data during the transaction processing is invisible to other transactions because of the transaction lock in TRAID5, in this way, Isolation is guaranteed. Once a transaction commits, the updates are persistent in TRAID5. Any write failure can be recovered by TRAIDparity recovery methods mentioned above. Hence, the Durability is kept. For the Atomicity property, if any kind of failure stops the transaction from committing, the parity information in the log can undo to clear the transaction effect. If system failure happened during data update on the disk after transaction commit, we can also use the TRAIDparity to redo the transaction. In this way, the database modifications follow the all or nothing rule. As a result, ACID properties are also well maintained in systems. It may be noted that the TRAID5 technique can be easily ported to build TRAID6 and other erasure coded arrays. The double-parity RAID or parity-based RAID6 such as RDP [28] maintains two parities P and P. P is same as the RAID5 parity and P is only for the spatial recovery (providing fault tolerance from one or two drive failures) of the second disk failure. The spatial recovery requirements are different for RAID5 and RAID6, but the temporal recovery (do undo/redo on a particular drive at time domain) provided by TRAID-parity Q is the same. Hence, only P parity is used to calculate the TRAID-parity Q. 6

7 4 Data Reliability of TRAID The core idea of TRAID is to exploit the inherent RAID redundancy to boost the performance for transaction processing systems. RAID architecture was developed to enhance the reliability of multi-disk subsystem, therefore, in this section we analyze the reliability of TRAID1 and TRAID5 architecture respectively, and compare them with RAID1 and RAID5 architecture. We show that in TRAID1, reliability is comparable to RAID1, except during a small time frame during which it is compromised for performance. On the other hand, reliability of TRAID5 and RAID5 are equivalent. TRAID1: In order to calculate the reliability of TRAID1, we divide the processing of a transaction into three steps: (1) Before a transaction can commit, all the transaction data and log records are in the database buffer and log buffer, respectively; (2) The log records are flushed onto the log disk; transaction is ready to commit, and transaction data in the database buffer is going to be written to the disk; (3) Transaction commits and all the transaction data and log records are on the disks. In the step 1 and 3, TRAID1 has the same data reliability as RAID1 does because both of them have same number of redundant copies. In step 1, the data will be lost if and only if both database buffer and log buffer failed no matter in TRAID or RAID, as a result, the mean time to data loss (MTTDL) depends on the mean time to failure (MTTF) of the buffer modules. Let MT T F buf represent the mean time to failure of a buffer module, and S DB, S LB be the size of database buffer and the size of the log buffer respectively. The mean failure rate caused by both DB buffer and log buffer is λ 1 = S DBS LB MT T R (MT T F buf ) 2 The MTTDL of TRAID and RAID in step 1 is therefore given by: MT T DL T RAID1 = MT T DL RAID1 = 1 λ 1 In step 3, TRAID1 and RAID1 have all the data on the disks; mirrored and stripped, so the MTTDL depends on the mean time failure rate of disks. Let be the number of disks, and MT T F disk be the mean time to failure of a disk. It is not straightforward to calculate the MTTDL of TRAID1 and RAID1 directly. However, we can calculate the reliability of RAID1 by using the MTTDL of RAID1 and RAID. Suppose, we have 2-way mirroring redundancy in RAID1, the MT T DL RAID1 is given by: MT T DL RAID1 = 2 MT T F disk And the MT T DL RAID can be denoted as (one disk failure will cause data lose): MT T DL RAID = MT T F disk A RAID1 with disks can be treated as a RAID with 2 groups, each of which contains 2 mirroring disks, as a result, the MT T DL RAID1 can be given as: MT T DL T RAID1 = MT T DL RAID1 = MT T Fgroup /2 = 2MT T F disk /2 = 4MT T F disk However in step 2, TRAID1 and RAID1 perform differently since we update the two mirroring copies in a different way; RAID1 write the two copies at the same time, while TRAID1 updates one of them before transaction commits, and then updates the other copy after the transaction commits. RAID1 in step 2 has the same data reliability as it does in step 3, since the system failure happens if and only if both the disks in one RAID1 group fail at the same time. Therefore the MTTDL can be also given by: 4MT T F disk. While the situation of TRAID1 in step3 is a little more complicated. Suppose we have totally T transactions, the probability of write operations is P, and the average processing time for each write transaction is T w, average processing time for each read transaction is T r. For all read transactions, we do not need to update the data on the disks, so the MTTDL is still denoted as 4MT T F disk, and the percentage of read operation time is: T r (1 p) T (T w P + T r 1 p) T It reduces to T r (1 p) (T w P + T r 1 p) For the write operations, during the asynchronous updates, the disk with un-updated copy has the old data, if this disk failed after the data on the other one is updated and before the transaction commit, we will lose the reference for possible undo or rollback actions. As a result, besides the normal RAID1 group-failure which also happened in RAID1, we need to consider the failure of the disk containing the old data. The mean failure rate of the former factor is given by λ2 = /2 2 MT T F disk = 4 MT T F disk The mean failure rate of the later factor for one write request is 1 λ3 = MT T F disk The MTTDL of TRAID1 for write operations can be denoted as 1 MT T DL T RAID1W rite = λ2 + λ3 1 = 4 MT T F disk + 1 MT T F disk = 4 MT T F disk And the percentage of write operation time is T 2 p T (T w P + T r 1 p) T T 2 p = (T w P + T r 1 p) As a result, the MTTDL of TRAID1 in step 2 can be given by MT T DL T RAID1 = T r (1 p) (T w P + T r (1 p)) 4 MT T F disk T 2 p + (T w P + T r (1 p)) 4 MT T F disk

8 This equation means if the application is read-intensive (P ), TRAID1 has the same data reliability as 4 MT T F RAID1 does: disk. If the application is writeintensive (P 1), the MTTDL of TRAID1 is given by 4 MT T F disk Since the data reliability in step1 and step3 is same, we focus on one case study in step2. Suppose there is a database with an underlying RAID5, which is composed of 8 disks, the workloads are 5% read and 5% write. The MTTF for disks is assumed to be 1 million hours [29]. Fitting these data into the MT T DL T RAID1 and MT T DL RAID1, we get MT T DL T RAID1 = hours, while MT T DL RAID1 = hours, which mean 1.79% Annual Failure Rate and 1.75% Annual Failure Rate, respectively. There is.4% tradeoff in data reliability as compared to 4% transaction processing performance improvement. We also considered an alternative implementation of TRAID1, which can use the parity redundancy style data in logs, i.e. log XOR of old data and new data. It is anticipated that this new solution gives the same reliability as RAID1 instead of.4% tradeoff, but also incurs an extra overhead of XOR hardware cost that is not a part of RAID1 design. We emphasize that the main purpose of our work is to utilize the existing redundancy to develop TRAID. It is feasible to use XOR parity calculation in TRAID5 because RAID5 has parity calculator but not in TRAID1 because RAID1 lacks such feature. TRAID5: The only difference between databases using TRAID5 and RAID5 is the log content, which can not affect the reliability of storage system. Assuming that log disk can not fail in a database system with RAID5, then more than one disk failure will result in data loss. Similarly in TRAID5, if one disk fails, the data on the failed block can be recovered by one XOR calculation. Furthermore, by using TRAID-Parity Q we can do undo or redo according to the transaction requirement. If more than one disk fails, the data will be lost since there is not enough redundancy information to do the recovery. In other words, the TRAID-Parity Q is used to undo or redo the transactional operations, rather than doing recovery in case of disk failure. As a result, the data reliability of RAID5 and TRAID5 is same. Let be the number of disks in the TRAID5 and RAID5, MT T F disk be the mean time to failure of each disk, M T T R be the mean repair time. Hence, the MT T DL of TRAID5 and RAID5 are given by: MT T DL T RAID5 = MT T DL RAID5 = 5 Performance Evaluation 5.1 Experimental Setup MT T F 2 disk ( 1) MT T R There are 6 PCs that are interconnected using the Intel s et Structure 1/1/1Mbps 47T switch. We PC1-6 P4 2.8GHz/256M RAM Database Berkeley DB 4.3 Version OS Linux Benchmark TPC-C/BTPC-C1/BTPC-C2 etwork Intel etstructure 47T Switch / 1G bandwidth Adapter (IC) Table 3. Hardware and software of environment construct TRAID (TRAID1 and TRAID5) and RAID (RAID1 and RAID1) on top of 4 disks. One PC acts as a client running benchmarks, and another PC acts as a log server. The hardware and software characteristics of the environment are shown in Table 3. In order to implement TRAID1 and TRAID5, we modified the corresponding RAID codes in Fedora, kernel version TRAID1: In the mirroring group of disks in TRAID1, we choose one disk as the primary disk and the other one as the secondary disk. In case of an update request in a transaction, the copy on the primary disk will be handled immediately, but the one on the secondary disk will be blocked temporarily in the device driver until the transaction commits on the primary disk. Since the memory to buffer I/O requests is limited, we format a partition on the TRAID1 disk and combine it with the memory to form a virtual memory. All the I/O requests can be recorded in the memory, even for a long lasting transaction. By using the virtual memory, when the device driver needs to evict a page that has been modified, the page is written to swap space on disk. When the database decides to update the data on the secondary disk, the I/O requests in the buffer and swap space are sent to the next level in the device driver. If primary disk fails, the TRAID1 controller can generate a read request inside the driver to obtain the corresponding Before Image data from the secondary disk, rather than from the log disk. TRAID5: In TRAID5, one more TRAID5-parity calculation function is added to the RAID5 source code. We add a hook to the XOR block function in RAID5 source code to get the required block information and write the TRAID-Parity into the buffer. When the buffer is full or the database decides to write the updated transaction data to the disk, the TRAID-parity is flushed to the log disk. The size of TRAID-parity block is set to 512Bytes, which is same as the size of a parity block and also the default page size in Berkeley DB. The TRAID-parity information can be used as a reference for undo and redo operations. Since we only log the information for one updating block, as compared to logging two whole pages (Before and After Image) in Berkeley DB, the log size overhead is reduced. For recovery, TRAID-Parity and the current version of data on the block are used to undo or redo the transaction. The benchmarks we used in our experiments are TPC-C 8

9 and two biased modified TPC-C, which are introduced in detail in Section 5.2. These benchmarks are implemented by using the industry-strength transaction processing software library from Berkeley DB (BDB) package, version 4.3. Berkeley DB supports page level locking as well as error recovery through write-ahead logging when processing transactions. We set the logging mechanism in Berkeley DB as synchronous so that the log records in the buffer have to be flushed onto the disk right after the transaction commit. While the transactional data can stay in the data buffer as long as the buffer is not full and the user does not request to flush the transaction data. We use C language and the API in Berkeley DB library to replay the TPC-C transactions [3]. 5.2 Workload Characteristics In order to have a fair evaluation of TRAID, we use three benchmarks: commercial benchmark for transaction processing evaluation: TPC-C [31], and two modified versions of TPC-C as micro benchmarks. The first benchmark, TPC-C, simulates an Online Transaction Processing (OLTP) database environment. It can measure the performance of a system which is tasked with processing numerous short business transactions concurrently [32]. It is set in the context of a wholesale supplier operating on a number of warehouses and their associated sales districts. TPC-C incorporates five types of transactions with different complexity for online and deferred execution on a database system. These transactions perform the basic operations on databases such as inserts, deletes, updates and so on. The transactions in TPC-C and their percentage of the transaction mix are [33]: (1) ew Order transaction (about 45%): a new order entered into the database; read-write transaction. (2) Payment transaction (about 43%): a payment recorded as received from a customer; read-write transaction. (3) Order Status transaction (about 4%): an inquiry as to whether an order has been processed; read-only transaction. (4) Stock Level transaction (about 4%): an inquiry as to what stocked items have a low inventory; read-only transaction. (5) Delivery transaction (about 5%): an item is removed from inventory and the status of the order is updated; readwrite transaction. Based on the implementation of standard TPC-C, we developed a special version of TPC-C for our test, which is called BTPC-C1 (Biased TPC-C benchmark1). In BTPC- C1, the key values in the queries and updates were changed from a uniformly random distribution to a biased distribution in the form of 9/1 rules. In this way, we increase the access locality so that the resulting workload is more sensitive to lock content delay, and the log-lock content delay. By using BTPC-C1, one locked transaction can cause more transactions to wait for the lock release, so we can see how much benefit can be gained by using TRAID, i.e. a shortened log-lock time. In the experiments with BTPC- C1, we increase the number of concurrent processes to see the performance of DB+TRAID and DB+RAID systems. The third benchmark aims to test the performance of TRAID with a write-intensive workload, which is called BTPC-C2 (Biased TPC-C benchmark2). In BTPC-C2, we shield all the read-only transactions in TPC-C such as item query transactions, and order status transactions. Because read requests in TRAID and RAID are identical, read intensive transactions may obviate the performance improvement. Therefore, by using BTPC-C2 we can explore the advantages of TRAID for the transactions with intensive update requests. 5.3 Experimental Results We run the TPC-C benchmark workload with warehouse [32] parameter set to 3, representing a database size of 2GB, which grows during each test run as new records are inserted. In our TPC-C benchmark, the input includes number of transactions, number of terminals(number of concurrent processes). The output consists of the total transaction processing time and the transactions per minute (tpmc). For each test, we run the given number of transactions ten times and get the average response time to analyze the TRAID performance in addition to the size of log file in each experiment Standard TPC-C Benchmark The first experiment compares the overall response times of BDB+RAID and BDB+TRAID for a given number of transactions. In standard TPC-C, we set the number of concurrent processes to 2, and the number of warm-up transaction is 1. The overall response time of RAID1, RAID5, TRAID1 and TRAID5 are shown in Figure 4, and the corresponding throughputs are shown in Figure 5. From Figure 4, we can see that compared to RAID1 and RAID5, TRAID1 and TRAID5 improves the overall response time significantly, and the improvement increases with the increasing number of transactions. The average throughput of RAID1 in Figure 5 is tpm- C, while the average throughput of TRAID1 is tpm-c, which means BDB+TRAID1 is 43.23% faster than B. Similar results can be obtained for RAID5 and TRAID5, the average throughput of RAID5 is tpm-c, while the one of TRAID5 is 31.5 tpm-c, which means TRAID5 outperforms RAID5 by 56.89%. The improvement of TRAID5 over RAID5 is more significant than that of TRAID1 over RAID1, because it replaces the before and after image writes by one XOR calculation of the in-memory data and a write of its result. Since it uses the XOR function provided by RAID, the re- 9

10 Overal Response Time(second) 35 3 DB+TRAID shown in Figure 6. The size of log files in B Log File Size(MB) 4 35 DB+RAID 3 DB+TRAID Throughput (tpm-c) DB+TRAID umber of Transactions Figure 4. Overall Response Time (TPC-C benchmark) umber of Transactions DB+TRAID1 Figure 5. Throughput (TPC-C benchmark) sulting cost is relatively negligible. While in TRAID1, after all updates are completed on the primary disk and the transaction is about to commit, we need to flush the buffered requests onto the secondary disk in order to maintain the data consistency. Although, we do not need to wait for this flushing to finish (once the write request is sent to the disk, it returns, without caring about the real write actions), the buffer still needs to manage all of the page evictions. If one transaction is on the boundary of the buffer but all the other transactions are not committed, we have to evict some pages to the swap space on disk. When we need to flush the pages to keep the ondisk data consistent, some of the pages may have to be read from the swap space on the primary disk and output to the database on the secondary disk, and results in extra I/O. The throughput of TRAID5/RAID5 is a little bit slower than RAID1/TRAID1 because we do not implement any extra optimization to eliminate the small write penalty in RAID5 or TRAID5, while TPC-C has a large percentage of small writes. TRAID is also evaluated for log size improvement as umber of Transactions Figure 6. Log Size Comparison (TPC-C Benchmark) and B is the same since all the pages being updated (including the before and after images) are logged in Berkeley DB. TRAID1 avoids logging of the before images in the database log file, while TRAID5 only records the TRAID-parity information instead of before and after images. From Figure 6, we can see that TRAID1 can save the log space up to 33.7% compared to a RAID system, while TRAID5 can save 32.6%. Before analyzing this result, note that we can not avoid recording the regular transaction information in the log file (LS, Transaction ID, etc); some other logged operations (page allocation, keep track of record counts in a B tree, mark a record on a page as deleted, etc); and the relative large checkpoint records in BDB log file, which log all the pages are being accessed by the running transactions. Hence, by only recording the After Image in TRAID1 and TRAID-parity in TRAID5, the log size is reduced by one-third rather then one-half. Logging in TRAID5 is different than TRAID1. As we mentioned above, we set the parity block size as 512Bytes in TRAID5 and it is the basic unit for parity computations. Actual data sizes of disk write requests (stripe size) are independent of the parity block size but are aligned with parity blocks. With this setting, one TRAID-parity will take as much space as one Before image does in BDB log file, and it is the only way to make a fair size comparison of the TRAID5 log with Berkeley DB log. As a result, the log size of TRAID5 should be similar to TRAID1. The small difference in the experiment result is due to the different response time: TRAID5 needs a little bit more time to do all the transactions, which may result in several more checkpoint records (new checkpoint is made every 6 seconds) Data Access Locality Micro benchmark The second experiment with BTPC-C1 benchmark evaluates the impact of data access locality on the TRAID 1

11 performance. Since in BTPC-C1 9% of the queries and update requests focus on 1% data, the overall performance will be more sensitive to the log-lock-latency effect. With the increasing number of concurrent processes, the benefit of TRAID over RAID becomes more significant because TRAID reduces the wait time of subsequent transactions. We run 1 transactions implemented by BTPC-C1, and gradually increase the number of concurrent processes. The overall response time of BTPC-C1 on top of B, B, BDB+TRAID1 and B is shown in Figure 7, and the corresponding throughputs are shown in Figure 8. From Fig- Overall Response Time (second) Throughput (tpm-c) DB+TRAID umber of Concurrent Processes Figure 7. Overall Response Time with strong access locality(btpc-c1 benchmark) DB+TRAID umber of Concurrent Processes Figure 8. Throughput with strong access locality (BTPC-C1 benchmark) ure 7 and Figure 8, we can see the performance improvement from RAID to TRAID is not substantial when there is only 1 process. The difference between TRAID and RAID in this case (sequentially transaction processing) is the waiting-time of log-writing for sequential transactions. Also, there is no log-locking time among the concurrent transactions, which can further delay the transaction commit time. The trend of throughput improvement is shown in the Figure 9. From the Figure 9, it is clear that the through- Throughput Improvement (%) 9% 8% 7% 6% 5% 4% 3% 2% 1% % umber of Concurrent Processes Figure 9. Throughput Improvement (BTPC- C1 benchmark) put improvement from RAID to TRAID increases gradually with the number of concurrent transactions up till 5, and then the improvement factor starts decreasing. The lock-content delay is a crucial factor in transaction response time before the number of concurrent processes reaches 5. TRAID gains more improvement with the increasing concurrency and more lock contention because it can decrease the log-lock content delay. However, after this point, the disk I/O costs dominate the transaction response time while the lock content effect decreases due to more concurrency. Although the throughput of TRAID is always more than RAID, the throughput improvement of TRAID over RAID is not increasing when the concurrency reaches the threshold Write-Intensive Workload Micro benchmark The third experiment tests the performance of TRAID for write-intensive workloads, BTPC-C2, in which every transaction needs to read and update the database. We changed the percentages of five transaction mix by deleting the read-only transactions such as Order Status transactions and Stock Level transactions, and increasing the percentages of the other three kinds of transactions. In this experiment, we set the number of concurrent processes to 2, and run different number of transactions to check the overall response time. The overall response time and throughput of B, B, BDB+TRAID1 and B are shown in Figure 1 and Figure 11, respectively. Figure 11 shows that the average throughput of RAID1 is tpm-c, TRAID1 is 32.5 tpm-c; while the throughput of RAID5 is tpm-c, TRAID5 is 31.1 tpm-c. By calculating the improvement, TRAID1 outperforms RAID1 by 47.4% while TRAID5 outperforms RAID5 by 61.7%. Recall these numbers with standard TPC-C benmark in the first experiment, the TRAID1 and TRAID5 outperform RAID1 and RAID5 by 43.23% and 56.89% respectively. 11

IEEE TRANSACTIONS ON COMPUTERS, VOL. 61, NO. 4, APRIL

IEEE TRANSACTIONS ON COMPUTERS, VOL. 61, NO. 4, APRIL 2012 517 TRAID: Exploiting Temporal Redundancy and Spatial Redundancy to Boost Transaction Processing Systems Performance Pengju Shang, Student Member,