2 STATEMENT BY AUTHOR This thesis has been submitted in partial fulllment of requirements for an advanced degree at The University of Arizona and is d

Size: px

Start display at page:

Download "2 STATEMENT BY AUTHOR This thesis has been submitted in partial fulllment of requirements for an advanced degree at The University of Arizona and is d"

Kelly Greer
5 years ago
Views:

1 ANALYTICAL EVALUATION OF THE RAID 5 DISK ARRAY by Anand Kuratti A Thesis Submitted to the Faculty of the DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING In Partial Fulllment of the Requirements For the Degree of MASTER OF SCIENCE WITH A MAJOR IN ELECTRICAL ENGINEERING In the Graduate College THE UNIVERSITY OF ARIZONA 1994

2 2 STATEMENT BY AUTHOR This thesis has been submitted in partial fulllment of requirements for an advanced degree at The University of Arizona and is deposited in the University Library to be made available to borrowers under rules of the Library. Brief quotations from this thesis are allowable without special permission, provided that accurate acknowledgment of source is made. Requests for permission for extended quotation from or reproduction of this manuscript in whole or in part may be granted by the head of the major department or the Dean of the Graduate College when in his or her judgment the proposed use of the material is in the interests of scholarship. In all other instances, however, permission must be obtained from the author. SIGNED: APPROVAL BY THESIS DIRECTOR This thesis has been approved on the date shown below: William H. Sanders Associate Professor of Electrical and Computer Engineering Date

3 3 ACKNOWLEDGMENTS There are many people I would like to acknowledge for helping me throughout the development of this thesis. Iwould like thank my thesis committee members: Dr. Pamela Delaney, Dr. Bernard Zeigler, and Dr. William Sanders. Iwould especially like to thank Bill for his constant support and valuable ideas that helped keep me on track in the face of numerous failed approaches to this problem. Iwould like to thank everyone in the PMRL lab, past and present: John Diener, Bruce McLeod, Lorenz Lercher, Akber Qureshi, Luai Malhis, Fransiskus Widjanarko, Latha Kant, Bhavan Shah, Doug Obal, and Aad van Moorsel, all of whom listened patiently to my constant ramblings.

4 4 To my parents, for their love and support To my brother, for his long distance enthusiasm

5 5 TABLE OF CONTENTS LIST OF FIGURES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7 LIST OF TABLES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 ABSTRACT : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : INTRODUCTION : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : RAID 5 ARCHITECTURE : : : : : : : : : : : : : : : : : : : : : : : : : : Data and Parity Placement : : : : : : : : : : : : : : : : : : : : : : : : : : I/O Methods : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Normal I/O Methods : : : : : : : : : : : : : : : : : : : : : : : : : Reconstruction I/O Methods : : : : : : : : : : : : : : : : : : : : : System Workload : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Model Assumptions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 3. PERFORMANCE MODEL : : : : : : : : : : : : : : : : : : : : : : : : : : Disk Model : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Seek Time : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Disk Access Time : : : : : : : : : : : : : : : : : : : : : : : : : : : Disk Service Time : : : : : : : : : : : : : : : : : : : : : : : : : : : Disk Arrival Process : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Disk Access Probabilities : : : : : : : : : : : : : : : : : : : : : : : : : : : Response Time : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : PERFORMABILITY MODEL : : : : : : : : : : : : : : : : : : : : : : : : Decomposition of Reconstruction Interval : : : : : : : : : : : : : : : : : : Response Time : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Optimal Reconstruction Rate : : : : : : : : : : : : : : : : : : : : : : : : : CONCLUSIONS : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 89 Appendix A. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 92 Appendix B. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 93

6 6 Appendix C. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 97 C.1. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 97 Appendix D. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 D.1. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 Appendix E. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 12 E.1. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 12 E.2. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 12 Appendix F. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 14 Appendix G. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 18 G.1. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 18 Appendix H. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 116 H.1. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 116 REFERENCES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 124

7 7 LIST OF FIGURES 2.1. Relationship between Data Mapping Entities : : : : : : : : : : : : : : : : Read Request : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Full Stripe Write Request : : : : : : : : : : : : : : : : : : : : : : : : : : : Partial Stripe Write Request : : : : : : : : : : : : : : : : : : : : : : : : : Read Reconstruction Request : : : : : : : : : : : : : : : : : : : : : : : : : Full Stripe Write Reconstruction Request : : : : : : : : : : : : : : : : : : Partial Stripe Write Reconstruction Request : : : : : : : : : : : : : : : : Disk Prole : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Disk Service Time Density for Dierent Values of p s : : : : : : : : : : : : Erlang Approximation for Disk Service Time Density : : : : : : : : : : : Poisson Characteristic of Individual Disk Accesses : : : : : : : : : : : : : Possible Accesses of Disk 2 - Read 2 Data Stripe Units : : : : : : : : : : Analytical Disk Access Probabilities : : : : : : : : : : : : : : : : : : : : : Simulated Disk Access Probabilities : : : : : : : : : : : : : : : : : : : : : Mean Total Response Time - Analytical : : : : : : : : : : : : : : : : : : : Mean Total Response Time - Simulation : : : : : : : : : : : : : : : : : : Mean Total Response Time - Percent Dierence : : : : : : : : : : : : : : Mean Read Response Time - Analytical : : : : : : : : : : : : : : : : : : : Mean Read Response Time - Simulation : : : : : : : : : : : : : : : : : : Mean Read Response Time - Percent Dierence : : : : : : : : : : : : : : Mean Write Response Time - Analytical : : : : : : : : : : : : : : : : : : Mean Write Response Time - Simulation : : : : : : : : : : : : : : : : : : Mean Write Response Time - Percent Dierence : : : : : : : : : : : : : : RAID 5 Data Reconstruction : : : : : : : : : : : : : : : : : : : : : : : : : Mean Batch Queue Length - 5th Reconstruction Request : : : : : : : : Mean Batch Queue Length - 1th Reconstruction Request : : : : : : : : Mean Batch Queue Length - 2th Reconstruction Request : : : : : : : : Mean Batch Queue Length - 4th Reconstruction Request : : : : : : : : Mean Batch Queue Length - Steady State : : : : : : : : : : : : : : : : : : Percent Dierence - 5th Reconstruction Request : : : : : : : : : : : : : Percent Dierence - 1th Reconstruction Request : : : : : : : : : : : : : Percent Dierence - 2th Reconstruction Request : : : : : : : : : : : : : 71

8 Percent Dierence - 4th Reconstruction Request : : : : : : : : : : : : : Mean Total Response Time - Analytical, percentage = % : : : : : : : : Mean Total Response Time - Simulation, percentage = % : : : : : : : : Mean Total Response Time - Percent Dierence, percentage = % : : : : Mean Total Response Time - Analytical, percentage = 2% : : : : : : : : Mean Total Response Time - Simulation, percentage = 2% : : : : : : : Mean Total Response Time - Percent Dierence, percentage = 2% : : : Mean Total Response Time - Analytical, percentage = 4% : : : : : : : : Mean Total Response Time - Simulation, percentage = 4% : : : : : : : Mean Total Response Time - Percent Dierence, percentage = 4% : : : Mean Total Response Time - Analytical, percentage = 6% : : : : : : : : Mean Total Response Time - Simulation, percentage = 6% : : : : : : : Mean Total Response Time - Percent Dierence, percentage = 6% : : : Mean Total Response Time - Analytical, percentage = 8% : : : : : : : : Mean Total Response Time - Simulation, percentage = 8% : : : : : : : Mean Total Response Time - Percent Dierence, percentage = 8% : : : RAID 5 Data Reconstruction With Additional Reconstruction Rate : Determination of Optimal - Multiple Objective Problem : : : : : : : : Determination of Optimal - Single Objective Problem : : : : : : : : : Optimal Additional Reconstruction Rate : : : : : : : : : : : : : : : : : : 88 G.1. Mean Read Response Time - Analytical, percentage = % : : : : : : : : 18 G.2. Mean Read Response Time - Simulation, percentage = % : : : : : : : : 19 G.3. Mean Read Response Time - Percent Dierence, percentage = % : : : : 19 G.4. Mean Read Response Time - Analytical, percentage = 2% : : : : : : : : 11 G.5. Mean Read Response Time - Simulation, percentage = 2% : : : : : : : 11 G.6. Mean Read Response Time - Percent Dierence, percentage = 2% : : : 111 G.7. Mean Read Response Time - Analytical, percentage = 4% : : : : : : : : 111 G.8. Mean Read Response Time - Simulation, percentage = 4% : : : : : : : 112 G.9. Mean Read Response Time - Percent Dierence, percentage = 4% : : : 112 G.1.Mean Read Response Time - Analytical, percentage = 6% : : : : : : : : 113 G.11.Mean Read Response Time - Simulation, percentage = 6% : : : : : : : 113 G.12.Mean Read Response Time - Percent Dierence, percentage = 6% : : : 114 G.13.Mean Read Response Time - Analytical, percentage = 8% : : : : : : : : 114 G.14.Mean Read Response Time - Simulation, percentage = 8% : : : : : : : 115 G.15.Mean Read Response Time - Percent Dierence, percentage = 8% : : : 115 H.1. Mean Write Response Time - Analytical, percentage = % : : : : : : : : 116 H.2. Mean Write Response Time - Simulation, percentage = % : : : : : : : : 117 H.3. Mean Write Response Time - Percent Dierence, percentage = % : : : : 117 H.4. Mean Write Response Time - Analytical, percentage = 2% : : : : : : : 118

9 H.5. Mean Write Response Time - Simulation, percentage = 2% : : : : : : : 118 H.6. Mean Write Response Time - Percent Dierence, percentage = 2% : : : 119 H.7. Mean Write Response Time - Analytical, percentage = 4% : : : : : : : 119 H.8. Mean Write Response Time - Simulation, percentage = 4% : : : : : : : 12 H.9. Mean Write Response Time - Percent Dierence, percentage = 4% : : : 12 H.1.Mean Write Response Time - Analytical, percentage = 6% : : : : : : : 121 H.11.Mean Write Response Time - Simulation, percentage = 6% : : : : : : : 121 H.12.Mean Write Response Time - Percent Dierence, percentage = 6% : : : 122 H.13.Mean Write Response Time - Analytical, percentage = 8% : : : : : : : 122 H.14.Mean Write Response Time - Simulation, percentage = 8% : : : : : : : 123 H.15.Mean Write Response Time - Percent Dierence, percentage = 8% : : : 123 9

10 1 LIST OF TABLES 2.1. Assumed Disk Parameters : : : : : : : : : : : : : : : : : : : : : : : : : : Read Request for 2 Data Stripe Units - Possible Disk Accesses : : : : : : 45

11 11 ABSTRACT As processor and memory performance continue to dramatically increase, the bottleneck in modern computers has shifted to the I/O subsystem. As a result, strategies to provide better performance than current disk systems have been investigated. One eort is the RAID (Redundant Arrays of Inexpensive Disks) Level 5 disk array. The RAID 5 disk array oers increased parallelism of I/O requests through the disk array architecture and fault tolerance through rotated parity. Although analytical models of disk array performance have been developed, they often rely on simplifying assumptions or bounds which cause results to be accurate for a restricted set of the possible workload parameters. This thesis presents analytical performance and performability models to compute the mean steady state response time for a RAID 5 I/O request under a transaction-processing workload. It is shown that these models are accurate for a wider range of the workload parameters than previous studies. Using an observation of how data is reconstructed when a single disk in a row has failed, the analytical models are extended to investigate an optimal rate for data reconstruction.

12 12 CHAPTER 1 INTRODUCTION Over the past decade, processor speed, memory speed, memory capacity, and disk capacity of computers have dramatically improved [1]. Single chip processor speeds have increased at a rate of 4%-1% per year. Access times for main memory have decreased 4%-1% per year. Main memory capacity has quadrupled every two to three years. In contrast, disk I/O performance has shown only modest gains over the same period of time. Disk seek times have improved at a rate of 7% per year. Transfer times from disk to main memory have remained at least an order of magnitude slower than transfer times from main memory to processor. This imbalanced system growth illustrates that a traditional computer organization consisting of a CPU, memory and single large capacity disk for mass storage is inadequate for the next generation of computers. As a result, if the imbalance in I/O performance provided by current disk systems is not remedied, future improvements in processor and

13 13 memory design will be wasted. Continued improvement in system performance depends on I/O subsystems with higher data and I/O rates. Away to increase I/O performance is to use an array of disks [2, 3]. By interleaving data across many disks, both throughput (measured in megabytes (MB) per second) and I/O rate (measured in I/O requests per second) are improved. Throughput is increased by having many disks cooperate in transferring a block of information; the I/O rate is increased by having multiple disks service multiple independent disk requests. Although disk arrays can achieve better performance, an important consequence is that the reliability of multiple disks is lower than a single disk. For example, if disk failure times are exponentially distributed, 1 disks have a combined failure rate 1 times larger than a single disk [4]. More importantly, if every disk failure caused data loss, a 1 disk array would lose data every few hundred hours. To protect against data loss due to disk failures, redundancy schemes have been incorporated into disk arrays. Redundancy schemes are designed to allow a disk array to continue operation when one or more disks have failed and data on failed disks becomes unavailable. Because disk arrays combined with data redundancy hold the promise of improved performance and availabilityover single disks, researchers have investigated dierent ways to design and organize disk array architectures. One eort is Redundant Arrays of Inexpensive Disks (RAID). Introduced in [5], Patterson, Gibson, and Katz present ve ways to introduce redundancy into an array of disks: RAID Level 1 to RAID Level 5. For each level, data is interleaved across multiple disks, but the type of redundancy ranges from traditional mirroring to rotated parity. Using a simple formula to estimate maximum throughput, the

14 14 authors conclude that RAID 5 with rotated parity oers the best performance potential of the organizations considered. Although RAID 5 oers improved performance and availability, techniques for modeling and analyzing I/O performance are important to be able to compare RAID 5 and current disk systems. In particular, analytical models combined with a realistic assessment of workload allow for accurate design and performance prediction. However, like many parallel systems, disk arrays are dicult to model because of queuing and fork-join synchronization. Since data is placed on multiple disks, an I/O request to the disk array breaks up into several disk requests. Each disk request may wait for service, then waits for the other disk reqests to completerequest to Under general conditions, queuing or fork-join synchronization is tractable, but the combination is unsolvable. Analysis is highly dependent on the characteristics of the particular system and requires careful use of approximations and simplifying assumptions. Previous work in the analytical modeling of disk arrays falls into three categories: 1. models that ignore queuing 2. models that ignore fork-join synchronization 3. models that consider queuing and fork-join synchronization using approximate techniques. Models that ignore queuing are useful in computing minimum response time or maximum throughput. Although useful in estimating the limits of system performance, such

15 15 bounds are only accurate when the system load is extremely light or heavy. Salem and Garcia-Molina [6] derive the expected minimum response time to study the benets of data striping in synchronized non-redundant disk arrays and show the eects of several low-level disk optimizations on response times at individual disks. Bitton and Gray [7] calculate expected disk seek times for unsynchronized mirrored disk arrays. Kim and Tantwai [8] derive service time distributions for unsynchronized, bit-interleaved, non-redundant disk arrays. Patterson, Gibson, and Katz [5, 9] compute maximum throughput estimates for several RAID levels. Models that ignore fork-join synchronization are frequently used to model bit-interleaved disk arrays. Kim [1] models synchronized bit-interleaved arrays as an M=G=1 queue and shows that such arrays provide lower service times and better load balancing, but decrease the number of concurrent requests. Chen and Towsley [11] analytically model RAID Levels 1 through 5 using bounds based on the request workload. Overhead for fork-join synchronization is ignored for small write requests, resulting in an optimistic model; large requests are modeled using a single queue for all disks in the array. Since data is placed on multiple disks, an I/O request requires requests to individual disks in the array. The I/O request is complete when all disk requests nish. This behavior is similar to a fork-join queue in which a task forks into several subtasks, each of which is sent to a dierent server. When a subtask completes service, it enters a join node and waits for the remaining subtasks to nish service. After all subtasks are complete, the task is complete. Because of the similarity ofrequests in a disk array to tasks in a

16 16 fork-join queue, results from analyses of fork-join queues have been used to model queuing and fork-join synchronization of disk requests in disk arrays. Although exact results are available for the task response time of a two server fork-join queue [12, 13], general systems that exhibit both queuing and synchronization are not tractable. As a result, attention has shifted to computation of upper and lower bounds. Menon and Mattson [14] formulate an approximate model for RAID 5 under transaction processing workloads, based on a scaling approximation for the M=M=1 queue developed by Nelson and Tantwai [15]. However, the work does not justify the approximation for more than two disks or show that exponential service is an appropriate model for disk accesses. Baccelli, Makowski, and Schwartz [16] derive bounds for response time in fork-join queues under general arrival and service patterns using stochastic ordering and associated random variables. Most previous studies of disk arrays, including RAID 5, often rely on assumptions or bounds which limit the accuracy of results when compared to system measurements or detailed simulation models. When simplifying assumptions are used, the model is developed with regard to certain operating conditions. For example, models which ignore queuing of disk requests are only accurate when the rate of I/O requests is low and the probability that a diks request waits for service is small. Models that compute bounds are usually accurate for restricted regions of the workload. For example, Chen and Towsley [11] calculate mean response time for read and write requests given dierent request rates and sizes. The percent dierence for their analytical calculations of the I/O request response time for single stripe unit requests is less than 1% when compared to detailed

17 17 simulation. However, for multiple stripe unit requests, especially write I/O requests, the dierence is greater than 1% and as high as 5%. This thesis presents analytical models to calculate the steady state average, mean read, and mean write response time of RAID 5 I/O requests under a transaction-processing workload. It is shown that these models are accurate for a wider range of system workload than previous studies. By systematically deriving the distribution of time to access and transfer data during a disk request, the arrival process of requests to individual disks in the array, and the time for all dependent disk requests in an I/O request to complete, a more precise model which considers both queuing and fork-join synchronization is developed. To validate the analytical results, values for a wide range of I/O request sizes and arrival rates are computed and compared to results from a detailed simulation model. Finally, the optimal rate for data reconstruction is determined by formulating data reconstruction as a single objective mathematical programming problem. The organization of this thesis is as follows: Chapter 2 will briey describe the RAID 5 architecture, including data and parity assignment, I/O methods, and components of system workload. Using this description, assumptions used in developing the analytical models are presented. Chapter 3 will discuss the performance model, as well as derivations for the time to service a disk request, the arrival process of requests to individual disks, and the time for all disk requests for an I/O request to complete. Chapter 4 demonstrates that results from the performance model can be extended to analysis of single disk failures and determination of an optimal rebuild rate. Chapter 5 will give conclusions and directions for future research.

18 18 CHAPTER 2 RAID 5 ARCHITECTURE Redundant Arrays of Inexpensive Disks employ two concepts for improved performance and data availability, striping and data redundancy. Striping data across multiple disks provides higher performance than single disks by increasing parallelism and load balancing of requests. Redundancy improves data availability by allowing RAID to operate in the face of single disk failures without data loss. Although striping and data redundancy are simple concepts, the design of a disk array involves complex tradeos between availability, performance, and cost. This chapter describes how RAID 5 addresses these issues through data and parity placement and I/O methods. A more complete reference of RAID architectures can be found in [5, 9]. Using a description of the system operation, the workload and assumptions used to develop the analytical models are presented. 2.1 Data and Parity Placement A RAID 5 disk array consists of N identical disks on which data is interleaved. The unit of data interleaving, or amount of data that is placed on one disk before data is placed on the next disk, is a stripe unit. Since disks are organized into rows and columns, the set of stripe units with the same physical location on each disk in a row is a stripe.

19 19 The number of disks in a stripe is dened as the stripe width, W s. The data redundancy scheme used in RAID 5 is parity-based. Each stripe contains a parity stripe unit, which is the exclusive-or (XOR) of all data stripe units within the stripe. When a single disk in a row fails, data can be reconstructed by reading the corresponding data and parity stripe units from the other disks. To illustrate the relationships between a stripe unit, stripe, parity stripe unit, and disk, consider an array of 2 disks of 4 columns and 5 rows in Figure 2.1. Stripe 11 contains data stripe units 55, 56, 57, 59 and parity unit 58 on disks 15, 16, 17, 19, and 18. Since each stripe/row contains 5 stripe units/disks, the stripe width equals 5. In contrast to redundancy schemes with dedicated parity disks, parity is distributed uniformly across all disks in a RAID 5 disk array. Because parity stripe units are rotated, I/O requests which must update parity are more evenly balanced across all disks in the array. Another advantage of rotated parity is that data is also distributed more evenly, which allows more disks to participate in I/O operations and increase throughput and I/O rate. Although there are numerous ways to encode parity relative to data, a standard policy is right asymmetric. For right asymmetric parity placement shown in Figure 2.1, parity stripe units are laid in a diagonal pattern starting from the top rightmost disk. Given how data and parity units are placed on a RAID 5 disk array, several functions can be dened to relate the relative locations of stripe units, stripes, and disks: 1. stripe unit number! disk: SUN % ND

20 2 column disk stripe row parity stripe unit number of disks in array: stripe width: number of disk groups (rows): Figure 2.1: Relationship between Data Mapping Entities

21 21 2. stripe unit number! stripe number: d(sun/w s )-1e 3. stripe number! parity stripe unit: [bsn/w s (W s ) 2 c+(nr)]+ [(SN % W s )(NR)] where SUN stripe unit number, SN stripe number, W s stripe width, NR number of rows, and ND number of disks in the array. For instance, given stripe unit 31 in Figure 2.1, the corresponding disk, stripe number, and parity unit can be computed using the above functions: 1. disk = 31 % 2 = stripe number = d31/5-1e = 6 3. parity unit = b6/5c25+4+(6 %5)4 = I/O Methods Using the above description of how data and parity are placed on a RAID 5 disk array, methods to read from and write to the array can be dened. Depending on whether disks have failed, a RAID 5 disk array operates in one of two modes. When all disks are functioning, the array isinnormal mode. In reconstruction mode, one or more disks have failed and the array must reconstruct data for I/O requests which access the failed disk(s). I/O methods to read and write data for each mode are described in the following sections.

22 22 1. read data data stripe unit requested data stripe unit parity stripe unit Normal I/O Methods Figure 2.2: Read Request Because data is placed on multiple disks, a I/O request to the disk array to read or write data results in requests for data stripe units at individual disks. If M bytes are requested and the stripe unit size is b bytes, n = dm=be data stripe units are requested. If the request is a read as shown in Figure 2.2, the request is complete when all n disk requests complete. For a write request, data must not only be written, but the corresponding parity stripe units must also be updated. Depending on how much of a stripe is written, three cases arise. 1. The request starts at the rst data disk in a stripe and the request size is n = W s. In this case, all data stripe units in a stripe are written, or a full stripe write as shown in Figure 2.3. Since all data stripe units are overwritten, the new parity is

23 23 1. write new data and parity data stripe unit requested data stripe unit parity stripe unit Figure 2.3: Full Stripe Write Request generated entirely from new data. The request is complete when the n data and parity stripe units are written. 2. The request accesses a single partial stripe (n < W s ) and all n data stripe units requested belong to the same stripe as illustrated in Figure 2.4. In this case, the parity stripe unit must be updated. This is accomplished by rst reading the n old data and parity stripe units. Second, the new parity stripe unit is computed by XORing the old and new data stripe units. The request completes after the n data stripe units and new parity stripe unit have been written. 3. If the request accesses two or more partial stripes, i.e. the n data stripe units are allocated across stripe boundaries, two or more parity stripe units must be updated. Since stripe units in one stripe do not depend upon stripe units in another stripe, the

24 24 1. read old data and parity 2. compute new parity 3. write new data and parity XOR data stripe unit requested data stripe unit parity stripe unit Figure 2.4: Partial Stripe Write Request operation is divided into multiple partial stripe operations. The request completes when all partial stripe operations nish Reconstruction I/O Methods Because a parity stripe unit is associated with each stripe, data can be reconstructed when a single disk in a row fails. When a disk has failed and a requests data stripe unit cannot be accessed, it can be rebuilt through an XOR of the remaining data and parity stripe units in the stripe. Although a data stripe unit from a failed disk can be reconstructed each time it is needed, an important question is where new and reconstructed data should be stored. To address this issue, most RAID systems contain a pool of spare

25 25 disks 1. When a stripe unit is reconstructed or overwritten by new data, it is also written to a spare disk. By writing data to a spare disk, unnecessary reads to other disks in the are prevented stripe when the same data stripe unit is requested again. As more new and reconstructed data is written to a spare disk, the spare disk eventually replaces the failed disk. When all data from the failed disk(s) have been written to corresponding spare disk(s), operation returns to normal mode. Yet, because requested data may not always be available at the spare disk, I/O methods described for normal operation are modied during data reconstruction. If a failed disk is not accessed during a read request, the requested data stripe units are read from each of the corresponding disks as described for a normal read request. However, if a failed disk is accessed as shown in Figure 2.5 and the needed stripe unit is available from the spare disk, the stripe unit is read from the spare disk. However, if the stripe unit has not been reconstructed, the other data stripe units and parity units in the stripe are read. Then the requested stripe unit is reconstructed through an XOR of the remaining data and parity stripe units. Finally, the reconstructed stripe unit is written to the spare disk to complete the I/O request. As described for normal operation, a write request can access a full stripe, single partial stripe, or multiple partial stripes. In Figure 2.6, when data is written to a full stripe, all disks in the stripe, including the spare disk, are written with the new data and parity. 1 An important factor in the design of disk arrays is whether spare disks are hot or cold. Hot disks are on line, which allows for immediate switching, but subject to the same failure conditions as data disks. Cold disks are powered when needed, but require a start up period, which may impact the response time of requests. In this work, a fully functioning spare disk is assumed to be available when a data disk has failed.

26 26 When a single partial stripe is written as in Figure 2.7, old data and parity must be read to compute the new parity. This is the same as a read request from a failed disk described above. After the new parity has been computed, the data and parity disks are written. New data or parity which would have been written to the failed disk is instead written to the spare disk. A write to multiple partial stripes is considered as a series of multiple single partial stripe writes. 2.3 System Workload In order to assess how well a system performs, the conditions under which a system operates must also be considered. Although the description architecture provides details of how a RAID 5 disk array operates, the RAID 5 performance depends on the inputs which drive the system. These inputs are dened as the workload. The workload for RAID 5, and disk arrays in general, is composed of the frequency and pattern of the arrival of I/O requests, and size of an I/O request. Since the arrival of I/O requests to the disk array depends on the characteristics of the application(s) which read from and write data to the array, it is impossible to give a general model for the arrival of I/O requests. However, for many applications, the arrival of I/O requests can be approximated by a Poisson process. In this thesis, it is assumed that the arrival of I/O requests is Poisson with rate.

27 27 1. read data XOR 3c. reconstruct stripe unit 2. check if stripe unit at spare disk 3d. write stripe unit to spare disk 3a. if not, reconstruct stripe unit (read remaining stripe units) 3d. stripe unit available 4. otherwise, stripe unit available data stripe unit from failed disk requested data stripe unit requested data stripe unit parity stripe unit Figure 2.5: Read Reconstruction Request

28 28 1. write new data and parity 2. write new data to spare disk data stripe unit requested from failed disk data stripe unit requested data stripe unit parity stripe unit Figure 2.6: Full Stripe Write Reconstruction Request

29 29 1. read old data and parity XOR 3c. reconstruct stripe unit 2. check if stripe unit at spare disk 3d. write stripe unit to spare disk 6b. write data to spare disk 3a. if not, reconstruct stripe unit (read remaining stripe units) 3d. stripe unit available 3b. otherwise, stripe unit available 4. old data and parity read XOR 5. compute new parity 6a. write new data and parity data stripe unit from failed disk requested data stripe unit requested data stripe unit parity stripe unit Figure 2.7: Partial Stripe Write Reconstruction Request

30 3 The second component of the system workload is the size of an I/O request. For many applications, request sizes can be classied as either supercomputer-based, where requests are large and infrequent, or transaction-based, where small amounts of data are frequently accessed [2, 7]. For this work, it is assumed that requests are transaction-based, where the the number of data stripe units requested is less than or equal to the number of data stripe units in a stripe, W s, 1. A distribution which reects this type of workload is a quasi-geometric function [11] P fn = ng = 8 >< >: n =1; (1,) n,1 (1, ) (1,),(1,) Ws n =2;:::;W s, 1 where N is the request size and 1 n W s,1. Since the maximum number of data stripe units in an I/O request is W s, 1 and a request for data can overlap stripe boundaries, at most two stripes can be accessed during an I/O request. Given this description of RAID 5 operation and workload, a set of assumptions used to construct models of the I/O request response time is presented. 2.4 Model Assumptions Using a description of a RAID 5 disk array, including data and parity mapping, I/O methods and system workload, the goal of this thesis is to develop models to accurately compute the mean response time of RAID 5 I/O requests. In doing so, the following assumptions are made: 1. Current RAID systems are typically constructed with 1 to 1 disks. To obtain numerical results without loss of generality, the array is assumed to contain 2 disks

31 31 Time for full disk rotation (R max ) 16.7 ms Number of disk cylinders (C) 12 Total usable disk storage 5 MB Arm acceleration time (a) 3ms Seek factor (b).5 Transfer rate () 3 MB/s Table 2.1: Assumed Disk Parameters and a stripe width of 5 disks. Each disk has the parameters shown in table 2.1 which reect current disk technology. 2. The stripe unit size equals 4 KB. 3. For many transaction-processing workloads, such as scientic databases, a majority of requests are queries to read data. The ratio of reads to writes for such systems is usually 2 or 3 to 1. In this thesis, probabilities for read and write requests are assumed to be.7 and Each disk can service only one request at a time. Other requests wait and are serviced in rst come-rst served (FCFS) order. 5. The arrival of I/O requests to the system is assumed to be a Poisson process with a rate. 6. It is assumed that I/O requests access data throughout the disk array in a uniform pattern. Since an I/O request requires multiple stripe units, this means that the starting stripe unit is random and that each disk in the array is equally likely to contain the starting stripe unit in a request. 7. Parity placement is right asymmetric.

32 32 8. Since the focus of this work is the performance of the disk array, it is assumed that the disk subsystem is disk limited, i.e. the memory and data paths are fast enough have little relative eect on the I/O request response time.

33 33 CHAPTER 3 PERFORMANCE MODEL An important metric for disk systems is response time, or the time for an I/O request to nish after data has been requested. Since data is interleaved over several disks, an I/O request to a RAID 5 disk array results in multiple disk requests for stripe units. The time for all disk requests to complete is dened as the response time of the I/O request. This chapter will analyze and model the operations that occur during an I/O request when all disks are functioning. First, using previous work for how a disk locates and transfers data, the distribution of time for a disk to service a request is derived. Second, the arrival process of requests from I/O requests to individual disks is considered. Third, a method for computing the mean time needed for all disk requests in an I/O request to nish is investigated. 3.1 Disk Model To develop a model for the response time of RAID 5 I/O requests, individual disk accesses must be analyzed. Although disk behavior involves many complex electrical and mechanical interactions, three components dominate the time for a disk access [17]: seek time, rotational latency, and transfer time. Seek time is dened as the time required

34 34 for the disk arm to move to the correct cylinder. Rotational latency is the time for the required data sector to spin under the read/write head(s). Transfer time is the time to transfer the data to memory. The probability distribution (PDF) and density functions (pdf) for the time to complete a disk request can be derived based on previous results for seek time, rotational latency, and transfer time Seek Time In considering a model for seek time, Lynch [17] observes that there is a non-negative probability that the disk arm does not move during a disk access, or sequential access probability p s. When disk requests are scheduled on a rst-come, rst-served basis, he shows through empirical measurement of several disk systems that when the disk arm does move, it tends moves to any other cylinder with equal probability. Using these observations, he expresses the probability density for seek distance as the probability the disk arm moves i cylinders. This is written as P fd = ig = 8 >< >: p s i = (1, p s ) 2(C,i) C(C,1) i =1; 2;:::;C, 1 where D is a discrete random variable representing seek distance and C is the total number of disk cylinders. To determine seek time, which is the amount of time needed for the disk arm to move i cylinders, a relationship between seek distance and seek time must be determined. Using trace data measurements of several disks, Chen and Katz [?] empirically determine a formula for seek time that is a function of seek distance and disk specications

35 seek time (seconds) seek distance (cylinders) Figure 3.1: Disk Prole s = 8 >< >: d = a + b p d d> where s is the seek time, d is the seek distance in cylinders, a is the arm acceleration time, and b is the seek factor of the disk. Note that if the number of disk cylinders is C, the maximum number of cylinders that a disk arm can move during a request is C, 1 and the maximum seek time (S max ) is a + b p C, 1. To illustrate the behavior of seek time versus seek distance, the prole of a disk with the parameters in table 2.1 is shown in Figure 3.1.

36 36 Using previous work for the probability density of seek distance and the relationship of seek time to seek distance, the distribution and density of seek time can be written in terms of seek distance. Since seek time is a function of the seek distance random variable, the seek time pdf is a transformation of the seek distance pdf [18]. For the general case, where Y is a function g(x) of a random variable X, the density of Y can be written as f Y (y) = f X (x 1 ) jg (x 1 )j + f X(x 2 ) jg (x 2 )j + :::+ f X (x n) jg (x n)j where x 1 ;x 2 :::x n are the real roots of g(x) and g (x) is the derivative ofg(x). Using this rule, the derivative of seek time is and the density of seek time S is s = g (d) = 8 >< >: d = 2(s,a) b 2 d> f S (s) = (1,ps)2[C,( s,a b )2 ] C 2 : Since C, ( s,a b )2 = Cb2,s 2 +2as,a 2 b 2, f S (s) can be simplied to f S (s) = 2(1,ps) C [ Cb2,s 2 +2as,a 2 2 b 2 ][ s,a b 2 ] Disk Access Time =[ 2(1,ps) ][Cb 2, s 2 +2as, a 2 )(s, a)] C 2 b 4 =[ 2(1,ps) C 2 b 4 ][Cb 2 s, s 3 +3as 2, 3a 2 s + a 3 ] : The second component in determining the amount of time to service a disk request is the rotational latency. Rotational latency is dened as the time for the disk to rotate to

37 37 the starting sector of the data requested. Under a variety of workloads and disk scheduling policies, rotational latency is commonly observed to be uniformly distributed in [;R max ], where R max is the time for a full disk rotation [7, 14, 19]. The pdf of the rotational latency is written as f R (r) = 1 R max ; r R max : Because the time for a disk request depends on the amount of time to locate needed data on the disk, the pdf the of disk access time, dened as the total time to move to the starting cylinder and track of the data requested, must be determined. If the random variable X denotes disk access time, then X can be expressed as S+R, where S represents seek time and R represents rotational latency. Since the time for the disk arm to move to the correct cylinder (S) is independent ofthe time for the disk to spin to the correct sector (R), the probability density ofx is the convolution of S and R f X (x) =f R (r) f S (s) = Z Smax f R (x, s)df S (s) : Since seek time is based on the number of cylinders that the disk arm moves, S is not a continuous random variable. Due to the discrete nature of seek time, the regions of integration for f X (x) depend on a, S max, and R max. For the case where R max b p C, 1, which corresponds to the parameters in table 2.1, the density of X is written as f X (x) = 8 >< R x f R (x)p s + a f + R (x, s)df s (s) R x a f + R (x, s)df s (s) R x x,r max f R (x, s)df s (s) x R max R max <x R max + a R max + a<x S max >: R Smax x,r max f R (x, s)df s (s) S max <x S max + R max :

38 38 Depending on which region a particular value for seek time belongs, the corresponding density value can be computed. Evaluation of each integral expression is shown in Appendix A Disk Service Time Once the track and sector containing the data has been located on the disk, the nal component of a disk request is the time to transfer the data from disk to main memory. The transfer time T for single block of data equals b= where b is the block size in bytes and is the transfer rate in bytes/second. Thus, the time for a disk to locate and transfer a single block of data, or disk service time, isy = X +T. Since each RAID 5 disk request consists of transfering a stripe unit of xed size, the transfer time for each disk request is constant. Because the transfer time for each disk request is constant, the transfer time shifts the disk service time pdf but does not change the shape of the density. Figure 3.2 shows the pdf of the disk service time for a 4 KB stripe unit given dierent values of p s and the parameters listed in table 2.1. Note that seek time and rotational latency are much greater than the transfer time for a disk request. As p s increases, the rotational latency dominates the disk service time of the request. If the disk arm does not move, rotational latency is eectively the only component of time for a disk request, and the density of disk service time equals the uniform density of rotational latency. This is illustrated in Figure 3.2. When p s =1/C, where C is the number of disk cylinders, the probability that the disk arm does not move equals the probability ofmoving to any other cylinder. For this case,

39 R max =.167 seconds p s =1/C p s =.1 p s =.3 p s =.5 p s =.7 p s =.9 5 pdf of disk service time disk service time (seconds) Figure 3.2: Disk Service Time Density for Dierent Values of p s the function is continuous. The graph of the disk service time density in Figure 3.2 is similar to trace data measurements for several disk systems observed in [8]. To determine the disk service time density for a disk in a RAID 5 system under a transaction-processing workload, the locations of data between successive requests must be considered. Because of the assumptions that the starting stripe unit of an I/O request is random and requests for a disk are serviced in rst come, rst served order, the data accessed between successive requests will tend to be scattered across the disk. Therefore, during a request, the disk arm tends to move to any other cylinder, including not move, with equal probability. This is equivalent to the case where p s = 1/C. When the sequential access probability p s equals 1/C, the pdf of disk service time can be approximated by an Erlang density oforder k and mean. Figure 3.3 shows an

40 4 6 actual (p s =1/C) least squares (Erlang) peak fit (Erlang) 5 pdf of disk service time disk service time (seconds) Figure 3.3: Erlang Approximation for Disk Service Time Density optimized t of the Erlang p df to the actual pdf using the least squares curve tting method. By shifting the mean slightly, the peak can be more closely matched (peak t), while sacricing the error on the right side of the curve. The parameters of the Erlang density for this case are order (k) 8 and mean () 42. This pdf will be used in analytical models developed in this following sections. In contrast, the simulator described in Appendix F models the actual behavior of a disk during a request. First, the time to move to the cylinder where the needed stripe unit is computed. Second, the time for the disk to spin to the correct data sector from the current position is calculated. To determine the actual disk service time, both of these quantities are added to the xed time to transfer the stripe unit. In this manner, the actual disk behavior is modeled and can provide an accurate comparison to the probabilistic expressions developed above.

41 41 With an understanding of the time to service a single disk request, the arrival process of requests resulting from an I/O request to individual disks in the array is investigated in the next section. 3.2 Disk Arrival Process Given the assumption that the arrival of I/O requests is a Poisson process, it is important tocharacterize the arrival process of subsequent disk requests to individual disks in the array. This section will illustrate how groups and individual disks are accessed during an I/O request. Let fn(t)jt g be the Poisson arrival process of I/O requests to the disk array and fn k (t)jt g; 1 k n, ben output processes, where each output process is the group of k disks accessed during an I/O request; p k is the probability that specic group of k disks is accessed during an I/O request. For each I/O request, only one out of a possible n groups of disks may be accessed, groups of disks accessed during successive I/O requests form a sequence of generalized Bernoulli trials. The conditional distribution that m k is the number of I/O requests that access a group of k disks given that there are m I/O requests in the time interval (;t] is described by the multinomial distribution, P fn 1 (t) =m 1 ;N 2 (t) =m 2 ;:::N n (t) =m n jn(t) =mg = m! m 1!m 2! :::m n! pm 1 1 pm 2 2 :::p mn n ; where P n k=1 p k = 1 and and P n k=1 m k = m. Multiplying by the probability of m I/O requests in (;t], the probability mass function that m 1 requests access group 1, m 2 requests access group 2, ::: m n requests access group n in (;t]is

42 42 = m! m 1!m 2!:::m n! p m 1 1 pm 2 2 :::p mn n e,t (t) m m! P fn 1 (t) =m 1 ;N 2 (t) =m 2 ;:::N n (t) =m n g = Q n k=1 e,p kt (p k t) m k m k! : This result shows that arrivals of requests to groups of disks, N 1 (t);n 2 (t);:::n n (t), are mutually independent and are Poisson with parameters p 1 ; p 2 ;:::p n. Therefore, when the arrival of I/O requests to a RAID 5 disk array ispoisson with rate, arrivals to groups of disks are Poisson with rates p k, where p k is the probability that the group of k of disks is accessed during an I/O request. Furthermore, because a disk is part of dierent groups that can be accessed during a request, the superposition of Poisson group requests results in Poisson arrivals at an individual disk. The arrival of disk requests at a disk is p j, where p j is the probability that disk j, 1 j N, is accessed during an I/O request and N is the number of disks in the array. The following example illustrates how Poisson I/O requests result in Poisson requests to groups and individual disks. Consider an array of 4 disks and an I/O request rate of as shown in Figure 3.4. If requests access disks 1 and 2, 2 and 3, and 3 and 4, with probabilities p 1 =:25, p 2 =:5, and p 3 =:25 ( P 3 i=1 p i = 1), arrivals to disks 1 and 2, 2 and 3, 3 and 4 are Poisson with rates.25,.5, and.25. Since arrivals to groups of disks are Poisson, arrivals to disks 1, 2, 3, and 4 are Poisson with rates.25 (p 1 = p 1 ),.75 (p 2 = p 1 + p 2 ),.75 (p 3 = p 2 + p 3 ), and.25 (p 4 = p 3 )by superposition.

Definition of RAID Levels

Definition of RAID Levels RAID The basic idea of RAID (Redundant Array of Independent Disks) is to combine multiple inexpensive disk drives into an array of disk drives to obtain performance, capacity and reliability that exceeds