Distributed Video Systems Chapter 5 Issues in Video Storage and Retrieval Part 2 - Disk Array and RAID

Similar documents
I/O CANNOT BE IGNORED

Chapter 6. Storage and Other I/O Topics

I/O CANNOT BE IGNORED

Appendix D: Storage Systems

Definition of RAID Levels

An Introduction to RAID

RAID (Redundant Array of Inexpensive Disks)

Storage. Hwansoo Han

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

- SLED: single large expensive disk - RAID: redundant array of (independent, inexpensive) disks

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

HP AutoRAID (Lecture 5, cs262a)

CS5460: Operating Systems Lecture 20: File System Reliability

CSCI-GA Database Systems Lecture 8: Physical Schema: Storage

Chapter 9: Peripheral Devices: Magnetic Disks

Chapter 10: Mass-Storage Systems

Mass-Storage Structure

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Lecture 23: Storage Systems. Topics: disk access, bus design, evaluation metrics, RAID (Sections )

Disks and RAID. CS 4410 Operating Systems. [R. Agarwal, L. Alvisi, A. Bracy, E. Sirer, R. Van Renesse]

CSE 380 Computer Operating Systems

I/O, Disks, and RAID Yi Shi Fall Xi an Jiaotong University

HP AutoRAID (Lecture 5, cs262a)

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 420, York College. November 21, 2006

Page 1. Magnetic Disk Purpose Long term, nonvolatile storage Lowest level in the memory hierarchy. Typical Disk Access Time

CISC 7310X. C11: Mass Storage. Hui Chen Department of Computer & Information Science CUNY Brooklyn College. 4/19/2018 CUNY Brooklyn College

V. Mass Storage Systems

Physical Storage Media

Database Systems. November 2, 2011 Lecture #7. topobo (mit)

Today s Papers. Array Reliability. RAID Basics (Two optional papers) EECS 262a Advanced Topics in Computer Systems Lecture 3

CSE325 Principles of Operating Systems. Mass-Storage Systems. David P. Duggan. April 19, 2011

Mladen Stefanov F48235 R.A.I.D

CSE 153 Design of Operating Systems

Administrivia. CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Disks (cont.) Disks - review

Part IV I/O System. Chapter 12: Mass Storage Structure

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

Principles of Data Management. Lecture #2 (Storing Data: Disks and Files)

Storing Data: Disks and Files

Storage and File Structure. Classification of Physical Storage Media. Physical Storage Media. Physical Storage Media

Storage Systems. Storage Systems

Chapter 6 Storage and Other I/O Topics

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

CMSC 424 Database design Lecture 12 Storage. Mihai Pop

RAID. Redundant Array of Inexpensive Disks. Industry tends to use Independent Disks

CSE380 - Operating Systems. Communicating with Devices

Database Systems II. Secondary Storage

Computer Architecture Computer Science & Engineering. Chapter 6. Storage and Other I/O Topics BK TP.HCM

Chapter 11: File System Implementation. Objectives

Storage Devices for Database Systems

Ch 11: Storage and File Structure

Disks and I/O Hakan Uraz - File Organization 1

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Physical Representation of Files

CS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University

Database Management Systems, 2nd edition, Raghu Ramakrishnan, Johannes Gehrke, McGraw-Hill

Introduction I/O 1. I/O devices can be characterized by Behavior: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec

Storing Data: Disks and Files. Storing and Retrieving Data. Why Not Store Everything in Main Memory? Database Management Systems need to:

Operating Systems. Operating Systems Professor Sina Meraji U of T

Chapter 6. Storage and Other I/O Topics. ICE3003: Computer Architecture Fall 2012 Euiseong Seo

CS370 Operating Systems

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID

Lecture 23: I/O Redundant Arrays of Inexpensive Disks Professor Randy H. Katz Computer Science 252 Spring 1996

Concepts Introduced. I/O Cannot Be Ignored. Typical Collection of I/O Devices. I/O Issues

Storing and Retrieving Data. Storing Data: Disks and Files. Solution 1: Techniques for making disks faster. Disks. Why Not Store Everything in Tapes?

Storing Data: Disks and Files. Storing and Retrieving Data. Why Not Store Everything in Main Memory? Chapter 7

Storing and Retrieving Data. Storing Data: Disks and Files. Solution 1: Techniques for making disks faster. Disks. Why Not Store Everything in Tapes?

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Part IV I/O System Chapter 1 2: 12: Mass S torage Storage Structur Structur Fall 2010

Storage systems. Computer Systems Architecture CMSC 411 Unit 6 Storage Systems. (Hard) Disks. Disk and Tape Technologies. Disks (cont.

Frequently asked questions from the previous class survey

Distributed Video Systems Chapter 3 Storage Technologies

Storage System COSC UCB

Today: Coda, xfs. Case Study: Coda File System. Brief overview of other file systems. xfs Log structured file systems HDFS Object Storage Systems

Chapter 6. Storage and Other I/O Topics

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

Module 13: Secondary-Storage Structure

5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks 485.e1

CS3600 SYSTEMS AND NETWORKS

Mass-Storage Structure

ECE Enterprise Storage Architecture. Fall 2018

Current Topics in OS Research. So, what s hot?

Chapter 11. I/O Management and Disk Scheduling

Chapter 14 Mass-Storage Structure

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

Storing Data: Disks and Files

Virtual File System -Uniform interface for the OS to see different file systems.

CS2410: Computer Architecture. Storage systems. Sangyeun Cho. Computer Science Department University of Pittsburgh

COMP283-Lecture 3 Applied Database Management

Disks. Storage Technology. Vera Goebel Thomas Plagemann. Department of Informatics University of Oslo

File. File System Implementation. Operations. Permissions and Data Layout. Storing and Accessing File Data. Opening a File

Chapter 6. Storage and Other I/O Topics. ICE3003: Computer Architecture Spring 2014 Euiseong Seo

Today: Coda, xfs! Brief overview of other file systems. Distributed File System Requirements!

The term "physical drive" refers to a single hard disk module. Figure 1. Physical Drive

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel

Chapter 14: Mass-Storage Systems. Disk Structure

File. File System Implementation. File Metadata. File System Implementation. Direct Memory Access Cont. Hardware background: Direct Memory Access

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

COT 4600 Operating Systems Fall 2009

High-Performance Storage Systems

Transcription:

Distributed Video ystems Chapter 5 Issues in Video torage and Retrieval art 2 - Disk Array and RAID Jack Yiu-bun Lee Department of Information Engineering The Chinese University of Hong Kong Contents 5.1 Disk Array Basics 5.2 Redundant Array of Inexpensive Disks 5.3 erformance and Cost Comparisons 5.4 Reliability 5.5 Implementation Considerations 5.6 Improving mall Write erformance for RAID 5 5.7 Declustering arity 5.8 Exploiting On-Line pare Disks Distributed Video ystems - Issues in Video torage and Retrieval - art 2 2

5.1 Disk Array Basics Data triping pindle-synchronized mode D 0 D 1 D 2 B 0 b 0 b 1 b 2 B 1 b 3 b 4 b 5 One logical stripe B 2 b 6 b 7 b 8 b 9 b 10 b 11 B 4 b 12 b 13 b 14 Movie A Higher transfer rate Distributed Video ystems - Issues in Video torage and Retrieval - art 2 3 5.1 Disk Array Basics Data triping plit-schedule mode D 0 D 1 D 2 B 0 b 0 b 1 b 2 B 1 b 3 b 4 b 5 One logical unit of a disk B 2 b 6 b 7 b 8 b 9 b 10 b 11 B 4 b 12 b 13 b 14 Movie A Lower queueing delay Distributed Video ystems - Issues in Video torage and Retrieval - art 2 4

5.1 Disk Array Basics Reliability While disk array can break the disk I/O bottleneck, the overall reliability becomes a problem. Movie A Complete system failure Disk failure olution Redundant Disk Arrays Distributed Video ystems - Issues in Video torage and Retrieval - art 2 5 5.2 Redundant Array of Inexpensive Disks History First proposed by atterson et al. in 1988. Defined RAID levels 1 to 5 rinciples Compute redundant data from existing data so that data lost in a failed disk can be recovered by computation. A method for computing redundant data is needed Example: arity, Hamming code, and Reed-olomon codes. A method for distributing redundant data is needed Centralized and distributed Distributed Video ystems - Issues in Video torage and Retrieval - art 2 6

5.2 Redundant Array of Inexpensive Disks RAID Organizations RAID Level 0 (Non-Redundant) D 0 D 1 D 2 B 0 b 0 b 1 b 2 B 1 b 3 b 4 b 5 B 2 b 6 b 7 b 8 b 9 b 10 b 11 B 4 b 12 b 13 b 14 Best write performance (why?) But not the best read performance (why?) Widely used in supercomputing environments where performance and capacity is more important than reliability. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 7 5.2 Redundant Array of Inexpensive Disks RAID Organizations RAID Level 1 (Mirrored) D 0 D 1 B 0 b 0 b 0 B 1 b 1 b 1 B 2 b 2 b 2 b 3 b 3 B 4 b 4 b 4 100% storage overhead. Best read performance (why?) Widely used in database applications where availability and transaction rate are more important than storage efficiency. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 8

5.2 Redundant Array of Inexpensive Disks RAID Organizations RAID Level 2 (Memory-tyle ECC) D 0 D 1 D 2 D 3 R 0 R 1 R 2 Hamming coded, <100% storage overhead; More disks -> lower storage overhead. Requires spindle synchronization. Disk-failure detection is not needed. Unpopular. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 9 5.2 Redundant Array of Inexpensive Disks RAID Organizations RAID Level 3 (Bit-Interleaved arity) D 0 D 1 D 2 D 3 R 0 arity disk is not used for reading during normal operation. arity coded, storage overhead = 1/(N-1). Requires spindle synchronization. Disk-failure detection is needed. Tolerable to single-disk failure. uitable for high-bandwidth applications. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 10

5.2 Redundant Array of Inexpensive Disks RAID Organizations RAID Level 4 (Block-Interleaved arity) D 0 D 1 D 2 D 3 R 0 arity Calculation b 0 b 1 b 2 b 3 p 0 p 0 = b 0 b 1 b 2 b 3 b 4 b 5 b 6 b 7 p 1 p 1 = b 4 b 5 b 6 b 7 b 8 b 9 b 10 b 11 p 2 p 2 = b 8 b 9 b 10 b 11 tripe unit b 12 b 13 b 14 b 15 p 3 p 3 = b 12 b 13 b 14 b 15 b 16 b 17 b 18 b 19 p 4 p 4 = b 16 b 17 b 18 b 19 arity unit arity coded, storage overhead = 1/(N-1). Disk-failure detection is needed. Tolerable to single-disk failure. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 11 5.2 Redundant Array of Inexpensive Disks RAID Organizations RAID Level 4 (Block-Interleaved arity) mall reads Involves one of the disks, hence split-schedule is desirable in this case. Large reads pan more disks, reading in parallel improves transfer rate. Large writes pan all disks, parity can be computed from the new data. mall writes pan a single disk: Read old data Read parity Compute new parity using old data, new data, & parity Write new data Write new parity Distributed Video ystems - Issues in Video torage and Retrieval - art 2 12

5.2 Redundant Array of Inexpensive Disks RAID Organizations RAID Level 4 (Block-Interleaved arity) mall writes 4 I/Os are needed for a single write! 2 of the 4 I/Os are performed on the parity disk. There is only one parity disk and hence it often becomes the bottleneck in writing. Read after disk failure Large reads Read whole stripes and use parity for lost stripe recovery. mall reads till need to read whole stripe in case the needed stripe is in the failed disk. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 13 5.2 Redundant Array of Inexpensive Disks RAID Organizations RAID Level 5 (Block-Interleaved Distributed arity) D 0 D 1 D 2 D 3 D 4 arity Calculation b 0 b 1 b 2 b 3 p 0 p 0 = b 0 b 1 b 2 b 3 b 5 b 6 b 7 p 1 b 4 p 1 = b 4 b 5 b 6 b 7 b 10 b 11 p 2 b 8 b 9 p 2 = b 8 b 9 b 10 b 11 b 15 p 3 b 12 b 13 b 14 p 3 = b 12 b 13 b 14 b 15 p 4 b 16 b 17 b 18 b 19 p 4 = b 16 b 17 b 18 b 19 Left-ymmetric arity lacement arity units are distributed across all disks. Removed the parity-disk bottleneck. Read performance better than RAID Level 4. (Why?) Distributed Video ystems - Issues in Video torage and Retrieval - art 2 14

5.2 Redundant Array of Inexpensive Disks RAID Organizations RAID Level 6 (+Q Redundancy) imilar to RAID 5 Distributed parity Block interleaved Read-modify-write Differences Uses Reed-olomon codes to protect against double-disk failure using two more disks. In small writes, 6 I/Os (instead of 4) are required. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 15 5.3 erformance and Cost Comparisons imilarity of different RAID levels RAID 5 with 2 disks is very similar to RAID 1 Differences are mainly implementations like arity versus replica cheduling algorithms Disk layout optimizations D 0 D 1 D 0 D 1 RAID Level 5 RAID Level 1 Distributed Video ystems - Issues in Video torage and Retrieval - art 2 16

5.3 erformance and Cost Comparisons imilarity of different RAID levels RAID 5 with stripe unit much smaller than the average request size is very similar to RAID 3 Because a request always involve all disks in the array. arity placement is different. In ractice RAID 2 and RAID 4 are usually inferior to RAID 5 RAID 5 can simulate RAID 1 and 3 o in most cases the problem is just selecting different stripe sizes and parity group sizes for RAID 5. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 17 5.4 Reliability Device Reliability Metric MTTF Mean-Time-To-Failure (usually measured in hours) MTTR Mean-Time-To-Repair (ditto) MTBF Mean-Time-Between-Failures (ditto) MTTF Operational Failed MTTR Distributed Video ystems - Issues in Video torage and Retrieval - art 2 18

5.4 Reliability Device Reliability RAID Level 5 MTBF MTBF array ( MTTF ) disk = N( G 1) MTTR 2 disk ingle-disk arity group size Number of disks Example N=100, G=16, MTTF disk =200,000hrs, MTTR disk =1hr MTBF array =3,000 years! o a RAID can in fact be more reliable than a disk! Distributed Video ystems - Issues in Video torage and Retrieval - art 2 19 5.4 Reliability ystem Crashes and arity Inconsistency ystem Crashes ower failure, operator error, hardware breakdown, etc. Implications Disk I/O operations can be interrupted Write operations are affected because data blocks are updated but not parity block and vice versa. Consequences The data block to be written could be corrupted. The associated parity block could become inconsistent and cannot be used to recover lost data in case of a disk failure. (More serious, why?) Distributed Video ystems - Issues in Video torage and Retrieval - art 2 20

5.4 Reliability ystem Crashes and arity Inconsistency For Bit-Interleaved RAID Levels (e.g. 3) Inconsistency only affects the data being written. Data to be written Updated Not updated This is because a stripe is small and a write operation usually spans many stripes. Nothing need to be done as data consistency after system crash is not guaranteed in non-raid disks anyway. It is up to the application to deal with the erroneous data. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 21 5.4 Reliability ystem Crashes and arity Inconsistency For Block-Interleaved RAID Levels (e.g. 5) Inconsistency can affect other data not related to the one being written. These blocks are affected by the inconsistency. The data block is updated but the system crashes before the parity block is updated. In hardware RAID controllers, non-volatile RAM is used to temporary store the parity information before the write operation to guard against system crashes. Hence software implementation usually have inferior performance. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 22

5.4 Reliability Uncorrectable Bit-Errors Causes Incorrectly written data Magnetic media aging Common Error Rates Uncorrectable bit error: one in 10 14 bits read Example Reconstruction of a failed disk in a 100GB disk array. Assumed 200 million sectors are to be read from the normal disks and each read is independent. A BER of 10 14 implies one 512 byte sector cannot be read for every 24 billion sectors read. 8 2 10 1 r{all 2x10 8 sectors ok}= 1 = 99.2% 10 2.4 10 Distributed Video ystems - Issues in Video torage and Retrieval - art 2 23 5.5 Implementation Considerations Regenerating arity After a ystem Crash roblem After a system crash, some parity blocks may become inconsistent due to interrupted writes. All the parity will have to be regenerated unless the inconsistent parity units can be identified. olution Hardware RAID Log state of parity (consistent/inconsistent) into stable storage (e.g. NVRAM); Mark parity unit as inconsistent before write; Regenerate all inconsistent parity units after crash. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 24

5.5 Implementation Considerations Regenerating arity After a ystem Crash olution oftware RAID erious performance degradation will be resulted if the disk is used as stable storage to log the parity states. Mark parity as inconsistent before write; Write parity; Mark parity as consistent after write. Delayed Updating Do not remark a parity immediately after write; ut it in a pool; If the parity is to be updated later again, then remarking is not needed. Larger pool size -> better hit rate but more parity units to be generated after crash. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 25 5.5 Implementation Considerations Operating with a Failed Disk roblem otential data loss in case of system crashes because parity information can be lost during writes. olution 1: Demand Reconstruction Failed disk pare disk + + + + Failed unit is reconstructed to a spare disk before attempting write. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 26

5.5 Implementation Considerations Operating with a Failed Disk olution 2: arity paring After a disk failure: Failed disk D To access this unit in the failed disk. Data unit is reconstructed to overwrite the parity unit: Failed disk D Reconstructed unit overwritten parity Distributed Video ystems - Issues in Video torage and Retrieval - art 2 27 5.5 Implementation Considerations Operating with a Failed Disk olution 2: arity paring After relocation, a system crash only affects the data unit being written. When the failed disk is repaired/replaced, the relocated data unit is simply copied to the new disk and remarked as no longer relocated. Extra meta state information is required to keep the list of relocated data units. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 28

5.5 Implementation Considerations Orthogonal RAID How to arrange the parity groups? Controller Controller Controller Controller Distributed Video ystems - Issues in Video torage and Retrieval - art 2 29 5.5 Implementation Considerations Orthogonal RAID Option 1: G=4, er Controller Controller Controller Controller Controller Distributed Video ystems - Issues in Video torage and Retrieval - art 2 30

5.5 Implementation Considerations Orthogonal RAID Option 2: G=4, One Disk er Controller Controller Controller Controller Controller Distributed Video ystems - Issues in Video torage and Retrieval - art 2 31 5.6 Improving mall Write erf. for RAID 5 RAID Level 5 Comparatively good performance in all areas except small write: Read-Modify-Write (4 I/Os) Read old data unit and associated parity unit; (2 reads) Compute new parity; Write new data unit and parity unit. (2 writes) General erformance Degradation Response time approximately doubled; (why not 4x?) Throughput reduced by a factor of 4. Application Implications Unsuitable for applications generating lots of small writes (e.g. transaction processing system). Usually mirroring (RAID-1) is preferred for T. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 32

5.6 Improving mall Write erf. for RAID 5 Method 1: Buffering and Caching Write Buffering Request Write Request Write Attempt Write uccess? chedule Write Attempt Write Write OK Write Failed Write OK ynchronous Write Asynchronous Write Distributed Video ystems - Issues in Video torage and Retrieval - art 2 33 5.6 Improving mall Write erf. for RAID 5 Method 1: Buffering and Caching Write Buffering Advantages Improved response time (report success immediately to application); otential saving in case the unit is overwritten before commit; otential saving in grouping sequential writes in one I/O; More efficient disk scheduling as more queued requests are available to the disk scheduler (e.g. CAN). roblems otential data loss in case of system crash unless nonvolatile memory is used for buffering the writes; Requires additional hardware support (e.g. NVRAM); Little improvement in throughput; Does not work under heavy load (why?). Distributed Video ystems - Issues in Video torage and Retrieval - art 2 34

5.6 Improving mall Write erf. for RAID 5 Method 1: Buffering and Caching Caching Read Caching If the old data unit is in the disk cache, then it can be used to compute the new parity, reducing from 4 to 3 I/Os. Very common in T applications where old value is read into the application and rewritten back after modification. Write Caching Caching recently written parity can eliminate the need to read old parity if any of the stripe units belonging to the same stripe is modified again. If successful, this can further reduce from 3 to 2 I/Os. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 35 5.6 Improving mall Write erf. for RAID 5 Method 2: Floating arity Clusters parity units into cylinders and leave one free track per such parity cylinder. Free track arity track Disk pindle Distributed Video ystems - Issues in Video torage and Retrieval - art 2 36

5.6 Improving mall Write erf. for RAID 5 Method 2: Floating arity Read-Modify-Write: Compute new parity and write Read old parity unit Disk pindle Distributed Video ystems - Issues in Video torage and Retrieval - art 2 37 5.6 Improving mall Write erf. for RAID 5 Method 2: Floating arity erformance Gain: WRITE Disk arm READ No seeking is needed in parity write; Rotational latency is also reduced. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 38

5.6 Improving mall Write erf. for RAID 5 Method 2: Floating arity erformance Gain: Example: 16 tracks per cyclinder Around 65% of the time, the block next to the parity block is a free block; Average number blocks to skip to get to the nearest unallocated block is between 0.7~0.8; Hence only an additional ms is needed for the parity write. Tradeoffs: A directory for the locations of free blocks and parity blocks must be maintained in memory. Around 1MB for a RAID with 4~10 500MB disks. Exact geometry of the disk drives is needed to schedule the disk head to an unallocated block. Most likely to be implemented in the RAID controller. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 39 5.6 Improving mall Write erf. for RAID 5 Method 3: arity Logging rinciple Delays the read of old parity and the write of the new parity. First art - Delayed arity Update Application Write RAID Controller Read old data Compute difference NVRAM Write to log region Flush to disk when full Distributed Video ystems - Issues in Video torage and Retrieval - art 2 40

5.6 Improving mall Write erf. for RAID 5 Method 3: arity Logging econd art - Batch Update RAID Controller RAM Compute new parities Read parity log and old parities Write new parities erformance Gain Larger transaction size (via batching) reduces disk overhead substantially; Reduces overhead from 4 disk accesses to a little more than 2 disk accesses (comparable to mirroring). Distributed Video ystems - Issues in Video torage and Retrieval - art 2 41 5.7 Declustered arity erformance Degradation After a Disk Failure mall Reads 0 1 2 4 5 3 8 6 7 To read unit 8, both unit 6, 7 and must also be read. 9 10 11 12 13 14 16 17 15 In the worst-case, disk load is increased by 100% after a disk failure. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 42

5.7 Declustered arity erformance Degradation After a Disk Failure mall Reads in Large RAID with multiple parity groups: 8 disks with two groups (G=4): D 0 D 1 D 2 D 3 D 4 D 5 D 6 D 7 0 4 8 12 16 1 5 9 13 17 2 6 10 14 3 7 11 15 0 4 8 12 16 1 5 9 13 17 2 6 10 14 Group 1 Group 2 3 7 11 15 (5,6,7) (5,6,7) (5,6,7) (5,6,7)... Group 1 is not affected but reading unit 6 still imposes 100% overhead on group 2. Disastrous for applications such as video server. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 43 5.7 Declustered arity erformance Degradation After a Disk Failure mall Reads in Large RAID with multiple parity groups: 8 disks, two groups (G=4), with declustering: D 0 D 1 D 2 D 3 D 4 D 5 D 6 D 7... Group 1 Group 2... (5,6,7) (0,1,5) (0,2,6) (0,3,7)... Overhead in reading unit 6 is spread across all disks in the system. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 44

5.7 Declustered arity erformance Degradation After a Disk Failure Design of the Block-lacement olicy How to design the placement of blocks from multiple parity groups so that the extra load distribution is uniform across all disks in the system? Method 1: Enumeration Repeat all possible mappings; Given N disks and parity group size G, N number of mappings = G Example: N=8, G=4 => 70 mappings. Method 2: Theory of Balanced Incomplete Block Designs ee M.Hall, Combinatorial Theory (2nd Ed.), Wiley-Interscience, 1996. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 45 5.7 Declustered arity Tradeoffs Less reliable than standard RAID in double-disk failure Why? The complex parity group mappings could disrupt the sequential placement of large data objects across the disks. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 46

5.8 Exploiting On-Line pare Disks Online spare disks spare Active RAID Reduce window-of-vulnerability after a disk failure RAID Controller Reconstruction After reconstruction Add a new spare disk Distributed Video ystems - Issues in Video torage and Retrieval - art 2 47 5.8 Exploiting On-Line pare Disks Motivation Normal Operation spare No work done! RAID Controller Operating ystem hortcomings The spare disk sits idle without sharing the load of the system; Without active I/O, failure in the spare disk could occur unnoticed. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 48

5.8 Exploiting On-Line pare Disks Distributed paring No dedicated spare disk Failed units are reconstructed to spare units Distributed Video ystems - Issues in Video torage and Retrieval - art 2 49 5.8 Exploiting On-Line pare Disks Distributed paring Advantages Improving performance during normal operation All disks contributes RAID Controller Operating ystem Less work in reconstruction pare units do not need reconstruction Distributed Video ystems - Issues in Video torage and Retrieval - art 2 50

5.8 Exploiting On-Line pare Disks Distributed paring tate-transition for a RAID with sparing Normal Mode failure detected restoration completed Failure Mode Restoration Mode reconstruction started failed disk replaced, restoration started Reconstruction Mode Reconfigured Mode reconstruction completed Distributed Video ystems - Issues in Video torage and Retrieval - art 2 51 5.8 Exploiting On-Line pare Disks Distributed paring Data organization before failure D 0 D 1 D 2 D 3 D 4 D 5 B 0 b 0 b 1 b 2 b 3 p 0 s 0 B 1 b 4 b 5 b 6 p 1 s 1 b 7 B 2 b 8 b 9 p 2 s 2 b 10 b 11 RAID-5 w/ distributed paring b 12 p 3 s 3 b 13 b 14 b 15 p 4 s 4 b 16 b 17 b 18 b 19 D 0 D 1 D 2 D 3 D 4 D B 0 b 0 b 1 b 2 b 3 p 0 s 0 B 1 b 4 b 5 b 6 p 1 b 7 s 1 B 2 b 8 b 9 p 2 b 10 b 11 s 2 RAID-5 w/ hot-sparing b 12 p 3 b 13 b 14 b 15 s 3 p 4 b 16 b 17 b 18 b 19 s 4 Distributed Video ystems - Issues in Video torage and Retrieval - art 2 52

5.8 Exploiting On-Line pare Disks Distributed paring Data organization after reconstruction RAID-5 w/ distributed paring B 0 B 1 B 2 D 0 b 0 b 4 b 8 b 12 p 4 D 1 D 2 D 3 D 4 b 1 b 2 b 3 p 0 b 5 b 6 p 1 b 6 b 9 p 2 p 2 b 10 p 3 s 3 b 13 b 14 b 16 b 16 b 17 b 18 D 5 b 2 b 7 b 11 b 15 b 19 New spare disk D 0 D 1 D 2 D 3 D 4 D RAID-5 w/ hot-sparing B 0 B 1 B 2 b 0 b 4 b 8 b 12 b 1 b 2 b 3 p 0 b 5 b 6 p 1 b 7 b 9 p 2 b 10 b 11 p 3 b 13 b 14 b 15 b 2 b 6 p 2 b 13 New spare disk p 4 b 16 b 17 b 18 b 19 b 17 Distributed Video ystems - Issues in Video torage and Retrieval - art 2 53 5.8 Exploiting On-Line pare Disks Distributed paring Data organization after reconstruction RAID-5 w/ distributed paring B 0 B 1 B 2 D 0 b 0 b 4 b 8 b 12 p 4 D 1 D 2 D 3 D 4 b 1 b 3 p 0 b 5 p 1 b 6 b 9 p 2 b 10 p 3 b 13 b 14 b 16 b 17 b 18 D 5 b 2 b 7 b 11 b 15 b 19 Different organization! D 0 D 1 D 2 D 3 D 4 D RAID-5 w/ hot-sparing B 0 B 1 B 2 b 0 b 4 b 8 b 12 b 1 b 3 p 0 b 5 p 1 b 7 b 9 b 10 b 11 p 3 b 14 b 15 b 2 b 6 p 2 b 13 DONE! p 4 b 16 b 18 b 19 b 17 Distributed Video ystems - Issues in Video torage and Retrieval - art 2 54

5.8 Exploiting On-Line pare Disks Distributed paring Tradeoffs One more step in full system recovery Normal Normal Reconstruction Restoration Reconstruction Distributed paring tandard pare Disk (Hot paring) A restoration phase is necessary to copy all reconstructed data to a new disk to replace the failed disk. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 55 5.8 Exploiting On-Line pare Disks Distributed paring Tradeoffs otential performance degradation before restoration Data originally in a single disk now distributes across multiple disks after reconstruction; There may be implications for applications doing lots of large, sequential reads and writes (e.g. video server); erformance gain in normal operation may not be usable Video server requires performance guarantees Usually worst-case assumption is used The extra capacity in normal operation cannot be used, otherwise the service of some users will be disrupted when a disk fails. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 56

5.8 Exploiting On-Line pare Disks arity paring rinciple tore extra parity information in spare units 0 1 0 1 0 1 0 1 1 0 otential Benefits Improving small read/write operations after a disk failure; Creating two parity groups to improve reliability; Implement +Q redundancy. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 57 References The materials in this chapter are based on:.m.chen, et al., RAID: High-erformance, Reliable econdary torage, ACM Computing urveys. J. Chandy, et al., Failure Evaluation of Disk Arrsy Organizations, roc. International Conference on Distributed Computing ystems, May 1993. D.A.atterson, et al., A Case for Redundant Array of Inexpensive Disks (RAID), roc. International Conference on Management of Data (IGMOD), June 1998. Distributed Video ystems - Issues in Video torage and Retrieval - art 2 58