RAID and AutoRAID RAID background Problem: technology trends - computers getting larger, need more disk bandwidth - disk bandwidth not riding moore s law - faster CPU enables more computation to support storage - data intensive applications - Approaches: - SLED: single large expensive disk - RAID: redundant array of (independent, inexpensive) disks NOTE: - Disk arrays had been done before - Contribution of this paper is a taxonomy and a way to compare them and organize them Key ideas: - striping: write blocks of a file to multiple disks, can read/write in parallel - Redundancy: write extra data to extra disks for failure recovery. E.g. parity, ecc, duplicate data. Redundancy can improve performance have choice of disk (latency), 2 disks (throughput) Why arrays? - Cheaper disks - Lower power - Smaller enclosures - Higher reliability o Can survive a disk failure - Larger bandwidth o Can read or write multiple disks at a time How do you compare disk setups? - Price? - Power? 1
- Size? - Performance? o What performance? o Large reads o Small reads o Large writes o Small writes o Read / modify / write (TP) Organization: - take N disks, put into groups of G RAID versions: JBOD: just a bunch of disks, mount as separate volumes - Read / write performance for a file limited to single disk - Reliability for a byte is same as single disk, but file system can tolerate some disk failures with partial data loss RAID 0: striping - Striping data across disks - Best overall performance: G reads/sec, G writes/sec - Worst reliability: MTTF = MTTF(disk) / G RAID 1: mirroring - store all data on two disks - write to both disks - read from whichever disk is faster (better positioned) - Write performance = single disk - Read performance = double - Overhead is 100% RAID 2: bit-wise ECC - stripe data across disks in small units - Store ECC bitwise on a parity disk - All reads / writes hit all disks - Can detect / correct lots of errors - Bad performance - FILL ME PERF RAID 3: bit parity - rely on disk for error detection - Still read from all disks (but parity), write to all disks 2
- RAID 4: block parity - use single disk for error correction, rely on controllers for detection - Can read from a single disk (no need to compute ecc) - can write to two disks (data disk + update parity) - Bottleneck: single parity disk for all writes - Small writes require 4 accesses: read only block, old parity, write new block+ new parity RAID 5: distributed parity - same as level 4 but parity disk changes for each block - Removes hotspot of parity disk - Large writes efficient just one extra access for parity RAID 6: more error correction - 2 parity disks allows detection 2 disk failures - Throughput per dollar small read small write large read large write storage efficiency Reason raid 0 1 1 1 1 1 raid 1 1 ½ 1 ½ ½ extra disk raid 3 1/G 1/G (G-1)/G (G-1)/G (G-1)/G one disk doesn t contribute Raid 5 1 max(1/g, ¼) 1 (G-1)/G (G-1)/G Notes: Raid 2 inferior like raid 3 but more ECC drives (good with driver failure not identified). Raid 4 inferior to Raid 5 similar best case, but throughput limited by single parity disk Choices of RAID - QUESTION: what should you choose, when? - Issues: o Cost of disks is it relevant? Perhaps space/power more relevant o Workload: lots of small reads/writes indicates raid 1, lots of large reads and writes indicates 5 AutoRAID 3
1. AutoRAID problem a. RAID 1 provides best performance/reliability b. RAID 5 is more efficient cost wise i. Performance good for large reads/writes 1. bad for small writes c. Performance depends on i. Number of disks, size of groups d. Managing the variety of RAID configurations is hard i. Changing layout requires copying data off to another system ii. Adding a disk requires copying data off to another system iii. All disks must be the same size iv. hot spares for fast repair do nothing to improve performance e. NOTE: same thing is true to day with disk + flash storage i. Cannot migrate a device between the two ii. Within flash, different encoding mechansims (MLC vs SLC) possible with similar tradeoffs 2. Desired goal a. A bunch of disks b. A workload c. Storage system determines best configuration for the workload i. QUESTION: what is that? 1. Mirror as much as possible 2. Store cold (not overwritten) data to RAID 5 a. Raid 5 performance is fine for small/large reads b. QUESTION: Who should do this? i. Administrator? ii. File system? iii. RAID controller? 1. Depends on what you sell: want to reach as much of your customer base as possible a. Sun: File system b. IBM: administrator c. HP: RAID controller 3. Possible organizations a. Cache: treat some set of mirrored disks as a fast cache in front of RAID 5 i. QUESTION: Upsides/downsides? 1. Less capacity? a. ratio of RAID 1 to RAID 5 is 100% overhead to 10% 4
b. Performance ratio is 1-10x ii. QUESTION: what about for flash 1. Smaller volume of flash make caching more attractive 2. Separate physical device allows disk to be removed and be consistent while leaving cache behind b. Tiering: i. Data lives in either mirrored tier or raid 5 tier ii. Data moves between tiers but lives at only one place iii. NOTE: Apple s Fusion drive using flash does this 1. all writes go to flash. 2. When < 4GB left in flash, move data to disk to ensure 4 GB available 4. AutoRAID Layout Terminology a. Physical Layout: disk blocks grouped into: i. segments: contiguous range on one disk allocated to a stripe 1. RAID stripes write one segment per disk 2. Size chosen to get good sequential performance (large) but spread workload for small accesses (small) ii. Physical Extent (PEX): (largish) set of segments on one disk, unit of allocation to raid 5/mirroring iii. Physical extent group (PEG): set of PEXs on different disks with desired redundancy (all on different disks, correct # of different disks) b. Logical layout: relocation blocks (64 kb) i. Unit of storage that can be assigned to different places 1. Larger than a disk block for efficiency ii. Unit if address translation: AutoRAID stores a table saying where all RBs are stored (persistently) 5. AutoRAID mechanisms: a. Mirrored reads: i. Just like RAID 1 for disks in PEG b. RAID 5 reads i. Pretty much like RAID 5 for disks in PEG c. Writes i. Go to NVRAM buffering for low latency before going anywhere d. Demotion i. Move data from Mirrored PEG to RAID 5 PEG e. Promotion i. Delete/free data from RAID 5, re-allocate in mirrored ii. WHY not move data? 5
1. No point:read performance is the same; only benefit of RAID 1 is writes 2. QUESTION: for Flash with fast reads, would this change? 6. AutoRAID Policies a. Normal access: i. On a read: 1. Read data wherever it is ii. On a write: 1. All data goes to NVRAM 2. Then mirroring (unless array is full) b. Demotion: when are blocks demoted from mirroring to RAID 5? i. QUESTION: What kinds of blocks benefit from mirroring? 1. Frequently updated 2. Randomly written ii. Policy: least-recently-written 1. read access not matter (see above) c. Layout: how are blocks laid out? i. Mirroring: random access: find a free RB and write there (free block bitmap) ii. RAID 5: logging 1. Always write sequentially to RAID 5, try to fill a whole stripe a. Gives maximum write performance b. Avoids read/modify/write penalty (small write in RAID) c. If not full stripe: i. Recompute parity on the fly 2. Use address translation to locate data 3. Safe parity updates a. Problem: what if you crash between writing the data and the new parity? b. Answer: use address translation (like LFS)/no-overwrite updates i. Write new data ii. Write new parity iii. Update translation table to new data d. Cleaning i. Mirrored storage: no cleaning necessary; can just overwrite holes 1. Copying/compaction used to make free PEXs for RAID 5 6
a. Disks all start as mirrored, must compact to start making RAID 5 space. ii. RAID 5 storage: 1. QUESTION: When do holes occur? a. Data overwritten i. Now lives in mirrored tier ii. Or somewhere else in RAID 5 2. QUESTION: How often is this? a. Rarely: data in RAID 5 is rarely written 3. POLICY: a. Hole filling: for mostly utilized ranges, overwrite vacant spot with new RBs from mirrored tier b. Cleaning: LFS copy/compact 7. Interesting features: a. On read/write, no decision as to where to put data i. Reads in place ii. Writes to mirrored b. Data movement is all all asynchronous i. background demotion ii. No promotion c. No-overwrite for consistency i. Write new data/parity then update map d. NVRAM for low latency i. Holds blocks before written 1. Can buffer data to until demotion makes space e. Automatically balances Mirrored/RAID 5 space i. Uses as much of capacity as possible for mirroring 1. No idle spares ii. Demotes cold data f. 7