Wednesday, May 3, Several RAID "levels" have been defined. Some are more commercially viable than others.

Wednesday, May 3, 2017 Topics for today RAID: Level 0 Level 1 Level 3 Level 4 Level 5 Beyond RAID 5 File systems RAID revisited Several RAID "levels" have been defined. Some are more commercially viable than others. RAID Level 0 (see Fig. 9.26) In RAID 0 we have the idea of striping the data. If we have N discs then virtual sector k will be stored as sector k/n on disc number k%n. There is no redundancy if a drive fails there is data loss. However, we may be able to perform, in parallel, operations that would have to be sequential on a single big disc, for example reading virtual sector S and virtual sector S+1. RAID level 1 (see Fig. 9.27) In RAID 1 we have mirror discs. Every disc is duplicated. Now if a single drive crashes, there is no data loss because an exact copy of it is available. Possibly slightly slower write operations (because two discs have to be written for a write to be complete). Variation - synchronize discs 180 degrees apart (reads are faster because we can read from either disc but writes are slower because we have to write to both discs) RAID level 0+1 or 1+0 (see Fig. 9.28) A combination of levels 1 and 0 - data striping with mirroring. One option (1+0: striped mirrors) might be more expensive than the other (0+1: mirrored striped) but might have greater reliability. Comp 162 Notes Page 1 of 12 May 3, 2017

RAID level 2 (see Fig. 9.29) Bit interleaving with Error Correction (Hamming Codes). Bits of a byte are stored on, for example, 8 separate discs with 4 parity bits stored on 4 other drives. If a single drive fails we do not even need to know which is the bad one in order to be able to recreate the data. There is a lot of overhead and RAID 2 is just of theoretical interest. RAID level 3 (see Fig. 9.30) Bit-interleaving with simple parity. Write a bit of each byte onto a separate data drive, write the parity bit onto a parity drive. Requires 9 drives? Note that if we have a data item with parity check (for example, the number of 1 bits is even) we can discover the value of one bit by examining the others. Consider 0 1 1 0 1 0 0 1 0 If you cover up one bit you can tell what it was by looking at the others. Thus, if a single drive fails in a RAID 3 system we can recover its data by looking at the corresponding bits on the other drives. In contrast to RAID 2, we need to know which is the bad disc (although this is normally not a problem.) A disadvantage of RAID 3 is that there is a large minimum-size transfer amount. If we can transfer no less than 4096 bytes (for example) at a time to a single disk, we can transfer no less than 4096*(N-1) data bytes to a RAID 3 system with N discs. RAID level 4 (see Fig. 9.31) Sector-level interleaving (as in RAID 0). In addition, each group of N data sectors generates a parity sector. For example, the parity might be the exclusive-or of the data Data sector X 10101011011 Data sector Y 01011111001 Data sector Z 10000101010 Parity(X,Y,Z) 01110001000 Reads are fast and reliable. If a sector becomes unreadable we can determine its contents by looking at the corresponding sectors on the other drives. Writes require that both the data sector and the corresponding parity sector are updated. This requires two reads (current data and current parity) and two writes (new data and new parity). The reads are needed so we can compute the new value of the parity sector. Comp 162 Notes Page 2 of 12 May 3, 2017

In a RAID 4 system all the parity sectors are on the same drive so it becomes a bottleneck in the system RAID level 5 (see Fig. 9.32) In RAID 5, the parity sectors are distributed cyclically over the set of drives. This permits greater parallelism in write operations than is possible with RAID 4. The RAID simulator (http://www.coastalworks.com/raid/raid.html ) We can use the program written by Martha Hogan to simulate 4-disc RAID 0, RAID 10 and RAID 5 systems. The program reports the Actual Read Failure Rate (ARFR) the Virtual Read Failure Rate (VRFR). RAID 0. There is no redundancy so VRFR will be the same as ARFR. (a) If we set Disc Failure Probabilities to (1.0, 0.0, 0.0, 0.0) we would expect ARFR = VRFR = 0.25 because one quarter of the read operations on average would map to the drive that fails 100% of the time. (b) If we set the probabilities to (0.0, 0.0, 0.0, 0.5) we would expect ARFR = VRFR = 0.125 RAID 10 There are two pairs of discs. We can determine, by observation, which discs are paired. Setting probabilities (1.0, 1.0, 0.0, 0.0) VRFR = 0.0 so Disc 1 and Disc 2 are in different pairs Setting probabilities (1.0, 0.0, 1.0, 0.0) VRFR 0.5 so Disc 1 and Disc 3 are paired If probabilities are (1.0, 0.0, 0,0, 0,0) VRFR = 0.0 and ARFR = 0.3333. Why? Suppose we have 60 virtual reads. On average 30 will be to disc 1 and 30 will be to disc 2. The 30 on disc 1 will fail requiring 30 reads from the mirror disc (Disc 3). So a total of 90 reads, 30 of which fail so ARFR = 1/3. Comp 162 Notes Page 3 of 12 May 3, 2017

RAID 5 In a RAID 5 configuration, if a read fails, corresponding sectors are read from the other discs. If all those reads succeed then we have a successful virtual read. (a) If failure probabilities are (1.0, 0.0, 0.0, 0.0) then VRFR = 0.0 and ARFR = 0.1428 The ARFR (1/7) can be explained by the following diagram representing 1200 initial reads. The symbol denotes a successful read and X denotes a failure. 1200 300 X 300 300 300 300 300 300 2100 reads are attempted (1200 original and 900 to correct error). Of these, 300 fail so the ARFR is 1/7. (b) If failure probabilities are (1.0, 1.0, 0.0, 0.0) then VRFR = 0.5 and ARFR = 0.4 as you can see from the following. 1200 300 X 300 X 300 300 300X 300 300 300X 300 300 (c) If failure probabilities are (0.5, 0.0, 0.0, 0.25) then VRFR = 0.0625 and ARFR = 0.16 as you can see from the following Comp 162 Notes Page 4 of 12 May 3, 2017

1600 400 400 400 400 200 200X 300 100 X 200 200 100 100 150 50 X 50 50 X ARFR = (200+100+50+50)/(1600+600+300) = 400/2500 = 0.16 VRFR = (50+50)/1600 = 0.0625 Beyond RAID 5 After a failure in a RAID 5 system, a disc must be rebuilt. If there is any read failure during this time then there is data loss. Example (due to Chen et al 1993), If we are rebuilding a 100GB disc array we need to perform 200 million sector reads. If the bit error rate is one in 10 14 bits, the probability of an error-free rebuild is 99.2% Errors during rebuild may be more likely if the discs in the system are of the same type and the same age. RAID level 6 is the current industry standard. In RAID 6 (dual-distributed parity) there are two independent parity sectors for each group of N data sectors thus we need more space: N+2 discs for N discs worth of data compared with N+1 disc for RAID 5 (single distributed parity). Reads are still fast but writes are slower than RAID 5 Comp 162 Notes Page 5 of 12 May 3, 2017

because of the need to update two parity sectors. We can improve reliability because the system is not running in a non-redundancy mode during a rebuild. File systems (section 9.3) Users are not normally concerned with the operation of the disc(s) in a system. Their interface is usually through the file management component of the Operating System. Warford looks at three ways in which we might store files on discs. We look at these methods plus, briefly, the Unix system and a system without directories that is based on hashing. Directories/Folders Directories (folders) hold information about files - including other directories so we can have a hierarchical system. If we can have links across directories then the structure is a graph rather than a tree. Information stored about a file might include the following (you may be able to think of additional items). name generation number (each modification bumps this by one) type - or pointer to application that created it. owner times: created last modified last read last archived expiration(?) access controls: user(s)/permitted-actions size and location of file - see below Space Allocation Warford s three ways of allocating space to files are termed (a) contiguous, (b) linked and (c) indexed. (a) contiguous (Fig. 9.18) In this first scheme, a file occupies contiguous space on a drive. This has advantages in that both sequential access to the whole file and random access to a part of the file are fast. The directory entry is likely to be simple too, just indicating where the file starts and how big it is. However, over time, the free space on a disc is likely to be fragmented. If there are multiple free areas, which one does the OS use when a user program opens a new file for writing (without any indication of how big the file might get)? Suppose a user tries to make a copy of file A. The OS knows how much space is required and if there is a free area big enough we are OK. Similarly, if the total free space is less than the size of A there is no way we can make a copy. But what if there Comp 162 Notes Page 6 of 12 May 3, 2017

is enough free space but it is broken into chunks no single one of which is big enough? The operating system may have to move files around to create a big enough area. This defragmentation is very time consuming. (b) linked storage (Fig. 9.19) In Warford s second scheme, a file is stored as a linked list of allocation units with each unit except the last pointing to the next unit. The directory entry identifies the number of units and the location of the first. The operating system now has much more flexibility about placing units of the file on the disc. However, sequential access through the file may be slow depending on the scattering of the units. Random access is very slow because to get to an arbitrary point in the file we typically have to follow the pointers through the list. (c) indexed (Fig. 9.20) Warford's third scheme also allows allocation units of a file to be placed anywhere on the disc. It has multiple pointers in the directory entry that point directly to allocation units. This reduces the time required to access a random component of the file. However, when we design a directory entry, how many pointer fields do we include? If it is N, how do we store a file having more than N units? (d) Unix-like approach In the Unix system, a directory entry points to an inode (index node) which points to the file rather than the directory entry pointing directly to the file. Because an inode can be pointed to from multiple places, a particular file can appear in multiple directories (and have different names in those directories). The inode contains a count of the number of pointers to it and a file is only deleted when the count reaches zero. There are a fixed number of pointers in the inode but they are used in a creative way to permit very large files to be accessed. Here is an example of how they might be used (different Unix implementations differ in the details). Suppose there are 13 pointers; we might use those pointers as follows: Comp 162 Notes Page 7 of 12 May 3, 2017

Pointer(s) 1 through 10 Used for point directly to the first 10 allocation units of the file 11 points to a block of pointers. The pointers in the block point to the next M blocks in the file. For example, if there are 256 pointers in the block, they would point to allocation units 11 through 266. 12 points to a block of pointers each of which points to a block of pointers. This double indirection lets us point to a further 256*256 blocks 267 through 65802 13 points to a block of pointers each pointing to a double-indirect block. Now we can access (via this triple indirection) a further 256*256*256 blocks 65803 through 16843018. This scheme enables very large files to be accessed. If allocation units are 4k bytes, we can have files in excess of 64Gb. While there is a time penalty for accessing all but the first few blocks, most of the files on a typical user account are small and can be accessed without indirection. The inode of a currently active file is held in main memory. (e) Directory-less file system Why might we want a directory-less file system? Sometimes corruption might occur in a directory/folder that makes a file unreadable even though the file itself is intact. The following is a possible scheme for storing files that does not use directories at all. However, some operations are slow; it is a scheme appropriate only for a specialized environment. Background: hashing Hashing is a storage and retrieval technique that uses a key-to-address function. We apply the function to the key of the item to be stored and the function returns the address where it should be stored. When we need to find an item we apply the function to the search key and it gives us the address where to look. We saw hashing when discussing symbol tables in assemblers. In the best circumstances no two items hash to the same address and storage and retrieval is very fast. However, in practice we need to make provision for an item hashing to an address that is Comp 162 Notes Page 8 of 12 May 3, 2017

already occupied. One way of dealing with overflows is to look at successive addresses until free space is found. The hash function is customized for the type of keys and range of addresses in the application. For example if we are hashing names into a table with 30 entries we might use address = ( rank (first letter) + rank (last letter)) % 30 Thus "Yeltsin" hashes to (25 + 14) % 30 = 39%30 = 9. Hashing to disc Imagine a disc as an array of allocation units each of which has room for user data plus a key field. When we initialize the file system we put a reserved key e.g. %%EMPTY in each key field. To write block b of file F to the disc we construct a composite key out of the filename and the block number then hash this key to get a block number. For example if the file is mydata and we are writing block 7, the composite key might be mydata%%007 If the block with the address is is free we put the data there otherwise we apply an overflow method to find a free space. When we wish to read block b of file F we go through the same hashing process. To speed up operations, the OS keeps a table showing the actual address of blocks that have overflowed and are not at their hash address. Unreliable portions of the disc can be flagged with a particular marker key, e.g. %%BAD Depending on the overflow method used, we may need a marker for a deleted unit e.g. %%DEL Some operations on this structure are slow and some are fast Slow: Medium: Fast: getting a list of all the files on the disc together with their sizes deleting a file accessing block I of a file We can simulate a hierarchical directory by permitting file names of the form X/Y/Z. Reading Section 9.1 has a page or two on disc drives. Section 9.4 discusses error detecting and correcting codes. Sections 9.5 for RAID and 9.3 for file systems. Comp 162 Notes Page 9 of 12 May 3, 2017

Review Questions 1. We have seen that a RAID system may be able to recover from an unsuccessful read operation. Can it recover from an unsuccessful write operation? 2. On an N-disc RAID 4 system, exactly one disc has a non-zero failure rate. Fill in the following table with the read and write operations required Error-free read of virtual sector Read of virtual sector with errors Error-free write of virtual sector Reads Writes 3. If, in a RAID 5 system, one disc has a 10% failure rate and the other N-1 discs read without error then the VRFR is 0.0. How does the ARFR change as we increase N (a) stays the same (b) increases (c) decreases? 4. You are in charge of a 4-disc RAID 5 system. On arriving at work one day you notice that one disc has a 10% failure rate (one in ten reads fails). The other discs are fine. Your boss wants to know (a) the VRFR (b) the ARFR. What do you tell her? Later, returning from lunch you notice that a second disc is also failing but on this drive, one in five reads fails. Your boss wants to know (c) the updated VRFR and (d) the ARFR. What numbers do you give her? 5. Research the Unix touch command. What is the effect of touch X (a) If file X already exists (b) if file X does not exist. Comp 162 Notes Page 10 of 12 May 3, 2017

Review Answers 1. Only action is to retry the write. 2. Reads Writes Error-free read of virtual sector 1 0 Read of virtual sector with errors N 0 Error-free write of virtual sector 2 2 3. (c) decreases. For example if N is 4 VRFR is 1/43 and if N is 6 VRFR is 1/65. 4. (a) 0.0 (b) 30/1290 = 1/43 1200 300 300 300 300 270 30 X 30 30 30 Comp 162 Notes Page 11 of 12 May 3, 2017

(c) 12/1200 = 0.01 (d) (30+60+6+6)/(1200+90+180) = 102/1470 1200 300 300 300 300 270 30X 240 60 X 30 30 60 60 24 6 X 54 6 X 5. (a) updates one or both of the access and modification timestamps of the file (held in the inode) (b) creates an empty file. Comp 162 Notes Page 12 of 12 May 3, 2017