COSC 6374 Parallel I/O (I) I/O basics Fall 2010 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network card 1 Network card 2 1
I/O Problem (I) Every node has its own local disk Most applications require data and executable to be locally available e.g. an MPI application using multiple nodes requires executable to be available on all nodes in the same directory using the same name Multiple processes need to access the same file potentially different portions efficiency Basic characteristics of storage devices Capacity: amount of data a device can store Transfer rate or bandwidth: amount of data at which a device can read/write in a certain amount of time Access time or latency: delay before the first byte is moved Prefix Abbreviation Base ten Base two kilo, kibi K, Ki 10^3 2^10=1024 Mega, mebi M, Mi 10^6 2^20 Giga, gibi G, Gi 10^9 2^30 Tera, tebi T, Ti 10^12 2^40 Peta, pebi P, Pi 10^15 2^50 2
UNIX File Access Model A File is a sequence of bytes When a program opens a file, the file system establishes a file pointer. The file pointer is an integer indicating the position in the file, where the next byte will be written/read. Disk drives read and write data in fixed-sized units (disk sectors) File systems allocate space in blocks, which is a fixed number of contiguous disk sectors. In UNIX based file systems, the blocks that hold data are listed in an inode. An inode contains the information needed to find all the blocks that belong to a file. If a file is too large and an inode can not hold the whole list of blocks, intermediate nodes (indirect blocks) are introduced. Write operations Write: the file systems copies bytes from the user buffer into system buffer. If buffer filled up, system sends data to disk System buffering + allows file systems to collect full blocks of data before sending to disk + File system can send several blocks at once to the disk (delayed write or write behind) - Data not really saved in the case of a system crash - For very large write operations, the additional copy from user to system buffer could/should be avoided 3
Read operations Read: File system determines, which blocks contain requested data Read blocks from disk into system buffer Copy data from system buffer into user memory System buffering: + file system always reads a full block (file caching) + If application reads data sequentially, prefetching (read ahead) can improve performance - Prefetching harmful to the performance, if application has a random access pattern. Dealing with disk latency: Caching and buffering Avoids repeated access to the same block Allows a file system to smooth out I/O behavior Helps to hide the latency of the hard drives Lowers the performance of I/O operations for irregular access Non-blocking I/O gives users control over prefetching and delayed writing Initiate read/write operations as soon as possible Wait for the finishing of the read/write operations just when absolutely necessary. 4
Improving Disk Bandwidth: disk striping Utilize multiple hard drives Split a file into constant chunks and distribute them across all disks Three relevant parameters: Stripe factor: number of disks Stripe depth: size of each block Which disk contains the first block of the file Block 1 Block 2 Block 3 Block n Disk 1 Disk 2 Disk 3 Disk 4 Disk striping Ideal assumption b(n, p) = p * b(n/p, 1) with N: number of bytes to be written b: bandwidth p: number of disks Realistically: b(n,p) < p * b(n/p,1) since N is often not large enough to fully utilize p hard drives networking overhead 5
Two levels of disk striping (I) Using a RAID controller Hardware typically a single box number of disks: 3 n Redundant arrays of independent disks (RAID) Goals: improve reliability and performance of an I/O system improve performance of an I/O system Several RAID levels defined RAID 0: disk striping without redundant storage ( JBOD = just a bunch of disks) No fault tolerance Good for high transfer rates i.e. read/write bandwidth of a single large file Good for high request rates i.e. access time to many (small) files RAID 1: mirroring All data is replicated on two or more disks Does not improve write performance and just moderately the read performance 6
RAID level 2 RAID 2: Hamming codes Each group of data bits has several check bits appended to it forming Hamming code words Each bit of a Hamming code word is stored on a separate disk Very high additional costs: e.g. up to 50% additional capacity required Hardly used today since parity based codes faster and easier RAID level 3 Parity based protection: Based on exclusive OR (XOR) Reversible Example 01101010 (data byte 1) XOR 11001001 (data byte 2) -------------------------------------- 10100011 (parity byte) Recovery 11001001 (data byte 2) XOR 10100011 (parity byte) --------------------------------------- 01101010 (recovered data byte 1) 7
RAID level 3 (cont.) Data divided evenly into N subblocks (N = number of disks, typically 4 or 5) Computing parity bytes generates an additional subblock Subblocks written in parallel on N+1 disks For best performance data should be of size (N * sector size) Problems with RAID level 3: All disks are always participating in every operation => contention for applications with high access rates If data size is less than N*sector size, system has to read old subblocks to calculate the parity bytes RAID level 3 good for high transfer rates RAID level 4 Parity bytes for N disks calculated and stored Parity bytes are stored on a separate disk Files are not necessarily distributed over N disks For read operations: Determine disks for the requested blocks Read data from these disks For write operations Retrieve the old data from the sector being overwritten Retrieve parity block from the parity disk Extract old data from the parity block using XOR operations Add the new data to the parity block using XOR Store new data Store new parity block Bottleneck: parity disk is involved in every operation 8
RAID level 5 Same as RAID 4, but parity blocks are distributed on different disks Block 1 Block 2 Block 3 Block 4 P(1,2,3,4) Block 5 Block 6 Block 7 P(5,6,7,8) Block 8 RAID level 6 Tolerates the loss of more than one disk Collection of several techniques E.g. P+Q parity: store parity bytes using two different algorithms and store the two parity blocks on different disks E.g. Two dimensional parity Parity disks 9
Is RAID level 1 + RAID level 0 RAID 1 mirroring RAID level 10 RAID 0 striping Also available: RAID 53 (RAID 0 + RAID 3) Comparing RAID levels RAID level Protection Space usage Good at.. Poor at.. 0 None N Performance Data protect. 1 Mirroring 2N Data protect. Space effic. 2 Hamming codes ~1.5N Transfer rate Request rate 3 Parity N+1 Transfer rate Request rate 4 Parity N+1 Read req. rate Write perf. 5 Parity N+1 Request rate Transfer rate 6 P+Q or 2-D (N+2) or (MN+M+N) Data protect. Write perf. 10 Mirroring 2N Performance Space effic. 53 parity N+striping factor Performance Space effic. 10
Two levels of disk striping (II) Using a parallel file system exposes the individual units capable of handling data often called storage servers, I/O nodes, etc. each storage server might use multiple hard drives underneath the hood to increase its read/write bandwidth Metadata server which keeps track of which parts of a file are on which storage server Single disk failure less of a problem, if each server uses underneath the hood a RAID 5 storage system Parallel File Systems: Conceptual overview Compute nodes Meta-data server storage server 0 storage server 1 storage server 2 storage server 3 11
File access on a parallel file system Compute node Metadata server Application calls write() OS requests list of relevant I/O nodes for this write operation MD server sends storage IDs, offsets etc. OS sends data to storage servers Disk striping Requirements to improve performance of I/O operations using disk striping: Multiple physical disks Have to balance network bandwidth and I/O bandwidth Problem of simple disk striping: for a fixed file size, the number of disks which can be used in parallel is limited Prominent parallel file systems PVFS2 Lustre GPFS NFS v4.2 (new standard currently being ratified) 12
Distributed vs. Parallel File Systems Distributed File Systems Offer access to a collection of files on remote machines Typically client-server based approach Transparent for the user NFS The Network File System Protocol for a remote file service Stateless server (v3) Communication based on RPC (Remote Procedure Call) NFS provides session semantics changes to an open file are initially only visible to the process that modified the file File locking not part of NFS protocol (v3) but often available through a separate protocol/daemon Client caching not part of the NFS protocol (v3) implementation dependent behavior Network File System (NFS) Compute node = NFS client NFS server Application calls write() OS forwards data to NSF server NFS daemon receives data NFS daemon calls write() 13
Parallel vs. Distributed File Systems Concurrent access to the same file from several processes is considered to be an unlikely event Distributed file systems assume different numbers of processors than parallel file systems Distributed file systems have different security requirements than parallel file systems 14