6. Results. This section describes the performance that was achieved using the RAMA file system.

Size: px
Start display at page:

Download "6. Results. This section describes the performance that was achieved using the RAMA file system."

Transcription

1 6. Results This section describes the performance that was achieved using the RAMA file system. The resulting numbers represent actual file data bytes transferred to/from server disks per second, excluding any metadata, parity data, client request packet overhead bytes, server to server communication bytes, or any retried data. All server and client programs started from a remote computer with a window for each node. The calculated results were displayed at a fixed time interval or packet count. (presently set to 30,000 packets). The performance was measured by counting packets over a time interval and calculating the resulting performance. Measured parameters were: time, packet count and distribution by packet type, file data in MB/sec, service time in milliseconds (msec) per block, percent cache hit/miss, inq fullness (measured in percent of capacity), blocks/writev (low, high and average), msec/writev (average), lines/lseek (average), reason for cleaning (in percent of lines cleaned for starvation or because cache was too full). The throughput in MB/sec is the only one reported. The other measured parameters were used to better understand and learn the file system activity. The tests were performed on file systems with one, two, three, or four servers. In all cases, when a file was striped, each stripe was striped on all disks minus one disk (the parity disk when it was needed). For example, when testing parity on a four disk file system, the data for each stripe occupied three disks and one disk was used for the stripe parity. When a three disk file system was tested, then for each stripe the data occupied two disks and a third disk held the parity information. When comparing these two cases it is important to note that with a four disk file system, the data is 75% of the total information that is 100

2 carried over the network and written to disk, and with the 3 disk file system, this is 66.6%, therefore the expected MB/sec/disk of file data is better with a four disk file system than with a three disk file system. However, the improvement in performance due to increased number of servers is reduced by the switch performance as the number of servers grow as shown in Section To illustrate the effect of the number of servers on the amount of file data that is written on a disk, consider this example: assume that the network can consistently deliver 12MB/sec/disk and that the disk can write 12MB/sec/disk and that we are writing a file with parity in a four or three disk file system. The ratio of data:parity is 3:1 and 2:1 respectively and therefore, using a four disk file system each disk will write 9MB/sec of file data and 3 MB/sec of parity data, but with a three disk file system, only 8MB/sec of file data and 4MB/sec of parity data. The amount of file data that was written or read in each test was about 50% of the total file system capacity, i.e.1-4gbs and was repeated tens of times before a result was recorded. The tests ran until the performance was consistent. The results when the server cache was empty were ignored, and results for the ramp up of adding clients to the test were ignored as well. In each test, clients were added to the test as long as there were almost no lost packets (few per messages per client process). I also attempted, whenever possible, to have a constant server NICs to client NICs ratio. However, this was difficult to achieve because the entire system consisted of 20 NICs, and each server has 2 NICs, so as the number of servers increased, there were not enough client NICs available to keep the ratio constant. 101

3 There were no other users on the system, but Linux system processes were running as normal system tasks and NFS was mounted as well. There were instances when the tests were repeated with better results than the ones reported in this document. The variance could be 5%-10%. Based on my discussions with the system administrator, this could be related to memory usage by system processes, or due to system activity with the disks or the network. The issues affecting the actual results are: File size - large files are expected to perform better than small files overall. This is because small files have fewer blocks over which to amortize metadata overhead costs. New file versus reused file - new files have more metadata overhead cost per data block than rewritten files, resulting in slower overall performance. This is because that the file size needs to be continuously updated as a new file grows in size. Files with parity for improved fault tolerance - incur the cost of writing parity information, resulting in slower overall performance compared to files without parity protection. Rewriting incomplete stripes in files with parity protection - incur additional cost of reading missing stripe data in addition to writing parity information, resulting in slower overall performance compared to files with complete stripe data for parity calculations. 102

4 Striping - affects the distribution of data on a disk and on all the disks. Striping on more disks reduces the overall portion of parity data that is required and therefore improves performance. In addition, the more data blocks that are stored on a disk before striping on the next disk improves the performance because it results in more vectors per writev system call. (see Section 3.2.1). Overall switch performance per server does not scale linearly as explained in Section As more ports participate in communication, aggregate switch performance improves, but performance per node is reduced Effect of Disk Organization on Performance Disk Capacity The disk capacity was fixed at 2 Gigabytes per disk Block Size Two block sizes were tested. 8,192 bytes (8K) and 16,384 bytes (16K). Most of the reported results in this document are those for the 8K block size. The 16K block size was tested in order to demonstrate that the results are in line with the expected numbers based on the effect of the larger block size on the network delivery capabilities as explained in Section and the disk capabilities as explained in Section In both instances the performance is expected to be somewhat better. The reasons for choosing these block sizes is discussed in Section 3.3. In both cases the same number of file bytes per disk were striped. 103

5 Number of Lines and Overflow Lines In all tests the disks were divided into 250 lines with 0 overflow lines. The code however supports overflow lines. The effect on disk performance when the overflow line(s) start to fill up was not tested Stripe Size The stripe size that was used in the test was derived from the baseline tests of the network and disk. Examining the disk writev performance in Section we see that as the number of vectors increases, the write performance increases as well, however, once the number of vectors reaches 16 for an 8KB block (8 for 16K blocks), the improvement per additional vector is smaller (see also Figure 5). I therefore selected the stripe size to be 16 and 8 blocks per file respectively Server Cache Size The effect of cache size was not examined in my experiments. In all tests the cache had room for 64MB (8000 8K blocks or K blocks) of data or parity or mirror inodes, whichever arrived from the network or was read from the disk. In addition, all tests were designed to minimize cache hits. This is because the intent was to measure the network and disk effect on the file system, not the caching improvements by reading from cache. It was not possible to completely eliminate cache hits, but in all tests the cache hits were between 0-3% and tests were repeated many times. The selection of the cache size is based on the assumption that the blocks are distributed roughly evenly among the disk lines and that in order to optimize the cache cleaning, a line needs to have on average a full stripe to be written to disk efficiently. With 8000 cache 104

6 block, 250 lines, and 16 blocks/stripe/disk for 8K blocks and 8 blocks/stripe/disk for 16K blocks, this is equal to about 2 stripes per line. Indeed, the tests showed that this assumption was correct, and the average number of vectors per writev was more than one stripe. An 8000 block cache fills in about 8 seconds if data arrives at 12 MB/sec. If the cache had only 4000 blocks, with an average of 1 stripe per line, some lines would have less than a full stripe to be cleaned to disk. At 16 blocks per writev the disk can write MB/sec, but with 8 blocks per writev only 7.46 MB/sec, a 37% reduction in performance for that line. In general, if the cache is not large enough, the cache will become full faster and cleaning will have to be performed with fewer dirty blocks in each line, hence, small vector count, yielding slower disk writes. If the cache is very large, initially it will take longer to fill. However, since the arrival rate of data is the same, no matter how large the cache is beyond some minimal size, cleaning will occur as often, at the rate that data arrives into the cache. What I need to guarantee is that there is a minimal average number of blocks per line available in the cache to guarantee enough blocks per line to be written in each cleaning cycle for optimal write time. Table 2 shows the relative improvement in disk performance based on the number of blocks written when cleaning a line (8K blocks, 16 blocks per disk line). The maximum performance improvement for disk writev is as follows: If the file system has many blocks that are frequently reused (read after write, read repeatedly, rewritten often), then a larger cache is advantageous because these reused blocks can wait in the cache before repeated use and repeated disk access is limited. 105

7 Number of Stripes Amount of Metadata The metadata consists of directory blocks, inodes and mirror inodes. The directory blocks occupy 0.28% of the total disk space for 8K bytes blocks. In addition, each file carries overhead that is equal to the size of two inodes, written on two separate disks. One inode is on the disk holding data block 0, and the same inode is mirrored on the next sequential disk for fault tolerance. In the 8K bytes block organization, 136 inodes fit in a block, so about 1.5% of a block is needed per file to store inodes. Therefore, for the smallest file where the file size equals 1 block of data, the metadata overhead is 3% plus the space for the directories for a total of 3.28%. Writev MB/sec 1/ % % % % Improvement Over Single Stripe Table 2. Relative Improvement of Disk Write Performance With the Number of Stripes per Writev (8K Blocks, 16 Blocks/Stripe/Disk) 6.2. Effect of Network Switch on Performance My expectation is that aggregate performance will scale as more servers are added to the system, however, the scaling is not linear based on observations that were explained in Figure Client Server Ratio The client:server ratio deals with four separate issues: How many client processes are sending/receiving data from the servers. 106

8 As more client processes are involved in data exchange with the servers, the likelihood of a server being idle is reduced. The reason is that while one client process is waiting for a packet from the server, another client is ready with a packet for the server. For best performance I want enough clients to ensure that the servers are continuously fed with work. The number of client processes is not unbounded, however, because I want to minimize the number of timed out requests which waste network bandwidth and therefore reduce overall file system performance. In all tests client processes were added as long as the performance improved by the addition. This number varied with each test because the client nodes were a mix of different CPUs, memory, NICs and speed. The range was between 1 and 20 per node. All the write data in the tests was generated directly into the client buffers, so no data copies were involved. How many client nodes are sending/receiving data from the servers. All processes communicating with servers in a particular client node share the resources (system buffers, NICs, CPU) of that node, so a node has limited capability to generate work for the servers. More client nodes can generate more work for the servers, and more likely generate enough work so that no server is ever idle. However, more client nodes contend for network resources, therefore their capability to produce work for the servers is bounded by the switch capabilities. If all servers in a system can serve x packets per second, distributed evenly among them, there is no advantage in doing the work by a single client over many clients, and therefore, in my testing, all available clients were participating in the experiments. How many client NICs are sending/receiving data from the servers. 107

9 In general, more NICs can improve packet delivery over the attached network as shown in Figure 1, Figure 4 and explained in Section All the NICs in these experiments were enabled and some were channel bonded as detailed in Section How many server NICs are sending/receiving data from the clients. Server NICs are fixed at two NICs per server and are channel bonded Flow Control and Time-out To control congestion, an initial time-out value for replies was set. When a reply did not arrive, the time-out value was increased by one millisecond. This approach made the timeout value elastic in one direction and appropriate for the specific network condition at the time of the congestion. In effect, the elasticity of the time-out value directly controls the number of packets per second flowing in the switch. If a server can service 1000 packets per second and we attempt to send 1100 packets per second, then at least 100 of them will time-out without getting a reply. Now these clients will attempt to retry, but with a longer time-out, thereby not attempting again for a longer period of time, effectively reducing the number of packets per second that are sent to the server, until the rate of requests can be handled by the servers. Since the time-out value represents the current congestion condition at the switch or the servers, it is desired to share the information among all threads in a client process. The time-out value is therefore protected by a semaphore. The current implementation does not reduce a timer value that is too high (larger than some minimum threshold) when a reply does arrive on time, for several reasons: The semaphore makes it costly to modify the time-out value because locking and unlocking are involved and since in most cases a 108

10 reply does arrive on time, there will be too many inspections of a timer that needs to stay unchanged. In addition, when replies arrive on time, it does not matter how long the timer is set for, since it is not needed for the arriving replies. If I let the timer grow unbounded, it is possible that the timer will reach values that are too high, therefore, an upper limit is set. The initial time-out values are set, based on the packet size, to between 400 and 1500 milliseconds. The limit is set to 3000 milliseconds Read Performance Reading data involves two types of messages. One is a file open request and reply, and the other is read data request and reply. The open request is needed so that the file striping parameters and file size may be known to the reading client for hash function calculation. The parity blocks, parity flag, or the mirror inodes are not involved in data reading. The cost of the open requests is amortized over all the data blocks read from the file and since it involves a single small packet exchange over the switch and possibly a disk read for the block that contains the file inode, which is most likely cached from previous open and create requests in the same line, its overhead is minimal. Therefore only large files were tested for read performance. All, but one, tests were designed to completely eliminate cache hits, therefore all data was read from disk when requested by the clients. The read data was placed in the cache for future use because that is the design of the file system, however, the data block was never reused, and the corresponding cache slot was replaced by another read block when space was needed. The disks were performing reads only. The data blocks were read sequentially. 109

11 The specific disks that were used have a cache buffer (track buffer) of 2MB each. This buffer is not accessible to the file system directly, but when the file system requests data from the disk and that data is in the track buffer already, then the disk head does not need to move, and the data is transferred from the track buffer to the file system buffer. This transfer is relatively quick because it does not require a mechanical motion (of the disk head). When the disk track buffer is fully loaded, it is considered warm. What was observed in the read tests is that the read performance improves as the number of file blocks in a line is increased. In other words, the read performance is directly related to the striping parameters, the number of blocks striped on a single disk before moving to another disk. The explanation for this observation is that the disk reads ahead when a block is requested, and keeps the additional data in the track buffer until the space is needed for other reads. When starting the server programs, the disk track buffer is empty and therefore each read request involves the physical reading of data from the disk surface. As the test progressed, the disk cache warmed and the read performance improved as some of the read blocks were satisfied from the track buffer. The upper limit on the performance was the network speed. The read performance fluctuates in a wide range depending on the ratio of reads that are satisfied from the track buffer and the ones that require a disk head movement. The reads were for files striped at 1 and 16 blocks/disk and were requested in groups of 1 to 16. When more than one block was requested, all the requests were sent to the server before any of the replies were examined by the client. This process is done in a tight loop and the requests leave the client very fast compared to the rate of replies coming back from the server. When several clients are doing this same loop, possibly to the same 110

12 server, a large number of read requests could queue at the server and some are dropped because the server request queue is limited in capacity (to force time-out on the client side when the server is too busy). Indeed, when the group was large, some packets were lost and requests had to be retried. This is also related to flow control issues that were discussed in Section There were always enough requests to keep the servers busy. The performance of the reads for a four disk system was 0.9MB/sec/server when the files were striped at 1block/disk, which is consistent with the base read test for the disk as shown in Figure 12. The reason for this performance is that the read ahead information in the disk cache is wasted when the next block that is read from the same file is on another disk. When the files were striped at 16 blocks/disk, the performance fluctuated between 2MB/sec/disk and 10MB/sec/disk, but most of the time in the middle of that range. This performance shows the positive effect of the disk read ahead into the track buffer Aggregate read performance is then 3.6MB/sec to 40MB/sec for a four disk file system. For comparison, a read test when all the data is in cache (100% cache hits in four servers) performed at 15.6MB/sec/disk for an aggregate performance of 62.4MB/sec for a four disk file system. The performance of a smaller file system was not tested since the dominating factor in read performance is the disk itself, not the network and therefore similar results are expected. 111

13 6.4. Write Performance Writing data involves several types of packet exchanges between client and server and between server and server. If the file already exists (in the case of rewrites), a file open request and reply is needed between client and server. For a new file, a file create message dialogue between the client and server needs to be exchanged, an inode for the new file needs to be created, an operation that might include reading inode blocks from the disk to check for duplicates, and possibly, space for the new file inode must be found if all existing inode blocks which are cached are full. In addition, the new/changed inode needs to be mirrored on another disk. All this, before any data block can be written. After the file is successfully created/opened, parity, file size update, and possibly invalidate parity messages are needed in addition to data, to guarantee consistency of file data. All of these operations are detailed in the following subsections. Because of the varied amount of overhead based on file type and how this overhead is amortized over different file types, this section is divided to distinguish results for small files, large files, files with and without parity and new files versus rewritten files Large Files Without Parity This section reports the performance results for writing files that are not fault tolerant and are therefore not protected by parity. The work involved in writing these files is as follows: Creating an inode for the file on one of the servers (inode disk). 112

14 Copying the inode to another server (mirror disk). This is necessary even if the file is not fault tolerant because multiple inodes occupy the same disk block, and other inodes in the same block may be fault tolerant. This work is done by the inode disk and requires server-to-server communication. Sending the data blocks. Updating the inode with the new file size as it grows. This requires a small packet to be sent from the client to the inode disk after each data block is sent. Updating the mirror inode (on another disk) with the growing file size at some reduced frequency than the update of the inode. This work is done by the inode disk and it requires server-to-server communication. Because the files are large, the cost of the creation and mirroring of the inode is amortized over many data blocks and is a very small cost to performance. However, the cost of updating the inode and mirror inode with the growing size of the file is expensive, a penalty of 32%-39% for writing new files over rewriting old files. This can be reduced by less frequent updates, say every other block, once a second, or every stripe, but none of these options were measured (further discussed in Chapter 7). A different approach could be to write higher numbered blocks ahead of lower numbered blocks, in which case the file size is not updated for every data block because lower numbered blocks look to the file system as rewrites, even when they are rewriting holes in the file system. Another solution could be to create files with an initial estimate of the file size, and later (maybe when closing the file) updating the size with the exact value. I believe that creating a file which is really a big hole is not a problem because it occupies no disk space, and 113

15 attempting to read hole blocks from the file will return NULLs in the read buffers, but not EOF. The next two sections report the actual performance for new and rewritten large files without parity New Large Files Without Parity The file system performance for writing new large files without parity is illustrated in Figure 11 and Figure 12. Figure 11 shows how an individual server performs among all servers in the system for 8K blocks. Figure 12 shows how the entire system scales as servers are added to the system. All figures also show the stand alone disk read and writev performance for reference. The disk writev performance line in the charts represents the value measured for a stand alone disk with the same number of vectors for writev as the striping of the tested file data. The disk read performance line in the chart is for a single block read. Examining the individual server performance we observe that when the system grows from a single server to two servers, the performance of each server is lower than a system of a single server. This is explained by several factors: Server to server communication: The fact that a single server system has no mirroring of inodes activity eliminates server to server communication, thereby leaving more resources for file data processing. On a multiple server system, each inode is duplicated on a second server, but on a single server system this is not possible. This fact reduces 114

16 performance from 7.8 MB/sec/disk for a single server system to 7.1 MB/sec/disk for a 2 server system and 6.9 MB/sec/disk for a 3 or 4 server system, a cost of roughly 9% and 11.5% respectively to performance MB/sec nservers disk random seek & w vectors/w ritev disk random seek & read 1 blk new File-noParity (I+M+W) rew ritefile-noparity (W) Figure 11. Individual Server Write Performance for Large Files Without Parity 115

17 MB/sec nservers disk random seek & w vectors/w ritev disk random seek & read 1 blk new File-noParity (I+M+W) rew ritefile-noparity (W) Figure 12. Aggregate Server Write Performance for Large Files Without Parity HOL blocking: By having more servers, the switch performance is reduced as explained in Section writev performance: The average blocks/writev (not shown in the figures) is reduced when writing inodes or mirror inodes because they consist of single blocks. When writing a single block, the disk can write at 1.4MB/sec (for 8K blocks) but when writing 16 blocks/writev call, the disk can write 11.8MB/sec. When writing 17 blocks/ 116

18 writev, the case when the inode is in a line that has a stripe ready to be transferred to disk as well, the performance is 12.3 MB/sec, not much better than a full stripe without the inode, but much better than writing the inode and data in two separate writev calls. When two full stripes are cleaned in a single writev call, the disk can perform at 14.4MB/sec, or 14.6MB/sec for two stripes and an inode. In general, when an inode or mirror block is cleaned alone, as opposed to being cleaned as one additional vector in cleaning a data stripe, the average write time of writing file data is reduced because of the overhead of writing inodes and mirror blocks. The exact reduction in speed depends on the ratio between single inodes and tag along inodes to be cleaned per line. In Figure 11, the performance of a single disk is shown to be better than the writev for a single stripe as shown in Figure 5. That is because in some cases there is more than one stripe ready to be cleaned in a line, and the figure shows the writev performance for a single stripe. For multiple servers, this benefit is reduced by the other costs which are outlined above, and therefore the performance line is below the reference line of a full stripe writev performance. The aggregate performance of the entire system shows that the file system scales as servers are added from 7.8MB/sec for a single server, to 14.2MB/sec, 20.7MB/sec and 27.6MB/sec for 2, 3, and 4 servers respectively Rewrite Large Files Without Parity The best performance is achieved when rewriting files without parity. All communication between clients and servers consists of file data alone. Inodes are not updated with file size 117

19 and parity is not involved. The performance is expected to be as close to disk speed as possible as long as the switch can deliver the packets as fast as the disk can rewrite them. As shown in Figure 2 and Figure 5, the network can deliver around 18MB/sec to a single server, and the disk can write at 11.86MB/sec when there are 16 vectors for the writev call. If there are more blocks on the line to write, the disk can perform better. Figure 11 and Figure 12 show that the actual performance for a single server is 12.4MB/sec, indicating that the average writev vector count is around 18. When a second server is added to the test, each individual server can write at 11.7MB/sec, a reduction of 5.6% per server compared with a single server, due to HOL blocking inside the connecting switch. Three and four servers perform at 10.2 MB/sec, a reduction of 17.7% per server compared with a single server. Three and four servers perform at about the same rate per server. This reduction in performance is consistent with the results achieved for the network baseline performance as explained in Section The aggregate performance of the entire system shows that the file system scales as servers are added from 12.4MB/sec for a single server, to 23.4MB/sec, 30.6MB/sec and 40.8MB/sec for two, three, and four servers respectively. 16K blocks are expected to perform somewhat better than 8K ones, and 32K blocks are expected to perform somewhat worse. Indeed, the 16K blocks results are 11.7MB/sec/disk and the 32K blocks perform at 9.1MB/sec/disk on a four disk file system. This is compared with the 10.2MB/sec/disk for the 8K blocks on the same system, and is consistent with the differences in the network performance and disk writev performance of the same parameters. 118

20 Large Files With Full Parity In order to add fault tolerance, parity is added and written to disks so that files can be recovered in case of a disk fault. The amount of additional data varies and depends on the parameters that are given by the user for each particular file when it is created or at any time later. If a file is created with the NOPARITY flag set, the file is not protected and is lost if its inode is on a faulty disk, or data from a faulty disk is lost if the inode resides on a good disk. The amount of additional data that is written is proportional to the striping details of each file. A file that is striped on n disks, carries 1/n of its size (in stripes, rounded up) additional information as parity. In addition to sending parity and data over the switch and writing them to the disk, there is an additional overhead that is caused by the need to keep data consistent at all times. The issue manifests itself as follows: assume that a client rewrites a portion of a stripe and then aborts for some reason without rewriting the changed parity block. If now one of the disks where the stripe reside crashes, missing data can be recovered, but it will not be correct and even worse, it will not be known that it is incorrect. Table 3 shows an example: Event Data On Disk 1 Data On Disk 2 Data On Disk 3 Data On Parity Disk old stripe data =9 changes before disk 3 crashes incorrect data on recovered disk =6 incorrect 9 Table 3. Inconsistent Disk Example The additional work that is required to guarantee consistency when a client faults is to tag a parity block as invalid while the data of the stripe that it protects is being modified. 119

21 After the modification is complete, a new parity block is written and the invalid tag is erased. This action is only required when rewriting a stripe, not when a stripe is written for the first time, i.e. not for new files. The actual work involved is to send a small message to the parity disk holding the parity for the particular stripe, instructing the file system to tag the parity block as invalid. Only after the reply from this message is received by the client may the client proceed with the change to the data itself. The file system at the parity disk needs to do some additional work as well. It needs to monitor the invalid parity blocks and make sure that the update to the block arrives in some finite amount of time. If it does not, the server assumes that the client process died, the updated parity block is not coming, and it recreates a new parity block by reading the entire stripe from the servers holding the stripe data. At that time the parity block is correct and is not tagged any longer. Only one tagging message is required for an entire stripe. The work that is required for inode and mirror inode maintenance is the same as with files that have no parity. 120

22 MB/sec nservers disk random seek & w vectors/w ritev disk random seek & read 1 blk new File-realTimeParity (I+M+W+P) rew ritefile-realtimeparity (W+V+P) rew ritefile-partialparity (R+W+V+P) Figure 13. Individual Server Write Performance for Large Files With Parity New Large Files With Full Parity The file system performance for writing large files with parity is demonstrated in Figure 13 and Figure 14. Figure 13 shows how an individual server performs among all servers in the system for 8KB blocks. Figure 14 shows how the entire system scales as servers are added to the system for of 8KB blocks. All figures also show the stand alone disk read and write performance for reference. The disk write performance is the value 121

23 measured for a stand alone disk with the same number of vectors for writev as the striping of the tested file data. The disk read performance is for a single block read. 50 MB/sec nservers disk random seek & w vectors/w ritev disk random seek & read 1 blk new File-realTimeParity (I+M+W+P) rew ritefile-realtimeparity (W+V+P) rew ritefile-partialparity (R+W+V+P) Figure 14. Aggregate Server Write Performance for Large Files With Parity. Since at least two disks are needed for fault tolerance protection, one for parity and at least one for data, the charts do not have any points for a single server system. No invalid tag messages are sent to the parity disks when new files are written. 122

24 In each system, at least one disk is used for parity and the rest is data, therefore the amount of work done to generate the parity is fixed in all experiments, one block per stripe, but the amount of work done for data is what makes the difference in performance. The figures start with a two server system because at least two disks are needed to generate parity, one for the data and one for the parity. Examining the individual server performance we observe that contrary to files without parity protection, when the system grows from two server to three and four servers, the performance of each server is better than a system of fewer servers. The individual server performance among all other servers shows that the file system scales as servers are added from 4.6MB/sec/server for a two server, to 4.8MB/ sec/server and 5.5MB/sec/server for 3 and 4 servers respectively. These measurements are for file data only. This is explained by several factors: Inodes or mirror inode blocks may be combined with parity stripes when they are written to disk. This reduces the cost of writing inodes and mirror inodes to disk compared with the cost of writing them as individual blocks. This means that the cost of writing parity blocks is reduced by the saving in the cost of writing inodes and mirror inodes. The cost of transferring a parity block through the switch is the same as transferring a data block, and writing a data stripe to disk takes the same time as writing a data stripe to disk. Therefore, when striping on more disks, the portion of all communication and disk activity for data is larger than for a system with fewer disks, and the performance scales. For example: With a two disk system, the amount of file data is the same as 123

25 parity data, the data is in fact mirrored [56], but with a three disk system, 2/3 of the information is file data and 1/3 is parity, and with a four disk system 3/4 is data and 1/4 is parity. HOL blocking; By having more servers, the switch performance is reduced as explained in Section but not enough to impair scalability. The aggregate performance of the entire system shows that the file system scales as servers are added from 9.2MB/sec for a two server, to 14.4MB/sec and 22MB/sec for 3 and 4 servers respectively Rewrite Large Files With Full Parity The performance for rewriting large files is affected by the need to invalidate the parity block before any data block of a stripe is modified. This incurs the additional cost of sending a small message to the parity disk. However, performance should improve compared to new files, because no inode activity is required. In all tests, the clients did not abort before sending the updated parity blocks, and therefore there was no additional cost to generate the parity block by reading the stripe data from all other servers in the system. The individual server performance among all other servers shows that the file system scales as servers are added from 4.5MB/sec/server for two servers, to 5.7MB/sec/server and 6.5MB/sec/server for three and four servers respectively. These measurements are for file data only. As we can see, for a two disk system, the performance is lower for rewriting files than for new files by 0.1MB/sec/disk because the cost of invalidating parity blocks is greater than the cost of updating inodes with a growing file size. However, this cost is 124

26 fixed per stripe, and therefore, system with 3 and 4 disks perform better when rewriting a file than when writing a new file. The aggregate performance of the entire system shows that the file system scales as servers are added from 9MB/sec for two servers, to 17.1MB/sec and 26MB/sec for three and four servers respectively Large Files With Partial Parity When a file which is protected by parity is modified, the associated parity block needs to be modified as well. If the rewrite is to the entire stripe, the parity block is calculated and updated in exactly the same manner as if it was a new stripe, except for the fact that the old parity block is already tagged as invalid as explained in Section If, however, only some of the blocks belonging to a stripe are rewritten, a mechanism is needed to get the rest of the information that is needed in order to calculate the modified parity block. There are two options: read before write - read data and parity before it is rewritten, subtract that old data from the old parity and add the new data to create the new parity block. read missing data - read the data blocks that are not modified and add them to the new data to generate a new parity block. Both solutions require reading of old data in order to generate the new parity block. Which one is better? The answer depends on the portion of data that is modified in each stripe as shown in Table 4 and Table 5. Method 2 is better when more blocks are rewritten, and Method 1 is better when few blocks are rewritten. The amount of additional data depends 125

27 number of data blocks rewritten, striping on 4 disks plus parity required number of read blocks required number of write blocks 1 1 data block, 1 parity block 1 data block, 1 parity block 3 3 data blocks, 1 parity block 3 data blocks, 1 parity block Table 4. Rewrite Data Cost Using Read Before Write Method on the striping parameters that are given by the user for each particular file when it is created. number of data blocks rewritten, striping on 4 disks plus parity required number of read blocks required number of write blocks 1 3 data blocks 1 data block, 1 parity block 3 1 data block 3 data blocks, 1 parity block Table 5. Rewrite Data Cost Using Read Missing Data Method When partial stripes are written for a new file, the procedure is exactly the same, except for the invalid tag which is not needed and is not sent. The client does not know ahead of time that the stripe is going to be partial. Only after waiting for some finite amount of time for the complete stripe to be ready for parity calculation, the client realizes that the entire stripe is not ready in the allotted time, and a partial stripe block is calculated and sent to the parity disk. The parity disk then proceeds as detailed above. The parity disk has no knowledge of why the parity block is partial. It could belong to a file with a hole, or could belong to a new file which is not being created sequentially, or could belong to a slow client that did not finish the stripe in the allowed time. In all of these cases, the server behaves the same, in that it reads missing data (holes) from other servers and builds a new parity block. The invalid tag is then erased. 126

28 The work that is required for inode and mirror inode maintenance is the same as with files that have no parity. All tests were done to existing files, and each stripe was missing all the blocks on one of the disks in the stripe. The work involved then was: tag old parity block as invalid, rewrite data (blocks of one data disk for a three disk system, two data disks for the four disk system), send partial parity block to the server, server reads missing data block from the server holding the missing data, and write the new parity block. Since all files were preexisting, no inode activity was needed. The individual server performance among all other servers shows that the file system scales as servers are added from 2.9MB/sec/server for a three server, to 4.1MB/sec/server for a four servers system. As we can see, the performance is lower for rewriting partial file data with parity to ones with complete rewriting files with parity by 49% for a three disk system and 37% for a four disk system because reading is done at a rate of 0.94MB/sec. Whenever possible, reading is done in parallel if more than a single data block is missing from the parity block. The aggregate performance of the entire system shows that the file system scales as servers are added from 8.7MB/sec for three servers, to 16.4MB/sec for four servers Small Files Without Parity Rewriting files without parity involves no inode activity, and therefore requires the same operations, regardless of the file size. Therefore, rewriting small files is not detailed here. 127

29 It is expected that rewriting small files has about the same performance as rewriting large files of identical striping parameters. Writing new small files is affected by the inode creation, mirroring, and file open and close operations which are amortized over fewer data blocks. In all tests the file size was 16 blocks striped on one disk. The reason is that in order to compare these results with results for large files, we need the 16blocks/disk striping. In order to make the inode operations as costly as possible (for comparison purposes), an inode creation was forced every 16 blocks by making the file size 16 blocks. The performance for writing small files is 4.1MB/sec/disk for a one disk file system and 3.1 for two, three or four disk file system respectively. The one server file system requires no mirror block updates and therefore performs better. This is a reduction of 49% and 57% compared with large files of the same file system, and is due to the inode update work which is amortized over fewer data blocks. The aggregate performance of writing small files without parity is 4.1, 7.2, 10.3 and 13.4MB/sec for one, two, three and four servers file system respectively Small Files With Parity As with files without parity, rewriting files with parity or partial parity involves the same operations, regardless of the file size. Therefore, rewriting small files with full or partial parity is not detailed here. It is expected that rewriting small files with full or partial parity has the same performance as rewriting large files of identical striping parameters. 128

4. Environment. This chapter describes the environment where the RAMA file system was developed. The

4. Environment. This chapter describes the environment where the RAMA file system was developed. The 4. Environment This chapter describes the environment where the RAMA file system was developed. The hardware consists of user computers (clients) that request reads and writes of file data from computers

More information

1. Introduction. Traditionally, a high bandwidth file system comprises a supercomputer with disks connected

1. Introduction. Traditionally, a high bandwidth file system comprises a supercomputer with disks connected 1. Introduction Traditionally, a high bandwidth file system comprises a supercomputer with disks connected by a high speed backplane bus such as SCSI [3][4] or Fibre Channel [2][67][71]. These systems

More information

CS3600 SYSTEMS AND NETWORKS

CS3600 SYSTEMS AND NETWORKS CS3600 SYSTEMS AND NETWORKS NORTHEASTERN UNIVERSITY Lecture 11: File System Implementation Prof. Alan Mislove (amislove@ccs.neu.edu) File-System Structure File structure Logical storage unit Collection

More information

Operating Systems. File Systems. Thomas Ropars.

Operating Systems. File Systems. Thomas Ropars. 1 Operating Systems File Systems Thomas Ropars thomas.ropars@univ-grenoble-alpes.fr 2017 2 References The content of these lectures is inspired by: The lecture notes of Prof. David Mazières. Operating

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

CS5460: Operating Systems Lecture 20: File System Reliability

CS5460: Operating Systems Lecture 20: File System Reliability CS5460: Operating Systems Lecture 20: File System Reliability File System Optimizations Modern Historic Technique Disk buffer cache Aggregated disk I/O Prefetching Disk head scheduling Disk interleaving

More information

File. File System Implementation. Operations. Permissions and Data Layout. Storing and Accessing File Data. Opening a File

File. File System Implementation. Operations. Permissions and Data Layout. Storing and Accessing File Data. Opening a File File File System Implementation Operating Systems Hebrew University Spring 2007 Sequence of bytes, with no structure as far as the operating system is concerned. The only operations are to read and write

More information

File. File System Implementation. File Metadata. File System Implementation. Direct Memory Access Cont. Hardware background: Direct Memory Access

File. File System Implementation. File Metadata. File System Implementation. Direct Memory Access Cont. Hardware background: Direct Memory Access File File System Implementation Operating Systems Hebrew University Spring 2009 Sequence of bytes, with no structure as far as the operating system is concerned. The only operations are to read and write

More information

Virtual Memory Outline

Virtual Memory Outline Virtual Memory Outline Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating Kernel Memory Other Considerations Operating-System Examples

More information

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown Lecture 21: Reliable, High Performance Storage CSC 469H1F Fall 2006 Angela Demke Brown 1 Review We ve looked at fault tolerance via server replication Continue operating with up to f failures Recovery

More information

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3. 5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000

More information

Chapter 11: File System Implementation. Objectives

Chapter 11: File System Implementation. Objectives Chapter 11: File System Implementation Objectives To describe the details of implementing local file systems and directory structures To describe the implementation of remote file systems To discuss block

More information

Wednesday, May 3, Several RAID "levels" have been defined. Some are more commercially viable than others.

Wednesday, May 3, Several RAID levels have been defined. Some are more commercially viable than others. Wednesday, May 3, 2017 Topics for today RAID: Level 0 Level 1 Level 3 Level 4 Level 5 Beyond RAID 5 File systems RAID revisited Several RAID "levels" have been defined. Some are more commercially viable

More information

Chapter 3 - Memory Management

Chapter 3 - Memory Management Chapter 3 - Memory Management Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 3 - Memory Management 1 / 222 1 A Memory Abstraction: Address Spaces The Notion of an Address Space Swapping

More information

Background. 20: Distributed File Systems. DFS Structure. Naming and Transparency. Naming Structures. Naming Schemes Three Main Approaches

Background. 20: Distributed File Systems. DFS Structure. Naming and Transparency. Naming Structures. Naming Schemes Three Main Approaches Background 20: Distributed File Systems Last Modified: 12/4/2002 9:26:20 PM Distributed file system (DFS) a distributed implementation of the classical time-sharing model of a file system, where multiple

More information

Virtual Memory. Chapter 8

Virtual Memory. Chapter 8 Virtual Memory 1 Chapter 8 Characteristics of Paging and Segmentation Memory references are dynamically translated into physical addresses at run time E.g., process may be swapped in and out of main memory

More information

CS 537 Fall 2017 Review Session

CS 537 Fall 2017 Review Session CS 537 Fall 2017 Review Session Deadlock Conditions for deadlock: Hold and wait No preemption Circular wait Mutual exclusion QUESTION: Fix code List_insert(struct list * head, struc node * node List_move(struct

More information

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017 Operating Systems Lecture 7.2 - File system implementation Adrien Krähenbühl Master of Computer Science PUF - Hồ Chí Minh 2016/2017 Design FAT or indexed allocation? UFS, FFS & Ext2 Journaling with Ext3

More information

Chapter 8: Virtual Memory. Operating System Concepts

Chapter 8: Virtual Memory. Operating System Concepts Chapter 8: Virtual Memory Silberschatz, Galvin and Gagne 2009 Chapter 8: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating

More information

CSE 153 Design of Operating Systems

CSE 153 Design of Operating Systems CSE 153 Design of Operating Systems Winter 2018 Lecture 22: File system optimizations and advanced topics There s more to filesystems J Standard Performance improvement techniques Alternative important

More information

Implementing a Statically Adaptive Software RAID System

Implementing a Statically Adaptive Software RAID System Implementing a Statically Adaptive Software RAID System Matt McCormick mattmcc@cs.wisc.edu Master s Project Report Computer Sciences Department University of Wisconsin Madison Abstract Current RAID systems

More information

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of Computer Science & Engineering Chentao Wu wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User:

More information

OS and Hardware Tuning

OS and Hardware Tuning OS and Hardware Tuning Tuning Considerations OS Threads Thread Switching Priorities Virtual Memory DB buffer size File System Disk layout and access Hardware Storage subsystem Configuring the disk array

More information

a process may be swapped in and out of main memory such that it occupies different regions

a process may be swapped in and out of main memory such that it occupies different regions Virtual Memory Characteristics of Paging and Segmentation A process may be broken up into pieces (pages or segments) that do not need to be located contiguously in main memory Memory references are dynamically

More information

Name: Instructions. Problem 1 : Short answer. [48 points] CMU / Storage Systems 23 Feb 2011 Spring 2012 Exam 1

Name: Instructions. Problem 1 : Short answer. [48 points] CMU / Storage Systems 23 Feb 2011 Spring 2012 Exam 1 CMU 18-746/15-746 Storage Systems 23 Feb 2011 Spring 2012 Exam 1 Instructions Name: There are three (3) questions on the exam. You may find questions that could have several answers and require an explanation

More information

OS and HW Tuning Considerations!

OS and HW Tuning Considerations! Administração e Optimização de Bases de Dados 2012/2013 Hardware and OS Tuning Bruno Martins DEI@Técnico e DMIR@INESC-ID OS and HW Tuning Considerations OS " Threads Thread Switching Priorities " Virtual

More information

Chapter 9: Virtual Memory

Chapter 9: Virtual Memory Chapter 9: Virtual Memory Silberschatz, Galvin and Gagne 2013 Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating

More information

MODELS OF DISTRIBUTED SYSTEMS

MODELS OF DISTRIBUTED SYSTEMS Distributed Systems Fö 2/3-1 Distributed Systems Fö 2/3-2 MODELS OF DISTRIBUTED SYSTEMS Basic Elements 1. Architectural Models 2. Interaction Models Resources in a distributed system are shared between

More information

Chapter 11: Implementing File Systems

Chapter 11: Implementing File Systems Chapter 11: Implementing File Systems Operating System Concepts 99h Edition DM510-14 Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory Implementation Allocation

More information

Database Systems. November 2, 2011 Lecture #7. topobo (mit)

Database Systems. November 2, 2011 Lecture #7. topobo (mit) Database Systems November 2, 2011 Lecture #7 1 topobo (mit) 1 Announcement Assignment #2 due today Assignment #3 out today & due on 11/16. Midterm exam in class next week. Cover Chapters 1, 2,

More information

MODELS OF DISTRIBUTED SYSTEMS

MODELS OF DISTRIBUTED SYSTEMS Distributed Systems Fö 2/3-1 Distributed Systems Fö 2/3-2 MODELS OF DISTRIBUTED SYSTEMS Basic Elements 1. Architectural Models 2. Interaction Models Resources in a distributed system are shared between

More information

AMD SP Promise SATA RAID Guide

AMD SP Promise SATA RAID Guide AMD SP5100 + Promise SATA RAID Guide Tyan Computer Corporation v1.00 Index: Section 1: Promise Firmware Overview (Page 2) Option ROM version Location (Page 3) Firmware menus o Main Menu (Page 4) o Drive

More information

ECE 610: Homework 4 Problems are taken from Kurose and Ross.

ECE 610: Homework 4 Problems are taken from Kurose and Ross. ECE 610: Homework 4 Problems are taken from Kurose and Ross. Problem 1: Host A and B are communicating over a TCP connection, and Host B has already received from A all bytes up through byte 248. Suppose

More information

Chapter 12: File System Implementation

Chapter 12: File System Implementation Chapter 12: File System Implementation Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency

More information

CSE380 - Operating Systems

CSE380 - Operating Systems CSE380 - Operating Systems Notes for Lecture 17-11/10/05 Matt Blaze, Micah Sherr (some examples by Insup Lee) Implementing File Systems We ve looked at the user view of file systems names, directory structure,

More information

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Chapter Seven. Large & Fast: Exploring Memory Hierarchy Chapter Seven Large & Fast: Exploring Memory Hierarchy 1 Memories: Review SRAM (Static Random Access Memory): value is stored on a pair of inverting gates very fast but takes up more space than DRAM DRAM

More information

Buffer Management for XFS in Linux. William J. Earl SGI

Buffer Management for XFS in Linux. William J. Earl SGI Buffer Management for XFS in Linux William J. Earl SGI XFS Requirements for a Buffer Cache Delayed allocation of disk space for cached writes supports high write performance Delayed allocation main memory

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

ECEC 355: Cache Design

ECEC 355: Cache Design ECEC 355: Cache Design November 28, 2007 Terminology Let us first define some general terms applicable to caches. Cache block or line. The minimum unit of information (in bytes) that can be either present

More information

Da-Wei Chang CSIE.NCKU. Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University

Da-Wei Chang CSIE.NCKU. Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University Chapter 11 Implementing File System Da-Wei Chang CSIE.NCKU Source: Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University Outline File-System Structure

More information

Chapter 4 File Systems. Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved

Chapter 4 File Systems. Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved Chapter 4 File Systems File Systems The best way to store information: Store all information in virtual memory address space Use ordinary memory read/write to access information Not feasible: no enough

More information

Operating System Concepts

Operating System Concepts Chapter 9: Virtual-Memory Management 9.1 Silberschatz, Galvin and Gagne 2005 Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2018 L20 Virtual Memory Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 Questions from last time Page

More information

Chapter 10: File System Implementation

Chapter 10: File System Implementation Chapter 10: File System Implementation Chapter 10: File System Implementation File-System Structure" File-System Implementation " Directory Implementation" Allocation Methods" Free-Space Management " Efficiency

More information

Chapter 12 File-System Implementation

Chapter 12 File-System Implementation Chapter 12 File-System Implementation 1 Outline File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency and Performance Recovery Log-Structured

More information

TCP Strategies. Keepalive Timer. implementations do not have it as it is occasionally regarded as controversial. between source and destination

TCP Strategies. Keepalive Timer. implementations do not have it as it is occasionally regarded as controversial. between source and destination Keepalive Timer! Yet another timer in TCP is the keepalive! This one is not required, and some implementations do not have it as it is occasionally regarded as controversial! When a TCP connection is idle

More information

OPERATING SYSTEM. Chapter 9: Virtual Memory

OPERATING SYSTEM. Chapter 9: Virtual Memory OPERATING SYSTEM Chapter 9: Virtual Memory Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating Kernel Memory

More information

V. File System. SGG9: chapter 11. Files, directories, sharing FS layers, partitions, allocations, free space. TDIU11: Operating Systems

V. File System. SGG9: chapter 11. Files, directories, sharing FS layers, partitions, allocations, free space. TDIU11: Operating Systems V. File System SGG9: chapter 11 Files, directories, sharing FS layers, partitions, allocations, free space TDIU11: Operating Systems Ahmed Rezine, Linköping University Copyright Notice: The lecture notes

More information

Chapter 11: Implementing File

Chapter 11: Implementing File Chapter 11: Implementing File Systems Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency

More information

ICS Principles of Operating Systems

ICS Principles of Operating Systems ICS 143 - Principles of Operating Systems Lectures 17-20 - FileSystem Interface and Implementation Prof. Ardalan Amiri Sani Prof. Nalini Venkatasubramanian ardalan@ics.uci.edu nalini@ics.uci.edu Outline

More information

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition Chapter 11: Implementing File Systems Operating System Concepts 9 9h Edition Silberschatz, Galvin and Gagne 2013 Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory

More information

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner CPS104 Computer Organization and Programming Lecture 16: Virtual Memory Robert Wagner cps 104 VM.1 RW Fall 2000 Outline of Today s Lecture Virtual Memory. Paged virtual memory. Virtual to Physical translation:

More information

1. Creates the illusion of an address space much larger than the physical memory

1. Creates the illusion of an address space much larger than the physical memory Virtual memory Main Memory Disk I P D L1 L2 M Goals Physical address space Virtual address space 1. Creates the illusion of an address space much larger than the physical memory 2. Make provisions for

More information

To Everyone... iii To Educators... v To Students... vi Acknowledgments... vii Final Words... ix References... x. 1 ADialogueontheBook 1

To Everyone... iii To Educators... v To Students... vi Acknowledgments... vii Final Words... ix References... x. 1 ADialogueontheBook 1 Contents To Everyone.............................. iii To Educators.............................. v To Students............................... vi Acknowledgments........................... vii Final Words..............................

More information

CS307: Operating Systems

CS307: Operating Systems CS307: Operating Systems Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building 3-513 wuct@cs.sjtu.edu.cn Download Lectures ftp://public.sjtu.edu.cn

More information

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

Chapter 8 & Chapter 9 Main Memory & Virtual Memory Chapter 8 & Chapter 9 Main Memory & Virtual Memory 1. Various ways of organizing memory hardware. 2. Memory-management techniques: 1. Paging 2. Segmentation. Introduction Memory consists of a large array

More information

Outlook. File-System Interface Allocation-Methods Free Space Management

Outlook. File-System Interface Allocation-Methods Free Space Management File System Outlook File-System Interface Allocation-Methods Free Space Management 2 File System Interface File Concept File system is the most visible part of an OS Files storing related data Directory

More information

CSE 451: Operating Systems Winter Page Table Management, TLBs and Other Pragmatics. Gary Kimura

CSE 451: Operating Systems Winter Page Table Management, TLBs and Other Pragmatics. Gary Kimura CSE 451: Operating Systems Winter 2013 Page Table Management, TLBs and Other Pragmatics Gary Kimura Moving now from Hardware to how the OS manages memory Two main areas to discuss Page table management,

More information

File Systems. CS170 Fall 2018

File Systems. CS170 Fall 2018 File Systems CS170 Fall 2018 Table of Content File interface review File-System Structure File-System Implementation Directory Implementation Allocation Methods of Disk Space Free-Space Management Contiguous

More information

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University CS 370: SYSTEM ARCHITECTURE & SOFTWARE [MASS STORAGE] Frequently asked questions from the previous class survey Shrideep Pallickara Computer Science Colorado State University L29.1 L29.2 Topics covered

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1 Memory technology & Hierarchy Caching and Virtual Memory Parallel System Architectures Andy D Pimentel Caches and their design cf Henessy & Patterson, Chap 5 Caching - summary Caches are small fast memories

More information

Hard Disk Drives. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

Hard Disk Drives. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau) Hard Disk Drives Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau) Storage Stack in the OS Application Virtual file system Concrete file system Generic block layer Driver Disk drive Build

More information

I/O CANNOT BE IGNORED

I/O CANNOT BE IGNORED LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.

More information

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin) : LFS and Soft Updates Ken Birman (based on slides by Ben Atkin) Overview of talk Unix Fast File System Log-Structured System Soft Updates Conclusions 2 The Unix Fast File System Berkeley Unix (4.2BSD)

More information

Performance of relational database management

Performance of relational database management Building a 3-D DRAM Architecture for Optimum Cost/Performance By Gene Bowles and Duke Lambert As systems increase in performance and power, magnetic disk storage speeds have lagged behind. But using solidstate

More information

Silberschatz, et al. Topics based on Chapter 13

Silberschatz, et al. Topics based on Chapter 13 Silberschatz, et al. Topics based on Chapter 13 Mass Storage Structure CPSC 410--Richard Furuta 3/23/00 1 Mass Storage Topics Secondary storage structure Disk Structure Disk Scheduling Disk Management

More information

CSE 120: Principles of Operating Systems. Lecture 10. File Systems. February 22, Prof. Joe Pasquale

CSE 120: Principles of Operating Systems. Lecture 10. File Systems. February 22, Prof. Joe Pasquale CSE 120: Principles of Operating Systems Lecture 10 File Systems February 22, 2006 Prof. Joe Pasquale Department of Computer Science and Engineering University of California, San Diego 2006 by Joseph Pasquale

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

Chapter 8. Virtual Memory

Chapter 8. Virtual Memory Operating System Chapter 8. Virtual Memory Lynn Choi School of Electrical Engineering Motivated by Memory Hierarchy Principles of Locality Speed vs. size vs. cost tradeoff Locality principle Spatial Locality:

More information

PowerVault MD3 SSD Cache Overview

PowerVault MD3 SSD Cache Overview PowerVault MD3 SSD Cache Overview A Dell Technical White Paper Dell Storage Engineering October 2015 A Dell Technical White Paper TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS

More information

OPERATING SYSTEM. Chapter 12: File System Implementation

OPERATING SYSTEM. Chapter 12: File System Implementation OPERATING SYSTEM Chapter 12: File System Implementation Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management

More information

CS510 Operating System Foundations. Jonathan Walpole

CS510 Operating System Foundations. Jonathan Walpole CS510 Operating System Foundations Jonathan Walpole File System Performance File System Performance Memory mapped files - Avoid system call overhead Buffer cache - Avoid disk I/O overhead Careful data

More information

Chapter 9: Virtual-Memory

Chapter 9: Virtual-Memory Chapter 9: Virtual-Memory Management Chapter 9: Virtual-Memory Management Background Demand Paging Page Replacement Allocation of Frames Thrashing Other Considerations Silberschatz, Galvin and Gagne 2013

More information

College of Computer & Information Science Spring 2010 Northeastern University 12 March 2010

College of Computer & Information Science Spring 2010 Northeastern University 12 March 2010 College of Computer & Information Science Spring 21 Northeastern University 12 March 21 CS 76: Intensive Computer Systems Scribe: Dimitrios Kanoulas Lecture Outline: Disk Scheduling NAND Flash Memory RAID:

More information

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 420, York College. November 21, 2006

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 420, York College. November 21, 2006 November 21, 2006 The memory hierarchy Red = Level Access time Capacity Features Registers nanoseconds 100s of bytes fixed Cache nanoseconds 1-2 MB fixed RAM nanoseconds MBs to GBs expandable Disk milliseconds

More information

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay! Lecture 16 Today: Start looking into memory hierarchy Cache$! Yay! Note: There are no slides labeled Lecture 15. Nothing omitted, just that the numbering got out of sequence somewhere along the way. 1

More information

COSC 6385 Computer Architecture. Storage Systems

COSC 6385 Computer Architecture. Storage Systems COSC 6385 Computer Architecture Storage Systems Spring 2012 I/O problem Current processor performance: e.g. Pentium 4 3 GHz ~ 6GFLOPS Memory Bandwidth: 133 MHz * 4 * 64Bit ~ 4.26 GB/s Current network performance:

More information

OPERATING SYSTEMS II DPL. ING. CIPRIAN PUNGILĂ, PHD.

OPERATING SYSTEMS II DPL. ING. CIPRIAN PUNGILĂ, PHD. OPERATING SYSTEMS II DPL. ING. CIPRIAN PUNGILĂ, PHD. File System Implementation FILES. DIRECTORIES (FOLDERS). FILE SYSTEM PROTECTION. B I B L I O G R A P H Y 1. S I L B E R S C H AT Z, G A L V I N, A N

More information

Advanced Memory Organizations

Advanced Memory Organizations CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU

More information

Worst-case Ethernet Network Latency for Shaped Sources

Worst-case Ethernet Network Latency for Shaped Sources Worst-case Ethernet Network Latency for Shaped Sources Max Azarov, SMSC 7th October 2005 Contents For 802.3 ResE study group 1 Worst-case latency theorem 1 1.1 Assumptions.............................

More information

Chapter 12: File System Implementation. Operating System Concepts 9 th Edition

Chapter 12: File System Implementation. Operating System Concepts 9 th Edition Chapter 12: File System Implementation Silberschatz, Galvin and Gagne 2013 Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University Chapter 10: File System Chapter 11: Implementing File-Systems Chapter 12: Mass-Storage

More information

Process size is independent of the main memory present in the system.

Process size is independent of the main memory present in the system. Hardware control structure Two characteristics are key to paging and segmentation: 1. All memory references are logical addresses within a process which are dynamically converted into physical at run time.

More information

2. PICTURE: Cut and paste from paper

2. PICTURE: Cut and paste from paper File System Layout 1. QUESTION: What were technology trends enabling this? a. CPU speeds getting faster relative to disk i. QUESTION: What is implication? Can do more work per disk block to make good decisions

More information

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review COS 318: Operating Systems NSF, Snapshot, Dedup and Review Topics! NFS! Case Study: NetApp File System! Deduplication storage system! Course review 2 Network File System! Sun introduced NFS v2 in early

More information

Chapter 12: File System Implementation

Chapter 12: File System Implementation Chapter 12: File System Implementation Silberschatz, Galvin and Gagne 2013 Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to

More information

CS161 Design and Architecture of Computer Systems. Cache $$$$$

CS161 Design and Architecture of Computer Systems. Cache $$$$$ CS161 Design and Architecture of Computer Systems Cache $$$$$ Memory Systems! How can we supply the CPU with enough data to keep it busy?! We will focus on memory issues,! which are frequently bottlenecks

More information

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability Topics COS 318: Operating Systems File Performance and Reliability File buffer cache Disk failure and recovery tools Consistent updates Transactions and logging 2 File Buffer Cache for Performance What

More information

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05 Engineering Goals Scalability Availability Transactional behavior Security EAI... Scalability How much performance can you get by adding hardware ($)? Performance perfect acceptable unacceptable Processors

More information

Computer Systems Laboratory Sungkyunkwan University

Computer Systems Laboratory Sungkyunkwan University File System Internals Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Today s Topics File system implementation File descriptor table, File table

More information

Chapter 17: Distributed-File Systems. Operating System Concepts 8 th Edition,

Chapter 17: Distributed-File Systems. Operating System Concepts 8 th Edition, Chapter 17: Distributed-File Systems, Silberschatz, Galvin and Gagne 2009 Chapter 17 Distributed-File Systems Background Naming and Transparency Remote File Access Stateful versus Stateless Service File

More information

I/O Device Controllers. I/O Systems. I/O Ports & Memory-Mapped I/O. Direct Memory Access (DMA) Operating Systems 10/20/2010. CSC 256/456 Fall

I/O Device Controllers. I/O Systems. I/O Ports & Memory-Mapped I/O. Direct Memory Access (DMA) Operating Systems 10/20/2010. CSC 256/456 Fall I/O Device Controllers I/O Systems CS 256/456 Dept. of Computer Science, University of Rochester 10/20/2010 CSC 2/456 1 I/O devices have both mechanical component & electronic component The electronic

More information

Performance and Optimization Issues in Multicore Computing

Performance and Optimization Issues in Multicore Computing Performance and Optimization Issues in Multicore Computing Minsoo Ryu Department of Computer Science and Engineering 2 Multicore Computing Challenges It is not easy to develop an efficient multicore program

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters COSC 6374 Parallel I/O (I) I/O basics Fall 2010 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network card 1 Network card

More information

CHAPTER 3 EFFECTIVE ADMISSION CONTROL MECHANISM IN WIRELESS MESH NETWORKS

CHAPTER 3 EFFECTIVE ADMISSION CONTROL MECHANISM IN WIRELESS MESH NETWORKS 28 CHAPTER 3 EFFECTIVE ADMISSION CONTROL MECHANISM IN WIRELESS MESH NETWORKS Introduction Measurement-based scheme, that constantly monitors the network, will incorporate the current network state in the

More information

Lecture 18: Reliable Storage

Lecture 18: Reliable Storage CS 422/522 Design & Implementation of Operating Systems Lecture 18: Reliable Storage Zhong Shao Dept. of Computer Science Yale University Acknowledgement: some slides are taken from previous versions of

More information

File System Implementation

File System Implementation File System Implementation Last modified: 16.05.2017 1 File-System Structure Virtual File System and FUSE Directory Implementation Allocation Methods Free-Space Management Efficiency and Performance. Buffering

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information