The Berkeley File System The Original File System Background The original UNIX file system was implemented on a PDP-11. All data transports used 512 byte blocks. File system I/O was buffered by the kernel. When UNIX was ported to faster machines like VAX-11, the original file system bandwidth (typical. 20 KByte/s) was to low. It is nothing in the file system interface that makes it inherently slow, Thus it is possible to keep the file system interface and only change the implementation to make it faster. Why is the bandwidth low? The file system used a 512 byte block size. This block size is to small with 10 ms disk seek time. All inodes are located in the first blocks of the file system. This creates long seeks between the inode area and the data blocks at the disk. Commands that alternately read inodes and data blocks (like ls -l) becomes especially inefficient. The data blocks in a file may be randomly located at the disk (at least in a file system that have been in use for a long time). 1 2
Transfer Time for Page The Berkeley File System How long time does it take to transfer a page between primary storage and disk memory? Notations: T page T transport Total transfer time for page Transport time between primary storage and disk storage T wait Average rotational latency + seek time V Transport speed for page transport L Page size Transfer time: T page = T transport + T wait = L/V + T wait Typical values: V =10Mbit/s, L = 10000 bits, T wait =10ms This gives: T page =1 ms + 10 ms Thus for page sizes of 1 Kbyte or less, the wait time is totally dominating making the transfer time almost independent of page size. A first attempt. In the first attempt to improve the file system bandwidth, the block size was increased to 1024 bytes. The result was that the bandwidth was more than double compared to the original file system. Every file system operation can transport twice as much data. The number of indirect blocks were reduced with bigger block size. Even after this change, the file system could not use more than 4% of the disk bandwidth. The bandwidth was higher for a new file system but degenerated after some time (especially for read operations). The reason for this is that the list of free blocks is sorted in optimal order when the file system is created, but as new files are created and removed the free list becomes increasingly random. 3 4
The Berkeley File System The Fast Berkeley Filesystem Methods to increase bandwidth: Use a big block size. Place related blocks close to each other. Problem with block size Big block sizes creates big fragmentation losses. Use variable block size. Requires an allocation strategy to ensure that a file only contains one block of less than maximum size. Locality Locating related data together requires that there is free blocks at the wanted locations. Not everything can be located locally. File system organization In order to improve locality, the file system is organized in cylinder groups. A cylinder group consists of a number of consecutive cylinders on the disk. A cylinder group contains: A copy of the super block. Inodes (statically allocated when the file system is created). A bitmap to keep track of free blocks in the cylinder group. Data blocks. The super block is stored in the cylinder groups in order to have redundant copies in case of a file system crash. 5 6
The Fast Berkeley Filesystem The Fast Berkeley Filesystem Block size To be able to use big block size without getting to large fragmentation losses, the big blocks are divided in a smaller fragments. The block size and fragment size is selected (within certain limits) when the file system is created. In order to be able to describe a 2 32 byte file with only two indirect levels, the minimum block size is 4096 bytes. The fragment size cannot be smaller than the disk sector size (usually 512 bytes). A block may consist of 2, 4 or 8 fragments. A bit map in the cylinder group keeps track of free blocks at fragment level. Allocation of data blocks and fragments. New data blocks in files are allocated in write operations. In order to keep the bandwidth that the big block size gives, only the last block in a file is allowed to contain fragments. 7 8
Allocating New Blocks and Fragments Allocating New Blocks and Fragments Possibilities when writing new data to a file: 1. There is enough space left in an already allocated block or fragment to hold the new data. New data are written into available space. 2. The file contains no fragmented blocks (and the last block in the file contains insufficient space to hold the new data). The problem with expanding a file one fragment at a time is that a file may be copied many times as a fragmented block expands to a full block. To reduce the number of copy operations, data should be written in units of full blocks when this is possible. This method is used by the C standard I/O library. If space exists in a block already allocated, the space is filled with new data. If the remainder of the new data contains more than a full blocks, new full blocks are allocated until less than a full block remains. For the last part, a block with the necessary fragments are used. 3. The file contains one or more fragments (and the fragments contains insufficient space to hold the data). If the size of new data + the size of data already in the fragments > the block size: A new block is allocated and the fragments are copied to the beginning of the new block. Continue as in point 2. Otherwise A block with the necessary fragments or a full block is allocated. Copy the old fragments + new data into the allocated space. 9 10
Placement of Data Blocks Strategies for Placement of Data Blocks The main strategy is to place data blocks to give the best possible locality. Data blocks in a single file should preferably be placed in the same cylinder group at rotationally optimal distance. However, not everything can be placed locally, because a big file could fill up an entire cylinder group and make it impossible to find blocks at wanted location in the future. In order for the locality strategy to work, there should always be some free blocks in every cylinder group. Unrelated data should be placed in a way that gives an equal amount of free space in all cylinder groups. Data blocks belonging to the same file should preferably be placed in the same cylinder group at rotationally optimal distance. If the file grows bigger than 48 Kbyte the block allocation is redirected to another cylinder group. Thereafter redirection is done for every Mbyte allocated data. The new cylinder groups are chosen among cylinder groups with more than average number of free blocks. 11 12
Strategies for Placement Inodes Global and Local Allocation Routines Directory inodes A new directory is placed in a cylinder group which have more free inodes than average and as few directories as possible. File inodes The inodes for all files in a directory should if possible be placed in the same cylinder group. A reason for this is the commonly used command ls -l that have to read all file inodes in the directory. There are two levels of block allocation routines. The global allocation routines keeps information about the number of free blocks and inodes in the different cylinder groups. They are used for example to locate the cylinder group with the maximum number of free blocks. The local allocation routines use the bitmap in the cylinder group to allocate a specific block. 13 14
Local Allocation Routines The Fast Berkeley Filesystem When calling the local allocation routines it may happen that the requested block is already in use. If the requested block is not available, the following strategy is used: 1. Use the next available block rotationally closest to the requested block on the same cylinder. 2. If there are no blocks available in the same cylinder, use a block in the same cylinder group. 3. If the cylinder group is full, quadratically rehash the cylinder group number to get a new cylinder group. 4. Finally if the hash fails, apply an exhaustive search to all cylinder groups. File systems that are parameterized to maintain at least 10 percent free space rarely use strategies 3 and 4. Performance evaluation Both read and write operations are faster in the new file system. The transfer speed in the new filesystem do not change with time (if at least 10 percent free space is maintained). In the new filesystem read operations are always as fast or (usually) faster than write operations. The reason is that the write operations run the block allocation routines. In the old filesystem write operations were about 50 percent faster than read operations. This is because write operations are asynchronous and the disk driver uses a SCAN algorithm to sort them. However then the file is read the read operations must always be processed immediately. Read operations are synchronous also in the new file system, but here the blocks are better ordered at the disk. 15 16