The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771
Overview Google file system is a scalable distributed file system designed mainly for large distributed data intensive applications. GFS provides fault tolerance and gives high aggregate performance to large number of clients. GFS has been designed by observations of workloads and storage needs. Google as platform can provide hundreds of TB of storage on thousands of machines which can be accessed by 100 of clients. In GFS throughput matters more than latency.
Introduction Design Overview Assumptions Interface Architecture Single Master Chunk Size Metadata In memory Data Structures Chunk Locations Operation Log System Interactions Lease and Mutation Order Data Flow Questions discussed Conclusions Topics Covered
Introduction GFS is designed to meet the rapid growing demands of Google data processing needs. Various factors kept in mind while designing GFS: Components failures are common. As system consists of thousands of storage machines which are built from inexpensive commodity parts. Therefore constant monitoring, error detection, fault tolerance and automatic recovery must be integral to the system. Now a days use of multi GB files are common. Block size should be revisited as its difficult to manage in data in KB, even if system can support it. Caching data blocks in client losses its appeal. Increases flexibility.
Design: Assumptions: Due to inexpensive commodities, failure rate is high. Management of large files must be done efficiently. Small files must be supported but we need not to optimize them. Once written files are seldom modified again. Atomicity with minimal synchronization is very important. Large streaming reads. High sustained bandwidth is more important than low latency.
Interface: Files are organized hierarchically in directories and identified by the path name. Usual operations are supported like: Create, delete, open, close, read and write. GFS also has snapshot and record append operations.
Architecture: It has a single master. Data is stored in multiple chunk servers, accessed by multiple clients. Files are divided into chunks of fixed size(64 MB). Each chunk is identified by an immutable and globally unique 64 bit chunk handler assigned by the master at the time of creation. Chunk servers stores the chunk on local disks as Linux files and they read or write data specified by chunk handlers and byte range. Keeping reliability in mind each chunk is replicated three times. Master has all the metadata and controls system activities like chunk lease management, garbage collection and chunk migration between chunk servers.
Master collects chunk servers state using heartbeat messages. Neither client nor the chunk servers caches the data. Not caching data simplifies the overall client system and also removes the cache coherency issues. Chunk servers don t cache file data as chunks are stored as local files so Linux buffer already keeps frequently accessed data in memory.
Single Master I A single Master simplifies the design and enables the master to make sophisticated chunk placement and replication decisions. Master involvement must be minimized so that it does not became a bottleneck. Single master can be a bottleneck? GFS gives solution as Shadow masters. Also storage should not be a problem as a chunk of size 64MB requires 64bytes of memory. So 20 Million files can be stored in 200MB space.
Chunk Size: As GFS deals with large files. So chunk size has been selected as 64MB. Lazy space allocation avoids wasting space due to internal fragmentation. Advantages of large chunk size: Reduces client master interactions. Reduces network overhead. Disadvantage: Chunk servers storing small files with small chunks perhaps just one might became hotspots. GFS fixed this problem by storing such executable with higher replication factor. A potential solution is to allow clients to read data from other clients in such situations.
Metadata: Master stores three major types of metadata: File and chunk namespaces Mapping from files to chunks Location of each chunk replica s All the metadata is stored in master. Master does not store chunk information persistently instead it asks chunk servers about its chunks at master start up or when a chunk server joins the cluster. Log help us to update master state simply and reliably and without risking inconsistency in event of master crash.
In memory data structure Since metadata is stored in memory masters operations are fast. Also its easy for master to scan through its entire state in background. Scanning is done for chunk garbage collection, rereplication in case of failure of chunk server and chunk migration to balance load and disk space. One of the concern is capacity of whole system is limited by how much memory master has?
Chunk Locations Master simply polls the chunk servers at start up rather keeping a record. Master monitors chunk servers status by regular heart beat messages.
Operation log: It consists historical record of critical meta data changes and also serves as the timeline that defines the concurrent operations. Log should be stored reliably or else we can loose the whole file system and recent client operations. Masters recovers its file state by replaying the operational log file, so to minimize the start up time operational log should be small. Master checkpoints its state whenever the log goes beyond a certain size. Master switches to a new log file and creates the new check point in a separate thread. Recovery needs only latest files so older files can be deleted freely.
System interactions: System is designed in such a way that masters involvement is minimized. Based on these assumptions we can decide how client, master and chunk server interact with each other to implement data mutation. Lease and mutation order: Basically mutation is an operation that changes the contents of a chunk such as write or an append operation. Also it is implemented on all chunk replicas.
Now what is lease? Lease is used to maintain a consistent mutation order across replica s. Master grants a chunk lease to one of the replica called as primary. A serial order for all mutations to chunk is decided by primary which is followed by all the replica s. Thus the global mutation order is defined first by lease grant order chosen by master and then with in the lease the serial number assigned by primary. This lease mechanism is designed to minimize management overhead at the master.
If the communication with primary is lost, then master can safely grants a new lease to another replica.
Data flow Flow of data is decoupled from flow of control to use network efficiently. Data in chunk servers is pushed linearly to fully utilize machine s network bandwidth. Latency is being minimized by pipelining the data transfer over TCP connections. If we ignore the network congestion the ideal elapsed time for transferring B bytes to R replicas is (B/T+RL) where T is the network throughput and L is the latency.
Questions by Professor: 1. its design has been driven by key observations of our application workloads and technological environment, What are the workload and technology characteristics GFS assumed in its design and what are their corresponding design choices? Sol: Answered in slides 4 and 5. 2. while caching data blocks in the client loses its appeal. GFS does not cache file data. Why does this design choice not lead to performance loss? What benefit does this choice have? Solution: File data is not cached by the client or chunk server. Large streaming reads offer little caching benefits since most of the cache data will always be overwritten due to less space. It directly gets the data from chunk server which leads to less interactions with master and improves overall efficiency.
3.) Small files must be supported, but we need not optimize for them. Why? Think of a scenario where such an assumption on workloads is not valid. Solution: Such an assumption is not valid when many small files are stored in a chunk of 64MB which are stored on different chunkservers. In such scenario load balancing is an issue. Hot spots can be created if many clients access these files at same time. 4.) Clients interact with the master for metadata operations, but all databearing communication goes directly to the chunk servers. How does this design help improve the system s performance? Solution: The clients contact the master only for metadata; reading and writing file data go through the chunk server. Besides, prefetching metadata further reduces client master interactions. Reduces load on master results in increase in efficiency. Central point of management. Masters is only involved in control messages
5.) A GFS cluster consists of a single master. What s benefit of having only a single master? What s its potential performance risk? How does GFS minimize such a risk? Sol.) Answered in slide 9. 6.) Each chunk replica is stored as a plain Linux file on a chunk server and is extended only as needed. How does GFS collaborate with chunk server s local file system to store file chunks? What s lazy space allocation and what s its benefit? Sol.) GFS does not store all the useful data on a single chunk rather it distributes it to balance load and disk space on chunk server. With lazy space allocation, the physical allocation of space is delayed as long as possible, until data at the size of the chunk size (in GFS's case, 64 MB) is accumulated. In other words, the decision process that precedes the allocation of a new chunk on disk, is heavily influenced by the size of the data that is to be written. This preference of waiting instead of allocating more chunks based on some other characteristic, minimizes the chance of internal fragmentation (i.e. unused portions of the 64 MB chunk).
7.) On the other hand, a large chunks size, even with lazy space allocation, has its disadvantages. Give an example disadvantage. Sol.) Answered in slide 10 8.) One potential concern for this memory only approach is that the number of chunks and hence the capacity of the whole system is limited by how much memory the master has. Why is GFS s master able to keep the metadata in memory? Sol.) Storage should not be a problem as a chunk of size 64MB requires 64bytes of memory. So 20 Million files can be stored in 200MB space. 9.) Since the operation log is critical, we must store it reliably and not make changes visible to clients until metadata changes are made persistent. Explain the role of the log. Sol.) explained in slide 14.
Conclusions: GFS demonstrate how to support large scale processing workloads on commodity hardware. Designed to tolerate frequent component failures. Optimize for huge files that are mostly appended and read. Going for simple solutions eg: Single Master. Finally GFS has met Google s storage needs.. So it must be good.