October 13, 2010
Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003.
1 Assumptions Interface Architecture Single master Chunk size Metadata 2 Mutation mechanism Additional operations 3 4 5
Frequent failures Assumptions Interface Architecture Single master Chunk size Metadata Hundreds of machines built from inexpensive commodity parts Component failures are the norm rather than the exception Constant monitoring, error detection, fault tolerance, and prompt automatic recovery must be integral to the system
Huge files Assumptions Interface Architecture Single master Chunk size Metadata Modest number of large files Multi-GB files are common Small files supported, but not optimized for Design assumptions and parameters such as I/O operation and blocksizes had to be revisited
Writing Assumptions Interface Architecture Single master Chunk size Metadata Mostly appending new data rather than overwriting existing data Large, sequential writes Once written, files are seldom modified again Appending is the focus of performance optimization and atomicity guarantees
Reading Assumptions Interface Architecture Single master Chunk size Metadata Once written, files are only read, often only sequentially Mostly large streaming reads and small random reads Batching and sorting small reads to advance steadily through the file
Concurrency Assumptions Interface Architecture Single master Chunk size Metadata Files often used as producer-consumer queues or for many-way merging Hundreds of producers concurrently append to a single file The file may be read later, or a consumer may be reading through the file simultaneously Atomicity with minimal synchronization overhead is essential
Bandwidth vs. latency Assumptions Interface Architecture Single master Chunk size Metadata High sustained bandwidth is more important than low latency Most applications place a premium on processing data in bulk at a high rate Few have stringent response time requirements for an individual read or write
Interface Assumptions Interface Architecture Single master Chunk size Metadata GFS doesn t implement a standard API such as POSIX Files are organized hierarchically in directories and identified by pathnames Standard operations: create, delete, open, close, read, and write Additional operations: snapshot and record append Snapshot creates a copy of a file or a directory tree at low cost Record append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client s append
Architecture Assumptions Interface Architecture Single master Chunk size Metadata A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients Each of these is a commodity Linux machine running a user-level server process
Files Assumptions Interface Architecture Single master Chunk size Metadata Files are divided into fixed-size chunks Each chunk is identified by a 64 bit chunk handle Chunkservers store chunks on local disks as Linux files Each chunk is replicated on multiple chunkservers (default: 3)
Master Assumptions Interface Architecture Single master Chunk size Metadata Maintains all file system metadata: namespace access control information mapping from files to chunks current locations of chunks Controls system-wide activities: chunk lease management garbage collection of orphaned chunks chunk migration between chunkservers Periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state
Communication Assumptions Interface Architecture Single master Chunk size Metadata GFS client communicates with the master and chunkservers to read or write data on behalf of the application Clients interact with the master only for metadata operations All data-bearing communication goes directly to the chunkservers
Cache Assumptions Interface Architecture Single master Chunk size Metadata Clients cache only metadata Caching data offers little benefit because most applications stream through huge files Not having them simplifies the client and the overall system Chunkservers need not cache file data because chunks are stored as local files (Linux s buffer cache already keeps frequently accessed data in memory)
Single master Assumptions Interface Architecture Single master Chunk size Metadata Having a single master simplifies the design Minimizing its involvement in reads and writes ensures that it does not become a bottleneck Clients only ask the master which chunkservers they should contact They cache this information for a limited time and interact with the chunkservers directly for many subsequent operations
Assumptions Interface Architecture Single master Chunk size Metadata 1 Client translates the file name and byte offset into chunk index within the file 2 It sends the master a request 3 The master replies with the corresponding chunk handle and locations of the replicas 4 The client caches this information 5 The client then sends a request to one of the replicas 6 Further reads of the same chunk require no more client-master interaction
- scheme Assumptions Interface Architecture Single master Chunk size Metadata
Chunk size Assumptions Interface Architecture Single master Chunk size Metadata 64 MB Lazy space allocation avoids wasting space due to internal fragmentation Advantages: Reduction of clients need to interact with the master Reduction of network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time Reduction of the size of metadata
Metadata Assumptions Interface Architecture Single master Chunk size Metadata Three types: File and chunk namespaces Mapping from files to chunks Locations of each chunk s replicas All metadata is kept in the master s memory Namespaces and mapping are also kept in an operation log stored on the master s local disk and replicated on remote machines The master does not store chunk location information persistently it asks each chunkserver about its chunks
In-Memory Data Structures Assumptions Interface Architecture Single master Chunk size Metadata Since metadata is stored in memory, master operations are fast Amount of memory the master has is not a concern: there is less than 64 bytes of metadata for each 64 MB chunk and file
Chunk locations Assumptions Interface Architecture Single master Chunk size Metadata The master does not keep a persistent record of which chunkservers have a replica of a given chunk It polls chunkservers for that information at startup and periodically thereafter (with HeartBeat messages) This eliminates the problem of keeping the master and chunkservers in sync
Operation log Assumptions Interface Architecture Single master Chunk size Metadata Contains a historical record of critical metadata changes Serves as a logical time line that defines the order of concurrent operations It is replicated on multiple machines Responds to a client operation only after flushing the corresponding log record to disk
Operation log Assumptions Interface Architecture Single master Chunk size Metadata The master recovers its file system state by replaying the operation log To minimize startup time, we must keep the log small. The master checkpoints its state whenever the log grows beyond a certain size The checkpoint is in a compact B-tree like form that can be directly mapped into memory and used for namespace lookup without extra parsing A failure during checkpointing does not affect correctness because the recovery code detects and skips incomplete checkpoints
Leases and mutation order Mutation mechanism Additional operations Mutation is an operation that changes the contents or metadata of a chunk (e.g. write) Leases are used to maintain a consistent mutation order across replicas The master grants a chunk lease to one of the replicas, which we call the primary The primary picks a serial order for all mutations to the chunk All replicas follow this order when applying mutations Lease has an extendible 60-seconds timeout Even if the master loses communication with a primary, it can safely grant a new lease to another replica after the old lease expires
Mutation mechanism Additional operations Data flow The flow of data is decoupled from the flow of control to use the network efficiently Control flows from the client to the primary and then to all secondaries Data is pushed linearly along a carefully picked chain of chunkservers Once a chunkserver receives some data, it starts forwarding immediately
Write control - scheme Mutation mechanism Additional operations
Mutation mechanism Additional operations Record append The client specifies only the data GFS appends it to the file at least once atomically at an offset of GFS s choosing If appending the record to the current chunk would cause the chunk to exceed the maximum size (64 MB), it is padded up to max size and next chunk is created
Mutation mechanism Additional operations Record append For the operation to report success, the data must have been written at the same offset on all replicas of some chunk If a record append fails at any replica, the client retries the operation Replicas of the same chunk may contain different data possibly including duplicates of the same record GFS does not guarantee that all replicas are bytewise identical. It only guarantees that the data is written at least once as an atomic unit
Mutation mechanism Additional operations Snapshot The snapshot operation makes a copy of a file or a directory tree Uses standard copy-on-write techniques
Mutation mechanism Additional operations Snapshot 1 When the master receives a snapshot request, it first revokes any relevant leases 2 Then, the master logs the operation to disk 3 It then applies this log record to its in-memory state by duplicating the metadata 4 The newly created snapshot files point to the same chunks as the source files 5 Next time the chunk is to be written, master notices that the reference count is greater than one 6 It then asks each chunkserver that has a current replica of original chunk to create its copy
The master executes all namespace operations Manages chunk replicas throughout the system: Makes placement decisions Creates new chunks and hence replicas Coordinates various system-wide activities to keep chunks fully replicated, to balance load across all the chunkservers, and to reclaim unused storage
Namespace management and locking GFS represents its namespace as a lookup table mapping full pathnames to metadata Each node in the namespace tree has an associated read-write lock Each master operation acquires a set of locks before it runs (read locks for all superdirectories pathnames and read or write lock for the whole pathname) Creating a file doesn t require write lock on parent directory, as there is no inode-like data structure Multiple file creations can be executed concurrently in the same directory
Replica placement The chunk replica placement policy serves two purposes: maximize data reliability and availability, and maximize network bandwidth utilization Replicas are spread across different machines and racks
Chunk creation When the master creates a chunk, it chooses where to place the initially empty replicas. It considers several factors: Chunkservers with below-average disk space utilization are preferred The number of recent creations on each chunkserver should be limited Replicas of a chunk should be spread across racks
Re-replication The master re-replicates a chunk as soon as the number of available replicas falls below a user-specified goal Priority of re-replication is based on several factors: How far it is from the replication goal Chunks from live files are replicated before chunks that belong to recently deleted files Chunks that are blocking client progress are prioritized
Rebalancing It is performed periodically Master examines the current replica distribution and moves replicas for better disk space and load balancing Through this process, master gradually fills up new chunkservers Replicas are removed from the chunkservers with below-average free space
Garbage collection GFS does not immediately reclaim the available physical storage The master logs a file s deletion immediately The file is renamed to a hidden name During the master s regular scan of the file system namespace, it removes any such hidden files if they have existed for more than three days In a similar scan, the master identifies orphaned chunks and erases the metadata for those chunks In a HeartBeat message, each chunkserver reports what chunks it has, and the master replies with the chunks that are no longer present in the master s metadata
Garbage collection Garbage collection provides a uniform and dependable way to clean up any replicas not known to be useful It merges storage reclamation into the regular background activities of the master The delay in reclaiming storage provides a safety net against accidental, irreversible deletion
Stale Replica Detection Chunk replicas may become stale if a chunkserver fails and misses mutations to the chunk while it is down For each chunk, the master maintains a chunk version number to distinguish between up-to-date and stale replicas Whenever the master grants a new lease on a chunk, it increases the chunk version number and informs the up-to-date replicas The master removes stale replicas in its regular garbage collection
High availability Fast recovery of master and chunkservers Chunk replication Master replication
Master replication Replication of operation log and checkpoints Mutation considered committed only after flushing its log record locally and on all replicas If a master machine fails, monitoring infrastructure starts a new master process elsewhere Shadow masters provide read-only access to the file system even when the primary master is down Shadow master reads a replica of the log and applies the same changes to its data structures exactly as the primary does Like the primary, it polls chunkservers at startup and exchanges frequent handshake messages with them to monitor their status
Data Integrity Chunkservers us checksumming to detect data corruption A chunk is broken up into 64 KB blocks, each has a corresponding 32 bit checksum Checksums are kept in memory and stored persistently with logging Chunkserver verifies the checksum of data blocks that overlap the read range before returning any data (reads) If a block doesn t match the checksum, chunkserver returns an error and reports it to the master, who will clone the chunk from another replica. The invalid replica is removed
Data Integrity Checksum computation is optimized for appends Checksum is incrementally updated for the last partial block, and computed for any brand new blocks filled by the append For writes, the first and last blocks of the range being overwritten must be read and verified first Scanning inactive chunks during idle periods
Micro-benchmarks