The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google* 정학수, 최주영 1
Outline Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis Conclusions 2
Introduction GFS was designed to meet the demands of Google s data processing needs. Emphasis on Design Component failures Files are huge Most files are mutated by appending 3
DESIGN OVERVIEW 4
Assumptions Composed of inexpensive components often fail Stores 100 MB or larger size file Large streaming reads, small random reads Large, sequential writes that append data to files. Atomicity with minimal synchronization overhead is essential. High sustained bandwidth is more important than low latency 5
Interface Files are organized hierarchically in directories and identified by pathnames Operation Create Delete Open Close Read Write Snapshot Record append Function Create file Delete file Open file Close file Read file Write file Create a copy of file or a directory tree Allow multiple clients to append data to the same file 6
Architecture Google File System. Designed for system-to-system interaction, and not for user-to-system interaction. 7
Single Master 8
Chunk Size Large chunk size 64MB Advantages Reduce client-master interaction Reduce network overhead Reduce the size of metadata Disadvantages Hot spot - Many clients accessing the same file 9
Metadata All metadata is kept in master s memory Less than 64bytes metadata each chunk Types File and chunk namespace File to chunk mapping Location of each chunk s replicas 10
Metadata(Cont d) In-Memory data structure Master operations are fast Easy and efficient periodically scan Operation log Contain historical record of critical metadata changes Replicate on multiple remote machines Respond to client only after log record Recovery by replaying the operation log 11
Consistency Model Consistent all clients will always see the same data regardless of which replicas they read from Defined consistent and clients will see what mutation writes in its entirety Inconsistent different clients may see different data at different times 12
SYSTEM INTERACTION 13
Leases and Mutation Order Leases To maintain a consistent mutation order across replicas and minimize management overhead The master grants one of the replicas to become the primary Primary picks a serial order of mutation When applying mutation all replicas follow the order 14
Leases and Mutation Order(Cont d) 15
Data Flow Fully utilize network bandwidth Decouple control flow and data flow Avoid network bottlenecks and high-latency Forwards the data to the closest machine Minimize latency Pipelining the data transfer 16
Atomic Record Appends Record append : atomic append operation Client specifies only the data GFS appends data at an offset of GFS s choosing and return that offset to client Many clients append to the same file concurrently such files often serves as multiple-producer/ singleconsumer queue Contain merged results 17
Snapshot Make a copy of a file or a directory tree Standard copy-on-write SNAPSHOT 18
MASTER OPERATION 19
Namespace Management and Locking Namespace Lookup table mapping full pathname to metadata Locking To ensure proper serialization multiple operations active and use locks over regions of the namespace Allow concurrent mutations in the same directory Prevent deadlock consistent total order 20
Replica Placement Maximize data reliability and availability Maximize network bandwidth utilization Spread replicas across machines Spread chunk replicas across the racks 21
Creation, Re-replication, Rebalancing Creation Demanded by writers Re-replication Number of available replicas fall down below a user-specifying goal Rebalancing For better disk space and load balancing 22
Garbage Collection Lazy reclaim Log deletion immediately Rename to a hidden name with deletion timestamp Remove 3 days later Undelete by renaming back to normal Regular scan Heartbeat message exchange with each chunkserver Identify orphaned chunks and erase the metadata 23
Stale Replica Detection Maintain a chunk version number Detect stale replicas Remove stale replicas in regular garbage collection 24
FAULT TOLERANCE AND DIAGNOSIS 25
High Availability Fast recovery Restore state and start in seconds Chunk replication Different replication levels for different parts of the file namespace Master clones existing replicas as chunkservers go offline or detect corrupted replicas through checksum verification 26
High Availability Master replication Operation log and checkpoints are replicated on multiple machines Master machine or disk fail Monitoring infrastructure outside GFS starts new master process Shadow master Read-only access when primary master is down 27
Data Integrity Checksum To detect corruption Every 64KB block in each chunk In memory and stored persistently with logging Read Chunkserver verifies checksum before returning Write Append Incrementally update the checksum for the last block Compute new checksum 28
Data Integrity(Cont d) Write Overwrite Read and verify the first and last block then write Compute and record new checksums During idle periods Chunkservers scan and verify inactive chunks 29
MEASUREMENTS 30
Micro-benchmarks GFS cluster 1 master 2 master replicas 16 chunkservers 16 clients Server machines connected to one switch client machines connected to the other Two switches are connected with 1 Gbps link. 31
Micro-benchmarks Figure 3: Aggregate Throughputs. Top curves show theoretical limits imposed by our network topology. Bottom curves show measured throughputs. They have error bars that show 95% confidence intervals, which are illegible in some cases because of low variance in measurements. 32
Real World Clusters Table2: characteristic Of two GFS clusters 33
Real World Clusters Table 3: Performance Metrics for Two GFS Clusters 34
Real World Clusters In cluster B Killed a single chunk server containing 15,000 chunks (600GB of data) All chunks restored in 23.2minutes Effective replication rate of 440MB/s Killed two chunk servers each 16,000 chunks (660GB of data) 266 chunks only have a single replica Higher priority Restored with in 2 minutes 35
Conclusions Demonstrates qualities essential to support large-scale processing workloads Treat component failure as the norm Optimize for huge files Extend and relax standard file system Fault tolerance provide Consistent monitoring Replicating crucial data Fast and automatic recovery Use checksum to detect data corruption High aggregate throughput 36