The Google File System - PDF Free Download

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008

OUTLINE INTRODUCTION DESIGN OVERVIEW SYSTEM INTERACTIONS MASTER OPERATION FAULT TOLERANCE AND DIAGNOSIS MEASUREMENTS CONCLUSIONS

INTRODUCTION Shares many same goals as previous distributed file system Departed from some earlier file system design assumptions Multiple GFS clusters are currently deployed for different purposes

DESIGN OVERVIEW Assumptions Interface Architecture Single Master Chunk Size Metadata Consistency Model

DESIGN OVERVIEW Assumptions The system is built from many inexpensive commodity components that often fail The system stores a modest number of large files The workflows primarily consist of two kinds of reads: large streaming reads and small random reads

DESIGN OVERVIEW Assumptions The workflows also have many large, sequential writes that append data to files The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file High sustained bandwidth is more important than low latency

DESIGN OVERVIEW Interface Provides a familiar file system interface But does not implement a standard API such as POSIX Files are organized hierarchically in directories and identified by pathnames create, delete, open, close, read and write snapshot and record append

DESIGN OVERVIEW Architecture A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients Files are divided into fixed-size chunks. Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by master at the time of chunk creation

DESIGN OVERVIEW Architecture Application GFS client GFS master File Chunk 1 Chunk... 2 Chunk 3 Identified by an immutable and globally unique 64 bit chunk handle GFS chunkserver Linux file system GFS chunkserver Linux file system... Chunk 1 Chunk 3 Chunk 2 Chunk 1 Accessed by chunk handle and byte range Replica Chunk 3 Chunk 2

DESIGN OVERVIEW Architecture Application GFS client GFS master Namespace Access Control Information Mapping from files to chunks Current locations of chunks Maintains all file system metadata GFS chunkserver Linux file system GFS chunkserver Linux file system... Chunk 1 Chunk 3 Chunk 2 Chunk 1 Chunk 3 Chunk 2

DESIGN OVERVIEW API Architecture Application GFS client HeartBeat messages GFS master Chunk lease management Garbage collection Chunk migration between chunkservers Controls systemwide activities Neither the client nor the chunkserver caches file data GFS chunkserver Linux file system Chunk 1 Chunk 3 Chunk 2 Chunk 1 GFS chunkserver Linux file system... Chunk 3 Chunk 2

DESIGN OVERVIEW File name & byte offset Single Master Cached Chunk index Application GFS client file name & chunk index chunk handle & chunk location GFS master File namespace /foo/bar chunk 2ef0 Legend chunk data chunk handle & byte range Data messages Control messages GFS chunkserver Linux file system Instructions to chunk server Chunkserver state GFS chunkserver Linux file system.........

DESIGN OVERVIEW Chunk Size Each chunk size is 64 MB Each chunk replica is stored as a plain Linux file on a chunk server Reduces clients need to interact with the master, network overhead, and the size of metadata stored on the master Hot spots issue

DESIGN OVERVIEW Metadata The master stores three major types of metadata: the file and chunk namespace the mapping from files to chunks the location s of each chunk s replica All metadata is kept in the master s memory

DESIGN OVERVIEW Metadata In-Memory Data Structures It s fast plus easy and efficient for the master to periodically scan in the background Limited by how much memory there is The cost of adding extra memory is far less than the benefits we gain

DESIGN OVERVIEW Metadata Chunk Locations The master polls chunkservers for that information at startup Regular HeartBeat messages Errors and rename

DESIGN OVERVIEW Metadata Operation log Contains historical record of critical metadata changes Replicated on multiple remote machines and respond to a client operation only after flushing the corresponding log record to disk both locally and remotely

DESIGN OVERVIEW Metadata Operation log The master recovers its file system state by replaying the operation log The master checkpoints its state whenever the log grows beyond a certain size A new checkpoint can be created without delaying incoming mutations

DESIGN OVERVIEW Consistency Model Supports highly distributed applications well but remains relatively simple and efficient to implement Guarantees by GFS Implications for applications

DESIGN OVERVIEW Consistency Model Guarantees by GFS File namespace mutations are atomic The state of a file region after a data mutation depends on the type of mutations consistent, inconsistent, defined and undefined

DESIGN OVERVIEW Consistency Model Guarantees by GFS Write Record Append Serial success Concurrent successes defined consistent but undefined defined interspersed with inconsistent Failure inconsistent Table 1: File Region State after Mutation

DESIGN OVERVIEW Consistency Model Implications for Applications Relying on appends rather than overwrites Checkpointing Writing self-validating and self-identifying record

SYSTEM INTERACTIONS Leases and Mutation Order Data Flow Atomic Record Appends Snapshot

SYSTEM INTERACTIONS Leases and Mutation Order Each mutation is performed at all chunk s replicas Leases are used to maintain a consistent mutation order across replicas The lease mechanism is design to minimize management overhead at the master A lease has an initial timeout of 60 seconds

SYSTEM INTERACTIONS Leases and Mutation Order 4 Client 3 step 1 2 Master Secondary Replica A 6 7 Primary Replica 5 Legend: Secondary Replica B 6 Control Data Figure 2: Write Control and Data Flow

SYSTEM INTERACTIONS Data Flow Decoupling the flow of data from the flow of control to use the network efficiently Each machine forwards the data to the closest machine in the network topology that has not received it Client S1 S3 S4 S2

SYSTEM INTERACTIONS Data Flow Minimizing latency by pipelining the data transfer over TCP connections Switched network with full-duplex links The ideal elapsed time for transferring B bytes to R replicas is B/T + RL T - network throughput (100 Mbps) L - latency (far below 1 ms)

SYSTEM INTERACTIONS Atomic Record Appends The client specifies only the data Many clients on different machines append to the same file concurrently Serves multiple-producer/single-consumer queues or contains merged results from many different clients Follows the control flow with a little extra logic at the primary

SYSTEM INTERACTIONS Snapshot Makes a copy of a file or a directory tree almost instantaneously, while minimizing any interruptions of ongoing mutations Implemented by standard copy-on-write techniques

SYSTEM INTERACTIONS Snapshot Application GFS client Snapshot GFS master Operation log Snapshot Metadata Source Dir Chunk C Dest Dir Revoke leases GFS chunkserver Linux file system GFS chunkserver Linux file system... Legend Lease Chunk C Chunk C

SYSTEM INTERACTIONS Snapshot Application GFS client Request chunk C Chunk C GFS master Operation log Snapshot Metadata Source Dir Dest Dir Chunk C > 1 Create new chunk C GFS chunkserver Linux file system GFS chunkserver Linux file system... Legend Lease Chunk C C Chunk C C Chunk C Chunk C

MASTER OPERATIONS Namespace Management and Locking Replica Placement Creation, Re-replication, Rebalancing Garbage Collection Stale Replica Detection

MASTER OPERATIONS Namespace Mgt and Locking Allows multiple operations to be active and use locks over regions of the namespace Logically represents namespace as a lookup table mapping full pathnames to metadata Each node in the namespace tree has an associated read-write lock Each master operation acquires a set of locks before it runs

MASTER OPERATIONS Namespace Mgt and Locking /d1/d2/ /dn/leaf Read locks on the directory name /d1 /d1/d2 /d1/d2/ /dn Either a read lock /d1/d2/ /dn/leaf or a write lock on the full pathname

MASTER OPERATIONS Namespace Mgt and Locking How this locking mechanism can prevent a file /home/user/foo from being created while /home/user is being snapshotted to /home/save Snapshot operation Creation operation Read locks /home /save /home /home/user Write locks /home/user /save/user /home/user/foo

MASTER OPERATIONS Replica Placement There are hundreds of chunkservers spread across many machine racks Communication between two machines on different racks may cross one or more network switches Two purposes: Maximize data reliability and availability Maximize network bandwidth utilization

MASTER OPERATIONS Creation, Re-replication, Rebalancing The reasons for creating chunk replicas Factors when creating a chunk: Place new replicas on chunkservers with below-average disk space utilization Limit the number of recent creations on each chunkserver Spread replicas of a chunk across racks

MASTER OPERATIONS Creation, Re-replication, Rebalancing The master re-replicates a chunk as soon as the number of available replicas falls below a user-specified goal Each chunk that needs to be re-replicated is prioritized based on several factors The master picks the highest priority chunk and clones it by instructing chunkservers

MASTER OPERATIONS Creation, Re-replication, Rebalancing The master rebalances replicas periodically The master gradually fills up a new chunkserver rather than instantly swamping

MASTER OPERATIONS Garbage Collection I want to delete a file named /foo... Client Master Log Delete... Metadata... /.foo-20081126 /foo... ChunkServers

MASTER OPERATIONS Stale Replica Detection For each chunk, the master maintains a chunk version number Chunk version number is increased when the master grants a new lease on a chunk The master removes stale replicas in its regular garbage collection The client or the chunkserver verifies the check number as well

FAULT TOLERANCE AND DIAGNOSIS High Availability Data Integrity Diagnostic Tools

FAULT TOLERANCE AND DIAGNOSIS High Availability Fast recovery Chunk Replication Master Replication Operation log and checkpoints are replicated on multiple machines Monitor infrastructure outside GFS

FAULT TOLERANCE AND DIAGNOSIS Data Integrity Each chunkserver uses checksumming to detect corruption of stored data A chunk is broken up into 64 KB blocks Each block has a 32 bit checksum Checksumming has little effect on read performance

FAULT TOLERANCE AND DIAGNOSIS Diagnostic Tools GFS servers generate diagnostic logs that record many significant events and all RPC requests and replies The performance impact of logging is minimal because these logs are written sequentially and asynchronously

MEASUREMENTS Micro-benchmarks Real World Clusters

MEASUREMENTS Micro-benchmarks Dual 1.4 GHz PIII processors 2 GB of memory Two 80 GB 5400 rpm disks HP 2524 switch 100 Mbps full-duplex link 1 Gbps link

MEASUREMENTS Micro-benchmarks Reads N clients read simultaneously Each client reads a randomly selected 4 MB region from 320 GB file set Repeated 256 times so that each client ends up reading 1 GB of data Expects at most 10% hit in the Linux buffer cache

MEASUREMENTS Micro-benchmarks Writes N clients write simultaneously to N distinct files Each client writes 1 GB of data to a new file in a series of 1 MB writes

MEASUREMENTS Micro-benchmarks Record Appends N clients append simultaneously to a single file In reality, applications tend to produce multiple files concurrently such as N clients append to M shared files simultaneously where both N and M are in the dozens or hundreds

MEASUREMENTS Real World Clusters Cluster A: Used regularly for research and development by over a hundred engineers A typical task is initiated by a human user and runs up to several hours It reads through a few MBs to a few TBs of data

MEASUREMENTS Real World Clusters Cluster B: Used for production data processing The tasks last much longer and continuously generate and process multi- TB data sets with only occasional human intervention In both cases, a single task consists of many processes on many machines

MEASUREMENTS Real World Clusters Storage Cluster A B Chunkservers 324 227 Available disk space 72 TB 180 TB Used disk space 55 TB 155 TB Number of Files 735 k 737 k Number of Dead files 22 k 232 k Number of Chunks 992 k 1550 k Metadata at chunkservers 13 GB 21 GB Metadata as master 48 MB 60 MB Characteristics of two GFS clusters

MEASUREMENTS Real World Clusters Metadata Includes checksums for 64 KB blocks of user data and the chunk version number About 100 bytes per file on average Each individual server has only 50 to 100 MB of metadata Cluster A B Chunkservers 324 227 Available disk space 72 TB 180 TB Used disk space 55 TB 155 TB Number of Files 735 k 737 k Number of Dead files 22 k 232 k Number of Chunks 992 k 1550 k Metadata at chunkservers 13 GB 21 GB Metadata as master 48 MB 60 MB Characteristics of two GFS clusters

MEASUREMENTS Real World Clusters Read and Write Rates Both clusters had been up for about one week Cluster A B Read rate (last minute) 583 MB/s 380 MB/s Read rate (laste hour) 562 MB/s 384 MB/s Read rate (since restart) 589 MB/s 49 MB/s Write rate (last minute) 1 MB/s 101 MB/s Write rate (last hour) 2 MB/s 117 MB/s Write rate (since restart) 25 MB/s 13 MB/s Master ops (last minute) 325 Ops/s 533 Ops/s Master ops (last hour) 381 Ops/s 518 Ops/s Master ops (since restart) 202 Ops/s 347 Ops/s Performance Metrics for Two GFS Clusters

MEASUREMENTS Real World Clusters Master Load Supports many thousands of file accesses per second It is possible to speed up further by placing name lookup caches in front of the namespace data structure Cluster A B Read rate (last minute) 583 MB/s 380 MB/s Read rate (laste hour) 562 MB/s 384 MB/s Read rate (since restart) 589 MB/s 49 MB/s Write rate (last minute) 1 MB/s 101 MB/s Write rate (last hour) 2 MB/s 117 MB/s Write rate (since restart) 25 MB/s 13 MB/s Master ops (last minute) 325 Ops/s 533 Ops/s Master ops (last hour) 381 Ops/s 518 Ops/s Master ops (since restart) 202 Ops/s 347 Ops/s Performance Metrics for Two GFS Clusters

MEASUREMENTS Real World Clusters Recovery Time Experiment 1 Killed a single chunkserver in cluster B which has about 15,000 chunks containing 600 GB of data Limited to 91 concurrent cloning (40%) where each clone operation is allowed to consume at most 6.25 MB/s (50 Mbps)

MEASUREMENTS Real World Clusters Recovery Time The result of experiment 1 All chunks were restored in 23.2 minutes Replication rate: 440 MB/s

MEASUREMENTS Real World Clusters Recovery Time Experiment 2 Killed two chunkservers each with roughly 16,000 chunks and 660 GB of data Reduced 266 chunks to having a single replica Restored to at least 2x replication within 2 minutes

CONCLUSIONS Supports large-scale data processing workloads on commodity hardware Radically different points in the design space Provides fault tolerance by constant monitoring, replicating crucial data, and fast and automatic recovery

CONCLUSIONS Delivers high aggregate throughput to many concurrent readers and writers performing a variety of tasks Successfully met Google s storage needs and is widely used within Google as the storage platform for research and development as well as production data processing