GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

GFS CS6450: Distributed Systems Lecture 5 Ryan Stutsman Some material taken/derived from Princeton COS-418 materials created by Michael Freedman and Kyle Jamieson at Princeton University. Licensed for use under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Some material taken/derived from MIT 6.824 by Robert Morris, Franz Kaashoek, and Nickolai Zeldovich. 1

From Last Time... The problem with Primary/Backup? Under poor connectivity or high churn may thrash if state synchronization to new backups is costly Think of cross-dc/cross-wan replication, cloud across AZs 2

Compute MapReduce Storage??? MapReduce gives scalable fault-tolerance data processing How do we do the same thing for data storage? Cheap hardware is competitive advantage But, have to deal with faults as a result 3

Filesystems Map of (hierarchical) filenames to varlength blobs open(filename) -> fd read(fd) -> bytes write(fd, bytes) seek(fd, pos) Multiple appenders in POSIX with O_APPEND? 4

Assumptions High component failure rates Inexpensive commodity components fail all the time Modest number of HUGE files Millions of files, many GB or even TBs each Files are write-once, mostly appended to Perhaps concurrently Large streaming reads High sustained throughput favored over low latency 5

Overview 6

Multi-writer Appends Common Crawler TBs of Data Processor Crawler URL Log Processor Crawler Processor Need parallel processing, but log all results to a common log stream for consumer processes Want to spread read I/O over many disks 7

Biggest Questions from Forms Concerns about Master Fault-tolerance/availability Scaling, capacity, and load Defined/Consistent Why not avoid this? How do apps deal with it? (Not asked, but what is the point/benefit?) 8

GFS Architecture Legend: Data messages Control messages Application GFS client (file name, chunk index) (chunk handle, chunk locations) GFS master File namespace /foo/bar chunk 2ef0 Le Instructions to chunkserver (chunk handle, byte range) chunk data Chunkserver state GFS chunkserver GFS chunkserver Linux file system Linux file system 9

Reads Get server list for (filename, offset); response includes primary/secondaries Contact any replica with offset Get data or beyond EOF Client Secondary Replica A Primary Replica Master Secondary Replica B 10

Writes/Appends 1 Get server list for (filename, offset); response includes primary/secondaries 4 Client 3 step 1 2 Master 3 Push data along pipeline Secondary Replica A 6 4 5 Send write command to primary Primary orders it, conveys order to secondaries 7 Primary Replica Secondary Replica B 6 5 Legend: Control Data 7 Notify client of outcome/offset 11

Mutual Exclusion/Ownership Master grant(chunk 32) Server Server Server Need mutual exclusion between servers for chunk mutations Only one chunkserver should order writes for a given chunk 12

Mutual Exclusion/Ownership Master grant(chunk 32) Server Server Server Heartbeats determine when new primary is needed But what about safe revocation? 13

Leases Master grant(chunk 32, until 16:31:24) Server Server Server Assume bounded clock drift Assume network is only so asynchronous in delivery Need lease term >> propagation delay + clock skew t c = max(0, t s - (m prop + 2m proc ) - ε) Very common approach to mutual exclusion 14

Questions How should append failures be handled? What is the effect of write to all, read from any? Assume two clients C 1 and C 2, no active writers If C 1 reads record r at offset o and notifies C 2 Then C 2 reads at offset o Is C 2 guaranteed to see r? 15

Master Single master Find files and their chunks Access control Heartbeating, monitoring chunkservers Chunk distribution/rebalancing Snapshotting Maintains Filename Chunklist map Chunklist Chunk servers map All in memory: can access/scan structures quickly Advantages/disadvantages? 16

Master Recovery Filesystem metadata changes logged locally and remotely Synchronously logged for client requests On crash, replay log to reconstruct state To bound recovery time, checkpoint state On restart, mmap checkpoint state, replay log tail Does checkpointing block normal requests? How often should checkpointing be done? What about chunk locations? They aren t logged. 17

Shadow Masters Improve scaling of metadata read operations Shadow masters consume remote copies of replication log to provides a nearly up-to-date view of mater metadata Since data ops go to chunk servers, they see upto-date data for the chunks the shadow is aware of 18

Consistency A consistency model defines expected behavior under concurrent mutation Should a read started after an acknowledged write see the effects of that write? Must it in GFS? Answers to these questions depend on how much complexity developers can tolerate performance trade offs 19

Defined, consistent? Outcome depends on type of mutation, success/failure, concurrent mutations Consistent: all clients will see same data regardless of read replica Defined: if consistent and contains value written by the client 20

No Failure, No Concurrency abc R 1 C Write( abc ) 1 abc R 2 abc R 3 Defined and consistent 21

Failure, No Concurrency abc R 1 C Write( abc ) 1 abc R 2 R 3 Undefined and inconsistent Why? R 3 inconsistent And read from R 3 gives undefined value back 22

No Failure, Concurrency C 1 Write( aaaabbbbcccc ) C 2 Write( ddddeeeeffff ) aaaa dddd bbbb eeee aaaa dddd bbbb eeee aaaa dddd bbbb eeee abc Undefined and consistent Why? What s the benefit of this approach? 23

Append Serial/Concurrent Must ensure all appends go to end of file No append precedes the another previously successful append Easy? Additional constraints Must maintain floor(offset / 64 MB) = chunk number Otherwise chunk size needed at master Linear time seek for reads/writes Avoid cross-chunkserver coordination on boundary crossing Else need distributed commit; see 2PC next week 24

Problem: Straddling Appends C 1 Append( ) Append( ) C 2 Without care, a reader may miss concurrent appends 25

Solution: Padding C 1 Append( ) Append( ) Side effect: defined interspersed with padding Where do duplicates come from? Append retry Hence, padding may actually be inconsistent C 2 26

Ordering Summary How are operations on files ordered? First, by masters read/write locks when metadata is mutated Master delegates ordering responsibilities within a chunk to a primary chunkserver 27

Interesting Points to add/discuss Rack layout, failure domains, network topology, IP addresses as distance Separation of control path and data path Master operations, fault-tolerance, and recovery Shadow masters Evals 3 copies enough? Are failures independent? Recovery, replica rebalancing, placement Loosely coupled replica garbage collection Existence of chunks is effectively ground truth Doesn t matter if the master says a CS has something it doesn t Doesn t matter if the master says a CS doesn t have something it does System design needs to work correctly and converge on truth 28