Example File Systems Using Replication CS 188 Distributed Systems February 10, 2015 Page 1
Example Replicated File Systems NFS Coda Ficus Page 2
NFS Originally NFS did not have any replication capability Replication of read-only file systems added later Primary copy read/write replication added later Page 3
NFS Read-Only Replication Almost by hand Sysadmin ensures multiple copies of file systems are identical Typically on different machines Avoid writing to any replica E.g., mount them read-only Use automounting facilities to handle failover and load balancing Page 4
Primary Copy NFS Replication Commonly referred to as DRDB Typically two replicas Primarily for reliability One replica is the primary It can be written Other replica mirrors the primary Provides service if primary unavailable Page 5
Some Primary Copy Issues Handling updates How and when do they propagate? Determining failure Of the secondary copy Of the primary copy Handling recovery Page 6
Update Issues In DRDB Two choices: Synchronous Writes don t return until both copies are updated Asynchronous Writes return once primary updated Secondary updated later Page 7
Implications of Synchronous Writes Slower, since can t indicate success till both copies are written One is written across the network, ensuring slowness Fewer consistency issues If write returned, both copies have it If not, neither does Real bad timing requires some cleanup Page 8
Implications of Asynchronous Writes Faster, since you only wait for primary copy Almost always works just fine Almost always Problems when it doesn t though Different values of same data at different copies May not clear how it happened Perhaps even worse Page 9
Detecting Failures DRDB usually uses a heartbeat process Primary and secondary expect to communicate every few seconds E.g., every two seconds If too many heartbeats in a row missed, declare the partner dead Might just be unreachable, though Page 10
Responding To Failures Switch service from the primary to the secondary Which becomes the primary Including write service Ensures continued operation after failure Update logging ensure new primary is up to date Page 11
Recovery From Failures Recovered node becomes the secondary Receives missed updates from primary Complications if network failure caused the failure The split brain problem Page 12
The Split Brain Problem Primary Secondary Primary NETWORK PARTITION! Update 1 Update 2 Now what? Update 3 Page 13
The Simple Solution Prevent access to both Until sysadmin designates one of them as the new primary Throw away the other and reset to the designated primary Simple for the sysadmin, maybe not for the users Page 14
What Other Solution Is Possible? Try to figure out what the correct version of the data is In NFS case, chances are good writes are to different files In which case, you probably just need the most recent copy of each file But there are complex cases NFS replication doesn t try to do this Page 15
Coda A follow-on to the Andrew File System (AFS) Using the basic philosophy of AFS But specifically to handle mobile computers Page 16
A server pool The AFS System Clients request files from the servers Client workstations Page 17
AFS Characteristics Files permanently stored at exactly one server Clients keep cached copies Writes cached until file close Asynchronous writes Other copies then invalidated Stateful servers Unless write conflicts Page 18
Adding Mobile Computers A server pool Just like AFS, except... Client workstations Some of the clients are mobile Page 19
Why Does That Make a Difference? Mobile computers come and go Well, so do users at workstations But mobile computers take their files with them And expect to access them while they are gone What happens when they do? Page 20
The Mobile Problem for AFS Now it reconnects The laptop downloads some files to its disk Then it disconnects from the network Then it uses the files And maybe writes them Page 21
Why Is This Different Than Normal AFS? We might get write conflicts here Normal AFS might, too But normal AFS conflicts have a small window Truly concurrent writes only Cache invalidation when someone closes For laptop, close could occur weeks before reconnect Page 22
Handling Disconnected Operations Could use a solution like NFS Server has primary copy Client has secondary copy If client can t access server, can t write Or could use an optimistic solution Assume no one else is going to write your file, so go ahead yourself Detect problems and fix as needed Page 23
The Coda Approach Essentially optimistic When connected, operates much like AFS When disconnected, client is allowed to update cached files Access control permitting But unlike AFS, can t propagate updates on file close After all, it s disconnected Instead, remember this failure until later Page 24
Ficus A more peer-oriented replicated file system A descendant of the Locus operating system Specifically designed for mobile computers Page 25
AFS, Coda, and Caching Like AFS, client machines only cache files An AFS cache miss is just a performance penalty Get it from the server A Coda cache miss when disconnected is a disaster User can t access his file Page 26
Avoiding Disconnected Cache Misses Really requires thinking ahead Initially Coda required users to do it Maintain a list of files they wanted to be sure to always cache In case of disconnected operations Eventually went to a hoarding solution We ll discuss hoarding later Page 27
Coda Reintegration When a disconnected Coda client reconnects Tries to propagate updates occurring during disconnection to a server If no one else updated that file, just like a normal AFS update If someone else updated the file during disconnection, what then? Page 28
Coda and Conflicts Such update problems on Coda reintegration are conflicts Two (or more) users made concurrent writes to a file Original solution was that later update (mostly) lost Update on server wins Other update put in special conflict directory Owning user or sysadmin notified to take action Or not take action... Page 29
Later Coda Conflict Solutions Automated reconciliation of conflicts When possible User tools to help handle them when automation doesn t work Can you think of particularly problematic issues here? Page 30
The Locus Computing Model System composed of many personal workstations Connected by a local area network Shared by all! And perhaps a few shared server machines All machines have dedicated storage But provide the illusion of... Page 31
The Ficus Computing Model Just like the Locus model, except... Some of the workstations are portable computers Which might disconnect from the network Taking their storage with them Page 32
Ficus Shares Some Problems With Coda Portable computers can only access local disks while disconnected Updates involving disconnected computers are complicated And can even cause conflicts Page 33
Ficus Has Some Unique Problems, Too What happens to this when the portables storage goes away? It s really... And, unfortunately... Page 34
Handling the Problems Rely on replication Replicate the files that the portable needs while disconnected Replicate the files it s taking away when it departs So everyone else can still see them Page 35
Updates in Ficus Ficus uses peer replication No primary copy All replicas are equally good So if access permissions allow update And you can get to a replica You can update it How does Ficus handle that? Page 36
The Easy Case All replicas are present and available Allow update to one of the replicas Make a reasonable effort to propagate the update to all others But not synchronously On a good day, this works and everything is identical Page 37
The Hard Case The best effort to propagate an update from the original replica fails Perhaps because you can t reach one or more other replicas Perhaps because the portable computers holding them are elsewhere Page 38
With primary copies Handling Updates With peer copies Primary Secondary If they re the same, no problem If they re the different, the primary always wins Only possible reason is that the secondary is old If they re the same, still no problem But what if they re different? Page 39
What Are the Possibilities? 1. One is old and the other is updated Or... How do we tell which is the new one? 2. Both have been updated Now what? Page 40
More Complicated If >2 Replicas Here s just one example Replica 1 Replica 2 Replica 3 What s the right thing to do? And how do you figure that out? Update replica 1 Propagate to replica 2 Propagate to replica 3 Somehow you figure out replica 2 is newer than replica 3 Update replica 1 Page 41
Reconciliation Always an option in Locus and Ficus Much more important with disconnected operation When a replica notices a previously unavailable replica, Check for missing updates and trade information about them The async operation that ensures eventual update propagation Page 42
Gossiping in Ficus Primary copy replication and systems like Coda always propagate updates the same way Other replicas give their updates to a single site And get new updates from that site Peer systems like Ficus have another option Any peer with later updates can pass them to you Even if they aren t the primary and didn t create the updates In file systems, this is called gossiping Page 43
How Does Ficus Track Updates? Ficus uses version vectors An applied type of vector clock These clocks keep one vector element per replica With a vector clock stored at each replica Clocks tick only on updates Page 44
Version Vector Use Example Replica 1 Replica 2 Replica 3 01 0 0 1 0 0 01 01 0 0 1 When replica 2 comes back, its version will be recognized as old Compared to either replica 1 or replica 3 Page 45
Version Vectors and Conflicts Ficus recognizes concurrent (and thus conflicting) writes Using version vectors If neither of two version vectors dominates the other, there s a conflict Implying concurrent write Typically detected during reconciliation Page 46
For Example CONFLICT! Replica 1 Replica 2 0 0 1 0 1 0 Page 47
Now What? Conflicting files represent concurrent writes There is no correct order to apply them Use other techniques to resolve the conflicts Creating a semantically correct and/ or acceptable version Page 48
Example Conflict Resolution Identical conflicts Same update made in two different places Easy to resolve Assuming updates in question are idempotent Conflicts involving append-only files Merge the appends Most Unix directory conflicts are automatically resolvable Page 49
Ficus Replication Granularity NFS replicates volumes Coda replicates individual files Ficus replicates volumes Later, selective replication of files within volumes added Page 50
Hoarding A portable machine off the network must operate off its own disk Only! So it better replicate the files it needs If you know/predict portable disconnection, pre-replicate those files That s called hoarding Page 51
Mechanics of Hoarding Mechanically easy if you replicate at file granularity E.g., Coda or Ficus with selective replication Simply replicate what you need Inefficient if you replicate at volume granularity Page 52
What Do You Hoard? Could be done manually Doesn t work out well Could replicate every file the portable ever touches Might overfill its disk Could use LRU Experience shows that fails oddly Page 53
What Does Work Well? You might think clustering Identify files that are used together If one of them recently used, hoard them all Basic approach in Seer Actually, LRU plus some sleazy tricks works equally well And is much cheaper Page 54