GlobalFS: A Strongly Consistent Multi-Site Filesystem Leandro Pacheco Raluca Halalai Valerio Schiavoni Fernando Pedone Etienne Rivière Pascal Felber RainbowFS Workshop May 3rd, 2017
Distributed applications GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 2
Distributed applications GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 2
Distributed applications GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 2
Distributed applications? GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 2
Distributed applications Distributed Storage GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 2
Distributed applications Distributed Storage SQL Databases NoSQL Databases Key-value storage Coordination Systems Caches File Systems GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 2
Distributed applications Distributed Storage SQL Databases NoSQL Databases Key-value storage Coordination Systems Caches File Systems Easy interoperability for existing aplications GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 2
Global infrastructure Amazon s AWS global infrastructure GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 3
CAP theorem Weak Consistency Lower latency Higher availability Possibly incorrect/unexpected results Strong Consistency Clear semantics and guarantees Easier to reason about Block instead of providing incorrect results GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 4
What is GlobalFS? Geographically distributed filesystem Familiar interface (POSIX) Strong consistency Fault-tolerance through replication Flexible performance through locality GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 5
Overall design Separate data and metadata Partial replication Metadata protocol exploiting atomic multicast Causal reads GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 6
Separate data and metadata Immutable data Variable sized blobs Metadata Controls file contents, properties and filesystem structure Metadata refers to data blobs 1 2 3 4 GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 7
Partial replication Immutable data is simple to replicate consistently Metadata is partitioned between replica groups (i.e., partitions) GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 8
Partial replication US EU SA GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 9
Partial replication US EU / www bin etc home SA alice bob mark GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 10
Partial replication US EU / www bin etc home SA alice bob mark US SA EU GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 11
Partial replication US EU Global Replication / www bin etc home SA alice bob mark US SA EU GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 12
Partial replication US EU Global Replication / www bin etc home SA alice bob mark Local US multicast SA EU - fast updates - local or remote reads GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 13
US Partial replication Global multicast (global replication) - costly updates - Global fast local Replication reads / EU www bin etc home SA alice bob mark Local US multicast SA EU - fast updates - local or remote reads GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 14
Partial ordering GlobalFS exploits atomic multicast Atomic delivery to groups of processes Partial ordering: messages for different groups don t have to be ordered betweem themselves Partial ordering is critical for scalability GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 15
Architecture Metadata replicas Send read or update commands Atomic multicast Application Client (FUSE) Data store Insert or fetch immutable data GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 16
Consistent update operations Step 1 Write data blobs to data store Step 2 Issue a metadata update GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 17
Consistent update operations Step 1 Write data blobs to data store Step 2 Issue a metadata update Req Single-partition Reply Req Uncoordinated multi-partition Reply Req Coordinated multi-partition Reply G 1 G 1 G 1 G 2 G 2 G 2 write to file in G 1 write to file in {G 1,G 2 } move file from G 1 to G 2 Atomic Multicast Execution GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 17
Causal read operations Causally related updates are seen in the same order e.g., operations done by the same client GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 18
Causal read operations Causally related updates are seen in the same order e.g., operations done by the same client Client A Creates an image cat.jpg Modifies a page pets.html to include the image cat.jpg GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 18
Causal read operations Causally related updates are seen in the same order e.g., operations done by the same client Client A Creates an image cat.jpg Modifies a page pets.html to include the image cat.jpg Client B Opens the pets.html page and finds a broken image reference Where is the cat? GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 18
Causal read operations Step 1 Contact a metadata replica for a list of blob ids Step 2 Get the data from the data store Approach inspired by vector clocks Vector is composed of one counter per replica group GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 19
Evaluation Complete prototype in Java https://github.com/pacheco/globalfs Filesystem in Userspace (FUSE) URingPaxos for atomic multicast Global deployment using Amazon EC2 GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 20
Maximum throughput by operation GlobalFS throughput Operations/sec 60000 50000 40000 30000 20000 10000 0 read 1KB GlobalFS CalvinFS Locality local write 1KB local create 1KB 1800 1600 1400 1200 1000 800 600 400 200 0 glob. write 1KB glob. create 1KB 3 region deployment US west, US east and Europe GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 21
Geographical scalability 1 Region 3 Regions 6 Regions 9 Regions Geographical Scalability 1 0.8 0.6 0.4 0.2 16081 ops 6882 ops 3072 ops Ideal read 1KB create write 1KB Normalized throughput per region as more regions are added 9 regions uses all EC2 regions available at the time GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 22
GlobalFS: Summary Strong consistency at global scale Simple and familiar API (POSIX) Flexible performance through partial replication and locality Cheap causal read operations GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 23
GlobalFS: Summary Strong consistency at global scale Simple and familiar API (POSIX) Flexible performance through partial replication and locality Cheap causal read operations Thank you! Leandro Pacheco pachecol@usi.ch GlobalFS: A Strongly Consistent Multi-Site Filesystem - Leandro Pacheco 23