Corbett et al., : Google s Globally-Distributed Database MIMUW 2017-01-11
ACID transactions
ACID transactions SQL queries
ACID transactions SQL queries Semi-relational data model
ACID transactions SQL queries Semi-relational data model Lock-free distributed transactions
ACID transactions SQL queries Semi-relational data model Lock-free distributed transactions Global scale
ACID transactions SQL queries Semi-relational data model Lock-free distributed transactions Global scale Externally consistent
Consistency matters Unfriend untrustworthy person X Post: My government is repressive...
External consistency Linearisability: if T 1 commits before T 2 starts, then T 1 s commit timestamp is smaller than T 2 s
External consistency Linearisability: if T 1 commits before T 2 starts, then T 1 s commit timestamp is smaller than T 2 s The first system to provide the guarantee at global scale
External consistency Linearisability: if T 1 commits before T 2 starts, then T 1 s commit timestamp is smaller than T 2 s The first system to provide the guarantee at global scale Allows:
External consistency Linearisability: if T 1 commits before T 2 starts, then T 1 s commit timestamp is smaller than T 2 s The first system to provide the guarantee at global scale Allows: consistent reads in the past
External consistency Linearisability: if T 1 commits before T 2 starts, then T 1 s commit timestamp is smaller than T 2 s The first system to provide the guarantee at global scale Allows: consistent reads in the past consistent backups
External consistency Linearisability: if T 1 commits before T 2 starts, then T 1 s commit timestamp is smaller than T 2 s The first system to provide the guarantee at global scale Allows: consistent reads in the past consistent backups consistent MapReduce executions
External consistency Linearisability: if T 1 commits before T 2 starts, then T 1 s commit timestamp is smaller than T 2 s The first system to provide the guarantee at global scale Allows: consistent reads in the past consistent backups consistent MapReduce executions atomic schema updates
Organisation A universe consists of zones
Organisation A universe consists of zones Zone has:
Organisation A universe consists of zones Zone has: a zonemaster
Organisation A universe consists of zones Zone has: a zonemaster spanservers that serve data to clients
Organisation A universe consists of zones Zone has: a zonemaster spanservers that serve data to clients location proxies
Organisation A universe consists of zones Zone has: a zonemaster spanservers that serve data to clients location proxies Global:
Organisation A universe consists of zones Zone has: a zonemaster spanservers that serve data to clients location proxies Global: universe master
Organisation A universe consists of zones Zone has: a zonemaster spanservers that serve data to clients location proxies Global: universe master placement driver responsible for data transfer across zones
Organisation A universe consists of zones Zone has: a zonemaster spanservers that serve data to clients location proxies Global: universe master placement driver responsible for data transfer across zones Bucketing abstraction: directories
Spanserver tablets
Spanserver tablets key, timestamp string
Spanserver tablets key, timestamp string stored on Colossus as B-trees and WAL
Spanserver tablets key, timestamp string stored on Colossus as B-trees and WAL Paxos state machine
Spanserver tablets key, timestamp string stored on Colossus as B-trees and WAL Paxos state machine Leader
Spanserver tablets key, timestamp string stored on Colossus as B-trees and WAL Paxos state machine Leader long-lived leader leases
Spanserver tablets key, timestamp string stored on Colossus as B-trees and WAL Paxos state machine Leader long-lived leader leases lock table
Spanserver tablets key, timestamp string stored on Colossus as B-trees and WAL Paxos state machine Leader long-lived leader leases lock table transaction manager
TrueTime Idea: expose clock uncertainty
TrueTime Idea: expose clock uncertainty Time masters: GPS or atomic clocks
TrueTime Idea: expose clock uncertainty Time masters: GPS or atomic clocks (Armageddon masters)
TrueTime Idea: expose clock uncertainty Time masters: GPS or atomic clocks (Armageddon masters) Timeslave daemon polls a variety of masters
TrueTime Idea: expose clock uncertainty Time masters: GPS or atomic clocks (Armageddon masters) Timeslave daemon polls a variety of masters Marzullo s algorithm used to detect liars
TrueTime Idea: expose clock uncertainty Time masters: GPS or atomic clocks (Armageddon masters) Timeslave daemon polls a variety of masters Marzullo s algorithm used to detect liars Eviction of malfunctioning masters and clients
TrueTime Idea: expose clock uncertainty Time masters: GPS or atomic clocks (Armageddon masters) Timeslave daemon polls a variety of masters Marzullo s algorithm used to detect liars Eviction of malfunctioning masters and clients Assumed upper bound on clock drift: 200 µs s.
Transactions Operation Concurrency control Replica Required RW trans. pessimistic leader RO trans. lock-free leader (timestamp), any Snapshot read lock-free any
RW transactions Two-phase locking, timestamps assigned when all locks are being held
RW transactions Two-phase locking, timestamps assigned when all locks are being held Disjoint leader lease intervals
RW transactions Two-phase locking, timestamps assigned when all locks are being held Disjoint leader lease intervals Start: Coordinator leader assigns timestamp s TT.now().latest after receiving the commit request, and greater than all prepare timestamps previously issued
RW transactions Two-phase locking, timestamps assigned when all locks are being held Disjoint leader lease intervals Start: Coordinator leader assigns timestamp s TT.now().latest after receiving the commit request, and greater than all prepare timestamps previously issued Commit wait: Clients cannot see any data commited by the transaction until TT.after(s) is true
RW transactions Two-phase locking, timestamps assigned when all locks are being held Disjoint leader lease intervals Start: Coordinator leader assigns timestamp s TT.now().latest after receiving the commit request, and greater than all prepare timestamps previously issued Commit wait: Clients cannot see any data commited by the transaction until TT.after(s) is true Wound-wait
RW transactions Two-phase locking, timestamps assigned when all locks are being held Disjoint leader lease intervals Start: Coordinator leader assigns timestamp s TT.now().latest after receiving the commit request, and greater than all prepare timestamps previously issued Commit wait: Clients cannot see any data commited by the transaction until TT.after(s) is true Wound-wait Client drives two-phase commit using the identity of the coordinator
Snapshot reads Safe time Maximum timestamp at which the replica is up to date Minimum of: timestamp of the highest-applied Paxos write
Snapshot reads Safe time Maximum timestamp at which the replica is up to date Minimum of: timestamp of the highest-applied Paxos write prepare timestamps of prepared (but not commited) transactions
RO transactions A timestamp needs to be assigned
RO transactions A timestamp needs to be assigned Scope expression required to negotiate timestamp between all Paxos groups involved
RO transactions A timestamp needs to be assigned Scope expression required to negotiate timestamp between all Paxos groups involved Either TT.now().latest...
RO transactions A timestamp needs to be assigned Scope expression required to negotiate timestamp between all Paxos groups involved Either TT.now().latest...... or the timestamp of the last commited write at a Paxos group
Q&A