Spanner: Google's Globally-Distributed Database* Huu-Phuc Vo August 03, 2013 *OSDI '12, James C. Corbett et al. (26 authors), Jay Lepreau Best Paper Award
Outline What is Spanner? Features & Example Structure of Spanner's Implementation Data model Version management Novel TrueTime API External consistency TrueTime API Lock-free read-only transactions Performance 2
What is Spanner? What is Spanner? scalable, multi-version database global-distributed database synchronously-replicated database Goal: building rich apps easy at Google scale Key ideas to build transactional storage system replicated globally to manage cross-datacenter replicated data to manage consistency globally 3
Features Features externally consistent reads & writes non-blocking reads in the past lock-free read-only transactions atomic schema changes SQL-like query language Applications support Google's advertising backend replace a sharded MySQL database 4
Key ideas Relational data model with SQL & general purpose transactions External consistency transactions can be ordered by their commit time commit times correspond to real world notions of time Paxos-based replication with number of replicas and distance to replicas controllable Data partitioned across thousands of servers 5
Example: Global Social Network Spanner maintains single system looks like single database 6
Example: Global Social Network sharded data US Brazil Spain Russia 7
Example: Global Social Network shard data into in many ways across datacenters User posts User posts US User posts User posts Brazil User User posts posts Spain User posts Russia 8
Example: Global Social Network replicate & store across datacenters User posts User posts San Francisco Seattle Arizona User posts User posts Sao Paulo Santiago US Brazil User User posts posts London Paris Berlin Madrid User posts Moscow Berlin Spain Russia 9
Structure of Spanner's Implementation Spanserver software stack Directories are the unit of data movement between Paxos groups 10
Spanserver software stack Transaction manager (TM): support distributed transactions Participant leader: implemented by using TM Lock table: implement concurrency control Leader: a replica that is selected among replicas Paxos state machine: support data replication Tablet: instances of a data structure Colossus: distributed file system (Google File System's successor) 11
Structure of Spanner's Implementation Spanner deployment = universe Zones set of zones = set of locations of dist. data can be more than 1 zone in a datacenter Servers in Spanner universe universe master: a console to show all zones for debug placement driver: handles automated movement of data across zones on timescale of minutes & communicates with spanserver location proxy: used by clients to locate spanservers to serve their data 12
Directory: unit of data placement move data between Paxos groups Purpose of moving: put frequently accessed dir into same group move to group that close to its accessors load balancing Directories moving while client operations are ongoing move 50MB directory in a few seconds 13
Data Model A data model based on schematized semi-relational tables with popularity of Megastore Query language with popularity of Dremel General-purpose Xacts experienced the lack with BigTable Not purely relational (rows have names) DB must be partitioned into hierarchies 14
Version Management Timestamp: to order all writes in a transaction (trans.) If T2 starts after commit of T1 then commit TS (T2) must greater commit TS (T1) Version management write transactions use strict 2-Phase Locking (2PL) Strict 2PL: trans. requests lock at any time before action & releases locks when commits each trans. T is assigned a timestamp (TS) s data written by T is timestamped with s 15
Concurrency Control Spanner uses multi-paxos each spanserver runs a single instance of Paxos for each "tablet" of data it stores tablet uses Paxos to agree on a "group leader" all writes will be done by the leader to replicas never concurrent outstanding writes to the same object within a replica group as long as the leader is stable only one round of communication is needed between leader and replicas to con7rm writes 16
External Consistency External consistency transactions can be ordered by their commit time commit times correspond to real world notions of time Synchronizing Snapshots timestamp order = commit order ensure consistent clock values by using GPS + atomic clocks on some nodes network time sync. protocols (exchange & adjust time accordingly) small differences between clocks define TrueTime to solve 17
External Consistency Example: (node = spanserver) suppose two nodes N1/N2 running T1 & N3/N4 running T2 N1/N2 running 30 seconds ahead N3/N4 T1 commits at time T=0 on node N1/N2 (time T=-30 on N3/N4) T2 commit time earlier than T1 to external observer If T1 & T2 wrote data D, recovered replica may apply T1 18
TrueTime TrueTime API methods: estimates accuracy of node's clock Global clock time with bounded uncertainty Ɛ: instantaneous error (half of interval width) 19
TrueTime Example How to ensure T1's commits time before T2's? ensure that T2 doesn't commit until after t1.latest t1.latest < t2.earliest: need to know each other's trans. instead, if ensure that T1 holds its locks until t1.latest, then T1 commits before T2 commits 20
TrueTime Architecture GPS & atomic-clock timemasters cross-check each other timeslave polls masters to reduce vulnerability from other masters Compute reference [earliest, latest] = now ± Ɛ 21
Commit & 2-Phase Commit 22
Commit & 2-Phase Commit 23
Commit & 2-Phase Commit 24
Paxos-based replication Spanner's Paxos implementation used timed (10 second) leader leases to make leadership long lived able to tolerate the failure/disconnection of nodes/data centers Idea ensure that all writes & reads go to a "quorum" of nodes Quorum: a simple majority if we have N nodes, read & write to 6oor(N/2) + 1 nodes In reality, send write & read request to all nodes, but only wait for a quorum of them to complete basics of quorum-based replication last time 25
Performance Micro-benchmarks for Spanner a single spanserver per zone enough load to saturate spanserver CPU all data is served out of memory to measure overhead of Spanner stack get about 2,500 transactional writes/s per CPU 15,000 transactional reads/s per CPU 26
Summary Spanner blended & developed semi-relational interface, transactions & SQL-based query language concepts of scalability, auto segmentation, failure resistance, data replication consistency & wide distribution TrueTime another key functionality of Spanner provides a functionality based on accurate time synchronization in distributed system by expressing inaccuracy of time more specifically in time API 27
Thank you 28