# The CAP theorem. The bad, the good and the ugly. Michael Pfeiffer Advanced Networking Technologies FG Telematik/Rechnernetze TU Ilmenau

1 The CAP theorem The bad, the good and the ugly Michael Pfeiffer Advanced Networking Technologies FG Telematik/Rechnernetze TU Ilmenau / 19

2 1 The bad: The CAP theorem s proof 2 The good: A different perspective 3 The ugly: CAP and SDN 2 / 19

3 Section 1 The bad: The CAP theorem s proof 3 / 19

4 The CAP theorem Central proposition In a distributed system, it is impossible to provide Consistency, Availability, and Partition tolerance all at once, i.e. at least one of them has to be sacrificed. Suggested by Brewer in 1999/2000, proof by Gilbert and Lynch in 2002 [1] In many networks, the absence of partitions cannot be guaranteed (firmware bugs, administrative errors,... ) choice between CP and AP 4 / 19

5 Formal model Network partition All messages between nodes in different components are lost. 5 / 19

6 Formal model Network partition All messages between nodes in different components are lost. Availability: Available data objects Every request received by a non-failing node must result in a response. No time boundary, but network partition can last forever, thus a strong availability requirement. 5 / 19

7 Formal model Network partition All messages between nodes in different components are lost. Availability: Available data objects Every request received by a non-failing node must result in a response. No time boundary, but network partition can last forever, thus a strong availability requirement. Consistency: Atomic data objects total order on all operations such that each operation looks as if it were completed at a single instant. Equivalent: Requests must act as if they were processed on a single node, one at a time. 5 / 19

8 Proof Proof by contradiction. Assume there is a CAP system: 6 / 19

9 Proof Proof by contradiction. Assume there is a CAP system: G 1 G 2 6 / 19

10 Proof Proof by contradiction. Assume there is a CAP system: G 1 G 2 6 / 19

11 Proof Proof by contradiction. Assume there is a CAP system: 1. x 42 G 1 G 2 C 1 6 / 19

12 Proof Proof by contradiction. Assume there is a CAP system: G 1 G 2 1. x success! C 1 6 / 19

13 Proof Proof by contradiction. Assume there is a CAP system: G 1 G 2 1. x success! 3. x? C 1 C 2 6 / 19

14 Proof Proof by contradiction. Assume there is a CAP system: G 1 G 2 1. x success! 3. x? 4.??? C 1 C 2 6 / 19

15 Classical strategies for CP and AP CP systems Delay the acknowledgement of a write operation until new value has been propagated to all nodes Examples: Relational database with synchronous replication 2PCP 7 / 19

16 Classical strategies for CP and AP CP systems Delay the acknowledgement of a write operation until new value has been propagated to all nodes Examples: Relational database with synchronous replication 2PCP AP systems Answer with the (possibly stale) last known value Examples: Slave DNS servers NoSQL databases 7 / 19

17 Section 2 The good: A different perspective 8 / 19

18 A different perspective (by Brewer [2]) The partition decision If a partition occurs during the processing of an operation, each node can decide to cancel the operation (favour C over A), or proceed, but risk inconsistencies (favour A over C). But: It is possible to decide differently every time, based on the circumstances. 9 / 19

19 A different perspective (by Brewer [2]) The partition decision If a partition occurs during the processing of an operation, each node can decide to cancel the operation (favour C over A), or proceed, but risk inconsistencies (favour A over C). But: It is possible to decide differently every time, based on the circumstances. This means: No partition No problem But during a partition, all systems must decide eventually Permanently retrying is in fact a choice for C over A 9 / 19

20 Mitigation strategies Generally: To keep consistency, some operations must be forbidden during a partition Others are okay (e.g. read queries) Often: Guarantee to consistency to a certain degree Example: Read-your-own-writes consistency Facebook: A user s timeline is stored at master copy and cached at slaves Usually users see (potentially stale) copies at slaves But when they post something, their reads are redirected to the respective master for a certain time Different strategies on different levels possible, e.g. inside a single site and between sites (latency!) Often: In one component progress is possible, multiple consensus algorithms available (e.g. dynamic voting) 10 / 19

21 Partition recovery What if we still want to continue service during partition? 1 Detect partition 2 Enter a special partition mode 3 Continue service 4 After partition: Recovery 11 / 19

22 Partition recovery What if we still want to continue service during partition? 1 Detect partition 2 Enter a special partition mode 3 Continue service 4 After partition: Recovery The small problem: Partition detection Nodes can disagree whether a partition exists Consensus about partition state not possible Nodes may enter the partition mode at different times A distributed commit protocol is required (2PCP, Paxos,... ) 11 / 19

23 The big problem: Partition recovery A (very) simple example: Users register on a web site Every user is assigned an unique ID (SQL: serial, auto_increment) During partition: Same ID might be assigned twice Recovery: Recreate uniqueness of IDs 12 / 19

24 The big problem: Partition recovery A (very) simple example: Users register on a web site Every user is assigned an unique ID (SQL: serial, auto_increment) During partition: Same ID might be assigned twice Recovery: Recreate uniqueness of IDs Partition recovery: It s about invariants In a consistent system, invariants are guaranteed Even when the system s designer does not know them In an available system, invariants must be explicitly restored after a partition System s designer must know the invariants and how to restore them 12 / 19

25 CRDTs Commutative/Conflict-free Replicated Data Types (CRDTs) are data types that provably converge Example: Google Docs serialises edits into a series of insert and delete operations 13 / 19

26 CRDTs Commutative/Conflict-free Replicated Data Types (CRDTs) are data types that provably converge Example: Google Docs serialises edits into a series of insert and delete operations On Monday, the ANT lecture is at 13: / 19

27 CRDTs Commutative/Conflict-free Replicated Data Types (CRDTs) are data types that provably converge Example: Google Docs serialises edits into a series of insert and delete operations On Thursday, the ANT lecture is at 13:00. On Monday, the ANT lecture is at 13: / 19

28 CRDTs Commutative/Conflict-free Replicated Data Types (CRDTs) are data types that provably converge Example: Google Docs serialises edits into a series of insert and delete operations On Monday, the ANT lecture is at 13:00. On Thursday, the ANT lecture is at 13:00. On Monday, the ANT lecture is at 17: / 19

29 CRDTs Commutative/Conflict-free Replicated Data Types (CRDTs) are data types that provably converge Example: Google Docs serialises edits into a series of insert and delete operations On Monday, the ANT lecture is at 13:00. On Thursday, the ANT lecture is at 13:00. On Monday, the ANT lecture is at 17:00. On Thursday, the ANT lecture is at 17: / 19

30 CRDTs Commutative/Conflict-free Replicated Data Types (CRDTs) are data types that provably converge Example: Google Docs serialises edits into a series of insert and delete operations On Monday, the ANT lecture is at 13:00. On Thursday, the ANT lecture is at 13:00. On Monday, the ANT lecture is at 17:00. On Thursday, the ANT lecture is at 17:00. Application-specific invariants are not ensured automatically 13 / 19

31 More on partition recovery Recovery is tedious and error prone Brewer: Similar to going from single-threaded to multi-threaded programming Sometimes only possibility: Ask the user (e.g. git merge) Balance between availability and consistency: ATMs: When partitioned, limit withdrawal to amount X Invariant: Not more withdrawals than allowed Manual correction afterwards Usual tools: Version vectors (vector clocks) Logging, replay and rollback 14 / 19

32 Section 3 The ugly: CAP and SDN 15 / 19

33 SDN and CAP So far, we have talked about distributed systems on the application layer (databases, web services,...) SDN is much more basic (layer 2/3) Network functionality is essential pure CP is not really an option AP means partition recovery is required 16 / 19

34 SDN and partition recovery Possible without the network up and running? Beware of dependency loops... Is falling back to non-sdn networking possible? Even if SDN has been used to replace features like VLANs? Relying on user input rather unrealistic... Possible to figure out all the invariants? Most SDN publications ignore the issue... BGP does not stabilise in all cases [3] / 19

35 Wrapping up 1 The CAP theorem is proven and holds. 2 Do not think about CP or AP systems, but about the partition decision. 3 Many possibilities to fine-tune the balance between consistency and availability, and to recover from partitions. 4 But systems tend to become very complex. 5 Can we stomach this amount of complexity for building services as basic as network connectivity? 18 / 19

36 [1] Seth Gilbert and Nancy Lynch. Brewer s conjecture and the feasibility of consistent, available, partition-tolerant web services. In: ACM SIGACT News 33 (2 June 2002), pp DOI: / [2] Eric Brewer. CAP twelve years later: How the rules have changed. In: Computer 45 (2 Feb. 2012), pp DOI: /MC [3] Timothy G. Griffin and Gordon Wilfong. An analysis of BGP convergence properties. In: ACM SIGCOMM Computer Communication Review 29 (4 Oct. 1999), pp DOI: / / 19

