SAMC: Sema+c- Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems

Size: px

Start display at page:

Download "SAMC: Sema+c- Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems"

Myron O’Brien’
6 years ago
Views:

1 SAMC: Sema+c- Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi *, Jeffrey F. Lukman, and Haryadi S. Gunawi * 1

2 Internet Services 2

3 Internet Services Cloud Systems 3

4 Reliability 4

Deep Bug Example ZooKeeper (synchronization service) Issue #335. 1. Nodes A, B, C start (w/ latex txid: 10) 2. B becomes leader 3. B crashes 4. C becomes leader 5.

5 Deep Bug Example ZooKeeper (synchronization service) Issue # Nodes A, B, C start (w/ latex txid: 10) 2. B becomes leader 3. B crashes 4. C becomes leader 5. C commits new txid-value pair (11, X) 6. A crashes, before committing the new txid C loses quorum and C crashes 8. A and B are back online after C crashes 9. A becomes leader 10. A's commits new txid-value pair (11, Y) 11. C is back online after A's new tx commit 12. C announce to B (11, X) 13. B replies diff starting with tx Inconsistency: A has (11, Y), C has (11, X) PERMANENT INCONSISTENT REPLICA 5

Deep Bug Characteris+cs ZooKeeper (synchronization service) Issue #335. Specific Order 1. Nodes A, B, C start (w/ latex txid: 10) 2. B becomes leader 3. B crashes 4.

A and B are back online after C crashes 9. A becomes leader 10. A's commits new txid-value pair (11, Y) 11. C is back online after A's new tx commit 12.

6 Deep Bug Characteris+cs ZooKeeper (synchronization service) Issue #335. Specific Order 1. Nodes A, B, C start (w/ latex txid: 10) 2. B becomes leader 3. B crashes 4. C becomes leader 5. C commits new txid-value pair (11, X) 6. A crashes, before committing the new txid C loses quorum and C crashes 8. A and B are back online after C crashes 9. A becomes leader 10. A's commits new txid-value pair (11, Y) 11. C is back online after A's new tx commit 12. C announce to B (11, X) 13. B replies diff starting with tx Inconsistency: A has (11, Y), C has (11, X) 1. Out- of- order messages 2. Mul+ple crashes 3. Mul+ple reboots HAPPEN IN ANY ORDER 6

Study of Deep Bugs # crashes / # reboots # crashes / # reboots 6 5 4 3 2 1 0 6 5 4 3 2 1 0 ZooKeeper 335 569 769 790 791 975 1075 1118 1154

# # crashes / # reboots SEVERE 6 5 4 3 2 1 0 Hadoop 913 3780 3846 4252 4425 4607 4748 4832 4833 4890 5000 5169 5198 5358 5405 5409 5476

7 Study of Deep Bugs # crashes / # reboots # crashes / # reboots ZooKeeper Cassandra Bug # 6503 Bug # # crashes / # reboots SEVERE Hadoop IMPLICATIONS x- axis is bug number y- axis is number of crashes and reboots INCONSISTENT REPLICAS, DATA LOSS, DOWNTIMES, ETC. Bug # 7

8 How do we catch deep bugs in distributed systems? 8

9 How to Catch Deep Bugs Distributed system model checker Re- ordering all non- determinis+c events Find which specific orderings lead to bugs

10 What s Wrong with Exis+ng Model Checkers? Last 7 years MaceMC [NSDI 07], Modist [NSDI 09], dbug [SSV 10], Demeter [SOSP 13], etc. BUT Too many events Mul+ple crashes and reboots Create more messages No model checker incorporate mul+ple crashes and reboots Cannot find deep bugs! 100 events 10

11 How do we catch deep bugs REALLY FAST? 11

12 Black- Box Approach Exis+ng model checkers are so slow They treat target systems as black boxes A large number of event orderings are generated Black Box Black Box Model Checker ABCD ABDC ACBD ACDB A B C D ADBC... (24 total) 12

13 Seman+c Knowledge How can we make model checker fast? Exploit seman+c knowledge Seman+c- aware model checker (SAMC) SAMC Black Box A B C D 13

14 Black Box vs. SAMC Black Box Model Checker ABCD ABDC ACBD ACDB ADBC ADCB BACD BADC BCAD BCDA BDAC... A Black Box B C D A B C D Lead to the same state Message Processing Seman+c Unnecessary Re-orderings SAMC with message processing semanjc ABCD ABDC ACBD ACDB ADBC ADCB BACD BADC BCAD BCDA... 14

ABCDX ABCXD N1 C,D N3 N4 ABXCD AXBCD XABCD ABDCX ABDXC

15 SAMC with Crashes Crash Recovery Seman+c A,B N2 Black Box Model checker ABCDX ABCXD SAMC with crash recovery semanjc ABCDX ABCXD N1 C,D N3 N4 ABXCD AXBCD XABCD ABDCX ABDXC ABXCD AXBCD XABCD ABDCX ABDXC Unnecessary Re-orderings

16 Generic Reduc+on Algorithms Local- Message Independence (LMI) Principle of Semantic Awareness Crash- Message Independence (CMI) Crash Recovery Symmetry (CRS) Reboot Synchroniza+on Symmetry (RSS) 16

17 SAMC Implementa+on and Integra+on SAMC implementa+on 10,000 LOC from scratch Apply SAMC to 3 cloud systems 7 protocols 10 versions Cloud systems Cassandra Hadoop 2.0 ZooKeeper Protocol Gossiper Hinted handoff Read/write Cluster management Specula+ve execu+on Atomic broadcast Leader elec+on 17

18 Result Reproduced 12 old bugs Compare to state- of- the- art techniques Dynamic Par+al Order Reduc+on (DPOR) Random- DPOR Find bugs 2x to 340x faster 49x on average Found 2 new bugs Submit them to developers 18

19 Outline IntuiJon SAMC Local- Message Independence Crash- Message Independence Crash Recovery Symmetry Reboot Synchroniza+on Symmetry Evalua+on 19

20 Dependency vs. Independency A, B S Node State = S A, B S S S B, A S A B B, A Unnecessary A, B = Dependent A, B = Independent INDEPENDENT = NO REORDERING 2X SPEED-UP 20

21 Black Box vs. SAMC Black Box Model Checker ABCD ABDC ACBD ACDB Black Box Seman+c Info SAMC ABCD ABDC BACD BADC ADBC ADCB BACD BADC BCAD A B C D A B C D All dependent Dependent Dependent BCDA... 6X SPEED-UP 4!=24 orderings 21

22 Reduc+on Speed- up S ABDC S ABDC

23 How to Declare Message Independency? Q: Which concurrent messages are independent? A: Use message processing seman+c 23

24 Message Processing Seman+c in Simplified Leader Elec+on Belief = n3 if(vote <= belief)! // do nothing! else! belief = vote;! B = 3 V = 1 B = 3 V = 2 B = 3 V = 4 B = 3 B = 3 B = 4 Vote=1 Vote=2 Vote=4 24

Removing Re- ordering via Message Processing

25 Removing Re- ordering via Message Processing Seman+c 1, 2, 3 1, 3, 2 2, 1, 3 Belief=4 B = 4 B = 4 B = 4 V = 1 V = 1 V = 2 Vote=1 Vote=2 Vote=3 B = 4 B = 4 B = 4 if (vote <= belief)! // do nothing! else! belief = vote;! V = 2 B = 4 V = 3 Unnecessary V = 3 V = 1 B = 4 B = 4 V = 2 V = 3 B = 4 B = 4 B = 4 25

26 Formalizing the Intui+on MESSAGE PROCESSING SEMANTIC if (vote <= belief)! // do nothing! else! belief = vote;!! DISCARD PATTERN if (isdiscard(msg, ls)) {! // do nothing;! }! DISCARD PREDICATE boolean discardpredicate(msg, ls) {! if (msg.vote <= ls.belief)! return true;! else! return false;! }! 26

27 Formalizing the Intui+on Belief=4 Vote=1 Vote=2 Vote=3 vote belief discardpredicate 1 4 true 2 4 true 3 4 true m x m y discard(m x ) discard(m y ) Independent 1 2 true true 1 3 true true 2 3 true true 27

28 Outline Intui+on SAMC Local- Message Independence Crash- Message Independence Crash Recovery Symmetry Reboot Synchroniza+on Symmetry Evalua+on 28

29 SAMC Architecture a, b c, d Local state: ls1 interceptor interceptor Local state: ls2 release(c) Protocol Specific Rules Generic Reduc+on Policies Basic Reduc+on Techniques Leader Elec+on LMI CMI CRS RSS Symmetry SAMC Atomic Broadcast Dynamic Par+al Order Reduc+on (DPOR) 29

30 Local- Message Independence SAMC Generic Reduc+on Policies Local- Message Independence (LMI) Crash- Message Independence (CMI) Crash Recovery Symmetry (CRS) Reboot Synchroniza+on Symmetry (RSS) 30

31 Local- Message Independence Discard parern Increment parern if (msg.type == ack) {! node.ackcount++;! }! C=0 ack C=0 ack boolean incrementpredicate(msg, ls) {! if (msg.type == ack)! return true;! else! return false;! }! Constant parern C=1 ack C=2 C=1 ack C=2 31

32 Crash- Message Independence SAMC Generic Reduc+on Policies Local- Message Independence (LMI) Crash- Message Independence (CMI) Crash Recovery Symmetry (CRS) Reboot Synchroniza+on Symmetry (RSS) 32

33 Crash- Message Independence Black Box ABCDX ABCXD ABXCD F A,B L F C,D F void handlecrash() {! if (X == follower && isquorum())! followercount--;! // No new messages! }! AXBCD XABCD ABDCX A,B L C,D boolean localimpact(x, ls) {! if (X == follower && isquorum())! return true;! else! return false;! }! F F F 33

}! S L S S boolean globalimpact(x, ls) {!

34 Crash- Message Independence F A,B L F C,D F void handlecrash() {! if (X == leader!isquorum())! electleader()! // New messages created! }! S L S S boolean globalimpact(x, ls) {! if (X == leader!isquorum())! return true;! else! return false;! }! 34

35 Crash Recovery Symmetry Reboot Synchroniza+on Symmetry SAMC Generic Reduc+on Policies Local- Message Independence (LMI) Crash- Message Independence (CMI) Crash Recovery Symmetry (CRS) Reboot Synchroniza+on Symmetry (RSS) 35

36 Outline Intui+on SAMC Local- Message Independence Crash- Message Independence Crash Recovery Symmetry Reboot Synchroniza+on Symmetry EvaluaJon 36

37 Evalua+on SAMC implementa+on 10,000 LOC from scratch Apply SAMC to 3 cloud systems 7 protocols 10 versions Cloud systems Cassandra Hadoop 2.0 ZooKeeper Protocol Gossiper Hinted handoff Read/write Cluster management Specula+ve execu+on Atomic broadcast Leader elec+on 37

38 Protocol- Specific Rules (e.g. ZooKeeper Leader Elec+on) Guide SAMC to remove re- orderings 35 LOC on average per protocol 38

39 Catching Old Bugs A table shows number of execu+ons to reach the bugs and speedup Issue# SAMC Black- Box DPOR Random Random DPOR #exe #exe speedup #exe speedup #exe speedup ZooKeeper ZooKeeper ZooKeeper ZooKeeper ZooKeeper ZooKeeper ZooKeeper MapReduce MapReduce MapReduce Cassandra Cassandra

40 Catching Old Bugs A table shows number of execu+ons to reach the bugs and speedup Issue# SAMC Black- Box DPOR Random Random DPOR #exe #exe speedup #exe speedup #exe speedup ZooKeeper ZooKeeper ZooKeeper ZooKeeper ZooKeeper ZooKeeper ZooKeeper MapReduce MapReduce MapReduce Cassandra Cassandra

41 Reduc+on Ra+o ZooKeeper leader elec+on protocol Run black- box DPOR Ater each execu+on, find that this execu+on is executed by SAMC or not Count DPOR s execu+ons that are executed by SAMC too #Crash #Reboot ReducJon RaJo ALL LMI CMI CRS RSS X 18X 5X 4X 3X 63X 35X 6X 5X 5X 103X 37X 9X 9X 14X 41

42 Conclusion Deep bugs live in the cloud Model checker needs to incorporate complex failure to reach deep bugs State space explosion Seman+c- aware model checking LMI, CMI, CRS, RSS Bring future research ques+ons What other seman+c knowledge is useful? How to extract them from the code automa+cally? 42

43 Thank You Ques+ons? hrp://ucare.cs.uchicago.edu 43

SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems

SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems Tanakorn Leesatapornwongsa and Mingzhe Hao, University of Chicago; Pallavi Joshi, NEC Labs America; Jeffrey F. Lukman,