Reducing Crash Recoverability to Reachability

Size: px

Start display at page:

Download "Reducing Crash Recoverability to Reachability"

Marybeth York
5 years ago
Views:

1 Reducing Crash Recoverability to Reachability Eric Koskinen Yale University Junfeng Yang Columbia University Principles of Programming Languages St. Petersburg, Florida 20 January 2016

2 We are pretty good at writing programs

3 We are pretty good at writing programs

4 We are pretty good at writing programs

5 We are pretty good at writing programs

6 We are pretty good at writing programs

7 We are pretty good at writing programs?

8 1. What do we mean by crash and recovery? Specification 2. Can we prove (automatically) that a program recovers from a crash? 3. Does this actually work on real examples?

9 What do we mean by crash and recovery? 0 1 CRASH 2 3

10 What do we mean by crash and recovery? CRASH 1. Boot machine 2. Establish program env. 3. Execute program 4. Crash mid-execution 5. Re-boot computer 6. Execute Recovery Script 7. Establish program env. 8. Re-execute program

11 What do we mean by crash and recovery? 0 1 CRASH 1. Boot machine 2. Establish program env. 3. Execute program Initial State 4. Crash mid-execution 2 5. Re-boot computer 6. Execute Recovery Script 7. Establish program env Re-execute program

12 What do we mean by crash and recovery? 0 1 CRASH 1. Boot machine 2. Establish program env. 3. Execute program Initial State 4. Crash mid-execution Crash 2 5. Re-boot computer 6. Execute Recovery Script 7. Establish program env Re-execute program

13 What do we mean by crash and recovery? 0 1 CRASH 1. Boot machine 2. Establish program env. 3. Execute program Initial State 4. Crash mid-execution Crash 2 5. Re-boot computer 6. Execute Recovery Script Recover 7. Establish program env Re-execute program

14 What do we mean by crash and recovery? 0 1 CRASH in = open(input) read(in,buf); CRASH

15 in = open(input) out = open(output,o_creat O_WRONLY O_TRUNC) write(out, A ) CRASH...

16 CRASH in = open(input) out = open(output,o_creat O_WRONLY O_TRUNC) write(out, A ) CRASH

17 Is this new trace ok? CRASH in = open(input) out = open(output,o_creat O_WRONLY O_TRUNC) write(out, A ) CRASH

18 Program states 7min

19 Program states With possibility of crashes... possibility of new behaviors Definition: If the program crashes, when it is re-executed, should not have new behaviors that weren t in the original program. Matches what the program does Program must handle new initial states

20 Program states With possibility of crashes... possibility of new behaviors Would like to prove that they are already included in the original program. C C Therefore... We can use the original program as the specification for how the program should behave in the presence of crashes.

21 Program states With possibility of crashes... possibility of new behaviors Would like to prove that they are already included in the original program. C C Therefore... We can use the original program as the specification for how the program should behave in the presence of crashes.

22 Non-determinism in = open(input) out = open(output,o_creat O_WRONLY O_TRUNC) if(rand()) { write(out, A ); CRASH } else { write(out, B ); }...

23 Non-determinism CRASH in = open(input) out = open(output,o_creat O_WRONLY O_TRUNC) if(rand()) { write(out, A ); CRASH } else { write(out, B ); }

24 Recovery Scripts RECOVER 10 CRASH Described in the Paper in = open(input) out = creat(output) if(rand()) { write(out, A ); CRASH; RECOVER() } else { write(out, B ); }... RECOVER() { if(exists(output)) unlink(output); }

25 6CRASH Specification Checkpoints RECOVER 3 Described in the Paper in = open(input) out = creat(output) write(out, pre ); fsync_commit(out); chkpt: if(rand()) { CRASH; RECOVER()... } else {... } RECOVER() { if(committed) { in=open(input); out=open(output); goto chkpt; } }

26 Hierarchy of Crash Recoverability 0-recoverability 1-recoverability N-recoverability

27 Hierarchy of Crash Recoverability 0-recoverability 1-recoverability N-recoverability

28 Hierarchy of Crash Recoverability 0-recoverability 1-recoverability N-recoverability

29 Hierarchy of Crash Recoverability 0-recoverability 1-recoverability N-recoverability Simulation

30 Hierarchy of Crash Recoverability 0-recoverability 1-recoverability N-recoverability Recoverability

31 Hierarchy of Crash Recoverability 0-recoverability 1-recoverability N-recoverability -recoverability

32 1. What do we mean by crash and recovery? 2. Can we prove (automatically) that a program recovers from a crash? 3. Does this actually work on real examples?

33 Key Idea: Transformation ( ) 0 fd = open(pw); {fd < 0} 2 5 {joe buf} 1 {fd 0} buf = read(fd); close(fd); 3 {joe buf} d=readdir(/u); 4 creat(pw2); append(pw2,buf); append(pw2,joe); fsync(pw2); close(pw2); 7 rename(pw2,pw); psync(pw); mkdir(/u/joe);

34 Key Idea: Transformation ( ) 0 fd = open(pw); {fd < 0} 2 5 {joe buf} 1 {fd 0} buf = read(fd); close(fd); 3 {joe buf} d=readdir(/u); 4 creat(pw2); append(pw2,buf); append(pw2,joe); fsync(pw2); close(pw2); 7 rename(pw2,pw); psync(pw); mkdir(/u/joe); reduce to reachability: ( ). cannot reach qerr is crash-recoverable

35 Key Idea: Transformation ( ) 0 fd = open(pw); {fd < 0} 2 5 {joe buf} 1 {fd 0} buf = read(fd); close(fd); 3 {joe buf} d=readdir(/u); 4 creat(pw2); append(pw2,buf); append(pw2,joe); fsync(pw2); close(pw2); 7 rename(pw2,pw); psync(pw); mkdir(/u/joe); reduce to reachability: Well-founded relation. cannot reach qerr is crash-recoverable

36 r = m(a) 1 2 s = n(b) σ1 ain Theorem. cannot reach qerr is crash-recoverable

37 r = m(a) r 1 2 Create Snapshot pw2 := pw; mem2 := mem; σ s = n(b) σ2 ain Theorem. cannot reach qerr is crash-recoverable

38 r = m(a) s = n(b) Create Snapshot pw3 := pw; mem3 := mem; r s σ σ2 σ3 ain Theorem. cannot reach qerr is crash-recoverable

39 r = m(a) s = n(b) r s σ σ2 σ3 σ4 σ5 ain Theorem. cannot reach qerr is crash-recoverable

40 r = m(a) s = n(b) r s Crash σ2 σ3 σ4 σ5 ain Theorem σ. cannot reach qerr is crash-recoverable qerr Recovery Termination

r = m(a) r s 1 2 3 4 5 σ s = n(b) σ2 σ3 σ4 σ5 `σ ain Theorem Load

41 r = m(a) r s σ s = n(b) σ2 σ3 σ4 σ5 `σ ain Theorem Load Snapshot `pw := pw2; `mem := mem2;. cannot reach qerr is crash-recoverable qerr

r = m(a) r s 1 2 3 4 5 σ s = n(b) `s s `σ Execute uncrashed snapshot `s := n(`b);

42 r = m(a) r s σ s = n(b) `s s `σ Execute uncrashed snapshot `s := n(`b); And recovered state ain Theorem s := n(b);. cannot reach qerr is crash-recoverable qerr

43 r = m(a) s = n(b) r s σ σ `σ `σ `t t qerr ain Theorem. cannot reach qerr is crash-recoverable

44 r = m(a) s = n(b) r s σ `σ `t t qerr ain Theorem. cannot reach qerr is crash-recoverable

45 ain Theorem. cannot reach qerr is crash-recoverable

46 1. What do we mean by crash and recovery? 2. Can we prove (automatically) that a program recovers from a crash? 3. Does this actually work on real examples?

47 Eleven 82

48 Eleven 82 counter- example Proof.

49 Notes Specification Built on CPAchecker Compiler Macros Model of the filesystem with arrays and integers Copying with arrays Eleven 82 counter- example Proof.

50 Benchmarks Simple examples from earlier in this talk Examples of crash recovery protocols of real-world examples [Pillai et al. OSDI 14] Google s LevelDB PostgreSQL - Used by 30% of tech companies SQLite - Used by probably every Android app (1B users) VMware ZooKeeper - Distributed applications, used by Yahoo

53 Related Work Chen et al. Using Crash Hoare logic for certifying the FSCQ file system. SOSP 2015 Broadly complementary: verified FS versus verifying user-level programs Specifically different: we focus on automation while they focus on proof modularity/reusability (require user-provided CHL specifications and user help in proof obligations) Ntzik et al. Fault-Tolerant Resource Reasoning. APLAS Novel logic explicitly tracking volatile/persistant Support concurrency, Not automated Gardner et al. Local Reasoning for the POSIX filesystem. ESOP Ridge et al. SibylFS: formal specification and oracle-based testing for POSIX and real-world file systems. SOSP 2015

54 Reducing Crash Recoverability to Reachability Eric Koskinen Yale University POPL 2016 Junfeng Yang Columbia University Contributions Specification - Definitions on what it means for a crash to recover Automatic - Reduction to automaton reachability - Proved recoverability of commit protocols from real systems (SQLite, LevelDB, ZooKeeper, etc.) Open Challenges - Code scope, O/S layers N-recoverability, infinite-recoverability Timing - Does recovery happen promptly? Concurrency

55 Thank you!

EXPLODE: a Lightweight, General System for Finding Serious Storage System Errors. Junfeng Yang, Can Sar, Dawson Engler Stanford University

EXPLODE: a Lightweight, General System for Finding Serious Storage System Errors Junfeng Yang, Can Sar, Dawson Engler Stanford University Why check storage systems? Storage system errors are among the