ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors

Size: px

Start display at page:

Download "ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors"

Sherman Robinson
6 years ago
Views:

1 ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang*, Josep Torrellas University of Illinois at Urbana-Champaign *Hewlett-Packard Laboratories

2 Motivation Availability & Reliability increasingly important Frequency, Feature Size Errors Complexity, Verification Cost Errors Multiprocessors Errors Global software-only recovery too slow Can hardware help? Prvulovic et al. ReVive: Cost Effective Rollback Recovery 2

3 Motivation Cost vs. Performance vs. Availability Low Cost Simple changes to a few key components Low Performance Overhead Handle frequent operations in hardware High Availability Fast recovery from a wide class of errors Prvulovic et al. ReVive: Cost Effective Rollback Recovery 3

4 Contribution: New Scheme Low Cost HW changes only to directory controllers Memory overhead only 12.5% (with 7+1 parity) Low Performance Overhead Only 6% performance overhead on average High Availability Recovery from: system-wide transients, loss of one node Availability better than % (assuming 1 error/ day) Prvulovic et al. ReVive: Cost Effective Rollback Recovery 4

5 Overview of ReVive Entire main memory protected by distributed parity Like RAID-5, but in memory Periodically establish s a checkpoint c Main memory is the checkpoint state Write-back kdirty data from caches, save processor r context t Save overwritten data to enable restoring checkpoint When program execution modifies memory for 1st time Prvulovic et al. ReVive: Cost Effective Rollback Recovery 5

6 Distributed N+1 Parity Node 0 Node 1 Node N Parity Group Distributed to minimize contention Parity Data Data... Allocation Granularity: page Update Granularity: cache line Prvulovic et al. ReVive: Cost Effective Rollback Recovery 6

7 Distributed Parity Update in HW WB Line X XOR Dir Mem Par Dir Dir Par Ack Rd Wr Rd Wr Home of Line X Mem Mem XOR Home of parity for Line X Prvulovic et al. ReVive: Cost Effective Rollback Recovery 7

8 ReVive: Checkpoint Creation Timeline Checkpoint (<1ms for 2MB L2) Execute (100 ms) P0 <5μs <1μs ~400μs/MB ~20μs P1 P2 Time P3 Timer Interrupt Save CPU Context Write-Back Sync Prvulovic et al. ReVive: Cost Effective Rollback Recovery 8

9 Logging in HW RdExcl Line X Rd Line X Data Dir Wr Log Mem Note: Wr Log also updates the parity Home of Line X Prvulovic et al. ReVive: Cost Effective Rollback Recovery 9

10 Log Filtering Add L bit to directory entry of each line Clear all L bits on each checkpoint Set when logged Do not log if already set Not needed for correctness Can be only in directory cache Can be completely omitted Prvulovic et al. ReVive: Cost Effective Rollback Recovery 10

11 Classes of Recoverable Errors CPU Caches Dir Ctrl Revive Hardware Mem Ctrl DRAM Net Ctrl Module Interconnection Network Can recover from (Trans + perm) errors in 1 node Trans errors in N nodes Prvulovic et al. ReVive: Cost Effective Rollback Recovery 11

12 Permanent Node Loss: Recovery Unavailable (~840ms) Degraded P0 Execute Detection Repair Log Repair Data P1 100ms 80ms ~100ms ~490ms ~20s P2 Time P3 Checkpoint Bzzzt! HW Rollback Prvulovic et al. ReVive: Cost Effective Rollback Recovery 12

13 Evaluation Setup Splash-2 benchmarks 16 superscalar processors (6-issue at 1GHz) 16kB L1 cache, 512kB L2 cache 2-D torus network, virtual cut-through routing 100MHz DDR SDRAM Using 7+1 distributed parity Checkpoint interval: 10ms and infinite Prvulovic et al. ReVive: Cost Effective Rollback Recovery 13

14 Performance Overhead 25% 20% 15% 10% 5% L2 Misses/1000 Instructions 5 6 Ckp+Log Par 9 Cp10ms 7+1 CpInf 7+1 Cp10ms 1+1 CpInf 1+1 Exec. Time Increase 0% Barnes Cho holesky FFT FMM LU Ocean Rad adiosity Radix Ray aytrace Volrend Wat ater-n2 Wat ater-sp Average Tolerable 6% performance overhead Prvulovic et al. ReVive: Cost Effective Rollback Recovery 14

15 Worst-Case Recovery Time Millis seconds Reconstruct (Log) Reconstruct (Mem) Undo 0 Barnes Cholesky FFT FMM LU Ocean Radiosity Redo Work Radix Raytrace Volrend HW Repair Water-N2 Water-Sp Average Radix: 590ms + 180ms + 50ms = 820ms % 999% availability Prvulovic et al. ReVive: Cost Effective Rollback Recovery 15

16 Network Traffic Bytes/Instr PAR Ckp WB Exe WB RD/RDX 0 Barnes Cholesky FFT FMM LU Ocean Radiosity Radix Raytrace Volrend Water-N2 Water-Sp H Mean Prvulovic et al. ReVive: Cost Effective Rollback Recovery 16

17 Memory Traffic Bytes/Instr 2.0 PAR LOG 1.0 Ckp WB Exe WB RD/RDX 0 Barnes Cholesky FFT FMM LU Ocean Radiosity Radix Raytrace Volrend Water-N2 Water-Sp H Mean Prvulovic et al. ReVive: Cost Effective Rollback Recovery 17

18 Related Work Device- or problem-specific p schemes DIVA, Redundant Multithreading, Slipstream, ECC, etc. ReVive can handle errors that escape these schemes, improving overall availability at low additional cost Other system-recovery schemes Plank et al. - N+1 parity in software Masubuchi et al. - logging with bus-snooper SafetyNet Prvulovic et al. ReVive: Cost Effective Rollback Recovery 18

19 Related Work: SafetyNet Types of recoverable errors ReVive: Permanent (loss of a node)+transient SafetyNet: Transient; perm only w/ redundant devices HW modifications ReVive: Directory controller only SafetyNet: Memory, caches, coherence protocol Performance Overhead 6% with ReVive, negligible with SafetyNet Prvulovic et al. ReVive: Cost Effective Rollback Recovery 19

20 Conclusions Recovery from: system-wide transients, loss of 1 node Availability better than % Low performance overhead (6% on average) HW changes only to directory controllers Memory overhead 12.5% with 7+1 parity Overhead can be reduced by increasing parity groups Prvulovic et al. ReVive: Cost Effective Rollback Recovery 20

21 ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas v c

22 Rollback Recovery in Multiprocessors Checkpoint Consistency Global, Local Coordinated or Local Uncoordinated Checkpoint Separation Full or Partial Partial can be with Logging, Renaming or Buffering Checkpoint Storage Safe External, Safe Internal or for a Specialized Error Class Prvulovic et al. ReVive: Cost Effective Rollback Recovery 22

23 Checkpoint Consistency Global Synchronization is fast enough on shared-memory machines All synchronize to make a single consistent checkpoint Local Coordinated Synchronize as needed for a set of consistent checkpoints Local Uncoordinated Do not synchronize Set of consistent checkpoints computed when recovering Prvulovic et al. ReVive: Cost Effective Rollback Recovery 23

24 Checkpoint Storage Safe External (e.g. RAID) Not fast enough Recovery data on redundancy protected-disk Safe Internal (e.g. DRAM) Recovery data in redundancy-protected memory Unsafe Internal Not general enough Recovery data not protected t dby redundancy d Assumes memory content survives errors Prvulovic et al. ReVive: Cost Effective Rollback Recovery 24

25 Checkpoint Separation Full Too much storage needed Checkpoint and working data sets do not intersect Partial with Buffering Commit atomicity, overhead Buffer non-checkpoint data, flush to commit Partial with Renaming Rename to avoid overwriting checkpoint data Partial with Logging Save overwritten checkpoint data in a log Complex HW or coarse grain Prvulovic et al. ReVive: Cost Effective Rollback Recovery 25

26 Log & Parity Update Races Error while log update in progress Must fully perform log update before starting overwrite Error while parity update in progress Assume a single node fails Can recover r either old or new content t Both result in consistent recovery (see paper) Long error detection latency Keep sufficient logs to recover far enough into the past Prvulovic et al. ReVive: Cost Effective Rollback Recovery 26

27 Availability vs Overhead If checkpoint interval too short Lost work and hardware self-check dominate recovery Fault-free execution performance suffers If checkpoint interval too long Low availability Find a good balance Checkpoint intervals of 100ms to 1s Prvulovic et al. ReVive: Cost Effective Rollback Recovery 27

28 Analysis Cache size vs. checkpoint interval 512kB caches with checkpoints every 10ms 5MB caches with checkpoints every 100ms Log size vs. checkpoint interval Log will grow in sub-linear proportion to interval size 10ms: <3MB per node, only two apps >128kB per node Parity overhead: 12.5% of system memory is parity Prvulovic et al. ReVive: Cost Effective Rollback Recovery 28

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors

To appear in the Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA-29) May 2002 ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors