DeLorean: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently

Size: px

Start display at page:

Download "DeLorean: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently"

Agatha Stone
5 years ago
Views:

1 : Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently, Luis Ceze* and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign *Department of Computer Science and Engineering University of Washington

2 Motivation 2 Debugging parallel applications on a multicore is challenging: Bugs can be very timing dependent Their effects might manifest after many instructions Need effective debugging techniques for parallel applications on MPs One such technique: HW-assisted deterministic replay

3 HW-Assisted Deterministic Replay 3 Phase I: Initial execution Bacon and Goldstein. WPDD 91 HW records into a log the order of dependences between threads This log captures the interleaving of threads Phase II: Replay Re-run the parallel application Enforce the dependence orders stored in the log

4 Contributions 4 Propose : a new scheme for hw-assisted deterministic replay of parallel programs based on chunk-based execution Records initial execution at production-run speeds Requires only small log (0.6% of current schemes) Replays at high-speed Has multiple modes of execution with different speed/logging requirements

5 Outline 5 Background on replay schemes : a Chunk-based replay scheme Advanced Execution Replay Evaluation delorean

6 Point-to-Point Schemes: FDR and RTR 6 Record dependences between individual instructions of different threads Optimizations to reduce log size P1 FDR (ISCA 03) RTR (ASPLOS 06) P2 Transitive reduction n Wb Regulated transitive reduction Rb m Use Sequential Consistency (SC) RTR proposes an algorithm to support TSO: P2 s Log P1 n m Record value of loads that violate SC

7 Vector-based Schemes: Strata Stratum: vector of as many counters as processors: Count # of memory operations executed Log stores sequence of strata P1 P2 P3 7 Strata (ASPLOS 06) Stratum Wa Rd Wc Rb Ra It also uses SC Log size is similar to RTR s Log

8 Episode-Based Schemes: ReRun 8 ReRun (ISCA 08)

9 Outline 9 Background on replay schemes : a Chunk-based replay scheme Advanced Execution Replay Evaluation delorean

10 Chunk-Based Execution 10 P0 P1 P0 P1 P0 P1 Mem st X st Y chunk ld T ld Z st W chunk st X st Y commit ld T ld Z st W commit st X st Y commit collision ld X ld Z ld X re-execute ld X ld Z ld X P0 P1 P0 Group consecutive dynamic instructions Execute chunks atomically Execute chunks in isolation Limit possible interleavings Memory ops can be reordered and overlapped within and across chunks Memory access interleaving happens at chunk boundaries Chunk-based systems: TCC (ISCA 04), IT (ICPS 05), BulkSC (ISCA 07)

11 A Great Substrate for MP Replay 11 Overlapping/reordering enables high-speed execution and replay Performance similar to a system with Release Consistency (RC) Limited memory interleaving minimizes logging requirements No need to record individual dependences

12 12 Chunk-Based Replay: Replaying a chunk-based system means: Generate same chunks during initial execution and replay And commit them in the same order s Log stores total order of chunk commits and their size P0 st X st Y commit st A st X ld Z st Z st T commit ld A ld B P1 ld C commit ld Y st B st A commit s log is very small: Each entry is short: (ProcID, ChunkSize) Log updated infrequently Aggressive designs can make it even smaller Log P0 2 P1 3 P1 3 P0 5

13 13 Basic Operation Proc 0 Proc 1 Chunk A Commit Request Arbiter Commit Request Chunk B Ok Ok A Size CS Log 1 ProcID = 1 B Size CS Log ProcID = 0 0 PI Log Log implemented as two different structures: Chunk Size (CS) Log Processor Interleaving (PI) Log

14 Outline 14 Background on replay schemes : a Chunk-Based replay scheme Advanced Execution Replay Evaluation delorean

15 Reducing s Log 15 PI Log P0 CS Log P1 CS Log 1 Use large chunks Fixed-size chunking Predefined interleaving 0 1 1

16 Reducing s Log 16 1 Use large chunks + Fewer entries in log More collisions and cache overflows 2 Fixed-size chunking + No Almost CS log no Need to handle cache overflows 3 Predefined interleaving + No PI log Performance degradation

17 Execution Modes 17 Nano Pico 1 Use large chunks 1 Use large chunks 2 Fixed-size chunking (tiny CS log) 2 Fixed-size chunking (tiny CS log) 3 Predefined interleaving (no PI Log)

18 Outline 18 Background on replay schemes : Chunk-Based replay scheme Advanced Execution Replay Evaluation delorean

19 Overall System 19 Proc 0 Network Proc 1 Chunk A DMA Arbiter Chunk B CS Log Interrupt Log I/O Log CS Log Interrupt Log I/O Log DMA Log PI Log Interrupt, I/O and DMA logs also appear in all past replay schemes

20 Replay 20 Virtual machine monitor loads initial system checkpoint Interrupt Log provides initial-execution interrupts I/O Log provides initial-execution I/O inputs Processors execute normally (they execute in parallel): Use CS Log to generate chunks with exceptional sizes (from cache overflows) Arbiter uses PI Log to enforce initial-execution commit order (or predefined order in Pico)

21 Also in the paper 21 Rules for deterministic chunking How to deal with interrupts, traps, cache overflows, I/O ops Scalability of Pico A detailed view of s architecture Proof of why s replay is deterministic Combination with other HW replay schemes

22 Outline 22 Background on replay schemes : Chunk-Based replay scheme Advanced Execution Replay Evaluation delorean

23 Evaluation Setup 23 Chunk-based simulator supports both execution and replay 8-processor CMP 1000, 2000, 3000 instructions per chunk PI Log: 4-bit entries CS Log: 32-bit entries Applications: SPLASH, SPECjbb 2000 and SPECweb 2005

24 Log Space Requirements 24 Log size (bits/core/kilo-inst) CS log (Compressed) PI log (Compressed) Instructions per chunk Nano Average RTR compressed log size (estimated) Log size (bits/core/kilo-inst) CS log (Compressed) Pico Average RTR compressed log size (estimated) Instructions per chunk Nano s log is 16% of RTR s (2000-instruction configuration) With one additional optimization, Nano s log is 7.5% of RTR s Picos s log is 0.6% of RTR s (1000-instruction configuration)

25 If We Were To Record a Day of Execution GB/Core/Day FDR RTR Nano Pico Execution and replay of long periods of time is feasible

26 Performance 26 RC Nano Nano Pico Replay SC Pico Replay Speedup R O RP S R 0 Initial and Execution Replay Execution Initial execution: Nano almost as fast as RC, Pico faster than SC Replay: comparable to initial execution speed

27 Conclusions 27 is a new approach to HW replay of parallel applications Advantages over traditional HW replay systems: Executes at high speed (like most aggressive consistency models) Replays at comparable speed Summarizes execution into a very small log (0.6% of current schemes) Great help for programmers that need to debug parallel codes

28 Future Work 28 Adapt to conventional multiprocessors without chunk-based execution Take conventional replay schemes, use very aggressive SC implementations, and compare to Combine the best aspects of the different recording approaches (, RTR and Strata)

29 : Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently, Luis Ceze* and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign *Department of Computer Science and Engineering University of Washington

The Bulk Multicore Architecture for Programmability

The Bulk Multicore Architecture for Programmability Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Acknowledgments Key contributors: Luis Ceze Calin