RelaxReplay: Record and Replay for Relaxed-Consistency Multiprocessors

Size: px

Start display at page:

Download "RelaxReplay: Record and Replay for Relaxed-Consistency Multiprocessors"

Hubert Page
5 years ago
Views:

1 RelaxReplay: Record and Replay for Relaxed-Consistency Multiprocessors Nima Honarmand and Josep Torrellas University of Illinois at Urbana-Champaign 1

2 RnR: Record and Deterministic Replay Recreate execution of a (parallel) program Record: capture nondeterministic events in a log Replay: use the log to recreate the exact same execution Use cases: Debugging, Security, Fault Tolerance, Sources of non-determinism rogram inputs Capture using operating system support Memory access interleaving Memory Race Recording (MRR) Important in multiprocessor systems Capture using hardware support My Focus: Hardware-based memory race recording in relaxed-consistency multiprocessors Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 2

3 RelaxReplay Contributions 1) A hardware-based scheme for memory race recording in multi-processors with relaxed memory models 2) Design independent of memory model details Only relies on conventional cache coherence 3) Compact log format Log size comparable to previous, less general, schemes 4) Log format enables efficient replay Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 3

4 Memory Race Recording (MRR) in Hardware Observation: Shared-memory communications manifest as cache coherence transactions Can do MRR by monitoring transactions Chunk-Based Recording 1) Dynamically, break processor s execution into chunks of instructions Chunk: the work done between consecutive communications with other processors Chunk Chunk Comm. Comm. Comm. 2) Order chunks across processors such that Chunk containing source of a dependence is ordered before chunk containing destination 3) Replay chunks in recorded order to reproduce dependences Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 4

5 Chunk-Based Recording Example TS=1 TS=2 Core 1 LD A LD B TID=1 CNT=1 TS=3 LD B TID=1 CNT=2 TS=1 inval Thread inval Interleaving Order Read req. Core 2 ST B ST A TID=2 CNT=2 Each processor keeps a Read Set and a Write Set Read and Write Sets are cleared after chunk creation TS=3 TS=2 # of Chunks << # of dependences Chunk content logged efficiently as <# of insts> Compact log Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 5

6 roblem: Access Re-ordering in Relaxed Models Chunk-based recording assumes in-program-order execution of memory accesses (Sequential Consistency) Real processors re-order memory accesses Recording Number of instructions not enough to capture work done between consecutive communications revious work support Total Store Ordering (TSO) roblem open for more aggressive relaxed model (ARM, IBM, ) How to efficiently record heavily re-ordered accesses? Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 6

7 Notion of Interval in RelaxReplay Core 1 Core 2 Core 3 TS=2 TS=1 TS=3 TS=4 TS=6 TS=5 Intervals Interval: eriod of time between consecutive communications with other processors Replaces notion of chunk Instructions in an interval are not in program order How to compactly log content of an interval? Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 7

Key Idea: utting Instructions Back in rogram Order Time inst 1 inst 2 inst 3 inst 4 inst 5 inst 6 Interval 1 Interval 2 inst 1 inst 2 inst 3 inst 4 inst 5 inst 6 Interval 1 Interval 2 Record an

8 Key Idea: utting Instructions Back in rogram Order Time inst 1 inst 2 inst 3 inst 4 inst 5 inst 6 Interval 1 Interval 2 inst 1 inst 2 inst 3 inst 4 inst 5 inst 6 Interval 1 Interval 2 Record an alternative order of execution with equivalent behavior and same set of inter-thread dependences Vast majority of instructions in program order Compact log Some instructions cannot be put in program order augment the log with additional information for correct play Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 8

Design: Decouple Instruction Execution and Logging Deferred logging: instructions go through a logging step, in program order Detects instructions that

.. head Counting (C) Event: When instruction is added to the log At the head of Shadow ROB Inst should be performed Happens in program order erformed

9 Design: Decouple Instruction Execution and Logging Deferred logging: instructions go through a logging step, in program order Detects instructions that can be put back in program order erform () Event: When instruction performs in memory ROB... head Counting (C) Event: When instruction is added to the log At the head of Shadow ROB Inst should be performed Happens in program order erformed Shadow ROB... Not Retired Not performed Retired but Not Counted For each inst, try to move to C Log an interval in terms of its Counting (C) events Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 9

10 Moving Events to C Events inst 1 inst 2 inst 3 Time C C Case 1 (very common): No conflicting coherence request between and C safe to move to C We put s back in program order in-order inst Log a sequence of such insts as <# of insts> C Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 10

11 Moving Events to C Events Case 2 (very rare): Conflicting request between and C can t move to C Reordered inst Special log entry Conflicting Request Interval 10 Interval 10+n... C X Reordered loads Log the value returned by load During replay, read the value from the log (instead of memory) Reordered store Log the value, address and ID of interval Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 11

12 Detecting Conflicting Requests Naïve solution: Associatively search in-flight instructions High hardware overhead Basic Solution (BASE): Reordered if and C in different intervals, otherwise in-order Optimized Solution (OT): Snoop Table Table of counters When snoop a request, index its address into the table and increment the counter Line Address Reordered if different snoop counts at and C Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 12

13 Evaluation Setup rocessor: 8 cores, 4-way out-of-order, 2 GHz Memory subsystem: private L1, shared L2, Ring interconnect Memory model: Release Consistency Benchmarks: 10 SLASH-2 programs Record: Max interval sizes: 4K insts and INF 4 configs: 4K-BASE, 4K-OT, INF-BASE and INF-OT Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 13

% of Memory Insts Number of Reordered Accesses 0.2 0.175 0.15 0.125 0.1 0.075 0.05 0.025 0 Reordered Loads Reordered Stores 1.6 0.16 0.06 0.03 0.03 0.0005 0.003 0.

14 % of Memory Insts Number of Reordered Accesses Reordered Loads Reordered Stores K-BASE 4K-OT INF-BASE INF-OT Less than 0.04% of mem. insts re-ordered with OT Regardless of Max Interval Size Compact log Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 14

Bits per 1K instructions Log Generation Rate 50 40 360 42 30 20 10 22 12 0

(INF-OT) on 8 processors Comparable to (1-4x) previous techniques Very small

15 Bits per 1K instructions Log Generation Rate K-BASE 4K-OT INF-BASE INF-OT Small log size: 48 MB/sec (4K-OT) and 25 MB/sec (INF-OT) on 8 processors Comparable to (1-4x) previous techniques Very small compared to existing memory BW Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 15

16 Also in the aper Can combine RelaxReplay with any existing chunk-based recording scheme In fact, I presented RelaxReplay on top of QuickRec_[ISCA 13] Details of the replay algorithm More performance numbers Replay performance Scalability analysis (different numbers of processors) Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 16

17 Conclusions RelaxReplay: Memory race recording for relaxed memory models Design independent of memory model Supports Intel, ARM, IBM, Tile, etc. Can be combined with existing chunk-based recorders Compact log format that enables efficient replay In practice, log size comparable to the ideal case of no reordering Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 17

18 THANK YOU! Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 18

19 Replaying RelaxReplay Logs Before replaying an interval wait for its predecessors While (interval has entries) { ent next entry of interval if (ent is Block) { execute next ent.size instructions } else if (ent is ReorderedLoad) { Read the load value from the log } else if (ent is ReorderedStore) { Read value and address from the log Execute the store } } Signal the successors of interval Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 19

20 Normalized Execution Time Replay Overhead Sequential replay using OS to enforce interval ordering and emulate re-ordered instructions Application Cycles Overhead Cycles K-BASE 4K-OT INF-BASE INF-OT Efficient replay: 46% (4K-OT) and 18% (INF-OT) overhead cycles Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 20

21 Example Interval = 10 Interval = 15 i 1 i 2 LD i 4 i 5 ST i 7 i 8... C C C ReorderedLoad? C C C ReorderedStore? C C Base Log for Interval 15: Block(Size=2) ReorderedLoad(Val) Block(Size=2) ReorderedStore(Addr,Val,Interval 10) Block(Size=2) Opt Log for Interval 15: Block(Size=8) Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 21

QuickRec: Prototyping an Intel Architecture Extension for Record and Replay of Multithreaded Programs

QuickRec: Prototyping an Intel Architecture Extension for Record and Replay of Multithreaded Programs Intel: Gilles Pokam, Klaus Danne, Cristiano Pereira, Rolf Kassa, Tim Kranich, Shiliang Hu, Justin Gottschlich