QuickRec: Prototyping an Intel Architecture Extension for Record and Replay of Multithreaded Programs

Size: px

Start display at page:

Download "QuickRec: Prototyping an Intel Architecture Extension for Record and Replay of Multithreaded Programs"

Meagan Dalton
5 years ago
Views:

1 QuickRec: Prototyping an Intel Architecture Extension for Record and Replay of Multithreaded Programs Intel: Gilles Pokam, Klaus Danne, Cristiano Pereira, Rolf Kassa, Tim Kranich, Shiliang Hu, Justin Gottschlich UIUC: Nima Honarmand, Nathan Dautenhahn, Samuel T. King, Josep Torrellas 1

2 Record and Replay (RnR) Systems Fear of debugging is one major block for adoption of shared memory parallel programming Hard to diagnose and reproduce, source of much frustration Goal: Reproduce (replay) an execution so that each instruction inputs same values as the recorded one Non-determinism: input data and thread interleaving Lots of research in the past 10 years but no complete hardware and software prototypes What s overhead and complexity? How much software enabling? Can it be always-on? 2

3 Contributions 1. HW/SW implementation of a user-level record and replay on an Intel architecture multi-core CPU a) Pentium cores augmented with a memory race recorder b) Full software stack: recorder and replayer 2. Recorder and replayer implementation description in the face of hardware intricacies a) Intel Relaxed memory model b) Intel non-atomic CISC instructions 3. Full characterization of the system 3

HW stack SW stack QuickRec Recording System Application(s) to be

Configures and manages hardware Records inputs to the program

Modified Modified Pentium core Pentium Pentium core core QuickIA

hardware extensions QuickIA is a dual socket Xeon platform where

4 HW stack SW stack QuickRec Recording System Application(s) to be recorded User-level recording tool Modified Linux Kernel (Capo3) Configures and manages hardware Records inputs to the program needed to replay Virtualizes recording hardware Modified Modified Modified Modified Pentium core Pentium Pentium core core QuickIA Server Pentium core Records sharedmemory interleaving with hardware extensions QuickIA is a dual socket Xeon platform where the CPUs are replaced by FPGAs Platform is capable of booting unmodified Linux OS 4

5 QuickIA Hardware Overview Xeon Server Socket 0 Socket 1 FPGA Pentium L2$ FPGA Pentium L2$ FPGA Pentium L2$ FPGA Pentium L2$ In-order superscalar Pentium processor (P54c) 60 Mhz 5 FPGA Bridge Connects Pentium FPGAs to Intel Front Side Bus MCH DDR2 FPGA Bridge MCH-Memory Controller Hub Cache coherent, full access to memory and I/O 8 GB of 24MB/s

Memory Race Recording (MRR) SW threads running on 2 cores Core 1 TID=1 Core 2 LD @C LD @A CNT=2 TS=1 ST @B ST @A TS=1 Data dependency resulting in a log entry Data dependency not resulting in a log

6 Memory Race Recording (MRR) SW threads running on 2 cores Core 1 TID=1 Core 2 CNT=2 TS=1 TS=1 Data dependency resulting in a log entry Data dependency not resulting in a log entry U-pipe Pentium V-pipe TID=1 CNT=2 TS=3 Thread Interleaving Order TID=2 CNT=3 TS=2 TS=2 TS=3 L1$ MRR RSet WSet Counter Buffer HW extension - MRR chunk denotes a timestamped group of instructions 6 MICRO 09 Architecting a Memory Race Recording in a Modern CPU

7 Memory Order Recording Challenges Total Store Order Memory Model Problem: Retired instruction count is insufficient to order instructions during replay Solution: Track # of pending stores in store buffer (MICRO 11, CoreRacer) Instruction Atomicity Violation (IAV) x86 macro memory operations do not execute atomically Some operations require more than one load/store IAV: intervening data dependency with another processor before completion of all memory operations Microarchitecture implementation details complicate replaying the memory ordering 7

Instruction Atomicity Violation (IAV) inc (@A) 1 2 4 μop 1 : tmp = (@A) μop 2 : tmp = tmp + 1 μop 3 : (@A) = tmp Core 1 Core 2 tmp = (@A) tmp = tmp + 1 Atomicity Violation (@A) = tmp store (@A) Only

8 Instruction Atomicity Violation (IAV) inc μop 1 : tmp = (@A) μop 2 : tmp = tmp + 1 μop 3 : (@A) = tmp Core 1 Core 2 tmp = (@A) tmp = tmp + 1 Atomicity Violation (@A) = tmp store (@A) Only first memory op. belongs to the chunk 3 Solution: 1. Track retirement of memory micro ops with an IAV counter 2. Reset when macro op retires 3. Log counter value if nonzero See paper for details and replay algo. 8

Hardware Kernel User Space D r i v e r Recording

chunk log 1 Inputs: syscalls, signals, etc.

3 Chunk data from processor 4 Serialized chunk log

and input data recording 2 Original OS Kernel

CMEM_PTR MRR_CTL CMEM_IDX MRR_STATUS CMEM_SZ

9 Hardware Kernel User Space D r i v e r Recording Software Stack (Capo3) Recorded App 5 4 input log chunk log 1 Inputs: syscalls, signals, etc. 2 Execution of syscalls, signals, etc. 3 Chunk data from processor 4 Serialized chunk log 5 Serialized Input log Hardware management and input data recording 2 Original OS Kernel Pentium 3 Core Cache MRR Register interface CMEM_PTR MRR_CTL CMEM_IDX MRR_STATUS CMEM_SZ MRR_FLAGS CMEM_TH See paper for details Changes are isolated with few callbacks throughout the kernel 9

Platform Characterization Set of SPLASH benchmarks executed to completion Experimented with 1, 2, 4 and 8 threads Capo3 implemented in Linux 3.0.

10 Platform Characterization Set of SPLASH benchmarks executed to completion Experimented with 1, 2, 4 and 8 threads Capo3 implemented in Linux Goals were to understand: Performance overhead sources Bandwidth requirements Scalability Chunk behavior Photograph of the QuickRec platform See paper for complete characterization 10

4 0.2 0 hw_only input chunk combined Recording overhead

11 Performance Overhead (4p run) Normalized Execution Time Average recording overhead: 13% hw_only input chunk combined Recording overhead is entirely due to the SW stack (mostly: input logs) 11

$9 MB/s (1TB in 3 days w/ compression) Small fraction of available$

12 Chunk Length (4p run) CDF of chunk length Bandwidth requirements Measured: 80KB/s on average Estimated in a modern Xeon server: 17.9 MB/s (1TB in 3 days w/ compression) Small fraction of available bandwidth Average chunk length for 4p is 39K instructions. Small bandwidth requirements. 12

13 Percentage of IAV Chunks Tight loop with several multi-memory operations Chunks with IAV are not just a corner case 13

14 Lessons Learned RnR hardware can be built with few touch points No changes in cache structures or coherence protocols Easy to implement and validate Biggest challenge is to deal with micro-architecture idiosyncrasies that affect replay TSO and IAV Software stack causes most recording overhead We will speed up software stack always-on usage model 14

15 Q&A Thank you! 15

16 Back up 16

17 Memory Ordering Idiosyncrasies IA memory model allows load instructions to retire before prior stores to different addresses 1st 2nd 2nd 1st Stores drain from store buffer postretirement Assume chunk creation happens before is drained from buffer Retirement order Chunk creation Memory Ordering Chunk order is potentially incorrect because retirement order does not match memory order Solution Insight: add # of pending stores to a chunk 17

18 Performance Overhead Breakdown Other (sleeping) Other (working) CTU (sleeping) CTU (working) System calls (sleeping) System calls (working) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 18

QuickRec: Prototyping an Intel Architecture Extension for Record and Replay of Multithreaded Programs

QuickRec: Prototyping an Intel Architecture Extension for Record and Replay of Multithreaded Programs Gilles Pokam, Klaus Danne, Cristiano Pereira, Rolf Kassa, Tim Kranich, Shiliang Hu, Justin Gottschlich