Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1

Announcements Reading for today: class notes Your main focus: class project!! The end game Today: multi-core caching + checkpointed processors Mo 11/30: guest talk on Intel Atom We 12/2: guest talk on phase change memory Thu 12/3: project papers and presentations are due Late afternoon talks + dinner Today: architecture seminar at 4pm, Gates Hall 104 Prof. Andreas Moshovos from U. Toronto on Tagless ectories for CMPs Lecture 18-2

Physical Layout of Caches for Larger-scale Systems core core core core core core core core L1$ L1$ L1$ L1$ L1$ L1$ L1$ L1$ c L1 SW c L1 SW c L1 SW c L1 SW Intra-Chip Switch Intra-Chip Switch $ Data c $ Tag L1 SW $ Data c $ Tag L1 SW $ Data c $ Tag L1 SW $ Data c $ Tag L1 SW Cache Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice $ Data c $ Data c $ Tag L1 SW $ Tag L1 SW $ Data c $ Data c $ Tag L1 SW $ Tag L1 SW $ Data c $ Data c $ Tag L1 SW $ Tag L1 SW $ Data c $ Data c $ Tag L1 SW $ Tag L1 SW Slice Slice Slice Slice $ Data $ Tag $ Data $ Tag $ Data $ Tag $ Data $ Tag Slice Slice Slice Slice Distributed implementation Lower access latency, lower access power, easier to turn-off selectively Challenge: non-uniform access to different slices Lecture 18-3

Dynamic Non-Uniform Access Caches (NUCA) (ASPLOS 02) CPU CPU CPU Motivation: allow cache lines to move close to requesting CPU Without using a directory architecture Approach: organize cache banks into bank sets Bank set determined by some bits in the address Banks within the set provide cache associativity (search in series ) Cache lines can move within a set to get closer to requesting CPU Mechanisms: mapping, searching, migration Mapping: simple, fair, shared Searching: incremental, multicast, smart Migration: data moves closer as it is accessed, evicted data moved further Lecture 18-4

NUCA Challenges for Multi-core (MICRO 04) Dark more accesses OLTP (on-line transaction processing) Ocean (scientific code) Lecture 18-5

Design Choices for Large Distributed Caches: Private Caching Private Private Private Private Private Private Private Private Private Private Private Private Private Private Private Private One slice per core Pros: low latency to data Cons: reduced capacity miss operation Search other private caches Through snooping or a directory Centralized or distributed directory 2 to 3 hops Alternatively fetch from off-chip Lecture 18-6

Design Choices for Large Distributed Caches: Shared Caching Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Slices form distributed, shared Pros: good utilization of capacity Cons: variable & high latency hit operation Search other caches ect access if banked cache Static placement of data in Or through directory Dynamic placement of data in Possibility for replication, migration, Additional hops miss operation Fetch data from off-chip Place in proper slice & update directory if any Lecture 18-7

Victim Replication (ISCA 05) A compromise between shared & private designs Capacity utilization of shared cache with low latency of private Idea: start with shared design and use local slice as victim cache When evicting from L1, write data in local Victim allowed to overwrite invalid blocks and other replicas Not allowed to overwrite actively shared blocks that have local as home Implementation: simple modifications to shared design On a miss, search local slice before remote slices ectory or banking structure does not change Victim does not change sharing pattern Invalidations handled locally a little differently Lecture 18-8

NuRAPID: Decoupling Tags from Data Motivation: provide a mechanism for caching optimization Controlled replication, communication w/o movement, capacity stealing, Basic idea: decouple tag storage from data storage Private tag arrays & shared data arrays Access the tags first, get a pointer to data May be to another slice Lecture 18-9

Using NuRAPID for CMP Optimization Controlled replication No copy on first access to on-chip data; just set pointer Copying on second access In-situ communication Read-write sharing without copying/moving data Keep data close to reader, adjust pointer to perform writes Requires a new cache state (C for communication) Capacity stealing Use remote slice as a victim cache for a processor s slice Lecture 18-10

Checkpointed Processors Lecture 18-11

Reminder: Why Single-thread Performance Matters (Symmetric CMP Speedup for Fixed Area) F=0.99 R=3, 85 medium cores Speedup=80 F 1 R=1, 256 simple cores Speedup=204 F=0.9 R=28, 9 large cores Speedup=26.7 For scalable performance, we need capable processor cores Single thread performance does matter Lecture 18-12

Reminder: Out-of-order, Superscalar Processor Lecture 18-13

Memory Bottlenecks for OOO Processors (Karkhanis Study at WMPI 02) Problem #1: ROB structural blockage Miss reaches head of ROB, stops other instructions from committing Eventually, back-pressure to instruction window, front-end etc For short misses, a 32 to 64-entry ROB is good enough But for long misses, ROB blockage is a big problem Not a problem #2: data dependencies on the miss If ROB infinite, most following instructions are independent For 1000-cycle miss, ~30 instructions left in IW after miss returns Most dependent instructions follow closely after miss Hence, we don t need a huge instruction window! Problem #3: control dependencies on the miss 25% of long misses feed to mispredicted branches Time wasted on mispredicted instructions Lecture 18-14

Long Misses & OOO Resources (SpecInt data with 8-way OOO from Mateo Valero, UPC) 1.22X 0.6X 1.41X 1.1X 1.86X Lecture 18-15

The Difficult of Large-Scale OOO Processors What is difficult with 4K pending instructions? ROB: must track state of every pending instructions Regardless if instruction is stalled or completed Register files: need a physical register for every pending instruction Regardless if instruction is stalled or completed Regardless if a later instruction overwrites the same physical register LSU: must buffer all stores until the retire Must buffer many pending loads for disambiguation IW: need a slot for every instruction dependent on a long miss Less of a problem though Ideal solution: the IPC of a design with a 4K-entry ROB at the area, complexity, and power consumption of one with a small ROB Lecture 18-16

Key Insight Why do we need a 4K-entry ROB buffer in order to have a 4K instructions pending? Lecture 18-17

Checkpointed Execution An Alternative for In-order Commit Checkpoint: a snapshot of the system state at a certain point in time Architecturally state = register state & memory state Overview of in-order commit using checkpoints: Periodically or selectively, take a checkpoint Allow instructions to commit out-of-order Release their ROB entry & their physical registers If there is no exception or misspeculation, we ve done the right thing On a misspeculation Restore the last checkpoint taken before offending instruction started Restart execution from that point On an exception Just like misspeculation, but don t allow OOO commit until you run into exception again Lecture 18-18

Checkpoint Implementation How many checkpoints are needed? Most applications explored so far require 1 to 4 Instruction tracking in the pipeline: extra field that identifies the # of the last checkpoint taken before the instruction was fetched Checkpoint table One entry per checkpoint Entries: valid, PC, (application related info) Checkpoint of register state Keep a safe copy of architectural registers at the time of the checkpoint For an IO core, add an extra RF per active checkpoint Issue: making fast copies between multiple RFs For an OOO core, exploit larger physical RF Use some physical register as checkpoint registers Must have copies of any mapping tables though Lecture 18-19

Checkpoint Implementation (cont) Checkpoint of memory state Impractical to save/restore all physical memory Track stores since checkpoint and provide mechanism to undo them Option 1: Buffer address and new value & use to update memory if checkpoint released Option 2: Log address and old value & use log to undo if checkpoint restored What are the advantages of each scheme? How many entries do you need? What happens if you run out of checkpoints? Or out of registers? Or out of store buffer entries? When would you take a checkpoint? Lecture 18-20

Application #1: RunAhead Execution to Hide Long Misses Runahead: a technique to obtain the memory-level parallelism benefits of a large ROB without building it When the oldest instruction in the ROB is an miss: Checkpoint architectural state and enter runahead mode In runahead mode: Instructions are speculatively pre-executed The purpose of pre-execution is to discover other misses The processor does not stall due to misses Runahead mode ends when the original miss returns Checkpoint is restored and normal execution resumes Lecture 18-21

Runahead Example Small ROB: Load 1 Miss Load 2 Miss Compute Stall Compute Stall Miss 1 Miss 2 Runahead: Load 1 Miss Load 2 Miss Load 1 Hit Load 2 Hit Compute Miss 1 Runahead Compute Saved Cycles Miss 2 Lecture 18-22

Benefits of Runahead Execution Pre-executed loads and stores independent of -miss instructions generate very accurate data prefetches: For both regular and irregular access patterns Instructions on the predicted program path are prefetched into the instruction cache and. Hardware prefetcher and branch predictor tables are trained using future access information. Lecture 18-23

Runahead Execution Implementation Entry into runahead mode Checkpoint architectural register state Instruction processing in runahead mode Speculative execution (all results discarded at the end) Instructions independent from proceed as usual Dependent instructions are removed from pipeline marking results INValid One extra bit per register entry and store buffer entry INV values are not used for prefetching/branch resolution Stores write their results into a special store buffer (runahead cache) Exit from runahead mode Restore architectural register state from checkpoint Lecture 18-24

Runahead Execution vs. Large Windows Lecture 18-25

Challenges with Runahead Execution Short runahead periods E.g., an miss encountered 10 cycles before prefetcher returns value Transition overheads without significant benefits Overlapping runahead periods Due to multiple misses within a period Useless runahead periods Going into runahead without generating any new misses Expensive pre-execution E.g., of FP instructions Lecture 18-26

Application #2: Continual Flow Execution If most instructions are independent of miss, why through away the pre-execution results? Continual flow execution Take a checkpoint when miss is encountered Keep executing instructions SAVE dependent instructions on in a side buffer Execute and commit all independent instructions When miss returns, execute all SAVED instructions Reintegrate results in the pipeline (register maps etc) This can be tricky as the architectural register may be remapped What if a saved instruction returns an exception? Lecture 18-27

Continual Flow Pipelines: HW & Performance Lecture 18-28

Implications of Runahead & Continual Flow Runahead Discards dependent instructions Speculatively executes independent instructions When miss returns, re-executes dependent & independent instrns Continual Flow Pipeline Saves dependent instructions Executes independent instructions When miss returns, executes only saved dependent instructions Assessment Both allow overlapping of misses to break past window limits Both limited by branch prediction accuracy on unresolved branches Continual Flow Pipeline sounds even more appealing But may not be worthwhile (vs. Runahead) & memory order issues Lecture 18-29

Other Applications of Checkpointed Processors Early release of physical registers Early release of load-store queue entries Recovery from transient faults On error detection, restore a checkpoint & re-execute If error is detected again, restore an older checkpoint Speculative execution Thread-level speculation Transactional memory What are the differences here? What may need to change in the checkpoint implementation? Lecture 18-30

Sun Rock A checkpointed processor that supports multiple applications Continual flow execution Continual flow with overlapping on reintegration Transactional memory Rock chip A CMP with 16-cores and a shared 4 cores share an instruction fetch unit ($I L1 cache), two $D L1 caches, and two floating point units Why does this make sense? Each core is a dual-threaded Lecture 18-31

Rock Pipeline Instructions can get out of order in different execution pipelines Lecture 18-32

Sun Rock Checkpointing Support Each register file has a working and an architectural copy ARF & WRF allow for a checkpoint per thread Multi-level implementation to avoid cost of register windows 0-cycle overhead to take a checkpoint Each register has a NT bit (not there) Similar to INValid bit in RunAhead Deferred instruction queue Buffering for instructions with NT operands 32-entry store buffer per core To buffer stores until a checkpoint is retired Lecture 18-33

Continual Flow Execution with Rock (aka Execute Ahead) Take checkpoint on long latency instruction $D L1 miss (in-order core), TLB miss, divides, Execute Ahead mode Buffer instructions with NT arguments in DQ If buffer is full, wait until long-latency instruction returns, restore checkpoint and continue from there (i.e., runahead) Complete other instructions as usual Once long latency instruction finishes Execute deferred instructions from DQ If all deferred instructions complete successfully, execution continues Otherwise, "fail restarts execution from checkpoint Lecture 18-34

Simultaneous Speculative Threading with Rock Use both threads in the core for a single SW thread Operate as in EA mode until long-latency instruction completes One HW (behind) thread executes deferred instructions One HW (ahead) thread executes speculatively the rest of program Parallelize execution of deferred instructions Ahead thread can start deferring instructions to its DQ If behind thread fails, single thread starts from checkpoint If behind thread finishes, it then starts executing deferred instructions from other DQ Lecture 18-35

Tricky Issues with Checkpointed Execution WAR and WAW register hazards when replaying from DQ WAR: copy register values on entry to DQ WAW: allow data forwarding but disallow register file overwrite Load replayed out-of-order May violate the consistency model (total store order) Add a bit per cache line (s-bit), set by loads during EA or replay If cache-line with s-bit is evicted or replaced, speculation fails and execution resumes from the checkpoint Full store buffer Speculation fails and execution resumes from the checkpoint Lecture 18-36

Transactional Memory (TM) with Rock Transactions execute speculatively as an EA thread Start of transaction treated as long latency operation Use checkpoint resources to allow tracking of TM state Checkpoint used to enable roll back if transaction fails S-bits used to track the cache lines read by transactions Can detected conflicting stores by other threads Store buffer tracks the new values produced by transaction But gets the store addresses and tracks conflicting loads/stores Lecture 18-37

Rock Performance Self-comparison Unclear how it compares to OOO core though Lecture 18-38