History Table. Latest

Size: px

Start display at page:

Download "History Table. Latest"

Theodore Simmons
5 years ago
Views:

1 Lecture 15 Prefetching Latest History Table A0 Correlating Prediction Table A0,A1 A3 11 Winter 2019 Prof. Ronald Dreslinski A1 Prefetch A3 h8p:// Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lee, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar of Carnegie Mellon University, Georgia Tech, Purdue University, University of Michigan, and University of Wisconsin. Slide 1

2 The memory wall Performance Processor Memory Source: Hennessy & PaCerson, Computer Architecture: A Quan2ta2ve Approach, 4 th ed. Today: 1 mem access 500 arithme?c ops How to reduce memory stalls for exis3ng SW? Slide 2 2

3 Conventional approach #1: Avoid main memory accesses Cache hierarchies: Trade off capacity for speed Write data CPU 2 clk 64K 20 clk CPU data Add more cache levels? Diminishing locality returns No help for shared data in MPs 4M 200 clk Main memory Slide 3

4 Conventional approach #2: Hide memory latency Out-of-order execu?on: Overlap compute & mem stalls execu?on In order compute mem stall OoO Expand OoO instruc?on window? Issue & load-store logic hard to scale No help for dependent instruc?ons Slide 4

5 Challenges of server apps Frequent sharing & synchroniza?on Many linked-data structures E.g., linked list, B + tree, Dependent miss chains [Ranganathan 98] Low memory level parallelism [Chou 04] 50-66%?me stalled on memory [Trancoso 97][Barroso 98][Ailamaki 99] execution Today Goal compute mem stall Our goals: Read misses: Fetch earlier & in parallel Write misses: Never stall Slide 5

6 What is Prefetching? Fetch memory ahead of?me Targets compulsory, capacity, & coherence misses Big challenges: 1. knowing what to fetch Fetching useless info wastes valuable resources 2. when to fetch it Fetching too early clucers storage Fetching too late defeats the purpose of pre -fetching Slide 6

7 Software Prefetching Compiler/programmer places prefetch instruc?ons requires ISA support why not use regular loads? found in recent ISA s such as SPARC V-9 Prefetch into register (binding) caches (non-binding) Slide 7

8 Software Prefetching (Cont.) e.g., for (I = 1; I < rows; I++) for (J = 1; J < columns; J++) { prefetch(&x[i+1,j]); sum = sum + x[i,j]; } Slide 8

9 Hardware Prefetching What to prefetch? one block spa?ally ahead? use address predictors à works for regular pacerns (x, x+8, x+16,.) When to prefetch? on every reference on every miss when prior prefetched data is referenced upon last processor reference use more complicated rate-matching mechanisms Where to put prefetched data? auxiliary buffers caches Slide 9

10 Generalized Access Pattern Prefetchers How do you prefetch 1. Heap data structures? 2. Indirect array accesses? 3. Generalized memory access pacerns? Taxonomy of approaches: Spa?al prefetchers Address-correla?ng prefetchers Precomputa?on prefetchers Slide 10

11 Spatial Locality and Sequential Prefetching Works well for I-cache Instruc?on fetching tend to access memory sequen?ally Doesn t work very well for D-cache More irregular access pacern regular pacerns may have non-unit stride (e.g. matrix code) Rela?vely easy to implement Large cache block size already have the effect of prefetching Auer loading one-cache line, start loading the next line automa?cally if the line is not in cache and the bus is not busy Slide 11

12 PC-based Stride Prefetchers Array/stride correlated to sta?c load instruc?on [Baer 91] Reference Predic?on Table Load Inst PC Load Inst. Last Address Last Flags PC (tag) Referenced Stride.. Record load PC, last addr. & stride between last two addrs. On load à compute distance between current & last addr if new distance matches old stride à found a pacern, go ahead and prefetch current addr+stride update last addr and last stride for next lookup Slide 12

13 Stream Buffers [Jouppi] Each stream buffer holds one stream of sequen?ally prefetched cache lines On a load miss check the head of all stream buffers for an address match No cache pollu2on if hit, pop the entry from FIFO, update the cache with data if not, allocate a new stream buffer to the new miss address (may have to recycle a stream buffer following LRU policy) Stream buffer FIFOs are con?nuously topped-off with subsequent cache lines whenever there is room and the bus is not busy DCache FIFO FIFO FIFO Memory interface Stream buffers can incorporate stride predic?on mechanisms to support non-unit-stride streams Indirect array accesses (e.g., A[B[i]])? FIFO Slide 13

14 Global History Buffer (GHB) [Nesbit 04] Holds miss address history in FIFO order Global History Buffer Index Table Linked lists within GHB connect related addresses Same sta?c load (PC/DC) Same global miss address (G/AC) Load PC Linked list walk is short compared with L2 miss latency FO FI miss addresses Slide 14

15 GHB Deltas Miss Address Stream Delta Stream Markov Graph Hybrid Width Depth Key => Current => Prefetches 8 Prefetches => 79 Prefetches => => => 87 4 Prefetches => => 79 Slide 15

16 GHB Stride Prefetching GHB-Stride uses the PC to access the index table The linked lists contain the local history of each load Compare the last two local strides. If the same then prefetch n + s, n + 2s,, n + ds Index Table pointer Global History Buffer miss address pointer PC A B C 1 head pointer A B C B 1 =? C Slide 16

17 GHB Delta Correlation (PC/DC) Form delta correla?ons within each load s local history For example, consider the local miss address stream: Addresses Deltas Correlation Prefetch Predictions (1,1) (1,62) (62, 1) Best results among data prefetchers for SPEC2K [Gracia Pérez 04] Slide 17

18 Spatial Correlation Repe??ve spa?al rela?onships between accesses Irregular layout à non-strided Sparse à can t capture with cache blocks But, repe??ve à predict to improve memory-level par. Not to be confused with spa?al locality: pacerns may repeat over large (e.g., few kb) regions Slide 18

19 Example Spatial correlation [Somogyi 06] page header Database Page (8kB) tuple data tuple slot index Memory Large-scale spa?al access pacerns PaCern is func?on of program Slide 19

20 Normalized Miss Rate Why exploit spatial correlation? Perfect predictor = one miss per generation 64B 512B 8kB 64B 512B 8kB 64B 512B 8kB 64B 512B 8kB 64B 512B 8kB 64B 512B 8kB Large Blocks Perfect Predictor 64B 512B 8kB 64B 512B 8kB OLTP DSS Web Sci. OLTP DSS Web Sci. L1 L1 (64kB) Large blocks à prohibi?ve L1 miss rate à bandwidth inefficient Spa?al correla?on à eliminate misses L2 L2 (8MB) Slide 20

21 SMS Operation Summary Spatial patterns stored in a pattern history table spatial patterns PC 1 ld A+4 PC 2 ld A PC 3 ld A+3 evict A observe store time PC 1 ld B+4 PC 2 ld B PC 3 ld B+3 3 predict cache hits 21 Slide 21

22 Correlation-Based Prefetching [Charney 96] Consider the following history of Load addresses emiced by a processor A, B, C, D, C, E, A, C, F, F, E, A, A, B, C, D, E, A,B C, D, C Auer referencing a par?cular address (say A or E), are some addresses more likely to be referenced next A B C D E F Markov Model Slide 22

23 Correlation-Based Prefetching Load Data Addr Load Data Addr Prefetch Confidence. Prefetch Confidence (tag) Candidate 1. Candidate N..... Track the likely next addresses auer seeing a par?cular addr. Prefetch accuracy is generally low so prefetch up to N next addresses to increase coverage (but this wastes bandwidth) Prefetch accuracy can be improved by using longer history Decide which address to prefetch next by looking at the last K load addresses instead of just the current one e.g. index with the XOR of the data addresses from the last K loads Using history of a couple loads can increase accuracy drama?cally This technique can also be applied to just the load miss stream Slide 23

24 Example: Markov Prefetchers [Joseph 07] Markov Prefetchers Correlate subsequent cache misses Trigger prefetch on miss Width-prefetching: predict & prefetch four candidates predic?ng only one results in low coverage! Prefetch into a buffer Slide 24

25 Tag-Correlating Prefetchers [Kaxiras 04] Correla?on-based prefetching: tables are too big they grow with data working set size Much similarity in block addresses mapping to sets when marching through arrays, tags across sets iden?cal! save space in correla?on tables by correla?ng tags only (not full addresses) Slide 25

26 Revisit: Global History Buffer (GHB) [Nesbit 04] Holds miss address history in FIFO order Linked lists within GHB connect related addresses PC/DC Same global miss address (G/AC) Miss Address Index Table Global History Buffer Linked list walk is short compared with L2 miss latency FO FI miss addresses Slide 26

27 GHB (G/AC) - Example Miss Address Miss Address Stream Index Table pointer head pointer Global History Buffer miss address pointer Key => Current => Prefetches Slide 27

28 Linked-Data Prefetchers When traversing linked-structure: Learn/record load-to-load dependence Get ahead of processor by traversing structure in FSM FSM gets ahead of processor by skipping computa?on q Similar proposals with SW help (e.g., helper/scout threads) Example: while (*p!= NULL) { } if (p->key == MATCH) p->val++; p = p->next;

29 Linked Data Structure Access next Offset next Offset next next Offset Offset

30 DetecOng Recursive Accesses a+0 a+4 a+8 a+12 a+14 next b+0 b+4 b+8 b+12 b+14 next c+0 c+4 c+8 c+12 c+14 next Offset Offset Offset Producer of b r src : a LOAD r dest, r src (14) Consumer of b/producer of c r src : b LOAD r dest, r src (14) Example p = p->next; hold same value

31 Roth, Moshovos, Sohi (HW) [Roth 98] Producer of b Consumer of b/producer of c r src : a r src : b PC-A: LOAD r dest, r src (14) PC-B: LOAD r dest, r src (14) hold same value Potential Producer Window Correlation Table Memory Value Loaded Producer Instruction Address Producer Instruction Address Consumer Instruction Address Consumer Instruction Template b PC-A PC-A PC-B LOAD r,r(14)

32 Runahead Execution [Mutlu 03] Memory-level parallelism of large window without building it! When oldest instruc?on is L2 miss: Checkpoint state and enter runahead mode In runahead mode: Instruc?ons specula?vely pre-executed To discover other L2 misses Processor con?nues to run Runahead mode ends when the original L2 miss returns Checkpoint is restored and normal execu?on resumes Slide 32

33 Perfect Caches: Load 1 Hit Load 2 Hit Runahead Example Compute Compute Small Window: Load 1 Miss Load 2 Miss Compute Stall Compute Stall Runahead: Load 1 Miss Load 2 Miss Load 1 Hit Load 2 Hit Compute Runahead Compute Saved Cycles Slide 33

34 Benefits of Runahead Execution Avoid stalling during an L2 cache miss! Pre-executed loads/stores generate accurate prefetches both regular and irregular access pacerns Instruc?ons on predicted path prefetched into i-cache and L2 Hardware prefetcher and branch predictor are trained using future access informa?on Slide 34

35 Improving Cache Performance: Summary Miss rate large block size higher associa?vity vic?m caches skewed-/pseudo-associa?vity hardware/souware prefetching compiler op?miza?ons Miss penalty give priority to read misses over writes/writebacks subblock placement early restart and cri?cal word first non-blocking caches mul?-level caches Hit?me (difficult?) small and simple caches avoiding transla?on during L1 indexing (later) pipelining writes for fast write hits subblock placement for fast write hits in write through caches Slide 35

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table Lecture 15 History Table Correlating Prediction Table Prefetching Latest A0 A0,A1 A3 11 Fall 2018 Jon Beaumont A1 http://www.eecs.umich.edu/courses/eecs470 Prefetch A3 Slides developed in part by Profs.