Prefetching. Fall 2007 Prof. Thomas Wenisch. Correlating Prediction Table. Latest. Prefetch A3.

Size: px

Start display at page:

Download "Prefetching. Fall 2007 Prof. Thomas Wenisch. Correlating Prediction Table. Latest. Prefetch A3."

Christal Charlene Long
5 years ago
Views:

1 History Table Correlating Prediction Table Prefetching Latest A0 A0,A1 A3 11 Fall 2007 Prof. Thomas Wenisch A1 Prefetch A3 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar of Carnegie Mellon University, Purdue University, University of Michigan, and University of Wisconsin. Slide 1

2 Announcements Wenisch Portions Austin, Brehob, Falsafi, HW # 5 (due 11/16) Will be posted by Wednesday Milestone 2 (due 11/14) Slide 2

3 Readings Wenisch Portions Austin, Brehob, Falsafi, For Wednesday: H&P Appendix C.4 C.6. Jacob & Mudge. Virtual Memory in Contemporary Processors Slide 3

4 Latency vs. Bandwidth Latency canbe handled by hiding/tolerating techniques e.g., parallelism may increase bandwidth demand reducing techniques Ultimately limited by physics Slide 4

5 Latency vs. Bandwidth (Cont.) Bandwidthcanbe handled by banking/interleaving/multiporting wider buses hierarchies (multiple l levels) l What happens if average demand dnot supplied? bursts are smoothed by queues if burst is much larger than average => long queue eventually increases delay to unacceptable levels Slide 5

6 The memory wall Perfor rmance Processor Memory Source: Hennessy & Patterson, Computer Architecture: A Quantitative Approach, 4 th ed. Today: 1 mem access 500 arithmetic ops How to reduce memory stalls for existing SW? Slide 6 6

7 Conventional approach #1: Avoid main memory accesses Wenisch Portions Austin, Brehob, Falsafi, Cache hierarchies: Write data Trade off capacity for speed CPU 2 clk 64K CPU data 20 clk Add more cache levels? Diminishing locality returns No help for shared data in MPs 4M 200 clk Main memory Slide 7 7

8 Conventional approach #2: Hide memory latency Wenisch Portions Austin, Brehob, Falsafi, Out of order execution: Overlap compute & mem stalls exec cution In order compute mem stall OoO Expand OoO instruction window? Issue & load store logic hard to scale No help for dependent instructions Slide 8 8

9 Challenges of server apps Defy conventional approaches: Frequent sharing Many linked data structures E.g., linked list, B + tree, Lead to dependent miss chains exe ecution Today Goal 50 66% time stalled on memory compute [Trancoso 97][Barroso 98][Ailamaki 99] Read stalls this talk Write stalls ISCA 07 Our goal: Fetch data earlier & in parallel compute mem stall Slide 9

10 What is Prefetching? Fetch memory ahead of time Targets compulsory, capacity, & coherence misses Big challenges: 1. knowing what to fetch Fetching useless info wastes valuable resources 2. when to fetch it Fetching too early clutters storage Fetching too late defeats df the purpose of pre fetching Slide 10

11 Prefetching Data Cache Branch Predictor I-cache Decode Buffer Decode Dispatch Buffer Dispatch Reservation Stations branch integer integer floating store load point Memory Reference Prediction Prefetch Queue Completion Buffer Complete Store Buf ffer Data Cache Main Memory Slide 11

12 Software Prefetching Compiler/programmer places prefetch instructions requires ISA support why not use regular loads? found din recent ISA s such as SPARC V 9 Prefetch hinto register (binding) caches (non binding): preferred in multiprocessors Slide 12

13 Software Prefetching (Cont.) e.g., for (I = 1; I < rows; I++) for (J = 1; J < columns; J++) { } prefetch(&x[i+1,j]); sum = sum + x[i,j]; Slide 13

14 Software Prefetching Support PowerPC Data Cache Block Touch Instruction (dcbt EA) a hint that performance will probably be improved if the block containing the byte addressed by EA is fetched into the data cache A correct implementation of dcbt is to do nothing Or, as a load instruction with no destination register except it should not trigger page or protection faults Where should compilers insert tdbt? dcbt? in front of every load: wastes I cache and D cache bandwidth where are loads likely to miss When traversing large data sets (arrays in scientific code) where load misses would really hurt performance pointer arguments to functions linked list traversal find loads whose data address is itself the result of a previous load Slide 14

15 Hardware Prefetching Whatto to prefetch? one block spatially ahead? use address predictors work well for regular patterns (e.g., x, x+8, x+16,.) When to prefetch? on every reference on every miss when prior prefetched data is referenced upon last processor reference Where to put prefetched data? auxiliary buffers caches Slide 15

16 Wenisch Portions Austin, Brehob, Falsafi, Spatial Locality and Sequential Prefetching Works well for I cache Instruction fetching tend to access memory sequentially Doesn t work very well for D cache More irregular access pattern regular patterns may have non unit stride tid (e.g. matrix ti code) d) Relatively easy to implement Large cache block size already have the effect of prefetching After loading one cache line, start loading the next line automatically if the line is not in cache and the bus is not busy What if you fetch at the wrong time. Imagine if you started sequential prefetching of a long cache line and so happens you get a load miss to the middle of that line? A critical word first ii reload triggered dby the load miss itself may actually have restarted computation sooner!! Slide 16

17 Stride Prefetchers Access pattern for a particularstaticload load ismore predictable Reference Prediction Table Load Inst PC Load Inst. Last Address Last Flags PC (tag) Referenced Stride.. Remembers previously executed loads, their PC, the last address referenced, stride between the last two references When executing a load, look up in RPT and compute the distance between the current data addr and the last addr if the new distance matches the old stride found a pattern,,go ahead and prefetch current addr+stride update last addr and last stride for next lookup Slide 17

18 Each stream buffer holds one stream of sequentially prefetched cache lines Stream Buffers No cache pollution On a load miss check the head of all stream buffers for an address match if hit, pop the entry from FIFO, update the cache with data if not, allocate a new stream buffer to the new miss address (may have to recycle a stream buffer following LRU policy) Stream buffer FIFOs are continuously topped off with subsequent cache lines whenever there is room and the bus is not busy DCache Wenisch Portions Austin, Brehob, Falsafi, FIFO FIFO FIFO interfac ce Memory Stream buffers can incorporate stride prediction mechanisms to support non unit stride streams Indirect array accesses (e.g., A[B[i]])? FIFO Slide 18

19 Generalized Access Pattern Prefetchers How do you prefetch 1. Heap data structures? 2. Indirect array accesses? 3. Generalized memoryaccess patterns? Current proposals: Precomputation prefetchers Address correlating prefetchers Slide 19

20 Runahead Prefetchers Proposed for I/O prefetching first (Gibson et al.) Duplicatethe the program Only execute the address generating stream Let it run ahead Mi Main Prefetch Thread Thread May run as a thread on A separate processor The samemultithreadedprocessor multithreaded Or custom address generation logic Many names: slipstream, precomp., runahead, Slide 20

21 To get ahead: Must avoid waiting Must compute less Predict Wenisch Portions Austin, Brehob, Falsafi, Runahead Prefetcher 1. Control flow thru branch prediction 2. Dataflowthru thru value prediction 3. Address generation computation only + Prefetch any pattern (need not be repetitive) Prediction only as good as branch + value prediction How much prefetch lookahead? Slide 21

22 Correlation-Based Prefetching Consider the following history of Load addresses emitted by a processor A, B, C, D, C, E, A, C, F, F, E, A, A, B, C, D, E, A,B C, D, C After referencing a particular address (say A or E), are some addresses more likely to be referenced next A B C D E F Markov Model Slide 22

23 Correlation-Based Prefetching Load Data Addr Load Data Addr Prefetch Confidence. Prefetch Confidence (tag) Candidate 1. Candidate N..... Track the likely next addresses after seeing a particularaddr. Prefetch accuracy is generally low so prefetch up to N next addresses to increase coverage (but this wastes bandwidth) Prefetch accuracy can be improved by using longer history Decide which address to prefetch next by looking at the last K load addresses instead of just the current one e.g. index with the XOR of the data addresses from the last K loads Using history of a couple loads can increase accuracy dramatically This technique can also be applied to just the load miss stream Slide 23

24 Example Address Correlating: Markov Prefetchers Markov Prefetchers (Joseph & Grunwald, ISCA 97) Correlate subsequent cache misses Trigger prefetch on miss Predict & prefetch 4 candidates: predicting 1 results in low coverage! Prefetch into a buffer Slide 24

25 Problems with Markov Either low coverage or low accuracy Prefetches several addresses to try to eliminate i one miss Insufficicient Lookahead Distance between two misses is usually small load/store A1 (miss) load/store A1 (hit)... load/store C3 (miss) load/store A3 (miss) lookahead Fetch on miss Slide 25

26 Spatio-Temporal p Memory Streaming [ISCA 05][ISCA 06] Memory accesses repeat spa ace patterns tim me Ld Q Ld W Ld E Ld Q Ld W Ld E sequences stream Q W E.. L2 CPU L1 Observe patterns/sequences Stream data to CPU ahead of requests Memory or other CPUs 26

27 My thesis: Temporal Memory Streaming (TMS) HW records & replays recurring miss sequences Baseline TMS time A CPU Must wait to follow pointers B C CPU Fetch in parallel l A B C Stream data to CPU In advance Ahead of CPU requests In parallel Even for dependent accesses In order Flow control to manage storage/bw 6-21% speedup in Web & OLTP apps. 27

28 Miss addresses are correlated Intuition: Miss sequences repeat Because code/data traversals repeat Miss seq. Q W A B C D E R T A B C D E Y Temporal Address Correlation Prior evidence: [Joseph 97][Luk 99][Chilimbi 01][Lai 01] Contrast: temporal locality stream = ordered address sequence 28

29 Recent streams recur Intuition: Streams exhibit temporal locality Because working set exhibits temporal locality MP: repetition often across CPUs CPU 1 Q W A B C D E R CPU 2 T A B C D E Y Temporal Stream Locality Streams evolve as structures change Track streams at run-time, not statically 29

30 TMS 10,000 feet Log Lookup Stream CPU Load A Load A Load B Load C CPU CPU Prefetch B Prefetch C TMS {A,B,C, } TMS {A,B,C, } TMS {A,B,C, } Key HW design challenge: Lookup mechanism 30

31 System models & applications 16-node DSM 4-core CMP CPU CPU Core & L1 Core & L1 8MB L2 8MB L2 Shared 8MB L2 Directory Directory Core Core Memory Memory & L1 & L1 Mem. Coherence misses Capacity/conflict misses Web: SPECweb99 DSS: DB2 Apache, Zeus Qry 1, 2, 17 OLTP: TPC-C Scientific DB2, Oracle em3d, moldyn, ocean 31

32 TMS opportunity % Off-chip misses 100% 80% 60% 40% 20% 0% Recurrence: System-wide Same CPU only New/Heads DSM CMP DSM CMP DSM CMP DSM CMP Web OLTP DSS Sci Avg. 75% misses in repetitive streams Streams recur across processors (esp. DSM) 32

33 Stream length Strea amed blo ocks 100% 80% 60% 40% 20% 0% median length 10 Web DSM Web CMP OLTP DSM OLTP CMP DSS DSM DSS CMP Stream length Long streams (esp. CMP) need flow control Contrast: Depth prefetching [Solihin 03][Nesbit 04] Short sequences (~4 misses) Variable length log in circular buffer 33

34 Off-chi ip misse es 20% 15% 10% 5% 0% Stream reuse Web DSM Web CMP OLTP DSM OLTP CMP DSS DSM DSS CMP Stream reuse distance Coherence: reuse = F( sharing behavior ) Capacity: reuse = F( L2 size ) Reuse distance 10 5 misses log off-chip 34

35 Correlated & strided misses Web OLTP DSS Sci Temporally-correlated Strided Non-correlated TMS & stride target different access patterns DSS: New TMS opportunity < 20% 35

36 Sources of repetitive streams Top contributors to TMS opportunity Web OLTP (DB2) Dynamic content generation in PERL (21%) System calls: poll, read, write, stat (15%) Kernel STREAMS sub-system (13%) Index, tuple & page accesses (23%) SQL request control & runtime interpreter t (14%) Kernel MMU & register window traps (13%) DSS Bulk memory copies (55%) Can extrapolate to new workloads via source breakdown 36

37 Coverage & timeliness % misses 100% 80% 60% 40% 20% 0% Opportunity Fully Hidden Partially Hidden Trace Timing Trace Timing Trace Timing Trace Timing Trace Timing Trace Timing Trace Timing em3d moldyn ocean Apache DB2 Oracle Zeus Stream lookup latency miss latency ~1 miss/stream untimely (partial or no overlap) TMS-DSM achieves avg. 75% of opportunity 37

38 Performance impact Time Breakdown Speedup ed Time Normaliz Busy Other Stalls Off-chip Read Stalls % CI base TMS base TMS base TMS base TMS base TMS base TMS base TMS em3d moldyn ocean Apache DB2 Oracle Zeus 1.0 em3d moldyn ocean Apache DB2 Oracle Zeus TMS-DSM eliminates 25%-95% of read stalls Commercial apps: 6% to 21% improvement 38

39 Memory wall summary Main memory accesses cost 100s cycles Solution: Temporal Memory Streaming Record & replay recurring miss sequences Breaks pointer-chasing dependence Performance improvement: 7-230% in scientific apps. 6-21% in commercial Web & OLTP apps. 39

40 Improving Cache Performance: Summary Miss rate large block size higher associativity victim caches skewed /pseudo associativity hardware/software prefetching compiler optimizations Miss penalty give priority to read misses over writes/writebacks subblock placement early restart and critical word first non blocking caches multi level caches Hit time (difficult?) small and simple caches avoiding translation during L1 indexing (later) pipelining writes for fast write hits subblock placement for fast write hits in write through caches Slide 40

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table Lecture 15 History Table Correlating Prediction Table Prefetching Latest A0 A0,A1 A3 11 Fall 2018 Jon Beaumont A1 http://www.eecs.umich.edu/courses/eecs470 Prefetch A3 Slides developed in part by Profs.