Prefetching (II): Using Processors-In-Memory (PIM) for Prefetching. Instructor: Josep Torrellas CS533

Size: px

Start display at page:

Download "Prefetching (II): Using Processors-In-Memory (PIM) for Prefetching. Instructor: Josep Torrellas CS533"

Jessie Park
6 years ago
Views:

1 Prefetching (II): Using Processors-In-Memory (PIM) for Prefetching Instructor: Josep Torrellas S533 opyright Josep Torrellas 2003

2 Memory Stall Time Execution of non-dependent instructions Main Proc stall $+Mem opyright Josep Torrellas

3 Intelligent Memory Processor chip Main Proc L1 $ L2 $ User-Level Memory Thread (ULMT) How to use ULMT to improve memory performance? Interconnect North ridge hip Mem Proc L1 $ Memory ontroller Memory RM chip chip Mem Proc L1 $ RM ells opyright Josep Torrellas

4 Exploiting Intelligent Memory Main Proc $+Mem ULMT for Memory-Side Prefetching Main Proc $+Mem Mem Proc opyright Josep Torrellas

5 orrelation Prefetching Records sequences of miss addresses in a correlation table When the head of a sequence is seen, prefetch the rest Effective: if miss sequences repeat (1) a[4*(i++)] &a[0] &a[4] &a[8] &a[12] &a[16]... (2) a[foo(i)] a[b[i]] &a[0] &a[3] &a[7] &a[9] &a[15]... (3) (4)... Z & & & &Z List/Tree/Graph with arbitrary but fixed traversal order opyright Josep Torrellas

6 orrelation Prefetching Effective for irregular+regular apps Needs little info: cache miss addresses No compiler support Works with existing binaries How to implement the correlation table? Past work: Expensive hardware Few Ms of on-chip SRM for the table (larger than cache!) Table size has to scale with app working set Paper proposes: Software Is software fast enough? Yes. opyright Josep Torrellas

7 ommunication Mechanism Processor chip Main Proc L1 $ L2 $ 1 Memory ontroller 2 Interconnect Memory chip North ridge hip 5 Mem Proc L1 $ 4 RM ells 3 opyright Josep Torrellas

8 Timeline of the Prefetch Handler Miss address obtained Prefetch addresses produced Prefetcher free Prefetching phase Learning phase Ideal algorithm: Response time Occupancy time lowest response time occupancy time < time between misses opyright Josep Torrellas

9 orrelation Table asic Organization [1] ddr of immediate successors dvanced Organization ddr of next immediate successors Tag Succ_L1 Tag Succ_L1 Succ_L2... NumSucc = 2 NumSucc = 2 [1] Joseph&Grunwald, IS 97 NumLevels = n opyright Josep Torrellas

10 Learning Phase asic Organization Tag Succ_L1 dvanced Organization Tag Succ_L1 Succ_L2 urrent miss,,,,,,... opyright Josep Torrellas

11 Learning Phase asic Organization Tag Succ_L1 dvanced Organization Tag Succ_L1 Succ_L2 urrent miss,,,,,,... opyright Josep Torrellas

12 Learning Phase asic Organization Tag Succ_L1 dvanced Organization Tag Succ_L1 Succ_L2 urrent miss,,,,,,... opyright Josep Torrellas

13 Learning Phase asic Organization Tag Succ_L1 dvanced Organization Tag Succ_L1 Succ_L2 urrent miss,,,,,,... opyright Josep Torrellas

14 Learning Phase asic Organization Tag Succ_L1 dvanced Organization Tag Succ_L1 Succ_L2 urrent miss,,,,,,... opyright Josep Torrellas

15 Learning Phase asic Organization Tag Succ_L1 dvanced Organization Tag Succ_L1 Succ_L2 urrent miss,,,,,,... opyright Josep Torrellas

16 Prefetching Phase asic Organization On miss dvanced Organization On miss haining asic: 1 miss immediate successor - no far ahead prefetching - low coverage and late prefetches asic + haining: - low accuracy - high response time dvanced: 1 miss several succ levels: + far ahead prefetching + high coverage + timely prefetches + high accuracy + low response time opyright Josep Torrellas

17 Simulation Environment Main processor: 1.6 GHz, 6-issue OOO L1: 2-way 16 K; L2: 4-way 512 K Mem: 243 cycle RT Memory processor: 800 MHz, 2-issue OOO L1: 2-way 32 K Mem: 100 cycle RT (in Northridge), 56 cycle RT (in RM) orrelation table: pplication dependent, e.g. 64K entries, 3 levels, 2 successors pplications Specint2000, Specfp2000, NS, Olden opyright Josep Torrellas

18 Predictability of miss sequences 100 Next Next2 Next3 Predictability of Miss Sequences G Equake FT Gap Mcf MST Parser Sparse an predict miss sequences accurately Tree verage opyright Josep Torrellas

19 Execution Time (Mem Proc in RM) usy L1 to L2 eyond L2 Normalized Execution Time N: No Prefetch S: Sequential : asic : dvanced +: S + advanced N S + N S + N S + N S + N S + G Mcf MST Tree vg9 opyright Josep Torrellas

20 ustomization usy L1 to L2 eyond L Normalized Execution Time N: No Prefetch +: S + advanced : ustomization 0 N + N + N + N + N + G Mcf MST Tree vg9 opyright Josep Torrellas

21 ycles etween Misses istribution of L2 miss distances [280,Infinity) [200,280) [80,200) [0,80) G Equake FT Gap Mcf MST Occupancy time of prefetch thread must be < 200 cycles Parser Sparse Tree verage opyright Josep Torrellas

22 Thread Response and Occupancy Time usy Mem 200 Number of Processor ycles ase hain Repl ustom ReplM Response Time ase hain Repl ustom ReplM Occupancy Time ll algo feasible: Occupancy Time < 200 cycles dvanced/repl has the best Response Time opyright Josep Torrellas

23 RM vs. Mem ontroller hip usy L1 to L2 eyond L2 Normalized Execution Time N N N N N N N N N N G Equake FT Gap Mcf MST Parser Sparse Tree vg9 opyright Josep Torrellas

24 Prefetching Effectiveness 3 Hits elayedhits NonPrefMisses Replaced Redundant 2.5 Normalized Effectiveness NoPref onven4 ase hain Repl onven4+repl ustom onven4+replm NoPref onven4 ase hain Repl onven4+repl ustom onven4+replm NoPref onven4 ase hain Repl Sparse Tree vg7 onven4+repl ustom onven4+replm opyright Josep Torrellas

25 us Utilization No prefetching ut to the reduced execution time ue to prefetching 100% 90% 80% 70% % Utilization 60% 50% 40% 30% 20% 10% 0% NoPref onven4 ase hain Repl onven4+repl dvanced: avg increase of 8% ustom onven4+replm opyright Josep Torrellas

DATA prefetching is a popular technique to tolerate long

DATA prefetching is a popular technique to tolerate long IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 14, NO. 6, JUNE 2003 563 Correlation Prefetching with a User-Level Memory Thread Yan Solihin, Member, IEEE, Jaejin Lee, Member, IEEE, and Josep