A best-offset prefetcher

Size: px

Start display at page:

Download "A best-offset prefetcher"

Randolph Francis
5 years ago
Views:

1 A best-offset prefetcher Pierre Michaud 2 nd data prefetching championship, june 2015

2 DPC2 rules core L1 DPC2 simulator L2 MSHR prefetcher L3 DRAM 2

3 DPC2 rules DPC2 simulator core L1 L2 MSHR physical address L2 hit/miss IP time prefetcher L3 DRAM 3

4 DPC2 rules DPC2 simulator core L1 L2 MSHR physical address L2 hit/miss IP time prefetcher occupancy L3 DRAM 4

5 DPC2 rules DPC2 simulator core L1 L2 MSHR physical address L2 hit/miss IP time prefetcher occupancy L3 DRAM L2 fill line L2 victim line time 5

6 DPC2 rules DPC2 simulator core L1 L2 MSHR physical address L2 hit/miss IP time prefetcher occupancy L3 DRAM L2 fill line L2 victim line time prefetch address must lie in same 4KB page as demand address 6

7 Offset prefetching demand line X prefetch offset O prefetch line X+O Next-line prefetching O=1 Full-fledged offset prefetcher varying offset Sandbox prefetcher (Pugsley et al., HPCA 2014) 7

8 Proposed Best-Offset (BO) prefetcher New method for setting the offset automatically - different from Sandbox - first implementation in an in-house simulator in 2011 Bandwidth & cache pollution prefetch throttling method - somewhat specific to DPC2 - DPC2 rules limit what can be done 8

9 Sequential stream (neglect page boundary effect) offset=2 if the offset is too small, prefetches may not be timely 9

10 Strided stream example: stride=+96 bytes offset=3 constant byte-stride periodic sequence of line-strides (1,2,1,2,...) offset = sum of line-strides in a period (offset=1+2=3)...or multiple of that sum (6,9,...) 10

11 Interleaved streams offset= offset= st stream alone offset = multiple of 3 2 nd stream alone offset = multiple of 2 Both streams offset = multiple of 6 11

12 BO prefetcher: main idea demand line X (miss / prefetched hit) prefetch X+O + O best-offset learning 12

13 BO prefetcher: main idea demand line X (miss / prefetched hit) fill line Y (prefetched) - prefetch X+O + O best-offset learning Y-O recent requests 13

14 BO prefetcher: main idea demand line X (miss / prefetched hit) fill line Y (prefetched) prefetch X+O O best-offset learning test O' + - hit/miss? look up X-O' - Y-O recent requests 14

15 Recent Requests (RR) Table in 2011: 64-entry fully-associative FIFO for DPC2: two direct-mapped banks with different hashing - resembles 2-way skewed-associative - 2 x 64 x 12-bit tags 1536 bits Write same tag redundantly in both banks 15

16 Learning the best offset 46 different offsets evaluated - 23 positive + 23 negative - 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,20,24,30,32,36,40 Each offset has a 5-bit score - 46 x bits Test the 46 offsets successively (46 L2 accesses) = one round - if hit in RR table for an offset, increment its score Learning phase finishes after 100 rounds, or if one of the scores reaches 31 - select the offset with the greatest score this is the new prefetch offset - new learning phase starts reset scores 16

17 Prefetch timeliness vs. prefetch accuracy BO prefetcher tries to do timely prefetches However... Sometimes, better to choose a smaller offset, even if it generates late prefetches - Example: short sequential streams Imperfect solution: delay queue 17

18 BO prefetcher with a delay queue demand line X (miss / prefetched hit) fill line Y (prefetched) delay 60 cycles X - prefetch X+O O best-offset learning test O' + - hit/miss? look up X-O' RR left RR right Y-O 18

19 Prefetch throttling (DPC2) Turn prefetch on only if BO score > BADSCORE - DPC2 BADSCORE=1 (10 for small L3 config) - best-offset learning continues while prefetch is off Drop prefetch request if MSHR occupancy is above a threshold - Vary MSHR threshold depending on BO score and L3 access rate 0 50% BW DRAM BW L3 access rate HIGH LOW HIGH BO score 19

20 State (number of bits) bits prefetch bits (1 bit per L2 line) recent requests (2x64x12) scores (46x5) delay queue (15 slots) miscellaneous TOTAL

21 1.35 fixed vs. adaptive offset (437.leslie3d) speedup BOP BOP w/o DQ o set 21

22 Fixed vs. adaptive offset (433.milc) speedup BOP BOP w/o DQ o set 22

23 Fixed vs. adaptive offset (434.zeusmp) speedup BOP BOP w/o DQ o set 23

24 BO prefetcher vs. Sandbox prefetcher Sandbox prefetcher (Pugsley et al., HPCA 2014) - first published full-fledged offset prefetcher - fake prefetches evaluate an offset by setting bits in a Bloom filter - if demand access hits in Bloom filter fake prefetch successful - prefetch timeliness not considered - Sandbox method is orthogonal to offset prefetching BO prefetcher - no fake prefetches - strive for prefetch timeliness 24

25 FIN 25

Best-Offset Hardware Prefetching

Best-Offset Hardware Prefetching Pierre Michaud March 2016 2 BOP: yet another data prefetcher Contribution: offset prefetcher with new mechanism for setting the prefetch offset dynamically - Improvement