ECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation. Qisi Wang Hui-Shun Hung Chien-Fu Chen

ECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation Qisi Wang Hui-Shun Hung Chien-Fu Chen

Outline Data Prefetching Exist Data Prefetcher Stride Data Prefetcher Offset Prefetcher (Best-Offest Prefetcher) Look-Ahead Prefetcher (Signature Pattern Prefetcher) Experiment Result Tool Background Simulation Result Conclusion

Data Prefetching(background) Prefetching the data before it is needed Reduce the compulsory miss Reduce the memory access latency if - High prefetching accuracy - Prefetch early enough Goal: Predict which address is needed in the future Next N Lines Prefetching Always prefetch next N cache lines after a demand access or a demand miss Pros - Easy to implement - Suitable for sequential accessing Cons - Waste bandwidth on unwanted data if data pattern is irregular

Data Prefetching(background) Offset Prefetching Prefetch the address with an offset X If X = 1 => Next Line Prefetching Demanded Address [A] Prefetcher with offset X Prefetch Address [A] +X

Stride Prefetcher A kind of offset prefetcher with fixed distance 2 kind of stride prefetcher Cons Program Counter (PC) based - Record the distance of memory access by load instruction - Next time fetch the same load instruction is fetched, prefetch last address + distance Cache block address based - Prefetch A + X, A + 2X, A + 3X.. - Stream Buffer is a special case of this type of prefetcher Avoid cache pollution If load miss, check stream buffer and pop to cache If stream buffer also miss, allocate a new stream buffer Distance (Stride) is fixed Several varied offset scheme are proposed - Best Offset (BO) Prefetcher - Signature Path Prefetcher (SPP)

Best-Offset Prefetcher (Idea) Varied offset through a learning procedure Finding the best offset value of different application Several candidate of offset are tested RR table records the completed prefetch requests Prefetch Y, current offset is O => Y-O saves into RR table

Best-Offset Prefetcher (Learning) In learning phase, all the offsets in list will be tested (1 round) Each L2 access test 1 offset DPC ver.: 46 offsets, paper ver.: 52 offsets If hit in RR table, score + 1 All scores reset to 0 when learning phase begin If learning phase finish (ex. 100 round) or some offset reach SCORE_MAX (DPC ver. = 31), the phase ends The offset with highest score will be the best offset New learning phase starts

Best-Offset Prefetcher 1-degree prefetcher (only prefetch 1 address) Prefetch 2 offset result many useless prefetch Turn off the prefetcher if the best score too low BAD_SCORE is the threshold Learning procedure still work MSHR threshold varied depends on BO score and L3 access rate

Signature Path Prefetcher Path confidence-based prefetcher History lookahead prefetching SPP table trained by L2 access Prefetching depend on The signature and pattern in SPP The overall probability

Signature Path Prefetcher Table Updating When L2 access a page, the corresponding signature table will update - Offset update - Offset difference (delta) use to generate new signature - The old signature is used for modifying pattern table Same pattern will have same signature Reduce training time and PT store entries

Signature Path Prefetcher Prefetching Search the signature of current accessed page Choose the delta with highest probability P i (C delta /C sig ) of ith prefetch depth If multiply of all P larger than threshold - Prefetch current address + delta - Use delta to update signature and access pattern table again If P < threshold, the procedure end

Gem5 Simulation System CPU L1D Cache L1I Cache L2 Cache Prefetcher Memory Interface

Gem5 Implementation CPU L1D Cache L1I Cache L2 Cache Prefetcher Memory Interface

System Setting CPU: TimingSimpleCPU L1 Caches (Data/Instruction) L2 Cache Size 16 KB 128 KB Associativity 2 8 Tag Latency 2 Cycle 20 Cycles Data Latency 2 Cycle 20 Cycles MSHR Size 4 Entries 16 Entries Replacement LRU LRU

Gem5 Implementation CPU L1D Cache L1I Cache L2 Cache Prefetcher Write Queue MSHR Priority Queue Memory Interface

L2 Cache-Prefetcher Interface L2 Cache Notify on Access& Fill Prefetcher hit/miss PC Address set way is prefetch Evicted address Write Queue MSHR insert Compute Prefetch Priority Queue Memory Interface

Bechmark Setting Prefetcher Configuration basic PF Types: Baseline, Stride (PC&Addr) DPC-2 PF Types: Best Offset, SPP, AMPM, Benchmark SPEC 2006-450.soplex - 454.calculix - 456 Hmmer - 462.libquantum - 998.specrand

Sim. Result Normalized Performance

Sim. Result L2C Overall Miss Rate

Sim. Result Miss Rate Improvement

Conclusion Contribution Open source Github repository @ hfsken/gem5-with-dpc-2- prefetcher - With DPC-2 Wrapper for adding DPC PFs - Integrated with following DPC PFs: Best-Offset, AMPM, Stride, SPP Summary For a short term running time - Best-offset Prefetcher have better performance in benchmarks which has more regular access pattern and higher overall miss rate - Performance gain in random access pattern is ignorable Future Work Complete documentation on Github repo Analysis benchmark behavior in detail in the report

Reference [1] Pierre Michaud, Best-Offset Hardware Prefetching IEEE HPCA, 2016 [2] Pierre Michaud, A Best-Offset Prefetcher DPC-2, 2015 [3] J. Kim, S. H. Pugsley, P. V. Gratz, A. L. N. Reddy, C. Wilkerson and Z. Chishti, "Path confidence based lookahead prefetching, IEEE/ACM MICRO 2016 [4] Jinchun Kim, Paul V. Gratz and A. L. Narasimha Reddy, Lookahead Prefetching with Signature Path, DPC-2, 2015 [5] Course Slide of Prof. Onur Mutlu, CMU [6] Course Slide of Prof. Mikko Lipasti, UW Madison