Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Size: px

Start display at page:

Download "Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A."

Claude Webster
5 years ago
Views:

1 Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez

2 Summary Read misses are more cri?cal than write misses Read misses can stall processor, writes are not on the cri?cal path Problem: Cache management does not exploit read- write disparity Goal: Design a cache that favors reads over writes to improve performance Lines that are only wriden to are less cri7cal Priori7ze lines that service read requests Key observa7on: Applica?ons differ in their read reuse behavior in clean and dirty lines Idea: Read- Write Par??oning Dynamically par??on the cache between clean and dirty lines Protect the par??on that has more read hits Improves performance over three recent mechanisms 2

3 Outline Mo7va7on Reuse Behavior of Dirty Lines Read- Write Par77oning Results Conclusion 3

4 Mo?va?on Read and write misses are not equally cri?cal Read misses are more cri?cal than write misses Read misses can stall the processor Writes are not on the cri?cal path Rd A STALL Wr B Rd C STALL Buffer/writeback B?me Cache management does not exploit the disparity between read- write requests 4

5 Key Idea Favor reads over writes in cache Differen?ate between read vs. only wriden to lines Cache should protect lines that serve read requests Lines that are only wriden to are less cri7cal Improve performance by maximizing read hits An Example Rd A Wr B Rd B Wr C Rd D A D Read- Only B Read and WriDen C Write- Only 5

An Example Rd A Wr B Rd B Wr C Rd D Rd A M D C B Rd A H WR B M Rd B H Wr C M Rd D M STALL Write B A 2 D stalls C B A per D B itera7on A D C B A Wr B H Rd B H WR C M Rd D M Write B Write C LRU

6 An Example Rd A Wr B Rd B Wr C Rd D Rd A M D C B Rd A H WR B M Rd B H Wr C M Rd D M STALL Write B A 2 D stalls C B A per D B itera7on A D C B A Wr B H Rd B H WR C M Rd D M Write B Write C LRU Replacement Policy Write C STALL D B A D B A D B A 1 D stall C B A per Replace itera7on C D B A Read- Biased Replacement Policy STALL D C B cycles saved Evic7ng Dirty lines are that treated are only differently wriden to depending can improve on performance read requests 6

7 Outline Mo7va7on Reuse Behavior of Dirty Lines Read- Write Par77oning Results Conclusion 7

8 Reuse Behavior of Dirty Lines Not all dirty lines are the same Write- only Lines Do not receive read requests, can be evicted Read- Write Lines Receive read requests, should be kept in the cache Evic7ng write- only lines provides more space for read lines and can improve performance 8

Percentage of Cachelines in LLC 100 90 80 70 60 50 40 30 20 10 0 400.perlbench Reuse Behavior of Dirty Lines 401.bzip2 403.gcc 410.

GemsFDTD 462.libquantum Dirty (write- only) Dirty (read- write) Applica7ons On average have 37.

9 Percentage of Cachelines in LLC perlbench Reuse Behavior of Dirty Lines 401.bzip2 403.gcc 410.bwaves 429.mcf 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d 445.gobmk 447.dealII 450.soplex 456.hmmer 458.sjeng 459.GemsFDTD 462.libquantum Dirty (write- only) Dirty (read- write) Applica7ons On average have 37.4% different lines read are write- only, reuse behavior 9.4% lines are in both dirty read lines and wriden 464.h264ref 465.tonto 470.lbm 471.omnetpp 473.astar 481.wrf 482.sphinx xalancbmk

10 Outline Mo7va7on Reuse Behavior of Dirty Lines Read- Write Par77oning Results Conclusion 10

11 Read- Write Par??oning Goal: Exploit different read reuse behavior in dirty lines to maximize number of read hits Observa7on: Some applica?ons have more reads to clean lines Other applica?ons have more reads to dirty lines Read- Write Par77oning: Dynamically par??ons the cache in clean and dirty lines Evict lines from the par??on that has less read reuse Improves performance by protec7ng lines with more read reuse 11

12 Read- Write Par??oning Number of Reads Normalized to Reads in clean lines at 100m Soplex Clean Line Dirty Line Instruc7ons (M) Number of Reads Normalized to Reads in clean lines at 100m Xalanc Clean Line Dirty Line Instruc7ons (M) Applica7ons have significantly different read reuse behavior in clean and dirty lines 12

?on through replacement DIP [Qureshi et al. 2007] selects vic?

13 Read- Write Par??oning U?lize disparity in read reuse in clean and dirty lines Par??on the cache into clean and dirty lines Predict the par??on size that maximizes read hits Maintain the par??on through replacement DIP [Qureshi et al. 2007] selects vic?m within the par??on Predicted Best Par77on Size 3 Replace from dirty par77on Dirty Lines Cache Sets Clean Lines 13

Predic?ng Par??on Size Predicts par??on size using sampled shadow tags Based on u?lity- based par??oning [Qureshi et al. 2006] Counts the number of read hits in clean and dirty lines Picks the par?

14 Predic?ng Par??on Size Predicts par??on size using sampled shadow tags Based on u?lity- based par??oning [Qureshi et al. 2006] Counts the number of read hits in clean and dirty lines Picks the par??on (x, associa?vity x) that maximizes number of read hits Maximum number of read hits C O U N T E R S C O U N T E R S S A M P L E D S H A D O W S H A D O W MRU MRU- 1 LRU+1 LRU MRU MRU- 1 LRU+1 LRU S E T S T A G S T A G S Dirty Clean 14

15 Outline Mo7va7on Reuse Behavior of Dirty Lines Read- Write Par77oning Results Conclusion 15

16 Methodology CMP$im x86 cycle- accurate simulator [Jaleel et al. 2008] 4MB 16- way set- associa?ve LLC 32KB I+D L1, 256KB L cycle DRAM access?me 550m representa?ve instruc?ons Benchmarks: 10 memory- intensive SPEC benchmarks 35 mul?- programmed applica?ons 16

17 Comparison Points DIP, RRIP: Inser?on Policy [Qureshi et al. 2007, Jaleel et al. 2010] Avoid thrashing and cache pollu?on Dynamically insert lines at different stack posi?ons Low overhead Do not differen?ate between read- write accesses SUP+: Single- Use Reference Predictor [Piquet et al. 2007] Avoids cache pollu?on Bypasses lines that do not receive re- references High accuracy Does not differen?ate between read- write accesses Does not bypass write- only lines High storage overhead, needs PC in LLC 17

Comparison Points: Read Reference Predictor (RRP) A new

ng PC Bypasses write- only lines Writebacks are not

Wb A Wb A Alloca7ng No alloca7ng PC from L1 PC Marks

18 Comparison Points: Read Reference Predictor (RRP) A new predictor inspired by prior works [Tyson et al. 1995, Piquet et al. 2007] Iden?fies read and write- only lines by alloca?ng PC Bypasses write- only lines Writebacks are not associated with any PC Time PC P: Rd A Wb A Wb A PC Q: Wb A Wb A Wb A Alloca7ng No alloca7ng PC from L1 PC Marks Associates P as a the PC alloca7ng that High allocates storage PC in a overhead L1 line and that passes is never PC in read L2, again LLC 18

00 DIP RRIP SUP+ RRP RWP Differen7a7ng RWP performs read within vs.

19 1.20 Single Core Performance 48.4KB 2.6KB Speedup vs. Baseline LRU DIP RRIP SUP+ RRP RWP Differen7a7ng RWP performs read within vs. write- only 3.4% of RRP, lines improves But requires performance 18X less over storage recent overhead mechanisms 19

20 4 Core Performance Speedup vs. Baseline LRU No Memory Intensive DIP RRIP SUP+ RRP RWP +4.5% 1 Memory Intensive 2 Memory Intensive 3 Memory Intensive +8% 4 Memory Intensive Differen7a7ng More benefit when read vs. more write- only applica7ons lines improves performance are memory over intensive recent mechanisms 20

21 Average Memory Traffic Percentage of Memory Traffic % 85% Writeback Miss 17% 66% 0 Base RWP Increases writeback traffic by 2.5%, but reduces overall memory traffic by 16% 21

8 6 4 2 0 Natural Dirty Par77on Predicted Dirty

bwaves 416.gamess 429.mcf 433.milc 434.zeusmp 435.

gobmk 447.dealII 450.soplex 453.povray 454.

libquantum 464.h264ref 465.tonto 470.lbm 471.

22 Dirty Par??on Sizes Number of Cachelines Natural Dirty Par77on Predicted Dirty Par77on 400.perlbench 401.bzip2 403.gcc 410.bwaves 416.gamess 429.mcf 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d 444.namd 445.gobmk 447.dealII 450.soplex 453.povray 454.calculix 456.hmmer 458.sjeng 459.GemsFDTD 462.libquantum 464.h264ref 465.tonto 470.lbm 471.omnetpp 473.astar 481.wrf 482.sphinx3 483.xalancbmk Par77on size varies significantly for some benchmarks 22

Predicted Dirty Par77on 400.perlbench 401.bzip2 403.gcc 410.bwaves 416.gamess 429.

gobmk 447.dealII 450.soplex 453.povray 454.calculix 456.hmmer 458.sjeng 459.

23 Dirty Par??on Sizes Number of Cachelines Natural Dirty Par77on Predicted Dirty Par77on 400.perlbench 401.bzip2 403.gcc 410.bwaves 416.gamess 429.mcf 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d 444.namd 445.gobmk 447.dealII 450.soplex 453.povray 454.calculix 456.hmmer 458.sjeng 459.GemsFDTD 462.libquantum 464.h264ref 465.tonto 470.lbm 471.omnetpp 473.astar 481.wrf 482.sphinx3 483.xalancbmk Par77on size varies significantly during the run7me for some benchmarks 23

24 Outline Mo7va7on Reuse Behavior of Dirty Lines Read- Write Par77oning Results Conclusion 24

25 Conclusion Problem: Cache management does not exploit read- write disparity Goal: Design a cache that favors read requests over write requests to improve performance Lines that are only wriden to are less cri7cal Protect lines that serve read requests Key observa7on: Applica?ons differ in their read reuse behavior in clean and dirty lines Idea: Read- Write Par??oning Dynamically par??on the cache in clean and dirty lines Protect the par??on that has more read hits Results: Improves performance over three recent mechanisms 25

26 Thank you 26

27 Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses