Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Size: px

Start display at page:

Download "Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A."

Daniella Nash
6 years ago
Views:

1 Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez

2 Summary Read misses are more cri?cal than write misses Read misses can stall processor, writes are not on the cri?cal path Problem: Cache management does not exploit read- write disparity Goal: Design a cache that favors reads over writes to improve performance Lines that are only wriden to are less cri7cal Priori7ze lines that service read requests Key observa7on: Applica?ons differ in their read reuse behavior in clean and dirty lines Idea: Read- Write Par??oning Dynamically par??on the cache between clean and dirty lines Protect the par??on that has more read hits Improves performance over three recent mechanisms 2

3 Outline Mo7va7on Reuse Behavior of Dirty Lines Read- Write Par77oning Results Conclusion 3

4 Mo?va?on Read and write misses are not equally cri?cal Read misses are more cri?cal than write misses Read misses can stall the processor Writes are not on the cri?cal path Rd A STALL Wr B Rd C STALL Buffer/writeback B?me Cache management does not exploit the disparity between read- write requests 4

5 Key Idea Favor reads over writes in cache Differen?ate between read vs. only wriden to lines Cache should protect lines that serve read requests Lines that are only wriden to are less cri7cal Improve performance by maximizing read hits An Example Rd A Wr B Rd B Wr C Rd D A D Read- Only B Read and WriDen C Write- Only 5

An Example Rd A Wr B Rd B Wr C Rd D Rd A M D C B Rd A H WR B M Rd B H Wr C M Rd D M STALL Write B A 2 D stalls C B A per D B itera7on A D C B A Wr B H Rd B H WR C M Rd D M Write B Write C LRU

6 An Example Rd A Wr B Rd B Wr C Rd D Rd A M D C B Rd A H WR B M Rd B H Wr C M Rd D M STALL Write B A 2 D stalls C B A per D B itera7on A D C B A Wr B H Rd B H WR C M Rd D M Write B Write C LRU Replacement Policy Write C STALL D B A D B A D B A 1 D stall C B A per Replace itera7on C D B A Read- Biased Replacement Policy STALL D C B cycles saved Evic7ng Dirty lines are that treated are only differently wriden to depending can improve on performance read requests 6

7 Outline Mo7va7on Reuse Behavior of Dirty Lines Read- Write Par77oning Results Conclusion 7

8 Reuse Behavior of Dirty Lines Not all dirty lines are the same Write- only Lines Do not receive read requests, can be evicted Read- Write Lines Receive read requests, should be kept in the cache Evic7ng write- only lines provides more space for read lines and can improve performance 8

Percentage of Cachelines in LLC 100 90 80 70 60 50 40 30 20 10 0 400.

zeusmp 435.gromacs 436.cactusADM 437.leslie3d 445.gobmk 447.dealII 450.soplex 456.hmmer 458.

libquantum Dirty (write- only) Dirty (read- write) Applica7ons On average have 37.

9 Percentage of Cachelines in LLC perlbench Reuse Behavior of Dirty Lines 401.bzip2 403.gcc 410.bwaves 429.mcf 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d 445.gobmk 447.dealII 450.soplex 456.hmmer 458.sjeng 459.GemsFDTD 462.libquantum Dirty (write- only) Dirty (read- write) Applica7ons On average have 37.4% different lines read are write- only, reuse behavior 9.4% lines are in both dirty read lines and wriden 464.h264ref 465.tonto 470.lbm 471.omnetpp 473.astar 481.wrf 482.sphinx xalancbmk

10 Outline Mo7va7on Reuse Behavior of Dirty Lines Read- Write Par77oning Results Conclusion 10

11 Read- Write Par??oning Goal: Exploit different read reuse behavior in dirty lines to maximize number of read hits Observa7on: Some applica?ons have more reads to clean lines Other applica?ons have more reads to dirty lines Read- Write Par77oning: Dynamically par??ons the cache in clean and dirty lines Evict lines from the par??on that has less read reuse Improves performance by protec7ng lines with more read reuse 11

12 Read- Write Par??oning Number of Reads Normalized to Reads in clean lines at 100m Soplex Clean Line Dirty Line Instruc7ons (M) Number of Reads Normalized to Reads in clean lines at 100m Xalanc Clean Line Dirty Line Instruc7ons (M) Applica7ons have significantly different read reuse behavior in clean and dirty lines 12

?on through replacement DIP [Qureshi et al. 2007] selects vic?

13 Read- Write Par??oning U?lize disparity in read reuse in clean and dirty lines Par??on the cache into clean and dirty lines Predict the par??on size that maximizes read hits Maintain the par??on through replacement DIP [Qureshi et al. 2007] selects vic?m within the par??on Predicted Best Par77on Size 3 Replace from dirty par77on Dirty Lines Cache Sets Clean Lines 13

Predic?ng Par??on Size Predicts par??on size using sampled shadow tags Based on u?lity- based par??oning [Qureshi et al. 2006] Counts the number of read hits in clean and dirty lines Picks the par?

14 Predic?ng Par??on Size Predicts par??on size using sampled shadow tags Based on u?lity- based par??oning [Qureshi et al. 2006] Counts the number of read hits in clean and dirty lines Picks the par??on (x, associa?vity x) that maximizes number of read hits Maximum number of read hits C O U N T E R S C O U N T E R S S A M P L E D S H A D O W S H A D O W MRU MRU- 1 LRU+1 LRU MRU MRU- 1 LRU+1 LRU S E T S T A G S T A G S Dirty Clean 14

15 Outline Mo7va7on Reuse Behavior of Dirty Lines Read- Write Par77oning Results Conclusion 15

16 Methodology CMP$im x86 cycle- accurate simulator [Jaleel et al. 2008] 4MB 16- way set- associa?ve LLC 32KB I+D L1, 256KB L cycle DRAM access?me 550m representa?ve instruc?ons Benchmarks: 10 memory- intensive SPEC benchmarks 35 mul?- programmed applica?ons 16

17 Comparison Points DIP, RRIP: Inser?on Policy [Qureshi et al. 2007, Jaleel et al. 2010] Avoid thrashing and cache pollu?on Dynamically insert lines at different stack posi?ons Low overhead Do not differen?ate between read- write accesses SUP+: Single- Use Reference Predictor [Piquet et al. 2007] Avoids cache pollu?on Bypasses lines that do not receive re- references High accuracy Does not differen?ate between read- write accesses Does not bypass write- only lines High storage overhead, needs PC in LLC 17

Comparison Points: Read Reference Predictor (RRP) A new

ng PC Bypasses write- only lines Writebacks are not

Wb A Wb A Alloca7ng No alloca7ng PC from L1 PC Marks

18 Comparison Points: Read Reference Predictor (RRP) A new predictor inspired by prior works [Tyson et al. 1995, Piquet et al. 2007] Iden?fies read and write- only lines by alloca?ng PC Bypasses write- only lines Writebacks are not associated with any PC Time PC P: Rd A Wb A Wb A PC Q: Wb A Wb A Wb A Alloca7ng No alloca7ng PC from L1 PC Marks Associates P as a the PC alloca7ng that High allocates storage PC in a overhead L1 line and that passes is never PC in read L2, again LLC 18

00 DIP RRIP SUP+ RRP RWP Differen7a7ng RWP performs read within vs.

19 1.20 Single Core Performance 48.4KB 2.6KB Speedup vs. Baseline LRU DIP RRIP SUP+ RRP RWP Differen7a7ng RWP performs read within vs. write- only 3.4% of RRP, lines improves But requires performance 18X less over storage recent overhead mechanisms 19

20 4 Core Performance Speedup vs. Baseline LRU No Memory Intensive DIP RRIP SUP+ RRP RWP +4.5% 1 Memory Intensive 2 Memory Intensive 3 Memory Intensive +8% 4 Memory Intensive Differen7a7ng More benefit when read vs. more write- only applica7ons lines improves performance are memory over intensive recent mechanisms 20

21 Average Memory Traffic Percentage of Memory Traffic % 85% Writeback Miss 17% 66% 0 Base RWP Increases writeback traffic by 2.5%, but reduces overall memory traffic by 16% 21

8 6 4 2 0 Natural Dirty Par77on Predicted Dirty

bwaves 416.gamess 429.mcf 433.milc 434.zeusmp 435.

gobmk 447.dealII 450.soplex 453.povray 454.

libquantum 464.h264ref 465.tonto 470.lbm 471.

22 Dirty Par??on Sizes Number of Cachelines Natural Dirty Par77on Predicted Dirty Par77on 400.perlbench 401.bzip2 403.gcc 410.bwaves 416.gamess 429.mcf 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d 444.namd 445.gobmk 447.dealII 450.soplex 453.povray 454.calculix 456.hmmer 458.sjeng 459.GemsFDTD 462.libquantum 464.h264ref 465.tonto 470.lbm 471.omnetpp 473.astar 481.wrf 482.sphinx3 483.xalancbmk Par77on size varies significantly for some benchmarks 22

Predicted Dirty Par77on 400.perlbench 401.bzip2 403.gcc 410.bwaves 416.gamess 429.

gobmk 447.dealII 450.soplex 453.povray 454.calculix 456.hmmer 458.sjeng 459.

23 Dirty Par??on Sizes Number of Cachelines Natural Dirty Par77on Predicted Dirty Par77on 400.perlbench 401.bzip2 403.gcc 410.bwaves 416.gamess 429.mcf 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d 444.namd 445.gobmk 447.dealII 450.soplex 453.povray 454.calculix 456.hmmer 458.sjeng 459.GemsFDTD 462.libquantum 464.h264ref 465.tonto 470.lbm 471.omnetpp 473.astar 481.wrf 482.sphinx3 483.xalancbmk Par77on size varies significantly during the run7me for some benchmarks 23

24 Outline Mo7va7on Reuse Behavior of Dirty Lines Read- Write Par77oning Results Conclusion 24

25 Conclusion Problem: Cache management does not exploit read- write disparity Goal: Design a cache that favors read requests over write requests to improve performance Lines that are only wriden to are less cri7cal Protect lines that serve read requests Key observa7on: Applica?ons differ in their read reuse behavior in clean and dirty lines Idea: Read- Write Par??oning Dynamically par??on the cache in clean and dirty lines Protect the par??on that has more read hits Results: Improves performance over three recent mechanisms 25

26 Thank you 26

27 Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez

28 Extra Slides 28

29 Reuse Behavior of Dirty Lines in LLC Different read reuse behavior in dirty lines Read Intensive/Non- Write Intensive Most accesses are reads, only a few writes Example: 483.xalancbmk Write Intensive Generates huge amount of intermediate data Example: 456.hmmer Read- Write Intensive Itera?vely reads and writes huge amount of data Example: 450.soplex 29

30 Read Intensive/Non- Write Intensive Most accesses are reads, only a few writes 483.xalancbmk: Extensible stylesheet language transforma?ons (XSLT) processor XSLT Processor XML Input HTML/XML/TXT Output XSLT Code 92% accesses are reads 99% write accesses are stack opera?on 30

31 elementat push %ebp mov %esp, %ebp push %esi pop %esi pop %ebp ret Non- Write Intensive Logical View of Stack %ebp %esi Write- Only 1.8% lines in LLC are write- only Read and Wri,en These dirty lines should be evicted WRITEBACK WRITEBACK 31

Mul?ple Sequence Alignments Best Ranked Matches Database Viterbi

32 Write Intensive Generates huge amount of intermediate data 456.hmmer: Searches a protein sequence database Hidden Markov Model of Mul?ple Sequence Alignments Best Ranked Matches Database Viterbi algorithm, uses dynamic programming Only 0.4% writes are from stack opera?ons 32

33 Write Intensive 2D Table Read and Wri,en WRITEBACK Write- Only 92% lines in LLC are write- only These lines can be evicted WRITEBACK 33

34 Read- Write Intensive Itera?vely reads and writes huge amount of data 450.soplex: Linear programming solver Miminum/maximum of objec?ve func?on Inequali?es and objec?ve func?on Simplex algorithm, iterates over a matrix Different opera?ons over the en?re matrix at each itera?on 34

35 Read- Write Intensive Read and Wri,en Pivot Column READ WRITEBACK Pivot Row Write- Only Read and Wri,en 19% lines in LLC write- only, 10% lines are read- wriden Read and wriden lines should be protected 35

Read Reuse of Clean- Dirty Lines in LLC Percentage of Cachelines in LLC 120 100 80 60 40 20 Clean Non- Read

cact 437.lesli 445.gob 447.dealI 450.sopl 456.hm 458.sjen 459.Gem 462.libqu 464.h26 465.tont 470.lbm 471.

36 Read Reuse of Clean- Dirty Lines in LLC Percentage of Cachelines in LLC Clean Non- Read Dirty Non- Read Clean Read Dirty Read perl 401.bzip 403.gcc 410.bwa 429.mcf 433.milc 434.zeus 435.gro 436.cact 437.lesli 445.gob 447.dealI 450.sopl 456.hm 458.sjen 459.Gem 462.libqu 464.h tont 470.lbm 471.omn 473.astar 481.wrf 482.sphi 483.xala On average 37.4% blocks are dirty non- read and 42% blocks are clean non- read 36

37 1.07 Single Core Performance 48.4KB 2.6KB Speedup vs. Baseline LRU DIP RRIP SUP+ RRP RWP 5% speedup over the whole SPECCPU2006 benchmarks 37

38 Speedup Speedup vs. Baseline LRU DIP RRIP SUP+ RRP RWP On average 14.6% speedup over baseline 5% speedup over the whole SPECCPU2006 benchmarks

39 Writeback Traffic to Memory Writeback Traffic Normalized to LRU RRP RWP Increases traffic by 17% over the whole SPECCPU2006 benchmarks 39

40 4- Core Performance Normalized Weighted Speedup RRP RWP Workloads 40

41 Change of IPC with Sta?c Par??oning 0.9 Xalancbmk 0.8 Soplex IPC IPC Number of Ways Assigned to Dirty Blocks Number of Ways Assigned to Dirty Blocks

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses