Annotated Memory References: A Mechanism for Informed Cache Management

Size: px

Start display at page:

Download "Annotated Memory References: A Mechanism for Informed Cache Management"

Ruth Scott
5 years ago
Views:

1 Annotated Memory References: A Mechanism for Informed Cache Management Alvin R. Lebeck, David R. Raymond, Chia-Lin Yang Mithuna S. Thottethodi Department of Computer Science, Duke University

2 Motivation Importance of cache performance Allow software to assist in cache management Issues in software assisted cache management and scope of this study What information? How is the information conveyed? How is the information exploited? 2

3 Outline Motivation and scope The proposed mechanism Static Instruction Annotation The Tag Instruction Overhead - Not more than 2% Utilizing Annotated Memory References Retain/Release annotation Word/Block annotation Between 11% and 17% speedup Conclusion 3

4 Design Decisions What? How? Instruction (PC) Static Dynamic Address (EA) Static Instruction annotation One additional instruction (TAG) proposed Annotation Register 4

5 The Tag Instruction Tag instruction fills the Annotation Register Future n loads get k bits of annotation n : tag coverage 2 k : number of possible annotations Implementation issues in modern processors Multiple Issue Out-of-order Execution tag 0xff44 or r1, r2, r3 ld r1, 0(r4) ld r2, 8(r4) add r1, r2, r3 ld r7, 0(r1) ld r8, 8(r1) Annotation Register f f 4 4 5

6 Implementation Issues Multiple Issue WAR hazard Dependence Out of order execution Annotations associated with loads at decode time Load/Store queue entries hold annotation bits ld r9, 0(r8) tag 0xff44 ld r1, 0(r4) ld r2, 8(r4) add r1, r2, r3 ld r7, 0(r1) ld r8, 8(r1) Annotation Register f f 4 4 6

7 Instruction Overhead Number of Instructions (Billions) wave turb3d tomcatv swim su2cor mgrid hydro2d fpppp apsi applu vortex perl m88ksim li ijpeg go gcc compress All memory references annotated Instructions TAGs required 4 bits/annotation; Tag coverage of 6 Instruction overhead Integer codes : 5.5% to 16.2% Floating Point codes : 6.2% to 7.7% 7

8 Cycle Overheads - Experiments Statically scheduled processors ATOM based issue policies Perfect branch prediction Ideal memory system No inter-block dependencies Dynamically scheduled processors SimpleScalar simulator 4-way issue, 64 RUU, 32 LSQ Simple overhead computation 8

9 Statically Scheduled Processor Cycle Overhead (%) compress gcc go ijpeg li m88ksim perl vortex applu fpppp hydro2d mgrid su2cor swim tomcatv turb3d wave5 All memory references annotated 4 bits/annotation; Tag coverage of 6 Cycle overhead Integer Codes: 0% to 0.85% Floating Point Codes: 0% to 2% 9

10 Dynamically Scheduled Processor Cycle Overhead (%) apsi compress95 hydro2d su2cor swim tomcatv turb3d vortex wave All memory references annotated 4 bits/annotation; Tag coverage of 6 Cycle overhead Integer Codes: 0% to 0.2% Floating Point Codes: 0% to 1.76% 10

11 Outline Motivation and scope The proposed mechanism Static Annotation The Tag Instruction Overhead - Not more than 2% Utilizing Annotated Memory References Retain/Release annotation Word/Block annotation Between 11% and 17% speedup Conclusion 11

12 Utilizing Annotated Memory References Code inspection and manual insertion of annotations CProf tool to give insights of code operation Multimedia applications epic, ijpeg, pegwit 12

13 Better Block Replacement Insight: some blocks should be retained even if LRU block Retain/Release annotations A block marked Retain cannot be replaced unless Released Bypass cache if no replacement candidate epic Normalized Execution Time Without annotations With annotations 4-way issue, OoO processor 64 RUU, 32 LSQ entries 8KB, 32 Byte block, Direct Mapped 13

14 Better Block Sizes Insight : Implicit prefetch of larger blocks hurts performance WordMode/BlockMode annotations WordMode annotated references bring in only a word and not the whole block pegwit and ijpeg Normalized Execution Time Without annotations With annotations ijpeg pegwit 4-way issue, OoO processor 64 RUU, 32 LSQ entries 8KB, 32 Byte block, Direct Mapped 14

15 Conclusions Cache performance is critical Software can assist in managing caches We demonstrate a mechanism that allows software to help manage caches with low overheads (under 2%), and significant benefits (between 11% and 17% speedups) 15

16 Backup Slide : I-Cache Effects Maximum dynamic code expansion is 1.16 Worst case : if all memory references are annotated Approx. 25% increase in cache misses for ill behaved codes [Lebeck and Wood, 94] In practice Far fewer annotated memory references 16

17 Cycle Time Overhead Dependencies across basic blocks: Maximum inter-block dependencies Upper bound of execution time Lower bound of overhead No inter-block dependencies Lower bound of execution time Upper bound of overhead Issue of first instruction in basic block Max Dep No Dep Issue of last instruction in basic block Continued execution of instructions in basic block Time 17

18 Statically Scheduled Processor Percent Cycle Overhead compress gcc go ijpeg li m88ksim perl vortex applu fpppp hydro2d mgrid su2cor swim tomcatv turb3d wave5 No inter-block dependencies Max inter-block dependencies Integer Codes: 0% to 0.8% Floating Point Codes: 0% to 2% 18

19 Design Decisions What? How? Static Dynamic Instruction (PC) Abraham et al, 93 Tyson et al, 95 Tyson et al, 95 Address (EA) McFarling et al. 92 Rivers et al. 96 Johnson et al. 97 Inoue et al

20 Better Block Replacement Insight: some blocks should be retained even if LRU block Retain/Release annotations A block marked Retain cannot be replaced unless Released Bypass cache if no replacement candidate epic Normalized Execution Time Miss Rate (%) Without annotations With annotations 0 20

21 Insight : Implicit prefetch of larger blocks hurts performance WordMode/BlockMode annotations WordMode annotated references bring in only a word and not the whole block pegwit and ijpeg Better Block Sizes Normalized Execution Time Miss Rate (%) Without annotations With annotations ijpeg ijpeg pegwit pegwit 21

Evaluation of a High Performance Code Compression Method

Evaluation of a High Performance Code Compression Method Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The