Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors

: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors Marsha Eng, Hong Wang, Perry Wang Alex Ramirez, Jim Fung, and John Shen Overview Applications of for Itanium Improving fetch bandwidth Stream-based Performance results Future work Related work Conclusion 2

: Motivation Macro-code vs. Micro-code Architectural vs. micro-architectural Compatibility vs. implementation-specific legacy-free : best of both Intermediate binary code representation Architecturally visible Implementation specific Legacy-free Like Pentium processor performance monitor interface 3 : Example Original Code T T T A B C D Trigger FT FT FT CFG A B C D Trigger i i+ i+2 i+3 i+4 Boundary i+5 i+6 Boundary i+n i+n+ i+n+2 i+n+3 Layout Original Code A D C A B C D B Appendix 4 2

: Benefits Modest hardware change Decode logic to recognize trigger and boundaries Unobtrusive to original code Appendix to original code d binary is guaranteed to be correct Backwards compatible Legacy free 5 For Itanium ISA Target : Future OOO Itanium machine Address code-density and fetch bandwidth issues Itanium 6 3

Itanium Itanium ISA puts instructions in bundles according to pre-defined templates Each bundle has 3 instruction slots and template Slot 0 Slot Slot 2 trigger: NOP with location encoded in offset boundary: use 2 out of 8 undefined templates 7 Improving Fetch Bandwidth Two ways in which fetch bandwidth can degrade Wasteful instructions in the code stream Reduced code density Occupy pipeline resources Instruction cache misses Remedy Time spent waiting for cache fill Reduce wasteful instructions Reduce I-Cache misses 8 4

Wasteful Instructions NOPs Predicated false instructions Effectively become NOPs in dynamic code stream Slot 0 Slot Pred Slot NOP False 2 9 Wasteful Instructions Cache line misaligned instructions Branch targets must be bundle-aligned Taken branches might not be bundle-aligned 2-bundle width cache line Slot 0 Taken Slot Branch Slot 2 Branch Slot Target 0 Slot Slot 2 3 wasteful instructions 4 wasteful instructions 0 5

Wasteful Instruction Profile Percent of Total Instructions 0 9 8 7 6 5 4 3 2 Percentage Wasteful Instructions gzip gcc mcf crafty parser gap average (27b) (3b) (4.7b) (2.5b) (7.45b) (20b) (50b) Not Wasteful Misalignment due to Taken Branch Misalignment due to Branch Target Pred False Nop Benchmark (total inst count) On average, nearly 3 of fetched instructions are wasteful Stream-Based Streams Instructions between two retired taken branches static streams cover 9 execution 2 64.gzip 2 76.gcc 0 0 8 8 6 6 4 4 2 2 64 327 490 653 2 8.mcf 0 8 6 4 2 86 979 42 305 468 63 794 957 2390 4779 768 9557 946 4335 6724 93 2502 2389 26280 28669 07 23 39 425 53 637 743 849 955 06 67 273 2 86.crafty 0 97.parser 2 254.gap 0 9 0 8 8 7 8 6 6 5 6 4 4 3 4 2 2 2 593 85 777 2369 296 3553 445 4737 5329 592 653 705 483 965 447 2 929 24 2893 3375 3857 4339 482 5303 5785 269 537 805 073 34 609 877 245 243 268 2949 327 6

Streams vs. Basic Blocks Size Streams are bigger than basic blocks Location Streams reside in section Basic blocks reside in original code Transitions and Prediction Branch prediction: blockblock, block-stream Stream prediction: streamstream, stream-block Number of Instructions 30 25 20 5 0 5 0 Average Basic Block and Stream Sizes gzip gcc mcf crafty parser gap average inst per basic block inst per stream 3 Stream Prediction Traditional Branch Predictor Original Code Traditional Branch Predictor Stream Predictor Stream Predictor Enables more aggressive instruction prefetch Predict the next stream instead of basic block Fetch further into code stream: different from regular prefetching 4 7

Stream Prefetching Stream stream-stream prepare to branch Prefetch next stream 5 Experimental Setup Pipeline Structure Branch Predictor In-order: 2 stage Out-of-order: 6 stage 2K entry GSHARE with 256 entry, 4-way BTB Instruction Queue Execute Bandwidth Cache Structure Memory Latency 2K entry GSHARE with 32 entry, 4-way BTB 8 bundle (24-instruction) queue 6 instructions Out-of-order: 8 instruction schedule window L (separate I and D): 6K 4-way, 8 way banked, 2 cycle latency L2 (shared): 256K 4-way, 8 way banked, 4 cycle latency L3 (shared): 3072K 2-way, way banked, 30 cycle latency Fill buffer (MSHR): 6 entries. All caches have 64 byte lines 230 cycle latency, TLB Miss Penalty 30 cycles 6 8

Objectives of Experiment. Quantify benefit of code density reduction Cache alignment and NOP removal 2. Quantify benefit of branch and stream prediction 3. Quantify benefit of stream prefetching Stream prefetching by prepare to branch 7 Configurations optimizations Cache alignment and NOP removal Microarchitecturalvariations Name Stream Predictor Branch Predictor Stream Prefetch No Stream Prefetch No No No Stream Prefetch Hybrid Predictor No Stream Prefetch No Stream Prefetch Hybrid Predictor 8 9

In-Order Performance.8 Performance Improvement of In-order Execution with Speedup.6.4.2 0.8 0.6 0.4 0.2 No Stream Prefetch No Stream Prefetch Hybrid Predictor Stream Prefetch Stream Prefetch Hybrid Predictor 0 Fetch bandwidth is not a critical resource for in-order pipelines 5% average speedup crafty gap gcc gzip mcf parser average Benchmark 9 Out-of-Order Performance 2.5 Performance Improvement of Out-of-Order Execution with Speedup 2.5 0.5 No Stream Prefetch No Stream Prefetch Hybrid Predictor Stream Prefetch Stream Prefetch Hybrid Predictor 0 Fetch bandwidth is critical for out-of-order pipelines 32% average speedup crafty gap gcc gzip mcf parser average Benchmark enables prefetching to reduce I-cache miss 20 0

Future Work Other optimization via Continuing with the fetch bandwidth theme Removal of predicate false instructions Alternative encodings Dependency encoding Dynamic construction Larger workloads Database applications Managed runtime environments (e.g. Java/.Net) 2 Related Work Trace cache: Microarchitectural E. Rotenberg, S. Bennett, and J. E. Smith. Trace Cache: a low latency approach to high bandwidth instruction fetching. replay: Microarchitectural B. Fahs, S. Bose, M. Crum, B. Slechta, F. Spadini, T. Tung, S. J. Patel, and S. S. Lumetta. Performance Characterization of a Hardware Framework for Dynamic Optimization. Dynamo: Software, Architectural V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: A Transparent Dynamic Optimization System. Spike: Software, Architectural A. Ramirez, L. A. Barroso, K. Gharachorloo, R. Cohn, J. L. Larriba-Pey, P. G. Lowney, and M. Valero. Code Layout Optimizations for Transaction Processing Workloads. 22

Conclusion as an intermediate encoding Architecturally visible Implementation specific Legacy-free for Itanium : Stream-based Target fetch bandwidth Code density Additional prefetchopportunity 5% speedup on in-order 32% speedup on out-of-order 23 2