L1 Data Cache Decomposition for Energy Efficiency

L1 Data Cache Decomposition for Energy Efficiency Michael Huang, Joe Renau, Seung-Moon Yoo, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/flexram

Objective Reduce L1 data cache energy consumption No performance degradation Partition the cache in multiple ways Specialization for stack accesses International Symposium on Low Power Electronics and Design, August 2001 2

Outline L1 D-Cache decomposition Specialized Stack Cache Pseudo Set-Associative Cache Simulation Environment Evaluation Conclusions International Symposium on Low Power Electronics and Design, August 2001 3

L1 D-Cache Decomposition A Specialized Stack Cache (SSC) A Pseudo Set-Associative Cache (PSAC) International Symposium on Low Power Electronics and Design, August 2001 4

Selection Selection done in decode stage to speed up Based on instruction address and opcode 2Kbit table to predict the PSAC way Address Opcode PSAC SSC International Symposium on Low Power Electronics and Design, August 2001 5

Stack Cache Small, direct-mapped cache Virtually tagged Software optimizations: Very important to reduce stack cache size Avoid trashing: allocate large structs in heap Easy to implement International Symposium on Low Power Electronics and Design, August 2001 6

SSC: Specialized Stack Cache Pointers to reduce traffic: TOS: reduce number write-backs SRB (safe-region-bottom): reduce unnecessary line-fills for write miss Region between TOS & SRB is safe (missing lines are non initialized) Infrequent access TOS TOS SRB SRB Stack grows International Symposium on Low Power Electronics and Design, August 2001 7

Pseudo Set-Associative Cache Partition the cache in 4 ways Tag Data Evaluated activation policies: Sequential, FallBackReg, Phased Cache, FallBackPha, PredictPha International Symposium on Low Power Electronics and Design, August 2001 8

Sequential (Calder 96) cycle 1 cycle 2 cycle 3 International Symposium on Low Power Electronics and Design, August 2001 9

Fallback-regular (Inoue 99) cycle 1 cycle 2 International Symposium on Low Power Electronics and Design, August 2001 10

Phased Cache (Hasegawa 95) cycle 1 cycle 2 International Symposium on Low Power Electronics and Design, August 2001 11

Fallback-phased (ours) Emphasis in energy reduction cycle 1 cycle 2 cycle 3 International Symposium on Low Power Electronics and Design, August 2001 12

Predictive Phased (ours) Emphasis in performance cycle 1 cycle 2 International Symposium on Low Power Electronics and Design, August 2001 13

Simulation Environment Baseline configuration: Processor: 1GHz R10000 like L1: 32 KB 2-way L2: 512KB 8-way phased cache Memory: 1 Rambus Channel Energy model: extended CACTI Energy is for data memory hierarchy only International Symposium on Low Power Electronics and Design, August 2001 14

Applications Multimedia SPECint Scientific Mp3dec: MP3 decoder Mp3enc: MP3 encoder Gzip: Data compression Crafty: Chess game MCF: Traffic model Bsom: data mining Blast: protein matching Treeadd: Olden tree search International Symposium on Low Power Electronics and Design, August 2001 15

Adding a Stack Cache Normalize Baseline 1 0.8 0.6 0.4 1.01 1.00 0.99 0.99 0.99 0.98 0.83 0.80 0.78 0.77 0.77 0.76 0.84 0.81 0.77 0.76 PLAIN 256B SSC 256B PLAIN 512B SSC 512B PLAIN 1KB SSC 1KB 0.76 0.75 0.2 0 Delay Energy E*D For the same size the Specialized Stack Cache is always better International Symposium on Low Power Electronics and Design, August 2001 16

Pseudo Set-Associative Cache 1 1.05 0.99 1.05 1.01 0.98 4-way Sequential 4-way FallBackReg 4-way Phased 4-way FallBackPha 4-way PredictPha Normalize Baseline 0.8 0.6 0.4 0.68 0.69 0.74 0.67 0.68 0.72 0.69 0.78 0.68 0.67 0.2 0 Delay Energy E*D PredictPha has the best delay and energy-delay product International Symposium on Low Power Electronics and Design, August 2001 17

PSAC: 2-way vs. 4-way 1 0.99 0.97 0.98 2-way Sequential 2-way PredictPha 4-way PredictPha Normalize Basline 0.8 0.6 0.4 0.78 0.79 0.68 0.77 0.76 0.67 0.2 0 Delay Energy E*D For E*D, 4-way PSAC is better than 2-way International Symposium on Low Power Electronics and Design, August 2001 18

Pseudo Set-Associative + Specialized Stack Cache 1 0.98 0.98 0.97 0.96 4-way PredictPha 4-way PredictPha + SSC256B 4-way PredictPha + SSC512B Normalize Baseline 0.8 0.6 0.4 0.68 0.61 0.58 0.57 4-way PredictPha + SSC1KB 0.67 0.60 0.56 0.55 0.2 0 Delay Energy E*D Combining PSAC and SSC reduces E*D by 44% on average International Symposium on Low Power Electronics and Design, August 2001 19

Area Constrained: small PSAC+SSC 1 0.98 0.98 0.97 24KB 3-way PredictPha 24KB 3-way PredictPha + SSC512B 32KB 4-way PredictPha + SSC512B Normalize Baseline 0.8 0.6 0.4 0.74 0.61 0.58 0.72 0.60 0.56 0.2 0 Delay Energy E*D SSC + small PSAC delivers cost effective E*D design International Symposium on Low Power Electronics and Design, August 2001 20

Energy Breakdown Normalize Baseline 1 0.8 0.6 0.4 BLAST MCF MP3D SSC L1 L2 Mem 0.2 0 Baseline 4-way PSAC SSC512B Comb Baseline 4-way PSAC SSC512B Comb Baseline 4-way PSAC SSC512B Comb International Symposium on Low Power Electronics and Design, August 2001 21

Conclusions Stack cache: important for energy-efficiency SW optimization required for stack caches Effective Specialized Stack Cache extensions Pseudo Set-Associative Cache: 4-way more effective than 2-way Predictive Phased PSAC has the lowest E*D Effective to combine PASC and SSC E*D reduced by 44% on average International Symposium on Low Power Electronics and Design, August 2001 22

Backup Slides International Symposium on Low Power Electronics and Design, August 2001 23

Cache Energy Energy (pj) 2000 1800 1600 1400 1200 1000 800 600 400 200 4-way 2-way 1-way 0 4K 8K 16K 32K 64K Cache Size International Symposium on Low Power Electronics and Design, August 2001 24

Extended CACTI New sense amplifier 15% bit-line swing for reads Full bit-line swing for writes Different energy for reads, writes, linefills, and write backs Multiple optimization parameters International Symposium on Low Power Electronics and Design, August 2001 25

SSC Energy Overhead Small energy consumption required to use TOS and SRB Registers updated at function call and return Registers check on cache miss International Symposium on Low Power Electronics and Design, August 2001 26

Miss Rate BLAST BSOM CRAFTY GZIP MCF MP3D MP3E TREE 12% 10% 8% 6% 4% 2% 0% 4KB 8KB 16KB 32KB 64KB International Symposium on Low Power Electronics and Design, August 2001 27

Overview International Symposium on Low Power Electronics and Design, August 2001 28