CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Compuer Archiecure and Engineering Lecure 7 - Memory Hierarchy-II Krse Asanovic Elecrical Engineering and Compuer Sciences Universiy of California a Berkeley hp://www.eecs.berkeley.edu/~krse hp://ins.eecs.berkeley.edu/~cs152 Las ime in Lecure 6 Dynamic RAM (DRAM) is main form of main memory sorage in use oday Holds values on small capaciors, need refreshing (hence dynamic) Slow muli-sep access: precharge, read row, read column Saic RAM (SRAM) is faser bu more expensive Used o build on-chip memory for caches s exploi wo forms of predicabiliy in memory reference sreams Temporal localiy, same locaion likely o be accessed again soon Spaial localiy, neighboring locaion likely o be accessed soon holds small se of values in fas memory (SRAM) close o processor Need o develop search scheme o find values in cache, and replacemen policy o make space for newly accessed locaions 2/17/2009 CS152-Spring!09 2

Placemen Policy Block Number 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 Memory Se Number 0 1 2 3 0 1 2 3 4 5 6 7 block 12 can be placed Fully (2-way) Se Direc Associaive Associaive Mapped anywhere anywhere in only ino se 0 block 4 (12 mod 4) (12 mod 8) 2/17/2009 CS152-Spring!09 3 Direc-Mapped Index Block Offse V k b Block 2 k lines = HIT Word or Bye 2/17/2009 CS152-Spring!09

2-Way Se-Associaive Index Block Offse b V k Block V Block = = Word or Bye HIT 2/17/2009 CS152-Spring!09 Fully Associaive V Block = Block Offse b = = Word or Bye HIT 2/17/2009 CS152-Spring!09

Replacemen Policy In an associaive cache, which block from a se should be eviced when he se becomes full? Random Leas Recenly Used (LRU) LRU cache sae mus be updaed on every access rue implemenaion only feasible for small ses (2-way) pseudo-lru binary ree ofen used for 4-8 way Firs In, Firs Ou (FIFO) a.k.a. Round-Robin used in highly associaive caches No Leas Recenly Used (NLRU) FIFO wih excepion for mos recenly used block or blocks This is a second-order effec. Why? Replacemen only happens on misses 2/17/2009 CS152-Spring!09 7 Block Size and Spaial Localiy Block is uni of ransfer beween he cache and memory Word0 Word1 Word2 Word3 4 word block, b=2 Spli CPU address block address offse b 32-b bis b bis 2 b = block size a.k.a line size (in byes) Larger block size has disinc hardware advanages less ag overhead exploi fas burs ransfers from DRAM exploi fas burs ransfers over wide busses Wha are he disadvanages of increasing block size? Fewer blocks => more conflics. Can wase bandwidh. 2/17/2009 CS152-Spring!09 8

CPU- Ineracion (5-sage pipeline) PCen PC 0x4 Add addr nop ins hi? Primary Insrucion IR D Decode, Regiser Fech E A B MD1 ALU M Y MD2 we addr Primary rdaa hi? wdaa R Sall enire CPU on daa cache miss To Memory Conrol Refill from Lower Levels of Memory Hierarchy 2/17/2009 CS152-Spring!09 9 Improving Performance Average memory access ime = Hi ime + Miss rae x Miss penaly To improve performance: reduce he hi ime reduce he miss rae reduce he miss penaly Wha is he simples design sraegy? Bigges cache ha doesn increase hi ime pas 1-2 cycles (approx 8-32KB in modern echnology) [ design issues more complex wih ou-of-order superscalar processors ] 2/17/2009 CS152-Spring!09 10

Serial-versus-Parallel and Memory access! is HIT RATIO: Fracion of references in cache 1 -! is MISS RATIO: Remaining references Processor Addr CACHE Average access ime for serial search: Addr Main Memory cache + (1 -!) mem Processor Addr CACHE Main Memory Average access ime for parallel search:! cache + (1 -!) mem Savings are usually small, mem >> cache, hi raio! high High bandwidh required for memory pah Complexiy of handling parallel pahs can slow cache 2/17/2009 CS152-Spring!09 Causes for Misses Compulsory: firs-reference o a block a.k.a. cold sar misses - misses ha would occur even wih infinie cache Capaciy: cache is oo small o hold all daa needed by he program - misses ha would occur even under perfec replacemen policy Conflic: misses ha occur because of collisions due o block-placemen sraegy - misses ha would no occur wih full associaiviy 2/17/2009 CS152-Spring!09 12

Effec of Parameers on Performance Larger cache size + reduces capaciy and conflic misses - hi ime will increase Higher associaiviy + reduces conflic misses - may increase hi ime Larger block size + reduces compulsory and capaciy (reload) misses - increases conflic misses and miss penaly 2/17/2009 CS152-Spring!09 13 Wrie Policy Choices hi: wrie hrough: wrie boh cache & memory» generally higher raffic bu simplifies cache coherence wrie back: wrie cache only (memory is wrien only when he enry is eviced)» a diry bi per block can furher reduce he raffic miss: no wrie allocae: only wrie o main memory wrie allocae (aka fech on wrie): fech ino cache Common combinaions: wrie hrough and no wrie allocae wrie back wih wrie allocae 2/17/2009 CS152-Spring!09 14

Wrie Performance V Index k Block Offse b 2 k lines = WE HIT Word or Bye 2/17/2009 CS152-Spring!09 15 Reducing Wrie Hi Time Problem: Wries ake wo cycles in memory sage, one cycle for ag check plus one cycle for daa wrie if hi Soluions: Design daa RAM ha can perform read and wrie in one cycle, resore old value afer ag miss Fully-associaive (CAM ) caches: Word line only enabled if hi Pipelined wries: Hold wrie daa for sore in single buffer ahead of cache, wrie cache daa during nex sore s ag check 2/17/2009 CS152-Spring!09 16

CS152 Adminisrivia 2/17/2009 CS152-Spring!09 17 Pipelining Wries Address and Sore From CPU Index Sore s Delayed Wrie Addr. =? Load/Sore S L Delayed Wrie =? 1 0 Load o CPU Hi? from a sore hi wrien ino daa porion of cache during ag access of subsequen sore 2/17/2009 CS152-Spring!09 18

Wrie Buffer o Reduce Read Miss Penaly CPU RF Wrie buffer Unified L2 Eviced diry lines for wrieback cache OR All wries in wriehru cache Processor is no salled on wries, and read misses can go ahead of wrie o main memory Problem: Wrie buffer may hold updaed value of locaion needed by a read miss Simple scheme: on a read miss, wai for he wrie buffer o go empy Faser scheme: Check wrie buffer addresses agains read miss addresses, if no mach, allow read miss o go ahead of wries, else, reurn value in wrie buffer 2/17/2009 CS152-Spring!09 19 Block-level Opimizaions s are oo large, i.e., oo much overhead Simple soluion: Larger blocks, bu miss penaly could be large. Sub-block placemen (aka secor cache) A valid bi added o unis smaller han full block, called sub-blocks Only read a sub-block on a miss If a ag maches, is he word in he cache? 100 300 204 1 1 1 1 1 1 0 0 0 1 0 1 2/17/2009 CS152-Spring!09 20

Se-Associaive RAM- Saus Saus =? =? No energy-efficien A ag and daa word is read from every way Two-phase approach Firs read ags, hen jus read daa from seleced way More energy-efficien Doubles laency in L1 OK, for L2 and above, why? Index Offse 2/17/2009 CS152-Spring!09 21 Mulilevel s A memory canno be large and fas Increasing sizes of cache a each level CPU L1$ L2$ DRAM Local miss rae = misses in cache / accesses o cache Global miss rae = misses in cache / CPU memory accesses Misses per insrucion = misses in cache / number of insrucions 2/17/2009 CS152-Spring!09 22

A Typical Memory Hierarchy c.2008 Spli insrucion & daa primary caches (on-chip SRAM) Muliple inerleaved memory banks (off-chip DRAM) CPU RF L1 Insrucion L1 Unified L2 Memory Memory Memory Memory Mulipored regiser file (par of CPU) Large unified secondary cache (on-chip SRAM) 2/17/2009 CS152-Spring!09 23 Presence of L2 influences L1 design Use smaller L1 if here is also L2 Trade increased L1 miss rae for reduced L1 hi ime and reduced L1 miss penaly Reduces average access energy Use simpler wrie-hrough L1 wih on-chip L2 Wrie-back L2 cache absorbs wrie raffic, doesn go off-chip A mos one L1 miss reques per L1 access (no diry vicim wrie back) simplifies pipeline conrol Simplifies coherence issues Simplifies error recovery in L1 (can use jus pariy bis in L1 and reload from L2 when pariy error deeced on L1 read) 2/17/2009 CS152-Spring!09 24

Inclusion Policy Inclusive mulilevel cache: Inner cache holds copies of daa in ouer cache Exernal access need only check ouer cache Mos common case Exclusive mulilevel caches: Inner cache may hold daa no in ouer cache Swap lines beween inner/ouer caches on miss Used in AMD Ahlon wih 64KB primary and 256KB secondary cache Why choose one ype or he oher? 2/17/2009 CS152-Spring!09 25 Ianium-2 On-Chip s (Inel/HP, 2002) Level 1: 16KB, 4-way s.a., 64B line, quad-por (2 load+2 sore), single cycle laency Level 2: 256KB, 4-way s.a, 128B line, quad-por (4 load or 4 sore), five cycle laency Level 3: 3MB, 12-way s.a., 128B line, single 32B por, welve cycle laency 2/17/2009 CS152-Spring!09 26

Acknowledgemens These slides conain maerial developed and copyrigh by: Arvind (MIT) Krse Asanovic (MIT/UCB) Joel Emer (Inel/MIT) James Hoe (CMU) John Kubiaowicz (UCB) David Paerson (UCB) MIT maerial derived from course 6.823 UCB maerial derived from course CS252 2/17/2009 CS152-Spring!09 27