CSE 490/590 Compuer Archiecure Cache I Seve Ko Compuer Sciences and Engineering Universiy a Buffalo Las Time Pipelining hazards Srucural hazards hazards Conrol hazards hazards Sall Bypass Conrol hazards Jump Condiional ranch 2 Branch Delay Slos (expose conrol hazard o sofware) Change he ISA semanics so ha he insrucion ha follows a jump or ranch is always execued gives compiler he flexiiliy o pu in a useful insrucion where normally a pipeline ule would have resuled. I 1 096 ADD I 2 100 BEQZ r1 +200 I 3 104 ADD I 4 304 ADD Delay slo insrucion execued regardless of ranch oucome Oher echniques include more advanced ranch predicion, which can dramaically reduce he ranch penaly... o come laer 3 Branch Pipeline Diagrams (ranch delay slo) ime 0 1 2 3 4 5 6 7.... (I 1 ) 096: ADD IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) 100: BEQZ +200 IF 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) 104: ADD IF 3 ID 3 EX 3 MA 3 WB 3 (I 4 ) 304: ADD IF 4 ID 4 EX 4 MA 4 WB 4 Resource Usage ime 0 1 2 3 4 5 6 7.... IF I 1 I 2 I 3 I 4 ID I 1 I 2 I 3 I 4 EX I 1 I 2 I 3 I 4 MA I 1 I 2 I 3 I 4 WB I 1 I 2 I 3 I 4 4 Why an Insrucion may no e dispached every cycle (CPI>1) Early Read-Only Technologies Full ypassing may e oo expensive o implemen ypically all frequenly used pahs are provided some infrequenly used ypass pahs may increase cycle ime and counerac he enefi of reducing CPI Loads have wo-cycle laency Insrucion afer load canno use load resul MIPS-I ISA defined load delay slos, a sofware-visile pipeline hazard (compiler schedules independen insrucion or insers NOP o avoid hazard). Removed in MIPS-II (pipeline inerlocks added in hardware)» MIPS: Microprocessor wihou Inerlocked Pipeline Sages Condiional ranches may cause ules kill following insrucion(s) if no delay slos Punched cards, From early 1700s hrough Jaquard Loom, Baage, and hen IBM Diode Marix, EDSAC-2 µcode sore Punched paper ape, insrucion sream in Harvard Mk 1 5 IBM Card Capacior ROS IBM Balanced Capacior ROS 6 C 1
Early Read/Wrie Main Technologies Semiconducor Baage, 1800s: Digis sored on mechanical wheels Semiconducor memory egan o e compeiive in early 1970s Inel formed o exploi marke for semiconducor memory Early semiconducor memory was Saic RAM (SRAM). SRAM cell inernals similar o a lach (cross-coupled inverers). Williams Tue, Mancheser Mark 1, 1947 Firs commercial Dynamic RAM (DRAM) was Inel 1103 Mercury Delay Line, Univac 1, 1951 1Ki of sorage on single chip charge on a capacior used o hold value Semiconducor memory quickly replaced core in 70s Also, regeneraive capacior memory on Aanasoff-Berry compuer, and roaing magneic drum memory on IBM 650 7 Modern DRAM Srucure 8 DRAM Archiecure i lines Col. 2M Col. 1 N+M Row 1 Row Decoder N M word lines Row 2N Column Decoder & Sense Amplifiers cell (one i) D Bis sored in 2-dimensional arrays on chip Modern chips have around 4 logical anks on each chip each logical ank physically implemened as many smaller arrays [Samsung, su-70nm DRAM, 2004] 9 DRAM Operaion DRAM Packaging Three seps in read/wrie access o a given ank Row access (RAS) Clock and conrol signals decode row address, enale addressed row (ofen muliple K in row) ilines share charge wih sorage cell small change in volage deeced y sense amplifiers which lach whole row of is sense amplifiers drive ilines full rail o recharge sorage cells charges i lines o known value, required efore nex row access Each sep has a laency of around 15-20ns in modern DRAMs Various DRAM sandards (DDR, RDRAM) have differen ways of encoding he signals for ransmission o he DRAM, u all share same core archiecure C 11 DRAM chip us (4,8,16,32) decode column address o selec small numer of sense amplifier laches (4, 8, 16, or 32 is depending on DRAM package) on read, send lached is ou o chip pins on wrie, change sense amplifier laches which hen charge sorage cells o required value can perform muliple column accesses on same row wihou anoher row access (urs mode) Precharge ~7 lines muliplexed row/column address ~12 Column access (CAS) 10 DIMM (Dual Inline Module) conains muliple chips wih clock/conrol/address signals conneced in parallel (someimes need uffers o drive signals o all chips) pins work ogeher o reurn wide word (e.g., 64-i daa us using 16x4-i pars) 12 2
- Boleneck Performance of high-speed compuers is usually limied y memory andwidh & laency Laency (ime for a single access) access ime >> Processor cycle ime Prolemaic Bandwidh (numer of accesses per uni ime) Increase he us size, ec. Usually OK 13 Performance Processor-DRAM Gap (laency) 1000 100 10 1 1980 1981 1982 1983 1984 1985 1986 1987 µproc 60%/year 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time DRAM Processor- Performance Gap: (grows 50% / year) DRAM 7%/year Four-issue 2GHz superscalar accessing 100ns DRAM could execue 800 insrucions during ime for one memory access! 14 Physical Size Affecs Laency CSE 490/590 Adminisrivia Small Signals have furher o ravel Fan ou o more locaions Big Very imporan o aend Reciaions nex week & he week afer Gues lecures There will e a couple gues lecures lae Fe/early Mar. Quiz 1 Rescheduled Fri, 2/11 Closed ook, in-class Includes lecures unil las Monday (1/31) Review: nex Wed (2/9) 15 16 Hierarchy Relaive Cell Sizes A Small, Fas (RF, SRAM) B holds frequenly used daa Big, Slow (DRAM) On-Chip SRAM in logic chip DRAM on memory chip capaciy: Regiser << SRAM << DRAM why? laency: Regiser << SRAM << DRAM why? andwidh: on-chip >> off-chip why? On a daa access: if daa fas memory low laency access (SRAM) If daa fas memory long laency access (DRAM) [ Foss, Implemening Applicaion-Specific, ISSCC 1996 ] 17 18 C 3
Levels of he Hierarchy Capaciy Access Time Cos Regisers 100s Byes <10s ns Cache K Byes 10-100 ns 1-0.1 cens/i Main M Byes 200ns- 500ns $.0001-.00001 cens /i Disk G Byes, 10 ms (10,000,000 ns) 10-5 -6-10 cens/i Tape infinie sec-min 10-8 Regisers Cache Disk Tape Insr. Operands s Pages Files Saging Xfer Uni prog./compiler 1-8 yes cache cnl 8-128 yes OS 512-4K yes user/operaor Myes Upper Level faser Lower Level 19 Larger Hierarchy: Apple imac G5 Managed " y compiler" 07 Reg L1 Ins L1 L2 DRAM Disk Size 1K 64K 32K 512K 256M 80G Laency Cycles, Time 1, 0.6 ns 3, 1.9 ns Managed " y hardware" 3, 1.9 ns 11, 6.9 ns Managed y OS," hardware," applicaion" 88, 55 ns 10 7, 12 ms Goal: Illusion of large, fas, cheap memory" Le programs address a memory space ha scales o he disk size, a a speed ha is usually as fas as regiser access imac G5" 1.6 GHz" Managemen of Hierarchy Small/fas sorage, e.g., regisers usually specified in insrucion Generally implemened direcly as a regiser file» u hardware migh do hings ehind sofware s ack, e.g., sack managemen, regiser renaming Larger/slower sorage, e.g., main memory usually compued from values in regiser Generally implemened as a hardware-managed cache hierarchy» hardware decides wha is kep in fas memory» u sofware may provide hins, e.g., don cache or prefech 21 (one do per access)! Real Reference Paerns Time! Donald J. Hafield, Jeanee Gerald: Program Resrucuring for Virual. IBM Sysems Journal 10(3): 168-192 (1971) Typical Reference Paerns Common Predicale Paerns n loop ieraions Insrucion feches Sack accesses accesses surouine call argumen access vecor access scalar accesses surouine reurn Time Two predicale properies of memory references: Temporal Localiy: If a locaion is referenced i is likely o e referenced again in he near fuure. Spaial Localiy: If a locaion is referenced i is likely ha locaions near i will e referenced in he near fuure. C 4
(one do per access)! Reference Paerns Spaial Localiy Temporal Localiy Time! Donald J. Hafield, Jeanee Gerald: Program Resrucuring for Virual. IBM Sysems Journal CSE 490/590, Spring 10(3): 2011 168-192 (1971) Caches Caches exploi oh ypes of predicailiy: Exploi emporal localiy y rememering he conens of recenly accessed locaions. Exploi spaial localiy y feching locks of daa around recenly accessed locaions. Inside a Cache Cache Algorihm (Read) Processor copy of main memory locaion 100 100 304 6848 416 Bye Bye Bye CACHE copy of main memory locaion 101 Main Line Look a Processor, search cache ags o find mach. Then eiher Found in cache a.k.a. Reurn copy of daa from cache No in cache a.k.a. MISS Read lock of daa from Main Wai Reurn daa o processor and updae cache Q: Which line do we replace? Placemen Policy Direc-Mapped Cache Numer 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 Index Offse V k Se Numer Cache lock 12 can e placed 0 1 2 3 0 1 2 3 4 5 6 7 Fully (2-way) Se Direc Associaive Associaive Mapped anywhere anywhere in only ino se 0 lock 4 (12 mod 4) (12 mod 8) 2 k lines Word or Bye 29 C 5
Direc Map Selecion higher-order vs. lower-order address is 2-Way Se-Associaive Cache Index k V Offse Index k V Offse V 2 k lines Word or Bye Word or Bye Fully Associaive Cache Replacemen Policy V In an associaive cache, which lock from a se should e eviced when he se ecomes full? Offse Word or Bye Random Leas Recenly Used (LRU) LRU cache sae mus e updaed on every access rue implemenaion only feasile for small ses (2-way) pseudo-lru inary ree ofen used for 4-8 way Firs In, Firs Ou (FIFO) a.k.a. Round-Roin used in highly associaive caches No Leas Recenly Used (NLRU) FIFO wih excepion for mos recenly used lock or locks This is a second-order effec. Why? Replacemen only happens on misses 34 Acknowledgemens These slides heavily conain maerial developed and copyrigh y Krse Asanovic (MIT/UCB) David Paerson (UCB) And also y: Arvind (MIT) Joel Emer (Inel/MIT) James Hoe (CMU) John Kuiaowicz (UCB) MIT maerial derived from course 6.823 UCB maerial derived from course CS252 35 C 6