CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago

CMSC 22200 Computer Architecture Lecture 11: More Caches Prof. Yajig Li Uiversity of Chicago

Lecture Outlie Caches 2

Review Memory hierarchy Cache basics Locality priciples Spatial ad temporal How to access a cache? Fudametal parameters sets, block size, associativity Performace metric: AMAT 3

Cache Desig Decisios ad Tradeoffs

Cache Desig Cosideratios Orgaizatio: cache size, block size, associatively? Replacemet: what data to remove to make room i cache? Write policy: what do we do about writes? Istructios/data: Do we treat them separately? Performace optimizatio Icrease hit rate, reduce miss rate Reduce hit time Reduce miss pealty 5

Cache Size Cache size: total data (ot icludig tag) capacity bigger ca exploit temporal locality better Too large: adversely affects hit ad miss latecy smaller is faster => bigger is slower access time may degrade critical path Too small Does t exploit temporal locality well useful data replaced ofte hit rate workig set size Workig set: the whole set of data the executig applicatio refereces Withi a time iterval cache size 6

Block Size Block size is the data that is associated with a address tag Too small do t exploit spatial locality well have larger tag overhead Ca t take advatage of fast burst trasfers from memory Too large Too few total # of blocks à less temporal locality exploitatio Waste cache space, badwidth ad eergy if spatial locality is ot high hit rate block size 7

Critical-Word First Large cache blocks ca take a log time to fill ito the cache fill cache lie critical word first restart cache access before complete fill Example Assume 8-byte words ad 8-word cache block Applicatio wats to access the 4 th word ad miss i cache Fetch the 4 th word first, the the rest 8

Associativity How may blocks ca map to the same idex (or set)? Higher associativity ++ Higher hit rate (reduce coflict misses) -- Slower cache access time (hit latecy ad data access latecy) -- More expesive hardware (more comparators) -- Dimiishig returs from higher associativity Smaller associativity lower cost hit rate lower hit latecy Especially importat for L1 caches associativity 9

Evictio/Replacemet Policy Which block i the set to replace o a cache miss? Ay ivalid block first If all are valid, cosult the replacemet policy 11

LRU (Least Recetly Used) Policy Idea: Evict the least recetly accessed block Problem: Need to keep track of access orderig of blocks Questio: 2-way set associative cache: What do you eed to implemet LRU perfectly? Questio: 4-way set associative cache: What do you eed to implemet LRU perfectly? How may differet orderigs possible for the 4 blocks i the set? How may bits eeded to ecode the LRU order of a block? What is the logic eeded to determie the LRU victim? 12

Approximatios of LRU Most moder processors do ot implemet true LRU (also called perfect LRU ) i highly-associative caches Why? True LRU is complex LRU is a approximatio to predict locality ayway (i.e., ot the best possible cache maagemet policy) Examples: Not MRU (ot most recetly used) May others 13

Radom Replacemet Policy LRU vs. Radom: Which oe is better? Example: 4-way cache, cyclic refereces to A, B, C, D, E 0% hit rate with LRU policy Set thrashig: Whe the program workig set i a set is larger tha set associativity Radom replacemet policy is better whe thrashig occurs I practice: Depeds o workload Average hit rate of LRU ad Radom are similar 14

Write-Back vs. Write-Through Caches Whe do we write the modified data i a cache to the ext level? Write through: At the time the write happes Write back: Whe the block is evicted Write-back + Ca cosolidate multiple writes to the same block before evictio Potetially saves badwidth betwee cache levels + saves eergy -- Need a bit idicatig the block is dirty/modified Write-through + Simpler + All levels are up to date -- More badwidth itesive; o coalescig of writes 16

Allocate vs. No-Allocate O Write Caches Do we allocate a cache block o a write miss? Allocate o write miss: Yes No-allocate o write miss: No Allocate o write miss + Ca cosolidate writes istead of writig each of them idividually to ext level (assume write-back cache) + Simpler because write misses ca be treated the same way as read misses -- Reuires trasfer of the whole cache block No-allocate + Coserves cache space if locality of writes is low (potetially better cache hit rate) 17

Istructio vs. Data Caches Separate or Uified? Uified: + Dyamic sharig of cache space: o overprovisioig that might happe with static partitioig (i.e., split I ad D caches) -- Istructios ad data ca thrash each other (i.e., o guarateed space for either) -- I ad D are accessed i differet places i the pipelie. Where do we place the uified cache for fast access? First level caches are almost always split Maily for the last reaso above Secod ad higher levels are almost always uified 19

A Word o Multi-level Cachig First-level (L1) caches (istructio ad data) Decisios very much affected by cycle time Small, lower associativity Secod-level (L2) caches Decisios eed to balace hit rate ad access latecy Usually large ad highly associative; latecy ot as importat Multi-level iclusio If data i L1 is always a subset of data i L2 + Easier cache aalysis + Easier coherece check -- Additioal logic to maitai iclusio -- Wasted space 20

How to Improve Cache Performace Remember Average memory access time (AMAT) = ( hit-rate * hit-latecy ) + ( miss-rate * miss-latecy ) Three fudametal goals Reducig miss rate Reducig miss latecy/cost Reducig hit latecy/cost Tradeoffs! E.g., to reduce miss rate, hit/miss latecy ca icrease 22

Improvig Cache Performace Reducig miss rate Higher associativity, better replacemet policies, Reducig hit latecy/cost Smaller caches/associativity Reducig miss latecy/cost Multi-level caches Critical word first No-blockig caches (multiple cache misses i parallel) High-badwidth caches (multiple accesses per cycle) Cache-friedly software approaches 23

No-Blockig Caches

Hadlig Multiple Outstadig Accesses Goal: Eable cache access whe there is a pedig miss Hit uder miss Miss uder miss Idea: No-blockig or lockup-free caches 25

Beefits of No-Blockig Caches 1 hit uder miss: 9% ad 12.5% reductio i cache access latecy for SPECit ad SPECfp bechmarks Slide credit: Prof. Christos Kozyrakis, (EE282, Staford Uiversity) 26

Hadlig Multiple Outstadig Accesses Idea: Keep track of the status/data of misses that are beig hadled i Miss Status Hadlig Registers (MSHRs) A cache access checks MSHRs to see if a miss to the same block is already pedig. If pedig, a ew reuest is ot geerated If pedig ad the eeded data available, data forwarded to later load Reuires bufferig of outstadig miss reuests 27

Miss Status Hadlig Register Also called miss buffer Keeps track of Outstadig cache misses Pedig load/store accesses that refer to the missig cache block Fields of a sigle MSHR etry Valid bit Cache block address (to match icomig accesses) Cotrol/status bits (prefetch, whether it s issued to memory, which subblocks have arrived, etc) Data for each subblock Multiple store/load etries for each pedig load/store that access the cache block Valid, type (load or store), data size (how may bytes), which bytes i block are eeded, destiatio register or store buffer etry address 28

Miss Status Hadlig Register Etry 29

No-Blockig Cache Operatio O a cache miss: Search MSHRs for a pedig access to the same block Foud: Allocate a load/store etry i the same MSHR etry Not foud: Allocate a ew MSHR No free etry: stall Whe a subblock returs from the ext level i memory Check which loads/stores waitig for it Forward data to the load/store uit Deallocate load/store etry i the MSHR etry (mark as ivalid) Write subblock i cache or MSHR If last subblock, dellaocate MSHR (after writig the block i cache) 30

Eablig High Badwidth Caches

Multiple Memory Accesses per Cycle Processors ca geerate multiple cache/memory accesses per cycle E.g. superscalar Cache/memory ca receive multiple access reuests per cycle E.g., shared amog multiple processors How do we esure the cache/memory ca hadle multiple accesses i the same clock cycle? Solutios... Multiple baks 32

Cache Baks Idea: rather tha treat the cache as a sigle moolithic block, divide it ito idepedet baks that ca support simultaeous accesses Bits i address determies which bak a address maps to Address space partitioed ito separate baks Access to differet baks (e.g., blocks 0, 1, 2, 3) ca be doe i parallel. This mappig is called iterleavig 33

Cache Baks Advatages Icrease cache badwidth No icrease i data store area Power beefits Disadvatages Caot satisfy multiple accesses to the same bak This is a key issue, called bak coflicts, i.e., multiple accesses are to the same bak May techiues to avoid bak coflicts Bak utilizatio More complex logic (itercoect etwork) to distribute/collect accesses 34

Software Optimizatio Techiues

Geeral Approaches Restructurig data access patters Restructurig data layout Focus: improvig hit rate 36

Restructurig Data Access Patters (I) Array access example: if colum-major x[i+1,j] follows x[i,j] i memory x[i,j+1] is far away from x[i,j] Poor code for i = 1, rows for j = 1, colums sum = sum + x[i,j] Better code for j = 1, colums for i = 1, rows sum = sum + x[i,j] This is called loop iterchage 37

Restructurig Data Access Patters (II) Blockig Divide loops operatig o arrays ito computatio chuks so that each chuk ca hold its data i the cache Avoids cache coflicts betwee differet chuks of computatio Essetially: Divide the workig set so that each piece fits i the cache Blockig limitatios 1. there ca be coflicts amog differet arrays 2. array sizes may be ukow at compile/programmig time 38

Blockig Example // 3 N-by-N matrices x, y, z for (i = 0; i < N; i++){ for (j = 0; j < N; j++) } { } r = 0; for(k = 0; k < N; k++ ) r += y[i][k] * z[k][j]; x[i][j] += r; 39

Access Patter older accesses ew accesses

Blocked Access Patter Uoptimized Blocked

Blocked Code for (jj = 0; jj < N; jj += B){ for (kk = 0; kk < N; kk += B){ for (i = 0; I < N; I ++){ for (j = jj; j < mi(jj+b, N); j++){ r = 0; for (k = kk; k< mi(kk+b, N); k++){ r += y[i][k]*z[k][j]; } x[i][j] += r; } } } } 42

Restructurig Data Layout (I) struct Node { struct Node* ode; it key; char [256] ame; char [256] school; } while (ode) { if (odeàkey == iput-key) { // access other fields of ode } ode = odeàext; } Poiter based traversal (e.g., of a liked list) Assume a huge liked list (1M odes) ad uiue keys Why does the code o the left have poor cache hit rate? Other fields occupy most of the cache lie eve though rarely accessed! 43

Restructurig Data Layout (II) struct Node { struct Node* ode; it key; struct Node-data* ode-data; } struct Node-data { char [256] ame; char [256] school; } while (ode) { if (odeàkey == iput-key) { // access odeàode-data } ode = odeàext; } Idea: separate freuetlyused fields of a data structure ad pack them ito a separate data structure Who should do this? Programmer? Compiler? Profilig vs. dyamic Hardware? Who ca determie what is freuetly used? 44

Restructurig Data Layout (III) How about istructio layout? 45