General Cache Mechanics. Memory Hierarchy: Cache. Cache Hit. Cache Miss CPU. Cache. Memory CPU CPU. Cache. Cache. Memory. Memory

Size: px

Start display at page:

Download "General Cache Mechanics. Memory Hierarchy: Cache. Cache Hit. Cache Miss CPU. Cache. Memory CPU CPU. Cache. Cache. Memory. Memory"

Buck Byrd
5 years ago
Views:

1 Hierarchy: hierarchy basics Locality organization -aware programming General Mechanics CP Data is moved in units Block: unit of data in cache and memory. (a.k.a. line) Smaller, faster, more expensive. Stores subset of memory s. (lines) Larger, slower, cheaper. Partitioned into s (lines). 8 Hit CP Request: Request data in b.. hit: Block b is in cache. Miss CP Request: Request:. Request data in b.. miss: is not in cache. eviction: Evict a to make room, maybe store to memory fill: Fetch from memory, store in cache. 9 Placement Policy: where to put in cache Replacement Policy: which to evict

2 Locality # sum = ; for (i = ; i < n; i++) { sum += a[i]; Data: Temporal: sum referenced in each iteration Spatial: array a[] accessed in stride- pattern Instructions: Temporal: execute loop repeatedly Spatial: execute instructions in sequence What is stored in memory? Assessing locality in code is an important programming skill. Locality # int sum_array_rows(int a[m][n]) { int sum = ; for (int i = ; i < M; i++) { for (int j = ; j < N; j++) { sum += a[i][j]; row-major M x N D array in C a[][] a[][] a[][] a[][] a[][] a[][] a[][] a[][] a[][] a[][] a[][] a[][] : a[][] : a[][] : a[][] 4: a[][] 5: a[][] 6: a[][] 7: a[][] 8: a[][] 9: a[][] : a[][] : a[][] : a[][] stride Locality # int sum_array_cols(int a[m][n]) { int sum = ; for (int j = ; j < N; j++) { for (int i = ; i < M; i++) { sum += a[i][j]; row-major M x N D array in C a[][] a[][] a[][] a[][] a[][] a[][] a[][] a[][] a[][] a[][] a[][] a[][] : a[][] : a[][] : a[][] 4: a[][] 5: a[][] 6: a[][] 7: a[][] 8: a[][] 9: a[][] : a[][] : a[][] : a[][] Locality #4 int sum_array_d(int a[m][n][n]) { int sum = ; for (int i = ; i < N; i++) { for (int j = ; j < N; j++) { for (int k = ; k < M; k++) { sum += a[k][i][j]; What is "wrong" with this code? How can it be fixed? stride N 4 5

3 Performance Metrics Miss Rate Fraction of memory accesses to data not in cache (misses / accesses) Typically: % - % for L; maybe < % for L, depending on size, etc. Hit Time Time to find and deliver a in the cache to the processor. Typically: - clock cycles for L; 5 - clock cycles for L Miss Penalty Additional time required on cache miss = main memory access time Typically 5 - cycles for L (trend: increasing!) 7 memory hierarchy why does it work? large, slow, power-efficient, cheap small, fast, power-hungry, expensive registers L cache (SRAM, on-chip) L cache (SRAM, on-chip) L cache (SRAM, off-chip) main memory (DRAM) explicitly programcontrolled persistent storage (hard disk, flash, over network, cloud, etc.) Organization: Key Points Block Fixed-size unit of data in memory/cache Placement Policy Where in the cache should a given be stored? direct-mapped, set associative Replacement Policy What if there is no room in the cache for requested data? least recently used, most recently used Write Policy When should writes update lower levels of memory hierarchy? write back, write through, write allocate, no write allocate (byte) address Blocks Divide address space into fixed-size aligned s. power of Example: size = 8 full byte address address bits - offset bits offset within log ( size) remember withinsameblock? (Pointers Lab)... Note: drawing address order differently from here on!

4 Placement: Direct-Mapped Placement: s resolve ambiguity Mapping: index() = mod S (easy for power-of- sizes...) S = # slots = 4 Mapping: index() = mod S Data bits not used for index. S 4 Address =,, Offset What slot in the cache? Disambiguates slot contents. Where within a? A puzzle. a-bit Address bits - bits (a-s-b) bits s bits log (# cache slots) Offset b bits starts empty. Access (address, hit/miss) stream: (, miss), (, hit), (, miss) full byte address Address bits - Offset bits Offset within log ( size) = b size >= bytes What could the size be? size < 8 bytes # address bits 7

5 Placement: direct mapping conflicts What happens when accessing in repeated pattern:,,,,...? cache conflict Every access suffers a miss, evicts cache line needed by next access. 8 Placement: Associative One index per set of slots. Store in any slot within set. -way 8 sets, each direct mapped -way 4 sets, s each Mapping: index() = mod S 4-way sets, 4 s each sets S = # slots in cache Replacement policy: if set is full, what should be replaced? Common: least recently used (LR) but hardware usually implements not most recently used 9 8-way set, 8 s fully associative Example:,, Offset? Example:,, Offset? 4-bit Address Offset E-way set-associative S slots 6-byte s 6-bit Address Offset Direct-mapped 4 slots -byte s bits set index bits offset bits E = -way S = 8 sets E = -way S = 4 sets E = 4-way S = sets index() = bits set index bits offset bits index(x8) bits set index bits offset bits index(x8) bits set index bits offset bits index(x8)

6 Replacement Policy General Organization (S, E, B) If set is full, what should be replaced? Common: least recently used (LR) (but hardware usually implements not most recently used Powers of E lines per set ( E-way ) set Another puzzle: starts empty, uses LR. Access (address, hit/miss) stream S sets /line (, miss); (, miss); (, miss) is not in the same as s replaced s associativity direct-mapped of cache cache? v valid bit B- cache capacity: S x E x B data bytes address size: t + s + b address bits B = b bytes of data per cache line (the data ) Read E = e lines per set Locate set by index Hit if any in set: is valid; and has matching Get data at offset in Direct-Mapped Practice -bit address 6 lines, 4-byte size Direct mapped x54 xa Address of byte in memory: t bits s bits b bits Offset bits? bits? bits? S = s sets set index offset Valid B B B B Valid B B B B A D data begins at this offset B A B D B 9 5 DA B B- 4 5 D 4 6 6D 7 8F F 9 D C D valid bit B = b bytes of data per cache line (the data ) C DF E F B D 7

7 Example (E = ) int sum_array_rows(double a[6][6]){ double sum = ; for (int r = ; r < 6; r++){ for (int c = ; c < 6; c++){ sum += a[r][c]; int sum_array_cols(double a[6][6]){ double sum = ; Locals in registers. Assume a is aligned such that &a[r][c] is aa...a rrrr cccc for (int c = ; c < 6; c++){ for (int r = ; r < 6; r++){ sum += a[r][c]; Assume: cold (empty) cache -bit set index, 5-bit offset aa...arrr rcc cc,4:,:,:,: aa...a aa...a,,,,,4,5,6,7,8,9,a,b,c,d,e,f,,,,,4,5,6,7,8,9,a,b,c,d,e,f bytes = 4 doubles 4 misses per row of array 4*6 = 64 misses bytes = 4 doubles every access a miss 6*6 = 56 misses, 4,,, 4,,, 4,,, 4,,,,,,,,,, 8 Example (E = ) int dotprod(int x[8], int y[8]) { int sum = ; for (int i = ; i < 8; i++) { sum += x[i]*y[i]; if x and y are mutually aligned, e.g., x, x8 6 bytes = 4 ints x[] y[] x[] y[] x[] y[] x[] y[] = 6 bytes; 8 sets in cache How many offset bits? How many set index bits? Address bits: ttt...t sss bbbb B = 6 = b : b=4 offset bits S = 8 = s : s= index bits Addresses as bits x:... x8:... xa:... if x and y are mutually unaligned, e.g., x, xa x[] x[] x[] x[] x[4] x[5] x[6] x[7] y[] y[] y[] y[] y[4] y[5] y[6] y[7] 9 Example (E = ) Writing to cache float dotprod(float x[8], float y[8]) { float sum = ; for (int i = ; i < 8; i++) { sum += x[i]*y[i]; Multiple copies of data exist, must be kept in sync. Write-hit policy Write-through: Write-back: needs a dirty bit If x and y aligned, e.g. &x[] =, &y[] = 8, can still fit both because each set has space for two s/lines s/lines per set x[] x[] x[] x[] y[] y[] y[] y[] x[4] x[5] x[6] x[7] y[4] y[5] y[6] y[7] 4 sets Write-miss policy Write-allocate: No-write-allocate: Typical caches: Write-back + Write-allocate, usually Write-through + No-write-allocate, occasionally 4 45

8 Write-back, write-allocate example eax = xcafe ecx = T edx = xcafe dirty bit /memory not involved. mov $T, %ecx. mov $, %edx. mov $xfeed, (%ecx) a. Miss on T. Write-back, write-allocate example eax = xcafe ecx = T edx = T xfeed xface dirty bit. mov $T, %ecx. mov $, %edx. mov $xfeed, (%ecx) a. Miss on T. b. Evict (clean: discard). c. Fill T (write-allocate). d. Write T in cache (dirty). 4. mov (%edx), %eax a. Miss on. T xface T xface xcafe xcafe Write-back, write-allocate example eax = xcafe ecx = T edx = T xcafe dirty bit xface xfeed. mov $T, %ecx. mov $, %edx. mov $xfeed, (%ecx) a. Miss on T. b. Evict (clean: discard). c. Fill T (write-allocate). d. Write T in cache (dirty). 4. mov (%edx), %eax a. Miss on. b. Evict T (dirty: write back). c. Fill. d. %eax. 5. DONE. xcafe 48

CS 240 Stage 3 Abstractions for Practical Systems

CS 240 Stage 3 Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the process model Virtual memory Dynamic memory allocation Victory lap Memory Hierarchy: Cache Memory