Computer Architecture Memory hierarchies and caches

Size: px

Start display at page:

Download "Computer Architecture Memory hierarchies and caches"

Matthew Watkins
5 years ago
Views:

1 Computer Architecture Memory hierarchies and caches S Coudert and R Pacalet January 23, 2019

2 Outline Introduction Localities principles Direct-mapped caches Increasing block size Set-associative caches Write strategies Cache coherence in multiprocessor systems 2/44 S Coudert and R Pacalet January 23, 2019

3 Outline Introduction Localities principles Direct-mapped caches Increasing block size Set-associative caches Write strategies Cache coherence in multiprocessor systems 3/44 S Coudert and R Pacalet January 23, 2019

4 The memory latency problem Latency of external memory access tends to increase (tens to hundreds of CPU clock cycles) Clock cycles between CPU load and instruction/data returning from memory are wasted Clock Per Instruction (CPI) increases On the other hand, in average, 90% of execution time corresponds to 10% of code instructions Caches are a way to take benefit from this to mitigate the memory latency problem 4/44 S Coudert and R Pacalet January 23, 2019

5 CPU Principles of caches All CPU memory accesses Small fast memory Access only when needed Larger slow memory Memory Memory Memory Memory Size smallest largest Speed fastest slowest Cost ($/bit) highest lowest Keep most frequently accessed data in small (expensive) fast (close) memory Performance depend on hit and miss times and on hit rate Technology Latency (s) Cost ($ per byte) Register Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk /44 S Coudert and R Pacalet January 23, 2019

6 Memory hierarchy CPU Increasing size Level 1 Level 2 Level n Increasing latency Size Goals: Minimize miss rate Ideally, the full memory is as fast as level 1 (miss rate = 0) and its size is that of level n (GBytes) 6/44 S Coudert and R Pacalet January 23, 2019

7 Cache miss & Cache hit CPU CACHE address data bus address bus HIT data bus address address bus 4 3 data bus MISS data bus 2 MEMORY Cache miss => CPU wait states 7/44 S Coudert and R Pacalet January 23, 2019

8 Most frequently accessed Best possible choice for cached data: the one I will need next The future s not ours to see (Que sera sera) Second best choice: most frequently accessed data How to identify them Approximation, heuristics based on two locality principles: Spatial: in a given short period of time a program frequently accesses a small memory area (example: working on an array) Temporal: a program often accesses the same memory cell several times in a short period (example: instructions in a loop) 8/44 S Coudert and R Pacalet January 23, 2019

9 Outline Introduction Localities principles Direct-mapped caches Increasing block size Set-associative caches Write strategies Cache coherence in multiprocessor systems 9/44 S Coudert and R Pacalet January 23, 2019

10 Locality principles: example Sub-program example: 1 f o r ( i = 0; i < 1000; i ++) { 2 C[ i ] = A [ i ] + B [ i ] ; 3 } Variable addresses: array A array B array C constant constant /44 S Coudert and R Pacalet January 23, 2019

11 Locality principles: example 1000 times Temporal locality: an accessed memory location is likely to be accessed again soon Spatial locality: a memory location near an accessed one is likely to be accessed soon 8000 lw $1,36000($0) 8004 lw $2,36004($0) 8008 lw $3,24000($1) 8012 lw $4,28000($1) 8016 add $3,$3,$ sw $3,32000($1) 8024 beq $1,$2, addi $1,$1, j while looping # $r1 < 0 Initialization # $r2 < 3996 # $r3 < A[i] Loop body # $r4 < B[i] # $r3 < $r3 + $r4 # C[i] < $r3 # jump to 8036 if $r1 = $r2 # increment $r1 # jump to 8008 Sequel 11/44 S Coudert and R Pacalet January 23, 2019

12 Locality principles: example 1000 times Temporal locality: an accessed memory location is likely to be accessed again soon Spatial locality: a memory location near an accessed one is likely to be accessed soon 8000 lw $1,36000($0) 8004 lw $2,36004($0) 8008 lw $3,24000($1) 8012 lw $4,28000($1) 8016 add $3,$3,$ sw $3,32000($1) 8024 beq $1,$2, addi $1,$1, j while looping # $r1 < 0 Initialization # $r2 < 3996 # $r3 < A[i] Loop body # $r4 < B[i] # $r3 < $r3 + $r4 # C[i] < $r3 # jump to 8036 if $r1 = $r2 # increment $r1 # jump to 8008 Sequel 11/44 S Coudert and R Pacalet January 23, 2019

13 Locality principles: example adresses 8000 lw $1,36000($0) 8004 lw $2,36004($0) 8008 lw $3,24000($1) 8012 lw $4,28000($1) 8016 add $3,$3,$ sw $3,32000($1) 8024 beq $1,$2, addi $1,$1, j # $r1 < 0 # $r2 < 3996 # $r3 < A[i] # $r4 < B[i] # $r3 < $r3 + $r4 # C[i] < $r3 # jump to 8036 if $r1 = $r2 # increment $r1 # jump to instruction fetch temporal locality iteration 1 iteration 2 iteration 3 iteration 4 iteration 5 time 12/44 S Coudert and R Pacalet January 23, 2019

14 Locality principles: example adresses lw $1,36000($0) 8004 lw $2,36004($0) 8008 lw $3,24000($1) 8012 lw $4,28000($1) 8016 add $3,$3,$ sw $3,32000($1) 8024 beq $1,$2, addi $1,$1, j # $r1 < 0 # $r2 < 3996 # $r3 < A[i] # $r4 < B[i] # $r3 < $r3 + $r4 # C[i] < $r3 # jump to 8036 if $r1 = $r2 # increment $r1 # jump to 8008 data fetch iteration 1 iteration 2 iteration 3 iteration 4 iteration 5 time 12/44 S Coudert and R Pacalet January 23, 2019

15 Locality principles: example adresses lw $1,36000($0) 8004 lw $2,36004($0) 8008 lw $3,24000($1) 8012 lw $4,28000($1) 8016 add $3,$3,$ sw $3,32000($1) 8024 beq $1,$2, addi $1,$1, j # $r1 < 0 # $r2 < 3996 # $r3 < A[i] # $r4 < B[i] # $r3 < $r3 + $r4 # C[i] < $r3 # jump to 8036 if $r1 = $r2 # increment $r1 # jump to spatial locality iteration 1 iteration 2 iteration 3 iteration 4 iteration 5 time 12/44 S Coudert and R Pacalet January 23, 2019

16 Most frequently accessed Selection of data to cache usually based on locality heuristics When data is fetched from lower levels: it is loaded in cache and kept there for later re-use (temporal locality), and data in neighbourhood are also loaded, just in case they will also be needed (spatial locality) 13/44 S Coudert and R Pacalet January 23, 2019

17 CPU Cache miss / cache hit 1 Memory 2 handling cache fault x Cache miss x CPU Cache x Memory 3 Memory 4 loading data at x and around Memory CPU Cache x x+1 hit CPU Cache hit data at x is now in cache neighbourhood too accessing y: miss y 14/44 S Coudert and R Pacalet January 23, 2019

18 CPU Cache miss / cache hit 5 Memory 6 handling cache fault y Cache miss CPU Cache Memory y y 7 Memory 8 Memory Cache x+1 x hit CPU CPU Cache miss 14/44 S Coudert and R Pacalet January 23, 2019

19 Cache management strategies Where in cache shall we store the incoming data when handling cache faults Upon CPU accesses, how do we know if a data is in cache and where In case a data must be replaced, which one to chose How do we handle write accesses Various kinds of caches and associated strategies 15/44 S Coudert and R Pacalet January 23, 2019

20 Outline Introduction Localities principles Direct-mapped caches Increasing block size Set-associative caches Write strategies Cache coherence in multiprocessor systems 16/44 S Coudert and R Pacalet January 23, 2019

21 Direct-mapped caches Smallest cache-able unit: the smallest Addressing Unit (AU, eg one byte, one word) For any AU there is one unique possible location in cache Cache capacity: 2 k cache lines (stores at most 2 k AUs) Where in the cache is AU which address in memory is a In line a mod 2 k How do we know it is the right AU The cache line also stores the tag: a/2 k How do we know it is a valid AU (eg after reset) The cache line also stores a validity bit 17/44 S Coudert and R Pacalet January 23, 2019

22 Direct-mapped caches Smallest cache-able unit: the smallest Addressing Unit (AU, eg one byte, one word) For any AU there is one unique possible location in cache Cache capacity: 2 k cache lines (stores at most 2 k AUs) Where in the cache is AU which address in memory is a In line a mod 2 k How do we know it is the right AU The cache line also stores the tag: a/2 k How do we know it is a valid AU (eg after reset) The cache line also stores a validity bit 17/44 S Coudert and R Pacalet January 23, 2019

23 Direct-mapped caches Smallest cache-able unit: the smallest Addressing Unit (AU, eg one byte, one word) For any AU there is one unique possible location in cache Cache capacity: 2 k cache lines (stores at most 2 k AUs) Where in the cache is AU which address in memory is a In line a mod 2 k How do we know it is the right AU The cache line also stores the tag: a/2 k How do we know it is a valid AU (eg after reset) The cache line also stores a validity bit 17/44 S Coudert and R Pacalet January 23, 2019

24 Direct-mapped caches Smallest cache-able unit: the smallest Addressing Unit (AU, eg one byte, one word) For any AU there is one unique possible location in cache Cache capacity: 2 k cache lines (stores at most 2 k AUs) Where in the cache is AU which address in memory is a In line a mod 2 k How do we know it is the right AU The cache line also stores the tag: a/2 k How do we know it is a valid AU (eg after reset) The cache line also stores a validity bit 17/44 S Coudert and R Pacalet January 23, 2019

25 Direct-mapped caches Smallest cache-able unit: the smallest Addressing Unit (AU, eg one byte, one word) For any AU there is one unique possible location in cache Cache capacity: 2 k cache lines (stores at most 2 k AUs) Where in the cache is AU which address in memory is a In line a mod 2 k How do we know it is the right AU The cache line also stores the tag: a/2 k How do we know it is a valid AU (eg after reset) The cache line also stores a validity bit 17/44 S Coudert and R Pacalet January 23, 2019

26 Direct-mapped caches Smallest cache-able unit: the smallest Addressing Unit (AU, eg one byte, one word) For any AU there is one unique possible location in cache Cache capacity: 2 k cache lines (stores at most 2 k AUs) Where in the cache is AU which address in memory is a In line a mod 2 k How do we know it is the right AU The cache line also stores the tag: a/2 k How do we know it is a valid AU (eg after reset) The cache line also stores a validity bit 17/44 S Coudert and R Pacalet January 23, 2019

27 Direct-mapped caches Smallest cache-able unit: the smallest Addressing Unit (AU, eg one byte, one word) For any AU there is one unique possible location in cache Cache capacity: 2 k cache lines (stores at most 2 k AUs) Where in the cache is AU which address in memory is a In line a mod 2 k How do we know it is the right AU The cache line also stores the tag: a/2 k How do we know it is a valid AU (eg after reset) The cache line also stores a validity bit 17/44 S Coudert and R Pacalet January 23, 2019

28 Direct-mapped caches Smallest cache-able unit: the smallest Addressing Unit (AU, eg one byte, one word) For any AU there is one unique possible location in cache Cache capacity: 2 k cache lines (stores at most 2 k AUs) Where in the cache is AU which address in memory is a In line a mod 2 k How do we know it is the right AU The cache line also stores the tag: a/2 k How do we know it is a valid AU (eg after reset) The cache line also stores a validity bit 17/44 S Coudert and R Pacalet January 23, 2019

29 Direct-mapped caches Smallest cache-able unit: the smallest Addressing Unit (AU, eg one byte, one word) For any AU there is one unique possible location in cache Cache capacity: 2 k cache lines (stores at most 2 k AUs) Where in the cache is AU which address in memory is a In line a mod 2 k How do we know it is the right AU The cache line also stores the tag: a/2 k How do we know it is a valid AU (eg after reset) The cache line also stores a validity bit 17/44 S Coudert and R Pacalet January 23, 2019

30 Direct-mapped caches Smallest cache-able unit: the smallest Addressing Unit (AU, eg one byte, one word) For any AU there is one unique possible location in cache Cache capacity: 2 k cache lines (stores at most 2 k AUs) Where in the cache is AU which address in memory is a In line a mod 2 k How do we know it is the right AU The cache line also stores the tag: a/2 k How do we know it is a valid AU (eg after reset) The cache line also stores a validity bit k bits V tag data address tag line index 1 tag line data cache: 2 k lines 17/44 S Coudert and R Pacalet January 23, 2019

31 Direct-mapped cache architecture Example 4 GB memory with 1 kb cache and one-byte cache lines memory address V hit tag = 20 data data 10 Memory /44 S Coudert and R Pacalet January 23, 2019

32 1 MISS CPU handling cache fault Direct-mapped cache running Cache 11 a q b g 01 l q b g a b c d e f g h j i k m l n o p Memory MISS CPU handling cache fault Cache 01 l b f g 01 l q b g a b c d e f g h j i k m l n o p Memory a a b b c c MISS d HIT d e e Cache f Cache f l 00 g p 00 g b 01 h f j i 0 01 b 01 h CPU 0000 CPU f g j i 0 11 g k k handling cache fault m l m l p 00 n n b 01 o o f 10 p p g 11 Memory but 1011 or 1110: miss Memory 19/44 S Coudert and R Pacalet January 23, 2019

33 Outline Introduction Localities principles Direct-mapped caches Increasing block size Set-associative caches Write strategies Cache coherence in multiprocessor systems 20/44 S Coudert and R Pacalet January 23, 2019

34 Larger addressing unit and cache line Addressing unit: CPU word (eg 32 bits) Locality: cache line stores several consecutive words (a block) Blocks are aligned in memory and cache Memory address tag unit n bits index byte tagidxuubb uubb uubb uubb uubb uubb V n cache lines MUX 21/44 S Coudert and R Pacalet January 23, 2019

35 CPU CPU MISS Direct-mapped cache running Cache handling cache fault p0 o0 n0 m HIT Cache p0 o0 n0 m l0 k0 j0 i o p a b c d e f g h i j k l m n o p Memory o p a b c d e f g h i j k l m n o p Memory CPU CPU MISS Cache p0 o0 n0 m handling cache fault p0 o0 n0 m l0 k0 j0 i MISS Cache p0 o0 n0 m l0 k0 j0 i handling cache fault p1 o1 n1 m l0 k0 j0 i o p a b c d e f g h i j k l m n o p Memory o p a b c d e f g h i j k l m n o p Memory 22/44 S Coudert and R Pacalet January 23, 2019

36 Exercise #1: Cache size and architecture 4-bytes words 4-words blocks (cache lines) 64-bit addresses Direct-mapped cache Cache capacity: 2 16 bytes of data Addresses breakdown Cache architecture Total cache size (with tags and valid bits) 23/44 S Coudert and R Pacalet January 23, 2019

37 Limits of block size Fewer cache lines for same cache capacity Favours spatial locality over temporal locality More data loaded from memory when handling cache miss Potentially increase cache miss cost Example (simplified): Access to three consecutive data in memory Cache access time: 2 cycle Memory access time (1 word): 20 cycles 1 word per cache line and 3 cache misses: = 66 cycles 4 words per cache line and 1 cache miss: 6 +1 (4 20) = 86 cycles Requires efficient data transfer between memory and cache 24/44 S Coudert and R Pacalet January 23, 2019

38 Improving memory-cache transfers Wider data bus Expensive Bounded efficiency Higher bus frequency Expensive Limited by Printed Circuit Board (PCB) constraints Double Data Rate (DDR) Banks rows columns Row latency > column latency Wrapping bursts Requested word first Multi-banking Mask row latencies bank decoder control logic address bus row DDR memory column MUX read logic data bus 25/44 S Coudert and R Pacalet January 23, 2019

39 Cache efficiency vs block size (Computer Organization and Design, the Hardware / Software Interface, Patterson and Hennessy, second edition, 1998) VAX machine, direct-mapped cache 26/44 S Coudert and R Pacalet January 23, 2019

40 Outline Introduction Localities principles Direct-mapped caches Increasing block size Set-associative caches Write strategies Cache coherence in multiprocessor systems 27/44 S Coudert and R Pacalet January 23, 2019

41 Set-associative caches Other possible improvement: several blocks per index (set) N-way set associative cache: N blocks per set (Computer Organization and Design, the Hardware / Software Interface, Patterson and Hennessy, second edition, 1998) 28/44 S Coudert and R Pacalet January 23, 2019

42 Set-associative cache architecture (Computer Organization and Design, the Hardware / Software Interface, Patterson and Hennessy, second edition, 1998) 29/44 S Coudert and R Pacalet January 23, 2019

43 Set-associative: replacement policy Several blocks per set What block to replace when a set is full First In First Out (FIFO) Same as Least Recently Cached Not that good, first in can be re-referenced frequently Random: simple hardware, sub-optimal but satisfactory Least Recently Used (LRU), complex hardware if > 2-ways LRU approximations Combinations of FIFO and LRU (Not Most Recently Used) 30/44 S Coudert and R Pacalet January 23, 2019

44 Exercise #2: Set-associative cache running 2-set, 2-way set-associative cache FIFO replacement policy Sequence of accesses (addresses) illustrating various situations Hits Misses 31/44 S Coudert and R Pacalet January 23, 2019

45 Outline Introduction Localities principles Direct-mapped caches Increasing block size Set-associative caches Write strategies Cache coherence in multiprocessor systems 32/44 S Coudert and R Pacalet January 23, 2019

46 Write policies Goals: avoid memory corruption, improve performance miss miss => replace x x x read x write x read y CPU CPU Cache hit Write-through: write in cache and in memory Performance issues when write rate exceeds memory throughput Write-back: memory written only when replacing a dirty block A dirty flag is added to cache lines Helpful when write rate exceeds memory throughput Cache miss Write only in memory (No Write Allocate) Fetch block from memory and write data (Write Allocate) All combinations are possible, some are more frequent (guess) Write buffers can be added to smooth the write rate to memory CPU y 33/44 S Coudert and R Pacalet January 23, 2019

47 Outline Introduction Localities principles Direct-mapped caches Increasing block size Set-associative caches Write strategies Cache coherence in multiprocessor systems 34/44 S Coudert and R Pacalet January 23, 2019

48 Processors Cache coherence problem CPU 1 CPU 2 CPU n Caches CPU 2: x=123 x=123 z= Write x,237 x absent Bus x=123 Shared memory CPU 1 CPU 2 CPU 1 CPU 2 Multiple copies in caches and memory how to ensure coherence x=123 x=237 x=123 x=237 x=123 x=237 Shared memory Shared memory incoherent! incoherent! 35/44 S Coudert and R Pacalet January 23, 2019

49 Cache coherence Note: problem exists even in mono-processor systems Direct Memory Access (DMA) peripherals Snooping Caches inform others about what they do Each cache continuously monitors others activity (snooping) Take appropriate actions when needed Appropriate depends on cache coherence protocol Directory-based Central or distributed directories track blocks and caches Not studied in this course Cache coherence protocols require support from bus protocol To exchange state changes To exchange cached data 36/44 S Coudert and R Pacalet January 23, 2019

50 Cache coherence protocols Define what action is taken in which circumstance (state) Examples Write invalidate Written cache sends a write invalidate message All snooping caches invalidate their copy of written block What if write through What if write back Write update (broadcast) Written cache broadcasts the new block All snooping caches update their copy of written block What if write through What if write back Operations of one cache can Be delayed upon request from another cache Wait for acknowledges by other caches Expect a response from other caches Use a delay before action unless responded to Be served by another cache or memory 37/44 S Coudert and R Pacalet January 23, 2019

51 Cache coherence protocols Each cache Maintains a state of each block Invalid Clean Dirty Exclusive Owned Reacts on events from its own processor Processor read Processor write Reacts on messages from other caches (snooped on bus) Read Write Flush Emits messages to other caches (on bus) Exchanges data with other caches (on bus) 38/44 S Coudert and R Pacalet January 23, 2019

52 Cache coherence protocols Write-through caches Exercise #3: Imagine protocol (messages, actions) 39/44 S Coudert and R Pacalet January 23, 2019

53 Example of coherence protocol: MSI MSI: Modified, Shared, Invalid Each block is in one of 3 states (M, S, I) for each cache A block can be in different states in different caches States definition Invalid: not in cache (or not valid) Shared: in cache and valid but read-only Modified: in cache and valid and read-write A block not in cache is considered in I (Invalid) state If a block is in M (Modified) state in one cache, it is the only copy The 3 states can be encoded using the Valid and Dirty flags 40/44 S Coudert and R Pacalet January 23, 2019

54 Example of coherence protocol: MSI Write-back allocate caches Exercise #4: Imagine MSI protocol (events, actions) M I S 41/44 S Coudert and R Pacalet January 23, 2019

55 Example of coherence protocol: MESI 3 states of MSI can be encoded using Valid and Dirty flags We could add one more state for free Let us reduce bandwidth with a fourth Exclusive (E) state Exclusive: block in cache, valid, read-only and is the only copy Write-back allocate caches Exercise #5: Imagine MESI protocol (events, actions) M S I E 42/44 S Coudert and R Pacalet January 23, 2019

56 Cache coherence protocol Homework: imagine further improvements with one more state Example: MOESI protocol O (Owner): modified in one cache, shared (S) in others Owner responsible for write-back Owner responsible for providing block to other caches M S I E O 43/44 S Coudert and R Pacalet January 23, 2019

57 Vocabulary Cache hit (miss): requested data is (not) in cache Block: smallest cache-able unit (12 n words, aligned) Cache line: where cache stores a block, its tag and flags Set: group of blocks with same cache index N-way cache: cache with N blocks per set Direct-mapped cache: 1-way cache (number of lines = number of sets) Full-associative cache: 1-set cache (number of lines = number of ways) Index: part of address used to designate a set Tag: part of address stored in cache line and compared with requested address to decide hit or miss Valid flag: flag stored in cache line and used to indicate whether content is valid or not Dirty flag: flag stored in cache line and used to indicate whether content has been modified or not Write-through: writing in cache and in memory Write-back: writing in cache but not in memory Eviction (replacement): replacing a block in a cache line by another block, after writing the replaced block in memory if cache is write-back and block was dirty Write-allocate: cache that, upon write miss, reads the block from memory, stores it in cache (evicts another block if needed) and writes in the cache (and in memory if it is a write-through cache) Write-no-allocate: cache that, upon write miss, writes in memory but not in the cache Coherence: property that guarantees that all CPUs see the same memory content despite their local caches 44/44 S Coudert and R Pacalet January 23, 2019

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,