Computer Architecture ELEC3441

Size: px

Start display at page:

Download "Computer Architecture ELEC3441"

Valentine Rice
5 years ago
Views:

CPU-Memory Bottleneck Computer Architecture ELEC44 CPU Memory Lecture 9 Cache Dr.

$Processor cycle time n Bandwidth (number of accesses per unit time) if fraction m of instructions access memory$ +m memory references / instruction CPI = requires +m memory refs / cycle (assuming ISC- V ISA) Bandwidth vs

+m memory references / instruction CPI = requires +m memory refs / cycle (assuming ISC- V ISA) Bandwidth vs

5- GT/s 0-0 GB/s CPU Cache Memory n Modern processors: In the range of - GHz clock rate Multiple instruction

> 0x slower Same die Small, ast (SAM) Large, Slow (DAM) n High speed memory that holds temporary copy of

1 CPU-Memory Bottleneck Computer Architecture ELEC44 CPU Memory Lecture 9 Cache Dr. Hayden Kwok-Hay So Department of Electrical and Electronic Engineering Performance of high-speed computers is usually limited by memory bandwidth & latency n Latency (time for a single access) Memory access time >> Processor cycle time n Bandwidth (number of accesses per unit time) if fraction m of instructions access memory +m memory references / instruction CPI = requires +m memory refs / cycle (assuming ISC- V ISA) Bandwidth vs Latency Cache Memory n Example: DD SDAM Latency in the range of 0-50 ns Bandwidth in the range of.5- GT/s 0-0 GB/s CPU Cache Memory n Modern processors: In the range of - GHz clock rate Multiple instruction issues (-4 memory instructions at the same time) Multiple -8 cores n Gap: Memory bandwidth is -6x slower Latency > 0x slower Same die Small, ast (SAM) Large, Slow (DAM) n High speed memory that holds temporary copy of frequently used data from main memory n Usually on the same die as the CPU n Low latency Typical: processor cycles n Limited capacity (compared to main memory) Typical: k to 0 Mbytes L cache 4

Cache Operation Overview CPU n Cache Memory To access a memory location: Look up

latency access statistically Need ways to make sure data that will be needed are in

egister << SAM << DAM << Magnetic disk Bandwidth on-chip Cost ($/bit) egister n

illusion of fast and large memory spans from register file to hard disk 6 eal Memory

Hierarchy regfile Give illusion of a large + fast memory statistically Donald J.

2 Cache Operation Overview CPU n Cache Memory To access a memory location: Look up memory content from cache If found, return If not found, look into memory n Low latency access statistically Need ways to make sure data that will be needed are in cache n L$ L$ L$ Memory Hard Disk egister << SAM << DAM << Magnetic disk Latency egister << SAM << DAM << Magnetic disk Bandwidth on-chip Cost ($/bit) egister n off-chip >> SAM >> >> DAM I/O bus >> Magnetic disk The same concept of creating the illusion of fast and large memory spans from register file to hard disk 6 eal Memory eference Pa/erns Capacity >> Memory Address (one dot per access) CPU 5 Memory Hierarchy regfile Give illusion of a large + fast memory statistically Donald J. Hatfield, Jeanette Gerald: Program estructuring for Virtual Memory. IBM Systems Journal 0(): 68-9 (97) 7 Time 8

Typical Memory eference Pa/erns Two predictable

Address n loop itera<ons Temporal Locality: If a

subrou<ne return Spa<al Locality: If a loca)on is

Pa/erns Caches exploit both types of predictability:

Temporal Locality Exploit temporal locality by

Exploit spa)al locality by fetching blocks of data

3 Typical Memory eference Pa/erns Two predictable proper<es of memory references: Instruc<on fetches Address n loop itera<ons Temporal Locality: If a loca)on is referenced it is likely to be referenced again in the near future. Stack accesses subrou<ne call argument access subrou<ne return Spa<al Locality: If a loca)on is referenced it is likely that loca)ons near it will be referenced in the near future. accesses scalar accesses Time 9 0 Memory eference Pa/erns Caches exploit both types of predictability: Memory Address (one dot per access) Spa<al Locality Temporal Locality Exploit temporal locality by remembering the contents of recently accessed loca)ons. Exploit spa)al locality by fetching blocks of data around recently accessed loca)ons. Time Donald J. Hatfield, Jeanette Gerald: Program estructuring for Virtual Memory. IBM Systems Journal 0(): 68-9 (97)

Address Processor copy of main memory loca)on 00 00 04 6848 46 Inside a Cache Address Byte Byte

Look at Processor Address, search cache s to find match. Then either ound in cache a.k.a. HIT eturn copy of data from cache Not in cache a.

to fetch from memory every time n Where to put a data in the cache when it is fetched?

4 Address Processor copy of main memory loca)on Inside a Cache Address Byte Byte Byte CACHE Address copy of main memory loca)on 0 Main Memory Line Block Cache Algorithm (ead) Look at Processor Address, search cache s to find match. Then either ound in cache a.k.a. HIT eturn copy of data from cache Not in cache a.k.a. MISS ead block of data from Main Memory Wait eturn data to processor and update cache Q: Which line do we replace? 4 Designing Cache actors to consider when designing cache n How big is the cache n How much data to fetch from memory every time n Where to put a data in the cache when it is fetched? n How to deal with conflict? n Synchronization with memory Capacity Line Size Cache organization eplacement Policy read/write policies 5 6

5 Split CPU address Line Size and Spa<al Locality A line is unit of transfer between the cache and memory Word0 Word Word Line Address Word Offset 4 word line, b= Cache Configurations n ully Associative n Direct Map n Set Associative -b bits b = line size a.k.a line size (in bytes) b bits Larger line size has dis)nct hardware advanes less overhead exploit fast burst transfers from DAM exploit fast burst transfers over wide busses What are the disadvanes of increasing line size? ewer lines => more conflicts. Can waste bandwidth. 7 8 ully Associative Cache Example: ully Associative ully Associative with 8 entries, line size = 4 words (b=) n Cache lines can be stored in any location of the cache Offset -bit address Offset n Evict (overwrite) cache line only when out of space n Work similar to an ideal cache except with realistic capacity limitation Valid Content T ABC00E D0D0D0D0 T E0009D E0E0E0E0 CAA000E Offset Size: Size: 4 bits 8 bits

6 Example: ully Associative ully Associative with 8 entries, line size = 4 words (b=) 0x0000A000 0x0000A004 0xBEE6 B 0x088A80 0x0000A00C offset Example: ully Associative ully Associative with 8 entries, line size = 4 words (b=) 0x0000A000 0x0000A004 0xBEE6 B 0x088A80 0x0000A00C offset 000 T DDDD DDDD DDDD DDDD T DDDD DDDD DDDD DDDD Example: ully Associative ully Associative with 8 entries, line size = 4 words (b=) 0x0000A000 0x0000A004 0xBEE6 B 0x088A80 0x0000A00C 0 00 offset Example: ully Associative ully Associative with 8 entries, line size = 4 words (b=) 0x0000A000 0x0000A004 0xBEE6 B 0x088A80 0x0000A00C offset 000 T 0000A00 DDDD DDDD DDDD DDDD T BEE EEEE EEEE EEEE EEEE T 0000A00 DDDD DDDD DDDD DDDD T BEE EEEE EEEE EEEE EEEE T 088A

Example: ully Associative ully Associative with 8 entries, line size = 4 words (b=) V ully Associa<ve Cache 0x0000A000 0x0000A004 0xBEE6 0x088A80 0x0000A00C B 0000 00 0000 0000 00 000 T 0000A00 DDDD

7 Example: ully Associative ully Associative with 8 entries, line size = 4 words (b=) V ully Associa<ve Cache 0x0000A000 0x0000A004 0xBEE6 0x088A80 0x0000A00C B T 0000A00 DDDD DDDD DDDD DDDD T BEE EEEE EEEE EEEE EEEE T 088A offset Offset b t t = = = Word or Byte HIT 5 6 Direct Map Cache n Simple and realistic implementation n Each memory address may be stored at only one possible location in the cache Usually by taking k bits of address for a k line $ n More than memory addresses may be mapped to the same cache location è Collision Evict old content before new content is stored n Simple and fast Often used in L cache Example: Direct Map Direct map with 8 entries, line size = 4 words (b=) Index Size: bits Offset Size: 4 bits Size: 5 bits Index Offset

8 Example: Direct Map Direct map with 8 entries, line size = 4 words (b=) 0x0000A000 0x0000A004 0xBEE6 0x088A80 0x0000A00C Example: Direct Map Direct map with 8 entries, line size = 4 words (b=) 0x0000A000 0x0000A004 0xBEE6 0x088A80 0x0000A00C T DDDD DDDD DDDD DDDD T DDDD DDDD DDDD DDDD Example: Direct Map Direct map with 8 entries, line size = 4 words (b=) 0x0000A000 0x0000A004 0xBEE6 0x088A80 0x0000A00C B 0 00 Example: Direct Map Direct map with 8 entries, line size = 4 words (b=) 0x0000A000 0x0000A004 0xBEE6 0x088A80 0x0000A00C B T DDDD DDDD DDDD DDDD T 0 EEEE EEEE EEEE EEEE T DDDD 000 DDDD 000 DDDD DDDD T 0 EEEE EEEE EEEE EEEE0000

Example: Direct Map Direct-Mapped Cache Direct map with 8 entries, line size = 4 words (b=) 0x0000A000 0x0000A004 0xBEE6 0x088A80 0x0000A00C B 0000 00 0000 0000 00 t V Index k Offset b 000 T 0 00 00

9 Example: Direct Map Direct-Mapped Cache Direct map with 8 entries, line size = 4 words (b=) 0x0000A000 0x0000A004 0xBEE6 0x088A80 0x0000A00C B t V Index k Offset b 000 T DDDD 000 DDDD DDDD 000 DDDD T 0 EEEE EEEE EEEE EEEE0000 HIT = t k lines Word or Byte 4 Set Associative Cache n One way to reduce miss on a cache is to increase the number of possible locations to store a data block n An N-way set associative cache has N locations to store each data block A data block can be placed in any of the N locations The N locations form a set n Each set may hold data with the same index Allows N different data blocks with the same index be stored in the cache n Need replacement policy to determine which of the N data blocks in the set to be evicted when the N+ data is written Example: -Way Set Associative -way Set Associative with 4 entries, line size = 4 words (b=) 0x0000A000 0x0000A004 0xBEE6 0x088A80 0x0000A00C V Content 00 T A0_00 D D D D V Content 5 6

10 Example: -Way Set Associative -way Set Associative with 4 entries, line size = 4 words (b=) 0x0000A000 0x0000A004 0xBEE6 0x088A80 0x0000A00C Example: -Way Set Associative -way Set Associative with 4 entries, line size = 4 words (b=) 0x0000A000 0x0000A004 0xBEE6 0x088A80 0x0000A00C V Content 00 T 0000A0_00 D D D D0 0 0 T BEE_ E E E E0 V Content T 088A_0 0 V Content 00 T 0000A0_00 D D D D0 0 0 T BEE_ E E E E0 V Content T 088A_0 0 What happens if the next access is 0xC000A8C? 7 8 t V -Way Set-Associa<ve Cache Index k V t = = Offset b Word or Byte Cache Organizations data Same size, different organizations data data 0 -way Set Associative Direct Map data data data data data data HIT ully Associative 9 40

11 eplacement Policy In an associa)ve cache, which line from a set should be evicted when the set becomes full? andom Least-ecently Used (LU) LU cache state must be updated on every access True implementa)on only feasible for small sets (-way) Pseudo-LU binary tree oben used for 4-8 way irst-in, irst-out (IO) a.k.a. ound-obin Used in highly associa)ve caches Not-Most-ecently Used (NMU) IO with excep)on for most-recently used line or lines n n Least ecently Used On replacement, select the line that was accessed the least recently (oldest line) Need to memorize the access time of each line 0 Access: Example: 4-way set assoc. line size = word; arrays a[], b[], c[], d[], e[] all map to set 0 at data at data at data at data 0 a a0 b b0 c c0 d d0 time access a b c d c d e a e b c d H/M M M M M 4 4 Least ecently Used at data at data at data at data 0 0 a a0 b b0 4 c c0 5 d d0 Access: time access a b c d c d e a e b c d H/M M M M M H H Least ecently Used at data at data at data at data 0 06 e a a0 e0 b b0 4 c c0 5 d d0 Access: time access a b c d c d e a e b c d H/M M M M M H H M 4 44

12 Least ecently Used at data at data at data at data 0 6 e e0 7 a b a0 b0 4 c c0 5 d d0 Access: time access a b c d c d e a e b c d H/M M M M M H H M M Least ecently Used at data at data at data at data 0 68 e e0 7 a a0 4 c c0 5 d d0 Access: time access a b c d c d e a e b c d H/M M M M M H H M M H Least ecently Used at data at data at data at data 0 8 e e0 d d0 9 b b0 0 c c0 Access: time access a b c d c d e a e b c d H/M M M M M H H M M H M M M Pseudo LU n Implementation challenges for true LU: equires storage for access time on every line Enormous amount of storage counter wrap around equires comparison of all access time within a set Comparison is slow in hardware n Pseudo LU relaxes the requirement to find the absolutely oldest piece of data in a set andomly pick any one of the older data in the set n One simple implementation: Set bit for each line of cache when accessed To replace, evict any one of the cache lines with 0 flag Periodically reset all flags to 0 n Advanced version: Evict only lines that are not dirty when there s a draw 47 48

13 Acknowledgements n These slides contain material developed and copyright by: Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) John Lazzaro (UCB) n MIT material derived from course 6.8 n UCB material derived from course CS5, CS5 49

Computer Architecture ELEC3441

Computer Architecture ELEC3441 CPU-Memory Bottleeck Computer Architecture ELEC44 CPU Memory Lecture 8 Cache Dr. Hayde Kwok-Hay So Departmet of Electrical ad Electroic Egieerig Performace of high-speed computers is usually limited by