ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 6: Memory Organization Part I

Size: px

Start display at page:

Download "ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 6: Memory Organization Part I"

Wilfred Gray
6 years ago
Views:

1 ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 6: Memory Organization Part I Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn, AL Adapted from Dr. Chen-Huan Chiang (Intel) and Prof. Vishwani D. Agrawal (Auburn University) [Adapted from Computer Organization and Design, Patterson & Hennessy, 2014] 1/8/2017 ELEC / Lecture 6 1

2 Designing a Computer Control Input Datapath Central Processing Unit (CPU) or Processor Memory Output 1/8/2017 ELEC / Lecture 6 2

3 Types of Computer Memories From the cover of: A. S. Tanenbaum, Structured Computer Organization, Fifth Edition, Upper Saddle River, New Jersey: Pearson Prentice Hall, /8/2017 ELEC / Lecture 6 3

4 Random Access Memory (RAM) Address bits Address decoder Memory cell array Read/write circuits Data bits 1/8/2017 ELEC / Lecture 6 4

5 Six-Transistor SRAM Cell bit bit Word line Bit line Bit line 1/8/2017 ELEC / Lecture 6 5

6 Dynamic RAM (DRAM) Cell Bit line Word line Single-transistor DRAM cell Robert Dennard s 1967 invention 1/8/2017 ELEC / Lecture 6 6

7 Electronic Memory Devices Memory technology SRAM semiconductor memory DRAM semiconductor memory Flash semiconductor memory Magnetic disk Typical access time $ per GiB in ns $500 $ ns $10 $20 5,000 50,000 ns $0.75 $1.00 5,000,000 20,000,000 ns $0.05 $0.10 For more on memories: Semiconductor Memories: A Handbook of Design, Manufacture and Application, by Betty Prince, Wiley Emerging Memories: Technologies and Trends, by Betty Prince, Springer /8/2017 ELEC / Lecture 6 7

8 Year introduced Chip size DRAM Evolution $ per GiB Total access time to a new row/column Average column access time to existing row Kibibit $1,500, ns 150 ns Kibibit $500, ns 100 ns Mebibit $200, ns 40 ns Mebibit $50, ns 40 ns Mebibit $15, ns 30 ns Mebibit $10, ns 12 ns Mebibit $4, ns 10 ns Mebibit $1, ns 7 ns Mebibit $ ns 5 ns Gibibit $50 45 ns 1.25 ns Gibibit $30 40 ns 1 ns Gibibit $1 35 ns 0.8 ns 1/8/2017 ELEC / Lecture 6 8

9 Processor-Memory Performance Gap Performance Moore s Law Year µproc 55%/year (2X/1.5yr) Processor-Memory Performance Gap (grows 50%/year) DRAM 7%/year (2X/10yrs) 1/8/2017 ELEC / Lecture 6 9

10 Memory Performance Impact on Performance Suppose a processor executes at ideal CPI = % arith/logic, 30% ld/st, 20% control and that 10% of data memory operations miss with a 50 cycle miss penalty InstrMiss, 0.5 DataMiss, 1.5 CPI = ideal CPI + average stalls per instruction = 1.1(cycle) + ( 0.30 (datamemops/instr) x 0.10 (miss/datamemop) x 50 (cycle/miss) ) = 1.1 cycle cycle = /2.6=58% of the time the processor is stalled waiting for memory! A 1% instruction miss rate would add an additional 0.5 to the CPI 1/8/2017 ELEC / Lecture 6 10 Ideal CPI, 1.1

11 The Memory Hierarchy Goal Fact: Large memories are slow and fast memories are small How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)? With hierarchy With parallelism 1/8/2017 ELEC / Lecture 6 11

12 A Typical Memory Hierarchy By taking advantage of the principle of locality Can present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology On-Chip Components Control edram Datapath RegFile ITLB DTLB Instr Data Cache Cache Second Level Cache (SRAM) Main Memory (DRAM) Secondary Memory (Disk) Speed (%cycles): ½ s 1 s 10 s 100 s 1,000 s Size (bytes): 100 s K s 10K s M s G s to T s Cost: highest lowest 1/8/2017 ELEC / Lecture 6 12

13 Characteristics of the Memory Hierarchy Increasing distance from the processor in access time Processor L1$ L2$ Main Memory 4-8 bytes (word) 8-32 bytes (block) 1 to 4 blocks Inclusive what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM Secondary Memory 1,024+ bytes (disk sector = page) (Relative) size of the memory at each level 1/8/2017 ELEC / Lecture 6 13

14 Introduction to Cache ($) 1/8/2017 ELEC / Lecture 6 14

15 The Locality Principle A program tends to access data that form a physical cluster in the memory multiple accesses may be made within the same block. Physical localities are temporal and may shift over longer periods of time data not used for some time is less likely to be used in the future. Upon miss, the least recently used (LRU) block can be overwritten by a new block. P. J. Denning, The Locality Principle, Communications of the ACM, vol. 48, no. 7, pp , July /8/2017 ELEC / Lecture 6 15

16 The Memory Hierarchy: Why Does it Work? Temporal Locality (Locality in Time): Keep most recently accessed data items closer to the processor Spatial Locality (Locality in Space): Move blocks consisting of contiguous words to the upper levels To Processor From Processor Upper Level Memory Blk X Lower Level Memory Blk Y 1/8/2017 ELEC / Lecture 6 16

17 Data Locality, Cache, Blocks Increase block size to match locality size Cache Memory Increase cache size to include most blocks Block 1 Block 2 Data needed by a program 1/8/2017 ELEC / Lecture 6 17

18 The Memory Hierarchy: Terminology Hit: data is in some block in the upper level (Blk X) Hit Rate: the fraction of memory accesses found in the upper level Hit Time: Time to access the upper level which consists of Time to determine hit/miss + RAM access time To Processor From Processor Upper Level Memory Blk X Lower Level Memory Blk Y Miss: data is not in the upper level so needs to be retrieve from a block in the lower level (Blk Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty 1/8/2017 ELEC / Lecture 6 18

19 How is the Hierarchy Managed? Registers Memory By compiler (programmer?) Cache Main memory By the cache controller hardware Main memory Disks By the operating system (virtual memory) Virtual to physical address mapping assisted by the hardware (Translation Lookaside Buffer) By the programmer (files) 1/8/2017 ELEC / Lecture 6 19

20 Designs of Cache Two questions to answer (in terms of hardware): Q1: If the data is in the cache? By comparing tag Q2: How and where to find the data in the cache? By address mapping 1/8/2017 ELEC / Lecture 6 20

21 Direct mapped Direct Mapped Cache For each item of data at the lower level, there is exactly one location in the cache where it might be so lots of items at the lower level must share locations in the upper level Address mapping: (block address) modulo (# of blocks in the cache) Let s first consider block sizes of one word in the next example 1/8/2017 ELEC / Lecture 6 21

22 32-word word-addressable memory tag index (local address) Direct-Mapped Cache Cache of 8 blocks Block size = 1 word cache address: tag 1/8/ Main memory index memory address

23 32-word word-addressable memory tag index (local address) Direct-Mapped Cache Main memory Cache of 4 blocks Block size = 2 word cache address: tag index block offset block offset memory address 1/8/

24 32-word byte-addressable memory tag index Direct-Mapped Cache (Byte Address) Cache of 8 blocks Block size = 1 word Main memory cache address: tag index memory address byte offset 1/8/

25 Example: Direct Mapped Cache Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid miss 1 miss 2 miss 3 miss 00 Mem(0) 00 Mem(0) 00 Mem(0) 00 Mem(0) 00 Mem(1) 00 Mem(1) 00 Mem(1) 00 Mem(2) 00 Mem(2) 00 Mem(3) 01 4 miss 3 hit 4 hit 15 miss 4 00 Mem(0) 01 Mem(4) 01 Mem(4) 01 Mem(4) 00 Mem(1) 00 Mem(1) 00 Mem(1) 00 Mem(1) 00 Mem(2) 00 Mem(2) 00 Mem(2) 00 Mem(2) 00 Mem(3) 00 Mem(3) 00 Mem(3) Mem(3) 15 8 requests, 6 misses 1/8/2017 ELEC / Lecture 6 25

26 Finding a Word in Cache Memory address 32 words byte-address Tag b6 b5 b4 b3 b2 b1 b0 Index Valid 2-bit Index bit Tag Data byte offset Cache size 8 words Block size = 1 word 1 = hit 0 = miss = Data 1/8/2017 ELEC / Lecture 6 26

27 How Many Bits Cache Has? Consider a main memory: 32 words; byte address is 7 bits wide: b6 b5 b4 b3 b2 b1 b0 Each word is 32 bits wide Assume that cache block size is 1 word (32 bits data) and it contains 8 blocks. Cache requires, for each word: 2 bit tag, and one valid bit Total storage needed in cache = #blocks in cache (data bits/block + tag bits + valid bit) = 8 (32+2+1) = 280 bits Physical storage/data storage = 280/256 = /8/2017 ELEC / Lecture 6 27

28 A More Realistic Cache Consider 4 GB, byte-addressable main memory: 1Gwords; byte address is 32 bits wide: b31 b16 b15 b2 b1 b0 Each word is 32 bits wide Assume that cache block size is 1 word (32 bits data) and it contains 64 KB data, or 16K words, i.e., 16K blocks. Number of cache index bits = 14, because 16K = 2 14 Tag size = 32 byte offset #index bits = = 16 bits Cache requires, for each word: 16 bit tag, and one valid bit Total storage needed in cache = #blocks in cache (data bits/block + tag size + valid bits) = 2 14 ( ) = = bits = 784 Kb = 98 KB Physical storage/data storage = 98/64 = 1.53 But, need to increase the block size to match the size of locality. 1/8/2017 ELEC / Lecture 6 28

29 Block index Valid bit Data Organization in Cache Address mapping overhead bit tag means memory is 16 times larger than cache Tag Block offset word block Memory address of word in cache /8/2017 ELEC / Lecture 6 29

30 Cache Bits for 4-Word Block Consider 4 GB, byte-addressable main memory: 1Gwords; byte address is 32 bits wide: b31 b16 b15 b2 b1 b0 Each word is 32 bits wide Assume that cache block size is 4 words (128 bits data) and it contains 64 KB data, or 16K words, i.e., 4K blocks. Number of cache index bits = 12, because 4K = 2 12 Tag size = 32 byte offset #block offset bits #index bits = = 16 bits Cache requires, for each word: 16 bit tag, and one valid bit Total storage needed in cache = #blocks in cache (data bits/block + tag size + valid bit) = 2 12 ( ) = = bits =580 Kb = 72.5 KB Physical storage/data storage = 72.5/64 = /8/2017 ELEC / Lecture 6 30

31 4K Indexes Using Larger Cache Block (4 Words) Memory address 4GB = 1G words byte-address b31 b15 b14 b4 b3 b2 b1 b0 16 bit Tag 12 bit Index Val. 16-bit Data Index bit Tag (4 words=128 bits) byte offset 2 bit block offset Cache size 16K words Block size = 4 word = hit 0 = miss = M U X Data 1/8/2017 ELEC / Lecture 6 31

32 Limitations of Direct Mapping Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid miss 4 miss 0 miss 4 miss Mem(0) 00 Mem(0) 0001 Mem(4) Mem(0) 4 0 miss 4 miss 0 miss 4 miss Mem(4) 00 Mem(0) 0001 Mem(4) 0 00 Mem(0) 8 requests, 8 misses Ping pong effect due to conflict misses - two memory locations that map into the same cache block

33 Set-Associative Cache Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid miss 4 miss 0 hit 4 hit 000 Mem(0) 000 Mem(0) 000 Mem(0) 000 Mem(0) 010 Mem(4) 010 Mem(4) 010 Mem(4) 8 requests, 2 misses Solves the ping pong effect in a direct mapped cache Recall that ping-pong effect is due to conflict misses now two memory locations that map into the same cache set can co-exist in 2-way set associate cache 1/8/2017 ELEC / Lecture 6 33

34 32-word word-addressable memory tags index Two-Way Set-Associative Cache This block is needed Cache of 8 blocks Block size = 1 word Main memory cache address: tag index LRU block memory address byte offset 1/8/2017 ELEC / Lecture

35 32-word word-addressable memory 4. miss tags index Miss Rate: Two-Way Set-Associative Cache Memory references to addresses: 0, 8, 0, 6, 8, 16 Cache of 8 blocks 1. miss Block size = 1 word miss Main memory 2. miss 3. hit cache address: tag index 5. hit xxx xxx 001 xxx xxx xxx memory address byte offset 1/8/2017 ELEC / Lecture

36 32-word word-addressable memory tag index Miss Rate: Direct-Mapped Cache Memory references to addresses: 0, 8, 0, 6, 8, miss 4. miss 5. miss 6. miss Main memory 1. miss 2.miss Cache of 8 blocks Block size = 1 word cache address: tag index 00 / 01 / 00 / 10 xx xx xx xx xx 00 xx memory address byte offset 1/8/2017 ELEC / Lecture 6 36

37 2 to 1 MUX Two-Way Set-Associative Cache Memory address 32 words byte-address 3 bit tag b6 b5 b4 b3 b2 b1 b0 2 bit index byte offset Cache size 8 words Block size = 1 word V tag V tag V tag V tag data data data data V tag V tag V tag V tag data data data data = = 1 = hit 0 = miss Data 1/8/2017 ELEC / Lecture 6 37

38 32-word word-addressable memory tag Fully-Associative Cache (8-Way Set Associative) This block is needed Cache of 8 blocks Block size = 1 word Main memory cache address: tag LRU block memory address byte offset 1/8/

39 32-word word-addressable memory tag Miss Rate: Fully-Associative Cache Memory references to addresses: 0, 8, 0, 6, 8, miss 2. miss 6. miss Main memory 1. miss 3. hit Cache of 8 blocks Block size = 1 word 5. hit cache address: tag xxxxx xxxxx xxxxx xxxxx memory address byte offset 1/8/2017 ELEC / Lecture 6 39

40 Eight-Way Set-Associative Cache Memory address 32 words byte-address 5 bit Tag b6 b5 b4 b3 b2 b1 b0 byte offset Cache size 8 words Block size = 1 word V tag data V tag data V tag data V tag data V tag data V tag data V tag data V tag data = = = = = = = = 1 = hit 0 = miss 8 to 1 multiplexer Data 1/8/2017 ELEC / Lecture 6 40

41 Handling Cache Hits Read hits (Instruction-cache and Data-cache) (I$ and D$) this is what we want! Write hits (D$ only), two possible solutions Allow cache and memory to be inconsistent Write the data only into the cache block (write-back) Write-back the cache contents to the next level in the memory hierarchy when that cache block is evicted Need a dirty bit for each data cache block to tell if it needs to be written back to memory when it is evicted Require the cache and memory to be consistent Always write the data into both the cache block and the next level in the memory hierarchy (write-through) So don t need a dirty bit But it is very slow to write to the next level in the memory hierarchy Use a write buffer,» Only if the write buffer is full, stall required for a writethrough 1/8/

42 Write Buffer for Write-Through Caching Processor Cache DRAM write buffer Write buffer between the cache and main memory Processor: writes data into the cache and the write buffer Memory controller: writes contents of the write buffer to memory The write buffer is just a FIFO Typical number of entries: 4 Memory system designer s nightmare The rate at which writes are generated can be more than the rate at which the memory can accept them. This can happen when the writes occur in bursts. One solution is to use a write-back cache; Another is to use an L2 cache 1/8/2017 ELEC / Lecture 6 42

43 Handling Cache Misses Read misses (I$ and D$) Stall the entire pipeline, Fetch the block from the next level in the memory hierarchy Install it in the cache Send the requested word to the processor Then resume the pipeline 1/8/2017 ELEC / Lecture 6 43

44 Handling Cache Misses (cont.) Write misses (D$ only) 1. Stall the pipeline, fetch the block from next level in the memory hierarchy, install it in the cache (which may involve having to evict a dirty block if using a write-back cache), write the word from the processor to the cache, then let the pipeline resume Either of the below two write miss policy could be used with write through or write back 2. Write allocate A cache block is allocated on a write miss just write the word into the cache updating both the tag and data, no need to check for cache hit, no need to stall normally used in write-back caches, hoping that the subsequent writes to the block will be captured by the cache 3. No-write allocate skip the cache write and the block is only modified in the lower level memory just write the word to the write buffer (and eventually to the next memory level), no need to stall if the write buffer isn t full; must invalidate the cache block since it will be inconsistent (now holding stale data) normally used in write-through caches with a write buffer because subsequent writes to the block must still write through to the lower level memory 1/8/

45 Cache Misses Compulsory (cold start or process migration, first reference): First access to a block, cold fact of life, not a whole lot you can do about it If you are going to run millions of instruction, compulsory misses are insignificant Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity Capacity: Fully associative cache allows a memory block to be mapped to any cache block; In a direct mapped cache, a memory block maps to exactly one cache block Cache cannot contain all blocks accessed by the program Solution: increase cache size Ambiguity between Conflict and Capacity misses 1/8/2017 ELEC / Lecture 6 45

46 Miss Rate vs Block Size vs Cache Size Miss rate (%) 10 5 Cache size 8 KB 16 KB 64 KB 256 KB Block size (bytes) Temporal locality compromised. Detail next slide Miss rate goes up if the block size becomes a significant fraction of the cache size because the number of blocks that can be held in the same size cache is smaller (increasing capacity misses) 1/8/2017 ELEC / Lecture 6 46

47 Block Size Tradeoff Larger block sizes take advantage of spatial locality but Miss Rate If the block size is too big relative to the cache size, the miss rate will go up because # of blocks is fewer and compromise temporal locality Larger block size means larger miss penalty Latency to first word in block + transfer time for remaining words Exploits Spatial Locality Fewer blocks compromises Temporal Locality Miss Penalty Average Access Time Increased Miss Penalty & Miss Rate Block Size Block Size In general, Average Memory Access Time Reduced compulsary miss Increased conflict miss Block Size = Hit Time + Miss Penalty x Miss Rate

48 Block Size Tradeoff (cont.) Smaller blocks result reduced compulsory misses, but the increasing capacity misses dominate Smaller blocks result in more blocks which exploit spatial locality Larger blocks incurs increasing conflict misses Larger blocks result in fewer blocks which compromise temporal locality 1/8/2017 ELEC / Lecture 6 48

49 Multiword Block Considerations Read misses (I$ and D$) Processed the same as for single word blocks Except that a miss returns the entire block from memory Miss penalty grows as block size grows Requested word first requested word is transferred from the memory to the cache (and datapath) first Early restart datapath resumes execution as soon as the requested word of the block is returned Nonblocking cache allows the datapath to continue to access the cache while the cache is handling an earlier miss Write misses (D$) Can t use write allocate Recall: Write allocate just write the word into the cache updating both the tag and data, no need to check for cache hit, no need to stall Otherwise will end up with a garbled block in the cache e.g., for 4 word blocks, a new tag and one word of data from the new block, and three words of data from the old block so must fetch the block from memory first and pay the stall time

50 Cache Summary The Principle of Locality: Program likely to access a relatively small portion of the address space at any instant of time Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three major categories of cache misses: Compulsory misses: sad facts of life. Example: cold start misses Conflict misses: increase cache size and/or associativity Nightmare Scenario: ping pong effect! Capacity misses: increase cache size Cache design space Total size, block size, associativity (replacement policy) Write-hit policy (write-through, write-back) Write-miss policy (write allocate, no-write allocate (i.e., write buffers)) 1/8/2017 ELEC / Lecture 6 50

51 Basic Cache Design I-cache Example Organized into blocks or lines Block Contents tag - extra bits to identify block (part of block address) data - data or instruction words contiguous memory locations Our example: One-word (4 byte) block size 30-bit tag Two blocks in cache E.g., address 0x = 0b LSBs for byte offset Rest bits are tag bits Cache b0 b1 CPU tag CPU0 data 0 tag CPU1 data 1 Main Memory 0x x x x C 0x /8/2017 ELEC / Lecture 6 51

52 I-Cache Example (2) Assume: r1==0, r2==1, r4==2 1 cycle for cache access 4 cycles for main. mem. access 1 cycle for instr. execution At cycle 1 - PC=0x00 Fetch instruction from memory look in cache (1 cycle) MISS - fetch from main mem (4 cycle penalty) M I S S Cache b0 b1 CPU CPU (empty) (empty) Main Memory 0x x x x C CPU (empty) (empty) L: add r1,r1,r2 bne r4,r1,l sub r1,r1,r1 L: j L 0x /8/2017 ELEC / Lecture 6 52

53 I-Cache Example (3) Cycle Address Op/Instr. r1 1-5 FETCH 0x 000 At cycle 6 6 0x 0 add r1,r1,r2 1 Execute instr. add r1,r1,r2 Cache b0 b1 0x 0 CPU CPU (empty) (empty) CPU L: add (empty) r1,r1,r2 (empty) Main Memory 0x x x x C L: add r1,r1,r2 bne r4,r1,l sub r1,r1,r1 L: j L 0x /8/2017 ELEC / Lecture 6 53

54 I-Cache Example (4) Cycle Address Op/Instr. r1 1-5 FETCH 0x 0 6 0x 0 add r1,r1,r FETCH 0x 4 At cycle 6 - PC=0x04 Fetch instruction from memory look in cache (1cycle) MISS - fetch from main mem (4 cycle penalty) M I S S Cache b0 b1 0x 0 CPU CPU (empty) (empty) Main Memory 0x x x x C 0x CPU L: add (empty) r1,r1,r2 (empty) L: add r1,r1,r2 bne r4,r1,l sub r1,r1,r1 L: j L 1/8/2017 ELEC / Lecture 6 54

55 I-Cache Example (5) Cycle Address Op/Instr. r1 1-5 FETCH 0x x 0 add r1,r1,r FETCH 0x x 4 bne r4,r1,l 1 At cycle 11 Execute instr. bne r4,r1,l Cache b0 b1 0x 0 CPU 0x 1 CPU (empty) (empty) Main Memory 0x x x x C CPU L: add (empty) r1,r1,r2 bne (empty) r4,r1,l L: add r1,r1,r2 bne r4,r1,l sub r1,r1,r1 L: j L 0x /8/2017 ELEC / Lecture 6 55

56 I-Cache Example (6) Cycle Address Op/Instr. r1 1-5 FETCH 0x 0 6 0x 0 add r1,r1,r FETCH 0x x 4 bne r4,r1,l 1 11 FETCH 0x 0 1 H I T Cache b0 b1 0x 0 CPU 0x 1 CPU (empty) (empty) CPU L: add (empty) r1,r1,r2 bne (empty) r4,r1,l At cycle 11 - PC=0x00 Fetch instruction from memory HIT - instruction in cache Main Memory 0x x x x C 0x L: add r1,r1,r2 bne r4,r1,l sub r1,r1,r1 L: j L 1/8/2017 ELEC / Lecture 6 56

57 I-Cache Example (7) Cycle Address Op/Instr. r1 1-5 FETCH 0x 0 6 0x 0 add r1,r1,r FETCH 0x x 4 bne r4,r1,l 1 12 FETCH 0x 0 1 Cache b0 b1 0x 0 CPU 0x 1 CPU (empty) (empty) CPU L: add (empty) r1,r1,r2 bne (empty) r4,r1,l 12 add r1,r1,2 2 At cycle 12 Execute add r1, r1, 2 Main Memory 0x x x x C L: add r1,r1,r2 bne r4,r1,l sub r1,r1,r1 L: j L 0x /8/2017 ELEC / Lecture 6 57

58 I-Cache Example (8) Cycle Address Op/Instr. r1 1-5 FETCH 0x 0 6 0x 0 add r1,r1,r FETCH 0x x 4 bne r4,r1,l 1 12 FETCH 0x 0 1 H I T Cache b0 b1 0x 0 CPU 0x 1 CPU (empty) (empty) CPU L: add (empty) r1,r1,r2 bne (empty) r4,r1,l 12 add r1,r1, FETCH 0x04 At cycle 12 - PC=0x04 Fetch instruction from memory HIT - instruction in cache Main Memory 0x x x x C 0x L: add r1,r1,r2 bne r4,r1,l sub r1,r1,r1 L: j L 1/8/2017 ELEC / Lecture 6 58

59 I-Cache Example (9) Cycle Address Op/Instr. r1 1-5 FETCH 0x 0 6 0x 0 add r1,r1,r FETCH 0x x 4 bne r4,r1,l 1 12 FETCH 0x add r1,r1, FETCH 0x04 13 bne r4, r1, L At cycle 13 Execute instr. bne r4, r1, L Branch not taken Cache b0 b1 0x 0 CPU 0x 1 CPU (empty) (empty) Main Memory 0x x x x C 0x CPU L: add (empty) r1,r1,r2 bne (empty) r4,r1,l L: add r1,r1,r2 bne r4,r1,l sub r1,r1,r1 L: j L 1/8/2017 ELEC / Lecture 6 59

60 I-Cache Example (10) Cycle Address Op/Instr. r1 1-5 FETCH 0x 0 6 0x 0 add r1,r1,r FETCH 0x x 4 bne r4,r1,l 1 12 FETCH 0x add r1,r1, FETCH 0x04 13 bne r4, r1, L 13 FETCH 0x08 At cycle 13 - PC=0x08 Fetch Instruction from Memory MISS - not in cache M I S S Cache b0 b1 0x 0 CPU 0x 1 CPU (empty) (empty) Main Memory 0x x x x C 0x CPU L: add (empty) r1,r1,r2 bne (empty) r4,r1,l L: add r1,r1,r2 bne r4,r1,l sub r1,r1,r1 L: j L 1/8/2017 ELEC / Lecture 6 60

61 I-Cache Example (11) Cycle Address Op/Instr. r1 1-5 FETCH 0x 0 6 0x 0 add r1,r1,r FETCH 0x x 4 bne r4,r1,l 1 12 FETCH 0x add r1,r1, FETCH 0x04 13 bne r4, r1, L FETCH 0x08 At cycle 17 - PC=0x08 Put instruction into cache Replace existing instruction Cache b0 b1 0x 0 0x 2 CPU 0x 1 CPU (empty) (empty) Main Memory 0x x x x C 0x CPU L: sub add (empty) r1,r1,r1 r1,r1,r2 bne (empty) r4,r1,l L: add r1,r1,r2 bne r4,r1,l sub r1,r1,r1 L: j L 1/8/2017 ELEC / Lecture 6 61

62 I-Cache Example (12) Cycle Address Op/Instr. r1 1-5 FETCH 0x 0 6 0x 0 add r1,r1,r FETCH 0x x 4 bne r4,r1,l 1 12 FETCH 0x add r1,r1, FETCH 0x bne r4, r1, L FETCH 0x sub r1, r1, r1 0 At cycle 18 Execute sub r1, r1, r1 Cache b0 b1 0x 2 0x 1 CPU (empty) Main Memory 0x x x x C 0x CPU sub r1,r1,r1 bne (empty) r4,r1,l L: add r1,r1,r2 bne r4,r1,l sub r1,r1,r1 L: j L 1/8/2017 ELEC / Lecture 6 62

63 I-Cache Example (13) Cycle Address Op/Instr. r1 1-5 FETCH 0x 0 6 0x 0 add r1,r1,r x 4 FETCH 0x 4 bne r4,r1,l 1 12 FETCH 0x add r1,r1, FETCH 0x bne r4, r1, L FETCH 0x sub r1, r1, r FETCH 0x0C At cycle 18 Fetch instruction from memory MISS - not in cache M I S S Cache b0 b1 0x 2 CPU 0x 1 CPU (empty) (empty) Main Memory 0x x x x C 0x CPU L: sub add (empty) r1,r1,r1 r1,r1,r2 bne (empty) r4,r1,l L: add r1,r1,r2 bne r4,r1,l sub r1,r1,r1 L: j L 1/8/2017 ELEC / Lecture 6 63

64 I-Cache Example (14) Cycle Address Op/Instr. r1 1-5 FETCH 0x 0 6 0x 0 add r1,r1,r x 4 FETCH 0x 4 bne r4,r1,l 1 12 FETCH 0x add r1,r1, FETCH 0x bne r4, r1, L FETCH 0x sub r1, r1, r FETCH 0x0C At cycle 22 Put instruction into cache Replace existing instruction Cache b0 b1 0x 0 0x 2 CPU 0x 1 0x 3 CPU (empty) (empty) Main Memory 0x x x x C 0x CPU L: sub add (empty) r1,r1,r1 r1,r1,r2 bne j (empty) L r1,r2,l L: add r1,r1,r2 bne r4,r1,l sub r1,r1,r1 L: j L 1/8/2017 ELEC / Lecture 6 64

65 I-Cache Example (15) Cycle Address Op/Instr. r1 1-5 FETCH 0x 0 6 0x 0 add r1,r1,r FETCH 0x x 4 bne r3,r1,l 11 FETCH 0x x 8 add r1,r1, FETCH 0x x 4 bne r4,r1,l FETCH 0x x 8 sub r1,r1,r FETCH 0x..C 23 0x 8 j L At cycle 23 Execute j L Cache b0 b1 0x 2 CPU 0x 3 CPU (empty) (empty) Main Memory 0x x x x C 0x CPU sub (empty) r1,r1,r1 j (empty) L L: add r1,r1,r2 bne r4,r1,l sub r1,r1,r1 L: j L 1/8/2017 ELEC / Lecture 6 65

66 Compare No-cache vs. Cache NO CACHE Cycle Address Op/Instr. 1-5 FETCH 0x 0 6 0x 0 add r1,r1,r 6-10 FETCH 0x x 4 bne r3,r1,l FETCH 0x x 0 add r1,r1, FETCH 0x x 4 bne r3,r1,l FETCH 0x x 8 sub r1,r1,r FETCH 0x..C 31 0x C j L M M H H M M CACHE Cycle Address Op/Instr x 0 FETCH 0x 0 add r1,r1,r 6-10 FETCH 0x x 4 bne r3,r1,l 11 FETCH 0x x 0 add r1,r1,2 12 FETCH 0x x 4 bne r3,r1,l FETCH 0x x 8 sub r1,r1,r FETCH 0x..C 23 0x C j L 1/8/2017 ELEC / Lecture 6 66

67 Cache Miss and the MIPS Pipeline I -Cache miss PC IR IRex A B invalid IRm IRwb D Cache Miss R T 1/8/2017 ELEC / Lecture 6 67

68 Cache Miss and the MIPS Pipeline Instruction Fetch Compare in Cycle 1 Miss Detected in Cycle 2 Fetch Completes (Pipeline Restarts) IF EX MEM W STALL STALL IF EX MEM W Clock Cycle 1 Clock Cycle 2+N Clock Cycle 3+N Clock Cycle 4+N Clock Cycle 5+N Clock Cycle 6+N N: # cycles for 1/8/2017 accessing main memory 68

69 Cache Miss and the MIPS Pipeline Load Instruction Compare in Cycle 4 Miss Detected in Cycle 5 Load Completes (Pipeline Restarts) IF EX MEM W STALL STALL Clock Cycle 1 IF EX MEM W Clock Cycle 2 Clock Cycle 3 Clock Cycle 4 STALL Clock Cycle 5 STALL Clock Cycle 5+N Clock Cycle 6+N 1/8/2017 ELEC / Lecture 6 69

70 Improving Cache Performance 1/8/2017 ELEC / Lecture 6 70

71 Review: The Memory Hierarchy Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology Increasing distance from the processor in access time Processor L1$ L2$ Main Memory 4-8 bytes (word) 8-32 bytes (block) 1 to 4 blocks Inclusive what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM Secondary Memory 1,024+ bytes (disk sector = page) (Relative) size of the memory at each level

72 Review: Principle of Locality Temporal Locality Keep most recently accessed data items closer to the processor Spatial Locality Move blocks consisting of contiguous words to the upper levels Hit Time << Miss Penalty To Processor From Processor Upper Level Memory Blk X Lower Level Memory Blk Y Hit: data appears in some block in the upper level (Blk X) Hit Rate: the fraction of accesses found in the upper level Hit Time: RAM access time + Time to determine hit/miss Miss: data needs to be retrieve from a lower level block (Blk Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level with a block from the lower level + Time to deliver this block s word to the processor Miss Types: Compulsory, Conflict, Capacity 1/8/2017 ELEC / Lecture 6 72

73 Measuring Cache Performance Assuming cache hit costs are included as part of the normal CPU execution cycle, then Memory-stall cycles come from cache misses (a sum of read-stalls and write-stalls) Memory-stall cycles = Read-stall cycles + Write-stall cycles where, CPU time = InstrCnt CPI ClkCylTime = IC (CPI ideal + Memory-stall cycles) CCT CPI stall Read-stall cycles = reads/program read miss rate read miss penalty Write-stall cycles = (writes/program write miss rate write miss penalty) + write buffer stalls

74 Measuring Cache Performance (cont.) For write-through caches, penalties of read miss and write miss are the same Both equal to the time to fetch the block from memory If write buffer stalls are assumed to be negligible, We can combine reads and writes using a single miss rate and miss penalty Memory-stall cycles = Memory access/program x Miss rate Miss penalty = Instructions/program x Misses/Instruction x Miss penalty 1/8/2017 ELEC / Lecture 6 74

75 Clocks per instruction Clocks per DRAM access Review: The Memory Wall Logic vs DRAM speed gap continues to grow Core Memory VAX/1980 PPro/ /8/2017 ELEC / Lecture 6 75

76 Impacts of Cache Performance Relative cache penalty increases as processor performance improves (faster clock rate and/or lower CPI) The memory speed is unlikely to improve as fast as processor cycle time. When calculating CPI stall, the cache miss penalty is measured in processor clock cycles needed to handle a miss The lower the CPI ideal, the more pronounced the impact of stalls Example A processor with a CPI ideal of 2, a 100 cycle miss penalty, 36% load/store instr s, and 2% I$ and 4% D$ miss rates Memory-stall cycles = 2% % 4% 100 = 3.44 So CPI stalls = = /8/2017 ELEC / Lecture 6 76

77 Impacts of Cache Performance (cont.) What happens if processor made faster but not memory? Memory stalls take up increasing fraction of execution time What if the CPI ideal is reduced to 1, without changing clock rate Possible by an improved pipeline CPI stalls = = 4.44 Execution time spent on memory stalls increased from 63% (3.44/5.44) to 77% (3.44/4.44) 1/8/2017 ELEC / Lecture 6 77

78 Impacts of Cache Performance (cont.) What if the processor clock rate is doubled (i.e., doubling the miss penalty)? Memory not changed. Increasing performance loss due to cache misses Example Miss penalty now doubles 100x2=200 cycles Because memory still takes the same time, but double of CPU cycles,» because CPU is twice as fast CPI stalls = 2+(2%x200)+(36%x4%x200)= 8.88 Relative performance due to cache miss = perfomance_fast_clk/performance_slow_clk=5.44/(8.88x1/2)=1.23 Not 2, smaller than 2» 2 is the rate of clock rate increase of processor» i.e., although processor clock rate is doubled, performance only increases 1.23 times due to cache misses 1/8/2017 ELEC / Lecture 6 78

79 Improving Cache Performance Time = InstrCnt x ClkCycleTime x (idealcpi+mem stalls) Average Memory Access time = Hit Time + (Miss Rate x Miss Penalty) = (Hit Rate x Hit Time) + (Miss Rate x Miss Time) 1. Reduce the time to hit in the cache Details later 2. Reduce the miss rate Allow more flexible block placement Use multiple levels of caches More details later 3. Reduce the miss penalty Details later 1/8/2017 ELEC / Lecture 6 79

80 Reduce the time to hit in the cache #1 Direct mapped cache Smaller cache Smaller blocks Cache Read: overlap tag comparison (TC) and data access Discard tag if mismatched Cache Write: pipeline write hit stages (TC and Write) Write buffer for write-thru cache Between processor and main memory to around the cache Store buffer for write-back cache Between processor and cache to allow one cycle cache store operation by pipelining tag check (checking for hit) and data access (storing) 1/8/2017 ELEC / Lecture 6 80

81 Reducing Cache Miss Rates #2 Allow more flexible block placement In a direct mapped cache a memory block maps to exactly one cache block At the other extreme, could allow a memory block to be mapped to any cache block fully associative cache A compromise is to divide the cache into sets each of the sets consists of n ways (n-way set associative). A memory block maps to a unique set (specified by the index field) and can be placed in any way of that set (so there are n-choices) Set # of a block = (block address) modulo (# sets in the cache) 1/8/2017 ELEC / Lecture 6 81

82 Reducing Cache Miss Rates #2 Use multiple levels of caches With advancing technology have more than enough room on the die for bigger L1 caches or for a second level of caches normally a unified L2 cache (i.e., it holds both instructions and data) and in some cases even a unified L3 cache L2/L3 could be on chip or off chip SRAMs Faster than main memory DRAMs Upon the miss of primary L1 cache, L2 cache is accessed If L2 contains the desired data, miss penalty for L1 will be the access time of L2 cache << that of the main memory Furthermore, L2 cache not tied to CPU clock rate (just like the concept of write buffer) If neither L1 nor L2 contains the data, main memory access is required and a larger miss penalty is incurred L2 access time + MM access time However, such a miss rate of L1&L2 << miss rate of L1 1/8/2017 ELEC / Lecture 6 82

83 Reducing Miss Penalty Using Multilevel Caches The following example demonstrates the significance of performance improvement from using a L2 cache. Example, CPI ideal of 2, 100 cycle miss penalty (to main memory), 36% load/stores, a 2% (4%) L1 Instr$ (D$) miss rate, add a L2$ that has a 25 cycle miss penalty and a 0.5% miss rate to main memory CPI stalls = 2 + 2% % 4% % % 0.5% 100 = 2 + (2%-0.5%)x %x(4%-0.5%)x %x(25+100) + 36%x0.5%x(25+100) = 3.54 (as compared to 5.44 with no L2$) 1/8/2017 ELEC / Lecture 6 83

84 Multilevel Cache Design Considerations Very different design considerations for L1 and L2 Primary cache (L1) should focus on minimizing hit time in support of a shorter clock cycle Smaller cache size with smaller block sizes However, smaller blocks, higher miss rate But, the miss penalty of the L1 cache is significantly reduced by the presence of an L2 cache so it can be smaller (i.e., faster) but have a higher miss rate Secondary cache(s) (e.g. L2) should focus on reducing miss rate to reduce the penalty of long main memory access times Larger cache size with larger block sizes However, larger block size, slower access (or hit) time But, for the L2 cache, hit time is less important than miss rate» Because L2$ hit time affects L1$ s miss penalty rather than L1$ hit time or processor cycle time 1/8/2017 ELEC / Lecture 6 84

85 Multilevel Cache Design Considerations (cont.) Global miss rate The fraction of references that miss in all levels of a multilevel cache. Local miss rate The fraction of references to one level of a cache that miss; used in multilevel hierarchies For example, L2$ local miss rate = the ratio of all misses in L2$ divided by the number of access to L2$ In our previous example, 0.5%/2% = 25% L2$ local miss rate >> than the global miss rate Because primary cache (L1) filters access Especially those with good spatial and temporal locality Luckily, the global miss rate dictates how often we must access main memory 1/8/2017 ELEC / Lecture 6 85

86 Reduce the miss penalty #3 Smaller blocks Use a write buffer to hold dirty blocks being replaced so don t have to wait for the write to complete before reading Check write buffer (and/or victim cache) on read miss may get lucky For large blocks fetch critical word first Use multiple cache levels L2 cache not tied to CPU clock rate Faster backing store/improved memory bandwidth wider buses memory interleaving, page mode DRAMs 1/8/2017 ELEC / Lecture 6 86

87 Key Cache Design Parameters L1 typical L2 typical Total size (blocks) 250 to to 250,000 Total size (KB) 16 to to 8000 Block size (B) 32 to to 128 Miss penalty (clocks) 10 to to 1000 Miss rates (global for L2) 2% to 5% 0.1% to 2% 1/8/2017 ELEC / Lecture 6 87

88 Two Machines Cache Parameters Intel P4 AMD Opteron L1 organization Split I$ and D$ Split I$ and D$ L1 cache size 8KB for D$, 96KB for trace cache (~I$) L1 block size 64 bytes 64 bytes 64KB for each of I$ and D$ L1 associativity 4-way set assoc. 2-way set assoc. L1 replacement ~ LRU LRU L1 write policy write-through write-back L2 organization Unified Unified L2 cache size 512KB 1024KB (1MB) L2 block size 128 bytes 64 bytes L2 associativity 8-way set assoc. 16-way set assoc. L2 replacement ~LRU ~LRU L2 write policy write-back write-back 1/8/2017 ELEC / Lecture 6 88

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major