Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Size: px

Start display at page:

Download "Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy"

Matilda Merritt
5 years ago
Views:

1 ENG338 Computer Organization and Architecture Part II Winter 217 S. Areibi School of Engineering University of Guelph Hierarchy Topics Hierarchy Locality Motivation Principles Elements of Design: Addresses Size Mapping Function Replacement Algorithms Write Policy Line Size Summary With thanks to W. Stallings, Hamacher, J. Hennessy, M. J. Irwin for lecture slide contents Many slides adapted from the PPT slides accompanying the textbook and CSE331 Course 2 Hierarchy o o o o o The design constraints on a computer memory can be summed up by three questions (i) How Much (ii) How Fast (iii) How expensive. There is a tradeoff among the three key characteristics A variety of technologies are used to implement memory system Dilemma facing designer is clear large capacity, fast, low cost!! Solution Employ memory hierarchy Flip Flops Dynamic RAM registers Main Disk Magnetic Disk Removable Media Static RAM 5 References I. Computer Organization and Architecture: Designing for Performance, 1 th edition, by William Stalling, Pearson. II. Computer Organization and Design: The Hardware/Software Interface, 5 th edition, by D. Patterson and J. Hennessy, Morgan Kaufmann III. Computer Organization and Architecture: Themes and Variations, 214, by Alan Clements, CENGAGE Learning 3 Chapter 5 Large and Fast: Exploiting Hierarchy 1

2 Hierarchy Main vs. As you go further, capacity and latency increase Dynamic RAM Static RAM Registers Registers 1KB 1 cycle L1 data or instruction 32KB 2 cycles L2 cache 2MB 15 cycles 1GB 3 cycles Disk 8 GB 1M cycles Static RAM 7 1 CPU + Bus + Registers Static RAM CPU Controller Local CPU / Bus Dynamic RAM PCI DRAM Co-processor Controller Peripheral Component Interconnect Bus EISA/PCI Bridge Controller Hard Drive Controller Video Adaptor SCSI Adaptor EISA PC Bus PC Card 1 PC Card 2 PC Card 3 SCSI Bus 11 Hierarchy Levels How is the Hierarchy Managed? Upper Level Lower Level Block (aka line): unit of copying May be multiple words If accessed data is present in upper level Hit: access satisfied by upper level Hit ratio: hits/accesses If accessed data is absent Miss: block copied from lower level Time taken: miss penalty Miss ratio: misses/accesses = 1 hit ratio Then accessed data supplied from upper level registers memory by compiler (programmer?) cache main memory by the cache controller hardware main memory disks by the operating system (virtual memory) virtual to physical address mapping assisted by the hardware (TLB) by the programmer (files) Chapter 5 Large and Fast: Exploiting Hierarchy 2

3 Taking Advantage of Locality Locality hierarchy Store everything on disk Copy recently accessed (and nearby) items from disk to smaller DRAM memory Main memory Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory memory attached to CPU Principle of Locality (Temporal) Programs access a small proportion of their address space at any time (Locality in Time) Temporal locality Items accessed recently are likely to be accessed again soon Keep most recently accessed data items closer to the processor e.g., instructions in a loop, induction variables and Locality Why do caches work? Temporal locality: if you used some data recently, you will likely use it again Spatial locality: if you used some data recently, you will likely access its neighbors No hierarchy: average access time for data = 3 cycles for (i=; i<1; i++) x[i] = x[i] + s; To Processor Upper Level Lower Level 32KB 1-cycle L1 cache that has a hit rate of 95%: average access time =.95 x x (31) = 16 cycles From Processor Blk X Blk Y 17 Principle of Locality (Spatial) Programs access a small proportion of their address space at any time (Locality in Space) Spatial locality Items near those accessed recently likely to be accessed soon Move blocks consisting of contiguous words to the upper levels E.g., sequential instruction access, scanning an array data Motivation for (i=; i<1; i++) x[i] = x[i] + s; To Processor From Processor Upper Level Blk X Lower Level Blk Y Chapter 5 Large and Fast: Exploiting Hierarchy 3

A Typical Hierarchy By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology.

4 A Typical Hierarchy By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. On-Chip Components Control Datapath RegFile ITLB DTLB Instr Data Second Level (SRAM) Main (DRAM) Secondary (Disk) Speed (ns):.1 s 1 s 1 s 1 s 1, s Size (bytes): 1 s K s 1K s M s T s Cost: highest lowest Why Pipeline? For Throughput! To avoid a structural hazard need two caches onchip: one for instructions (I$) and one for data (D$) I n s t r. O r d e r Inst Inst 1 Inst 2 Inst 3 Inst 4 Time (clock cycles) ALU I$ Reg D$ Reg ALU I$ Reg D$ Reg ALU I$ Reg D$ Reg ALU I$ Reg D$ Reg ALU I$ Reg D$ Reg To keep the pipeline running at its maximum rate both I$ and D$ need to satisfy a request from the datapath every cycle. What happens when they can t do that? The Hierarchy Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology Increasing distance from the processor in access time Processor 4-8 bytes (word) L1$ 8-32 bytes (block) L2$ 1 to 4 blocks Main Secondary 1,24+ bytes (disk sector = page) Inclusive what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM (Relative) size of the memory at each level Processor-memory Performance Gap The Processor vs DRAM speed disparity continues to grow Terminology Hit: data is in some block in the upper level (Blk X) Hit Rate: fraction of memory accesses found in upper level Hit Time: Time to access the upper level which consists of - SRAM access time + Time to determine hit/miss To Processor From Processor Upper Level Blk X Lower Level Blk Y Good memory hierarchy (cache) design is increasingly important to overall performance Miss: data is not in the upper level so needs to be retrieve from a block in the lower level (Blk Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to bring in a block from the lower level and replace a block in the upper level with it + Time to deliver the block the processor Hit Time << Miss Penalty Chapter 5 Large and Fast: Exploiting Hierarchy 4

5 Four Questions for Design Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) : Block Identification memory The level of the memory hierarchy closest to the CPU Given accesses X 1,, X n 1, X n Q3: Which block should be replaced on a miss? (Block replacement strategy) Q4: What happens on a write? (Write strategy) How do we know if the data is present? Where do we look?? : Block Placement Determined by associativity Direct mapped (1-way set associative) One choice for placement n-way set associative n choices within a set Fully associative Any location Higher associativity reduces miss rate, Increases complexity, cost, and access time Block Identification: Finding a Block Associativity Location method Tag comparisons Direct mapped Index 1 n-way set associative Hardware caches Set index, then search entries within the set Fully associative Search all entries #entries Full lookup table Reduce comparisons to reduce cost n : Block Placement Q1: Where can a block be placed in the upper level? Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo Number Sets Full Mapped Direct Mapped (12 mod 8) = 4 2-Way Assoc (12 mod 4) = Block Identification Use lower address part as index to How do we know requested block is in cache (9 or 13)? Store block address as well as the data Block address: Actually, only need the high-order bits Called the tag Index Chapter 5 Large and Fast: Exploiting Hierarchy 5

Block Identification: Tags Every block has a tag in addition to data Tag: Upper part of address, that is not used to index cache Another Reference String Mapping Consider the main memory word

1 Mem(4) Mem() 8 requests, 8 misses Ping pong effect due to conflict misses - two memory locations that map into the same cache block Valid Bits What if there is no data in a location?

6 Block Identification: Tags Every block has a tag in addition to data Tag: Upper part of address, that is not used to index cache Another Reference String Mapping Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid miss 4 miss miss 4 miss 1 4 Mem() Mem() 1 1 Mem(4) Mem() 4 miss 4 miss miss 4 1 miss Mem(4) Mem() 1 Mem(4) Mem() 8 requests, 8 misses Ping pong effect due to conflict misses - two memory locations that map into the same cache block Valid Bits What if there is no data in a location? Valid bit: 1 = present, = not present Initially Valid bit Tag Data Direct Mapped Hit/Miss: Example Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid Tag =, Index = Tag =, Index = 1 Tag =, Index = 1 Tag =, Index = 11 tag miss 1 miss 2 miss 3 miss Mem() Mem() Mem() Mem() Mem(1) Mem(1) Mem(1) Mem(2) Mem(2) Mem(3) Tag = 1, Index = Tag =, Index = 11 Tag = 1, Index = Tag = 11, Index = 11 4 miss 3 hit 4 hit 15 miss 1 4 Mem() 1 Mem(4) 1 Mem(4) 1 Mem(4) Mem(1) Mem(1) Mem(1) Mem(1) Mem(2) Mem(2) Mem(2) Mem(2) Mem(3) Mem(3) Mem(3) 11 Mem(3) 15 8 requests, 6 misses Direct Mapped Location determined by address Direct mapped: only one choice Index = (BlockAddress) modulo (#Blocks in cache) Index = 9 mod 4 = 1 If #Blocks (i.e., number of entries in the cache) is a power of 2 then modulo (i.e., Index) can be computed simply by using the low order log 2 (cache size in blocks) bits of the address (log 2 (4) = 2) Chapter 5 Large and Fast: Exploiting Hierarchy 6

7 The Direct Mapped Direct mapped For each item of data at the lower level (main memory), there is exactly one location in the upper level (cache) where it might be - so lots of items at the lower level must share locations in the upper level Address mapping: (block address) modulo (# of blocks in the cache) Direct-mapped cache: each address maps to a unique address Caching: Example Index Valid Tag Q2: Is it there? Data Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cache xx 1xx 1xx 11xx 1xx 11xx 11xx 111xx 1xx 11xx 11xx 111xx 11xx 111xx 111xx 1111xx Main Two low order bits define the byte in the word (32b words) Q1: How do we find it? Use next 2 low order memory address bits the index to determine which cache block (i.e., modulo the number of blocks in the cache) (block address) modulo (# of blocks in the cache) Accessing the Byte address 8 sets: 3 index bits 11 Equations for DM BlockAddress =ByteAddress/BytesPerBlock Index = BlockAddress % #Blocks Offset: 3 bits 8-byte words Direct-mapped cache: each address maps to a unique address Larger Block Size A 64 block, with16 bytes/block To what block number does byte address 12 map? Block Address = Byte Address/Block Size Block address = 12/16 = 75 Index = BlockAddress % # Blocks Index (Block number) = 75 modulo 64 = 11 Tag = BlockAddress / # Blocks Tag = 75 / 64 = 1 Data array 8 Sets (blocks) Tag Index Offset 22 bits 6 bits 4 bits The Tag Array Byte address Tag Compare index bits Offset: 3 bits Because each cache location can contain the contents of a number of different memory location, a tag is added to every block to further identify the requested item. 8-byte words Direct Mapped Example A Processor generates byte addresses It has Direct Mapped (1-way set associative) with 4 sets (blocks) The set (block) size is 4-bytes For each access, is it hit or miss? Solution: Compute Index BlockAddress % #Blocks Compute Tags BlockAddress / #Blocks Byte Address Block Address Tag array Data array 8 Sets 39 Index Tag Chapter 5 Large and Fast: Exploiting Hierarchy 7

8 Direct Mapped Example Compute Index BlockAddress % # Blocks Compute Tags BlockAddress / # Blocks #Blocks = 4 Direct Mapped Example Compute Index BlockAddress % # Blocks Compute Tags BlockAddress / # Blocks 1 5 MemBlock[22] #Blocks = 4 Byte Address Index Tag Hit or Miss????????? Byte Address Index Tag Hit or Miss? m m m????? Direct Mapped Example #Blocks = 4 Direct Mapped Example #Blocks = 4 Compute Index BlockAddress % # Blocks Compute Tags BlockAddress / # Blocks Compute Index BlockAddress % # Blocks Compute Tags BlockAddress / # Blocks 1 5 MemBlock[22] 1 6 MemBlock[26] Byte Address Index Tag Hit or Miss? m??????? Byte Address Index Tag Hit or Miss? m m m m???? Direct Mapped Example #Blocks = 4 Direct Mapped Example #Blocks = 4 Compute Index BlockAddress % # Blocks Compute Tags BlockAddress / # Blocks Compute Index BlockAddress % # Blocks Compute Tags BlockAddress / # Blocks 1 4 MemBlock[16] 1 6 MemBlock[26] 1 6 MemBlock[26] Byte Address Index Tag Hit or Miss? m m?????? Byte Address Index Tag Hit or Miss? m m m m m??? Chapter 5 Large and Fast: Exploiting Hierarchy 8

9 Direct Mapped Example Compute Index BlockAddress % # Blocks Compute Tags BlockAddress / # Blocks 1 4 MemBlock[16] 1 6 MemBlock[26] 1 MemBlock[3] Byte Address Index Tag Hit or Miss? m m m m m m?? #Blocks = 4 Block Size Considerations Larger blocks should reduce miss rate Due to spatial locality But in a fixed-sized cache Larger blocks fewer of them More competition increased miss rate Larger blocks pollution Larger miss penalty (i.e., cost (time) for transfer) Can override benefit of reduced miss rate Early restart? and critical-word-first can help Early restart: is simply to resume execution as soon as the requested word of the block is returned, rather than wait for the entire block. Direct Mapped Example Compute Index BlockAddress % # Blocks Compute Tags BlockAddress / # Blocks 1 4 MemBlock[16] 1 6 MemBlock[26] 1 MemBlock[3] Byte Address Index Tag Hit or Miss? m m m m m m h? #Blocks = 4 Reduce Misses via Larger Block Size Increasing the cache size decreases miss rate Increasing block size lowers miss rates. However the miss rate may go up eventually if the block size becomes a significant fraction of the cache size, Why? Because the number of blocks that can be held in the cache will become small, and there will be a great deal of competition for those blocks. Miss Rate 25% 2% 15% 1% 5% % Block Size (bytes) K 4K 16K 64K 256K Direct Mapped Example Compute Index BlockAddress % # Blocks Compute Tags BlockAddress / # Blocks #Blocks = MemBlock[16] 1 4 MemBlock[18] 1 MemBlock[3] DM Size Byte Address Index Tag Hit or Miss? m m m m m m h m 8 requests, 7 misses Chapter 5 Large and Fast: Exploiting Hierarchy 9

10 Direct-Mapped Size The total number of bits needed for a cache is a function of the (a) cache size, (b) address size, because the cache includes both the storage for the data and the tags. For the following situation: 32-bit o 32-bit addresses TAG Size Index n-bit Blockoffseoffset Byte- o A direct-mapped cache o The size is 2 n blocks, so n bits are used for index o The block size is 2 m words (2 m+2 bytes), so m bits are used for the word within the block, two bits used for the byte part of the address The size of the tag field is 32 (n + m + 2) The total number of bits in a direct-mapped cache is: 2 n x (data size (block size) + tag size + valid field size) Since the block size is 2 m words (a word is 32-bits i.e., 2 5 bits) (2 m+5 bits), and we need 1 bit for the valid field, the number of bits: 2 n x (2 m x 2 5 +(32 n m 2) +1 ) = 2 n x (2 m x n m) Direct-Mapped Size The total number of bits in a direct-mapped cache is #blocks x (block size + tag size + valid field size) Although this is the actual size in bits, the naming convention is to exclude the size of the tag and valid field and to count only the size of the data Valid bit Tag Data DM Size: Example How many total bits are required for a DM with: 16 KiB of Data 4-word blocks Assuming a 32-bit address. Solution: We know that 16 KiB is 496 (2 12 ) words. With a block size of 4-words (2 2 ), there are 124 ( 2 1 ) blocks. Each block has Data: 4 x 32 = 128 bits, plus Tag: which is ( ) = 18-bits, plus Valid bit: 1-bit Thus, the total cache size is ( ) x 124 blocks = bits (147 K bit) 2 1 x (4 x 32 +( ) +1 ) = 2 1 x 147 = 147 Kibibits Or 18.4 KiB for a 16 KiB cache Total number of bits is 1.5 times as many as needed for storage!! Performance Access and Size Performance Metrics This has 124 entries. Each entry (block) is one word. Each word is 32-bits (4-bytes) Therefore: 2-bits are used as offset 1-bit used as index TAG = 32 (1 + 2) = 2-bits The cache size in this case is 4 KiB o HitRate = #Hits / #Accesses o MissRate = #Misses / #Accesses o = 1 HitRate o HitTime = time for a hit o MissPenalty = cost of a miss o Average Access Time (AMAT) = o HitTime + MissRate x MissPenalty Chapter 5 Large and Fast: Exploiting Hierarchy 1

11 Miss Categories 3 Cs Model Compulsory First access to a block is always a miss - Also called cold start misses - Misses in infinite size cache Conflict Multiple memory locations mapped to the same cache location - Also called collision misses. - All other misses. Capacity cannot contain all blocks needed, Capacity misses occur due to blocks being discarded and later retrieved. Example Word addr Binary addr Hit/miss block Miss 11 Index N 1 N 1 N 11 N 1 N 11 N 11 Y 1 Mem[111] 111 N Example Word addr Binary addr Hit/miss block Miss 1 Example Index N 1 N 1 Y 11 Mem[111] 11 N 1 N 11 N 11 Y 1 Mem[111] 111 N Example (more blocks) Recall earlier cache with 4 blocks (8 requests, 7 misses) This cache has: 8-blocks (instead of 4), 1 word/block, direct mapped Example Word addr Binary addr Hit/miss block Hit Hit 1 Index N 1 N 1 N 11 N 1 N 11 N 11 N 111 N Index N 1 N 1 Y 11 Mem[111] 11 N 1 N 11 N 11 Y 1 Mem[111] 111 N Chapter 5 Large and Fast: Exploiting Hierarchy 11

12 Example Word addr Binary addr Hit/miss block 16 1 Miss 3 11 Miss Hit Index Y 1 Mem[1] 1 N 1 Y 11 Mem[111] 11 Y Mem[11] 1 N 11 N 11 Y 1 Mem[111] 111 N Address Subdivision A memory can hold 32 Kbytes. Data is transferred between MM and cache in blocks of 16 bytes each. The main memory consist of 512 Kbytes Show the format of main memory addresses in a DM Organization. Assume that addressing is done at the byte-level Solution: Total address lines needed for 512 Kbytes is 19 bits # of blocks: 32 Kbytes/16 bytes = bits for index Byte Offset within a block: 16 bytes bits for word (byte offset) Tag = = 4 bits Example Word addr Binary addr Hit/miss block Miss 1 Index Y 1 Mem[1] 1 N 1 Y 1 Mem[11] 11 Y Mem[11] 1 N 11 N 11 Y 1 Mem[111] 111 N 8 requests, 5 misses vs. 8 request with 7 misses Mapping Functions Use small cachewith 128 blocks of 16 words Use main memory with 64K words (4K blocks) Word-addressable memory, so 16-bit address Direct Mapping Direct Mapped Example size = 1K words, One word/block Byte offset Hit Tag 2 1 Index Data Address Subdivision & Architecture Index Valid Tag 2 Data 32 Comparator Chapter 5 Large and Fast: Exploiting Hierarchy 12

13 Multiword Block Direct Mapped size = 1K words, Hit Tag Index Valid Tag Index 8 Four words/block Byte offset Data Block offset Data Misses On cache hit, CPU proceeds normally On cache miss Stall the CPU pipeline Fetch block from next level of hierarchy Instruction cache miss Restart instruction fetch Data cache miss Complete data access What kind of locality are we taking advantage of? 32 Multiword Block Direct Mapped size = 16K words, Four words/block Hit Tag Address (showing bit positions) Byte offset 16 bits 128 bits Address (bit positions) Index Mux 32 Block offset 4K entries Data 74 Misses/Improving Performance Compulsory: First access to a block, cold fact of life, Conflict: Multiple memory locations mapped to the same cache location - Solution 1: increase cache size - Solution 2: increase associativity Capacity: cannot contain all blocks accessed by the program - Solution: increase cache size AMAT = HitTime + MissRate x MissPenalty Reduce HitTime: - Small and simple cache Reduce MissRate: - Larger Block Size, - Higher Associativity Reduce MissPenalty: - MultiLevel s, - Give priority to read misses. Miss & Hits Example: Intrinsity FastMATH Embedded MIPS processor 12-stage pipeline Instruction and data access on each cycle Split cache: separate I-cache and D-cache Each 16KB: 256 blocks 16 words/block D-cache: write-through or write-back SPEC2 miss rates I-cache:.4% D-cache: 11.4% Weighted average: 3.2% Chapter 5 Large and Fast: Exploiting Hierarchy 13

14 Example: Intrinsity FastMATH Example Access Pattern Summary Byte address Tag 11 Assume that addresses are 8 bits long How many of the following address requests are hits/misses? 4, 7, 1, 13, 16, 68, 73, 78, 83, 88, 4, 7, 1 8-byte words Compare Direct-mapped cache: each address maps to a unique address Tag array Data array 83 Summary The Principle of Locality: Program likely to access a relatively small portion of the address space at any instant of time - Temporal Locality: Locality in Time - Spatial Locality: Locality in Space Three major categories of cache misses: Compulsory misses: sad facts of life, e.g., cold start misses Conflict misses: increase cache size and/or associativity Nightmare Scenario: ping pong effect! Capacity misses: increase cache size What's Next? Set Associative s Write, Replacement Policies Systems that Support s The off-chip interconnect and memory architecture affects overall system performance dramatically on-chip 32-bit data & 32-bit addr per cycle CPU bus Main Assume 1. 1 clock cycle (1 ns) to send the address from the cache to the Main 2. 5 ns (5 processor clock cycles) for DRAM first word access time, 1 ns (1 clock cycles) cycle time (remaining words in burst for SDRAM) 3. 1 clock cycle (1 ns) to return a word of data from the Main to the cache -Bus to bandwidth number of bytes accessed from Main and transferred to cache/cpu per clock cycle Chapter 5 Large and Fast: Exploiting Hierarchy 14

15 One Word Wide Organization on-chip CPU bus Main If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory 1 cycle(s) to send address 5 cycle(s) to read DRAM 1 cycle(s) to return data 52 total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a miss is 4/52 =.77 bytes per clock Burst Organization on-chip CPU bus Main What if the block size is four words and a (DDR) SDRAM is used? 1 cycle(s) to send 1 st address 5 + 3*1 = 8 cycle(s) to read DRAM 1 cycle(s) to return last data word 82 total clock cycles miss penalty 5 cycles 1 cycles 1 cycles 1 cycles Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/82 =.183 bytes per clock Chapter 5 Large and Fast: Exploiting Hierarchy 15

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] CSF Cache Introduction [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user with as much