Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Similar documents
CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

14:332:331. Week 13 Basics of Cache

14:332:331. Week 13 Basics of Cache

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Memory Technology

CSE 2021: Computer Organization

CSE 2021: Computer Organization

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

EE 4683/5683: COMPUTER ARCHITECTURE

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Computer Systems Laboratory Sungkyunkwan University

Course Administration

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS3350B Computer Architecture

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Caches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

V. Primary & Secondary Memory!

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Page 1. Memory Hierarchies (Part 2)

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

COMPUTER ORGANIZATION AND DESIGN

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface

Modern Computer Architecture

The Memory Hierarchy & Cache

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 6: Memory Organization Part I

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

EN1640: Design of Computing Systems Topic 06: Memory System

The Memory Hierarchy Cache, Main Memory, and Virtual Memory

Handout 4 Memory Hierarchy

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Advanced Memory Organizations

CS152 Computer Architecture and Engineering Lecture 17: Cache System

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

ECE232: Hardware Organization and Design

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

5. Memory Hierarchy Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Memory Hierarchy Y. K. Malaiya

Memory. Principle of Locality. It is impossible to have memory that is both. We create an illusion for the programmer. Employ memory hierarchy

CPE 631 Lecture 04: CPU Caches

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

ECE331: Hardware Organization and Design

COSC 6385 Computer Architecture - Memory Hierarchies (I)

ECE232: Hardware Organization and Design

10/16/17. Outline. Outline. Typical Memory Hierarchy. Adding Cache to Computer. Key Cache Concepts

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality

Caches. Hiding Memory Access Times

EN1640: Design of Computing Systems Topic 06: Memory System

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS3350B Computer Architecture

The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs.

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Chapter 5. Large and Fast: Exploiting Memory Hierarchy. Jiang Jiang

Memory Hierarchy: Caches, Virtual Memory

Memory Hierarchy. Reading. Sections 5.1, 5.2, 5.3, 5.4, 5.8 (some elements), 5.9 (2) Lecture notes from MKP, H. H. Lee and S.

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

LECTURE 11. Memory Hierarchy

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

CMPT 300 Introduction to Operating Systems

Performance! (1/latency)! 1000! 100! 10! Capacity Access Time Cost. CPU Registers 100s Bytes <10s ns. Cache K Bytes ns 1-0.

CS 61C: Great Ideas in Computer Architecture Caches Part 2

CS161 Design and Architecture of Computer Systems. Cache $$$$$

ECE7995 (4) Basics of Memory Hierarchy. [Adapted from Mary Jane Irwin s slides (PSU)]

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Advanced Computer Architecture

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

CS/ECE 3330 Computer Architecture. Chapter 5 Memory

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

Recap: Machine Organization

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Transcription:

ENG338 Computer Organization and Architecture Part II Winter 217 S. Areibi School of Engineering University of Guelph Hierarchy Topics Hierarchy Locality Motivation Principles Elements of Design: Addresses Size Mapping Function Replacement Algorithms Write Policy Line Size Summary With thanks to W. Stallings, Hamacher, J. Hennessy, M. J. Irwin for lecture slide contents Many slides adapted from the PPT slides accompanying the textbook and CSE331 Course 2 Hierarchy o o o o o The design constraints on a computer memory can be summed up by three questions (i) How Much (ii) How Fast (iii) How expensive. There is a tradeoff among the three key characteristics A variety of technologies are used to implement memory system Dilemma facing designer is clear large capacity, fast, low cost!! Solution Employ memory hierarchy Flip Flops Dynamic RAM registers Main Disk Magnetic Disk Removable Media Static RAM 5 References I. Computer Organization and Architecture: Designing for Performance, 1 th edition, by William Stalling, Pearson. II. Computer Organization and Design: The Hardware/Software Interface, 5 th edition, by D. Patterson and J. Hennessy, Morgan Kaufmann III. Computer Organization and Architecture: Themes and Variations, 214, by Alan Clements, CENGAGE Learning 3 Chapter 5 Large and Fast: Exploiting Hierarchy 1

Hierarchy Main vs. As you go further, capacity and latency increase Dynamic RAM Static RAM Registers Registers 1KB 1 cycle L1 data or instruction 32KB 2 cycles L2 cache 2MB 15 cycles 1GB 3 cycles Disk 8 GB 1M cycles Static RAM 7 1 CPU + Bus + Registers Static RAM CPU Controller Local CPU / Bus Dynamic RAM PCI DRAM Co-processor Controller Peripheral Component Interconnect Bus EISA/PCI Bridge Controller Hard Drive Controller Video Adaptor SCSI Adaptor EISA PC Bus PC Card 1 PC Card 2 PC Card 3 SCSI Bus 11 Hierarchy Levels How is the Hierarchy Managed? Upper Level Lower Level Block (aka line): unit of copying May be multiple words If accessed data is present in upper level Hit: access satisfied by upper level Hit ratio: hits/accesses If accessed data is absent Miss: block copied from lower level Time taken: miss penalty Miss ratio: misses/accesses = 1 hit ratio Then accessed data supplied from upper level registers memory by compiler (programmer?) cache main memory by the cache controller hardware main memory disks by the operating system (virtual memory) virtual to physical address mapping assisted by the hardware (TLB) by the programmer (files) Chapter 5 Large and Fast: Exploiting Hierarchy 2

Taking Advantage of Locality Locality hierarchy Store everything on disk Copy recently accessed (and nearby) items from disk to smaller DRAM memory Main memory Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory memory attached to CPU Principle of Locality (Temporal) Programs access a small proportion of their address space at any time (Locality in Time) Temporal locality Items accessed recently are likely to be accessed again soon Keep most recently accessed data items closer to the processor e.g., instructions in a loop, induction variables and Locality Why do caches work? Temporal locality: if you used some data recently, you will likely use it again Spatial locality: if you used some data recently, you will likely access its neighbors No hierarchy: average access time for data = 3 cycles for (i=; i<1; i++) x[i] = x[i] + s; To Processor Upper Level Lower Level 32KB 1-cycle L1 cache that has a hit rate of 95%: average access time =.95 x 1 +.5 x (31) = 16 cycles From Processor Blk X Blk Y 17 Principle of Locality (Spatial) Programs access a small proportion of their address space at any time (Locality in Space) Spatial locality Items near those accessed recently likely to be accessed soon Move blocks consisting of contiguous words to the upper levels E.g., sequential instruction access, scanning an array data Motivation for (i=; i<1; i++) x[i] = x[i] + s; To Processor From Processor Upper Level Blk X Lower Level Blk Y Chapter 5 Large and Fast: Exploiting Hierarchy 3

A Typical Hierarchy By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. On-Chip Components Control Datapath RegFile ITLB DTLB Instr Data Second Level (SRAM) Main (DRAM) Secondary (Disk) Speed (ns):.1 s 1 s 1 s 1 s 1, s Size (bytes): 1 s K s 1K s M s T s Cost: highest lowest Why Pipeline? For Throughput! To avoid a structural hazard need two caches onchip: one for instructions (I$) and one for data (D$) I n s t r. O r d e r Inst Inst 1 Inst 2 Inst 3 Inst 4 Time (clock cycles) ALU I$ Reg D$ Reg ALU I$ Reg D$ Reg ALU I$ Reg D$ Reg ALU I$ Reg D$ Reg ALU I$ Reg D$ Reg To keep the pipeline running at its maximum rate both I$ and D$ need to satisfy a request from the datapath every cycle. What happens when they can t do that? The Hierarchy Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology Increasing distance from the processor in access time Processor 4-8 bytes (word) L1$ 8-32 bytes (block) L2$ 1 to 4 blocks Main Secondary 1,24+ bytes (disk sector = page) Inclusive what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM (Relative) size of the memory at each level Processor-memory Performance Gap The Processor vs DRAM speed disparity continues to grow Terminology Hit: data is in some block in the upper level (Blk X) Hit Rate: fraction of memory accesses found in upper level Hit Time: Time to access the upper level which consists of - SRAM access time + Time to determine hit/miss To Processor From Processor Upper Level Blk X Lower Level Blk Y Good memory hierarchy (cache) design is increasingly important to overall performance Miss: data is not in the upper level so needs to be retrieve from a block in the lower level (Blk Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to bring in a block from the lower level and replace a block in the upper level with it + Time to deliver the block the processor Hit Time << Miss Penalty Chapter 5 Large and Fast: Exploiting Hierarchy 4

Four Questions for Design Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) : Block Identification memory The level of the memory hierarchy closest to the CPU Given accesses X 1,, X n 1, X n Q3: Which block should be replaced on a miss? (Block replacement strategy) Q4: What happens on a write? (Write strategy) How do we know if the data is present? Where do we look?? : Block Placement Determined by associativity Direct mapped (1-way set associative) One choice for placement n-way set associative n choices within a set Fully associative Any location Higher associativity reduces miss rate, Increases complexity, cost, and access time Block Identification: Finding a Block Associativity Location method Tag comparisons Direct mapped Index 1 n-way set associative Hardware caches Set index, then search entries within the set Fully associative Search all entries #entries Full lookup table Reduce comparisons to reduce cost n : Block Placement Q1: Where can a block be placed in the upper level? Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo Number Sets Full Mapped Direct Mapped (12 mod 8) = 4 2-Way Assoc (12 mod 4) = 1234567 1234567 1234567 Block Identification Use lower address part as index to How do we know requested block is in cache (9 or 13)? Store block address as well as the data Block address: Actually, only need the high-order bits Called the tag Index 1111111111222222222233 1234567891234567891234567891 Chapter 5 Large and Fast: Exploiting Hierarchy 5

Block Identification: Tags Every block has a tag in addition to data Tag: Upper part of address, that is not used to index cache Another Reference String Mapping Consider the main memory word reference string 4 4 4 4 Start with an empty cache - all blocks initially marked as not valid miss 4 miss miss 4 miss 1 4 Mem() Mem() 1 1 Mem(4) Mem() 4 miss 4 miss miss 4 1 miss 4 1 4 1 Mem(4) Mem() 1 Mem(4) Mem() 8 requests, 8 misses Ping pong effect due to conflict misses - two memory locations that map into the same cache block Valid Bits What if there is no data in a location? Valid bit: 1 = present, = not present Initially Valid bit Tag Data 111 11111 1 1111 1111111 1 11111 1111111111 111111 1111 Direct Mapped Hit/Miss: Example Consider the main memory word reference string 1 2 3 4 3 4 15 Start with an empty cache - all blocks initially marked as not valid Tag =, Index = Tag =, Index = 1 Tag =, Index = 1 Tag =, Index = 11 tag miss 1 miss 2 miss 3 miss Mem() Mem() Mem() Mem() Mem(1) Mem(1) Mem(1) Mem(2) Mem(2) Mem(3) Tag = 1, Index = Tag =, Index = 11 Tag = 1, Index = Tag = 11, Index = 11 4 miss 3 hit 4 hit 15 miss 1 4 Mem() 1 Mem(4) 1 Mem(4) 1 Mem(4) Mem(1) Mem(1) Mem(1) Mem(1) Mem(2) Mem(2) Mem(2) Mem(2) Mem(3) Mem(3) Mem(3) 11 Mem(3) 15 8 requests, 6 misses Direct Mapped Location determined by address Direct mapped: only one choice Index = (BlockAddress) modulo (#Blocks in cache) Index = 9 mod 4 = 1 If #Blocks (i.e., number of entries in the cache) is a power of 2 then modulo (i.e., Index) can be computed simply by using the low order log 2 (cache size in blocks) bits of the address (log 2 (4) = 2) Chapter 5 Large and Fast: Exploiting Hierarchy 6

The Direct Mapped Direct mapped For each item of data at the lower level (main memory), there is exactly one location in the upper level (cache) where it might be - so lots of items at the lower level must share locations in the upper level Address mapping: (block address) modulo (# of blocks in the cache) Direct-mapped cache: each address maps to a unique address Caching: Example Index Valid Tag 1 1 11 Q2: Is it there? Data Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cache xx 1xx 1xx 11xx 1xx 11xx 11xx 111xx 1xx 11xx 11xx 111xx 11xx 111xx 111xx 1111xx Main Two low order bits define the byte in the word (32b words) Q1: How do we find it? Use next 2 low order memory address bits the index to determine which cache block (i.e., modulo the number of blocks in the cache) (block address) modulo (# of blocks in the cache) Accessing the Byte address 8 sets: 3 index bits 11 Equations for DM BlockAddress =ByteAddress/BytesPerBlock Index = BlockAddress % #Blocks Offset: 3 bits 8-byte words Direct-mapped cache: each address maps to a unique address Larger Block Size A 64 block, with16 bytes/block To what block number does byte address 12 map? Block Address = Byte Address/Block Size Block address = 12/16 = 75 Index = BlockAddress % # Blocks Index (Block number) = 75 modulo 64 = 11 Tag = BlockAddress / # Blocks Tag = 75 / 64 = 1 Data array 8 Sets (blocks) 38 31 1 9 4 3 Tag Index Offset 22 bits 6 bits 4 bits The Tag Array Byte address Tag Compare... 11 3 index bits Offset: 3 bits Because each cache location can contain the contents of a number of different memory location, a tag is added to every block to further identify the requested item. 8-byte words Direct Mapped Example A Processor generates byte addresses 88 14 88 14 64 12 64 72 It has Direct Mapped (1-way set associative) with 4 sets (blocks) The set (block) size is 4-bytes For each access, is it hit or miss? Solution: Compute Index BlockAddress % #Blocks Compute Tags BlockAddress / #Blocks Byte Address 88 14 88 14 64 12 64 72 Block Address Tag array Data array 8 Sets 39 Index Tag Chapter 5 Large and Fast: Exploiting Hierarchy 7

Direct Mapped Example Compute Index BlockAddress % # Blocks Compute Tags BlockAddress / # Blocks #Blocks = 4 Direct Mapped Example Compute Index BlockAddress % # Blocks Compute Tags BlockAddress / # Blocks 1 5 MemBlock[22] #Blocks = 4 Byte Address 88 14 88 14 64 12 64 72 Index 2 2 2 2 3 2 Tag 5 6 5 6 4 4 4 Hit or Miss????????? Byte Address 88 14 88 14 64 12 64 72 Index 2 2 2 2 3 2 Tag 5 6 5 6 4 4 4 Hit or Miss? m m m????? Direct Mapped Example #Blocks = 4 Direct Mapped Example #Blocks = 4 Compute Index BlockAddress % # Blocks Compute Tags BlockAddress / # Blocks Compute Index BlockAddress % # Blocks Compute Tags BlockAddress / # Blocks 1 5 MemBlock[22] 1 6 MemBlock[26] Byte Address 88 14 88 14 64 12 64 72 Index 2 2 2 2 3 2 Tag 5 6 5 6 4 4 4 Hit or Miss? m??????? Byte Address 88 14 88 14 64 12 64 72 Index 2 2 2 2 3 2 Tag 5 6 5 6 4 4 4 Hit or Miss? m m m m???? Direct Mapped Example #Blocks = 4 Direct Mapped Example #Blocks = 4 Compute Index BlockAddress % # Blocks Compute Tags BlockAddress / # Blocks Compute Index BlockAddress % # Blocks Compute Tags BlockAddress / # Blocks 1 4 MemBlock[16] 1 6 MemBlock[26] 1 6 MemBlock[26] Byte Address 88 14 88 14 64 12 64 72 Index 2 2 2 2 3 2 Tag 5 6 5 6 4 4 4 Hit or Miss? m m?????? Byte Address 88 14 88 14 64 12 64 72 Index 2 2 2 2 3 2 Tag 5 6 5 6 4 4 4 Hit or Miss? m m m m m??? Chapter 5 Large and Fast: Exploiting Hierarchy 8

Direct Mapped Example Compute Index BlockAddress % # Blocks Compute Tags BlockAddress / # Blocks 1 4 MemBlock[16] 1 6 MemBlock[26] 1 MemBlock[3] Byte Address 88 14 88 14 64 12 64 72 Index 2 2 2 2 3 2 Tag 5 6 5 6 4 4 4 Hit or Miss? m m m m m m?? #Blocks = 4 Block Size Considerations Larger blocks should reduce miss rate Due to spatial locality But in a fixed-sized cache Larger blocks fewer of them More competition increased miss rate Larger blocks pollution Larger miss penalty (i.e., cost (time) for transfer) Can override benefit of reduced miss rate Early restart? and critical-word-first can help Early restart: is simply to resume execution as soon as the requested word of the block is returned, rather than wait for the entire block. Direct Mapped Example Compute Index BlockAddress % # Blocks Compute Tags BlockAddress / # Blocks 1 4 MemBlock[16] 1 6 MemBlock[26] 1 MemBlock[3] Byte Address 88 14 88 14 64 12 64 72 Index 2 2 2 2 3 2 Tag 5 6 5 6 4 4 4 Hit or Miss? m m m m m m h? #Blocks = 4 Reduce Misses via Larger Block Size Increasing the cache size decreases miss rate Increasing block size lowers miss rates. However the miss rate may go up eventually if the block size becomes a significant fraction of the cache size, Why? Because the number of blocks that can be held in the cache will become small, and there will be a great deal of competition for those blocks. Miss Rate 25% 2% 15% 1% 5% % 16 32 64 Block Size (bytes) 128 256 1K 4K 16K 64K 256K Direct Mapped Example Compute Index BlockAddress % # Blocks Compute Tags BlockAddress / # Blocks #Blocks = 4 1 4 MemBlock[16] 1 4 MemBlock[18] 1 MemBlock[3] DM Size Byte Address 88 14 88 14 64 12 64 72 Index 2 2 2 2 3 2 Tag 5 6 5 6 4 4 4 Hit or Miss? m m m m m m h m 8 requests, 7 misses Chapter 5 Large and Fast: Exploiting Hierarchy 9

Direct-Mapped Size The total number of bits needed for a cache is a function of the (a) cache size, (b) address size, because the cache includes both the storage for the data and the tags. For the following situation: 32-bit o 32-bit addresses TAG Size Index n-bit Blockoffseoffset Byte- o A direct-mapped cache o The size is 2 n blocks, so n bits are used for index o The block size is 2 m words (2 m+2 bytes), so m bits are used for the word within the block, two bits used for the byte part of the address The size of the tag field is 32 (n + m + 2) The total number of bits in a direct-mapped cache is: 2 n x (data size (block size) + tag size + valid field size) Since the block size is 2 m words (a word is 32-bits i.e., 2 5 bits) (2 m+5 bits), and we need 1 bit for the valid field, the number of bits: 2 n x (2 m x 2 5 +(32 n m 2) +1 ) = 2 n x (2 m x 32 +31 n m) Direct-Mapped Size The total number of bits in a direct-mapped cache is #blocks x (block size + tag size + valid field size) Although this is the actual size in bits, the naming convention is to exclude the size of the tag and valid field and to count only the size of the data Valid bit Tag Data 111 11111 1 1111 1111111 1 11111 1111111111 111111 1111 DM Size: Example How many total bits are required for a DM with: 16 KiB of Data 4-word blocks Assuming a 32-bit address. Solution: We know that 16 KiB is 496 (2 12 ) words. With a block size of 4-words (2 2 ), there are 124 ( 2 1 ) blocks. Each block has Data: 4 x 32 = 128 bits, plus Tag: which is (32 1 2 2) = 18-bits, plus Valid bit: 1-bit Thus, the total cache size is (128 + 18 + 1) x 124 blocks = 15528 bits (147 K bit) 2 1 x (4 x 32 +(32 1 2 2) +1 ) = 2 1 x 147 = 147 Kibibits Or 18.4 KiB for a 16 KiB cache Total number of bits is 1.5 times as many as needed for storage!! Performance Access and Size Performance Metrics This has 124 entries. Each entry (block) is one word. Each word is 32-bits (4-bytes) Therefore: 2-bits are used as offset 1-bit used as index TAG = 32 (1 + 2) = 2-bits The cache size in this case is 4 KiB o HitRate = #Hits / #Accesses o MissRate = #Misses / #Accesses o = 1 HitRate o HitTime = time for a hit o MissPenalty = cost of a miss o Average Access Time (AMAT) = o HitTime + MissRate x MissPenalty Chapter 5 Large and Fast: Exploiting Hierarchy 1

Miss Categories 3 Cs Model Compulsory First access to a block is always a miss - Also called cold start misses - Misses in infinite size cache Conflict Multiple memory locations mapped to the same cache location - Also called collision misses. - All other misses. Capacity cannot contain all blocks needed, Capacity misses occur due to blocks being discarded and later retrieved. Example 22 26 22 26 16 3 16 18 Word addr Binary addr Hit/miss block 22 1 11 Miss 11 Index N 1 N 1 N 11 N 1 N 11 N 11 Y 1 Mem[111] 111 N Example 22 26 22 26 16 3 16 18 Word addr Binary addr Hit/miss block 26 11 1 Miss 1 Example Index N 1 N 1 Y 11 Mem[111] 11 N 1 N 11 N 11 Y 1 Mem[111] 111 N Example (more blocks) Recall earlier cache with 4 blocks (8 requests, 7 misses) This cache has: 8-blocks (instead of 4), 1 word/block, direct mapped 22 26 22 26 16 3 16 18 Example 22 26 22 26 16 3 16 18 Word addr Binary addr Hit/miss block 22 1 11 Hit 11 26 11 1 Hit 1 Index N 1 N 1 N 11 N 1 N 11 N 11 N 111 N Index N 1 N 1 Y 11 Mem[111] 11 N 1 N 11 N 11 Y 1 Mem[111] 111 N Chapter 5 Large and Fast: Exploiting Hierarchy 11

Example 22 26 22 26 16 3 16 18 Word addr Binary addr Hit/miss block 16 1 Miss 3 11 Miss 11 16 1 Hit Index Y 1 Mem[1] 1 N 1 Y 11 Mem[111] 11 Y Mem[11] 1 N 11 N 11 Y 1 Mem[111] 111 N Address Subdivision A memory can hold 32 Kbytes. Data is transferred between MM and cache in blocks of 16 bytes each. The main memory consist of 512 Kbytes Show the format of main memory addresses in a DM Organization. Assume that addressing is done at the byte-level Solution: Total address lines needed for 512 Kbytes is 19 bits # of blocks: 32 Kbytes/16 bytes = 248 2 11 11 bits for index Byte Offset within a block: 16 bytes 2 4 4 bits for word (byte offset) Tag = 19 11 4 = 4 bits 4 11 4 19 Example 22 26 22 26 16 3 16 18 Word addr Binary addr Hit/miss block 18 1 1 Miss 1 Index Y 1 Mem[1] 1 N 1 Y 1 Mem[11] 11 Y Mem[11] 1 N 11 N 11 Y 1 Mem[111] 111 N 8 requests, 5 misses vs. 8 request with 7 misses Mapping Functions Use small cachewith 128 blocks of 16 words Use main memory with 64K words (4K blocks) Word-addressable memory, so 16-bit address Direct Mapping Direct Mapped Example size = 1K words, One word/block 31 3... 13 12 11... 2 1 Byte offset Hit Tag 2 1 Index Data Address Subdivision & Architecture Index Valid 1 2... 121 122 123 Tag 2 Data 32 Comparator Chapter 5 Large and Fast: Exploiting Hierarchy 12

Multiword Block Direct Mapped size = 1K words, Hit Tag Index Valid Tag 1 2... 253 254 255 2 31 3... 13 12 11... 4 3 2 1 2 Index 8 Four words/block Byte offset Data Block offset Data Misses On cache hit, CPU proceeds normally On cache miss Stall the CPU pipeline Fetch block from next level of hierarchy Instruction cache miss Restart instruction fetch Data cache miss Complete data access What kind of locality are we taking advantage of? 32 Multiword Block Direct Mapped size = 16K words, Four words/block Hit Tag Address (showing bit positions) 2 16 12 Byte offset 16 bits 128 bits 16 32 Address (bit positions) 31 16 15 4 32 1 Index 32 32 32 Mux 32 Block offset 4K entries Data 74 Misses/Improving Performance Compulsory: First access to a block, cold fact of life, Conflict: Multiple memory locations mapped to the same cache location - Solution 1: increase cache size - Solution 2: increase associativity Capacity: cannot contain all blocks accessed by the program - Solution: increase cache size AMAT = HitTime + MissRate x MissPenalty Reduce HitTime: - Small and simple cache Reduce MissRate: - Larger Block Size, - Higher Associativity Reduce MissPenalty: - MultiLevel s, - Give priority to read misses. Miss & Hits Example: Intrinsity FastMATH Embedded MIPS processor 12-stage pipeline Instruction and data access on each cycle Split cache: separate I-cache and D-cache Each 16KB: 256 blocks 16 words/block D-cache: write-through or write-back SPEC2 miss rates I-cache:.4% D-cache: 11.4% Weighted average: 3.2% Chapter 5 Large and Fast: Exploiting Hierarchy 13

Example: Intrinsity FastMATH Example Access Pattern Summary Byte address Tag 11 Assume that addresses are 8 bits long How many of the following address requests are hits/misses? 4, 7, 1, 13, 16, 68, 73, 78, 83, 88, 4, 7, 1 8-byte words Compare Direct-mapped cache: each address maps to a unique address Tag array Data array 83 Summary The Principle of Locality: Program likely to access a relatively small portion of the address space at any instant of time - Temporal Locality: Locality in Time - Spatial Locality: Locality in Space Three major categories of cache misses: Compulsory misses: sad facts of life, e.g., cold start misses Conflict misses: increase cache size and/or associativity Nightmare Scenario: ping pong effect! Capacity misses: increase cache size What's Next? Set Associative s Write, Replacement Policies Systems that Support s The off-chip interconnect and memory architecture affects overall system performance dramatically on-chip 32-bit data & 32-bit addr per cycle CPU bus Main Assume 1. 1 clock cycle (1 ns) to send the address from the cache to the Main 2. 5 ns (5 processor clock cycles) for DRAM first word access time, 1 ns (1 clock cycles) cycle time (remaining words in burst for SDRAM) 3. 1 clock cycle (1 ns) to return a word of data from the Main to the cache -Bus to bandwidth number of bytes accessed from Main and transferred to cache/cpu per clock cycle Chapter 5 Large and Fast: Exploiting Hierarchy 14

One Word Wide Organization on-chip CPU bus Main If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory 1 cycle(s) to send address 5 cycle(s) to read DRAM 1 cycle(s) to return data 52 total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a miss is 4/52 =.77 bytes per clock Burst Organization on-chip CPU bus Main What if the block size is four words and a (DDR) SDRAM is used? 1 cycle(s) to send 1 st address 5 + 3*1 = 8 cycle(s) to read DRAM 1 cycle(s) to return last data word 82 total clock cycles miss penalty 5 cycles 1 cycles 1 cycles 1 cycles Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/82 =.183 bytes per clock Chapter 5 Large and Fast: Exploiting Hierarchy 15