Reducing Conflict Misses with Set Associative Caches

Similar documents
Memory Hierarchy I - Cache

Computer Architecture. Unit 4: Caches

This Unit: Caches. CIS 501 Computer Architecture. Motivation. Readings. Basic memory hierarchy concepts Speed vs capacity. Caches

This Unit: Caches. CIS 501 Computer Architecture. Motivation. Readings. Basic memory hierarchy concepts Speed vs capacity. Caches

This Unit: Caches. CIS 371 Computer Organization and Design. Readings. As you re getting settled. Basic memory hierarchy concepts Speed vs capacity

Memory Hierarchy I: Caches

Where We Are in This Course Right Now. ECE 152 Introduction to Computer Architecture. This Unit: Caches and Memory Hierarchies. Readings.

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Fall 2007 Prof. Thomas Wenisch

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

CS3350B Computer Architecture

Page 1. Multilevel Memories (Improving performance using a little cash )

CS/ECE 250: Computer Architecture


CS161 Design and Architecture of Computer Systems. Cache $$$$$

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Memory Hierarchy. Slides contents from:

CSE502: Computer Architecture CSE 502: Computer Architecture

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

Caches. Samira Khan March 23, 2017

Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Memory Hierarchy. Locality. Memories: Review

Caches and Memory. Anne Bracy CS 3410 Computer Science Cornell University. See P&H Chapter: , 5.8, 5.10, 5.13, 5.15, 5.17

Background. Memory Hierarchies. Register File. Background. Forecast Memory (B5) Motivation for memory hierarchy Cache ECC Virtual memory.

CS152 Computer Architecture and Engineering SOLUTIONS Caches and the Memory Hierarchy Assigned 9/17/2016 Problem Set #2 Due Tue, Oct 4

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

EE 4683/5683: COMPUTER ARCHITECTURE

Key Point. What are Cache lines

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Memory Hierarchy: Caches, Virtual Memory

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Locality and Data Accesses video is wrong one notes when video is correct

LECTURE 11. Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Lecture 11 Cache. Peng Liu.

Chapter 2: Memory Hierarchy Design Part 2

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Caches (Writing) P & H Chapter 5.2 3, 5.5. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer)

Course Administration

CPSC 313, 04w Term 2 Midterm Exam 2 Solutions

EECS 470 Lecture 13. Basic Caches. Fall 2018 Jon Beaumont

Memory Hierarchy. Slides contents from:

CS 61C: Great Ideas in Computer Architecture Caches Part 2

CS 136: Advanced Architecture. Review of Caches

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

Page 1. Memory Hierarchies (Part 2)

Pollard s Attempt to Explain Cache Memory

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

EECS150 - Digital Design Lecture 11 SRAM (II), Caches. Announcements

Caching Basics. Memory Hierarchies

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

COSC4201. Chapter 4 Cache. Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT

Computer Architecture Spring 2016

CS 104 Computer Organization and Design

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

Computer Architecture Spring 2016

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

COMP 3221: Microprocessors and Embedded Systems

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches, Set Associative Caches, Cache Performance

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Lecture 15: Cache Design (in Isolation) James C. Hoe Department of ECE Carnegie Mellon University

Caches. Hiding Memory Access Times

MIPS) ( MUX

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

CS61C : Machine Structures

EEC 483 Computer Organization

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Slide Set 5. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

10/16/17. Outline. Outline. Typical Memory Hierarchy. Adding Cache to Computer. Key Cache Concepts

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

12 Cache-Organization 1

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Memory Technologies. Technology Trends

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

Computer Architecture CS372 Exam 3

Chapter 2: Memory Hierarchy Design Part 2

CMPSC 311- Introduction to Systems Programming Module: Caching

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Transcription:

/6/7 Reducing Conflict es with Set Associative Caches Not too conflict y. Not too slow. Just Right! 8 byte, way xx E F xx C D What should the offset be? What should the be? What should the tag be? xx N O xx P Q addr A B C D E F G H J K L M N O P Q 8 byte, way load x load x load x load x addr A B C D E F G H J K L M N O P Q 8 byte, way load x load x load x load x N O addr A B C D E F G H J K L M N O P Q 8 byte, way load x load x load x load x N O addr A B C D E F G H J K L M N O P Q 8 byte, way load x load x load x load x N O E F addr A B C D E F G H J K L M N O P Q

/6/7 es: the Three C s Cold (compulsory) : never seen this address before Conflict : cache associativity is too low Capacity : cache is too small ABCs of Caches t avg = t hit + % miss * t miss + Associativity: conflict misses hit time + Block Size: cold misses conflict misses + Capacity: capacity misses hit time Which caches get what properties? L Caches L Cache L Cache Fast Big t avg = t hit + % miss * t miss Design with speed in mind More Associative Bigger Block Sizes Larger Capacity Design with miss rate in mind Summary so far Things we ve covered: The Need for Speed Locality to the Rescue! Calculating average memory access time $ es: Cold, Conflict, Capacity $ Characteristics: Associativity, Block Size, Capacity Things we skipped (and are about to cover): Cache Overhead Replacement Policies Writes Basic Memory Array Structure Number of entries nbits for lookup n entries Example: 4 entries, bit address Decoder changes n bit address to n bit one hot signal Size of entries Width of accessed Here: 56 bits ( bytes) 4 entry x 56bit SRAM bits 56 bits read from cache Caches: Finding Data via Indexing 4 entry x 56bit SRAM Basic cache: array of cache lines (or blocks) Here: KB cache (4 entries, B blocks) Hash table in hardware To find entry: decode part of address Which part? bits bit address B blocks 5 lowest bits locate byte in block = offset bits 4 entry bits find entry = bits Note: nothing says must be these bits But these work best (think about why) 56 bits bit address [:5] [4:5] [4:] << : Which entry? (K possible) offset: Which byte? ( possible) bit

/6/7 Knowing that You Found It: Tags Each entry can hold one of 7 blocks All blocks with same bit pattern How to know which if any is currently there? To each entry attach tag and valid bit Compare entry tag to address tag bits No need to match bits (why?) Lookup algorithm Read entry indicated by bits Hit if tag matches and valid bit is set Otherwise, a miss. Go to next level tag [:5] [4:5] [4:] << = Calculating Tag Overhead KB cache means cache holds KB of Called capacity Tag storage is considered overhead Tag overhead of KB cache with 4 x B entries B blocks?? bit offset 4 entries?? bit bit address?? bit tag (7 bit tag + bit valid)* 4 entries = 8Kb tags =.KB tags ~6% overhead What about 64 bit addresses? Tag increases to 49 bits, ~% overhead (worst case) address hit? Calculating Tag Overhead KB cache means cache holds KB of Called capacity Tag storage is considered overhead Tag overhead of KB cache with 4 x B entries B blocks 5 bit offset 4 entries bit bit address bits (5 bit offset + bit ) =7 bit tag (7 bit tag + bit valid) X 4 entries = 8Kb tags =.KB tags ~6% overhead What about 64 bit addresses? Tag increases to 49 bits, ~% overhead Handling a Cache What if requested isn t in the cache? How does it get in there? Cache controller: finite state machine Remembers miss address Accesses next level of memory Waits for response Writes /tag into proper locations All of this happens on the fill path Sometimes called backside Cache Performance Equation Access: read or write to cache Hit: desired found in cache t hit Cache % miss t miss : desired not found in cache Must get from another component No notion of miss in register file Fill: action of placing into cache % miss (miss rate): #misses / #accesses t hit : time to read from (write to) cache t miss : time to read into cache Performance metric: average access time t avg = t hit + % miss * t miss CPI Calculation with Cache es Parameters Simple pipeline with base CPI of Instruction mix: % loads/stores I$: % miss = %, t miss = cycles D$: % miss = %, t miss = cycles What is new CPI? CPI I$ = CPI D$ = CPI new =

/6/7 CPI Calculation with Cache es Parameters Simple pipeline with base CPI of Instruction mix: % loads/stores I$: % miss = %, t miss = cycles D$: % miss = %, t miss = cycles What is new CPI? CPI I$ = % missi$ *t miss =.* cycles =. cycle CPI D$ = % load/store *% missd$ *t missd$ =. *.* cycles =. cycle CPI new = CPI + CPI I$ + CPI D$ = +.+. =.5 Measuring Cache Performance Ultimate metric is t avg Cache capacity and circuits roughly determines t hit Lower level memory structures determine t miss Measure % miss Hardware performance counters Simulation Paper simulation (next) Cache Paper Simulation 4 bit addresses total memory size = 6B Simpler cache diagrams than bits 8B cache, B blocks Number of entries (or sets): 4 (capacity / block size) Figure out how address splits into offset//tag bits Offset: least significant log (block size) = log () = Index: next log (number of entries) = log (4) = Tag: rest = 4 = Cache diagram = addresses of in block, values don t matter Cache contents Address Outcome Set Set Set Set Cache Paper Simulation 8B cache, B blocks tag ( bit) ( bits) Set Set Set Set How to reduce % miss? And hopefully t avg? bit Cache Paper Simulation 8B cache, B blocks tag ( bit) ( bits) Set Set Set Set Set Set Set Set Hit Hit Hit Hit How to reduce % miss? And hopefully t avg? bit Capacity and Performance Simplest way to reduce % miss : increase capacity + rate decreases monotonically Working set : insns/ program is actively using Diminishing returns However t hit increases Latency proportional to sqrt(capacity) t avg? % miss working set size Cache Capacity Given capacity, manipulate % miss by changing organization 4

/6/7 Block Size Given capacity, manipulate % miss by changing organization One option: increase block size Exploit spatial locality Notice /offset bits change Tag remain the same Ramifications + Reduce % miss (up to a point) + Reduce tag overhead (why?) Potentially useless transfer Premature replacement of useful Fragmentation [:5] [4:6] 5*5bit SRAM 5 5 9-bit = block size [5:] << hit? address Block Size and Tag Overhead Tag overhead of KB cache with 4 B entries B lines 5-bit offset 4 entries -bit -bit address Tag overhead of KB cache with 5 64B entries 64B lines 5 entries -bit address Block Size and Tag Overhead Tag overhead of KB cache with 4 B entries B lines 5-bit offset 4 entries -bit -bit address (5-bit offset + -bit ) = 7-bit tag (7-bit tag + -bit valid) X 4 entries = 8Kb tags =.KB tags ~6% overhead Tag overhead of KB cache with 5 64B entries 64B lines 6-bit offset 5 entries 9-bit -bit address (6-bit offset + 9-bit ) = 7-bit tag (7-bit tag + -bit valid) X 5 entries = 9Kb tags =.KB tags ~% overhead Block Size Cache Paper Simulation 8B cache, 4B blocks tag ( bit) ( bit) offset ( bits) Set Set Hit (spatial locality) + Spatial prefetching : miss on brought in Conflicts: miss on kicked out Effect of Block Size on Rate Two effects on miss rate + Spatial prefetching (good) For blocks with adjacent addresses Turns miss/miss into miss/hit pairs Interference (bad) For blocks with non-adjacent addresses (but in adjacent entries) Turns hits into misses by disallowing simultaneous residence Consider entire cache as one big block Both effects always present Spatial prefetching dominates initially Depends on size of the cache Good block size is 6 8B Program dependent % miss Block Size Block Size and Penalty Does increasing block size increase t miss? Don t larger blocks take longer to read, transfer, and fill? They do, but t miss of an isolated miss is not affected Critical Word First / Early Restart (CRF/ER) Requested word fetched first, pipeline restarts immediately Remaining words in block transferred/filled in the background t miss es of a cluster of misses will suffer Reads/transfers/fills of two misses can t happen at the same time Latencies can start to pile up This is a bandwidth problem (more later) 5

/6/7 8B cache, B blocks Conflicts tag ( bit) ( bits) Set Set Set Set Hit Hit Pairs like / conflict Regardless of block-size (assuming capacity < 6) Q: can we allow pairs like these to simultaneously reside? A: yes, reorganize cache to do so bit Set-Associativity Block can reside in one of few entries Entry groups called sets Each entry in set called a way This is -way set-associative (SA) -way direct-mapped (DM) -set fully-associative (FA) + Reduces conflicts Increases latency hit: additional tag match & muxing Note: valid bit not shown [:4] [:5] 9 bit associativity address sets 5 5 [4:] ways 5 5 54 = = << hit? Set-Associativity Lookup algorithm Use bits to find set Read /tags in all ways in parallel Any (match and valid bit), Hit Notice tag//offset bits Only 9-bit (versus -bit for direct mapped) 9 bit Notice block numbering associativity [:4] [:5] sets 5 5 [4:] ways 5 5 54 = = << Associativity and Paper Simulation 8B cache, B blocks, -way set-associative Set.Way Set.Way Set.Way Set.Way + Avoid conflicts: and can both be in set Introduce some new conflicts: addresses get re-arranged Conflict avoidance usually dominates address hit? Associativity and Paper Simulation 8B cache, B blocks, -way set-associative Set.Way Set.Way Set.Way Set.Way (new conflict) Hit Hit (avoid conflict) + Avoid conflicts: and can both be in set Introduce some new conflicts: addresses get re-arranged Conflict avoidance usually dominates Replacement Policies Associative caches present a new design choice On cache miss, which block in set to replace (kick out)? Some options Random FIFO (first-in first-out) LRU (least recently used) Fits with temporal locality, LRU = least likely to be used in future NMRU (not most recently used) An easier to implement approximation of LRU Is LRU for -way set-associative caches Belady s: replace block that will be used furthest in future Unachievable optimum Which policy is simulated in previous example? 6

/6/7 Associativity and Performance Higher associative caches + Have better (lower) % miss Diminishing returns However t hit increases The more associative, the slower What about t avg? % miss ~5 Associativity Block-size and number of sets should be powers of two Makes ing easier (just rip bits out of the address) -way set-associativity? No problem 7