EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Similar documents
Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

CPE 631 Lecture 04: CPU Caches

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

EE 4683/5683: COMPUTER ARCHITECTURE

COSC 6385 Computer Architecture - Memory Hierarchies (I)

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Modern Computer Architecture

Course Administration

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

EECS150 - Digital Design Lecture 11 SRAM (II), Caches. Announcements

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

Computer Architecture Spring 2016

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

EECS 151/251A Spring 2019 Digital Design and Integrated Circuits. Instructor: John Wawrzynek. Lecture 18 EE141

CS152 Computer Architecture and Engineering Lecture 17: Cache System

ארכיטקטורת יחידת עיבוד מרכזי ת

14:332:331. Week 13 Basics of Cache

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

ECE ECE4680

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

14:332:331. Week 13 Basics of Cache

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

Memory Hierarchy Review

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Mo Money, No Problems: Caches #2...

Page 1. Memory Hierarchies (Part 2)

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Handout 4 Memory Hierarchy

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

LECTURE 11. Memory Hierarchy

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS3350B Computer Architecture

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS161 Design and Architecture of Computer Systems. Cache $$$$$

10/16/17. Outline. Outline. Typical Memory Hierarchy. Adding Cache to Computer. Key Cache Concepts

Advanced Computer Architecture

Memory Hierarchy: Caches, Virtual Memory

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Performance! (1/latency)! 1000! 100! 10! Capacity Access Time Cost. CPU Registers 100s Bytes <10s ns. Cache K Bytes ns 1-0.

Memory Technologies. Technology Trends

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Memory hierarchy review. ECE 154B Dmitri Strukov

Show Me the $... Performance And Caches

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Lecture 7 - Memory Hierarchy-II

CS152: Computer Architecture and Engineering Caches and Virtual Memory. October 31, 1997 Dave Patterson (http.cs.berkeley.

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

Time. Recap: Who Cares About the Memory Hierarchy? Performance. Processor-DRAM Memory Gap (latency)

Lecture 11 Cache. Peng Liu.

Agenda. Recap: Components of a Computer. Agenda. Recap: Cache Performance and Average Memory Access Time (AMAT) Recap: Typical Memory Hierarchy

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality

LECTURE 10: Improving Memory Access: Direct and Spatial caches

Recap: The Big Picture: Where are We Now? The Five Classic Components of a Computer. CS152 Computer Architecture and Engineering Lecture 20.

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

CMPT 300 Introduction to Operating Systems

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

Question?! Processor comparison!

CS162 Operating Systems and Systems Programming Lecture 10 Caches and TLBs"

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

CS 61C: Great Ideas in Computer Architecture Caches Part 2

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Page 1. Multilevel Memories (Improving performance using a little cash )

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Memory Hierarchy. Caching Chapter 7. Locality. Program Characteristics. What does that mean?!? Exploiting Spatial & Temporal Locality

Introduction to cache memories

ECE468 Computer Organization and Architecture. Memory Hierarchy

Time. Who Cares About the Memory Hierarchy? Performance. Where Have We Been?

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

CS3350B Computer Architecture

The Memory Hierarchy & Cache

Memory. Lecture 22 CS301

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

ECE331: Hardware Organization and Design

Page 1. Review: Address Segmentation " Review: Address Segmentation " Review: Address Segmentation "

Why memory hierarchy

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568/668

Advanced Memory Organizations

Caching Basics. Memory Hierarchies

Transcription:

EECS151/251A Spring 2018 Digital Design and Integrated Circuits Instructors: John Wawrzynek and Nick Weaver Lecture 19: Caches

Cache Introduction

40% of this ARM CPU is devoted to SRAM cache. But the role of cache in computer design has varied widely over time. 3

1977: DRAM faster than microprocessors Apple II (1977) CPU: 1000 ns DRAM: 400 ns Steve Jobs Steve Wozniak 4

1980-2003, CPU speed outpaced DRAM... Performance (1/latency) 10000 1000 100 10 Q. How do architects address this gap? A. Put smaller, faster cache memories between CPU and DRAM. Create a memory hierarchy. CPU 60% per yr 2X in 1.5 yrs CPU Gap grew 50% per year DRAM 9% per yr 2X in 10 yrs DRAM The power wall 1980 1990 Year 2000 2005 5

Nahalem Die Photo (i7, i5) L1 L2 Per core: 32KB L1 I-Cache (4-way set associative) 32KB L1 D-Cache (8-way set associative) 256KB unified L2 (8-way SA, 64B blocks) Common L3 8MB cache 6

Typical Memory Hierarchy Datapath On-Chip Components Control RegFile Instr Data Cache Cache Second- Level Cache (SRAM) Third- Level Cache (SRAM) Main Memory (DRAM) Secondary Memory (Disk Or Flash) Speed (cycles): ½ s 1 s 10 s 100 s 1,000,000 s Size (bytes): 100 s 10K s M s G s T s Cost/bit: highest lowest Principle of locality + memory hierarchy presents programmer with as much memory as is available in the cheapest technology at the speed offered by the fastest technology

CPU-Cache Interaction (5-stage pipeline) PCen 0x4 Add bubble PC addr inst hit? Primary Instruction Cache IR D To Memory Control Decode, Register Fetch E A ALU B M MD1 MD2 Y we addr Primary Data rdata Cache wdata hit? R Stall entire CPU on data cache miss Cache Refill Data from Lower Levels of Memory Hierarchy

Review from 61C Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. DRAM is slow but cheap and dense: Good choice for presenting the user with a BIG memory system SRAM is fast but expensive and not very dense: Good choice for providing the user FAST access time.

:Byte 1023 Example: 1 KB Direct Mapped Cache with 32 B Blocks For a 2 N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2 M ) 31 Cache Tag Block address Example: 0x50 Stored as part of the cache state 9 Cache Index Ex: 0x01 4 Byte Select Ex: 0x00 0 Valid Bit : Cache Tag 0x50 : Cache Data :Byte 31 :Byte 63 Byte 1 Byte 33 : Byte 0 Byte 32 0 1 2 3 Byte 992 31

Extreme Example: Fully Associative Fully Associative Cache Forget about the Cache Index Compare the Cache Tags of all cache entries in parallel Example: Block Size = 32 B blocks, we need N 27-bit comparators By definition: Conflict Miss = 0 for a fully associative cache 31 Cache Tag (27 bits long) 4 Byte Select Ex: 0x01 0 Cache Tag Valid Bit Cache Data = = = = Byte 1 Byte 0 : : :Byte 31 :Byte 63 Byte 33 Byte 32 = :

Set Associative Cache N-way set associative: N entries for each Cache Index N direct mapped caches operates in parallel Example: Two-way set associative cache Cache Index selects a set from the cache The two tags in the set are compared to the input in parallel Data is selected based on the tag result Valid Cache Tag Cache Data Cache Block 0 : : : Cache Index Cache Data Cache Block 0 : Cache Tag Valid : : Adr Tag Compare Sel1 1 Mux 0 Sel0 Compare Hit OR Cache Block

Disadvantage of Set Associative Cache N-way Set Associative Cache versus Direct Mapped Cache: N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss decision and set selection In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: Possible to assume a hit and continue. Recover later if miss. Valid Cache Tag Cache Data Cache Block 0 : : : Cache Index Cache Data Cache Block 0 : Cache Tag Valid : : Adr Tag Compare Sel1 1 Mux 0 Sel0 Compare Hit OR Cache Block

Cache Design and Optimization

Performance = Intr. Count x Clock Freq x (ideal CPI + stalls) Average Memory Access time = Hit Time + Miss Rate x Miss Penalty Improving Cache Performance: 3 general options 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache (although this is very often 1 cycle).

Primary Cache Parameters Block size how many bytes of data in each cache entry? Associativity how many ways in each set? Direct-mapped => Associativity = 1 Set-associative => 1 < Associativity < #Entries Fully associative => Associativity = #Entries Capacity (bytes) = Total #Entries * Block size #Entries = #Sets * Associativity

Block Size Tradeoff In general, larger block size takes advantage of spatial locality BUT: Larger block size means larger miss penalty: Takes longer time to fill up the block If block size is too big relative to cache size, miss rate will go up Too few cache blocks In general, Average Memory Access Time (AMAT): = Hit Time + Miss Penalty x Miss Rate Miss Penalty Miss Rate Exploits Spatial Locality Fewer blocks: compromises temporal locality Average Access Time Increased Miss Penalty & Miss Rate Block Size Block Size Block Size

Summary on Sources of Cache Misses Compulsory (cold start, first reference): first access to a block Cold fact of life: not a whole lot you can do about it Note: If you are going to run billions of instruction, Compulsory Misses are insignificant Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity Capacity: Cache cannot contain all blocks access by the program Solution: increase cache size Invalidation: other process (e.g., I/O) updates memory

3Cs Analysis Three sources of misses (SPEC2000 integer and floating-point benchmarks) Compulsory misses 0.006%; not visible Capacity misses, function of cache size Conflict portion depends on associativity and cache size

4 Questions for Caches (and Memory Hierarchy) Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy)

Q1: Where can a block be placed in the upper level? Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo Number Sets Block no. Fully associative: block 12 can go anywhere 0 1 2 3 4 5 6 7 Block no. Direct mapped: block 12 can go only into block 4 (12 mod 8) 0 1 2 3 4 5 6 7 Block no. Set associative: block 12 can go anywhere in set 0 (12 mod 4) 0 1 2 3 4 5 6 7 Block-frame address Set 0 Set 1 Set 2 Set 3 Block no. 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Q2: How is a block found if it is in the upper level? Tag Block Number Index Block offset Direct indexing (using index and block offset), tag compares, or combination Increasing associativity shrinks index, expands tag

Q3: Which block should be replaced on a miss? Easy for Direct Mapped Set Associative or Fully Associative: Random LRU (Least Recently Used) Associativity: 2-way 4-way 8-way Size LRU Random LRU Random LRU Random 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% 64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% Miss-rate

Q4: What happens on a write? Write through The information is written to both the block in the cache and to the block in the lower-level memory. Write back The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. is block clean or dirty? Pros and Cons of each? WT: simple, read misses cannot result in writes WB: no writes of repeated writes (saves energy) WT always combined with write buffers so that don t wait for lower level memory

Write Buffer for Write Through Processor Cache DRAM Write Buffer A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designer s nightmare: Store frequency (w.r.t. time) > 1 / DRAM write cycle Write buffer saturation

Write-miss Policy: Write Allocate versus Not Allocate Assume: a 16-bit write to memory location 0x0 and causes a miss Do we read in the block? Yes: Write Allocate No: Write Not Allocate 31 Cache Tag Example: 0x00 9 Cache Index Ex: 0x00 4 Byte Select Ex: 0x00 0 Valid Bit Cache Tag Cache Data 0x50 31 :Byte Byte 1 Byte 0 63 :Byte Byte 33 Byte 32 : : : 0 1 2 3 Usually: Write-back cache uses write allocate. Write-through cache uses no-write allocate. :Byte 1023 Byte 992 31

Another Real-word Example: At ISSCC 2015 in San Francisco, latest IBM mainframe chip details z13 designed in 22nm SOI technology with seventeen metal layers, 4 billion transistors/chip 8 cores/chip, with 2MB L2 cache, 64MB L3 cache, and 480MB L4 off-chip cache. 5GHz clock rate, 6 instructions per cycle, 2 threads/core Up to 24 processor chips in shared memory node 28

IBM z13 Memory Hierarchy 29