Caches. Han Wang CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

Similar documents
Caches! Hakim Weatherspoon CS 3410, Spring 2011 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

Caches. Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

Caches. See P&H 5.1, 5.2 (except writes) Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Caches. See P&H 5.1, 5.2 (except writes) Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Caches and Memory Deniz Altinbuken CS 3410, Spring 2015

CS 3410, Spring 2014 Computer Science Cornell University. See P&H Chapter: , 5.8, 5.15

Caches. Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

Caches (Writing) P & H Chapter 5.2 3, 5.5. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Caches (Writing) P & H Chapter 5.2 3, 5.5. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

14:332:331. Week 13 Basics of Cache

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Lec 13: Linking and Memory. Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University. Announcements

Caches! Hakim Weatherspoon CS 3410, Spring 2011 Computer Science Cornell University. See P&H 5.2 (writes), 5.3, 5.5

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Memory Hierarchy: Caches, Virtual Memory

EE 4683/5683: COMPUTER ARCHITECTURE

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

14:332:331. Week 13 Basics of Cache

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

Course Administration

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Fundamentals of Computer Systems

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

Sarah L. Harris and David Money Harris. Digital Design and Computer Architecture: ARM Edition Chapter 8 <1>

CS 61C: Great Ideas in Computer Architecture Direct- Mapped Caches. Increasing distance from processor, decreasing speed.


EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

ECE468 Computer Organization and Architecture. Memory Hierarchy

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Welcome to Part 3: Memory Systems and I/O

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

Memory Hierarchy. Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity. (Study Chapter 5)

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Levels in memory hierarchy

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

EN1640: Design of Computing Systems Topic 06: Memory System

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

EN1640: Design of Computing Systems Topic 06: Memory System

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. See P&H Chapter: , 5.8, 5.10, 5.15; Also, 5.13 & 5.

Fundamentals of Computer Systems

CMPSC 311- Introduction to Systems Programming Module: Caching

Page 1. Multilevel Memories (Improving performance using a little cash )

Introduction to OpenMP. Lecture 10: Caches

CPS101 Computer Organization and Programming Lecture 13: The Memory System. Outline of Today s Lecture. The Big Picture: Where are We Now?

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Assignment 1 due Mon (Feb 4pm

Cycle Time for Non-pipelined & Pipelined processors

CS3350B Computer Architecture

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

1/19/2009. Data Locality. Exploiting Locality: Caches

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Caches (Writing) Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. P & H Chapter 5.2 3, 5.5

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

Digital Logic & Computer Design CS Professor Dan Moldovan Spring Copyright 2007 Elsevier 8-<1>

Chapter 5. Memory Technology

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

CSC Memory System. A. A Hierarchy and Driving Forces

Memory Hierarchy. Slides contents from:

Cray XE6 Performance Workshop

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Memory Management! Goals of this Lecture!

Caches. Hiding Memory Access Times

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

CS152 Computer Architecture and Engineering Lecture 17: Cache System

Handout 4 Memory Hierarchy

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

Caches and Memory. Anne Bracy CS 3410 Computer Science Cornell University. See P&H Chapter: , 5.8, 5.10, 5.13, 5.15, 5.17

CMPT 300 Introduction to Operating Systems

Memory Hierarchies &

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

+ Random-Access Memory (RAM)

Prelim 3 Review. Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Systems Programming and Computer Architecture ( ) Timothy Roscoe

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Key Point. What are Cache lines

Caches II CSE 351 Spring #include <yoda.h> int is_try(int do_flag) { return!(do_flag!do_flag); }

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

MIPS) ( MUX

Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Memory Hierarchy. Locality. Memories: Review

Transcription:

Caches Han Wang CS 3410, Spring 2012 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)

This week: Announcements PA2 Work-in-progress submission Next six weeks: Two labs and two projects Prelim2 will be Thursday, March 29 th 2

Agenda Memory Hierarchy Overview The Principle of Locality Direct-Mapped Cache Fully Associative Cache 4

Performance CPU clock rates ~0.2ns 2ns (5GHz-500MHz) Technology Capacity $/GB Latency Tape 1 TB $.17 100s of seconds Disk 2 TB $.03 Millions of cycles (ms) SSD (Flash) 128 GB $2 Thousands of cycles (us) DRAM 8 GB $10 50-300 cycles (10s of ns) SRAM off-chip 8 MB $4000 5-15 cycles (few ns) SRAM on-chip 256 KB??? 1-3 cycles (ns) Others: edram aka 1-T SRAM, FeRAM, CD, DVD, Q: Can we create illusion of cheap + large + fast? 5

L3 becoming more common (edram?) Memory Pyramid RegFile 100s bytes L1 Cache (several KB) L2 Cache (½-32MB) Memory Pyramid < 1 cycle access Memory (128MB few GB) 1-3 cycle access 5-15 cycle access 50-300 cycle access Disk (Many GB few TB) 1000000+ cycle access These are rough numbers: mileage may vary for latest/greatest Caches usually made of SRAM (or edram) 6

Memory Hierarchy Memory closer to processor small & fast stores active data Memory farther from processor big & slow stores inactive data 8

Active vs Inactive Data Assumption: Most data is not active. Q: How to decide what is active? A: Programmer decides A: Compiler decides A: OS decides at run-time A: Hardware decides at run-time 9

Insight of Caches Q: What is active data? If Mem[x] is was accessed recently... then Mem[x] is likely to be accessed soon Exploit temporal locality: then Mem[x ± ε] is likely to be accessed soon Exploit spatial locality: 10

Locality Analogy Writing a report on a specific topic. While at library, check out books and keep them on desk. If need more, check them out and bring to desk. But don t return earlier books since might need them Limited space on desk; Which books to keep? You hope this collection of ~20 books on desk enough to write report, despite 20 being only 0.00002% of books in Cornell libraries 11

Two types of Locality Temporal Locality (locality in time) If a memory location is referenced then it will tend to be referenced again soon Keep most recently accessed data items closer to the processor Spatial Locality (locality in space) If a memory location is referenced, the locations with nearby addresses will tend to be referenced soon Move blocks consisting of contiguous words closer to the processor 12

Memory trace 0x7c9a2b18 0x7c9a2b19 0x7c9a2b1a 0x7c9a2b1b 0x7c9a2b1c 0x7c9a2b1d 0x7c9a2b1e 0x7c9a2b1f 0x7c9a2b20 0x7c9a2b21 0x7c9a2b22 0x7c9a2b23 0x7c9a2b28 0x7c9a2b2c 0x0040030c 0x00400310 0x7c9a2b04 0x00400314 0x7c9a2b00 0x00400318 0x0040031c... Locality int n = 4; int k[] = { 3, 14, 0, 10 }; int fib(int i) { if (i <= 2) return i; else return fib(i-1)+fib(i-2); } int main(int ac, char **av) { for (int i = 0; i < n; i++) { printi(fib(k[i])); prints("\n"); } } 13

Memory Hierarchy Memory closer to processor is fast but small usually stores subset of memory farther away strictly inclusive alternatives: strictly exclusive mostly inclusive Transfer whole blocks (cache lines): 4kb: disk ram 256b: ram L2 64b: L2 L1 14

Cache Lookups (Read) Processor tries to access Mem[x] Check: is block containing Mem[x] in the cache? Yes: cache hit return requested data from cache line No: cache miss read block from memory (or lower level cache) (evict an existing cache line to make room) place new block in cache return requested data and stall the pipeline while all of this happens 15

CPU Cache Organization Cache Controller Cache has to be fast and dense Gain speed by performing lookups in parallel but requires die real estate for lookup logic Reduce lookup logic by limiting where in the cache a block might be placed but might reduce cache effectiveness 16

Three common designs A given data block can be placed in any cache line Fully Associative in exactly one cache line Direct Mapped in a small set of cache lines Set Associative Memory Cache 17

Three common designs A given data block can be placed in any cache line Fully Associative in exactly one cache line Direct Mapped in a small set of cache lines Set Associative Memory Cache 18

Three common designs A given data block can be placed in any cache line Fully Associative in exactly one cache line Direct Mapped in a small set of cache lines Set Associative Memory 2-way set-associative $ way0 way1 19

TIO: Mapping the Memory Address Lowest bits of address (Offset) determine which byte within a block it refers to. Full address format: Tag Index Memory Address Offset n-bit Offset means a block is how many bytes? n-bit Index means cache has how many blocks? 20

line 0 line 1 Direct Mapped Cache Direct Mapped Cache Each block number mapped to a single cache line index Simplest hardware 0x000000 0x000004 0x000008 0x00000c 0x000010 0x000014 0x000018 0x00001c 0x000020 0x000024 0x000028 0x00002c 0x000030 0x000034 0x000038 0x00003c 0x000040 0x000044 0x000048 21

Direct Mapped Cache Direct Mapped Cache Each block number mapped to a single cache line index Simplest hardware line 0 line 1 line 2 line 3 0x000000 0x000004 0x000008 0x00000c 0x000010 0x000014 0x000018 0x00001c 0x000020 0x000024 0x000028 0x00002c 0x000030 0x000034 0x000038 0x00003c 0x000040 0x000044 0x000048 22

Tags and Offsets Assume sixteen 64-byte cache lines 0x7FFF3D4D = 0111 1111 1111 1111 0011 1101 0100 1101 Need meta-data for each cache line: valid bit: is the cache line non-empty? tag: which block is stored in this line (if valid) Q: how to check if X is in the cache? Q: how to clear a cache line? 23

A Simple Direct Mapped Cache Using byte addresses in this example! Addr Bus = 5 bits Processor lb $1 M[ 1 ] lb $2 M[ 13 ] lb $3 M[ 0 ] lb $3 M[ 6 ] lb $2 M[ 5 ] lb $2 M[ 6 ] lb $2 M[ 10 ] lb $2 M[ 12 ] $1 $2 $3 $4 A = Direct Mapped Cache V tag data Hits: Misses: Memory 0 101 1 103 2 107 3 109 4 113 5 127 6 131 7 137 8 139 9 149 10 151 11 157 12 163 13 167 14 173 15 179 16 181 24

Direct Mapped Cache (Reading) Tag Index Offset V Tag Block 25

Direct Mapped Cache Size Tag Index Offset n bit index, m bit offset Q: How big is cache (data only)? Q: How much SRAM needed (data + overhead)? 26

Cache Performance Cache Performance (very simplified): L1 (SRAM): 512 x 64 byte cache lines, direct mapped Data cost: 3 cycle per word access Lookup cost: 2 cycle Mem (DRAM): 4GB Data cost: 50 cycle per word, plus 3 cycle per consecutive word Performance depends on: Access time for hit, miss penalty, hit rate 27

Misses Cache misses: classification The line is being referenced for the first time Cold (aka Compulsory) Miss The line was in the cache, but has been evicted 28

Avoiding Misses Q: How to avoid Cold Misses Unavoidable? The data was never in the cache Prefetching! Other Misses Buy more SRAM Use a more flexible cache design 29

Bigger cache doesn t always help Mem access trace: 0, 16, 1, 17, 2, 18, 3, 19, 4, Hit rate with four direct-mapped 2-byte cache lines? With eight 2-byte cache lines? With four 4-byte cache lines? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 30

Misses Cache misses: classification The line is being referenced for the first time Cold (aka Compulsory) Miss The line was in the cache, but has been evicted because some other access with the same index Conflict Miss because the cache is too small i.e. the working set of program is larger than the cache Capacity Miss 31

Avoiding Misses Q: How to avoid Cold Misses Unavoidable? The data was never in the cache Prefetching! Capacity Misses Buy more SRAM Conflict Misses Use a more flexible cache design 32

Three common designs A given data block can be placed in any cache line Fully Associative in exactly one cache line Direct Mapped in a small set of cache lines Set Associative 33

A Simple Fully Associative Cache Using byte addresses in this example! Addr Bus = 5 bits Processor lb $1 M[ 1 ] lb $2 M[ 13 ] lb $3 M[ 0 ] lb $3 M[ 6 ] lb $2 M[ 5 ] lb $2 M[ 6 ] lb $2 M[ 10 ] lb $2 M[ 12 ] $1 $2 $3 $4 A = Fully Associative Cache V tag data Hits: Misses: Memory 0 101 1 103 2 107 3 109 4 113 5 127 6 131 7 137 8 139 9 149 10 151 11 157 12 163 13 167 14 173 15 179 16 181 34

Fully Associative Cache (Reading) Tag Offset V Tag Block = = = = hit? line select word select data 64bytes 32bits 35

Fully Associative Cache Size Tag Offset m bit offset, 2 n cache lines Q: How big is cache (data only)? Q: How much SRAM needed (data + overhead)? 36

Fully-associative reduces conflict misses... assuming good eviction strategy Mem access trace: 0, 16, 1, 17, 2, 18, 3, 19, 4, 20, Hit rate with four fully-associative 2-byte cache lines? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 37

but large block size can still reduce hit rate vector add trace: 0, 100, 200, 1, 101, 201, 2, 202, Hit rate with four fully-associative 2-byte cache lines? With two fully-associative 4-byte cache lines? 38

Misses Cache misses: classification Cold (aka Compulsory) The line is being referenced for the first time Capacity The line was evicted because the cache was too small i.e. the working set of program is larger than the cache Conflict The line was evicted because of another access whose index conflicted 39

Summary Caching assumptions small working set: 90/10 rule can predict future: spatial & temporal locality Benefits big & fast memory built from (big & slow) + (small & fast) Tradeoffs: associativity, line size, hit cost, miss penalty, hit rate Fully Associative higher hit cost, higher hit rate Larger block size lower hit cost, higher miss penalty Next up: other designs; writing to caches 40