Fundamentals of Computer Systems

Similar documents
Fundamentals of Computer Systems

CSEE 3827: Fundamentals of Computer Systems, Spring Caches


Sarah L. Harris and David Money Harris. Digital Design and Computer Architecture: ARM Edition Chapter 8 <1>

Direct Mapped Cache Hardware. Direct Mapped Cache. Direct Mapped Cache Performance. Direct Mapped Cache Performance. Miss Rate = 3/15 = 20%

Digital Logic & Computer Design CS Professor Dan Moldovan Spring Copyright 2007 Elsevier 8-<1>

Chapter 8 Memory Hierarchy and Cache Memory

Chapter 8 :: Topics. Chapter 8 :: Memory Systems. Introduction Memory System Performance Analysis Caches Virtual Memory Memory-Mapped I/O Summary

EN1640: Design of Computing Systems Topic 06: Memory System

EN1640: Design of Computing Systems Topic 06: Memory System

CMPSC 311- Introduction to Systems Programming Module: Caching

Caches. Han Wang CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

Chapter 8. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 8 <1>

Chapter 3. SEQUENTIAL Logic Design. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris.

Administrivia. Expect new HW out today (functions in assembly!)

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Memory Hierarchies &

Chapter 5. Memory Technology

Lecture 12: Memory hierarchy & caches

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

CMPSC 311- Introduction to Systems Programming Module: Caching

The University of Adelaide, School of Computer Science 13 September 2018

Chapter 8. Chapter 8:: Topics. Introduction. Processor Memory Gap

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

ECE232: Hardware Organization and Design

Basic Memory Hierarchy Principles. Appendix C (Not all will be covered by the lecture; studying the textbook is recommended!)

LECTURE 10: Improving Memory Access: Direct and Spatial caches

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

LECTURE 11. Memory Hierarchy

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Lec 13: Linking and Memory. Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University. Announcements

Lecture-14 (Memory Hierarchy) CS422-Spring

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Recap: Machine Organization

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

EE 4683/5683: COMPUTER ARCHITECTURE

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

ECE468 Computer Organization and Architecture. Memory Hierarchy

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Memory Hierarchy Y. K. Malaiya

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

14:332:331. Week 13 Basics of Cache

Caches! Hakim Weatherspoon CS 3410, Spring 2011 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

CS161 Design and Architecture of Computer Systems. Cache $$$$$

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

CSC Memory System. A. A Hierarchy and Driving Forces

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Memory hierarchies: caches and their impact on the running time

V. Primary & Secondary Memory!

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

CS3350B Computer Architecture

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

+ Random-Access Memory (RAM)

CS3350B Computer Architecture

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

CPSC 330 Computer Organization

Welcome to Part 3: Memory Systems and I/O

Caches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Memory Hierarchy. Slides contents from:

14:332:331. Week 13 Basics of Cache

ECE 485/585 Microprocessor System Design

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

www-inst.eecs.berkeley.edu/~cs61c/

CMPT 300 Introduction to Operating Systems

Memory Hierarchy. Slides contents from:

Computer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg

ECE 2300 Digital Logic & Computer Organization. Caches

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Cache Memory - II. Some of the slides are adopted from David Patterson (UCB)

Caches II CSE 351 Spring #include <yoda.h> int is_try(int do_flag) { return!(do_flag!do_flag); }

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

LECTURE 5: MEMORY HIERARCHY DESIGN

Memory Hierarchy: Caches, Virtual Memory

CSE Computer Architecture I Fall 2009 Homework 08 Pipelined Processors and Multi-core Programming Assigned: Due: Problem 1: (10 points)

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

CSEE W4824 Computer Architecture Fall 2012

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Course Administration

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Memory. Lecture 22 CS301

CSE502: Computer Architecture CSE 502: Computer Architecture

The Memory Hierarchy. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

Transcription:

Fundamentals of Computer Systems Caches Martha A. Kim Columbia University Fall 215 Illustrations Copyright 27 Elsevier 1 / 23

Computer Systems Performance depends on which is slowest: the processor or the memory system CLK Processor MemWrite Address Write WE Memory Read 2 / 23

Memory Speeds Haven t Kept Up 1, 1, Performance 1 1 CPU 1 1 198 1981 1982 1983 1984 1985 1986 1987 1988 1989 199 1991 1992 1993 Year Memory 1994 1995 1996 1997 1998 1999 2 21 22 23 24 25 Our single-cycle memory assumption has been wrong since 198. Hennessy and Patterson. Computer Architecture: A Quantitative Approach. 3rd ed., Morgan Kaufmann, 23. 3 / 23

Your Choice of Memories Fast Cheap Large On-Chip SRAM Commodity DRAM Supercomputer 4 / 23

Memory Hierarchy An essential trick that makes big memory appear fast Technology Cost Access Time Density ($/Gb) (ns) (Gb/cm2) SRAM 3.5.25 DRAM 1 1 1 16 Flash 2 3 8 32 Hard Disk.1 1 5 2 Read speed; writing much, much slower 5 / 23

A Modern Memory Hierarchy A desktop machine: AMD Phenom 96 Quad-core 2.3 GHz 1.1 1.25 V 95 W 65 nm Level Size Tech. L1 Instruction 64 K SRAM L1 64 K SRAM L2 512 K SRAM L3 2 MB SRAM Memory 4 GB DRAM Disk 5 GB Magnetic per core 6 / 23

A Simple Memory Hierarchy???? First level: small, fast storage (typically SRAM) H i p h i p h o o r a y! \ h i p \ Last level: large, slow storage (typically DRAM) Can fit a subset of lower level in upper level, but which subset? 7 / 23

Locality Example: Substring Matching H i p h i p h o o r a y! \ h i p \ Addresses accessed over time aka reference stream h H h i h p h h h i i p p h h temporal locality: if needed X recently, likely to need X again soon spatial locality: if need X, likely also need something near X 8 / 23

Cache Highest levels of memory hierarchy Fast: level 1 typically 1 cycle access time With luck, supplies most data Cache design questions: What data does it hold? How is data found? What data is replaced? Recently accessed Simple address hash Often the oldest 9 / 23

What is Held in the Cache? Ideal cache: always correctly guesses what you want before you want it. Real cache: never that smart Caches Exploit Temporal Locality Copy newly accessed data into cache, replacing oldest if necessary Spatial Locality Copy nearby data into the cache at the same time Specifically, always read and write a block, also called a line, at a time (e.g., 64 bytes), never a single byte. 1 / 23

Memory Performance Hit: is found in the level of memory hierarchy Miss: not found; will look in next level Hit Rate = Miss Rate = Number of hits Number of accesses Number of misses Number of accesses Hit Rate + Miss Rate = 1 The expected access time E L for a memory level L with latency t L and miss rate M L : E L = t L + M L E L+1 11 / 23

Memory Performance Example Two-level hierarchy: Cache and main memory Program executes 1 loads & stores 75 of these are found in the cache What s the cache hit and miss rate? 12 / 23

Memory Performance Example Two-level hierarchy: Cache and main memory Program executes 1 loads & stores 75 of these are found in the cache What s the cache hit and miss rate? Hit Rate = 75 1 = 75% Miss Rate = 1.75 = 25% If the cache takes 1 cycle and the main memory 1, What s the expected access time? 12 / 23

Memory Performance Example Two-level hierarchy: Cache and main memory Program executes 1 loads & stores 75 of these are found in the cache What s the cache hit and miss rate? Hit Rate = 75 1 = 75% Miss Rate = 1.75 = 25% If the cache takes 1 cycle and the main memory 1, What s the expected access time? Expected access time of main memory: E 1 = 1 cycles Access time for the cache: t = 1 cycle Cache miss rate: M =.25 E = t + M E 1 = 1 +.25 1 = 26 cycles 12 / 23

A Direct-Mapped Cache Address 11...111111 mem[xfffffffc] 11...11111 mem[xfffffff8] 11...11111 mem[xfffffff4] 11...1111 mem[xfffffff] 11...11111 mem[xffffffec] 11...1111 mem[xffffffe8] 11...1111 mem[xffffffe4] 11...111 mem[xffffffe]...11 mem[x24]...1 mem[x2]...111 mem[x1c]...11 mem[x18]...11 mem[x14]...1 mem[x1]...11 mem[xc]...1 mem[x8]...1 mem[x4]... mem[x] 2 3 -Word Main Memory This simple cache has 8 sets 1 block per set 4 bytes per block To simplify answering is this data in the cache?, each byte is mapped to exactly one set. 2 3 -Word Cache Set 7 (111) Set 6 (11) Set 5 (11) Set 4 (1) Set 3 (11) Set 2 (1) Set 1 (1) Set () 13 / 23

Direct-Mapped Cache Hardware Memory Address Tag Set 27 3 = Byte Offset V Tag 27 32 Address bits: 1: byte within block 2 4: set number 5 31: block tag Set 7 Set 6 Set 5 Set 4 Set 3 Set 2 Set 1 Set Cache hit if 8-entry x (1+27+32)-bit SRAM in the set of the address, Hit block is valid (V=1) tag (address bits 5 31) matches 14 / 23

Direct-Mapped Cache Behavior A dumb loop: repeat 5 times load from x4; load from xc; load from x8. li $t, 5 l1: beq $t, $, done lw $t1, x4($) lw $t2, xc($) lw $t3, x8($) addiu $t, $t, -1 j l1 done: Memory Address Tag... Byte Set Offset 1 3 V Tag 1... mem[x...c] 1... mem[x...8] 1... mem[x...4] Set 7 (111) Set 6 (11) Set 5 (11) Set 4 (1) Set 3 (11) Set 2 (1) Set 1 (1) Set () Cache when reading x4 last time Assuming the cache starts empty, what s the miss rate? 15 / 23

Direct-Mapped Cache Behavior A dumb loop: repeat 5 times load from x4; load from xc; load from x8. li $t, 5 l1: beq $t, $, done lw $t1, x4($) lw $t2, xc($) lw $t3, x8($) addiu $t, $t, -1 j l1 done: Memory Address Tag... Byte Set Offset 1 3 V Tag 1... mem[x...c] 1... mem[x...8] 1... mem[x...4] Set 7 (111) Set 6 (11) Set 5 (11) Set 4 (1) Set 3 (11) Set 2 (1) Set 1 (1) Set () Cache when reading x4 last time Assuming the cache starts empty, what s the miss rate? 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 M M M H H H H H H H H H H H H 3/15 =.2 = 2% 15 / 23

Direct-Mapped Cache: Conflict A dumber loop: repeat 5 times load from x4; load from x24 Memory Address li $t, 5 l1: beq $t, $, done lw $t1, x4($) lw $t2, x24($) addiu $t, $t, -1 j l1 done: Tag Set Byte Offset...1 1 3 V Tag Set 7 (111) Set 6 (11) Set 5 (11) Set 4 (1) Set 3 (11) Set 2 (1) Set 1 (1) Set () Cache State Assuming the cache starts empty, what s the miss rate? These are conflict misses 16 / 23

Direct-Mapped Cache: Conflict A dumber loop: repeat 5 times load from x4; load from x24 Memory Address li $t, 5 l1: beq $t, $, done lw $t1, x4($) lw $t2, x24($) addiu $t, $t, -1 j l1 done: Tag Set Byte Offset...1 1 3 V 1 Tag... mem[x...4] mem[x...24] Set 7 (111) Set 6 (11) Set 5 (11) Set 4 (1) Set 3 (11) Set 2 (1) Set 1 (1) Set () Cache State Assuming the cache starts empty, what s the miss rate? 4 24 4 24 4 24 4 24 4 24 M M M M M M M M M M 1/1 = 1 = 1% Oops These are conflict misses 16 / 23

No Way! Yes Way! 2-Way Set Associative Cache Memory Address Tag Set 28 2 Byte Offset V Tag Way 1 Way V Tag 28 32 28 32 Set 3 Set 2 Set 1 Set = = Hit 1 Hit 1 Hit 1 32 Hit 17 / 23

2-Way Set Associative Behavior li $t, 5 l1: beq $t, $, done lw $t1, x4($) lw $t2, x24($) addiu $t, $t, -1 j l1 done: Assuming the cache starts empty, what s the miss rate? 4 24 4 24 4 24 4 24 4 24 M M H H H H H H H H 2/1 =.2 = 2% Associativity reduces conflict misses Way 1 Way V Tag V Tag 1... mem[x...24] 1...1 mem[x...4] Set 3 Set 2 Set 1 Set 18 / 23

An Eight-way Fully Associative Cache Way 7 Way 6 Way 5 Way 4 Way 3 Way 2 Way 1 Way V Tag V Tag V Tag V Tag V Tag V Tag V Tag V Tag No conflict misses: only compulsory or capacity misses Either very expensive or slow because of all the associativity 19 / 23

Exploiting Spatial Locality: Larger Blocks Block Byte Tag Set Offset Offset Memory x8 9C: 1...1 1 11 Address 8 9 C Memory Address Block Byte Tag Set Offset Offset 27 2 V Tag 27 32 32 32 32 Set 1 Set 1 1 11 = 32 Hit 2 sets 1 block per set (Direct Mapped) 4 words per block 2 / 23

Direct-Mapped Cache Behavior w/ 4-word block The dumb loop: repeat 5 times load from x4; load from xc; load from x8. Block Byte Tag Set Offset Offset Memory... 11 Address V Tag 1... mem[x...c] mem[x...8] mem[x...4] mem[x...] Set 1 Set Cache when reading xc li $t, 5 l1: beq $t, $, done lw $t1, x4($) lw $t2, xc($) lw $t3, x8($) addiu $t, $t, -1 j l1 done: Assuming the cache starts empty, what s the miss rate? 21 / 23

Direct-Mapped Cache Behavior w/ 4-word block The dumb loop: repeat 5 times load from x4; load from xc; load from x8. Block Byte Tag Set Offset Offset Memory... 11 Address V Tag 1... mem[x...c] mem[x...8] mem[x...4] mem[x...] Set 1 Set Cache when reading xc li $t, 5 l1: beq $t, $, done lw $t1, x4($) lw $t2, xc($) lw $t3, x8($) addiu $t, $t, -1 j l1 done: Assuming the cache starts empty, what s the miss rate? 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 M H H H H H H H H H H H H H H 1/15 =.666 = 6.7% Larger blocks reduce compulsory misses by exploting spatial locality 21 / 23

Stephen s Desktop Machine Revisited On-chip caches: AMD Phenom 96 Quad-core 2.3 GHz 1.1 1.25 V 95 W 65 nm Cache Size Sets Ways Block L1I 64 K 512 2-way 64-byte L1D 64 K 512 2-way 64-byte L2 512 K 512 16-way 64-byte L3 2 MB 124 32-way 64-byte per core 22 / 23

Intel On-Chip Caches Chip Year Freq. (MHz) 8386 1985 16 25 off-chip none 8486 1989 25 1 8K unified off-chip Pentium 1993 6 3 8K 8K off-chip Pentium Pro 1995 15 2 8K 8K 256K 1M (MCM) Pentium II 1997 233 45 16K 16K 256K 512K (Cartridge) Pentium III 1999 45 14 16K 16K 256K 512K Pentium 4 21 14 373 Pentium M 23 9 213 32K 32K 1M 2M 15 3 32K 32K 2M 6M23 / 23 Core 2 Duo 25 8 16K L1 Instr L2 12k op 256K 2M trace cache