CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors

Similar documents
Memory management units

Elements of CPU performance

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

CMPSC 311- Introduction to Systems Programming Module: Caching

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Topics. Computer Organization CS Improving Performance. Opportunity for (Easy) Points. Three Generic Data Hazards

Memory Hierarchy: Caches, Virtual Memory

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Computer Architecture Spring 2016

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

The check bits are in bit numbers 8, 4, 2, and 1.

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Lecture 2: Memory Systems

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

www-inst.eecs.berkeley.edu/~cs61c/

Modern Computer Architecture

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

EEC 483 Computer Organization

CS3350B Computer Architecture

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Review : Pipelining. Memory Hierarchy

CPE300: Digital System Architecture and Design

Chapter Seven. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

Memory Hierarchies 2009 DAT105

CS3350B Computer Architecture

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

LECTURE 11. Memory Hierarchy

CMPSC 311- Introduction to Systems Programming Module: Caching

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

CISC 662 Graduate Computer Architecture Lecture 16 - Cache and virtual memory review

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

Introduction. Memory Hierarchy

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Page 1. Memory Hierarchies (Part 2)

Advanced Memory Organizations

Question?! Processor comparison!

Memory Hierarchy. Mehran Rezaei

The Memory Hierarchy & Cache

Memory Hierarchy. Goal: Fast, unlimited storage at a reasonable cost per bit.

Caches. Han Wang CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

EE 4683/5683: COMPUTER ARCHITECTURE

COSC 6385 Computer Architecture - Memory Hierarchies (I)

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Handout 4 Memory Hierarchy

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Page 1. Multilevel Memories (Improving performance using a little cash )

Caching Basics. Memory Hierarchies

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

Caches. Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

1/19/2009. Data Locality. Exploiting Locality: Caches

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Memory Hierarchies &

CS61C : Machine Structures

Computer Architecture Spring 2016

CPE 631 Lecture 04: CPU Caches

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

Logical Diagram of a Set-associative Cache Accessing a Cache

EITF20: Computer Architecture Part4.1.1: Cache - 2

1. Memory technology & Hierarchy

Random-Access Memory (RAM) Systemprogrammering 2007 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics

UC Berkeley CS61C : Machine Structures

Final Lecture. A few minutes to wrap up and add some perspective

CPU Pipelining Issues

Lecture 1 An Overview of High-Performance Computer Architecture. Automobile Factory (note: non-animated version)

Problem: Processor- Memory Bo<leneck

William Stallings Computer Organization and Architecture 8th Edition. Cache Memory

Modern Virtual Memory Systems. Modern Virtual Memory Systems

Review: Computer Organization

Course Administration

Random-Access Memory (RAM) Systemprogrammering 2009 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics! The memory hierarchy

Transistor: Digital Building Blocks

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Memory Management. Goals of this Lecture. Motivation for Memory Hierarchy

Caches. See P&H 5.1, 5.2 (except writes) Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Lecture 33 Caches III What to do on a write hit? Block Size Tradeoff (1/3) Benefits of Larger Block Size

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

CS 240 Stage 3 Abstractions for Practical Systems

UCB CS61C : Machine Structures

Advanced Computer Architecture

Caches. See P&H 5.1, 5.2 (except writes) Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Transcription:

CPUs Caches. Memory management. CPU performance. Cache : MainMemory :: Window : 1. Door 2. Bigger Door 3. The Great Outdoors 4. Horizontal Blinds 18% 9% 64% 9% Door Bigger Door The Great Outdoors Horizontal Blinds Caching: The Basic Idea Main Memory Stores words (A Z in example) Cache Stores subset of the words (4 in example) Organized into lines Multiple words To exploit spatial locality Access Word must be in cache for processor to access it Processor Small, Fast Cache A B G H A cache line Big, Slow Memory A B C Y Z 1

Caches and CPUs CPU address cache controller cache address main memory Each main memory location is mapped onto a cache entry. May have caches for: instructions; ; + instructions (unified). Memory access time is no longer deterministic! Locality of Reference Principle of Locality: Programs tend to reuse and instructions near those they have used recently. Temporal locality: recently referenced items are likely to be referenced in the near future. Spatial locality: items with nearby addresses tend to be referenced close together in time. sum = 0; for (i = 0; i < n; i++) Locality in Example: sum += a[i]; *v = sum; Data Reference array elements in succession (spatial) Instructions Reference instructions in sequence (spatial) Cycle through loop repeatedly (temporal) Cache performance benefits Keep frequently-accessed locations in fast cache. Cache retrieves more than one word at a time. Simple prediction of what will be used next. Sequential accesses are faster after first access. 2

Loop : temporallocality :: Array : 1. TemporalLocality 2. SpatialLocality 91% 9% TemporalLocality SpatialLocality Array : spatiallocality :: Loop : 1. TemporalLocality 2. SpatialLocality 91% 9% TemporalLocality SpatialLocality Terms Cache hit: required location is in cache. Cache miss: required location is not in cache. Types of misses Compulsory (cold): location has never been accessed; Capacity: working set is too large; Conflict: multiple locations in working set map to same cache entry. Working set: set of locations used by program in a time interval. 3

Memory system performance CPU cache h = cache hit rate. t cache = cache access time, t main = main memory access time. Average memory access time: t av = ht cache + (1-h)t main Multiple levels of cache CPU L1 cache L2 cache h 1 = L1 cache hit rate. h 2 = rate for miss on L1, hit on L2. Average memory access time: t av = h 1 t L1 + h 2 t L2 + (1- h 1 - h 2 )t main Bug in book algorithm Multiple levels of cache (revised formula) CPU L1 cache L2 cache h 1 = L1 cache hit rate. Better way to specify h 2 = L2 cache hit rate. Average memory access time: t av = h 1 t L1 + ((1-h 1 )h 2 )t L2 + (1- h 1 -(1-h 1 )h 2 )t main 4

Multi-level cache access time (revised formula) CPU L1 cache L2 cache h 1 = L1 cache hit rate. t L1 = L1 cache access. h 2 = L2 cache hit rate. t L2 = L2 cache access. t main = main memory access time. Average memory access time: t av = h 1 t L1 + h 2 (1-h 1 )t L2 + (1- h 2 )(1-h 1 )t main Computer System: Cache Concept Processor interrupt On-chip cache Caches Caches Memory-I/O bus bus Net cache Row cache Memory Memory I/O I/O controller I/O I/O controller I/O I/O controller Disk cache Web cache disk Disk disk Disk Display Display Network Network Design Issues for Caches Key Questions: Where should a line be placed in the cache? (line placement) How is a line found in the cache? (line identification) Which line should be replaced on a miss? (line replacement) What happens on a write? (write strategy) Constraints: Design must be very simple Hardware realization All decision making within nanosecond time scale Want to optimize performance for typical programs Do extensive benchmarking and simulations Many subtle engineering tradeoffs 5

Cache organizations Direct-mapped: each memory location maps onto exactly one cache entry. N-way set-associative: each memory location can go into one of n sets. Fully-associative: any memory location can be stored anywhere in the cache (almost never implemented). Main Memory 0x0000 0x0004 byte byte byte... 0x0008 0x0000 0x0000 Direct-mapped cache 1 0xabcd byte byte byte... valid tag cache block address tag index offset = hit value byte 6

Direct-mapped cache locations Many locations map onto the same cache block. Conflict misses are easy to generate: Array a[] uses locations 0, 1, 2, Array b[] uses locations 1024, 1025, 1026, Operation a[i] + b[i] generates conflict misses. How might we improve the cache? What are problems? What are solutions? Set-associative cache A set of direct-mapped caches: Set 1 Set 2 Set n... hit 7

Indexing into 2-Way Associative Cache Use middle s bits to select from among S = 2 s sets Set 0: Set 1: Tag Tag Tag Tag Valid Valid Valid Valid 0 1 B 1 0 1 B 1 0 1 B 1 0 1 B 1 Set S 1: Tag Tag Valid Valid 0 1 B 1 0 1 B 1 t s b tag set index offset Physical Address Ex: direct-mapped vs. set-associative address 000 001 010 011 100 101 110 111 0101 1111 0000 0110 1000 0001 1010 0100 Direct-mapped cache behavior After 001 access: block tag 00 - - 01 0 1111 10 - - 11 - - After 010 access: block tag 00 - - 01 0 1111 10 0 0000 11 - - 8

Direct-mapped cache (cont d.) After 011 access: block tag 00 - - 01 0 1111 10 0 0000 11 0 0110 After 100 access: block tag 00 1 1000 01 0 1111 10 0 0000 11 0 0110 Direct-mapped cache (cont d.) After 101 access: block tag 00 1 1000 01 1 0001 10 0 0000 11 0 0110 After 111 access: block tag 00 1 1000 01 1 0001 10 0 0000 11 1 0100 2-way set-associative cache behavior Final state of cache (twice as big as direct-mapped): set 00 01 10 11 blk-0 tag 1 0 0 0 blk-0 1000 1111 0000 0110 blk-1 tag - 1-1 blk-1-0001 - 0100 9

2-way set-associative cache behavior Final state of cache (same size as directmapped): set 0 1 blk-0 tag 01 10 blk-0 0000 0111 blk-1 tag 10 11 blk-1 1000 0100 Example caches StrongARM: 16 Kbyte, 32-way, 32-byte block instruction cache. 16 Kbyte, 32-way, 32-byte block cache (write-back). SHARC: 32-instruction, 2-way instruction cache. Design Issues for Caches Key Questions: Where should a line be placed in the cache? (line placement) How is a line found in the cache? (line identification) Which line should be replaced on a miss? (line replacement) What happens on a write? (write strategy) Constraints: Design must be very simple Hardware realization All decision making within nanosecond time scale Want to optimize performance for typical programs Do extensive benchmarking and simulations Many subtle engineering tradeoffs 10

Replacement policies Replacement policy: strategy for choosing which cache entry to throw out to make room for a new memory location. Two popular strategies: Random. Least-recently used (LRU). Write strategy Write-through: immediately copy write to main memory. Write-back: write to main memory only when location is removed from cache. 1KB a[i] and 2KB b[i] creates: 1. Compulsory miss 2. Conflict miss 3. Capacity miss 45% 45% On a 2KB cache under looping 9% Compulsory miss Conflict miss Capacity miss 11

Memory management units Memory management unit (MMU) translates addresses: CPU logical address memory management unit physical address main memory Memory management tasks Allows programs to move in physical memory during execution. Allows virtual memory: memory images kept in secondary storage; images returned to main memory on demand during execution. Page fault: request for location not resident in memory. Address translation Requires some sort of register/table to allow arbitrary mappings of logical to physical addresses. Two basic schemes: segmented; paged. Segmentation and paging can be combined (x86). 12

Segments and pages page 1 page 2 segment 1 memory segment 2 Segment address translation segment base address logical address segment lower bound segment upper bound + range check range error physical address Page address translation page offset page i base concatenate page offset 13

Page table organizations page descriptor page descriptor flat tree Caching address translations Large translation tables require main memory access. TLB: cache for address translation. Typically small. ARM memory management Memory region types: section: 1 Mbyte block; large page: 64 Kbytes; small page: 4 Kbytes. An address is marked as section-mapped or page-mapped. Two-level translation scheme. 14

ARM address translation Translation table base register descriptor 1st level table 1st index 2nd index concatenate offset descriptor 2nd level table concatenate physical address Elements of CPU performance Cycle time. CPU pipeline. Memory system. Pipelining Several instructions are executed simultaneously at different stages of completion. Various conditions can cause pipeline bubbles that reduce utilization: branches; memory system delays; etc. 15

Pipeline structures Both ARM and SHARC have 3-stage pipes: fetch instruction from memory; decode opcode and operands; execute. ARM pipeline execution fetch decode execute add r0,r1,#5 sub r2,r3,r6 fetch decode execute cmp r2,#3 fetch decode execute 1 2 3 time Performance measures Latency: time it takes for an instruction to get through the pipeline. Throughput: number of instructions executed per time period. Pipelining increases throughput without reducing latency. 16

Pipeline stalls If every step cannot be completed in the same amount of time, pipeline stalls. Bubbles introduced by stall increase latency, reduce throughput. ARM multi-cycle LDMIA instruction ldmia r0,{r2,r3} fetch decodeex ld r2ex ld r3 sub r2,r3,r6 cmp r2,#3 fetch decode fetch ex sub decodeex cmp time Control stalls Branches often introduce stalls (branch penalty). Stall time may depend on whether branch is taken. May have to squash instructions that already started executing. Don t know what to fetch until condition is evaluated. 17

ARM pipelined branch bne foo fetch decodeex bne ex bne ex bne sub r2,r3,r6 foo add r0,r1,r2 fetch decode fetch decodeex add time Delayed branch To increase pipeline efficiency, delayed branch mechanism requires n instructions after branch always executed whether branch is executed or not. SHARC supports delayed and non-delayed branches. Specified by bit in branch instruction. 2 instruction branch delay slot. Example: ARM execution time Determine execution time of FIR filter: for (i=0; i<n; i++) f = f + c[i]*x[i]; Only branch in loop test may take more than one cycle. BLT loop takes 1 cycle best case, 3 worst case. 18

Superscalar execution Superscalar processor can execute several instructions per cycle. Uses multiple pipelined paths. Programs execute faster, but it is harder to determine how much faster. Data dependencies Execution time depends on operands, not just opcode. Superscalar CPU checks dependencies dynamically: dependency add r2,r0,r1 add r3,r2,r5 r0 r2 r1 r3 r5 Memory system performance Caches introduce indeterminacy in execution time. Depends on order of execution. Cache miss penalty: added time due to a cache miss. Several reasons for a miss: compulsory, conflict, capacity. 19