A Cache Hierarchy in a Computer System

Similar documents
Cache in a Memory Hierarchy

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Chapter 2: Memory Hierarchy Design Part 2

Lecture notes for CS Chapter 2, part 1 10/23/18

Advanced Memory Organizations

EITF20: Computer Architecture Part4.1.1: Cache - 2

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

EITF20: Computer Architecture Part4.1.1: Cache - 2

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Chapter 2: Memory Hierarchy Design Part 2

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Pollard s Attempt to Explain Cache Memory

Advanced Caching Techniques

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Caches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Caching Techniques

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Advanced cache optimizations. ECE 154B Dmitri Strukov

Advanced optimizations of cache performance ( 2.2)

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Computer Systems Laboratory Sungkyunkwan University

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Lecture 12: Memory hierarchy & caches

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Copyright 2012, Elsevier Inc. All rights reserved.

Page 1. Multilevel Memories (Improving performance using a little cash )

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Caching Basics. Memory Hierarchies

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Lecture 18: Memory Hierarchy Main Memory and Enhancing its Performance Professor Randy H. Katz Computer Science 252 Spring 1996

Copyright 2012, Elsevier Inc. All rights reserved.

LECTURE 11. Memory Hierarchy

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Lecture-18 (Cache Optimizations) CS422-Spring

Cache Optimisation. sometime he thought that there must be a better way

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchies 2009 DAT105

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

COSC4201. Chapter 4 Cache. Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT

Introduction to cache memories

CS3350B Computer Architecture

Chapter-5 Memory Hierarchy Design

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Lecture 2: Memory Systems

Chapter Seven Morgan Kaufmann Publishers

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy. Slides contents from:

COSC 6385 Computer Architecture - Memory Hierarchies (I)

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

LECTURE 5: MEMORY HIERARCHY DESIGN

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Memory Hierarchy Basics

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Registers. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH

Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Memory Hierarchy. Locality. Memories: Review

EE 4683/5683: COMPUTER ARCHITECTURE

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Computer Architecture Spring 2016

Page 1. Memory Hierarchies (Part 2)

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

LECTURE 10: Improving Memory Access: Direct and Spatial caches

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Lec 11 How to improve cache performance

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Advanced Computer Architecture

Memory Hierarchy Design

Modern Computer Architecture

CMSC 611: Advanced Computer Architecture. Cache and Memory

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Chapter 7-1. Large and Fast: Exploiting Memory Hierarchy (part I: cache) 臺大電機系吳安宇教授. V1 11/24/2004 V2 12/01/2004 V3 12/08/2004 (minor)

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Transcription:

A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible. A. W. Burks, H. H.Goldstine, and J. Von Neumann Preliminary Discussion of the Logical Design of an Electronic Computing Instrument(1946) Autor: Paulo J. D. Domingues MICEI0304-Paulo Domingues 2 Evolution of DRAM and Processor Characteristics A Memory Hierarchy Example MICEI0304-Paulo Domingues 3 MICEI0304-Paulo Domingues 4

Why Does Cache Works Why does it works Levels of cache How does it works Comparing & testing Exploiting temporal locality. the same data objects are likely to be reused multiple times.. Exploiting spatial locality. Blocks usually contain multiple data objects. We can expect that the cost of copying a block after a miss will be amortized by subsequent references to other objects within that block. MICEI0304-Paulo Domingues 5 MICEI0304-Paulo Domingues 6 Levels of Memory Levels of Cache Registers Cache L1/L2 Main Memory Disk Storage Web cache Typical bus structure for L1 and L2 caches Bryant and O'Hallaron Primary Always on the CPU; Small (few KB) and very fast; Data and instructions may be separate; P4 has 20 KB Secondary Used to be off-chip, now is on chip; Much larger than primary cache; P4 has 512KB on-chip MICEI0304-Paulo Domingues 7 MICEI0304-Paulo Domingues 8

Characteristics Cache Size small enough that overall average cost/bit is close to that of main memory alone large enough so that overall average access time is close to that of cache alone large caches tend to be slightly slower than small ones available chip and board area is a limitation studies indicate that L1: 8-64KB, L2: 1-512KB is optimum cache size (for now). Speed 7 to 20ns ; MICEI0304-Paulo Domingues 9 Separate data and instruction caches Characteristics Unified data and instruction caches MICEI0304-Paulo Domingues 10 Typical Cache It is composed of 2 types of fast memory devices SRAM - hold the actual data, address and status tags in a direct mapped cache TAG RAM - small associative memories help with the accounting duties usually hold at least the address tags for non-direct mapped caches provide fully parallel search capability (sequential search would take too long ) MICEI0304-Paulo Domingues 11 General organization of cache A cache is an array of sets. Each set contains one or more lines. Each line contains a valid bit, some tag bits, and a block of data. The cache organization induces a partition of the m address bits into t tag bits, s set index bits, and b block offset bits. MICEI0304-Paulo Domingues 12

Block Identification General Addressing Model Components of an address as they relate to the cache: Tag: Is stored in the cache and used in comparison with CPU address Index: Used to select the set from the cache Offset: The first bits of the address give the offset of the byte within a block MICEI0304-Paulo Domingues 13 MICEI0304-Paulo Domingues 14 Pentium 4 Cache How Cache works 8-KB Level 1 Data Cache Level 1 Execution Trace Cache stores 12-K micro-ops and removes decoder latency from main execution loops 512-KB Advanced Transfer Cache (on-die, fullspeed Level 2 (L2) cache) with 8-way Associativity Cache Mapping Replacement of a Cache Block Cache Write Policy Cache Transfer Technologies Improving Cache Performance MICEI0304-Paulo Domingues 15 MICEI0304-Paulo Domingues 16

Cache Mapping: Direct mapping Cache Mapping: Direct mapping Each block of main memory gets a unique mapping; If a program happens to repeatedly reference words from two different blocks that map into the same cache line, then the blocks will be continually swapped in the cache and the hit ratio will be low. MICEI0304-Paulo Domingues 17 64 KB cache, 32-byte cache block MICEI0304-Paulo Domingues 18 Cache Mapping: Associative Mapping Cache Mapping: Set Associative Mapping Allows each memory block to be loaded into any line of the cache Tag uniquely identifies a block of main memory Cache control logic must simultaneously examine every line s tag for a match Requires fully associative memory very complex circuitry complexity increases exponentially with size very expensive Compromise between direct and associative mappings Cache is divided into v sets, each of which has k lines A given block will map directly to a particular set, but can occupy any line in that set (associative mapping is used within the set) The most common set associative mapping is 2 lines per set, called two-way set associative. (It significantly improves hit ratio over direct mapping, and the associative hardware is not too expensive.) MICEI0304-Paulo Domingues 19 MICEI0304-Paulo Domingues 20

Cache Mapping: Associative Mapping Replacement of a Cache Block Random The simplest method Least-Recent Used (LRU) Expensive as cache increase First in, first out (FIFO) Simpler than LRU, and almost as good 32 KB cache, 2-way set-associative, 16-byte blocks MICEI0304-Paulo Domingues 21 MICEI0304-Paulo Domingues 22 Write-Back Cache Cache Write Policy Write-Through Cache Buffered Write-Through Write-Back Cache CPU only updates cache line; Modified cache line only written to memory when replaced; Required dirty-bit for each cache line Memory not always consistent with cache. MICEI0304-Paulo Domingues 23 MICEI0304-Paulo Domingues 24

Write-Through Cache CPU updates both cache and memory; Memory is always consistent with cache; Usually is combined with write buffers, meaning that CPU doesn t have to wait for memory to complete write. MICEI0304-Paulo Domingues 25 Cache Transfer Technologies Cache Bursting Once the transfers are done in sequence there s no need to specify a different address after the first one. Asynchronous Cache Transfers are not tied to the system clock. Is very slow for higher clock cycles. Synchronous Burst Cache Each tick of the system clock, a transfer can be done to or from the cache. Pipeline Burst (PLB) Cache The data transfers that occur in a "burst" are done partially at the same time. The second transfer begins before the first transfer is done. MICEI0304-Paulo Domingues 26 Improving Cache Performance Reducing Miss Penalty Additional time required because of a miss Reducing Miss Rate The fraction of memory references during the execution of a program Reducing Miss Penalty or Rate via Parallelism Reducing Hit Time Time taken to deliver a word in the cache to the CPU MICEI0304-Paulo Domingues 27 Reducing Miss Penalty Multilevels Caches Small and faster L1; larger L2 Critical Word First and Early Restart As soon as first word of block arrives it s delivered to CPU Giving Priority to Read Miss over Writes The dirty block is written into a buffer Merging Write Buffer Combine data without having to write immediately to memory Victim Caches Small, fully associative cache between L1 and L2 MICEI0304-Paulo Domingues 28

Reducing Miss Rate Larger Block Size It is likely that the next word will be after the last one Larger Caches Increases cost and hit time Higher Associativity An eight-way set associative cache is as effective as a full associative cache The more a cache is associative the higher the hit time. Way Prediction and PseudoAssociative Caches Extra bits are needed to early set the multiplexor to select the desired block Compiler Optimizations Merging Arrays, Loop Interchange, Loop Fusion, Blocking MICEI0304-Paulo Domingues 29 Reducing Miss Penalty or Rate via Parallelism Nonblocking Caches to Reduce stalls on cache Misses The cache can continue to supply cache hits during a miss. H/W Prefetching of Instructions and Data The processor on a cache misses fetches besides the requested blocks, the next one. Compiler-Controlled Prefetching Prefetching, performed by the compiler MICEI0304-Paulo Domingues 30 Reducing Hit Time Small and Simple Caches small hardware is faster Avoiding Address Translation during Indexing of the Cache Using virtual addresses for the cache Pipelined Cache Access Trace Caches Finds a dynamic sequence of instructions including taken branches to load into a cache block MICEI0304-Paulo Domingues 31 Cache Optimization Summary Technique Miss Rate Miss Pen. Hit time Hardware Complexity Larger Block Size + - 0 Higher Associativity + - 1 Victim Caches + 2 Pseudo-associative + 2 Hardware Prefetching + 2 Compiler-controlled Pre + 3 Compiler Techniques + 0 Giving Read Misses Priority + 1 Subblock Placement + 1 Early Restart/Crit Wd First + 2 Nonblocking Caches + 3 Second-Level Caches + 2 Small and Simple Caches - + 0 Avoiding Address Trans. + 2 Pipelining Writes + 1 MICEI0304-Paulo Domingues 32

Main Memory Improve Memory Bandwidth Latency Affects primarily cache, in the miss penalty Bandwidth Has more effect on I/O and multiprocessors Wider Main Memory Twice the width, half the memory accesses Simple Interleaved Memory Several chips in a memory system organized in banks (each bank can be one word width) Independent Memory Banks Multiple independent accesses where multiple memory controllers allow banks to operate independently MICEI0304-Paulo Domingues 33 MICEI0304-Paulo Domingues 34 Memory Bandwidth Cache Coherence Problem Multiple copies of the same data in different caches; Data in the caches is modified locally; Cache and memory can be incoherent; Data may become inconsistent between caches on multiprocessor systems. MICEI0304-Paulo Domingues 35 MICEI0304-Paulo Domingues 36

Cache Coherence Problem - Solutions Exercices- Is the following code Cache-friendly friendly? Declare shared data to be non-cacheable (by software) Cache on read, but not cache write cycles; Use bus snooping to maintain coherence on write back to main memory. Or Directory based: Cache lines maintain an extra 2 bits per processor to preserve clean/dirty status bits. #define N 1000 typedef struct { int vel[3]; int acc[3]; point; point p[n]; An array of structs. MICEI0304-Paulo Domingues 37 MICEI0304-Paulo Domingues 38 Exercices- Is the following code Cache-friendly friendly? Exercices- Is the following code Cache-friendly friendly? void clear1(point *p, int n) { int i, j; for (i = 0; i < n; i++) { for (j = 0; j < 3; j++) p[i].vel[j] = 0; for (j = 0; j < 3; j++) p[i].acc[j] = 0; MICEI0304-Paulo Domingues 39 void clear2(point *p, int n) { int i, j; for (i = 0; i < n; i++) { for (j = 0; j < 3; j++) { p[i].vel[j] = 0; p[i].acc[j] = 0; MICEI0304-Paulo Domingues 40

Exercices- Is the following code Cache-friendly friendly? void clear3(point *p, int n) { int i, j; for (j = 0; j < 3; j++) { for (i = 0; i < n; i++) p[i].vel[j] = 0; for (i = 0; i < n; i++) p[i].acc[j] = 0; MICEI0304-Paulo Domingues 41