EEC 483 Computer Organization. Chapter 5.3 Measuring and Improving Cache Performance. Chansu Yu

Similar documents
Review: Computer Organization

Memory Hierarchy. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

EEC 483 Computer Organization

ECE232: Hardware Organization and Design

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

data block 0, word 0 block 0, word 1 block 1, word 0 block 1, word 1 block 2, word 0 block 2, word 1 block 3, word 0 block 3, word 1 Word index cache

Page 1. Memory Hierarchies (Part 2)

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

Memory Hierarchy. Reading. Sections 5.1, 5.2, 5.3, 5.4, 5.8 (some elements), 5.9 (2) Lecture notes from MKP, H. H. Lee and S.

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Virtual Memory - Objectives

CS3350B Computer Architecture

Introduction to OpenMP. Lecture 10: Caches

Transistor: Digital Building Blocks

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Advanced Memory Organizations

CS/ECE 3330 Computer Architecture. Chapter 5 Memory

Improving Cache Performance

Cray XE6 Performance Workshop

6 th Lecture :: The Cache - Part Three

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Chapter Seven. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)

Lecture 2: Memory Systems

Memory Hierarchy. Caching Chapter 7. Locality. Program Characteristics. What does that mean?!? Exploiting Spatial & Temporal Locality

Memory Hierarchy Y. K. Malaiya

EE 4683/5683: COMPUTER ARCHITECTURE

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Caches Concepts Review

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Advanced cache optimizations. ECE 154B Dmitri Strukov

COSC3330 Computer Architecture Lecture 19. Cache

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Chapter 5. Large and Fast: Exploiting Memory Hierarchy. Jiang Jiang

EITF20: Computer Architecture Part4.1.1: Cache - 2

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Fig 7.30 The Cache Mapping Function. Memory Fields and Address Translation

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Memory hierarchies: caches and their impact on the running time

EITF20: Computer Architecture Part4.1.1: Cache - 2

Agenda. Recap: Components of a Computer. Agenda. Recap: Cache Performance and Average Memory Access Time (AMAT) Recap: Typical Memory Hierarchy

Introduction. Memory Hierarchy

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

A Cache Hierarchy in a Computer System

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Chapter-5 Memory Hierarchy Design

Chapter 10: Virtual Memory. Lesson 05: Translation Lookaside Buffers

Computer Architecture Spring 2016

ECE 30 Introduction to Computer Engineering

CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

Page 1. Multilevel Memories (Improving performance using a little cash )

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Improving Cache Performance

CSE502: Computer Architecture CSE 502: Computer Architecture

Memory Hierarchy: Caches, Virtual Memory

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

Logical Diagram of a Set-associative Cache Accessing a Cache

Memory hierarchy review. ECE 154B Dmitri Strukov

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

12 Cache-Organization 1

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Lecture 7 - Memory Hierarchy-II

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Chapter 2: Memory Hierarchy Design Part 2

CS 61C: Great Ideas in Computer Architecture. Cache Performance, Set Associative Caches

CS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page.

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

CS3350B Computer Architecture

Cache Performance (H&P 5.3; 5.5; 5.6)

Lecture notes for CS Chapter 2, part 1 10/23/18

Levels in memory hierarchy

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Computer Architecture CS372 Exam 3

CS 240 Stage 3 Abstractions for Practical Systems

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Caching Basics. Memory Hierarchies

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Transcription:

EEC 483 Computer Organization Chapter 5.3 Measuring and Improving Cache Performance Chansu Yu Cache Performance Performance equation execution time = (execution cycles + stall cycles) x cycle time stall cycles = read-stall cycles + write-stall cycles read-stall cycles = reads/program x read miss rate x read miss penalty write-stall cycles = writes/program x write miss rate x write miss penalty + write buffer stalls Two ways of improving performance: decreasing the miss ratio decreasing the miss penalty Decreasing the miss ratio by more flexible placement of blocks is the main theme of this section 2 c.yu9@csuohio.edu

Example # Memory and cache system Cache hit time = cycle, miss penalty = 5 cycles Miss rate of the 32KB cache =.99% What is the average memory access time? 3 c.yu9@csuohio.edu Example #2 Harvard architecture refers to the one where data and program are separately stored in storage systems, e.g., data cache and instruction cache. Memory and cache system Cache hit time = cycle, miss penalty = 5 cycles 75% of memory access is instruction, 25% is data Miss rate of 6KB data cache = 6.47%, miss rate of 6KB instruction cache =.64% What is the average memory access time? 4 c.yu9@csuohio.edu 2

Direct-Mapped Caches Advantage: Simple Fast & Inexpensive Disadvantage: Vulnerable to thrashing. Two heavily used memory blocks map to same cache block index. Each is repeatedly evicted to make room for the other. 5 c.yu9@csuohio.edu Thrashing Example float dot_prod(float x[size], float y[size]) { float sum =.; int i; for (i = ; i < SIZE; i++) sum += x[i]*y[i]; If x[i] and y[i] map to same blocks thrash. What is hit rate in this case? return sum; } 6 c.yu9@csuohio.edu 3

Cache Performance: Tradeoffs () Increasing block size + decreases miss rate, until block gets large (spatial locality) - increases miss penalty (2) Increasing cache size + decreases miss rate - increases hit time - increases hardware cost To make exact tradeoffs, need to know specific numbers. Calculation & measurement See book for formulae. (3) Increasing associativity (Section 5.3) + increases hit rate - increases hit time - increases hardware cost 7 c.yu9@csuohio.edu Other Block Placement Policies Direct-mapped A memory block is cached in only one position in the cache Set-associative (n-way) A memory block is cached in n positions in the cache Fully-associative A memory block is cached in any position in the cache 8 Do they decrease miss ratio? c.yu9@csuohio.edu 4

Direct-mapped Block Placement 6 bytes Memory Address 2 3 4 5 6 7 8 9 A B C D E F Memory Valid 9 Direct-mapped cache Tag 6 bytes Block Index 2 3 4 5 6 7 8-bit memory address tag 3-bit block # 4-bit offset in the block c.yu9@csuohio.edu (N-Way) Set-associative Block Placement 6 bytes Memory Address 2 3 4 5 6 7 8 9 A B C D E F Memory Valid Set-associative cache (2-way ) Tag 6 bytes Block Index Set Set 2 2 Set 3 2 3 Set 4 3 8-bit memory address tag(???) block # (???) 4-bit offset in the block c.yu9@csuohio.edu 5

Fully-associative Block Placement 6 bytes Memory Address 2 3 4 5 6 7 8 9 A B C D E F Memory Valid Fully-associative cache Tag 6 bytes Block Index 8-bit memory address tag(???) block # (???) 4-bit offset in the block c.yu9@csuohio.edu Block Identification (cont d) N-way set-associative cache There are N cache blocks to compare N tag comparisons are done in parallel Block # to Set # choose low-order bits of blocks as set # tag + set# + offset consecutive blocks to map to different sets fewer conflicts in cache, especially in the presence of spatial locality Same cache size with higher associativity #blocks / set??? index size, tag size??? 2 c.yu9@csuohio.edu 6

CPU address block # tag set # offset 2 ways 2-way set associative index into cache tag data tag data hit OR MUX 3 data c.yu9@csuohio.edu An Example Address 3 3 2 9 8 3 2 22 8 Index V Tag Data V Tag Data V Tag Data V Tag Data 2 253 254 255 22 32 Hit 4-to- multiplexor Data 4 x-way set associate cache? Number of cache blocks? Block size? Tag field size? Cache size? c.yu9@csuohio.edu 7

Cache Performance: Tradeoffs 5% 2% 9% Miss rate 6% 3% % One-way Two-way Four-way Eight-way Associativity KB 6 KB 2 KB 32 KB 4 KB 64 KB What is it? 8 KB 28 KB 5 c.yu9@csuohio.edu Replacement Policies In direct-mapped cache, no replacement policy is necessary In set-associative cache, an important question is Which block to replace among the set (see page 54)? Least recently used (LRU) The most commonly used scheme How to keep track of usage of blocks? Single bit in case of two-way set associative cache See Section 7.5 for higher associativity case 6 c.yu9@csuohio.edu 8

Intel Nehalem 4-Core Processor (Multilevel On-Chip Caches ) Per core: 32KB L I-cache, 32KB L D-cache, 52KB L2 cache 7 c.yu9@csuohio.edu 3-Level Cache Organization L caches (per core) L2 unified cache (per core) L3 unified cache (shared) Intel Nehalem L I-cache: 32KB, 64-byte blocks, 4- way, approx LRU replacement, hit time n/a L D-cache: 32KB, 64-byte blocks, 8- way, approx LRU replacement, writeback/allocate, hit time n/a 256KB, 64-byte blocks, 8-way, approx LRU replacement, write-back/allocate, hit time n/a 8MB, 64-byte blocks, 6-way, replacement n/a, write-back/allocate, hit time n/a AMD Opteron X4 L I-cache: 32KB, 64-byte blocks, 2- way, LRU replacement, hit time 3 cycles L D-cache: 32KB, 64-byte blocks, 2- way, LRU replacement, writeback/allocate, hit time 9 cycles 52KB, 64-byte blocks, 6-way, approx LRU replacement, writeback/allocate, hit time n/a 2MB, 64-byte blocks, 32-way, replace block shared by fewest cores, writeback/allocate, hit time 32 cycles 8 c.yu9@csuohio.edu 9

Miss Penalty Reduction Return requested word first Then back-fill rest of block Non-blocking miss processing Hit under miss: allow hits to proceed Mis under miss: allow multiple outstanding misses Hardware prefetch: instructions and data Opteron X4: bank interleaved L D-cache Two concurrent accesses per cycle 9 c.yu9@csuohio.edu