Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Similar documents
EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Memory Hierarchies 2009 DAT105

Lec 11 How to improve cache performance

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Chapter-5 Memory Hierarchy Design

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

CPE 631 Lecture 06: Cache Design

Types of Cache Misses: The Three C s

Aleksandar Milenkovich 1

Computer Architecture Spring 2016

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

Cache performance Outline

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Memory Hierarchy Basics

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Cache Performance (H&P 5.3; 5.5; 5.6)

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

Cache Optimisation. sometime he thought that there must be a better way

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Advanced optimizations of cache performance ( 2.2)

Advanced cache optimizations. ECE 154B Dmitri Strukov

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Lecture 11 Reducing Cache Misses. Computer Architectures S

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Lecture-18 (Cache Optimizations) CS422-Spring

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

CMSC 611: Advanced Computer Architecture. Cache and Memory

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance. Why More on Memory Hierarchy?

Lecture 11. Virtual Memory Review: Memory Hierarchy

NOW Handout Page # Why More on Memory Hierarchy? CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

A Cache Hierarchy in a Computer System

CS422 Computer Architecture

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Copyright 2012, Elsevier Inc. All rights reserved.

Advanced Caching Techniques

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Copyright 2012, Elsevier Inc. All rights reserved.

LECTURE 5: MEMORY HIERARCHY DESIGN

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Chapter 2 (cont) Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Memory Hierarchy Design

Caching Basics. Memory Hierarchies

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Lecture 7: Memory Hierarchy 3 Cs and 7 Ways to Reduce Misses Professor David A. Patterson Computer Science 252 Fall 1996

Adapted from David Patterson s slides on graduate computer architecture

Lec 12 How to improve cache performance (cont.)

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

COSC4201. Chapter 4 Cache. Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Computer Systems Architecture

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Page 1. Memory Hierarchies (Part 2)

Advanced Caching Techniques

Advanced cache memory optimizations

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Chapter 2: Memory Hierarchy Design Part 2

CS222: Cache Performance Improvement

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Lecture notes for CS Chapter 2, part 1 10/23/18

Memory Hierarchy. Advanced Optimizations. Slides contents from:

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

CS3350B Computer Architecture

Copyright 2012, Elsevier Inc. All rights reserved.

EE 4683/5683: COMPUTER ARCHITECTURE

Copyright 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory hierarchy review. ECE 154B Dmitri Strukov

ECE331: Hardware Organization and Design

Memory Hierarchy. Slides contents from:

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Transcription:

Outline Lecture 7: EITF20 Computer Architecture Anders Ardö EIT Electrical and Information Technology, Lund University November 21, 2012 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 1 / 42 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 2 / 42 Memory hierarchy Who cares? 1980: no cache in microprocessors 1995: 2-level caches in a processor package 2000: 2-level caches on a processor die 2003: 3-level caches on a processor die AIM: Fast as cache; Large as disk; Cheap as possible A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 3 / 42 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 4 / 42

IBM z196 5.2GHz microprocessor chip Why does caching work? The Principle of Locality Temporal locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. Spatial locality (Locality in space): If an item is referenced, items whose addresses are close, tend to be referenced soon. A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 5 / 42 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 6 / 42 Cache measures 4 questions for the Memory Hierarchy hit rate = no of accesses that hit/no of accesses close to 1, more convenient with miss rate = 1.0 hit rate hit time: cache access time plus time to determine hit/miss miss penalty: time to replace a block measured in ns or number of clock cycles depends on: latency: time to get first word bandwidth: time to transfer block out-of-order execution can hide some of the miss penalty Average memory access time = hit time + miss rate miss penalty Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy) A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 7 / 42 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 8 / 42

Lecture 7 agenda Outline Chapter 5.2, (App C.2-C.3) in "Computer Architecture" Cache optimizations A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 9 / 42 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 10 / 42 Cache Performance IC (CPI execution + Execution Time = mem accesses instruction Three ways to increase performance: Reduce miss rate Reduce miss penalty Reduce hit time miss rate miss penalty) T C However, remember: Execution time is the only true measure! Cache optimizations A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 11 / 42 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 12 / 42

Outline A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 13 / 42 Cache optimizations A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 14 / 42 Bandwidth increase Outline Pipelined cache greater penalty on misspredicted branches Multibanked caches works best when even spread of accesses across banks sequential interleaving A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 15 / 42 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 16 / 42

Cache optimizations A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 17 / 42 Reduce hit time 1: KISS (Keep It Simple, Stupid) Hit time critical since it affects clock rate. Smaller and simpler is faster: Fits on-chip ( avoids off-chip time penalties) A direct mapped cache allows data fetch and tag check to proceed in parallel Access time in microseconds 900 800 700 600 500 400 300 200 100 0 1-way 4-way 2-way 8-way 16 KB 32 KB 64 KB 128 KB 256 KB Cache size A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 18 / 42 Reduce hit time 2: Address translation Processor uses virtual addresses (VA) while caches and main memory use physical addresses (PA) Reduce hit time 3: Address translation Use virtual addresses to both index cache and tag check Use the virtual address to index the cache in parallel Processes have different virtual address spaces Flush cache between context switches or Extend tag with process-identifier tag (PID) Two virtual addresses may map to the same physical address synonyms or aliases HW guarantees that every cache block have a unique physical address Some OS uses page colouring to ensure that the cache index is unique A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 19 / 42 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 20 / 42

Reduce hit time 4: trace caches Reduce hit time: various tricks Trace caches Dynamically find a sequence of executed instructions (including taken branches) to make up a cache block Complicated address mapping Avoids header and trailer overhead of normal cache schemes The same instruction can be stored several times in the cache (different traces) Used in Intel NetBurst (P4) Way prediction combines the speed of a direct mapped cache with the conflict miss reductions of set-associative caches Hardware prediction of which block in the set that will be accessed next (Compare branch prediction) If prediction correct low latency otherwise longer latency Prediction accurancy 85 % Pseudo associativity (i) check first a direct mapped entry, if miss (ii) check a second entry Two hit times (fast, slow) If hit rate in the fast entry is low, the performance will be dictated by the slow hit time A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 21 / 42 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 22 / 42 Outline A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 23 / 42 Cache optimizations A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 24 / 42

Reduce miss penalty 1: Multilevel caches Multilevel caches: execution time The more the merrier Use several levels of cache memory: The 1st level cache fast and small = on-chip 2nd level cache can be made much larger and set-associative to reduce capacity and conflict misses... and so on for 3rd and 4th level caches Performance: On-chip or Off-chip? Today 3 levels on-chip; 4th comming Average Access time L1 = Hit L1 + Miss rate L1 Miss penalty L1 Miss penalty L1 = Hit_block L2 + Miss rate L2 Miss penalty L2 Design issue: multilevel inclusion or exclusion? A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 25 / 42 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 26 / 42 Multilevel caches: examples Reduce miss penalty 2: Write buffers, Read priority Preference, Companionship Cache SPEC CPU CP L1 L2 L3 Int FP GHz KB KB MB FX-51 2.2 64+64 1024-1376 1371 Itanium 2 1.5 16+16 256 6 1322 2119 Pentium 4 3.2 12+8 512-1221 1271 (Pentium 4 EE) 3.2 12+8 512 2 1583 1475 Core i7 3.5 32+32 256 8 Phenom II 3 128 512 8 AMD Bulldozer 4 16+64 2048 8 IBM z196 5.2 64+128 1536 24 Giving priority to read misses over writes Merging write buffer A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 27 / 42 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 28 / 42

Reduce miss penalty 3: other tricks Reduce miss penalty 4: Nonblocking caches Impatience Early restart and critical word first Early restart fetch words in normal order but restart processor as soon as requested word has arrived Critical word first fetch the requested word first. Overlap CPU execution with filling the rest of the cache block Nonblocking cache lockup-free cache (+) Permit other cache operations to proceed when a miss has occurred (+) May further lower the effective miss penalty if multiple misses can overlap (-) The cache has to book-keep all outstanding references Increases cache controler complexity Good for out-of-order pipelined CPUs The presence of true data dependencies may limit performance Increases perfomance mainly with large block sizes. A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 29 / 42 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 30 / 42 Reduce miss rate/penalty: prefetching Goal: overlap execution with speculative prefetching to cache Hardware prefetching If there is a miss for block X, fetch also block X+1, X+2,... X+d d=1 = one-block lookahead. Used in Alpha 21064, Pentium 4 Outline Software prefetching load data before it is needed: Register prefetching Cache prefetching requires special instructions (Used in Itanium) Both requires lockup free (nonblocking) caches Can be faulting or nonfaulting (virtual memory) A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 31 / 42 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 32 / 42

Cache optimizations A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 33 / 42 Reduce miss rate The three C s: Compulsory misses in an infinite cache Capacity misses in a fully associative cache Conflict misses in an N-way associative cache How do we reduce the number of misses? Change cache size? Change block size? Change associativity? Change compiler? Other tricks! Which of the three C s are affected? A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 34 / 42 Reduce misses 1: increase block size Increased block size utilises the spatial locality Too big blocks increases capacity miss rate Big blocks also increases miss penalty Beware - impact on average memory access time Reduce misses 2: change associativity Rule of thumb: A direct mapped cache of size N has the same miss rate as a 2-way set associative cache of size N/2 Beware Hit time increases with increasing associativity Will clock cycle time increse? 900 800 1-way 4-way 2-way 8-way Access time in microseconds 700 600 500 400 300 200 100 0 16 KB 32 KB 64 KB 128 KB 256 KB Cache size A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 35 / 42 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 36 / 42

Reduce misses 3: Compiler optimizations Reduce misses 3: Compiler optimizations Basic idea: Reorganize code to increase locality! Merging arrays: Before int val[size]; int key[size]; After Struct merge { int val; int key; } struct merge newarr[size]; Loop interchange: Before for (j=0; j<n; j++) for (i=0; i<n; i++) A[i,j] =... Blocking After for (i=0; i<n; i++) for (j=0; j<n; j++) A[i,j] =... A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 37 / 42 Reduce misses 4: Victim cache A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 Outline Recycling How to combine fast hit time of direct mapped yet still avoid conflict misses? Add buffer to place data discarded from cache Jouppi: a 4-entry victim cache removed 25% of conflict misses for a 4 Kbyte direct mapped data cache 1 Reiteration 2 Cache performance optimization 3 Bandwidth increase 4 Reduce hit time 5 Reduce miss penalty 6 Reduce miss rate 7 Summary Used in AMD Athlon, HP and Alpha machines A. Ardö, EIT 38 / 42 Lecture 7: EITF20 Computer Architecture November 21, 2012 39 / 42 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 40 / 42

Cache Performance IC (CPI execution + Execution Time = mem accesses instruction Three ways to increase performance: Reduce miss rate Reduce miss penalty Reduce hit time... and increase bandwidth miss rate miss penalty) T C However, remember: Execution time is the only true measure! Cache optimizations A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 41 / 42 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 42 / 42