ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

Similar documents
ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

EE 660: Computer Architecture Advanced Caches

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 7 Memory III

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CSC 631: High-Performance Computer Architecture

ECE 4750 Computer Architecture, Fall 2014 T05 FSM and Pipelined Cache Memories

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory

Page 1. Multilevel Memories (Improving performance using a little cash )

CS252 Spring 2017 Graduate Computer Architecture. Lecture 11: Memory

EC 513 Computer Architecture

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 6 Pipelining Part 1

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

CS 152 Computer Architecture and Engineering. Lecture 11 - Virtual Memory and Caches

Lecture 11 Cache. Peng Liu.

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 8 Instruction-Level Parallelism Part 1

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

CS/ECE 250 Computer Architecture

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

Memory Hierarchy. Slides contents from:

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

CS 152 Computer Architecture and Engineering. Lecture 6 - Memory

Chapter 2: Memory Hierarchy Design Part 2

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Lecture notes for CS Chapter 2, part 1 10/23/18

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

Inside out of your computer memories (III) Hung-Wei Tseng

Computer Architecture Spring 2016

Chapter 2: Memory Hierarchy Design Part 2

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

What is Cache Memory? EE 352 Unit 11. Motivation for Cache Memory. Memory Hierarchy. Cache Definitions Cache Address Mapping Cache Performance

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

CS252 Graduate Computer Architecture Spring 2014 Lecture 10: Memory

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

CS3350B Computer Architecture

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Memory Hierarchy. Slides contents from:

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

EITF20: Computer Architecture Part4.1.1: Cache - 2

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

CS 152 Computer Architecture and Engineering. Lecture 9 - Virtual Memory

CS 152 Computer Architecture and Engineering. Lecture 13 - VLIW Machines and Statically Scheduled ILP

Memory hierarchy review. ECE 154B Dmitri Strukov

EITF20: Computer Architecture Part4.1.1: Cache - 2

Mo Money, No Problems: Caches #2...

Lecture 6 - Memory. Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

Lecture 9 - Virtual Memory

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

Lecture 11. Virtual Memory Review: Memory Hierarchy

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Lecture-14 (Memory Hierarchy) CS422-Spring

14:332:331. Week 13 Basics of Cache

CS422 Computer Architecture

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Show Me the $... Performance And Caches

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Advanced Caching Techniques

EE 4683/5683: COMPUTER ARCHITECTURE

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

ECE331: Hardware Organization and Design

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 4 Reduced Instruction Set Computers

CS 61C: Great Ideas in Computer Architecture Caches Part 2

CS152 Computer Architecture and Engineering

Advanced Caching Techniques

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 19 Summary

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Transcription:

ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 13 Memory Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall12.html

ECE552 Administrivia 19 October Homework #3 Due 19 October Project Proposals Due One page proposal 1. What question are you asking? 2. How are you going to answer that question? 3. Talk to me if you are looking for project ideas. 23 October Class Discussion Roughly one reading per class. Do not wait until the day before! 1. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. 2. Kim et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. 3. Fromm et al. The energy efficiency of IRAM architectures 4. Lee et al. Phase change memory architecture and the quest for scalability ECE 552 / CPS 550 2

Last Time History of Memory Williams tubes were unreliable. Core memory reliable but slow Semiconductor memory competitive in 1970s. DRAM Operation DRAM accesses require (1) activates, (2) reads/writes, (3) precharges Performance gap between processor and memory is growing. Managing the Memory Hierarchy 1. Predictable patterns provide (1) temporal and (2) spatial locality 2. Exploit predictable patterns with small, fast caches. 3. Cache data placement policies include (1) fully-associative, (2) setassociative, and (3) direct-mapped. Choice of policy determines cache effectiveness. ECE 552 / CPS 550 3

Datapath Cache Interaction PC 0x4 Add addr nop inst hit? Primary Instruction Cache IR D Decode, Register Fetch E A B MD1 ALU M Y MD2 we addr Primary Data rdata Cache wdata wdata hit? R Stall entire CPU on data cache miss To Memory Control Cache Refill Data from Lower Levels of Memory Hierarchy ECE 552 / CPS 550 4

Average Memory Access Time AMAT = [Hit Time] + [Miss Prob.] x [Miss Penalty] - Miss Penalty equals AMAT of next cache/memory/storage level. - AMAT is recursively defined To improve performance - Reduce the hit time (e.g., smaller cache) - Reduce the miss probability (e.g., larger cache) - Reduce the miss penalty (e.g., optimize the next level) Simple design strategy - Observe that hit time increases with cache size - Design the largest possible cache with a hit time of 1-2 cycles. - For example, design 8-32KB of cache in modern technology - Design trade-offs are more complex with superscalar architectures and multi-ported memories ECE 552 / CPS 550 5

Example: Serial vs Parallel Access Serial Access - Check cache for addressed data. If miss (probability 1-p), go to memory - AMAT = Tcache + (1-p) x Tmem Processor Addr CACHE Addr Main Memory Data Data Processor Addr CACHE Main Memory Data Data Parallel Access - Check cache and memory for addressed data. - AMAT = p x Tcache + (1-p) x Tmem - AMAT reductions from parallel access often small. Tmem >> Tcache and p is high. Parallel access consumes memory bandwidth. Parallel access increases cache complexity and Tcache. ECE 552 / CPS 550 6

Cache Misses and Causes (3C s) Compulsory - First reference to a block. May be caused by cold caches as application begins execution. - Compulsory misses would occur even with infinitely sized cache Capacity - Cache is too small and cannot hold all data needed by the program. - Capacity misses would occur even under perfect replacement policy. Conflict - Cache line replacement policy causes collisions - Conflict misses would not occur with full associativity ECE 552 / CPS 550 7

Reducing Misses Larger Cache Size - Benefit: reduces capacity and conflict misses - Cost: increases hit time Higher Associativity - Benefit: reduces conflict misses - Cost: increases hit time Larger Line Size - Benefit: reduces compulsory and capacity misses - Cost: increases conflict misses and increases miss penalty ECE 552 / CPS 550 8

Cache Write Policies What happens when a cache line is written? If write hits in cache (i.e., line already cached) - Write-Through: Write data to both cache and memory. Increases memory traffic but allows simpler datapath and cache controllers. - Write-Back: Write data to cache only. Write data to memory only when cache line is replaced (e.g., conflict). - Write-Back Optimization: Insert dirty bit per cache line, which indicates whether line is modified. Write-back only if replaced cache line is dirty. If write misses in cache - No Write Allocate: Write data to memory only. - Write Allocate: Fetch data into cache. Write data into cache. Also known as fetch-on-write. Common combinations - Write-through and no-write allocate - Write-back and write allocate. ECE 552 / CPS 550 9

Write Performance Tag Index Block Offset t V Tag k b Data 2 k lines HIT = t WE Data Word or Byte Writes must (1) check for HIT and (2) perform write only after HIT signal resolves. Steps are serial, which harms performance. ECE 552 / CPS 550 10

Reducing Write Time Problem: Writes take two cycles in memory - Access cache and compare tags to generate HIT signal (1 cycle) - Perform cache write if HIT enables a write (1 cycle) Solutions - Design data SRAM that can perform read/write in one cycle, restoring old value if HIT is false. - Design content-addressable data SRAM (CAM), which enables word line only if HIT is true. - Pipeline writes with a write buffer. Write the cache for store instruction j during the tag check store instruction j+1 ECE 552 / CPS 550 11

Pipelining Cache Writes Address and Store Data From CPU Tag Index Store Data Tags Delayed Write Addr. Load/Store =? S L Delayed Write Data Data =? 1 0 Hit? Load Data to CPU Introduce buffers for delayed write address, data. Write cache for j-th store during tag check for (j+1)-th store. Note that loads must check buffers for latest data. ECE 552 / CPS 550 12

Buffering Writes With buffers, writes do not stall datapath - Introduce a write buffer between adjacent levels in cache hierarchy. - After writes enter buffer, computation proceeds. - For example: Reads bypass writes. Problem - Write buffer may hold latest value for a read instruction s address Solutions - Option 1: If a read misses in cache, wait for the write buffer to empty - Option 2: Compare read address with write buffer addresses. If no match, allow read to bypass writes. Else, return value in write buffer. ECE 552 / CPS 550 13

Cache Hierarchy Problem - Memory technology imposes trade-off between speed and size. - A memory cannot be both large and fast. Solution - Introduce a multi-level cache hierarchy. - As distance from datapath increases, increase cache size. CPU L1$ L2$ DRAM ECE 552 / CPS 550 14

L1-L2 Cache Interactions Use smaller L1 cache if L2 cache is present - Reduce L1 hit time, but increase L1 miss rate. - L2 mitigates higher L1 miss rate by reducing L1 miss penalty - May reduce average time (AMAT) and energy (AMAE) Use write-through L1 if write-back L2 cache is present - Write-through L1 simplifies pipeline, cache controller. - Write-through for dirty lines reduces complexity (no dirty flags in cache) - Write-back L2 absorbs write traffic. Writes do not go off-chip to DRAM. Inclusion Policies - Inclusive Multi-level Cache: Smaller cache (e.g., L1) holds copies of data in larger cache (e.g., L2). Simpler policies. - Exclusive Multi-level Cache: Smaller cache (e.g., L1) holds data not in larger cache (e.g., L2). Example: AMD Athlon with 64KB primary and 256KB secondary. Reduces duplication. ECE 552 / CPS 550 15

Power7 On-Chip Caches (2009) 32KB L1 I$/core 32KB L1 D$/core 3-cycle latency 256KB Unified L2$/core 8-cycle latency 32MB Unified Shared L3$ Embedded DRAM 25-cycle latency to local slice ECE 552 / CPS 550 16

Prefetching Speculate about future memory accesses - Predict likely instruction and data accesses. - Pre-emptively fetch instructions and data into caches. - Instructions accesses likely easier to predict than data accesses. - Mechanisms might be implemented in HW, SW or both. - What type of misses does prefetching affect? Challenges in Prefetching - Prefetching should be useful. Prefetches should reduce misses. - Prefetching should be timely. Prefetches pollute cache if too early and are useless if too late. - Prefetching consume memory bandwidth. ECE 552 / CPS 550 17

Hardware Instruction Prefetching Example: Alpha AXP 21064 - Prefetch instructions - Fetch two lines on a cache miss. Fetch the requested line (i) and the next consecutive line (i+1). - Place requested line in instruction L1 cache. Place next line in an instruction stream buffer. - If an instruction fetch misses in L1 cache but hits in stream buffer, move stream buffer line into L1 cache. And prefetch next line (i+2) Req block Stream Buffer Prefetched instruction block CPU RF L1 Instruction Req block Unified L2 Cache ECE 552 / CPS 550 18

Hardware Data Prefetching Prefetch after a cache miss - Prefetch line (i+1) if an access for line (i) misses in the cache One Block Look-ahead (OBL) - Blocks also known as lines. - Initiate prefetch for block (i+1) when block (i) is accessed. - Generalizes to N-block look-ahead - How is this different from increasing the block or line size by N times? Strided Prefetch - Observe sequence of accesses to cache lines. - Suppose a sequence (i), (i+n), (i+2n) is observed. Prefetch (i+3n). - N is the stride. - Example: IBM Power 5 (2003) supports eight independent streams of strided prefetches per processor, prefetching 12 lines ahead of the current access. ECE 552 / CPS 550 19

Software Prefetching for(i=0; i < N; i++) { prefetch( &a[i + P] ); prefetch( &b[i + P] ); SUM = SUM + a[i] * b[i]; } Challenges in Software Prefetching - Timing is the biggest difficulty, not predictability. - Prefetch too late if prefetch instruction too close to data use. - Prefetch too early and cause cache/bandwidth pollution. - Requires estimating prefetch latency: time from issuing prefetch to filling L1 cache line. - Why is this hard to do? What is the correct value of P above? ECE 552 / CPS 550 20

Caches and Code Restructuring code affects data access sequences - Group data accesses together to improve spatial locality - Re-order data accesses to improve temporal locality Prevent data from entering the cache - Useful for variables that are only accessed once - Requires SW to communicate hints to HW. - Example: no-allocate instruction hints Kill data that will never be used again - Streaming data provides spatial locality but not temporal locality - If particular lines contain dead data, use them in replacement policy. - Toward software-managed caches ECE 552 / CPS 550 21

Loop Interchange for(j=0; j < N; j++) { for(i=0; i < M; i++) { x[i][j] = 2 * x[i][j]; } } for(i=0; i < M; i++) { for(j=0; j < N; j++) { x[i][j] = 2 * x[i][j]; } } What type of locality does this improve? What does it assume about x? ECE 552 / CPS 550 22

Loop Fusion for(i=0; i < N; i++) a[i] = b[i] * c[i]; for(i=0; i < N; i++) d[i] = a[i] * c[i]; for(i=0; i < N; i++) { a[i] = b[i] * c[i]; d[i] = a[i] * c[i]; } What type of locality does this improve? ECE 552 / CPS 550 23

Matrix Multiply (X=YZ) Naïve for(i=0; i < N; i++) { for(j=0; j < N; j++) { for(k=0; k < N; k++) { x[i][j] += y[i][k] * z[k][j]; } } } y k z k x j j Notes 1. Iterate through matrix (y) by row. 2. Iterate through matrix (z) by col. 3. Update matrix (x) by col. i i Not touched Old access New access ECE 552 / CPS 550 24

Matrix Multiply (X=YZ) Blocked for (i0=0 ; i0<n ; i0+=b) { for(j0=0 ; j0<n ; j0+=b) { for(k0=0 ; k0<n ; k0+=b) { }}} for(i=i0 ; i<min(i0+b,n) ; i+=1) { for(j=j0 ; j<min(j0+b,n) ; j+=1) { for(k=k0 ; k<min(k0+b,n) ; k+=1) { X[i][j] += Y[i][k] * Z[k][j]; }}} Notes Organize matrices in BxB blocks Iterate through blocks For each block, perform standard matrix multiply What type of locality does this improve? Hint: Track the re-use of matrix elements during computation. ECE 552 / CPS 550 25

Matrix Multiply (X=YZ) Blocked for (i0=0 ; i0<n ; i0+=b) { for(j0=0 ; j0<n ; j0+=b) { for(k0=0 ; k0<n ; k0+=b) { }}} for(i=i0 ; i<min(i0+b,n) ; i+=1) { for(j=j0 ; j<min(j0+b,n) ; j+=1) { for(k=k0 ; k<min(k0+b,n) ; k+=1) { X[i][j] += Y[i][k] * Z[k][j]; }}} Notes Organize matrices in BxB blocks Iterate through blocks For each block, perform standard matrix multiply What type of locality does this improve? 2(B^3) operations refer to 3(B^2) data. Select B such that 3(B^2) resides in cache ECE 552 / CPS 550 26

Caches Summary Quantify cache/memory hierarchy performance with AMAT Three types of cache misses: (1) compulsory, (2) capacity, (3) conflict Cache structure and data placement policies determine miss rates Write buffers improve performance. Prefetching Identify and exploit spatial locality Prefetchers can be implemented in hardware, software, or both Caches and Code Restructuring SW code can improve HW cache performance Data re-use can improve with code structure (e.g., matrix-multiply) ECE 552 / CPS 550 27

Acknowledgements These slides contain material developed and copyright by - Arvind (MIT) - Krste Asanovic (MIT/UCB) - Joel Emer (Intel/MIT) - James Hoe (CMU) - John Kubiatowicz (UCB) - Alvin Lebeck (Duke) - David Patterson (UCB) - Daniel Sorin (Duke) ECE 552 / CPS 550 28