COSC 6385 Computer Architecture - Memory Hierarchies (II)

Similar documents
COSC 6385 Computer Architecture - Memory Hierarchies (III)

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

LECTURE 5: MEMORY HIERARCHY DESIGN

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved.

Adapted from David Patterson s slides on graduate computer architecture

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved.

Introduction to cache memories

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

CSE 502 Graduate Computer Architecture

Lecture 18: Memory Hierarchy Main Memory and Enhancing its Performance Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

Memory technology and optimizations ( 2.3) Main Memory

The University of Adelaide, School of Computer Science 13 September 2018

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Memory Hierarchy Basics

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

MEMORY HIERARCHY DESIGN

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Caches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

ECE 485/585 Midterm Exam

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Advanced Computer Architecture- 06CS81-Memory Hierarchy Design

Mainstream Computer System Components

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

CMSC 611: Advanced Computer Architecture. Cache and Memory

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B.

Computer Systems Laboratory Sungkyunkwan University

Computer System Components

Computer Architecture Spring 2016

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CSE502: Computer Architecture CSE 502: Computer Architecture

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Advanced optimizations of cache performance ( 2.2)

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand

Lecture 11. Virtual Memory Review: Memory Hierarchy

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Name: 1. Caches a) The average memory access time (AMAT) can be modeled using the following formula: AMAT = Hit time + Miss rate * Miss penalty

5008: Computer Architecture

Lecture 18: DRAM Technologies

Advanced Caching Techniques

DRAM Main Memory. Dual Inline Memory Module (DIMM)

EITF20: Computer Architecture Part4.1.1: Cache - 2

Advanced cache optimizations. ECE 154B Dmitri Strukov

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Memory Hierarchy. Advanced Optimizations. Slides contents from:

Memory latency: Affects cache miss penalty. Measured by:

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

A Cache Hierarchy in a Computer System

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Chapter 2 (cont) Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Memory. Lecture 22 CS301

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS

Chapter-5 Memory Hierarchy Design

Pollard s Attempt to Explain Cache Memory

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

EE414 Embedded Systems Ch 5. Memory Part 2/2

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

EITF20: Computer Architecture Part4.1.1: Cache - 2

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Advanced Caching Techniques

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:

Main Memory Systems. Department of Electrical Engineering Stanford University Lecture 5-1

Memory latency: Affects cache miss penalty. Measured by:

CS222: Cache Performance Improvement

CSE 502 Graduate Computer Architecture. Lec Advanced Memory Hierarchy

LECTURE 11. Memory Hierarchy

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

CS422 Computer Architecture

Memory Hierarchy and Caches

Graduate Computer Architecture. Handout 4B Cache optimizations and inside DRAM

CSEE W4824 Computer Architecture Fall 2012

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Types of Cache Misses: The Three C s

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Lecture-18 (Cache Optimizations) CS422-Spring

CENG3420 Lecture 08: Memory Organization

Transcription:

COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity Misses: cache cannot contain all blocks required for the execution Conflict Misses: cache block has to be discarded because of block replacement strategy 1

Ten advanced cache optimizations Reducing hit time: Small and simple caches, way prediction Increase cache bandwidth: Pipelined caches, multi-bank caches and non-blocking caches Reducing miss penalty: Critical word first and early restart, merging write buffers Reducing miss rate: Compiler optimizations Reducing miss penalty and miss rate via parallelism: Hardware prefetch, compiler prefetch Small and simple caches Smaller and simpler caches are faster by nature Cache hit time depends on Comparison of the tag and the address No. of comparisons having to be executed Clock-cycle of the cache, i.e. on-chip vs. off-chip caches 2

Way Prediction Way prediction Add a bit to n-way associative caches in order to predict which of the cache blocks will be used The predicted block is checked first on a data request If the prediction was wrong, check the other entries Speeds up the initial access of the cache in case the prediction is correct Pseudo-associative caches: Implemented as a direct mapped cache with a backup location Only one address can hold the cache item In case the item is not located in the cache a secondary address is also checked Secondary address obtained by reverting the 1bit of the cache Problem: have to avoid that all data-items are located in a secondary location after a while Pipelined cache access Pipelined cache access: Reduce latency of first level cache access by pipelining E.g. Intel Core-i7 process takes 4 clock cycles for L1 instruction cache access Increases number of pipeline stages Penalty on branch mispredictions Larger number of cycles between issuing Load and Execute step 3

Non-blocking caches Allow data cache to continue to supply cache hits during a cache miss Reduces effective miss penalty Further optimization: allow overlap of multiple misses Only useful if the memory system can service multiple misses. Increases complexity of cache controller Multi-bank caches (I) Example: Costs of a cache miss 4 clock cycles to send the address to main memory 56 clock cycles to access a word 4 clock cycles to send a word of data Assuming a cache block consists of 4 words, a loading a cache block takes 4 x ( 4 + 56 + 4) cycles Increase width of main memory bus assuming the memory width is 2 words, the costs of a cache miss go down to 2 x ( 4+54+4) cycles 4

Multi-bank caches (II) Interleaved main memory Try to take advantage of the fact that reading/writing multiple words use multiple chips (=banks) Banks are typically one word wide Often using sequential interleaving: Bank 0 has all blocks with address % 4 = 0 Bank 1 has all blocks with address %3 = 1 Etc. If your cache consists of four banks: costs of a cache miss are 4+54+(4x4) cycles Critical word first and early restart In case of a cache-miss, an entire cache-block has to be loaded from memory Idea: don t wait until the entire cache-block has been loaded, focus on the required data item Critical word first: ask for the required data item Forward the data item to the processor Fill up the rest of the cache block afterwards Early restart: Fetch words of a cache block in sequential order Forward the requested data item to the processor as soon as available Fill up the rest of the cache block afterwards 5

Merging write buffers Check in the write buffer whether multiple entries can be merged to a single one 100 108 116 124 100 108 116 124 1 Mem[100] 0 0 0 1 Mem[108] 0 0 0 1 Mem[116] 0 0 0 1 Mem[124] 0 0 0 1 Mem[100] 1 Mem[108] 1 Mem[116] 1 Mem[124] 0 0 0 0 0 0 0 0 0 0 0 0 Compiler optimizations Re-organize source code such that caches are used more efficiently Two examples: Loop interchange /* original code jumps through memory and cache*/ for (j=0; j<n; j++ ) for (i=0; i< M; i++ ) x[i][j]=2*x[i][j]; /*optimized code */ for (i=0; i<m; i++ ) for (j=0; j<n; j++) x[i][j] = 2*x[i][j]; - Blocked matrix-matrix multiply (see introductory lecture 1) 6

Hardware prefetching Prefetch items before they are requested by the processor E.g. for instruction cache: processor fetches two blocks on a miss Requested block is placed in the instruction cache Prefetched block is placed in an instruction stream buffer In case of a request to the prefetched block, the original cache request is cancelled by the hardware and the block is read from the stream buffer Can be utilized for data caches as well Downside: wasting memory bandwidth if block is not required Compiler-controlled prefetch Register prefetch: load value into register Cache prefetch: load data into cache Faulty and non-faulting prefetches: a non-faulting prefetch turns into a no-op in case of a virtual address fault/protection violation 7

Compiler-controlled prefetch (II) Example: for (i=0; i<3; i++ ) { for (j=0; j<100; j++ ) { a[i][j] = b[j][0] * b[j+1][0]; } } Assumptions: Each element of a, b are 8 byte ( double precision fp) 8 KB direct mapped cache with 16 byte blocks Each block can hold 2 consecutive values of either a or b Compiler controlled prefetching (III) No. of cache misses for a: Since a block contains a[i][j], a[i][j+1] every even value of j will lead to a cache miss and every odd value of j will be a cache hit 3 x 100/2 =150 cache misses because of a No. of cache misses because of b: b does not benefit from spatial locality Each element of b can be used 3 times (for i=0,1,2) Ignoring cache conflicts, this leads to a Cache miss for b[0][0] for i=0 cache miss for every b[j+1][0] when i=0 1 + 100 =101 cache misses 8

Compiler controlled prefetching (IV) Assuming that a cache miss takes 7 clock cycles Assuming that we are dealing with non-faulting prefetches Does not prevent cache misses for the first 7 iterations for (j=0; j<100; j++) { prefetch b[j+7][0]; prefetch a[0][j+7]; a[0][j] = b[j][0] * b[j+1][0]; } for (i=1; i<3; i++ ) { for (j=0; j<100; j++ ) { prefetch a[i][j+7]; a[i][j] = b[j][0] * b[j+1][0]; } } Compiler controlled prefetching (V) New number of cache misses: 7 misses for b[0][0], b[1][0], in the first loop 7/2 = 4 misses for a[0][0], a[0][2], in the first loop 7/2 = 4 misses for a[1][0], a[1][2], in the second loop 7/2 = 4 misses for a[2][0], a[2][2], in the second loop total number of misses: 19 reducing the number of cache misses by 232! A write hint could tell the cache: I need the elements of a in the cache for speeding up the writes. But since I do not read the element, don t bother to load the data really into the cache it will be overwritten anyway! 9

Memory Technology Performance metrics Latency problems handled through caches Bandwidth main concern for main memory Access time: Time between read request and when desired word arrives Cycle time: Minimum time between unrelated requests to memory DRAM mostly used for main memory SRAM used for cache Memory Technology Static Random Access Memory (SRAM) Requires low power to retain bit Requires 6 transistors/bit Dynamic Random Access Memory (DRAM) One transistor/bit Must be re-written after being read Must be periodically refreshed (~ 8 ms) Refresh can be done for an entire row simultaneously Memory system unavailable for an entire memory cycle (Access time + cycle time) Address lines are multiplexed: Upper half of address: row access strobe (RAS) Lower half of address: column access strobe (CAS) 10

Source: http://www.eng.utah.edu/~cs7810/pres/11-7810-12.pdf Memory Technology Amdahl: Memory capacity should grow linearly with processor speed Memory speed has not kept pace with processors Capacity increased at ~55% per year RAS cycle has improved at ~5% per year Year Chip size Slowest RAS DRAM (ns) Fastest RAS DRAM (ns) CAS (ns) Cycle time (ns) 1980 64Kbit 180 150 75 250 1989 4Mbit 100 80 20 165 1998 128Mbit 70 50 10 100 2010 4Gbit 36 28 1 37 2012 8Gbit 30 24 0.5 31 11

DRAM optimizations Dual Inline Memory Module (DIMM): Chip containing 4 16 DRAMS Double data rate (DDR): transfer data on the rising and falling edge of the DRAM clock signal Doubles the peak data rate Synchronous DRAM (SDRAM) Added clock to DRAM interface Removed initial synchronization with controller clock Contain a register to hold the number of bytes requested Up to 8 transfers of 16 bits each can be served without having to send a new address (burst mode) Burst mode often supports critical word first DRAM optimizations (II) Multiple accesses to same row Wider interfaces DDR: offered 4 bit transfer mode DDR2 and DDR3: offer 16 bit transfer mode Multiple banks on each DRAM device 2-8 banks in DDR3 Requires to add another segment to the address: bank number, row address, column address 12

DDR DRAMs and DIMMs Memory Technology DDR: DDR2 Lower power (2.5 V -> 1.8 V) Higher clock rates (266 MHz, 333 MHz, 400 MHz) DDR3 1.5 V 800 MHz DDR4 1-1.2 V 1600 MHz GDDR5 is graphics memory based on DDR3 13

Memory Optimizations Graphics memory: Achieve 2-5 X bandwidth per DRAM vs. DDR3 Wider interfaces (32 bit vs. 16 bit) Higher clock rate Possible because they are attached via soldering instead of socketted DIMM modules Reducing power in SDRAMs: Lower voltage Low power mode (ignores clock, continues to refresh) Memory Dependability Electronic circuits are susceptible to cosmic rays For SDRAM Soft errors: dynamic errors Detected and fixed by error correcting codes (ECC) One parity bit for ~8 data bits Hard errors: permanent hardware errors DIMMS often contain sparse rows to replace defective rows Chipkill: RAID-like error recovery technique Failure of an entire chip can be handled 14