Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Similar documents
Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Basics

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Adapted from David Patterson s slides on graduate computer architecture

LECTURE 5: MEMORY HIERARCHY DESIGN

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Copyright 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy. Advanced Optimizations. Slides contents from:

Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Advanced cache optimizations. ECE 154B Dmitri Strukov

Memory Hierarchy. Reading. Sections 5.1, 5.2, 5.3, 5.4, 5.8 (some elements), 5.9 (2) Lecture notes from MKP, H. H. Lee and S.

Improving Cache Performance

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

The University of Adelaide, School of Computer Science 13 September 2018

EITF20: Computer Architecture Part4.1.1: Cache - 2

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Improving Cache Performance

EEC 483 Computer Organization. Chapter 5.3 Measuring and Improving Cache Performance. Chansu Yu

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

EITF20: Computer Architecture Part4.1.1: Cache - 2

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Advanced optimizations of cache performance ( 2.2)

Chapter-5 Memory Hierarchy Design

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Cache Optimisation. sometime he thought that there must be a better way

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

A Cache Hierarchy in a Computer System

Main Memory Supporting Caches

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy Design (Appendix B and Chapter 2)

Memory Hierarchies 2009 DAT105

Lec 11 How to improve cache performance

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

CSE 502 Graduate Computer Architecture

CS3350B Computer Architecture

CS/ECE 3330 Computer Architecture. Chapter 5 Memory

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Memory Hierarchy. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Advanced Memory Organizations

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Transistor: Digital Building Blocks

Chapter 5. Large and Fast: Exploiting Memory Hierarchy. Jiang Jiang

Cache Optimization. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

EN1640: Design of Computing Systems Topic 06: Memory System

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Caching Basics. Memory Hierarchies

Module Outline. CPU Memory interaction Organization of memory modules Cache memory Mapping and replacement policies.

MEMORY HIERARCHY DESIGN

Memory Hierarchy Y. K. Malaiya

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Page 1. Memory Hierarchies (Part 2)

Page 1. Multilevel Memories (Improving performance using a little cash )

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

ECE232: Hardware Organization and Design

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Chapter Seven. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy. Slides contents from:

Pollard s Attempt to Explain Cache Memory

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Chapter 2: Memory Hierarchy Design Part 2

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Cache Performance (H&P 5.3; 5.5; 5.6)

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Memory Hierarchy and Caches

COMPUTER ORGANIZATION AND DESIGN

Transcription:

Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2 Memory Hierarchy Design! Memory hierarchy design becomes more crucial with recent multi-core processors:! Aggregate peak bandwidth grows with # cores:! Intel Core i7 can generate two references per core per clock! Four cores and 3.2 GHz clock! 25.6 *10 9 64-bit data references/second +! 12.8 *10 9 128-bit instruction references! = 409.6 GB/s!! DRAM bandwidth is only 6% of this (25 GB/s)! Requires:! Multi-port, pipelined caches! Two levels of cache per core! Shared third-level cache on chip! When a word is not found in the cache, a miss occurs:! Fetch word from lower level in hierarchy, requiring a higher latency reference! Lower level may be another cache or the main memory! Also fetch the other words contained within the block! Takes advantage of spatial locality! Place block into cache in any location within its set, determined by address! block address MOD number of sets 3 4

! n sets => n-way set associative! Direct-mapped cache => one block per set! Fully associative => one set CPU exec-time = (CPU clock-cycles + Mem stall-cycles ) Clock cycle time Mem stall-cycles = IC Misses Instruction Miss Penalty! Writing to cache: two strategies! Write-through! Immediately update lower levels of hierarchy! Write-back! Only update lower levels of hierarchy when an updated block is replaced! Both strategies use write buffer to make writes asynchronous! Note1: miss rate/penalty are often different for reads and writes! Note2: speculative and multithreaded processors may execute other instructions during a miss! Reduces performance impact of misses 5 6 Cache Performance Example! Given! I-cache miss rate = 2%! D-cache miss rate = 4%! Miss penalty = 100 cycles! Base CPI (ideal cache) = 2! Load & stores are 36% of instructions! Miss cycles per instruction! I-cache: 0.02 100 = 2! D-cache: 0.36 0.04 100 = 1.44! Actual CPI = 2 + 2 + 1.44 = 5.44! Miss rate! Fraction of cache access that result in a miss! Causes of misses! Compulsory! First reference to a block! Capacity! Blocks discarded and later retrieved! Conflict! Program makes repeated references to multiple addresses from different blocks that map to the same location in the cache! Coherency! Different processors should see same value in same location Chapter 5 Large and Fast: Exploiting Memory Hierarchy 7 8

! Six basic cache optimizations:! Larger block size! Reduces compulsory misses! Increases capacity and conflict misses, increases miss penalty! Larger total cache capacity to reduce miss rate! Increases hit time, increases power consumption! Higher associativity! Reduces conflict misses! Increases hit time, increases power consumption! Multilevel caches to reduce miss penalty! Reduces overall memory access time! Giving priority to read misses over writes! Reduces miss penalty! Avoiding address translation in cache indexing! Reduces hit time Multilevel Caches! Primary cache attached to CPU! Small, but fast! Level-2 cache services misses from primary cache! Larger, slower, but still faster than main memory! Main memory services L-2 cache misses! Server systems typically include L-3 cache 9 Chapter 5 Large and Fast: Exploiting Memory Hierarchy 10 Multilevel Cache Example! Given! CPU base CPI = 1, clock rate = 4GHz! Miss rate/instruction = 2%! Main memory access time = 100ns! With just primary cache! Miss penalty = 100ns/0.25ns = 400 cycles! Effective CPI = 1 + 0.02 400 = 9! Now add L-2 cache Example (cont.)! Now add L-2 cache! Access time = 5ns! Global miss rate to main memory = 0.5%! Primary miss with L-2 hit! Penalty = 5ns/0.25ns = 20 cycles! Primary miss with L-2 miss! Extra penalty = 400 cycles! CPI = 1 + 0.02 20 + 0.005 400 = 3.4! Performance ratio = 9/3.4 = 2.6 Chapter 5 Large and Fast: Exploiting Memory Hierarchy 11 Chapter 5 Large and Fast: Exploiting Memory Hierarchy 12

Multilevel On-Chip Caches 3-Level Cache Organization Intel Nehalem 4-core processor Per core: 32KB L1 I-cache, 32KB L1 D-cache, 512KB L2 cache L1 caches (per core) L2 unified cache (per core) L3 unified cache (shared) Intel Nehalem n/a: data not available L1 I-cache: 32KB, 64-byte blocks, 4-way, approx LRU replacement, hit time n/a L1 D-cache: 32KB, 64-byte blocks, 8-way, approx LRU replacement, write-back/ allocate, hit time n/a 256KB, 64-byte blocks, 8-way, approx LRU replacement, writeback/allocate, hit time n/a 8MB, 64-byte blocks, 16-way, replacement n/a, write-back/ allocate, hit time n/a AMD Opteron X4 L1 I-cache: 32KB, 64-byte blocks, 2-way, approx LRU replacement, hit time 3 cycles L1 D-cache: 32KB, 64-byte blocks, 2-way, approx LRU replacement, write-back/ allocate, hit time 9 cycles 512KB, 64-byte blocks, 16-way, approx LRU replacement, writeback/allocate, hit time n/a 2MB, 64-byte blocks, 32-way, replace block shared by fewest cores, write-back/allocate, hit time 32 cycles Chapter 5 Large and Fast: Exploiting Memory Hierarchy 14! Reducing the hit time! small & simple first-level caches! way-prediction! Increase cache bandwidth! pipelined cache access! nonblocking caches! multibanked caches! Reducing the miss penalty! critical word first! merging write buffers! Reducing the miss rate! compiler optimizations Ten! Reducing the miss penalty or miss rate via parallelism! hardware prefetching of instructions and data! compiler-controlled prefetching AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 15 1. Small and simple 1 st level caches! Small and simple first level caches! Critical timing path:! addressing tag memory, then! comparing tags, then! selecting correct set! Direct-mapped caches can overlap tag compare and transmission of data! Lower associativity reduces power because fewer cache lines are accessed 16

L1 Size and Associativity L1 Size and Associativity Access time vs. size and associativity Energy per read vs. size and associativity 17 18 2. Way Prediction! To improve hit time, predict the way to pre-set mux! Mis-prediction gives longer hit time! Prediction accuracy! > 90% for two-way! > 80% for four-way! I-cache has better accuracy than D-cache! First used on MIPS R10000 in mid-90s! Used on ARM Cortex-A8! Extend to predict block as well! Way selection! Increases mis-prediction penalty 3. Pipelining Cache! Pipeline cache access to improve bandwidth! Examples:! Pentium: 1 cycle! Pentium Pro Pentium III: 2 cycles! Pentium 4 Core i7: 4 cycles! Increases branch mis-prediction penalty! Makes it easier to increase associativity 19 20

4. Nonblocking Caches! Allow hits before previous misses complete! Hit under miss! Hit under multiple miss! L2 must support this! In general, processors can hide L1 miss penalty but not L2 miss penalty 5. Multibanked Caches! Organize cache as independent banks to support simultaneous access! ARM Cortex-A8 supports 1-4 banks for L2! Intel i7 supports 4 banks for L1 and 8 banks for L2! Interleave banks according to block address 21 22 6. Critical Word First, Early Restart! Critical word first! Request missed word from memory first! Send it to the processor as soon as it arrives! Early restart! Request words in normal order! Send missed work to the processor as soon as it arrives 7. Merging Write Buffer! When storing to a block that is already pending in the write buffer, update write buffer! Reduces stalls due to full write buffer! Do not apply to I/O addresses No write buffering! Effectiveness of these strategies depends on block size and likelihood of another access to the portion of the block that has not yet been fetched Write buffering 23 24

8. Compiler Optimizations! Loop Interchange! Swap nested loops to access memory in sequential order 9. Hardware Prefetching! Fetch two blocks on miss (include next sequential block)! Blocking! Instead of accessing entire rows or columns, subdivide matrices into blocks! Requires more memory accesses but improves locality of accesses Pentium 4 Pre-fetching 25 26 10. Compiler Prefetching! Insert prefetch instructions before data is needed! Non-faulting: prefetch doesn t cause exceptions Summary! Register prefetch! Loads data into register! Cache prefetch! Loads data into cache! Combine with loop unrolling and software pipelining 27 28