Advanced Memory Organizations

Similar documents
Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

CSE 2021: Computer Organization

Large and Fast: Exploiting Memory Hierarchy

CSE 2021: Computer Organization

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Chapter 5. Memory Technology

data block 0, word 0 block 0, word 1 block 1, word 0 block 1, word 1 block 2, word 0 block 2, word 1 block 3, word 0 block 3, word 1 Word index cache

Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Memory Hierarchy. Locality. Memories: Review

Computer Systems Laboratory Sungkyunkwan University

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy. Reading. Sections 5.1, 5.2, 5.3, 5.4, 5.8 (some elements), 5.9 (2) Lecture notes from MKP, H. H. Lee and S.

Caches. Hiding Memory Access Times

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

LECTURE 10: Improving Memory Access: Direct and Spatial caches

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

V. Primary & Secondary Memory!

Chapter 7-1. Large and Fast: Exploiting Memory Hierarchy (part I: cache) 臺大電機系吳安宇教授. V1 11/24/2004 V2 12/01/2004 V3 12/08/2004 (minor)

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Caches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

ECE331: Hardware Organization and Design

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Chapter Seven. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)

Main Memory Supporting Caches

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Registers. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

The Memory Hierarchy & Cache

Memory. Lecture 22 CS301

Cache Optimization. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Chapter Seven Morgan Kaufmann Publishers

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN

5. Memory Hierarchy Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

CS3350B Computer Architecture

EE 4683/5683: COMPUTER ARCHITECTURE

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Memory Hierarchy: The motivation

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

A Cache Hierarchy in a Computer System

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Page 1. Memory Hierarchies (Part 2)

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

EN1640: Design of Computing Systems Topic 06: Memory System

ECE 30 Introduction to Computer Engineering

ECE232: Hardware Organization and Design

Course Administration

CS/ECE 3330 Computer Architecture. Chapter 5 Memory

Memory Hierarchy: Motivation

ECE331: Hardware Organization and Design

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Page 1. Multilevel Memories (Improving performance using a little cash )

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

EN1640: Design of Computing Systems Topic 06: Memory System

Lecture 11: Memory Systems -- Cache Organiza9on and Performance CSE 564 Computer Architecture Summer 2017

The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs.

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Cache Memory and Performance

The Memory System. Components of the Memory System. Problems with the Memory System. A Solution

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

EEC 483 Computer Organization

10/16/17. Outline. Outline. Typical Memory Hierarchy. Adding Cache to Computer. Key Cache Concepts

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Chapter 7: Large and Fast: Exploiting Memory Hierarchy

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Advanced Computer Architecture

Improving Cache Performance

Memory Hierarchy. Slides contents from:

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Improving Cache Performance

14:332:331. Week 13 Basics of Cache

Introduction to OpenMP. Lecture 10: Caches

Transcription:

CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU performance growth and DRAM memory performance growth!! 2 1

The Levels in Memory Hierarchy Access time: < 0.25nsec 0.5-2.5nsec 50-70nsec 5-50μsec 5-20msec Technology: SRAM DRAM SSD magnetic disk Capacity: 10 3 bytes 10 6 bytes 10 9 bytes 10 11 bytes 10 12 bytes Cost per GB: $500-$1,000 $10-$20 $0.75-$1.00 $0.05-$0.10 in 2012 Apple Fusion Drive Higher the level, smaller and faster the memory. Try to keep most of the action in the higher levels. The principle of locality reference is the most important program that is exploited in many parts of memory hierarchy. 3 Principle of Locality of Reference Programs access a small proportion of their address space at a time during its execution and they tend to reuse instructions and data they have used recently. temporal locality recently accessed items are likely to be accessed soon, spatial locality items near those accessed recently are likely to be accessed soon, i.e. items near one another tend to be referenced close together in time. An implication of principle of locality is that we can predict with reasonable accuracy what instructions and data a program will use in near future based on its accesses in the recent past. Principle of locality applies more strongly to code accesses than data accesses. 4 2

Typical System Organization Without Cache Register file CPU chip ALU System bus Memory bus Bus interface I/O bridge Main memory USB controller Graphics adapter I/O bus Disk controller Expansion slots for other devices such as network adapters. Mouse Keyboard Monitor Disk 5 Typical System Organization with Cache CPU chip Register file Cache memories ALU System bus Memory bus Bus interface I/O bridge Main memory 6 3

Memory and Cache Systems Cache is fast but because of that it has to be small. Why? CPU Data object transfer Cache Block transfer Main Memory 7 Basics of Cache Operation We will first (and mostly) consider a cache read operation, since it is more important. Why? When CPU provides an address requesting a content of a given main memory location: first check the cache for the content; caches include a tag in each cache entry to identify the memory address of a block, if the block present, this is a hit, get the content (fast) and CPU proceeds normally, without (or small) delay, if the block not present, this is a miss, stall CPU and read the block with the required content from main memory, long CPU slowdown: miss penalty time to access main memory and to place a block into the cache, And now (after miss penalty) CPU gets the required content. 8 4

Cache & DRAM Memory: Performance t2: main memory access time t1: cache access time Hit ratio is a ratio of a number of hits and a total number of memory accesses; Miss ratio = 1 Hit ratio Average Access Time t 1 +t 2 Because of locality of reference, hit rates are normally well over 90% t 1 Hit ratio 0 1 Average access time = hit time + (1 hit ratio) miss penalty 9 Direct Mapped Cache Direct mapped caches: only one choice for cache entry Location of cache entry determined by block address location = (block address) mod (number of blocks in cache) Use low-order address bits as a location for cache entry cache Main memory 10 5

Tags and Valid Bits How do we know which particular block is stored in a cache location? store block address as well as the data in a cache entry actually, only need the high-order bits of block address called the tag What if there is no data in a location? introduce valid bit in each entry valid bit: 1 = present, 0 = not present initially 0 11 Direct Mapped Cache & Address Format Assume processor with 10-bit addresses and a direct mapped cache with 8 entries, and 4 bytes of data in each entry; Then, a number of bytes/block = 2 block offset ; 4 = 2block offset Also, a number of entries = 2 index ; 8 = 2 index Address format = 2 bits (block offset) +3 bits (index) +5 bit (tag) V Tag = 5 bits Data = 32 bits N N N N N N N N 12 6

Cache Example: Access 1 Word address Binary address Hit/miss Cache entry 88 00010 110 00 Miss 110 Index V Tag Data 000 N 001 N 010 N 011 N 100 N 101 N 110 Y 00010 Mem[88-91] 111 N 13 Cache Example: Access 2 Word address Binary address Hit/miss Cache entry 104 00011 010 00 Miss 010 Index V Tag Data 000 N 001 N 010 Y 00011 Mem[104-107] 011 N 100 N 101 N 110 Y 00010 Mem[88-91] 111 N 14 7

Cache Example: Accesses 3, 4 Word address Binary address Hit/miss Cache entry 88 00010 110 00 Hit 110 104 00011 010 00 Hit 010 Index V Tag Data 000 N 001 N 010 Y 00011 Mem[104-107] 011 N 100 N 101 N 110 Y 00010 Mem[88-91] 111 N 15 Cache Example: Accesses 5, 6, 7 Word address Binary address Hit/miss Cache entry 64 00010 000 00 Miss 000 12 00000 011 00 Miss 011 64 00010 000 00 Hit 000 Index V Tag Data 000 Y 00010 Mem[64-67] 001 N 010 Y 00011 Mem[104-107] 011 Y 00000 Mem[12-15] 100 N 101 N 110 Y 00010 Mem[88-91] 111 N 16 8

Cache Example: Accesses 8 Word address Binary address Hit/miss Cache entry 72 00010 010 00 Miss 010 Index V Tag Data 000 Y 00010 Mem[64-67] 001 N 010 Y 00010 Mem[72-75] 011 Y 00000 Mem[12-15] 100 N 101 N 110 Y 00010 Mem[88-91] 111 N 17 Direct Mapped Cache: 1 4-byte Data Address (showing bit positions) 31 30 1716 15 5 4 3 2 1 0 Hit 16 14 Byte Block offset o ffse t Data 16 bits 32 bits Valid Tag Data 16K entries Block offset = 2 bits since 2 2 = 4 bytes 16 32 Index = 14 bits since 2 14 = 16K number of cache entries What kind of locality we are not taking advantage of? 18 9

Direct Mapped Cache: 4 4-byte Data Block offset = 4 bits (2 4 = 16 bytes) = 2 bits word offset + 2 bits byte offset Index = 12 bits (2 12 = 4K entries) Hit Address (showing bit positions) 31 16 15 4 32 1 0 tag index 16 12 2 Byte Tag offset Index 16 bits 128 bits V Tag Data Word Blockoffset Data 4K entries 16 32 32 32 32 Mux 32 19 Cache with 4-byte blocks had 16K entries total of 64KB data Cache with 16-byte blocks had 4K entries total of 64KB data Thus, both caches can accommodate the same amount of data Program Block size in bytes Instruction miss rate Data miss rate Effective combined miss rate gcc 4 6.1% 2.1% 5.4% 16 2.0% 1.7% 1.9% spice 4 1.2% 1.3% 1.2% 16 0.3% 0.6% 0.4% Larger blocks reduced miss rate due to spatial locality But for a fixed-sized cache, larger blocks fewer of them more competition may increased miss rate Larger miss penalty may override benefit of reduced miss rate Keep in mind: Average access time = hit time + (1 hit ratio) Miss penalty Block Size Consideration 20 10

Main Memory Supporting Caches DRAM memory has a fixed width of read/write operations and it is connected by fixed-width clocked bus to CPU (and cache) Assume the following for reading from main memory: 1 bus cycle for address transfer, 15 bus cycles per DRAM access, 1 bus cycle per data transfer to cache. For 16-byte block, and 4-byte-wide DRAM bus: miss penalty = 1 + 4 15 + 4 1 = 65 bus cycles, bandwidth = 16 bytes / 65 cycles = 0.25 bytes/cycle. Although caches are interested in low latency memory, it is generally easier to improve memory bandwidth with new memory organization than it is to reduce memory latency. 21 Wider Main Memory Bus CPU CPU Multiplexor Cache C a c h e 4 byte wide bus Bus Bus 16 byte wide bus Memory miss penalty = 65 bandwidth = 0.25 Memory 16-byte bus miss a. One-word-wide penalty = 1 + 15 + 1 = 17 cycles mem ory organization 16-byte bus throughput = 16 bytes/17 cycles = 0.94 bytes/cycle CPU accesses a word at a time, so a need for a multiplexer 22 11

Wider Main Memory Bus & Level-2 Cache CPU Multiplexor Cache Bus Memory But the multiplexer is on the critical timing path. Level-2 cache helps since the multiplexing is now between level-1 and level-2 caches and not on critical timing path. 23 Interleaved Memory Organization C P U C a c h e B u s M e m o r y M e m o r y M e m o r y M e m o r y b a n k 0 b a n k 1 b a n k 2 b a n k 3 This is four-way interleaved memory with 4 byte memory bus saves wires in bus Miss penalty = 1 + 15 + 4 1 = 20 cycles, Throughput = 16 bytes / 20 cycles = 0.8 bytes/cycle 24 12

Multilevel Caches Level-1 (primary) cache attached to CPU small, but very fast Level-2 (secondary) cache services misses from level-1 cache larger, slower, but still faster than main memory Main memory services level-2 cache misses Some high-end systems include level-3 cache Level-1 cache: focus on minimal hit time Level-2 cache: focus on low miss rate to avoid main memory access hit time has less overall impact 25 Multilevel Caches (cont.) Increased logic density => on-chip cache with processor internal cache: level 1 internal cache: level 2 external or internal cache: level 3 Unified cache, i.e. a cache for both instruction and data balances the load between instruction and data fetches, only one cache needs to be designed and implemented; Split cache: data cache (d-cache) and instruction cache (i-cache) essential for pipelined & other advanced architectures; Level 1 caches are normally split caches. 26 13

Associative Caches In addition to direct mapped, two more cache organizations Fully associative caches: allow a given block to go in any cache entry requires all entries to be searched at once tag comparator per entry (expensive) n-way set associative caches: each set contains n entries (blocks) block number determines which set in the cache set number = (block number) mod (number of sets) search only all entries in a given set at once n comparators (less expensive) 27 Illustration of Cache Organizations 2 way set associative 28 14

4-Way Set Associate Cache There are 256 sets Indexing sets Address 31 30 12 11 10 9 8 3 2 1 0 tag index 22 8 block offset 2 bits In d e x V T a g Data V T a g Data V T a g Data V T a g Data 0 1 2 253 254 255 22 32 Figure 5.18 4-way set associative cache built out of 4 direct-mapped caches 4 - to - 1 m u ltip le x o r Hit Data Presentation H 29 Comparing Finding a Block Associativity Location method Tag comparisons Direct mapped Index entry 1 n-way set associative Index a set, then search entries within the set Fully associative Search all entries Number of entries n Reduce comparisons to reduce time finding if an access is hit or miss. 30 15

Replacement Policy Direct mapped cache: no choice. Set or fully associative caches: prefer non-valid entry, if there is one, otherwise, choose among entries in the set. Least-recently used (LRU) replacement policy: choose the one unused for the longest time, simple for 2-way, manageable for up to 16-way, too hard beyond that; Random replacement policy: gives approximately the same performance as LRU for high associativity; 31 Cache Writing on Hit Write is more complex and longer than read since: write and tag comparison can not proceed simultaneously, only a portion of the block has to be updated; Two approaches when cache hit: write through & write back Write through technique: updates a main memory and the block in cache, but it makes writes take longer and it does not take advantage of cache, although simple to implement. Improvements to this technique special write buffer: it holds data waiting to be written to memory, CPU continues immediately, CPU stalls on write only if write buffer is already full. 32 16

Cache Writing on Hit (cont.) Write back technique: just update the block in cache, keep track of whether or not a block has been updated; each entry has a dirty bit for that purpose. Cache and memory are inconsistent. When a dirty block is to be replaced: write it back to memory, can also use write buffer, Write through is usually found today only in level 1 data caches backed by level 2 cache that uses write back. 33 Cache Writing on Miss What should happen on a write miss? Write-allocate: load block into cache, and update in cache (good if more writes to the location follow); No-write-allocate: writes immediately only to main memory; Write-through usually uses no-write-allocate, although write allocate is an alternative; Write-back usually uses write-allocate; 34 17

Intel Core i7 Cache Hierarchy Processor package Core 0 Registers Core 3 Registers L1: i cache and d cache: 32 KB each, 8 way, Access: 4 cycles L1 d-cache L1 i-cache L2 unified cache L3 unified cache (shared by all cores) L1 d-cache L1 i-cache L2 unified cache L2: unified cache: 256 KB, 8 way, Access: 11 cycles L3: unified cache: 8 MB, 16 way, Access: 30 40 cycles Block size: 64 bytes for all caches. Main memory 35 18