Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

Similar documents
Lecture 18: Memory Hierarchy Main Memory and Enhancing its Performance Professor Randy H. Katz Computer Science 252 Spring 1996

CMSC 611: Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture. Cache and Memory

Lecture 11. Virtual Memory Review: Memory Hierarchy

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

COSC 6385 Computer Architecture - Memory Hierarchies (II)

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

LECTURE 5: MEMORY HIERARCHY DESIGN

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

The University of Adelaide, School of Computer Science 13 September 2018

Copyright 2012, Elsevier Inc. All rights reserved.

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Introduction. The Big Picture: Where are We Now? The Five Classic Components of a Computer

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Adapted from David Patterson s slides on graduate computer architecture

A Cache Hierarchy in a Computer System

Memory latency: Affects cache miss penalty. Measured by:

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:

Memory latency: Affects cache miss penalty. Measured by:

Pollard s Attempt to Explain Cache Memory

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Computer System Components

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Copyright 2012, Elsevier Inc. All rights reserved.

Lecture 18: DRAM Technologies

Advanced cache optimizations. ECE 154B Dmitri Strukov

Memory. Lecture 22 CS301

Memory Hierarchy Design. Chapter 5

Memory Hierarchy Basics

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Copyright 2012, Elsevier Inc. All rights reserved.

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Mainstream Computer System Components

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Caches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Memories: Memory Technology

COSC 6385 Computer Architecture - Memory Hierarchies (III)

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Memory Technologies. Technology Trends

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

Introduction to cache memories

Page 1. Review: Who Cares About the Memory Hierarchy? Review: Cache performance. Review: Reducing Misses. Review: Miss Rate Reduction

5008: Computer Architecture

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

CSE 502 Graduate Computer Architecture

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Advanced Computer Architecture- 06CS81-Memory Hierarchy Design

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Advanced Caching Techniques

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

14:332:331. Week 13 Basics of Cache

EEC 483 Computer Organization

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

EITF20: Computer Architecture Part4.1.1: Cache - 2

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

EITF20: Computer Architecture Part4.1.1: Cache - 2

Chapter 2: Memory Hierarchy Design Part 2

Memory technology and optimizations ( 2.3) Main Memory

ECE7995 (4) Basics of Memory Hierarchy. [Adapted from Mary Jane Irwin s slides (PSU)]

Types of Cache Misses: The Three C s

Memory Hierarchy Design

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Page 1. Multilevel Memories (Improving performance using a little cash )

Advanced Caching Techniques

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Advanced Memory Organizations

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS

ECE 485/585 Microprocessor System Design

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)

Lecture 17. Fall 2007 Prof. Thomas Wenisch. row enable. _bitline. Lecture 18 Slide 1 EECS 470

Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B.

Transcription:

Lecture 20: ory Hierarchy Main ory and Enhancing its Performance Professor Alvin R. Lebeck Computer Science 220 Fall 1999 HW #4 Due November 12 Projects Finish reading Chapter 5 Grinch-Like Stuff CPS 220 2 Page 1 1

Review: Reducing Miss Penalty Summary Five techniques Read priority over write on miss Subblock placement Early Restart and Critical Word First on miss Non-blocking Caches (Hit Under Miss) Second Level Cache Can be applied recursively to Multilevel Caches Danger is that time to DRAM will grow with multiple levels in between CPS 220 3 Review: Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache Fast hit times by small and simple caches Fast hits via avoiding virtual address translation Fast hits via pipelined writes Fast writes on misses via small subblocks Fast hits by using multiple ports Fast hits by using value prediction CPS 220 4 Page 2 2

Review: Cache Optimization Summary Technique MR MP HT Complexity Larger Block Size + 0 Higher Associativity + 1 Victim Caches + 2 Pseudo-Associative Caches + 2 HW Prefetching of Instr/Data + 2 Compiler Controlled Prefetching + 3 Compiler Reduce Misses + 0 Priority to Read Misses + 1 Subblock Placement + + 1 Early Restart & Critical Word 1st + 2 Non-Blocking Caches + 3 Second Level Caches + 2 Small & Simple Caches + 0 Avoiding Address Translation + 2 Pipelining Writes + 1 CPS 220 5 Main ory Background Performance of Main ory: Latency: Cache Miss Penalty» Access Time: time between request and word arrives» Cycle Time: time between requests Bandwidth: I/O & Large Block Miss Penalty (L2) Main ory is DRAM: Dynamic Random Access ory Dynamic since needs to be refreshed periodically (8 ms) Addresses divided into 2 halves (ory as a 2D matrix):» RAS or Row Access Strobe» CAS or Column Access Strobe Cache uses SRAM: Static Random Access ory No refresh (6 transistors/bit vs. 1 transistor/bit) Address not divided Size: DRAM/SRAM 4-8, Cost/Cycle time: SRAM/DRAM 8-16 CPS 220 6 Page 3 3

Main ory Performance Simple: CPU, Cache, Bus, ory same width (1 word) Wide: CPU/Mux 1 word; Mux/Cache, Bus, ory N words (Alpha: 64 bits & 256 bits) Interleaved: CPU, Cache, Bus 1 word: ory N Modules (4 Modules); example is word interleaved CPU $ Bus CPU $ CPU Mux $ Bus ory CPS 220 7 Timing model Main ory Performance 1 to send address, 6 access time, 1 to send data Cache Block is 4 words Simple M.P. = 4 x (1+6+1) = 32 Wide M.P. = 1 + 6 + 1 = 8 Interleaved M.P. = 1 + 6 + 4x1 = 11 Address Bank 0 Address Bank 1 Address Bank 2 Address Bank 3 0 1 2 3 4 5 6 7 8 9 10 9 12 13 14 15 CPS 220 8 Page 4 4

Independent ory Banks ory banks for independent accesses vs. faster sequential accesses Multiprocessor I/O Miss under Miss, Non-blocking Cache Superbank: all memory active on one block transfer Bank: portion within a superbank that is word interleaved Superbank Number Bank Number Bank Offset CPS 220 9 Independent ory Banks How many banks? number banks >= number clocks to access word in bank For sequential accesses, otherwise will return to original bank before it has next word ready DRAM trend towards deep narrow chips. => fewer chips => harder to have banks of chips => put banks inside chips (internal banks) CPS 220 10 Page 5 5

Avoiding Bank Conflicts Lots of banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; Even with 128 banks, since 512 is multiple of 128, conflict structural hazard SW: loop interchange or declaring array not power of 2 HW: Prime number of banks bank number = address mod number of banks address within bank = address / number of banks modulo & divide per memory access? address within bank = address mod number words in bank (3, 7, 31) bank number? easy if 2 N words per bank CPS 220 11 Fast Bank Number Chinese Remainder Theorem As long as two sets of integers a i and b i follow these rules b i = x mod a i, 0 b i < a i, 0 x < a 0 a1 a 2 and that a i and a j are co-prime if i!= j, then the integer x has only one solution (unambiguous mapping): bank number = b 0, number of banks = a 0 (= 3 in example) address within bank = b 1, number of words in bank = a 1 (= 8 in example) N word address 0 to N-1, prime no. banks, # of words is power of 2 CPS 220 12 Page 6 6

Prime Mapping Example Seq. Interleaved Modulo Interleaved Bank Number: 0 1 2 0 1 2 Address within Bank: 0 0 1 2 0 16 8 1 3 4 5 9 1 17 2 6 7 8 18 10 2 3 9 10 11 3 19 11 4 12 13 14 12 4 20 5 15 16 17 21 13 5 6 18 19 20 6 22 14 7 21 22 23 15 7 23 CPS 220 13 Fast ory Systems: DRAM specific Multiple RAS accesses: several names (page mode) 64 Mbit DRAM: cycle time = 100 ns, page mode = 20 ns New DRAMs to address gap; what will they cost, will they survive? Synchronous DRAM: Provide a clock signal to DRAM, transfer synchronous to system clock RAMBUS: reinvent DRAM interface (Intel will use it)» Each Chip a module vs. slice of memory» Short bus between CPU and chips» Does own refresh» Variable amount of data returned» 1 byte / 2 ns (500 MB/s per chip) Cached DRAM (CDRAM): Keep entire row in SRAM CPS 220 14 Page 7 7

Main ory Summary Big DRAM + Small SRAM = Cost Effective Cray C-90 uses all SRAM (how many sold?) Wider ory Interleaved ory: for sequential or independent accesses Avoiding bank conflicts: SW & HW DRAM specific optimizations: page mode & Specialty DRAM, CDRAM Niche memory or main memory?» e.g., Video RAM for frame buffers, DRAM + fast serial output IRAM: Do you know what it is? CPS 220 15 Next Time ory hierarchy and virtual memory 16 Page 8 8