Hitting the Memory Wall: Implications of the Obvious. Wm. A. Wulf and Sally A. McKee {wulf

Similar documents
Reflections on the Memory Wall

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from:

Introduction to OpenMP. Lecture 10: Caches

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Cray XE6 Performance Workshop

CS 654 Computer Architecture Summary. Peter Kemper

Fundamentals of Computer Systems

SF-LRU Cache Replacement Algorithm

CSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it!

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

CS61C : Machine Structures

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

UCB CS61C : Machine Structures

Utilizing Concurrency: A New Theory for Memory Wall

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

CS61C : Machine Structures

CS61C : Machine Structures

ECE 571 Advanced Microprocessor-Based Design Lecture 10

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

UCB CS61C : Machine Structures

Logical Diagram of a Set-associative Cache Accessing a Cache

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

The Impact of Write Back on Cache Performance

1" 0" d" c" b" a" ...! ! Benefits of Larger Block Size. " Very applicable with Stored-Program Concept " Works well for sequential array accesses

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Introduction. Memory Hierarchy

UC Berkeley CS61C : Machine Structures

CS61C : Machine Structures

COMP 3221: Microprocessors and Embedded Systems

Slide Set 5. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Fundamentals of Computer Systems

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Lecture 33 Caches III What to do on a write hit? Block Size Tradeoff (1/3) Benefits of Larger Block Size

12 Cache-Organization 1

The University of Adelaide, School of Computer Science 13 September 2018

ECE 485/585 Microprocessor System Design

UCB CS61C : Machine Structures

Computer Architecture. R. Poss

CS/EE 6810: Computer Architecture

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Scalability of the RAMpage Memory Hierarchy

Cycle Time for Non-pipelined & Pipelined processors

Course Administration

Review : Pipelining. Memory Hierarchy

Caches. Han Wang CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

Computer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Computer Caches. Lab 1. Caching

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

LECTURE 5: MEMORY HIERARCHY DESIGN

www-inst.eecs.berkeley.edu/~cs61c/

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

EE 4683/5683: COMPUTER ARCHITECTURE

CMPSC 311- Introduction to Systems Programming Module: Caching

Caching Basics. Memory Hierarchies

TDT 4260 lecture 3 spring semester 2015

Memory Hierarchy. Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity. (Study Chapter 5)

MULTIPROCESSOR system has been used to improve

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

USES AND ABUSES OF AMDAHL'S LAW *

Copyright 2012, Elsevier Inc. All rights reserved.

August 1994 / Features / Cache Advantage. Cache design and implementation can make or break the performance of your high-powered computer system.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

CPU Pipelining Issues

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Page 1. Multilevel Memories (Improving performance using a little cash )

Copyright 2012, Elsevier Inc. All rights reserved.

Architectural Issues for the 1990s. David A. Patterson. Computer Science Division EECS Department University of California Berkeley, CA 94720

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

Stack Machines. Towards Scalable Stack Based Parallelism. 1 of 53. Tutorial Organizer: Dr Chris Crispin-Bailey

Sarah L. Harris and David Money Harris. Digital Design and Computer Architecture: ARM Edition Chapter 8 <1>

CS 136: Advanced Architecture. Review of Caches

ECE468 Computer Organization and Architecture. Memory Hierarchy

Cache Coherence Tutorial

Skewed-Associative Caches: CS752 Final Project

Pollard s Attempt to Explain Cache Memory

Question?! Processor comparison!

Block Size Tradeoff (1/3) Benefits of Larger Block Size. Lecture #22 Caches II Block Size Tradeoff (3/3) Block Size Tradeoff (2/3)

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

So, coming back to this picture where three levels of memory are shown namely cache, primary memory or main memory and back up memory.

CS152 Computer Architecture and Engineering Lecture 17: Cache System

CSE 141: Computer Architecture. Professor: Michael Taylor. UCSD Department of Computer Science & Engineering

Chap. 4 Multiprocessors and Thread-Level Parallelism

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Memory Management! Goals of this Lecture!

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

Transcription:

Hitting the Memory Wall: Implications of the Obvious Wm. A. Wulf and Sally A. McKee {wulf mckee}@virginia.edu Computer Science Report No. CS-9- December, 99 Appeared in Computer Architecture News, 3():0-, March

Hitting the Memory Wall: Implications of the Obvious Wm. A. Wulf Sally A. McKee Department of Computer Science University of Virginia {wulf mckee}@virginia.edu December 99 This brief note points out something obvious something the authors knew without really understanding. With apologies to those who did understand, we offer it to those others who, like us, missed the point. We all know that the rate of improvement in microprocessor speed exceeds the rate of improvement in DRAM memory speed each is improving exponentially, but the exponent for microprocessors is substantially larger than that for DRAMs. The difference between diverging exponentials also grows exponentially; so, although the disparity between processor and memory speed is already an issue, downstream someplace it will be a much bigger one. How big and how soon? The answers to these questions are what the authors had failed to appreciate. To get a handle on the answers, consider an old friend the equation for the average time to access memory, where t c and t m are the cache and DRAM access times and p is the probability of a cache hit: t = p t + ( p) t avg c m We want to look at how the average access time changes with technology, so we ll make some conservative assumptions; as you ll see, the specific values won t change the basic conclusion of this note, namely that we are going to hit a wall in the improvement of system performance unless something basic changes. Appeared in Computer Architecture News, 3():0-, March.

First let s assume that the cache speed matches that of the processor, and specifically that it scales with the processor speed. This is certainly true for on-chip cache, and allows us to easily normalize all our results in terms of instruction cycle times (essentially saying t c = cpu cycle). Second, assume that the cache is perfect. That is, the cache never has a conflict or capacity miss; the only misses are the compulsory ones. Thus ( p) is just the probability of accessing a location that has never been referenced before (one can quibble and adjust this for line size, but this won t affect the conclusion, so we won t make the argument more complicated than necessary). Now, although ( p) is small, it isn t zero. Therefore as t c and t m diverge, t avg will grow and system performance will degrade. In fact, it will hit a wall. In most programs, 0-0% of the instructions reference memory [Hen90]. For the sake of argument let s take the lower number, 0%. That means that, on average, during execution every 5th instruction references memory. We will hit the wall when t avg exceeds 5 instruction times. At that point system performance is totally determined by memory speed; making the processor faster won t affect the wall-clock time to complete an application. Alas, there is no easy way out of this. We have already assumed a perfect cache, so a bigger/ smarter one won t help.we re already using the full bandwidth of the memory, so prefetching or other related schemes won t help either. We can consider other things that might be done, but first let s speculate on when we might hit the wall. Assume the compulsory miss rate is % or less [Hen90] and that the next level of the memory hierarchy is currently four times slower than cache. If we assume that DRAM speeds increase by 7% per year [Hen90] and use Baskett s estimate that microprocessor performance is increasing at the rate of 0% per year [Bas9], the average number of cycles per memory access will be.5 in,.5 in, and 9. in 00. Under these assumptions, the wall is less than a decade away. Appeared in Computer Architecture News, 3():0-, March.

Figures -3 explore various possibilities, showing projected trends for a set of perfect or near-perfect caches. All our graphs assume that DRAM performance continues to increase at an annual rate of 7%. The horizontal axis is various cpu/dram performance ratios, and the lines at the top indicate the dates these ratios occur if microprocessor performance (µ) increases at rates of 50% and 00% respectively. Figure assumes that cache misses are currently times slower than hits; Figure (a) considers compulsory cache miss rates of less than % while Figure (b) shows the same trends for caches with more realistic miss rates of -0%. Figures is a counterpart of Figure, but assumes that the current cost of a cache miss is 6 times that of a hit. Figure 3 provides a closer look at the expected impact on average memory access time for one particular value of µ, Baskett s estimated 0%. Even if we assume a cache hit rate of 99.% and use the more conservative cache miss cost of cycles as our starting point, performance hits the 5-cycles-per-access wall in - years. At a hit rate of 99% we hit the same wall within the decade, and at 90%, within 5 years. Note that changing the starting point the current miss/hit cost ratio and the cache miss rates don t change the trends: if the microprocessor/memory performance gap continues to grow at a similar rate, in 0-5 years each memory access will cost, on average, tens or even hundreds of processor cycles. Under each scenario, system speed is dominated by memory performance. Over the past thirty years there have been several predictions of the eminent cessation of the rate of improvement in computer performance. Every such prediction was wrong. They were wrong because they hinged on unstated assumptions that were overturned by subsequent events. So, for example, the failure to foresee the move from discrete components to integrated circuits led to a prediction that the speed of light would limit computer speeds to several orders of magnitude slower than they are now. Appeared in Computer Architecture News, 3():0-, March. 3

Our prediction of the memory wall is probably wrong too but it suggests that we have to start thinking out of the box. All the techniques that the authors are aware of, including ones we have proposed [McK9, McK9a], provide one-time boosts to either bandwidth or latency. While these delay the date of impact, they don t change the fundamentals. The most convenient resolution to the problem would be the discovery of a cool, dense memory technology whose speed scales with that of processors. We are not aware of any such technology and could not affect its development in any case; the only contribution we can make is to look for architectural solutions. These are probably all bogus, but the discussion must start somewhere: - Can we drive the number of compulsory misses to zero? If we can t fix t m, then the only way to make caches work is to drive p to 00% which means eliminating the compulsory misses. If all data were initialized dynamically, for example, possibly the compiler could generate special first write instructions. It is harder for us to imagine how to drive the compulsory misses for code to zero. - Is it time to forgo the model that access time is uniform to all parts of the address space? It is false for DSM and other scalable multiprocessor schemes, so why not for single processors as well? If we do this, can the compiler explicitly manage a smaller amount of higher speed memory? - Are there any new ideas for how to trade computation for storage? Alternatively, can we trade space for speed? DRAM keeps giving us plenty of the former. - Ancient machines like the IBM 650 and Burroughs 05 used magnetic drum as primary memory and had clever schemes for reducing rotational latency to essentially zero can we borrow a page from either of those books? Appeared in Computer Architecture News, 3():0-, March.

As noted above, the right solution to the problem of the memory wall is probably something that we haven t thought of but we would like to see the discussion engaged. It would appear that we do not have a great deal of time. 0000 00 µ = 00% µ = 50% 0000 00 µ = 00% µ = 50% 00 00 000 00 0 p = 99.0% p = 99.% p = 99.% 000 00 0 p = 90.0% p = 9.0% p = 9.0% 6 3 6 56 5 K K K K 6K 3K 6K K cache miss/hit cost ratio (a) 6 3 6 56 5 K K K K 6K 3K 6K K cache miss/hit cost ratio (b) Figure Trends for a Current Cache Miss/Hit Cost Ratio of Appeared in Computer Architecture News, 3():0-, March. 5

0000 00 µ = 00% µ = 50% 0000 00 µ = 00% µ = 50% 00 00 000 00 0 p = 99.0% p = 99.% p = 99.% 000 00 0 p = 90.0% p = 9.0% p = 9.0% 6 3 6 56 5 K K K K 6K 3K 6K K cache miss/hit cost ratio (a) 6 3 6 56 5 K K K K 6K 3K 6K K cache miss/hit cost ratio (b) Figure Trends for a Current Cache Miss/Hit Cost Ratio of 6 0 6 p = 90.0% p = 99.0% p = 99.% 996 997 99 999 00 00 003 00 006 007 00 year Figure 3 Average Access Cost for 0% Annual Increase in Processor Performance Appeared in Computer Architecture News, 3():0-, March. 6

References [Bas9] [Hen90] [McK9] [McK9a] F. Baskett, Keynote address. International Symposium on Shared Memory Multiprocessing, April 99. J.L. Hennessy and D.A. Patterson, Computer Architecture: a Quantitative Approach, Morgan-Kaufman, San Mateo, CA, 990. S.A. McKee, et. al., Experimental Implementation of Dynamic Access Ordering, Proc. 7th Hawaii International Conference on System Sciences, Maui, HI, January 99. S.A. McKee, et. al., Increasing Memory Bandwidth for Vector Computations, Proc. Conference on Programming Languages and System Architecture, Zurich, March 99. Appeared in Computer Architecture News, 3():0-, March. 7