Lecture 14: Memory Hierarchy. James C. Hoe Department of ECE Carnegie Mellon University

Similar documents
Lecture 16: Cache in Context (Uniprocessor) James C. Hoe Department of ECE Carnegie Mellon University

Lecture 15: Cache Design (in Isolation) James C. Hoe Department of ECE Carnegie Mellon University

Computer Architecture

Lecture 17: Memory Hierarchy: Cache Design

Lecture 20: Virtual Memory, Protection and Paging. Multi-Level Caches

Lecture 19: Memory Hierarchy: Cache Design. Recap: Basic Cache Parameters

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer)

Caches. Samira Khan March 23, 2017

Lecture 17: Address Translation. James C. Hoe Department of ECE Carnegie Mellon University

Computer Architecture Lecture 19: Memory Hierarchy and Caches. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 3/19/2014

18-447: Computer Architecture Lecture 17: Memory Hierarchy and Caches. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 3/26/2012

Lecture 12: Memory hierarchy & caches

Memory Hierarchy. Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity. (Study Chapter 5)

Lecture-14 (Memory Hierarchy) CS422-Spring

18-447: Computer Architecture Lecture 22: Memory Hierarchy and Caches. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 3/27/2013

Lecture 13: Bus and I/O. James C. Hoe Department of ECE Carnegie Mellon University


Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

EECS 470 Lecture 13. Basic Caches. Fall 2018 Jon Beaumont

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Computer Science 432/563 Operating Systems The College of Saint Rose Spring Topic Notes: Memory Hierarchy

Caches. Han Wang CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

Assignment 1 due Mon (Feb 4pm

Key Point. What are Cache lines

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

Caches. Samira Khan March 21, 2017

Memory Hierarchy: Caches, Virtual Memory

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

ECE468 Computer Organization and Architecture. Memory Hierarchy

CS3350B Computer Architecture

Caches (Writing) P & H Chapter 5.2 3, 5.5. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Course Administration

14:332:331. Week 13 Basics of Cache

CPU Pipelining Issues

Fall 2007 Prof. Thomas Wenisch

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

Topic 18: Virtual Memory

Memory Hierarchy. Mehran Rezaei

virtual memory Page 1 CSE 361S Disk Disk

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Page 1. Multilevel Memories (Improving performance using a little cash )

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Sarah L. Harris and David Money Harris. Digital Design and Computer Architecture: ARM Edition Chapter 8 <1>

A Cache Hierarchy in a Computer System

Introduction to cache memories

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Virtual Memory. Motivations for VM Address translation Accelerating translation with TLBs

CS 3510 Comp&Net Arch

LECTURE 10: Improving Memory Access: Direct and Spatial caches

CMPSC 311- Introduction to Systems Programming Module: Caching

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CSE502: Computer Architecture CSE 502: Computer Architecture

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Lecture 2: Memory Systems

1. Memory technology & Hierarchy

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

EE 4683/5683: COMPUTER ARCHITECTURE

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Mo Money, No Problems: Caches #2...

Memory Hierarchy. Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity. (Study Chapter 5)

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Computer Systems Laboratory Sungkyunkwan University

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

ECE232: Hardware Organization and Design

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

EECS 470. Lecture 16 Virtual Memory. Fall 2018 Jon Beaumont

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Lecture 25: Busses. A Typical Computer Organization

Introduction to OpenMP. Lecture 10: Caches

CMPSC 311- Introduction to Systems Programming Module: Caching

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

LECTURE 11. Memory Hierarchy

EC 513 Computer Architecture

Advanced Memory Organizations

Memory. Lecture 22 CS301

LECTURE 5: MEMORY HIERARCHY DESIGN

Computer Systems C S Cynthia Lee Today s materials adapted from Kevin Webb at Swarthmore College

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

Digital Logic & Computer Design CS Professor Dan Moldovan Spring Copyright 2007 Elsevier 8-<1>

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

Cache introduction. April 16, Howard Huang 1

Chapter 8 :: Topics. Chapter 8 :: Memory Systems. Introduction Memory System Performance Analysis Caches Virtual Memory Memory-Mapped I/O Summary

CSEE W4824 Computer Architecture Fall 2012

Transcription:

18 447 Lecture 14: Memory Hierarchy James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L14 S1, James C. Hoe, CMU/ECE/CALCM, 2018

Your goal today Housekeeping understand memory system and memory hierarchy design in big pictures Notices HW 3, past due Lab 3, due week of 3/26 Handout #11: HW 3 solutions Lab and HW on hiatus over Spring Break Readings P&H Ch5 for the next many lectures 18 447 S18 L14 S2, James C. Hoe, CMU/ECE/CALCM, 2018

Wishful Memory So far we imagined a program owns contiguous 4GB private memory aprogram can access anywhere in 1 proc. cycle We are in good company Burks, Goldstein, von Neumann, 1946 18 447 S18 L14 S3, James C. Hoe, CMU/ECE/CALCM, 2018

The Reality Can t afford/don t need as much memory as size of address space (think 64 bit ISAs) RV32I said 4GB addr space not 4GB memory Can t find memory technology that is affordable in GByte and also cycle in GHz Most systems multi task several programs But, magic memory is nevertheless a useful approximation of reality due to memory hierarchy: large and fast virtual memory: contiguous and private 18 447 S18 L14 S4, James C. Hoe, CMU/ECE/CALCM, 2018 cover this part first cover this part later

Principles behind the solution 18 447 S18 L14 S5, James C. Hoe, CMU/ECE/CALCM, 2018

Bigger is slower The Law of Storage SRAM 512 Bytes @ sub nsec SRAM KByte~MByte @ nsec DRAM GByte @ ~50 nsec Hard Disk TByte @ ~10 msec Faster is more expensive (dollars and chip area) SRAM ~$10K per GByte DRAM ~$10 per GByte Hard Disk ~$0.1 per GByte How to make memory bigger, faster and cheaper? 18 447 S18 L14 S6, James C. Hoe, CMU/ECE/CALCM, 2018

Memory Locality Typical programs have strong locality in memory references (we put them there..... loops and structs.... ) Temporal: after accessing A, how many other distinct addresses before accessing A again Spatial: after accessing A, how many other distinct addresses before accessing a near by B Corollary: a program with strong temporal and spatial locality must be accessing only a compact working set at a time 18 447 S18 L14 S7, James C. Hoe, CMU/ECE/CALCM, 2018

18 447 S18 L14 S8, James C. Hoe, CMU/ECE/CALCM, 2018 Memoization If something is costly to compute, save the result to be reused With strong reuse storing just a small number of frequently used results can avoid most recomputations With poor reuse storing a large number of different results that are rarely or never reused locating the needed result from a large number of stored ones can itself become as expensive as computing

Cost Amortization overhead: one time cost to set up unit cost: cost for each unit of work total cost = overhead + unit cost x N average cost = total cost / N 18 447 S18 L14 S9, James C. Hoe, CMU/ECE/CALCM, 2018 = ( overhead / N ) + unit cost the essence of amortization In memorization, high up front cost to compute once is no problem if results reused many times

Putting the principles to work 18 447 S18 L14 S10, James C. Hoe, CMU/ECE/CALCM, 2018

Memory Hierarchy keep what you use actively here with strong locality effectively as fast as and as large as hold what isn t being used fast small big but slow faster per byte cheaper per byte 18 447 S18 L14 S11, James C. Hoe, CMU/ECE/CALCM, 2018

Managing Memory Hierarchy Copy data between levels explicitly and manually vacuum tubes vs Selectron (von Neumann paper) core vs drum memory in the 50 s scratchpad SRAM used on modern embedded and DSP Register file is a level of storage hierarchy Single address space, automatic management as early as ATLAS, 1962 common in today s fast processor with slow DRAM programmers don t need to know about it for typical programs to be both fast and correct What about atypical programs? 18 447 S18 L14 S12, James C. Hoe, CMU/ECE/CALCM, 2018

Modern Storage Hierarchy Memory Abstraction regfile (10~100 words, sub nsec) L1 cache ~32KB, ~nsec L2 cache ~512KB~1MB, many nsec L3 cache..... manual register spilling automatic cache management 18 447 S18 L14 S13, James C. Hoe, CMU/ECE/CALCM, 2018 Main memory (DRAM) GB, ~100nsec swap disk 100GB~TB, ~10msec automatic demand paging

Average Memory Access Time Memory hierarchy level L i has access time of t i Perceived access time T i is longer than t i a chance (hit rate h i ) you find what you want t i a chance (miss rate m i ) you don t find it t i +T i+1 h i + m i =1.0 In general T i = h i t i + m i (t i + T i+1 ) T i = t i + m i T i+1 think of this as miss penalty 18 447 S18 L14 S14, James C. Hoe, CMU/ECE/CALCM, 2018 Note: h i and m i are of references missed at L i 1 h bottom most =1.0

T i = t i + m i T i+1 Goal: achieve desired T 1 within allowed cost T i t i is not a goal: Keep m i low increase capacity C i lowers m i, but increases t i lower m i by smarter management, e.g., replacement: anticipate what you don t need prefetching: anticipate what you will need Keep T i+1 low reduce t i+1 with faster next level memory leads to increased cost and/or reduced capacity better solved by adding intermediate levels 18 447 S18 L14 S15, James C. Hoe, CMU/ECE/CALCM, 2018

Memory Hierarchy Design DRAM optimized for capacity per dollar T DRAM is essentially same regardless of capacity SRAM optimized for capacity per latency different tradeoff between capacity and latency possible, t = O( capacity) Memory hierarchy bridges the difference between CPU speed and DRAM speed T pclk T DRAM no hierarchy needed T pclk << T DRAM one or more levels of increasingly larger but slower SRAMs to minimize T 1 18 447 S18 L14 S16, James C. Hoe, CMU/ECE/CALCM, 2018

18 447 S18 L14 S17, James C. Hoe, CMU/ECE/CALCM, 2018 Intel P4 Example (very fast, very deep pipeline) 90nm, 3.6 GHz 16KB L1 D cache t 1 = 4 cyc int (9 cycle fp) 1024KB L2 D cache t 2 = 18 cyc int (18 cyc fp) Main memory t 3 = ~ 50ns or 180 cyc if m 1 =0.1, m 2 =0.1 T 1 =7.6, T 2 =36 if m 1 =0.01, m 2 =0.01 T 1 =4.2, T 2 =19.8 if m 1 =0.05, m 2 =0.01 T 1 =5.00, T 2 =19.8 if m 1 =0.01, m 2 =0.50 T 1 =5.08, T 2 =108 Notice: best case latency is not 1 cycle worst case access latency is 300+ cycles depending on exactly what happens

Working Set/Locality/Miss Rate 100% hit rate What is m 1 and m 2? C 1 C 2 cache capacity 18 447 S18 L14 S18, James C. Hoe, CMU/ECE/CALCM, 2018

18 447 S18 L14 S19, James C. Hoe, CMU/ECE/CALCM, 2018 Aside: Why is DRAM slow? DRAM fabrication at forefront of VLSI, but scaled with Moore s law in capacity and cost not speed Between 1980 ~ 2004 64K bit 1024M bit (exponential ~55% annual) 250ns 50ns (linear) A deliberate engineering choice memory capacity needs to grow linearly with processing speed in a balanced system Amdahl s Other Law DRAM/processor speed difference reconcilable by SRAM cache hierarchies (L1, L2, L3,...) Pareto optimal faster/smaller/more costly DRAM do exist

Don t Forget Bandwidth and Energy Assume RISC pipeline 1GHz and IPC=1 4GB/sec of instruction fetch bandwidth 1GB/sec load and 0.6GB/sec store (if 25% LW and 15% SW, Agerwala&Cocke) multiply by number of cores if multicore DDR4 ~20GB/sec/channel (under best case access pattern) and ~10 Watt at full blast With memory hierarchy BW i+1 = BW 1 Critical for multicore and GPU 18 447 S18 L14 S20, James C. Hoe, CMU/ECE/CALCM, 2018

Now we can talk about caches... Generically in computing, any structure that memoizes frequently repeated computation results to save on the cost of reproducing the results from scratch, e.g. a web cache 18 447 S18 L14 S21, James C. Hoe, CMU/ECE/CALCM, 2018

Cache in Computer Architecture An invisible, automatically managed memory hierarchy Program expects reading M[A] to return mostrecently written value, with or without cache Cache keeps copies of frequently accessed DRAM memory locations in a small fast memory service load/store using fast memory copies if exist transparent to program if memory idempotent (L13) funny things happen if mmap ed or if memory can change (e.g., by other cores or DMA) 18 447 S18 L14 S22, James C. Hoe, CMU/ECE/CALCM, 2018

Cache Interface for Dummies ready MemWrite Instruction address Instruction memory Instruction valid Address Write data Data memory Read data valid [Based on figures from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] MemRead Like the magic memory present address, R/W command, etc result or update valid after a short/fixed latency Except occasionally, cache needs more time will become valid/ready eventually what to do with pipeline until then? Stall!! 18 447 S18 L14 S23, James C. Hoe, CMU/ECE/CALCM, 2018

The Basic Problem Potentially M=2 m bytes of memory, how to keep copies of most frequently used locations in C bytes of fast storage where C << M Basic issues (intertwined) (1) when to cache a copy of a memory location (2) where in fast storage to keep the copy (3) how to find the copy later on (LW and SW only give indices into M) 18 447 S18 L14 S24, James C. Hoe, CMU/ECE/CALCM, 2018

M address Basic Operation (demand driven version) (3) cache lookup (2) yes hit? return data (1) no choose location update cache occupied? no fetch new from L i+1 yes (1 ) evict old to L i+1 data 18 447 S18 L14 S25, James C. Hoe, CMU/ECE/CALCM, 2018

Basic Cache Parameters M = 2 m : size of address space in bytes sample values: 2 32, 2 64 G=2 g : cache access granularity in bytes sample values: 4, 8 C : capacity of cache in bytes sample values: 16 KByte (L1), 1 MByte (L2) 100% hit rate working set size (W) 18 447 S18 L14 S26, James C. Hoe, CMU/ECE/CALCM, 2018 C

Direct Mapped Cache (v1) lg 2 M bit address tag idx g Tag Bank Data Bank lg 2 (C/G) bits t bits C/G lines by t bits valid C/G lines by G bytes t bits = let t= lg 2 M lg 2 C What about writes? 18 447 S18 L14 S27, James C. Hoe, CMU/ECE/CALCM, 2018 hit? G bytes data

Storage Overhead and Block Size For each cache block of G bytes, also storing t+1 bits of tag (where t=lg 2 M lg 2 C) if M=2 32, G=4, C=16K=2 14 t=18 bits for each 4 byte block 60% overhead; 16KB cache actually 25.5KB SRAM Solution: amortize tag over larger B byte block manage B/G consecutive words as indivisible unit if M=2 32, B=16, G=4, C=16K t=18 bits for each 16 byte block 15% overhead; 16KB cache actually 18.4KB SRAM spatial locality also says this is a good idea Larger caches wants even bigger blocks 18 447 S18 L14 S28, James C. Hoe, CMU/ECE/CALCM, 2018

Direct Mapped Cache (final) lg 2 M bit address tag idx bo g lg 2 (C/B) bits Tag Bank C/B by t bits valid Data Bank C/B by B bytes t bits lg 2 (B/G) bits = t bits B bytes let t= lg 2 M lg 2 C 18 447 S18 L14 S29, James C. Hoe, CMU/ECE/CALCM, 2018 hit? G bytes data