CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Similar documents
Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

211: Computer Architecture Summer 2016

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Today Cache memory organization and operation Performance impact of caches

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy Y. K. Malaiya

+ Random-Access Memory (RAM)

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

COMPUTER ORGANIZATION AND DESIGN

5. Memory Hierarchy Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Cache Memories October 8, 2007

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CISC 360. Cache Memories Nov 25, 2008

CSCI-GA Database Systems Lecture 8: Physical Schema: Storage

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Random-Access Memory (RAM) CISC 360. The Memory Hierarchy Nov 24, Conventional DRAM Organization. SRAM vs DRAM Summary.

Giving credit where credit is due

Carnegie Mellon. Cache Memories

Last class. Caches. Direct mapped

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface

The Memory Hierarchy 10/25/16

Memory Hierarchy. Instructor: Adam C. Champion, Ph.D. CSE 2431: Introduction to Operating Systems Reading: Chap. 6, [CSAPP]

Memory Hierarchy. Announcement. Computer system model. Reference

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Computer Organization: A Programmer's Perspective

ν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines.

Page 1. Multilevel Memories (Improving performance using a little cash )

Cache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

Computer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg

Cache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory

Storage Technologies and the Memory Hierarchy

Computer Systems. Memory Hierarchy. Han, Hwansoo

CMPSC 311- Introduction to Systems Programming Module: Caching

Foundations of Computer Systems

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

High-Performance Parallel Computing

CS 201 The Memory Hierarchy. Gerson Robboy Portland State University

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

CS 31: Intro to Systems Storage and Memory. Kevin Webb Swarthmore College March 17, 2015

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

Random Access Memory (RAM)

CS 240 Stage 3 Abstractions for Practical Systems

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

ECE468 Computer Organization and Architecture. Memory Hierarchy

Lecture 18: Memory Systems. Spring 2018 Jason Tang

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Example. How are these parameters decided?

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Course Administration

CS 33. Memory Hierarchy I. CS33 Intro to Computer Systems XVI 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Carnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron

Chapter 6. Storage and Other I/O Topics

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations

CENG3420 Lecture 08: Memory Organization

CS 2461: Computer Architecture 1 The Memory Hierarchy and impact on program performance

Read-only memory (ROM): programmed during production Programmable ROM (PROM): can be programmed once SRAM (Static RAM)

Lecture-14 (Memory Hierarchy) CS422-Spring

The Memory Hierarchy /18-213/15-513: Introduction to Computer Systems 11 th Lecture, October 3, Today s Instructor: Phil Gibbons

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

Cache Memories : Introduc on to Computer Systems 12 th Lecture, October 6th, Instructor: Randy Bryant.

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Introduction I/O 1. I/O devices can be characterized by Behavior: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

CSE 153 Design of Operating Systems Fall 2018

Storage. CS 3410 Computer System Organization & Programming

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 26, SPRING 2013

Contents. Memory System Overview Cache Memory. Internal Memory. Virtual Memory. Memory Hierarchy. Registers In CPU Internal or Main memory

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS429: Computer Organization and Architecture

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Cache Memories. Andrew Case. Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

CS429: Computer Organization and Architecture

The Memory Hierarchy Sept 29, 2006

CSE 153 Design of Operating Systems

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

CS 261 Fall Mike Lam, Professor. Memory

EE 4683/5683: COMPUTER ARCHITECTURE

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

Princeton University. Computer Science 217: Introduction to Programming Systems. The Memory/Storage Hierarchy and Virtual Memory

Advanced Memory Organizations

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 26, FALL 2012

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Transcription:

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

Programmer s Wish List Memory Private Infinitely large Infinitely fast Non-volatile Inexpensive Programs are getting bigger faster than memories.

Example Memory Hierarchy Smaller, faster, and costlier (per byte) storage devices Larger, slower, and cheaper (per byte) storage devices L6: L5: L4: L3: L2: L1: L0: Regs L1 cache (SRAM) L2 cache (SRAM) L3 cache (SRAM) Main memory (DRAM) Local secondary storage (local disks) Remote secondary storage (e.g., Web servers) CPU registers hold words retrieved from the L1 cache. L1 cache holds cache lines retrieved from the L2 cache. L2 cache holds cache lines retrieved from L3 cache L3 cache holds cache lines retrieved from main memory. Main memory holds disk blocks retrieved from local disks. Local disks hold files retrieved from disks on remote servers

SRAM vs DRAM Summary Trans. Access Needs Needs per bit time refresh? EDC? Cost Applications SRAM 4 or 6 1X No Maybe 100x Cache memories DRAM 1 10X Yes Yes 1X Main memories, frame buffers

1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Performance Question: Who Cares About the Memory Hierarchy? 1000 Moore s Law CPU µproc 60%/yr. 100 10 1 CPU-DRAM Gap Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 7%/yr.

Disk

Funny facts: It took 51 years to reach 1TB and 2 years to reach 2TB! IBM introduced the first hard disk drive to break the 1 GB barrier in 1980. IBM Disk 350, size 5MB, circa 50s source: http://royal.pingdom.com/2010/02/18/amazing-facts-and-figures-about-the-evolution-of-hard-disk-drives/

Hard Disks spinning platter of special material mechanical arm with read/write head must be close to the platter to read/write data data is stored magnetically storage capacity is commonly between 100GB 3TB disks are random access meaning data can be read/written anywhere on the disk The disk surface spins at a fixed rotational rate Spindle The read/write head is attached to the end of the arm and flies over the disk surface on a thin cushion of air By moving radially, the arm can position the read/write head over any track

What s Inside A Disk Drive? Arm Spindle Platters Actuator SCSI connector Electronics (including a processor and memory!) Image courtesy of Seagate Technology

Disk Drives Platters Tracks Platter Sectors Track To access data: seek time: position head over the proper track rotational latency: wait for desired sector transfer time: grab the data (one or more sectors)

A Conventional Hard Disk Structure

Hard Disk Architecture Surface = group of tracks Track = group of sectors Sector = group of bytes Cylinder: several tracks on corresponding surfaces

Disk Sectors and Access Each sector records Sector ID Data (512 bytes, 4096 bytes proposed) Error correcting code (ECC) Used to hide defects and recording errors Synchronization fields and gaps Access to a sector involves Queuing delay if other accesses are pending Seek: move the heads Rotational latency Data transfer Controller overhead

Disks: Other Issues Average seek and rotation times are helped by locality. Disk performance improves about 10%/year Capacity increases about 60%/year Example of disk controllers: SCSI, ATA, SATA

Flash Storage Nonvolatile semiconductor storage 100 1000 faster than disk Smaller, lower power, more robust But more $/GB (between disk and DRAM)

Flash Types NOR flash: bit cell like a NOR gate Random read/write access Used for instruction memory in embedded systems NAND flash: bit cell like a NAND gate Denser (bits/area), but block-at-a-time access Cheaper per GB Used for USB keys, media storage, Flash bits wears out after 1000 s of accesses Not suitable for direct RAM or disk replacement Wear leveling: remap data to less used blocks

Solid State Disk (SSD) Flash memory Solid-State Disk I/O bus Block 0 Page 0 Page 1 Page P-1 Flash translation layer Requests to read and write logical disk blocks Block B-1 Page 0 Page 1 Page P-1 Typically: pages are 512 4KB in size a block consists of 32 128 pages A blocks wears out after roughly 100,000 repeated writes. Once a block wears out it can no longer be used.

Main Memory (DRAM For now!)

DRAM packaged in memory modules that plug into expansion slots on the main system board (motherboard) Example package: 168-pin dual inline memory module (DIMM) transfers data to and from the memory controller in 64-bit chunks

Storage Trends SRAM Metric 1985 1990 1995 2000 2005 2010 2015 2015:1985 $/MB 2,900 320 256 100 75 60 25 116 access (ns) 150 35 15 3 2 1.5 1.3 115 DRAM Metric 1985 1990 1995 2000 2005 2010 2015 2015:1985 $/MB 880 100 30 1 0.1 0.06 0.02 44,000 access (ns) 200 100 70 60 50 40 20 10 typical size (MB) 0.256 4 16 64 2,000 8,000 16.000 62,500 Disk Metric 1985 1990 1995 2000 2005 2010 2015 2015:1985 $/GB 100,000 8,000 300 10 5 0.3 0.03 3,333,333 access (ms) 75 28 10 8 5 3 3 25 typical size (GB) 0.01 0.16 1 20 160 1,500 3,000 300,000

Cache Memory

Large gap between processor speed and memory speed

A B C of Cache

Cache Analogy Hungry! must eat! Option 1: go to refrigerator Found eat! Latency = 1 minute Option 2: go to store Found purchase, take home, eat! Latency = 20-30 minutes Option 3: grow food! Plant, wait wait wait, harvest, eat! Latency = ~250,000 minutes (~ 6 months)

What Do We Gain? Let m = cache access time, M: main memory access time p = probability that we find the data in the cache Average access time = p*m + (1-p)(m+M) = m + (1-p) M We need to increase p

Problem Given the following: Cache: 1 cycle access time Main memory: 100 cycle access time What is the average access time for 100 memory references if you measure that 90% of the cache accesses are hits

Class Problem For application X, you measure that 40% of the operations access memory. The non-memory access operations take one cycle. You also measure that 90% of the memory references hit in the cache. Whenever a cache miss occurs, the processor is stalled for 20 cycles to transfer the block from memory into the cache. A cache hit takes one cycle. What is the average operations time?

Cache Organization

Basic Cache Design Cache memory can copy data from any part of main memory It has 2 parts: The TAG (CAM) holds the memory address The BLOCK (SRAM) holds the memory data Accessing the cache: Compare the reference address with the tag If they match, get the data from the cache block If they don t match, get the data from main memory

Direct Mapped Cache Example: 32-bit address

Direct Mapped Cache

So What is a cache? Small, fast storage used to improve average access time to slow memory. Exploits spatial and temporal locality In computer architecture, almost everything is a cache! Registers a cache on variables First-level cache a cache on second-level cache Second-level cache a cache on memory Memory a cache on disk (virtual memory) etc Proc/Regs Slower Cheaper Bigger L1-Cache L2-Cache Memory Disk, Tape, etc. Faster More Expensive Smaller

Localities: Why Cache Is a Good Idea? Spatial locality: If block k is accessed, it is likely that block k+1 will be accessed Temporal locality: If block k is accessed, it is likely that it will be accessed again

1 valid bit per line t tag bits per line B = 2 b bytes per cache block Valid Tag 0 1 B 1 Set 0: Valid Tag 0 1 B 1 E lines per set Valid Tag 0 1 B 1 S = 2 s sets Set 1: Valid Tag 0 1 B 1 Valid Tag 0 1 B 1 Set S -1: Valid Tag 0 1 B 1 Cache size: C = B x E x S data bytes

Problem Show the breakdown of the address for the following cache configuration: 32 bit address 16K cache Direct-mapped cache 32-byte blocks tag set index block offset

Problem Show the breakdown of the address for the following cache configuration: 32 bit address 32K cache 4-way set associative cache 32-byte blocks tag set index block offset

Associativity (DM, 2-way, 4-way, FA) size Block size Replacement strategy (LRU, FIFO, LFU, RANDOM)

Design Issues What to do in case of hit/miss? Block size Associativity Replacement algorithm Improving performance

Hits vs. Misses Read hits this is what we want! Read misses stall the CPU, fetch block from memory, deliver to cache, restart Write hits: can replace data in cache and memory (write-through) write the data only into the cache (write-back the cache later) Write misses: read the entire block into the cache, then write the word

Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, 3. Reduce power consumption (won t be discussed here)

Reducing Misses Classifying Misses: 3 Cs Compulsory The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses in even an Infinite Cache) Capacity If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. Conflict If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses.

How Can We Reduce Misses? 1) Change Block Size: 2) Change Associativity: 3) Increase Cache Size

Miss rate Block Size Increasing the block size tends to decrease miss rate: 40% 35% 30% 25% 20% 15% 10% 5% 0% 4 16 64 256 Block size (bytes) 1 KB 8 KB 16 KB 64 KB 256 KB

Decreasing miss ratio with associativity One-way set associative (direct mapped) Block 0 1 2 3 4 5 6 7 Tag Data Set 0 1 2 3 Two-way set associative Tag Data Tag Data Four-way set associative Set 0 1 Tag Data Tag Data Tag Data Tag Data Eight-way set associative (fully associative) Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data

Implementation of 4-way set associative Address 31 30 12 11 10 9 8 3 2 1 0 22 8 Index 0 1 2 V Tag Data V Tag Data V Tag Data V Tag Data 253 254 255 22 32 4-to-1 multiplexor Hit Data

Effect of Associativity on Miss Rate 15% 12% 9% 6% 3% 0 One-way 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB Two-way Associativity 64 KB 128 KB Four-way Eight-way

Reducing Miss Penalty Write Policy 1: Write-Through vs Write-Back Write-through: all writes update cache and underlying memory/cache Can always discard cached data - most up-to-date data is in memory Cache control bit: only a valid bit Write-back: all writes simply update cache Can t just discard cached data - may have to write it back to memory Cache control bits: both valid and dirty bits Other Advantages: Write-through: memory (or other processors) always have latest data Simpler management of cache Write-back: much lower bandwidth, since data often overwritten multiple times Better tolerance to long-latency memory?

Reducing Miss Penalty Write Policy 2: Write Allocate vs Non-Allocate (What happens on write-miss) Write allocate: allocate new cache line in cache Usually means that you have to do a read miss to fill in rest of the cache-line! Write non-allocate (or write-around ): Simply send write data through to underlying memory/cache - don t allocate new cache line!

Decreasing miss penalty with multilevel caches Add a second (and third) level cache: often primary cache is on the same chip as the processor use SRAMs to add another cache above primary memory (DRAM) miss penalty goes down if data is in 2nd level cache Using multilevel caches: try and optimize the hit time on the 1st level cache try and optimize the miss rate on the 2nd level cache

What about Replacement Algorithm? LRU: Least Recently Used LFU: Least Frequently Used FIFO: First-In First-Out Random

How to write cache friendly code?

Is The Following Code Cache Friendly?

Matrix Multiplication Example Description: Multiply two N x N matrices Matrix elements are doubles (8 bytes) O(N 3 ) total operations Variable sum /* ijk */ held in register for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } matmult/mm.c

Miss Rate Analysis for Matrix Multiply Assume: Block size = 32B (big enough for four doubles) Matrix dimension (N) is very large Cache is not even big enough to hold multiple rows Analysis Method: Look at access pattern of inner loop i j i k = x k j C A B

Layout of C Arrays in Memory C arrays allocated in row-major order each row in contiguous memory locations Stepping through columns in one row: for (i = 0; i < N; i++) sum += a[0][i]; accesses successive elements if block size (B) > sizeof(a ij ) bytes, exploit spatial locality Stepping through rows in one column: for (i = 0; i < n; i++) sum += a[i][0]; accesses distant elements no spatial locality! miss rate = 1 (i.e. 100%)

Matrix Multiplication (ijk) /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } Misses per inner loop iteration: A B C 0.25 1.0 0.0 Inner loop: (i,*) (*,j) (i,j) A B C Row-wise Columnwise Fixed

Matrix Multiplication (jik) /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum } } Misses per inner loop iteration: A B C 0.25 1.0 0.0 Inner loop: (i,*) (*,j) (i,j) A B C Row-wise Columnwise Fixed

Matrix Multiplication (kij) /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } } Inner loop: (i,k) (k,*) (i,*) A B C Fixed Row-wise Row-wise Misses per inner loop iteration: A B C 0.0 0.25 0.25

Matrix Multiplication (ikj) /* ikj */ for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } } Inner loop: (i,k) (k,*) (i,*) A B C Fixed Row-wise Row-wise Misses per inner loop iteration: A B C 0.0 0.25 0.25

Matrix Multiplication (jki) /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } } Inner loop: (*,k) (k,j) (*,j) A B C Fixed Columnwise Columnwise Misses per inner loop iteration: A B C 1.0 0.0 1.0

Matrix Multiplication (kji) /* kji */ for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } } Misses per inner loop iteration: A B C 1.0 0.0 1.0 Inner loop: (*,k) (k,j) (*,j) A B C Fixed Columnwise Columnwise

Conclusions The computer system s storage is organized as a hierarchy. The reason of this hierarchy is to try to get an memory that is very fast, cheap, and almost infinite. A good programmer must try to make the code cache friendly make the common case cache friendly locality