Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Similar documents
211: Computer Architecture Summer 2016

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

Today Cache memory organization and operation Performance impact of caches

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Memory Hierarchy. Announcement. Computer system model. Reference

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

CISC 360. Cache Memories Nov 25, 2008

Cache Memories October 8, 2007

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Giving credit where credit is due

Carnegie Mellon. Cache Memories

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations

Last class. Caches. Direct mapped

Cache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

Cache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory

ν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines.

Cache Memories : Introduc on to Computer Systems 12 th Lecture, October 6th, Instructor: Randy Bryant.

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Cache Memories. Andrew Case. Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron

Carnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Example. How are these parameters decided?

Cache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

+ Random-Access Memory (RAM)

Memory Hierarchy. Instructor: Adam C. Champion, Ph.D. CSE 2431: Introduction to Operating Systems Reading: Chap. 6, [CSAPP]

Memory hierarchies: caches and their impact on the running time

High-Performance Parallel Computing

CS3350B Computer Architecture

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Systems Programming and Computer Architecture ( ) Timothy Roscoe

CSCI 402: Computer Architectures. Performance of Multilevel Cache

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Problem: Processor- Memory Bo<leneck

F28HS Hardware-Software Interface: Systems Programming

211: Computer Architecture Summer 2016

CMPSC 311- Introduction to Systems Programming Module: Caching

Computer Systems CSE 410 Autumn Memory Organiza:on and Caches

CMPSC 311- Introduction to Systems Programming Module: Caching

Computer Systems. Memory Hierarchy. Han, Hwansoo

Random-Access Memory (RAM) Systemprogrammering 2007 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics

Computer Organization: A Programmer's Perspective

Random-Access Memory (RAM) Systemprogrammering 2009 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics! The memory hierarchy

Caches II CSE 351 Spring #include <yoda.h> int is_try(int do_flag) { return!(do_flag!do_flag); }

CS 240 Stage 3 Abstractions for Practical Systems

Problem: Processor Memory BoJleneck

Memory Management! Goals of this Lecture!

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Memory Management. Goals of this Lecture. Motivation for Memory Hierarchy

Read-only memory (ROM): programmed during production Programmable ROM (PROM): can be programmed once SRAM (Static RAM)

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 26, SPRING 2013

Caches Spring 2016 May 4: Caches 1

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

Computer Systems Laboratory Sungkyunkwan University

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 26, FALL 2012

Giving credit where credit is due

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Caches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CISC 360. The Memory Hierarchy Nov 13, 2008

Computer Organization: A Programmer's Perspective

The Memory Hierarchy /18-213/15-513: Introduction to Computer Systems 11 th Lecture, October 3, Today s Instructor: Phil Gibbons

CS241 Computer Organization Spring Principle of Locality

Computer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg

Random-Access Memory (RAM) Lecture 13 The Memory Hierarchy. Conventional DRAM Organization. SRAM vs DRAM Summary. Topics. d x w DRAM: Key features

Lecture 17: Memory Hierarchy and Cache Coherence Concurrent and Mul7core Programming

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Key features. ! RAM is packaged as a chip.! Basic storage unit is a cell (one bit per cell).! Multiple RAM chips form a memory.

How to Write Fast Numerical Code

Random Access Memory (RAM)

Caches II. CSE 351 Autumn Instructor: Justin Hsia

ECE232: Hardware Organization and Design

Memory hierarchy Outline

EE108B Lecture 13. Caches Wrap Up Processes, Interrupts, and Exceptions. Christos Kozyrakis Stanford University

CS 201 The Memory Hierarchy. Gerson Robboy Portland State University

Cache Performance II 1

Recap: Machine Organization

Computer Organization - Overview

CS 31: Intro to Systems Caching. Kevin Webb Swarthmore College March 24, 2015

Princeton University. Computer Science 217: Introduction to Programming Systems. The Memory/Storage Hierarchy and Virtual Memory

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

How to Write Fast Numerical Code

Registers. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH

LECTURE 11. Memory Hierarchy

Announcements. EE108B Lecture 13. Caches Wrap Up Processes, Interrupts, and Exceptions. Measuring Performance. Memory Performance

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Storage Management 1

Carnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016

The Memory Hierarchy Sept 29, 2006

Transcription:

Memory Hierarchy Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu

Time (ns) The CPU-Memory Gap The gap widens between DRAM, disk, and CPU speeds 100,000,000.0 10,000,000.0 1,000,000.0 100,000.0 10,000.0 1,000.0 100.0 10.0 1.0 0.1 Disk seek time SSD access time DRAM access time SRAM access time CPU cycle time Effective CPU cycle time 0.0 1985 1990 1995 2000 2003 2005 2010 2015 Year 2

Principle of Locality Temporal locality Recently referenced items are likely to be referenced in the near future Spatial locality Items with nearby addresses tend to be referenced close together in time 3

Principle of Locality: Example sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum; Data Reference array elements in succession Reference sum each iteration Instructions Reference instructions in sequence Cycle through loop repeatedly Spatial locality Temporal locality Spatial locality Temporal locality 4

Memory Hierarchy Some fundamental and enduring properties of hardware and software Fast storage technologies cost more per byte, have less capacity, and require more power The gap between CPU and main memory speed is widening Well-written programs tend to exhibit good locality These fundamental properties complement each other beautifully They suggest an approach for organizing memory and storage systems known as a memory hierarchy 5

Memory Hierarchy: Example Smaller, faster, and costlier (per byte) storage devices L2: L1: L0: Registers L1 cache (SRAM) L2 cache (SRAM) Larger, slower, and cheaper (per byte) storage devices L6: L5: L4: L3: L3 cache (SRAM) Main memory (DRAM) Local secondary storage (local disks) Remote secondary storage (e.g. Web cache) 6

Exploiting Locality How to exploit temporal locality? Speed up data accesses by caching data in faster storage Caching in multiple levels: form a memory hierarchy The lower levels of the memory hierarchy tend to be slower, but larger and cheaper How to exploit spatial locality? Larger cache line size Cache nearby data together 7

Cache A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device Improves the average access time Exploits both temporal and spatial locality 8

General Cache Concepts Cache 84 9 14 10 3 Smaller, faster, more expensive memory caches a subset of the blocks 10 4 Data is copied in block-sized transfer units Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Larger, slower, cheaper memory viewed as partitioned into blocks 9

General Cache Concepts: Hit Cache Request: 14 8 9 14 3 Data in block b is needed Block b is in cache: Hit! Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 10

General Cache Concepts: Miss Cache Request: 12 8 12 9 14 3 Data in block b is needed Block b is not in cache: Miss! 12 Request: 12 Block b is fetched from memory Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Block b is stored in cache Placement policy: determines where b goes Replacement policy: determines which block gets evicted (victim) 11

CPU Cache Design Issues Cache size 8KB ~ 64KB for L1 Cache line size Typically, 32 ~ 64 bytes for L1 Lookup Fully associative Set associative: 2-way, 4-way, 8-way, 16-way, etc. Direct mapped Replacement LRU (Least Recently Used) FIFO (First-In First-Out), RANDOM, etc. 12

Intel Core i7 Cache Hierarchy Processor package Core 0 Regs Core 3 Regs L1 i-cache and d-cache: 32 KB, 8-way, Access: 4 cycles L1 d-cache L1 i-cache L1 d-cache L1 i-cache L2 unified cache: 256 KB, 8-way, Access: 10 cycles L2 unified cache L2 unified cache L3 unified cache: 8 MB, 16-way, Access: 40-75 cycles L3 unified cache (shared by all cores) Block size: 64 bytes for all caches. Main memory 13

Cache Performance Average memory access time = T hit + R miss * T miss Hit time (T hit ) Time to deliver a line in the cache to the processor Includes time to determine whether the line is in the cache 1 clock cycle for L1, 3 ~ 8 clock cycles for L2 Miss rate (R miss ) Fraction of memory references not found in cache 3 ~ 10% for L1, < 1% for L2 Miss penalty (T miss ) Additional time required because of a miss Typically 25 ~ 100 cycles for main memory 14

Writing Cache Friendly Code Make the common case go fast Focus on the inner loops of the core functions Minimize the misses in the inner loops Repeated references to variables are good (temporal locality) Sequential reference patterns are good (spatial locality) 15

Matrix Multiplication (1) Description Multiply N x N matrices O(N 3 ) total operations (i,*) (*,j) X = (i,j) /* ijk */ Variable sum for (i=0; i<n; i++) { held in register for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; Assumptions Line size = 32 bytes (big enough for 4 64-bit words) Matrix dimension (N) is very large 16

Matrix Multiplication (2) Matrix multiplication (ijk) /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; Misses per Inner Loop Iteration: 0.25 1.0 0.0 Inner loop: (i,*) (*,j) (i,j) Row-wise Columnwise Fixed 17

Matrix Multiplication (3) Matrix multiplication (jik) /* jik */ Inner loop: for (j=0; j<n; j++) { for (i=0; i<n; i++) { (*,j) sum = 0.0; (i,j) (i,*) for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum Row-wise Columnwise Fixed Misses per Inner Loop Iteration: 0.25 1.0 0.0 18

Matrix Multiplication (4) Matrix multiplication (kij) /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; Inner loop: (i,k) (k,*) (i,*) Fixed Row-wise Row-wise Misses per Inner Loop Iteration: 0.0 0.25 0.25 19

Matrix Multiplication (5) Matrix multiplication (ikj) /* ikj */ Inner loop: for (i=0; i<n; i++) { for (k=0; k<n; k++) { (i,k) (k,*) (i,*) r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; Fixed Row-wise Row-wise Misses per Inner Loop Iteration: 0.0 0.25 0.25 20

Matrix Multiplication (6) Matrix multiplication (jki) /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; Misses per Inner Loop Iteration: 1.0 0.0 1.0 Inner loop: (*,k) (k,j) (*,j) Fixed Columnwise Columnwise 21

Matrix Multiplication (7) Matrix multiplication (kji) /* kji */ for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; Misses per Inner Loop Iteration: 1.0 0.0 1.0 Inner loop: (*,k) (k,j) (*,j) Fixed Columnwise Columnwise 22

Summary Matrix Multiplication (8) ijk (& jik): 2 loads, 0 stores misses/iter = 1.25 for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; 0.25 1.0 0.0 kij (& ikj): 2 loads, 1 store misses/iter = 0.5 for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; 0.0 0.25 0.25 jki (& kji): 2 loads, 1 store misses/iter = 2.0 for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; 1.0 0.0 1.0 23

Cycles per inner loop iteration Matrix Multiplication (9) Performance in Core i7 100 jki / kji 10 ijk / jik jki kji ijk jik kij ikj 1 kij / ikj 50 100 150 200 250 300 350 400 450 500 550 600 650 700 Array size (n) 24

Summary Programmer can optimize for cache performance How data structures are organized How data are accessed Nested loop structure Blocking is a general technique All systems favor cache friendly code Getting absolute optimum performance is very platform specific Cache sizes, line sizes, associativities, etc. Can get most of the advantage with generic code Keep working set reasonably small (temporal locality) Use small strides (spatial locality) 25