Systems Programming and Computer Architecture ( ) Timothy Roscoe

Similar documents
Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Problem: Processor Memory BoJleneck

CS241 Computer Organization Spring Principle of Locality

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Problem: Processor- Memory Bo<leneck

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Today Cache memory organization and operation Performance impact of caches

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Computer Systems CSE 410 Autumn Memory Organiza:on and Caches

CS 240 Stage 3 Abstractions for Practical Systems

Caches Spring 2016 May 4: Caches 1

How to Write Fast Numerical Code

How to Write Fast Numerical Code

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Types of Cache Misses. The Hardware/So<ware Interface CSE351 Winter Cache Read. General Cache OrganizaJon (S, E, B) Memory and Caches II

The Hardware/So<ware Interface CSE351 Winter 2013

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

CS3350B Computer Architecture

+ Random-Access Memory (RAM)

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Carnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016

211: Computer Architecture Summer 2016

Carnegie Mellon. Cache Memories

Carnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Cache Memories. Andrew Case. Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

General Cache Mechanics. Memory Hierarchy: Cache. Cache Hit. Cache Miss CPU. Cache. Memory CPU CPU. Cache. Cache. Memory. Memory

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

CS 33. Architecture and Optimization (3) CS33 Intro to Computer Systems XVI 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

Computer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

Cache Memories : Introduc on to Computer Systems 12 th Lecture, October 6th, Instructor: Randy Bryant.

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Memory Management. Goals of this Lecture. Motivation for Memory Hierarchy

Example. How are these parameters decided?

Cache Lab Implementation and Blocking

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Memory Hierarchy. Instructor: Adam C. Champion, Ph.D. CSE 2431: Introduction to Operating Systems Reading: Chap. 6, [CSAPP]

Memory Management! Goals of this Lecture!

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

EE 4683/5683: COMPUTER ARCHITECTURE

CS3350B Computer Architecture

Page 1. Memory Hierarchies (Part 2)

CMPSC 311- Introduction to Systems Programming Module: Caching

Advanced Memory Organizations

Random-Access Memory (RAM) Systemprogrammering 2007 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics

Random-Access Memory (RAM) Systemprogrammering 2009 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics! The memory hierarchy

Memory hierarchies: caches and their impact on the running time

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 26, FALL 2012

Page 1. Multilevel Memories (Improving performance using a little cash )

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

Introduction to OpenMP. Lecture 10: Caches

How to Write Fast Numerical Code Spring 2011 Lecture 7. Instructor: Markus Püschel TA: Georg Ofenbeck

Read-only memory (ROM): programmed during production Programmable ROM (PROM): can be programmed once SRAM (Static RAM)

Course Administration

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Random-Access Memory (RAM) CISC 360. The Memory Hierarchy Nov 24, Conventional DRAM Organization. SRAM vs DRAM Summary.

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Caches Concepts Review

Memory Hierarchies &

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Cache Memories October 8, 2007

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 26, SPRING 2013

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Cray XE6 Performance Workshop

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

CISC 360. The Memory Hierarchy Nov 13, 2008

CMPSC 311- Introduction to Systems Programming Module: Caching

Giving credit where credit is due

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Last class. Caches. Direct mapped

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy. Slides contents from:

Random-Access Memory (RAM) Lecture 13 The Memory Hierarchy. Conventional DRAM Organization. SRAM vs DRAM Summary. Topics. d x w DRAM: Key features

Memory Hierarchy. Slides contents from:

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

CS 261 Fall Caching. Mike Lam, Professor. (get it??)

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

CHAPTER 6 Memory. CMPS375 Class Notes Page 1/ 16 by Kuo-pao Yang

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

CISC 360. Cache Memories Nov 25, 2008

Computer Systems. Memory Hierarchy. Han, Hwansoo

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

Transcription:

Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1

16: Caches Computer Architecture and Systems Programming 252-0061-00, Herbstsemester 2016 Timothy Roscoe AS 2016 Caches 2

Problem: Processor-memory bottleneck Processor performance doubled about every 18 months CPU Reg Bus bandwidth evolved much slower Main Memory Intel Haswell: Can process at least 512 Bytes/cycle (1 SSE two operand add and mult) Intel Haswell: Bandwidth 10 Bytes/cycle Latency 100 cycles Solution: Caches AS 2016 Caches 3

General cache mechanics Cache 8 9 14 3 Smaller, faster, more expensive memory caches a subset of the blocks Data is copied in block-sized transfer units Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Larger, slower, cheaper memory viewed as partitioned into blocks or lines AS 2016 Caches 4

16.1: Hits and misses AS 2016 Caches 5

General cache concepts: Hit Cache Request: 14 8 9 14 3 Data in block b is needed Block b is in cache: Hit! Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 AS 2016 Caches 6

General cache concepts: Miss Cache Request: 12 8 12 9 14 3 Data in block b is needed Block b is not in cache: Miss! Request: 12 Block b is fetched from memory Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Block b is stored in cache Placement policy: determines where b goes Replacement policy: determines which block gets evicted (victim) AS 2016 Caches 7

Cache performance metrics Miss Rate Fraction of memory references not found in cache (misses / accesses) = 1 hit rate Typical numbers (in percentages): 3-10% for L1 can be quite small (e.g., < 1%) for L2, depending on size, etc. Hit Time Time to deliver a line in the cache to the processor includes time to determine whether the line is in the cache Typical numbers: 1-2 clock cycle for L1 5-20 clock cycles for L2 Miss Penalty Additional time required because of a miss typically 50-200 cycles for main memory (Trend: increasing!) AS 2016 Caches 8

Lets think about those numbers Huge difference between a hit and a miss Could be 100x, if just L1 and main memory Would you believe 99% hits is twice as good as 97%? Consider: cache hit time of 1 cycle miss penalty of 100 cycles Average access time: 97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles 99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles This is why miss rate is used instead of hit rate AS 2016 Caches 9

Types of cache miss Cold (compulsory) miss Occurs on first access to a block Conflict miss Most caches limit placement to small subset of available slots e.g., block i must be placed in slot (i mod 4) Cache map be large enough, but multiple lines map to same slot e.g., referencing blocks 0, 8, 0, 8,... would miss every time Capacity miss Set of active cache blocks (working set) larger than cache Coherency miss Multiprocessor systems: see later in the course AS 2016 10

16.2: The memory hierarchy AS 2016 Caches 11

An example memory hierarchy L0: registers CPU registers hold words retrieved from L1 cache Smaller, faster, costlier per byte Larger, slower, cheaper per byte L4: L3: L2: L1: on-chip L1 cache (SRAM) on-chip L2 cache (SRAM) (was off-chip!) main memory (DRAM) local secondary storage (SSD, local disks) L1 cache holds cache lines retrieved from L2 cache L2 cache holds cache lines retrieved from main memory Main memory holds disk blocks retrieved from local disks Local disks hold files retrieved from disks on remote network servers L5: remote secondary storage (tapes, distributed file systems, Web servers) 12

Examples of caching in the hierarchy Cache type What is cached? Where is it cached? Latency (cycles) Managed by Registers 4/8-byte words CPU core 0 Compiler TLB Address translations On-chip TLB 0 Hardware L1 cache 64-byte blocks On-chip L1 1 Hardware L2 cache 64-byte blocks On-chip L2 10 Hardware Virtual memory 4kB page Main memory (RAM) 100 Hardware + OS Buffer cache 4kB sectors Main memory 100 OS Network buffer cache Parts of files Local disk, SSD 1,000,000 SMB/NFS client Browser cache Web pages Local disk 10,000,000 Web browser Web cache Web pages Remote server disks 1,000,000,000 Web proxy server AS 2016 Caches 13

Memory hierarchy: Core 2 Duo ( Conroe, 2006-2009) L1/L2 cache: 64 B blocks CPU Reg L1 I-cache 32 KB L1 D-cache ~4 MB L2 unified cache Main Memory Not drawn to scale ~4 GB ~500 GB Throughput: Latency: 16 B/cycle 8 B/cycle 2 B/cycle 1 B/30 cycles 3 cycles 14 cycles 100 cycles millions Disk AS 2016 Caches 14

CPU Thrpt: Latency: Intel Core i7-6700k ( Skylake, 2015-) L1/L2/L3 cache: 64 B blocks Reg ~81 B/cycle 4 cycles L1 I-cache 32 KB L1 D-cache ~29 B/cycle 12 cycles ~256kB L2 cache ~18 B/cycle 44 cycles ~8 MB L3 unified cache ~10 B/cycle 80 cycles ~16 GB ~300 GB Main Memory Not drawn to scale 1 B/10 cycles millions SSD AS 2016 Caches 15

Why caches work: locality Programs tend to use data and instructions with addresses near or equal to those they have used recently Temporal locality: Recently referenced items are likely to be referenced again in the near future block Spatial locality: Items with nearby addresses tend to be referenced close together in time block AS 2016 Caches 16

Example: locality? sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum; Data: Temporal: sum referenced in each iteration Spatial: array a[] accessed in stride-1 pattern Instructions: Temporal: cycle through loop repeatedly Spatial: reference instructions in sequence Being able to assess the locality of code is a crucial skill for a programmer AS 2016 Caches 17

Locality example #1 int sum_array_rows(int a[m][n]) { int i, j, sum = 0; } for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; AS 2016 Caches 18

Locality example #2 int sum_array_cols(int a[m][n]) { int i, j, sum = 0; } for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum; AS 2016 Caches 19

Locality example #3 /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[i*n+j] += a[i*n + k]*b[k*n + j]; } How can this be fixed? AS 2016 Caches 20

16.3: Cache organization AS 2016 Caches 21

General cache organization (S, E, B) E = 2 e lines per set set line or block S = 2 s sets v tag 0 1 2 B-1 Cache size: S x E x B data bytes valid bit B = 2 b bytes per cache block (the data) 22

16.4: Cache reads AS 2016 Caches 23

Cache read E = 2 e lines per set Locate set Check if any line in set has matching tag Yes + line valid: hit Locate data starting at offset Address of word: t bits s bits b bits S = 2 s sets tag set index block offset data begins at this offset v tag 0 1 2 B-1 valid bit B = 2 b bytes per cache block (the data) 24

Direct mapped cache (E = 1) Direct mapped: One line per set Assume: cache block size 8 bytes S = 2 s sets v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 Address of int: t bits 0 01 100 find set v tag 0 1 2 3 4 5 6 7 AS 2016 Caches 25

Direct mapped cache (E = 1) Direct mapped: One line per set Assume: cache block size 8 bytes valid? + match: assume yes = hit Address of int: t bits 0 01 100 v tag 0 1 2 3 4 5 6 7 block offset No match: old line is evicted and replaced int (4 Bytes) is here AS 2016 Caches 26

Example int sum_array_rows(double a[16][16]) { int i, j; double sum = 0; Ignore the variables sum, i, j assume: cold (empty) cache, a[0][0] goes here } for (i = 0; i < 16; i++) for (j = 0; j < 16; j++) sum += a[i][j]; return sum; int sum_array_cols(double a[16][16]) { int i, j; double sum = 0; } for (j = 0; j < 16; j++) for (i = 0; i < 16; i++) sum += a[i][j]; return sum; 32 B = 4 doubles AS 2016 Caches 27

2-way set-associative cache E = 2: Two lines per set Assume: cache block size 8 bytes Address of short int: t bits 0 01 100 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 AS 2016 Caches 28

2-way set-associative cache E = 2: Two lines per set Assume: cache block size 8 bytes Address of short int: t bits 0 01 100 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 find set v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 AS 2016 Caches 29

2-way set-associative cache E = 2: Two lines per set Assume: cache block size 8 bytes compare both Address of short int: t bits 0 01 100 valid? + match: yes = hit v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 short int (2 Bytes) is here block offset No match: One line in set is selected for eviction and replacement Replacement policies: random, least recently used (LRU), 30

Example int sum_array_rows(double a[16][16]) { int i, j; double sum = 0; Ignore the variables sum, i, j assume: cold (empty) cache, a[0][0] goes here } for (i = 0; i < 16; i++) for (j = 0; j < 16; j++) sum += a[i][j]; return sum; int sum_array_cols(double a[16][16]) { int i, j; double sum = 0; 32 B = 4 doubles } for (j = 0; j < 16; j++) for (i = 0; i < 16; i++) sum += a[i][j]; return sum; AS 2016 Caches 31

16.5: Cache writes AS 2016 Caches 32

What to do on a write-hit? Write-through Write immediately to memory Memory is always consistent with the cache copy Slow: what if the same value (or line!) is written several times Write-back Defer write to memory until replacement of line Need a dirty bit indicates line is different from memory Higher performance (but more complex) AS 2016 Caches 33

What to do on a write-miss? Write-allocate (load into cache, update line in cache) Good if more writes to the location follow More complex to implement May evict an existing value Common with write-back caches No-write-allocate (writes immediately to memory) Simpler to implement Slower code (bad if value subsequently re-read) Seen with write-through caches AS 2016 Caches 34

1 core, 2 threads Real caches: Intel Core i7-5960x (Extreme Edition, Haswell-E) I-cache 32kB 8-way no writes 32kB 8-way writeback 256kB 8-way noninclusive writeback private up to 20MB ~16-way writeback shared inclusive Optional L4 cache L1 D-cache L2 L3 AS 2016 Caches 35

2 cores, 1 FP unit Real caches: AMD Warsaw (Opteron 6380) I-cache 64kB 2-way no writes 2 x 16kB 4-way writeback 2MB 16-way inclusive writeback private 2 x 8MB 64-way writeback shared non-inclusive Main memory L1 D-cache L2 L3 AS 2016 Caches 36

Examples Software caches are more flexible File system buffer caches, web browser caches, etc. Some design differences Almost always fully associative so, no placement restrictions index structures like hash tables are common Often use complex replacement policies misses are very expensive when disk or network involved worth thousands of cycles to avoid them Not necessarily constrained to single block transfers may fetch or write-back in larger units, opportunistically AS 2016 37

16.6: Cache optimizations AS 2016 Caches 38

Optimizations for the memory hierarchy Write code that has locality Spatial: access data contiguously Temporal: make sure access to the same data is not too far apart in time How to achieve this? Proper choice of algorithm Loop transformations Cache versus register level optimization: In both cases locality desirable Register space much smaller + requires scalar replacement to exploit temporal locality Register level optimizations include exhibiting instruction level parallelism (conflicts with locality) AS 2016 Caches 39

Example: matrix multiplication c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[i*n+j] += a[i*n + k]*b[k*n + j]; } c = i a * b j AS 2016 Caches 40

Cache miss analysis Assume: Matrix elements are doubles Cache block = 8 doubles Cache size C << n (much smaller than n) n First iteration: n/8 + n = 9n/8 misses = * Afterwards in cache: (schematic) = * AS 2016 Caches 8 wide 41

Cache miss analysis Assume: Matrix elements are doubles Cache block = 8 doubles Cache size C << n (much smaller than n) n Second iteration: Again: n/8 + n = 9n/8 misses = * Total misses: 9n/8 * n 2 = (9/8) * n 3 8 wide AS 2016 Caches 42

16.7: Blocking AS 2016 Caches 43

Blocked matrix multiplication c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i+=b) for (j = 0; j < n; j+=b) for (k = 0; k < n; k+=b) /* B x B mini matrix multiplications */ for (i1 = i; i1 < i+b; i++) for (j1 = j; j1 < j+b; j++) for (k1 = k; k1 < k+b; k++) c[i1*n+j1] += a[i1*n + k1]*b[k1*n + j1]; } c = i1 a * b j1 + c AS 2016 Block size B x B 44

Cache miss analysis Assume: Cache block = 8 doubles Cache size C << n (much smaller than n) Three blocks fit into cache: 3B 2 < C First (block) iteration: B 2 /8 misses for each block 2n/B * B 2 /8 = nb/4 (omitting matrix c) = * n/b blocks AS 2016 Afterwards in cache (schematic) = Block size B x B *

Cache miss analysis Assume: Cache block = 8 doubles Cache size C << n (much smaller than n) Three blocks fit into cache: 3B 2 < C Second (block) iteration: Same as first iteration 2n/B * B 2 /8 = nb/4 = * n/b blocks Total misses: nb/4 * (n/b) 2 = n 3 /(4B) Block size B x B AS 2016 Caches 46

Summary No blocking: (9/8) * n 3 Blocking: 1/(4B) * n 3 Suggest largest possible block size B, but limit 3B 2 < C! (can possibly be relaxed a bit, but there is a limit for B) Reason for dramatic difference: Matrix multiplication has inherent temporal locality: Input data: 3n 2, computation 2n 3 Every array elements used O(n) times! But program has to be written properly AS 2016 Caches 47