Caches III. CSE 351 Spring Instructor: Ruth Anderson

Similar documents
Kathryn Chan, Kevin Bi, Ryan Wong, Waylon Huang, Xinyu Sui

Caches III. CSE 351 Winter Instructor: Mark Wyse

Caches III CSE 351 Spring

Caches III. CSE 351 Autumn Instructor: Justin Hsia

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Cache Wrap Up, System Control Flow

Caches II. CSE 351 Spring Instructor: Ruth Anderson

Administrivia. Caches III. Making memory accesses fast! Associativity. Cache Organization (3) Example Placement

Today Cache memory organization and operation Performance impact of caches

Caches Spring 2016 May 4: Caches 1

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Last class. Caches. Direct mapped

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

CS 240 Stage 3 Abstractions for Practical Systems

Types of Cache Misses. The Hardware/So<ware Interface CSE351 Winter Cache Read. General Cache OrganizaJon (S, E, B) Memory and Caches II

The Hardware/So<ware Interface CSE351 Winter 2013

Caches III. CSE 351 Autumn Instructor: Justin Hsia

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Systems Programming and Computer Architecture ( ) Timothy Roscoe

211: Computer Architecture Summer 2016

Virtual Memory I. CSE 351 Spring Instructor: Ruth Anderson

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Cache Memories : Introduc on to Computer Systems 12 th Lecture, October 6th, Instructor: Randy Bryant.

Carnegie Mellon. Cache Memories

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Carnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Cache Memories. Andrew Case. Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

General Cache Mechanics. Memory Hierarchy: Cache. Cache Hit. Cache Miss CPU. Cache. Memory CPU CPU. Cache. Cache. Memory. Memory

Cache Memories October 8, 2007

The Stack & Procedures

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Computer Systems CSE 410 Autumn Memory Organiza:on and Caches

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

CS 261 Fall Caching. Mike Lam, Professor. (get it??)

write-through v. write-back write-through v. write-back write-through v. write-back option 1: write-through write 10 to 0xABCD CPU RAM Cache ABCD: FF

Cache Performance II 1

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Inside out of your computer memories (III) Hung-Wei Tseng

CISC 360. Cache Memories Nov 25, 2008

Cache Performance (H&P 5.3; 5.5; 5.6)

Problem: Processor Memory BoJleneck

CSE351: Memory, Data, & Addressing I

ECE 2300 Digital Logic & Computer Organization. More Caches Measuring Performance

Giving credit where credit is due

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations

CS 61C: Great Ideas in Computer Architecture. Multilevel Caches, Cache Questions

Example. How are these parameters decided?

Computer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg

Key Point. What are Cache lines

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

CSE Computer Architecture I Fall 2009 Homework 08 Pipelined Processors and Multi-core Programming Assigned: Due: Problem 1: (10 points)

Caches II CSE 351 Spring #include <yoda.h> int is_try(int do_flag) { return!(do_flag!do_flag); }

Carnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016

CS 33. Architecture and Optimization (3) CS33 Intro to Computer Systems XVI 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

L14: Structs and Alignment. Structs and Alignment. CSE 351 Spring Instructor: Ruth Anderson

What is Cache Memory? EE 352 Unit 11. Motivation for Cache Memory. Memory Hierarchy. Cache Definitions Cache Address Mapping Cache Performance

Memory Hierarchy. Announcement. Computer system model. Reference

CS 61C: Great Ideas in Computer Architecture. Cache Performance, Set Associative Caches

CS/ECE 250 Computer Architecture

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Data III & Integers I

Caches II. CSE 351 Autumn Instructor: Justin Hsia

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches, Set Associative Caches, Cache Performance

Caches II. CSE 351 Autumn Instructor: Justin Hsia

CSE 141 Spring 2016 Homework 5 PID: Name: 1. Consider the following matrix transpose code int i, j,k; double *A, *B, *C; A = (double

Cache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

Computer Architecture Spring 2016

ν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines.

CS3350B Computer Architecture

CS241 Computer Organization Spring Principle of Locality

Cache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory

Effect of memory latency

Kathryn Chan, Kevin Bi, Ryan Wong, Waylon Huang, Xinyu Sui

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Double-precision General Matrix Multiply (DGEMM)

CSE 141 Computer Architecture Spring Lectures 17 Virtual Memory. Announcements Office Hour

Memory Hierarchy. Instructor: Adam C. Champion, Ph.D. CSE 2431: Introduction to Operating Systems Reading: Chap. 6, [CSAPP]

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

Logical Diagram of a Set-associative Cache Accessing a Cache

CS422 Computer Architecture

Memory hierarchies: caches and their impact on the running time

Cache Memory: Instruction Cache, HW/SW Interaction. Admin

See also cache study guide. Contents Memory and Caches. Supplement to material in section 5.2. Includes notation presented in class.

exercise 4 byte blocks, 4 sets index valid tag value 00

Java and C I. CSE 351 Spring Instructor: Ruth Anderson

Lecture 7 - Memory Hierarchy-II

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

University of Toronto Faculty of Applied Science and Engineering

Transcription:

Caches III CSE 351 Spring 2017 Instructor: Ruth Anderson Teaching Assistants: Dylan Johnson Kevin Bi Linxing Preston Jiang Cody Ohlsen Yufang Sun Joshua Curtis

Administrivia Office Hours Changes check calendar!! Homework 3, due TONIGHT (5/5) Midterm, Monday (5/8) Lab 3, due Thursday (5/11) Mid-Quarter Feedback Survey! 2

Question We have a cache of size 2 KiB with block size of 128 B. If our cache has 2 sets, what is its associativity? A. 2 B. 4 C. 8 D. 16 If addresses are 16 bits wide, how wide is the Tag field? 3

Cache Read EE = blocks/lines per set 1) Locate set 2) Check if any line in set is valid and has matching tag: hit 3) Locate data starting at offset Address of byte in memory: tt bits ss bits kk bits SS = # sets = 2 ss tag set index block offset data begins at this offset v tag 0 1 2 K-1 valid bit KK = bytes per block 4

Example: Direct-Mapped Cache (EE = 1) Direct-mapped: One line per set Block Size KK = 8 B v tag 0 1 2 3 4 5 6 7 Address of int: tt bits 0 01 100 SS = 2 ss sets v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 find set v tag 0 1 2 3 4 5 6 7 5

Example: Direct-Mapped Cache (EE = 1) Direct-mapped: One line per set Block Size KK = 8 B valid? + match?: yes = hit Address of int: tt bits 0 01 100 v tag 0 1 2 3 4 5 6 7 block offset 6

Example: Direct-Mapped Cache (EE = 1) Direct-mapped: One line per set Block Size KK = 8 B valid? + match?: yes = hit Address of int: tt bits 0 01 100 v tag 0 1 2 3 4 5 6 7 block offset int (4 B) is here This is why we want alignment! No match? Then old line gets evicted and replaced 7

Example: Set-Associative Cache (EE = 2) 2-way: Two lines per set Block Size KK = 8 B Address of short int: tt bits 0 01 100 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 find set v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 8

Example: Set-Associative Cache (EE = 2) 2-way: Two lines per set Block Size KK = 8 B compare both Address of short int: tt bits 0 01 100 valid? + match: yes = hit v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 block offset 9

Example: Set-Associative Cache (EE = 2) 2-way: Two lines per set Block Size KK = 8 B compare both Address of short int: tt bits 0 01 100 valid? + match: yes = hit v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 block offset short int (2 B) is here No match? One line in set is selected for eviction and replacement Replacement policies: random, least recently used (LRU), 10

Types of Cache Misses: 3 C s! Compulsory (cold) miss Occurs on first access to a block Conflict miss Conflict misses occur when the cache is large enough, but multiple data objects all map to the same slot e.g. referencing blocks 0, 8, 0, 8,... could miss every time Direct-mapped caches have more conflict misses than EE-way set-associative (where EE > 1) Capacity miss Occurs when the set of active cache blocks (the working set) is larger than the cache (just won t fit, even if cache was fullyassociative) Note: Fully-associative only has Compulsory and Capacity misses 11

Core i7: Associativity Processor package Core 0 Regs Core 3 Regs Block/line size: 64 bytes for all L1 d-cache L1 i-cache L1 d-cache L1 i-cache L1 i-cache and d-cache: 32 KiB, 8-way, Access: 4 cycles L2 unified cache L2 unified cache L2 unified cache: 256 KiB, 8-way, Access: 11 cycles L3 unified cache (shared by all cores) L3 unified cache: 8 MiB, 16-way, Access: 30-40 cycles Main memory slower, but more likely to hit 12

Making memory accesses fast! Cache basics Principle of locality Memory hierarchies Cache organization Direct-mapped (sets; index + tag) Associativity (ways) Replacement policy Handling writes Program optimizations that consider caches 13

What about writes? Multiple copies of data exist: L1, L2, possibly L3, main memory What to do on a write-hit? Write-through: write immediately to next level Write-back: defer write to next level until line is evicted (replaced) Must track which cache lines have been modified ( dirty bit ) What to do on a write-miss? Write-allocate: ( fetch on write ) load into cache, update line in cache Good if more writes or reads to the location follow No-write-allocate: ( write around ) just write immediately to memory Typical caches: Write-back + Write-allocate, usually Write-through + No-write-allocate, occasionally 14

Write-back, write-allocate example Contents of memory stored at address G Cache G 0xBEEF 0 dirty bit tag (there is only one set in this tiny cache, so the tag is the entire block address!) Memory F G 0xCAFE 0xBEEF In this example we are sort of ignoring block offsets. Here a block holds 2 bytes (16 bits, 4 hex digits). Normally a block would be much bigger and thus there would be multiple items per block. While only one item in that block would be written at a time, the entire line would be brought into cache. 15

Write-back, write-allocate example mov 0xFACE, F Cache G 0xBEEF 0 dirty bit Memory F G 0xCAFE 0xBEEF 16

Write-back, write-allocate example mov 0xFACE, F Cache UF 0xCAFE 0xBEEF 0 dirty bit Step 1: Bring F into cache Memory F G 0xCAFE 0xBEEF 17

Write-back, write-allocate example mov 0xFACE, F Cache UF 0xFACE 0xCAFE 0xBEEF 01 dirty bit Memory F 0xCAFE Step 2: Write 0xFACE to cache only and set dirty bit G 0xBEEF 18

Write-back, write-allocate example mov 0xFACE, F mov 0xFEED, F Cache UF 0xFACE 0xCAFE 0xBEEF 01 dirty bit Memory F 0xCAFE Write hit! Write 0xFEED to cache only G 0xBEEF 19

Write-back, write-allocate example mov 0xFACE, F mov 0xFEED, F mov G, %rax Cache UF 0xFEED 0xCAFE 0xBEEF 01 dirty bit Memory F G 0xCAFE 0xBEEF 20

Write-back, write-allocate example mov 0xFACE, F mov 0xFEED, F mov G, %rax Cache G 0xBEEF 0 dirty bit Memory F 0xFEED 1. Write F back to memory since it is dirty 2. Bring G into the cache so we can copy it into %rax G 0xBEEF 21

Question Which of the following cache statements is FALSE? A. We can reduce compulsory misses by decreasing our block size B. We can reduce conflict misses by increasing associativity C. A write-back cache will save time for code with good temporal locality on writes D. A write-through cache will always match data with the memory hierarchy level below it 22

Optimizations for the Memory Hierarchy Write code that has locality! Spatial: access data contiguously Temporal: make sure access to the same data is not too far apart in time How can you achieve locality? Adjust memory accesses in code (software) to improve miss rate (MR) Requires knowledge of both how caches work as well as your system s parameters Proper choice of algorithm Loop transformations 23

Example: Matrix Multiplication C A B c ij = a i* b *j 24

Matrices in Memory How do cache blocks fit into this scheme? Row major matrix in memory: Cache blocks COLUMN of matrix (blue) is spread among cache blocks shown in red 25

Naïve Matrix Multiply # move along rows of A for (i = 0; i < n; i++) # move along columns of B for (j = 0; j < n; j++) # EACH k loop reads row of A, col of B # Also read & write c(i,j) n times for (k = 0; k < n; k++) c[i*n+j] += a[i*n+k] * b[k*n+j]; C(i,j) C(i,j) A(i,:) = + B(:,j) 26

Cache Miss Analysis (Naïve) Scenario Parameters: Square matrix (nn nn), elements are doubles Cache block size KK = 64 B = 8 doubles Cache size CC nn (much smaller than nn) Ignoring matrix c Each iteration: nn + nn = 9nn misses = 8 8 27

Cache Miss Analysis (Naïve) Scenario Parameters: Square matrix (nn nn), elements are doubles Cache block size KK = 64 B = 8 doubles Cache size CC nn (much smaller than nn) Ignoring matrix c Each iteration: nn 8 + nn = 9nn 8 misses = Afterwards in cache: (schematic) = 8 doubles wide 28

Linear Algebra to the Rescue (1) Can get the same result of a matrix multiplication by splitting the matrices into smaller submatrices (matrix blocks ) For example, multiply two 4 4 matrices: This is extra (non-testable) material 29

Linear Algebra to the Rescue (2) This is extra (non-testable) material C 11 C 12 C 13 C 14 A 11 A 12 A 13 A 14 B 11 B 12 B 13 B 14 C 21 C 22 C 23 C 24 A 21 A 22 A 23 A 24 B 21 B 22 B 23 B 24 C 31 C 32 C 43 C 34 A 31 A 32 A 33 A 34 B 32 B 32 B 33 B 34 C 41 C 42 C 43 C 44 A 41 A 42 A 43 A 144 B 41 B 42 B 43 B 44 Matrices of size nn nn, split into 4 blocks of size rr (nn=4rr) C 22 = A 21 B 12 + A 22 B 22 + A 23 B 32 + A 24 B 42 = k A 2k *B k2 Multiplication operates on small block matrices Choose size so that they fit in the cache! This technique called cache blocking 30

Blocked Matrix Multiply Blocked version of the naïve algorithm: # move by rxr BLOCKS now for (i = 0; i < n; i += r) for (j = 0; j < n; j += r) for (k = 0; k < n; k += r) # block matrix multiplication for (ib = i; ib < i+r; ib++) for (jb = j; jb < j+r; jb++) for (kb = k; kb < k+r; kb++) c[ib*n+jb] += a[ib*n+kb]*b[kb*n+jb]; rr = block matrix size (assume rr divides nn evenly) 31

Cache Miss Analysis (Blocked) Scenario Parameters: Cache block size KK = 64 B = 8 doubles Cache size CC nn (much smaller than nn) Three blocks (rr rr) fit into cache: 3rr 2 < CC rr 2 elements per block, 8 per cache block Each block iteration: rr 2 /8 misses per block 2nn/rr rr 2 /8 = nnnn/4 nn/rr blocks in row and column = Ignoring matrix c nn/rr blocks 32

Cache Miss Analysis (Blocked) Scenario Parameters: Cache block size KK = 64 B = 8 doubles Cache size CC nn (much smaller than nn) Three blocks (rr rr) fit into cache: 3rr 2 < CC rr 2 elements per block, 8 per cache block Each block iteration: rr 2 /8 misses per block 2nn/rr rr 2 /8 = nnnn/4 nn/rr blocks in row and column = Ignoring matrix c nn/rr blocks Afterwards in cache (schematic) = 33

Matrix Multiply Visualization Here nn = 100, CC = 32 KiB, rr = 30 Naïve: Blocked: 1,020,000 cache misses 90,000 cache misses 34

Cache-Friendly Code Programmer can optimize for cache performance How data structures are organized How data are accessed Nested loop structure Blocking is a general technique All systems favor cache-friendly code Getting absolute optimum performance is very platform specific Cache size, cache block size, associativity, etc. Can get most of the advantage with generic code Keep working set reasonably small (temporal locality) Use small strides (spatial locality) Focus on inner loop code 35

The Memory Mountain Aggressive prefetching 16000 Core i7 Haswell 2.1 GHz 32 KB L1 d-cache 256 KB L2 cache 8 MB L3 cache 64 B block size Read throughput (MB/s) 14000 12000 10000 8000 6000 L2 L1 Ridges of temporal locality 4000 Slopes of spatial locality 2000 0 s1 s3 s5 Stride (x8 bytes) s7 s9 Mem s11 128m 32m L3 8m 2m 512k 128k Size (bytes) 32k 36

Learning About Your Machine Linux: lscpu ls /sys/devices/system/cpu/cpu0/cache/index0/ Ex: cat /sys/devices/system/cpu/cpu0/cache/index*/size cat /proc/cpuinfo grep cache sort uniq Windows: wmic memcache get <query> (all values in KB) Ex: wmic memcache get MaxCacheSize Modern processor specs: http://www.7-cpu.com/ 37