Types of Cache Misses. The Hardware/So<ware Interface CSE351 Winter Cache Read. General Cache OrganizaJon (S, E, B) Memory and Caches II

Similar documents
The Hardware/So<ware Interface CSE351 Winter 2013

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Problem: Processor Memory BoJleneck

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Today Cache memory organization and operation Performance impact of caches

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Computer Systems CSE 410 Autumn Memory Organiza:on and Caches

Cache Memories. Andrew Case. Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron

Caches Spring 2016 May 4: Caches 1

Carnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

CS 240 Stage 3 Abstractions for Practical Systems

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Caches III CSE 351 Spring

Carnegie Mellon. Cache Memories

General Cache Mechanics. Memory Hierarchy: Cache. Cache Hit. Cache Miss CPU. Cache. Memory CPU CPU. Cache. Cache. Memory. Memory

Caches III. CSE 351 Spring Instructor: Ruth Anderson

CS241 Computer Organization Spring Principle of Locality

Kathryn Chan, Kevin Bi, Ryan Wong, Waylon Huang, Xinyu Sui

Cache Memories : Introduc on to Computer Systems 12 th Lecture, October 6th, Instructor: Randy Bryant.

Carnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016

CS 33. Architecture and Optimization (3) CS33 Intro to Computer Systems XVI 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

Memory hierarchies: caches and their impact on the running time

Caches III. CSE 351 Winter Instructor: Mark Wyse

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

211: Computer Architecture Summer 2016

Last class. Caches. Direct mapped

CS 261 Fall Caching. Mike Lam, Professor. (get it??)

Example. How are these parameters decided?

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Cache Lab Implementation and Blocking

Memory Hierarchy. Instructor: Adam C. Champion, Ph.D. CSE 2431: Introduction to Operating Systems Reading: Chap. 6, [CSAPP]

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Caches III. CSE 351 Autumn Instructor: Justin Hsia

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

How to Write Fast Numerical Code

How to Write Fast Numerical Code

Memory Hierarchy. Announcement. Computer system model. Reference

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

How to Write Fast Numerical Code Spring 2011 Lecture 7. Instructor: Markus Püschel TA: Georg Ofenbeck

Cache Memories October 8, 2007

Caches III. CSE 351 Autumn Instructor: Justin Hsia

Computer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg

CS3350B Computer Architecture

Administrivia. Caches III. Making memory accesses fast! Associativity. Cache Organization (3) Example Placement

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Caches! Hakim Weatherspoon CS 3410, Spring 2011 Computer Science Cornell University. See P&H 5.2 (writes), 5.3, 5.5

Cache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

CISC 360. Cache Memories Nov 25, 2008

Page 1. Memory Hierarchies (Part 2)

CS3350B Computer Architecture

EEC 483 Computer Organization. Chapter 5.3 Measuring and Improving Cache Performance. Chansu Yu

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CS429: Computer Organization and Architecture

Giving credit where credit is due

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Problem: Processor- Memory Bo<leneck

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

Advanced Memory Organizations

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

CSE Computer Architecture I Fall 2009 Homework 08 Pipelined Processors and Multi-core Programming Assigned: Due: Problem 1: (10 points)

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

Caches. Cache Memory. memory hierarchy. CPU memory request presented to first-level cache first

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations

CS429: Computer Organization and Architecture

Key Point. What are Cache lines

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Logical Diagram of a Set-associative Cache Accessing a Cache

Page 1. Multilevel Memories (Improving performance using a little cash )

Cache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

ν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines.

CSE Computer Architecture I Fall 2011 Homework 07 Memory Hierarchies Assigned: November 8, 2011, Due: November 22, 2011, Total Points: 100

Cache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory

Cache Performance (H&P 5.3; 5.5; 5.6)

+ Random-Access Memory (RAM)

Introduction to OpenMP. Lecture 10: Caches

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

EE 4683/5683: COMPUTER ARCHITECTURE

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Memory Hierarchy. Slides contents from:

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

LECTURE 11. Memory Hierarchy

Transcription:

Types of Cache Misses The Hardware/So<ware Interface CSE351 Cold (compulsory) miss Occurs on very first access to a block Conflict miss Occurs when some block is evicted out of the cache, but then that block is referenced again later Conflict misses occur when the cache is large enough, but muldple data blocks all map to the same slot e.g., if blocks 0 and 8 map to the same cache slot, then referencing 0, 8, 0, 8,... would miss every Dme Conflict misses may be reduced by increasing the associadvity of the cache Capacity miss Occurs when the set of acdve cache blocks (the working set) is larger than the cache (just won t fit) 2 General Cache OrganizaJon (S, E, B) E 2 e lines per set set Cache Read E 2 e lines per set Locate set Check if any line in set has matching tag Yes + line valid: hit Locate data star?ng at offset line Address of byte in memory: t bits s bits b bits S 2 s sets S 2 s sets tag set index block offset v tag 0 1 2 B- 1 cache size: S x E x B data bytes v tag 0 1 2 B- 1 data begins at this offset valid bit B 2 b bytes of data per cache line (the data block) 3 valid bit B 2 b bytes of data per cache line (the data block) 4

Example: Direct- Mapped Cache (E 1) Direct- mapped: One line per set Example: Direct- Mapped Cache (E 1) Direct- mapped: One line per set Address of int: valid? + match?: yes hit Address of int: S 2 s sets find set 5 6 Example: Direct- Mapped Cache (E 1) Direct- mapped: One line per set valid? + match?: yes hit int (4 Bytes) is here No match: old line is evicted and replaced Address of int: Example (for E 1) int sum_array_rows(double a[16][16]) int i, j; double sum 0; for (i 0; i < 16; i++) for (j 0; j < 16; j++) sum + a[i][j]; int sum_array_cols(double a[16][16]) int i, j; double sum 0; for (j 0; j < 16; j++) for (i 0; i < 16; i++) sum + a[i][j]; 32 B 4 doubles Assume sum, i, j in registers Address of an aligned element of a: aa...ayyyyxxxx000 cold (empty) cache 3 bits for set, 5 bits for offset aa...ayyy yxx xx000 0,4: 2,0: 1,0: 0,0: aa...a001 aa...a000 001 000 100 00000 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 0,a 0,b 0,c 0,d 0,e 0,f 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 1,a 1,b 1,c 1,d 1,e 1,f 4 misses per row of array 416 64 misses 0,0 4,0 2,0 0,1 4,1 2,1 0,2 4,2 2,2 0,3 4,3 2,3 3,0 1,0 3,1 1,1 3,2 1,2 3,3 1,3 32 B 4 doubles every access a miss 1616 256 misses 7 8

Example (for E 1) float dotprod(float x[8], float y[8]) float sum 0; int i; for (i 0; i < 8; i++) sum + x[i]y[i]; In this example, cache blocks are 16 bytes; 8 sets in cache How many bits? How many set index bits? Address bits: it...t sss bbbb B 16 2 b : b4 offset bits S 8 2 s : s3 index bits 0: 000...0 000 0000 128: 000...1 000 0000 160: 000...1 010 0000 E- way Set- AssociaJve Cache (Here: E 2) E 2: Two lines per set Address of short int: find set if x and y have aligned starjng addresses, e.g., &x[0] 0, &y[0] 128 x[0] y[0] x[1] y[1] x[2] y[2] x[3] y[3] if x and y have unaligned starjng addresses, e.g., &x[0] 0, &y[0] 160 x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] y[0] y[1] y[2] y[3] y[4] y[5] y[6] y[7] 9 10 E- way Set- AssociaJve Cache (Here: E 2) E 2: Two lines per set valid? + match: yes hit compare both Address of short int: E- way Set- AssociaJve Cache (Here: E 2) E 2: Two lines per set valid? + match: yes hit compare both Address of short int: v tag 0 1 2 3 4 5 6 7 short int (2 Bytes) is here No match: One line in set is selected for evicjon and replacement Replacement policies: random, least recently used (LRU), 11 12

Example (for E 2) float dotprod(float x[8], float y[8]) float sum 0; int i; for (i 0; i < 8; i++) sum + x[i]y[i]; If x and y have aligned starjng addresses, e.g. &x[0] 0, &y[0] 128, can sjll fit both because two lines in each set x[0] x[1] x[2] x[3] y[0] y[1] y[2] y[3] x[4] x[5] x[6] x[7] y[4] y[5] y[6] y[7] Fully Set- AssociaJve Caches (S 1) Fully- associajve caches have all lines in one single set, S 1 E C / B, where C is total cache size Since, S ( C / B ) / E, therefore, S 1 Direct- mapped caches have E 1 S ( C / B ) / E C / B Tag matching is more expensive in associajve caches Fully- associadve cache needs C / B tag comparators: one for every line! Direct- mapped cache needs just 1 tag comparator In general, an E- way set- associadve cache needs E tag comparators Tag size, assuming m address bits (m 32 for IA32): m log 2 S log 2 B 13 14 Intel Core i7 Cache Hierarchy What about writes? Processor package Core 0 Regs d-cache i-cache Core 3 Regs d-cache i-cache i- cache and d- cache: 32 KB, 8- way, Access: 4 cycles L2 unified cache: 256 KB, 8- way, Access: 11 cycles MulJple copies of data exist:, L2, possibly L3, main memory What to do on a write- hit? Write- through (write immediately to memory) Write- back (defer write to memory undl line is evicted) Need a dirty bit to indicate if line is different from memory or not L2 unified cache L3 unified cache (shared by all cores) Main memory L2 unified cache L3 unified cache: 8 MB, 16- way, Access: 30-40 cycles Block size: 64 bytes for all caches. What to do on a write- miss? Write- allocate (load into cache, update line in cache) Good if more writes to the locadon follow No- write- allocate (just write immediately to memory) Typical caches: Write- back + Write- allocate, usually Write- through + No- write- allocate, occasionally 15 16

So<ware Caches are More Flexible OpJmizaJons for the Memory Hierarchy Examples File system buffer caches, web browser caches, etc. Write code that has locality SpaDal: access data condguously Temporal: make sure access to the same data is not too far apart in Dme Some design differences Almost always fully- associadve so, no placement restricdons How to achieve? Proper choice of algorithm Loop transformadons index structures like hash tables are common (for placement) Ofen use complex replacement policies misses are very expensive when disk or network involved worth thousands of cycles to avoid them Not necessarily constrained to single block transfers may fetch or write- back in larger units, opportunisdcally 17 18 Example: Matrix MulJplicaJon c (double ) calloc(sizeof(double), nn); / Multiply n x n matrices a and b / void mmm(double a, double b, double c, int n) int i, j, k; for (i 0; i < n; i++) for (j 0; j < n; j++) for (k 0; k < n; k++) c[in + j] + a[in + k]b[kn + j]; c i a b j Matrix elements are doubles Cache block 64 bytes 8 doubles Cache size C << n (much smaller than n) First iterajon: n/8 + n 9n/8 misses (omilng matrix c) Aferwards in cache: (schemadc) n 19 8 wide 20

Blocked Matrix MulJplicaJon Matrix elements are doubles Cache block 64 bytes 8 doubles Cache size C << n (much smaller than n) Other iterajons: Again: n/8 + n 9n/8 misses (omilng matrix c) Total misses: 9n/8 n 2 (9/8) n 3 n 8 wide c (double ) calloc(sizeof(double), nn); / Multiply n x n matrices a and b / void mmm(double a, double b, double c, int n) int i, j, k; for (i 0; i < n; i+b) for (j 0; j < n; j+b) for (k 0; k < n; k+b) / B x B mini matrix multiplications / for (i1 i; i1 < i+b; i1++) for (j1 j; j1 < j+b; j1++) for (k1 k; k1 < k+b; k1++) c[i1n + j1] + a[i1n + k1]b[k1n + j1]; c i1 a b j1 21 Block size B x B 22 Cache block 64 bytes 8 doubles Cache size C << n (much smaller than n) Three blocks fit into cache: 3B 2 < C Cache block 64 bytes 8 doubles Cache size C << n (much smaller than n) Three blocks fit into cache: 3B 2 < C First (block) iterajon: B 2 /8 misses for each block 2n/B B 2 /8 nb/4 (omilng matrix c) n/b blocks Other (block) iterajons: Same as first iteradon 2n/B B 2 /8 nb/4 n/b blocks Aferwards in cache (schemadc) Block size B x B Total misses: nb/4 (n/b) 2 n 3 /(4B) Block size B x B 23 24

Summary Cache- Friendly Code No blocking: (9/8) n 3 Blocking: 1/(4B) n 3 If B 8 difference is 4 8 9 / 8 36x If B 16 difference is 4 16 9 / 8 72x Suggests largest possible block size B, but limit 3B 2 < C! Reason for dramajc difference: Matrix muldplicadon has inherent temporal locality: Input data: 3n 2, computadon 2n 3 Every array element used O(n) Dmes! But program has to be wrinen properly Programmer can opjmize for cache performance How data structures are organized How data are accessed Nested loop structure Blocking is a general technique All systems favor cache- friendly code Gelng absolute opdmum performance is very plaqorm specific Cache sizes, line sizes, associadvides, etc. Can get most of the advantage with generic code Keep working set reasonably small (temporal locality) Use small strides (spadal locality) Focus on inner loop code 25 26 Read throughput (MB/s) The Memory Mountain 7000 6000 5000 4000 L2" 3000 2000 L3" 1000 " Intel Core i7 32 KB i- cache 32 KB d- cache 256 KB unified L2 cache 8M unified L3 cache All caches on- chip 0 s1 s3 s5 s7 s9 Stride (x8 bytes) s11 Mem" s13 s15 s32 64M 8M 1M 128K 16K 2K Working set size (bytes) 27