Computer Systems CSE 410 Autumn Memory Organiza:on and Caches

Similar documents
Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Problem: Processor Memory BoJleneck

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Problem: Processor- Memory Bo<leneck

Caches Spring 2016 May 4: Caches 1

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Carnegie Mellon. Cache Memories

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

CS 240 Stage 3 Abstractions for Practical Systems

Carnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

The Hardware/So<ware Interface CSE351 Winter 2013

Cache Memories. Andrew Case. Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron

Today Cache memory organization and operation Performance impact of caches

CS241 Computer Organization Spring Principle of Locality

Types of Cache Misses. The Hardware/So<ware Interface CSE351 Winter Cache Read. General Cache OrganizaJon (S, E, B) Memory and Caches II

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

+ Random-Access Memory (RAM)

CS3350B Computer Architecture

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Caches II CSE 351 Spring #include <yoda.h> int is_try(int do_flag) { return!(do_flag!do_flag); }

Cache Memories : Introduc on to Computer Systems 12 th Lecture, October 6th, Instructor: Randy Bryant.

Virtual Memory I. CSE 351 Spring Instructor: Ruth Anderson

Memory hierarchies: caches and their impact on the running time

CMPSC 311- Introduction to Systems Programming Module: Caching

Carnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016

How to Write Fast Numerical Code

How to Write Fast Numerical Code

Read-only memory (ROM): programmed during production Programmable ROM (PROM): can be programmed once SRAM (Static RAM)

211: Computer Architecture Summer 2016

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

General Cache Mechanics. Memory Hierarchy: Cache. Cache Hit. Cache Miss CPU. Cache. Memory CPU CPU. Cache. Cache. Memory. Memory

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

F28HS Hardware-Software Interface: Systems Programming

CMPSC 311- Introduction to Systems Programming Module: Caching

Caches II. CSE 351 Autumn Instructor: Justin Hsia

Advanced Memory Organizations

Cache Lab Implementation and Blocking

Computer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg

CS 61C: Great Ideas in Computer Architecture Direct- Mapped Caches. Increasing distance from processor, decreasing speed.

Memory Management! Goals of this Lecture!

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Random-Access Memory (RAM) Systemprogrammering 2007 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Random-Access Memory (RAM) Systemprogrammering 2009 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics! The memory hierarchy

Page 1. Multilevel Memories (Improving performance using a little cash )

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

Virtual Memory I. CSE 351 Winter Instructor: Mark Wyse

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Structs and Alignment CSE 351 Spring

Memory Management. Goals of this Lecture. Motivation for Memory Hierarchy

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

CS 33. Architecture and Optimization (3) CS33 Intro to Computer Systems XVI 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

Memory, Data, & Addressing I

CS 261 Fall Caching. Mike Lam, Professor. (get it??)

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Example. How are these parameters decided?

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 26, SPRING 2013

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Memory Hierarchy. Announcement. Computer system model. Reference

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Computer Systems. Memory Hierarchy. Han, Hwansoo

Roadmap. The Interface CSE351 Winter Frac3onal Binary Numbers. Today s Topics. Floa3ng- Point Numbers. What is ?

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

CS 465 Final Review. Fall 2017 Prof. Daniel Menasce

CS3350B Computer Architecture

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 26, FALL 2012

EE 4683/5683: COMPUTER ARCHITECTURE

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Last class. Caches. Direct mapped

Memory Hierarchies &

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Caches II. CSE 351 Spring Instructor: Ruth Anderson

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Caches III CSE 351 Spring

Cache Wrap Up, System Control Flow

Caches III. CSE 351 Spring Instructor: Ruth Anderson

Page 1. Memory Hierarchies (Part 2)

Course Administration

This lecture. Virtual Memory. Virtual memory (VM) CS Instructor: Sanjeev Se(a

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

LECTURE 11. Memory Hierarchy

CISC 360. The Memory Hierarchy Nov 13, 2008

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CSE351: Memory, Data, & Addressing I

Transcription:

Computer Systems CSE 410 Autumn 2013 10 Memory Organiza:on and Caches 06 April 2012 Memory Organiza?on 1

Roadmap C: car *c = malloc(sizeof(car)); c->miles = 100; c->gals = 17; float mpg = get_mpg(c); free(c); Assembly language: Machine code: get_mpg: pushq movq... popq ret %rbp %rsp, %rbp %rbp 0111010000011000 100011010000010000000010 1000100111000010 110000011111101000011111 Java: Car c = new Car(); c.setmiles(100); c.setgals(17); float mpg = c.getmpg(); OS: Memory & data Integers & floats Machine code & C x86 assembly Procedures & stacks Arrays & structs Memory & caches Processes Virtual memory Memory alloca?on Java vs. C Computer system: Caches

Memory and Caches Cache basics Principle of locality Memory hierarchies Cache organiza?on Program op?miza?ons that consider caches Caches

Making memory accesses fast! What we want: Memories that are Big Fast Cheap Hardware: Pick any two So we ll be clever 02 May 2012 Memory Organiza?on 4

How does execu?on?me grow with SIZE? int array[size]; int A = 0; for (int i = 0 ; i < 200000 ; ++ i) { } for (int j = 0 ; j < SIZE ; ++ j) { A += array[j]; } TIME Plot Caches SIZE

Actual Data Time SIZE Caches

Problem: Processor- Memory Bo\leneck Processor performance doubled about every 18 months CPU Reg Bus bandwidth evolved much slower Main Memory Core 2 Duo: Can process at least 256 Bytes/cycle Core 2 Duo: Bandwidth 2 Bytes/cycle Latency 100 cycles Problem: lots of wai4ng on memory Caches

Problem: Processor- Memory Bo\leneck Processor performance doubled about every 18 months CPU Reg Cache Bus bandwidth evolved much slower Main Memory Core 2 Duo: Can process at least 256 Bytes/cycle Core 2 Duo: Bandwidth 2 Bytes/cycle Latency 100 cycles Solu4on: caches Caches

Cache English defini?on: a hidden storage space for provisions, weapons, and/or treasures CSE defini?on: computer memory with short access?me used for the storage of frequently or recently used instruc?ons or data (i- cache and d- cache) more generally, used to op?mize data transfers between system elements with different characteris?cs (network interface cache, I/O cache, etc.) Caches

General Cache Mechanics Cache 8 9 14 3 Smaller, faster, more expensive memory caches a subset of the blocks Data is copied in block- sized transfer units Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Larger, slower, cheaper memory viewed as par??oned into blocks Caches

General Cache Concepts: Hit Cache Request: 14 8 9 14 3 Data in block b is needed Block b is in cache: Hit! Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Caches

General Cache Concepts: Miss Cache Request: 12 8 12 9 14 3 Data in block b is needed Block b is not in cache: Miss! 12 Request: 12 Block b is fetched from memory Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Block b is stored in cache Placement policy: determines where b goes Replacement policy: determines which block gets evicted (vic:m) Caches

Not to forget CPU A little of super fast memory (cache$) Lots of slower Mem Caches

Memory and Caches Cache basics Principle of locality Memory hierarchies Cache organiza?on Program op?miza?ons that consider caches Caches and Locality

Why Caches Work Locality: Programs tend to use data and instruc?ons with addresses near or equal to those they have used recently Temporal locality: Recently referenced items are likely to be referenced again in the near future Spa?al locality: Items with nearby addresses tend to be referenced close together in :me block block How do caches take advantage of this? Caches and Locality

Example: Locality? sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum; Data: Temporal: sum referenced in each itera:on Spa:al: array a[] accessed in stride- 1 patern Instruc?ons: Temporal: cycle through loop repeatedly Spa:al: reference instruc:ons in sequence Being able to assess the locality of code is a crucial skill for a programmer Caches and Locality

Another Locality Example int sum_array_3d(int a[m][n][n]) { int i, j, k, sum = 0; } for (i = 0; i < N; i++) for (j = 0; j < N; j++) for (k = 0; k < M; k++) sum += a[k][i][j]; return sum; What is wrong with this code? How can it be fixed? Caches and Locality

Memory and Caches Cache basics Principle of locality Memory hierarchies Cache organiza?on Program op?miza?ons that consider caches Caches - Memory Hierarchy

Cost of Cache Misses Huge difference between a hit and a miss Could be 100x, if just L1 and main memory Would you believe 99% hits is twice as good as 97%? Consider: Cache hit :me of 1 cycle Miss penalty of 100 cycles Average access :me: 97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles 99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles This is why miss rate is used instead of hit rate Caches - Memory Hierarchy

Cache Performance Metrics Miss Rate Frac:on of memory references not found in cache (misses / accesses) = 1 - hit rate Typical numbers (in percentages): 3% - 10% for L1 Hit Time Time to deliver a line in the cache to the processor Includes :me to determine whether the line is in the cache Typical hit :mes: 1-2 clock cycles for L1 Miss Penalty Addi:onal :me required because of a miss Typically 50-200 cycles Caches - Memory Hierarchy

Memory Hierarchies Some fundamental and enduring proper?es of hardware and sonware systems: Faster storage technologies almost always cost more per byte and have lower capacity The gaps between memory technology speeds are widening True for: registers cache, cache DRAM, DRAM disk, etc. Well- writen programs tend to exhibit good locality These proper?es complement each other beau?fully They suggest an approach for organizing memory and storage systems known as a memory hierarchy Caches - Memory Hierarchy

Memory Hierarchies Fundamental idea of a memory hierarchy: Each level k serves as a cache for the larger, slower, level k+1 below. Why do memory hierarchies work? Because of locality, programs tend to access the data at level k more oeen than they access the data at level k+1. Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit. Big Idea: The memory hierarchy creates a large pool of storage that costs as much as the cheap storage near the bo\om, but that serves data to programs at the rate of the fast storage near the top. Caches - Memory Hierarchy

An Example Memory Hierarchy Smaller, faster, costlier per byte registers on- chip L1 cache (SRAM) off- chip L2 cache (SRAM) CPU registers hold words retrieved from L1 cache L1 cache holds cache lines retrieved from L2 cache L2 cache holds cache lines retrieved from main memory Larger, slower, cheaper per byte main memory (DRAM) local secondary storage (local disks) remote secondary storage (distributed file systems, web servers) Main memory holds disk blocks retrieved from local disks Local disks hold files retrieved from disks on remote network servers Caches - Memory Hierarchy

Intel Core i7 Cache Hierarchy Processor package Core 0 Regs Core 3 Regs L1 i- cache and d- cache: 32 KB, 8- way, Access: 4 cycles L1 d-cache L1 i-cache L1 d-cache L1 i-cache L2 unified cache: 256 KB, 8- way, Access: 11 cycles L2 unified cache L2 unified cache L3 unified cache: 8 MB, 16- way, Access: 30-40 cycles L3 unified cache (shared by all cores) Block size: 64 bytes for all caches. Main memory Caches - Memory Hierarchy

Intel i7 Die 02 May 2012 Memory Organiza?on 25

Memory and Caches Cache basics Principle of locality Memory hierarchies Cache organiza?on Program op?miza?ons that consider caches Cache Organiza?on

Where should we put data in the cache? 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Memory How can we compute this mapping? Cache Organiza?on Index 00 01 10 11 Cache Data

Where should we put data in the cache? Memory Cache 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Index 00 01 10 11 Data Hmm.. The cache might get confused later! Why? And how do we solve that? Cache Organiza?on

Use tags! Memory Cache 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Index 00 01 10 11 Tag 00?? 01 01 Data Cache Organiza?on

What s a cache block? (or cache line) Byte Address 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Block (line) number 0 1 2 3 4 5 6 7 Index 0 1 2 3 Cache Organiza?on

A puzzle. What can you infer from this: Cache starts empty Access (addr, hit/miss) stream (10, miss), (11, hit), (12, miss) Cache Organiza?on

Problems with direct mapped caches? What happens if a program uses addresses 2, 6, 2, 6, 2,? Memory Address 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Index 00 01 10 11 Cache Organiza?on

Associa?vity What if we could store data in any place in the cache? But that might slow down caches so we do something in between. 1-way 8 sets, 1 block each 2-way 4 sets, 2 blocks each 4-way 2 sets, 4 blocks each 8-way 1 set, 8 blocks Set 0 1 2 3 4 5 6 7 Set 0 1 2 3 Set 0 1 Set 0 direct mapped fully associative Cache Organiza?on

But now how do I know where data goes? m-bit Address (m-k-n) bits Tag k bits Index n-bit Block Offset Our example used a 2 2 -block cache with 2 1 bytes per block. Where would 13 (1101) be stored? 4-bit Address? bits? bits?-bits Block Offset Cache Organiza?on

Example placement in set-associative caches Where would data from address 0x1833 be placed? Block size is 16 bytes. 0x1833in binary is 00...0110000 011 0011. m-bit Address (m-k-n) bits Tag k bits Index n-bit Block Offset k =? k =? k =? 1-way associativity 8 sets, 1 block each Set 0 1 2 3 4 5 6 7 2-way associativity 4 sets, 2 blocks each Set 0 1 2 3 4-way associativity 2 sets, 4 blocks each Set 0 1 Cache Organiza?on

Example placement in set-associative caches Where would data from address 0x1833 be placed? Block size is 16 bytes. 0x1833in binary is 00...0110000 011 0011. m-bit Address (m-k-4) bits Tag k bits Index 4-bit Block Offset k = 3 k = 2 k = 1 1-way associativity 8 sets, 1 block each Set 0 1 2 3 4 5 6 7 2-way associativity 4 sets, 2 blocks each Set 0 1 2 3 4-way associativity 2 sets, 4 blocks each Set 0 1 Cache Organiza?on

Block replacement Any empty block in the correct set may be used for storing data. If there are no empty blocks, which one should we replace? Replace something, of course, but what? Caches typically use something close to least-recently-used 1-way associativity 8 sets, 1 block each Set 0 1 2 3 4 5 6 7 2-way associativity 4 sets, 2 blocks each Set 0 1 2 3 4-way associativity 2 sets, 4 blocks each Set 0 1 Cache Organiza?on

Another puzzle. What can you infer from this: Cache starts empty Access (addr, hit/miss) stream (10, miss); (12, miss); (10, miss) Cache Organiza?on

Memory and Caches Cache basics Principle of locality Memory hierarchies Cache organiza?on (part 2) Program op?miza?ons that consider caches Cache Organiza?on

General Cache Organiza?on (S, E, B) E = 2 e lines per set set line S = 2 s sets v tag 0 1 2 B- 1 cache size: S x E x B data bytes valid bit B = 2 b bytes of data per cache line (the data block) Cache Organiza?on

Cache Read E = 2 e lines per set Locate set Check if any line in set has matching tag Yes + line valid: hit Locate data star4ng at offset Address of byte in memory: t bits s bits b bits S = 2 s sets tag set index block offset data begins at this offset v valid bit tag 0 1 2 B- 1 B = 2 b bytes of data per cache line (the data block) Cache Organiza?on

Example: Direct- Mapped Cache (E = 1) Direct- mapped: One line per set Assume: cache block size 8 bytes v tag 0 1 2 3 4 5 6 7 Address of int: t bits 0 01 100 S = 2 s sets v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 find set v tag 0 1 2 3 4 5 6 7 Cache Organiza?on

Example: Direct- Mapped Cache (E = 1) Direct- mapped: One line per set Assume: cache block size 8 bytes valid? + match?: yes = hit Address of int: t bits 0 01 100 v tag 0 1 2 3 4 5 6 7 block offset Cache Organiza?on

Example: Direct- Mapped Cache (E = 1) Direct- mapped: One line per set Assume: cache block size 8 bytes valid? + match?: yes = hit Address of int: t bits 0 01 100 v tag 0 1 2 3 4 5 6 7 block offset int (4 Bytes) is here No match: old line is evicted and replaced Cache Organiza?on

E- way Set- Associa?ve Cache (Here: E = 2) E = 2: Two lines per set Assume: cache block size 8 bytes Address of short int: t bits 0 01 100 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 find set v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 Cache Organiza?on

E- way Set- Associa?ve Cache (Here: E = 2) E = 2: Two lines per set Assume: cache block size 8 bytes compare both Address of short int: t bits 0 01 100 valid? + match: yes = hit v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 block offset Cache Organiza?on

E- way Set- Associa?ve Cache (Here: E = 2) E = 2: Two lines per set Assume: cache block size 8 bytes compare both Address of short int: t bits 0 01 100 valid? + match: yes = hit v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 block offset short int (2 Bytes) is here No match: One line in set is selected for evic?on and replacement Replacement policies: random, least recently used (LRU), Cache Organiza?on

Types of Cache Misses Cold (compulsory) miss Occurs on first access to a block Conflict miss Most hardware caches limit blocks to a small subset (some:mes just one) of the available cache slots if one (e.g., block i must be placed in slot (i mod size)), direct- mapped if more than one, n- way set- associa:ve (where n is a power of 2) Conflict misses occur when the cache is large enough, but mul:ple data objects all map to the same slot e.g., referencing blocks 0, 8, 0, 8,... would miss every :me Capacity miss Occurs when the set of ac:ve cache blocks (the working set) is larger than the cache (just won t fit) Cache Organiza?on

What about writes? Mul?ple copies of data exist: L1, L2, possibly L3, main memory What is the main problem with that? Cache Organiza?on

What about writes? Mul?ple copies of data exist: L1, L2, possibly L3, main memory What to do on a write- hit? Write- through (write immediately to memory) Write- back (defer write to memory un:l line is evicted) Need a dirty bit to indicate if line is different from memory or not What to do on a write- miss? Write- allocate (load into cache, update line in cache) Good if more writes to the loca:on follow No- write- allocate (just write immediately to memory) Typical caches: Write- back + Write- allocate, usually Write- through + No- write- allocate, occasionally Cache Organiza?on

Intel Core i7 Cache Hierarchy Processor package Core 0 Regs Core 3 Regs L1 i- cache and d- cache: 32 KB, 8- way, Access: 4 cycles L1 d-cache L1 i-cache L1 d-cache L1 i-cache L2 unified cache: 256 KB, 8- way, Access: 11 cycles L2 unified cache L2 unified cache L3 unified cache: 8 MB, 16- way, Access: 30-40 cycles L3 unified cache (shared by all cores) Block size: 64 bytes for all caches. Main memory Cache Organiza?on

Memory and Caches Cache basics Principle of locality Memory hierarchies Cache organiza?on Program op?miza?ons that consider caches Caches and Program Op?miza?ons

Op?miza?ons for the Memory Hierarchy Write code that has locality Spa:al: access data con:guously Temporal: make sure access to the same data is not too far apart in :me How to achieve? Proper choice of algorithm Loop transforma:ons Caches and Program Op?miza?ons

Example: Matrix Mul?plica?on c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[i*n + j] += a[i*n + k]*b[k*n + j]; } c = i a * b j Caches and Program Op?miza?ons

Cache Miss Analysis Assume: Matrix elements are doubles Cache block = 64 bytes = 8 doubles Cache size C << n (much smaller than n) First itera?on: n/8 + n = 9n/8 misses (omiqng matrix c) = * n Aeerwards in cache: (schema:c) = * Caches and Program Op?miza?ons 8 wide

Cache Miss Analysis Assume: Matrix elements are doubles Cache block = 64 bytes = 8 doubles Cache size C << n (much smaller than n) Other itera?ons: Again: n/8 + n = 9n/8 misses (omiqng matrix c) = * n 8 wide Total misses: 9n/8 * n 2 = (9/8) * n 3 Caches and Program Op?miza?ons

Blocked Matrix Mul?plica?on c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i+=b) for (j = 0; j < n; j+=b) for (k = 0; k < n; k+=b) /* B x B mini matrix multiplications */ for (i1 = i; i1 < i+b; i1++) for (j1 = j; j1 < j+b; j1++) for (k1 = k; k1 < k+b; k1++) c[i1*n + j1] += a[i1*n + k1]*b[k1*n + j1]; } j1 c = i1 a * b Block size B x B Caches and Program Op?miza?ons

Cache Miss Analysis Assume: Cache block = 64 bytes = 8 doubles Cache size C << n (much smaller than n) Three blocks fit into cache: 3B 2 < C First (block) itera?on: B 2 /8 misses for each block 2n/B * B 2 /8 = nb/4 (omiqng matrix c) = * n/b blocks Aeerwards in cache (schema:c) = Block size B x B * Caches and Program Op?miza?ons

Cache Miss Analysis Assume: Cache block = 64 bytes = 8 doubles Cache size C << n (much smaller than n) Three blocks fit into cache: 3B 2 < C Other (block) itera?ons: Same as first itera:on 2n/B * B 2 /8 = nb/4 = * n/b blocks Total misses: nb/4 * (n/b) 2 = n 3 /(4B) Block size B x B Caches and Program Op?miza?ons

Summary No blocking: (9/8) * n 3 Blocking: 1/(4B) * n 3 If B = 8 difference is 4 * 8 * 9 / 8 = 36x If B = 16 difference is 4 * 16 * 9 / 8 = 72x Suggests largest possible block size B, but limit 3B 2 < C! Reason for drama?c difference: Matrix mul:plica:on has inherent temporal locality: Input data: 3n 2, computa:on 2n 3 Every array element used O(n) :mes! But program has to be writen properly Caches and Program Op?miza?ons

Cache- Friendly Code Programmer can op?mize for cache performance How data structures are organized How data are accessed Nested loop structure Blocking is a general technique All systems favor cache- friendly code Geqng absolute op:mum performance is very plasorm specific Cache sizes, line sizes, associa:vi:es, etc. Can get most of the advantage with generic code Keep working set reasonably small (temporal locality) Use small strides (spa:al locality) Focus on inner loop code Caches and Program Op?miza?ons

Read throughput (MB/s) The Memory Mountain 7000 6000 5000 4000 L2" 3000 2000 L3" 1000 L1" University of Washington Intel Core i7 32 KB L1 i- cache 32 KB L1 d- cache 256 KB unified L2 cache 8M unified L3 cache All caches on- chip 0 s1 s3 s5 s7 Stride (x8 bytes) s9 s11 s13 Mem" s15 s32 64M 8M 1M 128K 16K 2K Working set size (bytes) Caches and Program Op?miza?ons