Problem: Processor Memory BoJleneck

Similar documents
Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Computer Systems CSE 410 Autumn Memory Organiza:on and Caches

Problem: Processor- Memory Bo<leneck

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

CS241 Computer Organization Spring Principle of Locality

CS 240 Stage 3 Abstractions for Practical Systems

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

Types of Cache Misses. The Hardware/So<ware Interface CSE351 Winter Cache Read. General Cache OrganizaJon (S, E, B) Memory and Caches II

The Hardware/So<ware Interface CSE351 Winter 2013

Carnegie Mellon. Cache Memories

Carnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Caches Spring 2016 May 4: Caches 1

Today Cache memory organization and operation Performance impact of caches

Cache Memories. Andrew Case. Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

+ Random-Access Memory (RAM)

General Cache Mechanics. Memory Hierarchy: Cache. Cache Hit. Cache Miss CPU. Cache. Memory CPU CPU. Cache. Cache. Memory. Memory

CS3350B Computer Architecture

How to Write Fast Numerical Code

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

How to Write Fast Numerical Code

Memory hierarchies: caches and their impact on the running time

211: Computer Architecture Summer 2016

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Read-only memory (ROM): programmed during production Programmable ROM (PROM): can be programmed once SRAM (Static RAM)

Carnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016

Cache Memories : Introduc on to Computer Systems 12 th Lecture, October 6th, Instructor: Randy Bryant.

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 26, SPRING 2013

Memory Hierarchy. Announcement. Computer system model. Reference

Advanced Memory Organizations

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

CS 33. Architecture and Optimization (3) CS33 Intro to Computer Systems XVI 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Random-Access Memory (RAM) CISC 360. The Memory Hierarchy Nov 24, Conventional DRAM Organization. SRAM vs DRAM Summary.

CMPSC 311- Introduction to Systems Programming Module: Caching

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 26, FALL 2012

CISC 360. The Memory Hierarchy Nov 13, 2008

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Giving credit where credit is due

Random-Access Memory (RAM) Systemprogrammering 2007 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics

Random-Access Memory (RAM) Lecture 13 The Memory Hierarchy. Conventional DRAM Organization. SRAM vs DRAM Summary. Topics. d x w DRAM: Key features

Example. How are these parameters decided?

Random-Access Memory (RAM) Systemprogrammering 2009 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics! The memory hierarchy

Computer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg

Caches II CSE 351 Spring #include <yoda.h> int is_try(int do_flag) { return!(do_flag!do_flag); }

CMPSC 311- Introduction to Systems Programming Module: Caching

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

Memory Hierarchy. Instructor: Adam C. Champion, Ph.D. CSE 2431: Introduction to Operating Systems Reading: Chap. 6, [CSAPP]

Key features. ! RAM is packaged as a chip.! Basic storage unit is a cell (one bit per cell).! Multiple RAM chips form a memory.

Cache Lab Implementation and Blocking

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

Memory Management! Goals of this Lecture!

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

F28HS Hardware-Software Interface: Systems Programming

Memory Management. Goals of this Lecture. Motivation for Memory Hierarchy

Cache Memories October 8, 2007

Computer Systems. Memory Hierarchy. Han, Hwansoo

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

The Memory Hierarchy /18-213/15-513: Introduction to Computer Systems 11 th Lecture, October 3, Today s Instructor: Phil Gibbons

CS 261 Fall Caching. Mike Lam, Professor. (get it??)

Last class. Caches. Direct mapped

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

CISC 360. Cache Memories Nov 25, 2008

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

The Memory Hierarchy Sept 29, 2006

The. Memory Hierarchy. Chapter 6

How to Write Fast Numerical Code Spring 2011 Lecture 7. Instructor: Markus Püschel TA: Georg Ofenbeck

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

This lecture. Virtual Memory. Virtual memory (VM) CS Instructor: Sanjeev Se(a

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Caches III CSE 351 Spring

Giving credit where credit is due

Foundations of Computer Systems

CS3350B Computer Architecture

Caches Concepts Review

CS 465 Final Review. Fall 2017 Prof. Daniel Menasce

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Random Access Memory (RAM)

CS 201 The Memory Hierarchy. Gerson Robboy Portland State University

Cache Memory and Performance

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

Introduction to OpenMP. Lecture 10: Caches

Memory Hierarchies &

Cache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

NEXT SET OF SLIDES FROM DENNIS FREY S FALL 2011 CMSC313.

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

Transcription:

Today Memory hierarchy, caches, locality Cache organiza:on Program op:miza:ons that consider caches CSE351 Inaugural Edi:on Spring 2010 1 Problem: Processor Memory BoJleneck Processor performance doubled about every 18 months CPU Reg Bus bandwidth evolved much slower Main Memory Core 2 Duo: Can process at least 256 Bytes/cycle Core 2 Duo: Bandwidth 2 Bytes/cycle Latency 100 cycles Solu,on: Caches CSE351 Inaugural Edi:on Spring 2010 2

Cache English defini:on: a hidden storage space for provisions, weapons, and/or treasures CSE Defini:on: computer memory with short access :me used for the storage of frequently or recently used instruc:ons or data (i cache and d cache) more generally, used to op:mize data transfers between system elements with different characteris:cs (network interface cache, I/O cache, etc.) CSE351 Inaugural Edi:on Spring 2010 3 General Cache Mechanics Cache 8 4 9 10 14 3 Smaller, faster, more expensive memory caches a subset of the blocks 10 4 Data is copied in block sized transfer units Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Larger, slower, cheaper memory viewed as par::oned into blocks CSE351 Inaugural Edi:on Spring 2010 4

General Cache Concepts: Hit Cache Request: 14 8 9 14 3 Data in block b is needed Block b is in cache: Hit! Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 CSE351 Inaugural Edi:on Spring 2010 5 General Cache Concepts: Miss Cache Request: 12 8 12 9 14 3 Data in block b is needed Block b is not in cache: Miss! 12 Request: 12 Block b is fetched from memory Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Block b is stored in cache Placement policy: determines where b goes Replacement policy: determines which block gets evicted (vicam) CSE351 Inaugural Edi:on Spring 2010 6

Cache Performance Metrics Miss Rate FracAon of memory references not found in cache (misses / accesses) = 1 hit rate Typical numbers (in percentages): 3 10% for L1 can be quite small (e.g., < 1%) for L2, depending on size, etc. Hit Time Time to deliver a line in the cache to the processor includes Ame to determine whether the line is in the cache Typical numbers: 1 2 clock cycle for L1 5 20 clock cycles for L2 Miss Penalty AddiAonal Ame required because of a miss typically 50 200 cycles for main memory (trend: increasing!) CSE351 Inaugural Edi:on Spring 2010 7 Lets think about those numbers Huge difference between a hit and a miss Could be 100x, if just L1 and main memory Would you believe 99% hits is twice as good as 97%? Consider: cache hit Ame of 1 cycle miss penalty of 100 cycles Average access Ame: 97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles 99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles This is why miss rate is used instead of hit rate CSE351 Inaugural Edi:on Spring 2010 8

Types of Cache Misses Cold (compulsory) miss Occurs on first access to a block Conflict miss Most hardware caches limit blocks to a small subset (someames just one) of the available cache slots if one (e.g., block i must be placed in slot (i mod size)), direct mapped if more than one, n way set associaave (where n is a power of 2) Conflict misses occur when the cache is large enough, but mulaple data objects all map to the same slot e.g., referencing blocks 0, 8, 0, 8,... would miss every Ame Capacity miss Occurs when the set of acave cache blocks (the working set) is larger than the cache (just won t fit) CSE351 Inaugural Edi:on Spring 2010 9 Why Caches Work Locality: Programs tend to use data and instruc:ons with addresses near or equal to those they have used recently Temporal locality: Recently referenced items are likely to be referenced again in the near future block Spa:al locality: Items with nearby addresses tend to be referenced close together in Ame block CSE351 Inaugural Edi:on Spring 2010 10

Example: Locality? sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum; Data: Temporal: sum referenced in each iteraaon SpaAal: array a[] accessed in stride 1 pabern Instruc:ons: Temporal: cycle through loop repeatedly SpaAal: reference instrucaons in sequence Being able to assess the locality of code is a crucial skill for a programmer CSE351 Inaugural Edi:on Spring 2010 11 Locality Example #1 int sum_array_rows(int a[m][n]) { int i, j, sum = 0; } for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; a[0][0] a[0][1] a[0][2] a[0][3] a[1][0] a[1][1] a[1][2] a[1][3] a[2][0] a[2][1] a[2][2] a[2][3] 1: a[0][0] 2: a[0][1] 3: a[0][2] 4: a[0][3] 5: a[1][0] 6: a[1][1] 7: a[1][2] 8: a[1][3] 9: a[2][0] 10: a[2][1] 11: a[2][2] 12: a[2][3] stride 1 CSE351 Inaugural Edi:on Spring 2010 12

Locality Example #2 int sum_array_cols(int a[m][n]) { int i, j, sum = 0; } for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum; a[0][0] a[0][1] a[0][2] a[0][3] a[1][0] a[1][1] a[1][2] a[1][3] a[2][0] a[2][1] a[2][2] a[2][3] 1: a[0][0] 2: a[1][0] 3: a[2][0] 4: a[0][1] 5: a[1][1] 6: a[2][1] 7: a[0][2] 8: a[1][2] 9: a[2][2] 10: a[0][3] 11: a[1][3] 12: a[2][3] stride N CSE351 Inaugural Edi:on Spring 2010 13 Locality Example #3 int sum_array_3d(int a[m][n][n]) { int i, j, k, sum = 0; } for (i = 0; i < N; i++) for (j = 0; j < N; j++) for (k = 0; k < M; k++) sum += a[k][i][j]; return sum; What is wrong with this code? How can it be fixed? CSE351 Inaugural Edi:on Spring 2010 14

Memory Hierarchies Some fundamental and enduring proper:es of hardware and sojware systems: Faster storage technologies almost always cost more per byte and have lower capacity The gaps between memory technology speeds are widening True for: registers cache, cache DRAM, DRAM disk, etc. Well wriben programs tend to exhibit good locality These proper:es complement each other beau:fully They suggest an approach for organizing memory and storage systems known as a memory hierarchy CSE351 Inaugural Edi:on Spring 2010 15 An Example Memory Hierarchy L0: registers CPU registers hold words retrieved from L1 cache Smaller, faster, costlier per byte Larger, slower, cheaper per byte L4: L3: L2: L1: on chip L1 cache (SRAM) off chip L2 cache (SRAM) main memory (DRAM) local secondary storage (local disks) L1 cache holds cache lines retrieved from L2 cache L2 cache holds cache lines retrieved from main memory Main memory holds disk blocks retrieved from local disks Local disks hold files retrieved from disks on remote network servers L5: remote secondary storage (distributed file systems, web servers) CSE351 Inaugural Edi:on Spring 2010 16

Examples of Caching in the Hierarchy Hardware 0 On Chip TLB Address transla:ons TLB Web browser 10,000,000 Local disk Web pages Browser cache Web cache Network cache Buffer cache Virtual Memory L2 cache L1 cache Registers Cache Type Web pages Parts of files Parts of files 4 KB page 64 bytes block 64 bytes block 4 byte words What is Cached? Web server 1,000,000,000 Remote server disks OS 100 Main memory Hardware 1 On Chip L1 Hardware 10 Off Chip L2 File system client 10,000,000 Local disk Hardware+OS 100 Main memory Compiler 0 CPU core Managed By Latency (cycles) Where is it Cached? CSE351 Inaugural Edi:on Spring 2010 17 Memory Hierarchy: Core 2 Duo Disk Main Memory L2 unified cache L1 I cache L1 D cache CPU Reg 2 B/cycle 8 B/cycle 16 B/cycle 1 B/30 cycles Throughput: Latency: 100 cycles 14 cycles 3 cycles millions ~4 MB 32 KB ~4 GB ~500 GB Not drawn to scale L1/L2 cache: 64 B blocks CSE351 Inaugural Edi:on Spring 2010 18

General Cache Organiza:on (S, E, B) E = 2 e lines per set set line S = 2 s sets v tag 0 1 2 B 1 cache size: S x E x B data bytes valid bit B = 2 b bytes data block per cache line (the data) CSE351 Inaugural Edi:on Spring 2010 19 Cache Read E = 2 e lines per set Locate set Check if any line in set has matching tag Yes + line valid: hit Locate data star,ng at offset Address of word: t bits s bits b bits S = 2 s sets tag set index block offset data begins at this offset v tag 0 1 2 B 1 valid bit B = 2 b bytes data block per cache line (the data) CSE351 Inaugural Edi:on Spring 2010 20

Example: Direct Mapped Cache (E = 1) Direct mapped: One line per set Assume: cache block size 8 bytes v tag 0 1 2 3 4 5 6 7 Address of int: t bits 0 01 100 S = 2 s sets v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 find set v tag 0 1 2 3 4 5 6 7 CSE351 Inaugural Edi:on Spring 2010 21 Example: Direct Mapped Cache (E = 1) Direct mapped: One line per set Assume: cache block size 8 bytes valid? + match: assume yes = hit Address of int: t bits 0 01 100 v tag 0 1 2 3 4 5 6 7 block offset CSE351 Inaugural Edi:on Spring 2010 22

Example: Direct Mapped Cache (E = 1) Direct mapped: One line per set Assume: cache block size 8 bytes valid? + match: assume yes = hit Address of int: t bits 0 01 100 v tag 0 1 2 3 4 5 6 7 block offset int (4 Bytes) is here No match: old line is evicted and replaced CSE351 Inaugural Edi:on Spring 2010 23 Example (for E =1) int sum_array_rows(double a[16][16]) { int i, j; double sum = 0; Assume sum, i, j in registers Address of an aligned element of a: aa. aaxxxxyyyy000 Assume: cold (empty) cache 3 bits for set, 5 bits for byte aa. aaxxx xyy yy000 } for (i = 0; i < 16; i++) for (j = 0; j < 16; j++) sum += a[i][j]; return sum; int sum_array_cols(double a[16][16]) { int i, j; double sum = 0; } for (j = 0; j < 16; j++) for (i = 0; i < 16; i++) sum += a[i][j]; return sum; 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 0,a 0,b 0,c 0,d 0,e 0,f 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 1,a 1,b 1,c 1,d 1,e 1,f 32 B = 4 doubles 4 misses per row 4*16 = 64 misses 0,0 4,0 2,0 0,1 4,1 2,1 0,2 4,2 2,2 0,3 4,3 2,3 3,0 1,0 3,1 1,1 3,2 1,2 3,3 1,3 32 B = 4 doubles every access a miss 16*16 = 256 misses CSE351 Inaugural Edi:on Spring 2010 24

Example (for E = 1) float dotprod(float x[8], float y[8]) { float sum = 0; int i; } for (i = 0; i < 8; i++) sum += x[i]*y[i]; return sum; if x and y have aligned star:ng addresses, e.g., &x[0] = 0, &y[0] = 128 x[0] y[0] x[1] y[1] x[2] y[2] x[3] y[3] if x and y have unaligned star:ng addresses, e.g., &x[0] = 0, &y[0] = 144 x[0] x[1] x[2] x[3] x[5] x[6] x[7] x[8] y[0] y[1] y[2] y[3] y[5] y[6] y[7] y[8] CSE351 Inaugural Edi:on Spring 2010 25 E way Set Associa:ve Cache (Here: E = 2) E = 2: Two lines per set Assume: cache block size 8 bytes Address of short int: t bits 0 01 100 find set CSE351 Inaugural Edi:on Spring 2010 26

E way Set Associa:ve Cache (Here: E = 2) E = 2: Two lines per set Assume: cache block size 8 bytes compare both Address of short int: t bits 0 01 100 valid? + match: yes = hit v tag 0 1 2 3 4 5 6 7 block offset CSE351 Inaugural Edi:on Spring 2010 27 E way Set Associa:ve Cache (Here: E = 2) E = 2: Two lines per set Assume: cache block size 8 bytes match both Address of short int: t bits 0 01 100 valid? + match: yes = hit block offset short int (2 Bytes) is here No match: One line in set is selected for evic:on and replacement Replacement policies: random, least recently used (LRU), CSE351 Inaugural Edi:on Spring 2010 28

Example (for E = 2) float dotprod(float x[8], float y[8]) { float sum = 0; int i; } for (i = 0; i < 8; i++) sum += x[i]*y[i]; return sum; if x and y have aligned star:ng addresses, e.g., &x[0] = 0, &y[0] = 128 s:ll can fit both because 2 lines in each set x[0] x[1] x[2] x[3] y[0] y[1] y[2] y[3] x[4] x[5] x[6] x[7] y[4] y[5] y[6] y[7] CSE351 Inaugural Edi:on Spring 2010 29 Fully Set Associa:ve Caches (S = 1) All lines in one single set, S = 1 E = C / B, where C is total cache size S = 1 = ( C / B ) / E Direct mapped caches have E = 1 S = ( C / B ) / E = C / B Tags are more expensive in associa:ve caches Fully associaave cache, C / B tag comparators Direct mapped cache, 1 tag comparator In general, E way set associaave caches, E tag comparators Tag size, assuming m address bits (m = 32 for IA32) m log 2 S log 2 B CSE351 Inaugural Edi:on Spring 2010 30

Typical Memory Hierarchy (Intel Core i7) L0: registers CPU registers (op:mized by complier) Smaller, faster, costlier per byte L2: L1: on chip L1 cache (SRAM) off chip L2 cache (SRAM) 8 way associa:ve in Intel Core i7 8 way associa:ve in Intel Core i7 Larger, slower, cheaper per byte L4: L3: off chip cache L3 shared by mul:ple cores (SRAM) main memory (DRAM) 16 way associa:ve in Intel Core i7 L5: local secondary storage (local disks) L6: remote secondary storage (distributed file systems, web servers) CSE351 Inaugural Edi:on Spring 2010 31 What about writes? Mul:ple copies of data exist: L1, L2, Main Memory, Disk What to do on a write hit? Write through (write immediately to memory) Write back (defer write to memory unal replacement of line) Need a dirty bit (line different from memory or not) What to do on a write miss? Write allocate (load into cache, update line in cache) Good if more writes to the locaaon follow No write allocate (writes immediately to memory) Typical Write through + No write allocate Write back + Write allocate CSE351 Inaugural Edi:on Spring 2010 32

Sojware Caches are More Flexible Examples File system buffer caches, web browser caches, etc. Some design differences Almost always fully associaave so, no placement restricaons index structures like hash tables are common (for placement) Omen use complex replacement policies misses are very expensive when disk or network involved worth thousands of cycles to avoid them Not necessarily constrained to single block transfers may fetch or write back in larger units, opportunisacally CSE351 Inaugural Edi:on Spring 2010 33 The Memory Mountain Read throughput (MB/s) 1200 1000 800 L1 PenAum III Xeon 550 MHz 16 KB on chip L1 d cache 16 KB on chip L1 i cache 512 KB off chip unified L2 cache 600 400 L2 200 0 s1 s3 s5 s7 Stride (words) s9 s11 Mem s13 s15 8m 2m 512k 128k 32k 8k 2k Working set size (bytes) CSE351 Inaugural Edi:on Spring 2010 34

Op:miza:ons for the Memory Hierarchy Write code that has locality SpaAal: access data conaguously Temporal: make sure access to the same data is not too far apart in Ame How to achieve? Proper choice of algorithm Loop transformaaons Cache versus register level op:miza:on: In both cases locality desirable Register space much smaller + requires scalar replacement to exploit temporal locality Register level opamizaaons include exhibiang instrucaon level parallelism (conflicts with locality) CSE351 Inaugural Edi:on Spring 2010 35 Example: Matrix Mul:plica:on c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[i*n + j] += a[i*n + k]*b[k*n + j]; } j c = i a * b CSE351 Inaugural Edi:on Spring 2010 36

Cache Miss Analysis Assume: Matrix elements are doubles Cache block = 8 doubles Cache size C << n (much smaller than n) First itera:on: n/8 + n = 9n/8 misses (omisng matrix c) = * n Amerwards in cache: (schemaac) = * 8 wide CSE351 Inaugural Edi:on Spring 2010 37 Cache Miss Analysis Assume: Matrix elements are doubles Cache block = 8 doubles Cache size C << n (much smaller than n) Other itera:ons: Again: n/8 + n = 9n/8 misses (omisng matrix c) = * n 8 wide Total misses: 9n/8 * n 2 = (9/8) * n 3 CSE351 Inaugural Edi:on Spring 2010 38

Blocked Matrix Mul:plica:on c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i+=b) for (j = 0; j < n; j+=b) for (k = 0; k < n; k+=b) /* B x B mini matrix multiplications */ for (i1 = i; i1 < i+b; i++) for (j1 = j; j1 < j+b; j++) for (k1 = k; k1 < k+b; k++) c[i1*n + j1] += a[i1*n + k1]*b[k1*n + j1]; } c = i1 a * b j1 + c Block size B x B CSE351 Inaugural Edi:on Spring 2010 39 Cache Miss Analysis Assume: Cache block = 8 doubles Cache size C << n (much smaller than n) Four blocks fit into cache: 4B 2 < C First (block) itera:on: B 2 /8 misses for each block 2n/B * B 2 /8 = nb/4 (omisng matrix c) = * n/b blocks Amerwards in cache (schemaac) = Block size B x B * CSE351 Inaugural Edi:on Spring 2010 40

Cache Miss Analysis Assume: Cache block = 8 doubles Cache size C << n (much smaller than n) Three blocks fit into cache: 3B 2 < C Other (block) itera:ons: Same as first iteraaon 2n/B * B 2 /8 = nb/4 = * n/b blocks Total misses: nb/4 * (n/b) 2 = n 3 /(4B) Block size B x B CSE351 Inaugural Edi:on Spring 2010 41 Summary No blocking: (9/8) * n 3 Blocking: 1/(4B) * n 3 If B = 8 difference is 4 * 8 * 9 / 8 = 36x If B = 16 difference is 4 * 16 * 9 / 8 = 72x Suggests largest possible block size B, but limit 4B 2 < C! (can possibly be relaxed a bit, but there is a limit for B) Reason for drama:c difference: Matrix mulaplicaaon has inherent temporal locality: Input data: 3n 2, computaaon 2n 3 Every array elements used O(n) Ames! But program has to be wriben properly CSE351 Inaugural Edi:on Spring 2010 42