write-through v. write-back write-through v. write-back write-through v. write-back option 1: write-through write 10 to 0xABCD CPU RAM Cache ABCD: FF

Similar documents
Cache Performance II 1

associativity terminology

Changelog. Performance. locality exercise (1) a transformation

Changelog. Changes made in this version not seen in first lecture: 26 October 2017: slide 28: remove extraneous text from code

SEQ without stages. valc register file ALU. Stat ZF/SF. instr. length PC+9. R[srcA] R[srcB] srca srcb dstm dste. Data in Data out. Instr. Mem.

exercise 4 byte blocks, 4 sets index valid tag value 00

Optimization part 1 1

211: Computer Architecture Summer 2016

Last class. Caches. Direct mapped

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Cache Memories October 8, 2007

Cache memories are small, fast SRAM based memories managed automatically in hardware.

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

CISC 360. Cache Memories Nov 25, 2008

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Changelog. Corrections made in this version not in first posting:

Giving credit where credit is due

Today Cache memory organization and operation Performance impact of caches

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Memory Hierarchy. Announcement. Computer system model. Reference

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations

Cache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Cache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory

ν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines.

Kathryn Chan, Kevin Bi, Ryan Wong, Waylon Huang, Xinyu Sui

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

Caches III. CSE 351 Spring Instructor: Ruth Anderson

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Cache Performance (H&P 5.3; 5.5; 5.6)

CS3350B Computer Architecture

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

Cache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

Caches III. CSE 351 Winter Instructor: Mark Wyse

Inside out of your computer memories (III) Hung-Wei Tseng

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Systems Programming and Computer Architecture ( ) Timothy Roscoe

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Example. How are these parameters decided?

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Key Point. What are Cache lines

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 31: Intro to Systems Caching. Martin Gagne Swarthmore College March 23, 2017

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

Memory Hierarchy: Caches, Virtual Memory

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Double-precision General Matrix Multiply (DGEMM)

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Carnegie Mellon. Cache Memories

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy

Page 1. Multilevel Memories (Improving performance using a little cash )

Mo Money, No Problems: Caches #2...

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 11. Memory Hierarchy

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Lecture 7 - Memory Hierarchy-II

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

CS 31: Intro to Systems Caching. Kevin Webb Swarthmore College March 24, 2015

Caches Concepts Review

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

SOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name:

Question 13 1: (Solution, p 4) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate.

EEC 483 Computer Organization

Chapter Seven. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)

ECE331: Hardware Organization and Design

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

Caching Basics. Memory Hierarchies

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Introduction to OpenMP. Lecture 10: Caches

COSC 6385 Computer Architecture - Memory Hierarchies (I)

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

August 1994 / Features / Cache Advantage. Cache design and implementation can make or break the performance of your high-powered computer system.

CSC D70: Compiler Optimization Memory Optimizations

CS377P Programming for Performance Single Thread Performance Caches I

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

CS162 Operating Systems and Systems Programming Lecture 10 Caches and TLBs"

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

CS 136: Advanced Architecture. Review of Caches

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Memory Hierarchies &

Transcription:

write-through v. write-back option 1: write-through 1 write 10 to 0xABCD CPU Cache ABCD: FF RAM 11CD: 42 ABCD: FF 1 2 write-through v. write-back option 1: write-through write-through v. write-back option 2: write-back 1 write 10 to 0xABCD 2 write 10 to 0xABCD 1 write 10 to 0xABCD CPU Cache ABCD: 10 RAM 11CD: 42 ABCD: 10 CPU Cache ABCD: 10 (dirty) RAM 11CD: 42 ABCD: FF 2 2

write-through v. write-back write-through v. write-back 1 write 10 to 0xABCD option 2: write-back 3 write 10 to ABCD 1 write 10 to 0xABCD 3 write 10 to ABCD CPU 2 read Cache from ABCD: 10 0x11CD (dirty) (conflicts) when replaced send value to memory RAM 11CD: 42 ABCD: 10 CPU 2 Cache 4 read read RAM ABCD: 10 from from (dirty) 11CD: 42 0x11CD 0x11CD ABCD: 10 (conflicts) read new value to store in cache 2 2 writeback policy changed value! 2-way set associative, 4 byte blocks, 2 sets index valid tag value dirty valid tag value dirty LRU 0 1 000000 mem[0x00] mem[0x01] 0 1 011000 mem[0x60]* mem[0x61]* 1 1 1 1 011000 mem[0x62] mem[0x63] 0 0 0 1 = dirty (different than memory) needs to be written if evicted allocate on write? processor writes less than whole cache block block not yet in cache two options: write-allocate fetch rest of cache block, replace written part write-no-allocate send write through to memory guess: not read soon? 3 4

write-allocate 2-way set associative, LRU, writeback index valid tag value dirty valid tag value dirty LRU 0 1 000000 mem[0x00] mem[0x01] 0 1 011000 mem[0x60]* mem[0x61]* 1 1 1 1 011000 mem[0x62] mem[0x63] 0 0 0 writing 0xFF into address 0x04? index 0, tag 000001 write-allocate 2-way set associative, LRU, writeback index valid tag value dirty valid tag value dirty LRU 0 1 000000 mem[0x00] mem[0x01] 0 1 011000 mem[0x60]* mem[0x61]* 1 1 1 1 011000 mem[0x62] mem[0x63] 0 0 0 writing 0xFF into address 0x04? index 0, tag 000001 step 1: find least recently used block 5 5 write-allocate 2-way set associative, LRU, writeback index valid tag value dirty valid tag value dirty LRU 0 1 000000 mem[0x00] mem[0x01] 0 1 011000 mem[0x60]* mem[0x61]* 1 1 1 1 011000 mem[0x62] mem[0x63] 0 0 0 writing 0xFF into address 0x04? index 0, tag 000001 step 1: find least recently used block step 2: possibly writeback old block write-allocate 2-way set associative, LRU, writeback index valid tag value dirty valid tag value dirty LRU 0 1 000000 mem[0x00] mem[0x01] 0 1 011000 0xFF mem[0x05] 1 0 1 1 011000 mem[0x62] mem[0x63] 0 0 0 writing 0xFF into address 0x04? index 0, tag 000001 step 1: find least recently used block step 2: possibly writeback old block step 3a: read in new block to get mem[0x05] step 3b: update LRU information 5 5

write-no-allocate 2-way set associative, LRU, writeback index valid tag value dirty valid tag value dirty LRU 0 1 000000 mem[0x00] mem[0x01] 0 1 011000 mem[0x60]* mem[0x61]* 1 1 1 1 011000 mem[0x62] mem[0x63] 0 0 0 writing 0xFF into address 0x04? step 1: is it in cache yet? step 2: no, just send it to memory fast writes write 10 to 0xABCD write buffer 0xABCD: 10 0x1234: 20 CPU write 20 to 0x1234 Cache RAM write appears to complete immediately when placed in buffer memory can be much slower 6 7 cache organization and miss rate cache organization and miss rate depends on program; one example: SPEC CPU2000 benchmarks, 64B block size LRU replacement policies data cache miss rates: Cache size direct-mapped 2-way 8-way fully assoc. 1KB 8.63% 6.97% 5.63% 5.34% 2KB 5.71% 4.23% 3.30% 3.05% 4KB 3.70% 2.60% 2.03% 1.90% 16KB 1.59% 0.86% 0.56% 0.50% 64KB 0.66% 0.37% 0.10% 0.001% 128KB 0.27% 0.001% 0.0006% 0.0006% Data: Cantin and Hill, Cache Performance for SPEC CPU2000 Benchmarks http://research.cs.wisc.edu/multifacet/misc/spec2000cache-data/ 8 depends on program; one example: SPEC CPU2000 benchmarks, 64B block size LRU replacement policies data cache miss rates: Cache size direct-mapped 2-way 8-way fully assoc. 1KB 8.63% 6.97% 5.63% 5.34% 2KB 5.71% 4.23% 3.30% 3.05% 4KB 3.70% 2.60% 2.03% 1.90% 16KB 1.59% 0.86% 0.56% 0.50% 64KB 0.66% 0.37% 0.10% 0.001% 128KB 0.27% 0.001% 0.0006% 0.0006% Data: Cantin and Hill, Cache Performance for SPEC CPU2000 Benchmarks http://research.cs.wisc.edu/multifacet/misc/spec2000cache-data/ 8

reasoning about cache performance hit time: time to lookup and find value in cache L1 cache typically 1 cycle? miss rate: portion of hits (value in cache) miss penalty: extra time to get value if there s a miss time to access next level cache or memory average memory access time AMAT = hit time + miss penalty miss rate effective speed of memory miss time: hit time + miss penalty 9 10 what cache parameters are better? can write a program to make a cache look bad: 1. access enough blocks, to fill the cache 2. access an additional block, replacing something 3. access last block replaced 4. access last block replaced 5. access last block replaced cache optimizations miss rate hit time miss penalty increase cache size better worse increase associativity better worse worse? increase block size depends worse worse add secondary cache better write-allocate better worse? writeback better worse? LRU replacement better? worse? average time = hit time + miss rate miss penalty but typical real programs have locality 11 12

cache optimizations by miss type capacity conflict compulsory increase cache size fewer misses increase associativity fewer misses increase block size more misses fewer misses exercise (1) initial cache: 64-byte blocks, 64 sets, 8 ways/set If we leave the other parameters listed above unchanged, which will probably reduce the number of capacity misses in a typical program? (Multiple may be correct.) A. quadrupling the block size (256-byte blocks, 64 sets, 8 ways/set) B. quadrupling the number of sets C. quadrupling the number of ways/set 13 14 exercise (2) exercise (3) initial cache: 64-byte blocks, 8 ways/set, 64KB cache If we leave the other parameters listed above unchanged, which will probably reduce the number of capacity misses in a typical program? (Multiple may be correct.) A. quadrupling the block size (256-byte block, 8 ways/set, 64KB cache B. quadrupling the number of ways/set C. quadrupling the cache size initial cache: 64-byte blocks, 8 ways/set, 64KB cache If we leave the other parameters listed above unchanged, which will probably reduce the number of conflict misses in a typical program? (Multiple may be correct.) A. quadrupling the block size (256-byte block, 8 ways/set, 64KB cache B. quadrupling the number of ways/set C. quadrupling the cache size 15 16

a note on matrix storage matrix squaring A N N matrix represent as array makes dynamic sizes easier: float A_2d_array[N][N]; float *A_flat = malloc(n * N); A_flat[i * N + j] === A_2d_array[i][j] B ij = n A ik A kj k=1 /* version 1: inner loop is k, middle is j */ B[i * N + j] += A[i * N + k] * A[k * N + j]; 17 18 matrix squaring matrix squaring B ij = n A ik A kj k=1 /* version 1: inner loop is k, middle is j*/ B[i*N+j] += A[i * N + k] * A[k * N + j]; /* version 2: outer loop is k, middle is i */ B[i*N+j] += A[i * N + k] * A[k * N + j]; 19 B ij = n A ik A kj k=1 /* version 1: inner loop is k, middle is j*/ B[i*N+j] += A[i * N + k] * A[k * N + j]; /* version 2: outer loop is k, middle is i */ B[i*N+j] += A[i * N + k] * A[k * N + j]; 19

performance alternate view 1: cycles/instruction 1.2 billions of instructions 1.0 0.8 k inner k outer 0.6 0.4 0.2 0.0 0 100 200 300 400 500 N 1.0 billions of cycles 0.8 k inner k outer 0.6 0.4 0.2 0.0 0 100 200 300 400 500 N 20 0.9 cycles/instruction 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0 100 200 300 400 500 N 21 alternate view 2: cycles/operation loop orders and locality 3.5 cycles/multiply or add loop body: B ij + = A ik A kj 3.0 kij order: B ij, A kj have spatial locality 2.5 2.0 1.5 1.0 0 100 200 300 400 500 N kij order: A ik has temporal locality better than ijk order: A ik has spatial locality ijk order: B ij has temporal locality 22 23

loop orders and locality matrix squaring loop body: B ij + = A ik A kj kij order: B ij, A kj have spatial locality B ij = n A ik A kj k=1 kij order: A ik has temporal locality better than ijk order: A ik has spatial locality /* version 1: inner loop is k, middle is j*/ B[i*N+j] += A[i * N + k] * A[k * N + j]; ijk order: B ij has temporal locality /* version 2: outer loop is k, middle is i */ B[i*N+j] += A[i * N + k] * A[k * N + j]; 23 exercise: which should perform better? why? 24 matrix squaring matrix squaring B ij = n A ik A kj k=1 B ij = n A ik A kj k=1 /* version 1: inner loop is k, middle is j*/ B[i*N+j] += A[i * N + k] * A[k * N + j]; /* version 1: inner loop is k, middle is j*/ B[i*N+j] += A[i * N + k] * A[k * N + j]; /* version 2: outer loop is k, middle is i */ B[i*N+j] += A[i * N + k] * A[k * N + j]; /* version 2: outer loop is k, middle is i */ B[i*N+j] += A[i * N + k] * A[k * N + j]; exercise: which should perform better? why? 24 exercise: which should perform better? why? 24

L1 misses 140 read misses/1k instructions k inner 120 k outer 100 80 60 40 20 0 0 100 200 300 400 500 N L1 miss detail (1) 140 120 100 80 60 40 20 matrix smaller than L1 cache read misses/1k instruction 0 0 50 100 150 200 N 25 26 L1 miss detail (2) 140 120 100 80 60 40 20 matrix smaller than L1 cache read misses/1k instruction N = 2 7 N = 93; 93 * 11 2 10 N = 114; 114 * 9 2 10 0 0 50 100 150 200 N addresses A[k*114+j] is at 10 0000 0000 0100 A[k*114+j+1] is at 10 0000 0000 1000 A[(k+1)*114+j] is at 10 0011 1001 0100 A[(k+2)*114+j] is at 10 0101 0101 1100 A[(k+9)*114+j] is at 11 0000 0000 1100 27 28

addresses A[k*114+j] is at 10 0000 0000 0100 A[k*114+j+1] is at 10 0000 0000 1000 A[(k+1)*114+j] is at 10 0011 1001 0100 A[(k+2)*114+j] is at 10 0101 0101 1100 A[(k+9)*114+j] is at 11 0000 0000 1100 recall: 6 index bits, 6 block offset bits (L1) conflict misses powers of two lower order bits unchanged A[k*93+j] and A[(k+11)*93+j]: 1023 elements apart (4092 bytes; 63.9 cache blocks) 64 sets in L1 cache: usually maps to same set A[k*93+(j+1)] will not be cached (next i loop) even if in same block as A[k*93+j] 28 29 reasoning about loop orders systematic approach (1) changing loop order changed locality how do we tell which loop order will be best? besides running each one? 30 B[i*N+j] += A[i*N+k] * A[k*N+j]; goal: get most out of each cache miss if N is larger than the cache: miss for B ij 1 comptuation miss for A ik N computations miss for A kj 1 computation effectively caching just 1 element 31

locality exercise (1) keeping values in cache /* version 1 */ A[i] += B[j] * C[i * N + j] /* version 2 */ A[i] += B[j] * C[i * N + j]; exercise: which has better temporal locality in A? in B? in C? how about spatial locality? can t explicitly ensure values are kept in cache but reusing values effectively does this cache will try to keep recently used values cache optimization ideas: choose what s in the cache for thinking about it: load values explicitly for implementing it: access only values we want loaded 32 33 flat 2D arrays and cache blocks array usage: kij order A[0] A[N] A kj A ik B ij A x0 A xn A x0 A xn for all k: for all i: for all j: B ij + = A ik A kj N calculations for A ik 1 for A kj, B ij 34 35

array usage: kij order array usage: kij order A k0 to A kn A k0 to A kn A ik B i0 to B in A ik A ik reused in innermost loop (over j) definitely cached (plus rest of cache block) B i0 to B in A x0 A xn for all k: for all i: for all j: B ij + = A ik A kj N calculations for A ik 1 for A kj, B ij A x0 A xn for all k: for all i: for all j: B ij + = A ik A kj N calculations for A ik 1 for A kj, B ij 35 35 array usage: kij order array usage: kij order A ik A k0 to A kn A kj reused in next middle loop (over i) cached only if entire row fits B i0 to B in A ik A k0 to A kn B ij reused in next outer loop probably not still in cache next time (but, at least some spatial locality) B i0 to B in A x0 A xn for all k: for all i: for all j: B ij + = A ik A kj N calculations for A ik 1 for A kj, B ij A x0 A xn for all k: for all i: for all j: B ij + = A ik A kj N calculations for A ik 1 for A kj, B ij 35 35

inefficiencies systematic approach (2) if a row doesn t fit in cache cache effectively holds one element everything else too much other stuff between accesses if a row does fit in cache cache effectively holds one row + one element everything else too much other stuff between accesses { { A ik loaded once in this loop (N 2 times): B ij, A kj loaded each iteration (if N big): B[i*N+j] += A[i*N+k] * A[k*N+j]; N 3 multiplies, N 3 adds about 1 load per operation 36 37 a transformation a transformation for (int kk = 0; kk < N; kk += 2) for (int k = kk; k < kk + 2; ++k) for (int i = 0; i < N; i += 2) B[i*N+j] += A[i*N+k] * A[k*N+j]; split the loop over k should be exactly the same (assuming even N) for (int kk = 0; kk < N; kk += 2) for (int k = kk; k < kk + 2; ++k) for (int i = 0; i < N; i += 2) B[i*N+j] += A[i*N+k] * A[k*N+j]; split the loop over k should be exactly the same (assuming even N) 38 38

simple blocking simple blocking for (int kk = 0; kk < N; kk += 2) /* was here: for (int k = kk; k < kk + 2; ++k) */ for (int i = 0; i < N; i += 2) for (int k = kk; k < kk + 2; ++k) B[i*N+j] += A[i*N+k] * A[k*N+j]; now reorder split loop same calculations for (int kk = 0; kk < N; kk += 2) /* was here: for (int k = kk; k < kk + 2; ++k) */ for (int i = 0; i < N; i += 2) for (int k = kk; k < kk + 2; ++k) B[i*N+j] += A[i*N+k] * A[k*N+j]; now reorder split loop same calculations now handle B ij for k + 1 right after B ij for k (previously: B i,j+1 for k right after B ij for k) 39 39 simple blocking simple blocking expanded for (int kk = 0; kk < N; kk += 2) /* was here: for (int k = kk; k < kk + 2; ++k) */ for (int i = 0; i < N; i += 2) for (int k = kk; k < kk + 2; ++k) B[i*N+j] += A[i*N+k] * A[k*N+j]; now reorder split loop same calculations now handle B ij for k + 1 right after B ij for k (previously: B i,j+1 for k right after B ij for k) for (int kk = 0; kk < N; kk += 2) { for (int i = 0; i < N; i += 2) { { /* process a "block" of 2 k values: */ B[i*N+j] += A[i*N+kk+0] * A[(kk+0)*N+j]; B[i*N+j] += A[i*N+kk+1] * A[(kk+1)*N+j]; 39 40

simple blocking expanded simple blocking expanded for (int kk = 0; kk < N; kk += 2) { for (int i = 0; i < N; i += 2) { { /* process a "block" of 2 k values: */ B[i*N+j] += A[i*N+kk+0] * A[(kk+0)*N+j]; B[i*N+j] += A[i*N+kk+1] * A[(kk+1)*N+j]; Temporal locality in B ij s for (int kk = 0; kk < N; kk += 2) { for (int i = 0; i < N; i += 2) { { /* process a "block" of 2 k values: */ B[i*N+j] += A[i*N+kk+0] * A[(kk+0)*N+j]; B[i*N+j] += A[i*N+kk+1] * A[(kk+1)*N+j]; More spatial locality in A ik 40 40 simple blocking expanded improvement in read misses for (int kk = 0; kk < N; kk += 2) { for (int i = 0; i < N; i += 2) { { /* process a "block" of 2 k values: */ B[i*N+j] += A[i*N+kk+0] * A[(kk+0)*N+j]; B[i*N+j] += A[i*N+kk+1] * A[(kk+1)*N+j]; 20read misses/1k instructions of unblocked 15 10 Still have good spatial locality in A kj, B ij 40 5 0 blocked (kk+=2) unblocked 0 100 200 300 400 500 600 N 41

simple blocking (2) same thing for i in addition to k? for (int kk = 0; kk < N; kk += 2) { for (int ii = 0; ii < N; ii += 2) { { /* process a "block": */ for (int k = kk; k < kk + 2; ++k) for (int i = 0; i < ii + 2; ++i) B[i*N+j] += A[i*N+k] * A[k*N+j]; simple blocking expanded for (int k = 0; k < N; k += 2) { for (int i = 0; i < N; i += 2) { { /* process a "block": */ B i+0,j += A i+0,k+0 * A k+0,j B i+0,j += A i+0,k+1 * A k+1,j B i+1,j += A i+1,k+0 * A k+0,j B i+1,j += A i+1,k+1 * A k+1,j 42 43 simple blocking expanded array usage (better) for (int k = 0; k < N; k += 2) { for (int i = 0; i < N; i += 2) { { /* process a "block": */ B i+0,j += A i+0,k+0 * A k+0,j B i+0,j += A i+0,k+1 * A k+1,j B i+1,j += A i+1,k+0 * A k+0,j B i+1,j += A i+1,k+1 * A k+1,j Now A kj reused in inner loop more calculations per load! 43 A k0 to A k+1,n A ik to A i+1,k+1 B i0 to B i+1,n more temporal locality: N calculations for each A ik 2 calculations for each B ij (for k, k + 1) 2 calculations for each A kj (for k, k + 1) 44

array usage (better) generalizing cache blocking A k0 to A k+1,n A ik to A i+1,k+1 B i0 to B i+1,n more spatial locality: calculate on each A i,k and A i,k+1 together both in same cache block same amount of cache loads 44 for (int kk = 0; kk < N; kk += K) { for (int ii = 0; ii < N; ii += I) { with I by K block of A hopefully cached: for (int jj = 0; jj < N; jj += J) { with K by J block of A, I by J block of B cached: for i in ii to ii+i: for j in jj to jj+j: for k in kk to kk+k: B[i * N + j] += A[i * N + k] * A[k * N + j]; B ij used K times for one miss N 2 /K misses A ik used J times for one miss N 2 /J misses A kj used I times for one miss N 2 /I misses catch: IK + KJ + IJ elements must fit in cache 45 generalizing cache blocking generalizing cache blocking for (int kk = 0; kk < N; kk += K) { for (int ii = 0; ii < N; ii += I) { with I by K block of A hopefully cached: for (int jj = 0; jj < N; jj += J) { with K by J block of A, I by J block of B cached: for i in ii to ii+i: for j in jj to jj+j: for k in kk to kk+k: B[i * N + j] += A[i * N + k] * A[k * N + j]; for (int kk = 0; kk < N; kk += K) { for (int ii = 0; ii < N; ii += I) { with I by K block of A hopefully cached: for (int jj = 0; jj < N; jj += J) { with K by J block of A, I by J block of B cached: for i in ii to ii+i: for j in jj to jj+j: for k in kk to kk+k: B[i * N + j] += A[i * N + k] * A[k * N + j]; B ij used K times for one miss N 2 /K misses B ij used K times for one miss N 2 /K misses A ik used J times for one miss N 2 /J misses A ik used J times for one miss N 2 /J misses A kj used I times for one miss N 2 /I misses A kj used I times for one miss N 2 /I misses catch: IK + KJ + IJ elements must fit in cache 45 catch: IK + KJ + IJ elements must fit in cache 45

generalizing cache blocking array usage: block for (int kk = 0; kk < N; kk += K) { for (int ii = 0; ii < N; ii += I) { with I by K block of A hopefully cached: for (int jj = 0; jj < N; jj += J) { with K by J block of A, I by J block of B cached: for i in ii to ii+i: for j in jj to jj+j: for k in kk to kk+k: B[i * N + j] += A[i * N + k] * A[k * N + j]; A ik block (I K) A kj block (K J) B ij block (I J) B ij used K times for one miss N 2 /K misses A ik used J times for one miss N 2 /J misses inner loop keeps blocks from A, B in cache A kj used I times for one miss N 2 /I misses catch: IK + KJ + IJ elements must fit in cache 45 46 array usage: block array usage: block A ik block (I K) A kj block (K J) B ij block (I J) A ik block (I K) A kj block (K J) B ij block (I J) B ij calculation uses strips from A K calculations for one load (cache miss) A ik calculation uses strips from A, B J calculations for one load (cache miss) 46 46

array usage: block A ik block (I K) A kj block (K J) B ij block (I J) (approx.) KIJ fully cached calculations for KI + IJ + KJ loads (assuming everything stays in cache) cache blocking efficiency load I K elements of A ik : do > J multiplies with each load K J elements of A kj : do I multiplies with each load I J elements of B ij : do K adds with each bigger blocks more work per load! catch: IK + KJ + IJ elements must fit in cache 46 47 cache blocking rule of thumb view 2: divide and conquer fill the most of the cache with useful data and do as much work as possible from that example: my desktop 32KB L1 cache I = J = K = 48 uses 48 2 3 elements, or 27KB. assumption: conflict misses aren t important 48 partial_square(float *A, float *B, int starti, int endi,...) { for (int i = starti; i < endi; ++i) { for (int j = startj; j < endj; ++j) {... square(float *A, float *B, int N) { for (int ii = 0; ii < N; ii += BLOCK)... /* segment of A, B in use fits in cache! */ partial_square( A, B, ii, ii + BLOCK, jj, jj + BLOCK,...); 49

cache blocking and miss rate 0.09 read misses/multiply or add 0.08 blocked unblocked 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00 0 100 200 300 400 500 N 50 what about performance? 2.0 cycles per multiply/add [less optimized loop] 1.5 1.0 0.5 0.0 0.4 0.3 0.2 0.1 unblocked blocked 0 100 200 300 400 500 N 0.5 cycles per multiply/add [optimized loop] unblocked blocked 51 performance for big sizes 1.0 matrix in 0.8 L3 cache 0.6 0.4 0.2 cycles per multiply or add unblocked blocked 0.0 0 2000 4000 6000 8000 10000 N optimized loop??? performance difference wasn t visible at small sizes until I optimized arithmetic in the loop (mostly by supplying better options to GCC) 1: reducing number of loads 2: doing adds/multiplies/etc. with less instructions 3: simplifying address computations 52 53

optimized loop??? performance difference wasn t visible at small sizes until I optimized arithmetic in the loop (mostly by supplying better options to GCC) overlapping loads and arithmetic load time load 1: reducing number of loads 2: doing adds/multiplies/etc. with less instructions 3: simplifying address computations but how can that make cache blocking better??? ultiply multiply multiply multiply multiply add add add add speed of load might not matter if these are slower 53 54 optimization and bottlenecks cache blocking: summary arithmetic/loop efficiency was the bottleneck after fixing this, cache performance was the bottleneck common theme when optimizing: X may not matter until Y is optimized reorder calculation to reduce cache misses: make explicit choice about what is in cache perform calculations in cache-sized blocks get more spatial and temporal locality temporal locality reuse values in many calculations before they are replaced in the cache spatial locality use adjacent values in calculations before cache block is replaced 55 56

cache blocking ugliness fringe cache blocking ugliness fringe for (int kk = 0; kk < N; kk += K) { for (int ii = 0; ii < N; ii += I) { for (int jj = 0; jj < N; jj += J) { for (int k = kk; k < min(kk + K, N) ; ++k) { //... 57 for (kk = 0; kk + K <= N; kk += K) { for (ii = 0; ii + I <= N; ii += I) { for (jj = 0; jj + J <= N; ii += J) { //... for (; jj < N; ++jj) { // handle remainder for (; ii < N; ++ii) { // handle remainder for (; kk < N; ++kk) { // handle remainder 58 avoiding conflict misses avoiding conflict misses (code) problem array is scattered throughout memory observation: 32KB cache can store 32KB contiguous array contiguous array is split evenly among sets solution: copy block into contiguous array process_block(ii, jj, kk) { float B_copy[I * J]; /* pseudocode for loop to save space */ for i = ii to ii + I, j = jj to jj + J: B_copy[i * J + j] = B[i * N + j]; for i = ii to ii + I, j = jj to jj + J, k: B_copy[i * J + j] += A[k * N + j] * A[i * N + k]; for all i, j: B[i * N + j] = B_copy[i * J + j]; 59 60

register reuse can compiler do register reuse? B[i*N+j] += A[i*N+k] * A[k*N+j]; // optimize into: { float Aik = A[i*N+k]; // hopefully keep in register! // faster than even cache hit! B[i*N+j] += Aik * A[k*N+j]; Not easily What if A=B? What if A=&B[10] { // want to preload A[i*N+k] here! { // but if A = B, modifying here! B[i*N+j] += A[i*N+k] * A[k*N+j]; can compiler do this for us? 61 62 automatic register reuse register blocking Compiler would need to generate overlap check: if ((B > A + N * N B < A) && (B + N * N > A + N * N B + N * N < A)) { { { float Aik = A[i*N+k]; { B[i*N+j] += Aik * A[k*N+j]; else { /* other version */ { for (int i = 0; i < N; i += 2) { float Ai0k = A[(i+0)*N + k]; float Ai1k = A[(i+1)*N + k]; for (int j = 0; j < N; j += 2) { float Akj0 = A[k*N + j+0]; float Akj1 = A[k*N + j+1]; B[(i+0)*N + j+0] += Ai0k * Akj0; B[(i+1)*N + j+0] += Ai1k * Akj0; B[(i+0)*N + j+1] += Ai0k * Akj1; B[(i+1)*N + j+1] += Ai1k * Akj1; 63 64

L2 misses 10 L2 misses/1k instructions ijk kij 8 6 4 2 0 0 100 200 300 400 500 N 65