Cache Performance II 1

Similar documents
write-through v. write-back write-through v. write-back write-through v. write-back option 1: write-through write 10 to 0xABCD CPU RAM Cache ABCD: FF

Changelog. Changes made in this version not seen in first lecture: 26 October 2017: slide 28: remove extraneous text from code

Changelog. Performance. locality exercise (1) a transformation

Optimization part 1 1

associativity terminology

SEQ without stages. valc register file ALU. Stat ZF/SF. instr. length PC+9. R[srcA] R[srcB] srca srcb dstm dste. Data in Data out. Instr. Mem.

exercise 4 byte blocks, 4 sets index valid tag value 00

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Last class. Caches. Direct mapped

211: Computer Architecture Summer 2016

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Today Cache memory organization and operation Performance impact of caches

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

CISC 360. Cache Memories Nov 25, 2008

Cache Memories October 8, 2007

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Giving credit where credit is due

Changelog. Corrections made in this version not in first posting:

Cache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

Cache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

ν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines.

Memory Hierarchy. Announcement. Computer system model. Reference

Carnegie Mellon. Cache Memories

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

Kathryn Chan, Kevin Bi, Ryan Wong, Waylon Huang, Xinyu Sui

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Caches III. CSE 351 Spring Instructor: Ruth Anderson

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Memory hierarchies: caches and their impact on the running time

Cache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

Cache Memories. Andrew Case. Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Key Point. What are Cache lines

Cache Memories : Introduc on to Computer Systems 12 th Lecture, October 6th, Instructor: Randy Bryant.

Example. How are these parameters decided?

Caches III. CSE 351 Winter Instructor: Mark Wyse

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Carnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron

Memory Systems and Performance Engineering

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Double-precision General Matrix Multiply (DGEMM)

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Introduction to OpenMP. Lecture 10: Caches

Cache Performance (H&P 5.3; 5.5; 5.6)

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

High-Performance Parallel Computing

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

CSC D70: Compiler Optimization Memory Optimizations

Cray XE6 Performance Workshop

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

CSE Computer Architecture I Fall 2009 Homework 08 Pipelined Processors and Multi-core Programming Assigned: Due: Problem 1: (10 points)

CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors

CS377P Programming for Performance Single Thread Performance Caches I

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Comprehensive Exams COMPUTER ARCHITECTURE. Spring April 3, 2006

Caches Concepts Review

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Memory Systems and Performance Engineering. Fall 2009

Control flow graphs and loop optimizations. Thursday, October 24, 13

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Lecture 16 Optimizing for the memory hierarchy

Memory Hierarchy: Caches, Virtual Memory

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

EEC 483 Computer Organization

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Caches (Writing) Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. P & H Chapter 5.2 3, 5.5

CS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page.

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

Storage Management 1

CS152 Computer Architecture and Engineering SOLUTIONS Caches and the Memory Hierarchy Assigned 9/17/2016 Problem Set #2 Due Tue, Oct 4

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

Lecture 18. Optimizing for the memory hierarchy

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

CS 31: Intro to Systems Caching. Kevin Webb Swarthmore College March 24, 2015

Princeton University. Computer Science 217: Introduction to Programming Systems. The Memory/Storage Hierarchy and Virtual Memory

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

Computer Architecture Spring 2016

EE 457 Unit 7a. Cache and Memory Hierarchy

EEC 483 Computer Organization. Chapter 5.3 Measuring and Improving Cache Performance. Chansu Yu

Advanced optimizations of cache performance ( 2.2)

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Transcription:

Cache Performance II 1

cache operation (associative) 111001 index offset valid tag valid tag data data 1 10 1 00 00 11 AA BB tag 1 11 1 01 B4 B5 33 44 = data (B5) AND = AND OR is hit? (1) 2

cache operation (associative) 111001 index offset valid tag valid tag data data 1 10 1 00 00 11 AA BB tag 1 11 1 01 B4 B5 33 44 = data (B5) AND = AND OR is hit? (1) 2

cache operation (associative) 111001 index offset valid tag valid tag data data 1 10 1 00 00 11 AA BB tag 1 11 1 01 B4 B5 33 44 = data (B5) AND = AND OR is hit? (1) 2

Tag-Index-Offset formulas 3 m memory addreses bits (Y86-64: 64) E number of blocks per set ( ways ) S = 2 s number of sets s (set) index bits B = 2 b block size b (block) offset bits t = m (s + b) tag bits C = B S E cache size (excluding metadata)

Tag-Index-Offset exercise m memory addreses bits (Y86-64: 64) E number of blocks per set ( ways ) S = 2 s number of sets s (set) index bits B = 2 b block size b (block) offset bits t = m (s + b) tag bits C = B S E cache size (excluding metadata) My desktop: L1 Data Cache: 32 KB, 8 blocks/set, 64 byte blocks L2 Cache: 256 KB, 4 blocks/set, 64 byte blocks L3 Cache: 8 MB, 16 blocks/set, 64 byte blocks Divide the address 0x34567 into tag, index, offset for each cache. 4

T-I-O exercise: L1 5 quantity block size (given) value for L1 B = 64Byte B = 2 b (b: block offset bits)

T-I-O exercise: L1 5 quantity value for L1 block size (given) B = 64Byte B = 2 b (b: block offset bits) block offset bits b = 6

T-I-O exercise: L1 5 quantity value for L1 block size (given) B = 64Byte B = 2 b (b: block offset bits) block offset bits b = 6 blocks/set (given) E = 8 cache size (given) C = 32KB = E B S

T-I-O exercise: L1 5 quantity value for L1 block size (given) B = 64Byte B = 2 b (b: block offset bits) block offset bits b = 6 blocks/set (given) E = 8 cache size (given) C = 32KB = E B S S = C (S: number of sets) B E

T-I-O exercise: L1 5 quantity value for L1 block size (given) B = 64Byte B = 2 b (b: block offset bits) block offset bits b = 6 blocks/set (given) E = 8 cache size (given) C = 32KB = E B S S = C B E number of sets S = 32KB 64Byte 8 = 64 (S: number of sets)

T-I-O exercise: L1 quantity value for L1 block size (given) B = 64Byte B = 2 b (b: block offset bits) block offset bits b = 6 blocks/set (given) E = 8 cache size (given) C = 32KB = E B S S = C B E number of sets S = 32KB 64Byte 8 = 64 (S: number of sets) S = 2 s (s: set index bits) set index bits s = log 2 (64) = 6 5

T-I-O results 6 L1 L2 L3 sets 64 1024 8192 block offset bits 6 6 6 set index bits 6 10 13 tag bits (the rest)

T-I-O: splitting 7 L1 L2 L3 block offset bits 6 6 6 set index bits 6 10 13 tag bits (the rest) 0x34567: 3 4 5 6 7 0011 0100 0101 0110 0111 bits 0-5 (all offsets): 100111 = 0x27

T-I-O: splitting 7 L1 L2 L3 block offset bits 6 6 6 set index bits 6 10 13 tag bits (the rest) 0x34567: 3 4 5 6 7 0011 0100 0101 0110 0111 bits 0-5 (all offsets): 100111 = 0x27

T-I-O: splitting 7 L1 L2 L3 block offset bits 6 6 6 set index bits 6 10 13 tag bits (the rest) 0x34567: 3 4 5 6 7 0011 0100 0101 0110 0111 bits 0-5 (all offsets): 100111 = 0x27 L1: bits 6-11 (L1 set): 01 0101 = 0x15 bits 12- (L1 tag): 0x34

T-I-O: splitting 7 L1 L2 L3 block offset bits 6 6 6 set index bits 6 10 13 tag bits (the rest) 0x34567: 3 4 5 6 7 0011 0100 0101 0110 0111 bits 0-5 (all offsets): 100111 = 0x27 L1: bits 6-11 (L1 set): 01 0101 = 0x15 bits 12- (L1 tag): 0x34

T-I-O: splitting 7 L1 L2 L3 block offset bits 6 6 6 set index bits 6 10 13 tag bits (the rest) 0x34567: 3 4 5 6 7 0011 0100 0101 0110 0111 bits 0-5 (all offsets): 100111 = 0x27 L2: bits 6-15 (set for L2): 01 0001 0101 = 0x115 bits 16-: 0x3

T-I-O: splitting 7 L1 L2 L3 block offset bits 6 6 6 set index bits 6 10 13 tag bits (the rest) 0x34567: 3 4 5 6 7 0011 0100 0101 0110 0111 bits 0-5 (all offsets): 100111 = 0x27 L2: bits 6-15 (set for L2): 01 0001 0101 = 0x115 bits 16-: 0x3

T-I-O: splitting L1 L2 L3 block offset bits 6 6 6 set index bits 6 10 13 tag bits (the rest) 0x34567: 3 4 5 6 7 0011 0100 0101 0110 0111 bits 0-5 (all offsets): 100111 = 0x27 L3: bits 6-18 (set for L3): 0 1101 0001 0101 = 0xD15 bits 18-: 0x0 7

matrix squaring 8 B ij = n A ik A kj k=1 /* version 1: inner loop is k, middle is j */ for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) for (int k = 0; k < N; ++k) B[i*N+j] += A[i * N + k] * A[k * N + j];

matrix squaring B ij = n A ik A kj k=1 /* version 1: inner loop is k, middle is j*/ for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) for (int k = 0; k < N; ++k) B[i*N+j] += A[i * N + k] * A[k * N + j]; /* version 2: outer loop is k, middle is i */ for (int k = 0; k < N; ++k) for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) B[i*N+j] += A[i * N + k] * A[k * N + j]; 9

matrix squaring B ij = n A ik A kj k=1 /* version 1: inner loop is k, middle is j*/ for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) for (int k = 0; k < N; ++k) B[i*N+j] += A[i * N + k] * A[k * N + j]; /* version 2: outer loop is k, middle is i */ for (int k = 0; k < N; ++k) for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) B[i*N+j] += A[i * N + k] * A[k * N + j]; 9

performance 1.2 billions of instructions 1.0 k inner k outer 0.8 0.6 0.4 0.2 0.0 0 100 200 300 400 500 600 N 1.0 billions of cycles 0.8 k inner k outer 0.6 0.4 0.2 0.0 10

alternate view 1: cycles/instruction 0.9 cycles/instruction 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0 100 200 300 400 500 600 N 11

alternate view 2: cycles/operation 3.5 cycles/multiply or add 3.0 2.5 2.0 1.5 1.0 0 100 200 300 400 500 600 N 12

loop orders and locality 13 loop body: B ij + = A ik A kj kij order: B ij, A kj have spatial locality kij order: A ik has temporal locality better than ijk order: A ik has spatial locality ijk order: B ij has temporal locality

loop orders and locality 13 loop body: B ij + = A ik A kj kij order: B ij, A kj have spatial locality kij order: A ik has temporal locality better than ijk order: A ik has spatial locality ijk order: B ij has temporal locality

matrix squaring B ij = n A ik A kj k=1 /* version 1: inner loop is k, middle is j*/ for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) for (int k = 0; k < N; ++k) B[i*N+j] += A[i * N + k] * A[k * N + j]; /* version 2: outer loop is k, middle is i */ for (int k = 0; k < N; ++k) for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) B[i*N+j] += A[i * N + k] * A[k * N + j]; 14

matrix squaring B ij = n A ik A kj k=1 /* version 1: inner loop is k, middle is j*/ for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) for (int k = 0; k < N; ++k) B[i*N+j] += A[i * N + k] * A[k * N + j]; /* version 2: outer loop is k, middle is i */ for (int k = 0; k < N; ++k) for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) B[i*N+j] += A[i * N + k] * A[k * N + j]; 14

matrix squaring B ij = n A ik A kj k=1 /* version 1: inner loop is k, middle is j*/ for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) for (int k = 0; k < N; ++k) B[i*N+j] += A[i * N + k] * A[k * N + j]; /* version 2: outer loop is k, middle is i */ for (int k = 0; k < N; ++k) for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) B[i*N+j] += A[i * N + k] * A[k * N + j]; 14

L1 misses 140 read misses/1k instructions 120 k inner k outer 100 80 60 40 20 0 0 100 200 300 400 500 600 N 15

L1 miss detail (1) 140 120 100 80 60 40 20 read misses/1k instruction matrix smaller than L1 cache 0 0 50 100 150 200 N 16

L1 miss detail (2) 140 120 100 80 60 40 20 read misses/1k instruction matrix smaller than L1 cache N = 2 7 N = 93; 93 * 11 2 10 N = 114; 114 * 9 2 10 0 0 50 100 150 200 N 17

addresses 18 A[k*114+j] is at 10 0000 0000 0100 A[k*114+j+1] is at 10 0000 0000 1000 A[(k+1)*114+j] is at 10 0011 1001 0100 A[(k+2)*114+j] is at 10 0101 0101 1100 A[(k+9)*114+j] is at 11 0000 0000 1100

addresses 18 A[k*114+j] is at 10 0000 0000 0100 A[k*114+j+1] is at 10 0000 0000 1000 A[(k+1)*114+j] is at 10 0011 1001 0100 A[(k+2)*114+j] is at 10 0101 0101 1100 A[(k+9)*114+j] is at 11 0000 0000 1100 recall: 6 index bits, 6 block offset bits (L1)

conflict misses 19 powers of two lower order bits unchanged A[k*93+j] and A[(k+11)*93+j]: 1023 elements apart (4092 bytes; 63.9 cache blocks) 64 sets in L1 cache: usually maps to same set A[k*93+(j+1)] will not be cached (next i loop) even if in same block as A[k*93+j]

L2 misses 10 L2 misses/1k instructions ijk 8 kij 6 4 2 0 0 100 200 300 400 500 600 N 20

systematic approach (1) for (int k = 0; k < N; ++k) for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) B[i*N+j] += A[i*N+k] * A[k*N+j]; goal: get most out of each cache miss if N is larger than the cache: miss for B ij 1 comptuation miss for A ik N computations miss for A kj 1 computation effectively caching just 1 element 21

flat 2D arrays and cache blocks 22 A[0] A[N] A x0 A xn

array usage: kij order A kj A ik B ij A x0 A xn for all k: for all i: for all j: B ij + = A ik A kj N calculations for A ik 1 for A kj, B ij 23

array usage: kij order A k0 to A kn A ik B i0 to B in A x0 A xn for all k: for all i: for all j: B ij + = A ik A kj N calculations for A ik 1 for A kj, B ij 23

array usage: kij order A k0 to A kn A ik B i0 to B in A x0 A xn for all A k: for all i: for all j: B ij + = A ik A kj ik reused in innermost loop (over j) definitely N cached calculations (plus rest for A of ik cache block) 1 for A kj, B ij 23

array usage: kij order A k0 to A kn A kj reused in next middle loop (over i) cached only if entire row fits A ik B i0 to B in A x0 A xn for all k: for all i: for all j: B ij + = A ik A kj N calculations for A ik 1 for A kj, B ij 23

array usage: kij order A k0 to A kn B ij reused in next outer loop probably not still in cache next time (but, at least some spatial locality) A ik B i0 to B in A x0 A xn for all k: for all i: for all j: B ij + = A ik A kj N calculations for A ik 1 for A kj, B ij 23

inefficiencies 24 if N is large enough that a row doesn t fit in cache: keeping one block in cache accessing one element of that block if N is small enough that a row does fit in cache: keeping one block in cache for A ik, using one element keeping row in cache for A kj, using N times

systematic approach (2) 25 for (int k = 0; k < N; ++k) { for (int i = 0; i < N; ++i) { A ik loaded once in this loop (N 2 times): for (int j = 0; j < N; ++j) B ij, A kj loaded each iteration (if N big): B[i*N+j] += A[i*N+k] * A[k*N+j]; N 3 multiplies, N 3 adds about 1 load per operation

a transformation 26 for (int kk = 0; kk < N; kk += 2) for (int k = kk; k < kk + 2; ++k) for (int i = 0; i < N; i += 2) for (int j = 0; j < N; ++j) B[i*N+j] += A[i*N+k] * A[k*N+j]; split the loop over k should be exactly the same (assuming even N)

a transformation 26 for (int kk = 0; kk < N; kk += 2) for (int k = kk; k < kk + 2; ++k) for (int i = 0; i < N; i += 2) for (int j = 0; j < N; ++j) B[i*N+j] += A[i*N+k] * A[k*N+j]; split the loop over k should be exactly the same (assuming even N)

simple blocking 27 for (int kk = 0; kk < N; kk += 2) /* was here: for (int k = kk; k < kk + 2; ++k) * for (int i = 0; i < N; i += 2) for (int j = 0; j < N; ++j) for (int k = kk; k < kk + 2; ++k) B[i*N+j] += A[i*N+k] * A[k*N+j]; now reorder split loop

simple blocking 27 for (int kk = 0; kk < N; kk += 2) /* was here: for (int k = kk; k < kk + 2; ++k) * for (int i = 0; i < N; i += 2) for (int j = 0; j < N; ++j) for (int k = kk; k < kk + 2; ++k) B[i*N+j] += A[i*N+k] * A[k*N+j]; now reorder split loop

simple blocking expanded 28 for (int kk = 0; kk < N; kk += 2) { for (int i = 0; i < N; i += 2) { for (int j = 0; j < N; ++j) { /* process a "block": */ B[i*N+j] += A[i*N+kk] * A[kk*N+j]; B[i*N+j] += A[i*N+kk+1] * A[(kk+1)*N+j];

simple blocking expanded 28 for (int kk = 0; kk < N; kk += 2) { for (int i = 0; i < N; i += 2) { for (int j = 0; j < N; ++j) { /* process a "block": */ B[i*N+j] += A[i*N+kk] * A[kk*N+j]; B[i*N+j] += A[i*N+kk+1] * A[(kk+1)*N+j]; Temporal locality in B ij s

simple blocking expanded 28 for (int kk = 0; kk < N; kk += 2) { for (int i = 0; i < N; i += 2) { for (int j = 0; j < N; ++j) { /* process a "block": */ B[i*N+j] += A[i*N+kk] * A[kk*N+j]; B[i*N+j] += A[i*N+kk+1] * A[(kk+1)*N+j]; More spatial locality in A ik

simple blocking expanded 28 for (int kk = 0; kk < N; kk += 2) { for (int i = 0; i < N; i += 2) { for (int j = 0; j < N; ++j) { /* process a "block": */ B[i*N+j] += A[i*N+kk] * A[kk*N+j]; B[i*N+j] += A[i*N+kk+1] * A[(kk+1)*N+j]; Still have good spatial locality in A kj, B ij

improvement in read misses 20read misses/1k instructions of unblocked 15 10 5 0 blocked (kk+=2) unblocked 0 100 200 300 400 500 600 N 29

simple blocking (2) 30 same thing for i in addition to k? for (int kk = 0; kk < N; kk += 2) { for (int ii = 0; ii < N; ii += 2) { for (int j = 0; j < N; ++j) { /* process a "block": */ for (int k = kk; k < kk + 2; ++k) for (int i = 0; i < ii + 2; ++i) B[i*N+j] += A[i*N+k] * A[k*N+j];

simple blocking expanded 31 for (int k = 0; k < N; k += 2) { for (int i = 0; i < N; i += 2) { for (int j = 0; j < N; ++j) { /* process a "block": */ B i+0,j += A i+0,k+0 * A k+0,j B i+0,j += A i+0,k+1 * A k+1,j B i+1,j += A i+1,k+0 * A k+0,j B i+1,j += A i+1,k+1 * A k+1,j

simple blocking expanded for (int k = 0; k < N; k += 2) { for (int i = 0; i < N; i += 2) { for (int j = 0; j < N; ++j) { /* process a "block": */ B i+0,j += A i+0,k+0 * A k+0,j B i+0,j += A i+0,k+1 * A k+1,j B i+1,j += A i+1,k+0 * A k+0,j B i+1,j += A i+1,k+1 * A k+1,j Now A kj reused in inner loop more calculations per load! 31

array usage (better) A k0 to A k+1,n A ik to A i+1,k+1 B i0 to B i+1,n more temporal locality: N calculations for each A ik 2 calculations for each B ij (for k, k + 1) 2 calculations for each A kj (for k, k + 1) 32

array usage (better) 32 A k0 to A k+1,n A ik to A i+1,k+1 B i0 to B i+1,n more spatial locality: calculate on each A i,k and A i,k+1 together both in same cache block same amount of cache loads

generalizing cache blocking 33 for (int kk = 0; kk < N; kk += K) { for (int ii = 0; ii < N; ii += I) { with I by K block of A cached: for (int jj = 0; jj < N; jj += J) { with K by J block of A, I by J block of B cached: for i, j, k in I by J by K block: B[i * N + j] += A[i * N + k] * A[k * N + j];

generalizing cache blocking 33 for (int kk = 0; kk < N; kk += K) { for (int ii = 0; ii < N; ii += I) { with I by K block of A cached: for (int jj = 0; jj < N; jj += J) { with K by J block of A, I by J block of B cached: for i, j, k in I by J by K block: B[i * N + j] += A[i * N + k] * A[k * N + j]; B ij used K times for one miss

generalizing cache blocking 33 for (int kk = 0; kk < N; kk += K) { for (int ii = 0; ii < N; ii += I) { with I by K block of A cached: for (int jj = 0; jj < N; jj += J) { with K by J block of A, I by J block of B cached: for i, j, k in I by J by K block: B[i * N + j] += A[i * N + k] * A[k * N + j]; A ik used > J times for one miss

generalizing cache blocking 33 for (int kk = 0; kk < N; kk += K) { for (int ii = 0; ii < N; ii += I) { with I by K block of A cached: for (int jj = 0; jj < N; jj += J) { with K by J block of A, I by J block of B cached: for i, j, k in I by J by K block: B[i * N + j] += A[i * N + k] * A[k * N + j]; A kj used I times for one miss

generalizing cache blocking 33 for (int kk = 0; kk < N; kk += K) { for (int ii = 0; ii < N; ii += I) { with I by K block of A cached: for (int jj = 0; jj < N; jj += J) { with K by J block of A, I by J block of B cached: for i, j, k in I by J by K block: B[i * N + j] += A[i * N + k] * A[k * N + j]; catch: IK + KJ + IJ elements must fit in cache

keeping values in cache 34 can t explicitly ensure values are kept in cache but reusing values effectively does this cache will try to keep recently used values cache optimization idea: choose what s in the cache

array usage: block 35 A ik block (I K) A kj block (K J) B ij block (I J) inner loop keeps blocks from A, B in cache

array usage: block 35 A ik block (I K) A kj block (K J) B ij block (I J) B ij calculation uses strips from A K calculations for one load (cache miss)

array usage: block 35 A ik block (I K) A kj block (K J) B ij block (I J) A ik calculation uses strips from A, B J calculations for one load (cache miss)

array usage: block 35 A ik block (I K) A kj block (K J) B ij block (I J) (approx.) KIJ fully cached calculations for KI + IJ + KJ loads (assuming everything stays in cache)

cache blocking efficiency 36 load I K elements of A ik : do > J multiplies with each load K J elements of A kj : do I multiplies with each load I J elements of B ij : do K adds with each bigger blocks more work per load! catch: IK + KJ + IJ elements must fit in cache

cache blocking rule of thumb 37 fill the most of the cache with useful data and do as much work as possible from that example: my desktop 32KB L1 cache I = J = K = 48 uses 48 2 3 elements, or 27KB. assumption: conflict misses aren t important

view 2: divide and conquer partial_square(float *A, float *B, int starti, int endi,...) { for (int i = starti; i < endi; ++i) { for (int j = startj; j < endj; ++j) {... square(float *A, float *B, int N) { for (int ii = 0; ii < N; ii += BLOCK)... /* segment of A, B in use fits in cache! */ partial_square( A, B, ii, ii + BLOCK, jj, jj + BLOCK,...); 38

cache blocking ugliness fringe 39 full block partial block

cache blocking ugliness fringe 40 for (int kk = 0; kk < N; kk += K) { for (int ii = 0; ii < N; ii += I) { for (int jj = 0; jj < N; jj += J) { for (int k = kk; k < min(kk+k, N) ; ++k) { //...

cache blocking ugliness fringe for (kk = 0; kk + K <= N; kk += K) { for (ii = 0; ii + I <= N; ii += I) { for (jj = 0; jj + J <= N; ii += J) { //... for (; jj < N; ++jj) { // handle remainder for (; ii < N; ++ii) { // handle remainder for (; kk < N; ++kk) { // handle remainder 41

cache blocking and miss rate 0.09 read misses/multiply or add blocked unblocked 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00 0 100 200 300 N 400 500 600 42

what about performance? 2.0 cycles per multiply/add [less optimized loop] 1.5 1.0 0.5 unblocked blocked 0.0 0 100 200 300 400 500 600 N 0.5 cycles per multiply/add [optimized loop] 0.4 0.3 0.2 0.1 unblocked blocked 0.0 0 200 400 600 800 1000 43

performance for big sizes 44 1.0 matrix in 0.8 L3 cache 0.6 0.4 0.2 cycles per multiply or add unblocked blocked 0.0 0 2000 4000 6000 8000 10000 N

optimized loop??? 45 performance difference wasn t visible at small sizes until I optimized arithmetic in the loop (by supplying better options to GCC) 1: loading B i,j through B i,j+7 with one instruction 2: doing adds and multiplies with less instructions 3: simplifying address computations

optimized loop??? performance difference wasn t visible at small sizes until I optimized arithmetic in the loop (by supplying better options to GCC) 1: loading B i,j through B i,j+7 with one instruction 2: doing adds and multiplies with less instructions 3: simplifying address computations but how can that make cache blocking better??? 45

overlapping loads and arithmetic 46 time load load ltiply dd multiply multiply multiply multip add add add speed of load might not matter if these are slower

register reuse for (int k = 0; k < N; ++k) for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) B[i*N+j] += A[i*N+k] * A[k*N+j]; // optimize into: for (int k = 0; k < N; ++k) for (int i = 0; i < N; ++i) { float Aik = A[i*N+k]; // hopefully keep in reg // faster than even cach for (int j = 0; j < N; ++j) B[i*N+j] += Aik * A[k*N+j]; can compiler do this for us? 47

can compiler do register reuse? 48 Not easily What if A = B? for (int k = 0; k < N; ++k) for (int i = 0; i < N; ++i) { // want to preload A[i*N+k] here! for (int j = 0; j < N; ++j) { // but if A = B, modifying here! B[i*N+j] += A[i*N+k] * A[k*N+j];

Automatic register reuse 49 Compiler would need to generate overlap check: if ((B > A + N * N B < A) && (B + N * N > A + N * N B + N * N < A)) { for (int k = 0; k < N; ++k) { for (int i = 0; i < N; ++i) { float Aik = A[i*N+k]; for (int j = 0; j < N; ++j) { B[i*N+j] += Aik * A[k*N+j]; else { /* other version */

register blocking 50 for (int k = 0; k < N; ++k) { for (int i = 0; i < N; i += 2) { float Ai0k = A[(i+0)*N + k]; float Ai1k = A[(i+1)*N + k]; for (int j = 0; j < N; j += 2) { float Akj0 = A[k*N + j+0]; float Akj1 = A[k*N + j+1]; B[(i+0)*N + j+0] += Ai0k * Akj0; B[(i+1)*N + j+0] += Ai1k * Akj0; B[(i+0)*N + j+1] += Ai0k * Akj1; B[(i+1)*N + j+1] += Ai1k * Akj1;

cache blocking: summary reorder calculation to reduce cache misses: make explicit choice about what is in cache perform calculations in cache-sized blocks get more spatial and temporal locality temporal locality reuse values in many calculations before they are replaced in the cache spatial locality use adjacent values in calculations before cache block is replaced 51

avoiding conflict misses 52 problem array is scattered throughout memory observation: 32KB cache can store 32KB contiguous array contiguous array is split evenly among sets solution: copy block into contiguous array

avoiding conflict misses (code) 53 process_block(ii, jj, kk) { float B_copy[I * J]; /* pseudocode for loop to save space */ for i = ii to ii + I, j = jj to jj + J: B_copy[i * J + j] = B[i * N + j]; for i = ii to ii + I, j = jj to jj + J, B_copy[i * J + j] += A[k * N + j] * A for all i, j: B[i * N + j] = B_copy[i * J + j];

prefetching 54 processors detect sequential access patterns e.g. accessing memory address 0, 8, 16, 24,? processor will prefetch 32, 48, etc. another way to take advantage of spatial locality part of why miss rate is so low

matrix sum 55 int sum1(int matrix[4][8]) { int sum = 0; for (int i = 0; i < 4; ++i) { for (int j = 0; j < 8; ++j) { sum += matrix[i][j]; access pattern: matrix[0][0], [0][1], [0][2],, [1][0]

matrix sum: spatial locality 56 matrix in memory (4 bytes/row) [0][0] iter. 0 miss [0][1] iter. 1 hit (same block as before) [0][2] iter. 2 miss [0][3] iter. 3 hit (same block as before) [0][4] iter. 4 miss [0][5] iter. 5 hit [0][6] iter. 6 [0][7] iter. 7 [1][0] iter. 8 [1][1] iter. 9

matrix sum: spatial locality 56 8-byte cache block? matrix in memory (4 bytes/row) [0][0] iter. 0 miss [0][1] iter. 1 hit (same block as before) [0][2] iter. 2 miss [0][3] iter. 3 hit (same block as before) [0][4] iter. 4 miss [0][5] iter. 5 hit [0][6] iter. 6 [0][7] iter. 7 [1][0] iter. 8 [1][1] iter. 9

matrix sum: spatial locality 56 8-byte cache block? matrix in memory (4 bytes/row) [0][0] iter. 0 miss [0][1] iter. 1 hit (same block as before) [0][2] iter. 2 miss [0][3] iter. 3 hit (same block as before) [0][4] iter. 4 miss [0][5] iter. 5 hit [0][6] iter. 6 [0][7] iter. 7 [1][0] iter. 8 [1][1] iter. 9

block size and spatial locality 57 larger blocks exploit spatial locality but larger blocks means fewer blocks for same size less good at exploiting temporal locality

alternate matrix sum 58 int sum2(int matrix[4][8]) { int sum = 0; // swapped loop order for (int j = 0; j < 8; ++j) { for (int i = 0; i < 4; ++i) { sum += matrix[i][j]; access pattern: matrix[0][0], [1][0], [2][0],, [0][1],

matrix sum: bad spatial locality 59 matrix in memory (4 bytes/row) [0][0] iter. 0 [0][1] iter. 4 [0][2] iter. 8 [0][3] iter. 12 [0][4] iter. 16 [0][5] iter. 20 [0][6] iter. 24 [0][7] iter. 28 [1][0] iter. 1 [1][1] iter. 5

matrix sum: bad spatial locality 59 8-byte cache block? matrix in memory (4 bytes/row) [0][0] iter. 0 [0][1] iter. 4 [0][2] iter. 8 [0][3] iter. 12 [0][4] iter. 16 [0][5] iter. 20 [0][6] iter. 24 [0][7] iter. 28 [1][0] iter. 1 [1][1] iter. 5 miss unless value not evicted for 4 iterations

cache organization and miss rate depends on program; one example: SPEC CPU2000 benchmarks, 64B block size LRU replacement policies data cache miss rates: Cache size direct-mapped 2-way 8-way fully assoc. 1KB 8.63% 6.97% 5.63% 5.34% 2KB 5.71% 4.23% 3.30% 3.05% 4KB 3.70% 2.60% 2.03% 1.90% 16KB 1.59% 0.86% 0.56% 0.50% 64KB 0.66% 0.37% 0.10% 0.001% 128KB 0.27% 0.001% 0.0006% 0.0006% Data: Cantin and Hill, Cache Performance for SPEC CPU2000 Benchmarks http://research.cs.wisc.edu/multifacet/misc/spec2000cache-data/ 60

cache organization and miss rate depends on program; one example: SPEC CPU2000 benchmarks, 64B block size LRU replacement policies data cache miss rates: Cache size direct-mapped 2-way 8-way fully assoc. 1KB 8.63% 6.97% 5.63% 5.34% 2KB 5.71% 4.23% 3.30% 3.05% 4KB 3.70% 2.60% 2.03% 1.90% 16KB 1.59% 0.86% 0.56% 0.50% 64KB 0.66% 0.37% 0.10% 0.001% 128KB 0.27% 0.001% 0.0006% 0.0006% Data: Cantin and Hill, Cache Performance for SPEC CPU2000 Benchmarks http://research.cs.wisc.edu/multifacet/misc/spec2000cache-data/ 60

is LRU always better? 61 least recently used exploits temporal locality

making LRU look bad 62 * = least recently used direct-mapped (2 sets) fully-associative (1 set) read 0 miss: mem[0]; miss: mem[0], * read 1 miss: mem[0]; mem[1] miss: mem[0]*, mem[1] read 3 miss: mem[0]; mem[3] miss: mem[3], mem[1]* read 0 hit: mem[0]; mem[3] miss: mem[3]*, mem[0] read 2 miss: mem[2]; mem[3] miss: mem[2], mem[0]* read 3 hit: mem[2]; mem[3] miss: mem[2]*, mem[3] read 1 hit: mem[2]; mem[1] hit: mem[1], mem[3]* read 2 hit: mem[2]; mem[1] miss: mem[1]*, mem[2]

cache operation (associative) 111001 index offset valid tag data valid tag data 1 10 00 11 1 00 AA BB tag 1 11 B4 B5 1 01 33 44 = data (B5) AND = AND OR is hit? (1) 63

cache operation (associative) 111001 index offset valid tag data valid tag data 1 10 00 11 1 00 AA BB tag 1 11 B4 B5 1 01 33 44 = data (B5) AND = AND OR is hit? (1) 63

cache operation (associative) 111001 index offset valid tag data valid tag data 1 10 00 11 1 00 AA BB tag 1 11 B4 B5 1 01 33 44 = data (B5) AND = AND OR is hit? (1) 63