Example. How are these parameters decided?

Similar documents
Cache memories are small, fast SRAM based memories managed automatically in hardware.

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations

Today Cache memory organization and operation Performance impact of caches

CISC 360. Cache Memories Nov 25, 2008

Giving credit where credit is due

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Cache Memories October 8, 2007

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Cache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

Cache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory

ν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines.

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Carnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron

Carnegie Mellon. Cache Memories

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Caches Concepts Review

211: Computer Architecture Summer 2016

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Last class. Caches. Direct mapped

Cache Memories. Andrew Case. Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron

Cache Memories : Introduc on to Computer Systems 12 th Lecture, October 6th, Instructor: Randy Bryant.

Cache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Memory Hierarchy. Announcement. Computer system model. Reference

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Memory Hierarchy. Instructor: Adam C. Champion, Ph.D. CSE 2431: Introduction to Operating Systems Reading: Chap. 6, [CSAPP]

Computer Organization: A Programmer's Perspective

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

High-Performance Parallel Computing

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

CS241 Computer Organization Spring Principle of Locality

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Problem: Processor Memory BoJleneck

CSCI 402: Computer Architectures. Performance of Multilevel Cache

Computer Systems CSE 410 Autumn Memory Organiza:on and Caches

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Carnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Types of Cache Misses. The Hardware/So<ware Interface CSE351 Winter Cache Read. General Cache OrganizaJon (S, E, B) Memory and Caches II

The Hardware/So<ware Interface CSE351 Winter 2013

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Cache Performance II 1

write-through v. write-back write-through v. write-back write-through v. write-back option 1: write-through write 10 to 0xABCD CPU RAM Cache ABCD: FF

211: Computer Architecture Summer 2016

Caches Spring 2016 May 4: Caches 1

Caches III. CSE 351 Spring Instructor: Ruth Anderson

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Caches III. CSE 351 Winter Instructor: Mark Wyse

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Kathryn Chan, Kevin Bi, Ryan Wong, Waylon Huang, Xinyu Sui

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Page 1. Multilevel Memories (Improving performance using a little cash )

Computer Organization - Overview

Advanced Memory Organizations

CS3350B Computer Architecture

Main Memory Supporting Caches

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS3350B Computer Architecture

Memory Hierarchy and Caches

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Cache Impact on Program Performance. T. Yang. UCSB CS240A. 2017

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

EN1640: Design of Computing Systems Topic 06: Memory System

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

COMPUTER ORGANIZATION AND DESIGN

EE108B Lecture 13. Caches Wrap Up Processes, Interrupts, and Exceptions. Christos Kozyrakis Stanford University

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

EE 4683/5683: COMPUTER ARCHITECTURE

Adapted from David Patterson s slides on graduate computer architecture

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

Computer Architecture Spring 2016

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

The Memory Hierarchy & Cache

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Introduction to Computer Systems

Introduction to Computer Systems

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 2: Memory Hierarchy Design Part 2

Transcription:

Example How are these parameters decided?

Comparing cache organizations Like many architectural features, caches are evaluated experimentally. As always, performance depends on the actual instruction mix, since different programs will have different memory access patterns. Simulating or executing real applications is the most accurate way to measure performance characteristics. The graphs on the next few slides illustrate the simulated miss rates for several different cache designs. Again lower miss rates are generally better, but remember that the miss rate is just one component of average memory access time and execution time.

Associativitytradeoffs and miss rates As we saw last time, higher associativitymeans more complex hardware. But a highly-associative cache will also exhibit a lower miss rate. Each set has more blocks, so there s less chance of a conflict between two addresses which both belong in the same set. Overall, this will reduce AMAT and memory stall cycles. The textbook shows the miss rates decreasing as the associativity increases. 12% Miss rate 9% 6% 3% 0% One-way Two-way Associativity Four-way Eight-way

Cache size and miss rates The cache size also has a significant impact on performance. The larger a cache is, the less chance there will be of a conflict. Again this means the miss rate decreases, so the AMAT and number of memory stall cycles also decrease. This figure depicts the miss rate as a function of both the cache size and its associativity. 15% 12% Miss rate 9% 6% 1 KB 2 KB 4 KB 8 KB 3% 0% One-way Two-way Four-way Associativity Eight-way

Block size and miss rates Finally, this figure shows miss rates relative to the block size and overall cache size. Smaller blocks do not take maximum advantage of spatial locality. But if blocks are toolarge, there will be fewer blocks available, and more potential misses due to conflicts. 40% 35% Miss rate 30% 25% 20% 15% 10% 5% 1 KB 8 KB 16 KB 64 KB 0% 4 16 64 256 Block size (bytes)

Memory and overall performance How do cache hits and misses affect overall system performance? Assuming a hit time of one CPU clock cycle, program execution will continue normally on a cache hit. (Our earlier computations always assumed one clock cycle for an instruction fetch or data access.) For cache misses, we ll assume the CPU must stallto wait for a load from main memory. The total number of stall cycles depends on the number of cache misses and the miss penalty. Memory stall cycles = # Memory accesses x miss rate x miss penalty To include stalls due to cache misses in CPU performance equations, we have to add them to the base number of execution cycles. CPU time = (CPU execution cycles + Memory stall cycles) x Cycle time

Performance example Assume that 33%of the Inumber of instructions in a program are data accesses. The cache hit ratio is 97% and the hit time is one cycle, but the miss penalty is 20 cycles. Memory stall cycles= # Memory accesses x miss rate x miss penalty =?

Performance example Assume that 33%of the Inumber of instructions in a program are data accesses. The cache hit ratio is 97% and the hit time is one cycle, but the miss penalty is 20 cycles. Memory stall cycles= # Memory accesses x miss rate x miss penalty = 0.33 Ix 0.03 x 20 cycles = 0.2 I cycles If Iinstructions are executed, then the number of wasted cycles will be 0.2 x I. This code is 1.2 times slower than a program with a perfect CPI of 1!

Memory systems are a bottleneck CPU time = (CPU execution cycles + Memory stall cycles) x Cycle time Processor performance traditionally outpaces memory performance, so the memory system is often the system bottleneck. For example, with a base CPI of 1, the CPU time from the last page is: CPU time = (I+ 0.2 I) x Cycle time What if we could doublethe CPU performance so the CPI becomes 0.5, but memory performance remained the same? CPU time = (0.5 I+ 0.2 I) x Cycle time The overall CPU time improves by just 1.2/0.7 = 1.7 times! Speeding up only part of a system has diminishing returns.

Basic main memory design There are some ways the main memory can be organized to reduce miss penalties and help with caching. For some concrete examples, let s assume the following three steps are taken when a cache needs to load data from the main memory. 1. It takes 1 cycle to send an address to the RAM. 2. There is a 15-cycle latency for each RAM access. 3. It takes 1 cycle to return data from the RAM. In the setup shown here, the buses from the CPU to the cache and from the cache to RAM are all one word wide. If the cache has one-word blocks, then filling a block from RAM (i.e., the miss penalty) would take 17 cycles. 1 + 15 + 1 = 17 clock cycles The cache controller has to send the desired address to the RAM, wait and receive the data. CPU Cache Main Memory

Miss penalties for larger cache blocks If the cache has four-word blocks, then loading a single block would need four individual main memory accesses, and a miss penalty of 68 cycles! 4 x (1 + 15 + 1) = 68 clock cycles CPU Cache Main Memory

A wider memory A simple way to decrease the miss penalty is to widen the memory and its interface to the cache, so we can read multiple words from RAM in one shot. If we could read four words from the memory at once, a four-word cache load would need just 17 cycles. CPU Cache 1 + 15 + 1 = 17 cycles The disadvantageis the cost of the wider buses each additional bit of memory width requires another connection to the cache. Main Memory

An interleaved memory Another approach is to interleavethe memory, or split it into banks that can be accessed individually. The main benefit is overlapping the latencies of accessing each word. For example, if our main memory has four banks, each one byte wide, then we could load four bytes into a cache block in just 20 cycles. CPU Cache Main Memory 1 + 15 + (4 x 1) = 20 cycles Our buses are still one byte wide here, so four cycles are needed to transfer data to the caches. This is cheaper than implementing a four-byte bus, but not too much slower. Bank 0 Bank 1 Bank 2 Bank 3

Interleaved memory accesses Clock cycles Load word 1 Load word 2 Load word 3 Load word 4 15 cycles Here is a diagram to show how the memory accesses can be interleaved. The magenta cycles represent sending an address to a memory bank. Each memory bank has a 15-cycle latency, and it takes another cycle (shown in blue) to return data from the memory. This is the same basic idea as pipelining! As soon as we request data from one memory bank, we can go ahead and request data from another bank as well. Each individual load takes 17 clock cycles, but four overlapped loads require just 20 cycles.

Which is better? Increasing block size can improve hit rate (due to spatial locality), but transfer time increases. Which cache configuration would be better? Cache #1 Cache #2 Block size 32-bytes 64-bytes Miss rate 5% 4% Assume both caches have single cycle hit times. Memory accesses take 1+15+1 cycles, and the interleaved memory bus is 8-bytes wide: i.e., an 16-byte memory access takes 18 cycles: 1 (send address) + 15 (memory access) + 2 (two 8-byte transfers) HINT:AMAT = Hit time + (Miss rate x Miss penalty) What is the miss penalty for a given block size?

Which is better? Increasing block size can improve hit rate (due to spatial locality), but transfer time increases. Which cache configuration would be better? Cache #1 Cache #2 Block size 32-bytes 64-bytes Miss rate 5% 4% Assume both caches have single cycle hit times. Memory accesses take 1+15+1 cycles, and the interleaved memory bus is 8-bytes wide: i.e., an 16-byte memory access takes 18 cycles: 1 (send address) + 15 (memory access) + 2 (two 8-byte transfers) Cache #1: Miss Penalty = 1 + 15 + 32B/8B = 20 cycles AMAT = 1 + (.05 * 20) = 2 Cache #2: Miss Penalty = 1 + 15 + 64B/8B = 24 cycles AMAT = 1 + (.04 * 24) = ~1.96 HINT:AMAT = Hit time + (Miss rate x Miss penalty)

The Memory Mountain Metric: read throughput (read bandwidth) Number of bytes read from memory per second (MB/s) Memory system test Run program with many reads in tight loop Examine throughput as function of read sequence Total size of memory covered (working set size) Distance between consecutive references (stride) Memory mountain A compact visualization of memory system performance. Surface represents read throughput as a function of spatial and temporal locality.

Memory Mountain Test Function /* The test function generates the read sequence */ /* References the first elems elements of data[] with a stride offset */ void test(int elems, int stride) { int i, result = 0; volatile int sink; for (i = 0; i < elems; i += stride) result += data[i]; sink = result; /* So compiler doesn't optimize away the loop */ /* A wrapper to call test() and return measured read throughput (MB/s) */ double run(int size, int stride, double Mhz) { double cycles; int elems = size / sizeof(int); test(elems, stride); /* warm up the cache */ cycles = fcyc2(test, elems, stride, 0); /* time test(elems,stride) */ return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */

Memory Mountain Main Routine /* mountain.c - Generate the memory mountain. */ #define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */ #define MAXBYTES (1 << 23) /*... up to 8 MB */ #define MAXSTRIDE 16 /* Strides range from 1 to 16 */ #define MAXELEMS MAXBYTES/sizeof(int) int data[maxelems]; /* The array we'll be traversing */ int main() { int size; /* Working set size (in bytes) */ int stride; /* Stride (in array elements) */ double Mhz; /* Clock frequency */ init_data(data, MAXELEMS); /* Initialize each element in data to 1 */ Mhz = mhz(0); /* Estimate the clock frequency */ for (size = MAXBYTES; size >= MINBYTES; size >>= 1) { for (stride = 1; stride <= MAXSTRIDE; stride++) printf("%.1f\t", run(size, stride, Mhz)); printf("\n"); exit(0);

The Memory Mountain read throughput (MB/s) Slopes of Spatial Locality 1200 1000 800 600 400 200 xe L2 L1 Pentium III Xeon 550 MHz 16 KB on-chip L1 d-cache 16 KB on-chip L1 i-cache 512 KB off-chip unified L2 cache Ridges of Temporal Locality 0 s1 s3 s5 mem 8k 2k stride (words) s7 s9 s11 s13 s15 8m 2m 512k 128k 32k working set size (bytes)

The Memory Mountain 7000 6000 L1 Core i7 2.67 GHz 32 KB L1 d-cache 256 KB L2 cache 8 MB L3 cache Slopes of spatial locality Read throughput (MB/s) 5000 4000 3000 2000 1000 0 Stride (x8 bytes) s1 s3 s5 s7 s9 L3 Mem s11 s13 s15 s32 L2 64M 8M 1M 128K 16K 2K Ridges of temporal locality Size (bytes)

Ridges of Temporal Locality Slice through the memory mountain with stride=16 7000 6000 Main memory region L3 cache region L2 cache region L1 cache region 5000 Read throughput (MB/s) 4000 3000 2000 1000 0 64M 32M 16M 8M 4M 2M 1M 512K 256K 128K 64K 32K 16K 8K 4K 2K Working set size (bytes)

A Slope of Spatial Locality Slice through memory mountain with size=4mb 5000 4500 4000 3500 Read throughput (MB/s) 3000 2500 2000 1500 1000 One access per cache line 500 0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 s32 s64 Stride (x8 bytes)

Discussion Given a memory mountain for one hardware platform, what can be determined? L1 cache size? L1 block size? L1 associativity? L2 cache size? L2 block size? L2 associativity? Number of levels of cache? Main memory bandwidth?

BONUS SLIDES Didn t have time to cover in class. Good reading material on your own time.

Matrix Multiplication Example Major cache effects to consider Total cache size Exploit temporal locality, reuse data while in cache /* ijk */ Exploit spatial locality for (i=0; i<n; i++) { Block size Description: Multiply N x N matrices Array elements doubles O(N 3 ) total operations Accesses N reads per source element N values summed per destination But partial sums may be held in a register Variable sum held in register for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum;

Miss Rate Analysis for Matrix Multiply Assumptions: Line size is 32B (big enough for 4 64-bit doubles) Matrix dimension (N) is very large Approximate 1/N as 0.0 Cache is not big enough to hold multiple rows Analysis method: Look at access pattern of inner loop k j j i k i A B C Typical code accesses A in rows, B in columns

Layout of C Arrays in Memory (review) C arrays allocated in row-major order Each row in contiguous memory locations Stepping through columns in one row: for (i = 0; i < N; i++) sum += a[0][i]; Accesses successive elements If block size (B) > 8 bytes, exploits spatial locality Compulsory miss rate = 8 bytes / B Stepping through rows in one column: for (i = 0; i < n; i++) sum += a[i][0]; Accesses distant elements No spatial locality! Compulsory miss rate = 100% (assuming large rows)

Matrix Multiplication (ijk) /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; Misses per inner loop iteration: A B C 0.25 1.0 0.0 Inner loop: (i,*) (*,j) (i,j) A B C Row-wise Columnwise Fixed

Matrix Multiplication (jik) /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; Misses per inner loop iteration: A B C 0.25 1.0 0.0 Inner loop: (i,*) (*,j) (i,j) A B C Row-wise Columnwise Fixed

Matrix Multiplication (kij) /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; Inner loop: (i,k) (k,*) (i,*) A B C Fixed Row-wise Row-wise Misses per inner loop iteration: A B C 0.0 0.25 0.25

Matrix Multiplication (ikj) /* ikj */ for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; Inner loop: (i,k) (k,*) (i,*) A B C Fixed Row-wise Row-wise Misses per inner loop iteration: A B C 0.0 0.25 0.25

Matrix Multiplication (jki) /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; Inner loop: (*,k) (*,j) (k,j) A B C Misses per inner loop iteration: Column - wise Fixed Columnwise A B C 1.0 0.0 1.0

Matrix Multiplication (kji) /* kji */ for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; Misses per inner loop iteration: A B C 1.0 0.0 1.0 Inner loop: (*,k) (k,j) (*,j) A B C Fixed Columnwise Columnwise

Summary of Matrix Multiplication ijk (& jik): 2 loads, 0 stores misses/iter = 1.25 kij (& ikj): 2 loads, 1 store misses/iter = 0.5 jki (& kji): 2 loads, 1 store misses/iter = 2.0 for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r;

Core i7 Matrix Multiply Performance Miss rates arebetter predictor of performance than total memory accesses. (But platform dependent!) 60 50 Cycles per inner loop iteration 40 30 20 10 jki kji ijk jik kij ikj 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 Array size (n)

Blocking: Matrix Multiplication c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[i*n+j] += a[i*n + k]*b[k*n + j]; j c = i a * b

Cache Miss Analysis Assume: Matrix elements are doubles Cache block = 8 doubles Cache size C << n (much smaller than n) First iteration: n/8 + n = 9n/8 misses = * n Afterwards in cache: (schematic) = * 8 wide

Cache Miss Analysis Assume: Matrix elements are doubles Cache block = 8 doubles Cache size C << n (much smaller than n) Second iteration: Again: n/8 + n = 9n/8 misses = * n Total misses (over n 2 iterations): 9n/8 * n 2 = (9/8) * n 3 8 wide

Blocked Matrix Multiplication c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i+=b) for (j = 0; j < n; j+=b) for (k = 0; k < n; k+=b) /* B x B mini matrix multiplications */ for (i1 = i; i1 < i+b; i++) for (j1 = j; j1 < j+b; j++) for (k1 = k; k1 < k+b; k++) c[i1*n+j1] += a[i1*n + k1]*b[k1*n + j1]; c = i1 a * b j1 + c Block size B x B

Cache Miss Analysis Assume: Cache block = 8 doubles Cache size C << n (much smaller than n) Three blocks fit into cache: 3B 2 < C First (block) iteration: B 2 /8 misses for each block and 2n/B unique blocks 2n/B * B 2 /8 = nb/4 (omitting matrix c) = * n/b blocks Afterwards in cache (schematic) = * Block size B x B

Cache Miss Analysis Assume: Cache block = 8 doubles Cache size C << n (much smaller than n) Three blocks fit into cache: 3B 2 < C Second (block) iteration: Same as first iteration 2n/B * B 2 /8 = nb/4 = * n/b blocks Total misses: nb/4 * (n/b) 2 = n 3 /(4B) Block size B x B

Summary No blocking: (9/8) * n 3 Blocking: 1/(4B) * n 3 Suggest largest possible block size B, but limit 3B 2 < C! (can possibly be relaxed a bit, but there is a limit for B) Reason for dramatic difference: Matrix multiplication has inherent temporal locality: Referenced data: 3n 2, computation: 2n 3 Every array element used O(n) times! But program has to be written properly; code more complex

Optimizations for the Memory Hierarchy Write code that has locality Spatial: access data contiguously Temporal: make sure access to the same data is not too far apart in time How to achieve? Proper choice of algorithm Loop transformations Cache versus register level optimization: In both cases locality desirable Register space much smaller + requires scalar replacement to exploit temporal locality Register level optimizations include exhibiting instruction level parallelism (conflicts with locality)

Parting Thoughts on Optimization The competent programmer is fully aware of the strictly limited size of his own skull; therefore he approaches the programming task in full humility, and among other things he avoids clever tricks like the plague. Edsger Dijkstra The best is the enemy of the good. Voltaire Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. Brian W. Kernighan We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Donald Knuth More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason including blind stupidity. W. A. Wulf You are bound to be unhappy if you optimize everything. Donald Knuth Rules of optimization: Rule 1: Don t do it. Rule 2 (For experts only): Don t do it yet. M. A. Jackson