Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Similar documents
EE108B Lecture 13. Caches Wrap Up Processes, Interrupts, and Exceptions. Christos Kozyrakis Stanford University

Lecture 11 Cache. Peng Liu.

Announcements. EE108B Lecture 13. Caches Wrap Up Processes, Interrupts, and Exceptions. Measuring Performance. Memory Performance

211: Computer Architecture Summer 2016

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

EE 4683/5683: COMPUTER ARCHITECTURE

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Last class. Caches. Direct mapped

Cache Memories October 8, 2007

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

CISC 360. Cache Memories Nov 25, 2008

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Giving credit where credit is due

CS3350B Computer Architecture

Course Administration

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

Today Cache memory organization and operation Performance impact of caches

Modern Computer Architecture

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

ν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines.

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Cache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Caches Concepts Review

Cache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory

Memory Hierarchy. Announcement. Computer system model. Reference

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS3350B Computer Architecture

Topics. Digital Systems Architecture EECE EECE Need More Cache?

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Page 1. Multilevel Memories (Improving performance using a little cash )

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

CS161 Design and Architecture of Computer Systems. Cache $$$$$

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

CPE 631 Lecture 04: CPU Caches

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Page 1. Memory Hierarchies (Part 2)

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

COSC 6385 Computer Architecture - Memory Hierarchies (I)

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

Question?! Processor comparison!

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

CS152 Computer Architecture and Engineering Lecture 17: Cache System

CS 61C: Great Ideas in Computer Architecture Caches Part 2

LECTURE 11. Memory Hierarchy

Memory Hierarchy: Caches, Virtual Memory

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Caching Basics. Memory Hierarchies

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

EN1640: Design of Computing Systems Topic 06: Memory System

CS61C : Machine Structures

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Advanced Memory Organizations

Memory Hierarchy. Slides contents from:

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Computer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches, Set Associative Caches, Cache Performance

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

High-Performance Parallel Computing

Lecture 11 Cache and Virtual Memory

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

Introduction to OpenMP. Lecture 10: Caches

CMPT 300 Introduction to Operating Systems

EECS150 - Digital Design Lecture 11 SRAM (II), Caches. Announcements

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy

Transcription:

Lecture 12 Memory Design & Caches, part 2 Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee108b 1

Announcements HW3 is due today PA2 is available on-line today Part 1 is due on 2/27 Part 2 is due on 3/6 HW4 available on-line today Merged with HW5 Due on 3/13 2

Review: Memory Technologies The cheaper the memory, The slower the access times Technology SRAM DRAM Disk Access Time 0.5-25 ns 50-120 ns 5-10 ms $/GB in 2004 $4000 $10000 $100-$200 $0.5-$2 3

1000 Performance Review: The Processor-Memory Performance Gap 100 10 1 1980 1981 1982 1983 1984 1985 Moore s Law 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 µproc 55%/yr. DRAM 7%/yr. A 1-cycle access to DRAM memory in 1980 takes 100s of cycles now We can t afford to stall the pipeline for that long! 4 CPU Processor-Memory Performance Gap: (grows 48% / year) DRAM

Review: Memory Hierarchy Big Fast Cheap Tradeoff cost-speed and size-speed using a hierarchy of memories: small, fast, expensive caches at the top large, slow, and cheap memory at the bottom M C Small Fast Expensive M M We can t use large amounts of fast memory expensive in dollars, watts, and space Big, Slow, Cheap Ideal: the memory hierarchy should be almost as fast as the top level, and almost as big and cheap as the bottom level Program locality makes this possible 5

Review: Typical Memory Hierarchy: Everything is a Cache for Something Else Access time Capacity Managed by Registers 1cycle ~500B software/compiler CPU Chip Level 1 Cache 1-3cycles ~64KB hardware Level 2 Cache 5-10cycles 1MB hardware Chips DRAM ~100cycles ~2GB Software/OS Mechanical Devices Disk 10 6-10 7 cycles ~100GB software/os Tape 6

How To Build A Cache? Big question I need to map a large address space into a small memory Want associative lookup How do I do that? Can build full associative lookup in hardware, but complex Need to find a simple but effective solution Two common techniques: Direct Mapped Set Associative Details Block size Write policy Replacement policy 7

Direct Mapped Caches Access with these is a straightforward process: Index into the cache with block number modulo cache size Read out both data and tag (stored upper address bits) Compare Tag with address you want to determine hit/miss Need valid bit for empty cache lines 31 n 0 Cache Tag: 0x50 Cache Index: 0x03 Valid Bit : Tag 0x50 : Data Byte 0 Byte 1 Byte 2 Byte 3 8 : Byte 2 n -1 0 1 2 3 2 n - 1 2 n Bytes

Cache Blocks Previous example had only 1 byte blocks Strategy took advantage of temporal locality since if a byte is referenced, it will tend to be referenced soon Did not take advantage of spatial locality To take advantage of spatial locality, increase block size Another advantage: Reduces size of tag memory, too! Potential disadvantage: Fewer blocks in the cache Valid Cache Tag Cache Data Byte 0 Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6 Byte 7 9

Cache Block Example Assume a 2 n byte direct mapped cache with 2 m byte blocks Byte select The lower m bits Cache index - The lower (n-m) bits of the memory address Cache tag - The upper (32-n-m) bits of the memory address 31 Cache Tag 0x50 9 Cache Index 0x01 5 4 0 Byte Select 0x1F 1 KB Cache 32 B Blocks Valid : Cache Tag 0x50 : 0 1 2 3 Byte0 Byte32 5 Cache Data : : Byte 2 Byte 31 Byte62Byte 63 : 2 5 = 32 cache lines 31 Byte 992 Byte 1023 10 :

Optimal Block Size Larger block sizes take advantage of spatial locality Also incurs larger miss penalty since it takes longer to transfer the block into the cache Large block can also increase the average time or the miss rate Tradeoff in selecting block size Average Access Time = Hit Time + Miss Penalty MR Miss Penalty Miss Rate Exploits Spatial Locality Fewer blocks: compromises temporal locality Average Access Time Increased Miss Penalty & Miss Rate Block Size Block Size Block Size 11

Direct Mapped Problems Conflict misses could in turn cause other conflict misses Consider an extreme example where both the cache size and block size are 4 bytes Valid Cache Tag Cache Data Byte 0 Byte 1 Byte 2 Byte 3 0 Only one entry in the cache Conflict misses will be continuous if we access >1 data item Such a situation is referred to as thrashing where cache continually loading data into the cache but evicting it before it can be used again 12

This is a real problem! Consider the following example code: double a[8192], b[8192], c[8192]; void vector_sum() { int i; } for (i = 0; i < 8192; i++) c[i] = a[i] + b[i]; Arrays a, b, and c will tend to conflict in small caches Code will get cache misses with every array access (3 per loop) Spatial locality savings from blocks will be eliminated How can the severity of the conflicts be reduced? 13

Fully Associative (FA) Cache Opposite extreme, in that it has no cache index Use any available entry to store memory elements No conflict misses, only capacity misses Must compare cache tags of all entries in parallel to find the desired one (can do this in hardware, called a CAM) 31 Cache Tag (27 bits long) Byte Select 0x1E 4 0 = = = Cache Tag Valid Byte 01 Byte 32 Cache Data : : Byte 30 Byte 62 Byte 31 Byte 63 = = : : : 14

N-way Set Associative Compromise between direct-mapped and fully associative Hash-table with N entries/index Set size = set associativity (e.g. 7-way, 7 members per set) For fast access, compare all entries (members) at the same time One view: N direct mapped caches in parallel Need to coordinate on Who outputs data Issuing a miss (only when all miss) Another view: (# Indexes) array of fully associative caches Direct Mapped and Fully Associative caches are really just extremes of the N-way Set Associative 15

Two-way Set Associative Cache index selects a set The two tags in the set are compared in parallel Data is selected based on the tag Cache Index Valid Cache Tag Cache Data Cache Data Cache Tag Valid Cache Block 0 Cache Block 0 : : : : : : Address Tag Set Address Tag Compare 1 Sel1 Mux 0 Sel0 Compare Hit OR Cache Block 16

Set Associative Caches Work quite well 2 ways is generally much better than direct mapped Get rid of many conflicts 4 way is a little better than 2 way More than 4 ways give little additional benefit Set Associative Disadvantages N comparators vs. 1 Critical path is longer, since need to have extra logic to combine the data from all the caches it requires an extra MUX that delays the data result Data comes after determine Hit/Miss Direct mapped assumes a hit and recovers later if a miss t data tag = M 17

Tag & Index with Set-Associative Caches Assume a 2 n -byte cache with 2 m -byte blocks that is 2 a set-associative Which bits of the address are the tag or the index? m least significant bits are byte select within the block Basic idea The cache contains 2 n /2 m =2 n-m blocks Each cache way contains 2 n-m /2 a =2 n-m-a Cache index: (n-m-a) bits after the byte select Same index used with all cache ways Bonus: Odd # s of ways allow odd-size caches (i.e. non-power-of-2) Ex.: 3 MB cache = 3 ways of 1 MB caches, 6 of 512K, etc. 31 Cache Tag n-a-1 Cache Index m-1 0 Byte Select 18

Intuitive Model of Cache Misses ( 3 Cs) Now, let s talk about missing in the cache... Several different categories of misses Compulsory/Cold Start in FA of infinite size First access to a block Only (partial) solution is to increase the block size Capacity in finite-size FA cache Cache cannot contain all blocks accessed by program Solution is to increase cache size Conflict (collision) in DM cache Multiple memory locations map to the same cache location One solution is to increase the cache size A second solution is to increase the associativity 19

Replacement Methods Which line do you replace on a miss? Direct Mapped Easy, you have only one choice Replace the line at the index you need N-way Set Associative Need to choose which way to replace Random (choose one at random) Least Recently Used (LRU) (the one used least recently) Often difficult to calculate, so people use approximations. Often they are really not recently used 20

What About Writes? So far we have talked about reading data But a processor also writes data Where do we put the data we want to write? In the cache? In main memory? In both? Caches have different policies for this question Most systems store the data in the cache Why? Some also store the data in memory as well What happens on a cache miss depends on: Whether main memory is always up to date Processor does not need to wait until the store completes 21

Cache Write Policy Write-through (writes go to cache and memory) Main memory is updated on each cache write Replacing a cache entry just replaces the existing entry with the new entry Memory write causes significant delay if pipeline must stall Write-back (write data only goes to the cache) Only the cache entry is updated on each cache write so main memory and the cache data are inconsistent Add Dirty bit to the cache entry to indicate whether the data in the cache entry must be committed to memory Replacing a cache entry requires writing the data back to memory before replacing the entry if it is dirty 22

Write Policy Trade-offs Write-through Misses are simpler and cheaper since block does not need to be written back Easier to implement, though most systems need an additional buffer, called a write buffer, to be practical Uses a lot of bandwidth to the next level of memory Write-back Words can be written at the cache rate Multiple writes within a block require only one writeback later Efficient block transfer on write back to memory at eviction 23

Buffering Writes Processor Cache DRAM Write Buffer Use Write Buffer between cache and memory Processor writes data into the cache and the write buffer Memory controller slowly drains buffer to memory Write Buffer First In First Out (FIFO) Typically holds a small number of writes Works fine if the rate of writes to memory is less than 1 / DRAM write cycle time 24

Reading vs. Writing With A Cache We have talked about reading from a cache Access cache to see if data is there If it hits, return the data Else fetch it from memory Writing a cache can take more time If it is a direct mapped, write through cache Fetch the tag, and write the data Data has to go into the cache, and destroying the old data can t hurt If it is a write-back or not DM write-through cache Need to check the tag before the write (access 1) Then you need to write the data (access 2) In either case, do you fetch the other words in the block? 25

Write Miss/Fetch Policies Write through cache Use fetch-on-write (also known as write allocate) To fetch the rest of the block Use no-fetch-on-write Mark the fact that the other parts of the block are not valid Use no-allocate-on-write (also known as write-around) Write data to memory without placing in the cache at all Write back caches Generally use fetch-on-write Fetch the rest of the block, hoping that future writes will be to the block, though no-fetch-on-write and no-allocate-on-write would also work 26

Write Miss Actions 27

Splitting Caches Careful about looking only at miss rate It is important, but not the whole story Most processors have a separate instruction and data cache Why split the cache, rather than have a larger unified cache? Split cache often has slightly higher miss rate The split cache doubles the access rate Allows an instruction and data reference in a cycle Miss rate alone is not a sufficient measurement of cache performance 28

Measuring Performance Memory system Stalls include both cache miss stalls and write buffer stalls Cache access time often determines the overall system clock cycle time since it is often the slowest functional unit in the pipeline Memory stalls hurt processor performance CPU Time = (CPU Cycles + Memory Stall Cycles) Cycle Time Memory stalls caused by both reading and writing Mem Stall Cycles = Read Stall Cycles + Write Stalls Cycles 29

Memory Performance Read stalls are fairly easy to understand Read Cycles = Reads/prog Read Miss Rate Read Miss Penalty Write stalls depend upon the write policy Write-through Write Stall = (Writes/Prog Write Miss Rate Write Miss Penalty) + Write Buffer Stalls Write-back Write Stall = (Writes/Prog Write Miss Rate Write Miss Penalty) Write miss penalty can be complex: Can be partially hidden if processor can continue executing Can include extra time to writeback a value we are evicting 30

Worst-Case Simplicity Assume that write and read misses cause the same delay Memory Stall Memory Stall Cycles= Cycles= Memory accesses Missrate Misspenalty Program Instructions Program Misses Misspenalty Instruction In a multi-level cache be careful about local miss rates Miss rate for 2 nd level cache is often high if you look at the local miss rate (misses per reference into the cache) But for misses per instruction, it is often much better 31

Cache Performance Example Consider the following Miss rate for instruction access is 5% Miss rate for data access is 8% Data references per instruction are 0.4 CPI with perfect cache is 2 Read and write miss penalty is 20 cycles Including possible write buffer stalls What is the performance of this machine relative to one without misses? 32

Performance Solution Find the CPI for the base system without misses CPI no misses = CPI perfect = 2 Find the CPI for the system with misses Misses/instr = I Cache Misses + D Cache Misses = 0.05 + (0.08 0.4) = 0.082 Memory Stall Cycles = Misses/instr Miss Penalty = 0.082 20 = 1.64 CPI with misses = CPI perfect + Memory Stall Cycles = 2 + 1.64 = 3.64 Compare the performance n= Performance Performance no misses with misses = CPI CPI with misses no misses = 3.64 2 = 1.82 33

Another Cache Problem Given the following data 1 instruction reference per instruction 0.27 loads/instruction 0.13 stores/instruction Memory access time = 4 cycles + # words/block A 64KB S.A. cache with 4 W block size has a miss rate of 1.7% Base CPI of 1.5 Suppose the cache uses a write through, write-around write strategy without a write buffer. How much faster would the machine be with a perfect write buffer? 1 1.5 1 CPU time = Instruction Count (CPI base + CPI memory ) Clock Cycle Time Performance is proportional to CPI = 1.5 + CPI memory 34

No Write Buffer CPU memory cache CPI memory = reads/instr. miss rate read miss penalty + writes/instr. write penalty write penalty = 4 cycles + 1 word 1 word/cycle = 5 cycles read miss penalty = 4 cycles + 4 words 1 word/cycle = 8 cycles CPI memory = (1 if + 0.27 ld) (1/inst.) (0.017) 8 cycles + (0.13 st)(1/inst.) 5 cycles CPI memory = 0.17 cycles/inst. + 0.65 cycles/inst. = 0.82 cycles/inst. CPI overall = 1.5 cycles/inst. + 0.82 cycles/inst. = 2.32 cycles/inst. 35

Perfect Write Buffer CPU Wbuff cache memory CPI memory = reads/instr. miss rate 8 cycle read miss penalty + writes/instr. (1 miss rate) 1 cycle hit penalty A hit penalty is required because on hits we must: 1) Access the cache tags during the MEM cycle to determine a hit 2) Stall the processor for a cycle to update a hit cache block CPI memory = 0.17 cycles/inst. + (0.13 st)(1/inst.) (1 0.017) 1 cycle CPI memory = 0.17 cycles/inst. + 0.13 cycles/inst. = 0.30 cycles/inst. CPI overall = 1.5 cycles/inst. + 0.30 cycles/inst. = 1.80 cycles/inst. 36

Perfect Write Buffer + Cache Write Buffer CPU Wbuff cache memory CWB CPI memory = reads/instr. miss rate 8 cycle read miss penalty Avoid a hit penalty on writes by: Add a one-entry write buffer to the cache itself Write the last store hit to the data array during next store s MEM Hazard: On loads, must check CWB along with cache! CPI memory = 0.17 cycles/inst. CPI overall = 1.5 cycles/inst. + 0.17 cycles/inst. = 1.67 cycles/inst. 37

Review: Cache Design Summary Three major categories of cache misses Compulsory First time access Capacity Insufficient cache size Conflict Two addresses map to the same cache block AMAT = hit time + miss rate miss penalty Larger cache Larger block size Higher associativity Second level cache Hit time Miss rate + +/ + Miss penalty + 38

Intel Pentium M Memory Hierarchy 32 K I-cache 4 way S.A., 32 B block 32 KB D-cache 4 way S.A., 32 B block 1 MB L2 cache 8 way S.A., 32 B block 1 GB main memory 3.2 GB/s bus to memory interface chip Half of the die is cache 10.6 x 7.8 = 83 mm 2 77 M transistors 50 M in cache 39

Fallacy: Caches are Transparent to Programmers Example cold cache, 4-byte words, 4-word cache blocks Assume N >> 4 Assume M > number of blocks in the cache int sumarrayrows(int a[m][n]) { int i, j, sum = 0; int sumarraycols(int a[m][n]) { int i, j, sum = 0; } for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; } for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum; Miss rate 25% Miss rate 100% 40

Matrix multiplication example Description: Multiply N x N matrices O(N 3 ) total operations Accesses N reads per source element N values summed per destination Writing cache friendly Repeated references to variables are good (temporal locality) Stride-1 reference patterns are good (spatial locality) /* /* ijk ijk */ */ for for (i=0; (i=0; i<n; i<n; i++) i++) { for for (j=0; (j=0; j<n; j<n; j++) j++) { sum sum = 0.0; 0.0; for for (k=0; (k=0; k<n; k<n; k++) k++) sum sum += += a[i][k] a[i][k] * b[k][j]; b[k][j]; c[i][j] c[i][j] = sum; sum; } } 41

Pentium blocked matrix multiply performance 60 Cycles/iteration 50 40 30 20 10 0 25 50 75 100 125 150 175 200 225 250 275 Array size (n) 300 325 350 375 400 kji jki kij ikj jik ijk bijk (bsize = 25) bikj (bsize = 25) 42