The course that gives CMU its Zip! Memory System Performance. March 22, 2001

Similar documents
Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations

CISC 360. Cache Memories Nov 25, 2008

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Giving credit where credit is due

Cache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory

Cache Memories October 8, 2007

Cache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

ν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines.

Last class. Caches. Direct mapped

211: Computer Architecture Summer 2016

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Memory Hierarchy. Announcement. Computer system model. Reference

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Today Cache memory organization and operation Performance impact of caches

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Carnegie Mellon. Cache Memories

Carnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron

High-Performance Parallel Computing

Computer Organization: A Programmer's Perspective

Example. How are these parameters decided?

Cache Memories. Andrew Case. Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Cache Memories : Introduc on to Computer Systems 12 th Lecture, October 6th, Instructor: Randy Bryant.

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Cache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

Memory Hierarchy. Instructor: Adam C. Champion, Ph.D. CSE 2431: Introduction to Operating Systems Reading: Chap. 6, [CSAPP]

CSCI 402: Computer Architectures. Performance of Multilevel Cache

211: Computer Architecture Summer 2016

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

EE108B Lecture 13. Caches Wrap Up Processes, Interrupts, and Exceptions. Christos Kozyrakis Stanford University

Announcements. EE108B Lecture 13. Caches Wrap Up Processes, Interrupts, and Exceptions. Measuring Performance. Memory Performance

write-through v. write-back write-through v. write-back write-through v. write-back option 1: write-through write 10 to 0xABCD CPU RAM Cache ABCD: FF

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Cache Performance II 1

Introduction to Computer Systems: Semester 1 Computer Architecture

Computer Organization - Overview

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Introduction to Computer Systems

Introduction to Computer Systems

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

Lecture 17: Memory Hierarchy and Cache Coherence Concurrent and Mul7core Programming

Copyright 2012, Elsevier Inc. All rights reserved.

CS/ECE 250 Computer Architecture

Caches III. CSE 351 Spring Instructor: Ruth Anderson

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

Matrix Multiplication

CS377P Programming for Performance Single Thread Performance Caches I

Matrix Multiplication

Princeton University. Computer Science 217: Introduction to Programming Systems. The Memory/Storage Hierarchy and Virtual Memory

CS3350B Computer Architecture

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

CS 240 Stage 3 Abstractions for Practical Systems

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Kathryn Chan, Kevin Bi, Ryan Wong, Waylon Huang, Xinyu Sui

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

CS 261 Fall Caching. Mike Lam, Professor. (get it??)

Memory hierarchies: caches and their impact on the running time

Cache Memory and Performance

virtual memory Page 1 CSE 361S Disk Disk

CS241 Computer Organization Spring Principle of Locality

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Random-Access Memory (RAM) Systemprogrammering 2007 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics

Lecture 16 Optimizing for the memory hierarchy

Cache Impact on Program Performance. T. Yang. UCSB CS240A. 2017

Motivations for Virtual Memory Virtual Memory Oct. 29, Why VM Works? Motivation #1: DRAM a Cache for Disk

Random-Access Memory (RAM) Systemprogrammering 2009 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics! The memory hierarchy

Advanced Caching Techniques

virtual memory. March 23, Levels in Memory Hierarchy. DRAM vs. SRAM as a Cache. Page 1. Motivation #1: DRAM a Cache for Disk


Virtual Memory Oct. 29, 2002

Performance Optimization Tutorial. Labs

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Sarah L. Harris and David Money Harris. Digital Design and Computer Architecture: ARM Edition Chapter 8 <1>

Cache Performance (H&P 5.3; 5.5; 5.6)

Problem: Processor Memory BoJleneck

What is Cache Memory? EE 352 Unit 11. Motivation for Cache Memory. Memory Hierarchy. Cache Definitions Cache Address Mapping Cache Performance

Cache Memory: Instruction Cache, HW/SW Interaction. Admin

1 Motivation for Improving Matrix Multiplication

Lecture 18. Optimizing for the memory hierarchy

CSE 502 Graduate Computer Architecture

Caches III. CSE 351 Winter Instructor: Mark Wyse

Question 13 1: (Solution, p 4) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate.

Problem: Processor- Memory Bo<leneck

CISC 360. Virtual Memory Dec. 4, 2008

Virtual Memory. Motivations for VM Address translation Accelerating translation with TLBs

Lecture 2. Memory locality optimizations Address space organization

Advanced Caching Techniques

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Transcription:

15-213 The course that gives CMU its Zip! Memory System Performance March 22, 2001 Topics Impact of cache parameters Impact of memory reference patterns memory mountain range matrix multiply

Basic Cache Organization Address space (N = 2 n bytes) Cache (C = S x E x B bytes) E lines/set Address (n = t + s + b bits) t s b S = 2 s sets B Cache line Valid bit tag data 1 bit t bits B = 2 b bytes (line size) Block 2

Multi-Level Caches Options: separate data and instruction caches, or a unified cache Processor TLB regs L1 Dcache L1 Icache L2 Cache Memory disk size: speed: $/Mbyte: line size: 200 B 3 ns 8-64 KB 3 ns 8 B 32 B larger, slower, cheaper 1-4MB SRAM 4 ns $100/MB 32 B 128 MB DRAM 60 ns $1.50/MB 8 KB 30 GB 8 ms $0.05/MB larger line size, higher associativity, more likely to write back 3

Key Features of Caches Accessing Word Causes Adjacent Words to be Cached B bytes having same bit pattern for upper n b address bits t s b 00 101 1000 In anticipation that will want to reference these words due to spatial locality in program Loading Block into Cache Causes Existing Block to be Evicted One that maps to same set If E > 1, then generally choose least recently used 00 101 0000 00 101 0100 00 101 1000 00 101 1100 4

Miss Rate Cache Performance Metrics Fraction of memory references not found in cache misses/references Typical numbers: 3-10% for L1 can be quite small (e.g., < 1%) for L2, depending on size, etc. Hit Time Time to deliver a line in the cache to the processor includes time to determine whether the line is in the cache Typical numbers: 1 clock cycle for L1 3-8 clock cycles for L2 Miss Penalty Additional time required because of a miss Typically 25-100 cycles for main memory 5

Categorizing Cache Misses Compulsory ( Cold ) Misses: First ever access to a memory line since lines are only brought into the cache on demand, this is guaranteed to be a cache miss Programmer/system cannot reduce these Capacity Misses: Active portion of memory exceeds the cache size Programmer can reduce by rearranging data & access patterns Conflict Misses: Active portion of address space fits in cache, but too many lines map to the same cache entry Programmer can reduce by changing data structure sizes Avoid powers of 2 6

Measuring Memory Bandwidth int data[maxsize]; int test(int size, int stride) { int result = 0; int wsize = size/sizeof(int); for (i = 0; i < wsize; i+= stride) result += data[i]; return result; Stride (words) Size (bytes) 7

Measuring Memory Bandwidth (cont.) Stride (words) Measurement Time repeated calls to test Size (bytes) If size sufficiently small, then can hold array in cache Characteristics of Computation Stresses read bandwidth of system Increasing stride yields decreased spatial locality On average will get stride*4/b accesses / cache block Increasing size increases size of working set 8

Alpha Memory Mountain Range 600 500 400 DEC Alpha 21164 466 MHz 8 KB (L1) 96 KB (L2) 2 M (L3) MB/s 300 L1 Resident 200 100 0 L2 Resident s1 s3 s5 Stride s7 s9 s11 s13 s15 16m 8m 4m 2m 1024k 512k 256k 128k 64k 32k 16k 9 8k 4k 2k 1k.5k Data Set L3 Resident Main Memory Resident

Effects Seen in Mountain Range Cache Capacity See sudden drops as increase working set size Cache Block Effects Performance degrades as increase stride Less spatial locality Levels off When reach single access per line 10

Alpha Cache Sizes 600 500 400 300 200 100 0 16m 8m 4m 2m 1024k 512k 256k 128k 64k 32k 16k 8k 4k 2k 1k.5k MB/s for stride = 16 Ranges.5k 8k Running in L1 (High overhead for small data set) 16k 64k Running in L2. 128k Indistinct cutoff (Since cache is 96KB) 256k 2m Running in L3. 4m 16m Running in main memory 11

Alpha Line Size Effects 600 500 400 300 200 100 0 s1 s2 s4 s8 s16 16m 1024k 32k 8k Observed Phenomenon As double stride, decrease accesses/block by 2 Until reaches point where just 1 access / block Line size at transition from downward slope to horizontal line Sometimes indistinct 12

Alpha Line Sizes 600 500 400 300 200 100 0 s1 s2 s4 s8 s16 16m 1024k 32k 8k Measurements 8k Entire array L1 resident. Effectively flat (except for overhead) 32k Shows that L1 line size = 32B 1024k Shows that L2 line size = 32B 16m L3 line size = 64? 13

Xeon Memory Mountain Range 1200 1000 Pentium III Xeon 550 MHz 16 KB (L1) 512 KB (L2) 800 MB/s 600 400 L1 Resident 200 0 s1 s3 8k 4k 2k 1k.5k L2 Resident s5 Stride s7 s9 s11 s13 s15 16m 8m 4m 14 2m 1024k 512k 256k 128k 64k 32k 16k Data Set Main Memory Resident

Xeon Cache Sizes 1000 900 800 700 600 500 400 300 200 100 0 16m 8m 4m 2m 1024k 512k 256k 128k 64k 32k 16k 8k 4k 2k 1k.5k MB/s for stride = 16 Ranges.5k 16k Running in L1. (Overhead at high end) 32k 256k Running in L2. 512k Running in main memory (but L2 supposed to be 512K!) 1m 16m Running in main memory 15

Xeon Line Sizes 1200 1000 800 600 400 16m 256k 4k 200 0 s1 s2 s4 s8 s16 Measurements 4k Entire array L1 resident. Effectively flat (except for overhead) 256k Shows that L1 line size = 32B 16m Shows that L2 line size = 32B 16

Interactions Between Program & Cache Major Cache Effects to Consider Total cache size Try to keep heavily used data in cache closest to processor Line size Exploit spatial locality Example Application Multiply N x N matrices O(N 3 ) total operations Accesses N reads per source element N values summed per destination» but may be able to hold in register Variable sum held in register /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; 17

Matrix Mult. Performance: Sparc20 mflops (d.p.) 20 18 16 14 12 10 8 ikj kij ijk jik jki kji 6 4 2 0 50 75 100 125 150 175 200 matrix size (n) As matrices grow in size, they eventually exceed cache capacity Different loop orderings give different performance cache effects whether or not we can accumulate partial sums in registers 18

Miss Rate Analysis for Matrix Multiply Assume: Line size = 32B (big enough for 4 64-bit words) Matrix dimension (N) is very large Approximate 1/N as 0.0 Cache is not even big enough to hold multiple rows Analysis Method: Look at access pattern of inner loop k j j i k i A B C 19

Layout of Arrays in Memory C arrays allocated in row-major order each row in contiguous memory locations Stepping through columns in one row: for (i = 0; i < N; i++) sum += a[0][i]; accesses successive elements if line size (B) > 8 bytes, exploit spatial locality compulsory miss rate = 8 bytes / B Stepping through rows in one column: for (i = 0; i < n; i++) sum += a[i][0]; accesses distant elements no spatial locality! compulsory miss rate = 1 (i.e. 100%) 20 Memory Layout 0x80000 0x80008 0x80010 0x80018 0x807F8 0x80800 0x80808 0x80810 0x80818 0x80FF8 0xFFFF8 a[0][0] a[0][1] a[0][2] a[0][3] a[0][255] a[1][0] a[1][1] a[1][2] a[1][3] a[1][255] a[255][255]

Matrix Multiplication (ijk) /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; Misses per Inner Loop Iteration: A B C 0.25 1.0 0.0 Inner loop: (i,*) (*,j) (i,j) A B C Row-wise Columnwise Fixed 21

Matrix Multiplication (jik) /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum Inner loop: (*,j) (i,j) (i,*) A B C Misses per Inner Loop Iteration: A B C Row-wise Columnwise Fixed 0.25 1.0 0.0 22

Matrix Multiplication (kij) /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; Inner loop: (i,k) (k,*) A B C (i,*) Fixed Row-wise Row-wise Misses per Inner Loop Iteration: A B C 0.0 0.25 0.25 23

Matrix Multiplication (ikj) /* ikj */ for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; Inner loop: (i,k) (k,*) A B C (i,*) Fixed Row-wise Row-wise Misses per Inner Loop Iteration: A B C 0.0 0.25 0.25 24

Matrix Multiplication (jki) /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; Inner loop: (*,k) (*,j) (k,j) A B C Misses per Inner Loop Iteration: A B C 1.0 0.0 1.0 Column - wise Fixed Columnwise 25

Matrix Multiplication (kji) /* kji */ for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; Inner loop: (*,k) (*,j) (k,j) A B C Misses per Inner Loop Iteration: Fixed Columnwise Columnwise A B C 1.0 0.0 1.0 26

Summary of Matrix Multiplication ijk (& jik): 2 loads, 0 stores misses/iter = 1.25 kij (& ikj): 2 loads, 1 store misses/iter = 0.5 jki (& kji): 2 loads, 1 store misses/iter = 2.0 for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; 27

Matrix Mult. Performance: DEC5000 mflops (d.p.) 3 2.5 2 1.5 ikj kij ijk jik jki kji 1 (misses/iter = 0.5) 0.5 (misses/iter = 1.25) 0 50 75 100 125 150 175 200 matrix size (n) (misses/iter = 2.0) 28

! " # Matrix Mult. Performance: Sparc20 Multiple columns of B fit in cache mflops (d.p.) 20 18 16 14 12 10 8 ikj kij ijk jik jki kji 6 4 2 0 50 75 100 125 150 175 200 matrix size (n) (misses/iter = 0.5) (misses/iter = 1.25) (misses/iter = 2.0) 29

Matrix Mult. Performance: Alpha 21164 Too big for L1 Cache Too big for L2 Cache 160 mflops (d.p.) 140 120 100 80 60 40 20 0 ijk ikj jik jki kij kji (misses/iter = 0.5) (misses/iter = 1.25) 25 50 75 100 125 150 175 200 225 250 275 300 325 350 matrix size (n) 375 400 425 450 475 500 (misses/iter = 2.0) 30

Matrix Mult.: Pentium III Xeon 90 80 70 MFlops (d.p.) 60 50 40 30 ijk ikj jik jki kij kji 20 10 0 25 50 75 100 125 150 175 200 225 250 275 Matrix Size (n) 31 300 325 350 375 400 425 450 475 500 (misses/iter = 0.5 or 1.25) (misses/iter = 2.0)

Blocked Matrix Multiplication Block (in this context) does not mean cache block - instead, it means a sub-block within the matrix Example: N = 8; sub-block size = 4 A11 A12 A21 A22 X B11 B12 B21 B22 = C11 C12 C21 C22 Key idea: Sub-blocks (i.e., A xy ) can be treated just like scalars. C 11 = A 11 B 11 + A 12 B 21 C 12 = A 11 B 12 + A 12 B 22 C 21 = A 21 B 11 + A 22 B 21 C 22 = A 21 B 12 + A 22 B 22 32

Blocked Matrix Multiply (bijk) for (jj=0; jj<n; jj+=bsize) { for (i=0; i<n; i++) for (j=jj; j < min(jj+bsize,n); j++) c[i][j] = 0.0; for (kk=0; kk<n; kk+=bsize) { for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; c[i][j] += sum; 33

Blocked Matrix Multiply Analysis Innermost loop pair multiplies a 1 X bsize sliver of A by a bsize X bsize block of B and accumulates into 1 X bsize sliver of C Loop over i steps through n row slivers of A & C, using same B for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; Innermost c[i][j] += sum; Loop Pair kk jj jj i kk i A B C row sliver accessed bsize times block reused n times in succession 34 Update successive elements of sliver

( ' & ) & $ $ * $ $ $ $ $ + % % % % % %& ' % & & & & ' ' ' ' ' Blocked Matrix Mult. Perf: DEC5000 3 bijk 2.5 bikj mflops (d.p.) 2 1.5 ikj ijk 1 0.5 0 50 75 100 125 150 175 200 matrix size (n) 35

/ 0,,,,, 1,, 2 3.. - -. -.. -. -. - - / / / / / / Blocked Matrix Mult. Perf: Sparc20 mflops (d.p.) 20 18 16 14 12 10 8 bijk bikj ikj ijk 6 4 2 0 50 75 100 125 150 175 200 matrix size (n) 36

Blocked Matrix Mult. Perf: Alpha 21164 160 140 120 100 80 60 bijk bikj ijk ikj 40 20 0 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 mflops (d.p.) matrix size (n) 37

90 80 70 60 50 40 30 20 10 0 Blocked Matrix Mult. : Xeon 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 Matrix Size (n) 38 ijk ikj bijk bikj MFlops (d.p.)

Observations Programmer Can Optimize for Cache Performance How data structures are organized How data accessed Nested loop structure Blocking is general technique All Machines Like Cache Friendly Code Getting absolute optimum performance very platform specific Cache sizes, line sizes, associatitivities, etc. Can get most of the advantage with generic code Keep working set reasonably small 39