Memory Systems and Performance Engineering

Similar documents
Memory Systems and Performance Engineering. Fall 2009

Cache Memories. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza. CS2101 October 2012

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Memory hierarchy review. ECE 154B Dmitri Strukov

Advanced Computer Architecture

Computer Architecture Spring 2016

Handout 4 Memory Hierarchy

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory Hierarchies 2009 DAT105

Performance! (1/latency)! 1000! 100! 10! Capacity Access Time Cost. CPU Registers 100s Bytes <10s ns. Cache K Bytes ns 1-0.

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

ECE4680 Computer Organization and Architecture. Virtual Memory

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

CS654 Advanced Computer Architecture. Lec 2 - Introduction

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

ECE468 Computer Organization and Architecture. Virtual Memory

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

Advanced optimizations of cache performance ( 2.2)

Cache Performance (H&P 5.3; 5.5; 5.6)

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

CPS101 Computer Organization and Programming Lecture 13: The Memory System. Outline of Today s Lecture. The Big Picture: Where are We Now?

CMSC 611: Advanced Computer Architecture. Cache and Memory

CSE Computer Architecture I Fall 2011 Homework 07 Memory Hierarchies Assigned: November 8, 2011, Due: November 22, 2011, Total Points: 100

CPE 631 Lecture 04: CPU Caches

Modern Computer Architecture

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

COSC 6385 Computer Architecture - Memory Hierarchies (I)

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Aleksandar Milenkovich 1

Lecture 11. Virtual Memory Review: Memory Hierarchy

CSC Memory System. A. A Hierarchy and Driving Forces

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Parallel Numerical Algorithms

CISC 662 Graduate Computer Architecture Lecture 16 - Cache and virtual memory review

Cache Performance II 1

CS152 Computer Architecture and Engineering Lecture 18: Virtual Memory

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

Key Point. What are Cache lines

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

Lecture 7 - Memory Hierarchy-II

Cache Memory: Instruction Cache, HW/SW Interaction. Admin

Effect of memory latency

Administrivia. CMSC 411 Computer Systems Architecture Lecture 8 Basic Pipelining, cont., & Memory Hierarchy. SPEC92 benchmarks

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

CSC D70: Compiler Optimization Memory Optimizations

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer)

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

CS152 Computer Architecture and Engineering Lecture 17: Cache System

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Lecture 11 Reducing Cache Misses. Computer Architectures S

5 Solutions. Solution a. no solution provided. b. no solution provided

Caches. Samira Khan March 23, 2017

CSC266 Introduction to Parallel Computing using GPUs Optimizing for Caches

Today Cache memory organization and operation Performance impact of caches

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

CS 261 Fall Caching. Mike Lam, Professor. (get it??)

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Optimising for the p690 memory system

CS3350B Computer Architecture

Memory Management! Goals of this Lecture!

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Inside out of your computer memories (III) Hung-Wei Tseng

EITF20: Computer Architecture Part4.1.1: Cache - 2

CS377P Programming for Performance Single Thread Performance Caches I

CS399 New Beginnings. Jonathan Walpole

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

211: Computer Architecture Summer 2016

The Processor Memory Hierarchy

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Page 1. Multilevel Memories (Improving performance using a little cash )

Types of Cache Misses: The Three C s

Cache memories are small, fast SRAM based memories managed automatically in hardware.

The check bits are in bit numbers 8, 4, 2, and 1.

CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

Way-associative cache

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

Computer Architecture Spring 2016

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Cache-Efficient Algorithms

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Last class. Caches. Direct mapped

Introduction to OpenMP. Lecture 10: Caches

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Memory Hierarchy. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

CISC 360. Cache Memories Nov 25, 2008

Transcription:

SPEED LIMIT PER ORDER OF 6.172 Memory Systems and Performance Engineering Fall 2010

Basic Caching Idea A. Smaller memory faster to access B. Use smaller memory to cache contents of larger memory C. Provide illusion of fast larger memory D. Key reason why t his works: locality 1. Temporal 2. Spatial

Capacity Access Time Cost Levels of the Memory Hierarchy CPU Registers 100s Bytes 300 500 ps (0.3-0.5 0.5 ns) L1 and L2 Cache 10s-100s K Bytes ~1 ns - ~10 ns $1000s/ GByte Main Memory G Bytes 80ns- 200ns ~ $100/ GByte Disk 10s T Bytes, 10 ms (10,000,000 ns) ~ $1 / GByte Tape infinite sec-min ~$1 / GByte Registers L1 Cache Instr. Operands Blocks L2 Cache Memory Disk Tape Blocks Pages Files Staging Xfer Unit prog./compiler 1-8 bytes cache cntl 32-64 bytes cache cntl 64-128 bytes OS 4K-8K bytes user/operator Mbytes Upper Level faster Larger Lower Level 8/28 CS252 Fall 07 3

Cache Issues Cold Miss The first time the data is available Prefetching may be able to reduce the cost Capacity Miss The previous access has been evicted because too much data touched in between Working Set too large Reorganize the data access so reuse occurs before getting evicted. Prefetch otherwise Conflict Miss Multiple data items mapped to the same location. Evicted even before cache is full Rearrange data and/or pad arrays True Sharing Miss Thread in another processor wanted the data, it got moved to the other cache Minimize sharing/locks False Sharing Miss Other processor used different data in the same cache line. So the line got moved Pad data and make sure structures such as locks don t get into the same cache line

Simple Cache A. 32Kbyte, direct mapped, 64 byte lines (512 lines) B. Cache access = single cycle C. Memory access = 100 cycles D. Byte addressable memory E. How addresses map into cache 1. Bottom 6 bits are offset in cache line 2. Next 9 bits determine cache line F. Successive 32Kbyte memory blocks line up in cache Cache Cache Lines Memory

Analytically Model Access Patterns #define S ((1<<20)*sizeof(int)) int A[S]; for (i = 0; i < S; i++) { read A[i]; Array laid out sequentially in i memory A Memory Access Pattern Summary Read in new line Read rest of line Move on to next line Assume sizeof(int) = 4 S reads to A 16 elements of A per line 15 of every 16 hit in cache Total access time: 15*(S/16) + 100*(S/16) What kind of locality? Spatial What kind of misses? Cold

Analytically Model Access Patterns #define S ((1<<20)*sizeof(int)) Access Pattern Summary int A[S]; Read A[0] every time for (i = 0; i < S; i++) { read A[0]; A Memory S reads to A All (except first) hit in cache Total access time 100 + (S-1) What kind of locality? lit Temporal What kind of misses? Cold

Analytically Model Access Patterns #define S ((1<<20)*sizeof(int)) Access Pattern Summary int A[S]; Read initial segment of A repeatedly for (i = 0; i < S; i++) { A read A[i % (1<<N)]; Memory Cache S reads to A Assume 4 <= N <= 13 One miss for each accessed line Rest hit in cache How many accessed lines? 2 (N-4) Total Access Time 2 (N-4) * 100 + (S - 2 (N-4) ) What kind of locality? Spatial, Temporal What kind of misses? Cold

Analytically Model Access Patterns #define S ((1<<20)*sizeof(int)) int A[S]; for (i = 0; i < S; i++) { read A[i % (1<<N)]; A Memory Cache Access Pattern Summary Read initial segment of A repeatedly S reads to A Assume 14 <= N First access to each line misses Rest accesses to that line hit Total Access Time (16 elements of A perlili ne) (15 of every 16 hit in cache) 15*(S/16) + 100*(S/16) What kind of locality? Spatial What kind of misses? Cold, capacity

Analytically Model Access Patterns #define S ((1<<20)*sizeof(int)) int A[S]; for (i = 0; i < S; i++) { read A[(i*16) % (1<<N)]; A Memory Cache Access Pattern Summary Read every 16 th element of initial segment of A, repeatedly S reads to A Assume 14 <= N First access to each line misses One access per line Total access time: 100 * S Data Fetched But Not Accessed What kind of locality? None What kind of misses? Cold, conflict

Analytically Model Access Patterns #define S ((1<<20)*sizeof(int)) Access Pattern Summary int A[S]; Read random elements of A for (i = 0; i < S; i++) { read A[random()%S]; Memory S reads to A Chance of hitting in cache (roughly) = 8K/1G= 1/256 Total access time (roughly): S(255/256)*100 + S*(1/256) A Cache What kind of locality? Almost none What kind of misses? Cold, Capacity, Conflict

Basic Cache Access Modes A. No locality no locality in computation B. Streaming spatial locality, no temporal locality C. In Cache most accesses to cache D. Mode shift for repeated sequential access 1. Working set fits in cache in cache mode 2. Working set too big for cache streaming mode E. Optimizations for streaming mode 1. Prefetching 2. Bandwidth provisioning (can buy bandwidth)

Analytically Model Access Patterns #define S ((1<<19)*sizeof(int)) Access Pattern Summary int A[S]; Read A and B sequentially int B[S]; for (i = 0; i < S; i++) { read A[i], B[i]; Memory S reads to A, B A and B interfere in cache Total access time 2*100 * S A B Cache What kind of locality? lit Spatial locality, but cache can t exploit it What kind of misses? Cold, Conflict

Analytically Model Access Patterns #define S ((1<<19+16)*sizeof(int)) 16)* Access Pattern Summary int A[S]; Read A and B sequentially int B[S]; for (i = 0; i < S; i++) { read A[i], B[i]; Memory S reads to A, B A and B almost, but don t, interfere in cache Total access time 2(15/16*S + 1/6*S*100) A B Cache What kind of locality? Spatial locality What kind of misses? Cold

Set Associative Caches A. Have sets with multiple lines per set B. Each line in cache called a way C. Each memoryliline maps to a specific set D. Can be put into any cache line in its set E. 32 Kbyte cache, 64 byte lines, 2-way associative 1. 256 sets 2. Bottom six bits determine offset in cache line 3. Next 8 bits determine set Way 0 Way 1 Sets

Analytically Model Access Patterns #define S ((1<<19)*sizeof(int)) Access Pattern Summary int A[S]; Read A and B sequentially int B[S]; for (i = 0; i < S; i++) { read A[i], B[i]; S reads to A, B A and B lines hit same way, but enough lines in way Total access time 2*(15/16*S + 1/6*S*100) A Cache B What kind of locality? Spatial locality What kind of misses? Cold

Analytically Model Access Patterns #define S ((1<<19)*sizeof(int)) Access Pattern Summary int A[S]; Read A and B sequentially int B[S]; for (i = 0; i < S; i++) { read A[i], B[i], C[i]; A Cache S reads to A, B,C A, B, C lines hit same way, but NOT enough lines in way Total access time (with LRU replacement) 3*S*100 B What kind of locality? lit Spatial locality (but cache can t exploit it) What kind of misses? Cold, conflict

Associativity A. How much associativity do modern machines have? B. Why don t they have more? Saman Amarasinghe 2009

Linked Lists and the Cache struct node { Struct layout puts int data; 4 struct node per cache line (alignment, struct node *next; space for padding) ; Assume list of length S sizeof(struct node) = 16 Access Pattern Summary Depends on allocation/use for (c = l; c!= NULL; c = c >next++) { Best case - everything in cache read c >data; total access time = S Next best adjacent (streaming) total access time = (3/4*S + 1/4*S*100) Worst random (no locality) total access time =100*S Concept of effective cache size (4 times less for lists than for arrays)

Structs and the Cache struct node { int data; int more_data; int even_more_data; int yet_more_data; int flags; struct node *next; ; sizeof(struct node) = 32 for (c = l; c!= NULL; c = c >next++) { read c >data; 2 struct node per cache line (alignment, space for padding) Assume list of length S Access Pattern Summary Depends on allocation/use Best case - everything in cache total access time = S Next best adjacent (streaming) total access time = (1/2*S + 1/2*S*100) Worst random (no locality) total access time =100*S Concept of effective cache size (8 times less for lists than for arrays)

Parallel Array Conversion struct node { int data; int more_data; int even_more_data; int yet_more_data; int flags; struct node *next; ; for (c = l; c!= 1; c = next[c]) { read data[c]; int data[maxdata]; int more_data[maxdata]; int even_more_data[maxdata]; int yet_more_data[maxdata]; int flags[maxdata]; int next MAXDATA]; Advantages: Better cache behavior (more working data fits in cache) Disadvantages: Code distortion Maximum size has to be known or Must manage own memory

Managing Code Distortion typedef struct node *list; typedef int list; int data(list l) { return l >data; int more_data(list l) { return l >more_data; list next(list l) { return l >next; int data(list l) { return data[l]; int more_data(list l) { return more_data[l]; list next(list l) { return next[l]; This version supports only one list Can extend to support multiple lists (need a list object)

Matrix Multiply A. Representing matrix in memory B. Row-major storage Or double A[4][4]; A[i][j]; double A[16]; A[i*4+j] A 00 A 01 A 02 A 03 A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 C. What if you want column-major storage? double A[16]; A[j*4+i]; A 00 A 10 A 20 A 30 A 01 A 11 A 12 A 13 A 20 A 21 A 22 A 32 A 03 A 13 A 23 A 33

Standard Matrix Multiply Code for (i = 0; i < SIZE; i++) { for (j = 0; j < SIZE; j++) { for (k = 0; k < SIZE; k++) { C[i*SIZE+j] += A[i*SIZE+k]*B[k*SIZE+j]; Look at inner loop only: Only first access to C misses (temporal locality) A accesses have streaming pattern (spatial locality) B has no temporal or spatial illocality li

Access Patterns for A, B, and C A B C = Courtesy of Martin Rinard. Used with permission. Martin Rinard 2008

Memory Access Pattern Scanning the memory A = B x C

How to get spatial locality for B A. Transpose B first, then multiply. New code: for (i = 0; i < SIZE; i++) { for (j = 0; j < SIZE; j++) { for (k = 0; k < SIZE; k++) { C[i*SIZE+j] += A[i*SIZE+k]*B[j B[j*SIZE+k]; Overall effect on execution time? 11620 ms (original) 2356 ms (after transpose)

A transpose(b) C = Courtesy of Martin Rinard. Used with permission. Martin Rinard 2008

Profile Data Saman Amarasinghe 2008

How to get temporal locality? A. How much temporal locality should there be? B. How many times is each element accessed? SIZE times. C. Can we rearrange computation to get better cache performance? D. Yes, we can block it! E. Key equation (here A 11,B 11 are submatrices) A 11 A B 11 B k A 1k *B k1 k A 1k *B kn 1N 1N = A N1 A NN B N1 B NN k A Nk *B k1 k A Nk *B kn

Blocked Matrix Multiply for (j = 0; j < SIZE; j += BLOCK) { for (k = 0; k < SIZE; k += BLOCK) { for (i = 0; i < SIZE; i+=block) { for (ii = i; ii < i+block; ii++) { for (jj = j; jj < j+block; jj++) { for (kk = k; kk < k+block; kk++) { C[ii*SIZE+jj] += A[ii*SIZE+kk]*B[jj*SIZE+kk]; (Warning SIZE must be a multiple of BLOCK) Overall effect on execution time? 11620 ms (original), 2356 ms (after transpose), 631 ms (after transpose and blocking)

After Blocking A transpose(b) C = Courtesy of Martin Rinard. Used with permission. Martin Rinard 2008

Data Reuse in Blocking Data reuse Change of computation order can reduce the # of loads to cache Calculating g a row (1024 values of A) A: 1024*1=1024 + B: 384*1=394 + C: 1024*384=393,216 = 394,524 Blocked Matrix Multiply (32 2 = 1024 values of A) A: 32*32=1024 + B: 384*32 =12,284 284 + C: 32*384=12,284284 = 25,600 1024 384 1024 = x 384 C 1024 A 1024 B

Profile Data Saman Amarasinghe 2008

Blocking for Multiple Levels A. Can block for registers, L1 cache, L2 cache, etc. B. Really nasty code to write by hand C. Automated by compiler community D. Divide and conquer an alternative (coming up)

Call Stack and Locality What does call stack look like? int fib(int n) { if (n == 0) return 1; if (n == 1) return 1; return (fib(n 1) + fib(n 2)); fib(4) How deep does it go? What kind of locality? (4) (4) (4) (4) (4) (4) (4) (4) (4) (4) (4) (4) (4) (4) (4) (4) (4) (3) (3) (3) (3) (3) (3) (3) (3) (3) (2) (2) (2) (2) (2) (2) (2) (1) (2) (2) (0) (2) (1) (1) (0)

Stages and locality A. Staged computational pattern 1. Read in lots of data 2. Process through Stage1,, StageN 3. Produce results B. Improving cache performance 1. For all cache-sized chunks of data a. Read in chunk b. Process chunk through Stage1,, StageN c. Produce results for chunk 2. Merge chunks to produce results

Basic Concepts A. Cache concepts 1. Lines, associativity, sets 2. How addresses map to cache B. Access pattern concepts 1. Streaming versus in-cache versus no locality 2. Mode switches when working set no longer fits in cache C. Data structure transformations 1. Segregate and pack frequently accessed data a. Replace structs with parallel arrays, other split data structures b. Pack data to get smaller cache footprint 2. Modify representation; Compute; Restore representation a. Transpose; Multiply;Transpose b. Copy In; Compute; Copy Out D. Computation transformations 1. Reorder data accesses for reuse 2. Recast computation stages when possible

Intel Core Microarchitecture Memory Sub-system Intel Core 2 Quad Processor L1 inst Core Core Core Core L1 data L2 L1 ins L1 data L1 ins L1 data L2 L1 ins L1 data L1 Data Cache Size Line Size Latency Associativty 32 KB 64 bytes 3 cycles 8 way L1 Instruction Cache Size Line Size Latency Associativty 32 KB 64 bytes 3 cycles 8 way L2 Cache Size Line Size Latency Associativty 6 MB 64 bytes 14 cycles 24 way Main Memory

Intel Nehalem Microarchitecture Memory Sub-system Intel 6 Core Processor L1 inst Core Core Core Core Core Core L1 dat L1 inst L1 dat L1 inst L1 dat L1 inst L1 dat L1 inst L1 dat L1 inst L2 L2 L2 L2 L2 L2 L3 Main Memory L1 dat L1 Data Cache Size Line Size Latency Associativty 32 KB 64 bytes 4 ns 8 way L1 Instruction Cache Size Line Size Latency Associativty 32 KB 64 bytes 4 ns 4 way L2 Cache Size Line Size Latency Associativty 256 KB 64 bytes 10 ns 8 way L3 Cache Size Line Size Latency Associativty 12 MB 64 bytes 50 ns 16 way Main Memory Size Line Size Latency Associativty 64 bytes 75 ns

Intel Core 2 Quad Processor for(rep=0; rep < REP; rep++) for(a=0; a < N ; a++) A[a] = A[a] + 1; R u n t i m e p e r a c c e s s 0.0004 0.00035 0.0003 0.00025 0.0002 0.00015 00015 0.0001 0.00005 0 2 8 32 128 512 2048 8192 32768 131072 Saman Amarasinghe 2009 N

Intel Core 2 Quad Processor for(r=0; r < REP; r++) for(a=0; a <64*1024; a++) A[a*skip] = A[a*skip] + 1; 6 5 4 3 2 1 0 0 128 256 384 512 640 768 896 1024 Saman Amarasinghe 2009 skip

Intel Nehalem Processor for(rep=0; rep < REP; rep++) for(a=0; a < N ; a++) A[a] = A[a] + 1; R 0.0000045 u n 0.000004 t i 0.0000035 m e 0.000003000003 p e r a c c e s s 0.0000025 0.000002 0.0000015 0.000001 0.0000005 0 0 20000 40000 60000 80000 100000 120000 140000 Saman Amarasinghe 2009 N

Intel Nehalem Processor mask = (1<<n) - 1; for(rep=0; rep < REP; rep++) { 1E+10 addr = ((rep + 523)*253573) & mask; A[addr] = A[addr] + 1; 9E+09 8E+09 7E+09 6E+09 Performance 5E+09 4E+09 3E+09 2E+09 1E+09 0 Saman Amarasinghe 2009 N

Intel Nehalem Processor mask = (1<<n) - 1; for(rep=0; rep < REP; rep++) { 1E+10 addr = ((rep + 523)*253573) & mask; A[addr] = A[addr] + 1; 9E+09 8E+09 7E+09 L1 Cache Miss 6E+09 5E+09 4E+09 Performance 3E+09 2E+09 1E+09 0 Saman Amarasinghe 2009 N

Intel Nehalem Processor mask = (1<<n) - 1; for(rep=0; rep < REP; rep++) { 1E+10 addr = ((rep + 523)*253573) & mask; A[addr] = A[addr] + 1; 9E+09 8E+09 L1 Cache Miss 7E+09 6E+09 L2 Cache Miss 5E+09 4E+09 Performance 3E+09 2E+09 1E+09 0 Saman Amarasinghe 2009 N

Intel Nehalem Processor mask = (1<<n) - 1; for(rep=0; rep < REP; rep++) { addr = ((rep + 523)*253573) & mask; A[addr] = A[addr] + 1; 1E+10 9E+09 8E+09 7E+09 6E+09 5E+09 4E+09 L1 Cache Miss L2 Cache Miss L3 Cache Miss Performance 3E+09 2E+09 1E+09 0 Saman Amarasinghe 2009 N

Intel Nehalem Processor mask = (1<<n) - 1; for(rep=0; rep < REP; rep++) { addr = ((rep + 523)*253573) & mask; A[addr] = A[addr] + 1; 1E+10 9E+09 8E+09 7E+09 6E+09 5E+09 4E+09 3E+09 L1 Cache Miss L2 Cache Miss L3 Cache Miss TLB misses Performance 2E+09 1E+09 0 Saman Amarasinghe 2009 N

TLB Page size is 4 KB Number of TLB entries is 512 So, total memory that can be mapped by TLB is 2 MB L3 cache is 12 MB! TLB misses before L3 cache misses! Saman Amarasinghe 2009

Other issues Multiple outstanding memory references Hardware Prefetching Prefetching instructions ti TLBs Paging Saman Amarasinghe 2009

MIT OpenCourseWare http://ocw.mit.edu 6.172 Performance Engineering of Software Systems Fall 2010 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.