Cache Impact on Program Performance. T. Yang. UCSB CS240A. 2017

Similar documents
CS422 Computer Architecture

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

ECE331: Hardware Organization and Design

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Today Cache memory organization and operation Performance impact of caches

Cache memories are small, fast SRAM based memories managed automatically in hardware.

CSCI 402: Computer Architectures. Performance of Multilevel Cache

COMPUTER ORGANIZATION AND DESIGN

5. Memory Hierarchy Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

211: Computer Architecture Summer 2016

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Giving credit where credit is due

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

CISC 360. Cache Memories Nov 25, 2008

CS/ECE 250 Computer Architecture

Cache Memories October 8, 2007

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Advanced optimizations of cache performance ( 2.2)

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Example. How are these parameters decided?

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

CS 61C: Great Ideas in Computer Architecture. Multilevel Caches, Cache Questions

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Last class. Caches. Direct mapped

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

LECTURE 10: Improving Memory Access: Direct and Spatial caches

Caches Concepts Review

CS222: Cache Performance Improvement

Copyright 2012, Elsevier Inc. All rights reserved.

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Cache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Cache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

ν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines.

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

High-Performance Parallel Computing

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

CS3350B Computer Architecture

Why memory hierarchy

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

Carnegie Mellon. Cache Memories

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

Cache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

Memory Hierarchy: Caches, Virtual Memory

CS 261 Fall Caching. Mike Lam, Professor. (get it??)

Memory Hierarchies 2009 DAT105

Main Memory Supporting Caches

Cache Memory: Instruction Cache, HW/SW Interaction. Admin

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Computer Architecture Spring 2016

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Lecture 2. Memory locality optimizations Address space organization

Caches III. CSE 351 Winter Instructor: Mark Wyse

COSC 6385 Computer Architecture - Memory Hierarchies (I)

Adapted from David Patterson s slides on graduate computer architecture

Cache Memories. Andrew Case. Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron

Carnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron

Cache Performance (H&P 5.3; 5.5; 5.6)

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

Chapter 2: Memory Hierarchy Design Part 2

Lecture 11 Reducing Cache Misses. Computer Architectures S

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. See P&H Chapter: , 5.8, 5.10, 5.15; Also, 5.13 & 5.

EN1640: Design of Computing Systems Topic 06: Memory System

CACHE ARCHITECTURE. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory

CSC266 Introduction to Parallel Computing using GPUs Optimizing for Caches

Lecture notes for CS Chapter 2, part 1 10/23/18

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Chapter 2: Memory Hierarchy Design Part 2

Memory Systems and Performance Engineering. Fall 2009

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Transcription:

Cache Impact on Program Performance T. Yang. UCSB CS240A. 2017

Multi-level cache in computer systems Topics Performance analysis for multi-level cache Cache performance optimization through program transformation Processor Datapath Control L1 cache L2 Cache L3 cache Caching Main meory Disk

3

Cache misses and data access time D 0 : total memory data accesses. D 1 : missed access at L1. m 1 local miss ratio of L1: m 1 = D 1 /D 0 D 2 : missed access at L2. m 2 local miss ratio of L2: m 2 = D 2 /D 1 D 3 : missed access at L3 m 3 local miss ratio of L2: m 3 = D 3 /D 2 Memory and cache access time: δ i : access time at cache level i δ mem : access time in memory. Average access time = total time/d 0 = δ 1 + m 1 *penalty CPU

Average memory access time (AMAT) AMAT= Data found in L1 Data found in L2, L3 or memory δ 1 + m 1 *penalty ~2 cycles

Total data access time Average time = Data found in L2 Data found in L3 or memory δ 1 + m 1 [δ 2 + m 2 Penalty] ~2 cycles

Total data access time Found in L3 or memory Average time = δ 1 + m 1 [δ 2 + m 2 Penalty] ~2 cycles

Total data access time No L3. Found in memory Average time = δ 1 + m 1 [δ 2 + m 2 δ mem ] ~2 cycles ~10 cycles ~100-200 cycles

Total data access time Found in L3 Found in memory Average memory access time (AMAT)= δ 1 + m 1 [δ 2 + m 2 [δ 3 + m 3 δ mem ]]

Local vs. Global Miss Rates Local miss rate the fraction of references to one level of a cache that miss. For example, m 2 = D 2 /D 1 Notice total_l2_accesses is L1 Misses Global miss rate the fraction of references that miss in all levels of a multilevel cache Global L2 miss rate = D 2 /D 0 L2$ local miss rate >> than the global miss rate Notice Global L2 miss rate = D 2 /D 0 = D 1 /D 0 * D 2 /D 1 = m 1 m 2 10

Global miss rate L1 Cache: 32KB I$, 32KB D$ L2 Cache: 256 KB L3 Cache: 4 MB 10/4/17 Fall 2013 -- Lecture 11

Average memory access time with no L3 cache δ 1 + m 1 [δ 2 + m 2 δ mem ] AMAT = = δ 1 + m 1 δ 2 + m 1 m 2 δ mem = δ 1 + m 1 δ 2 + GMiss 2 δ mem

Average memory access time with L3 cache δ 1 + m 1 [δ 2 + m 2 [δ 3 + m 3 δ mem ]] AMAT = = δ 1 + m 1 δ 2 + m 1 m 2 δ 3 +m 1 m 2 m 3 δ mem = δ 1 + m 1 δ 2 + GMiss 2 δ 3 + GMiss 3 δ mem

Example What is average memory access time?

Example What is the average memory access time with L1, L2, and L3?

Example

Cache-aware Programming Reuse values in cache as much as possible exploit temporal locality in program Example 1: Y[2] is revisited continously For i=1 to n y[2]=y[2]+3 Example 2 with access sequence: Y[2] is revisited after a few instructions later

Cache-aware Programming Take advantage of better bandwidth by getting a chunk of memory to cache and use whole chunk Exploit spatial locality in program For i=1 to n y[i]=y[i]+3 Visiting Y[1] benefits next access of Y[2] 4000 Y[0] Y[1] Y[2]] Y[3] Y[4] Y[31] 32-Byte Cache Block Tag Memory 18

2D array layout in memory (just like 1D array) for(x = 0; x < 3; x++){ for(y = 0; y < 3; y++) { a[y][x]=0; // implemented as array[3*y+x]=0 } } àaccess order a[0][0], a[1][0], a[2][0], a[3][0]

Exploit spatial data locality via program rewriting: Example 1 Each cache block has 64 bytes. Cache has 128 bytes Program structure (data access pattern) char D[64][64]; Each row is stored in one cache line block Program 1 for (j = 0; j <64; j++) for (i = 0; i < 64; i++) D[i][j] = 0; 64*64 data byte access à What is cache miss rate? Program 2 for (i = 0; i < 64; i++) for (j = 0; j < 64; j++) D[i][j] = 0; What is cache miss rate?

Data Access Pattern and cache miss for (i = 0; i <64; j++) for (j = 0; j < 64; i++) D[i][j] = 0; 1 cache miss in one inner loop iteration Miss hit hit hit hit D[0,0] D[0,1]. D[0,63] D[1,0] D[1,1]. D[1,63] j Cache block D[63,0] D[63,1]. D[63,63] 64 cache miss out of 64*64 access. There is spatial locality. Fetched cache block is used 64 times before swapping out (consecutive data access within the inner loop i

Memory layout and data access by block Memory layout of Char D[64][64] same as Char D[64*64] Data access order Memory layout of a program Program in 2D loop Miss hit hit hit hit D[0,0] D[0,1]. D[0,63] D[1,0] D[1,1]. D[1,63] j D[0,0] D[0,1]. D[0,63] D[1,0] D[1,1]. D[1,63] Cache block Cache block D[0,0] D[0,1]. D[0,63] D[1,0] D[1,1]. D[1,63] i D[63,0] D[63,1]. D[63,63] 64 cache miss out of 64*64 access. D[63,0] D[63,1]. D[63,63] Cache block D[63,0] D[63,1]. D[63,63]

Data Locality and Cache Miss for (j = 0; j <64; j++) for (i = 0; i < 64; i++) D[i][j] = 0; 64 cache miss in one inner loop iteration D[0,0] D[0,1]. D[0,63] D[1,0] D[1,1]. D[1,63] j D[63] D[63,0]. D[63,63] 100% cache miss There is no spatial locality. Fetched block is only used once before swapping out. i

Memory layout and data access by block Data access order of a program D[0,0] D[1,0]. D[63,0] D[0,1] D[1,1]. D[63,1] Cache block Cache block Memory layout D[0,0] D[0,1]. D[0,63] D[1,0] D[1,1]. D[1,63] i Program in 2D loop j D[0,0] D[0,1]. D[0,63] D[1,0] D[1,1]. D[1,63] D[63] D[63,0]. D[63,63] D[0,63] D[1,63]. D[63,63] Cache block D[63,0] D[63,1]. D[63,63] 100% cache miss

Summary of Example 1: Loop interchange alters execution order and data access patterns Exploit more spatial locality in this case

Program rewriting example 2: cache blocking for better temporal locality Cache size = 8 blocks =128 bytes Cache block size =16 bytes, hosting 4 integers Program structure int A[64]; // sizeof(int)=4 bytes for (k = 0; k<repcount; k++) for (i = 0; i < 64; I +=stepsize) A[i] =A[i]+1 Analyze cache hit ratio when varying cache block size, or step size (stride distance)

Example 2: Focus on inner loop for (i = 0; i < 64; i +=stepsize) A[i] =A[i]+1 memory Cache block Data access order/index 0 1 2 3 0 2 4 0 4 0 8 s Stepsize S=1 S=2 S=4 S=8 Step size or also called stride distance

Step size =2 for (i = 0; i < 64; I +=stepsize) A[i] =A[i]+1 //read, write A[i] Memory Cache block Data access order/index 0 2 4 S=2 M/H H/H M/H H/H M/H H/H

Repeat many times for (k = 0; k<repcount; k++) for (i = 0; i < 64; I +=stepsize) A[i] =A[i]+1 //read, write A[i] Memory Cache block Data access order/index 0 2 4 S=2 integers M/H H/H M/H H/H M/H H/H Array has 16 blocks. Inner loop accesses 32 elements, and fetches all 16 blocks. Each block is used as R/W/R/W. Cache size = 8 blocks and cannot hold all 16 blocks fetched.

Cache blocking to exploit temporal locality For (k=0; k=100; k++) for (i = 0;i <64;i+=S) A[i] =f(a[i]) 0 2 4 6 8 K=0 to 100 0 2 K=0 to 100 4 6 Pink code block can be executed fitting into cache 0 2 4 6 8 0 2 4 6 8 K=0 to 100 8 10 K=1 K=2 K=3 Rewrite program with cache blocking

Rewrite a program loop for better cache usage Loop blocking (cache blocking) Rewrite as with blocksize=2 More general: Given for (i = 0; i < 64; i+=s) A[i] =f(a[i]) Rewrite as: for (bi = 0; bi<64; bi=bi+blocksize) for (i = bi; i<bi+ blocksize;i+=s) A[i] =f(a[i])

Example 2: Cache blocking for better performance For (k=0; k=100; k++) for (i = 0; i < 64; i=i+s) A[i] =f(a[i]) Rewrite as: For (k=0; k=100; k++) for (bi = 0; bi<64; bi=bi+blocksize) for (i = bi; i<bi+blocksize; i+=s) A[i] =f(a[i]) Look interchange for (bi = 0; bi<64; bi=bi+blocksize) For (k=0; k=100; k++) for (i = bi; i<bi+ blocksize; i+=s) A[i] =f(a[i]) Pink code block can be executed fitting into cache

Example 3: Matrix multiplication C=A*B Cij=Row Ai * Col Bj For i= 0 to n-1 For j= 0 to n-1 For k=0 to n-1 C[i][j] +=A[i][k]* B[k][j]

Example 3: matrix multiplication code 2D array implemented using 1D layout for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) C[i+j*n] += A[i+k*n]* B[k+j*n] 3 loop controls can interchange (C elements are modified independently with no dependence) Which code has better cache performance (faster)? for (j = 0; j < n; j++) for (k = 0; k < n; k++) for (i = 0; i < n; i++) C[i+j*n] += A[i+k*n]* B[k+j*n]

Example 3: matrix multiplication code 2D array implemented using 1D layout for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) C[i+j*n] += A[i+k*n]* B[k+j*n] 3 loop controls can interchange (C elements are modified independently with no dependence) Which code has better cache performance (faster)? -- Study impact of stride on inner most loop which does most computation for (j = 0; j < n; j++) for (k = 0; k < n; k++) for (i = 0; i < n; i++) C[i+j*n] += A[i+k*n]* B[k+j*n]

Example 4: Cache blocking for matrix transpose for (x = 0; x < n; x++) { for (y = 0; y < n; y++) { dst[y + x * n] = src[x + y * n]; } src } y dst x Rewrite code with cache blocking

Example 4: Cache blocking for matrix transpose for (x = 0; x < n; x++) { for (y = 0; y < n; y++) { dst[y + x * n] = src[x + y * n]; } } Rewrite code with cache blocking for (i = 0; i < n; i += blocksize) for (x = i; x < i+blocksize; ++x) for (j = 0; j < n; j += blocksize) for (y = j; y < j+blocksize; ++y) dst[y + x * n] = src[x + y * n];