Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access

Similar documents
Computer Architecture ELEC3441

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

ELEC 377 Operating Systems. Week 6 Class 3

Memory and I/O Organization


If you miss a key. Chapter 6: Demand Paging Source:

4/11/17. Agenda. Princeton University Computer Science 217: Introduction to Programming Systems. Goals of this Lecture. Storage Management.

#4 Inverted page table. The need for more bookkeeping. Inverted page table architecture. Today. Our Small Quiz

Caches. Samira Khan March 23, 2017

Cache Memories. Lecture 14 Cache Memories. Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer)

Lecture 15: Memory Hierarchy Optimizations. I. Caches: A Quick Review II. Iteration Space & Loop Transformations III.

Giving credit where credit is due

Insertion Sort. Divide and Conquer Sorting. Divide and Conquer. Mergesort. Mergesort Example. Auxiliary Array

Efficient Distributed File System (EDFS)

Real-Time Guarantees. Traffic Characteristics. Flow Control

An Efficient Garbage Collection for Flash Memory-Based Virtual Memory Systems

Goals and Approach Type of Resources Allocation Models Shared Non-shared Not in this Lecture In this Lecture

Sorting. Sorting. Why Sort? Consistent Ordering

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Assembler. Building a Modern Computer From First Principles.

Optimizing for Speed. What is the potential gain? What can go Wrong? A Simple Example. Erik Hagersten Uppsala University, Sweden

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

Nachos Project 3. Speaker: Sheng-Wei Cheng 2010/12/16

Sorting. Sorted Original. index. index

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

CACHE MEMORY DESIGN FOR INTERNET PROCESSORS

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University

Verification by testing

Parallel matrix-vector multiplication

FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems

Lecture 7 Real Time Task Scheduling. Forrest Brewer

The stream cipher MICKEY-128 (version 1) Algorithm specification issue 1.0

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation

THE low-density parity-check (LDPC) code is getting

CE 221 Data Structures and Algorithms

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Array transposition in CUDA shared memory

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont)

Cache Sharing Management for Performance Fairness in Chip Multiprocessors

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

Priority queues and heaps Professors Clark F. Olson and Carol Zander

TripS: Automated Multi-tiered Data Placement in a Geo-distributed Cloud Environment

Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016)

Distributed Resource Scheduling in Grid Computing Using Fuzzy Approach

AADL : about scheduling analysis

Design and Analysis of Algorithms

EE 4683/5683: COMPUTER ARCHITECTURE

Intro. Iterators. 1. Access

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Real-Time Systems. Real-Time Systems. Verification by testing. Verification by testing

Chapter 1. Introduction

Advanced Computer Networks

Q.1 Q.20 Carry One Mark Each. is differentiable for all real values of x

Advanced Computer Architecture

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

A fair buffer allocation scheme

Simulation Based Analysis of FAST TCP using OMNET++

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Real-time Scheduling

CS 268: Lecture 8 Router Support for Congestion Control

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain

IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 12, DECEMBER On-Bound Selection Cache Replacement Policy for Wireless Data Access

Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

Collision Detection. Overview. Efficient Collision Detection. Collision Detection with Rays: Example. C = nm + (n choose 2)

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Storing Matrices on Disk: Theory and Practice Revisited

A Predictable Execution Model for COTS-based Embedded Systems

Circuit Analysis I (ENGR 2405) Chapter 3 Method of Analysis Nodal(KCL) and Mesh(KVL)

CS1100 Introduction to Programming

News. Recap: While Loop Example. Reading. Recap: Do Loop Example. Recap: For Loop Example

Programming in Fortran 90 : 2017/2018

LECTURE NOTES Duality Theory, Sensitivity Analysis, and Parametric Programming

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

Solving Planted Motif Problem on GPU

Concurrent models of computation for embedded software

Storage Binding in RTL synthesis

Conditional Speculative Decimal Addition*

FRES-CAR: An Adaptive Cache Replacement Policy

Outline. Digital Systems. C.2: Gates, Truth Tables and Logic Equations. Truth Tables. Logic Gates 9/8/2011

Clustered Multimedia NOD : Popularity-Based Article Prefetching and Placement

Brave New World Pseudocode Reference

Computer Animation and Visualisation. Lecture 4. Rigging / Skinning

CSE 326: Data Structures Quicksort Comparison Sorting Bound

Performance Study of Parallel Programming on Cloud Computing Environments Using MapReduce

USING GRAPHING SKILLS

A Statistical Model Selection Strategy Applied to Neural Networks

Sample Solution. Advanced Computer Networks P 1 P 2 P 3 P 4 P 5. Module: IN2097 Date: Examiner: Prof. Dr.-Ing. Georg Carle Exam: Final exam

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Multiple Sub-Row Buffers in DRAM: Unlocking Performance and Energy Improvement Opportunities

MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices

VIRTUAL MEMORY READING: CHAPTER 9

Reliability and Energy-aware Cache Reconfiguration for Embedded Systems

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Transcription:

Agenda Cache Performance Samra Khan March 28, 217 Revew from last lecture Cache access Assocatvty Replacement Cache Performance Cache Abstracton and Metrcs Address Tag Store (s the address n the cache? + bookkeepng) Ht/mss? Store (stores memory blocks) - 111 A 1-1 111 1-1 111 B Drect-Mapped Cache: Placement and Access Assume byte-addressable memory: 256 bytes, 8-byte blocks à 32 blocks Assume cache: 64 bytes, 8 blocks Drect-mapped: A block can go to only one locaton 2b 3 bts 3 bts Address V tag Cache ht rate = (# hts) / (# hts + # msses) = (# hts) / (# accesses) Average memory access tme (AMAT) = ( ht-rate ht-latency ) + ( mss-rate mss-latency ) 3 11-11 111 11 111-11 111 111 Memory MUX Addresses wth same ndex contend for the same locaton Cause conflct msses 4 1

A A, B, A, B, A, B A = b xxx B = b 1 xxx Drect-Mapped Cache: Placement and Access XXX 1 2 3 4 5 6 7 A A, B, A, B, A, B A = b xxx B = b 1 xxx Drect-Mapped Cache: Placement and Access XXX 1 2 3 4 5 6 7 1 XXXXXXXXX 2 bts 3 bts 3 bts 8-bt address MUX MISS: Fetch A and update tag 2 bts 3 bts 3 bts 8-bt address MUX B A, B, A, B, A, B A = b xxx B = b 1 xxx Drect-Mapped Cache: Placement and Access 1 XXX 1 2 3 4 5 6 7 1 XXXXXXXXX B A, B, A, B, A, B A = b xxx B = b 1 xxx Drect-Mapped Cache: Placement and Access 1 XXX 1 2 3 4 5 6 7 1 1 YYYYYYYYYY 2 bts 3 bts 3 bts 8-bt address MUX Tags do not match: MISS 2 bts 3 bts 3 bts 8-bt address MUX Fetch block B, update tag 2

A A, B, A, B, A, B A = x xxx B = x 1 xxx Drect-Mapped Cache: Placement and Access XXX 1 2 3 4 5 6 7 1 1 YYYYYYYYYY A A, B, A, B, A, B A = x xxx B = x 1 xxx Drect-Mapped Cache: Placement and Access XXX 1 2 3 4 5 6 7 1 XXXXXXXXX 2 bts 3 bts 3 bts 8-bt address MUX Tags do not match: MISS 2 bts 3 bts 3 bts 8-bt address MUX Fetch block A, update tag Set Assocatve Cache Assocatvty (and Tradeoffs) A A, B, A, B, A, B A = b xxx B = b 1 xxx XXX 1 1 2 3 1 1 XXXXXXXXX YYYYYYYYYY MUX Degree of assocatvty: How many blocks can map to the same ndex (or set)? Hgher assocatvty ++ Hgher ht rate -- Slower cache access tme (ht latency and data access latency) -- More expensve hardware (more comparators) 3 bts 2 bts 3 bts 8-bt address HIT MUX Dmnshng returns from hgher assocatvty 12 ht rate assocatvty 3

Issues n Set-Assocatve Caches Thnk of each block n a set havng a prorty Indcatng how mportant t s to keep the block n the cache Key ssue: How do you determne/adust block prortes? There are three key decsons n a set: Inserton, promoton, evcton (replacement) Inserton: What happens to prortes on a cache fll? Where to nsert the ncomng block, whether or not to nsert the block Promoton: What happens to prortes on a cache ht? Whether and how to change block prorty Evcton/replacement: What happens to prortes on a cache mss? Whch block to evct and how to adust prortes Evcton/Replacement Polcy Whch block n the set to replace on a cache mss? Any nvald block frst If all are vald, consult the replacement polcy Random FIFO Least recently used (how to mplement?) Not most recently used Least frequently used Hybrd replacement polces 13 14 Set LRU -1-2 A B C D Set LRU -1-2 E B C D ACCESS PATTERN: ACBD 15 ACCESS PATTERN: ACBDE 16 4

Set -1-2 E B C D Set -1-2 -1 E B C D ACCESS PATTERN: ACBDE 17 ACCESS PATTERN: ACBDE 18 Set -2-2 -1 E B C D Set -2 LRU -1 E B C D ACCESS PATTERN: ACBDE 19 ACCESS PATTERN: ACBDE 2 5

Set LRU -1 E B C D Set -1 LRU -1 E B C D ACCESS PATTERN: ACBDEB 21 ACCESS PATTERN: ACBDEB 22 Set -1 E B C D LRU -2 Implementng LRU Idea: Evct the least recently accessed block Problem: Need to keep track of access orderng of blocks Queston: 2-way set assocatve cache: What do you need to mplement LRU perfectly? Queston: 16-way set assocatve cache: What do you need to mplement LRU perfectly? What s the logc needed to determne the LRU vctm? ACCESS PATTERN: ACBDEB 23 24 6

Approxmatons of LRU Most modern processors do not mplement true LRU (also called perfect LRU ) n hghly-assocatve caches Why? True LRU s complex LRU s an approxmaton to predct localty anyway (.e., not the best possble cache management polcy) Examples: Not (not most recently used) Cache Replacement Polcy: LRU or Random LRU vs. Random: Whch one s better? Example: 4-way cache, cyclc references to A, B, C, D, E % ht rate wth LRU polcy Set thrashng: When the program workng set n a set s larger than set assocatvty Random replacement polcy s better when thrashng occurs In practce: Depends on workload Average ht rate of LRU and Random are smlar Best of both Worlds: Hybrd of LRU and Random How to choose between the two? Set samplng See Quresh et al., A Case for MLP-Aware Cache Replacement, ISCA 26. 25 26 What s In A Tag Store Entry? Vald bt Tag Replacement polcy bts Drty bt? Wrte back vs. wrte through caches Handlng Wrtes (I) n When do we wrte the modfed data n a cache to the next level? Wrte through: At the tme the wrte happens Wrte back: When the block s evcted Wrte-back + Can consoldate multple wrtes to the same block before evcton Potentally saves bandwdth between cache levels + saves energy -- Need a bt n the tag store ndcatng the block s drty/modfed Wrte-through + Smpler + All levels are up to date. Consstent -- More bandwdth ntensve; no coalescng of wrtes 27 28 7

Handlng Wrtes (II) Do we allocate a cache block on a wrte mss? Allocate on wrte mss No-allocate on wrte mss Allocate on wrte mss + Can consoldate wrtes nstead of wrtng each of them ndvdually to next level + Smpler because wrte msses can be treated the same way as read msses -- Requres (?) transfer of the whole cache block No-allocate + Conserves cache space f localty of wrtes s low (potentally better cache ht rate) Instructon vs. Caches Separate or Unfed? Unfed: + Dynamc sharng of cache space: no overprovsonng that mght happen wth statc parttonng (.e., splt I and D caches) -- Instructons and data can thrash each other (.e., no guaranteed space for ether) -- I and D are accessed n dfferent places n the ppelne. Where do we place the unfed cache for fast access? Frst level caches are almost always splt Manly for the last reason above Second and hgher levels are almost always unfed 29 3 Mult-level Cachng n a Ppelned Desgn Frst-level caches (nstructon and data) Decsons very much affected by cycle tme Small, lower assocatvty and data store accessed n parallel Second-level, thrd-level caches Decsons need to balance ht rate and access latency Usually large and hghly assocatve; latency less crtcal and data store accessed serally Cache Performance Seral vs. Parallel access of levels Seral: Second level cache accessed only f frst-level msses Second level does not see the same accesses as the frst Frst level acts as a flter (flters some temporal and spatal localty) Management polces are therefore dfferent 31 8

Cache Parameters vs. Mss/Ht Rate Cache sze Block sze Assocatvty Replacement polcy Inserton/Placement polcy 33 Cache Sze Cache sze: total data (not ncludng tag) capacty bgger can explot temporal localty better not ALWAYS better Too large a cache adversely affects ht and mss latency smaller s faster => bgger s slower access tme may degrade crtcal path ht rate Too small a cache doesn t explot temporal localty well useful data replaced often workng set Workng set: the whole set of data the executng applcaton references Wthn a tme nterval 34 sze cache sze Block Sze Block sze s the data that s assocated wth an address tag Assocatvty How many blocks can map to the same ndex (or set)? Too small blocks don t explot spatal localty well have larger tag overhead Too large blocks too few total # of blocks à less temporal localty explotaton waste of cache space and bandwdth/energy f spatal localty s not hgh Wll see more examples later ht rate block sze Larger assocatvty lower mss rate, less varaton among programs dmnshng returns, hgher ht latency ht rate Smaller assocatvty lower cost lower ht latency Especally mportant for L1 caches Power of 2 assocatvty requred? assocatvty 35 36 9

Hgher Assocatvty Hgher Assocatvty 3-way 4 bts 1 bts 3 bts 8-bt address 4 bts 1 bts 3 bts 8-bt address MUX MUX MUX MUX 37 38 Classfcaton of Cache Msses Compulsory mss frst reference to an address (block) always results n a mss subsequent references should ht unless the cache block s dsplaced for the reasons below Capacty mss cache s too small to hold everythng needed defned as the msses that would occur even n a fully-assocatve cache (wth optmal replacement) of the same capacty Conflct mss defned as any mss that s nether a compulsory nor a capacty mss How to Reduce Each Mss Type Compulsory Cachng cannot help Prefetchng Conflct More assocatvty Other ways to get more assocatvty wthout makng the cache assocatve Vctm cache Hashng Software hnts? Capacty Utlze cache space better: keep blocks that wll be referenced Software management: dvde workng set such that each phase fts n cache 39 4 1

Matrx Sum Cache Performance wth Code Examples nt sum1(nt matrx[4][8]) { nt sum = ; for (nt = ; < 4; ++) { for (nt = ; < 8; ++) { sum += matrx[][]; } } } access pattern: matrx[][], [][1], [][2],, [1][] Explotng Spatal Localty 8B cache block, 4 blocks, LRU, 4B nteger Access pattern matrx[][], [][1], [][2],, [1][] [][] à mss [][1] à ht [][2] à mss [][3] à ht [][4] à mss [][5] à ht [][6] à mss [][7] à ht [1][] à mss [1][1] à ht [][]-[][1] [][2]-[][3] [][4]-[][5] [][6]-[][7] Cache Blocks Replace [1][]-[1][1] [][2]-[][3] [][4]-[][5] [][6]-[][7] Explotng Spatal Localty block sze and spatal localty larger blocks explot spatal localty but larger blocks means fewer blocks for same sze less good at explotng temporal localty 11

Alternate Matrx Sum nt sum2(nt matrx[4][8]) { nt sum = ; // swapped loop order for (nt = ; < 8; ++) { for (nt = ; < 4; ++) { sum += matrx[][]; } } } access pattern: matrx[][], [1][], [2][], [3][], [][1], [1][1], [2][1], [3][1],, Bad at Explotng Spatal Localty 8B cache block, 4B nteger Access pattern matrx[][], [1][], [2][], [3][], [][1], [1][1], [2][1], [3][1],, [][] à mss [1][] à mss [2][] à mss [3][] à mss [][1] à ht [1][1] à ht [2][1] à ht [3][1] à ht [][2] à mss [1][2] à mss [][]-[][1] [1][]-[1][1] [2][]-[2][1] [3][]-[3][1] Cache Blocks Replace [][2]-[][3] [1][]-[1][1] [2][]-[2][1] [3][]-[3][1] Replace [][2]-[][3] [1][2]-[1][3] [2][]-[2][1] [3][]-[3][1] A note on matrx storage A > N X N matrx: represented as an 2D array makes dynamc szes easer: float A_2d_array[N][N]; float A_flat = malloc(n N); A_flat[ N + ] === A_2d_array[][] B "# = & A "( A (# (+, / verson 1: nner loop s k, mddle s / for (nt = ; < N; ++) for (nt = ; < N; ++) for (nt k = ; k < N; ++k) B[N+] += A[ N + k] A[k N + ]; 12

B B -, B -. B -/ B,- B,, B,. B,/ A -- A -, A -. A -/ A,- A,, A,. A,/ A.- A., A.. A./ A /- A /, A /. A // B B -, B -. B -/ B,- B,, B,. B,/ A A -, A -. A -/ A,- A,, A,. A,/ A.- A., A.. A./ A /- A /, A /. A // B -- = & A -( A (- (+- B -- = (A -- A -- ) + (A -, A,- ) + (A -. A.- ) + (A -/ A /- ) B -- = & A -( A (- (+- B -- = (A A ) + (A -, A,- ) + (A -. A.- ) + (A -/ A /- ) B B -, B -. B -/ B,- B,, B,. B,/ A A 1 A -. A -/ A 1 A,, A,. A,/ A.- A., A.. A./ A /- A /, A /. A // B B -, B -. B -/ B,- B,, B,. B,/ A A 1 A 2 A -/ A 1 A,, A,. A,/ A 2 A., A.. A./ A /- A /, A /. A // B -- = & A -( A (- (+- B -- = (A -- A -- ) + (A 1 A 1 ) + (A -. A.- ) + (A -/ A /- ) B -- = & A -( A (- (+- B -- = (A -- A -- ) + (A -, A,- ) + (A 2 A 2 ) + (A -/ A /- ) 13

B B -, B -. B -/ B,- B,, B,. B,/ A A 1 A 2 A 3 A 1 A,, A,. A,/ A 2 A., A.. A./ A 3 A /, A /. A // A k has spatal localty B -- B 1 B -. B -/ B,- B,, B,. B,/ A A 1 A 2 A 3 A,- A 11 A,. A,/ A.- A 21 A.. A./ A /- A 31 A /. A // A k has spatal localty B -- = & A -( A (- (+- B -- = (A -- A -- ) + (A -, A,- ) + (A -. A.- ) + (A 3 A 3 ) B -, = & A -( A (, (+- B 1 = (A A 1 ) + (A 1 A 11 ) + (A 2 A 21 ) + (A 3 A 31 ) Concluson B -- B -, B 2 B -/ B,- B,, B,. B,/ A A 1 A 2 A 3 A,- A,, A 12 A,/ A.- A., A 22 A./ A /- A /, A 32 A // A k has spatal localty A k has spatal localty B has temporal localty B -. = & A -( A (. (+- B 2 = (A A 2 ) + (A 1 A 12 ) + (A 2 A 22 ) + (A 3 A 32 ) 14

B "# = & A "( A (# (+, / verson 2: outer loop s k, mddle s / for (nt k = ; k < N; ++k) for (nt = ; < N; ++) for (nt = ; < N; ++) B[N+] += A[ N + k] A[k N + ]; Access pattern k =, = B[][] = A[][] A[][] B[][1] = A[][] A[][1] B[][2] = A[][] A[][2] B[][3] = A[][] A[][3] Access pattern k =, = 1 B[1][] = A[1][] A[][] B[1][1] = A[1][] A[][1] B[1][2] = A[1][] A[][2] B[1][3] = A[1][] A[][3] : k order B B 1 B 2 B 3 B,- B,, B,. B,/ A A 1 A 2 A 3 A,- A,, A,. A,/ A.- A., A.. A./ A /- A /, A /. A // B = (A A ) + (A -, A,- ) + (A -. A.- ) + (A -/ A /- ) B 1 = (A A 1 ) + (A -, A,, ) + (A -. A., ) + (A -/ A /, ) B 2 = (A A 2 ) + (A -, A,. ) + (A -. A.. ) + (A -/ A /. ) B 3 = (A A 3 ) + (A -, A,/ ) + (A -. A./ ) + (A -/ A // ) : k order B -- B -, B -. B -/ B 1 B 11 B 12 B 13 A A 1 A 2 A 3 A 1 A,, A,. A,/ A.- A., A.. A./ A /- A /, A /. A // B, A k have spatal localty A k has temporal localty B 1 = (A 1 A ) + (A,, A,- ) + (A,. A.- ) + (A,/ A /- ) B 11 = (A 1 A 1 ) + (A,, A,, ) + (A,. A., ) + (A,/ A /, ) B 12 = (A 1 A 2 ) + (A,, A,. ) + (A,. A.. ) + (A,/ A /. ) B 13 = (A 1 A 3 ) + (A,, A,/ ) + (A,. A./ ) + (A,/ A // ) k order B, A k have spatal localty A k has temporal localty k order A k has spatal localty B has temporal localty 15

Whch order s better? Order k performs much better 16