Program Transformations for the Memory Hierarchy

Size: px
Start display at page:

Download "Program Transformations for the Memory Hierarchy"

Transcription

1 Program Transformations for the Memory Hierarchy Locality Analysis and Reuse Copyright 214, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class at the University of Southern California have explicit permission to make copies of these materials for their personal use. Acknowledgement: Some of the material in this lecture is based on class notes from CSCI595-Spring 24 class at USC graciously provided by Dr. Jacqueline Chame and Dr. Mary Hall

2 Processor-Memory Bandwidth Gap Over Last 2 Decades Increased Sharply Now not as Severe but Multi-Core Exacerbates System-level Issues 2

3 Problems with Modern Architectures Uni-Processors Parallelism: extract and manage instruction level parallelism (nonnumerical and numerical applications) Memory hierarchy: improve locality and hide latency Multi-Processors Parallelism: detection of loop level parallelism Memory Hierarchy Minimize synchronization frequency Minimize communication (improve locality) Hide latency 3

4 Processor-Memory Bandwidth Gap Processor Cycles Functional Units Registers 1x 2-5x 2-3x Cache I Cache II Local Memory Remote Memory Principle of Locality: Reuse Data that has been recently used. Trade-off: Increasing capacity means higher latency 4

5 Hiding Memory Latency: Locality Memory Local to Processor Fast but of Limited Capacity Needs to Map Sub-set of Main Memory Restrict mapping choices to allow easy look-up. 5

6 Analysis for Locality 6

7 Blocking for Uni- and Multi-Processors = x Data Accessed = x 124 1,5, = x ,56 15x 7

8 Performance of Blocking 8

9 Example: Givens QR Decomposition 9

10 Hiding Memory Latency: Prefetching Idea: Anticipate Memory Access: Prefetch so data is available when needed Trade-Off: Need to prefetch in advance at the right time. Do not fetch too late nor to early Control-flow is an issue 1

11 Prefetching Example Suppose Prefecth Instructions fetch 2 words at a time How Effective is Prefetching? 11

12 Effectiveness of Prefetching Memory Stalls reduced by 5% to 9% Instructions and Memory Overhead are Typically Low 12

13 Data Reuse in Caches Data Reuse: Data used multiple times Data Locality: Data remains in cache between uses Less Cache Misses Less Memory Traffic Optimization Goal: exploit reuse Loop Transformations: Reuse Locality 13

14 Temporal Reuse Same Data Accessed in Distinct Loop Iterations DO I = 1, N DO J = 1, N C(J,I)= A(J)+B(I,J) ENDDO ENDDO A(J) has self-temporal reuse in loop I 14

15 Spatial Reuse Same Cache Line Accessed in Distinct Loop Iterations DO I = 1, N DO J = 1, N A(J,I)=A(J,I)+B(I,J) A(J,I) has Self-Spatial Reuse in loop J B(I,J) has Self-Spatial Reuse in loop I (FORTRAN arrays are stored in column-major order) 15

16 Group Reuse Same Data/Cache line accessed by Distinct References DO I = 1, N DO J = 1, N A(J,I)=A(J,I)+A(J-1,I)+A(J+1,I) ENDDO ENDDO A(J,I),A(J+1,I),A(J-1,I): Group Reuse in J 16

17 Group Reuse Group-Temporal Reuse: Same Data Location DO I = 1, N DO J = 1, N A(J,I)=A(J,I)+A(J-1,I)+A(J+1,I) Group-Spatial Reuse: Same Cache Line DO I = 1, N DO J = 1, N A(1,J)=A(,J) 17

18 Group Reuse Group Temporal but no Reuse of other Data on same Cache Line (cache line size = 4) DO I = 1, N DO J = 1, N A(J,I)=A(J+8,I)+A(J-8,I) Group Spatial but not Group Temporal DO I = 1, N DO J = 1, N A(J,I)=A(2J,I)+A(2J+1,I) 18

19 Reuse and Cache Misses DO I = 1, N DO J = 1, N = A(J) self-spatial in J self-temporal in I I N < cache line size I N > cache line size J cache miss 19 J

20 Cache Misses DO I = 1, N DO J = 1, M A(I)=A(I)+B(J,I) DO J = 1, M DO I = 1, N A(I)=A(I)+B(J,I) reference J I reference I J A(I) 1 N A(I) N/L M*N/L B(J,I) M/L N*M/L B(J,I) N M*N L = cache line size; N,M > cache size LIS = innermost loop 2

21 How to Select a Loop Order to Exploit Reuse? Identify and Quantify Reuse Allen & Kennedy: Compute a cost function for each loop (innermost memory cost) Wolf & Lam: Locality Analysis identifies reuse vectors Capacity misses based on reuse vectors/amount Order loops according to reuse amount 21

22 A&K: Innermost Memory Cost Innermost memory cost: C M (L i ) assume L i is innermost loop l i = loop variable, N = number of iterations of L i for each array reference r in loop nest: r does not depend on l i : cost (r) = 1 r such that l i strides over a non-contiguous dimension: cost (r) = N r such that l i strides over a contiguous dimension with step s < L: cost (r) = N*s/L C M (L i ) = sum of cost (r) 22

23 Example DO I = 1, N DO J = 1, M A(I)=A(I)+B(I,J) DO I = 1, N DO J = 1, M A(I)=A(I)+B(I,J) J innermost: cost J (A)=1, cost J (B)=M I innermost: cost I (A)=N, cost I (B)=N/L C M (J) = N * (1 + M) C M (I) = N * (M/L + M/L) = 2NM/L select I as innermost if N(1+M) > 2NM/L 23

24 Selecting a Loop Order Compute the innermost memory cost for each loop in the nest Compute a desired loop order, from innermost to outermost, in order of increasing innermost memory cost Interchange loops to achieve a legal loop order that is closest to the desired order 24

25 Example: Matrix-Matrix Multiply DO I = 1, N DO J = 1, N DO K = 1, N C(I,J)= C(I,J) + A(I,K) * B(K,J) C M (I) = 2N 3 /L + N 2 C M (J) = 2N 3 + N 2 C M (K) = N 3 + N 3 /L + N 2 Ordering by innermost loop cost: (J, K, I) 25

26 Groups of References (*) (*) also known as uniformly generated references When computing innermost memory costs, some references should be grouped: References that access same data in distinct iterations of a loop I Example: A(J,I), A(J,I+1), (J,I-1) References that access same cache line in same iteration of I Example: A(I,J), A(I+1,J),, A(I+L-1,J) 26

27 Example DO I = 1, T DO J = 1, N DO K = 1, N A(J,K) = f(a(j,k) + A(J,K+1) + A(J,K-1) + A(J+1,K) + A(J-1,K)) K innermost: G 1 = {A(J,K+1), A(J,K), A(J,K-1)} => ~N misses G 2 = {A(J+1,K), A(J,K), A(J-1, K)} => ~N misses J innermost: G 1 = {A(J+1,K), A(J,K), A(J-1,K)} => N/L misses G 2 = {A(J,K+1)} => N/L misses G 3 = {A(J,K-1)} => N/L misses 27

28 Loop Tiling (Blocking) Tiling reorders loop iterations to bring iterations accessing same data closer in time I Choose tile sizes so that reused data can be stored in cache I J J 28

29 Example: Matrix-Matrix Multiply DO J = 1, N DO K = 1, N DO I = 1, N C(I,J)= C(I,J)+A(I,K)*B(K,J) K J I I K C A B 29

30 Reuse DO J = 1, N DO K = 1, N DO I = 1, N C(I,J)= C(I,J)+A(I,K)*B(K,J) reuse type reuse amount I K J I J K C(I,J) spatial temporal none L N 1 A(I,K) spatial none temporal L 1 N B(K,J) temporal spatial none N L 1 3

31 Loop Tiling: Strip-Mine and Interchange DO J = 1, N DO K = 1, N DO I = 1, N by T DO II = I, min(i+t-1,n) C(II,J)+= A(II,K)+B(K,J) Strip mine loop I: I: traverses strips II: traverses iterations in a strip DO J = 1, N DO I = 1, N by T DO K = 1, N DO II = I, min(i+t-1,n) C(II,J)+=A(II,K)+B(K,J) Interchange K and I (exploit reuse of C(II,J) in loop K) 31

32 Tiling Inner Loops I and K DO K = 1, N by T K DO I = 1, N by T I DO J = 1, N DO KK = K, min(kk+ T K,N) DO II = I, min(ii+ T I,N) C(II,J)= C(II,J)+A(II,KK)*B(KK,J) T K T I C A B 32

33 Cache Misses (assuming that data referenced in tiled loops fit in cache) DO J = 1, N by T J DO K = 1, N by T K DO I = 1, N by T I DO JJ = J, min(j+t J,N) DO KK = K, min(kk+ T K,N) DO II = I, min(ii+ T I,N) C(II,JJ)= C(II,JJ)+A(II,KK)*B(KK,JJ) LIS II KK JJ I K J C(II,JJ) T I /L T I / L T J T I / L NT J /L N 2 T J /(T K L) N 3 /(T K L) A(II,KK) T I /L T K T I /L T I T K / L NT K /L N 2 /L N 3 /(T J L) B(KK,JJ) 1 T K / L T J T K / L NT J T K /(T I L) N 2 T J /(T I L) N 3 /(T I L) 33

34 Legality of Tiling Tiling: Strip-Mine and Interchange strip-mine does not reorder iterations interchange must be legal d, d 1, d k-1 either = or < OR strip size less than threshold of dependence 34

35 Profitability of Tiling Tiling is Profitable is there is enough reuse to offset the overheads extra controlling loop for each tiled loop extra misses in the controlling loops extra misses due to alignment to cache lines Need to Identify and Quantify Reuse 35

36 Locality Analysis Identify reuse in iteration space vector spaces represent directions with reuse Quantify reuse in a given iteration subspace Localized Iteration Space (LIS): set of iterations where reuse can be exploited Transform loop nest to include directions with reuse in LIS 36

37 Types of Reuse Self reuse: same array reference same data: self-temporal reuse same cache line: self-spatial reuse Group reuse: several uniformly generated array references same data: group-temporal reuse same cache line: group-spatial reuse 37

38 Identifying reuse: mapping iterations to data DO I = 1, N DO J = 1, N A(I)= B(J,I) A I I () ( 1 ) = ( I) J B Iteration space J ( 1)() I () = 1 J I J Data space 38

39 Mapping Iterations to Data Array indexing function: f ( i ) : Zn Zd = H ( i ) + c Example: A 1 = A(J+2,I) fa 1 = & % 1 1# " & % I J # " + & 2# % " 39

40 CSCI Compiler Design 4 Matrix Multiply: Array Indexing Functions " # % & + " # % & " # % & = 1 1 K J I fc(i,j) " # % & + " # % & " # % & = 1 1 K J I fa(i, K) " # % & + " # % & " # % & = 1 1 K J I fb(k,j)

41 Identifying Self-Temporal Reuse i 1 and i 2 reference the same data when: H ( i ) + c = H ( i 1) + 1 or H ( i 1 i 2) = c " " Solution: reuse in direction r s.t. " H ( r ) = Self-temporal reuse vector space R ST = ker H Reuse is exploited if r is included in LIS 41

42 CSCI Compiler Design 42 Identifying Temporal Reuse Iterations that access same A(I) ( ) ( ) " # % & = " # % & J I J I Iterations that access same B(J,I) " # % & " # % & = " # % & " # % & J I J I ( ) span J I I = ( ) span I I J J = = reuse vector No reuse

43 Quantifying Temporal Reuse Temporal Reuse If number of iterations along each dimension of the reuse space is B, then each element is reused B dim(rst) Example: A(I)is reused N 2 times in J, K DO I = 1, N DO J = 1, N DO K = 1, N A(I) = A(I) + B(J,K) 43

44 Temporal Locality A reference has self-temporal locality if it has self-temporal reuse and RST and LIS have a non-null intersection Quantity of reuse utilized Dimensionality of intersection of RST and LIS Number of memory accesses: 1/B dim(rst) 44

45 Matrix-Matrix Multiply: RST DO J = 1, N DO K = 1, N DO I = 1, N C(I,J)= C(I,J)+A(I,K)*B(K,J) RST (C(I,J)) = span (, 1, ) RST (A(I,K)) = span (1,, ) RST (B(K,J)) = span (,, 1) Localized iteration space: loop I There is a non-empty intersection with the reuse vector space of B(K,J) 45

46 Tiled Matrix multiply DO J = 1, N by T J DO K = 1, N by T K DO I = 1, N by T I DO JJ = J, min(j+t J,N) DO KK = K, min(kk+ T K,N) DO II = I, min(ii+ T I,N) C(II,JJ)= C(II,JJ)+A(II,KK)*T(KK,JJ) Localized Iteration Space: all three loop dimensions Non-empty intersection with each of the reuse spaces 46

47 Identifying Self-Spatial Reuse i 1 and i 2 reference the same cache line when: " " HS( i 1) + c = HS( i 2) + c or HS( i 1 i 2) where H S = & % d1 2n Solution: directions r such that h h : : :.. h h : dn # " HS " ( r ) = = all array indices except the column index must be identical Self-spatial reuse space R SS = ker H S 47

48 Identifying Spatial Reuse Iterations that access same cache line (B(J,I)) & % 1 # " & % I J 1 1 # " = & % 1 # " & % I J 2 2 # " I 1 = J I 2 span(,1) DO I = 1, N DO J = 1, N A(I)= B(J,I) reuse vector 48

49 Quantifying Spatial Reuse Spatial reuse Only exists if stride s is less than L Data reused s/l times, where s is the stride Example: cache line holding B(J,K) is reused L times in loop J DO I = 1, N DO J = 1, N A(I) = A(I) + B(J,I) 49

50 Identifying Group-Temporal Reuse References are uniformly generated same H A(Hi+c1) = A(Hi+c2) reference same data when: There exists r in LIS such that H ( i ) + c1 = H ( i ) + c 2 H ( r) = c1 c 2 5

51 Identifying Group-Temporal Reuse Find particular solution r p such that H ( rp) = c1 c 2 General solution RGT = ker(h) + r p 51

52 Back to Tiling Tile loops carrying reuse to move the reuse vectors into LIS Tiling increases the dimensionality of the LIS => increases locality 52

53 In Practice Reuse vectors that are a combination of the elementary vectors of the iteration space Example: A(I+J,K), R ST = (1, -1, ) Use the smallest enclosing space spanned by the elementary vectors: Example: (1,-1,) is spanned by (1,,) and (,1,) 53

54 Tiling for Real Caches Tiling reduces capacity misses in fully-associative caches In real life: Caches are direct-mapped/small associativity Blocking may introduce conflict misses: Data within a block is not contiguous in memory Conflict misses may offset benefits 54

55 Tiling for Real Caches Conflict misses Vary with array and tile sizes Costly to predict, at compile-/run-time Avoiding conflicts: Tile size selection Copying tiles at run-time Array padding 55

56 Other Levels of the Memory Hierarchy Register level Unroll-and-jam and scalar replacement L2, L3 cache levels block for L1 cache first block controlling loops for L2 hierarchical tiling 56

57 Blocking and Parallelization Parallel machines Conflicting goals if parallel loop also carries reuse Works well if coarse-grain parallelism (outer loops) inner loops carry reuse Data partitioning may cause false sharing choose block sizes multiple of cache line size array padding 57

58 Blocking and Parallelization Architectures with Superword Parallelism Instructions operate on superwords (objects consisting of a set of contiguous words) Spatial locality and superword parallelism are complementary optimizations Temporal reuse in superword registers also complementary 58

59 Other uses of Locality Analysis Data and Computation Partitioning Data Prefetching What references to prefetch? potential misses identified by Locality Analysis When to prefetch? reuse info used to compute prefetch predicates 59

60 Summary Data Reuse Concept Spatial- and Temporal Locality Analysis Use of Reuse Vectors and LIS to Quantify Reuse Combination with Loop Transformations Convert Reuse into Locality and Leverage Caches 6

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality A Crash Course in Compilers for Parallel Computing Mary Hall Fall, 2008 1 Overview of Crash Course L1: Data Dependence Analysis and Parallelization (Oct. 30) L2 & L3: Loop Reordering Transformations, Reuse

More information

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010 Parallel Programming Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code Mary Hall September 23, 2010 1 Observations from the Assignment Many of you are doing really well Some more are doing

More information

Loop Transformations! Part II!

Loop Transformations! Part II! Lecture 9! Loop Transformations! Part II! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! Loop Unswitching Hoist invariant control-flow

More information

CSC D70: Compiler Optimization Memory Optimizations

CSC D70: Compiler Optimization Memory Optimizations CSC D70: Compiler Optimization Memory Optimizations Prof. Gennady Pekhimenko University of Toronto Winter 2018 The content of this lecture is adapted from the lectures of Todd Mowry, Greg Steffan, and

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Announcements Monday November 28th, Dr. Sanjay Rajopadhye is talking at BMAC Friday December 2nd, Dr. Sanjay Rajopadhye will be leading CS553 Last Monday Kelly

More information

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Optimization Outline Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Hierarchy 1-2 ns Registers 32 512 B 3-10 ns 8-30 ns 60-250 ns 5-20

More information

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic

More information

Optimising for the p690 memory system

Optimising for the p690 memory system Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor

More information

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas Software Pipelining by Modulo Scheduling Philip Sweany University of North Texas Overview Instruction-Level Parallelism Instruction Scheduling Opportunities for Loop Optimization Software Pipelining Modulo

More information

Last class. Caches. Direct mapped

Last class. Caches. Direct mapped Memory Hierarchy II Last class Caches Direct mapped E=1 (One cache line per set) Each main memory address can be placed in exactly one place in the cache Conflict misses if two addresses map to same place

More information

ECE 5730 Memory Systems

ECE 5730 Memory Systems ECE 5730 Memory Systems Spring 2009 Off-line Cache Content Management Lecture 7: 1 Quiz 4 on Tuesday Announcements Only covers today s lecture No office hours today Lecture 7: 2 Where We re Headed Off-line

More information

211: Computer Architecture Summer 2016

211: Computer Architecture Summer 2016 211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University

More information

Linear Loop Transformations for Locality Enhancement

Linear Loop Transformations for Locality Enhancement Linear Loop Transformations for Locality Enhancement 1 Story so far Cache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a linear transformation

More information

Compiling for Advanced Architectures

Compiling for Advanced Architectures Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Previously Unroll and Jam Homework PA3 is due Monday November 2nd Today Unroll and Jam is tiling Code generation for fixed-sized tiles Paper writing and critique

More information

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc Memory Hierarchy 2 1 Memory Organization Memory hierarchy CPU registers few in number (typically 16/32/128) subcycle access

More information

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching The Lecture Contains: Loop Unswitching Supercomputing Applications Programming Paradigms Important Problems Scheduling Sources and Types of Parallelism Model of Compiler Code Optimization Data Dependence

More information

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L3: Autotuning Compilers

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L3: Autotuning Compilers A Crash Course in Compilers for Parallel Computing Mary Hall Fall, 2008 1 Overview of Crash Course L1: Data Dependence Analysis and Parallelization (Oct. 30) L2 & L3: Loop Reordering Transformations, Reuse

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19 Parallel Processing WS 2018/19 Universität Siegen rolanda.dwismuellera@duni-siegena.de Tel.: 0271/740-4050, Büro: H-B 8404 Stand: September 7, 2018 Betriebssysteme / verteilte Systeme Parallel Processing

More information

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010 Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Control flow graphs and loop optimizations. Thursday, October 24, 13

Control flow graphs and loop optimizations. Thursday, October 24, 13 Control flow graphs and loop optimizations Agenda Building control flow graphs Low level loop optimizations Code motion Strength reduction Unrolling High level loop optimizations Loop fusion Loop interchange

More information

Automatic Tuning of Scientific Applications. Apan Qasem Ken Kennedy Rice University Houston, TX

Automatic Tuning of Scientific Applications. Apan Qasem Ken Kennedy Rice University Houston, TX Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Recap from Last Year A framework for automatic tuning of applications Fine grain control of transformations

More information

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. CS 33 Caches CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Cache Performance Metrics Miss rate fraction of memory references not found in cache (misses

More information

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Memory Hierarchy Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Time (ns) The CPU-Memory Gap The gap widens between DRAM, disk, and CPU speeds

More information

CISC 360. Cache Memories Nov 25, 2008

CISC 360. Cache Memories Nov 25, 2008 CISC 36 Topics Cache Memories Nov 25, 28 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Cache memories are small, fast SRAM-based

More information

Null space basis: mxz. zxz I

Null space basis: mxz. zxz I Loop Transformations Linear Locality Enhancement for ache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a matrix of the loop nest. dependence

More information

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 S2

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Giving credit where credit is due

Giving credit where credit is due CSCE 23J Computer Organization Cache Memories Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce23j Giving credit where credit is due Most of slides for this lecture are based

More information

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations Systems I Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Cache Performance Metrics Miss Rate Fraction of memory references not found in cache

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

EE482c Final Project: Stream Programs on Legacy Architectures

EE482c Final Project: Stream Programs on Legacy Architectures EE482c Final Project: Stream Programs on Legacy Architectures Chaiyasit Manovit, Zi-Bin Yang, John Kim and Sanjit Biswas {cmanovit, zbyang, jjk12, sbiswas}@stanford.edu} June 6, 2002 1. Summary of project

More information

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster, Today Cache Memories CSci 2021: Machine Architecture and Organization November 7th-9th, 2016 Your instructor: Stephen McCamant Cache memory organization and operation Performance impact of caches The memory

More information

Advanced optimizations of cache performance ( 2.2)

Advanced optimizations of cache performance ( 2.2) Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped

More information

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture Announcements! Previous lecture Caches Inf3 Computer Architecture - 2016-2017 1 Recap: Memory Hierarchy Issues! Block size: smallest unit that is managed at each level E.g., 64B for cache lines, 4KB for

More information

PCOPP Uni-Processor Optimization- Features of Memory Hierarchy. Uni-Processor Optimization Features of Memory Hierarchy

PCOPP Uni-Processor Optimization- Features of Memory Hierarchy. Uni-Processor Optimization Features of Memory Hierarchy PCOPP-2002 Day 1 Classroom Lecture Uni-Processor Optimization- Features of Memory Hierarchy 1 The Hierarchical Memory Features and Performance Issues Lecture Outline Following Topics will be discussed

More information

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

The course that gives CMU its Zip! Memory System Performance. March 22, 2001 15-213 The course that gives CMU its Zip! Memory System Performance March 22, 2001 Topics Impact of cache parameters Impact of memory reference patterns memory mountain range matrix multiply Basic Cache

More information

CSC D70: Compiler Optimization Prefetching

CSC D70: Compiler Optimization Prefetching CSC D70: Compiler Optimization Prefetching Prof. Gennady Pekhimenko University of Toronto Winter 2018 The content of this lecture is adapted from the lectures of Todd Mowry and Phillip Gibbons DRAM Improvement

More information

Cache Memories October 8, 2007

Cache Memories October 8, 2007 15-213 Topics Cache Memories October 8, 27 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance The memory mountain class12.ppt Cache Memories Cache

More information

CS/ECE 250 Computer Architecture

CS/ECE 250 Computer Architecture Computer Architecture Caches and Memory Hierarchies Benjamin Lee Duke University Some slides derived from work by Amir Roth (Penn), Alvin Lebeck (Duke), Dan Sorin (Duke) 2013 Alvin R. Lebeck from Roth

More information

CS222: Cache Performance Improvement

CS222: Cache Performance Improvement CS222: Cache Performance Improvement Dr. A. Sahu Dept of Comp. Sc. & Engg. Indian Institute of Technology Guwahati Outline Eleven Advanced Cache Performance Optimization Prev: Reducing hit time & Increasing

More information

Enhancing Parallelism

Enhancing Parallelism CSC 255/455 Software Analysis and Improvement Enhancing Parallelism Instructor: Chen Ding Chapter 5,, Allen and Kennedy www.cs.rice.edu/~ken/comp515/lectures/ Where Does Vectorization Fail? procedure vectorize

More information

1/25/12. Administrative

1/25/12. Administrative Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:

More information

Cache Performance II 1

Cache Performance II 1 Cache Performance II 1 cache operation (associative) 111001 index offset valid tag valid tag data data 1 10 1 00 00 11 AA BB tag 1 11 1 01 B4 B5 33 44 = data (B5) AND = AND OR is hit? (1) 2 cache operation

More information

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010 Cache Memories EL21 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 21 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of

More information

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Next time Dynamic memory allocation and memory bugs Fabián E. Bustamante,

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University Improve Cache Performance Average memory access time (AMAT): AMAT =

More information

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers Administration CS 380C: Advanced Topics in Compilers Instructor: eshav Pingali Professor (CS, ICES) Office: POB 4.126A Email: pingali@cs.utexas.edu TA: TBD Graduate student (CS) Office: Email: Meeting

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Introduction to optimizations. CS Compiler Design. Phases inside the compiler. Optimization. Introduction to Optimizations. V.

Introduction to optimizations. CS Compiler Design. Phases inside the compiler. Optimization. Introduction to Optimizations. V. Introduction to optimizations CS3300 - Compiler Design Introduction to Optimizations V. Krishna Nandivada IIT Madras Copyright c 2018 by Antony L. Hosking. Permission to make digital or hard copies of

More information

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons Cache Memories 15-213/18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, 2017 Today s Instructor: Phil Gibbons 1 Today Cache memory organization and operation Performance impact

More information

Cache Memory: Instruction Cache, HW/SW Interaction. Admin

Cache Memory: Instruction Cache, HW/SW Interaction. Admin Cache Memory Instruction Cache, HW/SW Interaction Computer Science 104 Admin Project Due Dec 7 Homework #5 Due November 19, in class What s Ahead Finish Caches Virtual Memory Input/Output (1 homework)

More information

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: Who Cares About the Memory Hierarchy? Processor Only Thus

More information

Memory Hierarchy. Bojian Zheng CSCD70 Spring 2018

Memory Hierarchy. Bojian Zheng CSCD70 Spring 2018 Memory Hierarchy Bojian Zheng CSCD70 Spring 2018 bojian@cs.toronto.edu 1 Memory Hierarchy From programmer s point of view, memory has infinite capacity (i.e. can store infinite amount of data) has zero

More information

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains: The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention

More information

A Preliminary Assessment of the ACRI 1 Fortran Compiler

A Preliminary Assessment of the ACRI 1 Fortran Compiler A Preliminary Assessment of the ACRI 1 Fortran Compiler Joan M. Parcerisa, Antonio González, Josep Llosa, Toni Jerez Computer Architecture Department Universitat Politècnica de Catalunya Report No UPC-DAC-94-24

More information

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem Introduction Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: Increase computation power Make the best use of available bandwidth We study the bandwidth

More information

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!

More information

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories Agenda Chapter 6 Cache Memories Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal

More information

A Quantitative Algorithm for Data. IRISA, University of Rennes. Christine Eisenbeis INRIA. Abstract

A Quantitative Algorithm for Data. IRISA, University of Rennes. Christine Eisenbeis INRIA. Abstract A Quantitative Algorithm for Data Locality Optimization Francois Bodin, William Jalby, Daniel Windheiser IRISA, University of Rennes Rennes, FRANCE Christine Eisenbeis INRIA Rocquencourt, FRANCE Abstract

More information

Today Cache memory organization and operation Performance impact of caches

Today Cache memory organization and operation Performance impact of caches Cache Memories 1 Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal locality

More information

Optimizing MMM & ATLAS Library Generator

Optimizing MMM & ATLAS Library Generator Optimizing MMM & ATLAS Library Generator Recall: MMM miss ratios L1 Cache Miss Ratio for Intel Pentium III MMM with N = 1 1300 16 32/lock 4-way 8-byte elements IJ version (large cache) DO I = 1, N//row-major

More information

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time Cache Performance!! Memory system and processor performance:! CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st = Pipeline time +

More information

Lecture 2. Memory locality optimizations Address space organization

Lecture 2. Memory locality optimizations Address space organization Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput

More information

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time Cache Performance!! Memory system and processor performance:! CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st = Pipeline time +

More information

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995 Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Fall 1995 Review: Who Cares About the Memory Hierarchy? Processor Only Thus Far in Course:

More information

Advanced Caching Techniques

Advanced Caching Techniques Advanced Caching Approaches to improving memory system performance eliminate memory accesses/operations decrease the number of misses decrease the miss penalty decrease the cache/memory access times hide

More information

Memory Systems and Performance Engineering

Memory Systems and Performance Engineering SPEED LIMIT PER ORDER OF 6.172 Memory Systems and Performance Engineering Fall 2010 Basic Caching Idea A. Smaller memory faster to access B. Use smaller memory to cache contents of larger memory C. Provide

More information

Programming Techniques for Supercomputers: Modern processors. Architecture of the memory hierarchy

Programming Techniques for Supercomputers: Modern processors. Architecture of the memory hierarchy Programming Techniques for Supercomputers: Modern processors Architecture of the memory hierarchy Prof. Dr. G. Wellein (a,b), Dr. G. Hager (a), Dr. M. Wittmann (a) (a) HPC Services Regionales Rechenzentrum

More information

PERFORMANCE OPTIMISATION

PERFORMANCE OPTIMISATION PERFORMANCE OPTIMISATION Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Hardware design Image from Colfax training material Pipeline Simple five stage pipeline: 1. Instruction fetch get instruction

More information

high-speed-high-capacity memory

high-speed-high-capacity memory Sanjay Rajopadhye Colorado State University n Transparently provide the illusion of a high-speed-high-capacity memory n Built out of caches: small memory devices that exploit the principle of locality

More information

Memory Hierarchy. Announcement. Computer system model. Reference

Memory Hierarchy. Announcement. Computer system model. Reference Announcement Memory Hierarchy Computer Organization and Assembly Languages Yung-Yu Chuang 26//5 Grade for hw#4 is online Please DO submit homework if you haen t Please sign up a demo time on /6 or /7 at

More information

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2 ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 13 Memory Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall12.html

More information

Administrative. Optimizing Stencil Computations. March 18, Stencil Computations, Performance Issues. Stencil Computations 3/18/13

Administrative. Optimizing Stencil Computations. March 18, Stencil Computations, Performance Issues. Stencil Computations 3/18/13 Administrative Optimizing Stencil Computations March 18, 2013 Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings and review lecture Prior exams are posted Design

More information

Lecture 7 - Memory Hierarchy-II

Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw

More information

Simone Campanoni Loop transformations

Simone Campanoni Loop transformations Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple

More information

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2 ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 13 Memory Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html

More information

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010 CS4961 Parallel Programming Lecture 14: Reasoning about Performance Administrative: What s Coming Programming assignment 2 due Friday, 11:59PM Homework assignment out on Tuesday, Oct. 19 and due Monday,

More information

Principle of Polyhedral model for loop optimization. cschen 陳鍾樞

Principle of Polyhedral model for loop optimization. cschen 陳鍾樞 Principle of Polyhedral model for loop optimization cschen 陳鍾樞 Outline Abstract model Affine expression, Polygon space Polyhedron space, Affine Accesses Data reuse Data locality Tiling Space partition

More information

Coarse-Grained Parallelism

Coarse-Grained Parallelism Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop interchange and skewing, Loop Strip-mining cs6363 1 Introduction Our previous loop transformations target vector and

More information

ARCHER Single Node Optimisation

ARCHER Single Node Optimisation ARCHER Single Node Optimisation Optimising for the Memory Hierarchy Slides contributed by Cray and EPCC Overview Motivation Types of memory structures Reducing memory accesses Utilizing Caches Write optimisations

More information

ν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines.

ν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines. Topics CISC 36 Cache Memories Dec, 29 ν Generic cache memory organization ν Direct mapped caches ν Set associatie caches ν Impact of caches on performance Cache Memories Cache memories are small, fast

More information

CSE P 501 Compilers. Loops Hal Perkins Spring UW CSE P 501 Spring 2018 U-1

CSE P 501 Compilers. Loops Hal Perkins Spring UW CSE P 501 Spring 2018 U-1 CSE P 501 Compilers Loops Hal Perkins Spring 2018 UW CSE P 501 Spring 2018 U-1 Agenda Loop optimizations Dominators discovering loops Loop invariant calculations Loop transformations A quick look at some

More information

Memory Systems and Performance Engineering. Fall 2009

Memory Systems and Performance Engineering. Fall 2009 Memory Systems and Performance Engineering Fall 2009 Basic Caching Idea A. Smaller memory faster to access B. Use smaller memory to cache contents of larger memory C. Provide illusion of fast larger memory

More information

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Cache memories are small, fast SRAM based memories managed automatically in hardware. Cache Memories Cache memories are small, fast SRAM based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and

More information

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

Memories. CPE480/CS480/EE480, Spring Hank Dietz. Memories CPE480/CS480/EE480, Spring 2018 Hank Dietz http://aggregate.org/ee480 What we want, what we have What we want: Unlimited memory space Fast, constant, access time (UMA: Uniform Memory Access) What

More information

CS 2461: Computer Architecture 1

CS 2461: Computer Architecture 1 Next.. : Computer Architecture 1 Performance Optimization CODE OPTIMIZATION Code optimization for performance A quick look at some techniques that can improve the performance of your code Rewrite code

More information

Module 16: Data Flow Analysis in Presence of Procedure Calls Lecture 32: Iteration. The Lecture Contains: Iteration Space.

Module 16: Data Flow Analysis in Presence of Procedure Calls Lecture 32: Iteration. The Lecture Contains: Iteration Space. The Lecture Contains: Iteration Space Iteration Vector Normalized Iteration Vector Dependence Distance Direction Vector Loop Carried Dependence Relations Dependence Level Iteration Vector - Triangular

More information

Lecture 2: Single processor architecture and memory

Lecture 2: Single processor architecture and memory Lecture 2: Single processor architecture and memory David Bindel 30 Aug 2011 Teaser What will this plot look like? for n = 100:10:1000 tic; A = []; for i = 1:n A(i,i) = 1; end times(n) = toc; end ns =

More information

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy CSCI-UA.0201 Computer Systems Organization Memory Hierarchy Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Programmer s Wish List Memory Private Infinitely large Infinitely fast Non-volatile

More information

Cache Performance (H&P 5.3; 5.5; 5.6)

Cache Performance (H&P 5.3; 5.5; 5.6) Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st

More information

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!

More information

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O Hallaron (CMU) Mohamed Zahran (NYU)

More information

Loops. Lather, Rinse, Repeat. CS4410: Spring 2013

Loops. Lather, Rinse, Repeat. CS4410: Spring 2013 Loops or Lather, Rinse, Repeat CS4410: Spring 2013 Program Loops Reading: Appel Ch. 18 Loop = a computation repeatedly executed until a terminating condition is reached High-level loop constructs: While

More information

Automatic Tiling of Iterative Stencil Loops

Automatic Tiling of Iterative Stencil Loops Automatic Tiling of Iterative Stencil Loops Zhiyuan Li and Yonghong Song Purdue University Iterative stencil loops are used in scientific programs to implement relaxation methods for numerical simulation

More information

write-through v. write-back write-through v. write-back write-through v. write-back option 1: write-through write 10 to 0xABCD CPU RAM Cache ABCD: FF

write-through v. write-back write-through v. write-back write-through v. write-back option 1: write-through write 10 to 0xABCD CPU RAM Cache ABCD: FF write-through v. write-back option 1: write-through 1 write 10 to 0xABCD CPU Cache ABCD: FF RAM 11CD: 42 ABCD: FF 1 2 write-through v. write-back option 1: write-through write-through v. write-back option

More information

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster, Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Cache Memory Organization and Access Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O

More information