Program Transformations for the Memory Hierarchy
|
|
- Amelia Quinn
- 5 years ago
- Views:
Transcription
1 Program Transformations for the Memory Hierarchy Locality Analysis and Reuse Copyright 214, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class at the University of Southern California have explicit permission to make copies of these materials for their personal use. Acknowledgement: Some of the material in this lecture is based on class notes from CSCI595-Spring 24 class at USC graciously provided by Dr. Jacqueline Chame and Dr. Mary Hall
2 Processor-Memory Bandwidth Gap Over Last 2 Decades Increased Sharply Now not as Severe but Multi-Core Exacerbates System-level Issues 2
3 Problems with Modern Architectures Uni-Processors Parallelism: extract and manage instruction level parallelism (nonnumerical and numerical applications) Memory hierarchy: improve locality and hide latency Multi-Processors Parallelism: detection of loop level parallelism Memory Hierarchy Minimize synchronization frequency Minimize communication (improve locality) Hide latency 3
4 Processor-Memory Bandwidth Gap Processor Cycles Functional Units Registers 1x 2-5x 2-3x Cache I Cache II Local Memory Remote Memory Principle of Locality: Reuse Data that has been recently used. Trade-off: Increasing capacity means higher latency 4
5 Hiding Memory Latency: Locality Memory Local to Processor Fast but of Limited Capacity Needs to Map Sub-set of Main Memory Restrict mapping choices to allow easy look-up. 5
6 Analysis for Locality 6
7 Blocking for Uni- and Multi-Processors = x Data Accessed = x 124 1,5, = x ,56 15x 7
8 Performance of Blocking 8
9 Example: Givens QR Decomposition 9
10 Hiding Memory Latency: Prefetching Idea: Anticipate Memory Access: Prefetch so data is available when needed Trade-Off: Need to prefetch in advance at the right time. Do not fetch too late nor to early Control-flow is an issue 1
11 Prefetching Example Suppose Prefecth Instructions fetch 2 words at a time How Effective is Prefetching? 11
12 Effectiveness of Prefetching Memory Stalls reduced by 5% to 9% Instructions and Memory Overhead are Typically Low 12
13 Data Reuse in Caches Data Reuse: Data used multiple times Data Locality: Data remains in cache between uses Less Cache Misses Less Memory Traffic Optimization Goal: exploit reuse Loop Transformations: Reuse Locality 13
14 Temporal Reuse Same Data Accessed in Distinct Loop Iterations DO I = 1, N DO J = 1, N C(J,I)= A(J)+B(I,J) ENDDO ENDDO A(J) has self-temporal reuse in loop I 14
15 Spatial Reuse Same Cache Line Accessed in Distinct Loop Iterations DO I = 1, N DO J = 1, N A(J,I)=A(J,I)+B(I,J) A(J,I) has Self-Spatial Reuse in loop J B(I,J) has Self-Spatial Reuse in loop I (FORTRAN arrays are stored in column-major order) 15
16 Group Reuse Same Data/Cache line accessed by Distinct References DO I = 1, N DO J = 1, N A(J,I)=A(J,I)+A(J-1,I)+A(J+1,I) ENDDO ENDDO A(J,I),A(J+1,I),A(J-1,I): Group Reuse in J 16
17 Group Reuse Group-Temporal Reuse: Same Data Location DO I = 1, N DO J = 1, N A(J,I)=A(J,I)+A(J-1,I)+A(J+1,I) Group-Spatial Reuse: Same Cache Line DO I = 1, N DO J = 1, N A(1,J)=A(,J) 17
18 Group Reuse Group Temporal but no Reuse of other Data on same Cache Line (cache line size = 4) DO I = 1, N DO J = 1, N A(J,I)=A(J+8,I)+A(J-8,I) Group Spatial but not Group Temporal DO I = 1, N DO J = 1, N A(J,I)=A(2J,I)+A(2J+1,I) 18
19 Reuse and Cache Misses DO I = 1, N DO J = 1, N = A(J) self-spatial in J self-temporal in I I N < cache line size I N > cache line size J cache miss 19 J
20 Cache Misses DO I = 1, N DO J = 1, M A(I)=A(I)+B(J,I) DO J = 1, M DO I = 1, N A(I)=A(I)+B(J,I) reference J I reference I J A(I) 1 N A(I) N/L M*N/L B(J,I) M/L N*M/L B(J,I) N M*N L = cache line size; N,M > cache size LIS = innermost loop 2
21 How to Select a Loop Order to Exploit Reuse? Identify and Quantify Reuse Allen & Kennedy: Compute a cost function for each loop (innermost memory cost) Wolf & Lam: Locality Analysis identifies reuse vectors Capacity misses based on reuse vectors/amount Order loops according to reuse amount 21
22 A&K: Innermost Memory Cost Innermost memory cost: C M (L i ) assume L i is innermost loop l i = loop variable, N = number of iterations of L i for each array reference r in loop nest: r does not depend on l i : cost (r) = 1 r such that l i strides over a non-contiguous dimension: cost (r) = N r such that l i strides over a contiguous dimension with step s < L: cost (r) = N*s/L C M (L i ) = sum of cost (r) 22
23 Example DO I = 1, N DO J = 1, M A(I)=A(I)+B(I,J) DO I = 1, N DO J = 1, M A(I)=A(I)+B(I,J) J innermost: cost J (A)=1, cost J (B)=M I innermost: cost I (A)=N, cost I (B)=N/L C M (J) = N * (1 + M) C M (I) = N * (M/L + M/L) = 2NM/L select I as innermost if N(1+M) > 2NM/L 23
24 Selecting a Loop Order Compute the innermost memory cost for each loop in the nest Compute a desired loop order, from innermost to outermost, in order of increasing innermost memory cost Interchange loops to achieve a legal loop order that is closest to the desired order 24
25 Example: Matrix-Matrix Multiply DO I = 1, N DO J = 1, N DO K = 1, N C(I,J)= C(I,J) + A(I,K) * B(K,J) C M (I) = 2N 3 /L + N 2 C M (J) = 2N 3 + N 2 C M (K) = N 3 + N 3 /L + N 2 Ordering by innermost loop cost: (J, K, I) 25
26 Groups of References (*) (*) also known as uniformly generated references When computing innermost memory costs, some references should be grouped: References that access same data in distinct iterations of a loop I Example: A(J,I), A(J,I+1), (J,I-1) References that access same cache line in same iteration of I Example: A(I,J), A(I+1,J),, A(I+L-1,J) 26
27 Example DO I = 1, T DO J = 1, N DO K = 1, N A(J,K) = f(a(j,k) + A(J,K+1) + A(J,K-1) + A(J+1,K) + A(J-1,K)) K innermost: G 1 = {A(J,K+1), A(J,K), A(J,K-1)} => ~N misses G 2 = {A(J+1,K), A(J,K), A(J-1, K)} => ~N misses J innermost: G 1 = {A(J+1,K), A(J,K), A(J-1,K)} => N/L misses G 2 = {A(J,K+1)} => N/L misses G 3 = {A(J,K-1)} => N/L misses 27
28 Loop Tiling (Blocking) Tiling reorders loop iterations to bring iterations accessing same data closer in time I Choose tile sizes so that reused data can be stored in cache I J J 28
29 Example: Matrix-Matrix Multiply DO J = 1, N DO K = 1, N DO I = 1, N C(I,J)= C(I,J)+A(I,K)*B(K,J) K J I I K C A B 29
30 Reuse DO J = 1, N DO K = 1, N DO I = 1, N C(I,J)= C(I,J)+A(I,K)*B(K,J) reuse type reuse amount I K J I J K C(I,J) spatial temporal none L N 1 A(I,K) spatial none temporal L 1 N B(K,J) temporal spatial none N L 1 3
31 Loop Tiling: Strip-Mine and Interchange DO J = 1, N DO K = 1, N DO I = 1, N by T DO II = I, min(i+t-1,n) C(II,J)+= A(II,K)+B(K,J) Strip mine loop I: I: traverses strips II: traverses iterations in a strip DO J = 1, N DO I = 1, N by T DO K = 1, N DO II = I, min(i+t-1,n) C(II,J)+=A(II,K)+B(K,J) Interchange K and I (exploit reuse of C(II,J) in loop K) 31
32 Tiling Inner Loops I and K DO K = 1, N by T K DO I = 1, N by T I DO J = 1, N DO KK = K, min(kk+ T K,N) DO II = I, min(ii+ T I,N) C(II,J)= C(II,J)+A(II,KK)*B(KK,J) T K T I C A B 32
33 Cache Misses (assuming that data referenced in tiled loops fit in cache) DO J = 1, N by T J DO K = 1, N by T K DO I = 1, N by T I DO JJ = J, min(j+t J,N) DO KK = K, min(kk+ T K,N) DO II = I, min(ii+ T I,N) C(II,JJ)= C(II,JJ)+A(II,KK)*B(KK,JJ) LIS II KK JJ I K J C(II,JJ) T I /L T I / L T J T I / L NT J /L N 2 T J /(T K L) N 3 /(T K L) A(II,KK) T I /L T K T I /L T I T K / L NT K /L N 2 /L N 3 /(T J L) B(KK,JJ) 1 T K / L T J T K / L NT J T K /(T I L) N 2 T J /(T I L) N 3 /(T I L) 33
34 Legality of Tiling Tiling: Strip-Mine and Interchange strip-mine does not reorder iterations interchange must be legal d, d 1, d k-1 either = or < OR strip size less than threshold of dependence 34
35 Profitability of Tiling Tiling is Profitable is there is enough reuse to offset the overheads extra controlling loop for each tiled loop extra misses in the controlling loops extra misses due to alignment to cache lines Need to Identify and Quantify Reuse 35
36 Locality Analysis Identify reuse in iteration space vector spaces represent directions with reuse Quantify reuse in a given iteration subspace Localized Iteration Space (LIS): set of iterations where reuse can be exploited Transform loop nest to include directions with reuse in LIS 36
37 Types of Reuse Self reuse: same array reference same data: self-temporal reuse same cache line: self-spatial reuse Group reuse: several uniformly generated array references same data: group-temporal reuse same cache line: group-spatial reuse 37
38 Identifying reuse: mapping iterations to data DO I = 1, N DO J = 1, N A(I)= B(J,I) A I I () ( 1 ) = ( I) J B Iteration space J ( 1)() I () = 1 J I J Data space 38
39 Mapping Iterations to Data Array indexing function: f ( i ) : Zn Zd = H ( i ) + c Example: A 1 = A(J+2,I) fa 1 = & % 1 1# " & % I J # " + & 2# % " 39
40 CSCI Compiler Design 4 Matrix Multiply: Array Indexing Functions " # % & + " # % & " # % & = 1 1 K J I fc(i,j) " # % & + " # % & " # % & = 1 1 K J I fa(i, K) " # % & + " # % & " # % & = 1 1 K J I fb(k,j)
41 Identifying Self-Temporal Reuse i 1 and i 2 reference the same data when: H ( i ) + c = H ( i 1) + 1 or H ( i 1 i 2) = c " " Solution: reuse in direction r s.t. " H ( r ) = Self-temporal reuse vector space R ST = ker H Reuse is exploited if r is included in LIS 41
42 CSCI Compiler Design 42 Identifying Temporal Reuse Iterations that access same A(I) ( ) ( ) " # % & = " # % & J I J I Iterations that access same B(J,I) " # % & " # % & = " # % & " # % & J I J I ( ) span J I I = ( ) span I I J J = = reuse vector No reuse
43 Quantifying Temporal Reuse Temporal Reuse If number of iterations along each dimension of the reuse space is B, then each element is reused B dim(rst) Example: A(I)is reused N 2 times in J, K DO I = 1, N DO J = 1, N DO K = 1, N A(I) = A(I) + B(J,K) 43
44 Temporal Locality A reference has self-temporal locality if it has self-temporal reuse and RST and LIS have a non-null intersection Quantity of reuse utilized Dimensionality of intersection of RST and LIS Number of memory accesses: 1/B dim(rst) 44
45 Matrix-Matrix Multiply: RST DO J = 1, N DO K = 1, N DO I = 1, N C(I,J)= C(I,J)+A(I,K)*B(K,J) RST (C(I,J)) = span (, 1, ) RST (A(I,K)) = span (1,, ) RST (B(K,J)) = span (,, 1) Localized iteration space: loop I There is a non-empty intersection with the reuse vector space of B(K,J) 45
46 Tiled Matrix multiply DO J = 1, N by T J DO K = 1, N by T K DO I = 1, N by T I DO JJ = J, min(j+t J,N) DO KK = K, min(kk+ T K,N) DO II = I, min(ii+ T I,N) C(II,JJ)= C(II,JJ)+A(II,KK)*T(KK,JJ) Localized Iteration Space: all three loop dimensions Non-empty intersection with each of the reuse spaces 46
47 Identifying Self-Spatial Reuse i 1 and i 2 reference the same cache line when: " " HS( i 1) + c = HS( i 2) + c or HS( i 1 i 2) where H S = & % d1 2n Solution: directions r such that h h : : :.. h h : dn # " HS " ( r ) = = all array indices except the column index must be identical Self-spatial reuse space R SS = ker H S 47
48 Identifying Spatial Reuse Iterations that access same cache line (B(J,I)) & % 1 # " & % I J 1 1 # " = & % 1 # " & % I J 2 2 # " I 1 = J I 2 span(,1) DO I = 1, N DO J = 1, N A(I)= B(J,I) reuse vector 48
49 Quantifying Spatial Reuse Spatial reuse Only exists if stride s is less than L Data reused s/l times, where s is the stride Example: cache line holding B(J,K) is reused L times in loop J DO I = 1, N DO J = 1, N A(I) = A(I) + B(J,I) 49
50 Identifying Group-Temporal Reuse References are uniformly generated same H A(Hi+c1) = A(Hi+c2) reference same data when: There exists r in LIS such that H ( i ) + c1 = H ( i ) + c 2 H ( r) = c1 c 2 5
51 Identifying Group-Temporal Reuse Find particular solution r p such that H ( rp) = c1 c 2 General solution RGT = ker(h) + r p 51
52 Back to Tiling Tile loops carrying reuse to move the reuse vectors into LIS Tiling increases the dimensionality of the LIS => increases locality 52
53 In Practice Reuse vectors that are a combination of the elementary vectors of the iteration space Example: A(I+J,K), R ST = (1, -1, ) Use the smallest enclosing space spanned by the elementary vectors: Example: (1,-1,) is spanned by (1,,) and (,1,) 53
54 Tiling for Real Caches Tiling reduces capacity misses in fully-associative caches In real life: Caches are direct-mapped/small associativity Blocking may introduce conflict misses: Data within a block is not contiguous in memory Conflict misses may offset benefits 54
55 Tiling for Real Caches Conflict misses Vary with array and tile sizes Costly to predict, at compile-/run-time Avoiding conflicts: Tile size selection Copying tiles at run-time Array padding 55
56 Other Levels of the Memory Hierarchy Register level Unroll-and-jam and scalar replacement L2, L3 cache levels block for L1 cache first block controlling loops for L2 hierarchical tiling 56
57 Blocking and Parallelization Parallel machines Conflicting goals if parallel loop also carries reuse Works well if coarse-grain parallelism (outer loops) inner loops carry reuse Data partitioning may cause false sharing choose block sizes multiple of cache line size array padding 57
58 Blocking and Parallelization Architectures with Superword Parallelism Instructions operate on superwords (objects consisting of a set of contiguous words) Spatial locality and superword parallelism are complementary optimizations Temporal reuse in superword registers also complementary 58
59 Other uses of Locality Analysis Data and Computation Partitioning Data Prefetching What references to prefetch? potential misses identified by Locality Analysis When to prefetch? reuse info used to compute prefetch predicates 59
60 Summary Data Reuse Concept Spatial- and Temporal Locality Analysis Use of Reuse Vectors and LIS to Quantify Reuse Combination with Loop Transformations Convert Reuse into Locality and Leverage Caches 6
A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality
A Crash Course in Compilers for Parallel Computing Mary Hall Fall, 2008 1 Overview of Crash Course L1: Data Dependence Analysis and Parallelization (Oct. 30) L2 & L3: Loop Reordering Transformations, Reuse
More informationCS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010
Parallel Programming Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code Mary Hall September 23, 2010 1 Observations from the Assignment Many of you are doing really well Some more are doing
More informationLoop Transformations! Part II!
Lecture 9! Loop Transformations! Part II! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! Loop Unswitching Hoist invariant control-flow
More informationCSC D70: Compiler Optimization Memory Optimizations
CSC D70: Compiler Optimization Memory Optimizations Prof. Gennady Pekhimenko University of Toronto Winter 2018 The content of this lecture is adapted from the lectures of Todd Mowry, Greg Steffan, and
More informationTiling: A Data Locality Optimizing Algorithm
Tiling: A Data Locality Optimizing Algorithm Announcements Monday November 28th, Dr. Sanjay Rajopadhye is talking at BMAC Friday December 2nd, Dr. Sanjay Rajopadhye will be leading CS553 Last Monday Kelly
More informationOutline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis
Memory Optimization Outline Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Hierarchy 1-2 ns Registers 32 512 B 3-10 ns 8-30 ns 60-250 ns 5-20
More informationAutotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT
Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic
More informationOptimising for the p690 memory system
Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor
More informationSoftware Pipelining by Modulo Scheduling. Philip Sweany University of North Texas
Software Pipelining by Modulo Scheduling Philip Sweany University of North Texas Overview Instruction-Level Parallelism Instruction Scheduling Opportunities for Loop Optimization Software Pipelining Modulo
More informationLast class. Caches. Direct mapped
Memory Hierarchy II Last class Caches Direct mapped E=1 (One cache line per set) Each main memory address can be placed in exactly one place in the cache Conflict misses if two addresses map to same place
More informationECE 5730 Memory Systems
ECE 5730 Memory Systems Spring 2009 Off-line Cache Content Management Lecture 7: 1 Quiz 4 on Tuesday Announcements Only covers today s lecture No office hours today Lecture 7: 2 Where We re Headed Off-line
More information211: Computer Architecture Summer 2016
211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University
More informationLinear Loop Transformations for Locality Enhancement
Linear Loop Transformations for Locality Enhancement 1 Story so far Cache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a linear transformation
More informationCompiling for Advanced Architectures
Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have
More informationTiling: A Data Locality Optimizing Algorithm
Tiling: A Data Locality Optimizing Algorithm Previously Unroll and Jam Homework PA3 is due Monday November 2nd Today Unroll and Jam is tiling Code generation for fixed-sized tiles Paper writing and critique
More informationSE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc Memory Hierarchy 2 1 Memory Organization Memory hierarchy CPU registers few in number (typically 16/32/128) subcycle access
More informationPerformance Issues in Parallelization Saman Amarasinghe Fall 2009
Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationModule 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching
The Lecture Contains: Loop Unswitching Supercomputing Applications Programming Paradigms Important Problems Scheduling Sources and Types of Parallelism Model of Compiler Code Optimization Data Dependence
More informationA Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L3: Autotuning Compilers
A Crash Course in Compilers for Parallel Computing Mary Hall Fall, 2008 1 Overview of Crash Course L1: Data Dependence Analysis and Parallelization (Oct. 30) L2 & L3: Loop Reordering Transformations, Reuse
More informationLecture 9 Basic Parallelization
Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning
More informationLecture 9 Basic Parallelization
Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning
More informationParallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19
Parallel Processing WS 2018/19 Universität Siegen rolanda.dwismuellera@duni-siegena.de Tel.: 0271/740-4050, Büro: H-B 8404 Stand: September 7, 2018 Betriebssysteme / verteilte Systeme Parallel Processing
More informationPerformance Issues in Parallelization. Saman Amarasinghe Fall 2010
Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationControl flow graphs and loop optimizations. Thursday, October 24, 13
Control flow graphs and loop optimizations Agenda Building control flow graphs Low level loop optimizations Code motion Strength reduction Unrolling High level loop optimizations Loop fusion Loop interchange
More informationAutomatic Tuning of Scientific Applications. Apan Qasem Ken Kennedy Rice University Houston, TX
Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Recap from Last Year A framework for automatic tuning of applications Fine grain control of transformations
More informationCS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.
CS 33 Caches CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Cache Performance Metrics Miss rate fraction of memory references not found in cache (misses
More informationMemory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Memory Hierarchy Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Time (ns) The CPU-Memory Gap The gap widens between DRAM, disk, and CPU speeds
More informationCISC 360. Cache Memories Nov 25, 2008
CISC 36 Topics Cache Memories Nov 25, 28 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Cache memories are small, fast SRAM-based
More informationNull space basis: mxz. zxz I
Loop Transformations Linear Locality Enhancement for ache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a matrix of the loop nest. dependence
More informationEssential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2
Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 S2
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationGiving credit where credit is due
CSCE 23J Computer Organization Cache Memories Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce23j Giving credit where credit is due Most of slides for this lecture are based
More informationSystems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations
Systems I Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Cache Performance Metrics Miss Rate Fraction of memory references not found in cache
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationEE482c Final Project: Stream Programs on Legacy Architectures
EE482c Final Project: Stream Programs on Legacy Architectures Chaiyasit Manovit, Zi-Bin Yang, John Kim and Sanjit Biswas {cmanovit, zbyang, jjk12, sbiswas}@stanford.edu} June 6, 2002 1. Summary of project
More informationToday. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,
Today Cache Memories CSci 2021: Machine Architecture and Organization November 7th-9th, 2016 Your instructor: Stephen McCamant Cache memory organization and operation Performance impact of caches The memory
More informationAdvanced optimizations of cache performance ( 2.2)
Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped
More informationAnnouncements. ! Previous lecture. Caches. Inf3 Computer Architecture
Announcements! Previous lecture Caches Inf3 Computer Architecture - 2016-2017 1 Recap: Memory Hierarchy Issues! Block size: smallest unit that is managed at each level E.g., 64B for cache lines, 4KB for
More informationPCOPP Uni-Processor Optimization- Features of Memory Hierarchy. Uni-Processor Optimization Features of Memory Hierarchy
PCOPP-2002 Day 1 Classroom Lecture Uni-Processor Optimization- Features of Memory Hierarchy 1 The Hierarchical Memory Features and Performance Issues Lecture Outline Following Topics will be discussed
More informationThe course that gives CMU its Zip! Memory System Performance. March 22, 2001
15-213 The course that gives CMU its Zip! Memory System Performance March 22, 2001 Topics Impact of cache parameters Impact of memory reference patterns memory mountain range matrix multiply Basic Cache
More informationCSC D70: Compiler Optimization Prefetching
CSC D70: Compiler Optimization Prefetching Prof. Gennady Pekhimenko University of Toronto Winter 2018 The content of this lecture is adapted from the lectures of Todd Mowry and Phillip Gibbons DRAM Improvement
More informationCache Memories October 8, 2007
15-213 Topics Cache Memories October 8, 27 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance The memory mountain class12.ppt Cache Memories Cache
More informationCS/ECE 250 Computer Architecture
Computer Architecture Caches and Memory Hierarchies Benjamin Lee Duke University Some slides derived from work by Amir Roth (Penn), Alvin Lebeck (Duke), Dan Sorin (Duke) 2013 Alvin R. Lebeck from Roth
More informationCS222: Cache Performance Improvement
CS222: Cache Performance Improvement Dr. A. Sahu Dept of Comp. Sc. & Engg. Indian Institute of Technology Guwahati Outline Eleven Advanced Cache Performance Optimization Prev: Reducing hit time & Increasing
More informationEnhancing Parallelism
CSC 255/455 Software Analysis and Improvement Enhancing Parallelism Instructor: Chen Ding Chapter 5,, Allen and Kennedy www.cs.rice.edu/~ken/comp515/lectures/ Where Does Vectorization Fail? procedure vectorize
More information1/25/12. Administrative
Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:
More informationCache Performance II 1
Cache Performance II 1 cache operation (associative) 111001 index offset valid tag valid tag data data 1 10 1 00 00 11 AA BB tag 1 11 1 01 B4 B5 33 44 = data (B5) AND = AND OR is hit? (1) 2 cache operation
More informationCache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010
Cache Memories EL21 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 21 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of
More informationCache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance
Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Next time Dynamic memory allocation and memory bugs Fabián E. Bustamante,
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University Improve Cache Performance Average memory access time (AMAT): AMAT =
More informationAdministration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers
Administration CS 380C: Advanced Topics in Compilers Instructor: eshav Pingali Professor (CS, ICES) Office: POB 4.126A Email: pingali@cs.utexas.edu TA: TBD Graduate student (CS) Office: Email: Meeting
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationIntroduction to optimizations. CS Compiler Design. Phases inside the compiler. Optimization. Introduction to Optimizations. V.
Introduction to optimizations CS3300 - Compiler Design Introduction to Optimizations V. Krishna Nandivada IIT Madras Copyright c 2018 by Antony L. Hosking. Permission to make digital or hard copies of
More informationCache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons
Cache Memories 15-213/18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, 2017 Today s Instructor: Phil Gibbons 1 Today Cache memory organization and operation Performance impact
More informationCache Memory: Instruction Cache, HW/SW Interaction. Admin
Cache Memory Instruction Cache, HW/SW Interaction Computer Science 104 Admin Project Due Dec 7 Homework #5 Due November 19, in class What s Ahead Finish Caches Virtual Memory Input/Output (1 homework)
More informationLecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996
Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: Who Cares About the Memory Hierarchy? Processor Only Thus
More informationMemory Hierarchy. Bojian Zheng CSCD70 Spring 2018
Memory Hierarchy Bojian Zheng CSCD70 Spring 2018 bojian@cs.toronto.edu 1 Memory Hierarchy From programmer s point of view, memory has infinite capacity (i.e. can store infinite amount of data) has zero
More informationModule 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:
The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention
More informationA Preliminary Assessment of the ACRI 1 Fortran Compiler
A Preliminary Assessment of the ACRI 1 Fortran Compiler Joan M. Parcerisa, Antonio González, Josep Llosa, Toni Jerez Computer Architecture Department Universitat Politècnica de Catalunya Report No UPC-DAC-94-24
More informationIntroduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem
Introduction Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: Increase computation power Make the best use of available bandwidth We study the bandwidth
More informationChapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.
Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!
More informationAgenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories
Agenda Chapter 6 Cache Memories Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal
More informationA Quantitative Algorithm for Data. IRISA, University of Rennes. Christine Eisenbeis INRIA. Abstract
A Quantitative Algorithm for Data Locality Optimization Francois Bodin, William Jalby, Daniel Windheiser IRISA, University of Rennes Rennes, FRANCE Christine Eisenbeis INRIA Rocquencourt, FRANCE Abstract
More informationToday Cache memory organization and operation Performance impact of caches
Cache Memories 1 Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal locality
More informationOptimizing MMM & ATLAS Library Generator
Optimizing MMM & ATLAS Library Generator Recall: MMM miss ratios L1 Cache Miss Ratio for Intel Pentium III MMM with N = 1 1300 16 32/lock 4-way 8-byte elements IJ version (large cache) DO I = 1, N//row-major
More informationCache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time
Cache Performance!! Memory system and processor performance:! CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st = Pipeline time +
More informationLecture 2. Memory locality optimizations Address space organization
Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput
More informationCache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time
Cache Performance!! Memory system and processor performance:! CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st = Pipeline time +
More informationLecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995
Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Fall 1995 Review: Who Cares About the Memory Hierarchy? Processor Only Thus Far in Course:
More informationAdvanced Caching Techniques
Advanced Caching Approaches to improving memory system performance eliminate memory accesses/operations decrease the number of misses decrease the miss penalty decrease the cache/memory access times hide
More informationMemory Systems and Performance Engineering
SPEED LIMIT PER ORDER OF 6.172 Memory Systems and Performance Engineering Fall 2010 Basic Caching Idea A. Smaller memory faster to access B. Use smaller memory to cache contents of larger memory C. Provide
More informationProgramming Techniques for Supercomputers: Modern processors. Architecture of the memory hierarchy
Programming Techniques for Supercomputers: Modern processors Architecture of the memory hierarchy Prof. Dr. G. Wellein (a,b), Dr. G. Hager (a), Dr. M. Wittmann (a) (a) HPC Services Regionales Rechenzentrum
More informationPERFORMANCE OPTIMISATION
PERFORMANCE OPTIMISATION Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Hardware design Image from Colfax training material Pipeline Simple five stage pipeline: 1. Instruction fetch get instruction
More informationhigh-speed-high-capacity memory
Sanjay Rajopadhye Colorado State University n Transparently provide the illusion of a high-speed-high-capacity memory n Built out of caches: small memory devices that exploit the principle of locality
More informationMemory Hierarchy. Announcement. Computer system model. Reference
Announcement Memory Hierarchy Computer Organization and Assembly Languages Yung-Yu Chuang 26//5 Grade for hw#4 is online Please DO submit homework if you haen t Please sign up a demo time on /6 or /7 at
More informationECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2
ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 13 Memory Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall12.html
More informationAdministrative. Optimizing Stencil Computations. March 18, Stencil Computations, Performance Issues. Stencil Computations 3/18/13
Administrative Optimizing Stencil Computations March 18, 2013 Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings and review lecture Prior exams are posted Design
More informationLecture 7 - Memory Hierarchy-II
CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw
More informationSimone Campanoni Loop transformations
Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple
More informationECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2
ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 13 Memory Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html
More informationCS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010
CS4961 Parallel Programming Lecture 14: Reasoning about Performance Administrative: What s Coming Programming assignment 2 due Friday, 11:59PM Homework assignment out on Tuesday, Oct. 19 and due Monday,
More informationPrinciple of Polyhedral model for loop optimization. cschen 陳鍾樞
Principle of Polyhedral model for loop optimization cschen 陳鍾樞 Outline Abstract model Affine expression, Polygon space Polyhedron space, Affine Accesses Data reuse Data locality Tiling Space partition
More informationCoarse-Grained Parallelism
Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop interchange and skewing, Loop Strip-mining cs6363 1 Introduction Our previous loop transformations target vector and
More informationARCHER Single Node Optimisation
ARCHER Single Node Optimisation Optimising for the Memory Hierarchy Slides contributed by Cray and EPCC Overview Motivation Types of memory structures Reducing memory accesses Utilizing Caches Write optimisations
More informationν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines.
Topics CISC 36 Cache Memories Dec, 29 ν Generic cache memory organization ν Direct mapped caches ν Set associatie caches ν Impact of caches on performance Cache Memories Cache memories are small, fast
More informationCSE P 501 Compilers. Loops Hal Perkins Spring UW CSE P 501 Spring 2018 U-1
CSE P 501 Compilers Loops Hal Perkins Spring 2018 UW CSE P 501 Spring 2018 U-1 Agenda Loop optimizations Dominators discovering loops Loop invariant calculations Loop transformations A quick look at some
More informationMemory Systems and Performance Engineering. Fall 2009
Memory Systems and Performance Engineering Fall 2009 Basic Caching Idea A. Smaller memory faster to access B. Use smaller memory to cache contents of larger memory C. Provide illusion of fast larger memory
More informationCache memories are small, fast SRAM based memories managed automatically in hardware.
Cache Memories Cache memories are small, fast SRAM based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and
More informationMemories. CPE480/CS480/EE480, Spring Hank Dietz.
Memories CPE480/CS480/EE480, Spring 2018 Hank Dietz http://aggregate.org/ee480 What we want, what we have What we want: Unlimited memory space Fast, constant, access time (UMA: Uniform Memory Access) What
More informationCS 2461: Computer Architecture 1
Next.. : Computer Architecture 1 Performance Optimization CODE OPTIMIZATION Code optimization for performance A quick look at some techniques that can improve the performance of your code Rewrite code
More informationModule 16: Data Flow Analysis in Presence of Procedure Calls Lecture 32: Iteration. The Lecture Contains: Iteration Space.
The Lecture Contains: Iteration Space Iteration Vector Normalized Iteration Vector Dependence Distance Direction Vector Loop Carried Dependence Relations Dependence Level Iteration Vector - Triangular
More informationLecture 2: Single processor architecture and memory
Lecture 2: Single processor architecture and memory David Bindel 30 Aug 2011 Teaser What will this plot look like? for n = 100:10:1000 tic; A = []; for i = 1:n A(i,i) = 1; end times(n) = toc; end ns =
More informationCSCI-UA.0201 Computer Systems Organization Memory Hierarchy
CSCI-UA.0201 Computer Systems Organization Memory Hierarchy Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Programmer s Wish List Memory Private Infinitely large Infinitely fast Non-volatile
More informationCache Performance (H&P 5.3; 5.5; 5.6)
Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st
More informationCS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III
CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!
More informationMemory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska
Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O Hallaron (CMU) Mohamed Zahran (NYU)
More informationLoops. Lather, Rinse, Repeat. CS4410: Spring 2013
Loops or Lather, Rinse, Repeat CS4410: Spring 2013 Program Loops Reading: Appel Ch. 18 Loop = a computation repeatedly executed until a terminating condition is reached High-level loop constructs: While
More informationAutomatic Tiling of Iterative Stencil Loops
Automatic Tiling of Iterative Stencil Loops Zhiyuan Li and Yonghong Song Purdue University Iterative stencil loops are used in scientific programs to implement relaxation methods for numerical simulation
More informationwrite-through v. write-back write-through v. write-back write-through v. write-back option 1: write-through write 10 to 0xABCD CPU RAM Cache ABCD: FF
write-through v. write-back option 1: write-through 1 write 10 to 0xABCD CPU Cache ABCD: FF RAM 11CD: 42 ABCD: FF 1 2 write-through v. write-back option 1: write-through write-through v. write-back option
More informationMemory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,
Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Cache Memory Organization and Access Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O
More information