Program Transformations for the Memory Hierarchy

Size: px

Start display at page:

Download "Program Transformations for the Memory Hierarchy"

Amelia Quinn
5 years ago
Views:

1 Program Transformations for the Memory Hierarchy Locality Analysis and Reuse Copyright 214, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class at the University of Southern California have explicit permission to make copies of these materials for their personal use. Acknowledgement: Some of the material in this lecture is based on class notes from CSCI595-Spring 24 class at USC graciously provided by Dr. Jacqueline Chame and Dr. Mary Hall

2 Processor-Memory Bandwidth Gap Over Last 2 Decades Increased Sharply Now not as Severe but Multi-Core Exacerbates System-level Issues 2

3 Problems with Modern Architectures Uni-Processors Parallelism: extract and manage instruction level parallelism (nonnumerical and numerical applications) Memory hierarchy: improve locality and hide latency Multi-Processors Parallelism: detection of loop level parallelism Memory Hierarchy Minimize synchronization frequency Minimize communication (improve locality) Hide latency 3

4 Processor-Memory Bandwidth Gap Processor Cycles Functional Units Registers 1x 2-5x 2-3x Cache I Cache II Local Memory Remote Memory Principle of Locality: Reuse Data that has been recently used. Trade-off: Increasing capacity means higher latency 4

5 Hiding Memory Latency: Locality Memory Local to Processor Fast but of Limited Capacity Needs to Map Sub-set of Main Memory Restrict mapping choices to allow easy look-up. 5

6 Analysis for Locality 6

7 Blocking for Uni- and Multi-Processors = x Data Accessed = x 124 1,5, = x ,56 15x 7

8 Performance of Blocking 8

9 Example: Givens QR Decomposition 9

10 Hiding Memory Latency: Prefetching Idea: Anticipate Memory Access: Prefetch so data is available when needed Trade-Off: Need to prefetch in advance at the right time. Do not fetch too late nor to early Control-flow is an issue 1

11 Prefetching Example Suppose Prefecth Instructions fetch 2 words at a time How Effective is Prefetching? 11

12 Effectiveness of Prefetching Memory Stalls reduced by 5% to 9% Instructions and Memory Overhead are Typically Low 12

13 Data Reuse in Caches Data Reuse: Data used multiple times Data Locality: Data remains in cache between uses Less Cache Misses Less Memory Traffic Optimization Goal: exploit reuse Loop Transformations: Reuse Locality 13

14 Temporal Reuse Same Data Accessed in Distinct Loop Iterations DO I = 1, N DO J = 1, N C(J,I)= A(J)+B(I,J) ENDDO ENDDO A(J) has self-temporal reuse in loop I 14

15 Spatial Reuse Same Cache Line Accessed in Distinct Loop Iterations DO I = 1, N DO J = 1, N A(J,I)=A(J,I)+B(I,J) A(J,I) has Self-Spatial Reuse in loop J B(I,J) has Self-Spatial Reuse in loop I (FORTRAN arrays are stored in column-major order) 15

16 Group Reuse Same Data/Cache line accessed by Distinct References DO I = 1, N DO J = 1, N A(J,I)=A(J,I)+A(J-1,I)+A(J+1,I) ENDDO ENDDO A(J,I),A(J+1,I),A(J-1,I): Group Reuse in J 16

17 Group Reuse Group-Temporal Reuse: Same Data Location DO I = 1, N DO J = 1, N A(J,I)=A(J,I)+A(J-1,I)+A(J+1,I) Group-Spatial Reuse: Same Cache Line DO I = 1, N DO J = 1, N A(1,J)=A(,J) 17

18 Group Reuse Group Temporal but no Reuse of other Data on same Cache Line (cache line size = 4) DO I = 1, N DO J = 1, N A(J,I)=A(J+8,I)+A(J-8,I) Group Spatial but not Group Temporal DO I = 1, N DO J = 1, N A(J,I)=A(2J,I)+A(2J+1,I) 18

19 Reuse and Cache Misses DO I = 1, N DO J = 1, N = A(J) self-spatial in J self-temporal in I I N < cache line size I N > cache line size J cache miss 19 J

20 Cache Misses DO I = 1, N DO J = 1, M A(I)=A(I)+B(J,I) DO J = 1, M DO I = 1, N A(I)=A(I)+B(J,I) reference J I reference I J A(I) 1 N A(I) N/L M*N/L B(J,I) M/L N*M/L B(J,I) N M*N L = cache line size; N,M > cache size LIS = innermost loop 2

21 How to Select a Loop Order to Exploit Reuse? Identify and Quantify Reuse Allen & Kennedy: Compute a cost function for each loop (innermost memory cost) Wolf & Lam: Locality Analysis identifies reuse vectors Capacity misses based on reuse vectors/amount Order loops according to reuse amount 21

22 A&K: Innermost Memory Cost Innermost memory cost: C M (L i ) assume L i is innermost loop l i = loop variable, N = number of iterations of L i for each array reference r in loop nest: r does not depend on l i : cost (r) = 1 r such that l i strides over a non-contiguous dimension: cost (r) = N r such that l i strides over a contiguous dimension with step s < L: cost (r) = N*s/L C M (L i ) = sum of cost (r) 22

23 Example DO I = 1, N DO J = 1, M A(I)=A(I)+B(I,J) DO I = 1, N DO J = 1, M A(I)=A(I)+B(I,J) J innermost: cost J (A)=1, cost J (B)=M I innermost: cost I (A)=N, cost I (B)=N/L C M (J) = N * (1 + M) C M (I) = N * (M/L + M/L) = 2NM/L select I as innermost if N(1+M) > 2NM/L 23

24 Selecting a Loop Order Compute the innermost memory cost for each loop in the nest Compute a desired loop order, from innermost to outermost, in order of increasing innermost memory cost Interchange loops to achieve a legal loop order that is closest to the desired order 24

25 Example: Matrix-Matrix Multiply DO I = 1, N DO J = 1, N DO K = 1, N C(I,J)= C(I,J) + A(I,K) * B(K,J) C M (I) = 2N 3 /L + N 2 C M (J) = 2N 3 + N 2 C M (K) = N 3 + N 3 /L + N 2 Ordering by innermost loop cost: (J, K, I) 25

26 Groups of References (*) (*) also known as uniformly generated references When computing innermost memory costs, some references should be grouped: References that access same data in distinct iterations of a loop I Example: A(J,I), A(J,I+1), (J,I-1) References that access same cache line in same iteration of I Example: A(I,J), A(I+1,J),, A(I+L-1,J) 26

27 Example DO I = 1, T DO J = 1, N DO K = 1, N A(J,K) = f(a(j,k) + A(J,K+1) + A(J,K-1) + A(J+1,K) + A(J-1,K)) K innermost: G 1 = {A(J,K+1), A(J,K), A(J,K-1)} => ~N misses G 2 = {A(J+1,K), A(J,K), A(J-1, K)} => ~N misses J innermost: G 1 = {A(J+1,K), A(J,K), A(J-1,K)} => N/L misses G 2 = {A(J,K+1)} => N/L misses G 3 = {A(J,K-1)} => N/L misses 27

28 Loop Tiling (Blocking) Tiling reorders loop iterations to bring iterations accessing same data closer in time I Choose tile sizes so that reused data can be stored in cache I J J 28

29 Example: Matrix-Matrix Multiply DO J = 1, N DO K = 1, N DO I = 1, N C(I,J)= C(I,J)+A(I,K)*B(K,J) K J I I K C A B 29

30 Reuse DO J = 1, N DO K = 1, N DO I = 1, N C(I,J)= C(I,J)+A(I,K)*B(K,J) reuse type reuse amount I K J I J K C(I,J) spatial temporal none L N 1 A(I,K) spatial none temporal L 1 N B(K,J) temporal spatial none N L 1 3

31 Loop Tiling: Strip-Mine and Interchange DO J = 1, N DO K = 1, N DO I = 1, N by T DO II = I, min(i+t-1,n) C(II,J)+= A(II,K)+B(K,J) Strip mine loop I: I: traverses strips II: traverses iterations in a strip DO J = 1, N DO I = 1, N by T DO K = 1, N DO II = I, min(i+t-1,n) C(II,J)+=A(II,K)+B(K,J) Interchange K and I (exploit reuse of C(II,J) in loop K) 31

32 Tiling Inner Loops I and K DO K = 1, N by T K DO I = 1, N by T I DO J = 1, N DO KK = K, min(kk+ T K,N) DO II = I, min(ii+ T I,N) C(II,J)= C(II,J)+A(II,KK)*B(KK,J) T K T I C A B 32

33 Cache Misses (assuming that data referenced in tiled loops fit in cache) DO J = 1, N by T J DO K = 1, N by T K DO I = 1, N by T I DO JJ = J, min(j+t J,N) DO KK = K, min(kk+ T K,N) DO II = I, min(ii+ T I,N) C(II,JJ)= C(II,JJ)+A(II,KK)*B(KK,JJ) LIS II KK JJ I K J C(II,JJ) T I /L T I / L T J T I / L NT J /L N 2 T J /(T K L) N 3 /(T K L) A(II,KK) T I /L T K T I /L T I T K / L NT K /L N 2 /L N 3 /(T J L) B(KK,JJ) 1 T K / L T J T K / L NT J T K /(T I L) N 2 T J /(T I L) N 3 /(T I L) 33

34 Legality of Tiling Tiling: Strip-Mine and Interchange strip-mine does not reorder iterations interchange must be legal d, d 1, d k-1 either = or < OR strip size less than threshold of dependence 34

35 Profitability of Tiling Tiling is Profitable is there is enough reuse to offset the overheads extra controlling loop for each tiled loop extra misses in the controlling loops extra misses due to alignment to cache lines Need to Identify and Quantify Reuse 35

36 Locality Analysis Identify reuse in iteration space vector spaces represent directions with reuse Quantify reuse in a given iteration subspace Localized Iteration Space (LIS): set of iterations where reuse can be exploited Transform loop nest to include directions with reuse in LIS 36

37 Types of Reuse Self reuse: same array reference same data: self-temporal reuse same cache line: self-spatial reuse Group reuse: several uniformly generated array references same data: group-temporal reuse same cache line: group-spatial reuse 37

38 Identifying reuse: mapping iterations to data DO I = 1, N DO J = 1, N A(I)= B(J,I) A I I () ( 1 ) = ( I) J B Iteration space J ( 1)() I () = 1 J I J Data space 38

39 Mapping Iterations to Data Array indexing function: f ( i ) : Zn Zd = H ( i ) + c Example: A 1 = A(J+2,I) fa 1 = & % 1 1# " & % I J # " + & 2# % " 39

40 CSCI Compiler Design 4 Matrix Multiply: Array Indexing Functions " # % & + " # % & " # % & = 1 1 K J I fc(i,j) " # % & + " # % & " # % & = 1 1 K J I fa(i, K) " # % & + " # % & " # % & = 1 1 K J I fb(k,j)

41 Identifying Self-Temporal Reuse i 1 and i 2 reference the same data when: H ( i ) + c = H ( i 1) + 1 or H ( i 1 i 2) = c " " Solution: reuse in direction r s.t. " H ( r ) = Self-temporal reuse vector space R ST = ker H Reuse is exploited if r is included in LIS 41

42 CSCI Compiler Design 42 Identifying Temporal Reuse Iterations that access same A(I) ( ) ( ) " # % & = " # % & J I J I Iterations that access same B(J,I) " # % & " # % & = " # % & " # % & J I J I ( ) span J I I = ( ) span I I J J = = reuse vector No reuse

43 Quantifying Temporal Reuse Temporal Reuse If number of iterations along each dimension of the reuse space is B, then each element is reused B dim(rst) Example: A(I)is reused N 2 times in J, K DO I = 1, N DO J = 1, N DO K = 1, N A(I) = A(I) + B(J,K) 43

44 Temporal Locality A reference has self-temporal locality if it has self-temporal reuse and RST and LIS have a non-null intersection Quantity of reuse utilized Dimensionality of intersection of RST and LIS Number of memory accesses: 1/B dim(rst) 44

45 Matrix-Matrix Multiply: RST DO J = 1, N DO K = 1, N DO I = 1, N C(I,J)= C(I,J)+A(I,K)*B(K,J) RST (C(I,J)) = span (, 1, ) RST (A(I,K)) = span (1,, ) RST (B(K,J)) = span (,, 1) Localized iteration space: loop I There is a non-empty intersection with the reuse vector space of B(K,J) 45

46 Tiled Matrix multiply DO J = 1, N by T J DO K = 1, N by T K DO I = 1, N by T I DO JJ = J, min(j+t J,N) DO KK = K, min(kk+ T K,N) DO II = I, min(ii+ T I,N) C(II,JJ)= C(II,JJ)+A(II,KK)*T(KK,JJ) Localized Iteration Space: all three loop dimensions Non-empty intersection with each of the reuse spaces 46

47 Identifying Self-Spatial Reuse i 1 and i 2 reference the same cache line when: " " HS( i 1) + c = HS( i 2) + c or HS( i 1 i 2) where H S = & % d1 2n Solution: directions r such that h h : : :.. h h : dn # " HS " ( r ) = = all array indices except the column index must be identical Self-spatial reuse space R SS = ker H S 47

48 Identifying Spatial Reuse Iterations that access same cache line (B(J,I)) & % 1 # " & % I J 1 1 # " = & % 1 # " & % I J 2 2 # " I 1 = J I 2 span(,1) DO I = 1, N DO J = 1, N A(I)= B(J,I) reuse vector 48

49 Quantifying Spatial Reuse Spatial reuse Only exists if stride s is less than L Data reused s/l times, where s is the stride Example: cache line holding B(J,K) is reused L times in loop J DO I = 1, N DO J = 1, N A(I) = A(I) + B(J,I) 49

50 Identifying Group-Temporal Reuse References are uniformly generated same H A(Hi+c1) = A(Hi+c2) reference same data when: There exists r in LIS such that H ( i ) + c1 = H ( i ) + c 2 H ( r) = c1 c 2 5

51 Identifying Group-Temporal Reuse Find particular solution r p such that H ( rp) = c1 c 2 General solution RGT = ker(h) + r p 51

52 Back to Tiling Tile loops carrying reuse to move the reuse vectors into LIS Tiling increases the dimensionality of the LIS => increases locality 52

53 In Practice Reuse vectors that are a combination of the elementary vectors of the iteration space Example: A(I+J,K), R ST = (1, -1, ) Use the smallest enclosing space spanned by the elementary vectors: Example: (1,-1,) is spanned by (1,,) and (,1,) 53

54 Tiling for Real Caches Tiling reduces capacity misses in fully-associative caches In real life: Caches are direct-mapped/small associativity Blocking may introduce conflict misses: Data within a block is not contiguous in memory Conflict misses may offset benefits 54

55 Tiling for Real Caches Conflict misses Vary with array and tile sizes Costly to predict, at compile-/run-time Avoiding conflicts: Tile size selection Copying tiles at run-time Array padding 55

56 Other Levels of the Memory Hierarchy Register level Unroll-and-jam and scalar replacement L2, L3 cache levels block for L1 cache first block controlling loops for L2 hierarchical tiling 56

57 Blocking and Parallelization Parallel machines Conflicting goals if parallel loop also carries reuse Works well if coarse-grain parallelism (outer loops) inner loops carry reuse Data partitioning may cause false sharing choose block sizes multiple of cache line size array padding 57

58 Blocking and Parallelization Architectures with Superword Parallelism Instructions operate on superwords (objects consisting of a set of contiguous words) Spatial locality and superword parallelism are complementary optimizations Temporal reuse in superword registers also complementary 58

59 Other uses of Locality Analysis Data and Computation Partitioning Data Prefetching What references to prefetch? potential misses identified by Locality Analysis When to prefetch? reuse info used to compute prefetch predicates 59

60 Summary Data Reuse Concept Spatial- and Temporal Locality Analysis Use of Reuse Vectors and LIS to Quantify Reuse Combination with Loop Transformations Convert Reuse into Locality and Leverage Caches 6

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

A Crash Course in Compilers for Parallel Computing Mary Hall Fall, 2008 1 Overview of Crash Course L1: Data Dependence Analysis and Parallelization (Oct. 30) L2 & L3: Loop Reordering Transformations, Reuse