Memory access patterns. 5KK73 Cedric Nugteren

Size: px

Start display at page:

Download "Memory access patterns. 5KK73 Cedric Nugteren"

Eugene Garrett
5 years ago
Views:

1 Memory access patterns 5KK73 Cedric Nugteren

2 Today s topics 1. The importance of memory access patterns a. Vectorisation and access patterns b. Strided accesses on GPUs c. Data re-use on GPUs and FPGA s 2. Classifying memory access patterns a. Berkeley s 7 dwarfs b. Algorithmic species c. Algorithmic skeletons 3. Algorithmic skeletons for accelerators (after the break, Mark Wijtvliet) 1/25

iters ldv vr1, addr1 ldv vr2, addr2 addv vr3, vr1, vr2 stv vr3,

3 Vector-SIMD execution N iters ld r1, addr1 ld r2, addr2 add r3, r1, r2 st r3, addr3 for (i=0; i<n; i++) c[i] = a[i] + b[i]; N/4 iters ldv vr1, addr1 ldv vr2, addr2 addv vr3, vr1, vr2 stv vr3, addr3 SIMD processes multiple scalar operations concurrently 2/25

4 Vector-SIMD execution A single instruction being executed: By multiple processing engines (ALUs, PEs, cores, nodes) Concurrently in lockstep (no synchronization) On multiple data elements Present in a wide range of architectures SIMD, GPU, AVX, SSE, NEON, Xetal, etc. Type of parallelism that is easy and cheap to implement No coherence problem No lock problem Caveat: Hard to program and/or easy to lose many factors of performance [Slides taken from P. Sadayappan] 3/25

5 How to use SIMD instructions? Pick your favourite: 1. Vectorising compiler (ICC, latest GCCs) 2. Macros or intrinsics 3. Assembly..B8.5 movaps a(,%rdx,4), %xmm0 addps b(,%rdx,4), %xmm0 movaps %xmm0, c(,%rdx,4) addq $4, %rdx cmpq $rdi, %rdx jl..b8.5 for (i=0; i<n; i++) c[i] = a[i] + b[i]; m128 ra, rb, rc; for (int i = 0; i <N; i+=4) { ra = _mm_load_ps(&a[i]); rb = _mm_load_ps(&b[i]); rc = _mm_add_ps(ra,rb); _mm_store_ps(&c[i], rc); } [Slides taken from P. Sadayappan] 4/25

6 What is the performance impact? for (i=0; i<n; i++) a[i] = a[i] + 1; Properties of the example: Stride-1 accesses to array a Inner loop has independent operations (no loop carried dependences) Array a resides in L1 cache (12.5 KB) Performance in GOPS/s on 128-bits wide CPU: char (16) int (4) float (4) double (2) original SSE speed-up 20.1x 5.0x 3.9x 2.0x [Slides taken from P. Sadayappan] 5/25

7 Strided accesses (1/2) for (i=0; i<n; i+=16) a[i] = a[i] + 1; Properties of the example: Stride-16 accesses to array a Inner loop has independent operations Array a resides in L1 cache Why no performance gain? Operands are not contiguous in memory Multiple loads/stores, vector pack/unpack No auto-vectorisation in GCC ICC vectorises, but no gains Performance in GOPS/s on 128-bits wide CPU: char (16) int (4) float (4) double (2) original SSE 2.7 speed-up 1.0x 1.0x 1.0x 1.0x [Slides taken from P. Sadayappan] 6/25

8 Strided accesses (2/2) for (i=0; i<n; i+=stride) a[i] = a[i] + 1; Generalised example (still L1 resident) Performance in GOPS/s on 128-bits wide CPU: STRIDE char (16) int (4) float (4) double (2) [Slides taken from P. Sadayappan] 7/25

9 Dependent operations for (i=0; i<n; i++) a[i] = a[i-1] + 1; Why no performance gain? Iteration i depends on iteration i-1 Inner loop cannot be parallelised Properties of the example: Stride-1 accesses to array a Inner loop has dependent operations Array a resides in L1 cache Performance in GOPS/s on 128-bits wide CPU: char (16) int (4) float (4) double (2) original SSE speed-up 1.0x 1.0x 1.0x 1.0x [Slides taken from P. Sadayappan] 8/25

10 L1 versus main memory for (i=0; i<10000*n; i++) a[i] = a[i] + 1; Why is performance limited? Code has become memory bandwidth bound Explained by the roofline model Properties of the example: Stride-1 accesses to array a Inner loop has independent operations Array a resides in main memory (DRAM) Performance in GOPS/s on 128-bits wide CPU: char (16) int (4) float (4) double (2) SSE L SSE DRAM [Slides taken from P. Sadayappan] 9/25

11 Multi-core scaling #pragma omp parallel for for (i=0; i<n; i++) a[i] = a[i] + 1; #pragma omp parallel for for (i=0; i<10000*n; i++) a[i] = a[i] + 1; threads char (16) int (4) float (4) double (2) SSE L SSE L x 3.0x 3.0x 3.0x threads char (16) int (4) float (4) double (2) SSE DRAM SSE DRAM x 1.0x 1.0x 1.0x speed-up speed-up [Slides taken from P. Sadayappan] 10/25

12 Lessons learned from vectorisation Vectorisation and parallelisation are important Significant speed-ups can be obtained......depending on the memory access patterns! Performance depends on the memory access pattern Strided accesses Dependent / independent operations Size of data structures Performance / implementation will differ per architecture Vector width and data types L1 resident or not (L1 cache size, DRAM bandwidth, etc.) Bottom line: Let s take a closer look at memory access patterns 11/25

13 Strided accesses on GPUs global void stride_copy(float* out, float* in) { int id = blockidx.x*blockdim.x + threadidx.x; out[id*stride] = in[id* STRIDE]; } Performance in GB/s on a Tesla C2050: S=1 S=2 S=3 S=4 S=5 S=6 S=7 S= S=9 S=10 S=11 S=12 S=13 S=14 S=15 S= Why is performance deteriorating? Memory accesses are no longer coalesced Not all data in cache-lines are used 12/25

14 Data-reuse on GPUs global void filter(float* out, float* in) { int id = blockidx.x*blockdim.x + threadidx.x; out[id] = 0.33 * (in[id-1] + in[id] + in[id+1]); } Properties of the example: Each data element is used 3 times (data-reuse) Memory bandwidth is the limiting performance factor Use the GPU s scratchpad memory (shared) to benefit from reuse Newer GPUs use caches to benefit automatically Expected performance gain: up to 2x id reuse id+1 in[] out[] 13/25

15 Data-reuse on FPGAs Implementing an erosion filter on an FPGA: Manually (VHDL) Automatically from C (HLS) Automatically from C using memory access pattern information 14/25

16 Today s topics 1. The importance of memory access patterns a. Vectorisation and access patterns b. Strided accesses on GPUs c. Data re-use on GPUs and FPGA s 2. Classifying memory access patterns a. Berkeley s 7 dwarfs b. Algorithmic species c. Algorithmic skeletons 3. Algorithmic skeletons for accelerators (after the break, Mark Wijtvliet) 15/25

Classifying program code Berkeley s 7 dwarves of computation: 1. 2. 3.

Logic Graph Traversal Dynamic Programming Backtrack and

17 Classifying program code Berkeley s 7 dwarves of computation: Dense Linear Algebra Sparse Linear Algebra Spectral Methods N-Body Methods Structured Grids Unstructured Grids MapReduce Combinational Logic Graph Traversal Dynamic Programming Backtrack and Branch-and-Bound Graphical Models Finite State Machines More information: ( A View From Berkeley ) 16/25

18 Classifying memory access patterns Berkeley s dwarves are High-level and intuitive, but......don t capture all relevant details of memory access patterns Not formalised nor exact: classes are based on a textual description Can we do better? Introducing algorithmic species A classification of code based on memory access patterns 17/25

19 Algorithmic species examples (1/3) for(i=0; i<64; i++) { for(j=0; j<128; j++) { R[i][j] = 2 M[i][j]; } } Basic forall matrix copy Each i,j iteration one data element is read from M Each i,j iteration one data element is written to R M[0:63,0:127] element R[0:63,0:127] element 18/25

20 Algorithmic species examples (2/3) for(i=0; i<64; i++) { r[i] = 0; for(j=0; j<128; j++) { r[i] += M[i][j] v[j]; } } Matrix-vector multiplication Each i iteration a row is read from M and the full vector v Each i iteration one element of the vector r is produced M[0:63,0:127] chunk(-,0:127) + v[0:127] full r[0:63] element 19/25

21 Algorithmic species examples (3/3) for(i=1; i<128-1; i++) { m[i] = 0.33 (a[i 1]+a[i]+a[i+1]); } Filter with data-reuse Each i iteration three neighbouring elements from a are read Each i iteration one element of m is produced a[1:126] neighbourhood(-1:1) m[1:126] element 20/25

22 Can t we capture more details? (P,r,[0..7],2,2) for(i=0; i<4; i++) { Q[i] = P[2 i] + P[2 i + 1]; } (P,r,[0..6],1,2) combine (P,r,[1..7],1,2) Characterise based on: (Q,w,[0..3],1,1) Array name Type (read or write) Domain Number of elements Step 21/25

23 How can we use a classification? global void filter(float* out, float* in) { int id = blockidx.x*blockdim.x + threadidx.x; out[id] = 0.33 * (in[id-1] + in[id] + in[id+1]); } Consider the earlier GPU filter example: Each data element is used 3 times (data-reuse) Use the GPU s scratchpad memory (shared) to benefit from reuse What if we had an optimised pre-implemented skeleton (template) for such neighbourhood type of computations? id reuse id+1 in[] out[] 22/25

24 Using algorithmic skeletons <args> = float* out, float* in <computation> = 0.33 * (in[i-1] + in[i] + in[i+1]) global void filter(float* out, float* in) { <input> = in int id = blockidx.x*blockdim.x + threadidx.x; <output> = out int sid = threadidx.x; <type> = float (user input) // Load into local (shared) memory shared smem[512]; smem[sid] = in[id]; global void neighbourhood_skeleton( syncthreads() ;<args>) { int id = blockidx.x*blockdim.x + threadidx.x; // Perform the computation int sid = threadidx.x; float res = 0.33*(smem[sid-1]+smem[sid]+smem[sid+1]); out[id] = res; // Load into local (shared) memory } (instantiated skeleton) shared <type> smem[512]; smem[id] = <input>[id]; syncthreads() ; + // Perform the computation <type> res = <computation> <output>[id] = res; (simplified skeleton) } 23/25

25 Today s topics 1. The importance of memory access patterns a. Vectorisation and access patterns b. Strided accesses on GPUs c. Data re-use on GPUs and FPGA s 2. Classifying memory access patterns a. Berkeley s 7 dwarfs b. Algorithmic species c. Algorithmic skeletons 3. Algorithmic skeletons for accelerators (after the break, Mark Wijtvliet) 24/25

26 Further reading Compiler vectorisation: Auto-vectorization of interleaved data for SIMD (paper) D. Nuzman, I. Rosen, A. Zaks, 2006 Roofline model: Roofline: an insightful visual performance model for multicore architectures (paper) S. Williams, A. Waterman, D. Patterson - Communications of the ACM, 2009 Memory access patterns: Patterns for parallel programming (book) T.G. Mattson, B.A. Sanders, B.L. Massingill, 2004 The landscape of parallel computing research: A view from Berkeley (paper) K. Asanovic, R. Bodik, B.C. Catanzaro, et al., 2006 Algorithmic species revisited: A program code classification based on array references (paper) C. Nugteren, R. Corvino, H. Corporaal, /25

Program Optimization Through Loop Vectorization

Program Optimization Through Loop Vectorization María Garzarán, Saeed Maleki William Gropp and David Padua Department of Computer Science University of Illinois at Urbana-Champaign Simple Example Loop