Code optimization in a 3D diffusion model

Size: px

Start display at page:

Download "Code optimization in a 3D diffusion model"

Baldric Allan Day
5 years ago
Views:

1 Code optimization in a 3D diffusion model Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona

2 Agenda Background Diffusion algorithm Performance: baseline Scaling: OpenMP Vectorization: #pragma simd Peeling out Note on bandwidth Summary 2

3 References Ref: Chapter 4 Intel s Xeon Phi Coprocessor High Performance Programming Author of the code: Naoya Maruyama of Riken Advanced Institute for Computational Science in Japan Simulate diffusion of a solute through a volume of liquid over time within a 3D container A three-dimensional seven-point stencil operation is used 3

4 Diffusion model Diffusion of a solute over Time through an Enclosed Volume 4

5 The diffusion equation ϕ(r, t) is the density of the diffusing material at location r and time t D(ϕ, r) is the collective diffusion coefficient for density ϕ at location r If D is constant then it becomes 5

6 Numerical approach: Finite differences Regular meshing in 3D Forward time centred space (FTCS) where Threading Vectorization MPI Domain decomposition Hybrid computing 6

7 Seven point stencil z North, South East, West Top, Bottom y x 3D Stencil Used to calculate the diffusion of a solute through a liquid volume. 7

8 Diffusion algorithm in principle for (i = 0; i < niter; i++) { } for (z = 0; z < nz; z++) for (y = 0; y < ny; y++) for (x = 0; x < nx; x++) f2[z,y,x] = cc*f1[z,y,x] + cw*f1[z,y,x 1] + ce*f1[z,y,x+1] + cn*f1[z,y 1,x] + cs*f1[z,y+1,x] + cb*f1[z 1,y,x] + ct*f1[z+1,y,x] temp = f2; f2 = f1; f1 = temp; Switch buffers The time loop Walk the mesh Update the mesh 8

9 Boundary conditions Molecular density for sub-volumes that sit next to the edges of our container The boundary conditions occur for any sub-volume that has x = 0, y = 0, or z = 0 x = nx, y = ny, or z = nz Replace the value of the neighbour volume with the target central density value to get a reasonable approximation of the diffusion at that point Bounds check no overstepping the bounds! Reshape the code with The sides of the box Linierize f1[] and f2[] using the stencil indices by adding w,e,n,s,b,t (west, east, north, south, bottom, top) variables 9

10 for (int i = 0; i < count; ++i){ } for (int z = 0; z < nz; z++) { } for (int y = 0; y < ny; y++) { for (int x = 0; x < nx; x++) { } int c, w, e, n, s, b, t; c = x + y * nx + z * nx * ny; w = (x == 0)? c : c 1; e = (x == nx 1)? c : c + 1; n = (y == 0)? c : c nx; s = (y == ny 1)? c : c + nx; b = (z == 0)? c : c nx * ny; t = (z == nz 1)? c : c + nx * ny; f2_t[c] = cc * f1_t[c] + cw * f1_t[w] + ce * f1_t[e] + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t]; REAL *t = f1_t; f1_t = f2_t; f2_t = t; Boundary coordinates Diffusion base kernel } 10

11 Diffusion: baseline code diffusion_base.c 11

12 Performance metrics Floating-point performance f2_t[c] = cc * f1_t[c] + cw * f1_t[w] + ce * f1_t[e] + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t]; 13 floating point operation per inner loop iteration Memory bandwidth (in GB/s) number of bytes of volume data read and written during the call 12

13 Baseline 13

14 Compilation: native, aggressive for the Intel Xeon Phi symbol openmp switch Aggressive $icc g -openmp -mmic -std5c99 -O3 vect-report=3 diffusion_base.c - o diffusion_base Xeon Phi vectorization reports Environment: set on Xeon Phi export OMP_NUM_THREADS=1 228 export KMP_AFFINITY=compact scatter What and where Execution on Xeon Phi %./diffusion_base ssh to mic card and run natively 14

15 Runtime results Running diffusion base kernel l6553 times diffusion_base_xphi thread num = 1 affinity =!-rnp count is 65 Running diffusion kernel 65 times Elapsed time : (s) FLOPS : (MFlops) Throughput : (GB/s) Accuracy : e-09 15

16 vtune Run on host Requires a script Run.sh #!/bin/bash source /home/rogerphilp/psxevars.sh export OMP_NUM_THREADS=1 Export KMP_AFFINITY = echo diffusion_base_xphi thread num = ${OMP_NUM_THREADS} affinity = ${KMP_AFFINITY} /home/rogerphilp/diffusion/diffusion_base_xphi 16

17 vtune baseline statistics May be too high Memory stalls Instruction starvation Branch misprediction Or long latency instructions 17

18 Baseline thread analysis But only one core 18

19 vtune analysis of the baseline code Where the time is being spent Cpu activity 19

20 vtune analysis of the diffusion_baseline code Red = Regions of poor performance 20

21 Baseline vectorization report diffusion_base.optrpt: vect-report=3 diffusion_base.c(103,3) remark #15541: outer loop was not auto-vectorized: consider using SIMD diffusion_base.c(106,9) Inner loop: f1, f2 dependency remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details remark #15346: vector dependence: assumed FLOW dependence between f2 line 115 and f1 line 115 middle loops: f1, f2 dependency diffusion_base.c(105,7) remark #15541: outer loop was not auto-vectorized: consider using SIMD directive diffusion_base.c(104,5) Temporal loop remark #15541: outer loop was not auto-vectorized: consider using SIMD directive 21

22 Performance requirements To improve performance we initially need two key elements Scaling: openmp directives Vectorization: simd pragmas 22

23 Scaling: openmp 23

24 Scaling: OpenMP See updated function diffusion_openmp() #pragma omp parallel And collapse the z and y loops #pragma omp for collapse(2) Effectively creating a loop for(yz=0; yz < ny*nx; ++yz) Enables each thread to be assigned larger chunks of data Allows more efficiency on each pass through the loop 24

25 diffusion_omp.c #pragma omp parallel { REAL *f1_t = f1,*f2_t = f2; for (int i = 0; i < count; ++i) { #pragma omp for collapse(2). for (int z = 0; z < nz; z++) { for (int y = 0; y < ny; y++) { for (int x = 0; x < nx; x++) { f2_t[c] = cc * f1_t[c] + cw * f1_t[w] + ce * f1_t[e] + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t];.. } section is marked as parallel Each thread gets the same index Z and y loops are collapsed 25

26 diffusion_omp.c: compact; 228 threads Thread number Thread arrangement diffusion_omp_xphi thread num = 228 affinity = compact Running diffusion kernel 6225 times with 228 threads Elapsed time : (s) FLOPS : (MFlops) Throughput : (GB/s) Accuracy : e-09 Diffusion_base_xphi Flops: FLOPS : (MFlops) 26

27 Speedup Experiment with the number of threads per core Speedup using OpenMP vs number of threads omp omp c s Number of threads 27

28 CPI rate is better at 4.39 CPI rate is worse at thread compact openmp cpu usage 228 thread scatter openmp cpu usage 28

29 Diffusion openmp cpu usage histograms threads = 228 affinity = compact openmp cpu usage threads = 228 affinity = scattter openmp cpu usage 29

30 Diffusion openmp usage histograms threads = 228 affinity = compact openmp usage threads = 228 affinity = scattter openmp usage 30

31 Diffusion_omp.c: compact Diffusion_omp.c: scatter 31

32 Hotspots threads = 228 affinity = compact openmp threads = 228 affinity = scatter openmp 32

33 Openmp vectorization report diffusion_omp.optrpt: vect-report=3 Temporal loop diffusion_omp.c(106,5) inlined into diffusion_omp.c(202,3) remark #15541: outer loop was not auto-vectorized: consider using SIMD directive middle loops: f1, f2 dependency diffusion_omp.c(108,7) inlined into diffusion_omp.c(202,3) remark #15541: outer loop was not auto-vectorized: consider using SIMD directive inner loops: f1, f2 dependency diffusion_omp.c(110,11) inlined into diffusion_omp.c(202,3) remark #15346: vector dependence: assumed FLOW dependence between f2_t line 119 and f1_t line

34 Vectorization 34

35 Forcing vectorization Add vectorization pragma #pragma simd Vectorization pragma Ignore suspected dependencies 35

36 diffusion_ompvect.c #pragma omp parallel for (int i = 0; i < count; ++i) { #pragma omp for collapse(2) for (int z = 0; z < nz; z++) { Vectorization pragma for (int y = 0; y < ny; y++) { #pragma simd Ignore suspected dependencies for (int x = 0; x < nx; x++) {. f2_t[c] = cc * f1_t[c] + cw * f1_t[w] + ce * f1_t[e] + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t];. } 36

37 Difffusion_ompvect: compact diffusion_ompvect_xphi thread num = 228 affinity = compact diffusion kernel 6225 times with 228 threads Elapsed time : (s) FLOPS : (MFlops) Throughput : (GB/s)Accuracy : e-09 Diffusion_omp_base_xphi Flops: FLOPS : (MFlops) 37

38 Speedup Speedup using OpenMP and Vectorization vs number of threads ompvec ompvec omp omp c s c s Number of Threads 38

39 Disffusion_ompvec: threads = 228 Affinity = compact CPI has increased over openmp 39

40 Openmp only threads = 228 affinity = compact openmp usage Openmp + vectorization threads = 228 affinity = compact openmp usage 40

41 Diffusion: threads = 228 Affinity = compact Openmp + vectorization Openmp only 41

42 Openmp vectorization report diffusion_ompvec.optrpt: vect-report=3 diffusion_ompvect.c(111,11) remark #15301: SIMD LOOP WAS VECTORIZED remark #15476: scalar loop cost: 66 remark #15477: vector loop cost: remark #15478: estimated potential speedup: diffusion_ompvect.c(111,11) remark #15301: REMAINDER LOOP WAS VECTORIZED diffusion_ompvect.c(111,11) remark #15301: PEEL LOOP WAS VECTORIZED 42

43 Peel and remainder 43

44 Boundary update: pesky boundaries Inner mesh Boundary mesh Currently boundary update: is mixed in with kernel: conditions can cause vectorization issues only has to occur before buffer pointer swap of f1 and f2 can occur before, after or before and after the execution of the main kernel 44

45 New main kernel: diffusion_peel #pragma simd for (x = 1; x < nx-1; x++) { } Start from index 1 ++c; ++n; ++s; ++b; ++t; f2_t[c] = cc*f1_t[c] + cw*f1_t[c-1] + ce*f1_t[c+1] + cs*f1_t[s] + cn*f1_t[n] + cb*f1_t[b] + ct*f1_t[t]; No explicit conditionals Ends at index nx -2 The new vectorised peeled kernel 45

46 int x, c, n, s, b, t; x = 0; c = x + y*nx + z*nx*ny; n = (y == 0)? c : c - nx; s = (y == ny-1)? c : c + nx; b = (z == 0)? c : c - nx* ny; t = (z == nz-1)? c : c + nx* ny; f2_t[c] = cc*f1_t[c] + cw*f1_t[c] + ce*f1_t[c+1] + cs*f1_t[s] + cn*f1_t[n] + cb*f1_t[b] + ct*f1_t[t]; // New simd kernel goes here ++c; ++n; ++s; ++b; ++t; f2_t[c] = cc*f1_t[c] + cw*f1_t[c-1] + ce*f1_t[c] + cs*f1_t[s] + cn*f1_t[n] + cb*f1_t[b] + ct*f1_t[t]; } } REAL *t = f1_t; f1_t = f2_t; f2_t = t; First set of boundaries updated Second set of boundaries updated System updated 46

47 diffusion_peel.c : compact, 228 threads Thread number Thread arrangement diffusion_peel_xphi thread num = 228 affinity = compact Running diffusion kernel 6225 times with 228 threads Elapsed time : (s) FLOPS : (MFlops) Throughput : (GB/s) Accuracy : e-09 Diffusion_base_xphi Flops: FLOPS : (MFlops) 47

48 s diffusion_peel.c 450 Speedup vs Number of Threads ompvecpl b ompvec b omp b ompvecpl c ompvecpl s ompvec c ompvec s omp c omp s number of threads 48

49 diffusion_peel.c: compact, 228 threads CPI is higher still 49

50 Affinity = compact, 228 threads Openmp + vectorization + peel Openmp + vectorization 50

51 Openmp vectorization report diffusion_peel.optrpt: vect-report=3 diffusion_peel.c(120,11) remark #15301: PEEL LOOP WAS VECTORIZED remark #15301: SIMD LOOP WAS VECTORIZED remark #15450: unmasked unaligned unit stride loads: 7 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 52 remark #15477: vector loop cost: remark #15478: estimated potential speedup:

52 Axis Title A note on bandwith 100 Bandwidth GB/s vs number of threads ompvecpl c ompvecpl s ompvec c ompvec s omp s omp s number of threads Problem may move from compute bound to memory bound 52

53 Note Intel s compilers, are in a constant state of improvement particularly in regards to finding vectorization opportunities. compiler version used was unable to automatically vectorize newer compiler version may have succeed but this did not! The compiler may need a little more information to vectorize. 53

54 Summary 54

55 Overview of procedure for optimizing the diffusion code -O3 optimisation profile profile profile baseline openmp openmp + vectorization Opt-report Opt-report Opt-report profile peel tiling Affinity Thread count 55

56 s diffusion_peel.c 450 Speedup vs Number of Threads ompvecpl b ompvec b omp b ompvecpl c ompvecpl s ompvec c ompvec s omp c omp s number of threads 56

57 Summary Compiled everything with O3 Generated a baseline performance figure Applied multiple thread counts Applied two affinities: compact and scatter Reviewed the optimisation reports Analysed the program using vtune to find hotspots As a consequence we achieved a speed up of 400 times 57

58 Thank you 58

Bring your application to a new era:

Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.