Finite Difference Simulations on GPU Clusters: How far can you push 1D domain decomposition?

Size: px

Start display at page:

Download "Finite Difference Simulations on GPU Clusters: How far can you push 1D domain decomposition?"

Roxanne Hall
6 years ago
Views:

1 Finite Difference Simulations on GPU Clusters: How far can you push 1D domain decomposition? Pierre Wahl, Hugo Thienpont Andrew Adinetz, Eugen Trebunski Maxim Milakov, Jiri Kraus GPU Technical Conference San Jose March 2014

2 FDTD in photonics: Light-matter interaction

3 Belgium California Light Machine (B-CALM) Finite Difference (Time Domain)

4 Outline Question 1: Port BCALM to GPU? Optimize performance on a single GPU. Tricks to maximize memory bandwidth. Question 2: Port B-CALM on GPU clusters Overlapping computation and communication. Question 3: Limits of 1D-domain decomposition Performance model Limits of 1D domain decomposition.

5 Porting B-CALM to GPU (or any other Finite Difference code)

6 The Algorithm Start <<<Update_E>>> <<<Update_H>>> No Last Iteration? Halo of 1 Done Yes

7 Use of shared memory (1)

8 Use of shared memory (1)

9 Use of shared memory (1) Trick: Stage your loads 1. Load from GPU-RAM into temporary variables. 2. Write in shared memory % faster.

10 Use of shared memory (2)

11 Use of shared memory (2)

12 Use of shared memory (2)

13 Use of shared memory (2) Trick: Toggle pointer Instead copying in share mem. array. A[i][j]=B[i][j] Switch pointer to share mem. array. A[t][i][j] for t=0,1 t=1-t each time 5% faster.

14 Porting B-CALM to Mutli-GPU (or any other Finite Difference code)

15 Port B-CALM on GPU clusters Main goal aggregate memory weak scaling is important Transfer borders Overlap communication and computation 1D-domain decomposition simplest possible no gather operations needed

16 Overlap communication an computation No overlap <<<Update_E>>> Stream 1 <<<Update_E>>> Stream 2 <<<UpBorderE>>> Copy Border E down Copy Border E down <<<Update_H>>> <<<UpBorderH>>> Copy Border H up <<<Update_H>>> Copy Border H up Sync MPI threads and cuda-streams

17 Limits of 1D-domain decomposition (evaluate by using a performance level)

18 Estimate scalability: simple performance model Stream 1 Stream 2 t 2 <<<Update_E>>> <<<UpBorderE>>> Copy Border E down t 1 t 3 Total time t 2 <<<Update_H>>> <<<UpBorderH>>> Copy Border H up t 1 t 3 T tot = t 1 + max t 2, t 3

19 Estimate scalability: Performance model Kernels (t 1, t 2 ) t 1,2 = latency kernel + throughput kernel #cells per card Communication(t 3 ) t 3 = latency comm + bandwidth comm bytes/cell #comm cells per card Variables can be estimated or measured using nvprof before coding effort.

20 Estimate scalability: Performance model Kernels (t 1, t 2 ) Volume t 1,2 = latency kernel + throughput kernel #cells per card Communication(t 3 ) Surface t 3 = latency comm + bandwidth comm bytes/cell #comm cells per card Variables can be estimated or measured using nvprof before coding effort.

21 B-CALM on GPU-Clusters t 3 > t 2 Z l t 2 > t 3 Z l = 10 z-layers per GPU

22 B-CALM on GPU-Clusters t 3 > t 2 Z l t 2 > t 3 Z l = 10 z-layers per GPU * Fermi (C2070 running 2 per node) Intel Xeon 5650 Infiniband QDR-HBA CUDAaware mvapich2

23 Scaling limits of 1D domain decomposition - N x N y Z l bytes = Card Memory cell - N x N y bytes = 6GB cell - N x = N y = Since N z > N x,y - Scaling limit = N z Z l = 387

Scaling limits of 1D domain decomposition - N x N y Z l bytes = Card Memory cell - N x N y 10 40 bytes = 6GB cell - N x = N y = 3872 -

24 Scaling limits of 1D domain decomposition - N x N y Z l bytes = Card Memory cell - N x N y bytes = 6GB cell - N x = N y = Since N z > N x,y - Scaling limit = N z Z l = 387 * Fermi (C2070 running 2 per node) Intel Xeon 5650 Infiniband QDR-HBA CUDAaware mvapich2

25 B-CALM on GPU-Clusters t 3 > t 2 Z l t 2 > t 3 OK until ~500 GPU s

26 B-CALM on GPU-Clusters t 3 > t 2 Z l t 2 > t 3 OK until ~500 GPU s * Fermi (C2070 running 2 per node) Intel Xeon 5650 Infiniband QDR-HBA CUDAaware mvapich2

27 B-CALM to do science P. Wahl, T. Tanemura, N. Vermeulen, J. Van Erps, D. Miller and H. Thienpont Design of large scale plasmonic nanoslit arrays for arbitrary mode conversion and demultiplexing Optics Express 22, (2014). We needed: 384GB of GPU RAM = 64 Fermi GPU s = 32 K40 s

28 B-CALM to do science P. Wahl, T. Tanemura, N. Vermeulen, J. Van Erps, D. Miller and H. Thienpont Design of large scale plasmonic nanoslit arrays for arbitrary mode conversion and demultiplexing Optics Express 22, (2014).

29 B-CALM to do science P. Wahl, T. Tanemura, N. Vermeulen, J. Van Erps, D. Miller and H. Thienpont Design of large scale plasmonic nanoslit arrays for arbitrary mode conversion and demultiplexing Optics Express 22, (2014).

30 B-CALM to do science P. Wahl, T. Tanemura, N. Vermeulen, J. Van Erps, D. Miller and H. Thienpont Design of large scale plasmonic nanoslit arrays for arbitrary mode conversion and demultiplexing Optics Express 22, (2014).

31 Conclusions Question 1: Port BCALM to GPU? Use of shared memory Tricks to maximize bandwidth. Question 2: Port B-CALM on GPU clusters Overlapping computation and communication. Question 3: Limits of 1D-domain decomposition Performance model: easy using nvprof Limits of 1D domain decomposition can be determined before porting.

Large scale Imaging on Current Many- Core Platforms

Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,