Progress in automatic GPU compilation and why you want to run MPI on your GPU

Size: px

Start display at page:

Download "Progress in automatic GPU compilation and why you want to run MPI on your GPU"

Shawn Newman
5 years ago
Views:

1 TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your GPU with Tobias Grosser and Tobias SPCL presented at CCDSC, Lyon, France, 2016

2 #pragma ivdep 2

3 !$ACC DATA &!$ACC PRESENT(density1,energy1) &!$ACC PRESENT(vol_flux_x,vol_flux_y,volume,mass_flux_x,mass_flux_y,vertexdx,vertexdy) &!$ACC PRESENT(pre_vol,post_vol,ener_flux)!$ACC KERNELS IF(dir.EQ.g_xdir) THEN IF(sweep_number.EQ.1)THEN!$ACC LOOP INDEPENDENT DO k=y_min-2,y_max+2!$acc LOOP INDEPENDENT DO j=x_min-2,x_max+2 pre_vol(j,k)=volume(j,k)+(vol_flux_x(j+1,k )-vol_flux_x(j,k)+vol_flux_y(j,k+1)-vol_flux_y(j,k)) post_vol(j,k)=pre_vol(j,k)-(vol_flux_x(j+1,k )-vol_flux_x(j,k)) ENDDO ENDDO ELSE!$ACC LOOP INDEPENDENT DO k=y_min-2,y_max+2!$acc LOOP INDEPENDENT DO j=x_min-2,x_max+2 pre_vol(j,k)=volume(j,k)+vol_flux_x(j+1,k)-vol_flux_x(j,k) post_vol(j,k)=volume(j,k) ENDDO ENDDO 3

4 Heitlager et al.: A Practical Model for Measuring Maintainability 4

5 !$ACC DATA &!$ACC COPY(chunk%tiles(1)%fie%density0) &!$ACC COPY(chunk%tiles(1)%fie%density1) &!$ACC COPY(chunk%tiles(1)%fie%energy0) &!$ACC COPY(chunk%tiles(1)%fie%energy1) &!$ACC COPY(chunk%tiles(1)%fie%pressure) &!$ACC COPY(chunk%tiles(1)%fie%soundspeed) &!$ACC COPY(chunk%tiles(1)%fie%viscosity) &!$ACC COPY(chunk%tiles(1)%fie%xvel0) &!$ACC COPY(chunk%tiles(1)%fie%yvel0) &!$ACC COPY(chunk%tiles(1)%fie%xvel1) &!$ACC COPY(chunk%tiles(1)%fie%yvel1) &!$ACC COPY(chunk%tiles(1)%fie%vol_flux_x) &!$ACC COPY(chunk%tiles(1)%fie%vol_flux_y) &!$ACC COPY(chunk%tiles(1)%fie%mass_flux_x)&!$ACC COPY(chunk%tiles(1)%fie%mass_flux_y)&!$ACC COPY(chunk%tiles(1)%fie%volume) &!$ACC COPY(chunk%tiles(1)%fie%work_array1)&!$ACC COPY(chunk%tiles(1)%fie%work_array2)&!$ACC COPY(chunk%tiles(1)%fie%work_array3)&!$ACC COPY(chunk%tiles(1)%fie%work_array4)&!$ACC COPY(chunk%tiles(1)%fie%work_array5)&!$ACC COPY(chunk%tiles(1)%fie%work_array6)&!$ACC COPY(chunk%tiles(1)%fie%work_array7)&!$ACC COPY(chunk%tiles(1)%fie%cellx) &!$ACC COPY(chunk%tiles(1)%fie%celly) &!$ACC COPY(chunk%tiles(1)%fie%celx) &!$ACC COPY(chunk%tiles(1)%fie%cely) &!$ACC COPY(chunk%tiles(1)%fie%vertexx) &!$ACC COPY(chunk%tiles(1)%fie%vertexdx) &!$ACC COPY(chunk%tiles(1)%fie%vertexy) &!$ACC COPY(chunk%tiles(1)%fie%vertexdy) &!$ACC COPY(chunk%tiles(1)%fie%xarea) &!$ACC COPY(chunk%tiles(1)%fie%yarea) &!$ACC COPY(chunk%left_snd_buffer) &!$ACC COPY(chunk%left_rcv_buffer) &!$ACC COPY(chunk%right_snd_buffer) &!$ACC COPY(chunk%right_rcv_buffer) &!$ACC COPY(chunk%bottom_snd_buffer) &!$ACC COPY(chunk%bottom_rcv_buffer) &!$ACC COPY(chunk%top_snd_buffer) &!$ACC COPY(chunk%top_rcv_buffer) Sloccount *f90: 6,440!$ACC: 833 (13%) 5

6 6

7 do i = 0, N do j = 0, i y(i,j) = ( y(i,j) + y(i,j+1) )/2 7

Some results: Polybench 3.2 geomean: ~6x arithmean: ~30x Speedup over icc O3 Xeon E5-2690 (10 cores, 0.5Tflop) vs.

8 Some results: Polybench 3.2 geomean: ~6x arithmean: ~30x Speedup over icc O3 Xeon E (10 cores, 0.5Tflop) vs. Titan Black Kepler GPU (2.9k cores, 1.7Tflop) T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS 16 8

9 Runtime (m:s) spcl.inf.ethz.ch Compiles all of SPEC CPU 2006 Example: LBM 8:24 7:12 essentially my 4-core x86 laptop with the (free) GPU that s in there 6:00 4:48 ~20% Xeon E (10 cores, 0.5Tflop) vs. Titan Black Kepler GPU (2.9k cores, 1.7Tflop) 3:36 2:24 ~4x 1:12 0:00 Mobile icc icc -openmp clang Polly ACC Workstation T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS 16 9

10 10

SC16 (preprint at SPCL page) spcl.inf.ethz.ch GPU latency hiding vs.

11 T. Gysi, J. Baer, TH: dcuda: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page) spcl.inf.ethz.ch GPU latency hiding vs. MPI st st st st CUDA over-subscribe hardware use spare parallel slack for latency hiding MPI host controlled full device synchronization device compute core active thread instruction latency

T. Gysi, J. Baer, TH: dcuda: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page) spcl.inf.ethz.ch Hardware latency hiding at the cluster level?

12 T. Gysi, J. Baer, TH: dcuda: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page) spcl.inf.ethz.ch Hardware latency hiding at the cluster level? put st st st st st put st dcuda (distributed CUDA) unified programming model for GPU clusters avoid unnecessary device synchronization to enable system wide latency hiding device compute core active thread instruction latency

for (int idx = from; idx < to; idx += jstride) out[idx] = -4.

jstride] + in[idx - jstride]; iterative stencil kernel thread

+ jstride, jstride, &out[jstride], tag); if (rsend)

communication dcuda_wait_notifications(ctx, wout, DCUDA_ANY_SOURCE,

13 T. Gysi, J. Baer, TH: dcuda: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page) spcl.inf.ethz.ch dcuda: MPI-3 RMA extensions for (int i = 0; i < steps; ++i) { for (int idx = from; idx < to; idx += jstride) out[idx] = -4.0 * in[idx] + computation in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride]; iterative stencil kernel thread specific idx } if (lsend) dcuda_put_notify(ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag); if (rsend) dcuda_put_notify(ctx, wout, rank + 1, 0, jstride, &out[len], tag); communication dcuda_wait_notifications(ctx, wout, DCUDA_ANY_SOURCE, tag, lsend + rsend); swap(in, out); swap(win, wout); map ranks to blocks device-side put/get operations notifications for synchronization shared and distributed memory

14 Hardware supported communication overlap traditional MPI-CUDA dcuda 1 device compute core active block T. Gysi, J. Baer, TH: dcuda: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

15 more blocks T. Gysi, J. Baer, TH: dcuda: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page) spcl.inf.ethz.ch The dcuda runtime system event handler block manager MPI host-side ack queue notification queue logging queue command queue device library context device-side

16 execution time [ms] T. Gysi, J. Baer, TH: dcuda: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page) spcl.inf.ethz.ch (Very) simple stencil benchmark Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node 1000 no overlap compute & exchange 500 halo exchange compute only # of copy iterations per exchange

17 execution time [ms] T. Gysi, J. Baer, TH: dcuda: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page) spcl.inf.ethz.ch Real stencil (COSMO weather/climate code) Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node MPI-CUDA 100 dcuda 50 halo exchange # of nodes

18 execution time [ms] T. Gysi, J. Baer, TH: dcuda: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page) spcl.inf.ethz.ch Particle simulation code (Barnes Hut) Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node 200 MPI-CUDA 150 dcuda halo exchange # of nodes

19 execution time [ms] T. Gysi, J. Baer, TH: dcuda: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page) spcl.inf.ethz.ch Sparse matrix-vector multiplication Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node dcuda MPI-CUDA 0 communication # of nodes

dcuda_put_notify(ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag); if (rsend)

20 dcuda distributed memory for (int i = 0; i < steps; ++i) { for (int idx = from; idx < to; idx += jstride) out[idx] = -4.0 * in[idx] + in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride]; if (lsend) dcuda_put_notify(ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag); if (rsend) dcuda_put_notify(ctx, wout, rank + 1, 0, jstride, &out[len], tag); Automatic dcuda_wait_notifications(ctx, wout, DCUDA_ANY_SOURCE, tag, lsend + rsend); Automatic swap(in, out); swap(win, wout); } Regression Free High Performance Overlap High Performance 20

21 LLVM Nightly Test Suite SCoPs 0-dim 1-dim 2-dim 3-dim No Heuristics Heuristics 21

22 Mobile Workstation spcl.inf.ethz.ch Cactus ADM (SPEC 2006) 22

23 Evading various ends the hardware view 23

Progress in automatic GPU compilation and why you want to run MPI on your GPU

Progress in automatic GPU compilation and why you want to run MPI on your GPU Progress in automatic compilation and why you want to run MPI on your Torsten Hoefler (most work by Tobias Grosser, Tobias Gysi, Jeremia Baer) Presented at Saarland University, Saarbrücken, Germany ETH