G. Colin de Verdière CEA

Size: px

Start display at page:

Download "G. Colin de Verdière CEA"

Georgiana Pope
6 years ago
Views:

1 A methodology to port production codes to GPU learned from our experiments G. Colin de Verdière CEA GCdV CEA, DAM, DIF, F Arpajon 1

2 Moving towards GPGPU Potentially important speedups COMPLEX(8) usage is ideal The algorithm is the key «Amdahl for ever» The most important part of the work The code should be adapted Need a methodology of development GCdV CEA, DAM, DIF, F Arpajon 2

3 ALF: Amdahl s Law is Forever! n= f=0.00 s=1 f=0.10 s=1.1 f=0.20 s=1.25 f=0.30 s=1.42 f=0.40 s=1.66 f=0.50 s=2 f=0.60 s=2.5 f=0.70 s=3.33 f=0.80 s=5 f=0.90 s=10 f=0.95 s=20 f=0.99 s=100 f=0.999 s=1000 lim Speedup = Speedup f n n 1 + (1 + f 1 ) + n = #CPU f = // fraction of the code = 1 = f (1 + f 1 ) + n Think massively parallel! f GCdV CEA, DAM, DIF, F Arpajon 3

4 Methodology Step 1: Reorganize the code Step 2: Profiling Use all available tools gprof, timers, Focus on useful sections (80% / 20%) First of all: optimize the sequential code Step 3: GPGPU adaptations Choose the language (HMPP/PGI/Cuda/OpenCL) I recommend HMPP GCdV CEA, DAM, DIF, F Arpajon 4

5 Reorganize the source Architecture Interfaces of subroutines Dynamic memory Array utilization (seen as a stream) Introduce permanent timers Important to measure the code s evolutions GCdV CEA, DAM, DIF, F Arpajon 5

6 Architecture Clearly separate why from how Ease Code comprehension Locate array usage Push computation onto the GPU Avoid Call TOTO Do i = 1, N Titi(i) ) = tata(i) ) + 1 End do Call TUTU GCdV CEA, DAM, DIF, F Arpajon 6

7 DO WHILE (proptime.lt.t_end) CALL check_time IF (term) THEN CALL write_continue EXIT ENDIF CALL timestep!this routine takes about 90% of total execution time CALL boundary(proptime) DO i=dim_z_start,dim_z_end!this loop takes about 10% of total execution time flux_f(i)=flux_f(i)+sum(abs(u1(:,:,i))**2) imax_f(i)=max(imax_f(i),maxval(abs(u1(:,:,i))**2)) flux_b(i)=flux_b(i)+sum(abs(u2(:,:,i))**2) imax_b(i)=max(imax_b(i),maxval(abs(u2(:,:,i))**2)) ENDDO outcount=outcount+1 proptime=proptime+delta_t IF (outcount.eq.out) THEN. GCdV CEA, DAM, DIF, F Arpajon 7

8 Interfaces of compute subroutines F77 declarations No F90 shapes Allows Explicit dimensions Are mandatory INTENT IMPLICIT NONE INTRINSICS Limit MODULE/USE for TYPES declarations! Otherwise use arguments Masked global variables are invisible from GPU SUBROUTINE only No variable or optional arguments Enforce variable name coherency Define a clear contract associated with the subroutine Help the compiler (optimizer) Promote GPU usage Help to understand the data flow GCdV CEA, DAM, DIF, F Arpajon 8

9 !$hmpp <sbsgrp> Hker codelet SUBROUTINE ker(u1, U2, k0, n2, delta_z, dim_x, dim_y, dim_z_start, dim_z_end) IMPLICIT NONE integer, intent(in) :: dim_x, dim_y, dim_z_start, dim_z_end REAL(8), intent(in) :: delta_z REAL(8), intent(in) :: n2, k0 COMPLEX(8), intent(inout) :: U1(dim_x,dim_y, dim_z_start - 1:dim_z_end) COMPLEX(8), intent(inout) :: U2(dim_x,dim_y, dim_z_start :dim_z_end+1) REAL(8) :: i1, i2 INTEGER(4) :: i,j,k,l intrinsic :: ABS, EXP, CMPLX, MOD, INT!$omp parallel do shared(u1, u2) private(i,j,k,l,, i1, i2) DO k=dim_z_start-1,dim_z_end DO j=1,dim_y DO i=1,dim_x! <SNIP> ENDDO ENDDO ENDDO END SUBROUTINE ker GCdV CEA, DAM, DIF, F Arpajon 9

10 Dynamic memory Avoid: Sub toto real(8), allocatable :: temp(:) Allocate(temp(NN)) Do i = 1, N Temp(i)= End do Deallocate(temp) Do i = 1, N Toto(i) = End do End Sub Dynamic memory is expensive at runtime Is not available on the GPU (kernel side) No ALLOCATABLE in a compute subroutine Pull memory allocation up to the architecture Declare temporaries outside the compute subroutine Compute subroutines are meant to COMPUTE GCdV CEA, DAM, DIF, F Arpajon 10

11 Separate address space Sharing is impossible no USE remote asynchronous execution No return value Pure Function Updating the GPU is expensive! Favor large packets Explanations 5Go/s ns L1: L2: L3: Mem: Go/s -- ns GCdV CEA, DAM, DIF, F Arpajon 11 Latency in ns

12 Study Arrays streams Mandatory step to ease data movements what should go on the GPU? How long should it stay on the GPU? Minimize PCI-Express usage INTENT helps a lot A manual tool: An array table GCdV CEA, DAM, DIF, F Arpajon 12

13 Array table Tab1 Tab2 Tab3 Tab4 Sub1 D W* Sub2 W WR* W Sub3 R R* R Sub4 R W* *: in a USE; D: allocate; W: write; R: read Helps to discover which array if useful on the GPU? Helps to transform use into arguments Helps to reduce memory usage GCdV CEA, DAM, DIF, F Arpajon 13

14 ! -- IN -- U1_buffer, U2_buffer, Q_buffer! -- INOUT -- U1, U2, Q call PROPAGATION(U1, U2, Q, U1_buffer, )! -- INOUT -- U1, U2 CALL BOUNDARY(proptime, U1, U2, )! -- IN -- U1, U2! -- INOUT -- flux_f, flux_b, imax_f, imax_b CALL FLUX_UPDATE(U1, U2, flux_f, flux_b, imax_f, imax_b, ) GCdV CEA, DAM, DIF, F Arpajon 14

15 Random generator [rand()] Is most of the time a simple function X n+1 = ( *X n ) mod 2 31 The linux one, not good X n+1 = *X n mod 2 48 Lavaux & Jenssens known as good. It is necessary to store X n between two calls What should be done with thousands of concurrent threads? Minimize internal storage 624 Mersenne Twister! How to be independent of parallelism? Control of X 0 GCdV CEA, DAM, DIF, F Arpajon 15

16 Oddities of a GPU Maximize the «compute intensity» What is expensive = memory access and tests A good ratio: >= 2 fops / access Fuse loops whenever possible Compute instead of store Limit conditionals Pull up to the caller (if possible) Or use masks No I/O (PRINT/WRITE/READ) GCdV CEA, DAM, DIF, F Arpajon 16

17 Example do i=1,nface pstar(ind(i))=pold(i) end do Ustar(i) is useless as an array do i=1,nface wl(i) = sqrt(cl(i)*(one+gamma6*(pstar(i)-pl(i))/pl(i))) wr(i) = sqrt(cr(i)*(one+gamma6*(pstar(i)-pr(i))/pr(i))) end do do i=1,nface ustar(i) = half*(ul(i) + (pl(i)-pstar(i))/wl(i) + & & ur(i) - (pr(i)-pstar(i))/wr(i) ) end do do i=1,nface sgnm(i) = sign(one,ustar(i)) end do GCdV CEA, DAM, DIF, F Arpajon 17

18 Porting to CUDA Create interfaces for CUDA routines (C) Manual management of the GPU memory Manual management of data movements Duplicate compute routine in CUDA MPI coupling Which data is needed? Optimize transfers on the PCI-Express as with MPI GCdV CEA, DAM, DIF, F Arpajon 18

19 Porting to HMPP Annotate the source code according to the following: Compute on the GPU GPU + transfer all the arrays at each call Verify that the code is still valid Performances are poor (normal in this mode) Iteratively minimize the transfers In the end only the minimum set of transfers Optimize the compute subroutines GCdV CEA, DAM, DIF, F Arpajon 19

20 Porting to OpenMP «Almost free» Reuse what has been done for HMPP GCdV CEA, DAM, DIF, F Arpajon 20

21 In all cases Do NOT reprogram what is available as a library MKL, CUBLAS, GCdV CEA, DAM, DIF, F Arpajon 21

22 F90 Do not use all the possibilities of the language Use only the F90 syntax on a F77 code Do not forget the F2003 iso_c_binding Opens to efficient C routines usage Allows the usage of pointers created in C GCdV CEA, DAM, DIF, F Arpajon 22

23 F90 pointer to a member of a structure Quite unsuitable for a GPU port: Arrays_1D%Ro_ptr => Hydro_vars%ro( first_cell:last_cell, j, k) GCdV CEA, DAM, DIF, F Arpajon 23

24 SOA or AOS SOA = structure of array; AOS = array of structure CPUs prefer AOS IF we use many members at the same time Otherwise use a SOA GPUs prefer SOA GCdV CEA, DAM, DIF, F Arpajon 24

25 GPGPU: further enhancements Overlap computations and communications According to algorithm Benefit from the asynchronism between CPU & GPU Compute the internal domain Compute boundaries From the GPU To the GPU MPI exchange GCdV CEA, DAM, DIF, F Arpajon 25

26 As a summary It is worth moving to GPGPU Future computer architectures will be along these lines As of TODAY Start to prepare the codes Mandatory coding practices Revise the algorithms for hyper parallelism Will be beneficial even for sequential or MPI codes GCdV CEA, DAM, DIF, F Arpajon 26

27 Conclusion Parallelism will be at all levels Application MUST be hyper parallel from the ground up A sequential run should be a degenerated case of a parallel code Reactivate vectorized subroutines Future codes will have to adapt dynamically to the hardware Do not underestimate the architecture of a code Enforce best practices The technical debts GCdV CEA, DAM, DIF, F Arpajon 27

28 The technical debt (see wikipedia) Ward Cunningham first drew the comparison between technical complexity and debt in a 1992 experience report: Shipping first time code is like going into debt. A little debt speeds development so long as it is paid back promptly with a rewrite... The danger occurs when the debt is not repaid. Every minute spent on not-quiteright code counts as interest on that debt. Entire engineering organizations can be brought to a standstill under the debt load of an unconsolidated implementation, object-oriented or otherwise. GCdV CEA, DAM, DIF, F Arpajon 28

29 One last word KISS (keep it simple stupid) GCdV CEA, DAM, DIF, F Arpajon 29

Parallel Hybrid Computing Stéphane Bihan, CAPS

Parallel Hybrid Computing Stéphane Bihan, CAPS Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous hardware