CAPS Technology. ProHMPT, 2009 March12 th

Size: px

Start display at page:

Download "CAPS Technology. ProHMPT, 2009 March12 th"

Jayson Brooks
5 years ago
Views:

1 CAPS Technology ProHMPT, 2009 March12 th

2 Overview of the Talk 1. HMPP in a nutshell Directives for Hardware Accelerators (HWA) 2. HMPP Code Generation Capabilities Efficient code generation for CUDA 3. Library adapter HPL / DGEMM experiment 4. Codelet Finder

4 HMPP Directives C and Fortran directives to program hardware accelerators Ensure portability and default compilation and execution Declare hardware implementations of native functions Indicate resource allocation and communication Place synchronization barriers A standard and portable way of programming A programming glue between general-purpose and hardware-specific languages Insulation of hardware-specific kernels in C and Fortran code HMPP Workbench 4

5 Directives Principles Declare hardware specific implementations of functions (codelets) Can be specialized to the execution context (data size, ) Codelet calls Synchronous, asynchronous properties Data transfers Data preloading Synchronization barriers Host CPU waits until remote computation has completed Main Memory Application data General Purpose Processor Cores Upload Download Remote Procedure call HWA Application data on HWA Cores HMPP Workbench 5

6 Simple Example #pragma hmpp sgemm codelet, target=cuda, args[vout].io=inout extern void sgemm( int m, int n, int k, float alpha, const float vin1[n][n], const float vin2[n][n], float beta, float vout[n][n] ); int main(int argc, char **argv) { for( j = 0 ; j < 2 ; j++ ) { #pragma hmpp sgemm callsite sgemm( size, size, size, alpha, vin1, vin2, beta, vout ); HMPP Workbench 6

7 HMPP Codelet Definition A pure function to be executed in a remote device or specialized core No global variable No side effects Several possible variants for different targets for different use contexts (vector size,...) Managed by the HMPP runtime HMPP API provides the necessary support functions HMPP Workbench 7

8 Directives Overview #pragma hmpp <label> <directive type> [, <directive parameter>]* [&]!$hmpp <label> <directive type> [, <directive parameter>]* [&] A unique label identifies a group of directives that belong to the same codelet Directive types: codelet: codelet declaration callsite: codelet call, can be asynchronous advancedload: preloading of data delegatedstore: wait for data result upload synchronize: wait for the completion of a codelet release: free a compute unit for another codelet HMPP Workbench 8

callsite, asynchronous, & #pragma hmpp sgemm args[vin1;vin2;vout].advancedload=true & #pragma hmpp sgemm args[m;n;k;alpha;beta].

9 Advanced Programming int main(int argc, char **argv) { #pragma hmpp sgemm allocate, args[vin1;vin2;vout].size={size,size #pragma hmpp sgemm advancedload, args[vin1;vin2;vout] & #pragma hmpp sgemm advancedload, args[m;n;k;alpha;beta] for( j = 0 ; j < 2 ; j++ ) { #pragma hmpp sgemm callsite, asynchronous, & #pragma hmpp sgemm args[vin1;vin2;vout].advancedload=true & #pragma hmpp sgemm args[m;n;k;alpha;beta].advancedload=true sgemm( size, size, size, alpha, vin1, vin2, beta, vout ); #pragma hmpp sgemm synchronize #pragma hmpp sgemm delegatedstore, args[vout] #pragma hmpp sgemm release Allocate and initialize device outside loop Preload data Download result when needed Execute asynchronously HMPP Workbench 9

10 Codelet Directive (1) Declare a hardware specific implementation of a function Several possible variants (target, execution context) Default is the native codelet #pragma hmpp label codelet, target=cuda:brook, args[v1].io=out #pragma hmpp label2 codelet, target=sse, args[v1].io=out, cond= n<800 void MyCodelet(int n, float v1[n], float v2[n], float v3[n]) { int i; for (i = 0 ; i < n ; i++) { v1[i] = v2[i] + v3[i]; HMPP Workbench 10

11 Advancedload Directive (1) Data transfers strongly impact on performance Try to preload data before codelet call site #pragma hmpp simple advancedload, args[v2], asynchronous,\ args[v2].addr= t2 for (k = 0 ; k < iter ; k++) { #pragma hmpp simple callsite, args[v2].advancedload=true simplefunc1(n, &(t1[k*n]), &(t2[k*n]), &(t3[k*n])); #pragma hmpp simple advancedload, args[v2], asynchronous,\ args[v2].addr="&(t2[(k+1)*n]), args[v2].size="(n)" /* do something else */ #pragma hmpp simple release HMPP Workbench 11

Advancedload Directive (2) Avoid reloading constant data int main(int argc, char **argv) { #pragma hmpp simple advancedload, args[v2], const for (j=0; j<n; j++){ #pragma

12 Advancedload Directive (2) Avoid reloading constant data int main(int argc, char **argv) { #pragma hmpp simple advancedload, args[v2], const for (j=0; j<n; j++){ #pragma hmpp simple callsite, args[v2].advancedload=true simplefunc1(n,t1[j], t2, t3[j], alpha); #pragma hmpp label release t2 is not reloaded each loop iteration HMPP Workbench 12

13 Codelet Generation

14 Objectives Allow to transparently use HWA From C or Fortran to CUDA, Brook, Allow for code tuning at source code level Directives based approach

15 Code Generation Flow

16 Codelet Generation C, Java or Fortran source code input HWA oriented subset of the languages Set of directives to Optimize target codelet generation Express parallelism expression Make code tuning easier Generated code can also be tuned

17 Loop Parallelization Force or prevent the parallelization of loops Help defining kernels in a codelet #pragma hmppcg parallel for (i=0; i < n; i++) { #pragma hmppcg noparallel for (j=0; j < n; j++) { D[i][j] = A[i][j] * E[3][j];

18 Input C Code Example 1 typedef struct{ float r, i; Complex; #pragma hmpp convolution2d codelet, args[data; opx].io=in, args[convr].io=out, target=cuda void convolution2d( Complex *data, int nx, int ny, Complex *opx, int oplx, int oply, Complex *convr) { int hoplx = (oplx+1)/2; int hoply = (oply+1)/2; int iy, ix; #pragma hmppcg parallel for (iy = 0; iy < ny; iy++) { #pragma hmppcg parallel for (ix = 0; ix < nx; ix++) { float dumr =0.0, dumi = 0.0; int ky; for(ky = -(oply - hoply - 1); ky <= hoply; ky++) { int kx; for(kx = -(oplx - hoplx - 1); kx <= hoplx; kx++){ int dx = min( max(ix+kx, 0), (nx - 1) ); int dy = min( max(iy+ky, 0), (ny - 1) ); dumr += data[dy * nx + dx].r * opx[(hoply - ky) * oplx + (hoplx - kx)].r; dumr -= data[dy * nx + dx].i * opx[(hoply - ky) * oplx + (hoplx - kx)].i; dumi += data[dy * nx + dx].r * opx[(hoply - ky) * oplx + (hoplx - kx)].i; dumi += data[dy * nx + dx].i * opx[(hoply - ky) * oplx + (hoplx - kx)].r; convr[iy*nx+ix].r = dumr; convr[iy*nx+ix].i = dumi;

19 Input Fortran Code Example 2!$HMPP sgemm3 codelet, target=cuda, args[vout].io=inout SUBROUTINE sgemm(m,n,k2,alpha,vin1,vin2,beta,vout) INTEGER, INTENT(IN) :: m,n,k2 REAL, INTENT(IN) :: alpha,beta REAL, INTENT(IN) :: vin1(n,n), vin2(n,n) REAL, INTENT(INOUT) :: vout(n,n) REAL :: prod INTEGER :: i,j,k!$hmppcg unroll(8), jam(2), noremainder!$hmppcg parallel DO j=1,n!$hmppcg unroll(8), splitted, noremainder!$hmppcg parallel DO i=1,n prod = 0.0 DO k=1,n prod = prod + vin1(i,k) * vin2(k,j) ENDDO vout(i,j) = alpha * prod + beta * vout(i,j) ; END DO END DO END SUBROUTINE sgemm

20 MxM Performance

21 Performance Examples

22 Tuning Issue Example #pragma hmpp astex_codelet 1 codelet & #pragma hmpp astex_codelet 1, args[c].io=in & #pragma hmpp astex_codelet 1, args[v].io=inout & #pragma hmpp astex_codelet 1, args[u].io=inout & #pragma hmpp astex_codelet 1, target=cuda & #pragma hmpp astex_codelet 1, version=1.4.0 void astex_codelet 1(float u[256][256][256], float v[256][256][256], float c[256][256][256], const int K, const float x2){ astex_thread_begin:{ for (int it = 0 ; it < K ; ++it){ for (int i2 = 1 ; i2 < ; ++i2){ Need interchange for (int i3 = 1 ; i3 < ; ++i3){ for (int i1 = 1 ; i1 < ; ++i1){ float coeff = c[i3][i2][i1] * c[i3][i2][i1] * x2; float sum = u[i3][i2][i1 + 1] + u[i3][i2][i1-1]; sum += u[i3][i2 + 1][i1] + u[i3][i2-1][i1]; sum += u[i3 + 1][i2][i1] + u[i3-1][i2][i1]; v[i3][i2][i1] = ( * coeff) * u[i3][i2][i1] + coeff * sum - v[i3][i2][i1]; for (int i2 = 1 ; i2 < ; ++i2){ for (int i3 = 1 ; i3 < ; ++i3){ for (int i1 = 1 ; i1 < ; ++i1{..... astex_thread_end:;

23 Library Issues

24 Motivations Various implementations of libraries are available for a given target CUBLAS, MKL, ATLAS, No strict performance order Each library has a different performance profile Best choice depends on platform and runtime parameters User left with a complex issue Routine performance measure Decision programming Hardware version adaptation Development partially funded by STREP Milepost Machine Learning for Embedded Programs Optimisation

25 Difficult Decisions Making with Alternative Codes (Multiversioning) Various implementations of routines are available or can be generated for a given target CUBLAS, MKL, ATLAS, SIMD instructions, GPcore, HWA, Hybrid No strict performance order Each implementation has a different performance profile Best choice depends on platform and runtime parameters Decision is a complex issue How to produce the decision?

26 Library Adapter Overview

27 Illustrating Example: Dealing with Multiple BLAS Implementations Runtime selection of DGEMM in High Performance Linpack Intel(R) Xeon(R) 2.50GHz CUBLAS - Tesla C1060, Intel MKL Three binaries of the application Static linking with CUBLAS Static linking with MKL Library mix with selection of routine at runtime Automatically generated using CAPS tooling Three hardware resource configurations GPU + 1, 2, and 4 cores used for MKL

28 Performance Using One Core Performance in Gigaflops 4 problem sizes: 64, 500, 1200, ,3 20 Performance (GFLOPS) ,3 6,5 8 8,1 9 Cublas MKL Dyn. Sel ,4 1,3 1,4 1,2 0, Problem Size

29 Performance Using Two Cores Performance (GFLOPS) Cublas MKL Dyn. Sel. 10 6,5 7,6 5 4,3 4,4 0 0,07 1,4 1,2 0, Problem Size

Performance Using Four Cores 35 32 30 26 Performance (GFLOPS) 25 20 15 10 7,2 9,7

30 Performance Using Four Cores Performance (GFLOPS) ,2 9, Cublas MKL Dyn. Sel ,4 0 0,07 0,9 1,2 1, Problem Size

31 Codelet Finder Alpha version

speculation static static Useful for dynamic dynamic HWA

32 Codelet Finder Overview Partitioning of C code to highlight codelets Data value specialization Aliasing speculation static static Useful for dynamic dynamic HWA exploitation (and maybe vectorization and parallelization) Partitioned code

33 Extracted Codelets Are Not Just Hotspots HWA data mapping in local memory adds constraints 0xA 10 { for (x = 0 ; x < i_size ; x++) { diff[x + y * i_size] = pix1[x] - pix2[x]; pix1 += i_pix1; pix2 += i_pix2; Main memory 0x0 05 HWA local memory 0xA 40 pix1= 0xA 10 0x0 35 pix1= 0x0 05

34 Example of Partitioning to Use HWA (1) #pragma hmpp astex_codelet 1 codelet & #pragma hmpp astex_codelet 1, args[c].io=in & #pragma hmpp astex_codelet 1, args[v].io=inout & #pragma hmpp astex_codelet 1, args[u].io=inout & #pragma hmpp astex_codelet 1, target=cuda & #pragma hmpp astex_codelet 1, version=1.4.0 void astex_codelet 1(float u[256][256][256], float v[256][256][256], float c[256][256][256], const int K, const float x2){ astex_thread_begin:{ for (int it = 0 ; it < K ; ++it){ for (int i2 = 1 ; i2 < ; ++i2){ for (int i3 = 1 ; i3 < ; ++i3){ for (int i1 = 1 ; i1 < ; ++i1){ float coeff = c[i3][i2][i1] * c[i3][i2][i1] * x2; float sum = u[i3][i2][i1 + 1] + u[i3][i2][i1-1]; sum += u[i3][i2 + 1][i1] + u[i3][i2-1][i1]; sum += u[i3 + 1][i2][i1] + u[i3-1][i2][i1]; v[i3][i2][i1] = ( * coeff) * u[i3][i2][i1] + coeff * sum - v[i3][i2][i1]; for (int i2 = 1 ; i2 < ; ++i2){ for (int i3 = 1 ; i3 < ; ++i3){ for (int i1 = 1 ; i1 < ; ++i1{..... astex_thread_end:;

35 Example of Partitioning to Use HWA (2) Extract codelet to be executed on HWA Data specialization Aliasing speculation Convolution code icc -O3 vs HWA Speedup is 3.4 with the HWA Codelet tuning Loop interchange was needed

36 Codelet Finder in ProHMPT Can be used to provide codelet testbed for the various techniques

37 Conclusion

38 CAPS in ProHMPT Génération de codes adaptatifs Définition de directives Allocation dynamiques de ressources

39 Tasks Thème 2: extraction du parallélisme Tâche 3: Compilation et analyse du code statique Fondé sur DPIL Tâche 4: Langage Extension d OpenMP pour l hétérogène Thème 3: Support logiciel Tâche 7: Ordonnancement Allocation de ressources

Parallel Hybrid Computing Stéphane Bihan, CAPS

Parallel Hybrid Computing Stéphane Bihan, CAPS Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous hardware