Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Size: px

Start display at page:

Download "Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives"

Lauren Rose
5 years ago
Views:

1 Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel Programming and Applications (HLPP 2014) July 3-4, 2014 Amsterdam, Netherlands

2 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work

3 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work

4 100,000 Performance (vs. VAX-11/780) 10, AX-11/780, 5 MHz AMD Athlon 64, 2.8 GHz 11,865 14,38719,484 AMD Athlon, 2.6 GHz Intel Xeon EE 3.2 GHz 7,108 Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology) 6,043 6,681 IBM Power4, 1.3 GHz 4,195 3,016 Intel VC820 motherboard, 1.0 GHz Pentium III processor 1,779 Professional Workstation XP1000, 667 MHz 21264A Digital AlphaServer /575, 575 MHz , AlphaServer /600, 600 MHz Digital Alphastation 5/500, 500 MHz Digital Alphastation 5/300, 300 MHz Digital Alphastation 4/266, 266 MHz IBM POWERstation 100, 150 MHz Digital 3000 AXP/500, 150 MHz HP 9000/750, 66 MHz IBM RS6000/540, 30 MHz MIPS M2000, 25 MHz 18 MIPS M/120, 16.7 MHz 13 Sun-4/260, 16.7 MHz 9 VAX 8700, 22 MHz %/year Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz) Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) Intel Core Duo Extreme 2 cores, 3.0 GHz Intel Core 2 Extreme 2 cores, 2.9 GHz %/year 24,129 21,871 25%/year 1.5, VAX-11/ The Parallel Challenge David A. Patterson and John L. Hennessy.! Computer Organization and Design:! The Hardware/Software Interface.! Elsevier, 2014.

5 ACCELERATORS/CO-PROCESSORS SYSTEMS ATI AMD Intel 30 NVIDIA Clearspeed CSX600 Cell General Purpose Computing with GPUs The TOP500 List.! June 2014.

6 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work

7 GPGPU with CUDA First GPGPU programs look like graphics applications! CUDA enables the use of C CUDA kernel: specifies the operation of a single GPU thread Main ideas: 1.Lightweight parallel threads in hierarchy: grid, block 2.Shared memory 3.Barriers

8 GPU Programming Features in CUDA 1 Threadification 2 Thread grouping: warps 3 Minimization of CPU-GPU data transfers 4 Coalescing 5 Maximization of the usage of registers and shared memory 6 Divergency 7 Occupancy 8 Threads per block

9 GPGPU with OpenHMPP Directive-based approaches provide several advantages: More readable codes Only one file Independent from the hardware platform Reasonable performance

10 GPGPU with OpenHMPP Directive-based approaches provide several advantages: More readable codes Only one file Independent from the hardware platform Reasonable performance explicit control of software-managed caches

11 GPGPU with OpenHMPP Directive-based approaches provide several advantages: More readable codes Only one file Independent from the hardware platform Reasonable performance explicit control of software-managed caches standard loop transformations

12 GPU Programming Features with OpenHMPP 1 Threadification 2 Thread grouping 3 Minimization of CPU-GPU data transfers 4 Coalescing 5 Maximization of the usage of registers and shared memory 6 Divergency 7 Occupancy 8 Threads per block gridify advancedload delegatedstore permute unroll fuse tile shared

13 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work

14 dikernel: Domain- Independent Computational Kernel DOMAIN-SPECIFIC CONCEPT LEVEL (problem solving methods and application domain) DOMAIN-INDEPENDENT CONCEPT LEVEL (programming practice) SEMANTIC LEVEL (control flow and data dependence graphs) SYNTACTIC LEVEL (abstract syntax tree) TEXT LEVEL (ASCII code) Characterizes the computations carried out in a program without being affected by how they are coded Exposes multiple levels of parallelism M. Arenaz et al. XARK: An Extensible Framework for Automatic Recognition of Computational Kernels. ACM Transactions on Programming Languages and Systems, 30(6), 2008.

15 Building the KIR Non-statement based, high-level, hierarchical IR 1.diKernel recognition on the DDG 2.Identification of flow dependences 3.Hierarchy of execution scopes reflecting the computational stages & dikernel classification J.M. Andión et al. A Novel Compiler Support for Automatic Parallelization on Multicore Systems. Parallel Computing, 39(9), 2013.

16 Example of KIR: CONV3D shaded to be omitted in the discovering of parallelism 1.int i, j, k, size_x, size_y, size_z; 2.float coefx,coefy,coefz,*input,*output; 3. 4.for (i = 0; i < size_x; i++) { 5. for (j = 0; j < size_y; j++) { 6. for (k = 0; k < size_z; k++) { 7. float tempx = input[i][j][k]+coefx* 8. ( 9. input[i-1][j][k]+input[i+1][j][k]+ 10. input[i-2][j][k]+input[i+2][j][k]+ 11. input[i-3][j][k]+input[i+3][j][k]+ 12. input[i-4][j][k]+input[i+4][j][k] 13. ); 14. float tempy = input[i][j][k]+coefy* 15. ( 16. input[i][j-1][k]+input[i][j+1][k]+ 17. input[i][j-2][k]+input[i][j+2][k]+ 18. input[i][j-3][k]+input[i][j+3][k]+ 19. input[i][j-4][k]+input[i][j+4][k] 20. ); 21. float tempz = input[i][j][k]+coefz* 22. ( 23. input[i][j][k-1]+input[i][j][k+1]+ 24. input[i][j][k-2]+input[i][j][k+2]+ 25. input[i][j][k-3]+input[i][j][k+3]+ 26. input[i][j][k-4]+input[i][j][k+4] 27. ); 28. output[i][j][k] = 29. output[i][j][k]+tempx+tempy+tempz; 30. } 31. } 32.} ROOT EXECUTION SCOPE ES_for i,j,k (Fig. 1a, lines 4-32) K < tempx 7 > scalar assignment K < tempy 14 > scalar assignment K < tempz 21 > scalar assignment K < output 28 > regular reduction

17 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work

18 GPU Programming Features addressed by our Automatic Technique 1 Threadification 2 Thread grouping 3 Minimization of CPU-GPU data transfers 4 Coalescing 5 Maximization of the usage of registers and shared memory 6 Divergency 7 Occupancy 8 Threads per block

19 Detection of coalesced accesses to the GPU global memory CHRECS_x k = [{0,+,1}][{0,+,1}] 1 // only for_i is threadified 2 for (i = 0; i <= N; i++) { 3 for (j = 0; j <= N; j++) { 4... x[i][j]... 5 } 6 } (a) Source code S1. 1 // only for_j is threadified 2 for (j = 0; j <= N; j++) { 3 for (i = 0; i <= N; i++) { 4... x[i][j]... 5 } 6 } (c) Source code S2. T0 T1 T2 (i=0) (i=1) (i=2) j=0 x[0][0] x[1][0] x[2][0] j=1 x[0][1] x[1][1] x[2][1] j=2 x[0][2] x[1][2] x[2][2] T0 T1 T2 ( j=0) (j=1) (j=2) i=0 x[0][0] x[0][1] x[0][2] i=1 x[1][0] x[1][1] x[1][2] i=2 x[2][0] x[2][1] x[2][2] chrecs 1 st dim {0} {1} {2} 2 nd dim {0,+,1} {0,+,1} {0,+,1} chrecs 1 st dim {0,+,1} {0,+,1} {0,+,1} 2 nd dim {0} {1} {2} (b) Non-coalesced accesses. (d) Coalesced accesses.

20 Detection of coalesced accesses to the GPU global memory CHRECS_x k = [{0,+,1}][{0,+,1}] 1 // only for_i is threadified 2 for (i = 0; i <= N; i++) { 3 for (j = 0; j <= N; j++) { 4... x[i][j]... 5 } 6 } (a) Source code S1. 1 // only for_j is threadified 2 for (j = 0; j <= N; j++) { 3 for (i = 0; i <= N; i++) { 4... x[i][j]... 5 } 6 } (c) Source code S2. T0 T1 T2 (i=0) (i=1) (i=2) j=0 x[0][0] x[1][0] x[2][0] j=1 x[0][1] x[1][1] x[2][1] j=2 x[0][2] x[1][2] x[2][2] T0 T1 T2 ( j=0) (j=1) (j=2) i=0 x[0][0] x[0][1] x[0][2] i=1 x[1][0] x[1][1] x[1][2] i=2 x[2][0] x[2][1] x[2][2] the same chrecs 1 st dim {0} {1} {2} 2 nd dim {0,+,1} {0,+,1} {0,+,1} chrecs 1 st dim {0,+,1} {0,+,1} {0,+,1} 2 nd dim {0} {1} {2} (b) Non-coalesced accesses. (d) Coalesced accesses. contiguous range

21 Locality-Aware Generation of Efficient GPGPU Code (and III) 2.Usage of registers to store reused data within a GPU thread 3.Usage of the GPU shared memory for data shared between the threads of a warp 4.Increase the computational load of a GPU thread (loop tiling preserving coalescing)

22 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work

23 Case Study: CONV3D (I) shaded to be omitted in the discovering of parallelism 1.int i, j, k, size_x, size_y, size_z; 2.float coefx,coefy,coefz,*input,*output; 3. 4.for (i = 0; i < size_x; i++) { 5. for (j = 0; j < size_y; j++) { 6. for (k = 0; k < size_z; k++) { 7. float tempx = input[i][j][k]+coefx* 8. ( 9. input[i-1][j][k]+input[i+1][j][k]+ 10. input[i-2][j][k]+input[i+2][j][k]+ 11. input[i-3][j][k]+input[i+3][j][k]+ 12. input[i-4][j][k]+input[i+4][j][k] 13. ); 14. float tempy = input[i][j][k]+coefy* 15. ( 16. input[i][j-1][k]+input[i][j+1][k]+ 17. input[i][j-2][k]+input[i][j+2][k]+ 18. input[i][j-3][k]+input[i][j+3][k]+ 19. input[i][j-4][k]+input[i][j+4][k] 20. ); 21. float tempz = input[i][j][k]+coefz* 22. ( 23. input[i][j][k-1]+input[i][j][k+1]+ 24. input[i][j][k-2]+input[i][j][k+2]+ 25. input[i][j][k-3]+input[i][j][k+3]+ 26. input[i][j][k-4]+input[i][j][k+4] 27. ); 28. output[i][j][k] = 29. output[i][j][k]+tempx+tempy+tempz; 30. } 31. } 32.} ROOT EXECUTION SCOPE ES_for i,j,k (Fig. 1a, lines 4-32) K < tempx 7 > scalar assignment K < tempy 14 > scalar assignment K < tempz 21 > scalar assignment K < output 28 > regular reduction

24 Case Study: CONV3D (II) conv3d-cpu conv3d-hmpp1: Coalescing!! 1.int i, j, k, size_x, size_y, size_z; 2.float coefx,coefy,coefz,*input,*output; 3. 4.for (i = 0; i < size_x; i++) { 5. for (j = 0; j < size_y; j++) { 6. for (k = 0; k < size_z; k++) { 7. float tempx = input[i][j][k]+coefx* Default OpenHMPP policy CHRECS_input 1 = [{0,+,1}][{0,+,1}] [{0,+,1}]! CHRECS_input 1 T0 = [{0}][{0}][{0,+,1}] CHRECS_input 1 T1 = [{0}][{1}][{0,+,1}] Loop nest is permuted to forj, fork, fori (permute directive) CHRECS_input 1 T0 = [{0,+,1}][{0}][{0}] CHRECS_input 1 T1 = [{0,+,1}][{0}][{1}]

25 Case Study: CONV3D (III) conv3d-hmpp2: Registers 4.for (i = 0; i < size_x; i++) { 5. for (j = 0; j < size_y; j++) { 6. for (k = 0; k < size_z; k++) { 7. float tempx = input[i][j][k]+coefx* 8. ( 9. input[i-1][j][k]+input[i+1][j][k]+ CHRECS_input 1 = [{0,+,1}][{0,+,1}][{0,+,1}] CHRECS_input 2 = [{-1,+,1}] [{0,+,1}][{0,+,1}] CHRECS_input 1 = [{1,+,1}][{0,+,1}][{0,+,1}] CHRECS_input 1 T0 = [{0,+,1}][{0}][{0}] CHRECS_input 2 T0 = [{-1,+,1}][{0}][{0}] CHRECS_input 3 T0 = [{1,+,1}][{0}][{0}]

26 Case Study: CONV3D (and IV) conv3d-hmpp3: Shared memory 4.for (i = 0; i < size_x; i++) { 5. for (j = 0; j < size_y; j++) { 6. for (k = 0; k < size_z; k++) { 21. float tempz = input[i][j][k]+coefz* 22. ( 23. input[i][j][k-1]+input[i][j][k+1]+ 24. input[i][j][k-2]+input[i][j][k+2]+ 25. input[i][j][k-3]+input[i][j][k+3]+ 26. input[i][j][k-4]+input[i][j][k+4] 27. ); T0 1 st dim 2 nd dim 3 rd dim 1 st dim 2 nd dim 3 rd dim CHRECS input 19 {0,+,1} {0} {0} {0,+,1} {0} {1} CHRECS input 20 {0,+,1} {0} { 1} {0,+,1} {0} {0} CHRECS input 21 {0,+,1} {0} {1} {0,+,1} {0} {2} CHRECS input 22 {0,+,1} {0} { 2} {0,+,1} {0} { 1} CHRECS input 23 {0,+,1} {0} {2} {0,+,1} {0} {3} CHRECS input 24 {0,+,1} {0} { 3} {0,+,1} {0} { 2} CHRECS input 25 {0,+,1} {0} {3} {0,+,1} {0} {4} CHRECS input 26 {0,+,1} {0} { 4} {0,+,1} {0} { 3} CHRECS input 27 {0,+,1} {0} {4} {0,+,1} {0} {5} T1 shared clause of the gridify directive

27 Case Study: SGEMM (I) ROOT EXECUTION SCOPE ES_for i,j (Fig. 3a, lines 5-13) 1.int i, j, l, m, n, k; 2.float A[m][k], B[k][n], C[m][n]; 3.float alpha, beta, prod; 4. 5.for (i = 0; i < m; i++) { 6. for (j = 0; j < n; j++) { 7. prod = 0; 8. for (l = 0; l < k; l++) { 9. prod += A[i][l] * B[l][j]; 10. } 11. C[i][j] = alpha * prod + beta * C[i][j]; 12. } 13.} K < prod 7 > scalar assignment ES_for l (Fig. 3a, lines 8-10) K < prod 9 > scalar reduction K < C 11 > regular reduction

28 Case Study: SGEMM (II) sgemm-cpu sgemm-mkl: Intel MKL sgemm-hmpp1: Offloading (and check coalescing) 1.int i, j, l, m, n, k; 2.float A[m][k], B[k][n], C[m][n]; 3.float alpha, beta, prod; 4. 5.for (i = 0; i < m; i++) { 6. for (j = 0; j < n; j++) { 7. prod = 0; 8. for (l = 0; l < k; l++) { 9. prod += A[i][l] * B[l][j]; 10. } 11. C[i][j] = alpha * prod + beta * C[i][j]; 12. } 13.} not instantiated T0 T1 1 st dim 2 nd dim 1 st dim 2 nd dim 1 st dim 2 nd dim CHRECS A {0,+,1} {0,+,1} {0} {0,+,1} {0} {0,+,1} CHRECS B {0,+,1} {0,+,1} {0,+,1} {0} {0,+,1} {1} CHRECS C {0, +, 1} {0, +, 1} {0} {0} {0} {1}

29 Case Study: SGEMM (and III) T sgemm-hmpp2: Tiling preserving coalescing 1.int i, j, l, m, n, k; 2.float A[m][k], B[k][n], C[m][n]; 3.float alpha, beta, prod; 4. 5.for (i = 0; i < m; i++) { 6. for (j = 0; j < n; j++) { 7. prod = 0; 8. for (l = 0; l < k; l++) { 9. prod += A[i][l] * B[l][j]; 10. } 11. C[i][j] = alpha * prod + beta * C[i][j]; 12. } 13.} Algorithm 4 Increase the computational load of a GPU thread 1: procedure INCREASELOAD Input: access x k [i k,1 ][i k,2 ]...[i k,n ] to an n-dimensional array x stored in row-major order Input: loop nest L = L 1,L 2,...,L l where both L 1,L 2 are threadified Input: amount of data D to be processed by a GPU thread 2: increment the step of the outer loop L 1 to D 3: for each scalar variable s in L do 4: promote s to an array s[d] 5: transform reads and writes to s into loops of D iterations 6: end for 7: end procedure sgemm-hmpp3: Let the compiler use the registers (unroll) sgemm-hmpp4: Use the shared memory for B sgemm-cublas: NVIDIA CUBLAS library

30 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work

31 Performance Evaluation: CONV3D sizex, sizey and sizez in 128, 256, 384, 512, 640 and 768 GFLOPS conv3d-cpu conv3d-hmpp1 conv3d-hmpp2 conv3d-hmpp CPU (nova) GPU Tesla S1070 (nova) GPU Tesla S2050 (pluton) Fermi cards introduced memory caches

32 Performance Evaluation: SGEMM m, n and k in 128, 256, 384, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 4096, 6144 and 8192 GFLOPS sgemm-cpu sgemm-mkl sgemm-hmpp1 sgemm-hmpp2 sgemm-hmpp3 sgemm-hmpp4 sgemm-cublas CPU (nova) GPU Tesla S1070 (nova) GPU Tesla S2050 (pluton) the biggest improvement factor is the usage of the GPU shared memory

33 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work

34 Conclusions KIR-based locality-aware automatic parallelization technique that targets GPU-based heterogeneous systems exploit data locality in the complex GPU memory hierarchy two representative case studies: CONV3D & SGEMM chains of recurrences model accesses to n-dimensional arrays OpenHMPP directives enabled a great understandability and portability of the generated GPU code

35 Future Work New automatic partitioning algorithm of the KIR to handle the interactions between computations in full-scale applications Auto-tuning approaches to select the best performance on a given hardware architecture Test with larger benchmark suite and on other manycore accelerators

36 Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel Programming and Applications (HLPP 2014) July 3-4, 2014 Amsterdam, Netherlands

A Parallelizing Compiler for Multicore Systems

A Parallelizing Compiler for Multicore Systems José M. Andión, Manuel Arenaz, Gabriel Rodríguez and Juan Touriño 17th International Workshop on Software and Compilers for Embedded Systems (SCOPES 2014)