Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Size: px
Start display at page:

Download "Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives"

Transcription

1 Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel Programming and Applications (HLPP 2014) July 3-4, 2014 Amsterdam, Netherlands

2 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work

3 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work

4 100,000 Performance (vs. VAX-11/780) 10, AX-11/780, 5 MHz AMD Athlon 64, 2.8 GHz 11,865 14,38719,484 AMD Athlon, 2.6 GHz Intel Xeon EE 3.2 GHz 7,108 Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology) 6,043 6,681 IBM Power4, 1.3 GHz 4,195 3,016 Intel VC820 motherboard, 1.0 GHz Pentium III processor 1,779 Professional Workstation XP1000, 667 MHz 21264A Digital AlphaServer /575, 575 MHz , AlphaServer /600, 600 MHz Digital Alphastation 5/500, 500 MHz Digital Alphastation 5/300, 300 MHz Digital Alphastation 4/266, 266 MHz IBM POWERstation 100, 150 MHz Digital 3000 AXP/500, 150 MHz HP 9000/750, 66 MHz IBM RS6000/540, 30 MHz MIPS M2000, 25 MHz 18 MIPS M/120, 16.7 MHz 13 Sun-4/260, 16.7 MHz 9 VAX 8700, 22 MHz %/year Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz) Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) Intel Core Duo Extreme 2 cores, 3.0 GHz Intel Core 2 Extreme 2 cores, 2.9 GHz %/year 24,129 21,871 25%/year 1.5, VAX-11/ The Parallel Challenge David A. Patterson and John L. Hennessy.! Computer Organization and Design:! The Hardware/Software Interface.! Elsevier, 2014.

5 ACCELERATORS/CO-PROCESSORS SYSTEMS ATI AMD Intel 30 NVIDIA Clearspeed CSX600 Cell General Purpose Computing with GPUs The TOP500 List.! June 2014.

6 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work

7 GPGPU with CUDA First GPGPU programs look like graphics applications! CUDA enables the use of C CUDA kernel: specifies the operation of a single GPU thread Main ideas: 1.Lightweight parallel threads in hierarchy: grid, block 2.Shared memory 3.Barriers

8 GPU Programming Features in CUDA 1 Threadification 2 Thread grouping: warps 3 Minimization of CPU-GPU data transfers 4 Coalescing 5 Maximization of the usage of registers and shared memory 6 Divergency 7 Occupancy 8 Threads per block

9 GPGPU with OpenHMPP Directive-based approaches provide several advantages: More readable codes Only one file Independent from the hardware platform Reasonable performance

10 GPGPU with OpenHMPP Directive-based approaches provide several advantages: More readable codes Only one file Independent from the hardware platform Reasonable performance explicit control of software-managed caches

11 GPGPU with OpenHMPP Directive-based approaches provide several advantages: More readable codes Only one file Independent from the hardware platform Reasonable performance explicit control of software-managed caches standard loop transformations

12 GPU Programming Features with OpenHMPP 1 Threadification 2 Thread grouping 3 Minimization of CPU-GPU data transfers 4 Coalescing 5 Maximization of the usage of registers and shared memory 6 Divergency 7 Occupancy 8 Threads per block gridify advancedload delegatedstore permute unroll fuse tile shared

13 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work

14 dikernel: Domain- Independent Computational Kernel DOMAIN-SPECIFIC CONCEPT LEVEL (problem solving methods and application domain) DOMAIN-INDEPENDENT CONCEPT LEVEL (programming practice) SEMANTIC LEVEL (control flow and data dependence graphs) SYNTACTIC LEVEL (abstract syntax tree) TEXT LEVEL (ASCII code) Characterizes the computations carried out in a program without being affected by how they are coded Exposes multiple levels of parallelism M. Arenaz et al. XARK: An Extensible Framework for Automatic Recognition of Computational Kernels. ACM Transactions on Programming Languages and Systems, 30(6), 2008.

15 Building the KIR Non-statement based, high-level, hierarchical IR 1.diKernel recognition on the DDG 2.Identification of flow dependences 3.Hierarchy of execution scopes reflecting the computational stages & dikernel classification J.M. Andión et al. A Novel Compiler Support for Automatic Parallelization on Multicore Systems. Parallel Computing, 39(9), 2013.

16 Example of KIR: CONV3D shaded to be omitted in the discovering of parallelism 1.int i, j, k, size_x, size_y, size_z; 2.float coefx,coefy,coefz,*input,*output; 3. 4.for (i = 0; i < size_x; i++) { 5. for (j = 0; j < size_y; j++) { 6. for (k = 0; k < size_z; k++) { 7. float tempx = input[i][j][k]+coefx* 8. ( 9. input[i-1][j][k]+input[i+1][j][k]+ 10. input[i-2][j][k]+input[i+2][j][k]+ 11. input[i-3][j][k]+input[i+3][j][k]+ 12. input[i-4][j][k]+input[i+4][j][k] 13. ); 14. float tempy = input[i][j][k]+coefy* 15. ( 16. input[i][j-1][k]+input[i][j+1][k]+ 17. input[i][j-2][k]+input[i][j+2][k]+ 18. input[i][j-3][k]+input[i][j+3][k]+ 19. input[i][j-4][k]+input[i][j+4][k] 20. ); 21. float tempz = input[i][j][k]+coefz* 22. ( 23. input[i][j][k-1]+input[i][j][k+1]+ 24. input[i][j][k-2]+input[i][j][k+2]+ 25. input[i][j][k-3]+input[i][j][k+3]+ 26. input[i][j][k-4]+input[i][j][k+4] 27. ); 28. output[i][j][k] = 29. output[i][j][k]+tempx+tempy+tempz; 30. } 31. } 32.} ROOT EXECUTION SCOPE ES_for i,j,k (Fig. 1a, lines 4-32) K < tempx 7 > scalar assignment K < tempy 14 > scalar assignment K < tempz 21 > scalar assignment K < output 28 > regular reduction

17 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work

18 GPU Programming Features addressed by our Automatic Technique 1 Threadification 2 Thread grouping 3 Minimization of CPU-GPU data transfers 4 Coalescing 5 Maximization of the usage of registers and shared memory 6 Divergency 7 Occupancy 8 Threads per block

19 Detection of coalesced accesses to the GPU global memory CHRECS_x k = [{0,+,1}][{0,+,1}] 1 // only for_i is threadified 2 for (i = 0; i <= N; i++) { 3 for (j = 0; j <= N; j++) { 4... x[i][j]... 5 } 6 } (a) Source code S1. 1 // only for_j is threadified 2 for (j = 0; j <= N; j++) { 3 for (i = 0; i <= N; i++) { 4... x[i][j]... 5 } 6 } (c) Source code S2. T0 T1 T2 (i=0) (i=1) (i=2) j=0 x[0][0] x[1][0] x[2][0] j=1 x[0][1] x[1][1] x[2][1] j=2 x[0][2] x[1][2] x[2][2] T0 T1 T2 ( j=0) (j=1) (j=2) i=0 x[0][0] x[0][1] x[0][2] i=1 x[1][0] x[1][1] x[1][2] i=2 x[2][0] x[2][1] x[2][2] chrecs 1 st dim {0} {1} {2} 2 nd dim {0,+,1} {0,+,1} {0,+,1} chrecs 1 st dim {0,+,1} {0,+,1} {0,+,1} 2 nd dim {0} {1} {2} (b) Non-coalesced accesses. (d) Coalesced accesses.

20 Detection of coalesced accesses to the GPU global memory CHRECS_x k = [{0,+,1}][{0,+,1}] 1 // only for_i is threadified 2 for (i = 0; i <= N; i++) { 3 for (j = 0; j <= N; j++) { 4... x[i][j]... 5 } 6 } (a) Source code S1. 1 // only for_j is threadified 2 for (j = 0; j <= N; j++) { 3 for (i = 0; i <= N; i++) { 4... x[i][j]... 5 } 6 } (c) Source code S2. T0 T1 T2 (i=0) (i=1) (i=2) j=0 x[0][0] x[1][0] x[2][0] j=1 x[0][1] x[1][1] x[2][1] j=2 x[0][2] x[1][2] x[2][2] T0 T1 T2 ( j=0) (j=1) (j=2) i=0 x[0][0] x[0][1] x[0][2] i=1 x[1][0] x[1][1] x[1][2] i=2 x[2][0] x[2][1] x[2][2] the same chrecs 1 st dim {0} {1} {2} 2 nd dim {0,+,1} {0,+,1} {0,+,1} chrecs 1 st dim {0,+,1} {0,+,1} {0,+,1} 2 nd dim {0} {1} {2} (b) Non-coalesced accesses. (d) Coalesced accesses. contiguous range

21 Locality-Aware Generation of Efficient GPGPU Code (and III) 2.Usage of registers to store reused data within a GPU thread 3.Usage of the GPU shared memory for data shared between the threads of a warp 4.Increase the computational load of a GPU thread (loop tiling preserving coalescing)

22 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work

23 Case Study: CONV3D (I) shaded to be omitted in the discovering of parallelism 1.int i, j, k, size_x, size_y, size_z; 2.float coefx,coefy,coefz,*input,*output; 3. 4.for (i = 0; i < size_x; i++) { 5. for (j = 0; j < size_y; j++) { 6. for (k = 0; k < size_z; k++) { 7. float tempx = input[i][j][k]+coefx* 8. ( 9. input[i-1][j][k]+input[i+1][j][k]+ 10. input[i-2][j][k]+input[i+2][j][k]+ 11. input[i-3][j][k]+input[i+3][j][k]+ 12. input[i-4][j][k]+input[i+4][j][k] 13. ); 14. float tempy = input[i][j][k]+coefy* 15. ( 16. input[i][j-1][k]+input[i][j+1][k]+ 17. input[i][j-2][k]+input[i][j+2][k]+ 18. input[i][j-3][k]+input[i][j+3][k]+ 19. input[i][j-4][k]+input[i][j+4][k] 20. ); 21. float tempz = input[i][j][k]+coefz* 22. ( 23. input[i][j][k-1]+input[i][j][k+1]+ 24. input[i][j][k-2]+input[i][j][k+2]+ 25. input[i][j][k-3]+input[i][j][k+3]+ 26. input[i][j][k-4]+input[i][j][k+4] 27. ); 28. output[i][j][k] = 29. output[i][j][k]+tempx+tempy+tempz; 30. } 31. } 32.} ROOT EXECUTION SCOPE ES_for i,j,k (Fig. 1a, lines 4-32) K < tempx 7 > scalar assignment K < tempy 14 > scalar assignment K < tempz 21 > scalar assignment K < output 28 > regular reduction

24 Case Study: CONV3D (II) conv3d-cpu conv3d-hmpp1: Coalescing!! 1.int i, j, k, size_x, size_y, size_z; 2.float coefx,coefy,coefz,*input,*output; 3. 4.for (i = 0; i < size_x; i++) { 5. for (j = 0; j < size_y; j++) { 6. for (k = 0; k < size_z; k++) { 7. float tempx = input[i][j][k]+coefx* Default OpenHMPP policy CHRECS_input 1 = [{0,+,1}][{0,+,1}] [{0,+,1}]! CHRECS_input 1 T0 = [{0}][{0}][{0,+,1}] CHRECS_input 1 T1 = [{0}][{1}][{0,+,1}] Loop nest is permuted to forj, fork, fori (permute directive) CHRECS_input 1 T0 = [{0,+,1}][{0}][{0}] CHRECS_input 1 T1 = [{0,+,1}][{0}][{1}]

25 Case Study: CONV3D (III) conv3d-hmpp2: Registers 4.for (i = 0; i < size_x; i++) { 5. for (j = 0; j < size_y; j++) { 6. for (k = 0; k < size_z; k++) { 7. float tempx = input[i][j][k]+coefx* 8. ( 9. input[i-1][j][k]+input[i+1][j][k]+ CHRECS_input 1 = [{0,+,1}][{0,+,1}][{0,+,1}] CHRECS_input 2 = [{-1,+,1}] [{0,+,1}][{0,+,1}] CHRECS_input 1 = [{1,+,1}][{0,+,1}][{0,+,1}] CHRECS_input 1 T0 = [{0,+,1}][{0}][{0}] CHRECS_input 2 T0 = [{-1,+,1}][{0}][{0}] CHRECS_input 3 T0 = [{1,+,1}][{0}][{0}]

26 Case Study: CONV3D (and IV) conv3d-hmpp3: Shared memory 4.for (i = 0; i < size_x; i++) { 5. for (j = 0; j < size_y; j++) { 6. for (k = 0; k < size_z; k++) { 21. float tempz = input[i][j][k]+coefz* 22. ( 23. input[i][j][k-1]+input[i][j][k+1]+ 24. input[i][j][k-2]+input[i][j][k+2]+ 25. input[i][j][k-3]+input[i][j][k+3]+ 26. input[i][j][k-4]+input[i][j][k+4] 27. ); T0 1 st dim 2 nd dim 3 rd dim 1 st dim 2 nd dim 3 rd dim CHRECS input 19 {0,+,1} {0} {0} {0,+,1} {0} {1} CHRECS input 20 {0,+,1} {0} { 1} {0,+,1} {0} {0} CHRECS input 21 {0,+,1} {0} {1} {0,+,1} {0} {2} CHRECS input 22 {0,+,1} {0} { 2} {0,+,1} {0} { 1} CHRECS input 23 {0,+,1} {0} {2} {0,+,1} {0} {3} CHRECS input 24 {0,+,1} {0} { 3} {0,+,1} {0} { 2} CHRECS input 25 {0,+,1} {0} {3} {0,+,1} {0} {4} CHRECS input 26 {0,+,1} {0} { 4} {0,+,1} {0} { 3} CHRECS input 27 {0,+,1} {0} {4} {0,+,1} {0} {5} T1 shared clause of the gridify directive

27 Case Study: SGEMM (I) ROOT EXECUTION SCOPE ES_for i,j (Fig. 3a, lines 5-13) 1.int i, j, l, m, n, k; 2.float A[m][k], B[k][n], C[m][n]; 3.float alpha, beta, prod; 4. 5.for (i = 0; i < m; i++) { 6. for (j = 0; j < n; j++) { 7. prod = 0; 8. for (l = 0; l < k; l++) { 9. prod += A[i][l] * B[l][j]; 10. } 11. C[i][j] = alpha * prod + beta * C[i][j]; 12. } 13.} K < prod 7 > scalar assignment ES_for l (Fig. 3a, lines 8-10) K < prod 9 > scalar reduction K < C 11 > regular reduction

28 Case Study: SGEMM (II) sgemm-cpu sgemm-mkl: Intel MKL sgemm-hmpp1: Offloading (and check coalescing) 1.int i, j, l, m, n, k; 2.float A[m][k], B[k][n], C[m][n]; 3.float alpha, beta, prod; 4. 5.for (i = 0; i < m; i++) { 6. for (j = 0; j < n; j++) { 7. prod = 0; 8. for (l = 0; l < k; l++) { 9. prod += A[i][l] * B[l][j]; 10. } 11. C[i][j] = alpha * prod + beta * C[i][j]; 12. } 13.} not instantiated T0 T1 1 st dim 2 nd dim 1 st dim 2 nd dim 1 st dim 2 nd dim CHRECS A {0,+,1} {0,+,1} {0} {0,+,1} {0} {0,+,1} CHRECS B {0,+,1} {0,+,1} {0,+,1} {0} {0,+,1} {1} CHRECS C {0, +, 1} {0, +, 1} {0} {0} {0} {1}

29 Case Study: SGEMM (and III) T sgemm-hmpp2: Tiling preserving coalescing 1.int i, j, l, m, n, k; 2.float A[m][k], B[k][n], C[m][n]; 3.float alpha, beta, prod; 4. 5.for (i = 0; i < m; i++) { 6. for (j = 0; j < n; j++) { 7. prod = 0; 8. for (l = 0; l < k; l++) { 9. prod += A[i][l] * B[l][j]; 10. } 11. C[i][j] = alpha * prod + beta * C[i][j]; 12. } 13.} Algorithm 4 Increase the computational load of a GPU thread 1: procedure INCREASELOAD Input: access x k [i k,1 ][i k,2 ]...[i k,n ] to an n-dimensional array x stored in row-major order Input: loop nest L = L 1,L 2,...,L l where both L 1,L 2 are threadified Input: amount of data D to be processed by a GPU thread 2: increment the step of the outer loop L 1 to D 3: for each scalar variable s in L do 4: promote s to an array s[d] 5: transform reads and writes to s into loops of D iterations 6: end for 7: end procedure sgemm-hmpp3: Let the compiler use the registers (unroll) sgemm-hmpp4: Use the shared memory for B sgemm-cublas: NVIDIA CUBLAS library

30 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work

31 Performance Evaluation: CONV3D sizex, sizey and sizez in 128, 256, 384, 512, 640 and 768 GFLOPS conv3d-cpu conv3d-hmpp1 conv3d-hmpp2 conv3d-hmpp CPU (nova) GPU Tesla S1070 (nova) GPU Tesla S2050 (pluton) Fermi cards introduced memory caches

32 Performance Evaluation: SGEMM m, n and k in 128, 256, 384, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 4096, 6144 and 8192 GFLOPS sgemm-cpu sgemm-mkl sgemm-hmpp1 sgemm-hmpp2 sgemm-hmpp3 sgemm-hmpp4 sgemm-cublas CPU (nova) GPU Tesla S1070 (nova) GPU Tesla S2050 (pluton) the biggest improvement factor is the usage of the GPU shared memory

33 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work

34 Conclusions KIR-based locality-aware automatic parallelization technique that targets GPU-based heterogeneous systems exploit data locality in the complex GPU memory hierarchy two representative case studies: CONV3D & SGEMM chains of recurrences model accesses to n-dimensional arrays OpenHMPP directives enabled a great understandability and portability of the generated GPU code

35 Future Work New automatic partitioning algorithm of the KIR to handle the interactions between computations in full-scale applications Auto-tuning approaches to select the best performance on a given hardware architecture Test with larger benchmark suite and on other manycore accelerators

36 Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel Programming and Applications (HLPP 2014) July 3-4, 2014 Amsterdam, Netherlands

A Parallelizing Compiler for Multicore Systems

A Parallelizing Compiler for Multicore Systems A Parallelizing Compiler for Multicore Systems José M. Andión, Manuel Arenaz, Gabriel Rodríguez and Juan Touriño 17th International Workshop on Software and Compilers for Embedded Systems (SCOPES 2014)

More information

Chapter 1: Fundamentals of Quantitative Design and Analysis

Chapter 1: Fundamentals of Quantitative Design and Analysis 1 / 12 Chapter 1: Fundamentals of Quantitative Design and Analysis Be careful in this chapter. It contains a tremendous amount of information and data about the changes in computer architecture since the

More information

Heterogeneous Multicore Parallel Programming

Heterogeneous Multicore Parallel Programming Innovative software for manycore paradigms Heterogeneous Multicore Parallel Programming S. Chauveau & L. Morin & F. Bodin Introduction Numerous legacy applications can benefit from GPU computing Many programming

More information

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

Parallel Hybrid Computing F. Bodin, CAPS Entreprise Parallel Hybrid Computing F. Bodin, CAPS Entreprise Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Introduction. What is Computer Architecture? Meltdown & Spectre. Meltdown & Spectre. Computer Architecture ELEC /1/17. Dr. Hayden Kwok-Hay So

Introduction. What is Computer Architecture? Meltdown & Spectre. Meltdown & Spectre. Computer Architecture ELEC /1/17. Dr. Hayden Kwok-Hay So Computer Architecture ELEC3441 What is Computer Architecture? Introduction 2 nd Semester, 2018-19 Dr. Hayden Kwok-Hay So Department of Electrical and Electronic Engineering Computer Architecture 2nd sem.

More information

Accelerating Financial Applications on the GPU

Accelerating Financial Applications on the GPU Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General

More information

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011 MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise June 2011 FREE LUNCH IS OVER, CODES HAVE TO MIGRATE! Many existing legacy codes needs to migrate to

More information

Halfway! Sequoia. A Point of View. Sequoia. First half of the course is over. Now start the second half. CS315B Lecture 9

Halfway! Sequoia. A Point of View. Sequoia. First half of the course is over. Now start the second half. CS315B Lecture 9 Halfway! Sequoia CS315B Lecture 9 First half of the course is over Overview/Philosophy of Regent Now start the second half Lectures on other programming models Comparing/contrasting with Regent Start with

More information

Introduction. What is Computer Architecture? Design constraints. What is Computer Architecture? Computer Architecture ELEC3441

Introduction. What is Computer Architecture? Design constraints. What is Computer Architecture? Computer Architecture ELEC3441 Computer Architecture ELEC3441 What is Computer Architecture? Introduction 2 nd Semester, 2016-17 Dr. Hayden Kwok-Hay So Department of Electrical and Electronic Engineering Computer Architecture 2 What

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

A Cross-Input Adaptive Framework for GPU Program Optimizations

A Cross-Input Adaptive Framework for GPU Program Optimizations A Cross-Input Adaptive Framework for GPU Program Optimizations Yixun Liu, Eddy Z. Zhang, Xipeng Shen Computer Science Department The College of William & Mary Outline GPU overview G-Adapt Framework Evaluation

More information

Towards Automatic Code Generation for GPUs

Towards Automatic Code Generation for GPUs Towards Automatic Code Generation for GPUs Javier Setoain 1, Christian Tenllado 1, Jose Ignacio Gómez 1, Manuel Arenaz 2, Manuel Prieto 1, and Juan Touriño 2 1 ArTeCS Group 2 Computer Architecture Group

More information

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2 2014@San Jose Shanghai Jiao Tong University Tokyo Institute of Technology OpenACC2 vs.openmp4 he Strong, the Weak, and the Missing to Develop Performance Portable Applica>ons on GPU and Xeon Phi James

More information

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray Pragma-based GPU Programming and HMPP Workbench Scott Grauer-Gray Pragma-based GPU programming Write programs for GPU processing without (directly) using CUDA/OpenCL Place pragmas to drive processing on

More information

Dealing with Heterogeneous Multicores

Dealing with Heterogeneous Multicores Dealing with Heterogeneous Multicores François Bodin INRIA-UIUC, June 12 th, 2009 Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism

More information

Introduction. What is Computer Architecture? Meltdown & Spectre. Meltdown & Spectre. Computer Architecture ELEC3441. Dr. Hayden Kwok-Hay So

Introduction. What is Computer Architecture? Meltdown & Spectre. Meltdown & Spectre. Computer Architecture ELEC3441. Dr. Hayden Kwok-Hay So Computer Architecture ELEC3441 What is Computer Architecture? Introduction 2 nd Semester, 2017-18 Dr. Hayden Kwok-Hay So Department of Electrical and Electronic Engineering Computer Architecture 2 Meltdown

More information

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs

More information

HMPP port. G. Colin de Verdière (CEA)

HMPP port. G. Colin de Verdière (CEA) HMPP port G. Colin de Verdière (CEA) Overview.Uchu prototype HMPP MOD2AS MOD2AM HMPP in a real code 2 The UCHU prototype Bull servers 1 login node 4 nodes 2 Haperton, 8GB 2 NVIDIA Tesla S1070 IB DDR Slurm

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Code Migration Methodology for Heterogeneous Systems

Code Migration Methodology for Heterogeneous Systems Code Migration Methodology for Heterogeneous Systems Directives based approach using HMPP - OpenAcc F. Bodin, CAPS CTO Introduction Many-core computing power comes from parallelism o Multiple forms of

More information

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan yanyh@cse.sc.edu

More information

CAPS Technology. ProHMPT, 2009 March12 th

CAPS Technology. ProHMPT, 2009 March12 th CAPS Technology ProHMPT, 2009 March12 th Overview of the Talk 1. HMPP in a nutshell Directives for Hardware Accelerators (HWA) 2. HMPP Code Generation Capabilities Efficient code generation for CUDA 3.

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information

Auto-tuning a High-Level Language Targeted to GPU Codes. By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos

Auto-tuning a High-Level Language Targeted to GPU Codes. By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos Auto-tuning a High-Level Language Targeted to GPU Codes By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos GPU Computing Utilization of GPU gives speedup on many algorithms

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for

More information

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Compiling for GPUs. Adarsh Yoga Madhav Ramesh Compiling for GPUs Adarsh Yoga Madhav Ramesh Agenda Introduction to GPUs Compute Unified Device Architecture (CUDA) Control Structure Optimization Technique for GPGPU Compiler Framework for Automatic Translation

More information

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels?

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels? Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels? J. Lobeiras, M. Amor, M. Arenaz, and B.B. Fraguela Computer Architecture Group, University of A Coruña, Spain {jlobeiras,margamor,manuel.arenaz,basilio.fraguela}@udc.es

More information

Dense Linear Algebra. HPC - Algorithms and Applications

Dense Linear Algebra. HPC - Algorithms and Applications Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT MULTI-CORE PROGRAMMING Dongrui She December 9, 2010 ASSIGNMENT Goal of the Assignment 1 The purpose of this assignment is to Have in-depth understanding of the architectures of real-world multi-core CPUs

More information

Sequoia. Mattan Erez. The University of Texas at Austin

Sequoia. Mattan Erez. The University of Texas at Austin Sequoia Mattan Erez The University of Texas at Austin EE382N: Parallelism and Locality, Fall 2015 1 2 Emerging Themes Writing high-performance code amounts to Intelligently structuring algorithms [compiler

More information

Addressing Heterogeneity in Manycore Applications

Addressing Heterogeneity in Manycore Applications Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction

More information

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

How to Write Code that Will Survive the Many-Core Revolution

How to Write Code that Will Survive the Many-Core Revolution How to Write Code that Will Survive the Many-Core Revolution Write Once, Deploy Many(-Cores) Guillaume BARAT, EMEA Sales Manager CAPS worldwide ecosystem Customers Business Partners Involved in many European

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Optimization of Tele-Immersion Codes

Optimization of Tele-Immersion Codes Optimization of Tele-Immersion Codes Albert Sidelnik, I-Jui Sung, Wanmin Wu, María Garzarán, Wen-mei Hwu, Klara Nahrstedt, David Padua, Sanjay Patel University of Illinois at Urbana-Champaign 1 Agenda

More information

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University * slides thanks to Kavita Bala & many others Final Project Demo Sign-Up: Will be posted outside my office after lecture today.

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming

More information

OPENACC ONLINE COURSE 2018

OPENACC ONLINE COURSE 2018 OPENACC ONLINE COURSE 2018 Week 3 Loop Optimizations with OpenACC Jeff Larkin, Senior DevTech Software Engineer, NVIDIA ABOUT THIS COURSE 3 Part Introduction to OpenACC Week 1 Introduction to OpenACC Week

More information

Use of Synthetic Benchmarks for Machine- Learning-based Performance Auto-tuning

Use of Synthetic Benchmarks for Machine- Learning-based Performance Auto-tuning Use of Synthetic Benchmarks for Machine- Learning-based Performance Auto-tuning Tianyi David Han and Tarek S. Abdelrahman The Edward S. Rogers Department of Electrical and Computer Engineering University

More information

Auto-tunable GPU BLAS

Auto-tunable GPU BLAS Auto-tunable GPU BLAS Jarle Erdal Steinsland Master of Science in Computer Science Submission date: June 2011 Supervisor: Anne Cathrine Elster, IDI Norwegian University of Science and Technology Department

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

Programming in CUDA. Malik M Khan

Programming in CUDA. Malik M Khan Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement

More information

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Shuai Mu, Chenxi Wang, Ming Liu, Yangdong Deng Department of Micro-/Nano-electronics Tsinghua University Outline

More information

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60 1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling

More information

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school

More information

How to Optimize Geometric Multigrid Methods on GPUs

How to Optimize Geometric Multigrid Methods on GPUs How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient

More information

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Towards a Performance- Portable FFT Library for Heterogeneous Computing

Towards a Performance- Portable FFT Library for Heterogeneous Computing Towards a Performance- Portable FFT Library for Heterogeneous Computing Carlo C. del Mundo*, Wu- chun Feng* *Dept. of ECE, Dept. of CS Virginia Tech Slides Updated: 5/19/2014 Forecast (Problem) AMD Radeon

More information

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)

More information

Advanced CUDA Optimization 1. Introduction

Advanced CUDA Optimization 1. Introduction Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines

More information

Experiences with Achieving Portability across Heterogeneous Architectures

Experiences with Achieving Portability across Heterogeneous Architectures Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron + + University of Virginia ++ Lawrence Livermore

More information

Early Experiences with the OpenMP Accelerator Model

Early Experiences with the OpenMP Accelerator Model Early Experiences with the OpenMP Accelerator Model Canberra, Australia, IWOMP 2013, Sep. 17th * University of Houston LLNL-PRES- 642558 This work was performed under the auspices of the U.S. Department

More information

Automatic Pruning of Autotuning Parameter Space for OpenCL Applications

Automatic Pruning of Autotuning Parameter Space for OpenCL Applications Automatic Pruning of Autotuning Parameter Space for OpenCL Applications Ahmet Erdem, Gianluca Palermo 6, and Cristina Silvano 6 Department of Electronics, Information and Bioengineering Politecnico di

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Locality-Aware Mapping of Nested Parallel Patterns on GPUs

Locality-Aware Mapping of Nested Parallel Patterns on GPUs Locality-Aware Mapping of Nested Parallel Patterns on GPUs HyoukJoong Lee *, Kevin Brown *, Arvind Sujeeth *, Tiark Rompf, Kunle Olukotun * * Pervasive Parallelism Laboratory, Stanford University Purdue

More information

Applications of Berkeley s Dwarfs on Nvidia GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

Advanced OpenACC. Steve Abbott November 17, 2017

Advanced OpenACC. Steve Abbott November 17, 2017 Advanced OpenACC Steve Abbott , November 17, 2017 AGENDA Expressive Parallelism Pipelining Routines 2 The loop Directive The loop directive gives the compiler additional information

More information

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute

More information

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO Foreword How to write code that will survive the many-core revolution? is being setup as a collective

More information

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects Administrative Midterm - In class April 4, open notes - Review notes, readings and review lecture (before break) - Will post prior exams Design Review - Intermediate assessment of progress on project,

More information

CUB. collective software primitives. Duane Merrill. NVIDIA Research

CUB. collective software primitives. Duane Merrill. NVIDIA Research CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives

More information

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Photos placed in horizontal position with even amount of white space between photos and header Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster,

More information

CS377P Programming for Performance GPU Programming - II

CS377P Programming for Performance GPU Programming - II CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Sequoia. Mike Houston Stanford University July 9 th, CScADS Workshop

Sequoia. Mike Houston Stanford University July 9 th, CScADS Workshop Sequoia Mike Houston Stanford University July 9 th, 2007 - CScADS Workshop Today s outline Sequoia programming model Sequoia targets Tuning in Sequoia http://sequoia.stanford.edu - Supercomputing 2006

More information

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2

More information

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs

More information

1/25/12. Administrative

1/25/12. Administrative Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

GPU programming made easier

GPU programming made easier GPU programming made easier Jacob Jepsen 6. June 2014 University of Copenhagen Department of Computer Science 6. June 2014 Introduction We created a tool that reduces the development time of GPU code.

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core

Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core Tom Henderson NOAA/OAR/ESRL/GSD/ACE Thomas.B.Henderson@noaa.gov Mark Govett, Jacques Middlecoff Paul Madden,

More information

Accelerator programming with OpenACC

Accelerator programming with OpenACC ..... Accelerator programming with OpenACC Colaboratorio Nacional de Computación Avanzada Jorge Castro jcastro@cenat.ac.cr 2018. Agenda 1 Introduction 2 OpenACC life cycle 3 Hands on session Profiling

More information

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia

More information

Parallel Hybrid Computing Stéphane Bihan, CAPS

Parallel Hybrid Computing Stéphane Bihan, CAPS Parallel Hybrid Computing Stéphane Bihan, CAPS Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous hardware

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit

More information

Concepts Introduced. Classes of Computers. Classes of Computers (cont.) Great Architecture Ideas. personal computers (PCs)

Concepts Introduced. Classes of Computers. Classes of Computers (cont.) Great Architecture Ideas. personal computers (PCs) Concepts Introduced Classes of Computers classes of computers great architecture ideas software levels computer components performance measures technology trends personal computers (PCs) servers intended

More information

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy Jan Verschelde joint with Genady Yoffe and Xiangcheng Yu University of Illinois at Chicago Department of Mathematics, Statistics,

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded

More information

OpenMP Optimization and its Translation to OpenGL

OpenMP Optimization and its Translation to OpenGL OpenMP Optimization and its Translation to OpenGL Santosh Kumar SITRC-Nashik, India Dr. V.M.Wadhai MAE-Pune, India Prasad S.Halgaonkar MITCOE-Pune, India Kiran P.Gaikwad GHRIEC-Pune, India ABSTRACT For

More information

Preparing seismic codes for GPUs and other

Preparing seismic codes for GPUs and other Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information