Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives
|
|
- Lauren Rose
- 5 years ago
- Views:
Transcription
1 Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel Programming and Applications (HLPP 2014) July 3-4, 2014 Amsterdam, Netherlands
2 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work
3 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work
4 100,000 Performance (vs. VAX-11/780) 10, AX-11/780, 5 MHz AMD Athlon 64, 2.8 GHz 11,865 14,38719,484 AMD Athlon, 2.6 GHz Intel Xeon EE 3.2 GHz 7,108 Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology) 6,043 6,681 IBM Power4, 1.3 GHz 4,195 3,016 Intel VC820 motherboard, 1.0 GHz Pentium III processor 1,779 Professional Workstation XP1000, 667 MHz 21264A Digital AlphaServer /575, 575 MHz , AlphaServer /600, 600 MHz Digital Alphastation 5/500, 500 MHz Digital Alphastation 5/300, 300 MHz Digital Alphastation 4/266, 266 MHz IBM POWERstation 100, 150 MHz Digital 3000 AXP/500, 150 MHz HP 9000/750, 66 MHz IBM RS6000/540, 30 MHz MIPS M2000, 25 MHz 18 MIPS M/120, 16.7 MHz 13 Sun-4/260, 16.7 MHz 9 VAX 8700, 22 MHz %/year Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz) Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) Intel Core Duo Extreme 2 cores, 3.0 GHz Intel Core 2 Extreme 2 cores, 2.9 GHz %/year 24,129 21,871 25%/year 1.5, VAX-11/ The Parallel Challenge David A. Patterson and John L. Hennessy.! Computer Organization and Design:! The Hardware/Software Interface.! Elsevier, 2014.
5 ACCELERATORS/CO-PROCESSORS SYSTEMS ATI AMD Intel 30 NVIDIA Clearspeed CSX600 Cell General Purpose Computing with GPUs The TOP500 List.! June 2014.
6 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work
7 GPGPU with CUDA First GPGPU programs look like graphics applications! CUDA enables the use of C CUDA kernel: specifies the operation of a single GPU thread Main ideas: 1.Lightweight parallel threads in hierarchy: grid, block 2.Shared memory 3.Barriers
8 GPU Programming Features in CUDA 1 Threadification 2 Thread grouping: warps 3 Minimization of CPU-GPU data transfers 4 Coalescing 5 Maximization of the usage of registers and shared memory 6 Divergency 7 Occupancy 8 Threads per block
9 GPGPU with OpenHMPP Directive-based approaches provide several advantages: More readable codes Only one file Independent from the hardware platform Reasonable performance
10 GPGPU with OpenHMPP Directive-based approaches provide several advantages: More readable codes Only one file Independent from the hardware platform Reasonable performance explicit control of software-managed caches
11 GPGPU with OpenHMPP Directive-based approaches provide several advantages: More readable codes Only one file Independent from the hardware platform Reasonable performance explicit control of software-managed caches standard loop transformations
12 GPU Programming Features with OpenHMPP 1 Threadification 2 Thread grouping 3 Minimization of CPU-GPU data transfers 4 Coalescing 5 Maximization of the usage of registers and shared memory 6 Divergency 7 Occupancy 8 Threads per block gridify advancedload delegatedstore permute unroll fuse tile shared
13 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work
14 dikernel: Domain- Independent Computational Kernel DOMAIN-SPECIFIC CONCEPT LEVEL (problem solving methods and application domain) DOMAIN-INDEPENDENT CONCEPT LEVEL (programming practice) SEMANTIC LEVEL (control flow and data dependence graphs) SYNTACTIC LEVEL (abstract syntax tree) TEXT LEVEL (ASCII code) Characterizes the computations carried out in a program without being affected by how they are coded Exposes multiple levels of parallelism M. Arenaz et al. XARK: An Extensible Framework for Automatic Recognition of Computational Kernels. ACM Transactions on Programming Languages and Systems, 30(6), 2008.
15 Building the KIR Non-statement based, high-level, hierarchical IR 1.diKernel recognition on the DDG 2.Identification of flow dependences 3.Hierarchy of execution scopes reflecting the computational stages & dikernel classification J.M. Andión et al. A Novel Compiler Support for Automatic Parallelization on Multicore Systems. Parallel Computing, 39(9), 2013.
16 Example of KIR: CONV3D shaded to be omitted in the discovering of parallelism 1.int i, j, k, size_x, size_y, size_z; 2.float coefx,coefy,coefz,*input,*output; 3. 4.for (i = 0; i < size_x; i++) { 5. for (j = 0; j < size_y; j++) { 6. for (k = 0; k < size_z; k++) { 7. float tempx = input[i][j][k]+coefx* 8. ( 9. input[i-1][j][k]+input[i+1][j][k]+ 10. input[i-2][j][k]+input[i+2][j][k]+ 11. input[i-3][j][k]+input[i+3][j][k]+ 12. input[i-4][j][k]+input[i+4][j][k] 13. ); 14. float tempy = input[i][j][k]+coefy* 15. ( 16. input[i][j-1][k]+input[i][j+1][k]+ 17. input[i][j-2][k]+input[i][j+2][k]+ 18. input[i][j-3][k]+input[i][j+3][k]+ 19. input[i][j-4][k]+input[i][j+4][k] 20. ); 21. float tempz = input[i][j][k]+coefz* 22. ( 23. input[i][j][k-1]+input[i][j][k+1]+ 24. input[i][j][k-2]+input[i][j][k+2]+ 25. input[i][j][k-3]+input[i][j][k+3]+ 26. input[i][j][k-4]+input[i][j][k+4] 27. ); 28. output[i][j][k] = 29. output[i][j][k]+tempx+tempy+tempz; 30. } 31. } 32.} ROOT EXECUTION SCOPE ES_for i,j,k (Fig. 1a, lines 4-32) K < tempx 7 > scalar assignment K < tempy 14 > scalar assignment K < tempz 21 > scalar assignment K < output 28 > regular reduction
17 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work
18 GPU Programming Features addressed by our Automatic Technique 1 Threadification 2 Thread grouping 3 Minimization of CPU-GPU data transfers 4 Coalescing 5 Maximization of the usage of registers and shared memory 6 Divergency 7 Occupancy 8 Threads per block
19 Detection of coalesced accesses to the GPU global memory CHRECS_x k = [{0,+,1}][{0,+,1}] 1 // only for_i is threadified 2 for (i = 0; i <= N; i++) { 3 for (j = 0; j <= N; j++) { 4... x[i][j]... 5 } 6 } (a) Source code S1. 1 // only for_j is threadified 2 for (j = 0; j <= N; j++) { 3 for (i = 0; i <= N; i++) { 4... x[i][j]... 5 } 6 } (c) Source code S2. T0 T1 T2 (i=0) (i=1) (i=2) j=0 x[0][0] x[1][0] x[2][0] j=1 x[0][1] x[1][1] x[2][1] j=2 x[0][2] x[1][2] x[2][2] T0 T1 T2 ( j=0) (j=1) (j=2) i=0 x[0][0] x[0][1] x[0][2] i=1 x[1][0] x[1][1] x[1][2] i=2 x[2][0] x[2][1] x[2][2] chrecs 1 st dim {0} {1} {2} 2 nd dim {0,+,1} {0,+,1} {0,+,1} chrecs 1 st dim {0,+,1} {0,+,1} {0,+,1} 2 nd dim {0} {1} {2} (b) Non-coalesced accesses. (d) Coalesced accesses.
20 Detection of coalesced accesses to the GPU global memory CHRECS_x k = [{0,+,1}][{0,+,1}] 1 // only for_i is threadified 2 for (i = 0; i <= N; i++) { 3 for (j = 0; j <= N; j++) { 4... x[i][j]... 5 } 6 } (a) Source code S1. 1 // only for_j is threadified 2 for (j = 0; j <= N; j++) { 3 for (i = 0; i <= N; i++) { 4... x[i][j]... 5 } 6 } (c) Source code S2. T0 T1 T2 (i=0) (i=1) (i=2) j=0 x[0][0] x[1][0] x[2][0] j=1 x[0][1] x[1][1] x[2][1] j=2 x[0][2] x[1][2] x[2][2] T0 T1 T2 ( j=0) (j=1) (j=2) i=0 x[0][0] x[0][1] x[0][2] i=1 x[1][0] x[1][1] x[1][2] i=2 x[2][0] x[2][1] x[2][2] the same chrecs 1 st dim {0} {1} {2} 2 nd dim {0,+,1} {0,+,1} {0,+,1} chrecs 1 st dim {0,+,1} {0,+,1} {0,+,1} 2 nd dim {0} {1} {2} (b) Non-coalesced accesses. (d) Coalesced accesses. contiguous range
21 Locality-Aware Generation of Efficient GPGPU Code (and III) 2.Usage of registers to store reused data within a GPU thread 3.Usage of the GPU shared memory for data shared between the threads of a warp 4.Increase the computational load of a GPU thread (loop tiling preserving coalescing)
22 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work
23 Case Study: CONV3D (I) shaded to be omitted in the discovering of parallelism 1.int i, j, k, size_x, size_y, size_z; 2.float coefx,coefy,coefz,*input,*output; 3. 4.for (i = 0; i < size_x; i++) { 5. for (j = 0; j < size_y; j++) { 6. for (k = 0; k < size_z; k++) { 7. float tempx = input[i][j][k]+coefx* 8. ( 9. input[i-1][j][k]+input[i+1][j][k]+ 10. input[i-2][j][k]+input[i+2][j][k]+ 11. input[i-3][j][k]+input[i+3][j][k]+ 12. input[i-4][j][k]+input[i+4][j][k] 13. ); 14. float tempy = input[i][j][k]+coefy* 15. ( 16. input[i][j-1][k]+input[i][j+1][k]+ 17. input[i][j-2][k]+input[i][j+2][k]+ 18. input[i][j-3][k]+input[i][j+3][k]+ 19. input[i][j-4][k]+input[i][j+4][k] 20. ); 21. float tempz = input[i][j][k]+coefz* 22. ( 23. input[i][j][k-1]+input[i][j][k+1]+ 24. input[i][j][k-2]+input[i][j][k+2]+ 25. input[i][j][k-3]+input[i][j][k+3]+ 26. input[i][j][k-4]+input[i][j][k+4] 27. ); 28. output[i][j][k] = 29. output[i][j][k]+tempx+tempy+tempz; 30. } 31. } 32.} ROOT EXECUTION SCOPE ES_for i,j,k (Fig. 1a, lines 4-32) K < tempx 7 > scalar assignment K < tempy 14 > scalar assignment K < tempz 21 > scalar assignment K < output 28 > regular reduction
24 Case Study: CONV3D (II) conv3d-cpu conv3d-hmpp1: Coalescing!! 1.int i, j, k, size_x, size_y, size_z; 2.float coefx,coefy,coefz,*input,*output; 3. 4.for (i = 0; i < size_x; i++) { 5. for (j = 0; j < size_y; j++) { 6. for (k = 0; k < size_z; k++) { 7. float tempx = input[i][j][k]+coefx* Default OpenHMPP policy CHRECS_input 1 = [{0,+,1}][{0,+,1}] [{0,+,1}]! CHRECS_input 1 T0 = [{0}][{0}][{0,+,1}] CHRECS_input 1 T1 = [{0}][{1}][{0,+,1}] Loop nest is permuted to forj, fork, fori (permute directive) CHRECS_input 1 T0 = [{0,+,1}][{0}][{0}] CHRECS_input 1 T1 = [{0,+,1}][{0}][{1}]
25 Case Study: CONV3D (III) conv3d-hmpp2: Registers 4.for (i = 0; i < size_x; i++) { 5. for (j = 0; j < size_y; j++) { 6. for (k = 0; k < size_z; k++) { 7. float tempx = input[i][j][k]+coefx* 8. ( 9. input[i-1][j][k]+input[i+1][j][k]+ CHRECS_input 1 = [{0,+,1}][{0,+,1}][{0,+,1}] CHRECS_input 2 = [{-1,+,1}] [{0,+,1}][{0,+,1}] CHRECS_input 1 = [{1,+,1}][{0,+,1}][{0,+,1}] CHRECS_input 1 T0 = [{0,+,1}][{0}][{0}] CHRECS_input 2 T0 = [{-1,+,1}][{0}][{0}] CHRECS_input 3 T0 = [{1,+,1}][{0}][{0}]
26 Case Study: CONV3D (and IV) conv3d-hmpp3: Shared memory 4.for (i = 0; i < size_x; i++) { 5. for (j = 0; j < size_y; j++) { 6. for (k = 0; k < size_z; k++) { 21. float tempz = input[i][j][k]+coefz* 22. ( 23. input[i][j][k-1]+input[i][j][k+1]+ 24. input[i][j][k-2]+input[i][j][k+2]+ 25. input[i][j][k-3]+input[i][j][k+3]+ 26. input[i][j][k-4]+input[i][j][k+4] 27. ); T0 1 st dim 2 nd dim 3 rd dim 1 st dim 2 nd dim 3 rd dim CHRECS input 19 {0,+,1} {0} {0} {0,+,1} {0} {1} CHRECS input 20 {0,+,1} {0} { 1} {0,+,1} {0} {0} CHRECS input 21 {0,+,1} {0} {1} {0,+,1} {0} {2} CHRECS input 22 {0,+,1} {0} { 2} {0,+,1} {0} { 1} CHRECS input 23 {0,+,1} {0} {2} {0,+,1} {0} {3} CHRECS input 24 {0,+,1} {0} { 3} {0,+,1} {0} { 2} CHRECS input 25 {0,+,1} {0} {3} {0,+,1} {0} {4} CHRECS input 26 {0,+,1} {0} { 4} {0,+,1} {0} { 3} CHRECS input 27 {0,+,1} {0} {4} {0,+,1} {0} {5} T1 shared clause of the gridify directive
27 Case Study: SGEMM (I) ROOT EXECUTION SCOPE ES_for i,j (Fig. 3a, lines 5-13) 1.int i, j, l, m, n, k; 2.float A[m][k], B[k][n], C[m][n]; 3.float alpha, beta, prod; 4. 5.for (i = 0; i < m; i++) { 6. for (j = 0; j < n; j++) { 7. prod = 0; 8. for (l = 0; l < k; l++) { 9. prod += A[i][l] * B[l][j]; 10. } 11. C[i][j] = alpha * prod + beta * C[i][j]; 12. } 13.} K < prod 7 > scalar assignment ES_for l (Fig. 3a, lines 8-10) K < prod 9 > scalar reduction K < C 11 > regular reduction
28 Case Study: SGEMM (II) sgemm-cpu sgemm-mkl: Intel MKL sgemm-hmpp1: Offloading (and check coalescing) 1.int i, j, l, m, n, k; 2.float A[m][k], B[k][n], C[m][n]; 3.float alpha, beta, prod; 4. 5.for (i = 0; i < m; i++) { 6. for (j = 0; j < n; j++) { 7. prod = 0; 8. for (l = 0; l < k; l++) { 9. prod += A[i][l] * B[l][j]; 10. } 11. C[i][j] = alpha * prod + beta * C[i][j]; 12. } 13.} not instantiated T0 T1 1 st dim 2 nd dim 1 st dim 2 nd dim 1 st dim 2 nd dim CHRECS A {0,+,1} {0,+,1} {0} {0,+,1} {0} {0,+,1} CHRECS B {0,+,1} {0,+,1} {0,+,1} {0} {0,+,1} {1} CHRECS C {0, +, 1} {0, +, 1} {0} {0} {0} {1}
29 Case Study: SGEMM (and III) T sgemm-hmpp2: Tiling preserving coalescing 1.int i, j, l, m, n, k; 2.float A[m][k], B[k][n], C[m][n]; 3.float alpha, beta, prod; 4. 5.for (i = 0; i < m; i++) { 6. for (j = 0; j < n; j++) { 7. prod = 0; 8. for (l = 0; l < k; l++) { 9. prod += A[i][l] * B[l][j]; 10. } 11. C[i][j] = alpha * prod + beta * C[i][j]; 12. } 13.} Algorithm 4 Increase the computational load of a GPU thread 1: procedure INCREASELOAD Input: access x k [i k,1 ][i k,2 ]...[i k,n ] to an n-dimensional array x stored in row-major order Input: loop nest L = L 1,L 2,...,L l where both L 1,L 2 are threadified Input: amount of data D to be processed by a GPU thread 2: increment the step of the outer loop L 1 to D 3: for each scalar variable s in L do 4: promote s to an array s[d] 5: transform reads and writes to s into loops of D iterations 6: end for 7: end procedure sgemm-hmpp3: Let the compiler use the registers (unroll) sgemm-hmpp4: Use the shared memory for B sgemm-cublas: NVIDIA CUBLAS library
30 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work
31 Performance Evaluation: CONV3D sizex, sizey and sizez in 128, 256, 384, 512, 640 and 768 GFLOPS conv3d-cpu conv3d-hmpp1 conv3d-hmpp2 conv3d-hmpp CPU (nova) GPU Tesla S1070 (nova) GPU Tesla S2050 (pluton) Fermi cards introduced memory caches
32 Performance Evaluation: SGEMM m, n and k in 128, 256, 384, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 4096, 6144 and 8192 GFLOPS sgemm-cpu sgemm-mkl sgemm-hmpp1 sgemm-hmpp2 sgemm-hmpp3 sgemm-hmpp4 sgemm-cublas CPU (nova) GPU Tesla S1070 (nova) GPU Tesla S2050 (pluton) the biggest improvement factor is the usage of the GPU shared memory
33 Outline Motivation: General Purpose Computation with GPUs GPGPU with CUDA & OpenHMPP The KIR: an IR for the Detection of Parallelism Locality-Aware Generation of Efficient GPGPU Code Case Studies: CONV3D & SGEMM Performance Evaluation Conclusions & Future Work
34 Conclusions KIR-based locality-aware automatic parallelization technique that targets GPU-based heterogeneous systems exploit data locality in the complex GPU memory hierarchy two representative case studies: CONV3D & SGEMM chains of recurrences model accesses to n-dimensional arrays OpenHMPP directives enabled a great understandability and portability of the generated GPU code
35 Future Work New automatic partitioning algorithm of the KIR to handle the interactions between computations in full-scale applications Auto-tuning approaches to select the best performance on a given hardware architecture Test with larger benchmark suite and on other manycore accelerators
36 Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel Programming and Applications (HLPP 2014) July 3-4, 2014 Amsterdam, Netherlands
A Parallelizing Compiler for Multicore Systems
A Parallelizing Compiler for Multicore Systems José M. Andión, Manuel Arenaz, Gabriel Rodríguez and Juan Touriño 17th International Workshop on Software and Compilers for Embedded Systems (SCOPES 2014)
More informationChapter 1: Fundamentals of Quantitative Design and Analysis
1 / 12 Chapter 1: Fundamentals of Quantitative Design and Analysis Be careful in this chapter. It contains a tremendous amount of information and data about the changes in computer architecture since the
More informationHeterogeneous Multicore Parallel Programming
Innovative software for manycore paradigms Heterogeneous Multicore Parallel Programming S. Chauveau & L. Morin & F. Bodin Introduction Numerous legacy applications can benefit from GPU computing Many programming
More informationParallel Hybrid Computing F. Bodin, CAPS Entreprise
Parallel Hybrid Computing F. Bodin, CAPS Entreprise Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationOpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware
OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationIntroduction. What is Computer Architecture? Meltdown & Spectre. Meltdown & Spectre. Computer Architecture ELEC /1/17. Dr. Hayden Kwok-Hay So
Computer Architecture ELEC3441 What is Computer Architecture? Introduction 2 nd Semester, 2018-19 Dr. Hayden Kwok-Hay So Department of Electrical and Electronic Engineering Computer Architecture 2nd sem.
More informationAccelerating Financial Applications on the GPU
Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General
More informationMIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011
MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise June 2011 FREE LUNCH IS OVER, CODES HAVE TO MIGRATE! Many existing legacy codes needs to migrate to
More informationHalfway! Sequoia. A Point of View. Sequoia. First half of the course is over. Now start the second half. CS315B Lecture 9
Halfway! Sequoia CS315B Lecture 9 First half of the course is over Overview/Philosophy of Regent Now start the second half Lectures on other programming models Comparing/contrasting with Regent Start with
More informationIntroduction. What is Computer Architecture? Design constraints. What is Computer Architecture? Computer Architecture ELEC3441
Computer Architecture ELEC3441 What is Computer Architecture? Introduction 2 nd Semester, 2016-17 Dr. Hayden Kwok-Hay So Department of Electrical and Electronic Engineering Computer Architecture 2 What
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationA Cross-Input Adaptive Framework for GPU Program Optimizations
A Cross-Input Adaptive Framework for GPU Program Optimizations Yixun Liu, Eddy Z. Zhang, Xipeng Shen Computer Science Department The College of William & Mary Outline GPU overview G-Adapt Framework Evaluation
More informationTowards Automatic Code Generation for GPUs
Towards Automatic Code Generation for GPUs Javier Setoain 1, Christian Tenllado 1, Jose Ignacio Gómez 1, Manuel Arenaz 2, Manuel Prieto 1, and Juan Touriño 2 1 ArTeCS Group 2 Computer Architecture Group
More informationOpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2
2014@San Jose Shanghai Jiao Tong University Tokyo Institute of Technology OpenACC2 vs.openmp4 he Strong, the Weak, and the Missing to Develop Performance Portable Applica>ons on GPU and Xeon Phi James
More informationPragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray
Pragma-based GPU Programming and HMPP Workbench Scott Grauer-Gray Pragma-based GPU programming Write programs for GPU processing without (directly) using CUDA/OpenCL Place pragmas to drive processing on
More informationDealing with Heterogeneous Multicores
Dealing with Heterogeneous Multicores François Bodin INRIA-UIUC, June 12 th, 2009 Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism
More informationIntroduction. What is Computer Architecture? Meltdown & Spectre. Meltdown & Spectre. Computer Architecture ELEC3441. Dr. Hayden Kwok-Hay So
Computer Architecture ELEC3441 What is Computer Architecture? Introduction 2 nd Semester, 2017-18 Dr. Hayden Kwok-Hay So Department of Electrical and Electronic Engineering Computer Architecture 2 Meltdown
More informationGPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler
GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs
More informationHMPP port. G. Colin de Verdière (CEA)
HMPP port G. Colin de Verdière (CEA) Overview.Uchu prototype HMPP MOD2AS MOD2AM HMPP in a real code 2 The UCHU prototype Bull servers 1 login node 4 nodes 2 Haperton, 8GB 2 NVIDIA Tesla S1070 IB DDR Slurm
More informationPerformance impact of dynamic parallelism on different clustering algorithms
Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu
More informationCode Migration Methodology for Heterogeneous Systems
Code Migration Methodology for Heterogeneous Systems Directives based approach using HMPP - OpenAcc F. Bodin, CAPS CTO Introduction Many-core computing power comes from parallelism o Multiple forms of
More informationLecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators
Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan yanyh@cse.sc.edu
More informationCAPS Technology. ProHMPT, 2009 March12 th
CAPS Technology ProHMPT, 2009 March12 th Overview of the Talk 1. HMPP in a nutshell Directives for Hardware Accelerators (HWA) 2. HMPP Code Generation Capabilities Efficient code generation for CUDA 3.
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationAuto-tuning a High-Level Language Targeted to GPU Codes. By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos
Auto-tuning a High-Level Language Targeted to GPU Codes By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos GPU Computing Utilization of GPU gives speedup on many algorithms
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationChapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for
More informationCompiling for GPUs. Adarsh Yoga Madhav Ramesh
Compiling for GPUs Adarsh Yoga Madhav Ramesh Agenda Introduction to GPUs Compute Unified Device Architecture (CUDA) Control Structure Optimization Technique for GPGPU Compiler Framework for Automatic Translation
More informationStreaming-Oriented Parallelization of Domain-Independent Irregular Kernels?
Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels? J. Lobeiras, M. Amor, M. Arenaz, and B.B. Fraguela Computer Architecture Group, University of A Coruña, Spain {jlobeiras,margamor,manuel.arenaz,basilio.fraguela}@udc.es
More informationDense Linear Algebra. HPC - Algorithms and Applications
Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:
More informationCUDA Architecture & Programming Model
CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationMULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT
MULTI-CORE PROGRAMMING Dongrui She December 9, 2010 ASSIGNMENT Goal of the Assignment 1 The purpose of this assignment is to Have in-depth understanding of the architectures of real-world multi-core CPUs
More informationSequoia. Mattan Erez. The University of Texas at Austin
Sequoia Mattan Erez The University of Texas at Austin EE382N: Parallelism and Locality, Fall 2015 1 2 Emerging Themes Writing high-performance code amounts to Intelligently structuring algorithms [compiler
More informationAddressing Heterogeneity in Manycore Applications
Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction
More informationCUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation
CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationHow to Write Code that Will Survive the Many-Core Revolution
How to Write Code that Will Survive the Many-Core Revolution Write Once, Deploy Many(-Cores) Guillaume BARAT, EMEA Sales Manager CAPS worldwide ecosystem Customers Business Partners Involved in many European
More informationA TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE
A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationOptimization of Tele-Immersion Codes
Optimization of Tele-Immersion Codes Albert Sidelnik, I-Jui Sung, Wanmin Wu, María Garzarán, Wen-mei Hwu, Klara Nahrstedt, David Padua, Sanjay Patel University of Illinois at Urbana-Champaign 1 Agenda
More informationWhat Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others
What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University * slides thanks to Kavita Bala & many others Final Project Demo Sign-Up: Will be posted outside my office after lecture today.
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming
More informationOPENACC ONLINE COURSE 2018
OPENACC ONLINE COURSE 2018 Week 3 Loop Optimizations with OpenACC Jeff Larkin, Senior DevTech Software Engineer, NVIDIA ABOUT THIS COURSE 3 Part Introduction to OpenACC Week 1 Introduction to OpenACC Week
More informationUse of Synthetic Benchmarks for Machine- Learning-based Performance Auto-tuning
Use of Synthetic Benchmarks for Machine- Learning-based Performance Auto-tuning Tianyi David Han and Tarek S. Abdelrahman The Edward S. Rogers Department of Electrical and Computer Engineering University
More informationAuto-tunable GPU BLAS
Auto-tunable GPU BLAS Jarle Erdal Steinsland Master of Science in Computer Science Submission date: June 2011 Supervisor: Anne Cathrine Elster, IDI Norwegian University of Science and Technology Department
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationProgramming in CUDA. Malik M Khan
Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement
More informationEvaluating the Potential of Graphics Processors for High Performance Embedded Computing
Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Shuai Mu, Chenxi Wang, Ming Liu, Yangdong Deng Department of Micro-/Nano-electronics Tsinghua University Outline
More informationGPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60
1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling
More informationCUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN
CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school
More informationHow to Optimize Geometric Multigrid Methods on GPUs
How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient
More informationShared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP
Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationTowards a Performance- Portable FFT Library for Heterogeneous Computing
Towards a Performance- Portable FFT Library for Heterogeneous Computing Carlo C. del Mundo*, Wu- chun Feng* *Dept. of ECE, Dept. of CS Virginia Tech Slides Updated: 5/19/2014 Forecast (Problem) AMD Radeon
More informationGPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh
GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)
More informationAdvanced CUDA Optimization 1. Introduction
Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines
More informationExperiences with Achieving Portability across Heterogeneous Architectures
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron + + University of Virginia ++ Lawrence Livermore
More informationEarly Experiences with the OpenMP Accelerator Model
Early Experiences with the OpenMP Accelerator Model Canberra, Australia, IWOMP 2013, Sep. 17th * University of Houston LLNL-PRES- 642558 This work was performed under the auspices of the U.S. Department
More informationAutomatic Pruning of Autotuning Parameter Space for OpenCL Applications
Automatic Pruning of Autotuning Parameter Space for OpenCL Applications Ahmet Erdem, Gianluca Palermo 6, and Cristina Silvano 6 Department of Electronics, Information and Bioengineering Politecnico di
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationLocality-Aware Mapping of Nested Parallel Patterns on GPUs
Locality-Aware Mapping of Nested Parallel Patterns on GPUs HyoukJoong Lee *, Kevin Brown *, Arvind Sujeeth *, Tiark Rompf, Kunle Olukotun * * Pervasive Parallelism Laboratory, Stanford University Purdue
More informationApplications of Berkeley s Dwarfs on Nvidia GPUs
Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationAdvanced OpenACC. Steve Abbott November 17, 2017
Advanced OpenACC Steve Abbott , November 17, 2017 AGENDA Expressive Parallelism Pipelining Routines 2 The loop Directive The loop directive gives the compiler additional information
More informationGPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum
GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute
More informationHow to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO
How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO Foreword How to write code that will survive the many-core revolution? is being setup as a collective
More informationReview for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects
Administrative Midterm - In class April 4, open notes - Review notes, readings and review lecture (before break) - Will post prior exams Design Review - Intermediate assessment of progress on project,
More informationCUB. collective software primitives. Duane Merrill. NVIDIA Research
CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives
More informationPortability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures
Photos placed in horizontal position with even amount of white space between photos and header Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster,
More informationCS377P Programming for Performance GPU Programming - II
CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationSequoia. Mike Houston Stanford University July 9 th, CScADS Workshop
Sequoia Mike Houston Stanford University July 9 th, 2007 - CScADS Workshop Today s outline Sequoia programming model Sequoia targets Tuning in Sequoia http://sequoia.stanford.edu - Supercomputing 2006
More informationDirected Optimization On Stencil-based Computational Fluid Dynamics Application(s)
Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2
More informationCUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer
CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs
More information1/25/12. Administrative
Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationGPU programming made easier
GPU programming made easier Jacob Jepsen 6. June 2014 University of Copenhagen Department of Computer Science 6. June 2014 Introduction We created a tool that reduces the development time of GPU code.
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationTHE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS
Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT
More informationProgress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core
Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core Tom Henderson NOAA/OAR/ESRL/GSD/ACE Thomas.B.Henderson@noaa.gov Mark Govett, Jacques Middlecoff Paul Madden,
More informationAccelerator programming with OpenACC
..... Accelerator programming with OpenACC Colaboratorio Nacional de Computación Avanzada Jorge Castro jcastro@cenat.ac.cr 2018. Agenda 1 Introduction 2 OpenACC life cycle 3 Hands on session Profiling
More informationAccelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies
Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia
More informationParallel Hybrid Computing Stéphane Bihan, CAPS
Parallel Hybrid Computing Stéphane Bihan, CAPS Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous hardware
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationGPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA
GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit
More informationConcepts Introduced. Classes of Computers. Classes of Computers (cont.) Great Architecture Ideas. personal computers (PCs)
Concepts Introduced Classes of Computers classes of computers great architecture ideas software levels computer components performance measures technology trends personal computers (PCs) servers intended
More informationOn Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy
On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy Jan Verschelde joint with Genady Yoffe and Xiangcheng Yu University of Illinois at Chicago Department of Mathematics, Statistics,
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded
More informationOpenMP Optimization and its Translation to OpenGL
OpenMP Optimization and its Translation to OpenGL Santosh Kumar SITRC-Nashik, India Dr. V.M.Wadhai MAE-Pune, India Prasad S.Halgaonkar MITCOE-Pune, India Kiran P.Gaikwad GHRIEC-Pune, India ABSTRACT For
More informationPreparing seismic codes for GPUs and other
Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More information