Performance and Portability Studies with OpenACC Accelerated Version of GTC-P

Size: px
Start display at page:

Download "Performance and Portability Studies with OpenACC Accelerated Version of GTC-P"

Transcription

1 Performance and Portability Studies with OpenACC Accelerated Version of GTC-P Yueming Wei, Yichao Wang, Linjin Cai, William Tang, Bei Wang, Stephane Ethier, Simon See, James Lin Center for High Performance Computing, Shanghai Jiao Tong University, China Princeton Institute of Computational Science and Engineering, Princeton University, Princeton, NJ, USA Princeton Plasma Physics Laboratory, Princeton, NJ, USA NVIDIA, Singapore Tokyo Institute of Technology, Japan Abstract Accelerator-based heterogeneous computing is of paramount importance to High Performance Computing. The increasing complexity of the cluster architectures requires more generic, high-level programming models. OpenACC is a directivebased parallel programming model, which provides performance on and portability across a wide variety of platforms, including GPU, multicore CPU, and many-core processors. GTC-P is a discovery-science-capable real-world application code based on the Particle-In-Cell (PIC) algorithm that is well-established in the HPC area. Basic versions of this code have demonstrated performance portability on TOP5 supercomputers with different architectures, including Titan, Mira, etc. Motivated by the increasing interest in state-of-art portability of emerging architectures, we have now implemented the first OpenACC version of GTC-P and evaluated its performance portability across NVIDIA GPUs, Intel x86 and OpenPOWER CPUs. In this paper, we have also proposed two key optimization methods for OpenACC implementation of PIC algorithm on multicore CPU and GPU including with focus on removing atomic operation and taking advantage of shared memory. The OpenACC version is shown to be able to deliver impressive productivity and performance with respect to portability and scalability, including achieving more than 9% performance compared with the native versions with only about 3 LOC. Keywords: Gyrokinetic PIC code, GTC-P, OpenACC, CUDA, GPU, OpenPOWER I. INTRODUCTION Recent years have witnessed the widespread adoption of accelerators for High Performance Computing (HPC). By the end of 215, more than 1 accelerated systems are on the list of the world s 5 most powerful supercomputers, accounting for 143 petaflops, over one-third of the list s total FLOPS[1]. While accelerators provide tremendous computational power, they have a significant difference in architecture, which demands more generic, high-level programming models. OpenACC is one such designed to tackle this problem. It provides a directive-based approach for a single source code base with function portability across different platforms, including GPU, multicore CPU, and many-core processors. OpenACC today is considered among the most promising approaches to develop high-performance scientific applications[2]. The directives and Corresponding author. address: james@sjtu.edu.cn. programming model defined in the OpenACC allow programmers to create high-level host+accelerator programs without the need to explicitly initialize the accelerator, manage data or program transfers between the host and accelerator. OpenACC directives expose parallelism in C, C++, or Fortran programs, which compilers use to provide optimizations for different architectures, such as x86 multicore, GPU, and OpenPOWER. The portability provided by OpenACC make it possible to run a single source code across platforms with different architectures. The Gyrokinetic Toroidal Code developed by Princeton (GTC-P) is an advanced plasma turbulence HPC code that can be applied to high-resolution size-scaling studies relevant to the next-generation International Thermonuclear Experimental Reactor (ITER)[9]. GTC-P is a modern co-design version of the comprehensive original GTC code[8] with focus on performance enhancement of the basic particle-in-cell (PIC) operations that deliver the capability to efficiently carry out computations at extreme scales with unprecedented resolution and speed on a variety of different architectures on presentday multi-petaflop supercomputers, including Tianhe-2, Titan, Sequoia, Mira, etc., that feature GPU, CPU multicore, and many-core processors[5]. In this study, motivated by improving state-of-art portability on advanced architectures, we focus on the development and implementation of an OpenACC version of GTC-P. This includes associated evaluation of performance portability across multiple modern platforms and scalability across multiple nodes. Original contributions : We implement the first OpenACC accelerated version of GTC-P. This paper present two optimization methods for OpenACC version of GTC-P on x86 multicore and GPU to achieve more than 9% performance with 34 and 214 LOC including removing atomic operation and taking advantage of shared memory. We study the performance portability of OpenACC version of GTC-P on x86 multicore, x86+gpu, Open- POWER+GPU and the scalability of this OpenACC code on the real cluster.

2 The rest of this paper is organized as follows. Section II describes the related research. Our implementation and optimization of GTC-P using OpenACC is presented in Section III and IV. Section V includes the detail of performance evaluation and we conclude the paper in Section VI. II. RELATED WORK OpenACC evaluation. Several works focused on the performance evaluation of OpenACC. In [4], Tetsuya Hoshino et al. compare the performance of OpenACC and CUDA based on two microbenchmarks and one real world CFD application. Their study shows that the lack of the interface for shared memory in OpenACC leads to the performance gap between OpenACC and CUDA. Yichao Wang et al. study the performance portability for OpenACC on NVIDIA Kepler and Intel Knight Corner by one benchmark suite[1]. However, their study only focuses on the single node evaluation, portability is not considered. Moreover, x86 multicore and OpenPOWER are not evaluated in their study, which is supported by the recent release of PGI compiler. OpenACC porting work. Several real-world applications have been ported to GPU using OpenACC. Kraus et al. [6] and Stone et al. [7] use OpenACC to accelerate the CFD algorithm on multi-gpus. Their study showed that OpenACC allows significant acceleration of applications with a large code base with reasonable effort. However, they only study the performance of OpenACC on GPUs, portability across multiplatform is not considered. F. Hariri et al. developed a portable platform for accelerating PIC code on heterogeneous manycore architectures[3]. Their study showed that by porting the PIC algorithm to GPU using only OpenACC, a performance gain of factor 3 could be obtained on a Kepler K2X GPU than a Sandy Bridge 8-core CPU. When CUDA is used for fine-tuning optimization, the performance gain would be up to 3.4. In our study, we also use CUDA to implement the charge routine for further optimization. However, this paper only focuses on parallelization strategies on the single node using OpenACC, portability across multi-platforms and scalability across multi-nodes are not evaluated. III. OPENACC IMPLEMENTATION OF GTC-P This section describes in details the strategies we adopt to implement our OpenACC version of GTC-P based on the GTC-P code running on CPU[9]. The code base is the GPC-P code on CPU with OpenMP directives to take full advantage of multicore. We profiled the CPU version code using test case A on a Sandy Bridge CPU(E5-267) and located the hotspots charge and push routines, which consume 86.1% overall executing time, as shown in Fig. 1. Therefore, our target is to port charge and push onto the accelerator, and leave the rest four subroutines on the host. A. Parallelization The porting process to OpenACC is performed by converting one routine at a time. The porting starting from charge, Hotspots > 85% Fig. 1: GTC-P s profiling result on 8-core Sandy Bridge CPU(E5-267) using test case A. then moves to push. The parallel regions are defined by the parallel directives combined with data clauses to signify the movement of data in and out of the parallel region. We add the parallel directives at the top of every loop nest, accompanied by the copy, copyin and copyout data clauses. We use independent clause to indicated that there are no data dependencies between loops. B. Data management The naive OpenACC implementation with offloading the computational-heavy loop nests to GPU, however, decreases the performance on CPU. The performance gap is caused by the increase of elapsed time of data transfer from host to device (HTD) and device to host (DTH). In our test case, we simulate 1-time steps. Therefore, data transfer happens 1 times for each kernel, which dominates the performance on the GPU. Obviously, we certainly do not want to waste time on moving data between the host and GPU each iteration. w/o opt.: w/ opt.: memory copy compute HTD and DTH memory copy in every step HTD memory copy only at the beginning of the program Fig. 2: Profiling of GTC-P using nvvp to analyze the data movement between the host and device. To reduce the data movement, we create a containing data region using the data directive and combine the parallel directives with the present data clause indicating the data already is present in the device memory. Besides, we need

3 to port shift routine to GPU to keep all the data on the device operated by GPU. This would allow for data to be initialized on the host, transferred to the device memory at the start of the simulation, and retained in device memory for the application lifetime. Fig. 2 shows the profiling result before and after reducing the movement of data between the host and the device. C. Atomic operation Accelerating the majority part of each kernel could be accomplished by apply parallel loop directive over the main loop. In charge routine, however, there is one thing we need carefully deal with data race caused by the main loop over all the particles. Different particles may update the same grid simultaneously, thus creating data race and incorrect result. We solve this issue by using atomic memory operations as shown in listing 1. Listing 1: Atomic operation in charge routine #pragma acc parallel loop for(m=; m < mi; m++) { ij1 = kk + (mzeta+1)*(jtiontmp-igrid_in); ij2 = kk + (mzeta+1)*(jtion1tmp-igrid_in); #pragma acc atomic update densityi[ij1] += d1; #pragma acc atomic update densityi[ij1+1] += d2; } IV. OPENACC OPTIMIZATION OF GTC-P In this section, we present the optimization methods we adopt to optimize the OpenACC version of GTC-P on Kepler GPU and x86 multicore CPU. A. Kepler GPU 1) Thread mapping: On NVIDIA Kepler, up to 248 threads can run on a single streaming multiprocessor (SMX). All the threads active on the same SMX compete for the limited resource, such as L1 cache, texture cache and register. Therefore, reducing the occupancy properly can improve the performance. PGI s OpenACC implementation allows to control the occupancy manually by setting the block size through vector length clause. Besides the occupancy, we also need to consider the setup cost time for each thread. By default, the compiler will launch enough gangs with the corresponding vector width indicated by vector length to cover the iteration space. More threads means fewer iterations executed by each thread. When the number of iterations executed by each thread is smaller enough, the setup cost of each thread will amortize the performance gained by the computing. We use num gangs clause and vector length clause to control the number of iterations executed by each thread. When we apply parallel loop directive over a two-level nested loop, OpenACC will apply gangs parallelism to the outer loop and vector parallelism to the inner loop automatically. In push routine, the inner loop has just four iterations, and we need to add a loop seq over the inner loop to avoid the automatical parallelization by the compiler to improve the performance. 2) Optmization with CUDA: OpenACC is a deviceindependent standard that does not include support for device dependent hardware components such as texture cache and shared memory in GPU memory model, which means the shared memory cannot be explicitly allocated and used for sharing data between threads in thread blocks. This gives the OpenACC version a significant performance limitation compared to the CUDA version. Moreover, thread ID is unavailable compared with CUDA and OpenMP. Therefore we cannot use cooperative computation to capture locality for co-scheduled threads as CUDA does. OpenACC has interoperability with CUDA, which means we can use some highly specialized function available in CUDA for further optimization, of course at the expense of portability and maintainability. We use CUDA code to optimize the charge routine to use the shared memory explicitly just as the CUDA version does. With additional 14 LOC, the hybrid OpenACC code obtains 1.4x speedup compared to the optimized pure OpenACC code on a single node. With step-by-step optimization, the OpenACC code achieves 4.21x speedup compared with the original OpenMP code on a single node, as shown in Fig. 3. baseline OpenMP CPU 1.24X OpenACC 3.5X + thread mapping GPU better 4.21X + CUDA LOC + 3 LOC + 14 LOC Fig. 3: Wallclock time (secs) for test case A with one node (E5-2695v3 with two NVIDIA K8) showing the OpenMP, OpenACC baseline, with thread mapping optimization and CUDA optimization. B. x86 multicore CPU When we compiled and run the OpenACC code(v1) directly on x86 CPU, there is a significant drop in performance of

4 charge(v1), which is about 2 times slower than the OpenMP code, as shown in Table I. TABLE I: Perofrmance of OpenACC and OpenMP charge on 16-core Sandy Bridge x86 multicore with test case A. OpenACC charge V1(w atomic) s charge V2(w/o atomic) 8.91s iteration OpenMP 7.87s physical core Fig. 4: An illustration of thread mapping policy of OpenACC on multicore. Assume there is loop with eight iterations running on a CPU with four physical cores, then the first two iterations will be assigned to the 1st physical core, and the 3rd and 4th iterations will be assigned to the 2nd physical cores, the rest may be deduced by analogy. Through the profiling result, we found atomic is the key factor that causes the OpenACC code performance drop on x86. In the OpenMP version of GTC-P code, atomic operation in charge is avoided by allocating replication copies for array densityi part and do reduction at the end, where thread ID is required. OpenMP thread ID can be obtained by the API function omp get num threads. However, there is no such API function in OpenACC. Thus we cannot obtain the thread ID explicitly. But when we understand the thread mapping policy of OpenACC on multicore, we can get the thread ID manually and remove the atomic operation. An illustration of thread mapping policy of OpenACC on multicore is shown in Fig. 4 and Listing 2 is the code example showing how to obtain thread ID manually. After 34 LOC modification of the original OpenACC code, the executing time of charge(v2) reduced from s to 8.91s, which is almost 177x speedup. Listing 2: Obtain OpenACC thread ID manually base on OpenACC thread mapping policy on multicore. if(mi % ACC_NUM_CORES == ) mi_per_core = mi / ACC_NUM_CORES; else mi_per_core = mi / ACC_NUM_CORES + 1; #pragma acc parallel loop for(m = ; m < mi; ++m) { } thread_id = m / mi_per_core; V. EVALUATION RESULTS AND ANALYSIS In this section, we evaluate the portability and scalability of the OpenACC implementation of GTC-P. For portability, we run the OpenACC code on three different architectures and compare their performance on the single node across different platforms. We evaluate the scalability of OpenACC by comparing it with the CUDA version and the OpenMP version across multi-nodes. A. Portability In this section, we analyze the portability of the OpenACC version we implemented in previous sections across different platforms. We consider three different target computing systems supported by PGI OpenACC compiler: x86 multicore, x86 + NVIDIA GPU, and OpenPOWER + NVIDIA GPU. Table II shows the system configuration of different platforms. NVIDIA K8 is a dual-gpu graphics card, we only study the portability of OpenACC code on a single node without MPI. Therefore, only one of the two GK21 GPUs is used. The target architecture for the compilation is specified by appropriate options(e.g., -ta=tesla and -ta=multicore for NVIDIA GPU and x86 multicore CPU respectively). TABLE II: System configuration for the portability evaluation Platform Host Device Compiler CPU+GPU Intel Xeon E Tesla K8 PGI 16.5 CPU+GPU POWER8 Tesla K8 PGI 16.7(Beta) CPU Intel Xeon E PGI 16.5 In the x86 multicore cases, the parallelization performed by the compiler is similar to that implied by the OpenMP directive. The compiler uses all available cores on the processor unless a different number is specified by the gang clause or through the ACC NUM CORES environment variables. A multicore CPU is considered as a shared memory accelerator, so data clauses are ignored, and no data copies are executed. From the point of view, portability of the code is great as the same C-code, annotated with the same OpenACC pragmas, immediately runs on all architectures. However, the result on the x86 is not entirely satisfying, and we optimized the code by removing the atomic operation with additional 34 LOC as described previously. Fig. 5 shows the portability of the OpenACC code across different platforms. From a performance portability perspective, this is an excellent result. B. Scalability We evaluated the scalability of our OpenACC implementation version GTC-P with the existed CUDA and OpenMP version using the π cluster at the High Performance Computing Center (HPCC), Shanghai Jiao Tong University. The

5 Elapsed time [sec] Lower is better Sort Smooth Field Poisson Push Charge TABLE IV: Evaluation of the weak scaling from A to C plasma size using number of MPI processes ranging from 1 to 16 Problem size A B C mpsi mthetamax mgrid total particles # MPI Lower is better C B A x86 multicore x86+gpu POWER8+GPU 4 Fig. 5: OpenACC performance on three different platforms hardware and software environment employed for one node is shown in Table III. TABLE III: Machine environment for scalability evaluation(π cluster) Elapsed time [sec] CPU Intel Xeon E (2.6GHz) GPU NVIDIA Tesla Kepler K2M 2 Interconnection Infiniband: Mellanox FDR 56Gb/s OS Red Hat Enterprise Linux Server release 6.3 MPI mpich/3.2 Compiler PGI 16.4 OpenACC CUDA OpenMP Fig. 6: Weak scaling of OpenACC, CUDA, and OpenMP versions of GTC-P(vertical scale = wall-clock time for 1 time-steps). 1) Weak scaling: In this study, we perform the weak scaling experiments. The simulation size of GTC-P is decided by several numerical parameters. Table IV shows the default parameters for problem size A, B, and C based on physical evaluation of OpenACC and CUDA version, which we used to evaluate the weak scaling of GTC-P. For the weak scaling of OpenACC and CUDA, we use up to 8 GPU nodes of π where 2 Tesla K2M are running on each node, and the total number of GPUs ranged from 1 to 16. For the OpenMP version weak scaling, we use up to 8 CPU nodes of π where two 8-core Sandy Bridge are running on each node. We respectively use 1, 4 and 16 MPI processes to evaluate the A, B and C problem size with one MPI per GPU or CPU, where the number of particles per MPI is about Our parameters and the number of MPI processes setting for weak scaling ensure the number of grid and particles per MPI constant for each test cases. Figure 6 shows the weak scaling results for A, B, and C problem sizes with a particle density of 1 using OpenACC version, CUDA version, and OpenMP version. We start from A size problem with npartdom=1, the B size problem with npartdom=4 and C size problem with npartdom=16, respectively. Table IV shows the other important parameters settings. For our weak scaling experiments, the grid and particles are constant for each MPI process in different test cases. From the figure, we can see that OpenACC achieves comparable scalability to CUDA. 2) Strong scaling: Then we use up to 16 nodes to perform the strong scaling evaluation. For the strong scaling, we keep the problem size constant and vary the number of MPI from 2 to 32. We perform the strong scaling experiments with the radial domain decomposition for test case B with the resolution of 1 particles per cell. Fig. 7 shows the strong scaling result of OpenMP, OpenACC, and CUDA version of GTC-P on up to 16 nodes. From the figure, we can see that, as expected CUDA, represents the standard for scalability performance but that the OpenACC version is nevertheless offers reasonably comparable performance together with ease of use. Fig. 8 shows the strong scaling of push and charge of OpenACC. We can see that push routine scales very well and the line is almost linear. Charge routine is the main factor that influences the scalability of OpenACC. VI. CONCLUSION AND FUTURE WORK Motivated to improve state-of-art portability, we have developed and implemented an OpenACC version of GTC- P, a discovery-science-capable modern PIC code deployed in increasing problem size investigations with unprecedented resolution of large nuclear fusion systems. We have run the OpenACC code across different architectures and multinodes to evaluate its portability and scalability. Associated comparisons of the performance results achieved by OpenACC with previous versions of this code were carried out to examine

6 better Fig. 7: Strong scaling of OpenMP, OpenACC and CUDA version of GTC-P. better Fig. 8: Strong scaling of OpenACC version of GTC-P and the accelerated kernels. whether OpenACC is a good option for accelerating this example of a real-world PIC application. It is found that OpenACC version exhibits high level of productivity and portability. In particular, the new OpenACC version satisfies our goal of achieving acceptable performance on NVIDIA GPUs with a reasonable level of effort that is considerably less demanding than doing so with CUDA. This is illustrated by the fact that with only 213 out of 8 LOC for such an application code, the new OpenACC version has successfully achieved 73% performance of a CUDA version of the code on GPUs with acceptable scalability up to 16 nodes. In future investigations, we plan to further extend the evaluation of the OpenACC version of GTC-P on other platforms such as ARM with the goal of gaining important insights into how to archive reasonable performance portability for a production code (as exemplified by GTC-P) instead of simple kernels alone. This approach holds promise of providing more meaningful benchmarks in realistic assessments of advanced HPC code performance on leadership-class supercomputers. VII. ACKNOWLEDGEMENT This research is supported in part by an NSF SAVI project in the US supporting pilot international collaborative research and development and by a China National High-Tech R&D Plan (863 plan) 214AA1A32 and 216YFB218 at Shanghai Jiao Tong University (SJTU), a NVIDIA Center of Excellence. SJTU vice-director James Lin gratefully acknowledges support from Japans JSPS RONPAKU (PhD Thesis) Program at Tokyo Institute of Technology. Thanks are also extended to Dr. Zhen Wang and Michael Wolfe of PGI for their much appreciated advice, also to OpenPOWER Foundation for kindly providing us access to an OpenPOWER test machine. REFERENCES [1] Accelerator use surges in world s top supercomputers. [2] C. Bonati, E. Calore, S. Coscetti, M. D elia, M. Mesiti, F. Negro, S.F. Schifano, and R. Tripiccione. Development of scientific software for hpc architectures using open acc: The case of lqcd. In Software Engineering for High Performance Computing in Science (SE4HPCS), 215 IEEE/ACM 1st International Workshop on, pages IEEE, 215. [3] F. Hariri, T.M. Tran, A. Jocksch, E. Lanti, J. Progsch, P. Messmer, S. Brunner, C. Gheller, and L. Villard. A portable platform for accelerated pic codes and its application to gpus using openacc. Computer Physics Communications, 216. [4] T. Hoshino, N. Maruyama, S. Matsuoka, and R. Takaki. Cuda vs openacc: Performance case studies with kernel benchmarks and a memory-bound cfd application. In Cluster, Cloud and Grid Computing (CCGrid), th IEEE/ACM International Symposium on, pages IEEE, 213. [5] K.Z. Ibrahim, K. Madduri, S. Williams, B. Wang, S. Ethier, and L. Oliker. Analysis and optimization of gyrokinetic toroidal simulations on homogenous and heterogenous platforms. International Journal of High Performance Computing Applications, 27(4): , 213. [6] J. Kraus, M. Schlottke, A. Adinetz, and D. Pleiter. Accelerating a c++ cfd code with openacc. In Proceedings of the First Workshop on Accelerator Programming using Directives, pages IEEE Press, 214. [7] C.P. Stone and B.H. Elton. Accelerating the multi-zone scalar pentadiagonal cfd algorithm with openacc. In Proceedings of the Second Workshop on Accelerator Programming using Directives, page 2. ACM, 215. [8] K. Tsugane, T. Boku, H. Murai, M. Sato, W. Tang, and B. Wang. Hybridview programming of nuclear fusion simulation code in the pgas parallel programming language xcalablemp. Parallel Computing, 216. [9] B. Wang, S. Ethier, W. Tang, T. Williams, K.Z. Ibrahim, K. Madduri, S. Williams, and L. Oliker. Kinetic turbulence simulations at extreme scale on leadership-class systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 82. ACM, 213. [1] Y. Wang, Q. Qin, S. SEE, and J. Lin. Performance portability evaluation for openacc on intel knights corner and nvidia kepler. HPC China, 213.

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor Stephen Wang 1, James Lin 1,4, William Tang 2, Stephane Ethier 2, Bei Wang 2, Simon See 1,3

More information

Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P

Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P Stephen Wang 1, James Lin 1, William Tang 2, Stephane Ethier 2, Bei Wang 2, Simon See

More information

OpenStaPLE, an OpenACC Lattice QCD Application

OpenStaPLE, an OpenACC Lattice QCD Application OpenStaPLE, an OpenACC Lattice QCD Application Enrico Calore Postdoctoral Researcher Università degli Studi di Ferrara INFN Ferrara Italy GTC Europe, October 10 th, 2018 E. Calore (Univ. and INFN Ferrara)

More information

Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer

Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer Xiaoqian Zhu 1,2, Xin Liu 1, Xiangfei Meng 2, Jinghua Feng 2 1 School of Computer, National University of Defense

More information

Challenges in adapting Particle-In-Cell codes to GPUs and many-core platforms

Challenges in adapting Particle-In-Cell codes to GPUs and many-core platforms Challenges in adapting Particle-In-Cell codes to GPUs and many-core platforms L. Villard, T.M. Tran, F. Hariri *, E. Lanti, N. Ohana, S. Brunner Swiss Plasma Center, EPFL, Lausanne A. Jocksch, C. Gheller

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY Stephen Abbott, March 26 2018 ACKNOWLEDGEMENTS Collaborators: Oak Ridge Nation Laboratory- Ed D Azevedo NVIDIA - Peng

More information

Directive-based Programming for Highly-scalable Nodes

Directive-based Programming for Highly-scalable Nodes Directive-based Programming for Highly-scalable Nodes Doug Miles Michael Wolfe PGI Compilers & Tools NVIDIA Cray User Group Meeting May 2016 Talk Outline Increasingly Parallel Nodes Exposing Parallelism

More information

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015 PERFORMANCE PORTABILITY WITH OPENACC Jeff Larkin, NVIDIA, November 2015 TWO TYPES OF PORTABILITY FUNCTIONAL PORTABILITY PERFORMANCE PORTABILITY The ability for a single code to run anywhere. The ability

More information

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance

More information

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 /CPU,a),2,2 2,2 Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 XMP XMP-dev CPU XMP-dev/StarPU XMP-dev XMP CPU StarPU CPU /CPU XMP-dev/StarPU N /CPU CPU. Graphics Processing Unit GP General-Purpose

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata) Lattice Simulations using OpenACC compilers Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata) OpenACC is a programming standard for parallel computing developed by Cray, CAPS,

More information

Designing and Optimizing LQCD code using OpenACC

Designing and Optimizing LQCD code using OpenACC Designing and Optimizing LQCD code using OpenACC E Calore, S F Schifano, R Tripiccione Enrico Calore University of Ferrara and INFN-Ferrara, Italy GPU Computing in High Energy Physics Pisa, Sep. 10 th,

More information

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel www.bsc.es OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray Ozen guray.ozen@bsc.es Exascale in BSC Marenostrum 4 (13.7 Petaflops ) General purpose cluster (3400

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

OpenACC Course Lecture 1: Introduction to OpenACC September 2015 OpenACC Course Lecture 1: Introduction to OpenACC September 2015 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15:

More information

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters An Extension of XcalableMP PGAS Lanaguage for Multi-node Clusters Jinpil Lee, Minh Tuan Tran, Tetsuya Odajima, Taisuke Boku and Mitsuhisa Sato University of Tsukuba 1 Presentation Overview l Introduction

More information

Timothy Lanfear, NVIDIA HPC

Timothy Lanfear, NVIDIA HPC GPU COMPUTING AND THE Timothy Lanfear, NVIDIA FUTURE OF HPC Exascale Computing will Enable Transformational Science Results First-principles simulation of combustion for new high-efficiency, lowemision

More information

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

Tianhe-2, the world s fastest supercomputer. Shaohua Wu Senior HPC application development engineer

Tianhe-2, the world s fastest supercomputer. Shaohua Wu Senior HPC application development engineer Tianhe-2, the world s fastest supercomputer Shaohua Wu Senior HPC application development engineer Inspur Inspur revenue 5.8 2010-2013 6.4 2011 2012 Unit: billion$ 8.8 2013 21% Staff: 14, 000+ 12% 10%

More information

arxiv: v1 [hep-lat] 12 Nov 2013

arxiv: v1 [hep-lat] 12 Nov 2013 Lattice Simulations using OpenACC compilers arxiv:13112719v1 [hep-lat] 12 Nov 2013 Indian Association for the Cultivation of Science, Kolkata E-mail: tppm@iacsresin OpenACC compilers allow one to use Graphics

More information

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

NVIDIA Update and Directions on GPU Acceleration for Earth System Models NVIDIA Update and Directions on GPU Acceleration for Earth System Models Stan Posey, HPC Program Manager, ESM and CFD, NVIDIA, Santa Clara, CA, USA Carl Ponder, PhD, Applications Software Engineer, NVIDIA,

More information

HPC future trends from a science perspective

HPC future trends from a science perspective HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively

More information

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

Splotch: High Performance Visualization using MPI, OpenMP and CUDA Splotch: High Performance Visualization using MPI, OpenMP and CUDA Klaus Dolag (Munich University Observatory) Martin Reinecke (MPA, Garching) Claudio Gheller (CSCS, Switzerland), Marzia Rivi (CINECA,

More information

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016 OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators

More information

Accelerating CFD with Graphics Hardware

Accelerating CFD with Graphics Hardware Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted

More information

S Comparing OpenACC 2.5 and OpenMP 4.5

S Comparing OpenACC 2.5 and OpenMP 4.5 April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical

More information

Addressing Heterogeneity in Manycore Applications

Addressing Heterogeneity in Manycore Applications Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction

More information

A Comprehensive Study on the Performance of Implicit LS-DYNA

A Comprehensive Study on the Performance of Implicit LS-DYNA 12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel An Introduction to OpenACC Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel Chapter 1 Introduction OpenACC is a software accelerator that uses the host and the device. It uses compiler

More information

High performance Computing and O&G Challenges

High performance Computing and O&G Challenges High performance Computing and O&G Challenges 2 Seismic exploration challenges High Performance Computing and O&G challenges Worldwide Context Seismic,sub-surface imaging Computing Power needs Accelerating

More information

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2 2014@San Jose Shanghai Jiao Tong University Tokyo Institute of Technology OpenACC2 vs.openmp4 he Strong, the Weak, and the Missing to Develop Performance Portable Applica>ons on GPU and Xeon Phi James

More information

High Performance Computing with Accelerators

High Performance Computing with Accelerators High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing

More information

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench

More information

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center The Stampede is Coming Welcome to Stampede Introductory Training Dan Stanzione Texas Advanced Computing Center dan@tacc.utexas.edu Thanks for Coming! Stampede is an exciting new system of incredible power.

More information

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016 Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2016 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait

More information

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan yanyh@cse.sc.edu

More information

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi

More information

OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA

OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA Directives for Accelerators ABOUT OPENACC GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers

More information

Accelerator programming with OpenACC

Accelerator programming with OpenACC ..... Accelerator programming with OpenACC Colaboratorio Nacional de Computación Avanzada Jorge Castro jcastro@cenat.ac.cr 2018. Agenda 1 Introduction 2 OpenACC life cycle 3 Hands on session Profiling

More information

PGI Fortran & C Accelerator Programming Model. The Portland Group

PGI Fortran & C Accelerator Programming Model. The Portland Group PGI Fortran & C Accelerator Programming Model The Portland Group Published: v0.72 December 2008 Contents 1. Introduction...3 1.1 Scope...3 1.2 Glossary...3 1.3 Execution Model...4 1.4 Memory Model...5

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

An Example of Porting PETSc Applications to Heterogeneous Platforms with OpenACC

An Example of Porting PETSc Applications to Heterogeneous Platforms with OpenACC An Example of Porting PETSc Applications to Heterogeneous Platforms with OpenACC Pi-Yueh Chuang The George Washington University Fernanda S. Foertter Oak Ridge National Laboratory Goal Develop an OpenACC

More information

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Photos placed in horizontal position with even amount of white space between photos and header Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster,

More information

The Gyrokinetic Particle Simulation of Fusion Plasmas on Tianhe-2 Supercomputer

The Gyrokinetic Particle Simulation of Fusion Plasmas on Tianhe-2 Supercomputer 2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems The Gyrokinetic Particle Simulation of Fusion Plasmas on Tianhe-2 Supercomputer Endong Wang 1, Shaohua Wu 1, Qing Zhang

More information

Peter Messmer Developer Technology Group Stan Posey HPC Industry and Applications

Peter Messmer Developer Technology Group Stan Posey HPC Industry and Applications Peter Messmer Developer Technology Group pmessmer@nvidia.com Stan Posey HPC Industry and Applications sposey@nvidia.com U Progress Reported at This Workshop 2011 2012 CAM SE COSMO GEOS 5 CAM SE COSMO GEOS

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Illinois Proposal Considerations Greg Bauer

Illinois Proposal Considerations Greg Bauer - 2016 Greg Bauer Support model Blue Waters provides traditional Partner Consulting as part of its User Services. Standard service requests for assistance with porting, debugging, allocation issues, and

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Steve Scott, Tesla CTO SC 11 November 15, 2011

Steve Scott, Tesla CTO SC 11 November 15, 2011 Steve Scott, Tesla CTO SC 11 November 15, 2011 What goal do these products have in common? Performance / W Exaflop Expectations First Exaflop Computer K Computer ~10 MW CM5 ~200 KW Not constant size, cost

More information

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Efficiency and Programmability: Enablers for ExaScale Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable

More information

STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC. Stefan Maintz, Dr. Markus Wetzstein

STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC. Stefan Maintz, Dr. Markus Wetzstein STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC Stefan Maintz, Dr. Markus Wetzstein smaintz@nvidia.com; mwetzstein@nvidia.com Companies Academia VASP USERS AND USAGE 12-25% of CPU cycles @ supercomputing

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Running the FIM and NIM Weather Models on GPUs

Running the FIM and NIM Weather Models on GPUs Running the FIM and NIM Weather Models on GPUs Mark Govett Tom Henderson, Jacques Middlecoff, Jim Rosinski, Paul Madden NOAA Earth System Research Laboratory Global Models 0 to 14 days 10 to 30 KM resolution

More information

OpenMP for next generation heterogeneous clusters

OpenMP for next generation heterogeneous clusters OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great

More information

Developing PIC Codes for the Next Generation Supercomputer using GPUs. Viktor K. Decyk UCLA

Developing PIC Codes for the Next Generation Supercomputer using GPUs. Viktor K. Decyk UCLA Developing PIC Codes for the Next Generation Supercomputer using GPUs Viktor K. Decyk UCLA Abstract The current generation of supercomputer (petaflops scale) cannot be scaled up to exaflops (1000 petaflops),

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Preparing GPU-Accelerated Applications for the Summit Supercomputer Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership

More information

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance 11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton

More information

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017 THE FUTURE OF GPU DATA MANAGEMENT Michael Wolfe, May 9, 2017 CPU CACHE Hardware managed What data to cache? Where to store the cached data? What data to evict when the cache fills up? When to store data

More information

Trellis: Portability Across Architectures with a High-level Framework

Trellis: Portability Across Architectures with a High-level Framework : Portability Across Architectures with a High-level Framework Lukasz G. Szafaryn + Todd Gamblin ++ Bronis R. de Supinski ++ Kevin Skadron + + University of Virginia {lgs9a, skadron@virginia.edu ++ Lawrence

More information

Two-Phase flows on massively parallel multi-gpu clusters

Two-Phase flows on massively parallel multi-gpu clusters Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

COMP Parallel Computing. Programming Accelerators using Directives

COMP Parallel Computing. Programming Accelerators using Directives COMP 633 - Parallel Computing Lecture 15 October 30, 2018 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia COMP 633 - Prins Directives for Accelerator

More information

Critically Missing Pieces on Accelerators: A Performance Tools Perspective

Critically Missing Pieces on Accelerators: A Performance Tools Perspective Critically Missing Pieces on Accelerators: A Performance Tools Perspective, Karthik Murthy, Mike Fagan, and John Mellor-Crummey Rice University SC 2013 Denver, CO November 20, 2013 What Is Missing in GPUs?

More information

SuperMike-II Launch Workshop. System Overview and Allocations

SuperMike-II Launch Workshop. System Overview and Allocations : System Overview and Allocations Dr Jim Lupo CCT Computational Enablement jalupo@cct.lsu.edu SuperMike-II: Serious Heterogeneous Computing Power System Hardware SuperMike provides 442 nodes, 221TB of

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC DR. CHRISTOPH ANGERER, NVIDIA *) THANKS TO JEFF LARKIN, NVIDIA, FOR THE SLIDES 3 APPROACHES TO GPU PROGRAMMING Applications Libraries Compiler Directives

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA STATE OF THE ART 2012 18,688 Tesla K20X GPUs 27 PetaFLOPS FLAGSHIP SCIENTIFIC APPLICATIONS

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO Foreword How to write code that will survive the many-core revolution? is being setup as a collective

More information

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Early Experiences Writing Performance Portable OpenMP 4 Codes

Early Experiences Writing Performance Portable OpenMP 4 Codes Early Experiences Writing Performance Portable OpenMP 4 Codes Verónica G. Vergara Larrea Wayne Joubert M. Graham Lopez Oscar Hernandez Oak Ridge National Laboratory Problem statement APU FPGA neuromorphic

More information

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. Portable and Productive Performance with OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 Cray: Leadership in Computational Research Earth Sciences

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview

More information

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Outline NICS and AACE Architecture Overview Resources Native Mode Boltzmann BGK Solver Native/Offload

More information

Scientific Programming in C XIV. Parallel programming

Scientific Programming in C XIV. Parallel programming Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information