Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer

Size: px
Start display at page:

Download "Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer"

Transcription

1 Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer Xiaoqian Zhu 1,2, Xin Liu 1, Xiangfei Meng 2, Jinghua Feng 2 1 School of Computer, National University of Defense Technology, Changsha, China 2 National Supercomputer Center in Tianjin, Tianjin, China zhu_xiaoqian@sina.com, liuxin_0117@yeah.net, mengxf@nscc-tj.gov.cn, fengjh@nscc-tj.gov.cn Abstract In this study, we test and analyze the performance of Gyrokinetic Torodial Code(GTC) program. According to the analysis results, we port GTC s compute-intensive subroutines to GPU and speed up them on the CPU+GPU heterogeneous architecture of TH-1A supercomputer. Some optimization strategies are developed in this process, for example, subroutines are integrated to reduce the data transfer between host and device, GPU memory access is optimized to reduce the access latency and static keyword is designed before arrays declaration to avoid unnecessary address allocation and data copy. Experiment results show that the performance of the subroutines ported to GPU is improved evidently, which is between 6 and 8 times, and the total performance of GTC could be improved by 3 to 4 times. Keywords-component; TH-1A, GPU, high performance computing, GTC, nuclear fusion I. INTRODUCTION In recent years, high performance computing(hpc) is becoming a hot topic in modern society, and the supercomputing power is also considered as a benchmark of science innovation of one country. In order to satisfy people s need of computing power, and also because of the development barrier of the single core processor, the multicore technology developed greatly. The multicore technology includes homogeneous multicore technology and heterogeneous multicore technology. Heterogeneous multicore adopts the main core plus coprocessor design pattern, in which the main core charges the process of complex logical matter, while the coprocessor charges intensive computation, and this cooperation mode establishes an important approach for the computing power of computers. Nowadays, CPU+GPU heterogeneous system is one of the typical heterogeneous multicore architecture. In this architecture, CPU acts as the main core, and GPU acts as the coprocessor. The computing power of GPU grows fast in recent years. NVIDIA s Tesla M2050 GPU is based on Fermi architecture, and it includes 448 CUDA cores, all of which could provide more that 1TFlops for single precision computation, and for double precision it is about 600GFlops. TH-1A supercomputer manufactured by National University of Defense Technology is based on CPU+GPU heterogeneous hybrid architecture. It ranked No.1 in the TOP500 list of the world supercomputer in November 2010 [1]. Nowadays, nuclear fusion energy is considered as one most important method to solve the energy and environment challenges of human world completely. GTC is an important program of fusion energy research, and it is one of the simulation programs which could fully use supercomputers computing power, is also selected as a benchmark program which could evaluate the performance of supercomputer by United States Department of Energy. The importance of GTC also means that this program should be processed in a fast speed. GTC has been ported to most parallel computers in order to get a high efficiency, but the performance is still needed to be improved. In this study we use TH-1A supercomputer to test GTC, and in order to improve the whole application s performance, we speed up GTC s compute-intensive subroutines using GPU for the CPU+GPU hybrid heterogeneous architecture of TH-1A. In this process, some optimization strategies are developed, such as subroutines are integrated to reduce the data transfer between host and device, GPU memory access is optimized to reduce memory access latency and static keyword is designed before array s declaration to avoid unnecessary address allocation and data copy. Experiment results show that, the performance of the subroutines ported to GPU is improved evidently, which is about between 6 and 8 times, and the total performance of GTC can be improved by 3 to 4 times. In this study, we first introduce the system architecture of TH-1A and GTC background in section 2. In section 3, we test GTC on TH-1A, and according to analysis we find out the performance bottleneck of GTC to decide which subroutines should be speeded up with GPU. Section 4 introduces our optimization techniques in detail. We present the GTC s performance improvement after optimization in section 5. In the last section, we make some conclusions and propose our future work. II. BANKGROUND A. System Architecture of TH-1A Figure 1 shows the system architecture of TH-1A. TH-1A is composed of computing system, interconnected communication system, monitoring and diagnostic system, and I/O system. The computing system includes 1024 service nodes and 7168 computing nodes. Each service node is configured with two FT-1000 CPU (2.93GHz, eight-core). The computing nodes adopt the main core plus coprocessor design pattern. Each computing node is configured with two Intel Xeon CPU(x GHz, six-cores) and one NVIDIA M2050 GPU. NVIDIA M2050 GPU is based on Fermi architecture, including fourteen Streaming Multi-processors, each of which includes 32 CUDA cores. The frequency of each CUDA core is /11/$ IEEE 6027

2 1.15GHz. The peak computing power of TH-1A is 4700TFlops, and Linpack test performance is 2566TFlops [2]. B. GTC Program Gyrokinetic Toroidal Code is a time-dependent 3- dimension PIC (particle-in-cell) program, and it is used to research the turbulent transport problem in fusion plasmas. The PIC method is used to describe the complex interactive problem between field and plasmas. When the plasmas organized by charged particles moved in strong magnetic field, they could found the electrostatic field. The particles are driven to move by spatial gradients in the equilibrium profiles of the plasma temperature and density [3-4]. And then it is necessary to solve the Poisson equation to specify the trajectory of the charged particles [5]. The main steps of the PCI algorithm are as follows. First we distribute the charge of each particle on its nearest grid points. Then the electrostatic potential and field at each grid point. We then calculate the force acting on each particle from the field at its nearest grid points. Finally, we move the particles according to the forces just calculated, and repeat these steps until the end of the simulation [6]. III. PERFORMANCE ANALYSIS OF GTC ON TH-1A We test GTC program on TH-1A supercomputer to obtain the execution time of each module. The execution time proportion of modules is shown in Figure 2(a). The electron module takes up about 80% of the whole processing time, and becomes the bottleneck of GTC s total performance. (a)time proportion of modules (b) Time proportion of subroutines Figure 2 Execution time proportion of GTC. The electron module is composed of four subroutines including DPHIEFF, PUSHE, SHIFTE and CHARGEE. The execution time proportion of each subroutine is shown in Figure 2(b), execution time of DPHIEFF and CHARGEE only takes up about 1% of the electron module, but PUSHE and Figure 1 System architectures of TH-1A supercomputer. SHIFTE takes up about 99%. PUSHE and SHIFTE belong to compute-intensive subroutines, so in order to improve the total performance of GTC, we can use GPU to speed up PUSHE and SHIFTE. It worth to be noted that PUSHE takes up the most part of the execution time of electron module, we should use GPU to speed up the computation of PUSHE. But because PUSHE and SHIFTE are called by turns in GTC s main function, and there exits some data dependence between these two subroutines, this may lead to a great deal of data transfer between host and device if we only port PUSHE to GPU. So in order to avoid this, we port SHIFTE to GPU too. IV. OPTIMIZATION STRATEGIES A. Subroutines Integrated Optimization The data processing steps in CPU+GPU heterogeneous concurrency systems are as follows: (1) the initial data waiting to be processed should be transferred from host to device; (2) GPU processes data stored in device memory; (3) CPU charges to transfer the processed result from device memory back to host memory. Host memory and device memory are connected with PCI-E bus, which could provide about 8GB/s effective bandwidth [7]. So the data transfer between host and device across PCI-E bus often becomes the bottleneck of applications performance. We adopt the method of subroutines integrated in speeding up PUSHE and SHIFTE, which holds the opinion that the arrays with a large scale of data should be placed in device memory as long as possible. In this way we are able to avoid the big arrays transfer between host and device in a large degree. Figure 3(a) shows that, PUSHE and SHIFTE are called in a do-loop body in GTC, and they are called by turns. When we use GPU to speed up the computation of these two subroutines, we should transfer the initial data and processed result between host and device across PCI-E bus. Figure 2(b) shows the framework that these two subroutines are ported to GPU separately. At each call of the subroutine the initial data and processed result are needed to be transferred correspondingly between host and device, which blocks the promotion of application s total performance. So our paper adopts subroutines integrated strategy, which is shown as Figure 3(c), PUSHE and SHIFTE subroutines are integrated into the pushe_shifte_func function. The data copy needed for 6028

3 computation is placed outside the loop body, and computation of the whole loop is all placed on GPU. In this way the times of data copy is reduced from 8*ncycle*irk times to twice. It is worth to be noted that, there exits some MPI communication in PUSHE subroutine, which is needed to be operated in host end. So the arrays needed for the MPI operation should still be transferred from device memory do host memory. But because that this communication operation is located in if statement, we can place the data copy inside if statement. Then the data copy only activates when if condition is true. This data copy has almost no affect on GTC s total performance because that if condition has a very low hit rate. Figure 3 Subroutines integrated optimization. B. Memory Access Optimization There are different memory spaces in device, including global memory, texture memory, shared memory, cache and registers. These memory spaces provide basic conditions for optimization to variables with different access characteristics [7]. We adopt two kinds of optimization strategies to reduce the memory access latency in PUSHE and SHIFTE s computation on GPU, one is to make threads access global memory in a merged way, the other is to bind some special arrays to texture memory. The former optimization technology is used to provide a higher bandwidth; and the latter optimization technology is used to reduce the latency when the data accessed by adjacent threads is located in the same spatial area. The access pattern and optimization method of arrays in PUSHE and SHIFTE are shown in table 1, wzgc and wpgc0 have met the condition that threads access global memory in a merged way, and it is also not necessary to put them in shared memory because they are accessed only a few times; for arrays like rgpsi and mtheta, there are no optimization methods because their access pattern are very irregular. TABLE I. Arrays gradphi, bsp, xsp wbgc, wpgc, zelectron, zelectron0 wzgc, wpgc0, wtgc0, wtgc1, jtgc0, jtgc1 rgpsi, mtheta, igrid, qtinv and so on MEMORY ACCESS OPTIMIZATION Memory Access Characteristics Strong locally access Threads access not merged Threads access merged threads access irregularly Optimization Method bind to texture memory transpose to merge no no C. static Keyword Optimization The compiler could allocate a fixed storage area for static variables if they are declared with static keyword, and they will exist all through the program s execution. We define a class named gpu_array whose code is shown as follows. We could declare and initialize arrays with the construct function of class gpu_array. For some arrays, we could place static keyword before their declaration to avoid unnecessary address allocation and data copy from host to device. template <class T> class gpu_array{ public: gpu_array(size_t size){ size_=size; cutilsafecall(cudamalloc((void**)&ptr_, size*sizeof(t))); } gpu_array(t* data, size_t size){ size_=size; cutilsafecall(cudamalloc((void**)&ptr_, size*sizeof(t))); cutilsafecall(cudamemcpy(ptr_, data, size*sizeof(t), cudamemcpyhosttodevice)); } private: T *ptr; size_t size_; 6029

4 } Function pushe_shifte_func is called repeatedly in the loop body of GTC s main function. So in order to optimize the application in the aspects of address allocation and data copy, we use static declaration appropriately before some arrays declaration. The arrays used in PUSHE routine, including global arrays and local arrays, are separated into the following different kinds: Local arrays: the arrays whose definition, initialization and usage are all inside PUSHE, for example, jtgc0, jtgc1, delt, delp, vdrtmp, wzgc, wpgc0, wtgc1, wpgc, wbgc and so on. For the reason that the sizes of these arrays are not changed and their values are also not necessary to be copied from host to device, then we use static keyword before their declaration, which could save the data arrays address allocation time. static gpu_array<float> d_array(arraysize); Global arrays: arrays needed for the computation of PUSHE and SHIFTE are passed into pushe_shifte_func function across the parameters patter, and then these arrays could be separated further into two parts according to their changing situation in the whole program s execution time: (1) The array s value is specified in the beginning of GTC program s execution, and it will not be changed in the whole executing time. So this kind of array is not necessary to be copied from host to device after the first call of pushe_shifte_func function. So we use static keyword before these arrays declaration in order to save the time of address allocation and data copy from host to device. static gpu_array<float> d_array(h_array, arraysize); (2) Global arrays whose value is often changed during GTC s execution. Because these arrays in device memory should be updated every time the function pushe_shifte_func is called, we don t use static keyword. In this way, when we call function pushe_shifte_func, these arrays in device memory are updated to the latest value. gpu_array<float> d_array(h_array, arraysize); V. EXPERIMENTS A. Input Files We use three input files to test the performance of GTC on TH-1A supercomputer. The input files are gtc_63.in, gtc_125.in and gtc_250.in. The number of grid points defined in the three input files is different, and this number is decided by variables mpsi, mthetamax and r0. The three parameters of each input file are shown in table 2. In this paper we set MPI processes to 32, 64 and 128, and one process is corresponding with one GPU. B. Experiments Results 1. gtc_63.in The experiment result using gtc_63.in input file is shown in Figure 4, the electron module including the computing time of subroutines PUSHE and SHIFTE. The lable electron-gpu denotes the new time of electron module when PUSHE and SHIFTE are ported to GPU, total denotes the total execution time of GTC when PUSHE and SHIFTE are computed by CPU, and total-new denotes the new total execution time of GTC when PUSHE and SHIFTE are ported to GPU. The experiment results show that, the performance is improved evidently after we speed up subroutines PUSHE and SHIFTE with GPU. Performance of PUSHE and SHIFTE is improved about 5 times, and the whole performance of GTC is improved about 3 times. 2. gtc_125.in and gtc_250.in Figure 4 Test of gtc_63.in. The experiment results using gtc_125.in and gtc_250.in are shown as Figure 5 and Figure 6. Performance of PUSHE and SHIFTE is improved about five to eight times, and the total performance of GTC is improved about three to four times. Figure 5 Test of gtc_125.in. TABLE II. PARAMETERS OF INPUT FILES mpsi mthetamax r0 gtc_63.in gtc_125.in gtc_250.in Figure 6 Test of gtc_250.in. 6030

5 VI. CONCLUSIONS AND FUTURE WORK In this study, we test and analyze the performance of Gyrokinetic Torodial Code(GTC) program. According to analysis results, we port GTC s compute-intensive subroutines to GPU and speed up them on the CPU+GPU heterogeneous architecture of TH-1A. In this process we developed some optimization strategies in order to improve the total performance of GTC. Experiment results show that the performance of PUSHE and SHIFTE could be improved between 5 and 8 times, and the total performance of GTC is improved between 3 and 4 times. Our future work includes two parts, one is to speed up other compute-intensive routines of GTC with GPU; the other is to develop CPU+GPU heterogeneous parallel technology with MPI+OpenMP+CUDA method in order to make fully use of TH-1A s computing power. ACKNOWLEDGMENT This work is supported by the National Natural Science Foundation of China under grant No , NO Thanks to professors ZhiHong Lin, Yi an Lei, Yong Xiao of Peking University, they help us solve many problems when we port GTC program to TH-1A supercomputer. And thanks to Peng Wang in NVIDIA Corporation, he gives us many good advices on optimization of GTC program on GPU. REFERENCES [1] [2] [3] S.Ethier, W. M. Tang and Z. Lin, Gyrokinetic particle-in-cell simulations of plasma microturbulence on advanced computing platforms, Journal of Physics: Conference Series, [4] Z. Lin, T. S. Hahm and W. W. Lee, Turbulent transport reduction by zonal flows: Massively parallel simulations, Science [J], 1998, PP: [5] Z. Lin; T. S. Hahm; W. W. Lee; et al. Method for solving the gyrokinetic poisson equation in general geometry. Physical Review E- PHYS REV E [J], 1995, PP: [6] S.Klasky; S. Ethier; Z. Lin; et al. Grid-Based Parallel Data Streaming implemented for the Gyrokinetic Torodial Code. Proceedings of the 2003 ACM/IEEE conference on Supercomputing [C], [7] NVIDIA Corporation NVIDIA CUDA Programming Gudie. [EB/OL],

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor Stephen Wang 1, James Lin 1,4, William Tang 2, Stephane Ethier 2, Bei Wang 2, Simon See 1,3

More information

The Gyrokinetic Particle Simulation of Fusion Plasmas on Tianhe-2 Supercomputer

The Gyrokinetic Particle Simulation of Fusion Plasmas on Tianhe-2 Supercomputer 2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems The Gyrokinetic Particle Simulation of Fusion Plasmas on Tianhe-2 Supercomputer Endong Wang 1, Shaohua Wu 1, Qing Zhang

More information

Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P

Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P Stephen Wang 1, James Lin 1, William Tang 2, Stephane Ethier 2, Bei Wang 2, Simon See

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Developing PIC Codes for the Next Generation Supercomputer using GPUs. Viktor K. Decyk UCLA

Developing PIC Codes for the Next Generation Supercomputer using GPUs. Viktor K. Decyk UCLA Developing PIC Codes for the Next Generation Supercomputer using GPUs Viktor K. Decyk UCLA Abstract The current generation of supercomputer (petaflops scale) cannot be scaled up to exaflops (1000 petaflops),

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

arxiv: v1 [physics.comp-ph] 4 Nov 2013

arxiv: v1 [physics.comp-ph] 4 Nov 2013 arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Challenges in adapting Particle-In-Cell codes to GPUs and many-core platforms

Challenges in adapting Particle-In-Cell codes to GPUs and many-core platforms Challenges in adapting Particle-In-Cell codes to GPUs and many-core platforms L. Villard, T.M. Tran, F. Hariri *, E. Lanti, N. Ohana, S. Brunner Swiss Plasma Center, EPFL, Lausanne A. Jocksch, C. Gheller

More information

A Comprehensive Study on the Performance of Implicit LS-DYNA

A Comprehensive Study on the Performance of Implicit LS-DYNA 12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four

More information

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar CRAY XK6 REDEFINING SUPERCOMPUTING - Sanjana Rakhecha - Nishad Nerurkar CONTENTS Introduction History Specifications Cray XK6 Architecture Performance Industry acceptance and applications Summary INTRODUCTION

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst Ali Khajeh-Saeed Software Engineer CD-adapco J. Blair Perot Mechanical Engineering UMASS, Amherst Supercomputers Optimization Stream Benchmark Stag++ (3D Incompressible Flow Code) Matrix Multiply Function

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

GPU Acceleration of Particle Advection Workloads in a Parallel, Distributed Memory Setting

GPU Acceleration of Particle Advection Workloads in a Parallel, Distributed Memory Setting Girona, Spain May 4-5 GPU Acceleration of Particle Advection Workloads in a Parallel, Distributed Memory Setting David Camp, Hari Krishnan, David Pugmire, Christoph Garth, Ian Johnson, E. Wes Bethel, Kenneth

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY Stephen Abbott, March 26 2018 ACKNOWLEDGEMENTS Collaborators: Oak Ridge Nation Laboratory- Ed D Azevedo NVIDIA - Peng

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

Optimizing Fusion PIC Code XGC1 Performance on Cori Phase 2

Optimizing Fusion PIC Code XGC1 Performance on Cori Phase 2 Optimizing Fusion PIC Code XGC1 Performance on Cori Phase 2 T. Koskela, J. Deslippe NERSC / LBNL tkoskela@lbl.gov June 23, 2017-1 - Thank you to all collaborators! LBNL Brian Friesen, Ankit Bhagatwala,

More information

Massively Parallel Architectures

Massively Parallel Architectures Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

Tianhe-2, the world s fastest supercomputer. Shaohua Wu Senior HPC application development engineer

Tianhe-2, the world s fastest supercomputer. Shaohua Wu Senior HPC application development engineer Tianhe-2, the world s fastest supercomputer Shaohua Wu Senior HPC application development engineer Inspur Inspur revenue 5.8 2010-2013 6.4 2011 2012 Unit: billion$ 8.8 2013 21% Staff: 14, 000+ 12% 10%

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Performance and Portability Studies with OpenACC Accelerated Version of GTC-P

Performance and Portability Studies with OpenACC Accelerated Version of GTC-P Performance and Portability Studies with OpenACC Accelerated Version of GTC-P Yueming Wei, Yichao Wang, Linjin Cai, William Tang, Bei Wang, Stephane Ethier, Simon See, James Lin Center for High Performance

More information

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

Multigrid algorithms on multi-gpu architectures

Multigrid algorithms on multi-gpu architectures Multigrid algorithms on multi-gpu architectures H. Köstler European Multi-Grid Conference EMG 2010 Isola d Ischia, Italy 20.9.2010 2 Contents Work @ LSS GPU Architectures and Programming Paradigms Applications

More information

Numerical Algorithms on Multi-GPU Architectures

Numerical Algorithms on Multi-GPU Architectures Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung

More information

Harnessing GPU speed to accelerate LAMMPS particle simulations

Harnessing GPU speed to accelerate LAMMPS particle simulations Harnessing GPU speed to accelerate LAMMPS particle simulations Paul S. Crozier, W. Michael Brown, Peng Wang pscrozi@sandia.gov, wmbrown@sandia.gov, penwang@nvidia.com SC09, Portland, Oregon November 18,

More information

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues Top500 Supercomputer list represent parallel computers, so distributed systems such as SETI@Home are not considered Does not consider storage or I/O issues Both custom designed machines and commodity machines

More information

Turbostream: A CFD solver for manycore

Turbostream: A CFD solver for manycore Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware

More information

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 /CPU,a),2,2 2,2 Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 XMP XMP-dev CPU XMP-dev/StarPU XMP-dev XMP CPU StarPU CPU /CPU XMP-dev/StarPU N /CPU CPU. Graphics Processing Unit GP General-Purpose

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms

Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms Liang Men, Miaoqing Huang, John Gauch Department of Computer Science and Computer Engineering University of Arkansas {mliang,mqhuang,jgauch}@uark.edu

More information

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA 2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

GRAPHICAL PROCESSING UNIT-BASED PARTICLE-IN-CELL SIMULATIONS*

GRAPHICAL PROCESSING UNIT-BASED PARTICLE-IN-CELL SIMULATIONS* GRAPHICAL PROCESSING UNIT-BASED PARTICLE-IN-CELL SIMULATIONS* Viktor K. Decyk, Department of Physics and Astronomy, Tajendra V. Singh and Scott A. Friedman, Institute for Digital Research and Education,

More information

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center The Stampede is Coming Welcome to Stampede Introductory Training Dan Stanzione Texas Advanced Computing Center dan@tacc.utexas.edu Thanks for Coming! Stampede is an exciting new system of incredible power.

More information

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Two-Phase flows on massively parallel multi-gpu clusters

Two-Phase flows on massively parallel multi-gpu clusters Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous

More information

Building NVLink for Developers

Building NVLink for Developers Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized

More information

Using Graphics Chips for General Purpose Computation

Using Graphics Chips for General Purpose Computation White Paper Using Graphics Chips for General Purpose Computation Document Version 0.1 May 12, 2010 442 Northlake Blvd. Altamonte Springs, FL 32701 (407) 262-7100 TABLE OF CONTENTS 1. INTRODUCTION....1

More information

The GPU-based Parallel Calculation of Gravity and Magnetic Anomalies for 3D Arbitrary Bodies

The GPU-based Parallel Calculation of Gravity and Magnetic Anomalies for 3D Arbitrary Bodies Available online at www.sciencedirect.com Procedia Environmental Sciences 12 (212 ) 628 633 211 International Conference on Environmental Science and Engineering (ICESE 211) The GPU-based Parallel Calculation

More information

Large scale Imaging on Current Many- Core Platforms

Large scale Imaging on Current Many- Core Platforms Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,

More information

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, basic tasks, data types 3 Introduction to D3, basic vis

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, basic tasks, data types 3 Introduction to D3, basic vis Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, basic tasks, data types 3 Introduction to D3, basic vis techniques for non-spatial data Project #1 out 4 Data

More information

High Performance Computing with Accelerators

High Performance Computing with Accelerators High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing

More information

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid

More information

A MPI-based parallel pyramid building algorithm for large-scale RS image

A MPI-based parallel pyramid building algorithm for large-scale RS image A MPI-based parallel pyramid building algorithm for large-scale RS image Gaojin He, Wei Xiong, Luo Chen, Qiuyun Wu, Ning Jing College of Electronic and Engineering, National University of Defense Technology,

More information

Fujitsu s Approach to Application Centric Petascale Computing

Fujitsu s Approach to Application Centric Petascale Computing Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview

More information

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com

More information

Performance Diagnosis for Hybrid CPU/GPU Environments

Performance Diagnosis for Hybrid CPU/GPU Environments Performance Diagnosis for Hybrid CPU/GPU Environments Michael M. Smith and Karen L. Karavanic Computer Science Department Portland State University Performance Diagnosis for Hybrid CPU/GPU Environments

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

Roberts edge detection algorithm based on GPU

Roberts edge detection algorithm based on GPU Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(7):1308-1314 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Roberts edge detection algorithm based on GPU

More information

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most

More information

Plasma Particle-in-Cell Codes on GPUs: A Developer s Perspective. Viktor K. Decyk and Tajendra V. Singh UCLA

Plasma Particle-in-Cell Codes on GPUs: A Developer s Perspective. Viktor K. Decyk and Tajendra V. Singh UCLA Plasma Particle-in-Cell Codes on GPUs: A Developer s Perspective Viktor K. Decyk and Tajendra V. Singh UCLA Abstract Particle-in-Cell (PIC) codes are one of the most important codes used in plasma physics

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior

More information

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi Tohoku University 14 November,

More information

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

Lecture 1: Introduction and Computational Thinking

Lecture 1: Introduction and Computational Thinking PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational

More information

GPU Programming. Ringberg Theorie Seminar 2010

GPU Programming. Ringberg Theorie Seminar 2010 or How to tremendously accelerate your code? Michael Kraus, Christian Konz Max-Planck-Institut für Plasmaphysik, Garching Ringberg Theorie Seminar 2010 Introduction? GPU? GPUs can do more than just render

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

Running HARMONIE on Xeon Phi Coprocessors

Running HARMONIE on Xeon Phi Coprocessors Running HARMONIE on Xeon Phi Coprocessors Enda O Brien Irish Centre for High-End Computing Disclosure Intel is funding ICHEC to port & optimize some applications, including HARMONIE, to Xeon Phi coprocessors.

More information

Performance Benefits of NVIDIA GPUs for LS-DYNA

Performance Benefits of NVIDIA GPUs for LS-DYNA Performance Benefits of NVIDIA GPUs for LS-DYNA Mr. Stan Posey and Dr. Srinivas Kodiyalam NVIDIA Corporation, Santa Clara, CA, USA Summary: This work examines the performance characteristics of LS-DYNA

More information

arxiv: v1 [hep-lat] 12 Nov 2013

arxiv: v1 [hep-lat] 12 Nov 2013 Lattice Simulations using OpenACC compilers arxiv:13112719v1 [hep-lat] 12 Nov 2013 Indian Association for the Cultivation of Science, Kolkata E-mail: tppm@iacsresin OpenACC compilers allow one to use Graphics

More information

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins, Todd Harman Scientific Computing and Imaging Institute & University of Utah I. Uintah

More information

Game-changing Extreme GPU computing with The Dell PowerEdge C4130

Game-changing Extreme GPU computing with The Dell PowerEdge C4130 Game-changing Extreme GPU computing with The Dell PowerEdge C4130 A Dell Technical White Paper This white paper describes the system architecture and performance characterization of the PowerEdge C4130.

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29 Introduction CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction Spring 2018 1 / 29 Outline 1 Preface Course Details Course Requirements 2 Background Definitions

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

Introduc)on to GPU Programming

Introduc)on to GPU Programming Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/people/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/cuda/training/nvidia_gpu_compu)ng_webinars_cuda_memory_op)miza)on.pdf

More information

Improving performances of an embedded RDBMS with a hybrid CPU/GPU processing engine

Improving performances of an embedded RDBMS with a hybrid CPU/GPU processing engine Improving performances of an embedded RDBMS with a hybrid CPU/GPU processing engine Samuel Cremer 1,2, Michel Bagein 1, Saïd Mahmoudi 1, Pierre Manneback 1 1 UMONS, University of Mons Computer Science

More information

CUDA Memories. Introduction 5/4/11

CUDA Memories. Introduction 5/4/11 5/4/11 CUDA Memories James Gain, Michelle Kuttel, Sebastian Wyngaard, Simon Perkins and Jason Brownbridge { jgain mkuttel sperkins jbrownbr}@cs.uct.ac.za swyngaard@csir.co.za 3-6 May 2011 Introduction

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

Lecture Topic Projects

Lecture Topic Projects Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, data types 3 Data sources and preparation Project 1 out 4 Data reduction, similarity & distance, data augmentation

More information

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school

More information