Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer
|
|
- Myles Kelly
- 5 years ago
- Views:
Transcription
1 Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer Xiaoqian Zhu 1,2, Xin Liu 1, Xiangfei Meng 2, Jinghua Feng 2 1 School of Computer, National University of Defense Technology, Changsha, China 2 National Supercomputer Center in Tianjin, Tianjin, China zhu_xiaoqian@sina.com, liuxin_0117@yeah.net, mengxf@nscc-tj.gov.cn, fengjh@nscc-tj.gov.cn Abstract In this study, we test and analyze the performance of Gyrokinetic Torodial Code(GTC) program. According to the analysis results, we port GTC s compute-intensive subroutines to GPU and speed up them on the CPU+GPU heterogeneous architecture of TH-1A supercomputer. Some optimization strategies are developed in this process, for example, subroutines are integrated to reduce the data transfer between host and device, GPU memory access is optimized to reduce the access latency and static keyword is designed before arrays declaration to avoid unnecessary address allocation and data copy. Experiment results show that the performance of the subroutines ported to GPU is improved evidently, which is between 6 and 8 times, and the total performance of GTC could be improved by 3 to 4 times. Keywords-component; TH-1A, GPU, high performance computing, GTC, nuclear fusion I. INTRODUCTION In recent years, high performance computing(hpc) is becoming a hot topic in modern society, and the supercomputing power is also considered as a benchmark of science innovation of one country. In order to satisfy people s need of computing power, and also because of the development barrier of the single core processor, the multicore technology developed greatly. The multicore technology includes homogeneous multicore technology and heterogeneous multicore technology. Heterogeneous multicore adopts the main core plus coprocessor design pattern, in which the main core charges the process of complex logical matter, while the coprocessor charges intensive computation, and this cooperation mode establishes an important approach for the computing power of computers. Nowadays, CPU+GPU heterogeneous system is one of the typical heterogeneous multicore architecture. In this architecture, CPU acts as the main core, and GPU acts as the coprocessor. The computing power of GPU grows fast in recent years. NVIDIA s Tesla M2050 GPU is based on Fermi architecture, and it includes 448 CUDA cores, all of which could provide more that 1TFlops for single precision computation, and for double precision it is about 600GFlops. TH-1A supercomputer manufactured by National University of Defense Technology is based on CPU+GPU heterogeneous hybrid architecture. It ranked No.1 in the TOP500 list of the world supercomputer in November 2010 [1]. Nowadays, nuclear fusion energy is considered as one most important method to solve the energy and environment challenges of human world completely. GTC is an important program of fusion energy research, and it is one of the simulation programs which could fully use supercomputers computing power, is also selected as a benchmark program which could evaluate the performance of supercomputer by United States Department of Energy. The importance of GTC also means that this program should be processed in a fast speed. GTC has been ported to most parallel computers in order to get a high efficiency, but the performance is still needed to be improved. In this study we use TH-1A supercomputer to test GTC, and in order to improve the whole application s performance, we speed up GTC s compute-intensive subroutines using GPU for the CPU+GPU hybrid heterogeneous architecture of TH-1A. In this process, some optimization strategies are developed, such as subroutines are integrated to reduce the data transfer between host and device, GPU memory access is optimized to reduce memory access latency and static keyword is designed before array s declaration to avoid unnecessary address allocation and data copy. Experiment results show that, the performance of the subroutines ported to GPU is improved evidently, which is about between 6 and 8 times, and the total performance of GTC can be improved by 3 to 4 times. In this study, we first introduce the system architecture of TH-1A and GTC background in section 2. In section 3, we test GTC on TH-1A, and according to analysis we find out the performance bottleneck of GTC to decide which subroutines should be speeded up with GPU. Section 4 introduces our optimization techniques in detail. We present the GTC s performance improvement after optimization in section 5. In the last section, we make some conclusions and propose our future work. II. BANKGROUND A. System Architecture of TH-1A Figure 1 shows the system architecture of TH-1A. TH-1A is composed of computing system, interconnected communication system, monitoring and diagnostic system, and I/O system. The computing system includes 1024 service nodes and 7168 computing nodes. Each service node is configured with two FT-1000 CPU (2.93GHz, eight-core). The computing nodes adopt the main core plus coprocessor design pattern. Each computing node is configured with two Intel Xeon CPU(x GHz, six-cores) and one NVIDIA M2050 GPU. NVIDIA M2050 GPU is based on Fermi architecture, including fourteen Streaming Multi-processors, each of which includes 32 CUDA cores. The frequency of each CUDA core is /11/$ IEEE 6027
2 1.15GHz. The peak computing power of TH-1A is 4700TFlops, and Linpack test performance is 2566TFlops [2]. B. GTC Program Gyrokinetic Toroidal Code is a time-dependent 3- dimension PIC (particle-in-cell) program, and it is used to research the turbulent transport problem in fusion plasmas. The PIC method is used to describe the complex interactive problem between field and plasmas. When the plasmas organized by charged particles moved in strong magnetic field, they could found the electrostatic field. The particles are driven to move by spatial gradients in the equilibrium profiles of the plasma temperature and density [3-4]. And then it is necessary to solve the Poisson equation to specify the trajectory of the charged particles [5]. The main steps of the PCI algorithm are as follows. First we distribute the charge of each particle on its nearest grid points. Then the electrostatic potential and field at each grid point. We then calculate the force acting on each particle from the field at its nearest grid points. Finally, we move the particles according to the forces just calculated, and repeat these steps until the end of the simulation [6]. III. PERFORMANCE ANALYSIS OF GTC ON TH-1A We test GTC program on TH-1A supercomputer to obtain the execution time of each module. The execution time proportion of modules is shown in Figure 2(a). The electron module takes up about 80% of the whole processing time, and becomes the bottleneck of GTC s total performance. (a)time proportion of modules (b) Time proportion of subroutines Figure 2 Execution time proportion of GTC. The electron module is composed of four subroutines including DPHIEFF, PUSHE, SHIFTE and CHARGEE. The execution time proportion of each subroutine is shown in Figure 2(b), execution time of DPHIEFF and CHARGEE only takes up about 1% of the electron module, but PUSHE and Figure 1 System architectures of TH-1A supercomputer. SHIFTE takes up about 99%. PUSHE and SHIFTE belong to compute-intensive subroutines, so in order to improve the total performance of GTC, we can use GPU to speed up PUSHE and SHIFTE. It worth to be noted that PUSHE takes up the most part of the execution time of electron module, we should use GPU to speed up the computation of PUSHE. But because PUSHE and SHIFTE are called by turns in GTC s main function, and there exits some data dependence between these two subroutines, this may lead to a great deal of data transfer between host and device if we only port PUSHE to GPU. So in order to avoid this, we port SHIFTE to GPU too. IV. OPTIMIZATION STRATEGIES A. Subroutines Integrated Optimization The data processing steps in CPU+GPU heterogeneous concurrency systems are as follows: (1) the initial data waiting to be processed should be transferred from host to device; (2) GPU processes data stored in device memory; (3) CPU charges to transfer the processed result from device memory back to host memory. Host memory and device memory are connected with PCI-E bus, which could provide about 8GB/s effective bandwidth [7]. So the data transfer between host and device across PCI-E bus often becomes the bottleneck of applications performance. We adopt the method of subroutines integrated in speeding up PUSHE and SHIFTE, which holds the opinion that the arrays with a large scale of data should be placed in device memory as long as possible. In this way we are able to avoid the big arrays transfer between host and device in a large degree. Figure 3(a) shows that, PUSHE and SHIFTE are called in a do-loop body in GTC, and they are called by turns. When we use GPU to speed up the computation of these two subroutines, we should transfer the initial data and processed result between host and device across PCI-E bus. Figure 2(b) shows the framework that these two subroutines are ported to GPU separately. At each call of the subroutine the initial data and processed result are needed to be transferred correspondingly between host and device, which blocks the promotion of application s total performance. So our paper adopts subroutines integrated strategy, which is shown as Figure 3(c), PUSHE and SHIFTE subroutines are integrated into the pushe_shifte_func function. The data copy needed for 6028
3 computation is placed outside the loop body, and computation of the whole loop is all placed on GPU. In this way the times of data copy is reduced from 8*ncycle*irk times to twice. It is worth to be noted that, there exits some MPI communication in PUSHE subroutine, which is needed to be operated in host end. So the arrays needed for the MPI operation should still be transferred from device memory do host memory. But because that this communication operation is located in if statement, we can place the data copy inside if statement. Then the data copy only activates when if condition is true. This data copy has almost no affect on GTC s total performance because that if condition has a very low hit rate. Figure 3 Subroutines integrated optimization. B. Memory Access Optimization There are different memory spaces in device, including global memory, texture memory, shared memory, cache and registers. These memory spaces provide basic conditions for optimization to variables with different access characteristics [7]. We adopt two kinds of optimization strategies to reduce the memory access latency in PUSHE and SHIFTE s computation on GPU, one is to make threads access global memory in a merged way, the other is to bind some special arrays to texture memory. The former optimization technology is used to provide a higher bandwidth; and the latter optimization technology is used to reduce the latency when the data accessed by adjacent threads is located in the same spatial area. The access pattern and optimization method of arrays in PUSHE and SHIFTE are shown in table 1, wzgc and wpgc0 have met the condition that threads access global memory in a merged way, and it is also not necessary to put them in shared memory because they are accessed only a few times; for arrays like rgpsi and mtheta, there are no optimization methods because their access pattern are very irregular. TABLE I. Arrays gradphi, bsp, xsp wbgc, wpgc, zelectron, zelectron0 wzgc, wpgc0, wtgc0, wtgc1, jtgc0, jtgc1 rgpsi, mtheta, igrid, qtinv and so on MEMORY ACCESS OPTIMIZATION Memory Access Characteristics Strong locally access Threads access not merged Threads access merged threads access irregularly Optimization Method bind to texture memory transpose to merge no no C. static Keyword Optimization The compiler could allocate a fixed storage area for static variables if they are declared with static keyword, and they will exist all through the program s execution. We define a class named gpu_array whose code is shown as follows. We could declare and initialize arrays with the construct function of class gpu_array. For some arrays, we could place static keyword before their declaration to avoid unnecessary address allocation and data copy from host to device. template <class T> class gpu_array{ public: gpu_array(size_t size){ size_=size; cutilsafecall(cudamalloc((void**)&ptr_, size*sizeof(t))); } gpu_array(t* data, size_t size){ size_=size; cutilsafecall(cudamalloc((void**)&ptr_, size*sizeof(t))); cutilsafecall(cudamemcpy(ptr_, data, size*sizeof(t), cudamemcpyhosttodevice)); } private: T *ptr; size_t size_; 6029
4 } Function pushe_shifte_func is called repeatedly in the loop body of GTC s main function. So in order to optimize the application in the aspects of address allocation and data copy, we use static declaration appropriately before some arrays declaration. The arrays used in PUSHE routine, including global arrays and local arrays, are separated into the following different kinds: Local arrays: the arrays whose definition, initialization and usage are all inside PUSHE, for example, jtgc0, jtgc1, delt, delp, vdrtmp, wzgc, wpgc0, wtgc1, wpgc, wbgc and so on. For the reason that the sizes of these arrays are not changed and their values are also not necessary to be copied from host to device, then we use static keyword before their declaration, which could save the data arrays address allocation time. static gpu_array<float> d_array(arraysize); Global arrays: arrays needed for the computation of PUSHE and SHIFTE are passed into pushe_shifte_func function across the parameters patter, and then these arrays could be separated further into two parts according to their changing situation in the whole program s execution time: (1) The array s value is specified in the beginning of GTC program s execution, and it will not be changed in the whole executing time. So this kind of array is not necessary to be copied from host to device after the first call of pushe_shifte_func function. So we use static keyword before these arrays declaration in order to save the time of address allocation and data copy from host to device. static gpu_array<float> d_array(h_array, arraysize); (2) Global arrays whose value is often changed during GTC s execution. Because these arrays in device memory should be updated every time the function pushe_shifte_func is called, we don t use static keyword. In this way, when we call function pushe_shifte_func, these arrays in device memory are updated to the latest value. gpu_array<float> d_array(h_array, arraysize); V. EXPERIMENTS A. Input Files We use three input files to test the performance of GTC on TH-1A supercomputer. The input files are gtc_63.in, gtc_125.in and gtc_250.in. The number of grid points defined in the three input files is different, and this number is decided by variables mpsi, mthetamax and r0. The three parameters of each input file are shown in table 2. In this paper we set MPI processes to 32, 64 and 128, and one process is corresponding with one GPU. B. Experiments Results 1. gtc_63.in The experiment result using gtc_63.in input file is shown in Figure 4, the electron module including the computing time of subroutines PUSHE and SHIFTE. The lable electron-gpu denotes the new time of electron module when PUSHE and SHIFTE are ported to GPU, total denotes the total execution time of GTC when PUSHE and SHIFTE are computed by CPU, and total-new denotes the new total execution time of GTC when PUSHE and SHIFTE are ported to GPU. The experiment results show that, the performance is improved evidently after we speed up subroutines PUSHE and SHIFTE with GPU. Performance of PUSHE and SHIFTE is improved about 5 times, and the whole performance of GTC is improved about 3 times. 2. gtc_125.in and gtc_250.in Figure 4 Test of gtc_63.in. The experiment results using gtc_125.in and gtc_250.in are shown as Figure 5 and Figure 6. Performance of PUSHE and SHIFTE is improved about five to eight times, and the total performance of GTC is improved about three to four times. Figure 5 Test of gtc_125.in. TABLE II. PARAMETERS OF INPUT FILES mpsi mthetamax r0 gtc_63.in gtc_125.in gtc_250.in Figure 6 Test of gtc_250.in. 6030
5 VI. CONCLUSIONS AND FUTURE WORK In this study, we test and analyze the performance of Gyrokinetic Torodial Code(GTC) program. According to analysis results, we port GTC s compute-intensive subroutines to GPU and speed up them on the CPU+GPU heterogeneous architecture of TH-1A. In this process we developed some optimization strategies in order to improve the total performance of GTC. Experiment results show that the performance of PUSHE and SHIFTE could be improved between 5 and 8 times, and the total performance of GTC is improved between 3 and 4 times. Our future work includes two parts, one is to speed up other compute-intensive routines of GTC with GPU; the other is to develop CPU+GPU heterogeneous parallel technology with MPI+OpenMP+CUDA method in order to make fully use of TH-1A s computing power. ACKNOWLEDGMENT This work is supported by the National Natural Science Foundation of China under grant No , NO Thanks to professors ZhiHong Lin, Yi an Lei, Yong Xiao of Peking University, they help us solve many problems when we port GTC program to TH-1A supercomputer. And thanks to Peng Wang in NVIDIA Corporation, he gives us many good advices on optimization of GTC program on GPU. REFERENCES [1] [2] [3] S.Ethier, W. M. Tang and Z. Lin, Gyrokinetic particle-in-cell simulations of plasma microturbulence on advanced computing platforms, Journal of Physics: Conference Series, [4] Z. Lin, T. S. Hahm and W. W. Lee, Turbulent transport reduction by zonal flows: Massively parallel simulations, Science [J], 1998, PP: [5] Z. Lin; T. S. Hahm; W. W. Lee; et al. Method for solving the gyrokinetic poisson equation in general geometry. Physical Review E- PHYS REV E [J], 1995, PP: [6] S.Klasky; S. Ethier; Z. Lin; et al. Grid-Based Parallel Data Streaming implemented for the Gyrokinetic Torodial Code. Proceedings of the 2003 ACM/IEEE conference on Supercomputing [C], [7] NVIDIA Corporation NVIDIA CUDA Programming Gudie. [EB/OL],
Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor
Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor Stephen Wang 1, James Lin 1,4, William Tang 2, Stephane Ethier 2, Bei Wang 2, Simon See 1,3
More informationThe Gyrokinetic Particle Simulation of Fusion Plasmas on Tianhe-2 Supercomputer
2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems The Gyrokinetic Particle Simulation of Fusion Plasmas on Tianhe-2 Supercomputer Endong Wang 1, Shaohua Wu 1, Qing Zhang
More informationAnalysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P
Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P Stephen Wang 1, James Lin 1, William Tang 2, Stephane Ethier 2, Bei Wang 2, Simon See
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationDeveloping PIC Codes for the Next Generation Supercomputer using GPUs. Viktor K. Decyk UCLA
Developing PIC Codes for the Next Generation Supercomputer using GPUs Viktor K. Decyk UCLA Abstract The current generation of supercomputer (petaflops scale) cannot be scaled up to exaflops (1000 petaflops),
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationarxiv: v1 [physics.comp-ph] 4 Nov 2013
arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationChallenges in adapting Particle-In-Cell codes to GPUs and many-core platforms
Challenges in adapting Particle-In-Cell codes to GPUs and many-core platforms L. Villard, T.M. Tran, F. Hariri *, E. Lanti, N. Ohana, S. Brunner Swiss Plasma Center, EPFL, Lausanne A. Jocksch, C. Gheller
More informationA Comprehensive Study on the Performance of Implicit LS-DYNA
12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four
More informationCRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar
CRAY XK6 REDEFINING SUPERCOMPUTING - Sanjana Rakhecha - Nishad Nerurkar CONTENTS Introduction History Specifications Cray XK6 Architecture Performance Industry acceptance and applications Summary INTRODUCTION
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationJ. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst
Ali Khajeh-Saeed Software Engineer CD-adapco J. Blair Perot Mechanical Engineering UMASS, Amherst Supercomputers Optimization Stream Benchmark Stag++ (3D Incompressible Flow Code) Matrix Multiply Function
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationA TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE
A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA
More informationGPU Acceleration of Particle Advection Workloads in a Parallel, Distributed Memory Setting
Girona, Spain May 4-5 GPU Acceleration of Particle Advection Workloads in a Parallel, Distributed Memory Setting David Camp, Hari Krishnan, David Pugmire, Christoph Garth, Ian Johnson, E. Wes Bethel, Kenneth
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationEXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March
EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY Stephen Abbott, March 26 2018 ACKNOWLEDGEMENTS Collaborators: Oak Ridge Nation Laboratory- Ed D Azevedo NVIDIA - Peng
More informationAccelerating Implicit LS-DYNA with GPU
Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,
More informationOptimizing Fusion PIC Code XGC1 Performance on Cori Phase 2
Optimizing Fusion PIC Code XGC1 Performance on Cori Phase 2 T. Koskela, J. Deslippe NERSC / LBNL tkoskela@lbl.gov June 23, 2017-1 - Thank you to all collaborators! LBNL Brian Friesen, Ankit Bhagatwala,
More informationMassively Parallel Architectures
Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationTianhe-2, the world s fastest supercomputer. Shaohua Wu Senior HPC application development engineer
Tianhe-2, the world s fastest supercomputer Shaohua Wu Senior HPC application development engineer Inspur Inspur revenue 5.8 2010-2013 6.4 2011 2012 Unit: billion$ 8.8 2013 21% Staff: 14, 000+ 12% 10%
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationPerformance and Portability Studies with OpenACC Accelerated Version of GTC-P
Performance and Portability Studies with OpenACC Accelerated Version of GTC-P Yueming Wei, Yichao Wang, Linjin Cai, William Tang, Bei Wang, Stephane Ethier, Simon See, James Lin Center for High Performance
More informationCUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation
CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationGPU Programming Using NVIDIA CUDA
GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationMultigrid algorithms on multi-gpu architectures
Multigrid algorithms on multi-gpu architectures H. Köstler European Multi-Grid Conference EMG 2010 Isola d Ischia, Italy 20.9.2010 2 Contents Work @ LSS GPU Architectures and Programming Paradigms Applications
More informationNumerical Algorithms on Multi-GPU Architectures
Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationParallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs
Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung
More informationHarnessing GPU speed to accelerate LAMMPS particle simulations
Harnessing GPU speed to accelerate LAMMPS particle simulations Paul S. Crozier, W. Michael Brown, Peng Wang pscrozi@sandia.gov, wmbrown@sandia.gov, penwang@nvidia.com SC09, Portland, Oregon November 18,
More informationrepresent parallel computers, so distributed systems such as Does not consider storage or I/O issues
Top500 Supercomputer list represent parallel computers, so distributed systems such as SETI@Home are not considered Does not consider storage or I/O issues Both custom designed machines and commodity machines
More informationTurbostream: A CFD solver for manycore
Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware
More informationGPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3
/CPU,a),2,2 2,2 Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 XMP XMP-dev CPU XMP-dev/StarPU XMP-dev XMP CPU StarPU CPU /CPU XMP-dev/StarPU N /CPU CPU. Graphics Processing Unit GP General-Purpose
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationAccelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms
Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms Liang Men, Miaoqing Huang, John Gauch Department of Computer Science and Computer Engineering University of Arkansas {mliang,mqhuang,jgauch}@uark.edu
More information2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA
2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationGRAPHICAL PROCESSING UNIT-BASED PARTICLE-IN-CELL SIMULATIONS*
GRAPHICAL PROCESSING UNIT-BASED PARTICLE-IN-CELL SIMULATIONS* Viktor K. Decyk, Department of Physics and Astronomy, Tajendra V. Singh and Scott A. Friedman, Institute for Digital Research and Education,
More informationThe Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center
The Stampede is Coming Welcome to Stampede Introductory Training Dan Stanzione Texas Advanced Computing Center dan@tacc.utexas.edu Thanks for Coming! Stampede is an exciting new system of incredible power.
More informationCOMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES
COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationResources Current and Future Systems. Timothy H. Kaiser, Ph.D.
Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More informationDistributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca
Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationTwo-Phase flows on massively parallel multi-gpu clusters
Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous
More informationBuilding NVLink for Developers
Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized
More informationUsing Graphics Chips for General Purpose Computation
White Paper Using Graphics Chips for General Purpose Computation Document Version 0.1 May 12, 2010 442 Northlake Blvd. Altamonte Springs, FL 32701 (407) 262-7100 TABLE OF CONTENTS 1. INTRODUCTION....1
More informationThe GPU-based Parallel Calculation of Gravity and Magnetic Anomalies for 3D Arbitrary Bodies
Available online at www.sciencedirect.com Procedia Environmental Sciences 12 (212 ) 628 633 211 International Conference on Environmental Science and Engineering (ICESE 211) The GPU-based Parallel Calculation
More informationLarge scale Imaging on Current Many- Core Platforms
Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,
More informationLecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, basic tasks, data types 3 Introduction to D3, basic vis
Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, basic tasks, data types 3 Introduction to D3, basic vis techniques for non-spatial data Project #1 out 4 Data
More informationHigh Performance Computing with Accelerators
High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing
More informationTitan - Early Experience with the Titan System at Oak Ridge National Laboratory
Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid
More informationA MPI-based parallel pyramid building algorithm for large-scale RS image
A MPI-based parallel pyramid building algorithm for large-scale RS image Gaojin He, Wei Xiong, Luo Chen, Qiuyun Wu, Ning Jing College of Electronic and Engineering, National University of Defense Technology,
More informationFujitsu s Approach to Application Centric Petascale Computing
Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview
More informationTo Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs
To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com
More informationPerformance Diagnosis for Hybrid CPU/GPU Environments
Performance Diagnosis for Hybrid CPU/GPU Environments Michael M. Smith and Karen L. Karavanic Computer Science Department Portland State University Performance Diagnosis for Hybrid CPU/GPU Environments
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS
ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationRoberts edge detection algorithm based on GPU
Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(7):1308-1314 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Roberts edge detection algorithm based on GPU
More informationAccelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin
Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most
More informationPlasma Particle-in-Cell Codes on GPUs: A Developer s Perspective. Viktor K. Decyk and Tajendra V. Singh UCLA
Plasma Particle-in-Cell Codes on GPUs: A Developer s Perspective Viktor K. Decyk and Tajendra V. Singh UCLA Abstract Particle-in-Cell (PIC) codes are one of the most important codes used in plasma physics
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationDuksu Kim. Professional Experience Senior researcher, KISTI High performance visualization
Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior
More informationPerformance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA
Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi Tohoku University 14 November,
More informationFlux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters
Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,
More informationLecture 1: Introduction and Computational Thinking
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational
More informationGPU Programming. Ringberg Theorie Seminar 2010
or How to tremendously accelerate your code? Michael Kraus, Christian Konz Max-Planck-Institut für Plasmaphysik, Garching Ringberg Theorie Seminar 2010 Introduction? GPU? GPUs can do more than just render
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationRunning HARMONIE on Xeon Phi Coprocessors
Running HARMONIE on Xeon Phi Coprocessors Enda O Brien Irish Centre for High-End Computing Disclosure Intel is funding ICHEC to port & optimize some applications, including HARMONIE, to Xeon Phi coprocessors.
More informationPerformance Benefits of NVIDIA GPUs for LS-DYNA
Performance Benefits of NVIDIA GPUs for LS-DYNA Mr. Stan Posey and Dr. Srinivas Kodiyalam NVIDIA Corporation, Santa Clara, CA, USA Summary: This work examines the performance characteristics of LS-DYNA
More informationarxiv: v1 [hep-lat] 12 Nov 2013
Lattice Simulations using OpenACC compilers arxiv:13112719v1 [hep-lat] 12 Nov 2013 Indian Association for the Cultivation of Science, Kolkata E-mail: tppm@iacsresin OpenACC compilers allow one to use Graphics
More informationRadiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System
Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins, Todd Harman Scientific Computing and Imaging Institute & University of Utah I. Uintah
More informationGame-changing Extreme GPU computing with The Dell PowerEdge C4130
Game-changing Extreme GPU computing with The Dell PowerEdge C4130 A Dell Technical White Paper This white paper describes the system architecture and performance characterization of the PowerEdge C4130.
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationIntroduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29
Introduction CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction Spring 2018 1 / 29 Outline 1 Preface Course Details Course Requirements 2 Background Definitions
More informationWhen MPPDB Meets GPU:
When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU
More informationIntroduc)on to GPU Programming
Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/people/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/cuda/training/nvidia_gpu_compu)ng_webinars_cuda_memory_op)miza)on.pdf
More informationImproving performances of an embedded RDBMS with a hybrid CPU/GPU processing engine
Improving performances of an embedded RDBMS with a hybrid CPU/GPU processing engine Samuel Cremer 1,2, Michel Bagein 1, Saïd Mahmoudi 1, Pierre Manneback 1 1 UMONS, University of Mons Computer Science
More informationCUDA Memories. Introduction 5/4/11
5/4/11 CUDA Memories James Gain, Michelle Kuttel, Sebastian Wyngaard, Simon Perkins and Jason Brownbridge { jgain mkuttel sperkins jbrownbr}@cs.uct.ac.za swyngaard@csir.co.za 3-6 May 2011 Introduction
More informationThe Stampede is Coming: A New Petascale Resource for the Open Science Community
The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation
More informationLecture Topic Projects
Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, data types 3 Data sources and preparation Project 1 out 4 Data reduction, similarity & distance, data augmentation
More informationCUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN
CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school
More information