VASP Accelerated with GPUs
|
|
- Bethanie Singleton
- 5 years ago
- Views:
Transcription
1 VASP Accelerated with GPUs Capabilities, Methods, and Road-Map Max Hutchinson University of Chicago; Carnegie Mellon University GTC, May 17th, 2012 Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 1 / 44
2 Acknowledgements The rest of our team: Michael Widom James Komianos The real VASP team: Georg Kresse Martijn Marsman Jürgen Hafner This work was supported by the PETTT project PP-CCM-KY P3. This research was supported in part by the National Science Foundation through TeraGrid resources provided by Pittsburgh Supercomputing Center. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 2 / 44
3 Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 3 / 44
4 References M. Hutchinson, M. Widom, VASP on a GPU: Application to exact-exchange calculations of the stability of lemental boron, Computer Physics Communications, Volume 183, Issue 7, July 2012, Pages Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 4 / 44
5 Table of Contents 1 Context Motivating Science DFT and VASP 2 Capabilities and Performance Low-Level Ports High-Level Ports System Capabilities, Requirements 3 Design Decisions and Methods Guiding Principles Development Cycle Examples Tips 4 Road-Map Our plans Your part Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 5 / 44
6 Context Table of Contents 1 Context Motivating Science DFT and VASP 2 Capabilities and Performance Low-Level Ports High-Level Ports System Capabilities, Requirements 3 Design Decisions and Methods Guiding Principles Development Cycle Examples Tips 4 Road-Map Our plans Your part Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 6 / 44
7 Context Motivating Science Quantum Chemistry Hard Condensed Matter Modern model for atomic physics has non-classical elements Electron correlation, exchange energy Discretization of energy, angular momentum Practical understanding of some materials requires quantum models Nano-scale electronics Surface effects High-resolution spectroscopy Low-temperature structure Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 7 / 44
8 Scientific Perspective Context DFT and VASP Start by approximating n-body quantum system with the single-particle Kohn-Sham equation. Density functional theory (DFT) approximates correlation and exchange energies as functionals of the electron density. Functionals form a ladder of increasing accuracy and computational cost. Eigenvalue solvers then used to find the wave-functions. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 8 / 44
9 One example: Boron Context DFT and VASP The low temperature structure of elemental boron is not known. E βα E β α LDA PBE PKZB HF Table: Table of structural energies (units mev/atom). Here β refers to the ideal hr105 structure, β refers to the 107 atom optimized variant of B.hR141. Energies of α are obtained from the super cell hr12x8. All values are given for the 3x3x3 k-point mesh. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 9 / 44
10 Context DFT and VASP Computational Perspective DFT is nominally O(n 2 lnn) or O(n 3 ), depending on system size. Excact-exchange is more expensive: O(n 3 lnn) or O(n 4 ). Operations have high fine-grain data parallelism BLAS FFT Scatter-Gather Iterations are long (order second) All adds up to a great GPU candidate Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 10 / 44
11 Capabilities and Performance Table of Contents 1 Context Motivating Science DFT and VASP 2 Capabilities and Performance Low-Level Ports High-Level Ports System Capabilities, Requirements 3 Design Decisions and Methods Guiding Principles Development Cycle Examples Tips 4 Road-Map Our plans Your part Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 11 / 44
12 FFT Port Capabilities and Performance Low-Level Ports FFT s contribute 30-50% of CPU time. FFT calls funneled through kernels (4 of them) Previously used to switch between FFTW and custom FFTs Simple copy, compute, copy-back used Cores CPU + 1 GPU Ratio Table: PdO benchmark (87 ions, 496 bands, 822 electrons) on Dirac (NERSC) Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 12 / 44
13 Capabilities and Performance Low-Level Ports BLAS Port BLAS calls contribute 15-40% of CPU time. BLAS calls are made inline, but there aren t too many important ones Again, simple copy, compute, copy-back used Performance was poor (20% worse), so this was abandoned early on. Advances in CUBLAS might make this profitable Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 13 / 44
14 Capabilities and Performance High-Level Ports Exact-Exchange (HF) Port Hybrid functionals, or exact-exchange, are very intensive > 98% of runtime Factor of 2 in memory use Includes interaction between bands Add a linear order to previous complexities VASP implementation is somewhat compartmentalized Calls funnel through two routines Once per k-point per iteration Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 14 / 44
15 Capabilities and Performance High-Level Ports HF Port Performance Workstation vs Workstation Structure hr12 hr12x8 hr105 Platform cpu gpu cpu gpu cpu gpu FOCK ACC (s) , , FOCK FORCE (s) , , , ,435.5 Other (s) Overall (hr) Speedup 5.82x 12.39x 20.41x Table: Run-times of components of VASP exact-exchange runs. Overall times are projected assuming a total of 5 ionic minimization steps and 75 electronic minimization steps. CPU runs are single-core and GPU runs are single-device. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 15 / 44
16 Plots Capabilities and Performance High-Level Ports Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 16 / 44
17 HF Port Performance Workstation vs Supercomputer Capabilities and Performance High-Level Ports Struct. k T-1C1G T-2C2G B-16C B-32C B-64C B-128C hr hr12x8 2 1, , , , ,160.3 hr , , , , , ,221.0 hr , , , , , ,817.5 ap , , , , , ,816.5 Table: Actual run-times of truncated runs, reduced NELM and NSW, of different structures on different platforms. T is tirith, B is blacklight, attributes mcng indicates m CPU cores and n GPU devices. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 17 / 44
18 Capabilities and Performance System Capabilities, Requirements Other Capabilities Compute capability 2.0 or higher Arbitrary CPU:GPU ratios Round-robin Uses File I/O (I m sorry) Mixed or full double precision FFTs in single or double Everything else in double Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 18 / 44
19 Design Decisions and Methods Table of Contents 1 Context Motivating Science DFT and VASP 2 Capabilities and Performance Low-Level Ports High-Level Ports System Capabilities, Requirements 3 Design Decisions and Methods Guiding Principles Development Cycle Examples Tips 4 Road-Map Our plans Your part Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 19 / 44
20 Guiding Principles Design Decisions and Methods Guiding Principles 1 Performance: ultimately, this is our primary concern Intercept high in the call tree Write/use good kernels 2 Programmability: programmer time is a limited quantity Be maximally compartmental, minimally intrusive Don t get too clever 3 Portability: why write something that can t be used? Use standard languages (FORTRAN, C[, Python]) Use standard libraries (CUBLAS, CUFFT) Don t add system assumptions Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 20 / 44
21 Design Decisions and Methods Guiding Principles CPU Profile Optimize Translate Profile Validate Optimize GPU Debug Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 21 / 44
22 Design Decisions and Methods Development Cycle Incremental Ports Our technique has been to climb up callgraphs. Pros: Important work is done first Debugging is [more] palatable Provides rough numerical validation Cons: Divergent efforts can require merges Inherit high-level structure from CPU code Perturbation method. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 22 / 44
23 Design Decisions and Methods Development Cycle CPU Profile Optimize Translate Profile Validate Optimize GPU Debug Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 23 / 44
24 Design Decisions and Methods Development Cycle Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 24 / 44
25 Design Decisions and Methods Development Cycle Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 25 / 44
26 Design Decisions and Methods Development Cycle Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 26 / 44
27 Design Decisions and Methods Development Cycle CPU Profile Optimize Translate Profile Validate Optimize GPU Debug Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 27 / 44
28 Intercepts Design Decisions and Methods Development Cycle #ifdef CUDA / Assumptions / USE CUDA = ( condition1 && condition2 &&... ); if ( USE CUDA ) { fun cu(foo, bar) // intercept (not a kernel ) } else { #endif / Function to be intercepted / fun(foo, bar) #ifdef CUDA } #endif Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 28 / 44
29 Design Decisions and Methods Development Cycle CPU Profile Optimize Translate Profile Validate Optimize GPU Debug Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 29 / 44
30 Validation Design Decisions and Methods Development Cycle./vasp_test.py -e../exes/vasp-pgk -t PdO-v/ -n 1 ====================================================== Test Name: PdO-v/ Run on: In:./tests/3F0T Result Parameter Test vs Expected passed energy e+02 vs e+02 passed ext. pressure e+02 vs e+02 passed volume e+03 vs e+03 passed stress (xx) e+02 vs e+02 passed stress (yy) e+02 vs e+02 passed stress (zz) e+02 vs e+02 passed stress (xy) e+00 vs e+00 passed stress (yz) e+00 vs e+00 passed stress (zx) e+00 vs e x loop time vs Max 0.95x Hutchinson setdij (UChicago and time CMU) GPU VASP vs GTC 5/17/12 30 / 44
31 Design Decisions and Methods Development Cycle CPU Profile Optimize Translate Profile Validate Optimize GPU Debug Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 31 / 44
32 Tot Num Avg % "method=": A_kernel: gemm: double_: crrexp_mul_wave_k: aug_charge_trace_k: mul_vec_k: charge_trace_k: racc0_combine_k: calc_dllmm_k: apply_gfac_der_k: apply_gfac_k: eccp_nl_fock: memcpy: rpro1_combine_k: split_complex_k: else: Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 32 / 44 CUDA Profiler Design Decisions and Methods Development Cycle
33 CUDA Profiler Design Decisions and Methods Development Cycle Tot Num Avg % "method=": memcpy: A_kernel: B_kernel: memset32: else: gemm: crrexp_mul_wave_k: racc0_combine_k: charge_trace_k: aug_charge_trace_k: apply_gfac_der_k: apply_gfac_k: eccp_nl_fock: double_: mul_vec_k: rpro1_combine_k: Max Hutchinson split_complex_k: (UChicago and CMU) GPU 0.0VASP 0 GTC 0.0 5/17/ / 44
34 Design Decisions and Methods Development Cycle CPU Profile Optimize Translate Profile Validate Optimize GPU Debug Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 34 / 44
35 Persistent pointers Design Decisions and Methods Examples / void pointer / typedef struct void p{ unsigned int size ; void ptr ; } void p ; / double pointer / typedef struct double p{ unsigned int size ; double ptr ; } double p ; Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 35 / 44
36 Persistent pointers Design Decisions and Methods Examples / Assign a chunk of GPU mem to a chunck of CPU mem / static inline void assign cu ( void p dest, //!< destina void src, //!< source unsigned int size //<! size ( i ){ / Do we need to resize? / if (dest >ptr == NULL dest >size < size ){ if (dest >ptr!= NULL) cudafree(dest >ptr ); cudamalloc(( void )&dest >ptr, size ); dest >size = size ; } / Do the actual copy / cudamemcpy(dest >ptr, src, size, cudamemcpyhosttodevice); } Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 36 / 44
37 Structs Design Decisions and Methods Examples typedef struct 4vector{ int t ; int x; int y; int z; } 4vector events [N]; Improves locality for elemental functions. Mechanism is deep memory caches. typedef struct 4vectors{ int t [N]; int x[n]; int y[n]; int z[n]; } 4vectors events ; Improves memory bandwidth for vector functions. Mechanism is wide memory bus. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 37 / 44
38 Design Decisions and Methods Tips Intercepts vs Overhauls Intercepts and overhauls have the same theoretical peak performance. Maximal intercept is 2 codes One is usually easier than the other. Difficulty of intercepts is governed by Loop position: must intercept above fine-grain loops Data structures: must pass data and context to GPU Difficulty of overhauls is governed by Size, complexity of auxiliary code State of the original code Overhaul has side-benefits. Intercepts have side-costs. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 38 / 44
39 Road-Map Table of Contents 1 Context Motivating Science DFT and VASP 2 Capabilities and Performance Low-Level Ports High-Level Ports System Capabilities, Requirements 3 Design Decisions and Methods Guiding Principles Development Cycle Examples Tips 4 Road-Map Our plans Your part Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 39 / 44
40 Road-Map Our plans Non-HF Port Port will use the same scheme as HF port Climbing up may of the non-hf versions of CPU routines Trying to get all the way up to minimization routine (e.g. RMM-DIIS) You can expect performance approaching HF performance Less parallelism for systems of the same size More rapid iteration Mitigated by larger quantum systems Our goal is beta by sometime this summer Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 40 / 44
41 Road-Map Our plans Merge with VASP Core Our code is generally available to VASP license holders Must request access through Vienna Distribution through our website and git repo This scheme is inadequate (doesn t scale). We hope to put the ports in VASP 5.3, which will have some other architectural changes. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 41 / 44
42 Road-Map Your part Wish List Users, to do science It s all about science Find the kink s in our implementation Input, to direct effort and validate results Scientifically relevant systems Requests for functionality Effort, to write the ports Current VASP users with time to contribute VASP is a large code Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 42 / 44
43 Road-Map Your part Conclusions We ve ported HF functionality in VASP to CUDA. Up to 20x performance over singe core Up to 64 core performance compared to supercomputers Callgraph climbing port method is effective Accelerate specific functionality of large codes Can inform future decisions about dedicated ports Accelerating scientific codes enables new science. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 43 / 44
44 Road-Map Your part Thank you Questions? Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 44 / 44
STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC. Stefan Maintz, Dr. Markus Wetzstein
STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC Stefan Maintz, Dr. Markus Wetzstein smaintz@nvidia.com; mwetzstein@nvidia.com Companies Academia VASP USERS AND USAGE 12-25% of CPU cycles @ supercomputing
More informationAn Innovative Massively Parallelized Molecular Dynamic Software
Renewable energies Eco-friendly production Innovative transport Eco-efficient processes Sustainable resources An Innovative Massively Parallelized Molecular Dynamic Software Mohamed Hacene, Ani Anciaux,
More informationQuantum ESPRESSO on GPU accelerated systems
Quantum ESPRESSO on GPU accelerated systems Massimiliano Fatica, Everett Phillips, Josh Romero - NVIDIA Filippo Spiga - University of Cambridge/ARM (UK) MaX International Conference, Trieste, Italy, January
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationTESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications
TESLA P PERFORMANCE GUIDE HPC and Deep Learning Applications MAY 217 TESLA P PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important
More informationPorting CASTEP to GPGPUs. Adrian Jackson, Toni Collis, EPCC, University of Edinburgh Graeme Ackland University of Edinburgh
Porting CASTEP to GPGPUs Adrian Jackson, Toni Collis, EPCC, University of Edinburgh Graeme Ackland University of Edinburgh CASTEP Density Functional Theory Plane-wave basis set with pseudo potentials Heavy
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationGTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013
GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»
More informationPortable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.
Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 What is Cray Libsci_acc? Provide basic scientific
More informationPORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune
PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further
More informationApproaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department
Approaches to acceleration: GPUs vs Intel MIC Fabio AFFINITO SCAI department Single core Multi core Many core GPU Intel MIC 61 cores 512bit-SIMD units from http://www.karlrupp.net/ from http://www.karlrupp.net/
More informationCUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017
CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC 2017 The art of doing more with less 2 Performance RULE #1: DON T TRY TOO HARD Peak Performance Time 3 Unrealistic Effort/Reward Performance
More informationPorting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method
Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Josh Romero, Massimiliano Fatica - NVIDIA Vamsi Spandan, Roberto Verzicco -
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationIntroduction to C omputational F luid Dynamics. D. Murrin
Introduction to C omputational F luid Dynamics D. Murrin Computational fluid dynamics (CFD) is the science of predicting fluid flow, heat transfer, mass transfer, chemical reactions, and related phenomena
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationTESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications
TESLA P PERFORMANCE GUIDE Deep Learning and HPC Applications SEPTEMBER 217 TESLA P PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationIntroduction to Computational Fluid Dynamics Mech 122 D. Fabris, K. Lynch, D. Rich
Introduction to Computational Fluid Dynamics Mech 122 D. Fabris, K. Lynch, D. Rich 1 Computational Fluid dynamics Computational fluid dynamics (CFD) is the analysis of systems involving fluid flow, heat
More informationPERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015
PERFORMANCE PORTABILITY WITH OPENACC Jeff Larkin, NVIDIA, November 2015 TWO TYPES OF PORTABILITY FUNCTIONAL PORTABILITY PERFORMANCE PORTABILITY The ability for a single code to run anywhere. The ability
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationX10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management
X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large
More informationNEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS
NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS 1 Unified Memory CUDA 6 2 3 XT and Drop-in Libraries GPUDirect RDMA in MPI 4 Developer Tools 1 Unified Memory CUDA 6 2 3 XT and Drop-in Libraries
More informationEfficient use of hybrid computing clusters for nanosciences
International Conference on Parallel Computing ÉCOLE NORMALE SUPÉRIEURE LYON Efficient use of hybrid computing clusters for nanosciences Luigi Genovese CEA, ESRF, BULL, LIG 16 Octobre 2008 with Matthieu
More informationPorting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation
Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code
More informationGTC 2017 S7672. OpenACC Best Practices: Accelerating the C++ NUMECA FINE/Open CFD Solver
David Gutzwiller, NUMECA USA (david.gutzwiller@numeca.com) Dr. Ravi Srinivasan, Dresser-Rand Alain Demeulenaere, NUMECA USA 5/9/2017 GTC 2017 S7672 OpenACC Best Practices: Accelerating the C++ NUMECA FINE/Open
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationImproving the Performance of the Molecular Similarity in Quantum Chemistry Fits. Alexander M. Cappiello
Improving the Performance of the Molecular Similarity in Quantum Chemistry Fits Alexander M. Cappiello Department of Chemistry Carnegie Mellon University Pittsburgh, PA 15213 December 17, 2012 Abstract
More informationFirst Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster
First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster YALES2: Semi-industrial code for turbulent combustion and flows Jean-Matthieu Etancelin, ROMEO, NVIDIA GPU Application
More information04. CUDA Data Transfer
04. CUDA Data Transfer Fall Semester, 2015 COMP427 Parallel Programming School of Computer Sci. and Eng. Kyungpook National University 2013-5 N Baek 1 CUDA Compute Unified Device Architecture General purpose
More informationCS 179: Lecture 10. Introduction to cublas
CS 179: Lecture 10 Introduction to cublas Table of contents, you are here. Welcome to week 4, this is new material from here on out so please ask questions and help the TAs to improve the lectures and
More informationINTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017
INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationPART I - Fundamentals of Parallel Computing
PART I - Fundamentals of Parallel Computing Objectives What is scientific computing? The need for more computing power The need for parallel computing and parallel programs 1 What is scientific computing?
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationGPGPU Lessons Learned. Mark Harris
GPGPU Lessons Learned Mark Harris General-Purpose Computation on GPUs Highly parallel applications Physically-based simulation image processing scientific computing computer vision computational finance
More informationAmazon Web Services: Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud
Amazon Web Services: Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud Summarized by: Michael Riera 9/17/2011 University of Central Florida CDA5532 Agenda
More informationTuring Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA
Turing Architecture and CUDA 10 New Features Minseok Lee, Developer Technology Engineer, NVIDIA Turing Architecture New SM Architecture Multi-Precision Tensor Core RT Core Turing MPS Inference Accelerated,
More informationCURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS
CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS Roberto Gomperts (NVIDIA, Corp.) Michael Frisch (Gaussian, Inc.) Giovanni Scalmani (Gaussian, Inc.) Brent Leback (PGI) TOPICS Gaussian Design
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationPorting COSMO to Hybrid Architectures
Porting COSMO to Hybrid Architectures T. Gysi 1, O. Fuhrer 2, C. Osuna 3, X. Lapillonne 3, T. Diamanti 3, B. Cumming 4, T. Schroeder 5, P. Messmer 5, T. Schulthess 4,6,7 [1] Supercomputing Systems AG,
More informationANITA S SUPER AWESOME RECITATION SLIDES
ANITA S SUPER AWESOME RECITATION SLIDES 15/18-213: Introduction to Computer Systems Dynamic Memory Allocation Anita Zhang, Section M UPDATES Cache Lab style points released Don t fret too much Shell Lab
More informationParallelism. CS6787 Lecture 8 Fall 2017
Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does
More informationLecture 15: More Iterative Ideas
Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!
More informationScientific Computations Using Graphics Processors
Scientific Computations Using Graphics Processors Blair Perot Ali Khajeh-Saeed Tim McGuiness History Kevin Bowers, X Division Los Alamos Lab (2003) Lots of Memory Uses Memory Banks Cheap (commodity) Relativistic
More informationAccelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University
Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationCS 179: GPU Computing. Lecture 2: The Basics
CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationAccelerating Molecular Modeling Applications with Graphics Processors
Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference
More informationINTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro
INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015 OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different
More informationFormal Loop Merging for Signal Transforms
Formal Loop Merging for Signal Transforms Franz Franchetti Yevgen S. Voronenko Markus Püschel Department of Electrical & Computer Engineering Carnegie Mellon University This work was supported by NSF through
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationLittle Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo
OpenMP Amasis Brauch German University in Cairo May 4, 2010 Simple Algorithm 1 void i n c r e m e n t e r ( short a r r a y ) 2 { 3 long i ; 4 5 for ( i = 0 ; i < 1000000; i ++) 6 { 7 a r r a y [ i ]++;
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationProductive Performance on the Cray XK System Using OpenACC Compilers and Tools
Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationUsing OpenACC With CUDA Libraries
Using OpenACC With CUDA Libraries John Urbanic with NVIDIA Pittsburgh Supercomputing Center Copyright 2015 3 Ways to Accelerate Applications Applications Libraries Drop-in Acceleration CUDA Libraries are
More informationOptimised all-to-all communication on multicore architectures applied to FFTs with pencil decomposition
Optimised all-to-all communication on multicore architectures applied to FFTs with pencil decomposition CUG 2018, Stockholm Andreas Jocksch, Matthias Kraushaar (CSCS), David Daverio (University of Cambridge,
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction
More informationTESLA V100 PERFORMANCE GUIDE. Life Sciences Applications
TESLA V100 PERFORMANCE GUIDE Life Sciences Applications NOVEMBER 2017 TESLA V100 PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important
More informationIdentifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011
Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for
More informationAlgorithms of Scientific Computing
Algorithms of Scientific Computing Fast Fourier Transform (FFT) Michael Bader Technical University of Munich Summer 2018 The Pair DFT/IDFT as Matrix-Vector Product DFT and IDFT may be computed in the form
More informationAdrian Tate XK6 / openacc workshop Manno, Mar
Adrian Tate XK6 / openacc workshop Manno, Mar6-7 2012 1 Overview & Philosophy Two modes of usage Contents Present contents Upcoming releases Optimization of libsci_acc Autotuning Adaptation Asynchronous
More informationMAGMA: a New Generation
1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release
More informationThe Use of Cloud Computing Resources in an HPC Environment
The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes
More informationMemory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018
Memory Bandwidth and Low Precision Computation CS6787 Lecture 10 Fall 2018 Memory as a Bottleneck So far, we ve just been talking about compute e.g. techniques to decrease the amount of compute by decreasing
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationAdvanced CUDA Optimizations. Umar Arshad ArrayFire
Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers
More informationGradient Free Design of Microfluidic Structures on a GPU Cluster
Gradient Free Design of Microfluidic Structures on a GPU Cluster Austen Duffy - Florida State University SIAM Conference on Computational Science and Engineering March 2, 2011 Acknowledgements This work
More informationWhy C? Because we can t in good conscience espouse Fortran.
C Tutorial Why C? Because we can t in good conscience espouse Fortran. C Hello World Code: Output: C For Loop Code: Output: C Functions Code: Output: Unlike Fortran, there is no distinction in C between
More informationThe VASP Scripter AddOn
The VASP Scripter AddOn Tutorial Version 11.8.1 The VASP Scripter AddOn: Tutorial Version 11.8.1 Copyright 2008 2011 QuantumWise A/S Atomistix ToolKit Copyright Notice All rights reserved. This publication
More informationTimers 1 / 46. Jiffies. Potent and Evil Magic
Timers 1 / 46 Jiffies Each timer tick, a variable called jiffies is incremented It is thus (roughly) the number of HZ since system boot A 32-bit counter incremented at 1000 Hz wraps around in about 50
More informationPerformance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer
Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer Xiaoqian Zhu 1,2, Xin Liu 1, Xiangfei Meng 2, Jinghua Feng 2 1 School of Computer, National University of Defense
More informationMaximizing Face Detection Performance
Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationResearch Faculty Summit Systems Fueling future disruptions
Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationStarting a Data Analysis
03/20/07 PHY310: Statistical Data Analysis 1 PHY310: Lecture 17 Starting a Data Analysis Road Map Your Analysis Log Exploring the Data Reading the input file (and making sure it's right) Taking a first
More informationAchieve Better Performance with PEAK on XSEDE Resources
Achieve Better Performance with PEAK on XSEDE Resources Haihang You, Bilel Hadri, Shirley Moore XSEDE 12 July 18 th 2012 Motivations FACTS ALTD ( Automatic Tracking Library Database ) ref Fahey, Jones,
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationEXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March
EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY Stephen Abbott, March 26 2018 ACKNOWLEDGEMENTS Collaborators: Oak Ridge Nation Laboratory- Ed D Azevedo NVIDIA - Peng
More informationFrom Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133)
From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133) Dr Paul Richmond Research Fellow University of Sheffield (NVIDIA CUDA Research Centre) Overview Complex
More informationAcceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP
Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP Zhe Weng and Peter Strazdins*, Computer Systems Group, Research School of Computer Science, The Australian National University
More informationWHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016
WHAT S NEW IN CUDA 8 Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve Larger Problems** Critical Path Analysis * HOOMD Blue v1.3.3 Lennard-Jones liquid
More informationMemory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017
Memory Bandwidth and Low Precision Computation CS6787 Lecture 9 Fall 2017 Memory as a Bottleneck So far, we ve just been talking about compute e.g. techniques to decrease the amount of compute by decreasing
More informationGAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能
GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能 Hsi-Yu Schive ( 薛熙于 ), Tzihong Chiueh ( 闕志鴻 ), Yu-Chih Tsai ( 蔡御之 ), Ui-Han Zhang ( 張瑋瀚 ) Graduate Institute
More informationCS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel
CS 267 Applications of Parallel Computers Lecture 23: Load Balancing and Scheduling James Demmel http://www.cs.berkeley.edu/~demmel/cs267_spr99 CS267 L23 Load Balancing and Scheduling.1 Demmel Sp 1999
More informationProfiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015
Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit
More informationGreat Reality #2: You ve Got to Know Assembly Does not generate random values Arithmetic operations have important mathematical properties
Overview Course Overview Course theme Five realities Computer Systems 1 2 Course Theme: Abstraction Is Good But Don t Forget Reality Most CS courses emphasize abstraction Abstract data types Asymptotic
More informationMODELING CUDA COMPUTE APPLICATIONS BY CRITICAL PATH. PATRIC ZHAO, JIRI KRAUS, SKY WU
MODELING CUDA COMPUTE APPLICATIONS BY CRITICAL PATH PATRIC ZHAO, JIRI KRAUS, SKY WU patricz@nvidia.com AGENDA Background Collect data and Visualizations Critical Path Performance analysis and prediction
More informationLecture 2: Introduction to OpenMP with application to a simple PDE solver
Lecture 2: Introduction to OpenMP with application to a simple PDE solver Mike Giles Mathematical Institute Mike Giles Lecture 2: Introduction to OpenMP 1 / 24 Hardware and software Hardware: a processor
More informationCSC573: TSHA Introduction to Accelerators
CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures
More informationReview of previous examinations TMA4280 Introduction to Supercomputing
Review of previous examinations TMA4280 Introduction to Supercomputing NTNU, IMF April 24. 2017 1 Examination The examination is usually comprised of: one problem related to linear algebra operations with
More informationTime-dependent density-functional theory with massively parallel computers. Jussi Enkovaara CSC IT Center for Science, Finland
Time-dependent density-functional theory with massively parallel computers Jussi Enkovaara CSC IT Center for Science, Finland Outline Overview of the GPAW software package Parallelization for time-dependent
More information