VASP Accelerated with GPUs

Size: px
Start display at page:

Download "VASP Accelerated with GPUs"

Transcription

1 VASP Accelerated with GPUs Capabilities, Methods, and Road-Map Max Hutchinson University of Chicago; Carnegie Mellon University GTC, May 17th, 2012 Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 1 / 44

2 Acknowledgements The rest of our team: Michael Widom James Komianos The real VASP team: Georg Kresse Martijn Marsman Jürgen Hafner This work was supported by the PETTT project PP-CCM-KY P3. This research was supported in part by the National Science Foundation through TeraGrid resources provided by Pittsburgh Supercomputing Center. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 2 / 44

3 Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 3 / 44

4 References M. Hutchinson, M. Widom, VASP on a GPU: Application to exact-exchange calculations of the stability of lemental boron, Computer Physics Communications, Volume 183, Issue 7, July 2012, Pages Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 4 / 44

5 Table of Contents 1 Context Motivating Science DFT and VASP 2 Capabilities and Performance Low-Level Ports High-Level Ports System Capabilities, Requirements 3 Design Decisions and Methods Guiding Principles Development Cycle Examples Tips 4 Road-Map Our plans Your part Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 5 / 44

6 Context Table of Contents 1 Context Motivating Science DFT and VASP 2 Capabilities and Performance Low-Level Ports High-Level Ports System Capabilities, Requirements 3 Design Decisions and Methods Guiding Principles Development Cycle Examples Tips 4 Road-Map Our plans Your part Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 6 / 44

7 Context Motivating Science Quantum Chemistry Hard Condensed Matter Modern model for atomic physics has non-classical elements Electron correlation, exchange energy Discretization of energy, angular momentum Practical understanding of some materials requires quantum models Nano-scale electronics Surface effects High-resolution spectroscopy Low-temperature structure Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 7 / 44

8 Scientific Perspective Context DFT and VASP Start by approximating n-body quantum system with the single-particle Kohn-Sham equation. Density functional theory (DFT) approximates correlation and exchange energies as functionals of the electron density. Functionals form a ladder of increasing accuracy and computational cost. Eigenvalue solvers then used to find the wave-functions. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 8 / 44

9 One example: Boron Context DFT and VASP The low temperature structure of elemental boron is not known. E βα E β α LDA PBE PKZB HF Table: Table of structural energies (units mev/atom). Here β refers to the ideal hr105 structure, β refers to the 107 atom optimized variant of B.hR141. Energies of α are obtained from the super cell hr12x8. All values are given for the 3x3x3 k-point mesh. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 9 / 44

10 Context DFT and VASP Computational Perspective DFT is nominally O(n 2 lnn) or O(n 3 ), depending on system size. Excact-exchange is more expensive: O(n 3 lnn) or O(n 4 ). Operations have high fine-grain data parallelism BLAS FFT Scatter-Gather Iterations are long (order second) All adds up to a great GPU candidate Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 10 / 44

11 Capabilities and Performance Table of Contents 1 Context Motivating Science DFT and VASP 2 Capabilities and Performance Low-Level Ports High-Level Ports System Capabilities, Requirements 3 Design Decisions and Methods Guiding Principles Development Cycle Examples Tips 4 Road-Map Our plans Your part Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 11 / 44

12 FFT Port Capabilities and Performance Low-Level Ports FFT s contribute 30-50% of CPU time. FFT calls funneled through kernels (4 of them) Previously used to switch between FFTW and custom FFTs Simple copy, compute, copy-back used Cores CPU + 1 GPU Ratio Table: PdO benchmark (87 ions, 496 bands, 822 electrons) on Dirac (NERSC) Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 12 / 44

13 Capabilities and Performance Low-Level Ports BLAS Port BLAS calls contribute 15-40% of CPU time. BLAS calls are made inline, but there aren t too many important ones Again, simple copy, compute, copy-back used Performance was poor (20% worse), so this was abandoned early on. Advances in CUBLAS might make this profitable Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 13 / 44

14 Capabilities and Performance High-Level Ports Exact-Exchange (HF) Port Hybrid functionals, or exact-exchange, are very intensive > 98% of runtime Factor of 2 in memory use Includes interaction between bands Add a linear order to previous complexities VASP implementation is somewhat compartmentalized Calls funnel through two routines Once per k-point per iteration Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 14 / 44

15 Capabilities and Performance High-Level Ports HF Port Performance Workstation vs Workstation Structure hr12 hr12x8 hr105 Platform cpu gpu cpu gpu cpu gpu FOCK ACC (s) , , FOCK FORCE (s) , , , ,435.5 Other (s) Overall (hr) Speedup 5.82x 12.39x 20.41x Table: Run-times of components of VASP exact-exchange runs. Overall times are projected assuming a total of 5 ionic minimization steps and 75 electronic minimization steps. CPU runs are single-core and GPU runs are single-device. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 15 / 44

16 Plots Capabilities and Performance High-Level Ports Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 16 / 44

17 HF Port Performance Workstation vs Supercomputer Capabilities and Performance High-Level Ports Struct. k T-1C1G T-2C2G B-16C B-32C B-64C B-128C hr hr12x8 2 1, , , , ,160.3 hr , , , , , ,221.0 hr , , , , , ,817.5 ap , , , , , ,816.5 Table: Actual run-times of truncated runs, reduced NELM and NSW, of different structures on different platforms. T is tirith, B is blacklight, attributes mcng indicates m CPU cores and n GPU devices. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 17 / 44

18 Capabilities and Performance System Capabilities, Requirements Other Capabilities Compute capability 2.0 or higher Arbitrary CPU:GPU ratios Round-robin Uses File I/O (I m sorry) Mixed or full double precision FFTs in single or double Everything else in double Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 18 / 44

19 Design Decisions and Methods Table of Contents 1 Context Motivating Science DFT and VASP 2 Capabilities and Performance Low-Level Ports High-Level Ports System Capabilities, Requirements 3 Design Decisions and Methods Guiding Principles Development Cycle Examples Tips 4 Road-Map Our plans Your part Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 19 / 44

20 Guiding Principles Design Decisions and Methods Guiding Principles 1 Performance: ultimately, this is our primary concern Intercept high in the call tree Write/use good kernels 2 Programmability: programmer time is a limited quantity Be maximally compartmental, minimally intrusive Don t get too clever 3 Portability: why write something that can t be used? Use standard languages (FORTRAN, C[, Python]) Use standard libraries (CUBLAS, CUFFT) Don t add system assumptions Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 20 / 44

21 Design Decisions and Methods Guiding Principles CPU Profile Optimize Translate Profile Validate Optimize GPU Debug Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 21 / 44

22 Design Decisions and Methods Development Cycle Incremental Ports Our technique has been to climb up callgraphs. Pros: Important work is done first Debugging is [more] palatable Provides rough numerical validation Cons: Divergent efforts can require merges Inherit high-level structure from CPU code Perturbation method. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 22 / 44

23 Design Decisions and Methods Development Cycle CPU Profile Optimize Translate Profile Validate Optimize GPU Debug Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 23 / 44

24 Design Decisions and Methods Development Cycle Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 24 / 44

25 Design Decisions and Methods Development Cycle Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 25 / 44

26 Design Decisions and Methods Development Cycle Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 26 / 44

27 Design Decisions and Methods Development Cycle CPU Profile Optimize Translate Profile Validate Optimize GPU Debug Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 27 / 44

28 Intercepts Design Decisions and Methods Development Cycle #ifdef CUDA / Assumptions / USE CUDA = ( condition1 && condition2 &&... ); if ( USE CUDA ) { fun cu(foo, bar) // intercept (not a kernel ) } else { #endif / Function to be intercepted / fun(foo, bar) #ifdef CUDA } #endif Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 28 / 44

29 Design Decisions and Methods Development Cycle CPU Profile Optimize Translate Profile Validate Optimize GPU Debug Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 29 / 44

30 Validation Design Decisions and Methods Development Cycle./vasp_test.py -e../exes/vasp-pgk -t PdO-v/ -n 1 ====================================================== Test Name: PdO-v/ Run on: In:./tests/3F0T Result Parameter Test vs Expected passed energy e+02 vs e+02 passed ext. pressure e+02 vs e+02 passed volume e+03 vs e+03 passed stress (xx) e+02 vs e+02 passed stress (yy) e+02 vs e+02 passed stress (zz) e+02 vs e+02 passed stress (xy) e+00 vs e+00 passed stress (yz) e+00 vs e+00 passed stress (zx) e+00 vs e x loop time vs Max 0.95x Hutchinson setdij (UChicago and time CMU) GPU VASP vs GTC 5/17/12 30 / 44

31 Design Decisions and Methods Development Cycle CPU Profile Optimize Translate Profile Validate Optimize GPU Debug Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 31 / 44

32 Tot Num Avg % "method=": A_kernel: gemm: double_: crrexp_mul_wave_k: aug_charge_trace_k: mul_vec_k: charge_trace_k: racc0_combine_k: calc_dllmm_k: apply_gfac_der_k: apply_gfac_k: eccp_nl_fock: memcpy: rpro1_combine_k: split_complex_k: else: Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 32 / 44 CUDA Profiler Design Decisions and Methods Development Cycle

33 CUDA Profiler Design Decisions and Methods Development Cycle Tot Num Avg % "method=": memcpy: A_kernel: B_kernel: memset32: else: gemm: crrexp_mul_wave_k: racc0_combine_k: charge_trace_k: aug_charge_trace_k: apply_gfac_der_k: apply_gfac_k: eccp_nl_fock: double_: mul_vec_k: rpro1_combine_k: Max Hutchinson split_complex_k: (UChicago and CMU) GPU 0.0VASP 0 GTC 0.0 5/17/ / 44

34 Design Decisions and Methods Development Cycle CPU Profile Optimize Translate Profile Validate Optimize GPU Debug Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 34 / 44

35 Persistent pointers Design Decisions and Methods Examples / void pointer / typedef struct void p{ unsigned int size ; void ptr ; } void p ; / double pointer / typedef struct double p{ unsigned int size ; double ptr ; } double p ; Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 35 / 44

36 Persistent pointers Design Decisions and Methods Examples / Assign a chunk of GPU mem to a chunck of CPU mem / static inline void assign cu ( void p dest, //!< destina void src, //!< source unsigned int size //<! size ( i ){ / Do we need to resize? / if (dest >ptr == NULL dest >size < size ){ if (dest >ptr!= NULL) cudafree(dest >ptr ); cudamalloc(( void )&dest >ptr, size ); dest >size = size ; } / Do the actual copy / cudamemcpy(dest >ptr, src, size, cudamemcpyhosttodevice); } Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 36 / 44

37 Structs Design Decisions and Methods Examples typedef struct 4vector{ int t ; int x; int y; int z; } 4vector events [N]; Improves locality for elemental functions. Mechanism is deep memory caches. typedef struct 4vectors{ int t [N]; int x[n]; int y[n]; int z[n]; } 4vectors events ; Improves memory bandwidth for vector functions. Mechanism is wide memory bus. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 37 / 44

38 Design Decisions and Methods Tips Intercepts vs Overhauls Intercepts and overhauls have the same theoretical peak performance. Maximal intercept is 2 codes One is usually easier than the other. Difficulty of intercepts is governed by Loop position: must intercept above fine-grain loops Data structures: must pass data and context to GPU Difficulty of overhauls is governed by Size, complexity of auxiliary code State of the original code Overhaul has side-benefits. Intercepts have side-costs. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 38 / 44

39 Road-Map Table of Contents 1 Context Motivating Science DFT and VASP 2 Capabilities and Performance Low-Level Ports High-Level Ports System Capabilities, Requirements 3 Design Decisions and Methods Guiding Principles Development Cycle Examples Tips 4 Road-Map Our plans Your part Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 39 / 44

40 Road-Map Our plans Non-HF Port Port will use the same scheme as HF port Climbing up may of the non-hf versions of CPU routines Trying to get all the way up to minimization routine (e.g. RMM-DIIS) You can expect performance approaching HF performance Less parallelism for systems of the same size More rapid iteration Mitigated by larger quantum systems Our goal is beta by sometime this summer Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 40 / 44

41 Road-Map Our plans Merge with VASP Core Our code is generally available to VASP license holders Must request access through Vienna Distribution through our website and git repo This scheme is inadequate (doesn t scale). We hope to put the ports in VASP 5.3, which will have some other architectural changes. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 41 / 44

42 Road-Map Your part Wish List Users, to do science It s all about science Find the kink s in our implementation Input, to direct effort and validate results Scientifically relevant systems Requests for functionality Effort, to write the ports Current VASP users with time to contribute VASP is a large code Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 42 / 44

43 Road-Map Your part Conclusions We ve ported HF functionality in VASP to CUDA. Up to 20x performance over singe core Up to 64 core performance compared to supercomputers Callgraph climbing port method is effective Accelerate specific functionality of large codes Can inform future decisions about dedicated ports Accelerating scientific codes enables new science. Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 43 / 44

44 Road-Map Your part Thank you Questions? Max Hutchinson (UChicago and CMU) GPU VASP GTC 5/17/12 44 / 44

STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC. Stefan Maintz, Dr. Markus Wetzstein

STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC. Stefan Maintz, Dr. Markus Wetzstein STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC Stefan Maintz, Dr. Markus Wetzstein smaintz@nvidia.com; mwetzstein@nvidia.com Companies Academia VASP USERS AND USAGE 12-25% of CPU cycles @ supercomputing

More information

An Innovative Massively Parallelized Molecular Dynamic Software

An Innovative Massively Parallelized Molecular Dynamic Software Renewable energies Eco-friendly production Innovative transport Eco-efficient processes Sustainable resources An Innovative Massively Parallelized Molecular Dynamic Software Mohamed Hacene, Ani Anciaux,

More information

Quantum ESPRESSO on GPU accelerated systems

Quantum ESPRESSO on GPU accelerated systems Quantum ESPRESSO on GPU accelerated systems Massimiliano Fatica, Everett Phillips, Josh Romero - NVIDIA Filippo Spiga - University of Cambridge/ARM (UK) MaX International Conference, Trieste, Italy, January

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications TESLA P PERFORMANCE GUIDE HPC and Deep Learning Applications MAY 217 TESLA P PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important

More information

Porting CASTEP to GPGPUs. Adrian Jackson, Toni Collis, EPCC, University of Edinburgh Graeme Ackland University of Edinburgh

Porting CASTEP to GPGPUs. Adrian Jackson, Toni Collis, EPCC, University of Edinburgh Graeme Ackland University of Edinburgh Porting CASTEP to GPGPUs Adrian Jackson, Toni Collis, EPCC, University of Edinburgh Graeme Ackland University of Edinburgh CASTEP Density Functional Theory Plane-wave basis set with pseudo potentials Heavy

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 What is Cray Libsci_acc? Provide basic scientific

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department Approaches to acceleration: GPUs vs Intel MIC Fabio AFFINITO SCAI department Single core Multi core Many core GPU Intel MIC 61 cores 512bit-SIMD units from http://www.karlrupp.net/ from http://www.karlrupp.net/

More information

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017 CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC 2017 The art of doing more with less 2 Performance RULE #1: DON T TRY TOO HARD Peak Performance Time 3 Unrealistic Effort/Reward Performance

More information

Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method

Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Josh Romero, Massimiliano Fatica - NVIDIA Vamsi Spandan, Roberto Verzicco -

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Introduction to C omputational F luid Dynamics. D. Murrin

Introduction to C omputational F luid Dynamics. D. Murrin Introduction to C omputational F luid Dynamics D. Murrin Computational fluid dynamics (CFD) is the science of predicting fluid flow, heat transfer, mass transfer, chemical reactions, and related phenomena

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications TESLA P PERFORMANCE GUIDE Deep Learning and HPC Applications SEPTEMBER 217 TESLA P PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Introduction to Computational Fluid Dynamics Mech 122 D. Fabris, K. Lynch, D. Rich

Introduction to Computational Fluid Dynamics Mech 122 D. Fabris, K. Lynch, D. Rich Introduction to Computational Fluid Dynamics Mech 122 D. Fabris, K. Lynch, D. Rich 1 Computational Fluid dynamics Computational fluid dynamics (CFD) is the analysis of systems involving fluid flow, heat

More information

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015 PERFORMANCE PORTABILITY WITH OPENACC Jeff Larkin, NVIDIA, November 2015 TWO TYPES OF PORTABILITY FUNCTIONAL PORTABILITY PERFORMANCE PORTABILITY The ability for a single code to run anywhere. The ability

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large

More information

NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS

NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS 1 Unified Memory CUDA 6 2 3 XT and Drop-in Libraries GPUDirect RDMA in MPI 4 Developer Tools 1 Unified Memory CUDA 6 2 3 XT and Drop-in Libraries

More information

Efficient use of hybrid computing clusters for nanosciences

Efficient use of hybrid computing clusters for nanosciences International Conference on Parallel Computing ÉCOLE NORMALE SUPÉRIEURE LYON Efficient use of hybrid computing clusters for nanosciences Luigi Genovese CEA, ESRF, BULL, LIG 16 Octobre 2008 with Matthieu

More information

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code

More information

GTC 2017 S7672. OpenACC Best Practices: Accelerating the C++ NUMECA FINE/Open CFD Solver

GTC 2017 S7672. OpenACC Best Practices: Accelerating the C++ NUMECA FINE/Open CFD Solver David Gutzwiller, NUMECA USA (david.gutzwiller@numeca.com) Dr. Ravi Srinivasan, Dresser-Rand Alain Demeulenaere, NUMECA USA 5/9/2017 GTC 2017 S7672 OpenACC Best Practices: Accelerating the C++ NUMECA FINE/Open

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Improving the Performance of the Molecular Similarity in Quantum Chemistry Fits. Alexander M. Cappiello

Improving the Performance of the Molecular Similarity in Quantum Chemistry Fits. Alexander M. Cappiello Improving the Performance of the Molecular Similarity in Quantum Chemistry Fits Alexander M. Cappiello Department of Chemistry Carnegie Mellon University Pittsburgh, PA 15213 December 17, 2012 Abstract

More information

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster YALES2: Semi-industrial code for turbulent combustion and flows Jean-Matthieu Etancelin, ROMEO, NVIDIA GPU Application

More information

04. CUDA Data Transfer

04. CUDA Data Transfer 04. CUDA Data Transfer Fall Semester, 2015 COMP427 Parallel Programming School of Computer Sci. and Eng. Kyungpook National University 2013-5 N Baek 1 CUDA Compute Unified Device Architecture General purpose

More information

CS 179: Lecture 10. Introduction to cublas

CS 179: Lecture 10. Introduction to cublas CS 179: Lecture 10 Introduction to cublas Table of contents, you are here. Welcome to week 4, this is new material from here on out so please ask questions and help the TAs to improve the lectures and

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

PART I - Fundamentals of Parallel Computing

PART I - Fundamentals of Parallel Computing PART I - Fundamentals of Parallel Computing Objectives What is scientific computing? The need for more computing power The need for parallel computing and parallel programs 1 What is scientific computing?

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

GPGPU Lessons Learned. Mark Harris

GPGPU Lessons Learned. Mark Harris GPGPU Lessons Learned Mark Harris General-Purpose Computation on GPUs Highly parallel applications Physically-based simulation image processing scientific computing computer vision computational finance

More information

Amazon Web Services: Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud

Amazon Web Services: Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud Amazon Web Services: Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud Summarized by: Michael Riera 9/17/2011 University of Central Florida CDA5532 Agenda

More information

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA Turing Architecture and CUDA 10 New Features Minseok Lee, Developer Technology Engineer, NVIDIA Turing Architecture New SM Architecture Multi-Precision Tensor Core RT Core Turing MPS Inference Accelerated,

More information

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS Roberto Gomperts (NVIDIA, Corp.) Michael Frisch (Gaussian, Inc.) Giovanni Scalmani (Gaussian, Inc.) Brent Leback (PGI) TOPICS Gaussian Design

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

Porting COSMO to Hybrid Architectures

Porting COSMO to Hybrid Architectures Porting COSMO to Hybrid Architectures T. Gysi 1, O. Fuhrer 2, C. Osuna 3, X. Lapillonne 3, T. Diamanti 3, B. Cumming 4, T. Schroeder 5, P. Messmer 5, T. Schulthess 4,6,7 [1] Supercomputing Systems AG,

More information

ANITA S SUPER AWESOME RECITATION SLIDES

ANITA S SUPER AWESOME RECITATION SLIDES ANITA S SUPER AWESOME RECITATION SLIDES 15/18-213: Introduction to Computer Systems Dynamic Memory Allocation Anita Zhang, Section M UPDATES Cache Lab style points released Don t fret too much Shell Lab

More information

Parallelism. CS6787 Lecture 8 Fall 2017

Parallelism. CS6787 Lecture 8 Fall 2017 Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

Scientific Computations Using Graphics Processors

Scientific Computations Using Graphics Processors Scientific Computations Using Graphics Processors Blair Perot Ali Khajeh-Saeed Tim McGuiness History Kevin Bowers, X Division Los Alamos Lab (2003) Lots of Memory Uses Memory Banks Cheap (commodity) Relativistic

More information

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

CS 179: GPU Computing. Lecture 2: The Basics

CS 179: GPU Computing. Lecture 2: The Basics CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Accelerating Molecular Modeling Applications with Graphics Processors

Accelerating Molecular Modeling Applications with Graphics Processors Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference

More information

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015 OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different

More information

Formal Loop Merging for Signal Transforms

Formal Loop Merging for Signal Transforms Formal Loop Merging for Signal Transforms Franz Franchetti Yevgen S. Voronenko Markus Püschel Department of Electrical & Computer Engineering Carnegie Mellon University This work was supported by NSF through

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo OpenMP Amasis Brauch German University in Cairo May 4, 2010 Simple Algorithm 1 void i n c r e m e n t e r ( short a r r a y ) 2 { 3 long i ; 4 5 for ( i = 0 ; i < 1000000; i ++) 6 { 7 a r r a y [ i ]++;

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Using OpenACC With CUDA Libraries

Using OpenACC With CUDA Libraries Using OpenACC With CUDA Libraries John Urbanic with NVIDIA Pittsburgh Supercomputing Center Copyright 2015 3 Ways to Accelerate Applications Applications Libraries Drop-in Acceleration CUDA Libraries are

More information

Optimised all-to-all communication on multicore architectures applied to FFTs with pencil decomposition

Optimised all-to-all communication on multicore architectures applied to FFTs with pencil decomposition Optimised all-to-all communication on multicore architectures applied to FFTs with pencil decomposition CUG 2018, Stockholm Andreas Jocksch, Matthias Kraushaar (CSCS), David Daverio (University of Cambridge,

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications TESLA V100 PERFORMANCE GUIDE Life Sciences Applications NOVEMBER 2017 TESLA V100 PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important

More information

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for

More information

Algorithms of Scientific Computing

Algorithms of Scientific Computing Algorithms of Scientific Computing Fast Fourier Transform (FFT) Michael Bader Technical University of Munich Summer 2018 The Pair DFT/IDFT as Matrix-Vector Product DFT and IDFT may be computed in the form

More information

Adrian Tate XK6 / openacc workshop Manno, Mar

Adrian Tate XK6 / openacc workshop Manno, Mar Adrian Tate XK6 / openacc workshop Manno, Mar6-7 2012 1 Overview & Philosophy Two modes of usage Contents Present contents Upcoming releases Optimization of libsci_acc Autotuning Adaptation Asynchronous

More information

MAGMA: a New Generation

MAGMA: a New Generation 1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release

More information

The Use of Cloud Computing Resources in an HPC Environment

The Use of Cloud Computing Resources in an HPC Environment The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes

More information

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018 Memory Bandwidth and Low Precision Computation CS6787 Lecture 10 Fall 2018 Memory as a Bottleneck So far, we ve just been talking about compute e.g. techniques to decrease the amount of compute by decreasing

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Optimizations. Umar Arshad ArrayFire Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers

More information

Gradient Free Design of Microfluidic Structures on a GPU Cluster

Gradient Free Design of Microfluidic Structures on a GPU Cluster Gradient Free Design of Microfluidic Structures on a GPU Cluster Austen Duffy - Florida State University SIAM Conference on Computational Science and Engineering March 2, 2011 Acknowledgements This work

More information

Why C? Because we can t in good conscience espouse Fortran.

Why C? Because we can t in good conscience espouse Fortran. C Tutorial Why C? Because we can t in good conscience espouse Fortran. C Hello World Code: Output: C For Loop Code: Output: C Functions Code: Output: Unlike Fortran, there is no distinction in C between

More information

The VASP Scripter AddOn

The VASP Scripter AddOn The VASP Scripter AddOn Tutorial Version 11.8.1 The VASP Scripter AddOn: Tutorial Version 11.8.1 Copyright 2008 2011 QuantumWise A/S Atomistix ToolKit Copyright Notice All rights reserved. This publication

More information

Timers 1 / 46. Jiffies. Potent and Evil Magic

Timers 1 / 46. Jiffies. Potent and Evil Magic Timers 1 / 46 Jiffies Each timer tick, a variable called jiffies is incremented It is thus (roughly) the number of HZ since system boot A 32-bit counter incremented at 1000 Hz wraps around in about 50

More information

Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer

Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer Xiaoqian Zhu 1,2, Xin Liu 1, Xiangfei Meng 2, Jinghua Feng 2 1 School of Computer, National University of Defense

More information

Maximizing Face Detection Performance

Maximizing Face Detection Performance Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

Starting a Data Analysis

Starting a Data Analysis 03/20/07 PHY310: Statistical Data Analysis 1 PHY310: Lecture 17 Starting a Data Analysis Road Map Your Analysis Log Exploring the Data Reading the input file (and making sure it's right) Taking a first

More information

Achieve Better Performance with PEAK on XSEDE Resources

Achieve Better Performance with PEAK on XSEDE Resources Achieve Better Performance with PEAK on XSEDE Resources Haihang You, Bilel Hadri, Shirley Moore XSEDE 12 July 18 th 2012 Motivations FACTS ALTD ( Automatic Tracking Library Database ) ref Fahey, Jones,

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY Stephen Abbott, March 26 2018 ACKNOWLEDGEMENTS Collaborators: Oak Ridge Nation Laboratory- Ed D Azevedo NVIDIA - Peng

More information

From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133)

From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133) From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133) Dr Paul Richmond Research Fellow University of Sheffield (NVIDIA CUDA Research Centre) Overview Complex

More information

Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP

Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP Zhe Weng and Peter Strazdins*, Computer Systems Group, Research School of Computer Science, The Australian National University

More information

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve Larger Problems** Critical Path Analysis * HOOMD Blue v1.3.3 Lennard-Jones liquid

More information

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017 Memory Bandwidth and Low Precision Computation CS6787 Lecture 9 Fall 2017 Memory as a Bottleneck So far, we ve just been talking about compute e.g. techniques to decrease the amount of compute by decreasing

More information

GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能

GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能 GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能 Hsi-Yu Schive ( 薛熙于 ), Tzihong Chiueh ( 闕志鴻 ), Yu-Chih Tsai ( 蔡御之 ), Ui-Han Zhang ( 張瑋瀚 ) Graduate Institute

More information

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel CS 267 Applications of Parallel Computers Lecture 23: Load Balancing and Scheduling James Demmel http://www.cs.berkeley.edu/~demmel/cs267_spr99 CS267 L23 Load Balancing and Scheduling.1 Demmel Sp 1999

More information

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit

More information

Great Reality #2: You ve Got to Know Assembly Does not generate random values Arithmetic operations have important mathematical properties

Great Reality #2: You ve Got to Know Assembly Does not generate random values Arithmetic operations have important mathematical properties Overview Course Overview Course theme Five realities Computer Systems 1 2 Course Theme: Abstraction Is Good But Don t Forget Reality Most CS courses emphasize abstraction Abstract data types Asymptotic

More information

MODELING CUDA COMPUTE APPLICATIONS BY CRITICAL PATH. PATRIC ZHAO, JIRI KRAUS, SKY WU

MODELING CUDA COMPUTE APPLICATIONS BY CRITICAL PATH. PATRIC ZHAO, JIRI KRAUS, SKY WU MODELING CUDA COMPUTE APPLICATIONS BY CRITICAL PATH PATRIC ZHAO, JIRI KRAUS, SKY WU patricz@nvidia.com AGENDA Background Collect data and Visualizations Critical Path Performance analysis and prediction

More information

Lecture 2: Introduction to OpenMP with application to a simple PDE solver

Lecture 2: Introduction to OpenMP with application to a simple PDE solver Lecture 2: Introduction to OpenMP with application to a simple PDE solver Mike Giles Mathematical Institute Mike Giles Lecture 2: Introduction to OpenMP 1 / 24 Hardware and software Hardware: a processor

More information

CSC573: TSHA Introduction to Accelerators

CSC573: TSHA Introduction to Accelerators CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures

More information

Review of previous examinations TMA4280 Introduction to Supercomputing

Review of previous examinations TMA4280 Introduction to Supercomputing Review of previous examinations TMA4280 Introduction to Supercomputing NTNU, IMF April 24. 2017 1 Examination The examination is usually comprised of: one problem related to linear algebra operations with

More information

Time-dependent density-functional theory with massively parallel computers. Jussi Enkovaara CSC IT Center for Science, Finland

Time-dependent density-functional theory with massively parallel computers. Jussi Enkovaara CSC IT Center for Science, Finland Time-dependent density-functional theory with massively parallel computers Jussi Enkovaara CSC IT Center for Science, Finland Outline Overview of the GPAW software package Parallelization for time-dependent

More information