Sampling Using GPU Accelerated Sparse Hierarchical Models

Size: px
Start display at page:

Download "Sampling Using GPU Accelerated Sparse Hierarchical Models"

Transcription

1 Sampling Using GPU Accelerated Sparse Hierarchical Models Miroslav Stoyanov Oak Ridge National Laboratory supported by Exascale Computing Project (ECP) exascaleproject.org April 9, 28 Miroslav Stoyanov /25

2 Sparse polynomial model Sparse grid, without the grid Given a sequence of -D basis functions φ i(x) : [, ] R for i =,,, (and maybe associated interpolation nodes ξ i) consider all possible d-dimensional tensors: d φ i (x) = φ ik (x k ), φ i (x) : d k=[, ] R k= A sparse polynomial basis is defined by a finite multi-index set Λ N d = {(i, i 2,, i k ) : i k N } A sparse model is any basis combined with a set of coefficients C = {c i } i G R m G Λ,C(x) = c i φ i (x) i Λ The choice of Λ and C is usually made so that G Λ,C(x) G Λ,C(x) f(x), f(x) : d k=[, ] R m For example, any sparse grid constructed form a nested set of nodes and basis functions follows this framework. Miroslav Stoyanov 2/25

3 Sparse polynomial basis X X X X X 2 X X 2 X X 2 X 3 X 4 X 3 X 4 X 3 X 4 X 5 X 6 X 7 X 8 X 5 X 6 X 7 X 8 X 5 X 6 X 7 X 8 Miroslav Stoyanov 3/25

4 Piece-wise constant hierarchy X X X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X X X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 2 X 2 X 22 X 23 X 24 X 25 X 26 Miroslav Stoyanov 4/25

5 Motivation: Exascale Models of Stellar Explosions Fast surrogate models Simulation of neutrino radiation in core collapse supernovae u t + u = u(e, )R(E, E, T, η)n(e )de E The integral has to be evaluated for each cell and the collision kernel R(E, E ) is computed from a separate expensive model. Standard practice is to approximate the kernels model using dense grids... Sparse grids surrogates dramatically reduce construction cost and the memory footprint but evaluations are more expensive than simple table Consider O(, ) evaluations, per-time step, per-discretization cell. Miroslav Stoyanov 5/25

6 Motivating: Tomography in 4D Sparse data in high-dimensions There are many advanced image processing techniques for 2D and 3D Spallation Neutron Source at ORNL produces data in the order of 2GB Data lives in 4D space, which means information is very sparse But the data has lots of structure Miroslav Stoyanov 6/25

7 Motivating: Bayesian inference Large number of outputs Bayesian inference p(x D) = L(u(x), D)p(x) u(x) is some model, D is observation data, p(x) is the prior, initial guess regarding x, L(u(x), d) is the likelihood, probability of the difference between u(x) and D, p(x D) is the posterior, informed distribution of x. Analyzing p(x D) is challenging: Markov-Chain Monte Carlo method is needed high-dimensionality, low acceptance rate, irregular structure of p(x D) Brute force solution DiffeRential Evolution Adaptive Metropolis (DREAM) run multiple chains, huge number of batches of samples The model u(x) can have large number of outputs Miroslav Stoyanov 7/25

8 GPU Acceleration GPU accelerators Nvidia, K2, K4, P, V - massive boost of flops and mops compared to CPUs - better energy efficiency and low cost - massvive concurrency, thousands of simulteneous operations Challenges in working with accelerators: - massvive concurrency, cannot handle sequential algorithms - memory management, GPU have sepatate limited memory Miroslav Stoyanov 8/25

9 Split evaluations We want the model output at a set of points {x j} n j=, i.e., G Λ,C(x j) = i Λ c i φ i (x j) Let C be the matrix with columns {c i } i Λ, and let B be the matrix Then, we want the answer B = [b i,j ] = φ i (x j) A = CB where A R m n, C R m Λ, and B R Λ n. Matrix C is given, thus we need to compute matrix B and then multiply by C. The splitting approacha allows us to exploit fast linear algebra libraries for GPU computations, e.g., Nvidia: cublas and cusparse University of Tenneessee at Knoxville: Matrix Algebra on GPU and Multicore Architectures (MAGMA) Miroslav Stoyanov 9/25

10 Sparsity discussion Basis functions φ i (x) have local support, which means that there is sparsity in B. Sparcity pattern cannot be predicted for a general multi-index; however, for testing purposes, we consider the multi-index associated with a standard sparse grid, in which case, the number of non-zero entries is where Λ is the number multi-indexes. O(n log d 2( Λ )) Two potential algorithms for constructing B Dense method, compute all n Λ entries, which is very parallelizable Sparse method, convert the multidimensional hierarchial structure (DAG) into a tree and traverse the treee computing only the non-zeros of B Miroslav Stoyanov /25

11 Basis evaluations Dense method parallelizes accros the multi-index each block of CUDA threads is assigned 32 multi-indexes the associated nodes and the basis support is stored in shared memory there is opportunity for some reuse of data Evaluating many functions that are zero, larger memory footprint Sparse method parallelizes accros the number of evaluations N each CUDA threads is assigned a single x j the hierarchy of multi-indexes is converted to a tree each thread independently traverses the tree Lots of uncoalesed memory access, there is no opportunity for reuse Miroslav Stoyanov /25

12 Basis evaluations: 4D 6 5 Basis evaluation 4D K2 Sparse (2.5K batch) K2 Dense K2 Sparse (5K batch) Basis evaluation 4D 6 K4 Sparse (2.5K batch) K4 Dense 5 K4 Sparse (5K batch) Pascal Sparse (2.5K batch) Pascal Dense Pascal Sparse (5K batch) Volta Sparse (2.5K batch) Volta Dense Volta Sparse (5K batch) Miroslav Stoyanov 2/25

13 Basis evaluations: 8D 6 5 Basis evaluation 8D K2 Sparse (2.5K batch) K2 Dense K2 Sparse (5K batch) Basis evaluation 8D 6 K4 Sparse (2.5K batch) K4 Dense 5 K4 Sparse (5K batch) Pascal Sparse (2.5K batch) Volta Sparse (2.5K batch) Pascal Dense Volta Dense 8 Pascal Sparse (5K batch) 8 Volta Sparse (5K batch) Miroslav Stoyanov 3/25

14 Model evaluations: Diverse methods Evaluations 4D, outputs = K2 Sparse-Sparse K2 Dense-Dense K2 Sparse-Dense K2 Dense-Sparse Evaluations 4D, outputs = K4 Sparse-Sparse K4 Dense-Dense K4 Sparse-Dense 8 K4 Dense-Sparse Pascal Sparse-Sparse Volta Sparse-Sparse Pascal Dense-Dense Volta Dense-Dense Pascal Sparse-Dense Pascal Dense-Sparse Volta Sparse-Dense Volta Dense-Sparse Miroslav Stoyanov 4/25

15 Model evaluations: 4D and few outputs Evaluations 4D, outputs = - 28 K2 Sparse - K2 Dense - K2 Sparse - 28 K2 Dense - 28 Evaluations 4D, outputs = - 28 K4 Sparse - K4 Dense - K4 Sparse K4 Dense Pascal Sparse - Volta Sparse - Pascal Dense - Volta Dense Pascal Sparse - 28 Pascal Dense Volta Sparse - 28 Volta Dense Miroslav Stoyanov 5/25

16 Model evaluations: 4D and many outputs Evaluations 4D, outputs = Evaluations 4D, outputs = K2 Sparse K4 Sparse - 24 K2 Dense - 24 K4 Dense K2 Sparse K2 Dense K4 Sparse K4 Dense Pascal Sparse - 24 Volta Sparse - 24 Pascal Dense - 24 Volta Dense Pascal Sparse Pascal Dense Volta Sparse Volta Dense Miroslav Stoyanov 6/25

17 Model evaluations: 8D and few outputs Evaluations 8D, outputs = - 28 K2 Sparse - K2 Dense - K2 Sparse - 28 K2 Dense - 28 Evaluations 8D, outputs = K4 Sparse - K4 Dense - 3 K4 Sparse - 28 K4 Dense Pascal Sparse - Pascal Dense - Pascal Sparse - 28 Pascal Dense Volta Sparse - Volta Dense - Volta Sparse - 28 Volta Dense Miroslav Stoyanov 7/25

18 Model evaluations: 8D and many outputs Evaluations 8D, outputs = K2 Sparse - 24 K2 Dense - 24 K2 Sparse K2 Dense Evaluations 8D, outputs = K4 Sparse - 24 K4 Dense - 24 K4 Sparse K4 Dense Pascal Sparse - 24 Pascal Dense - 24 Pascal Sparse Pascal Dense Volta Sparse - 24 Volta Dense - 24 Volta Sparse Volta Dense Miroslav Stoyanov 8/25

19 Observations Small multi-index sets and few x j favor dense algorithm Large batches favor the sparse algorithm Newer hardware architectures handle the uncoalesed memory access and favor the sparse algorithm High dimensions and complex tree structure favors the dense algorithm High number of outputs washes the difference between the methods Dense algorithm always uses more memory Miroslav Stoyanov 9/25

20 A more mainstream architecture Intel 6-core: i7-393k CPU (Sandy-Bridge-E) coupled with Nvidai GTX 8. Intel 4-core: i7-67k CPU (Skylake) coupled with Nvidai GTX 98ti. Dimensions, level 7, number of multi-indexes.8m, outputs. Stage i7-393k i7-67k GTX 8 GTX 98ti Select multi-indexes 7s 5s Compute coefficients 45s 23s CPU evaluate M 797s 772s GPU evaluate M 28s 24s Miroslav Stoyanov 2/25

21 Results: Neutrino collision kernel model Neutrino opacity kernels for different temperature Using Pascal GPU, evaluations of,, opacity values can be performed in < 2s on Pascal and < 6s on K4. Miroslav Stoyanov 2/25

22 Results: Tomography Original Classical Tomography Multi-index Simplified Shepp-Logan example, using 26 angle measurements. Original Classical Tomography Multi-index Neutron imaging example, using 2 angle measurements. Miroslav Stoyanov 22/25

23 Examples: Bayesian inference Simple model: u(x, x 2) = sin(x πt) + sin(x 2πt), x, x 2 [, 2] Data (assuming no noise): D = sin(5πt) + sin(πt) Conformal likelihood: ( L(x) = exp 3 ) (u(x) D) 2 dt PDF of Model Parameter Exact solution Likelihood Log-likelihood Posterior Miroslav Stoyanov 23/25

24 Tasmanian Toolkit for Adaptive Stochastic Modeling and Non-Intrusive ApproximatioN github.com/ornl/tasmanian tasmanian.ornl.gov current version: 5. Supported interfaces: C/C++, Python, MATLAB/Octave, CLI, Fortran9/95 Miroslav Stoyanov 24/25

25 Tasmanian Supported: Linux, OSX, Windows (VC++) Build system with cmake, install script (bash and batch), GNU-Make BSD License (with UT-Battele clause) No external dependence (good to have CUDA and BLAS) Global polynomial based refinement Large number of global -D rules with different growth 5 Leja-type rules with different growth ASKEY quadrature rules Chebyshev-type -D rules balancing Lebesque constant and number of nodes Several local -D rules arbitrary order piwce-wise polynomials linear and cubic wavelets locally anisotropic refinement approach Main focus is on surrogate modeling (Sparse Grids) DiffeRential Evolution Adaptive Metropolis (DREAM) method Miroslav Stoyanov 25/25

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

Applications of Berkeley s Dwarfs on Nvidia GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea. Abdulrahman Manea PhD Student Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science Energy Resources Engineering Department School of Earth Sciences

More information

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh. Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization Dmitry I. Lyakh liakhdi@ornl.gov This research used resources of the Oak Ridge Leadership Computing Facility at the

More information

Accelerated Machine Learning Algorithms in Python

Accelerated Machine Learning Algorithms in Python Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals

More information

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016 ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016 Challenges What is Algebraic Multi-Grid (AMG)? AGENDA Why use AMG? When to use AMG? NVIDIA AmgX Results 2

More information

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY Stephen Abbott, March 26 2018 ACKNOWLEDGEMENTS Collaborators: Oak Ridge Nation Laboratory- Ed D Azevedo NVIDIA - Peng

More information

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center 3/17/2015 2014 IBM Corporation Outline IBM OpenPower Platform Accelerating

More information

Cuda C Programming Guide Appendix C Table C-

Cuda C Programming Guide Appendix C Table C- Cuda C Programming Guide Appendix C Table C-4 Professional CUDA C Programming (1118739329) cover image into the powerful world of parallel GPU programming with this down-to-earth, practical guide Table

More information

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,

More information

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE GOWANLOCK NORTHERN ARIZONA UNIVERSITY SCHOOL OF INFORMATICS, COMPUTING & CYBER SYSTEMS BEN KARSIN UNIVERSITY OF HAWAII AT MANOA DEPARTMENT

More information

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA Turing Architecture and CUDA 10 New Features Minseok Lee, Developer Technology Engineer, NVIDIA Turing Architecture New SM Architecture Multi-Precision Tensor Core RT Core Turing MPS Inference Accelerated,

More information

VSC Users Day 2018 Start to GPU Ehsan Moravveji

VSC Users Day 2018 Start to GPU Ehsan Moravveji Outline A brief intro Available GPUs at VSC GPU architecture Benchmarking tests General Purpose GPU Programming Models VSC Users Day 2018 Start to GPU Ehsan Moravveji Image courtesy of Nvidia.com Generally

More information

Modern GPUs (Graphics Processing Units)

Modern GPUs (Graphics Processing Units) Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

User Manual: TASMANIAN Sparse Grids

User Manual: TASMANIAN Sparse Grids ORNL REPORT Unlimited Release Printed August 2013 User Manual: TASMANIAN Sparse Grids M. Stoyanov Prepared by Oak Ridge National Laboratory One Bethel Valley Road, Oak Ridge, Tennessee 37831 The Oak Ridge

More information

MAGMA: a New Generation

MAGMA: a New Generation 1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release

More information

A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang

A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang University of Massachusetts Amherst Introduction Singular Value Decomposition (SVD) A: m n matrix (m n) U, V: orthogonal

More information

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

In-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units

In-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units Page 1 of 17 In-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units Niloo Ranjan Jibonananda Sanyal Joshua New Page 2 of 17 Table of Contents In-Situ Statistical Analysis

More information

How to Optimize Geometric Multigrid Methods on GPUs

How to Optimize Geometric Multigrid Methods on GPUs How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Monte Carlo for Spatial Models

Monte Carlo for Spatial Models Monte Carlo for Spatial Models Murali Haran Department of Statistics Penn State University Penn State Computational Science Lectures April 2007 Spatial Models Lots of scientific questions involve analyzing

More information

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,

More information

Georgia Institute of Technology, August 17, Justin W. L. Wan. Canada Research Chair in Scientific Computing

Georgia Institute of Technology, August 17, Justin W. L. Wan. Canada Research Chair in Scientific Computing Real-Time Rigid id 2D-3D Medical Image Registration ti Using RapidMind Multi-Core Platform Georgia Tech/AFRL Workshop on Computational Science Challenge Using Emerging & Massively Parallel Computer Architectures

More information

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications ME964 High Performance Computing for Engineering Applications Outlining Midterm Projects Topic 3: GPU-based FEA Topic 4: GPU Direct Solver for Sparse Linear Algebra March 01, 2011 Dan Negrut, 2011 ME964

More information

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009 Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009 Introduction CUDA is a tool to turn your graphics card into a small computing cluster. It s not always

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

CUDA Accelerated Compute Libraries. M. Naumov

CUDA Accelerated Compute Libraries. M. Naumov CUDA Accelerated Compute Libraries M. Naumov Outline Motivation Why should you use libraries? CUDA Toolkit Libraries Overview of performance CUDA Proprietary Libraries Address specific markets Third Party

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

A Low Level Introduction to High Dimensional Sparse Grids

A Low Level Introduction to High Dimensional Sparse Grids A Low Level Introduction to High Dimensional Sparse Grids http://people.sc.fsu.edu/ jburkardt/presentations/sandia 2007.pdf... John 1 Clayton Webster 2 1 Virginia Tech 2 Sandia National Laboratory. 21

More information

GPU-Accelerated Deep Learning

GPU-Accelerated Deep Learning GPU-Accelerated Deep Learning July 6 th, 2016. Greg Heinrich. Credits: Alison B. Lowndes, Julie Bernauer, Leo K. Tam. PRACTICAL DEEP LEARNING EXAMPLES Image Classification, Object Detection, Localization,

More information

Algorithms of Scientific Computing

Algorithms of Scientific Computing Algorithms of Scientific Computing Overview and General Remarks Michael Bader Technical University of Munich Summer 2017 Classification of the Lecture Who is Who? Students of Informatics: Informatics Bachelor

More information

Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors?

Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors? Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors? The CVDLab Team Francesco Furiani Tue, April 3, 2014 ROMA TRE UNIVERSITÀ DEGLI STUDI

More information

6 BLAS (Basic Linear Algebra Subroutines)

6 BLAS (Basic Linear Algebra Subroutines) 161 BLAS 6.1 Motivation 6 BLAS (Basic Linear Algebra Subroutines) 6.1 Motivation How to optimise programs that use a lot of linear algebra operations? Efficiency depends on but also on: processor speed

More information

High-Performance Scientific Computing

High-Performance Scientific Computing High-Performance Scientific Computing Instructor: Randy LeVeque TA: Grady Lemoine Applied Mathematics 483/583, Spring 2011 http://www.amath.washington.edu/~rjl/am583 World s fastest computers http://top500.org

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Data mining with sparse grids using simplicial basis functions

Data mining with sparse grids using simplicial basis functions Data mining with sparse grids using simplicial basis functions Jochen Garcke and Michael Griebel Institut für Angewandte Mathematik Universität Bonn Part of the work was supported within the project 03GRM6BN

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

2D vector fields 3. Contents. Line Integral Convolution (LIC) Image based flow visualization Vector field topology. Fast LIC Oriented LIC

2D vector fields 3. Contents. Line Integral Convolution (LIC) Image based flow visualization Vector field topology. Fast LIC Oriented LIC 2D vector fields 3 Scientific Visualization (Part 8) PD Dr.-Ing. Peter Hastreiter Contents Line Integral Convolution (LIC) Fast LIC Oriented LIC Image based flow visualization Vector field topology 2 Applied

More information

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015 AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for the future 2 AmgX Fast, scalable linear solvers, emphasis on iterative

More information

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth

More information

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation

More information

OpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances

OpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances OpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances Stefano Cagnoni 1, Alessandro Bacchini 1,2, Luca Mussi 1 1 Dept. of Information Engineering, University of Parma,

More information

Advanced CUDA Optimization 1. Introduction

Advanced CUDA Optimization 1. Introduction Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines

More information

Accelerating GPU Kernels for Dense Linear Algebra

Accelerating GPU Kernels for Dense Linear Algebra Accelerating GPU Kernels for Dense Linear Algebra Rajib Nath, Stan Tomov, and Jack Dongarra Innovative Computing Lab University of Tennessee, Knoxville July 9, 21 xgemm performance of CUBLAS-2.3 on GTX28

More information

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs 3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs H. Knibbe, C. W. Oosterlee, C. Vuik Abstract We are focusing on an iterative solver for the three-dimensional

More information

HPC future trends from a science perspective

HPC future trends from a science perspective HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively

More information

CafeGPI. Single-Sided Communication for Scalable Deep Learning

CafeGPI. Single-Sided Communication for Scalable Deep Learning CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks

More information

RAMSES on the GPU: An OpenACC-Based Approach

RAMSES on the GPU: An OpenACC-Based Approach RAMSES on the GPU: An OpenACC-Based Approach Claudio Gheller (ETHZ-CSCS) Giacomo Rosilho de Souza (EPFL Lausanne) Romain Teyssier (University of Zurich) Markus Wetzstein (ETHZ-CSCS) PRACE-2IP project EU

More information

A MULTI-GPU COMPUTE SOLUTION FOR OPTIMIZED GENOMIC SELECTION ANALYSIS. A Thesis. presented to. the Faculty of California Polytechnic State University

A MULTI-GPU COMPUTE SOLUTION FOR OPTIMIZED GENOMIC SELECTION ANALYSIS. A Thesis. presented to. the Faculty of California Polytechnic State University A MULTI-GPU COMPUTE SOLUTION FOR OPTIMIZED GENOMIC SELECTION ANALYSIS A Thesis presented to the Faculty of California Polytechnic State University San Luis Obispo In Partial Fulfillment of the Requirements

More information

Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic

Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

Accelerating the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware

Accelerating the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware NSF REU - 2018: Project Report Accelerating the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware Anumeena Sorna Electronics and Communciation Engineering National Institute of Technology,

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA GPU COMPUTING AND THE FUTURE OF HPC Timothy Lanfear, NVIDIA ~1 W ~3 W ~100 W ~30 W 1 kw 100 kw 20 MW Power-constrained Computers 2 EXASCALE COMPUTING WILL ENABLE TRANSFORMATIONAL SCIENCE RESULTS First-principles

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

Clustering Relational Data using the Infinite Relational Model

Clustering Relational Data using the Infinite Relational Model Clustering Relational Data using the Infinite Relational Model Ana Daglis Supervised by: Matthew Ludkin September 4, 2015 Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015

More information

Hierarchical Bayesian Modeling with Ensemble MCMC. Eric B. Ford (Penn State) Bayesian Computing for Astronomical Data Analysis June 12, 2014

Hierarchical Bayesian Modeling with Ensemble MCMC. Eric B. Ford (Penn State) Bayesian Computing for Astronomical Data Analysis June 12, 2014 Hierarchical Bayesian Modeling with Ensemble MCMC Eric B. Ford (Penn State) Bayesian Computing for Astronomical Data Analysis June 12, 2014 Simple Markov Chain Monte Carlo Initialise chain with θ 0 (initial

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

Data mining with sparse grids

Data mining with sparse grids Data mining with sparse grids Jochen Garcke and Michael Griebel Institut für Angewandte Mathematik Universität Bonn Data mining with sparse grids p.1/40 Overview What is Data mining? Regularization networks

More information

Some notes on efficient computing and high performance computing environments

Some notes on efficient computing and high performance computing environments Some notes on efficient computing and high performance computing environments Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public

More information

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox

More information

Massively Parallel GPU-friendly Algorithms for PET. Szirmay-Kalos László, Budapest, University of Technology and Economics

Massively Parallel GPU-friendly Algorithms for PET. Szirmay-Kalos László,   Budapest, University of Technology and Economics Massively Parallel GPU-friendly Algorithms for PET Szirmay-Kalos László, http://cg.iit.bme.hu, Budapest, University of Technology and Economics (GP)GPU: CUDA (OpenCL) Multiprocessor N Multiprocessor 2

More information

Convexization in Markov Chain Monte Carlo

Convexization in Markov Chain Monte Carlo in Markov Chain Monte Carlo 1 IBM T. J. Watson Yorktown Heights, NY 2 Department of Aerospace Engineering Technion, Israel August 23, 2011 Problem Statement MCMC processes in general are governed by non

More information

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

A performance portable implementation of HOMME via the Kokkos programming model

A performance portable implementation of HOMME via the Kokkos programming model E x c e p t i o n a l s e r v i c e i n t h e n a t i o n a l i n t e re s t A performance portable implementation of HOMME via the Kokkos programming model L.Bertagna, M.Deakin, O.Guba, D.Sunderland,

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Gpufit: An open-source toolkit for GPU-accelerated curve fitting

Gpufit: An open-source toolkit for GPU-accelerated curve fitting Gpufit: An open-source toolkit for GPU-accelerated curve fitting Adrian Przybylski, Björn Thiel, Jan Keller-Findeisen, Bernd Stock, and Mark Bates Supplementary Information Table of Contents Calculating

More information

GPU LIBRARY ADVISOR. DA _v8.0 September Application Note

GPU LIBRARY ADVISOR. DA _v8.0 September Application Note GPU LIBRARY ADVISOR DA-06762-001_v8.0 September 2016 Application Note TABLE OF CONTENTS Chapter 1. Overview... 1 Chapter 2. Usage... 2 DA-06762-001_v8.0 ii Chapter 1. OVERVIEW The NVIDIA is a cross-platform

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

Deep Learning: Transforming Engineering and Science The MathWorks, Inc. Deep Learning: Transforming Engineering and Science 1 2015 The MathWorks, Inc. DEEP LEARNING: TRANSFORMING ENGINEERING AND SCIENCE A THE NEW RISE ERA OF OF GPU COMPUTING 3 NVIDIA A IS NEW THE WORLD S ERA

More information

Unveiling Cellular & Molecular Events of Cardiac Arrhythmias

Unveiling Cellular & Molecular Events of Cardiac Arrhythmias Unveiling Cellular & Molecular Events of Cardiac Arrhythmias Hoang-Trong Minh Tuan 1, George S. William 1, Greg D. Smith 2, M. Saleet Jafri 1,3,4 1 - Department of Bioinformatics and Computational Biology

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford

More information

May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017

May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017 May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND Mark Harris, May 10, 2017 INTRODUCING CUDA 9 BUILT FOR VOLTA FASTER LIBRARIES Tesla V100 New GPU Architecture Tensor Cores NVLink Independent Thread Scheduling

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

Markov Chain Monte Carlo on the GPU Final Project, High Performance Computing

Markov Chain Monte Carlo on the GPU Final Project, High Performance Computing Markov Chain Monte Carlo on the GPU Final Project, High Performance Computing Alex Kaiser Courant Institute of Mathematical Sciences, New York University December 27, 2012 1 Introduction The goal of this

More information

AES Cryptosystem Acceleration Using Graphics Processing Units. Ethan Willoner Supervisors: Dr. Ramon Lawrence, Scott Fazackerley

AES Cryptosystem Acceleration Using Graphics Processing Units. Ethan Willoner Supervisors: Dr. Ramon Lawrence, Scott Fazackerley AES Cryptosystem Acceleration Using Graphics Processing Units Ethan Willoner Supervisors: Dr. Ramon Lawrence, Scott Fazackerley Overview Introduction Compute Unified Device Architecture (CUDA) Advanced

More information

CODE-GENERATION FOR DIFFERENTIAL EQUATION SOLVERS

CODE-GENERATION FOR DIFFERENTIAL EQUATION SOLVERS CODE-GENERATION FOR DIFFERENTIAL EQUATION SOLVERS Dániel Berényi Wigner RCP, GPU Laboratory, Budapest, Hungary Perspectives of GPU Computing in Physics and Astrophysics Rome 2014. INTRODUCTION The most

More information

GPU Parallelization of Gibbs Sampling Abstractions, Results, and Lessons Learned Alireza S Mahani Scientific Computing Group Sentrana Inc.

GPU Parallelization of Gibbs Sampling Abstractions, Results, and Lessons Learned Alireza S Mahani Scientific Computing Group Sentrana Inc. GPU Parallelization of Gibbs Sampling Abstractions, Results, and Lessons Learned Alireza S Mahani Scientific Computing Group Sentrana Inc. May 16, 2012 Objectives of This Talk What This Talk Is About What

More information

Introduction to GPU Computing. 周国峰 Wuhan University 2017/10/13

Introduction to GPU Computing. 周国峰 Wuhan University 2017/10/13 Introduction to GPU Computing chandlerz@nvidia.com 周国峰 Wuhan University 2017/10/13 GPU and Its Application 3 Ways to Develop Your GPU APP An Example to Show the Developments Add GPUs: Accelerate Science

More information

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center An Introduction to GPU Architecture and CUDA C/C++ Programming Bin Chen April 4, 2018 Research Computing Center Outline Introduction to GPU architecture Introduction to CUDA programming model Using the

More information

Parallel and Distributed Programming Introduction. Kenjiro Taura

Parallel and Distributed Programming Introduction. Kenjiro Taura Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel Programming? 2 What Parallel Machines Look Like, and Where Performance Come From? 3 How to Program Parallel

More information