Sampling Using GPU Accelerated Sparse Hierarchical Models
|
|
- Mercy Tyler
- 5 years ago
- Views:
Transcription
1 Sampling Using GPU Accelerated Sparse Hierarchical Models Miroslav Stoyanov Oak Ridge National Laboratory supported by Exascale Computing Project (ECP) exascaleproject.org April 9, 28 Miroslav Stoyanov /25
2 Sparse polynomial model Sparse grid, without the grid Given a sequence of -D basis functions φ i(x) : [, ] R for i =,,, (and maybe associated interpolation nodes ξ i) consider all possible d-dimensional tensors: d φ i (x) = φ ik (x k ), φ i (x) : d k=[, ] R k= A sparse polynomial basis is defined by a finite multi-index set Λ N d = {(i, i 2,, i k ) : i k N } A sparse model is any basis combined with a set of coefficients C = {c i } i G R m G Λ,C(x) = c i φ i (x) i Λ The choice of Λ and C is usually made so that G Λ,C(x) G Λ,C(x) f(x), f(x) : d k=[, ] R m For example, any sparse grid constructed form a nested set of nodes and basis functions follows this framework. Miroslav Stoyanov 2/25
3 Sparse polynomial basis X X X X X 2 X X 2 X X 2 X 3 X 4 X 3 X 4 X 3 X 4 X 5 X 6 X 7 X 8 X 5 X 6 X 7 X 8 X 5 X 6 X 7 X 8 Miroslav Stoyanov 3/25
4 Piece-wise constant hierarchy X X X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X X X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 2 X 2 X 22 X 23 X 24 X 25 X 26 Miroslav Stoyanov 4/25
5 Motivation: Exascale Models of Stellar Explosions Fast surrogate models Simulation of neutrino radiation in core collapse supernovae u t + u = u(e, )R(E, E, T, η)n(e )de E The integral has to be evaluated for each cell and the collision kernel R(E, E ) is computed from a separate expensive model. Standard practice is to approximate the kernels model using dense grids... Sparse grids surrogates dramatically reduce construction cost and the memory footprint but evaluations are more expensive than simple table Consider O(, ) evaluations, per-time step, per-discretization cell. Miroslav Stoyanov 5/25
6 Motivating: Tomography in 4D Sparse data in high-dimensions There are many advanced image processing techniques for 2D and 3D Spallation Neutron Source at ORNL produces data in the order of 2GB Data lives in 4D space, which means information is very sparse But the data has lots of structure Miroslav Stoyanov 6/25
7 Motivating: Bayesian inference Large number of outputs Bayesian inference p(x D) = L(u(x), D)p(x) u(x) is some model, D is observation data, p(x) is the prior, initial guess regarding x, L(u(x), d) is the likelihood, probability of the difference between u(x) and D, p(x D) is the posterior, informed distribution of x. Analyzing p(x D) is challenging: Markov-Chain Monte Carlo method is needed high-dimensionality, low acceptance rate, irregular structure of p(x D) Brute force solution DiffeRential Evolution Adaptive Metropolis (DREAM) run multiple chains, huge number of batches of samples The model u(x) can have large number of outputs Miroslav Stoyanov 7/25
8 GPU Acceleration GPU accelerators Nvidia, K2, K4, P, V - massive boost of flops and mops compared to CPUs - better energy efficiency and low cost - massvive concurrency, thousands of simulteneous operations Challenges in working with accelerators: - massvive concurrency, cannot handle sequential algorithms - memory management, GPU have sepatate limited memory Miroslav Stoyanov 8/25
9 Split evaluations We want the model output at a set of points {x j} n j=, i.e., G Λ,C(x j) = i Λ c i φ i (x j) Let C be the matrix with columns {c i } i Λ, and let B be the matrix Then, we want the answer B = [b i,j ] = φ i (x j) A = CB where A R m n, C R m Λ, and B R Λ n. Matrix C is given, thus we need to compute matrix B and then multiply by C. The splitting approacha allows us to exploit fast linear algebra libraries for GPU computations, e.g., Nvidia: cublas and cusparse University of Tenneessee at Knoxville: Matrix Algebra on GPU and Multicore Architectures (MAGMA) Miroslav Stoyanov 9/25
10 Sparsity discussion Basis functions φ i (x) have local support, which means that there is sparsity in B. Sparcity pattern cannot be predicted for a general multi-index; however, for testing purposes, we consider the multi-index associated with a standard sparse grid, in which case, the number of non-zero entries is where Λ is the number multi-indexes. O(n log d 2( Λ )) Two potential algorithms for constructing B Dense method, compute all n Λ entries, which is very parallelizable Sparse method, convert the multidimensional hierarchial structure (DAG) into a tree and traverse the treee computing only the non-zeros of B Miroslav Stoyanov /25
11 Basis evaluations Dense method parallelizes accros the multi-index each block of CUDA threads is assigned 32 multi-indexes the associated nodes and the basis support is stored in shared memory there is opportunity for some reuse of data Evaluating many functions that are zero, larger memory footprint Sparse method parallelizes accros the number of evaluations N each CUDA threads is assigned a single x j the hierarchy of multi-indexes is converted to a tree each thread independently traverses the tree Lots of uncoalesed memory access, there is no opportunity for reuse Miroslav Stoyanov /25
12 Basis evaluations: 4D 6 5 Basis evaluation 4D K2 Sparse (2.5K batch) K2 Dense K2 Sparse (5K batch) Basis evaluation 4D 6 K4 Sparse (2.5K batch) K4 Dense 5 K4 Sparse (5K batch) Pascal Sparse (2.5K batch) Pascal Dense Pascal Sparse (5K batch) Volta Sparse (2.5K batch) Volta Dense Volta Sparse (5K batch) Miroslav Stoyanov 2/25
13 Basis evaluations: 8D 6 5 Basis evaluation 8D K2 Sparse (2.5K batch) K2 Dense K2 Sparse (5K batch) Basis evaluation 8D 6 K4 Sparse (2.5K batch) K4 Dense 5 K4 Sparse (5K batch) Pascal Sparse (2.5K batch) Volta Sparse (2.5K batch) Pascal Dense Volta Dense 8 Pascal Sparse (5K batch) 8 Volta Sparse (5K batch) Miroslav Stoyanov 3/25
14 Model evaluations: Diverse methods Evaluations 4D, outputs = K2 Sparse-Sparse K2 Dense-Dense K2 Sparse-Dense K2 Dense-Sparse Evaluations 4D, outputs = K4 Sparse-Sparse K4 Dense-Dense K4 Sparse-Dense 8 K4 Dense-Sparse Pascal Sparse-Sparse Volta Sparse-Sparse Pascal Dense-Dense Volta Dense-Dense Pascal Sparse-Dense Pascal Dense-Sparse Volta Sparse-Dense Volta Dense-Sparse Miroslav Stoyanov 4/25
15 Model evaluations: 4D and few outputs Evaluations 4D, outputs = - 28 K2 Sparse - K2 Dense - K2 Sparse - 28 K2 Dense - 28 Evaluations 4D, outputs = - 28 K4 Sparse - K4 Dense - K4 Sparse K4 Dense Pascal Sparse - Volta Sparse - Pascal Dense - Volta Dense Pascal Sparse - 28 Pascal Dense Volta Sparse - 28 Volta Dense Miroslav Stoyanov 5/25
16 Model evaluations: 4D and many outputs Evaluations 4D, outputs = Evaluations 4D, outputs = K2 Sparse K4 Sparse - 24 K2 Dense - 24 K4 Dense K2 Sparse K2 Dense K4 Sparse K4 Dense Pascal Sparse - 24 Volta Sparse - 24 Pascal Dense - 24 Volta Dense Pascal Sparse Pascal Dense Volta Sparse Volta Dense Miroslav Stoyanov 6/25
17 Model evaluations: 8D and few outputs Evaluations 8D, outputs = - 28 K2 Sparse - K2 Dense - K2 Sparse - 28 K2 Dense - 28 Evaluations 8D, outputs = K4 Sparse - K4 Dense - 3 K4 Sparse - 28 K4 Dense Pascal Sparse - Pascal Dense - Pascal Sparse - 28 Pascal Dense Volta Sparse - Volta Dense - Volta Sparse - 28 Volta Dense Miroslav Stoyanov 7/25
18 Model evaluations: 8D and many outputs Evaluations 8D, outputs = K2 Sparse - 24 K2 Dense - 24 K2 Sparse K2 Dense Evaluations 8D, outputs = K4 Sparse - 24 K4 Dense - 24 K4 Sparse K4 Dense Pascal Sparse - 24 Pascal Dense - 24 Pascal Sparse Pascal Dense Volta Sparse - 24 Volta Dense - 24 Volta Sparse Volta Dense Miroslav Stoyanov 8/25
19 Observations Small multi-index sets and few x j favor dense algorithm Large batches favor the sparse algorithm Newer hardware architectures handle the uncoalesed memory access and favor the sparse algorithm High dimensions and complex tree structure favors the dense algorithm High number of outputs washes the difference between the methods Dense algorithm always uses more memory Miroslav Stoyanov 9/25
20 A more mainstream architecture Intel 6-core: i7-393k CPU (Sandy-Bridge-E) coupled with Nvidai GTX 8. Intel 4-core: i7-67k CPU (Skylake) coupled with Nvidai GTX 98ti. Dimensions, level 7, number of multi-indexes.8m, outputs. Stage i7-393k i7-67k GTX 8 GTX 98ti Select multi-indexes 7s 5s Compute coefficients 45s 23s CPU evaluate M 797s 772s GPU evaluate M 28s 24s Miroslav Stoyanov 2/25
21 Results: Neutrino collision kernel model Neutrino opacity kernels for different temperature Using Pascal GPU, evaluations of,, opacity values can be performed in < 2s on Pascal and < 6s on K4. Miroslav Stoyanov 2/25
22 Results: Tomography Original Classical Tomography Multi-index Simplified Shepp-Logan example, using 26 angle measurements. Original Classical Tomography Multi-index Neutron imaging example, using 2 angle measurements. Miroslav Stoyanov 22/25
23 Examples: Bayesian inference Simple model: u(x, x 2) = sin(x πt) + sin(x 2πt), x, x 2 [, 2] Data (assuming no noise): D = sin(5πt) + sin(πt) Conformal likelihood: ( L(x) = exp 3 ) (u(x) D) 2 dt PDF of Model Parameter Exact solution Likelihood Log-likelihood Posterior Miroslav Stoyanov 23/25
24 Tasmanian Toolkit for Adaptive Stochastic Modeling and Non-Intrusive ApproximatioN github.com/ornl/tasmanian tasmanian.ornl.gov current version: 5. Supported interfaces: C/C++, Python, MATLAB/Octave, CLI, Fortran9/95 Miroslav Stoyanov 24/25
25 Tasmanian Supported: Linux, OSX, Windows (VC++) Build system with cmake, install script (bash and batch), GNU-Make BSD License (with UT-Battele clause) No external dependence (good to have CUDA and BLAS) Global polynomial based refinement Large number of global -D rules with different growth 5 Leja-type rules with different growth ASKEY quadrature rules Chebyshev-type -D rules balancing Lebesque constant and number of nodes Several local -D rules arbitrary order piwce-wise polynomials linear and cubic wavelets locally anisotropic refinement approach Main focus is on surrogate modeling (Sparse Grids) DiffeRential Evolution Adaptive Metropolis (DREAM) method Miroslav Stoyanov 25/25
MAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationApplications of Berkeley s Dwarfs on Nvidia GPUs
Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationPhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.
Abdulrahman Manea PhD Student Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science Energy Resources Engineering Department School of Earth Sciences
More informationPortable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.
Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization Dmitry I. Lyakh liakhdi@ornl.gov This research used resources of the Oak Ridge Leadership Computing Facility at the
More informationAccelerated Machine Learning Algorithms in Python
Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals
More informationACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016
ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016 Challenges What is Algebraic Multi-Grid (AMG)? AGENDA Why use AMG? When to use AMG? NVIDIA AmgX Results 2
More informationMulti-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation
Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M
More informationParallel FFT Program Optimizations on Heterogeneous Computers
Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationEXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March
EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY Stephen Abbott, March 26 2018 ACKNOWLEDGEMENTS Collaborators: Oak Ridge Nation Laboratory- Ed D Azevedo NVIDIA - Peng
More informationExploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center
Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center 3/17/2015 2014 IBM Corporation Outline IBM OpenPower Platform Accelerating
More informationCuda C Programming Guide Appendix C Table C-
Cuda C Programming Guide Appendix C Table C-4 Professional CUDA C Programming (1118739329) cover image into the powerful world of parallel GPU programming with this down-to-earth, practical guide Table
More informationEfficient Multi-GPU CUDA Linear Solvers for OpenFOAM
Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,
More informationGPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC
GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE GOWANLOCK NORTHERN ARIZONA UNIVERSITY SCHOOL OF INFORMATICS, COMPUTING & CYBER SYSTEMS BEN KARSIN UNIVERSITY OF HAWAII AT MANOA DEPARTMENT
More informationTuring Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA
Turing Architecture and CUDA 10 New Features Minseok Lee, Developer Technology Engineer, NVIDIA Turing Architecture New SM Architecture Multi-Precision Tensor Core RT Core Turing MPS Inference Accelerated,
More informationVSC Users Day 2018 Start to GPU Ehsan Moravveji
Outline A brief intro Available GPUs at VSC GPU architecture Benchmarking tests General Purpose GPU Programming Models VSC Users Day 2018 Start to GPU Ehsan Moravveji Image courtesy of Nvidia.com Generally
More informationModern GPUs (Graphics Processing Units)
Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationUser Manual: TASMANIAN Sparse Grids
ORNL REPORT Unlimited Release Printed August 2013 User Manual: TASMANIAN Sparse Grids M. Stoyanov Prepared by Oak Ridge National Laboratory One Bethel Valley Road, Oak Ridge, Tennessee 37831 The Oak Ridge
More informationMAGMA: a New Generation
1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release
More informationA GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang
A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang University of Massachusetts Amherst Introduction Singular Value Decomposition (SVD) A: m n matrix (m n) U, V: orthogonal
More informationEfficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs
Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationIn-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units
Page 1 of 17 In-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units Niloo Ranjan Jibonananda Sanyal Joshua New Page 2 of 17 Table of Contents In-Situ Statistical Analysis
More informationHow to Optimize Geometric Multigrid Methods on GPUs
How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationMonte Carlo for Spatial Models
Monte Carlo for Spatial Models Murali Haran Department of Statistics Penn State University Penn State Computational Science Lectures April 2007 Spatial Models Lots of scientific questions involve analyzing
More informationAdministrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve
Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationTowards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers
Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationGeorgia Institute of Technology, August 17, Justin W. L. Wan. Canada Research Chair in Scientific Computing
Real-Time Rigid id 2D-3D Medical Image Registration ti Using RapidMind Multi-Core Platform Georgia Tech/AFRL Workshop on Computational Science Challenge Using Emerging & Massively Parallel Computer Architectures
More informationOpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4
OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted
More informationGPU Programming Using NVIDIA CUDA
GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics
More informationME964 High Performance Computing for Engineering Applications
ME964 High Performance Computing for Engineering Applications Outlining Midterm Projects Topic 3: GPU-based FEA Topic 4: GPU Direct Solver for Sparse Linear Algebra March 01, 2011 Dan Negrut, 2011 ME964
More informationGeorgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009
Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009 Introduction CUDA is a tool to turn your graphics card into a small computing cluster. It s not always
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationCUDA Accelerated Compute Libraries. M. Naumov
CUDA Accelerated Compute Libraries M. Naumov Outline Motivation Why should you use libraries? CUDA Toolkit Libraries Overview of performance CUDA Proprietary Libraries Address specific markets Third Party
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationA Low Level Introduction to High Dimensional Sparse Grids
A Low Level Introduction to High Dimensional Sparse Grids http://people.sc.fsu.edu/ jburkardt/presentations/sandia 2007.pdf... John 1 Clayton Webster 2 1 Virginia Tech 2 Sandia National Laboratory. 21
More informationGPU-Accelerated Deep Learning
GPU-Accelerated Deep Learning July 6 th, 2016. Greg Heinrich. Credits: Alison B. Lowndes, Julie Bernauer, Leo K. Tam. PRACTICAL DEEP LEARNING EXAMPLES Image Classification, Object Detection, Localization,
More informationAlgorithms of Scientific Computing
Algorithms of Scientific Computing Overview and General Remarks Michael Bader Technical University of Munich Summer 2017 Classification of the Lecture Who is Who? Students of Informatics: Informatics Bachelor
More informationComputational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors?
Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors? The CVDLab Team Francesco Furiani Tue, April 3, 2014 ROMA TRE UNIVERSITÀ DEGLI STUDI
More information6 BLAS (Basic Linear Algebra Subroutines)
161 BLAS 6.1 Motivation 6 BLAS (Basic Linear Algebra Subroutines) 6.1 Motivation How to optimise programs that use a lot of linear algebra operations? Efficiency depends on but also on: processor speed
More informationHigh-Performance Scientific Computing
High-Performance Scientific Computing Instructor: Randy LeVeque TA: Grady Lemoine Applied Mathematics 483/583, Spring 2011 http://www.amath.washington.edu/~rjl/am583 World s fastest computers http://top500.org
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationData mining with sparse grids using simplicial basis functions
Data mining with sparse grids using simplicial basis functions Jochen Garcke and Michael Griebel Institut für Angewandte Mathematik Universität Bonn Part of the work was supported within the project 03GRM6BN
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More information2D vector fields 3. Contents. Line Integral Convolution (LIC) Image based flow visualization Vector field topology. Fast LIC Oriented LIC
2D vector fields 3 Scientific Visualization (Part 8) PD Dr.-Ing. Peter Hastreiter Contents Line Integral Convolution (LIC) Fast LIC Oriented LIC Image based flow visualization Vector field topology 2 Applied
More informationAmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015
AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for the future 2 AmgX Fast, scalable linear solvers, emphasis on iterative
More informationTechnische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics
GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth
More informationACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS
ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation
More informationOpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances
OpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances Stefano Cagnoni 1, Alessandro Bacchini 1,2, Luca Mussi 1 1 Dept. of Information Engineering, University of Parma,
More informationAdvanced CUDA Optimization 1. Introduction
Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines
More informationAccelerating GPU Kernels for Dense Linear Algebra
Accelerating GPU Kernels for Dense Linear Algebra Rajib Nath, Stan Tomov, and Jack Dongarra Innovative Computing Lab University of Tennessee, Knoxville July 9, 21 xgemm performance of CUBLAS-2.3 on GTX28
More information3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs
3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs H. Knibbe, C. W. Oosterlee, C. Vuik Abstract We are focusing on an iterative solver for the three-dimensional
More informationHPC future trends from a science perspective
HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively
More informationCafeGPI. Single-Sided Communication for Scalable Deep Learning
CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks
More informationRAMSES on the GPU: An OpenACC-Based Approach
RAMSES on the GPU: An OpenACC-Based Approach Claudio Gheller (ETHZ-CSCS) Giacomo Rosilho de Souza (EPFL Lausanne) Romain Teyssier (University of Zurich) Markus Wetzstein (ETHZ-CSCS) PRACE-2IP project EU
More informationA MULTI-GPU COMPUTE SOLUTION FOR OPTIMIZED GENOMIC SELECTION ANALYSIS. A Thesis. presented to. the Faculty of California Polytechnic State University
A MULTI-GPU COMPUTE SOLUTION FOR OPTIMIZED GENOMIC SELECTION ANALYSIS A Thesis presented to the Faculty of California Polytechnic State University San Luis Obispo In Partial Fulfillment of the Requirements
More informationAccelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic
Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago
More informationAccelerating the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware
NSF REU - 2018: Project Report Accelerating the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware Anumeena Sorna Electronics and Communciation Engineering National Institute of Technology,
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationGPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA
GPU COMPUTING AND THE FUTURE OF HPC Timothy Lanfear, NVIDIA ~1 W ~3 W ~100 W ~30 W 1 kw 100 kw 20 MW Power-constrained Computers 2 EXASCALE COMPUTING WILL ENABLE TRANSFORMATIONAL SCIENCE RESULTS First-principles
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationClustering Relational Data using the Infinite Relational Model
Clustering Relational Data using the Infinite Relational Model Ana Daglis Supervised by: Matthew Ludkin September 4, 2015 Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015
More informationHierarchical Bayesian Modeling with Ensemble MCMC. Eric B. Ford (Penn State) Bayesian Computing for Astronomical Data Analysis June 12, 2014
Hierarchical Bayesian Modeling with Ensemble MCMC Eric B. Ford (Penn State) Bayesian Computing for Astronomical Data Analysis June 12, 2014 Simple Markov Chain Monte Carlo Initialise chain with θ 0 (initial
More informationA TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE
A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA
More informationData mining with sparse grids
Data mining with sparse grids Jochen Garcke and Michael Griebel Institut für Angewandte Mathematik Universität Bonn Data mining with sparse grids p.1/40 Overview What is Data mining? Regularization networks
More informationSome notes on efficient computing and high performance computing environments
Some notes on efficient computing and high performance computing environments Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public
More informationMultipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs
Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox
More informationMassively Parallel GPU-friendly Algorithms for PET. Szirmay-Kalos László, Budapest, University of Technology and Economics
Massively Parallel GPU-friendly Algorithms for PET Szirmay-Kalos László, http://cg.iit.bme.hu, Budapest, University of Technology and Economics (GP)GPU: CUDA (OpenCL) Multiprocessor N Multiprocessor 2
More informationConvexization in Markov Chain Monte Carlo
in Markov Chain Monte Carlo 1 IBM T. J. Watson Yorktown Heights, NY 2 Department of Aerospace Engineering Technion, Israel August 23, 2011 Problem Statement MCMC processes in general are governed by non
More informationN-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo
N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationA performance portable implementation of HOMME via the Kokkos programming model
E x c e p t i o n a l s e r v i c e i n t h e n a t i o n a l i n t e re s t A performance portable implementation of HOMME via the Kokkos programming model L.Bertagna, M.Deakin, O.Guba, D.Sunderland,
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationGpufit: An open-source toolkit for GPU-accelerated curve fitting
Gpufit: An open-source toolkit for GPU-accelerated curve fitting Adrian Przybylski, Björn Thiel, Jan Keller-Findeisen, Bernd Stock, and Mark Bates Supplementary Information Table of Contents Calculating
More informationGPU LIBRARY ADVISOR. DA _v8.0 September Application Note
GPU LIBRARY ADVISOR DA-06762-001_v8.0 September 2016 Application Note TABLE OF CONTENTS Chapter 1. Overview... 1 Chapter 2. Usage... 2 DA-06762-001_v8.0 ii Chapter 1. OVERVIEW The NVIDIA is a cross-platform
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationDeep Learning: Transforming Engineering and Science The MathWorks, Inc.
Deep Learning: Transforming Engineering and Science 1 2015 The MathWorks, Inc. DEEP LEARNING: TRANSFORMING ENGINEERING AND SCIENCE A THE NEW RISE ERA OF OF GPU COMPUTING 3 NVIDIA A IS NEW THE WORLD S ERA
More informationUnveiling Cellular & Molecular Events of Cardiac Arrhythmias
Unveiling Cellular & Molecular Events of Cardiac Arrhythmias Hoang-Trong Minh Tuan 1, George S. William 1, Greg D. Smith 2, M. Saleet Jafri 1,3,4 1 - Department of Bioinformatics and Computational Biology
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More informationDouble Rewards of Porting Scientific Applications to the Intel MIC Architecture
Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford
More informationMay 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017
May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND Mark Harris, May 10, 2017 INTRODUCING CUDA 9 BUILT FOR VOLTA FASTER LIBRARIES Tesla V100 New GPU Architecture Tensor Cores NVLink Independent Thread Scheduling
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More informationMarkov Chain Monte Carlo on the GPU Final Project, High Performance Computing
Markov Chain Monte Carlo on the GPU Final Project, High Performance Computing Alex Kaiser Courant Institute of Mathematical Sciences, New York University December 27, 2012 1 Introduction The goal of this
More informationAES Cryptosystem Acceleration Using Graphics Processing Units. Ethan Willoner Supervisors: Dr. Ramon Lawrence, Scott Fazackerley
AES Cryptosystem Acceleration Using Graphics Processing Units Ethan Willoner Supervisors: Dr. Ramon Lawrence, Scott Fazackerley Overview Introduction Compute Unified Device Architecture (CUDA) Advanced
More informationCODE-GENERATION FOR DIFFERENTIAL EQUATION SOLVERS
CODE-GENERATION FOR DIFFERENTIAL EQUATION SOLVERS Dániel Berényi Wigner RCP, GPU Laboratory, Budapest, Hungary Perspectives of GPU Computing in Physics and Astrophysics Rome 2014. INTRODUCTION The most
More informationGPU Parallelization of Gibbs Sampling Abstractions, Results, and Lessons Learned Alireza S Mahani Scientific Computing Group Sentrana Inc.
GPU Parallelization of Gibbs Sampling Abstractions, Results, and Lessons Learned Alireza S Mahani Scientific Computing Group Sentrana Inc. May 16, 2012 Objectives of This Talk What This Talk Is About What
More informationIntroduction to GPU Computing. 周国峰 Wuhan University 2017/10/13
Introduction to GPU Computing chandlerz@nvidia.com 周国峰 Wuhan University 2017/10/13 GPU and Its Application 3 Ways to Develop Your GPU APP An Example to Show the Developments Add GPUs: Accelerate Science
More informationAn Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center
An Introduction to GPU Architecture and CUDA C/C++ Programming Bin Chen April 4, 2018 Research Computing Center Outline Introduction to GPU architecture Introduction to CUDA programming model Using the
More informationParallel and Distributed Programming Introduction. Kenjiro Taura
Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel Programming? 2 What Parallel Machines Look Like, and Where Performance Come From? 3 How to Program Parallel
More information