Scientific Computing on Graphical Processors: FMM, Flagon, Signal Processing, Plasma and Astrophysics
|
|
- Flora Moore
- 6 years ago
- Views:
Transcription
1 Scientific Computing on Graphical Processors: FMM, Flagon, Signal Processing, Plasma and Astrophysics Ramani Duraiswami Computer Science & UMIACS University of Maryland, College Park Joint work with Nail Gumerov, Yuancheng Luo, Adam O Donovan, Bill Dorland, Kate Despain Partially supported by NASA, DOE, NSF, UMD, NVIDIA
2 Problem sizes in simulation/assimilation are increasing Change in paradigm in science Simulate then test Fidelity demands larger simulations Problems being simulated are also much more Sensors are getting varied and cheaper; and storage is getting cheaper Cameras, microphones Other Large data Text (all the newspapers, books, technical papers) Genome data Medical/biological data (X-Ray, PET, MRI, Ultrasound, Electron microscopy ) Climate (Temperature, Salinity, Pressure, Wind, Oxygen content, )
3 Need fast algorithms, parallel processing, better software Fast algorithms that improve asymptotic complexity of operations FFT, FMM, NUFFT, preconditioned Krylov iterations Parallel processing can divide the time needed by the number of processors GPUs, multicore CPUs Partitioning problems across heterogeneous computing environments Cloud computing Architecture aware programming Data structures for parallel architectures and cache optimization
4 Fast Multipole Methods Follows from seminal work of Rokhlin and Greengard (1987) General method for accelerating large classes of dense matrix vector products Solve systems, compute eigenvalues etc. in combination with iterative algorithms Allow reduction of O(N 2 ) and O(N 3 ) operations to linear order Dr. Gumerov and I are applying it to many areas Acoustics, Synthetic beamforming Fluid mechanics (vortex methods, potential flow, Stokes flow) Electromagnetic scattering and Maxwell s equations Fast statistics, similarity measures, image processing, segmentation, tracking, learning Non uniform fast Fourier transforms and reconstruction Elastic registration, fitting thin-plate splines
5 Decompose matrix vector product into a sparse part taking care of local interactions FMM replaces pairwise evaluations in dense part with an upward and downward pass via a hierarchy Spatial data structures (octrees), associated lists of particles Source Data Hierarchy MLFMM Evaluation Data Hierarchy N S S S S S S R R R R R M Level 3 Level 5 Level 4 Level 2 Level 2 Level 3 Level 4 Level 5
6 RBF/FMM interpolation to regular spatial grid
7 Helmholtz equation (some other scattering problems were solved) Performance tests Mesh: vertices/ elements kd=29, Neumann problem kd=144, Robin problem (impedance, sigma=1) Gumerov & Duraiswami, 2006
8 FMM on GPU N.A. Gumerov and R. Duraiswami, Fast multipole methods on graphics processors. Journal of Computational Physics, 227, , N-body problems --- several papers implement on GPU ( but restricted to O(10^5)) To go to O(10 6 ) and beyond we need the FMM Challenges Effect of GPU architecture on FMM complexity and optimization Accuracy Performance
9 Basic FMM flow chart Gumerov & Duraiswami, 2006
10 Direct summation on GPU (final step in the FMM) Computations of potential, optimal settings for CPU CPU: Time=CNs, s=8 -lmax N b/c=16 GPU: Time=A 1 N+B 1 N/s+C 1 Ns read/write float computations access to box data These parameters depend on the hardware
11 Direct summation on GPU FMM requires a balance between direct summation and the rest of the algorithm Compare GPU final summation complexity: Cost =A 1 N+B 1 N/s+C 1 Ns. and total FMM complexity: Cost = AN+BN/s+CNs. Optimal cluster size for direct summation step of the FMM s opt = (B 1 /C 1 ) 1/2, This leads to Cost =(A+A 1 )N+(B+B 1 )N/s+C 1 Ns, and s opt = ((B+B 1 )/C 1 ) 1/2.
12 Direct summation on GPU (final step in the FMM) Computations of potential, optimal settings for GPU b/c=300
13 Other steps of the FMM on GPU Accelerations in range 5-60; Effective accelerations for N=1,048,576 (taking into account max level reduction):
14 Accuracy Relative L 2 norm error measure: CPU single precision direct summation was taken as exact ; 100 sampling points were used.
15 What is more accurate for solution of large problems on GPU: direct summation or FMM? Error computed over a grid of 729 sampling points, relative to exact solution, which is direct summation with double precision. Possible reason why the GPU error in direct summation grows: systematic roundoff error in computation of function 1/sqrt(x). (still a question).
16 Performance N=1,048,576 (potential only) serial CPU GPU Ratio p= s s 33 p= s s 56 p= s s 48 N=1,048,576 p=4 p=8 p=12 (potential+forces (gradient)) serial CPU GPU s s s s s s Ratio
17 Performance p=4 p=8 p=12
18 Performance FMM Computations of the potential and forces: GPU Peak performance of GPU for direct summation 290 Gigaflops, while for the FMM on GPU effective rates in range Teraflops are observed (following the citation below). dir FMM CPU M.S. Warren, J.K. Salmon, D.J. Becker, M.P. Goda, T. Sterling & G.S. Winckelmans. Pentium Pro inside: I. a treecode at 430 Gigaflops on ASCI Red, Bell price winning paper at SC 97, direct
19 Introduction GPUs are great as all the previous talks have said But require you to program in extended version of C Need NVIDIA toolchain What if you have an application that is In Fortran 9x/2003, Matlab, C/C++ Too large to fit on the GPU and needs to use the CPU cores, MPI, etc. as part of a larger application, but take advantage of GPU Offload computations which have good speedups on the GPU to it using library calls in your programming environment Enter the FLAGON An extensible open source library and a middleware framework that allows use of GPU Implemented currently for Fortran-9X, and preliminarily for C++ and MATLAB
20 Programming on the GPU GPU organized as 2-30 groups of multiprocessors (8 relatively slow processors) with small amount of own memory and access to common shared memory Factor of 100s difference in speed as one goes up the memory hierarchy To achieve gains problems must fit the SPMD paradigm and manage memory Fortunately many practically important tasks do map well and we are working on converting others Image and Audio Processing Some types of linear algebra cores Many machine learning algorithms Research issues: Identifying important tasks and mapping them to the architecture Making it convenient for programmers to call GPU code from host code Local memory ~50kB GPU shared memory ~1GB Host memory ~2-32 GB
21 Approach to use GPU: Flagon Middleware Programming from higher language on CPU (Fortran / C++/Matlab) Defines Module/Class that provides pointers on CPU to Device Variables on the GPU Execute small, well written, CU functions to perform primitive operations on device avoid data transfer overhead Provide wrappers to BLAS, FFT, and other software (random number, sort, screen dump, etc.) Allow incorporation of existing mechanisms for doing distributed programming (OpenMP, MPI, etc.) to handle clusters Allow relatively easy conversion of existing code
22 Sample scientific computing applications Radial basis function fitting Plasma turbulence computations Fast Multipole Force calculation in particle systems Numerical Relativity Signal Processing Integral Equations
23 FLAGON Framework Fortran Layer Device Variables (devvar) communicates with lower levels Fortran interfaces and wrappers pass parameters to C/C++ level May directly call CUBLAS/ CUFFT library functions C/C++ Layer Communicates with CUDA kernels Setup function calls, parameter passing to kernels Module management of external functions CUDA Layer Performs operations on the device Fortran Level Fortran - C Wrappers/Interfaces C/CUDA FLAGON CUBLAS/CUFFT Functionality Device Kernels
24 FLAGON Principles Build a module/class that defines device variables, and host pointers to them, allows their manipulation via functions and overloaded FORTRAN 95 operators Extensible via CUDA kernels that work with module Use external CUDA kernel loaders and generic kernel callers Efficient memory management Data is stored on the device and managed by the host Asynchronous operations continuously performed on the device Minimizes data transfers between host and device Integrated Libraries CUBLAS/CUFFT CUDPP Some new linear algebra cores, small FFT code, random numbers
25 FLAGON Device Variables User instantiates device variables in Fortran Encapsulates parameters and attributes of the data structure transferred between host and device Tracks (via pointers) allocated memory on the device Stores data attributes (type and dimensions) on the host and device FLAGON Structure devvar Device Pointer Device Data Type Device Status Device Dimensions Device Leading Dimensions Pointer to device memory address Data type stored on device Allocation status on device X, Y, Z dimensions of vector or matrix on host X, XY L L leading dimensions of vector or matrix on device
26 FLAGON Work-Cycle Compiling and link library to user Fortran code Load library into memory Allocate device variables and copy host data to device Work-cycle allows subsequent computations to be performed solely on the device Data transfer from device to host when done Discard/free data on the device FLAGON Work Cycle Load FLAGON Library Allocate Device Variable(s ) Memory Transfer Host to Device Work Memory Transfer Device to Host Specify GPU device, load CUBLAS library Allocates and pads memory on GPU Device Transfer host data from Fortran to CUDA global memory Call CUBLAS, CUFFT, CUDPP, CUDA functions and perform all calculations on the GPU Transfer data back from device to host
27 FLAGON Functions Initialization functions open_devobjects, close_devobjects Memory functions Allocation/deallocation allocate_dv(chartype, nx, ny, nz) deallocate_dv(devvar) Memory transfer transfer_[i, r, c]4(hostvar, devvar, c2g) transfer_[i, r, c] (hostvar, devvar, c2g) Memory copy copy(devvar1,devvar2) function clonedeepwdata(devvara) function clonedeepwodata(devvara) Misc. swap(devvar1, devvar2) part(devicevariable,i1,i2,j1,j2,k1,k2) get_[i, s, c] set_[i, s, c] Point-wise Functions Arithmetic devf_[hadamardf, divide, addition, subtraction] (devvar3, devvar1, devvar2, option) Scaling devf_[i,s,c]scal(devicevariable, a, b), devf_cscalconj(devicevariable, a, b) Misc. devf_zeros(devicevariable), devf_conjugate(devicevariable), devf_partofcmplx(whichpart,devicevariable) CUBLAS Functions: BLAS 1, BLAS 2, BLAS 3 (with shorter call strings) CUFFT Functions: FFT Plans devf_fftplan(devvariable, fft_type, batch) devf_destroyfftplan(plan) FFT Functions devf_fft(input, plan, output) devf_bfft(input, plan, output) devf_ifft(input, plan, output) devf_fftr2c(input, plan, output) devf_fftc2r(input, plan, output) CUDPP Functions: devf_anccudppsortscan(devvarin, devvarout, operation, datatype, algorithm, option) devf_anccudppsortsimple(devvarin, devvarout) Ancillary Functions: devf_ancmatrixtranspose(devvarin, devvarout) devf_ancbitonicsort(devvar1)
28 Example of code conversion
29 Plasma turbulence computations spectral code, solved via a standard Runge-Kutta time advance, coupled with a pseudo-spectral evaluation of NL terms. Derivatives are evaluated in k space, while multiplications in Eq. (2) are carried out in real space. standard 2/3 rule for dealiasing is applied, and small hyperviscous damping terms are added to provide stability at the grid scale. results agree with analytic expectations and same on both CPU & GPU. 32x speedup!
30 Device memory Multi-processors screen camera 64 microphone spherical array Forms an audio camera
31 Audio Camera spherical array of microphones Use beamforming algorithms we developed can find sounds coming from particular directions Run several beamformers, one look direction and assign output to an Audio pixel Compose audio image. E Transform the spherical array into a camera for audio images l Requires significant processing to e form pixels from all directions in a v frame before the next frame is ready a ti o n θ Azimuth Azimuth φ
32 O Donovan et al. : Several papers in IEEE CVPR, IEEE ICASSP, WASPAA ( )
33 Plasma Computations via PIC
34 Data structures for coalesced access Particles modeling a density or real particles Right hand side of evolution equation controlled by a PDE for field solved on a regular grid Either spectrally or via finite differences Before/After time step require interpolation of field quantities at grid nodes to/from particles Organized particles in a box using octrees created via bit interleaving resulting in a Morton curve layout Update procedures at the end of each time step George Stantchev, William Dorland, Nail Gumerov Fast parallel particle-to-grid interpolation for plasma PIC simulations on the GPU, J. Parallel Distrib. Comput., 2008
35 Numerical relativity Beginning collaboration with Prof. Tiglio's group Hope to report more later
FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)
FMM implementation on CPU and GPU Nail A. Gumerov (Lecture for CMSC 828E) Outline Two parts of the FMM Data Structure Flow Chart of the Run Algorithm FMM Cost/Optimization on CPU Programming on GPU Fast
More informationCMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline
CMSC 858M/AMSC 698R Fast Multipole Methods Nail A. Gumerov & Ramani Duraiswami Lecture 20 Outline Two parts of the FMM Data Structures FMM Cost/Optimization on CPU Fine Grain Parallelization for Multicore
More informationTerascale on the desktop: Fast Multipole Methods on Graphical Processors
Terascale on the desktop: Fast Multipole Methods on Graphical Processors Nail A. Gumerov Fantalgo, LLC Institute for Advanced Computer Studies University of Maryland (joint work with Ramani Duraiswami)
More informationGPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging
GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani With Nail A. Gumerov,
More informationFast Multipole and Related Algorithms
Fast Multipole and Related Algorithms Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov Efficiency by exploiting symmetry and A general
More informationIterative methods for use with the Fast Multipole Method
Iterative methods for use with the Fast Multipole Method Ramani Duraiswami Perceptual Interfaces and Reality Lab. Computer Science & UMIACS University of Maryland, College Park, MD Joint work with Nail
More informationEfficient O(N log N) algorithms for scattered data interpolation
Efficient O(N log N) algorithms for scattered data interpolation Nail Gumerov University of Maryland Institute for Advanced Computer Studies Joint work with Ramani Duraiswami February Fourier Talks 2007
More informationFast Multipole Accelerated Indirect Boundary Elements for the Helmholtz Equation
Fast Multipole Accelerated Indirect Boundary Elements for the Helmholtz Equation Nail A. Gumerov Ross Adelman Ramani Duraiswami University of Maryland Institute for Advanced Computer Studies and Fantalgo,
More informationFMM accelerated BEM for 3D Helmholtz equation
FMM accelerated BEM for 3D Helmholtz equation Nail A. Gumerov and Ramani Duraiswami Institute for Advanced Computer Studies University of Maryland, U.S.A. also @ Fantalgo, LLC, U.S.A. www.umiacs.umd.edu/~gumerov
More informationA Kernel-independent Adaptive Fast Multipole Method
A Kernel-independent Adaptive Fast Multipole Method Lexing Ying Caltech Joint work with George Biros and Denis Zorin Problem Statement Given G an elliptic PDE kernel, e.g. {x i } points in {φ i } charges
More informationAdvanced CUDA Optimization 1. Introduction
Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationGPU-based Distributed Behavior Models with CUDA
GPU-based Distributed Behavior Models with CUDA Courtesy: YouTube, ISIS Lab, Universita degli Studi di Salerno Bradly Alicea Introduction Flocking: Reynolds boids algorithm. * models simple local behaviors
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationTechnische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics
GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth
More informationScalable Fast Multipole Methods on Distributed Heterogeneous Architectures
Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures Qi Hu huqi@cs.umd.edu Nail A. Gumerov gumerov@umiacs.umd.edu Ramani Duraiswami ramani@umiacs.umd.edu Institute for Advanced Computer
More informationCUDA Experiences: Over-Optimization and Future HPC
CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationIntermediate Parallel Programming & Cluster Computing
High Performance Computing Modernization Program (HPCMP) Summer 2011 Puerto Rico Workshop on Intermediate Parallel Programming & Cluster Computing in conjunction with the National Computational Science
More informationCenter for Computational Science
Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,
More informationGeorgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009
Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009 Introduction CUDA is a tool to turn your graphics card into a small computing cluster. It s not always
More informationFMM Data Structures. Content. Introduction Hierarchical Space Subdivision with 2 d -Trees Hierarchical Indexing System Parent & Children Finding
FMM Data Structures Nail Gumerov & Ramani Duraiswami UMIACS [gumerov][ramani]@umiacs.umd.edu CSCAMM FAM4: 4/9/4 Duraiswami & Gumerov, -4 Content Introduction Hierarchical Space Subdivision with d -Trees
More informationEXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March
EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY Stephen Abbott, March 26 2018 ACKNOWLEDGEMENTS Collaborators: Oak Ridge Nation Laboratory- Ed D Azevedo NVIDIA - Peng
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationPorting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation
Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationHigh-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs
High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea
More informationTree-based methods on GPUs
Tree-based methods on GPUs Felipe Cruz 1 and Matthew Knepley 2,3 1 Department of Mathematics University of Bristol 2 Computation Institute University of Chicago 3 Department of Molecular Biology and Physiology
More informationFast Multipole Method on the GPU
Fast Multipole Method on the GPU with application to the Adaptive Vortex Method University of Bristol, Bristol, United Kingdom. 1 Introduction Particle methods Highly parallel Computational intensive Numerical
More informationFast-multipole algorithms moving to Exascale
Numerical Algorithms for Extreme Computing Architectures Software Institute for Methodologies and Abstractions for Codes SIMAC 3 Fast-multipole algorithms moving to Exascale Lorena A. Barba The George
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationParallel and Distributed Computing with MATLAB Gerardo Hernández Manager, Application Engineer
Parallel and Distributed Computing with MATLAB Gerardo Hernández Manager, Application Engineer 2018 The MathWorks, Inc. 1 Practical Application of Parallel Computing Why parallel computing? Need faster
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationHigh performance Computing and O&G Challenges
High performance Computing and O&G Challenges 2 Seismic exploration challenges High Performance Computing and O&G challenges Worldwide Context Seismic,sub-surface imaging Computing Power needs Accelerating
More informationDeveloping PIC Codes for the Next Generation Supercomputer using GPUs. Viktor K. Decyk UCLA
Developing PIC Codes for the Next Generation Supercomputer using GPUs Viktor K. Decyk UCLA Abstract The current generation of supercomputer (petaflops scale) cannot be scaled up to exaflops (1000 petaflops),
More informationTwo-Phase flows on massively parallel multi-gpu clusters
Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationData parallel algorithms, algorithmic building blocks, precision vs. accuracy
Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel
More informationAccelerating MATLAB with CUDA
Accelerating MATLAB with CUDA Massimiliano Fatica NVIDIA mfatica@nvidia.com Won-Ki Jeong University of Utah wkjeong@cs.utah.edu Overview MATLAB can be easily extended via MEX files to take advantage of
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationThe Many-Core Revolution Understanding Change. Alejandro Cabrera January 29, 2009
The Many-Core Revolution Understanding Change Alejandro Cabrera cpp.cabrera@gmail.com January 29, 2009 Disclaimer This presentation currently contains several claims requiring proper citations and a few
More informationCapturing, Computing, Visualizing and Recreating Spatial Sound
Capturing, Computing, Visualizing and Recreating Spatial Sound Ramani Duraiswami University of Maryland, College Park Joint work with Dmitry Zotkin, Zhiyun Li, Elena Grassi, Adam O Donovan, Nail Gumerov,
More informationGPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.
GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran G. Ruetsch, M. Fatica, E. Phillips, N. Juffa Outline WRF and RRTM Previous Work CUDA Fortran Features RRTM in CUDA
More informationKSTAR tokamak. /
KSTAR tokamak / spinhalf@nfri.re.kr !!! Data parallelism CUDA programming python! pycuda GPU Development tools Python 2.6+ Scientific libraries as python package for interactive computing (Numpy, Scipy..)
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationMulti-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation
Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M
More informationPractical Introduction to CUDA and GPU
Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing
More informationAn Innovative Massively Parallelized Molecular Dynamic Software
Renewable energies Eco-friendly production Innovative transport Eco-efficient processes Sustainable resources An Innovative Massively Parallelized Molecular Dynamic Software Mohamed Hacene, Ani Anciaux,
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction
More informationOptimizing and Accelerating Your MATLAB Code
Optimizing and Accelerating Your MATLAB Code Sofia Mosesson Senior Application Engineer 2016 The MathWorks, Inc. 1 Agenda Optimizing for loops and using vector and matrix operations Indexing in different
More informationInterdisciplinary practical course on parallel finite element method using HiFlow 3
Interdisciplinary practical course on parallel finite element method using HiFlow 3 E. Treiber, S. Gawlok, M. Hoffmann, V. Heuveline, W. Karl EuroEDUPAR, 2015/08/24 KARLSRUHE INSTITUTE OF TECHNOLOGY -
More informationA Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids
A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids Patrice Castonguay and Antony Jameson Aerospace Computing Lab, Stanford University GTC Asia, Beijing, China December 15 th, 2011
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationSpeeding up MATLAB Applications Sean de Wolski Application Engineer
Speeding up MATLAB Applications Sean de Wolski Application Engineer 2014 The MathWorks, Inc. 1 Non-rigid Displacement Vector Fields 2 Agenda Leveraging the power of vector and matrix operations Addressing
More informationGPU ARCHITECTURE Chris Schultz, June 2017
GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationCUDA Accelerated Compute Libraries. M. Naumov
CUDA Accelerated Compute Libraries M. Naumov Outline Motivation Why should you use libraries? CUDA Toolkit Libraries Overview of performance CUDA Proprietary Libraries Address specific markets Third Party
More informationOn the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters
1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk
More informationMD-CUDA. Presented by Wes Toland Syed Nabeel
MD-CUDA Presented by Wes Toland Syed Nabeel 1 Outline Objectives Project Organization CPU GPU GPGPU CUDA N-body problem MD on CUDA Evaluation Future Work 2 Objectives Understand molecular dynamics (MD)
More informationFast Multipole Methods. Linear Systems. Matrix vector product. An Introduction to Fast Multipole Methods.
An Introduction to Fast Multipole Methods Ramani Duraiswami Institute for Advanced Computer Studies University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationA Sampling of CUDA Libraries Michael Garland
A Sampling of CUDA Libraries Michael Garland NVIDIA Research CUBLAS Implementation of BLAS (Basic Linear Algebra Subprograms) on top of CUDA driver Self-contained at the API level, no direct interaction
More informationWhy Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends
Imagine stream processor; Bill Dally, Stanford Connection Machine CM; Thinking Machines Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz Eitan Grinspun Caltech Ian Farmer
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationScalable Distributed Fast Multipole Methods
Scalable Distributed Fast Multipole Methods Qi Hu, Nail A. Gumerov, Ramani Duraiswami University of Maryland Institute for Advanced Computer Studies (UMIACS) Department of Computer Science, University
More informationAccelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University
Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear
More informationIntroducing Overdecomposition to Existing Applications: PlasComCM and AMPI
Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI Sam White Parallel Programming Lab UIUC 1 Introduction How to enable Overdecomposition, Asynchrony, and Migratability in existing
More information2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA
2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance
More informationCUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata
CUDA Fluid simulation Lattice Boltzmann Models Cellular Automata Please excuse my layout of slides for the remaining part of the talk! Fluid Simulation Navier Stokes equations for incompressible fluids
More informationdesigning a GPU Computing Solution
designing a GPU Computing Solution Patrick Van Reeth EMEA HPC Competency Center - GPU Computing Solutions Saturday, May the 29th, 2010 1 2010 Hewlett-Packard Development Company, L.P. The information contained
More informationGeorgia Institute of Technology, August 17, Justin W. L. Wan. Canada Research Chair in Scientific Computing
Real-Time Rigid id 2D-3D Medical Image Registration ti Using RapidMind Multi-Core Platform Georgia Tech/AFRL Workshop on Computational Science Challenge Using Emerging & Massively Parallel Computer Architectures
More informationGPUs and Einstein s Equations
GPUs and Einstein s Equations Tim Dewey Advisor: Dr. Manuel Tiglio AMSC Scientific Computing University of Maryland May 5, 2011 Outline 1 Project Summary 2 Evolving Einstein s Equations 3 Implementation
More informationGTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013
GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»
More informationHigh-Performance Computing Using GPUs
High-Performance Computing Using GPUs Luca Caucci caucci@email.arizona.edu Center for Gamma-Ray Imaging November 7, 2012 Outline Slide 1 of 27 Why GPUs? What is CUDA? The CUDA programming model Anatomy
More informationParallel Interpolation in FSI Problems Using Radial Basis Functions and Problem Size Reduction
Parallel Interpolation in FSI Problems Using Radial Basis Functions and Problem Size Reduction Sergey Kopysov, Igor Kuzmin, Alexander Novikov, Nikita Nedozhogin, and Leonid Tonkov Institute of Mechanics,
More informationEfficient Multi-GPU CUDA Linear Solvers for OpenFOAM
Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,
More informationEfficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs
Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationAdaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA
Adaptive Mesh Astrophysical Fluid Simulations on GPU San Jose 10/2/2009 Peng Wang, NVIDIA Overview Astrophysical motivation & the Enzo code Finite volume method and adaptive mesh refinement (AMR) CUDA
More informationPhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.
Abdulrahman Manea PhD Student Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science Energy Resources Engineering Department School of Earth Sciences
More informationFast Radial Basis Functions for Engineering Applications. Prof. Marco Evangelos Biancolini University of Rome Tor Vergata
Fast Radial Basis Functions for Engineering Applications Prof. Marco Evangelos Biancolini University of Rome Tor Vergata Outline 2 RBF background Fast RBF on HPC Engineering Applications Mesh morphing
More informationAccelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies
Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia
More informationPerformance of Implicit Solver Strategies on GPUs
9. LS-DYNA Forum, Bamberg 2010 IT / Performance Performance of Implicit Solver Strategies on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Abstract: The increasing power of GPUs can be used
More informationThe Fast Multipole Method (FMM)
The Fast Multipole Method (FMM) Motivation for FMM Computational Physics Problems involving mutual interactions of N particles Gravitational or Electrostatic forces Collective (but weak) long-range forces
More informationLecture 15: More Iterative Ideas
Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!
More informationUltra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc.
Abstract Ultra Large-Scale FFT Processing on Graphics Processor Arrays Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc. Graphics Processor Unit (GPU) technology has been shown well-suited to efficient
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationApplication of GPU-Based Computing to Large Scale Finite Element Analysis of Three-Dimensional Structures
Paper 6 Civil-Comp Press, 2012 Proceedings of the Eighth International Conference on Engineering Computational Technology, B.H.V. Topping, (Editor), Civil-Comp Press, Stirlingshire, Scotland Application
More informationAdaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics
Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics
More informationAn Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos
An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos Mapping Regular Code to GPUs Regular codes Operate on
More informationStokes Preconditioning on a GPU
Stokes Preconditioning on a GPU Matthew Knepley 1,2, Dave A. Yuen, and Dave A. May 1 Computation Institute University of Chicago 2 Department of Molecular Biology and Physiology Rush University Medical
More informationIntel Performance Libraries
Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationAccelerating the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware
NSF REU - 2018: Project Report Accelerating the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware Anumeena Sorna Electronics and Communciation Engineering National Institute of Technology,
More information