S4289: Efficient solution of multiple scalar and block-tridiagonal equations
|
|
- Charla Stokes
- 6 years ago
- Views:
Transcription
1 S4289: Efficient solution of multiple scalar and block-tridiagonal equations Endre László endre.laszlo [at] oerc.ox.ac.uk Oxford e-research Centre, University of Oxford, UK Pázmány Péter Catholic University, Budapest, Hungary Mike Giles (Oxford), Jeremy Appleyard (NVIDIA) GPU Technology Conference March 26th, 2014 San Jose Endre László (Oxford) S4289 March 26th, 2014 San Jose 1 / 25
2 Outline for GPU developers 1 Batch scalar-tridiagonal solvers ADI (Alternating Directions Implicit) method Thomas algorithm Multi-dimensional data structures - access patterns Optimization: local data transposition in shared memory Optimization: local data transposition with shfl() Thomas-PCR hybrid Comparison to CPU, Xeon Phi and LAPACK tridiagonal solver 2 Batch block-tridiagonal solver Block tridiagonal data structure - access patterns Work-sharing on the GPU Comparison to CPU and LAPACK banded solver Conclusion Endre László (Oxford) S4289 March 26th, 2014 San Jose 2 / 25
3 Example: Solving the heat equation with ADI The heat-diffusion equation is the PDE that is solved with the method: ADI (Alternating Directions Implicit) method Classical FD scheme Computationally cheaper then Crank-Nicolson Relies on approximate factorization O( t 2, x 2 ) order accurate in both space and time Unconditionally stable if parameters chosen right positive Introduced by Peaceman and Rachford 1 du dt = 2 u (1) 1 D. W. Peaceman and J. Rachford, H. H., The numerical solution of parabolic and elliptic differential equations, Journal of the Society for Industrial and Applied Mathematics, vol. 3, no. 1, pp. pp. 2841, Endre László (Oxford) S4289 March 26th, 2014 San Jose 3 / 25
4 Example: Solving the heat equation with ADI 3 tridiagonal solves along dimensions X, Y, Z preproc : u (0) = λ ( δx 2 + δy 2 + δz 2 ) u n ( ) x dim : 1 λδ 2 x u (1) = u (0) ( ) y dim : 1 λδ 2 y u (2) = u (1) ( ) z dim : 1 λδ 2 z u = u (2) add : u n+1 = u n + u The upcoming discussion of tridiagonal solvers is in the context of the ADI method Endre László (Oxford) S4289 March 26th, 2014 San Jose 4 / 25
5 A tridiagonal system Storage: 3 coefficient arrays 1 solution arrays 1 RHS array All stored in a cubic datastructure b 0 c a 1 b 1 c a 2 b 2 c a N 1 b N 1 u 0 u 1 u 2 u 3. u N 1 = d 0 d 1 d 2 d 3. d N 1 Endre László (Oxford) S4289 March 26th, 2014 San Jose 5 / 25
6 Solving tridiagonal systems Assumptions: Stems from real-world CFD and financial applications Computation domain: structured multidimensional Hypercubic-ish : Ω = R N 0 N 1 N D, D = 2..8 Large number of systems: N 2 on an N 3 cube Enough to saturate GPU System sizes in the order of 100s 1000s Each system has its own coefficients and RHS No pivoting: diagonal dominance required Endre László (Oxford) S4289 March 26th, 2014 San Jose 6 / 25
7 Batch scalar-tridiagonal solvers cusparse?gtsvbathstride() Inefficient with the previous assumptions CR-PCR hybrid Lack of multidimensional support Global data transpose needed in Y and Z dimensions Extra space requirements: 768MB for a SP problem Uses two kernel calls Enough parallelism is batch problems CR/PCR not necessarily needed Tesla K40 has 12GB device memory Multidimensional problem domain N d N = d for dimensions d = 2..8: d N # parallel systems Endre László (Oxford) S4289 March 26th, 2014 San Jose 7 / 25
8 Thomas algorithm Algorithm 1 Thomas algorithm Require: thomas(a, b, c, d) 1: d 0 := d 0/b 0 2: c 0 := c 0/b 0 3: for i = 1,..., N 1 do 4: r := 1 / (b i a i c i 1 ) 5: d i := r (d i a i d i 1 ) 6: c i := r c i 7: end for 8: for i = N 2,..., 0 do 9: d i := d i c i u i+1 10: end for 11: return d Endre László (Oxford) S4289 March 26th, 2014 San Jose 8 / 25
9 Multi-dimensional data structures - access patterns Data layout: idx = k NX NY + j NX + i Performance depends on how the threads are mapped to the domain Different efficiency along different dimensions Assume sequential dependence in algorithm iterating along dim.: X: stride = 1 worst performance Y: stride = NX best performance Z: stride = NX NY good performance if TLB miss rate is avoided Endre László (Oxford) S4289 March 26th, 2014 San Jose 9 / 25
10 Time / grid element [ns] Mapping threads to the domain: X/Y/Z-dimension solves x16.5 PreProc X-solve Y-solve Z-solve SP DP X: offset = 1, stride = NX 4byte/32byte = 12.5% cache line utilization in SP Y: offset = NX, stride = 1 perfectly coalesced, 100% utilization Z: offset = NX NY, stride = 1 perfectly coalesced, 100% utilization + moderate TLB hit rate Endre László (Oxford) S4289 March 26th, 2014 San Jose 10 / 25
11 GB/s Mapping threads to the domain: X/Y/Z-dimension solves Nvidia Tesla K40 (GK 110B): 288 GB/s SP DP 50 0 X-solve Y-solve Z-solve Endre László (Oxford) S4289 March 26th, 2014 San Jose 11 / 25
12 TLB (Translation Lookaside Buffer) miss rate CUDA uses Unified Virtual Address Space Virtual address space uses memory pages Memory page frame pointers are cached in TLB TLB is a coarser cache that works with LLC translates address tag to frame pointer caches frame pointers from main memory On NVIDIA devices TLB is hardware implemented and page sizes can not be changed Small page size + long-stride high TLB miss rate NVVP implicitly reports it within the Global memory replay overhead counter 753 clock latency in case of TLB page miss 2 2 Measure on GT200 by Wong et al. in Demystifying GPU microarchitecture through microbenchmarking, Performance Analysis of Systems and Software (ISPASS), 2010 Endre László (Oxford) S4289 March 26th, 2014 San Jose 12 / 25
13 How to cope with TLB miss rate and coalescence? TLB is easy: remap your solver for better locality Change 2D thread block mapping into 1D thread blocks So that threads within a block will solve the closest neighboring set of systems Perform cache/register blocking Coalesced memory access is more difficult: Only a problem in the X-dim Need for cache blocking: Local transpose in shared memory or Local transpose with register shuffle ( shfl() intrinsic) or Caching a whole system Thomas-PCR hybrid Endre László (Oxford) S4289 March 26th, 2014 San Jose 13 / 25
14 Register file 32 rows Shared memory Thomas with shared memory transpose Forward pass: 1 Wrap a warp (32 threads) into 4x8 blocks to perform non-caching (32byte) loads 2 Load 32x8 size tiles into shared memory: 8 steps of 4x8 blocks loads 3 Transpose data by putting values into registers: float a[8]; is compiled to 8 registers if array indexing is known in compile time 4 Perform calculation with the 8 values along X dimension 5 Repeat from 2 until end of X-dim is reached Backward pass: y x 8 columns Step 0 Step 1 32byte Transpose Thread 0: float reg[8] Thread 1: float reg[8] Thread 2: float reg[8] Thread 3: float reg[8] Thread 4: float reg[8] Thread 5: float reg[8] Thread 6: float reg[8] Thread 7: float reg[8] Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 1 Do the same backward: transpose + store Endre László (Oxford) S4289 March 26th, 2014 San Jose 14 / 25
15 Register file 32 rows Register file Thomas with register shuffle transpose Forward pass: 1 Wrap 32 threads into 8x4 blocks to perform 4 x float4 vector loads 2 Load 32x16 size tiles into registers: 4 threads read 4 consecutive float4 vectors = 64 bytes Do this 4 times for rows under each other 3 Transpose data within 4 threads: 4 threads exchange data on a 4x4 2D array with shfl(float4) Each element in the 2D array is a float4 vector 4 Perform calculation with the 16 values along X dimension 5 Repeat from 2 until end of X-dim is reached Backward pass: 1 Do the same backward: transpose + store y x 16 columns Read Read steps Step 1 Step 2 Step 3 Step 4 Step 1 Step 2 Step 3 Step 4 Thread 0: float reg[16] = Thread 1: float reg[16] = Thread 2: float reg[16] = Thread 3: float reg[16] = Thread 4: float reg[16] = Thread 5: float reg[16] = Thread 6: float reg[16] = Thread 7: float reg[16] = 64 bytes float4 float4 float4 float Transpose float a[16] Endre László (Oxford) S4289 March 26th, 2014 San Jose 15 / 25
16 Thomas/PCR hybrid algorithm Endre László (Oxford) S4289 March 26th, 2014 San Jose 16 / 25
17 Time / grid element [ns] Performance comparison Trid-X SP Trid-X DP 1 Trid-Y SP 0.5 Trid-Y DP 0 Naïve Shared transpose Register shuffle ThomasPCR hybrid cusparse v socket Xeon LAPACKE v Xeon Phi CPU: Intel Xeon E5-2680, 2 socket, 40MB, 16 core (32 HT), 102 GB/s GPU: Nvidia K40m, 288 GB/s Endre László (Oxford) S4289 March 26th, 2014 San Jose 17 / 25
18 Scalar-tridiagonal library use with OpenACC void main() { int n = NX*NY*NZ; float* u = (float *) malloc(sizeof(float)*n); float* ax = (float *)acc_malloc(sizeof(float)*n);... #pragma acc data copy(u[n]) deviceptr(ax,bx,cx,ay,by,cy,az,bz,cz,du) for(it=0; it<iter; it++) {... // calculate r.h.s. and set tri-diagonal coefficients int ndim = 3; int dims[3] = {256,256,256}; int pads[3] = {320,320,320}; solvedim = 0; // X-solve tridsmtsvstridedbatch(ax, bx, cx, du, u, ndim, solvedim, dims, pads); solvedim = 1; // Y-solve tridsmtsvstridedbatch(ay, by, cy, du, u, ndim, solvedim, dims, pads); }... } Endre László (Oxford) S4289 March 26th, 2014 San Jose 18 / 25
19 Batch block-tridiagonal solver Motivation for block solver: State variables in CFD/finance PDEs have inter-depenedence Block matrices with block sizes of 2 8 Sub-problems to be solved: Inverting, multiplying blocks (matrices) involves branching Data storage shortage limits the number of systems on the device Optimization strategies on GPUs: Data storage - for better data locality Work sharing - to increase parallelism Inter-thread communication with shared memory Inter-thread communication with register shuffle Endre László (Oxford) S4289 March 26th, 2014 San Jose 19 / 25
20 Batch block-tridiagonal solver work sharing Threads within a warp compute: Matrix-matrix product Matrix-vector product Gauss-Jordan block solve A thread stores a column of a block and a scalar value of a vector Need to pay special attention to help register allocation Algorithms are implemented with shared memory or shfl() instrinsic communication In worst case (M = 7) 4 threads out of 32 are idle Endre László (Oxford) S4289 March 26th, 2014 San Jose 20 / 25
21 Batch block-tridiagonal matrix storage Blocks are stored in a row major format Blocks of different problems are interleaved for better data locality A 0 0 A 1 0 A A P 1 0 A 0 1 A 1 1 A A P 1 1 A 0 2 A 1 2 A (2) Endre László (Oxford) S4289 March 26th, 2014 San Jose 21 / 25
22 Batch block-tridiagonal solver Two versions: shared memory, register shuffle Register spill above 8x8 DP block size Approx. 8-16k threads saturate GPU Low shared memory use 576(1125) bytes/threadblock SP(DP) good occupancy Shared SP 8x8 Shared memory efficiency 84.5% Shared memory load/store throughput 1700 GB/s L2 Hit Rate (L1 Reads) 51.9% Executed IPC 1.68 Texture Cache hit rate 50% In SP shuffle is better In DP shared memory is better Endre László (Oxford) S4289 March 26th, 2014 San Jose 22 / 25
23 GB/s GFLOPS Data and compute throughput SP DP SP DP M - block size M - block size (a) Effective data throughput (b) Compute throughput CPU: Intel Xeon E5-2690, 2 socket, 40MB, 16 core (32 HT), 102 GB/s GPU: Nvidia K40m, 288 GB/s Endre László (Oxford) S4289 March 26th, 2014 San Jose 23 / 25
24 Speedup over LAPACKE Speedup over LAPACKE Performance comparison Baseline: Multi-threaded LAPACKE?gbsv work() banded solver CPU GPU - Shared GPU - Shuffle 4 2 CPU GPU - Shared GPU - Shuffle M - block size M - block size (a) Single Precision (b) Double Precision Endre László (Oxford) S4289 March 26th, 2014 San Jose 24 / 25
25 Conclusion Batch tridiagonal solvers Scalar solver Different optimization strategies: Thomas with shared memory transpose Thomas with register shuffle transpose Thomas/PCR hybrid Library quality solution for scalar tridiagonal solvers Block solver High throughput solver Higher performance than: Vectorized, multi-threaded CPU block solver or Banded LAPACK(E) solver Contributions of Nvidia funded summer interns is acknowledged: James Whittle, Catherine Hastings Endre László (Oxford) S4289 March 26th, 2014 San Jose 25 / 25
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationCUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD
More informationA Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois
A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs Li-Wen Chang, Wen-mei Hwu University of Illinois A Scalable, Numerically Stable, High- How to Build a gtsv for Performance
More informationFast Tridiagonal Solvers on GPU
Fast Tridiagonal Solvers on GPU Yao Zhang John Owens UC Davis Jonathan Cohen NVIDIA GPU Technology Conference 2009 Outline Introduction Algorithms Design algorithms for GPU architecture Performance Bottleneck-based
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationComputational Acceleration of Image Inpainting Alternating-Direction Implicit (ADI) Method Using GPU CUDA
Computational Acceleration of Inpainting Alternating-Direction Implicit (ADI) Method Using GPU CUDA Mutaqin Akbar mutaqin.akbar@gmail.com Pranowo pran@mail.uajy.ac.id Suyoto suyoto@mail.uajy.ac.id Abstract
More informationTo Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs
To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com
More informationGPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.
GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran G. Ruetsch, M. Fatica, E. Phillips, N. Juffa Outline WRF and RRTM Previous Work CUDA Fortran Features RRTM in CUDA
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight
More informationParallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units
Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units Khor Shu Heng Engineering Science Programme National University of Singapore Abstract
More informationGPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler
GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationS WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018
S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationHands-on CUDA Optimization. CUDA Workshop
Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION
April 4-7, 2016 Silicon Valley CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JAKOB PROGSCH, NVIDIA 1 WHAT YOU WILL LEARN An iterative method to optimize your GPU
More informationDense Linear Algebra. HPC - Algorithms and Applications
Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:
More informationEvaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices
Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationHighly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs
Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs Kaixi Hou, Hao Wang, Wu chun Feng {kaixihou, hwang121, wfeng}@vt.edu Jeffrey S. Vetter, Seyong Lee vetter@computer.org, lees2@ornl.gov
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More information3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs
3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs H. Knibbe, C. W. Oosterlee, C. Vuik Abstract We are focusing on an iterative solver for the three-dimensional
More informationAccelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies
Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia
More informationProgramming in CUDA. Malik M Khan
Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement
More informationPhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.
Abdulrahman Manea PhD Student Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science Energy Resources Engineering Department School of Earth Sciences
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationTowards a Performance- Portable FFT Library for Heterogeneous Computing
Towards a Performance- Portable FFT Library for Heterogeneous Computing Carlo C. del Mundo*, Wu- chun Feng* *Dept. of ECE, Dept. of CS Virginia Tech Slides Updated: 5/19/2014 Forecast (Problem) AMD Radeon
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationBandwidth Avoiding Stencil Computations
Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu
More informationAdministrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve
Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard
More informationAdvanced CUDA Optimizations
Advanced CUDA Optimizations General Audience Assumptions General working knowledge of CUDA Want kernels to perform better Profiling Before optimizing, make sure you are spending effort in correct location
More informationProfiling & Tuning Applications. CUDA Course István Reguly
Profiling & Tuning Applications CUDA Course István Reguly Introduction Why is my application running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA, needs
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Review Secret behind GPU performance: simple cores but a large number of them; even more threads can exist live on the hardware (10k 20k threads live). Important performance
More informationParallelization of Numerical Methods on Parallel Processor Architectures
Pázmany Péter Catholic University Doctoral Thesis Parallelization of Numerical Methods on Parallel Processor Architectures Author: Endre László Thesis Advisor: Dr. Péter Szolgay, D.Sc. A thesis submitted
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationEfficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI
Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from
More informationLeveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute
Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication Steve Rennich Nvidia Developer Technology - Compute Block Sparse Matrix Vector Multiplication Sparse Matrix-Vector Multiplication
More informationCenter for Computational Science
Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,
More informationCUDA Performance Optimization
Mitglied der Helmholtz-Gemeinschaft CUDA Performance Optimization GPU Programming with CUDA April 25-27, 2016 Jiri Kraus (NVIDIA) based on work by Andrew V. Adinetz What you will learn: What is memory
More informationAutomated Finite Element Computations in the FEniCS Framework using GPUs
Automated Finite Element Computations in the FEniCS Framework using GPUs Florian Rathgeber (f.rathgeber10@imperial.ac.uk) Advanced Modelling and Computation Group (AMCG) Department of Earth Science & Engineering
More informationSparse Matrix-Vector Multiplication with Wide SIMD Units: Performance Models and a Unified Storage Format
ERLANGEN REGIONAL COMPUTING CENTER Sparse Matrix-Vector Multiplication with Wide SIMD Units: Performance Models and a Unified Storage Format Moritz Kreutzer, Georg Hager, Gerhard Wellein SIAM PP14 MS53
More informationUsing OpenACC With CUDA Libraries
Using OpenACC With CUDA Libraries John Urbanic with NVIDIA Pittsburgh Supercomputing Center Copyright 2015 3 Ways to Accelerate Applications Applications Libraries Drop-in Acceleration CUDA Libraries are
More informationGPU programming made easier
GPU programming made easier Jacob Jepsen 6. June 2014 University of Copenhagen Department of Computer Science 6. June 2014 Introduction We created a tool that reduces the development time of GPU code.
More informationCartoon parallel architectures; CPUs and GPUs
Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD
More informationAdvanced CUDA Optimizing to Get 20x Performance. Brent Oster
Advanced CUDA Optimizing to Get 20x Performance Brent Oster Outline Motivation for optimizing in CUDA Demo performance increases Tesla 10-series architecture details Optimization case studies Particle
More informationLecture 5. Performance Programming with CUDA
Lecture 5 Performance Programming with CUDA Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Today s lecture Matrix multiplication 2011 Scott B. Baden / CSE 262 / Spring 2011 3 Memory Hierarchy
More informationCUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer
CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs
More information2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions
Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationX10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management
X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large
More informationScan Primitives for GPU Computing
Scan Primitives for GPU Computing Shubho Sengupta, Mark Harris *, Yao Zhang, John Owens University of California Davis, *NVIDIA Corporation Motivation Raw compute power and bandwidth of GPUs increasing
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationImproving Performance of Machine Learning Workloads
Improving Performance of Machine Learning Workloads Dong Li Parallel Architecture, System, and Algorithm Lab Electrical Engineering and Computer Science School of Engineering University of California,
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationEfficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs
Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationChapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture
An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationDeterminant Computation on the GPU using the Condensation AMMCS Method / 1
Determinant Computation on the GPU using the Condensation Method Sardar Anisul Haque Marc Moreno Maza Ontario Research Centre for Computer Algebra University of Western Ontario, London, Ontario AMMCS 20,
More informationOpenStaPLE, an OpenACC Lattice QCD Application
OpenStaPLE, an OpenACC Lattice QCD Application Enrico Calore Postdoctoral Researcher Università degli Studi di Ferrara INFN Ferrara Italy GTC Europe, October 10 th, 2018 E. Calore (Univ. and INFN Ferrara)
More informationUnrolling parallel loops
Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:
More informationCS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)
CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance
More informationCS 179 Lecture 4. GPU Compute Architecture
CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at
More informationPorting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method
Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Josh Romero, Massimiliano Fatica - NVIDIA Vamsi Spandan, Roberto Verzicco -
More informationAdvanced CUDA Optimizing to Get 20x Performance
Advanced CUDA Optimizing to Get 20x Performance Brent Oster Outline Motivation for optimizing in CUDA Demo performance increases Tesla 10-series architecture details Optimization case studies Particle
More informationIdentifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011
Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for
More informationHiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.
HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation
More informationUsing GPUs for unstructured grid CFD
Using GPUs for unstructured grid CFD Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Schlumberger Abingdon Technology Centre, February 17th, 2011
More informationGPU ARCHITECTURE Chris Schultz, June 2017
GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA
More informationMemory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory
Memory Lecture 2: different memory and variable types Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Key challenge in modern computer architecture
More informationGREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer
GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES Nikolay Markovskiy Peter Messmer ABOUT CP2K Atomistic and molecular simulations of solid state From ab initio DFT and Hartree-Fock
More informationScalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009
Scalable Multi Agent Simulation on the GPU Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Reasoning Explicit State machine, serial Implicit Compute intensive Fits SIMT well Collision avoidance Motivation
More informationParallelising Pipelined Wavefront Computations on the GPU
Parallelising Pipelined Wavefront Computations on the GPU S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More informationCOMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers
COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu
More information2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA
2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance
More informationCSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices
CSE 599 I Accelerated Computing - Programming GPUS Parallel Pattern: Sparse Matrices Objective Learn about various sparse matrix representations Consider how input data affects run-time performance of
More information1/25/12. Administrative
Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:
More informationA Tutorial on CUDA Performance Optimizations
A Tutorial on CUDA Performance Optimizations Amit Kalele Prasad Pawar Parallelization & Optimization CoE TCS Pune 1 Outline Overview of GPU architecture Optimization Part I Block and Grid size Shared memory
More informationUnderstanding Outstanding Memory Request Handling Resources in GPGPUs
Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar ECE Department University of Victoria lashgar@uvic.ca Ebad Salehi ECE Department University of Victoria ebads67@uvic.ca
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationS0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS
S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS John R Appleyard Jeremy D Appleyard Polyhedron Software with acknowledgements to Mark A Wakefield Garf Bowen Schlumberger Outline of Talk Reservoir
More informationOptimizing Data Locality for Iterative Matrix Solvers on CUDA
Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,
More informationACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016
ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016 Challenges What is Algebraic Multi-Grid (AMG)? AGENDA Why use AMG? When to use AMG? NVIDIA AmgX Results 2
More informationLecture 2: different memory and variable types
Lecture 2: different memory and variable types Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 2 p. 1 Memory Key challenge in modern
More informationAccelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University
Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear
More information