Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units

Size: px
Start display at page:

Download "Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units"


1 Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units Khor Shu Heng Engineering Science Programme National University of Singapore Abstract This paper presents a parallel alternating direction implicit (ADI) solver for the two-dimensional heat diffusion problem on an NVidia Graphics Processing Units (GPU). The first section of the work gives a brief introduction on Compute United Device Architecture (CUDA), the programming interface for parallel programming on an NVidia GPU, whereas the second section describes the implementation details of the tridiagonal system solver and the setup of the corresponding right hand side for implicit solution in and direction. The tridiagonal solver used in this work is based on the parallel cyclic reduction algorithm implemented by Zhang et al. [1]. The original algorithm does not supports system size which is non-power of two and uses 5 shared memory usage, where is the tridiagonal system size. We noticed that the shared memory usage can be reduced to 3 for cases where the tridiagonal system is symmetric with uniform elements on the diagonals. Slight modification has been done to cater for cases where the system size if non-power of two. We have also attempted to make the computation of right hand sides as efficient as possible, especially for the solution in y direction, Using CUDA Visual Profiler, the performance of the GPU ADI solver was compared with the serial implementation in CPU, which was based on Gaussian Elimination Scheme without pivoting. Reasonable acceleration was achieved for both float type computation and double type computation. 1. Introduction 1.1 Parallel computing using Graphics Processing Units Graphics Processing Units is specially designed for computation tasks exhibit fine grained data parallelism, with high ratio of arithmetic operation to memory operation. In three-dimensional graphics rendering, large set of pixels and vertices data are mapped onto parallel processing threads. Modern GPU are highly parallel, multithreaded, with more multicore processor than a CPU. This makes GPU a viable

2 and cheaper alternative for parallel programming compared to multicore CPU, vector computer and grid computing. In November 2006, NVidia introduced C with CUDA extension, a general purpose parallel computing architecture, which enables the user to use C language to leverage the parallelism of supported NVidia GPU for data parallel tasks. Since then, CUDA has begun to gain popularity in the high performance computing community, and has been applied in diverse fields: computational finance, computational fluid dynamics, image reconstruction for CT scan and molecular simulations. In general, programming in CUDA involves memory transfer between the host memory (CPU) and the device memory (GPU). The host calls a kernel, which perform the parallel computing task in device memory. The computation result is then written back to the host memory. Figure 1. GPU architecture GPU processes data parallel using threads. Threads are grouped in blocks. Each threads has a private local memory, whereas each block has a shared memory bank accessible to all the threads within the block. All threads have access to the same global memory bank. More information on the programming methodology and optimization techniques are available in [2]. 1.2 Alternating Direction Method Two-dimensional heat diffusion problem is governed by the partial differential equation: Diffusion problems usually suffer from numerical instability for explicit schemes, thus need to be solved using implicit schemes like Crank Nicolson scheme. While the linear equation systems generated in

3 Crank Nicolson scheme are tridiagonal for one-dimensional case, this is no longer true for two-dimension. More computational effort is required to obtain the solution for the linear systems. Alternating direction method circumvent this problem by halving the time step and solves the governing partial differential equation implicitly in one spatial dimension for each sub step. Where And similarly for. For a grid size of n n, n independent diagonally dominant tridiagonal systems of size n are generated for implicit solution in each dimension. 2. Implementation of Alternating Direction Method in CUDA 2.1 Mapping of higher dimensional array To work with two-dimensional problem in CUDA, one can map the two-dimensional data for temperature distribution onto a one-dimensional array in the following manner: Notice that the pitch is not necessarily equal to the row length. One should use cudamallocpitch function to automatically allocate the array with the suitable pitch length such that the memory access in direction is coalesced. Non-coalesced memory access is slower and may impact overall performance.

4 2.2 Implicit Solution in direction Each independent tridiagonal system is mapped onto a block, whereas each equation of the tridiagonal system is mapped onto a thread within the block. We declare the array for the right hand side in the shared memory. Computation of the right hand side correspond to each tridiagonal system involves communication between each element in a row and the elements adjacent to it in the y direction. Excerpt of Code from the Kernel for Right Hand Side Setup for x direction #define INDEX(i,j,pitch) (i + mul24(j,pitch)) global void rhssetup(float*rhs,float*u,int m,int pitch,int pitch2,float alpha) unsigned int thid=threadidx.x; unsigned int blid=blockidx.x; unsigned int center=index(thid+1,blid+1,pitch); if(thid<m) rhs[index(thid,blid,pitch2)]=(1-alpha)*u[center]+alpha/2*(u[centerpitch]+u[center+pitch]); syncthreads; rhs[blid*pitch2]+=(alpha/2)*u[(blid+1)*pitch]; rhs[index(m-1,blid,pitch2)]+=(alpha/2)*u[index(m+2,blid+1,pitch)]; The mul24 function is defined for efficient multiplication. The pitch length is different for the the right hand side array and the temperature distribution array since the x and y dimension are different in general. 2.3 Parallel Cylic Reduction In this work, the tridiagonal solver used is base on the parallel cyclic reduction algorithm implemented by Zhang et. al. [1]. Parallel cylic reduction is a variant of the cylic reduction algorithm first proposed by Hockney and Golub in 1965[3]. Parallel cyclic reduction differs from cylic reduction by having only the fowrad reduction phase. The algorithm solves a tridiagonal system of size n in steps and 12 n computations. In contrast, Gaussian elimination without pivoting solves the same problem size with 2 n steps.

5 The idea of parallel cyclic reduction is to reduce the original tridiagonal system to smaller systems of half the original size in a recursive manner. Consider an n by n tridiagonal system: For each row, and are eliminated by means of row operations involving row and the two rows a stride above/below i. Initial stride is 1. This process updates and generates and as fill in. The odd indexed rows and the even indexed rows have now become two independent tridiagonal systems. Repeat the same process with stride double that of the previous one, and we will get smaller and smaller independent systems. Assume for the moment that the system size is power of two. Iterating the forward reduction phase for times will yield independent tridiagonal systems of size 2. In Zhang et al s implementation, the updated value of and always overwrite the original one, hence only 5 (including right hand side, d, and solution, x )storage requirement is needed in the shared memory. 2.4 System size of non-power-of-two For system sizes which are non-power-of-two, the forward reduction can be performed for ceil times. The end result would be floor numbers of systems of size two and numbers of systems of size one.

6 The original code section for the back substitution can be changed from: if (thid < delta) int addr1 = thid; int addr2 = thid+delta; float tmp3 = b[addr2]*b[addr1]-c[addr1]*a[addr2]; x[addr1] = (b[addr2]*d[addr1]-c[addr1]*d[addr2])/tmp3; x[addr2] = (d[addr2]*b[addr1]-d[addr1]*a[addr2])/tmp3; To: if (thid < delta) int addr1 = thid; int addr2 = thid+delta; float tmp3 = b[addr2]*b[addr1]- c[addr1]*a[addr2]; if(addr2<n) x[addr1] = (b[addr2]*d[addr1]-c[addr1]*d[addr2])/tmp3; x[addr2] = (d[addr2]*b[addr1]-d[addr1]*a[addr2])/tmp3; else x[addr1]=d[addr1]/b[addr1]; The code section in the red bracket solves system size of two, whereas the code section in the green bracket solves system size of one. 2.5 Symmetric tridiagonal system with uniform elements on the diagonals For Dirichlet boundary condition, the tridiagonal system involved is symmetric with uniform elements on the diagonals. For such system, we observe the following: 1) The upper and lower diagonals of the new tridiagonal systems formed during the forward reduction phase are filled with identical elements. 2) Only the first and the last elements of the main diagonals have different values from other elements on the main diagonal.

7 3) Computation of every subsequent values of the new a and c only require the knowledge of the initial value of b. These observations allow us to reduce shared memory storage and less memory read/write operation. Only b and d need to be stored in the shared memory. a can be dropped out and only c is stored in the register. Consider the original code section: for (int j = 0; j <iteration; j++) int i = thid; if(i < delta) float tmp2 = c[i] / b[i+delta]; bnew = b[i] - a[i+delta] * tmp2; dnew = d[i] - d[i+delta] * tmp2; anew = 0; cnew = -c[i+delta] * tmp2; else if((systemsize-i-1) < delta) float tmp = a[i] / b[i-delta]; bnew = b[i] - c[i-delta] * tmp; dnew = d[i] - d[i-delta] * tmp; anew = -a[i-delta] * tmp; cnew = 0; else float tmp1 = a[i] / b[i-delta]; float tmp2 = c[i] / b[i+delta]; bnew = b[i] - c[i-delta] * tmp1 - a[i+delta] * tmp2; dnew = d[i] - d[i-delta] * tmp1 - d[i+delta] * tmp2; anew = -a[i-delta] * tmp1; cnew = -c[i+delta] * tmp2; syncthreads(); b[i] = bnew; d[i] = dnew; a[i] = anew; c[i] = cnew; delta *=2; syncthreads();

8 This can be replaced by for (int j = 0; j <iteration; j++) float temp=c/b; int i = thid; if(i < delta) float tmp2 = c / b[i+delta]; bnew = b[i] - c * tmp2; dnew = d[i] - d[i+delta] * tmp2; else if((systemsize-i-1) < delta) float tmp = c / b[i-delta]; bnew = b[i] - c * tmp; dnew = d[i] - d[i-delta] * tmp; else float tmp1 = c / b[i-delta]; float tmp2 = c / b[i+delta]; bnew = b[i] - c * (tmp1 + tmp2); dnew = d[i] - d[i-delta] * tmp1 - d[i+delta] * tmp2; syncthreads(); b[i] = bnew; d[i] = dnew; delta *=2; syncthreads(); B=B-2*c*temp; c*=-temp; Where B is the original value of the main diagonal element. Memory read/write from/into register is much faster than shared memory. Profiling result from CUDA Visual Profiler shows that this replacement reduces the computation time by about one fifth. 2.6 Implicit Solution in y direction After the implicit solution in x direction has been computed, the right hand side corresponding to the implicit solution in the y direction can be computed in the similar manner as in the x direction.

9 Code excerpt for the right hand side computation: unsigned int thid=threadidx.x; unsigned int blid=blockidx.x; unsigned int center=index(blid+1,thid+1,pitch); if(thid<n) rhs[index(thid,blid,pitch2)]=(1-alpha)*u[center]+alpha/2*(u[center- 1]+u[center+1]); However, profiler result shows that the right hand side computation for implicit solution in the y direction is much less efficient than the right hand side computation for the implicit solution in the x direction. This is due to the non-coalesced memory access pattern, which is much slower. Fig. Profiling result using CUDA Visual Profiler. ADISolve refers to the routine for tridiagonal solver in x and y direction, memcpydtoh refers to memory transfer from device memory to host memory, initialize refers to the routine to set up the initial condition, while rhssetup and rhssetup2 refers to right hand side computation corresponds to x direction and y direction respectively. 3. Result The ADI solver was implemented on a GTX285 NVidia GPU, which has the ability to run algorithm with double precision. The serial version of ADI solver was based on the Gaussian Elimination scheme without pivoting, and was implemented on a Intel Core2Duo CPU E8400 at 3.0 GHz with 4 Gb of ram. The heat diffusion problem tested has Dirichlet boundary condition. Both codes were tested with float precision and double precision for dimension size (including boundary condition) of , and for 3000 time steps. Time taken for memory transfer from device to host was taken into consideration. Below is the summary of the timing result: System Size GPU CPU float double float double s 1.75 s 2.26 s 2.33 s s 3.06 s s s s 9.89 s s s

10 Only NVidia Graphics Card with computing capability of 1.3 may run double precision computation. Current generation GPU has considerable lower bandwidth for double precision than float precision, which renders them less suitable when a high precision is necessary. Due to shared memory size limitation we have not implemented the code for system size of more than It is possible to implement the same algorithm using global memory, but this will result in performance penalty due to low bandwidth of global memory access 4. Conclusion Reasonable acceleration for tridiagonal system solver has been achieved. The algorithm presented here can be further optimized by improving the right hand side computation routine. Future work will concentrate on the extension of the algorithm to cases where the system size is more than Reference [1] Zhang Y., Cohen J., Owens J.D. Fast Tridiagonal Solver on the GPU. Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, p , [2] NVidia CUDA compute unified device architecture, programming guide, Version 2.0. [3] R.W. Hockney, C.R. Jesshope. Parallel Computers. Adam Hilger, Bristol, 1981.

Fast Tridiagonal Solvers on GPU

Fast Tridiagonal Solvers on GPU Fast Tridiagonal Solvers on GPU Yao Zhang John Owens UC Davis Jonathan Cohen NVIDIA GPU Technology Conference 2009 Outline Introduction Algorithms Design algorithms for GPU architecture Performance Bottleneck-based

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Computational Acceleration of Image Inpainting Alternating-Direction Implicit (ADI) Method Using GPU CUDA

Computational Acceleration of Image Inpainting Alternating-Direction Implicit (ADI) Method Using GPU CUDA Computational Acceleration of Inpainting Alternating-Direction Implicit (ADI) Method Using GPU CUDA Mutaqin Akbar Pranowo Suyoto Abstract

More information

State of Art and Project Proposals Intensive Computation

State of Art and Project Proposals Intensive Computation State of Art and Project Proposals Intensive Computation Annalisa Massini - 2015/2016 Today s lecture Project proposals on the following topics: Sparse Matrix- Vector Multiplication Tridiagonal Solvers

More information

Scan Primitives for GPU Computing

Scan Primitives for GPU Computing Scan Primitives for GPU Computing Shubho Sengupta, Mark Harris *, Yao Zhang, John Owens University of California Davis, *NVIDIA Corporation Motivation Raw compute power and bandwidth of GPUs increasing

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

S4289: Efficient solution of multiple scalar and block-tridiagonal equations

S4289: Efficient solution of multiple scalar and block-tridiagonal equations S4289: Efficient solution of multiple scalar and block-tridiagonal equations Endre László endre.laszlo [at] Oxford e-research Centre, University of Oxford, UK Pázmány Péter Catholic University,

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information



More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

Chapter 2 A Guide for Implementing Tridiagonal Solvers on GPUs

Chapter 2 A Guide for Implementing Tridiagonal Solvers on GPUs Chapter 2 A Guide for Implementing Tridiagonal Solvers on GPUs Li-Wen Chang and Wen-mei W. Hwu 2.1 Introduction The tridiagonal solver has been recognized as a critical building block for many engineering

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}

More information

Unrolling parallel loops

Unrolling parallel loops Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:

More information

GPU Implementation of Implicit Runge-Kutta Methods

GPU Implementation of Implicit Runge-Kutta Methods GPU Implementation of Implicit Runge-Kutta Methods Navchetan Awasthi, Abhijith J Supercomputer Education and Research Centre Indian Institute of Science, Bangalore, India,

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance

More information

Supporting Data Parallelism in Matcloud: Final Report

Supporting Data Parallelism in Matcloud: Final Report Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by

More information

A novel approach to evaluating compact finite differences and similar tridiagonal schemes on GPU-accelerated clusters

A novel approach to evaluating compact finite differences and similar tridiagonal schemes on GPU-accelerated clusters Clemson University TigerPrints All Theses Theses 12-2015 A novel approach to evaluating compact finite differences and similar tridiagonal schemes on GPU-accelerated clusters Ashwin Trikuta Srinath Clemson

More information

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Module 12 Floating-Point Considerations

Module 12 Floating-Point Considerations GPU Teaching Kit Accelerated Computing Module 12 Floating-Point Considerations Lecture 12.1 - Floating-Point Precision and Accuracy Objective To understand the fundamentals of floating-point representation

More information

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory Memory Lecture 2: different memory and variable types Prof. Mike Giles Oxford University Mathematical Institute Oxford e-research Centre Key challenge in modern computer architecture

More information

CS 179: GPU Programming. Lecture 7

CS 179: GPU Programming. Lecture 7 CS 179: GPU Programming Lecture 7 Week 3 Goals: More involved GPU-accelerable algorithms Relevant hardware quirks CUDA libraries Outline GPU-accelerated: Reduction Prefix sum Stream compaction Sorting(quicksort)

More information

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs Li-Wen Chang, Wen-mei Hwu University of Illinois A Scalable, Numerically Stable, High- How to Build a gtsv for Performance

More information

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for

More information

Module 1: Introduction to Finite Difference Method and Fundamentals of CFD Lecture 5:

Module 1: Introduction to Finite Difference Method and Fundamentals of CFD Lecture 5: file:///d:/chitra/nptel_phase2/mechanical/cfd/lecture5/5_1.htm 1 of 1 6/20/2012 12:22 PM The Lecture deals with: Explicit and Implicit Methods file:///d:/chitra/nptel_phase2/mechanical/cfd/lecture5/5_2.htm

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel

More information

Lecture 2: different memory and variable types

Lecture 2: different memory and variable types Lecture 2: different memory and variable types Prof. Mike Giles Oxford University Mathematical Institute Oxford e-research Centre Lecture 2 p. 1 Memory Key challenge in modern

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 Objective To master Reduction Trees, arguably the

More information


CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

CS 677: Parallel Programming for Many-core Processors Lecture 6

CS 677: Parallel Programming for Many-core Processors Lecture 6 1 CS 677: Parallel Programming for Many-core Processors Lecture 6 Instructor: Philippos Mordohai Webpage: E-mail: Logistics Midterm: March 11

More information

GPU-accelerated Verification of the Collatz Conjecture

GPU-accelerated Verification of the Collatz Conjecture GPU-accelerated Verification of the Collatz Conjecture Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima 739-8527,

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

CS 314 Principles of Programming Languages

CS 314 Principles of Programming Languages CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

High-Performance Computing Using GPUs

High-Performance Computing Using GPUs High-Performance Computing Using GPUs Luca Caucci Center for Gamma-Ray Imaging November 7, 2012 Outline Slide 1 of 27 Why GPUs? What is CUDA? The CUDA programming model Anatomy

More information

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library Libraries Don t reinvent the wheel. Specialized math libraries are likely faster. BLAS: Basic Linear Algebra Subprograms LAPACK: Linear Algebra Package (uses BLAS) to download

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

Point-to-Point Synchronisation on Shared Memory Architectures

Point-to-Point Synchronisation on Shared Memory Architectures Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:

More information

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size Overview Videos are everywhere But can take up large amounts of resources Disk space Memory Network bandwidth Exploit redundancy to reduce file size Spatial Temporal General lossless compression Huffman

More information

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86) 26(86) Information Coding / Computer Graphics, ISY, LiTH CUDA memory Coalescing Constant memory Texture memory Pinned memory 26(86) CUDA memory We already know... Global memory is slow. Shared memory is

More information

Lecture 1: Introduction and Computational Thinking

Lecture 1: Introduction and Computational Thinking PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational

More information

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory. Table of Contents Shared Memory Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain

More information

Simultaneous Solving of Linear Programming Problems in GPU

Simultaneous Solving of Linear Programming Problems in GPU Simultaneous Solving of Linear Programming Problems in GPU Amit Gurung* Binayak Das* Rajarshi Ray* * National Institute of Technology Meghalaya

More information

GPU programming basics. Prof. Marco Bertini

GPU programming basics. Prof. Marco Bertini GPU programming basics Prof. Marco Bertini CUDA: atomic operations, privatization, algorithms Atomic operations The basics atomic operation in hardware is something like a read-modify-write operation performed

More information

Computational Fluid Dynamics (CFD) using Graphics Processing Units

Computational Fluid Dynamics (CFD) using Graphics Processing Units Computational Fluid Dynamics (CFD) using Graphics Processing Units Aaron F. Shinn Mechanical Science and Engineering Dept., UIUC Accelerators for Science and Engineering Applications: GPUs and Multicores

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Warp shuffles. Lecture 4: warp shuffles, and reduction / scan operations. Warp shuffles. Warp shuffles

Warp shuffles. Lecture 4: warp shuffles, and reduction / scan operations. Warp shuffles. Warp shuffles Warp shuffles Lecture 4: warp shuffles, and reduction / scan operations Prof. Mike Giles Oxford University Mathematical Institute Oxford e-research Centre Lecture 4 p. 1 Warp

More information

A Comprehensive Study on the Performance of Implicit LS-DYNA

A Comprehensive Study on the Performance of Implicit LS-DYNA 12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four

More information

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information

CS671 Parallel Programming in the Many-Core Era

CS671 Parallel Programming in the Many-Core Era CS671 Parallel Programming in the Many-Core Era Lecture 3: GPU Programming - Reduce, Scan & Sort Zheng Zhang Rutgers University Review: Programming with CUDA An Example in C Add vector A and vector B to

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

GPU-Accelerated Bulk Computation of the Eigenvalue Problem for Many Small Real Non-symmetric Matrices

GPU-Accelerated Bulk Computation of the Eigenvalue Problem for Many Small Real Non-symmetric Matrices GPU-Accelerated Bulk Computation of the Eigenvalue Problem for Many Small Real Non-symmetric Matrices Hiroki Tokura, Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima

More information

How to perform HPL on CPU&GPU clusters. Draško Tomić

How to perform HPL on CPU&GPU clusters. Draško Tomić How to perform HPL on CPU&GPU clusters Draško Tomić email: Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics

More information

The Shallow Water Equations and CUDA

The Shallow Water Equations and CUDA The Shallow Water Equations and CUDA Oliver Meister December 17 th 2014 Tutorial Parallel Programming and High Performance Computing, December 17 th 2014 1 Last Tutorial Discretized Heat Equation System

More information

How to Optimize Geometric Multigrid Methods on GPUs

How to Optimize Geometric Multigrid Methods on GPUs How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient

More information

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

Parallel Prefix Sum (Scan) with CUDA. Mark Harris Parallel Prefix Sum (Scan) with CUDA Mark Harris March 2009 Document Change History Version Date Responsible Reason for Change February 14, 2007 Mark Harris Initial release March 25,

More information

Optimizing Parallel Reduction in CUDA

Optimizing Parallel Reduction in CUDA Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Tree-based approach used within each

More information

Accelerating CFD with Graphics Hardware

Accelerating CFD with Graphics Hardware Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery

More information

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung

More information

Lecture 4: warp shuffles, and reduction / scan operations

Lecture 4: warp shuffles, and reduction / scan operations Lecture 4: warp shuffles, and reduction / scan operations Prof. Mike Giles Oxford University Mathematical Institute Oxford e-research Centre Lecture 4 p. 1 Warp shuffles Warp

More information

International Supercomputing Conference 2009

International Supercomputing Conference 2009 International Supercomputing Conference 2009 Implementation of a Lattice-Boltzmann-Method for Numerical Fluid Mechanics Using the nvidia CUDA Technology E. Riegel, T. Indinger, N.A. Adams Technische Universität

More information

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E) FMM implementation on CPU and GPU Nail A. Gumerov (Lecture for CMSC 828E) Outline Two parts of the FMM Data Structure Flow Chart of the Run Algorithm FMM Cost/Optimization on CPU Programming on GPU Fast

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Figure 6.1: Truss topology optimization diagram.

Figure 6.1: Truss topology optimization diagram. 6 Implementation 6.1 Outline This chapter shows the implementation details to optimize the truss, obtained in the ground structure approach, according to the formulation presented in previous chapters.

More information

OpenACC Fundamentals. Steve Abbott November 15, 2017

OpenACC Fundamentals. Steve Abbott November 15, 2017 OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)

More information

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology Data-Parallel Algorithms on GPUs Mark Harris NVIDIA Developer Technology Outline Introduction Algorithmic complexity on GPUs Algorithmic Building Blocks Gather & Scatter Reductions Scan (parallel prefix)

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline CMSC 858M/AMSC 698R Fast Multipole Methods Nail A. Gumerov & Ramani Duraiswami Lecture 20 Outline Two parts of the FMM Data Structures FMM Cost/Optimization on CPU Fine Grain Parallelization for Multicore

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming

More information

Accelerating GPU kernels for dense linear algebra

Accelerating GPU kernels for dense linear algebra Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,

More information

GPU Programming for Mathematical and Scientific Computing

GPU Programming for Mathematical and Scientific Computing GPU Programming for Mathematical and Scientific Computing Ethan Kerzner and Timothy Urness Department of Mathematics and Computer Science Drake University Des Moines, IA 50311

More information

Parallel algorithms for fast air pollution assessment in three dimensions

Parallel algorithms for fast air pollution assessment in three dimensions HPC-UA 2014 (Ukraine, Kyiv, Octoer 14, 2014) Parallel algorithms for fast air pollution assessment in three dimensions Bohaienko V.O. 1 1 Glushkov Institute of Cyernetic of NAS of Ukraine, Kyiv, Ukraine

More information

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Lecture 6. Programming with Message Passing Message Passing Interface (MPI) Lecture 6 Programming with Message Passing Message Passing Interface (MPI) Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Finish CUDA Today s lecture Programming with message passing 2011

More information

Dense Linear Algebra. HPC - Algorithms and Applications

Dense Linear Algebra. HPC - Algorithms and Applications Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:

More information

CUDA. More on threads, shared memory, synchronization. cuprintf

CUDA. More on threads, shared memory, synchronization. cuprintf CUDA More on threads, shared memory, synchronization cuprintf Library function for CUDA Developers Copy the files from /opt/cuprintf into your source code folder #include global void testkernel(int

More information

Speed Up Your Codes Using GPU

Speed Up Your Codes Using GPU Speed Up Your Codes Using GPU Wu Di and Yeo Khoon Seng (Department of Mechanical Engineering) The use of Graphics Processing Units (GPU) for rendering is well known, but their power for general parallel

More information

5.12 EXERCISES Exercises 263

5.12 EXERCISES Exercises 263 5.12 Exercises 263 5.12 EXERCISES 5.1. If it s defined, the OPENMP macro is a decimal int. Write a program that prints its value. What is the significance of the value? 5.2. Download omp trap 1.c from

More information

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline

More information

Performance Optimization Part II: Locality, Communication, and Contention

Performance Optimization Part II: Locality, Communication, and Contention Lecture 7: Performance Optimization Part II: Locality, Communication, and Contention Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Beth Rowley Nobody s Fault but Mine

More information

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University GPU Computing K. Cooper 1 1 Department of Mathematics Washington State University 2014 Review of Parallel Paradigms MIMD Computing Multiple Instruction Multiple Data Several separate program streams, each

More information

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Optimising the Mantevo benchmark suite for multi- and many-core architectures Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of

More information


A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

Dense matching GPU implementation

Dense matching GPU implementation Dense matching GPU implementation Author: Hailong Fu. Supervisor: Prof. Dr.-Ing. Norbert Haala, Dipl. -Ing. Mathias Rothermel. Universität Stuttgart 1. Introduction Correspondence problem is an important

More information


S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS John R Appleyard Jeremy D Appleyard Polyhedron Software with acknowledgements to Mark A Wakefield Garf Bowen Schlumberger Outline of Talk Reservoir

More information

CUDA (Compute Unified Device Architecture)

CUDA (Compute Unified Device Architecture) CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information