Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units
|
|
- Lucas Marsh
- 5 years ago
- Views:
Transcription
1 Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units Khor Shu Heng Engineering Science Programme National University of Singapore Abstract This paper presents a parallel alternating direction implicit (ADI) solver for the two-dimensional heat diffusion problem on an NVidia Graphics Processing Units (GPU). The first section of the work gives a brief introduction on Compute United Device Architecture (CUDA), the programming interface for parallel programming on an NVidia GPU, whereas the second section describes the implementation details of the tridiagonal system solver and the setup of the corresponding right hand side for implicit solution in and direction. The tridiagonal solver used in this work is based on the parallel cyclic reduction algorithm implemented by Zhang et al. [1]. The original algorithm does not supports system size which is non-power of two and uses 5 shared memory usage, where is the tridiagonal system size. We noticed that the shared memory usage can be reduced to 3 for cases where the tridiagonal system is symmetric with uniform elements on the diagonals. Slight modification has been done to cater for cases where the system size if non-power of two. We have also attempted to make the computation of right hand sides as efficient as possible, especially for the solution in y direction, Using CUDA Visual Profiler, the performance of the GPU ADI solver was compared with the serial implementation in CPU, which was based on Gaussian Elimination Scheme without pivoting. Reasonable acceleration was achieved for both float type computation and double type computation. 1. Introduction 1.1 Parallel computing using Graphics Processing Units Graphics Processing Units is specially designed for computation tasks exhibit fine grained data parallelism, with high ratio of arithmetic operation to memory operation. In three-dimensional graphics rendering, large set of pixels and vertices data are mapped onto parallel processing threads. Modern GPU are highly parallel, multithreaded, with more multicore processor than a CPU. This makes GPU a viable
2 and cheaper alternative for parallel programming compared to multicore CPU, vector computer and grid computing. In November 2006, NVidia introduced C with CUDA extension, a general purpose parallel computing architecture, which enables the user to use C language to leverage the parallelism of supported NVidia GPU for data parallel tasks. Since then, CUDA has begun to gain popularity in the high performance computing community, and has been applied in diverse fields: computational finance, computational fluid dynamics, image reconstruction for CT scan and molecular simulations. In general, programming in CUDA involves memory transfer between the host memory (CPU) and the device memory (GPU). The host calls a kernel, which perform the parallel computing task in device memory. The computation result is then written back to the host memory. Figure 1. GPU architecture GPU processes data parallel using threads. Threads are grouped in blocks. Each threads has a private local memory, whereas each block has a shared memory bank accessible to all the threads within the block. All threads have access to the same global memory bank. More information on the programming methodology and optimization techniques are available in [2]. 1.2 Alternating Direction Method Two-dimensional heat diffusion problem is governed by the partial differential equation: Diffusion problems usually suffer from numerical instability for explicit schemes, thus need to be solved using implicit schemes like Crank Nicolson scheme. While the linear equation systems generated in
3 Crank Nicolson scheme are tridiagonal for one-dimensional case, this is no longer true for two-dimension. More computational effort is required to obtain the solution for the linear systems. Alternating direction method circumvent this problem by halving the time step and solves the governing partial differential equation implicitly in one spatial dimension for each sub step. Where And similarly for. For a grid size of n n, n independent diagonally dominant tridiagonal systems of size n are generated for implicit solution in each dimension. 2. Implementation of Alternating Direction Method in CUDA 2.1 Mapping of higher dimensional array To work with two-dimensional problem in CUDA, one can map the two-dimensional data for temperature distribution onto a one-dimensional array in the following manner: Notice that the pitch is not necessarily equal to the row length. One should use cudamallocpitch function to automatically allocate the array with the suitable pitch length such that the memory access in direction is coalesced. Non-coalesced memory access is slower and may impact overall performance.
4 2.2 Implicit Solution in direction Each independent tridiagonal system is mapped onto a block, whereas each equation of the tridiagonal system is mapped onto a thread within the block. We declare the array for the right hand side in the shared memory. Computation of the right hand side correspond to each tridiagonal system involves communication between each element in a row and the elements adjacent to it in the y direction. Excerpt of Code from the Kernel for Right Hand Side Setup for x direction #define INDEX(i,j,pitch) (i + mul24(j,pitch)) global void rhssetup(float*rhs,float*u,int m,int pitch,int pitch2,float alpha) unsigned int thid=threadidx.x; unsigned int blid=blockidx.x; unsigned int center=index(thid+1,blid+1,pitch); if(thid<m) rhs[index(thid,blid,pitch2)]=(1-alpha)*u[center]+alpha/2*(u[centerpitch]+u[center+pitch]); syncthreads; rhs[blid*pitch2]+=(alpha/2)*u[(blid+1)*pitch]; rhs[index(m-1,blid,pitch2)]+=(alpha/2)*u[index(m+2,blid+1,pitch)]; The mul24 function is defined for efficient multiplication. The pitch length is different for the the right hand side array and the temperature distribution array since the x and y dimension are different in general. 2.3 Parallel Cylic Reduction In this work, the tridiagonal solver used is base on the parallel cyclic reduction algorithm implemented by Zhang et. al. [1]. Parallel cylic reduction is a variant of the cylic reduction algorithm first proposed by Hockney and Golub in 1965[3]. Parallel cyclic reduction differs from cylic reduction by having only the fowrad reduction phase. The algorithm solves a tridiagonal system of size n in steps and 12 n computations. In contrast, Gaussian elimination without pivoting solves the same problem size with 2 n steps.
5 The idea of parallel cyclic reduction is to reduce the original tridiagonal system to smaller systems of half the original size in a recursive manner. Consider an n by n tridiagonal system: For each row, and are eliminated by means of row operations involving row and the two rows a stride above/below i. Initial stride is 1. This process updates and generates and as fill in. The odd indexed rows and the even indexed rows have now become two independent tridiagonal systems. Repeat the same process with stride double that of the previous one, and we will get smaller and smaller independent systems. Assume for the moment that the system size is power of two. Iterating the forward reduction phase for times will yield independent tridiagonal systems of size 2. In Zhang et al s implementation, the updated value of and always overwrite the original one, hence only 5 (including right hand side, d, and solution, x )storage requirement is needed in the shared memory. 2.4 System size of non-power-of-two For system sizes which are non-power-of-two, the forward reduction can be performed for ceil times. The end result would be floor numbers of systems of size two and numbers of systems of size one.
6 The original code section for the back substitution can be changed from: if (thid < delta) int addr1 = thid; int addr2 = thid+delta; float tmp3 = b[addr2]*b[addr1]-c[addr1]*a[addr2]; x[addr1] = (b[addr2]*d[addr1]-c[addr1]*d[addr2])/tmp3; x[addr2] = (d[addr2]*b[addr1]-d[addr1]*a[addr2])/tmp3; To: if (thid < delta) int addr1 = thid; int addr2 = thid+delta; float tmp3 = b[addr2]*b[addr1]- c[addr1]*a[addr2]; if(addr2<n) x[addr1] = (b[addr2]*d[addr1]-c[addr1]*d[addr2])/tmp3; x[addr2] = (d[addr2]*b[addr1]-d[addr1]*a[addr2])/tmp3; else x[addr1]=d[addr1]/b[addr1]; The code section in the red bracket solves system size of two, whereas the code section in the green bracket solves system size of one. 2.5 Symmetric tridiagonal system with uniform elements on the diagonals For Dirichlet boundary condition, the tridiagonal system involved is symmetric with uniform elements on the diagonals. For such system, we observe the following: 1) The upper and lower diagonals of the new tridiagonal systems formed during the forward reduction phase are filled with identical elements. 2) Only the first and the last elements of the main diagonals have different values from other elements on the main diagonal.
7 3) Computation of every subsequent values of the new a and c only require the knowledge of the initial value of b. These observations allow us to reduce shared memory storage and less memory read/write operation. Only b and d need to be stored in the shared memory. a can be dropped out and only c is stored in the register. Consider the original code section: for (int j = 0; j <iteration; j++) int i = thid; if(i < delta) float tmp2 = c[i] / b[i+delta]; bnew = b[i] - a[i+delta] * tmp2; dnew = d[i] - d[i+delta] * tmp2; anew = 0; cnew = -c[i+delta] * tmp2; else if((systemsize-i-1) < delta) float tmp = a[i] / b[i-delta]; bnew = b[i] - c[i-delta] * tmp; dnew = d[i] - d[i-delta] * tmp; anew = -a[i-delta] * tmp; cnew = 0; else float tmp1 = a[i] / b[i-delta]; float tmp2 = c[i] / b[i+delta]; bnew = b[i] - c[i-delta] * tmp1 - a[i+delta] * tmp2; dnew = d[i] - d[i-delta] * tmp1 - d[i+delta] * tmp2; anew = -a[i-delta] * tmp1; cnew = -c[i+delta] * tmp2; syncthreads(); b[i] = bnew; d[i] = dnew; a[i] = anew; c[i] = cnew; delta *=2; syncthreads();
8 This can be replaced by for (int j = 0; j <iteration; j++) float temp=c/b; int i = thid; if(i < delta) float tmp2 = c / b[i+delta]; bnew = b[i] - c * tmp2; dnew = d[i] - d[i+delta] * tmp2; else if((systemsize-i-1) < delta) float tmp = c / b[i-delta]; bnew = b[i] - c * tmp; dnew = d[i] - d[i-delta] * tmp; else float tmp1 = c / b[i-delta]; float tmp2 = c / b[i+delta]; bnew = b[i] - c * (tmp1 + tmp2); dnew = d[i] - d[i-delta] * tmp1 - d[i+delta] * tmp2; syncthreads(); b[i] = bnew; d[i] = dnew; delta *=2; syncthreads(); B=B-2*c*temp; c*=-temp; Where B is the original value of the main diagonal element. Memory read/write from/into register is much faster than shared memory. Profiling result from CUDA Visual Profiler shows that this replacement reduces the computation time by about one fifth. 2.6 Implicit Solution in y direction After the implicit solution in x direction has been computed, the right hand side corresponding to the implicit solution in the y direction can be computed in the similar manner as in the x direction.
9 Code excerpt for the right hand side computation: unsigned int thid=threadidx.x; unsigned int blid=blockidx.x; unsigned int center=index(blid+1,thid+1,pitch); if(thid<n) rhs[index(thid,blid,pitch2)]=(1-alpha)*u[center]+alpha/2*(u[center- 1]+u[center+1]); However, profiler result shows that the right hand side computation for implicit solution in the y direction is much less efficient than the right hand side computation for the implicit solution in the x direction. This is due to the non-coalesced memory access pattern, which is much slower. Fig. Profiling result using CUDA Visual Profiler. ADISolve refers to the routine for tridiagonal solver in x and y direction, memcpydtoh refers to memory transfer from device memory to host memory, initialize refers to the routine to set up the initial condition, while rhssetup and rhssetup2 refers to right hand side computation corresponds to x direction and y direction respectively. 3. Result The ADI solver was implemented on a GTX285 NVidia GPU, which has the ability to run algorithm with double precision. The serial version of ADI solver was based on the Gaussian Elimination scheme without pivoting, and was implemented on a Intel Core2Duo CPU E8400 at 3.0 GHz with 4 Gb of ram. The heat diffusion problem tested has Dirichlet boundary condition. Both codes were tested with float precision and double precision for dimension size (including boundary condition) of , and for 3000 time steps. Time taken for memory transfer from device to host was taken into consideration. Below is the summary of the timing result: System Size GPU CPU float double float double s 1.75 s 2.26 s 2.33 s s 3.06 s s s s 9.89 s s s
10 Only NVidia Graphics Card with computing capability of 1.3 may run double precision computation. Current generation GPU has considerable lower bandwidth for double precision than float precision, which renders them less suitable when a high precision is necessary. Due to shared memory size limitation we have not implemented the code for system size of more than It is possible to implement the same algorithm using global memory, but this will result in performance penalty due to low bandwidth of global memory access 4. Conclusion Reasonable acceleration for tridiagonal system solver has been achieved. The algorithm presented here can be further optimized by improving the right hand side computation routine. Future work will concentrate on the extension of the algorithm to cases where the system size is more than Reference [1] Zhang Y., Cohen J., Owens J.D. Fast Tridiagonal Solver on the GPU. Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, p , [2] NVidia CUDA compute unified device architecture, programming guide, Version 2.0. [3] R.W. Hockney, C.R. Jesshope. Parallel Computers. Adam Hilger, Bristol, 1981.
Fast Tridiagonal Solvers on GPU
Fast Tridiagonal Solvers on GPU Yao Zhang John Owens UC Davis Jonathan Cohen NVIDIA GPU Technology Conference 2009 Outline Introduction Algorithms Design algorithms for GPU architecture Performance Bottleneck-based
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationComputational Acceleration of Image Inpainting Alternating-Direction Implicit (ADI) Method Using GPU CUDA
Computational Acceleration of Inpainting Alternating-Direction Implicit (ADI) Method Using GPU CUDA Mutaqin Akbar mutaqin.akbar@gmail.com Pranowo pran@mail.uajy.ac.id Suyoto suyoto@mail.uajy.ac.id Abstract
More informationState of Art and Project Proposals Intensive Computation
State of Art and Project Proposals Intensive Computation Annalisa Massini - 2015/2016 Today s lecture Project proposals on the following topics: Sparse Matrix- Vector Multiplication Tridiagonal Solvers
More informationScan Primitives for GPU Computing
Scan Primitives for GPU Computing Shubho Sengupta, Mark Harris *, Yao Zhang, John Owens University of California Davis, *NVIDIA Corporation Motivation Raw compute power and bandwidth of GPUs increasing
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationS4289: Efficient solution of multiple scalar and block-tridiagonal equations
S4289: Efficient solution of multiple scalar and block-tridiagonal equations Endre László endre.laszlo [at] oerc.ox.ac.uk Oxford e-research Centre, University of Oxford, UK Pázmány Péter Catholic University,
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationGPU Programming Using NVIDIA CUDA
GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics
More informationChapter 2 A Guide for Implementing Tridiagonal Solvers on GPUs
Chapter 2 A Guide for Implementing Tridiagonal Solvers on GPUs Li-Wen Chang and Wen-mei W. Hwu 2.1 Introduction The tridiagonal solver has been recognized as a critical building block for many engineering
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationUnrolling parallel loops
Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:
More informationGPU Implementation of Implicit Runge-Kutta Methods
GPU Implementation of Implicit Runge-Kutta Methods Navchetan Awasthi, Abhijith J Supercomputer Education and Research Centre Indian Institute of Science, Bangalore, India navchetanawasthi@gmail.com, abhijith31792@gmail.com
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance
More informationSupporting Data Parallelism in Matcloud: Final Report
Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by
More informationA novel approach to evaluating compact finite differences and similar tridiagonal schemes on GPU-accelerated clusters
Clemson University TigerPrints All Theses Theses 12-2015 A novel approach to evaluating compact finite differences and similar tridiagonal schemes on GPU-accelerated clusters Ashwin Trikuta Srinath Clemson
More informationCSE 599 I Accelerated Computing - Programming GPUS. Memory performance
CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationModule 12 Floating-Point Considerations
GPU Teaching Kit Accelerated Computing Module 12 Floating-Point Considerations Lecture 12.1 - Floating-Point Precision and Accuracy Objective To understand the fundamentals of floating-point representation
More informationMemory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory
Memory Lecture 2: different memory and variable types Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Key challenge in modern computer architecture
More informationCS 179: GPU Programming. Lecture 7
CS 179: GPU Programming Lecture 7 Week 3 Goals: More involved GPU-accelerable algorithms Relevant hardware quirks CUDA libraries Outline GPU-accelerated: Reduction Prefix sum Stream compaction Sorting(quicksort)
More informationA Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois
A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs Li-Wen Chang, Wen-mei Hwu University of Illinois A Scalable, Numerically Stable, High- How to Build a gtsv for Performance
More informationChapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for
More informationModule 1: Introduction to Finite Difference Method and Fundamentals of CFD Lecture 5:
file:///d:/chitra/nptel_phase2/mechanical/cfd/lecture5/5_1.htm 1 of 1 6/20/2012 12:22 PM The Lecture deals with: Explicit and Implicit Methods file:///d:/chitra/nptel_phase2/mechanical/cfd/lecture5/5_2.htm
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationData parallel algorithms, algorithmic building blocks, precision vs. accuracy
Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel
More informationLecture 2: different memory and variable types
Lecture 2: different memory and variable types Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 2 p. 1 Memory Key challenge in modern
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationCS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees
CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 Objective To master Reduction Trees, arguably the
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationCS 677: Parallel Programming for Many-core Processors Lecture 6
1 CS 677: Parallel Programming for Many-core Processors Lecture 6 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11
More informationGPU-accelerated Verification of the Collatz Conjecture
GPU-accelerated Verification of the Collatz Conjecture Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima 739-8527,
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationCS 314 Principles of Programming Languages
CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationHigh-Performance Computing Using GPUs
High-Performance Computing Using GPUs Luca Caucci caucci@email.arizona.edu Center for Gamma-Ray Imaging November 7, 2012 Outline Slide 1 of 27 Why GPUs? What is CUDA? The CUDA programming model Anatomy
More informationDon t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library
Libraries Don t reinvent the wheel. Specialized math libraries are likely faster. BLAS: Basic Linear Algebra Subprograms LAPACK: Linear Algebra Package (uses BLAS) http://www.netlib.org/lapack/ to download
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationPoint-to-Point Synchronisation on Shared Memory Architectures
Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:
More informationOverview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size
Overview Videos are everywhere But can take up large amounts of resources Disk space Memory Network bandwidth Exploit redundancy to reduce file size Spatial Temporal General lossless compression Huffman
More informationInformation Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)
26(86) Information Coding / Computer Graphics, ISY, LiTH CUDA memory Coalescing Constant memory Texture memory Pinned memory 26(86) CUDA memory We already know... Global memory is slow. Shared memory is
More informationLecture 1: Introduction and Computational Thinking
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational
More informationShared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.
Table of Contents Shared Memory Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain miguel.cardenas@ciemat.es
More informationSimultaneous Solving of Linear Programming Problems in GPU
Simultaneous Solving of Linear Programming Problems in GPU Amit Gurung* amitgurung@nitm.ac.in Binayak Das* binayak89cse@gmail.com Rajarshi Ray* raj.ray84@gmail.com * National Institute of Technology Meghalaya
More informationGPU programming basics. Prof. Marco Bertini
GPU programming basics Prof. Marco Bertini CUDA: atomic operations, privatization, algorithms Atomic operations The basics atomic operation in hardware is something like a read-modify-write operation performed
More informationComputational Fluid Dynamics (CFD) using Graphics Processing Units
Computational Fluid Dynamics (CFD) using Graphics Processing Units Aaron F. Shinn Mechanical Science and Engineering Dept., UIUC Accelerators for Science and Engineering Applications: GPUs and Multicores
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationWarp shuffles. Lecture 4: warp shuffles, and reduction / scan operations. Warp shuffles. Warp shuffles
Warp shuffles Lecture 4: warp shuffles, and reduction / scan operations Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 4 p. 1 Warp
More informationA Comprehensive Study on the Performance of Implicit LS-DYNA
12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four
More information2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions
Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics
More informationCS671 Parallel Programming in the Many-Core Era
CS671 Parallel Programming in the Many-Core Era Lecture 3: GPU Programming - Reduce, Scan & Sort Zheng Zhang Rutgers University Review: Programming with CUDA An Example in C Add vector A and vector B to
More informationFCUDA: Enabling Efficient Compilation of CUDA Kernels onto
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:
More informationGPU-Accelerated Bulk Computation of the Eigenvalue Problem for Many Small Real Non-symmetric Matrices
GPU-Accelerated Bulk Computation of the Eigenvalue Problem for Many Small Real Non-symmetric Matrices Hiroki Tokura, Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationAdaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics
Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics
More informationThe Shallow Water Equations and CUDA
The Shallow Water Equations and CUDA Oliver Meister December 17 th 2014 Tutorial Parallel Programming and High Performance Computing, December 17 th 2014 1 Last Tutorial Discretized Heat Equation System
More informationHow to Optimize Geometric Multigrid Methods on GPUs
How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient
More informationParallel Prefix Sum (Scan) with CUDA. Mark Harris
Parallel Prefix Sum (Scan) with CUDA Mark Harris mharris@nvidia.com March 2009 Document Change History Version Date Responsible Reason for Change February 14, 2007 Mark Harris Initial release March 25,
More informationOptimizing Parallel Reduction in CUDA
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each
More informationAccelerating CFD with Graphics Hardware
Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery
More informationParallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs
Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung
More informationLecture 4: warp shuffles, and reduction / scan operations
Lecture 4: warp shuffles, and reduction / scan operations Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 4 p. 1 Warp shuffles Warp
More informationInternational Supercomputing Conference 2009
International Supercomputing Conference 2009 Implementation of a Lattice-Boltzmann-Method for Numerical Fluid Mechanics Using the nvidia CUDA Technology E. Riegel, T. Indinger, N.A. Adams Technische Universität
More informationFMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)
FMM implementation on CPU and GPU Nail A. Gumerov (Lecture for CMSC 828E) Outline Two parts of the FMM Data Structure Flow Chart of the Run Algorithm FMM Cost/Optimization on CPU Programming on GPU Fast
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationFigure 6.1: Truss topology optimization diagram.
6 Implementation 6.1 Outline This chapter shows the implementation details to optimize the truss, obtained in the ground structure approach, according to the formulation presented in previous chapters.
More informationOpenACC Fundamentals. Steve Abbott November 15, 2017
OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)
More informationData-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology
Data-Parallel Algorithms on GPUs Mark Harris NVIDIA Developer Technology Outline Introduction Algorithmic complexity on GPUs Algorithmic Building Blocks Gather & Scatter Reductions Scan (parallel prefix)
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationCMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline
CMSC 858M/AMSC 698R Fast Multipole Methods Nail A. Gumerov & Ramani Duraiswami Lecture 20 Outline Two parts of the FMM Data Structures FMM Cost/Optimization on CPU Fine Grain Parallelization for Multicore
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming
More informationAccelerating GPU kernels for dense linear algebra
Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,
More informationGPU Programming for Mathematical and Scientific Computing
GPU Programming for Mathematical and Scientific Computing Ethan Kerzner and Timothy Urness Department of Mathematics and Computer Science Drake University Des Moines, IA 50311 ethan.kerzner@gmail.com timothy.urness@drake.edu
More informationParallel algorithms for fast air pollution assessment in three dimensions
HPC-UA 2014 (Ukraine, Kyiv, Octoer 14, 2014) Parallel algorithms for fast air pollution assessment in three dimensions Bohaienko V.O. 1 1 Glushkov Institute of Cyernetic of NAS of Ukraine, Kyiv, Ukraine
More informationLecture 6. Programming with Message Passing Message Passing Interface (MPI)
Lecture 6 Programming with Message Passing Message Passing Interface (MPI) Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Finish CUDA Today s lecture Programming with message passing 2011
More informationDense Linear Algebra. HPC - Algorithms and Applications
Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:
More informationCUDA. More on threads, shared memory, synchronization. cuprintf
CUDA More on threads, shared memory, synchronization cuprintf Library function for CUDA Developers Copy the files from /opt/cuprintf into your source code folder #include cuprintf.cu global void testkernel(int
More informationSpeed Up Your Codes Using GPU
Speed Up Your Codes Using GPU Wu Di and Yeo Khoon Seng (Department of Mechanical Engineering) The use of Graphics Processing Units (GPU) for rendering is well known, but their power for general parallel
More information5.12 EXERCISES Exercises 263
5.12 Exercises 263 5.12 EXERCISES 5.1. If it s defined, the OPENMP macro is a decimal int. Write a program that prints its value. What is the significance of the value? 5.2. Download omp trap 1.c from
More informationGPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique
GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline
More informationPerformance Optimization Part II: Locality, Communication, and Contention
Lecture 7: Performance Optimization Part II: Locality, Communication, and Contention Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Beth Rowley Nobody s Fault but Mine
More informationCUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University
GPU Computing K. Cooper 1 1 Department of Mathematics Washington State University 2014 Review of Parallel Paradigms MIMD Computing Multiple Instruction Multiple Data Several separate program streams, each
More informationOptimising the Mantevo benchmark suite for multi- and many-core architectures
Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of
More informationA TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE
A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA
More informationDense matching GPU implementation
Dense matching GPU implementation Author: Hailong Fu. Supervisor: Prof. Dr.-Ing. Norbert Haala, Dipl. -Ing. Mathias Rothermel. Universität Stuttgart 1. Introduction Correspondence problem is an important
More informationS0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS
S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS John R Appleyard Jeremy D Appleyard Polyhedron Software with acknowledgements to Mark A Wakefield Garf Bowen Schlumberger Outline of Talk Reservoir
More informationCUDA (Compute Unified Device Architecture)
CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More information