Parallelising Pipelined Wavefront Computations on the GPU

Similar documents
Optimization solutions for the segmented sum algorithmic function

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

Double-Precision Matrix Multiply on CUDA

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Portland State University ECE 588/688. Graphics Processors

CUDA OPTIMIZATIONS ISC 2011 Tutorial


CS427 Multicore Architecture and Parallel Computing

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

CS 179 Lecture 4. GPU Compute Architecture

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

High Performance Computing on GPUs using NVIDIA CUDA

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Tesla Architecture, CUDA and Optimization Strategies

Automated Finite Element Computations in the FEniCS Framework using GPUs

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

CUDA Architecture & Programming Model

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Lecture 2: CUDA Programming

On the Acceleration of Wavefront Applications using Distributed Many-Core Architectures

General Purpose GPU Computing in Partial Wave Analysis

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

GPU Profiling and Optimization. Scott Grauer-Gray

Numerical Simulation on the GPU

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Programming in CUDA. Malik M Khan

CUDA Performance Optimization. Patrick Legresley

Accelerating CFD with Graphics Hardware

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Dense Linear Algebra. HPC - Algorithms and Applications

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

Tuning CUDA Applications for Fermi. Version 1.2

GPU Fundamentals Jeff Larkin November 14, 2016

Fundamental Optimizations

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Josef Pelikán, Jan Horáček CGG MFF UK Praha

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

B. Tech. Project Second Stage Report on

Introduction to GPU hardware and to CUDA

CUDA Threads. Origins. ! The CPU processing core 5/4/11

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Computer Architecture

Mathematical computations with GPUs

CS 179: GPU Computing

Point-to-Point Synchronisation on Shared Memory Architectures

A Unified Optimizing Compiler Framework for Different GPGPU Architectures

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

A Detailed GPU Cache Model Based on Reuse Distance Theory

Exploiting graphical processing units for data-parallel scientific applications

Generic Polyphase Filterbanks with CUDA

A Plug-and-Play Model for Evaluating Wavefront Computations on Parallel Architectures

From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133)

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood

Profiling & Tuning Applications. CUDA Course István Reguly

GRAPHICS PROCESSING UNITS

High Performance Computing with Accelerators

Universiteit Leiden Opleiding Informatica

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

NVIDIA Fermi Architecture

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

Parallel Computing: Parallel Architectures Jin, Hai

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

University of Bielefeld

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

Optimizing Multiple GPU FDTD Simulations in CUDA

How to Optimize Geometric Multigrid Methods on GPUs

n N c CIni.o ewsrg.au

A Framework for Modeling GPUs Power Consumption

Using GPUs to compute the multilevel summation of electrostatic forces

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Support Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura

CUDA Memories. Introduction 5/4/11

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Scientific Computing on GPUs: GPU Architecture Overview

GPU-accelerated Verification of the Collatz Conjecture

Software and Performance Engineering for numerical codes on GPU clusters

Code Optimizations for High Performance GPU Computing

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Warps and Reduction Algorithms

Programmable Graphics Hardware (GPU) A Primer

Parallel Computing. Hwansoo Han (SKKU)

Real-time Graphics 9. GPGPU

Optimisation Myths and Facts as Seen in Statistical Physics


CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

Transcription:

Parallelising Pipelined Wavefront Computations on the GPU S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick U.K 1st UK CUDA Developers Conference 7 th Dec 2009 Oxford, U.K. 1

Overview Wavefront Computations A GPGPU Solution? Wavefronts within Wavefronts Performance Modelling Beating the CPU Optimisations to Win Results, Validations and Model Projections Current and Future Work Conclusions 2

Wavefront Computations Wavefront computations are at the core of a number of large scientific computing workloads. Centers including the Los Alamos National Laboratory (LANL) in the United States and the Atomic Weapons Establishment (AWE) in the UK use these codes heavily. Lamport s core (hyperplane) algorithm that underpins these codes has existed for more than thirty five years. Defining characteristics: Operating on a grid of cells with each cell requiring some computation to be performed. Each cell has a data dependency, such that the solution of up to three neighbouring cells is required. 3

Cell Dependencies 4

Motivation Our previous work was on analysing and optimising applications that use the wavefront algorithm using MPI. Processor (1,m) Processor (1,1) Processor (n,m) Processor (n,1) Ny Nz Proceeds as Wavefronts through the 3D data cube Nx 5

Motivation (cont d) Algorithm operates over a three-dimensional structure of size Nx Ny Nz. Grid is mapped onto a 2D m x n grid of processors; each is assigned a stack of Nx / n x Ny / m x Nz cells. Data dependency results in a sequence of wavefronts (or a sweep) that starts from one corner and makes its way through other cells. We have modelled codes (e.g. Chimaera, LU, and Sweep3D) that employ wavefront computations with MPI. 6

Motivation (cont d) Our focus is now on using GPUs to investigate improvements to the solution per processor. A canonical solution is normally employed by the CPU to solve the computation per processor. Listing: Canonical Algorithm For k=1; k<=kend do For j=1; j<=jend do For i=1; i<=iend do A(i,j,k)=A(i 1,j,k)+ A(i,j 1,k)+A(i,j,k 1) // Compute cell End for End for End for 7

Hyperplane (Wavefront) Algorithm Let f = i + j + k, g = k and h = j. The plane defined by i + j + k = CONST is called a hyperplane. Listing : Hyperplane Algorithm DO CONCURRENTLY ON EACH PROCESSOR For f = 3, iend+jend+kend do A(f g h,h,g) = A(f g h 1,h,g)+A(f g h,h 1,g)+ A(f g h, h, g 1) End For The critical dependencies are preserved, even though the solution is carried out across the grid in wavefronts. 8

A GPGPU Solution? Can we utilise the many cores on a GPU to get a speedup to this algorithm? Theoretically simple... 9

A GPGPU Solution? (cont d) For a 3D cube of cells: 10

GPU Limitations What s the practical situation? Experimental System Daresbury Laboratory U.K. 8 x NVIDIA Tesla S1070 servers, each with four Tesla C1060 cards. Compute nodes consists of Nehalem processors (@ 2.53 GHz quad-core, 24 GB RAM). Each CPU core sees one Tesla card. Voltaire HCA410-4EX InfiniBand adapter. NVIDIA Tesla C1060 GPU Specifications: Each GPU card has 30 multi-processors Streaming Multiprocessor (SM) with 8 cores per processor. Each card therefore has 240 cores (streaming processor cores). Each core operates at 1.296 to 1.44 GHz. 4 GB Memory per card. 11

GPU Limitations (cont d) CUDA Device Architecture: DRAM GPU To Host Local Global SM 1 SM 2 SM 4 Constant Texture Constant and Texture Cache Memory Registers SM 30 Shared Memory Processor Cores (8 cores) 12

GPU Limitations (cont d) Each SM is allocated a number of threads, arranged as blocks. No synchronisation between threads in different blocks. Limit of 512 threads per block. Memory hierarchy: Global memory access is slow and should be avoided. Limit of 16KB of shared memory per SM. Other considerations: Limit of 16,384 registers per block. Aligning half-warps for performance. 13

A Solution? Wavefronts within Wavefronts Need to be scalable. Run more than 512 threads by utilising parallelism across all the multiprocessors. The cells on each diagonal are decomposed into coarse subtasks, and assigned to an SM as thread blocks. 14

Wavefronts within Wavefronts Each diagonal is computed by a kernel: for (wave = 0; wave < (3*(N/dimBlock.x)) - 2; wave++) { // Run the kernel. hyperplane_3d <<< dimgrid, dimblock, shared_mem_size >>> (d_gpu, wave); } cudathreadsynchronize(); // Not strictly necessary. The time to compute one diagonal is ceiling (number of blocks per diagonal / number of SMs) Each block utilises the resources available to an SM to solve the cells we will talk about this later. 15

A Performance Model What does this solution mean in terms of a performance model? Modelling Block level performance Assume a 3D cube of data cells with dimension N P GPU Number of SMs on the GPU W g,gpu Time to solve a block of cells W GPU Time to solve the 3D cube of cells using the GPU 16

Initial Results Each cell is randomly initialised, and at each step calculates the average of itself and its top, north and west neighbours. How the 3D data is decomposed has a significant effect on execution time. Strange behaviour where the number of cells is a multiple of 32 (especially at powers of 2). 17

Initial Results (cont d) 18

Initial Results (cont d) 19

Initial Results (cont d) 20

Beating the CPU Optimisation within the blocks: Thread re-use. Caching values in shared memory. Coalesced memory accesses. Avoiding shared memory bank-conflicts. Optimisations over the blocks: Explicit vs implicit CPU synchronisation. Inter-block synchronisation using mutexes. 21

Thread Reuse in a Block Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15 22

Coalesced Memory Access 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 1 8 5 2 12 9 6 3 13 10 7 14 11 15 Requires padding on devices below compute capability 1.3. How does this apply to 3D? 23

Beating the CPU (Results) 24

Beating the CPU (Results) Code was restructured for GPU to avoid unnecessary branching. Similar restructuring applied to CPU in kind. Re-use of threads and shared memory offers a 2x speedup over the naive GPU implementation. Spikes remain, likely to be an issue at the warp level. Kernel information: 17 registers. 2948 bytes of shared memory per block. 42% occupancy. 25

The Bigger Picture Current work: Porting LU, Sweep3D and Chimaera to GPU. (CUDA and OpenCL) Additional barriers from larger programs: Double precision. Multiple computations per cell. Looking towards the future: How well does our algorithm perform on a consumer card (eg GTX 295)? How well will our algorithm perform on Fermi? Benchmarking and analysis should facilitate predictions. 26

Conclusions Wavefront computations can utilise emerging GPU architectures, despite their dependencies. To see speedup: Memcpy() needs to be faster. Require more work per Memcpy(). Codes cannot be ported naively. Hardware limitations may be a problem (particularly for larger codes). Performance modelling will offer insights into which applications can be ported successfully. 27