OpenACC programming for GPGPUs: Rotor wake simulation

Similar documents
Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

COMP Parallel Computing. Programming Accelerators using Directives

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

OpenACC Fundamentals. Steve Abbott November 13, 2016

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Getting Started with Directive-based Acceleration: OpenACC

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

An Introduc+on to OpenACC Part II

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Programming NVIDIA GPUs with OpenACC Directives

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

Porting COSMO to Hybrid Architectures

Warps and Reduction Algorithms

Programming paradigms for GPU devices

Accelerating CFD with Graphics Hardware

Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs

GPU Fundamentals Jeff Larkin November 14, 2016

S Comparing OpenACC 2.5 and OpenMP 4.5

Directive-based Programming for Highly-scalable Nodes

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

OpenACC Fundamentals. Steve Abbott November 15, 2017

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

Accelerator programming with OpenACC

Compiling a High-level Directive-Based Programming Model for GPGPUs

OpenACC Course. Office Hour #2 Q&A

OpenACC (Open Accelerators - Introduced in 2012)

An Introduction to OpenACC - Part 1

Performance of deal.ii on a node

SENSEI / SENSEI-Lite / SENEI-LDC Updates

Turbostream: A CFD solver for manycore

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

INTRODUCTION TO OPENACC

Introduction to Compiler Directives with OpenACC

Introduction to GPU hardware and to CUDA

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

Introduction to OpenACC. 16 May 2013

GPU Computing with OpenACC Directives Presented by Bob Crovella For UNC. Authored by Mark Harris NVIDIA Corporation

An Introduction to OpenACC

Introduction to Runtime Systems

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

GPUs and Emerging Architectures

Introduction to GPU programming with CUDA

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

CUDA. Matthew Joyner, Jeremy Williams

Parallel Programming Libraries and implementations

The PGI Fortran and C99 OpenACC Compilers

OpenACC introduction (part 2)

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

Introduction to GPGPU and GPU-architectures

Using PGI compilers to solve initial value problems on GPU

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Dr.-Ing. Achim Basermann DLR, Simulation and Software Technology (SC)

Implementation of Adaptive Coarsening Algorithm on GPU using CUDA

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Introduction to OpenACC

Computers & Fluids 114 (2015) Contents lists available at ScienceDirect. Computers & Fluids

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

CS377P Programming for Performance GPU Programming - II

Supporting Data Parallelism in Matcloud: Final Report

OPENACC ONLINE COURSE 2018

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Parallel Programming. Libraries and Implementations

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

General Purpose GPU Computing in Partial Wave Analysis

Introduction to Parallel Computing with CUDA. Oswald Haan

Code Migration Methodology for Heterogeneous Systems

Scientific Computations Using Graphics Processors

Computer Architecture

Parallelism III. MPI, Vectorization, OpenACC, OpenCL. John Cavazos,Tristan Vanderbruggen, and Will Killian

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

Optimizing OpenACC Codes. Peter Messmer, NVIDIA

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA

Parallel Numerical Algorithms

Transcription:

DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik

DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing with OpenACC Rotor simulation code Freewake OpenACC port of Freewake Conclusion

DLR.de Chart 3 Hardware-Architecture: Overview Host controls Device CPU GPU main memory data transfers GPU memory

DLR.de Chart 4 Hardware-Architecture: Data transfers Host Device CPU GPU 60 GG/s main memory 12 GG/s 200 GG/s GPU memory

DLR.de Chart 5 Hardware-Architecture: Calculations 2 CPUs with 8 cores: Host 13 stream. multiprocessors: Device 8 float SIMD 512 GGGGG/s (SP) main memory 192 SIMT cores 3.5 TTTTT/s (SP) GPU memory

DLR.de Chart 6 Hardware-Architecture: Comparison CPU SIMD parallelism: SSE / AVX extensions (8 float) MIMD parallelism: several CPUs (2) multiple cores (8) (possibly CPU threads) Caches to avoid memory latency GPU SIMT parallelism: 32 scalar threads form a warp 1-32 warps form a thread block (32-1024 threads per block) MIMD parallelism: thread blocks within a grid Switch threads to hide latency Requires 100,000+ threads!

DLR.de Chart 7 Outline Hardware-Architecture (CPU+GPU) GPU computing with OpenACC Rotor simulation code Freewake OpenACC port of Freewake Conclusion

DLR.de Chart 8 OpenACC: Overview Language extensions similar to OpenMP Directive based Supported languages: C C++ Fortran Supported compilers: CAPS CRAY PGI (unofficial patches for GCC)

DLR.de Chart 9 OpenACC: Fortran example program main real :: a(n)!$acc data copyout(a)! Computation in several loops on the GPU:!$acc parallel loop do i = 1, N a(i) = 2.5 * i end do!$acc end data! Use results on the CPU end program main

DLR.de Chart 10 OpenACC: C++ example int main() { float *a = new float[n]; #pragma acc data copyout(a[0:n-1]) // Computation in several loops on the GPU: #pragma acc parallel loop for(int i = 0; i < N; i++) a[i] = 2.5 * i #pragma acc end data // Use results on the CPU return 0; }

DLR.de Chart 11 OpenACC: Important features Explicit data movement between host and device (bottleneck!) Loop reductions: calculate sum, minimum, etc over all iterations (currently only for scalar variables) Explicit mapping to GPU hardware: gang thread blocks (worker warps) vector threads within a warp performance tuning Interoperable with CUDA

DLR.de Chart 12 OpenACC: Reduction example program main real :: a(n) real :: norm_a! initialize a norm_a = 0.!$acc parallel loop reduction(+:norm_a) do i = 1, N norm_a = norm_a + a(i)*a(i) end do!$acc end parallel loop norm_a = sqrt(norm_a) end program main

DLR.de Chart 13 OpenACC: acc kernels example program main real :: a(n) real :: norm_a! initialize a!$acc kernels norm_a = 0. do i = 1, N norm_a = norm_a + a(i)*a(i) end do norm_a = sqrt(norm_a)!$acc end kernels end program main

DLR.de Chart 14 OpenACC: acc kernels compiler output > pgf90 -Minfo -fast -acc -ta=nvidia kernels_test.f90 main: 7, Generating present_or_copyin(a(:)) Generating NVIDIA code 9, Loop is parallelizable Accelerator kernel generated 9,!$acc loop gang, vector(128)! blockidx%x threadidx%x 10, Sum reduction generated for norm_a

DLR.de Chart 15 OpenACC: Fine-tuning example!$acc kernels norm_a = 0.!$acc loop gang(1024) vector(128) reduction(+:norm_a) do i = 1, N norm_a = norm_a + a(i)*a(i) end do norm_a = sqrt(norm_a)!$acc end kernels

DLR.de Chart 16 Outline Hardware-Architecture (CPU+GPU) GPU computing with OpenACC Rotor simulation code Freewake OpenACC port of Freewake Conclusion

DLR.de Chart 17 Freewake: Overview Developed 1994-1996 by FT-HS implemented in Fortran MPI-parallel Used by the FT-HS rotor simulation code S4 Simulates the flow around a helicopter s rotor Vortex-Lattice method Discretizes complex wake structures with a set of vortex elements Based on experimental data (from the international HART program 1995)

DLR.de Chart 18 Freewake: Comparison with classical CFD Classical CFD: Navier-Stokes-equations velocity mesh in whole 3D domain Vortex methods: Vorticity equation (curl of velocity) Spatial discretization: vorticity points/grid in interesting regions Discretization in time: Update velocity using small stencils Move points using induced velocity numerical diffusion complex induced velocity calculation (similar: N-body problem)

DLR.de Chart 19 Freewake: Velocity calculation moving 2D grid in space with cells of precalculated vorticity good initial wake geometry from analytical model Velocity at a point: sum over induced velocities of all grid cells different formulas depending on the distance point cell interpolated cells to allow smaller time steps than grid resolution

DLR.de Chart 20 Freewake: Vortex visualization

DLR.de Chart 21 Outline Hardware-Architecture (CPU+GPU) GPU computing with OpenACC Rotor simulation code Freewake OpenACC port of Freewake Conclusion

DLR.de Chart 22 Freewake OpenACC port: Simple benchmark Idea: only use formula for medium range Performance is compute bound: Working set: n = ~6300 grid points ~250 kk (4 blades with 11x144 points) Operations: ~50n 2 Floo per iteration Fastest CPU implementation (GCC): uses vector-reductions from OpenMP 3.0 ~280 GGGGG/s (SP) ca. 40x faster than Free-Wake OpenACC GPU implementation (PGI): only scalar reductions possible not optimal & more complex ~360 GGGGG/s (SP)

DLR.de Chart 23 Freewake OpenACC port: Avoiding branches in loops if-statements in loops are problematic: GPU warp needs several passes for different branches (SIMT) may also prevent efficient vectorization on CPUs (SIMD) Nested if-statements in Free-Wake: Grid boundaries: partial loop unrolling by hand Flags : rrrrrr = ffff a + 1 ffff b for ffff {0,1} Different formulas depending on distance point cell: difficult! currently best results with dedicated loops for individual cases

DLR.de Chart 24 Freewake OpenACC port: Hybrid CPU/GPU calculations MPI-parallel: Grid stored redundantly on all processes Each process calculates the velocity of a set of points Hybrid calculation: First MPI-process uses the GPU All others stay on the CPU uses acc_set_device_type( acc_device_... ) Dynamic Load-Balancing: Measure time in each iteration Redistribute work appropriately

DLR.de Chart 25 Freewake OpenACC port: performance results 18 16 14 12 GFlop/s 10 8 6 NVIDIA K20m 2x Xeon E5-2640v2 4 2 0 CPU only (SP) CPU+GPU (SP) CPU only (DP) CPU+GPU (DP)

DLR.de Chart 26 Outline Hardware-Architecture (CPU+GPU) GPU computing with OpenACC Rotor simulation code Freewake OpenACC port of Freewake Conclusion

DLR.de Chart 27 Conclusion Successfully ported the Freewake simulation to GPUs using OpenACC original numerical method not modified refactored & restructured a lot of code Porting complex algorithms to GPUs is difficult branches in loops hurt (much more than for CPUs) Loop restructuring may also improve the CPU performance (SIMD vectorization on modern CPUs) Stumbled upon several OpenACC PGI-compiler bugs (all fixed very fast) Future work: extension to wind turbines reduce complexity from O(n 2 ) to O(n log n)

DLR.de Chart 28 Summary about OpenACC Directive based: Annotate loops Specify data movement Portable: same code basis for host (CPU) and device (GPU) hybrid calculations different current and future accelerators Code remains short & understandable To make it fast: still a lot of work (restructuring) background knowledge required

DLR.de Chart 29 Questions?