GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

Similar documents
GPU ACCELERATION OF THE LONG-WAVE RAPID RADIATIVE TRANSFER MODEL IN WRF USING CUDA FORTRAN

Optimizing Multiple GPU FDTD Simulations in CUDA

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

General Purpose GPU Computing in Partial Wave Analysis

S4289: Efficient solution of multiple scalar and block-tridiagonal equations

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Quantum ESPRESSO on GPU accelerated systems

Parallelising Pipelined Wavefront Computations on the GPU

MAGMA. Matrix Algebra on GPU and Multicore Architectures

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

Determinant Computation on the GPU using the Condensation AMMCS Method / 1

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

HPC with Multicore and GPUs

Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method

arxiv: v1 [hep-lat] 12 Nov 2013

High Performance Computing with Accelerators

Porting the ICON Non-hydrostatic Dynamics and Physics to GPUs

A MATLAB Interface to the GPU

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Solving Dense Linear Systems on Graphics Processors

Quantitative study of computing time of direct/iterative solver for MoM by GPU computing

Optimizing Weather Model Radiative Transfer Physics for the Many Integrated Core and GPGPU Architectures

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute

Running the NIM Next-Generation Weather Model on GPUs

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

Accelerating Data Warehousing Applications Using General Purpose GPUs

Double-Precision Matrix Multiply on CUDA

Gauss-Jordan Matrix Inversion Speed-Up using GPUs with the Consideration of Power Consumption

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA GPGPU Workshop 2012

Multi2sim Kepler: A Detailed Architectural GPU Simulator

Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core

Automated Finite Element Computations in the FEniCS Framework using GPUs

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

SENSEI / SENSEI-Lite / SENEI-LDC Updates

Unrolling parallel loops

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Data-intensive computing in radiative transfer modelling

Communication-Avoiding Optimization of Geometric Multigrid on GPUs

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks

A NEW PARALLEL SPARSE CHEMISTRY SOLVER FOR CMAQ. George Delic* HiPERiSM Consulting, LLC, P.O. Box 569, Chapel Hill, NC 27514, USA

Fast and reliable linear system solutions on new parallel architectures

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

A MATLAB Interface to the GPU

Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Tesla Architecture, CUDA and Optimization Strategies

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Porting COSMO to Hybrid Architectures

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Tesla GPU Computing A Revolution in High Performance Computing

Mathematical computations with GPUs

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

CUDA Programming Model

Introduction to CUDA Programming

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County

GViM: GPU-accelerated Virtual Machines

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

Can Accelerators Really Accelerate Harmonie?

Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

QR Decomposition on GPUs

Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

GPU-optimized computational speed-up for the atmospheric chemistry box model from CAM4-Chem

RAMSES on the GPU: An OpenACC-Based Approach

Performance potential for simulating spin models on GPU

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video

Transcription:

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran G. Ruetsch, M. Fatica, E. Phillips, N. Juffa

Outline WRF and RRTM Previous Work CUDA Fortran Features RRTM in CUDA Fortran Results Hot of the Press Batched Solver for Small Matrix Problems

WRF and RRTM WRF (Weather Research and Forecast) is a mesoscale numerical weather prediction code designed for both operational forecasting and the research community with a broad spectrum of applications and multiple physics options. Longwave RRTM (Rapid Radiative Transport Model) is an optional model that computes the energy transfer through the atmosphere due to electromagnetic radiation Uses look-up tables for efficiency Separates calculation into 16 spectral bands

RRTM - Previous Work RRTM proposed as benchmark kernel http://www.mmm.ucar.edu/wrf/wg2/gpu/ Contains only CPU code Initial GPU port to CUDA Fortran on PGI s website http://www.pgroup.com/resources/accel_files/list.htm One of several WRF components on PGI Accelerator Files site Contains CUDA Fortran source code and white paper Based on C1060 and early version (10.1) of the CUDA Fortran compiler

CUDA and CUDA Fortran CUDA programming model Heterogenous programming model Use both CPU and GPU, which have different memory spaces Allows for incremental development Scalable programming model Programs runs on any number of processors without recompiling Write a program for one thread, instantiate on many parallel threads CUDA Fortran is the Fortran analog of CUDA C Implemented in PGI s Fortran compiler Program host and device code similar to CUDA C Host code is based on Runtime API Fortran language extensions to simplify data management

RRTM Components RRTM has the following steps INIRAD: computes the ozone mixing ratio distributions MM5ATM: provides atmospheric profiles SETCOEF: calculates various quantities needed for the radiative transfer algorithm GASABS: calculates gaseous optical depths and Planck functions Computed in 16 spectral bands RTRN: calculate the radiative transfer for both clear and cloudy columns Radiation only depends on data in the same vertical column GPU exploits this parallelism

Data Layout CPU Layout GPU Layout j k i k i j A(i,k,j) Outer loops over (i,j) Extract 1D arrays in k and send to RRTM Reorder to A(i,j,k) after transfer For rtrn(): launch nx*ny threads, each thread calculates one column All other routines: launch nx*ny*nz threads

Recent Changes Fermi architecture -Mcuda=cc20 L1 cache 64KB allotment set to 48KB L1 cache and 16KB shared memory Lookup tables changed from constant to device arrays constant still used for physical constants (scalar values) CUDA Fortran additions Pinned host memory used for large arrays (faster transfer, enables asynchronous transfers) add pinned attribute to variable declarations

Results Performed on system with Two quad-core Xeon X5550 CPUs (2.67 GHz) Tesla M2050 (Fermi) GPU (448 cores, 1.15 GHz, 3GB memory) Tesla M2090 (Fermi) GPU (512 cores, 1.3 GHz, 6GB memory) PGI 11.8 compilers CPU code from http://www.mmm.ucar.edu/wrf/wg2/gpu/ Utilizes a single CPU core Input data is on a (nx,nz,ny) mesh with nx=73, nz=28, ny=60

Results Overall time (ms) % time in gasabs() % time in rtrn() Previous Study Current Results X5440 Tesla C1060 X5550 Tesla M2050 Tesla M2090 -fast Baseline -fast Baseline Modified Baseline Modified 703 83 479 44 38 41 34 56% 31% 52% 21% 22% 21% 21% 28% 46% 27% 48% 45% 44% 43% Tesla M2050/M2090 Baseline: code from previous study, recompiled with 11.8 Tesla M2050/M2090 Modified: items in Recent Changes implemented

Conclusions and Future Work Conclusions Good speedup for a small problem Data reordering is key Leverage separate memory space (host data unchanged) Reordering fast on device Future Work Add asynchronous data transfers to hide communication Textures are coming to CUDA Fortran Possibility for lookup tables Utilize CPU and GPU Lookup tables

Batched Solver for Small Matrices Motivated by chemical simulation containing ~20 species Independent system is solved at each grid point LU decomposition, Gaussian and Gauss-Jordan elimination Dense matrices Partial pivoting Routine chosen based on matrix size and hardware Double precision (and double-complex) Code will be available at http://nvdeveloper.nvidia.com

Batched Solver Implementation Each system is mapped to a thread block Each thread in a thread block can work on multiple matrix elements One matrix element per thread most efficient for small matrices Pivoting via two-stage process Configurable number of threads finds maximum of elements assigned to them, and write these intermediate values to shared memory All threads redundantly find the maximum of the intermediate values Can be turned off for diagonally dominant matrices

Batched Solver Performance

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran G. Ruetsch, M. Fatica, E. Phillips, N. Juffa