High Performance Ocean Modeling using CUDA

Similar documents
Numerical Ocean Modeling and Simulation with CUDA

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

High performance Computing and O&G Challenges

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

Advances of parallel computing. Kirill Bogachev May 2016

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

HPC parallelization of oceanographic models via high-level techniques

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

SENSEI / SENSEI-Lite / SENEI-LDC Updates

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

OpenACC programming for GPGPUs: Rotor wake simulation

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Introduction II. Overview

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

An Innovative Massively Parallelized Molecular Dynamic Software

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

Parallel Programming. Libraries and Implementations

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

HPC Application Porting to CUDA at BSC

CS 475: Parallel Programming Introduction

Introduction to GPU hardware and to CUDA

BMGTOOLS: A COMMUNITY TOOL TO HANDLE MODEL GRID AND BATHYMETRY

CUDA GPGPU Workshop 2012

ORAP Forum October 10, 2013

QR Decomposition on GPUs

Using OpenACC in IFS Physics Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Applying Particle Tracking Model in the Coastal Modeling System

Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms

Tuning Alya with READEX for Energy-Efficiency

Solving Dense Linear Systems on Graphics Processors

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.

LooPo: Automatic Loop Parallelization

OP2 FOR MANY-CORE ARCHITECTURES

Cross Teaching Parallelism and Ray Tracing: A Project based Approach to Teaching Applied Parallel Computing

Applications of Berkeley s Dwarfs on Nvidia GPUs

OpenACC Based GPU Parallelization of Plane Sweep Algorithm for Geometric Intersection. Anmol Paudel Satish Puri Marquette University Milwaukee, WI

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

Running the FIM and NIM Weather Models on GPUs

Faster Simulations of the National Airspace System

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

GPU Accelerated Solvers for ODEs Describing Cardiac Membrane Equations

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

OpenACC Course. Office Hour #2 Q&A

WHY PARALLEL PROCESSING? (CE-401)

Finite Difference Time Domain (FDTD) Simulations Using Graphics Processors

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

OPTIMIZATION OF THE CODE OF THE NUMERICAL MAGNETOSHEATH-MAGNETOSPHERE MODEL

CUDA. Matthew Joyner, Jeremy Williams

Numerical Algorithms on Multi-GPU Architectures

GPU - Next Generation Modeling for Catchment Floodplain Management. ASFPM Conference, Grand Rapids (June 2016) Chris Huxley

Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

High Performance Computing (HPC) Introduction

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Performance Analysis and Optimization of the Regional Ocean Model System on TeraGrid

Fast BVH Construction on GPUs

Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP

Parallel Programming Libraries and implementations

Placement de processus (MPI) sur architecture multi-cœur NUMA

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Reflections on Dynamic Languages and Parallelism. David Padua University of Illinois at Urbana-Champaign

Dynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Threaded Programming. Lecture 9: Alternatives to OpenMP

Intel Xeon Phi Coprocessors

Software and Performance Engineering for numerical codes on GPU clusters

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Improved Event Generation at NLO and NNLO. or Extending MCFM to include NNLO processes

Harnessing GPU speed to accelerate LAMMPS particle simulations

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

HPC Architectures. Types of resource currently in use

AMath 483/583 Lecture 21 May 13, 2011

Allinea Unified Environment

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

Subset Sum Problem Parallel Solution

Chapter 3 Parallel Software

Transcription:

using CUDA Chris Lupo Computer Science Cal Poly Slide 1

Acknowledgements Dr. Paul Choboter Jason Mak Ian Panzer Spencer Lines Sagiv Sheelo Jake Gardner Slide 2

Background Joint research with Dr. Paul Choboter (Cal Poly, Math) He's the Oceanographer Research is focused on dynamics of wind driven coastal upwelling Model cross shore and alongshore transport in the California coastal ocean Useful for studying applications such as: pollution transport algal blooms sediment transport Slide 3

Background Surface temperature modeling, San Luis Obispo County Slide 4

Background Coarse grained bathymetry modeling of San Francisco and Monterey Bays ~10 km grid resolution Slide 5

Background Finer grained bathymetry modeling of Monterey Bay ~1 km grid resolution Slide 6

Objectives We want more resolution! Ideally we would have meter or sub meter grids Creek mouths Geologic / topologic features (e.g. Monterey Canyon) Existing models can take days or weeks to compute Slide 7

ROMS Regional Ocean Modeling System (ROMS) Robust, actively developed open source numerical ocean modeling software used by the scientific community Developed in Fortran 90/95, ~400k loc Staggered, finite, difference grid: velocity, temperature, salinity, and sea surface elevation Slide 8

ROMS Iterative time stepping algorithm Simulation time interval: large (slow), depth dependent time step (baroclinic) solves 3D equations, uses many short (fast), depth integrated time steps (barotropic) to solve 2D equations Short time steps (step2d) occupy more than 50% of runtime Slide 9

Parallelism ROMS has been developed with two parallel computing paradigms Distributed with message passing for clusters Shared memory with OpenMP for multicore Slide 10

Parallelism: Distributed MPI Great for multi node clusters Advantages: Very scalable, supports large number of processors Fine grained parallelism Cost: Can easily hit $100k for entry level system Maintenance: need a machine room Support: third party contracts, technicians Disadvantages: Slide 11

Parallelism: Shared OpenMP Great for desktop researchers Advantages: Cost: Great performance for < $10k Maintenance: possible for end user Support: possible for end user Not very scalable Course grained parallelism Disadvantages: Slide 12

Parallelism: Shared OpenMP Great for desktop researchers Advantages: Cost: Great performance for < $10k Maintenance: possible for end user Support: possible for end user Not very scalable Course grained parallelism Disadvantages: Slide 13

OpenMP Implementation Partition the grid into tiles and assign each tile to a core (thread) http://www.myroms.org/ Slide 14

CUDA Approach Re-use the existing OpenMP tiling Partition grid into small tiles, assign each tile to thread on GPU Use the smallest possible tile, 2x2, to saturate GPU Slide 15

CUDA Approach Rewrite the step2d function as a CUDA kernel Allocate GPU memory once Transfer memory before and after all step2d computations Slide 16

CUDA Approach Slide 17

Porting Challenges step2d function > 2000 lines Lack of encapsulation in Fortran code Necessary global variables must all be identified Index translation Slide 18

Experimental Setup OpenMP 2 Intel Xeon E5504 (8 cores total) MPI Intel Xeon 5130 processors (64 cores total) CUDA Tesla C2050 (448 cores) Slide 19

Results Upwelling simulation: internally specified idealized conditions Slide 20

Results Observational data from CA central coast (512x256) Slide 21

Next Steps Run more portions of the simulation on the GPU, this is only step2d OpenACC looks promising, but we've struggled with it GPU optimizations: faster memory, double buffering, divergence removal Multi GPU support Hybrid OpenMP+CUDA Implementation Utilize newer hardware to exploit dynamic parallelism Slide 22

Conclusions GPUs show promising results for accelerating ocean modeling software Open new doors in simulation size or time scale for research. Can allow us to do better desktop science. Slide 23

Conclusions GPUs show promising results for accelerating ocean modeling software Open new doors in simulation size or time scale for research. Can allow us to do better desktop science. Slide 24

Questions? Slide 25