Progress on Advanced Dynamical Cores for the Community Atmosphere Model. June 2010

Similar documents
A Test Suite for GCMs: An Intercomparison of 11 Dynamical Cores

Extending scalability of the community atmosphere model

Global non-hydrostatic modelling using Voronoi meshes: The MPAS model

Ocean Simulations using MPAS-Ocean

ATM 298, Spring 2013 Lecture 4 Numerical Methods: Horizontal DiscreDzaDons April 10, Paul A. Ullrich (HH 251)

A New Dynamical Core with Local Mesh Refinement. Todd Ringler

Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms

Two dynamical core formulation flaws exposed by a. baroclinic instability test case

CESM Projects Using ESMF and NUOPC Conventions

Comparison of XT3 and XT4 Scalability

Prac%cal Session 3: Atmospheric Model Configura%on Op%ons. Andrew Ge>elman

Development and Testing of a Next Generation Spectral Element Model for the US Navy

CAM History and I/O Infrastructure Changes

Experiences with CUDA & OpenACC from porting ACME to GPUs

Evaluating the Performance of the Community Atmosphere Model at High Resolutions

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Dr. John Dennis

Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core

Getting the code to work: Proof of concept in an interactive shell.

Diagnosing the coupled stratosphere- troposphere stationary wave response to climate change in CCMs

FMS: the Flexible Modeling System

Global Stokes Drift and Climate Wave Modeling

Adding mosaic grid support to LibCF

Diagnostics and Exploratory Analysis Infrastructure for ACME Workflow

Running the FIM and NIM Weather Models on GPUs

MPAS Developers Guide. MPAS Development Team

Performance Analysis of the PSyKAl Approach for a NEMO-based Benchmark

A performance portable implementation of HOMME via the Kokkos programming model

Block-structured Adaptive Meshes and Reduced Grids for Atmospheric General Circulation Models

Introduction to parallel Computing

Experiences with Porting CESM to ARCHER

New Features of HYCOM. Alan J. Wallcraft Naval Research Laboratory. 16th Layered Ocean Model Workshop

Interpolation. Computer User Training Course Paul Dando. User Support. ECMWF 25 February 2016

Mapping MPI+X Applications to Multi-GPU Architectures

Computer Architectures and Aspects of NWP models

A mass-conservative version of the semi- Lagrangian semi-implicit HIRLAM using Lagrangian vertical coordinates

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

Impact of New GMAO Forward Processing Fields on Short-Lived Tropospheric Tracers

Design of Parallel Programs Algoritmi e Calcolo Parallelo. Daniele Loiacono

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

The Earth System Modeling Framework (and Beyond)

MIR. ECMWF s New Interpolation Package. P. Maciel, T. Quintino, B. Raoult, M. Fuentes, S. Villaume ECMWF. ECMWF March 9, 2016

CCSM Performance with the New Coupler, cpl6

Radial Basis Function-Generated Finite Differences (RBF-FD): New Opportunities for Applications in Scientific Computing

Parallel Quality Meshes for Earth Models

Aquaplanets with slab ocean in CESM1

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI

Asynchronous Communication in Spectral Element and Discontinuous Galerkin Methods for Atmospheric Dynamics

J. Dennis, National Center for Atmospheric Research, Boulder Colorado, phone: , fax:

MPAS-O Update Doug Jacobsen, Mark Petersen, Todd Ringler Los Alamos National Laboratory

Porting and Optimizing the COSMOS coupled model on Power6

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Using Lamport s Logical Clocks

Cubed Sphere Dynamics on the Intel Xeon Phi

A Software Developing Environment for Earth System Modeling. Depei Qian Beihang University CScADS Workshop, Snowbird, Utah June 27, 2012

Code Saturne on POWER8 clusters: First Investigations

Unified Model Performance on the NEC SX-6

Improving climate model coupling through complete mesh representation

The ICON project: Design and performance of an unstructured grid approach for a global triangular grid model

Initial results and performance of the GFDL Cubed-Sphere Model

GPU-optimized computational speed-up for the atmospheric chemistry box model from CAM4-Chem

ForestClaw : Mapped, multiblock adaptive quadtrees

ICON Performance Benchmark and Profiling. March 2012

BİL 542 Parallel Computing

1 Serial Implementation

High Performance Computing

Introduction to Parallel Computing

ScalaIOTrace: Scalable I/O Tracing and Analysis

Multi-Domain Pattern. I. Problem. II. Driving Forces. III. Solution

Adding MOAB to CIME s MCT driver

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

A Spectral Element Eulerian-Lagrangian Atmospheric Model (SEELAM)

Hybrid Strategies for the NEMO Ocean Model on! Many-core Processors! I. Epicoco, S. Mocavero, G. Aloisio! University of Salento & CMCC, Italy!

Improving Scalability and Accelerating Petascale Turbulence Simulations Using OpenMP

Performance Tools and Holistic HPC Workflows

A Semi-Lagrangian Discontinuous Galerkin (SLDG) Conservative Transport Scheme on the Cubed-Sphere

CTF: State-of-the-Art and Building the Next Generation ASE 2017

MPI On-node and Large Processor Count Scaling Performance. October 10, 2001 Terry Jones Linda Stanberry Lawrence Livermore National Laboratory

Experiments in Pure Parallelism

University of Reading. Discretization On Non-uniform Meshes: Tests solving shallow-water equations

Porting the ICON Non-hydrostatic Dynamics and Physics to GPUs

Experiments Using BG/Q s Hardware Transactional Memory

The Red Storm System: Architecture, System Update and Performance Analysis

escience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows

A Comparison of Two Shallow Water Models with. Non-Conforming Adaptive Grids

New Features of HYCOM. Alan J. Wallcraft Naval Research Laboratory. 14th Layered Ocean Model Workshop

Power Constrained HPC

Introduction to PRECIS

HPC Performance Advances for Existing US Navy NWP Systems

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC

Adaptive Mesh Refinement in Titanium

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to ANSYS CFX

Advanced Numerical Methods for NWP Models

Performance analysis basics

The CSLaM transport scheme & new numerical mixing diagnostics

VisIt Libsim. An in-situ visualisation library

GPU Developments for the NEMO Model. Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA

Transcription:

Progress on Advanced Dynamical Cores for the Community Atmosphere Model June 2010 Art Mirin, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344 LLNL-PRES-438171

A primary motivator for advanced dycores is to overcome limitations of the lat-lon grid Convergence of meridians at poles limits it timestept and provides unnecessarily high h resolution in polar regions at expense of other locations throughout globe limits scalability by inhibiting effective domain decomposition in longitude Several new dynamical cores are currently available in versions of CAM Homme (spectral element, cubed sphere grid, cam trunk) Fvcubed (finite volume, cubed sphere grid, cam branch) Mpas [finite volume, icosahedral (hexagonal) grid, cam branch] 2

AMWG session The following three slides are for the AMWG summary 3

OpenMP in Homme has been revived Collaborators: John Dennis (NCAR); Jim Edwards (IBM); Mark Taylor (Sandia) Homme can now utilize both distributed memory parallelism (MPI) and shared memory parallelism (OpenMP) Both forms of parallelism li operate over finite it elements OpenMP does not provide additional parallelism but could provide more effective parallelism For NE16NP4 configuration on Cray XT5 (jaguarpf), OpenMP allows close to a 50% speedup (for same number of cpu s) 1536 MPI tasks, no OpenMP => 21.4 sim.years per comp. day 256 MPI tasks, 6-way OpenMP => 31.3 sim.years per comp. day Comprehensive scaling study in progress 4

Further validation of fvcubed has been accomplished Collaborators: Will Sawyer (CSCS); Brian Eaton (NCAR); Christiane Jablonowski (U. Mich.) Dynamics validated using Jablonowski-Williamson test cases Diagnosed and corrected bugs relating to physics Evaluated and upgraded OpenMP implementation Compared physics tendencies on FV grid vs FVCUBED grid Currently being upgraded from cam3_6_57 to cam4_9_02 About to be updated to latest GFDL dycore version Will carry out further validation and comprehensive performance evaluation 5

MPAS dycore has been implemented and is undergoing gvalidation Collaborators: Dan Bergmann, Jeff Painter, Mike Wickett (LLNL); Todd Ringler (LANL); Bill Skamarock, Michael Duda, Joe Klemp (NCAR) MPAS is amenable to local mesh refinement Compared physics tendencies on FV grid vs MPAS grid Compared MPAS driven by CAM (without physics) with MPAS-standalone for baroclinic wave tests Inclusion of cell boundary arrays in netcdf output allows visualization with CDAT Creating mapping files with corresponding lat-lon grid, for AMWG diagnostics package and operation with land model Running comprehensive aquaplanet comparison with other dycores 6

SEWG session The remaining slides are for the SEWG session 7

OpenMP in Homme has been revived Collaborators: John Dennis (NCAR); Jim Edwards (IBM); Mark Taylor (Sandia) Homme can now utilize both distributed memory parallelism (MPI) and shared memory parallelism (OpenMP) Both forms of parallelism operate with respect to spectral elements OpenMP does not provide additional parallelism but could provide more effective parallelism potential benefits of shared memory over distributed memory parallelism include: lower memory footprint more favorable surface area to volume ratio 8

The Homme dynamics advance is an OpenMP region Many codes implement OpenMP over loops!$omp parallel do do enddo In Homme, the time integration encompasses a single parallel region, with specific elements assigned to specific threads!$omp parallel call dynamics_advanceadvance!$omp end parallel mpi communications and other key areas limited to master thread (!$OMP master) synchronization accomplished through!$omp barrier approach is potentially more efficient but more difficult to debug 9

We encountered several issues in the OpenMP implementation CAM reproducible sum (which references MPI) called from parallel region restricted the repro_sum call to master thread used temporary thread-shared buffer to transfer data between master and non-master threads global_buf(ie) = thread_private_mem(ie)!$omp MASTER call repro_sum(global_buf, buf global_sum)!$omp END MASTER Limited OpenMP support in CAM/Homme interface layer enhanced OpenMP support 10

We encountered several issues in the OpenMP implementation (cont.) Indeterminate data clobbering during MPI communication each thread packs data to be communicated into buffer accessible to master thread; master thread calls MPI; data is then extracted into thread-private structure inserted!$omp barrier calls between buffer unpacks and subsequent buffer packs Lack of OpenMP barrier after timelevel_update (which adjusts time level indices; called by master thread only) resulted in incorrect indices being used by non-master threads!$omp MASTER call TimeLevel_update!$OMP END MASTER!$OMP BARRIER [was not previously present] call prim_advance_exp [uses new time level indices] 11

OpenMP appears to pay off on Cray XT5 For NE16NP4 (~2-deg) configuration on Cray XT5 (jaguarpf), OpenMP results in a ~50% speedup (for same number of cpu s) 1536 MPI tasks, no OpenMP => 21.4 sim.years per comp. day 256 MPI tasks, 6-way OpenMP => 31.3 sim.years per comp. day Above result based on a single case (run multiple times); comprehensive scaling study in progress 12

Effort to run CAM with fvcubed has been revived Collaborators: Will Sawyer (CSCS); Brian Eaton (NCAR); Christiane Jablonowski (U. Mich.) Dynamics validated using Jablonowski-Williamson test cases Initial performance tests encouraging graph below from Cray XT4 (June 2009) years per com mputing day Simulated 70 60 50 40 30 20 10 0 0 500 1000 fvcubed fv-lat-lon Number of CPUs 13

Further validation of fvcubed has been accomplished Diagnosed and corrected bugs relating to physics multiple chunks per dynamics block (phys_loadbalance=0) now supported phys_loadbalance (>0) options requiring communication not yet implemented Evaluated and upgraded OpenMP implementation in both dycore itself and CAM/fvcubed interface layer Compared physics tendencies on FVCUBED grid vs FV grid skip dynamics advance (for FV, need to convert potential temperature to virtual potential temperature to account for mismatch between input and output states to/from dyn_run) 14

Temperature change over month (physics tendencies only, no dynamics) fvcubed fv surface 500 hpa 15

CAM/fvcubed implementation is being updated Currently being upgraded from cam3_6_57 to cam4_9_02 When complete, will update to latest GFDL dycore version previously used NASA version of fvcubed, hence needed to extract added infrastructure (labor intensive) have obtained direct access to GFDL archive (Balaji) and will use that version (provides easier path to future updates) Will then carry out further validation and comprehensive performance evaluation 16

MPAS dycore has been implemented in CAM Collaborators: Dan Bergmann, Jeff Painter, Mike Wickett (LLNL); Todd Ringler (LANL); Bill Skamarock, k Michael Duda, Joe Klemp (NCAR) MPAS (Model Prediction Across Scales) uses unstructured icosahedral (hexagonal) grid finite-volume approach using C-grid (normal velocities i at cell edges) conformal grid (no hanging nodes) is amenable to local mesh refinement dycores being developed for atmosphere and ocean 17

MPAS uses conformal, variable resolution grid 18

MPAS uses vertical coordinate different from those of other CAM dycores MPAS uses an eta-based vertical coordinate, but with dry pressure instead of total pressure Pd(x,k) = A(k)*(p0-pt) + B(k)*(psd(x)-pt) + pt reduces to same functional form as other dycores if pt=0 Important instances of A(k),B(k) in CAM replaced by pressure state variable or reference pressure Reference pressure (used for parameterizations) supplied by reference pressure module Pref(k) = A(k)*(p0-pt) + B(k)*(p0-pt) + pt 19

CAM/MPAS is undergoing validation Compared physics tendencies on FV grid vs MPAS grid Inclusion of cell boundary arrays in netcdf output allows visualization with CDAT Compared MPAS driven by CAM (without physics) with MPAS-standalone for baroclinic wave tests, including advection of passive tracers Codes run on CRAY XT5 (jaguar) and Opteron- Infiniband system (atlas) 20

Zonal velocity change over month (physics tendencies only, no dynamics) fv mpas d e l t a Zo n a l w i n d a t s u rf a c e a f t e r 3 0 d a ys FV m / s Me a n 3. 6 8 6 6 7 Ma x 1 8. 7 8 2 7 Mi n -1 6. 0 6 8 1 90 d e l t a Zo n a l w i n d a t s u rf a c e a f t e r 3 0 d a ys FV m MA / ss Ma x 1 8. 8 9 0 9 Mi n -1 6. 5 8 2 6 70 60 50 30 30 10 la t surface -1 0 0-3 0-3 0-5 0-6 0-7 0-9 0 0 60 120 180 lon 240 0 300 30-2 0 60-1 2-1 6-1 2-8 -4 0 4 8 12 16 120 150 180 ncol -4-1 6-2 0 90 210 240 270 300 4-8 330 12 0 20 8 16 20 d e l t a Z o n a l w i n d a t 5 0 0 h P a a f t e r 3 0 d a y s F Vm / s Me a n -3. 3 1 9 6 8 Ma x 2 2. 5 7 7 2 Mi n -3 6. 7 7 4 8 90 d e l t a Z o n a l w i n d a t 5 0 0 h P a a f t e r 3 0 d a y s F VmM/ sa S Ma x 2 3. 0 3 5 1 Mi n -3 6. 4 7 1 7 70 60 50 30 30 10 lat 0 500 hpa -1 0-3 0-3 0-5 0-6 0-7 0-9 0 0 60 120 180 lon 240 0 300-4 0-3 5-4 0-3 5-3 0-2 5-2 0-1 5-1 0-5 0 5 10 15 20 30 60-3 0 90 120-2 0-2 5 150 180 ncol -1 0-1 5 210 240 270 0-5 300 330 10 5 20 15 25 25 21

Baroclinic wave test (Jablonowski and Williamson) Initial velocity field is zonally symmetric and contains midlatitude northern and southern hemisphere jets Apply zonal wind perturbation Evaluate using 40962-cell grid (nominally 1-deg) Compare with literature Compare CAM/MPAS with MPAS-standalone 22

Results of baroclinic wave test (10 days) CAM/MPAS MPAS-standalone 23

Current activities and future plans Using newly re-written Scrip to create mapping files to convert history output to latitude-longitude longitude grid will enable utilization of AMWG diagnostics package will enable operation with land model Initiating comprehensive aquaplanet comparison with other dycores Carrying out performance evaluation across resolutions and process counts Plan to carry out cases with land model Plan to investigate simulation using locally refined grid over key geographic region 24