The Icosahedral Nonhydrostatic (ICON) Model

Similar documents
Deutscher Wetterdienst

Deutscher Wetterdienst. Ulrich Schättler Deutscher Wetterdienst Research and Development

Porting and Optimizing the COSMOS coupled model on Power6

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

ICON for HD(CP) 2. High Definition Clouds and Precipitation for Advancing Climate Prediction

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

Introduction to parallel Computing

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Barbara Chapman, Gabriele Jost, Ruud van der Pas

European exascale applications workshop, Manchester, 11th and 12th October 2016 DLR TAU-Code - Application in INTERWinE

ICON Performance Benchmark and Profiling. March 2012

Computer Architectures and Aspects of NWP models

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

OpenACC programming for GPGPUs: Rotor wake simulation

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

Porting COSMO to Hybrid Architectures

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

End-to-end optimization potentials in HPC applications for NWP and Climate Research

Introduction to Parallel Programming. Tuesday, April 17, 12

Parallel Computing. Parallel Algorithm Design

HPC Performance Advances for Existing US Navy NWP Systems

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

The ICON project: Design and performance of an unstructured grid approach for a global triangular grid model

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

The next generation supercomputer. Masami NARITA, Keiichi KATAYAMA Numerical Prediction Division, Japan Meteorological Agency

Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms

Implementation of an integrated efficient parallel multiblock Flow solver

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Lecture 15: More Iterative Ideas

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

ECE 669 Parallel Computer Architecture

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC

Parallelization Principles. Sathish Vadhiyar

Running the NIM Next-Generation Weather Model on GPUs

cdo Data Processing (and Production) Luis Kornblueh, Uwe Schulzweida, Deike Kleberg, Thomas Jahns, Irina Fast

Scalability of Elliptic Solvers in NWP. Weather and Climate- Prediction

Graph Partitioning for High-Performance Scientific Simulations. Advanced Topics Spring 2008 Prof. Robert van Engelen

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla

Massively Parallel Phase Field Simulations using HPC Framework walberla

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Parallel and High Performance Computing CSE 745

Introduction to Parallel Programming

Cray XE6 Performance Workshop

Benchmark 1.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques.

simulation framework for piecewise regular grids

COSC 6385 Computer Architecture - Multi Processor Systems

A brief introduction to OpenMP

Shared Memory Programming. Parallel Programming Overview

Parallel Numerics, WT 2013/ Introduction

Development and Testing of a Next Generation Spectral Element Model for the US Navy

Parallel Quality Meshes for Earth Models

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

Week 3: MPI. Day 04 :: Domain decomposition, load balancing, hybrid particlemesh

Computational Geometry

Improving Scalability and Accelerating Petascale Turbulence Simulations Using OpenMP

Exercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing

The Cray Rainier System: Integrated Scalar/Vector Computing

CFD exercise. Regular domain decomposition

Introduction to Parallel Programming

Scientific Programming in C XIV. Parallel programming

CCSM Performance with the New Coupler, cpl6

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A TIMING AND SCALABILITY ANALYSIS OF THE PARALLEL PERFORMANCE OF CMAQ v4.5 ON A BEOWULF LINUX CLUSTER

High Performance Computing

Carlo Cavazzoni, HPC department, CINECA

Software and Performance Engineering for numerical codes on GPU clusters

The Case for Collective Pattern Specification

Lecture V: Introduction to parallel programming with Fortran coarrays

Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

WRF Software Architecture. John Michalakes, Head WRF Software Architecture Michael Duda Dave Gill

EE/CSCI 451: Parallel and Distributed Computation

Mesh reordering in Fluidity using Hilbert space-filling curves

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

High Performance Computing. Introduction to Parallel Computing

Exploring unstructured Poisson solvers for FDS

IBM HPC Development MPI update

Two-Phase flows on massively parallel multi-gpu clusters

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Basics of Performance Engineering

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST

Non-Blocking Collectives for MPI

Combinatorial problems in a Parallel Hybrid Linear Solver

Introduction to OpenMP

Modelling and implementation of algorithms in applied mathematics using MPI

IFS migrates from IBM to Cray CPU, Comms and I/O

PRIMEHPC FX10: Advanced Software

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

Parallelization of DQMC Simulations for Strongly Correlated Electron Systems

EULAG: high-resolution computational model for research of multi-scale geophysical fluid dynamics

COSC 6374 Parallel Computation. Parallel Design Patterns. Edgar Gabriel. Fall Design patterns

Solving Large Complex Problems. Efficient and Smart Solutions for Large Models

Commodity Cluster Computing

Transcription:

The Icosahedral Nonhydrostatic (ICON) Model Scalability on Massively Parallel Computer Architectures Florian Prill, DWD + the ICON team 15th ECMWF Workshop on HPC in Meteorology October 2, 2012

ICON = ICOsahedral Nonhydrostatic model Global circulation model for atmosphere and ocean Joint development of DWD and Max-Planck-Institute for Meteorology t v n + (ζ + f ) v t + n K + w z v n = c pd θ v n π t w + (v n w) w v n + w z w = c pd θ v z π g t ρ + (vρ) = 0 t (ρθ v )+ (vρθ v ) = 0 (v n, w, ρ, θ v : prognostic variables) Code design goals: data parallelism and task parallelism multi-vendor interoperability operational schedule: fixed run time characteristics 1 / 19 Florian Prill :: Parallelization of the ICON Model

In a Nutshell ICON implements optimization strategies for data placement and locality reduction of serial bottlenecks system-level optimization ICON parallel program design SPMD style programming with master-only message passing + OpenMP task parallelism + asynchronous I/O + advanced data structures for special purposes 2 / 19 Florian Prill :: Parallelization of the ICON Model

ICON s Unstructured Grid Primal cells: triangles uses icosahedron for macro-triangulation staggered grid, velocity at edge midpoints local subdomains ( nests ) local domain(s) Example: 20 km global res. 1.3 10 6 90 vertical levels (up to 75 km) global domain 3 / 19 Florian Prill :: Parallelization of the ICON Model

ICON s Domain Decomposition Criteria: 1. Static load balancing, e. g. every PE comprises sunlit and shadowed parts of the globe 2. Communication reduction Explicit array partitioning with halo regions lateral boundary regions nest overlap regions interior points Geometric decomposition, 20 partitions... avoids conditionals in iterations. 4 / 19 Florian Prill :: Parallelization of the ICON Model

ICON s Domain Decomposition Criteria: 1. Static load balancing, e. g. every PE comprises sunlit and shadowed parts of the globe 2. Communication reduction Explicit array partitioning with halo regions lateral boundary regions nest overlap regions interior points Min-cut decomposition, 20 partitions Example, 40 PEs, 40 km global: Comm. ca. -4 %, Connections ca. -7.9%... avoids conditionals in iterations. 4 / 19 Florian Prill :: Parallelization of the ICON Model

Solver Parts dynamical core 18% hor. diffusion 3% communication 18% transport 13% physics 19% misc, e.g. I/O and feedback 21% Run-time portions (example) radiation 7% 5 / 19 Florian Prill :: Parallelization of the ICON Model

I/O: Bottleneck for High-Resolution NWP Classical root I/O of high-resolution data becomes a critical issue. The ICON model offloads all computed results to dedicated output nodes. computation and I/O overlap fast system layer: Climate Data Interface WMO GRIB2 standard 6 / 19 Florian Prill :: Parallelization of the ICON Model

Advanced Data Structures Proximity queries in a metric space Find the near neighbors in a large data set. set of cell centers distance on the sphere Elementary approach: Grid method Divide search area into small spaces, keep short lists of points. Alternative approach: Search trees Setup phase: OpenMP pipelining during tree construction Query phase: Different threads may traverse the tree structure in parallel 7 / 19 Florian Prill :: Parallelization of the ICON Model

Geometric Near-neighbor Access Trees Generalization of kd-trees. 1. At the top node of a GNAT choose several distinguished split points. 2. The remaining points are classified into groups depending on what Voronoi cell they fall into. 3. Recursive application forms tree levels. Additionally store ranges of distance values from split points to the data points associated with other split points. S. Brin: Near Neighbor Search in Large Metric Spaces. VLDB 95 8 / 19 Florian Prill :: Parallelization of the ICON Model

Geometric Near-neighbor Access Trees Generalization of kd-trees. 1. At the top node of a GNAT choose several distinguished split points. 2. The remaining points are classified into groups depending on what Voronoi cell they fall into. 3. Recursive application forms tree levels. Additionally store ranges of distance values from split points to the data points associated with other split points. S. Brin: Near Neighbor Search in Large Metric Spaces. VLDB 95 8 / 19 Florian Prill :: Parallelization of the ICON Model

Geometric Near-neighbor Access Trees Generalization of kd-trees. 1. At the top node of a GNAT choose several distinguished split points. 2. The remaining points are classified into groups depending on what Voronoi cell they fall into. 3. Recursive application forms tree levels. Additionally store ranges of distance values from split points to the data points associated with other split points. S. Brin: Near Neighbor Search in Large Metric Spaces. VLDB 95 8 / 19 Florian Prill :: Parallelization of the ICON Model

Geometric Near-neighbor Access Trees Generalization of kd-trees. 1. At the top node of a GNAT choose several distinguished split points. 2. The remaining points are classified into groups depending on what Voronoi cell they fall into. 3. Recursive application forms tree levels. Additionally store ranges of distance values from split points to the data points associated with other split points. S. Brin: Near Neighbor Search in Large Metric Spaces. VLDB 95 8 / 19 Florian Prill :: Parallelization of the ICON Model

Flat-MPI + Hybrid Performance 9 / 19 Florian Prill :: Parallelization of the ICON Model

Objective 75% of Top500 HPC systems have 6 cores per socket. On-chip clock rates have increased only moderately. Amdahl 67 Pessimistic saturation limit for program scaling on HPC systems: Speedup S 1 α (α : non-parallel sections) For NWP: Run time, not problem size is constant. Parallel amount grows with the number of PEs Weak scaling matters in practice! 10 / 19 Florian Prill :: Parallelization of the ICON Model

Flat-MPI + Hybrid Performance 0.4 0.3 0.2 Name dynamics hor. diffusion physics radiation sync total transport 100 75 50 Memory Time cost / (cell/cores) 0.1 0.0 500 1000 1500 2000 IBM P6 cores Time / Memory (%) 25 0 64x32x1 32x32x2 32x16x4 32x8x16 Procs. and Threads Test setup: experiment APE NH, geom. decomposition, 1 h reduced radiation grid, 48 cells/core Test setup: real data test case, 2000 steps, geom. decomp., resolution 20 km global, 10 km local (L. Linardakis, 01/2012) 11 / 19 Florian Prill :: Parallelization of the ICON Model

Strong Scaling Test Test setup: Non-hydrostatic test, real data, 10 days forecast Resolution: R2B06 ( 40km), 90 levels, time step 60s Parallel setup: IBM Power6 platform, 6h output, 32 MPI tasks x 2 OpenMP threads/node (smt) 12 / 19 Florian Prill :: Parallelization of the ICON Model

Portable Efficiency? 13 / 19 Florian Prill :: Parallelization of the ICON Model

Cache Optimization Unstructured grids make extensive use of indirectly accessed arrays. Manual loop transformations: Loop interchange: Array assignments are transformed to allow vectorization and improved cache performance on scalar platforms. Loop tiling: Block-partitioned loop iteration space, enhancing cache reuse. Block size = automatic optimization for a wide range of platforms. 1 0.98 0.96 0.94 dynamical core 0.92 0.9 0.88 0.86 0 10 20 30 40 50 60 70 80 block size IBM Power 6, 40 global 0.84 0.82 0.8 0.78 physics 0.76 0.74 0.72 0.7 0 10 20 30 40 50 60 70 80 block size 14 / 19 Florian Prill :: Parallelization of the ICON Model

Task Assignment Prevent tasks from going off-node, reduce switch communication. Task placement is a Quadratic Assignment Problem: min π Π(n) n i=1 j=1 n a ij b πi π j, a ij : flow matrix, b ij : distance matrix. NP-hard problem! B. Brandfass 2010 ; E. Taillard 1991 (robust taboo search) 15 / 19 Florian Prill :: Parallelization of the ICON Model

Task Assignment value 50 100 150 200 Reordered communication matrix, 3x15 tasks value 25 50 75 Reordered communication matrix, 8x32 0.75 0.80 0.85 0.90 0.95 1.00 IBM P6 NEC SX9 Task assignment benefit 16 / 19 Florian Prill :: Parallelization of the ICON Model

Remote Memory Access MPI-2 provides a model for remote memory access. potentially more lightweight benefit from special hardware? Local ghost-cell exchange in the ICON model: pt2pt pscw fence alltoallv IBM P6 NEC SX9 0.8 0.9 1.0 1.1 0.8 0.9 1.0 1.1 17 / 19 Florian Prill :: Parallelization of the ICON Model

System Noise OS jitter: daemons, IRQs MPI jitter high-frequency perturbations (e. g. cache-related) Analysis with n-processor micro-benchmark. 0.06 Relative OS jitter loss (fix work quantum, m measurements): lost rel = m i=1 max( t compute,i, ) m t compute m t compute relative jitter loss 0.04 0.02 0.00 0 20 40 60 MPI tasks 18 / 19 Florian Prill :: Parallelization of the ICON Model

In a Nutshell ICON implements optimization strategies for data placement and locality reduction of serial bottlenecks system-level optimization ICON parallel program design SPMD style programming with master-only message passing + OpenMP task parallelism + asynchronous I/O + advanced data structures for special purposes 19 / 19 Florian Prill :: Parallelization of the ICON Model

In a Nutshell ICON implements optimization strategies for data placement and locality reduction of serial bottlenecks system-level optimization ICON parallel program design Unsolicited Advice for HPC Vendors: SPMD style programming with master-only message passing OS jitter: System kernel must be minimized to increase performance + OpenMP task parallelism + asynchronous I/O + advanced data structures for special purposes Operational use: Compiler support for reproducible code Pragma-based, incremental use of stream architectures 19 / 19 Florian Prill :: Parallelization of the ICON Model

Florian Prill Met. Analyse und Modellierung Deutscher Wetterdienst e-mail: Florian.Prill@dwd.de