The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.

Similar documents
Deutscher Wetterdienst

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

An update on the COSMO- GPU developments

Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs

Porting COSMO to Hybrid Architectures

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

PLAN-E Workshop Switzerland. Welcome! September 8, 2016

CSCS Proposal writing webinar Technical review. 12th April 2015 CSCS

GPU Consideration for Next Generation Weather (and Climate) Simulations

PP POMPA (WG6) News and Highlights. Oliver Fuhrer (MeteoSwiss) and the whole POMPA project team. COSMO GM13, Sibiu

Physical parametrizations and OpenACC directives in COSMO

Preparing a weather prediction and regional climate model for current and emerging hardware architectures.

Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss. PP POMPA status.

Illinois Proposal Considerations Greg Bauer

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Shifter: Fast and consistent HPC workflows using containers

Dynamical Core Rewrite

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

RAMSES on the GPU: An OpenACC-Based Approach

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Operational Robustness of Accelerator Aware MPI

GPU Developments for the NEMO Model. Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA

Trends in HPC (hardware complexity and software challenges)

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Deutscher Wetterdienst

DOI: /jsfi Towards a performance portable, architecture agnostic implementation strategy for weather and climate models

CUDA. Matthew Joyner, Jeremy Williams

An Introduction to OpenACC

First Experiences With Validating and Using the Cray Power Management Database Tool

Transport Simulations beyond Petascale. Jing Fu (ANL)

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

The ECMWF forecast model, quo vadis?

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Optimised all-to-all communication on multicore architectures applied to FFTs with pencil decomposition

Interactive Supercomputing for State-of-the-art Biomolecular Simulation

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Developments in Atmospheric Sciences

John Levesque Nov 16, 2001

Opportunities & Challenges for Piz Daint s Cray XC50 with ~5000 P100 GPUs. Thomas C. Schulthess

ECE 574 Cluster Computing Lecture 18

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Quantum ESPRESSO on GPU accelerated systems

The Titan Tools Experience

Developments in Computing Technology: GPUs

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Addressing Heterogeneity in Manycore Applications

Challenges in adapting Particle-In-Cell codes to GPUs and many-core platforms

The Eclipse Parallel Tools Platform

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

OpenACC programming for GPGPUs: Rotor wake simulation

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Status of the COSMO GPU version

Arm's role in co-design for the next generation of HPC platforms

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

CLAW FORTRAN Compiler source-to-source translation for performance portability

GPU Computing with NVIDIA s new Kepler Architecture

The Arm Technology Ecosystem: Current Products and Future Outlook

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

GPU computing at RZG overview & some early performance results. Markus Rampp

COSMO Dynamical Core Redesign Tobias Gysi David Müller Boulder,

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

Optimizing an Earth Science Atmospheric Application with the OmpSs Programming Model

It s not my fault! Finding errors in parallel codes 找並行程序的錯誤

Accelerating Financial Applications on the GPU

GPU. OpenMP. OMPCUDA OpenMP. forall. Omni CUDA 3) Global Memory OMPCUDA. GPU Thread. Block GPU Thread. Vol.2012-HPC-133 No.

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

An Introduction to the SPEC High Performance Group and their Benchmark Suites

Cray Scientific Libraries: Overview and Performance. Cray XE6 Performance Workshop University of Reading Nov 2012

STELLA: A Domain Specific Language for Stencil Computations

The Cray Programming Environment. An Introduction

Stan Posey, NVIDIA, Santa Clara, CA, USA

Evaluating OpenMP s Effectiveness in the Many-Core Era

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

What Can a Small Country Do? The MeteoSwiss Implementation of the COSMO Suite on the Cray XT4

Technology for a better society. hetcomp.com

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Performance modeling of 3D MPDATA simulations on GPU cluster

Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Running the FIM and NIM Weather Models on GPUs

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Barcelona Supercomputing Center

EULAG: high-resolution computational model for research of multi-scale geophysical fluid dynamics

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

The next generation supercomputer. Masami NARITA, Keiichi KATAYAMA Numerical Prediction Division, Japan Meteorological Agency

Industrial achievements on Blue Waters using CPUs and GPUs

Deutscher Wetterdienst. Ulrich Schättler Deutscher Wetterdienst Research and Development

IFS migrates from IBM to Cray CPU, Comms and I/O

Electronic structure calculations on Thousands of CPU's and GPU's

Early Experiences Writing Performance Portable OpenMP 4 Codes

arxiv: v1 [hep-lat] 12 Nov 2013

Our Workshop Environment

CUDA Experiences: Over-Optimization and Future HPC

An Innovative Massively Parallelized Molecular Dynamic Software

Exascale Challenges and Applications Initiatives for Earth System Modeling

Transcription:

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy! Thomas C. Schulthess ENES HPC Workshop, Hamburg, March 17, 2014 T. Schulthess!1

Piz Daint, presently one of Europe s most powerful petascale supercomputers Cray XC30 with 5272 hybrid nodes: Intel SandyBridge CPU + NVIDIA K20x GPU ENES HPC Workshop, Hamburg, March 17, 2014 T. Schulthess!2

source: David Leutwyler ENES HPC Workshop, Hamburg, March 17, 2014 T. Schulthess!3

Domain is larger by ~ 10x small: 500 x 500 x 60 large: 1536 x 1536 x 60 Same integration speed 1:80 About 1.5x more nodes small: 95 nodes @ 32 (AMD) cores large: 144 hybrid (GPU+CPU) nodes Different implementations small: regular COSMO code (MPI) large: new MPI+OpenMP/CUDA code source: David Leutwyler ENES HPC Workshop, Hamburg, March 17, 2014 T. Schulthess!4

Speedup of the full COSMO-2 production problem (apples to apples with 33h forecast of Meteo Swiss) 4x Cray XE6 (Nov. 2011) Cray XK7 (Nov. 2012) Cray XC30 (Nov. 2012) Cray XC30 hybrid (GPU) (Nov. 2013) 4x 3x 3x 1.67x 3.36x 2x New HP2C funded code 1.77x 1.49x 2.5x 2x 1x 1.4x 1.35x 1x Current production code ENES HPC Workshop, Hamburg, March 17, 2014 T. Schulthess!5

Energy to solution (kwh / ensemble member) 6.0 Cray XE6 (Nov. 2011) Cray XK7 (Nov. 2012) Cray XC30 (Nov. 2012) Cray XC30 hybrid (GPU) (Nov. 2013) Current production code 4.5 1.41x 1.75x New HP2C funded code 3.0 1.49x 3.93x 6.89x 2.51x 2.64x 1.5 ENES HPC Workshop, Hamburg, March 17, 2014 T. Schulthess!6

The bottom line: Improving the implementation and introducing a new architecture (GPUs) results in a 2 1/2 x speedup and 4 x improvement in energy to solution ENES HPC Workshop, Hamburg, March 17, 2014 T. Schulthess!7

Refactoring COSMO Runtime based 2 km production model of MeteoSwiss % Code Lines (F90) % Runtime Original code (with OpenACC for GPU) Rewrite in C++ (with CUDA backend for GPU) ENES HPC Workshop, Hamburg, March 17, 2014 T. Schulthess!8

velocities pressure temperature water Mathematical description Physical model turbulence Discretization / algorithm Domain science (incl. applied mathematics) lap(i,j,k) = 4.0 * data(i,j,k) + data(i+1,j,k) + data(i-1,j,k) + data(i,j+1,k) + data(i,j-1,k); Code / implementation Code compilation A given supercomputer Port serial code to supercomputers > vectorize > parallelize > petascaling > exascaling >... Computer engineering (& computer science) ENES HPC Workshop, Hamburg, March 17, 2014 T. Schulthess!9

velocities pressure temperature water Mathematical description Physical model turbulence Discretization / algorithm Domain science (incl. applied mathematics) lap(i,j,k) = 4.0 * data(i,j,k) + data(i+1,j,k) + data(i-1,j,k) + data(i,j+1,k) + data(i,j-1,k); Code compilation Architectural options / design Code / implementation Optimal algorithm Auto tuning Domain specific libraries & tools Computer engineering (& computer science) ENES HPC Workshop, Hamburg, March 17, 2014 T. Schulthess!10

COSMO: current and new, HP2C developed code main (current) main (new) dynamics Prototyping code / interactive data analysis Application code physics dynamics physics with OpenMP / OpenACC stencil library X86 GPU boundary conditions Domain Specific Libraries & Tools (DSL&T) GCL MPI MPI Basic Libraries (incl. BLAS, LAPACK, FFT,...) system system Gory detail will be given in Xavier s talk tomorrow ENES HPC Workshop, Hamburg, March 17, 2014 T. Schulthess!11

velocities pressure temperature water Mathematical description Physical model Dynamic developer environment, i.e. not Fortran/C/C++ but based on Python or equivalent dynamic language turbulence Computer engineering (& computer science) Domain science lap(i,j,k) = 4.0 * data(i,j,k) + data(i+1,j,k) + data(i-1,j,k) + data(i,j+1,k) + data(i,j-1,k); Code compilation Architectural options / design Code / implementation Discretization / algorithm Optimal algorithm Auto tuning Domain specific libraries & tools PASC co-design projects ENES HPC Workshop, Hamburg, March 17, 2014 T. Schulthess!12

COSMO & other models: how things could develop with a dynamic developer environment main (current) main (new) scripts (future) dynamics Python environment physics dynamics physics with OpenMP / OpenACC stencil library X86 GPU boundary conditions GCL physics tools numerical tools dynamics grid tools MPI MPI backend backend system system system The main advantage: model development is scalable! ENES HPC Workshop, Hamburg, March 17, 2014 T. Schulthess!13

THANK YOU! ENES HPC Workshop, Hamburg, March 17, 2014 T. Schulthess!14