CUG Talk. In-situ data analytics for highly scalable cloud modelling on Cray machines. Nick Brown, EPCC

Similar documents
INTERTWinE workshop. Decoupling data computation from processing to support high performance data analytics. Nick Brown, EPCC

In-situ data analytics for highly scalable cloud modelling on Cray machines

Parallel design patterns ARCHER course. Vectorisation and active messaging

Programming with MPI

Programming with MPI

Introduction to High Performance Parallel I/O

Porting the microphysics model CASIM to GPU and KNL Cray machines

Programming with MPI

PGAS languages. The facts, the myths and the requirements. Dr Michèle Weiland Monday, 1 October 12

Towards Zero Cost I/O:

The NEMO Ocean Modelling Code: A Case Study

CFD exercise. Regular domain decomposition

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES. Basic usage of OpenSHMEM

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Brief introduction of SocketPro continuous SQL-stream sending and processing system (Part 1: SQLite)

Tutorial 8 Build resilient, responsive and scalable web applications with SocketPro

Inside core.async Channels. Rich Hickey

Porting and Optimisation of UM on ARCHER. Karthee Sivalingam, NCAS-CMS. HPC Workshop ECMWF JWCRP

Parallel I/O. and split communicators. David Henty, Fiona Ried, Gavin J. Pringle

The Icosahedral Nonhydrostatic (ICON) Model

Damaris. In-Situ Data Analysis and Visualization for Large-Scale HPC Simulations. KerData Team. Inria Rennes,

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

CCSM Performance with the New Coupler, cpl6

PERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

The Google File System

Tools and Methodology for Ensuring HPC Programs Correctness and Performance. Beau Paisley

Implementing Scheduling Algorithms. Real-Time and Embedded Systems (M) Lecture 9

Page 1. Last Time. Today. Embedded Compilers. Compiler Requirements. What We Get. What We Want

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Allows program to be incrementally parallelized

Patterns for Asynchronous Invocations in Distributed Object Frameworks

Parallel, In Situ Indexing for Data-intensive Computing. Introduction

Parallel Programming. Libraries and implementations

CLIENT SERVER ARCHITECTURE:

Programming with MPI

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

Datacenter replication solution with quasardb

Operating System Design Issues. I/O Management

Extreme I/O Scaling with HDF5

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

Comparing One-Sided Communication with MPI, UPC and SHMEM

Two Success Stories - Optimised Real-Time Reporting with BI Apps

Fortran 2008: what s in it for high-performance computing

Enabling In Situ Viz and Data Analysis with Provenance in libmesh

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Scott Meder Senior Regional Sales Manager

Transactum Business Process Manager with High-Performance Elastic Scaling. November 2011 Ivan Klianev

An Introduction to Big Data Formats

Welcome! Virtual tutorial starts at 15:00 BST

Scalable Concurrent Hash Tables via Relativistic Programming

Concurrency Control. Chapter 17. Comp 521 Files and Databases Spring

The MPI Message-passing Standard Practical use and implementation (I) SPD Course 2/03/2010 Massimo Coppola

Scibox: Online Sharing of Scientific Data via the Cloud

Static Analysis of Embedded C

UNIX Memory Management Interview Questions and Answers

FCP: A Fast and Scalable Data Copy Tool for High Performance Parallel File Systems

I/O Management Software. Chapter 5

Parallel I/O in the LFRic Infrastructure. Samantha V. Adams Workshop on Exascale I/O for Unstructured Grids th September 2017, DKRZ, Hamburg.

Experiences from the Fortran Modernisation Workshop

An Evolutionary Path to Object Storage Access

High Performance Oracle Endeca Designs for Retail. Technical White Paper 24 June

Outline. Threads. Single and Multithreaded Processes. Benefits of Threads. Eike Ritter 1. Modified: October 16, 2012

The Cray Programming Environment. An Introduction

L3.4. Data Management Techniques. Frederic Desprez Benjamin Isnard Johan Montagnat

Introduction to parallel Computing

Performance Pack. Benchmarking with PlanetPress Connect and PReS Connect

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Benefit of Asynch I/O Support Provided in APAR PQ86769

Programming with MPI

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Parallel Code Optimisation

Scalable Software Components for Ultrascale Visualization Applications

MPI and MPI on ARCHER

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

LAPI on HPS Evaluating Federation

Anne Fouilloux. Fig. 1 Use of observational data at ECMWF since CMA file structure.

Introduction to OpenMP

I/O Management Software. Chapter 5

Chapter 18 Parallel Processing

A developer s guide to load testing

The EC Presenting a multi-terabyte dataset MWF via ER the web

OPENSTACK: THE OPEN CLOUD

Concurrency Control. Chapter 17. Comp 521 Files and Databases Fall

HPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh

ECE 574 Cluster Computing Lecture 13

Google Analytics. Gain insight into your users. How To Digital Guide 1

CSC 2405: Computer Systems II

AUTOBEST: A United AUTOSAR-OS And ARINC 653 Kernel. Alexander Züpke, Marc Bommert, Daniel Lohmann

Real-time Support in Operating Systems

An Introduction to the LFRic Project

Parallel design patterns ARCHER course. The actor pattern

Pegasus. Automate, recover, and debug scientific computations. Rafael Ferreira da Silva.

Lustre Parallel Filesystem Best Practices

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Dr. Ilia Bermous, the Australian Bureau of Meteorology. Acknowledgements to Dr. Martyn Corden (Intel), Dr. Zhang Zhang (Intel), Dr. Martin Dix (CSIRO)

Concurrency: Deadlock and Starvation. Chapter 6

ECE 391 Exam 1 Review Session - Spring Brought to you by HKN

Transcription:

CUG Talk In-situ analytics for highly scalable cloud modelling on Cray machines Nick Brown, EPCC nick.brown@ed.ac.uk

Met Office NERC Cloud model (MONC) Uses Large Eddy Simulation for modelling clouds & atmospheric flows Written in Fortran 2003 due to scientist familiarity, uses MPI for parallelisation Designed to be a community model which will be accessible to be changed by non expert HPC programmers and scale/perform well. For use not just by Met Office scientists, but also those in the wider weather/climate community Replaces an older model, the LEM from the 1980s From 22 million to billions of grid points From 256 cores to many thousands

A challenge for analysis Prognostics Diagnostics With much larger domains (billions of grid points) how can we best analyse the in a scalable fashion? Previous LEM model did this in line with computation, where the model would stop and calculate diagnostics before continuing with computation Could write to disk and analyse offline

In-situ approach Have many computational processes and a number of analytics cores Typically one core per processor is dedicated to IO, serving the other cores running the computational model Computational cores fire and forget their In-situ as raw is never written out Would be too time consuming Avoids blocking the computational cores for analytics and IO

Existing approaches Some existing approaches: XIOS Damaris ADIOS Unified Model IO server We need: To support dynamic time stepping Checkpoint-restart of the IO server itself Bit reproducibility Scalability and performance Easy configuration & extendibility

External API Raw MONC Raw MONC Diagnostics Diagnostic Writer Operators Inter IO communications NetCDF file writer Writer state serialiser manipulation Instantaneous averaged NetCDF output file

Define the from MONC Arrays, scalars or maps Mandatory (default) or optional A unique subset of a field (collective) or not. If collective need to provide sizes per dimension; z, zn, y,x and qfields Integer, double, float, boolean, string External API Raw MONC Operators Raw MONC Diagnostics Inter IO communications Diagnostic Instantaneous manipulation averaged Writer NetCDF file writer NetCDF output file Writer state serialiser The IO server expects this every frequency timesteps On registration the MONC process is told what it should send & when. MONC process tells the IO server the sizes. <-definition name= _parcel frequency=2> <field name="u" type="array" _type="double" size="z,y,x" collective=true/> <field name="vwp_local" type="array" _type="double" optional=true/> </-definition>

Define the diagnostics & its attributes Along with how to generate this diagnostic Organised as communication and operators External API Raw MONC Operators Raw MONC Diagnostics Inter IO communications Diagnostic Instantaneous manipulation averaged Writer NetCDF file writer NetCDF output file Writer state serialiser <-handling> <diagnostic field="vwp_mean" type="scalar" _type="double" units="kg/m^2"> <operator name="localreduce" operator="sum" result="vwp_mean_loc_reduced" field="vwp_local"/> <communication name="reduction" operator="sum" result="vwp_mean_g" field="vwp_mean_loc_reduced" root="auto"/> <operator name="arithmetic" result="vwp_mean" equation="vwp_mean_g/({x_size}*{y_size})"/> </diagnostic> <-handling>

External API Raw MONC Raw MONC Diagnostics Diagnostic Writer Operators Inter IO communications NetCDF file writer Writer state serialiser manipulation Instantaneous averaged NetCDF output file

User configuration - writing External API Raw MONC <group name="3d_fields"> <member name="w"/> <member name="u"/> </group> Raw MONC Operators Diagnostics Inter IO communications Diagnostic Instantaneous manipulation averaged Writer NetCDF file writer NetCDF output file Writer state serialiser <-writing> <file name="profile_ts.nc" write_time_frequency="100.0" title="profile"> <include field="vwp_mean" time_manipulation="averaged" output_frequency= 2.0"/> <include group="3d_fields" time_manipulation="instantaneous output_frequency="5.0"/> </file> </-writing>

Reuse of configuration Numerous existing XML configurations provided which can be included by the user Raises the issue of name conflicts Handled by the concept of namespaces #include checkpoint.xml #include profile_fields.xml #include scalar_fields.xml

External API Raw MONC Raw MONC Diagnostics Diagnostic Writer Operators Inter IO communications NetCDF file writer Writer state serialiser manipulation Instantaneous averaged NetCDF output file

Event handling The s and their sub activities are event handlers Process events concurrently by assigning these to these from a pool Aids asynchronous handling As soon as arrives process it Internal state of event handlers needs protection (mutexes/rw locks) Event Challenge: Bit reproducibility For some handlers enforce a predictable order of processing events (based on model timestep.) Queue up out of order events Process event Request thread from pool Process event Thread pool

Inter-IO communications challenge We promote asynchronicity and processing of events out of order where possible External API Raw MONC Many inter IO communications involve collective operations (such as a reduction) Operators Raw MONC Diagnostics Inter IO communications Diagnostic Instantaneous manipulation averaged Writer NetCDF file writer NetCDF output file We would like to use MPI, but issue order of collectives matters (i.e. if IO server 1 issues a reduce on field A and then B, then all other IO servers must issue reductions in that same order) But ensuring this would require additional overhead and/or coordination Solution: Abstract through active messaging Writer state serialiser

Active messaging for inter IO comms These communication calls additionally provide UniqueID: matching collectives even if they are issued out of order Callback : Handling procedure called on the root when the arrives inter_io_reduce(, _type, root, uniqueid, callback) When this reduction is completed on the root, a thread is activated from the pool and calls the handling function (typically in the event handler) The Unique ID here is the concatenation of field name and timestep Built upon MPI P2P communication calls call inter_io_reduce(, type, 0, fieldname// _ //timestep, handler) subroutine handler(, _type, uniqueid) end subroutine handler

Active messaging for synchronisation File writing is done by NetCDF But this is not thread safe, so crucial that only one thread per IO server process calls NetCDF functions concurrently Many NetCDF calls are collective (i.e. will block until called on all processes in the communicator.) NetCDF close is an example of this, where each process will block here until same call issued on all other processes Which we don t want, as it will block access to NetCDF (and MPI) Active messaging barrier calls the handling function on every process once a barrier has been issued by all processes call inter_io_barrier(filename_uniqueid, closehandler) subroutine closehandler(uniqueid) call close_netcdf_file( ) end subroutine closehandler

Checkpointing We need to support checkpoint-restart of the IO server and analytics This is challenging due to the amount of asynchronicity, especially in the analytics Wait for all analytics to that point to complete and just store the state of the writer Two step process Walk the state, lock it and determine the amount of memory needed Serialise state, write this and unlock

Prognostic writing optimisation IO servers servicing many computational cores means that the is naturally split up For prognostic field writes this can be a problem as it is very inefficient to do lots of independent writes to the file Want to perform minimal collective writes instead Search through the domain of local computational cores and merge chunks together where possible to produce smaller number of large contiguous chunks Must do the same number of writes on every core, some perform empty writes if not enough chunks

Performance and scalability Standard MONC stratus cloud test case Weak scaling on Cray XC30, 65536 local grid points 232 diagnostic values every timestep, time averaged over 10 model seconds. File written every 100 model seconds. Run terminates after 2000 model seconds.

IO overhead as a metric To measure the performance of the IO server and different configurations we adopt an overhead metric This is the time difference from the MONC communication that should trigger a write, to that write having being physically performed Configuration Overhead MPI serialised 8.92 MPI multiple 12.02 MPI serialised + hyperthreading MPI multiple + hyperthreading 8.14 9.71

Performance on the KNL 308s 299s 437s 390s 317s 322s Cray XC40, 64 core KNL 7210 Same stratus test case. 3.3 million grid points As MONC is not multi-threaded can we run one IO server per MONC on the hyper-thread?

Conclusions and further work We have discussed our approach for in-situ analytics and IO Which performs and scales well up to 32768 computational cores As well as the architecture, challenges and lessons learnt from implementing this Extend the active messaging layer to build upon something other than MPI Plug in other writing mechanisms such as visualisation tools Extract this from MONC to enable others to integrate with their models