ECMWF's Next Generation IO for the IFS Model and Product Generation

Similar documents
ECMWF s Next Generation IO for the IFS Model

Approaches to I/O Scalability Challenges in the ECMWF Forecasting System

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

I/O Profiling Towards the Exascale

NEXTGenIO Performance Tools for In-Memory I/O

Analyzing I/O Performance on a NEXTGenIO Class System

Fujitsu SC 18. Human Centric Innovation Co-creation for Success 2018 FUJITSU

Big changes coming to ECMWF Product Generation system

MIR. ECMWF s New Interpolation Package. P. Maciel, T. Quintino, B. Raoult, M. Fuentes, S. Villaume ECMWF. ECMWF March 9, 2016

Fast Forward Storage & I/O. Jeff Layton (Eric Barton)

Using DDN IME for Harmonie

Second workshop for MARS administrators

Fast Forward I/O & Storage

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

Lustre* - Fast Forward to Exascale High Performance Data Division. Eric Barton 18th April, 2013

First experiences of using WC(P)S at ECMWF

Monitoring and evaluating I/O performance of HPC systems

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Progress on TIGGE Archive Center in CMA

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

ECMWF point database: providing direct access to any model output grid-point values

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

The Cray Rainier System: Integrated Scalar/Vector Computing

DDN About Us Solving Large Enterprise and Web Scale Challenges

New Approach to Unstructured Data

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

Data Movement & Tiering with DMF 7

CSD3 The Cambridge Service for Data Driven Discovery. A New National HPC Service for Data Intensive science

libhio: Optimizing IO on Cray XC Systems With DataWarp

The DEEP-ER take on I/O

Lustre overview and roadmap to Exascale computing

HPC Storage Use Cases & Future Trends

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

Copyright 2012 EMC Corporation. All rights reserved.

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Improved Solutions for I/O Provisioning and Application Acceleration

NetApp: Solving I/O Challenges. Jeff Baxter February 2013

Accessing NVM Locally and over RDMA Challenges and Opportunities

A New Key-Value Data Store For Heterogeneous Storage Architecture

Flash Storage Complementing a Data Lake for Real-Time Insight

Data Movement & Storage Using the Data Capacitor Filesystem

Innovator, Disruptor or Laggard, Where will your storage applications live? Next generation storage

Advances of parallel computing. Kirill Bogachev May 2016

DDN and Flash GRIDScaler, Flashscale Infinite Memory Engine

cdo Data Processing (and Production) Luis Kornblueh, Uwe Schulzweida, Deike Kleberg, Thomas Jahns, Irina Fast

Introduction to ECMWF resources:

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

Tech Talk on HPC. Ken Claffey. VP, Cloud Systems. May 2016

Architecting Storage for Semiconductor Design: Manufacturing Preparation

Amazon Elastic File System

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Expand In-Memory Capacity at a Fraction of the Cost of DRAM: AMD EPYCTM and Ultrastar

Analysts Weigh In On Persistent Memory...

Introduction to Database Services

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO

Aerospike Scales with Google Cloud Platform

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH

EIOW Exa-scale I/O workgroup (exascale10)

Flashed-Optimized VPSA. Always Aligned with your Changing World

HPC projects. Grischa Bolls

What SMT can do for You. John Hague, IBM Consultant Oct 06

I/O Monitoring at JSC, SIONlib & Resiliency

Backup and archiving need not to create headaches new pain relievers are around

The EU-funded BRIDGE project

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

Challenges in HPC I/O

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

Emerging Technologies for HPC Storage

Isilon: Raising The Bar On Performance & Archive Use Cases. John Har Solutions Product Manager Unstructured Data Storage Team

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

I/O at the Center for Information Services and High Performance Computing

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

Bringing HyperScale Computing to the Enterprise. The need for Enterprises to overhaul their IT systems

MOHA: Many-Task Computing Framework on Hadoop

IME Infinite Memory Engine Technical Overview

Evolution of Rack Scale Architecture Storage

Interface Trends for the Enterprise I/O Highway

MULTITHERMAN: Out-of-band High-Resolution HPC Power and Performance Monitoring Support for Big-Data Analysis

Design Considerations for Using Flash Memory for Caching

A global technology leader approaching $42B in sales with 57,000 people, and customers in 160+ countries LENOVO. ALL RIGHTS RESERVED

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

Exploiting Weather & Climate Data at Scale (WP4)

AWS Storage Gateway. Not your father s hybrid storage. University of Arizona IT Summit October 23, Jay Vagalatos, AWS Solutions Architect

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

Smart Trading with Cray Systems: Making Smarter Models + Better Decisions in Algorithmic Trading

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

Transcription:

ECMWF's Next Generation IO for the IFS Model and Product Generation Future workflow adaptations Tiago Quintino, B. Raoult, S. Smart, A. Bonanni, F. Rathgeber, P. Bauer ECMWF tiago.quintino@ecmwf.int ECMWF 17th Workshop on High Performance Computing in Meteorology Reading, UK ECMWF October 24, 2016

What do we do? ECMWF s HPC Targets Operations Time Critical Operational runs 2 hours from satelite cut-off to deliver forecast products 10 day forecast twice per day, 00Z and 12Z Boundary Conditions 06Z and 18Z, monthly, seasonal, etc. Research Non Time Critical Improving our models Climate reanalysis, etc Tension HPC Facility Targets Capability, minimise the time to solution of Model runs Capacity, maximise the throughput of research jobs per day Time Critical vs. Non Time Critical Capacity vs. Capability Challenge: design our HPC system to optimise these goals, minimising TCO? 2

ECMWF HPC Job profile 3

ECMWF s Production Workflow IFS Model Product Generation ECPDS Member States & Customers Fields Products MARS Perpetual Archive 4

Operational workload: Job allocation (1 cycle) 5

Le Petit Prince, Antoine de Saint-Exupéry 6

Operational workload: Job allocation (1 cycle) 7

Operational workload: Files opened (1 cycle) Target Files = # Users x # Steps x # Ranks 8

Operational workload: Input Read (1 cycle) 9

Operations workload: Output written (1 cycle) 10

ECMWF s Production Workflow IFS Model Product Generation ECPDS Member States & Customers Raw Output Fields 42TiB Read (70%) Products Parallel Filesystem Storage (Lustre) FDB PDB Time critical path MARS Perpetual Archive 11

Estimated Growth in Model IO 2015 16km, 137 levels Time critical 21 TB/day written 22 Million fields 85 Million products 11 TB/day send to customers 2020 Increase: 2 horizontal, 1 upper air Time critical 128 TB/day written 90 Million fields 450 Million products 60 TB/day send to customers Non-time critical 100 TB/day archived 400 research experiments 400,000 jobs / day Non-time critical 1 PB/day archived 1000 research experiments 1,000,000 jobs / day 12

Feeling the Byte? I/O Gap 13

Burst Buffers Compute1 Compute2 Compute N Low latency Short Bursts Low(er) IO rate Initially designed for check pointing Used to absorb IO peaks Layered on top of PFS Application sees persistence with low latency Burst Buffer High latency Long tranfer High(er) IO rate Concerns Sharding vs Consistency vs Shared Data replication (resilience) POSIX file system interface (still) Lustre PFS What if we could change the application? 14

What is NextGenIO? Integrated into ECMWF s Scalability Programme Exploring new NVRAM technologies to minimise Exascale I/O bottlenecks Partners EPCC (Proj. Leader) Intel Fujitsu T.U. Dresden Barcelona S.C. Allinea Software ARCTUR ECMWF Project Aims Build an HPC prototype system with Intel 3D XPoint technology Develop tools and systemware to support application development Design scheduler startegies that take NVRAM into account Explore how to best use this technology in I/O servers ECMWF Tasks Provide requirements and use cases Develop a I/O Workload Simulator Explore interation with I/O server layer in IFS Test and assess the system scalability http://www.nextgenio.eu - EU funded H2020 project, runs 2015-2018 October 29, 2014 15

NVRAM Intel 3D XPoint Key characteristics: storage density similar to NAND flash memory better durability speed and latency better than NAND, though slower than DRAM priced between NAND and DRAM Source: https://en.wikipedia.org/wiki/3d_xpoint "3D XPoint" by Trolomite Own work. Licensed under CC BY-SA 4.0 How is ECMWF planning to use this technology? large buffers for time critical applications similar to burst buffers but in application space persistence until archival, for non time critical Key Point: High Density at very low latency adding a new layer in the hierarchical storage system view October 29, 2014 16

IFS IO Server Based on MeteoFrance IO server for IFS Entered production in March 2016 Compute Only IFS IFS Compute IFS Compute IFS Compute IFS Compute IFS Compute IFS Compute IFS Compute IFS Compute IFS Compute IFS Compute IFS Compute Compute MPI NxM Aggregation GRIB Encoding & IO IFS IO IFS IO IFS IO IFS IO IFS IO Persistent Storage FDB NVRAM (Lustre) NEXTGenIO October 29, 2014 17

Streaming Model Output to a Computing Service Time critical path IFS IO Fields MultIO library NVRAM Buffers Product Generation MultIO implements IO multiplexing Remove file system IO from critical path Today, we could save: 32TB w. / hour 26TB r. / hour FDB MARS How to store all model output in NVRAM buffers? October 29, 2014 18

Object Store Key-Value stores offer scalability Just add more instances to increase capacity and throuput Transaction behavior with minimal synchronization Growing popularity, namely due to Big Data Analytics Key: date=12012007, param=temp Value: 101001...100101010110010 Object Storage OID: dfaa33ecccdae7f144aeac421 But ECMWF has been using key-value store for 30 years... MARS 19

MARS Language RETRIEVE, CLASS = OD, TYPE = FC, LEVTYPE = PL, EXPVER = 0001, STREAM = OPER, PARAM = Z/T, TIME = 1200, LEVELIST = 1000/500, DATE = 20160517, STEP = 12/24/36 RETRIEVE, CLASS = RD, TYPE = FC, LEVTYPE = PL, EXPVER = ABCD, STREAM = OPER, PARAM = Z/T, TIME = 1200, LEVELIST = 1000/500, DATE = 20160517, STEP = 12/24/36 Unique way to describe all ECMWF data both Operational and Research 20

FDB (version 5) Domain specific (NWP) object store Transactional, No synchronization Key-value store Keys are scientific meta-data (MARS Metadata) Values are byte streams (GRIB) IFS IFS IFS FDB MARS Metadata + Fields Support for multiple back-ends: POSIX file-system (currently on Lustre) 3D XPoint using pmem.io library Could explore others: NVRAM pmem.io DDR POSIX Lustre Intel DAOS, Cray DataWarp, etc. Supports wild card searches, ranges, data conversion, etc param=temperature/humidity, levels=all, steps=0/240/by/3 date=01011999/to/31122015, 21

Product Generation - PGen Rewrite in C++ Based on... New interpolation software (MIR) Caching algorithms for operators (Explicit) Task Graph analysis Remember: users can update requests daily Factorise common tasks Batch and Reorder execution Compute time-series on-the-fly User Requests Decode Transform Interpolation Filter Encode Decode Transform Interpolation Filter Encode Decode Transform Interpolation Filter Encode Decode Transform Interpolation Filter Encode Decode Transform Interpolation Filter Encode Decode Transform Interpolation Crop Encode October 29, 2014 22

PGen Task Graph Analysis October 29, 2014 23

PGen Task Graph Analysis Merge same input October 29, 2014 24

PGen Task Graph Analysis Merge same SH transforms October 29, 2014 25

PGen Task Graph Analysis Merge same interpolation target grids October 29, 2014 26

PGen Task Graph Analysis Merge same rotation and cropping October 29, 2014 27

PGen Task Graph Analysis Caching of operators October 29, 2014 28

PGen Task Graph Analysis Caching of operators Time series October 29, 2014 29

PGen Task Reordering for Dynamic Load Balancing Example: 6 tasks, 3 workers Tail FIFO dispatch Heuristic Reorder FIFO dispatch October 29, 2014 30

Summary : Overall Infrastructure Plan Notify PGen Products Fields Data retrieved from object store IFS IO MultIO FDB MEM Hot buffering for performance Time critical components FDB MARS 31

Messages To Take Home Burst Buffers, SSD s, NVRAM, are filling in the I/O Gap and will change the way we use and store data ECMWF is adapting its workflow to take advantage of these upcoming technologies (MultIO, FDB5, MIR, PGen) What would you do differently, if your persistent storage would be 10,000x faster? NEXTGenIO has received funding from the European Union s Horizon 2020 Research and Innovation programme under Grant Agreement no. 671951 October 29, 2014 32