Using DDN IME for Harmonie

Similar documents
Application Performance on IME

IME Infinite Memory Engine Technical Overview

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

DDN and Flash GRIDScaler, Flashscale Infinite Memory Engine

Infinite Memory Engine Freedom from Filesystem Foibles

Improved Solutions for I/O Provisioning and Application Acceleration

HPC Storage Use Cases & Future Trends

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

SFA12KX and Lustre Update

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Isilon Performance. Name

CA485 Ray Walshe Google File System

Applying DDN to Machine Learning

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

ECMWF's Next Generation IO for the IFS Model and Product Generation

DDN About Us Solving Large Enterprise and Web Scale Challenges

NEXTGenIO Performance Tools for In-Memory I/O

NPTEL Course Jan K. Gopinath Indian Institute of Science

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

The Leading Parallel Cluster File System

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

High-Performance Lustre with Maximum Data Assurance

ECMWF s Next Generation IO for the IFS Model

November 7, DAN WILSON Global Operations Architecture, Concur. OpenStack Summit Hong Kong JOE ARNOLD

NetApp: Solving I/O Challenges. Jeff Baxter February 2013

libhio: Optimizing IO on Cray XC Systems With DataWarp

Lustre HSM at Cambridge. Early user experience using Intel Lemur HSM agent

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

The BioHPC Nucleus Cluster & Future Developments

Accelerating Ceph with Flash and High Speed Networks

Feedback on BeeGFS. A Parallel File System for High Performance Computing

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Enhancing Checkpoint Performance with Staging IO & SSD

Using file systems at HC3

vsan 6.6 Performance Improvements First Published On: Last Updated On:

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO

Decentralized Distributed Storage System for Big Data

Application Acceleration Beyond Flash Storage

Parallel File Systems for HPC

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Data Movement & Storage Using the Data Capacitor Filesystem

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Using Transparent Compression to Improve SSD-based I/O Caches

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

Accelerate Applications Using EqualLogic Arrays with directcache

CLOUD-SCALE FILE SYSTEMS

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

Advances of parallel computing. Kirill Bogachev May 2016

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Isilon Scale Out NAS. Morten Petersen, Senior Systems Engineer, Isilon Division

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

Parallel File Systems. John White Lawrence Berkeley National Lab

Lustre TM. Scalability

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

A Gentle Introduction to Ceph

Active-Active LNET Bonding Using Multiple LNETs and Infiniband partitions

NEMO Performance Benchmark and Profiling. May 2011

RUNNING PETABYTE-SIZED CLUSTERS

API and Usage of libhio on XC-40 Systems

HPC Architectures. Types of resource currently in use

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

2012 HPC Advisory Council

ABySS Performance Benchmark and Profiling. May 2010

1. ALMA Pipeline Cluster specification. 2. Compute processing node specification: $26K

Introducing Tegile. Company Overview. Product Overview. Solutions & Use Cases. Partnering with Tegile

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

Emerging Technologies for HPC Storage

NVM Express over Fabrics Storage Solutions for Real-time Analytics

Lustre overview and roadmap to Exascale computing

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

LLNL Lustre Centre of Excellence

Distributed File Systems II

Analyzing I/O Performance on a NEXTGenIO Class System

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

Extraordinary HPC file system solutions at KIT

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Transcription:

Irish Centre for High-End Computing Using DDN IME for Harmonie Gilles Civario, Marco Grossi, Alastair McKinstry, Ruairi Short, Nix McDonnell April 2016

DDN IME: Infinite Memory Engine

IME: Major Features Burst Buffer Takes bulk writes quicker than PFS. High-performance Global Cache importing data into or file pinning data in the IME data tier, I/O Accelerator Avoids POSIX locking bottleneckenhanced, low-level communications protocols, accelerating both reads and writes. Application Workflows Integrates with job schedulers, enabling simultaneous job runs and shortening the job queue for faster application run time. PFS Optimizer Dynamically reorganizes data into sequential writes, eliminating the latency, thrashing, and slow write times created by the fragmented I/O patterns of demanding mixed workload applications. Aggregated Storage Capacity Intelligently virtualizes disparate NVM devices into a single pool of shared memory, providing increased capacity and bandwidth across a cluster of IME Server nodes. Scalable and Fault-tolerant Solution Provides scalability and redundancy at both the storage device and node level. If a server becomes saturated IME Client will automatically re-direct the data traffic

DDN IME Application I/O Workflow COMPUTE Diverse, high concurrency applications Fast Data NVM & SSD Persistent Data (Disk) Lightweight IME client passes fragments to application IME server sends fragments to IME clients IME servers write buffers to NVM and manage internal metadata IME prefetchesdata based upon scheduler request Parallel File system acts as persistent store for data

ICHECs experience Worked with IME in 2014: Pre-GA code. Worked with MPI-IO, single program. 30-60% speedup vs. Lustre on seismic code.

Todays work Test a full, NWP workflow: Not just the forecast. Post-processing, dependent jobs Simultaneous multiple jobs Test MPI-IO and serial: Adding post-processing file conversion to NetCDFto add MPI-IO job (using MVAPICH2) Simultaneously read via POSIX

Harmonie Hirlam/ ALADIN consortium In use in Met Éireannproduction at ICHEC POSIX based I/O flow. cy 40h1.1.5beta FA/LFI files, postprocessed to GRIB (Conversions to netcdf added to test MPI-IO) IO_SERVER optional component

Harmonieworkflow Standard flow: Multiple (serial) preprocessing jobs Populate cache Parallel forecast Post-processing (serial) jobs typically triggered on n-hour output Read from cache

8 x Compute Nodes: 2x Intel Xeon E5-2680v2 128GB RAM FDR InfiniBand Compute nodes Irish Centre for High-End Computing Initial test system Filesystem Storage: DDN SFA 7700 Lustre 2.5 with 2 x OSS servers 3.4GB/s Write, 3.3 GB/s Read IME Servers IME1 IME2 IME3 IME System: 4 servers with 24 x 240GB SSDs each 36GB/s Write, 39 GB/s Read OSS1 OSS2 IME4 Object Storage Servers... OST1 OST2 OST6 SFA7700

IME configuration The persistence of the IME burst buffer is provided by SSD drives Each IME Server: -24 x 240GB SSD drive 2.5" -2 x SAS2308 PCI-Express Fusion-MPT SAS-2 On each of the IME server are running two instances of the IME software service: - each instance is pinned to a specific NUMA node and Infinibandport the maximum throughput per IME server is limited by the speed of the IB interface: dual port FDR in this case. The MPI-IO request using IME as backend are automatically balanced between the configured IME server; data transfer between IME client and server is carried over RDMA.

The MPI library to use for your run is a customized version of MVAPICH: MVAPICH version 2.0 that include the ROMIO driver for IME to use IME as backend for MPI-IO is simply as prepend the filename with 'im:/' the IME namespace replicate the underlining filesystem: in this particular case the Lustreone. If -for any reason -your MPI task still need POSIX access in the context of an MPI-IO request: simply open the file as usual, omitting the 'im:/' prefix.

Test cases: IRL25: 2.5 km domain over Ireland. 500 x 540 grid, 45s timestep Production run for Met Éireann IRL10: 1 km: 1300 x 1300, 20s timestep Also use IO_SERVER

Results Harmonie workflow works No significant speedup seen

Results Harmonie workflow works No significant speedup seen Post-processing was tuned to minimize IO delays: Minimal verification writes (every 6hr output) Reduced variable set Tested IO server configuration, job size

Tracking via IB traffic 2.5km case, No IO server. 8s IO step /each hr Two stage writes, including SURFEX output

1km case (1300x1300) Limited by small test cluster size In IO Server case, 1-2 cores (pinned) reserved for server No IO Server 20% drop in IO write time, (increased compute) Same time to solution with 2 IO Server IME startup Typical step

Gotchas On crashes, inconsistent state seen in IME /Lustre Scripts needed to delete and cleanup /lustre/work/harmonie/hm_data/ /ime/lustre/work/harmonie/hm_data/...

Current work Add extra compute nodes: 30 extra nodes for running IRL25 + IRL10 overlapped Testing performance of mixed postp + fcst jobs Based on work on fionn: Should saturate test filesystem with: 1km subdomain, 15 minute boundary updates Serial verification workflow Porting other postp tasks (hydrological model) to mix

Thank you! Thanks to: James Coomer, DDN Niall Wilson, ICHEC

IME Architecture POSIX and MPI-IO interfaces creates erasure coding manages data to/from app extreme update rates decentralised fault tolerant scalable place data anywhere dynamic data placement improve flash lifetime manages buffers for IO replay to PFS manages fast rebuilds application interfaces IME Client Distributed Hash Table Log Structured Filesystem IME Server PFS client