Application Performance on IME

Similar documents
Using DDN IME for Harmonie

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Infinite Memory Engine Freedom from Filesystem Foibles

HPC Storage Use Cases & Future Trends

DDN and Flash GRIDScaler, Flashscale Infinite Memory Engine

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

2012 HPC Advisory Council

SFA12KX and Lustre Update

Technologies for High Performance Data Analytics

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

IME Infinite Memory Engine Technical Overview

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

libhio: Optimizing IO on Cray XC Systems With DataWarp

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Parallel File Systems. John White Lawrence Berkeley National Lab

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

ABySS Performance Benchmark and Profiling. May 2010

University at Buffalo Center for Computational Research

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

Fast Forward I/O & Storage

Improved Solutions for I/O Provisioning and Application Acceleration

Data Movement & Tiering with DMF 7

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Early Evaluation of the "Infinite Memory Engine" Burst Buffer Solution

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

Emerging Technologies for HPC Storage

Fast Forward Storage & I/O. Jeff Layton (Eric Barton)

Carlo Cavazzoni, HPC department, CINECA

DDN About Us Solving Large Enterprise and Web Scale Challenges

Choosing Resources Wisely Plamen Krastev Office: 38 Oxford, Room 117 FAS Research Computing

The RAMDISK Storage Accelerator

Extreme I/O Scaling with HDF5

HPC Architectures. Types of resource currently in use

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

SGI Overview. HPC User Forum Dearborn, Michigan September 17 th, 2012

Building NVLink for Developers

High-Performance Lustre with Maximum Data Assurance

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Results from TSUBAME3.0 A 47 AI- PFLOPS System for HPC & AI Convergence

NVM Express over Fabrics Storage Solutions for Real-time Analytics

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Lustre HSM at Cambridge. Early user experience using Intel Lemur HSM agent

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Parallel File Systems Compared

Building Self-Healing Mass Storage Arrays. for Large Cluster Systems

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Integration Path for Intel Omni-Path Fabric attached Intel Enterprise Edition for Lustre (IEEL) LNET

An ESS implementation in a Tier 1 HPC Centre

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Lustre usages and experiences

HPC Innovation Lab Update. Dell EMC HPC Community Meeting 3/28/2017

AcuSolve Performance Benchmark and Profiling. October 2011

Lustre / ZFS at Indiana University

DATARMOR: Comment s'y préparer? Tina Odaka

Atrato SOLVE - Scalable Offload Logical Volume Engine

An Overview of Fujitsu s Lustre Based File System

API and Usage of libhio on XC-40 Systems

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

NetApp High-Performance Storage Solution for Lustre

NEMO Performance Benchmark and Profiling. May 2011

Applying DDN to Machine Learning

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

Habanero Operating Committee. January

OpenFOAM Performance Testing and Profiling. October 2017

Isilon Performance. Name

MPI RUNTIMES AT JSC, NOW AND IN THE FUTURE

Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

Overview of Tianhe-2

A Case for High Performance Computing with Virtual Machines

IBM POWER8 HPC System Accelerates Genomics Analysis with SMT8 Multithreading

Architectures for Scalable Media Object Search

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

Outline. March 5, 2012 CIRMMT - McGill University 2

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Data Movement & Storage Using the Data Capacitor Filesystem

Lustre overview and roadmap to Exascale computing

Advances of parallel computing. Kirill Bogachev May 2016

IBM CORAL HPC System Solution

in Action Fujitsu High Performance Computing Ecosystem Human Centric Innovation Innovation Flexibility Simplicity

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

MSC Nastran Explicit Nonlinear (SOL 700) on Advanced SGI Architectures

THOUGHTS ABOUT THE FUTURE OF I/O

BeeGFS. Parallel Cluster File System. Container Workshop ISC July Marco Merkel VP ww Sales, Consulting

Using file systems at HC3

Advanced Research Compu2ng Informa2on Technology Virginia Tech

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Transcription:

Application Performance on IME Toine Beckers, DDN Marco Grossi, ICHEC

Burst Buffer Designs Introduce fast buffer layer Layer between memory and persistent storage Pre-stage application data Buffer writes from memory to fast devices Store intermediate application data Still a mount point (similar to a file system)

3 Infinite Memory Engine: How does it Work?

IME Summary Designed for Scalability Ultra-low latency I/O between Compute Nodes and NVM Fully POSIX & HPC Compatible Additional APIs Available Scale-Out Data Protection Distributed Erasure Coding Non-Deterministic System Write Anywhere, No Layout Needed Integrated With File Systems Accelerates Lustre, GPFS No Code Modification Needed Writes Fast; Read Fast Too No other system offers both at scale

5 ICHEC Background Irish Centre for High-End Computing National Technology Centre Established in 2005 10 th anniversary! Powered by people 27 staff Terrific mix of computational scientists, researchers, developers and systems administrators Dublin(east coast) & Galway(west coast) office Mandates include HPC & Big Data/Data Analytics Industry engagement Partnerships, consultancy, training & services Public sector & agency engagement Services, enablement & training National Academic HPC Service Collaboration, training & service provision

6 TORTIA Intro TORTIA (Tullow Oil Reverse Time Imaging Application) Developed in house for, and in collaboration with, Tullow Oil plc A real application for real work! Reverse Time Migration (RTM) code Used by Oil & Gas companies to analyse seismic survey data TORTIA is heavily optimized and tuned Parallelism, vectorization, but also optimized on the I/O side Achieves 30-50% of peak at scale

7 TORTIA Some details Standard C++ with OpenMP & MPI Input and output data in SEG-Y format Requires a temporary scratch area First half of the time loop dump snapshots of velocity fields The second half of the time loop read back the saved snapshots LIFO (Last-In First-out) access pattern Implement 3 different I/O backend for the scratch POSIX MPI-IO In Memory aka no I/O

8 TORTIA Scratch I/O pattern: LIFO Compute Write Read 0 0 time 1 1 I/O 2 2 k-2 k-1 k-1 k-2 High chance of cache miss Likely to be in cache Both compute node and storage side

9 TORTIA on pre-ga DDN IME Test cluster 8 x Compute Nodes 2x Intel Xeon E5-2680v2 128GB RAM FDR InfiniBand Compute nodes IME Servers IME1 Filesystem Storage DDN SFA 7700 Lustre 2.5 with 2 x OSS servers 3.4GB/s Write, 3.3 GB/s Read OSS1 IB FDR OSS2 IME2 IME3 IME4 Object Storage Servers IME System 4 servers with 24 x 240GB SSDs each 36GB/s Write, 39 GB/s Read... OST1 OST2 OST6 SFA7700

10 TORTIA Code porting Used the MPI-IO interface to DDN IME Some constraints on IME pre-ga Required patched version of MVAPICH2 Added IME libraries at link time Prepended im: to file path Used MVAPICH instead of Intel MPI Still used Intel Compiler DDN Düsseldorf LAB

11 TORTIA Experiment use case Scratch I/O target Interface In-memory - Lustre DDN IME MPI-IO MPI-IO Total I/O size Scenario Small 80 GB Quick data validation Medium 950 GB Typical production run Large 8.4 TB High-resolution run

12 TORTIA on pre-ga DDN IME Total execution time 6 nodes 2 x MPI rank /node 1.00 20 x OpenMP thread /rank 0.80 I/O target 0.60 In memory Lustre 0.40 IME Burst Buffer 0.20 Up-to 3x speedup Total execution time 0.00 Small case 80GB Medium case 950GB Large case 8.4 TB In memory not applicable to Large case: not enough memory on the nodes

Elapsed time in seconds Speedup for IME compared to Lustre 13 TORTIA on pre-ga DDN IME Independent run 400 350 1.6 1.55 300 250 200 150 100 50 1.5 1.45 1.4 1.35 1.3 1.25 Lustre IME 0 1 2 3 4 5 6 7 8 Number of concurrent independent runs 1.2 Speedup Multiple independent run of the Small test case 1 run x compute node; node count in {1..8}

14 TORTIA on pre-ga DDN IME Time spent in I/O 1 Large test case Data collected using Darshan 0.8 0.6 0.4 0.2 0 MPI-IO read Lustre IME burst buffer MPI-IO write