IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning September 22 nd 2015 Tommaso Cecchi
2 What is IME? This breakthrough, software defined storage application introduces a whole new applicationaware data acceleration tier that provides gamechanging latency reduction and greater bandwidth and IOPS performance for today s and tomorrow s performance hungry scientific, analytic and big data applications.
3 What is IME? IME delivers the performance of flash with the manageability & capacity of shared storage IME is a new new tier of transparent, extendable, non-volatile memory (NVM), that provides gamechanging latency reduction and greater bandwidth and IOPS performance for the next generation of performance hungry scientific, analytic and big data applications.
4 What is IME? IME creates a new applicationaware fast data tier that resides right between compute and the parallel file system to accelerate I/O, reduce latency and provide greater operational and economic efficiency
5 How Does IME Help? Changes the I/O Provisioning Paradigm & Reduces the Total Cost of Storage IME enables organizations to separate the provisioning of peak & sustained performance requirements with greater operational efficiency and cost savings than utilizing exclusively disk-based parallel file systems STORAGE BANDWIDTH UTILIZATION OF A MAJOR HPC PRODUCTION STORAGE SYSTEM 99% of the time < 33% of max 70% of the time< 5% of max IME Reduces Storage Hardware up to 70% Fewer systems to buy, power manage, maintain
6 How Does IME Help? Limitless Performance Scaling Removes Architectural & Economic & Barriers IME makes exascale I/O a reality, and finally enables the enterprise to run HPC jobs with much greater performance and efficiency IME Eliminates: Parallel file system locking, limitations & bottlenecks 70% of storage hardware, consumed floorspace Latency driving a 30% loss of compute resources 90% of checkpoint/restart downtime
7 Why Cache Matters in HPC Even Large HPC Sites Drive a Lot of Small I/O Cache is critical in aligning all-too-frequent unaligned writes and capturing small writes to preserve spinning disk performance All DDN Storage products offers cache mirroring & battery-backed RAM cache - proven across 3 generations to accelerate all varieties of data Many systems today do not even offer a protected, redundant write cache. Caching is one of the most difficult layers of a storage stack to engineer, it s also the most critical 2014 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change. ddn.com
8 Where IME Provides Value IME Accelerates Parallel Filesystems Absorbs all sizes of I/O at full performance, unlike Lustre* and GPFS 2014 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change. ddn.com
9 Where IME Provides Value 1. MITIGATES POOR PFS PERFORMANCE caused by PFS locking, small I/O, and mal-aligned, fragmented I/O patterns. IME makes bad apps run well and also prevents a poor-behaving app from impacting the entire supercomputer. This is especially valuable to diverse workload environments and ISV applications. IOR benchmarks indicate a 3x 20x speedup on I/Os <32KB. 25 MB/s S3D Turbulent Flow Model 50 GB/s 4 GB/s 2) PROVIDES HIGHER PERFORMANCE I/O (bandwidth and latency) to the application. At ISC14, we demonstrated three orders of magnitude speed-up due to this high performance tier 3) IME DRIVES SIGNIFICANTLY MORE EFFICIENT I/O TO THE PFS by re-aligning and coalescing data within the non-volatile storage. At ISC14, we demonstrated two orders of magnitude speed-up due to this efficiency 2014 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change. ddn.com
10 IME Lowers the Total Cost of Storage IME+PFS delivers better price/performance over PFS alone Cluster Memory = 400TB Qty: 12 IME Appliances NVM Capacity = 2.75X Cluster Memory (Each w/ Qty: 48 1.9TB NVMe SSDs) Components SFA Only IME + SFA Advantages Cluster I/O BW 540 GB/s 756 GB/s 216 GB/s More BW Delivered Storage Fabric BW 540 GB/s 270 GB/s 50% Less BW Needed to PFS Qty: OSS 112 56 50% Less OSS to Buy Qty: SFA Appliances 14 7 50% Less SFA Appliances Needed Qty HDDs/SFA QTY: HDDs IME Value Proposition 400 (80 HDD * 5 Enclos) 5,600 (14 SFA *400 HDD) 800 (80 HDD * 10 Enclos) 5,600 (7 SFA *800 HDD) More bandwidth to the cluster (Faster job turn-around, more jobs in same period, fewer nodes needed to complete same amount of work) 200% More HDD Density per SFA Appliance Delivering the Same Capacity Fewer OSS and SFAs Reduced power, space and operational cost Similar persistent capacity Lower overall capital cost 2014 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change. ddn.com
11 HPC Ecosystem Client IO Interfaces Application IO Implementation High-level IO Libraries (optional) MPI-IO Native IO POSIX IO Data path for HL IO library built on POSIX Forwarded + Exported IO (optional) File System IO Interface (VFS, User Space Library) 2014 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change. ddn.com
12 High-Level IO Libraries Provides an application and end-user oriented IO interface Files / directories abstracted from users in favor of data sets / objects / containers / variables Object operations (put, get) instead of byte streams (read, write) Portable, self-describing data sets Example High-Level IO Libraries HDF5 (http://www.hdfgroup.org/hdf5/) netcdf (http://www.unidata.ucar.edu/software/netcdf/) PnetCDF (http://cucis.ece.northwestern.edu/projects/pnetcdf/) ADIOS (https://www.olcf.ornl.gov/center-projects/adios/) Implementations leverage lower-level IO interfaces POSIX MPI-IO 2014 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change. ddn.com
13 MPI-IO Provides a high-performance parallel IO interface and semantics Applies successful MPI capabilities to file IO Bulk data capabilities (MPI_File_write_at_all) Metadata capabilities (e.g. scalable file open() ) Most popular implementation is Argonne National Laboratory s ROMIO Distributed in MPICH Available in MPICH derivatives (MVAPICH, IBM MPI, Intel MPI, Cray MPI, and others) Key Features: Independent IO: Uncoordinated parallel IO from many concurrent readers and writers Collective IO: Coordinated IO from many readers and writers. Two popular implementations o Data Sieving Selective filtering of data (reduces IOPs) o Two-phase IO Intermediate processes collect and serve data to other processes (reduces number of readers-writers touching PFS) MPI Derived Data Type support: Allow MPI runtime to load non-contiguous data in files directly into application data structures in RAM o Used heavily by higher-level IO libraries (e.g. PnetCDF and HDF5) Specialization for storage system targets (ROMIO ADIO drivers) o IME provides an ADIO driver that translates MPI-IO requests into IME requests o ROMIO provides drivers for Lustre, GPFS, PanFS, Further Reading Chapter 13 of the MPI3 Standard http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf 2014 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change. ddn.com
14 POSIX IO Provides a portable byte-stream IO interface read(), write(), open(), close(), POSIX IO Pros Portable Inertia POSIX IO Cons Some design assumptions no longer true for modern computers (concurrency and parallelism) Lots of state at runtime (file descriptors) Further Reading POSIX standard (POSIX.1-2001) 2014 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change. ddn.com
15 DDN IME Ecosystem Client IO Interfaces Application IO Implementation High-level IO Libraries (optional) MPI-IO MPI-IO (IME) (POSIX) POSIX (IME FUSE) IME Native Client Library Data path for HL IO library built on POSIX 2014 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change. ddn.com
16 DDN IME Ecosystem Client IO Interfaces Three primary interfaces for IME IME FUSE o Provides POSIX IO o Captures IO requests through the Linux VFS o Target Use Case: General purpose applications that use POSIX IME ROMIO o Provides MPI-IO support o Captures IO requests through the MPI runtime in user space o Target Use Case: Parallel applications IME Native Library o Low-level programming interface o FUSE and ROMIO layers implemented on this interface o Target Use Case: Highly-optimized customer applications that may not map cleanly onto POSIX or MPI-IO 2014 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change. ddn.com
IME Internal Architecture Overview
18 Aggregate IME Adaptive vs. Non-Adaptive WRITE Performance Ideal, healthy system One degraded IME server, Adaptive Amdahl s Law in action! One degraded IME server, Non-adaptive
19 Real-Time IME Adaptive vs. Non-adaptive WRITE Performance Adaptive heuristic learns quickly 4x Performance Lost with Non-adaptive
20 Use of Log Structuring in IME What does this give us? Near line rate performance regardless of output pattern.