The RAMDISK Storage Accelerator

Similar documents
Parallel File Systems. John White Lawrence Berkeley National Lab

The rcuda middleware and applications

Slurm at the George Washington University Tim Wickberg - Slurm User Group Meeting 2015

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

HPC Storage Use Cases & Future Trends

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

Triton file systems - an introduction. slide 1 of 28

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Habanero Operating Committee. January

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

Coordinating Parallel HSM in Object-based Cluster Filesystems

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

Lustre at Scale The LLNL Way

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

STAR-CCM+ Performance Benchmark and Profiling. July 2014

LBRN - HPC systems : CCT, LSU

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Application Performance on IME

Beyond Petascale. Roger Haskin Manager, Parallel File Systems IBM Almaden Research Center

Cluster Setup and Distributed File System

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Extraordinary HPC file system solutions at KIT

I/O and Scheduling aspects in DEEP-EST

GFS: The Google File System

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Network Design Considerations for Grid Computing

The Google File System

HPC Capabilities at Research Intensive Universities

Flux: The State of the Cluster

InfiniBand Networked Flash Storage

SuperMike-II Launch Workshop. System Overview and Allocations

Voltaire Making Applications Run Faster

Progress Report on Transparent Checkpointing for Supercomputing

ECE7995 (3) Basis of Caching and Prefetching --- Locality

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

BeeGFS. Parallel Cluster File System. Container Workshop ISC July Marco Merkel VP ww Sales, Consulting

HIGH-PERFORMANCE STORAGE FOR DISCOVERY THAT SOARS

1. ALMA Pipeline Cluster specification. 2. Compute processing node specification: $26K

Application Acceleration Beyond Flash Storage

The Fusion Distributed File System

Introduction to High Performance Parallel I/O

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

Parallel File Systems for HPC

GPUs and Emerging Architectures

HIGH PERFORMANCE COMPUTING FROM SUN

Improving I/O Bandwidth With Cray DVS Client-Side Caching

Data Movement & Tiering with DMF 7

Xyratex ClusterStor6000 & OneStor

Illinois Proposal Considerations Greg Bauer

IBM s Data Warehouse Appliance Offerings

Milestone Systems Husky Network Video Recorders Product Comparison Chart September 9, 2016

HPC Architectures. Types of resource currently in use

MAHA. - Supercomputing System for Bioinformatics

Filesystem Performance on FreeBSD

Current Status of the Ceph Based Storage Systems at the RACF

CSCS HPC storage. Hussein N. Harake

High Performance Solid State Storage Under Linux

I/O Monitoring at JSC, SIONlib & Resiliency

LQCD Facilities at Jefferson Lab. Chip Watson May 6, 2011

CYFRONET SITE REPORT IMPROVING SLURM USABILITY AND MONITORING. M. Pawlik, J. Budzowski, L. Flis, P. Lasoń, M. Magryś

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET

The Economics of InfiniBand Virtual Device I/O

LHConCRAY. Acceptance Tests 2017 Run4 System Report Miguel Gila, CSCS August 03, 2017

Creating High Performance Clusters for Embedded Use

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla

IBM Spectrum Scale IO performance

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

SUN CUSTOMER READY HPC CLUSTER: REFERENCE CONFIGURATIONS WITH SUN FIRE X4100, X4200, AND X4600 SERVERS Jeff Lu, Systems Group Sun BluePrints OnLine

Parallel File Systems Compared

Picking the right number of targets per server for BeeGFS. Jan Heichler March 2015 v1.3

Parallel Storage Systems for Large-Scale Machines

Architecting Storage for Semiconductor Design: Manufacturing Preparation

NetApp High-Performance Storage Solution for Lustre

AIST Super Green Cloud

Automated Configuration and Administration of a Storage-class Memory System to Support Supercomputer-based Scientific Workflows

GPUfs: Integrating a file system with GPUs

Comet Virtualization Code & Design Sprint

University at Buffalo Center for Computational Research

DATARMOR: Comment s'y préparer? Tina Odaka

The bidder should be willing to arrange for a demonstration of equipment offered at free of cost on

Lustre TM. Scalability

Diamond Networks/Computing. Nick Rees January 2011

Warehouse- Scale Computing and the BDAS Stack

New Storage Technologies First Impressions: SanDisk IF150 & Intel Omni-Path. Brian Marshall GPFS UG - SC16 November 13, 2016

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

A New NSF TeraGrid Resource for Data-Intensive Science

Stockholm Brain Institute Blue Gene/L

GFS: The Google File System. Dr. Yingwu Zhu

Structuring PLFS for Extensibility

Virtual Machine Migration

MapReduce. Stony Brook University CSE545, Fall 2016

The Last Bottleneck: How Parallel I/O can improve application performance

Transcription:

The RAMDISK Storage Accelerator A Method of Accelerating I/O Performance on HPC Systems Using RAMDISKs Tim Wickberg, Christopher D. Carothers wickbt@rpi.edu, chrisc@cs.rpi.edu Rensselaer Polytechnic Institute

Background CPUs performance doubles every 18 months HPC system performance follows this trend Disk I/O throughput doubles once every 10 years This creates an exponentially widening gap between compute and storage systems We can't continue to throw disks at the problem to keep current compute to I/O ratio intact More disks not only cost more, but failure rates also cause problems

RAMDISK Storage Accelerator Introduce new layer to HPC data storage the RAMDISK Storage Accelerator (RSA) Functions as a high-speed data staging area Allocated per job, in proportion to compute resources Requires no application modification, only a few settings in job scripts to enable

RSA Aggregate RAMDISKs on RSA nodes together using a parallel filesystem PVFS in our tests Lustre, GPFS, Ceph and others possible as well Parallel RAMDISK is exported to I/O nodes in the system

Bind mount mount o bind /rsa /place/on/diskfs Application sees a single FS hierarchy, doesn't need to know if the RSA is functional or not Decouples RSA from the compute system, allows the application to function regardless of RSA availability

Scheduling Set aside half the I/O nodes for active jobs, and the other half for jobs that have finished or will start soon Allocate RSA nodes in proportion to the compute system Data moves asynchronous to job execution, and frees the compute system up sooner.

Example job flow Before job begins, selected as next-likely to start. Stage data in to RSA. Job Starts on compute system Reads data in from RSA Checkpoints to RSA Job execution finishes results written to RSA Compute system released RSA pushes results back to disk storage after compute system has moved on to the next job

Compare to traditional job flow Initial load: 15 minute read in from disk Checkpoints: 10 minutes per checkpoint, once an hour, 24 hour job run Results back out: 10 minutes 225 minutes spent on I/O

RSA Initial load: happens before compute job starts Checkpoints: 1 minute each Results out: 1 minute to RSA Afterwards: results from RSA back to disk storage 25 minutes spent on I/O Saved 200 minutes on the compute system Asynchronous data staging seems likely for any large scale system at this point

Test System 1-rack Blue Gene/L 16 RSA nodes (borrowed from another cluster), 32GB RAM each Due to networking constraints, a single Gigabit Ethernet link connects the RSA to the BG/L functional network. Bottleneck evident in results.

Example RSA Scheduler RSA scheduler implements the scheduling, RSA construction/destruction, data staging Proof-of-concept constructed alongside the SLURM job scheduler

Test Results More details in the paper, but briefly: Example test: file-per-process, 2048 processes, 2GB data total GPFS disk: 1100 seconds to write out test data High contention exacerbates problems with GPFS metadata locking RSA: 36 seconds to write to RSA 222 seconds after compute system is released to push data back to GPFS Data staged back from single system avoids GPFS contention issues 2800% speedup from the application / compute system's viewpoint

Future Work Full-scale test system to be implemented @ CCNI in the next six months 1-rack (200 TFLOPS) Blue Gene/Q 32 RSA Nodes, each 128GB+ RAM, 4TB+ total. FDR (56Gbps) Infiniband fabric

Extensions RAM is faster, but SSDs are catching up, and provide better price/capacity. Everything shown here can extend to SSD-backed systems as well. Overhead in Linux memory management. Data copied in-memory ~5 times on the way in or out, this could be reduced with major modification to the kernel. Or, perhaps a simplified OS could be developed to support this. Handle RSA scheduling directly in job scheduler, rather than external

Conclusions: I/O continues to fall behind compute capacity The RSA provides a method to mitigate this problem Frees the compute system faster, reduce pressure on disk storage I/O Possible to integrate into HPC systems without changing applications

Thank You

Questions?