Advanced Data Placement via Ad-hoc File Systems at Extreme Scales (ADA-FS)

Similar documents
An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

I/O Profiling Towards the Exascale

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

IBM Spectrum Scale IO performance

Analyzing I/O Performance on a NEXTGenIO Class System

Experiences with HP SFS / Lustre in HPC Production

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

I/O at the Center for Information Services and High Performance Computing

IBM Spectrum NAS. Easy-to-manage software-defined file storage for the enterprise. Overview. Highlights

Data Movement & Tiering with DMF 7

NEXTGenIO Performance Tools for In-Memory I/O

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Improved Solutions for I/O Provisioning and Application Acceleration

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

UCX: An Open Source Framework for HPC Network APIs and Beyond

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Introduction to HPC Parallel I/O

HPC Storage Use Cases & Future Trends

The Fusion Distributed File System

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Comprehensive Lustre I/O Tracing with Vampir

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Name: Instructions. Problem 1 : Short answer. [48 points] CMU / Storage Systems 20 April 2011 Spring 2011 Exam 2

Oak Ridge National Laboratory Computing and Computational Sciences

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Recovering Disk Storage Metrics from low level Trace events

A Case for High Performance Computing with Virtual Machines

CS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es

PM Support in Linux and Windows. Dr. Stephen Bates, CTO, Eideticom Neal Christiansen, Principal Development Lead, Microsoft

Using file systems at HC3

Extraordinary HPC file system solutions at KIT

Analytics in the cloud

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

Network Request Scheduler Scale Testing Results. Nikitas Angelinas

Crossing the Chasm: Sneaking a parallel file system into Hadoop

PHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan

Flash Storage Complementing a Data Lake for Real-Time Insight

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

Habanero Operating Committee. January

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Parallel File Systems Compared

IME Infinite Memory Engine Technical Overview

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

DDN About Us Solving Large Enterprise and Web Scale Challenges

Enosis: Bridging the Semantic Gap between

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Application Acceleration Beyond Flash Storage

An Overview of Fujitsu s Lustre Based File System

Exploring Use-cases for Non-Volatile Memories in support of HPC Resilience

Emerging Technologies for HPC Storage

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Voltaire Making Applications Run Faster

Pocket: Elastic Ephemeral Storage for Serverless Analytics

FhGFS - Performance at the maximum

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

Technology Testing at CSCS including BeeGFS Preliminary Results. Hussein N. Harake CSCS-ETHZ

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

Preparing GPU-Accelerated Applications for the Summit Supercomputer

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Distributed File Systems II

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Solving Linux File System Pain Points. Steve French Samba Team & Linux Kernel CIFS VFS maintainer Principal Software Engineer Azure Storage

Towards General-Purpose Resource Management in Shared Cloud Services

Parallel File Systems. John White Lawrence Berkeley National Lab

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0)

I N V E N T I V E. SSD Firmware Complexities and Benefits from NVMe. Steven Shrader

Purity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language

Lustre overview and roadmap to Exascale computing

Red Hat Ceph Storage and Samsung NVMe SSDs for intensive workloads

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Toward a Memory-centric Architecture

BlueGene/L. Computer Science, University of Warwick. Source: IBM

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

Let s Make Parallel File System More Parallel

InfiniBand Networked Flash Storage

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)

ISC 09 Poster Abstract : I/O Performance Analysis for the Petascale Simulation Code FLASH

ENERGY-EFFICIENT VISUALIZATION PIPELINES A CASE STUDY IN CLIMATE SIMULATION

pnfs, POSIX, and MPI-IO: A Tale of Three Semantics

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

Mapping MPI+X Applications to Multi-GPU Architectures

Designing a True Direct-Access File System with DevFS

ECE7995 (7) Parallel I/O

High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han

Overcoming Data Locality: an In-Memory Runtime File System with Symmetrical Data Distribution

The Optimal CPU and Interconnect for an HPC Cluster

The Google File System. Alexandru Costan

Cluster-Level Google How we use Colossus to improve storage efficiency

Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov

Transcription:

Advanced Data Placement via Ad-hoc File Systems at Extreme Scales (ADA-FS) Understanding I/O Performance Behavior (UIOP) 2017 Sebastian Oeste, Mehmet Soysal, Marc-André Vef, Michael Kluge, Wolfgang E. Nagel, Achim Streit, André Brinkmann

Agenda Motivation ADA-FS (WIP) - Overview The ad-hoc file system Data aware scheduling Resource and topology discovery Monitoring Tracing in distributed file systems Conclusion Future work 2

The ADA-FS project New project in the second funding period of SPPEXA SPPEXA topics: System software and runtime libraries Three project partners: TU Dresden (TUD) Wolfgang E. Nagel (Principal Investigator), Andreas Knüpfer, Michael Kluge, Sebastian Oeste Johannes Gutenberg University Mainz (JGU) André Brinkmann (Principal Investigator), Marc-André Vef Karlsruhe Institute of Technology (KIT) Achim Streit (Principal Investigator), Mehmet Soysal 3

Motivation The I/O subsystem is a generic bottleneck in HPC systems (bandwidth, latency) The shared medium has no reliable bandwidth and IOPS Jobs can disrupt each other New storage technologies will emerge in HPC systems SSD, persistent HBM (High Bandwidth Memory), NVRAM... Faster access Pre-stage inputs 4

View on three HPC systems Auroa @ANL Summit @ORNL ForHLR 2 @KIT # compute nodes N Bisection Bandwidth >50.000 500 Tbyte/s** Global I/O Bandwitdh > 3.400 40 Tbyte/s** 2.5 Tbyte/s** Node Local Storage Bandwidth Node Local Storage 1 Tbyte/s** >5 Gbyte/s* NVRAM >5 Gbyte/s* NVRAM 1.172 6 Tbyte/s 50 Gbyte/s 0.5 Gbyte/s SSD * assumed values ** announced values Future systems are planned with high-speed node storage SSDs are already used in today s systems Higher bisection bandwidth within compute nodes 5

ADA-FS Proposed solution Bring data closer to compute units Create a private parallel file system for each job on the compute nodes Tailor the file system to the application s requirements Discover interconnect topology and distribute data accordingly Advantages Dedicated bandwidth and IOPS Independent to the global file system Low latency due to SSD/NVRAM Compute Node Local Storage Ad-Hoc Parallel FS created for a job Cluster Gobal Parallel Storage 6

Project overview ADA-FS Ad-hoc file system Data-aware scheduling Application monitoring Deployment on nodes that are part of a single job Central I/O planning Integrated I/O monitoring Usage of dedicated node-local storage resources Interface w/ resource mgmt. and scheduler Resource & topology discovery Intelligent use of cluster topology information for optimized data placement Dynamic resource usage tracking 7

Issues with today s distributed file systems Big players: Spectrum Scale (GPFS), Lustre, Ceph (not ready for prod.) Not designed for Exascale E.g., Spectrum Scale was designed for Gigascale Heavy communication Intricate locking mechanisms Metadata is not scalable (rigid data structures) Applications produce too much load on file systems Providing POSIX I/O semantics results in such issues 8

Ad-hoc file system design Based on the Fuse library Eventual consistency Relaxed POSIX I/O semantics Ignore metadata fields, such as mtime, atime, filesize Simplify file system protocols No locking Scalable metadata approaches Use Key-Value stores No rigid data structures (e.g., directory blocks) Last writer wins Distribute data across disks (taking locality into account) Use cluster topology information for improved data placement Optimize file system configurations for the running application 9

Data aware scheduling and data management Interaction with the existing batch environment no replacement of existing components Challenges Bad wall-clock prediction Plan the exact time to stage the data to the ad-hoc file system Data has to be managed during data staging Solutions Use machine learning for better wall-clock prediction Pre-stage data using RDMA (NVMe) Manage data with workpools with a global identifier Tight interaction with the existing scheduler and the resource manager 10

Resource and topology discovery Discover node-local storage types, sizes, and the reliable local speed Store information in a central resource database Use tools to fetch information of a number of nodes Expectation Detailed knowledge where the job will run Provide a better foundation for data-staging and scheduler decisions sysmap-collect Compute nodes Resource database 11

Monitoring Record I/O behavior Predict I/O phases Generalize I/O for application types Learn I/O characteristics Generate hints for the I/O planer Monitor I/O operations Parallel application process Ad-hoc file system Can modify or redirect I/O operations Underlying file system Authors: Mehmet Soysal, Marc-André Vef, Sebastian Oeste 22.03.2017 12

Towards flexible and practical tracing in distributed systems JGU visited IBM research for a summer internship in 2016 Investigation of IBM Spectrum Scale s tracing framework Over 1 million lines of code with 20,000+ static tracepoints Sets: subsystems (60) and priority (0-14) Two tracing modes: blocking and overwrite Issues that arose after >20 development years Sets of tracepoints are impractical Developer tend to care only about two levels: default and not-default Too many tracepoints even in the default levels Either severe performance degradation (blocking) or information loss (overwrite) Excess collection of byproduct traces 13

Solution Consider: dynamic instrumentation vs. static instrumentation Idea: Combine the flexibility of dynamic instrumentation with the low overhead of static instrumentation Evaluating a tracing prototype at IBM allowed several insights Use a ring-buffer for trace collection Allow fine-grained control over all tracepoints A simple bitmap is sufficient Allow flexible and user-defined tracepoint sets 14

Preliminary evaluation with BeeGFS 15

Experimental setup Hardware ForHLR II @ KIT Compute nodes with Infiniband FDR (56 Gbit/s) Fat-Tree Global I/O throughput 50 Gbyte/s Node local SSD (400Gb) approx. R/W 600/400MB 256 used compute nodes Environment BeeGFS as a job-temporal private parallel file system Use the IOZone benchmark to evaluate write throughput 16

IOzone write throughput Observation: Write throughput is limited by SSD performance 17

Conclusion and future work Fully exploiting fast storage technologies will become more important Sparse related work on job-temporal file systems Write performance looks promising even with POSIX compliant systems (BeeGFS,...) Next: Application analysis across three clusters (JGU, TUD, KIT) I/O phase categorization Which POSIX functions are mostly used Which hints can be used for improved data placement Analysis of the impact of Fuse and other file system configurations Evaluation of the performance and applicability of a minimal file system Investigation of efficient data staging methods 18

We are looking for more real world applications Applications with high IOPS or large I/O footprints Applications that expose high metadata loads on the parallel file system Please contact us ada-fs-all@fusionforge.zih.tu-dresden.de Michael Kluge <michael.kluge@tu-dresden.de> Sebastian Oeste <sebastian.oeste@tu-dresden> Marc-André Vef <vef@uni-mainz.de> Mehmet Soysal <mehmet.soysal@kit.edu> Or visit ada-fs.github.io 19

Thank you Questions? 20