Integrating Analysis and Computation with Trios Services

Similar documents
Developing Integrated Data Services for Cray Systems with a Gemini Interconnect

EMPRESS Extensible Metadata PRovider for Extreme-scale Scientific Simulations

CSE6230 Fall Parallel I/O. Fang Zheng

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Adaptable IO System (ADIOS)

What is DARMA? DARMA is a C++ abstraction layer for asynchronous many-task (AMT) runtimes.

Extending Scalability of Collective IO Through Nessie and Staging

Using a Robust Metadata Management System to Accelerate Scientific Discovery at Extreme Scales

Techniques to improve the scalability of Checkpoint-Restart

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

UCX: An Open Source Framework for HPC Network APIs and Beyond

Technical Computing Suite supporting the hybrid system

DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Initial Performance Evaluation of the Cray SeaStar Interconnect

EMPRESS Extensible Metadata PRovider for Extreme-scale Scientific Simulations.

Transport Simulations beyond Petascale. Jing Fu (ANL)

Extreme I/O Scaling with HDF5

Challenges and Opportunities for HPC Interconnects and MPI

Oh, Exascale! The effect of emerging architectures on scien1fic discovery. Kenneth Moreland, Sandia Na1onal Laboratories

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

A Classifica*on of Scien*fic Visualiza*on Algorithms for Massive Threading Kenneth Moreland Berk Geveci Kwan- Liu Ma Robert Maynard

Red Storm / Cray XT4: A Superior Architecture for Scalability

PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks

Open MPI for Cray XE/XK Systems

Early Experiences with Trinity - The First Advanced Technology Platform for the ASC Program

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Perspectives form US Department of Energy work on parallel programming models for performance portability

Application Acceleration Beyond Flash Storage

Large Scale Visualization on the Cray XT3 Using ParaView

Oak Ridge National Laboratory Computing and Computational Sciences

Extreme-scale Graph Analysis on Blue Waters

THOUGHTS ABOUT THE FUTURE OF I/O

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Do You Know What Your I/O Is Doing? (and how to fix it?) William Gropp

Accelerating HPL on Heterogeneous GPU Clusters

Application Sensitivity to Link and Injection Bandwidth on a Cray XT4 System

*University of Illinois at Urbana Champaign/NCSA Bell Labs

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Using the Cray Gemini Performance Counters

Interactive HPC: Large Scale In-Situ Visualization Using NVIDIA Index in ALYA MultiPhysics

Efficient Transactions for Parallel Data Movement

Tintenfisch: File System Namespace Schemas and Generators

GPUfs: Integrating a file system with GPUs

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

The Cielo Capability Supercomputer

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

The DEEP-ER take on I/O

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Introduction to parallel Computing

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

I/O and Scheduling aspects in DEEP-EST

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Paving the Road to Exascale

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT

The Exascale Architecture

Harp-DAAL for High Performance Big Data Computing

MDHIM: A Parallel Key/Value Store Framework for HPC

Center Extreme Scale CS Research

Building the Most Efficient Machine Learning System

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

Lustre overview and roadmap to Exascale computing

Improved Solutions for I/O Provisioning and Application Acceleration

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-Volatile Burst Buffers

Multi-Gigabit Transceivers Getting Started with Xilinx s Rocket I/Os

Interconnect Your Future

EXASCALE COMPUTING: WHERE OPTICS MEETS ELECTRONICS

Using DDN IME for Harmonie

1 Title. 2 Team Information. 3 Abstract. 4 Problem Statement and Hypothesis. Zest: The Maximum Reliable TBytes/sec/$ for Petascale Systems

libhio: Optimizing IO on Cray XC Systems With DataWarp

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC

CPU-GPU Heterogeneous Computing

GPUfs: Integrating a file system with GPUs

Irregular Graph Algorithms on Parallel Processing Systems

Efficiency Considerations of Cauchy Reed-Solomon Implementations on Accelerator and Multi-Core Platforms

A performance portable implementation of HOMME via the Kokkos programming model

Extreme-scale Graph Analysis on Blue Waters

Examples of In Transit Visualization

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

CUDA. Matthew Joyner, Jeremy Williams

EXTREME SCALE DATA MANAGEMENT IN HIGH PERFORMANCE COMPUTING

VARIABILITY IN OPERATING SYSTEMS

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work

Multi-Resolution Streams of Big Scientific Data: Scaling Visualization Tools from Handheld Devices to In-Situ Processing

Lustre A Platform for Intelligent Scale-Out Storage

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Unified Communication X (UCX)

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Lustre* - Fast Forward to Exascale High Performance Data Division. Eric Barton 18th April, 2013

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Understanding MPI on Cray XC30

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

HPC Saudi Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences. Presented to: March 14, 2017

An Introduction to OpenACC

Parallel Programming Futures: What We Have and Will Have Will Not Be Enough

Transcription:

October 31, 2012 Integrating Analysis and Computation with Trios Services Approved for Public Release: SAND2012-9323P Ron A. Oldfield Scalable System Software Sandia National Laboratories Albuquerque, NM, USA raoldfi@sandia.gov Trilinos User Group Meeting Oct 31, 2012 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Some I/O Issues for Exascale Storage systems are the slowest, most fragile, part of an HPC system Scaling to extreme client counts is challenging POSIX semantics gets in the way, Current usage models not appropriate for Petascale, much less Exascale Checkpoints are a HUGE concern for I/O currently primary focus of FS App workflow uses storage as a communication conduit Simulate, store, analyze, store, refine, store, most of the data is transient One way to reduce I/O pressure on the FS is to inject nodes between app and FS 1. Reduce the effective I/O cost through data staging (a.k.a. Burst Buffer) 2. Reduce amount of data written to storage (integrated analysis, data services) 3. Present FS with fewer clients (IO forwarding) Trios Services enable application control of these nodes October 31, 2012 Trilinos User Group Meeting 2

Trios Data Services Approach Leverage available compute/service node resources for I/O caching and data processing Application-Level I/O Services Nessie PnetCDF staging service CTH real-time analysis SQL Proxy (for NGC) Interactive sparse-matrix visualization In-memory key-value (in development) Framework for developing data services Portable API for inter-app comm (NNTI) Client and server libs, cmake macros, utilities Originally developed for LWFS Client Application (compute nodes) I/O Service (compute/service nodes) Raw Data Processed Data Cache/aggregate /process File System Visualization Client October 31, 2012 Trilinos User Group Meeting 3

Example: A Simple Transfer Service Trilinos/packages/trios/examples/xfer-service Used to test Nessie API xfer_write_rdma: server pulls raw data using RDMA get xfer_read_rdma: server transfers data to client using RDMA put Used for performance evaluation Test low-level network protocols Test overhead of XDR encoding Tests async and sync performance Creating the Transfer Service Define the XDR data structs and API arguments Implement the client stubs Implement the server Client Application Xfer-Service October 31, 2012 Trilinos User Group Meeting 4

Transfer Service Implementing the Client Stubs Interface between scientific app and service Steps for client stub Initialize the remote method arguments, in this case, it s just the length of the array Call the rpc function. The RPC function includes method arguments (args), and a pointer to the data available for RDMA (buf) The RPC is asynchronous The client checks for completion by calling nssi_wait(&req); October 31, 2012 Trilinos User Group Meeting 5

Transfer Service Implementing the Server Implement server stubs Using standard stub args For xfer_write_rdma_srvr, the server pulls data from client Implement server executable Initialize Nessie Register server stubs/callbacks Start the server thread(s) October 31, 2012 Trilinos User Group Meeting 6

Transfer Service Evaluation: Put/Get Performance October 31, 2012 Trilinos User Group Meeting 7

Transfer Service Evaluation: Put/Get Performance October 31, 2012 Trilinos User Group Meeting 8

Blocks/sec/core......... Trios Services for Analysis CTH Analysis using ParaView Motivation Analysis code may not scale as well as HPC code Direct integration may be fragile (e.g., large binaries) Fat nodes may be available on Exascale architectures for buffering and analysis CTH fragment detection service Extra nodes provide in-line processing (overlap fragment detection with time step calculation) Only output results to storage (reduce I/O) Non-intrusive Looks like in-situ (pvspy API) Issues to Address Number of nodes for service Based on memory requirements Based on computational requirements Placement of nodes Resilience In-Situ Analysis Client Application CTH Client Application CTH PVSPY Client analysis code Fragment Data In-Transit Analysis (with Service) Raw Data Fragment-Detection Service PVSPY Server analysis code Fragment Data October 31, 2012 Trilinos User Group Meeting 9

Memory Requirements for CTH Service Memory Analysis of ParaView Codes Current implementation of analysis codes have problems Constraints given 32 GiB/node One node can manage/process ~16K AMR blocks from CTH. 16:1 ratio of compute nodes to service nodes (based on our input decks) Our goal is to use less than 10% additional resources In-situ viz adds ~10% overhead October 31, 2012 Trilinos User Group Meeting 10

Load Balancing for CTH Service Ten Cycles of 128-core run (one server node) Wait for Server Transfer Data 2 server cores 64:1 10 cycles in 37 secs Client idle waiting for server to complete (also affects transfers) 4 server cores 32:1 10 cycles in 23 secs 8 server cores 16:1 10 cycles in 19 secs Less than 1% time waiting October 31, 2012 Trilinos User Group Meeting 11

Impact of Placement on Performance We know placement is important from previous study Goal is to place nodes within given allocation to avoid network contention App-to-app (MPI), app-to-svc (NTTI), svc-to-svc (MPI), svc-to-storage (PFS) Graph partitioning based on network topology and application network traffic (w/pedretti) October 31, 2012 Trilinos User Group Meeting 12

Resilience for Trios Services Storage-efficient app resilience is still a problem after 20+ years of research Trios services use memory for transient data, how do we ensure resilience in such a model? We are exploring transaction-based methods Goal is to provide assurances in multiple protection domains (e.g., the application, service 1, service 2, ) Jay Lofstead (1423) has an LDRD to look at this issue. October 31, 2012 Trilinos User Group Meeting 13

Summary Data Services provide a new way to integrate analysis and computation Particularly useful on deployments of Burst Buffer architectures Other Labs are also looking into this type of approach (ANL:Gleam, ORNL:ADIOS, ) Scheduling, programming models, security, and resilience need to better support data services. Lots of work to do! Nessie provides an effective framework for developing services Client and server API, macros for XDR processing, utils for managing svcs Supports most HPC interconnects (Seastar, Gemini, InfiniBand, IBM) Trilinos provides a great research vehicle Common repository, testing support, broad distribution Trios Data Services Development Team (and current assignment) Ron Oldfield: PI, CTH data service, Nessie development Todd Kordenbrock: Nessie development, performance analysis Gerald Lofstead: PnetCDF/Exodus, transaction-based resilience Craig Ulmer: Data-service APIs for accelerators (GPU, FPGA) Shyamali Mukherjee: Protocol performance evaluations, Nessie BG/P support October 31, 2012 Trilinos User Group Meeting 14

October 31, 2012 Integrating Analysis and Computation with Trios Services Approved for Public Release: SAND2012-9323P Ron A. Oldfield Scalable System Software Sandia National Laboratories Albuquerque, NM, USA raoldfi@sandia.gov Trilinos User Group Meeting Oct 31, 2012 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Extra Slides October 31, 2012 Trilinos User Group Meeting 16

Nessie Network Transport Interface (NNTI) An abstract API for RDMA-based interconnects Ptls Ptls ugni ugni IB IB Portable methods for inter-application communication Basic functionality Initialize/close the interface Connect/disconnect to a peer Register/deregister memory Transport data (put, get, wait) Supported Interconnects Seastar (Cray XT), InfiniBand, Gemini (Cray XE, XK), DCMF (IBM BG/P) Users Nessie, ADIOS/DataStager (new), HDF5?, Ceph? Application NSSI Client API NTTI Network Clients NTTI NSSI Service Servers October 31, 2012 Trilinos User Group Meeting 17

Trios Services for Staging PnetCDF Data Staging Application level Burst Buffer Motivation Synchronous I/O libraries require app to wait until data is on storage device Not enough cache on compute nodes to handle I/O bursts NetCDF is basis of important I/O libs at Sandia (Exodus) PnetCDF Caching Service Service aggregates/caches data and pushes data to storage Async I/O allows overlap of I/O and computation Status First described in MSST 2006 paper Implemented in 2007 Presented at PDSW 11 Client Application (compute nodes) NetCDF requests NetCDF Service (compute nodes) Cache/aggregate Processed Data Lustre File System October 31, 2012 Trilinos User Group Meeting 18