Exploring Use-cases for Non-Volatile Memories in support of HPC Resilience

Similar documents
Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems

Combing Partial Redundancy and Checkpointing for HPC

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

REMEM: REmote MEMory as Checkpointing Storage

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

BzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory

A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Philip C. Roth. Computer Science and Mathematics Division Oak Ridge National Laboratory

Scalable and Fault Tolerant Failure Detection and Consensus

Proactive Process-Level Live Migration in HPC Environments

libhio: Optimizing IO on Cray XC Systems With DataWarp

Techniques to improve the scalability of Checkpoint-Restart

STORAGE LATENCY x. RAMAC 350 (600 ms) NAND SSD (60 us)

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

Using Transparent Compression to Improve SSD-based I/O Caches

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

HP-DAEMON: High Performance Distributed Adaptive Energy-efficient Matrix-multiplicatiON

Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks

Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740

Accurate emulation of CPU performance

ScalaIOTrace: Scalable I/O Tracing and Analysis

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

The Fusion Distributed File System

Consolidating Complementary VMs with Spatial/Temporalawareness

xsim The Extreme-Scale Simulator

A Case for High Performance Computing with Virtual Machines

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

The Oracle Database Appliance I/O and Performance Architecture

Adaptive Power Profiling for Many-Core HPC Architectures

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

PHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan

High-Performance Key-Value Store on OpenSHMEM

Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI

Dell Fluid Data solutions. Powerful self-optimized enterprise storage. Dell Compellent Storage Center: Designed for business results

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Heckaton. SQL Server's Memory Optimized OLTP Engine

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Single-Points of Performance

DISP: Optimizations Towards Scalable MPI Startup

SOLVING THE DRAM SCALING CHALLENGE: RETHINKING THE INTERFACE BETWEEN CIRCUITS, ARCHITECTURE, AND SYSTEMS

Universal Storage. Innovation to Break Decades of Tradeoffs VASTDATA.COM

Six-Core AMD Opteron Processor

Scalable In-memory Checkpoint with Automatic Restart on Failures

Maximizing Memory Performance for ANSYS Simulations

Managing Hardware Power Saving Modes for High Performance Computing

arxiv: v1 [cs.dc] 7 Oct 2017

SGI Overview. HPC User Forum Dearborn, Michigan September 17 th, 2012

Advanced Data Placement via Ad-hoc File Systems at Extreme Scales (ADA-FS)

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Efficiency Evaluation of the Input/Output System on Computer Clusters

Fine-grained Metadata Journaling on NVM

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work

FlashTier: A Lightweight, Consistent and Durable Storage Cache

Tracing and Visualization of Energy Related Metrics

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

ReDefine Enterprise Storage

ENERGY-EFFICIENT VISUALIZATION PIPELINES A CASE STUDY IN CLIMATE SIMULATION

Utilitarian Performance Isolation in Shared SSDs

Benchmark: In-Memory Database System (IMDS) Deployed on NVDIMM

Radiation-Induced Error Criticality In Modern HPC Parallel Accelerators

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

Soft Updates Made Simple and Fast on Non-volatile Memory

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Structuring PLFS for Extensibility

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago

The Dangers and Complexities of SQLite Benchmarking. Dhathri Purohith, Jayashree Mohan and Vijay Chidambaram

Software-Defined Data Infrastructure Essentials

Schema-Agnostic Indexing with Azure Document DB

arxiv: v2 [cs.dc] 2 May 2017

Tools for Social Networking Infrastructures

NEXTGenIO Performance Tools for In-Memory I/O

Expand In-Memory Capacity at a Fraction of the Cost of DRAM: AMD EPYCTM and Ultrastar

Introduction to HPC Parallel I/O

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

Design and Architecture of Dell Acceleration Appliances for Database (DAAD): A Practical Approach with High Availability Guaranteed

Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs

* Contributed while interning at SAP. September 1 st, 2017 PUBLIC

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Enhancing Checkpoint Performance with Staging IO & SSD

Network bandwidth is a performance bottleneck for cluster computing. Especially for clusters built with SMP machines.

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

HYCOM Performance Benchmark and Profiling

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Paperspace. Architecture Overview. 20 Jay St. Suite 312 Brooklyn, NY Technical Whitepaper

The Google File System

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Architecture Exploration of High-Performance PCs with a Solid-State Disk

Newest generation of HP ProLiant DL380 takes #1 position overall on Oracle E-Business Suite Small Model Benchmark

PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads

The Google File System

Exchange 2010 Tested Solutions: 500 Mailboxes in a Single Site Running Hyper-V on Dell Servers

Transcription:

Exploring Use-cases for Non-Volatile Memories in support of HPC Resilience Onkar Patil 1, Saurabh Hukerikar 2, Frank Mueller 1, Christian Engelmann 2 1 Dept. of Computer Science, North Carolina State University 2 Computer Science and Mathematics Division, Oak Ridge National Laboratory

MOTIVATION Exaflop Computers compute devices + memory devices + interconnects + cooling and power Close Proximity! Manufacturing processes not foolproof Lower durability and reliability Frequency of device failures and data corruptions effectiveness and utility Future Applications need to be more resilient, Maintain a balance between performance and power consumption Minimize trade-offs

PROBLEM STATEMENT Non-volatile memory (NVM) technologies maintain state of computation in the primary memory architecture More potential as specialized hardware Data Retention critical in improving resilience of an application against crashes Persistent memory regions to improve HPC resiliency key aspect of this project APPLICATION APPLICATION APPLICATION STATIC DATA STRUCTURES DYNAMIC DATA STRUCTURES STATIC DATA STRUCTURES DYNAMIC DATA STRUCTURES STATIC DATA STRUCTURES DYNAMIC DATA STRUCTURES NVM NVM NVM NVM-based Main Memory Application-directed Checkpointing Data Versioning

RESULTS Experimentation Setup 16-node cluster with Dual socket, Quad-Core AMD Opteron, 128 GB memory, Intel SSD from 100GB to 256GB DGEMM benchmark of the HPCC benchmark suite Tested for 4, 8 and 16-node configurations for a matrix sizes of 1000, 2000 and 3000 elements

GFLOPS Time(sec) 10 GFLOPS in node scaling for StarDGEMM PMEM_ONLY PMEM_CPY PMEM_VER 100 Execution times in node scaling for StarDGEMM PMEM_ONLY PMEM_CPY 1 10 PMEM_VER 0.1 1 0.01 0.1 0.001 4 Nodes 8 16 0.01 4 Nodes 8 16 only allocation and NVM-based main memory perform better An inefficient lookup algorithm

GFLOPS Time(sec) 10 1 GFLOPS for problem size scaling in StarDGEMM PMEM_ONL Y 10000 1000 Execution time for problem size scaling in StarDGEMM PMEM_ONLY PMEM_CPY PMEM_VER 0.1 100 10 0.01 1 0.001 0.1 0.0001 1000 2000 3000 No. of elements All modes perform similar and consistently for node and data scaling Execution time increases exponentially for multiple copies of memory 0.01 1000 2000 3000 No. of elements

CONCLUSION & FUTURE WORK Conclusion: Non-volatile memory devices can be used as specialized hardware for improving the resilience of the system Future Work: Memory usage modes to make applications efficient and maintain complete system state Minimal overhead Support more complex applications Lightweight recovery mechanisms to work with the checkpointing schemes Reduce downtime and rollback time Intelligent policies that can help build resilient static and dynamic runtime system

Evaluating Performance of Burst Buffer Models for Real-Application Workloads in HPC Systems Harsh Khetawat Frank Mueller Christopher Zimmer

Introduction Existing storage systems becoming bottleneck Solution: burst buffers Use burst buffers for: Checkpoint/Restart I/O Staging Write-through cache for parallel FS Burst Buffers on Cori

Burst buffer placement: Placement Co-located with compute nodes (Summit) Co-located with I/O nodes (Cori) Separate set of nodes Trade-offs in choice of placements Capability I/O models, staging, etc. Predictability Impact on shared resources, runtime variability Economic Infrastructure reuse, cost of storage device I/O performance dependent on placement Choice of network topology

Idea Simulate network and burst buffer architectures CODES simulation suite Real-world I/O traces (Darshan) Full multi-tenant system with mixed workloads (capability/capacity) Supports network topologies Local & external storage models Combine network topologies and storage architectures Performance under striping/protection schemes Reproducible tool for HPC centers

Conclusion Determine based on workload characteristics: Burst buffer placement Network topology Performance of striping across burst buffers Overhead of resilience schemes Reproducible tool to: Simulate specific workloads Determine best fit

Thank You