Investigating Resilient HPRC with Minimally-Invasive System Monitoring

Similar documents
FAULT TOLERANT SYSTEMS

In-Network Computing. Paving the Road to Exascale. June 2017

Arlington, VA Phone: (703)

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

Reflections on Failure in Post-Terascale Parallel Computing

A Large-Scale Study of Soft- Errors on GPUs in the Field

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

High Availability and Redundant Operation

Fault tolerance in Grid and Grid 5000

Windows Azure Services - At Different Levels

PHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan

FPGA-Based Embedded Systems for Testing and Rapid Prototyping

OCP Engineering Workshop - Telco

CEC 450 Real-Time Systems

HPE ProLiant Gen10. Franz Weberberger Presales Consultant Server

High Speed Fault Injection Tool (FITO) Implemented With VHDL on FPGA For Testing Fault Tolerant Designs

Cycle Sharing Systems

Chapter 17: Distributed Systems (DS)

EXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&UIUC

Research Article HwPMI: An Extensible Performance Monitoring Infrastructure for Improving Hardware Design and Productivity on FPGAs

Combining Arm & RISC-V in Heterogeneous Designs

Autosave for Research Where to Start with Checkpoint/Restart

Veloce2 the Enterprise Verification Platform. Simon Chen Emulation Business Development Director Mentor Graphics

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

Introduction to the Service Availability Forum

GEN-Z AN OVERVIEW AND USE CASES

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

escience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows

MPI History. MPI versions MPI-2 MPICH2

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET

Bringing OpenStack to the Enterprise. An enterprise-class solution ensures you get the required performance, reliability, and security

Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing

Interconnect Your Future

HP NonStop Database Solution

Designing with ALTERA SoC Hardware

Microsoft Office SharePoint Server 2007

Ultra Depedable VLSI by Collaboration of Formal Verifications and Architectural Technologies

RiceNIC. A Reconfigurable Network Interface for Experimental Research and Education. Jeffrey Shafer Scott Rixner

An FPGA-based High Speed Parallel Signal Processing System for Adaptive Optics Testbed

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

SQL Server 2017 for your Mission Critical applications

Feature Comparison Summary

Providers of a Comprehensive Portfolio of Solutions for Reliable Ethernet and Synchronization in the Energy Market. Industrial

The Use of Cloud Computing Resources in an HPC Environment

A Hardware Filesystem Implementation for High-Speed Secondary Storage

Distributed Systems 27. Process Migration & Allocation

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

MPI versions. MPI History

A Seamless Tool Access Architecture from ESL to End Product. Albrecht Mayer (Infineon Microcontrollers) S4D Conference Sophia Antipolis, Sept.

Challenges and opportunities of debugging FPGAs with embedded CPUs. Kris Chaplin Embedded Technology Specialist Altera Northern Europe

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

PACS and Its Components

Gen-Z Memory-Driven Computing

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Why FPGAs will win the Accelerator Battle: Building Computers that Minimize Data Movement

Developing Microsoft Azure Solutions: Course Agenda

PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 4 th, 2014 NEC Corporation

Hybrid Memory Platform

Frequently Asked Questions (FAQ)

HPC Downtime Budgets: Moving SRE Practice to the Rest of the World

Persistent Memory. High Speed and Low Latency. White Paper M-WP006

NVMe SSDs Becoming Norm for All Flash Storage

Adaptive Runtime Support

SpaceComRTOS. A distributed formal RTOS adapted to SpaceWire enabled systems

Oracle Exadata X7. Uwe Kirchhoff Oracle ACS - Delivery Senior Principal Service Delivery Engineer

Dynamic Partitioned Global Address Spaces for Power Efficient DRAM Virtualization

#techsummitch

Optimize and Accelerate Your Mission- Critical Applications across the WAN

3D WiNoC Architectures

Introduction to the Qsys System Integration Tool

Lenovo ThinkSystem Mission Critical Customer Deck

FPGAhammer: Remote Voltage Fault Attacks on Shared FPGAs, suitable for DFA on AES

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

HPC Storage Use Cases & Future Trends

Realizing the Next Generation of Exabyte-scale Persistent Memory-Centric Architectures and Memory Fabrics

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work

Resilient IP Backbones. Debanjan Saha Tellium, Inc.

Looking ahead with IBM i. 10+ year roadmap

EOS: An Extensible Operating System

Adaptive Cluster Computing using JavaSpaces

MultiChipSat: an Innovative Spacecraft Bus Architecture. Alvar Saenz-Otero

Intel Research mote. Ralph Kling Intel Corporation Research Santa Clara, CA

Scalable and Fault Tolerant Failure Detection and Consensus

Streaming, made simple. FPGA Manager. Streaming made simple

ENERGY-EFFICIENT VISUALIZATION PIPELINES A CASE STUDY IN CLIMATE SIMULATION

ALTERA FPGAs Architecture & Design

VMware Join the Virtual Revolution! Brian McNeil VMware National Partner Business Manager

Using FPGAs as Microservices

RouteBricks: Exploiting Parallelism To Scale Software Routers

Authenticated Storage Using Small Trusted Hardware Hsin-Jung Yang, Victor Costan, Nickolai Zeldovich, and Srini Devadas

Fast Forward Storage & I/O. Jeff Layton (Eric Barton)

Distributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello

The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook)

Toward a Memory-centric Architecture

Scalable In-memory Checkpoint with Automatic Restart on Failures

Branch offices and SMBs: choosing the right hyperconverged solution

Dependable VLSI Platform using Robust Fabrics

Proactive Process-Level Live Migration in HPC Environments

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Transcription:

Investigating Resilient HPRC with Minimally-Invasive System Monitoring Bin Huang Andrew G. Schmidt Ashwin A. Mendon Ron Sass Reconfigurable Computing Systems Lab UNC Charlotte

Agenda Exascale systems are expected to fault frequently What can we do with FPGAs? Can we tell if there is a failure in the system? Results and analysis 1

Exascale systems are expected to fault frequently Two main reasons behind this belief: Ever-increasing number of components MTTF of these components are not expected to improve enough to compensate Failure record by Los Alamos National Lab 22 HPC systems in production use Accumulated downtime from 1996 to 2005 2

Downtime by hardware failures in LANL (1996-2005) ~907 days of work loss by just memory DIMM failures 3

Downtime by software failures in LANL (1996-2005) ~387 days of work loss by just cluster file system failures 4

State of the art Checkpoint/restart - periodically stop program and write data to non-volatile memory library (Berkeley Lab Checkpoint/Restart) Often, ad hoc by application programer Checkpoint can take 30 min. each time in HPC systems (Franck Cappello's keynote @ EuroPVM/MPI in 2008) Detection Human (programmer) realizes application is not finished or producing results Some ad hoc scripts to determine if log files are growing In short, detection is an open question 5

Experience with Spirit all-fpga cluster Processor crashed randomly Take hours to re-run the application Hardware cores would finish task 6

Big question Supposed we can build HPC systems with FPGAs, will HPRC systems be more resilient? 7

Towards a resilient HPRC system Hundreds of FPGAs each as system-on-a-chip (SOC) Automated system-monitoring OS Application Processor Accelerator core Autonomous restart Questions Can tell if there is a failure in the system? Can we identify why it failed? Can we recover from the failure? 8

Related work and techniques Debugging tools (e.g. ChipScope, SignalTap) Monitor a hardware core's status Additional JTAG and USB ports Triple Modular Redundancy Used by mission-critical system Limited power budget Owl-system monitoring framework (Schulz, et al.) Snoop system transaction for CPU FPGA is not first-class element Other performance analysis framework proposed source level (HDL) instrumentation (Koehler, et al. and Lancaster, et al.) 9

System-level monitoring framework Side-band Data Network Primary Data Network 10

Monitor Core Monitor target's registers or finite-state machines Similar to ChipScope but different sampling rate and duration States saved for checkpoint/restart 11

System Monitor Hub Collect status of local components from HW monitor cores Interface with side-band data network Receive requests from Head Node Apply TMR if high availability required 12

On-chip and off-chip high-speed network Side-band Data Network Latency impacts mean time to discovery Bandwidth impacts mean time to recovery (large amount of data will be transfered for checkpoint/restart) 13

Head Node Side-band Data Network Interpret status information Recover the system from failures autonomously Tell programmer why the node has failed 14

Experimental setup : A simple ring network One Head Node 32 Worker Nodes 15

Example : Issue request Head node issues the request to worker node 0 Other worker nodes are waiting for the request 16

Example : Append health information Worker node 0 appends its status information to the end of the packet Packet continues to flow to next worker node 17

Example : Detect failure Packet travels back to the head node with failure information on worker node 1 18

Initial results Fault Type How to emulate Detected? App crash Physical disruption OS crash Physical disruption Network failure Unplug cable Network crash Disable on-chip router Accel. core failure Fabricated 19

Analysis Encouraging initial result Architecture Independent Reconfigurable Network (AIREN) Support eight 4.0 Gb/s bi-directional channels 0.8 µs latency between nodes Head node can scan 32 work nodes every 26.24 µs Significant reduction in mean time to discovery Even if we scale to 40,000 FPGAs, we can still scan all nodes every 32.8 ms 20

Conclusion We will design HPRC system with expectation of failures (and work losses) We conceptualize a resilient HPRC system with an open system-monitoring framework Initial results from a 32-node test on Spirit cluster prove this concept This framework will back up other current research (most likely long-running jobs) to prevent work loss Testbed for resilience research 21

Thank you Bin Huang Andrew G. Schmidt Ashwin A. Mendon Ron Sass Reconfigurable Computing Systems Lab www.rcs.uncc.edu/wiki UNC Charlotte 22

Statistics of downtime in LANL (1996-2005) 3,387 days system downtime out of 22 clusters 23