Investigating Resilient HPRC with Minimally-Invasive System Monitoring

Investigating Resilient HPRC with Minimally-Invasive System Monitoring Bin Huang Andrew G. Schmidt Ashwin A. Mendon Ron Sass Reconfigurable Computing Systems Lab UNC Charlotte

Agenda Exascale systems are expected to fault frequently What can we do with FPGAs? Can we tell if there is a failure in the system? Results and analysis 1

Exascale systems are expected to fault frequently Two main reasons behind this belief: Ever-increasing number of components MTTF of these components are not expected to improve enough to compensate Failure record by Los Alamos National Lab 22 HPC systems in production use Accumulated downtime from 1996 to 2005 2

Downtime by hardware failures in LANL (1996-2005) ~907 days of work loss by just memory DIMM failures 3

Downtime by software failures in LANL (1996-2005) ~387 days of work loss by just cluster file system failures 4

State of the art Checkpoint/restart - periodically stop program and write data to non-volatile memory library (Berkeley Lab Checkpoint/Restart) Often, ad hoc by application programer Checkpoint can take 30 min. each time in HPC systems (Franck Cappello's keynote @ EuroPVM/MPI in 2008) Detection Human (programmer) realizes application is not finished or producing results Some ad hoc scripts to determine if log files are growing In short, detection is an open question 5

Experience with Spirit all-fpga cluster Processor crashed randomly Take hours to re-run the application Hardware cores would finish task 6

Big question Supposed we can build HPC systems with FPGAs, will HPRC systems be more resilient? 7

Towards a resilient HPRC system Hundreds of FPGAs each as system-on-a-chip (SOC) Automated system-monitoring OS Application Processor Accelerator core Autonomous restart Questions Can tell if there is a failure in the system? Can we identify why it failed? Can we recover from the failure? 8

Related work and techniques Debugging tools (e.g. ChipScope, SignalTap) Monitor a hardware core's status Additional JTAG and USB ports Triple Modular Redundancy Used by mission-critical system Limited power budget Owl-system monitoring framework (Schulz, et al.) Snoop system transaction for CPU FPGA is not first-class element Other performance analysis framework proposed source level (HDL) instrumentation (Koehler, et al. and Lancaster, et al.) 9

System-level monitoring framework Side-band Data Network Primary Data Network 10

Monitor Core Monitor target's registers or finite-state machines Similar to ChipScope but different sampling rate and duration States saved for checkpoint/restart 11

System Monitor Hub Collect status of local components from HW monitor cores Interface with side-band data network Receive requests from Head Node Apply TMR if high availability required 12

On-chip and off-chip high-speed network Side-band Data Network Latency impacts mean time to discovery Bandwidth impacts mean time to recovery (large amount of data will be transfered for checkpoint/restart) 13

Head Node Side-band Data Network Interpret status information Recover the system from failures autonomously Tell programmer why the node has failed 14

Experimental setup : A simple ring network One Head Node 32 Worker Nodes 15

Example : Issue request Head node issues the request to worker node 0 Other worker nodes are waiting for the request 16

Example : Append health information Worker node 0 appends its status information to the end of the packet Packet continues to flow to next worker node 17

Example : Detect failure Packet travels back to the head node with failure information on worker node 1 18

Initial results Fault Type How to emulate Detected? App crash Physical disruption OS crash Physical disruption Network failure Unplug cable Network crash Disable on-chip router Accel. core failure Fabricated 19

Analysis Encouraging initial result Architecture Independent Reconfigurable Network (AIREN) Support eight 4.0 Gb/s bi-directional channels 0.8 µs latency between nodes Head node can scan 32 work nodes every 26.24 µs Significant reduction in mean time to discovery Even if we scale to 40,000 FPGAs, we can still scan all nodes every 32.8 ms 20

Conclusion We will design HPRC system with expectation of failures (and work losses) We conceptualize a resilient HPRC system with an open system-monitoring framework Initial results from a 32-node test on Spirit cluster prove this concept This framework will back up other current research (most likely long-running jobs) to prevent work loss Testbed for resilience research 21

Thank you Bin Huang Andrew G. Schmidt Ashwin A. Mendon Ron Sass Reconfigurable Computing Systems Lab www.rcs.uncc.edu/wiki UNC Charlotte 22

Statistics of downtime in LANL (1996-2005) 3,387 days system downtime out of 22 clusters 23