Architecting High Performance Computing Systems for Fault Tolerance and Reliability

Similar documents
Dell Solution for High Density GPU Infrastructure

Analyzing Performance and Power of Applications on GPUs with Dell 12G Platforms. Dr. Jeffrey Layton Enterprise Technologist HPC

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

DESCRIPTION GHz, 1.536TB shared memory RAM, and 20.48TB RAW internal storage teraflops About ScaleMP

STAR-CCM+ Performance Benchmark and Profiling. July 2014

CAS 2K13 Sept Jean-Pierre Panziera Chief Technology Director

Dell PowerEdge R720xd with PERC H710P: A Balanced Configuration for Microsoft Exchange 2010 Solutions

Sugon TC6600 blade server

John Fragalla TACC 'RANGER' INFINIBAND ARCHITECTURE WITH SUN TECHNOLOGY. Presenter s Name Title and Division Sun Microsystems

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

The ClusterStor Engineered Solution for HPC. Dr. Torben Kling Petersen, PhD Principal Solutions Architect - HPC

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

HUAWEI Tecal X6000 High-Density Server

FUJITSU Server PRIMERGY CX400 M4 Workload-specific power in a modular form factor. 0 Copyright 2018 FUJITSU LIMITED

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

Cisco UCS M5 Blade and Rack Servers

An Open, Standards Based Approach to Building Supercomputers

Selecting Server Hardware for FlashGrid Deployment

CPMD Performance Benchmark and Profiling. February 2014

Accelerating high-performance computing with hybrid platforms

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

SUN CUSTOMER READY HPC CLUSTER: REFERENCE CONFIGURATIONS WITH SUN FIRE X4100, X4200, AND X4600 SERVERS Jeff Lu, Systems Group Sun BluePrints OnLine

1. ALMA Pipeline Cluster specification. 2. Compute processing node specification: $26K

Cisco MCS 7825-I1 Unified CallManager Appliance

Dell PowerEdge C410x. Technical Guide

Cisco HyperFlex HX220c Edge M5

Inspur AI Computing Platform

Energy Efficiency and WCT Innovations

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

Designed for Maximum Accelerator Performance

Isilon Hardware Basics and Guidelines

GW2000h w/gw175h/q F1 specifications

Dell EMC HPC System for Life Sciences v1.4

Cisco HyperFlex HX220c M4 Node

INFOBrief. Dell PowerEdge Key Points

MILC Performance Benchmark and Profiling. April 2013

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Xyratex ClusterStor6000 & OneStor

HUAWEI TECHNOLOGIES CO., LTD. HUAWEI FusionServer X6000 High-Density Server

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

Dell PowerEdge Servers Portfolio Guide

Huawei Enterprise A Better Way. Huawei FusionServer X6800 Competitiveness Analysis

NAMD Performance Benchmark and Profiling. January 2015

The Cray CX1 puts massive power and flexibility right where you need it in your workgroup

DELL POWEREDGE SERVERS

NEMO Performance Benchmark and Profiling. May 2011

HPC Hardware Overview

Cisco HyperFlex HX220c M4 and HX220c M4 All Flash Nodes

Cisco HyperFlex HX220c M4 and HX220c M4 All Flash Nodes

Cisco MCS 7845-H1 Unified CallManager Appliance

OCTOPUS Performance Benchmark and Profiling. June 2015

Data Sheet FUJITSU Server PRIMERGY CX272 S1 Dual socket server node for PRIMERGY CX420 cluster server

A Dell Technical White Paper Dell Virtualization Solutions Engineering

The New Generation of x86 Server Innovation

Cisco UCS C250 M2 Extended-Memory Rack-Mount Server

Cisco UCS C250 M2 Extended-Memory Rack-Mount Server

Exactly as much as you need.

Cisco UCS S3260 System Storage Management

Intel Xeon E v4, Optional Operating System, 8GB Memory, 2TB SAS H330 Hard Drive and a 3 Year Warranty

Discover an Integrated IT Platform Designed for Branch Offices and Small and Medium Enterprise Locations

LS-DYNA Performance Benchmark and Profiling. April 2015

Suggested use: infrastructure applications, collaboration/ , web, and virtualized desktops in a workgroup or distributed environments.

Cisco MCS 7835-H2 Unified Communications Manager Appliance

HP GTC Presentation May 2012

Intel Xeon E v4, Windows Server 2016 Standard, 16GB Memory, 1TB SAS Hard Drive and a 3 Year Warranty

Exam Questions C

Supports up to four 3.5-inch SAS/SATA drives. Drive bays 1 and 2 support NVMe SSDs. A size-converter

Intel Select Solutions for Professional Visualization with Advantech Servers & Appliances

ThinkServer RD630 Photography - 3.5" Disk Bay*

Cisco UCS C200 M2 High-Density Rack-Mount Server

An Oracle White Paper December Accelerating Deployment of Virtualized Infrastructures with the Oracle VM Blade Cluster Reference Configuration

IBM eserver xseries. BladeCenter. Arie Berkovitch eserver Territory Manager IBM Corporation

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Reference Architecture Microsoft Exchange 2013 on Dell PowerEdge R730xd 2500 Mailboxes

Data Sheet Fujitsu Server PRIMERGY CX400 M1 Compact and Easy

Achieve Optimal Network Throughput on the Cisco UCS S3260 Storage Server

vstart 50 VMware vsphere Solution Specification

Cisco UCS C210 M1 General-Purpose Rack-Mount Server

Nokia open edge server

Cisco UCS S3260 System Storage Management

ABySS Performance Benchmark and Profiling. May 2010

IBM Power Advanced Compute (AC) AC922 Server

Density Optimized System Enabling Next-Gen Performance

Data Sheet Fujitsu Server PRIMERGY CX250 S2 Dual Socket Server Node

HIGH-DENSITY HIGH-CAPACITY HIGH-DENSITY HIGH-CAPACITY

NAMD GPU Performance Benchmark. March 2011

HostEngine 5URP24 Computer User Guide

HP BladeSystem c-class Server Blades OpenVMS Blades Management. John Shortt Barry Kierstein Leo Demers OpenVMS Engineering

SNAP Performance Benchmark and Profiling. April 2014

TELCO. Tomi Männikkö/HW Architect, Datacenter R&D/ Nokia

January 28 29, 2014San Jose. Engineering Workshop

DELL Terascala HPC Storage Solution (DT-HSS2)

QuickSpecs HP Cluster Platform 3000 and HP Cluster Platform 4000

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Front-loading drive bays 1 12 support 3.5-inch SAS/SATA drives. Optionally, front-loading drive bays 1 and 2 support 3.5-inch NVMe SSDs.

Open Compute Solutions Guide

Cisco UCS C210 M2 General-Purpose Rack-Mount Server

Microsoft SQL Server 2012 Fast Track Reference Architecture Using PowerEdge R720 and Compellent SC8000

Transcription:

Architecting High Performance Computing Systems for Fault Tolerance and Reliability Blake T. Gonzales HPC Computer Scientist Dell Advanced Systems Group blake_gonzales@dell.com

Agenda HPC Fault Tolerance and Reliability Architecture Design Techniques Dell HPC Solutions: Purpose Built Reliability Questions 2

HPC Fault Tolerance and Reliability 3

HPC Fault Tolerance and Reliability Complex nature of HPC systems can have a detrimental effect on their ability to reliably complete the tasks at hand. Research performed by HPC systems is important! Reliability and fault tolerance is of utmost concern in HPC. 4

Single Points of Failure Shared-memory multiprocessor (SMP) systems are generally prone to system wide failures due to single errors in memory, CPU or disk. With the ubiquitous use of clustered HPC technology in the last decade, the risk of system wide failures due to single points of failure can be minimized! Clustered solutions must be designed correctly. 5

HPC Subsystem Design Cluster solutions have many moving parts It is important to design each subsystem with an eye to how it relates to the other subsystems. Find key components that are likely to cause system wide failures, and implement architecture design techniques to prevent such failures. 6

Architecture Design Techniques 7

Component Classification Failure has little effect on overall reliability - Compute Nodes - Out-of-band management Failure has major effect on overall reliability - Head / Admin Node(s) - Job Scheduler - Storage - Power - Cooling - Cabling - Network - The list goes on 8

Compute Nodes Irony: The workhorse subsystem of an HPC cluster, is the same subsystem that requires the least amount of built-in fault tolerance. High fault tolerance to an occasional failed job Generally does not require added fault tolerant subsystems 9

Compute Nodes Common to have several compute nodes inoperable on large systems When a single compute node fails, typically only one job is effected, or subset of jobs, on the system. 10

Login Nodes Customer Facing, Outage Perception Provide Multiple Identical Nodes Publish Entry Points 11

Head / Administrative Nodes Consider Separating these Functions - Provisioning - Image Management - Job Scheduling (Multiple Nodes!) - Network Boot for Compute Nodes 12

Non-Compute Node Disk Protection Mirroring - RAID 1 (this doesn t protect against data corruption though) Hot Spares Backups (software stack, compute node images) Disk Cloning (weekly, multiple copies) 13

Power Distribution Continuous Power Feed - Generators - Battery Backup or UPS Multiple Data Center or Rack based PDUs Deliver Power Feeds to Multiple Power Supplies Hot Swapable Power Supplies Labeling is critical 14

Cooling For every watt consumed in a component, there is a cooling power component that must be consumed as well! Air Handlers, Chillers, Fans Hot spots correlate with compute nodes Hot/Cold Aisles Chilled Doors, In-row cooling 15

Job Scheduling Job Scheduling State database Multiple paths to the database Multiple failover daemons on separate nodes Jobs may continue to run on compute node infrastructure, even if daemon nodes fail. But you need a map of the activity. Checkpoint / Restart 16

Networking and Interconnect Little redundancy needed in admin network Out-of-band management is crucial Interconnect hubs require redundancy - Ethernet & Infiniband - MPI / Storage - Power - Uplinks/Downlinks 17

Cable Management Reliability, Really? Heavy IB Cables Labeling is Critical Organize the chaos You ll be glad next time you have a outage and time is of the essence! 18

Continual Testing! 19

Dell HPC Solutions: Purpose Built Reliability 20

The Dell HPC NFS Storage Solution (NSS) 21

The Dell HPC NFS Storage Solution (NSS) 22

The Dell Terascala HPC Storage Solution (HSS) Full Lustre solution, fully configured, tested, tuned, and deployed On-site installation with client deployment and training included MetaData Server (MDS) Redundant, highly available solution Simple, linear scalability Full management system with easy to use GUI Base OSS 23

Dell PowerEdge C6100 Four 2-Socket Nodes in 2U Intel Westmere-EP Each Node: 12 DIMMs each 2 GigE (Intel) 1 Daughter Card (PCIe x8) 10GigE QDR IB One PCIe x16 (half-length, half-height) Optional SAS controller (in-place of IB) Chassis Design: Hot Plug, Individually Serviceable System Boards / Nodes Up to 12 x 3.5 drives (3 per node) Up to 24 x 2.5 drives (6 per node) N+1 Power supplies (1100W or 1400W) NVIDIA HIC certified DDR and QDR IB PCIe card certified 24

Dell PowerEdge C410x 3U chassis (external) - Room-and-Board for PCIe Gen-2 x16 devices - Up to 8 hosts Sixteen (16) x16 Gen-2 Devices - Initial Target = GPGPUs - Support for any FH/HL or HH/HL device - Each slot Double-Wide - Individually Serviceable N+1 Power (3+1) - Gold (90%) N+1 Cooling (7+1) 25

Dell PowerEdge C410x Sixteen (16) x16 Gen-2 Modules - PCIe Gen-2 x16 compliant - Independently serviceable LED and On/Off Power connector for GPGPU card GPU card Board-to-board connector for X16 Gen 2 PCIe signals and power 26

C410x Sandwich C6100 7U } C410x C6100 8-Card C410x Sandwich 16-Card C410x Sandwich 2 x C6100 2 x C6100 8 GPUs 16 GPUs 1 QDR IB daughtercard 1 QDR IB daughtercard 7U total 7U total 8 GPUs total 16 GPUs total 8 nodes total 8 nodes total 8/7 nodes / U 8/7 node / U 8/7 GPUs per U 16/7 GPUs per U 1 GPU per PCIe x16 2 GPUs per PCIe x16 27

Dell PowerEdge C410x Increased density (more GPUs per RackU) Introduced flexibility - GPU/Host ratio = 1:1, 2:1, 3:1, 4:1,, (8:1),, (16:1) Purposely Separate the Host from the GPUs Purpose-built to power, cool and manage PCI-e devices - (N+1) Power (3+1 Gold power supplies) - (N+1) Cooling (7+1 fans) - Onboard BMC Web interface to monitor, manage & configure - Each PCI-e Module is individually serviceable -- no un-cabling -- no un-racking - - no opening of compute nodes - - no bumped DIMMS - - no disturbed dust - - vertical insertion 28

HPC at Dell Blake T. Gonzales HPC Computer Scientist blake_gonzales@dell.com Jim Gutowski HPC Business Development Manager james_gutowski@dell.com HPC Blogs, Case Studies, Guides, Tips and Tricks http://hpcatdell.com/ Twitter @HPCatDell 29