The Red Storm System: Architecture, System Update and Performance Analysis

The Red Storm System: Architecture, System Update and Performance Analysis Douglas Doerfler, Jim Tomkins Sandia National Laboratories Center for Computation, Computers, Information and Mathematics LACSI 2006 Workshop on Performance & Productivity of Extreme-Scale Parallel Systems October 17th, 2006

Outline Architecture & Design Upgrade: the numbers Availability & Reliability Application Performance

Red Storm Architecture Balanced System Performance: CPU, Memory, Interconnect and I/O Scalability: System Hardware and System Software scale, a single cabinet system to 32K processor system Functional Partitioning: Hardware and System Software Reliability: Full system Reliability, Availability, Serviceability (RAS) Designed into Architecture Upgradeability: Designed in path for system upgrade Red/Black Switching: Flexible support for both classified and unclassified Computing in a single system Custom Packaging: High density, relatively low power system Price/Performance: Excellent performance per dollar, use high volume commodity parts where feasible

Red Storm System (pre-upgrade) True MPP, designed to be a single system Distributed memory MIMD parallel supercomputer Fully connected 3-D mesh interconnect. 108 compute node cabinets and 10,368 compute node processors (AMD Opteron @ 2.0 GHz) ~30 TB of DDR compute node memory (4 GB, 3 GB, 2 GB) 8 Service and I/O cabinets on each end (256 processors for each color) ~400 TB of disk storage (~200 TB per color) Less than 2 MW total power and cooling Less than 3,000 ft 2 of floor space

Red Storm System Upgrade TeraFLOPs 3D Mesh (compute partition) Nodes/Partition (Red/Center/Black) Compute Memory Processors NIC Pre-Upgrade ~41 TF 27x16x24 256/10,368/256 ~31TB (DDR333) AMD Opteron @ 2.0GHz Seastar V1.2 Post-Upgrade ~125 TF 27x20x24 320/12,960/320 ~78TB (DDR400) AMD Opteron, Dual-Core, @2.4GHz Seastar V2.1 (~doubles HT bandwidth)

Red Storm Layout (post upgrade) (27 20 24 Compute Node Mesh) Normally Classified Switchable Nodes Normally Unclassified I/O and Service Nodes Disconnect Cabinets Disk storage system not shown I/O and Service Nodes

Red Storm System Software Operating Systems LINUX on service and I/O nodes LWK (Catamount) on compute nodes LINUX on RAS nodes File Systems Parallel File System - Lustre Unix File System- Lustre NFS v3 Run-Time System Logarithmic loader Node allocator Batch system PBS Pro Libraries MPI, I/O, Math Single System View Programming Model Message Passing: MPI Support for Heterogeneous Applications Tools ANSI Standard Compilers Fortran, C, C++: PGI Debugger: TotalView Performance Monitor: Cray Apprentice and PAPI System Management and Administration Accounting RAS GUI Interface for monitoring system Single System View

Red Storm System Management and RAS RAS Workstations: Cray CMS Separate and redundant RAS workstations for Red and Black ends of machine. System administration and monitoring interface. Error logging and monitoring for major system components including processors, memory, NIC/Router, power supplies, fans, disk controllers, and disks. RAS Network - Dedicated Ethernet network for connecting RAS nodes to RAS workstations. RAS Nodes One for each compute board - L0 One for each cabinet - L1

Red Storm Performance: Interconnect and I/O Interconnect performance MPI Latency: Requirement & measured Neighbor < 5 µs; measured: 6.0 generic / 3.6 accelerated Full machine < 8 µs; add ~ 3 µs to above Measured MPI Bandwidth ~ 2,200 MB/sec uni-directional, ~ 4,000 MB/sec bi-directional Peak HT bandwidth: 3.2 GB/s each direction Peak Link bandwidth: 3.84 GB/s each direction Bi-section bandwidth ~3.69 TB/s Y by Z; ~4.98 TB/s X by Z; ~8.30 TB/s X by Y (torus) I/O system performance PFS Requirement: 50 GB/s sustained for each color Observed 50 GB/s using IOR under ideal conditions External Requirement: 25 GB/s (aggregate) sustained for each color Observed 600 MB/s over a single 10GE link

System Availability OUO

System Reliability OUO

Application Performance Post-Upgrade Preliminary Analysis

Upgrade Performance Before & After Upgrade by the Numbers 40 to 125 TeraFLOPS 10,368 to 12,960 compute nodes 512 to 640 service nodes 2X the SeaStar Interconnect Bandwidth 2.0 GHz single-core to 2.4 GHz dual-core AMD Opteron processors DDR-333 to DDR-400 memory speed Status Black Section is in progress Center Section - Oct 06 Red Section - Nov 06

Upgrade Performance Single-Core vs Dual-Core CTH - Shape Charge Constant work/core Speedup of at least 1.4 out to 2048 sockets Sage - timing_c Constant work/core At scale, speedup is at least a factor of 1.6

Application Performance Pre-Upgrade Preliminary Analysis

Sage Scaling (John Daly, LANL) Updated data point

CTH ASC Purple & Red Storm Performance Sandia's CTH(Shape Charge- 90x216x90 cells/pe) Execution Time for 100 Cycles 2500 Wall Time, secs 2000 1500 1000 500 0 CTH-Purple CTH-Red Storm 1 10 100 1000 10000 Number Of Processors

SEAM Benchmarks (aqua planet) Red Storm 5 TF max BG/L 4 TF max SEAM = NCAR s Spectral Element Atmospheric Model, POP = LANL s Parallel Ocean Program

POP Benchmarks ( 1/10 degree Ocean)

The Impact of a Balanced Architecture Architectural balance with low system noise is the key to a scalable platform Well Balanced Traits Translate to High Real World Application Performance

Conclusions Red Storm is an architecture Red Storm is an instantiation of that architecture Red Storm has demonstrated excellent scalability on real applications The upgrade has shown significant application speedup (more analysis to come)