The Red Storm System: Architecture, System Update and Performance Analysis

Similar documents
Initial Performance Evaluation of the Cray SeaStar Interconnect

Cray RS Programming Environment

Red Storm / Cray XT4: A Superior Architecture for Scalability

Parallel Computer Architecture II

Eldorado. Outline. John Feo. Cray Inc. Why multithreaded architectures. The Cray Eldorado. Programming environment.

The Cielo Capability Supercomputer

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Real Parallel Computers

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET

Wednesday : Basic Overview. Thursday : Optimization

Scalable Computing at Work

Introducing the next generation of affordable and productive massively parallel processing (MPP) computing the Cray XE6m supercomputer.

Stockholm Brain Institute Blue Gene/L

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

What are Clusters? Why Clusters? - a Short History

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Real Parallel Computers

Use of Common Technologies between XT and Black Widow

User Training Cray XC40 IITM, Pune

Cray XC Scalability and the Aries Network Tony Ford

Our Workshop Environment

Preparing GPU-Accelerated Applications for the Summit Supercomputer

BlueGene/L. Computer Science, University of Warwick. Source: IBM

The way toward peta-flops

Batch Scheduling on XT3

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

2008 International ANSYS Conference

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

The IBM Blue Gene/Q: Application performance, scalability and optimisation

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Cluster Network Products

Roadmapping of HPC interconnects

The Cray XD1. Technical Overview. Amar Shan, Senior Product Marketing Manager. Cray XD1. Cray Proprietary

Current Status of the Next- Generation Supercomputer in Japan. YOKOKAWA, Mitsuo Next-Generation Supercomputer R&D Center RIKEN

Illinois Proposal Considerations Greg Bauer

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Regression Testing on Petaflop Computational Resources. CUG 2010, Edinburgh Mike McCarty Software Developer May 27, 2010

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Cray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe:

CP2K Performance Benchmark and Profiling. April 2011

Resource allocation and utilization in the Blue Gene/L supercomputer

Sugon TC6600 blade server

Breakthrough Science via Extreme Scalability. Greg Clifford Segment Manager, Cray Inc.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Application Sensitivity to Link and Injection Bandwidth on a Cray XT4 System

TFLOP Performance for ANSYS Mechanical

Computer Architecture

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Using Quality of Service for Scheduling on Cray XT Systems

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

Performance of Variant Memory Configurations for Cray XT Systems

Ian Foster, An Overview of Distributed Systems

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

CS500 SMARTER CLUSTER SUPERCOMPUTERS

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

High Performance Computing: Blue-Gene and Road Runner. Ravi Patel

Performance and Power Co-Design of Exascale Systems and Applications

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Brutus. Above and beyond Hreidar and Gonzales

Parallel & Cluster Computing. cs 6260 professor: elise de doncker by: lina hussein

Comparing Linux Clusters for the Community Climate System Model

Application Performance on Dual Processor Cluster Nodes

Smarter Clusters from the Supercomputer Experts

Oak Ridge National Laboratory Computing and Computational Sciences

Infiniband and RDMA Technology. Doug Ledford

Designed for Maximum Accelerator Performance

Optimizing LS-DYNA Productivity in Cluster Environments

The Cray Rainier System: Integrated Scalar/Vector Computing

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Introduction of Fujitsu s next-generation supercomputer

Scaling Across the Supercomputer Performance Spectrum

The Architecture and the Application Performance of the Earth Simulator

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

HIGH PERFORMANCE COMPUTING FROM SUN

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

COSC 6385 Computer Architecture - Multi Processor Systems

The AMD64 Technology for Server and Workstation. Dr. Ulrich Knechtel Enterprise Program Manager EMEA

Maximizing Memory Performance for ANSYS Simulations

Overview of Tianhe-2

CASE STUDY: Using Field Programmable Gate Arrays in a Beowulf Cluster

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Cray XT Series System Overview S

Assessment of LS-DYNA Scalability Performance on Cray XD1

HYCOM Performance Benchmark and Profiling

Parallel File Systems Compared

Design and Evaluation of a 2048 Core Cluster System

Enabling Performance-per-Watt Gains in High-Performance Cluster Computing

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

Six-Core AMD Opteron Processor

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Cluster Computing. Cluster Architectures

HPC Saudi Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences. Presented to: March 14, 2017

What does Heterogeneity bring?

Performance of Variant Memory Configurations for Cray XT Systems

Transcription:

The Red Storm System: Architecture, System Update and Performance Analysis Douglas Doerfler, Jim Tomkins Sandia National Laboratories Center for Computation, Computers, Information and Mathematics LACSI 2006 Workshop on Performance & Productivity of Extreme-Scale Parallel Systems October 17th, 2006

Outline Architecture & Design Upgrade: the numbers Availability & Reliability Application Performance

Red Storm Architecture Balanced System Performance: CPU, Memory, Interconnect and I/O Scalability: System Hardware and System Software scale, a single cabinet system to 32K processor system Functional Partitioning: Hardware and System Software Reliability: Full system Reliability, Availability, Serviceability (RAS) Designed into Architecture Upgradeability: Designed in path for system upgrade Red/Black Switching: Flexible support for both classified and unclassified Computing in a single system Custom Packaging: High density, relatively low power system Price/Performance: Excellent performance per dollar, use high volume commodity parts where feasible

Red Storm System (pre-upgrade) True MPP, designed to be a single system Distributed memory MIMD parallel supercomputer Fully connected 3-D mesh interconnect. 108 compute node cabinets and 10,368 compute node processors (AMD Opteron @ 2.0 GHz) ~30 TB of DDR compute node memory (4 GB, 3 GB, 2 GB) 8 Service and I/O cabinets on each end (256 processors for each color) ~400 TB of disk storage (~200 TB per color) Less than 2 MW total power and cooling Less than 3,000 ft 2 of floor space

Red Storm System Upgrade TeraFLOPs 3D Mesh (compute partition) Nodes/Partition (Red/Center/Black) Compute Memory Processors NIC Pre-Upgrade ~41 TF 27x16x24 256/10,368/256 ~31TB (DDR333) AMD Opteron @ 2.0GHz Seastar V1.2 Post-Upgrade ~125 TF 27x20x24 320/12,960/320 ~78TB (DDR400) AMD Opteron, Dual-Core, @2.4GHz Seastar V2.1 (~doubles HT bandwidth)

Red Storm Layout (post upgrade) (27 20 24 Compute Node Mesh) Normally Classified Switchable Nodes Normally Unclassified I/O and Service Nodes Disconnect Cabinets Disk storage system not shown I/O and Service Nodes

Red Storm System Software Operating Systems LINUX on service and I/O nodes LWK (Catamount) on compute nodes LINUX on RAS nodes File Systems Parallel File System - Lustre Unix File System- Lustre NFS v3 Run-Time System Logarithmic loader Node allocator Batch system PBS Pro Libraries MPI, I/O, Math Single System View Programming Model Message Passing: MPI Support for Heterogeneous Applications Tools ANSI Standard Compilers Fortran, C, C++: PGI Debugger: TotalView Performance Monitor: Cray Apprentice and PAPI System Management and Administration Accounting RAS GUI Interface for monitoring system Single System View

Red Storm System Management and RAS RAS Workstations: Cray CMS Separate and redundant RAS workstations for Red and Black ends of machine. System administration and monitoring interface. Error logging and monitoring for major system components including processors, memory, NIC/Router, power supplies, fans, disk controllers, and disks. RAS Network - Dedicated Ethernet network for connecting RAS nodes to RAS workstations. RAS Nodes One for each compute board - L0 One for each cabinet - L1

Red Storm Performance: Interconnect and I/O Interconnect performance MPI Latency: Requirement & measured Neighbor < 5 µs; measured: 6.0 generic / 3.6 accelerated Full machine < 8 µs; add ~ 3 µs to above Measured MPI Bandwidth ~ 2,200 MB/sec uni-directional, ~ 4,000 MB/sec bi-directional Peak HT bandwidth: 3.2 GB/s each direction Peak Link bandwidth: 3.84 GB/s each direction Bi-section bandwidth ~3.69 TB/s Y by Z; ~4.98 TB/s X by Z; ~8.30 TB/s X by Y (torus) I/O system performance PFS Requirement: 50 GB/s sustained for each color Observed 50 GB/s using IOR under ideal conditions External Requirement: 25 GB/s (aggregate) sustained for each color Observed 600 MB/s over a single 10GE link

System Availability OUO

System Reliability OUO

Application Performance Post-Upgrade Preliminary Analysis

Upgrade Performance Before & After Upgrade by the Numbers 40 to 125 TeraFLOPS 10,368 to 12,960 compute nodes 512 to 640 service nodes 2X the SeaStar Interconnect Bandwidth 2.0 GHz single-core to 2.4 GHz dual-core AMD Opteron processors DDR-333 to DDR-400 memory speed Status Black Section is in progress Center Section - Oct 06 Red Section - Nov 06

Upgrade Performance Single-Core vs Dual-Core CTH - Shape Charge Constant work/core Speedup of at least 1.4 out to 2048 sockets Sage - timing_c Constant work/core At scale, speedup is at least a factor of 1.6

Application Performance Pre-Upgrade Preliminary Analysis

Sage Scaling (John Daly, LANL) Updated data point

CTH ASC Purple & Red Storm Performance Sandia's CTH(Shape Charge- 90x216x90 cells/pe) Execution Time for 100 Cycles 2500 Wall Time, secs 2000 1500 1000 500 0 CTH-Purple CTH-Red Storm 1 10 100 1000 10000 Number Of Processors

SEAM Benchmarks (aqua planet) Red Storm 5 TF max BG/L 4 TF max SEAM = NCAR s Spectral Element Atmospheric Model, POP = LANL s Parallel Ocean Program

POP Benchmarks ( 1/10 degree Ocean)

The Impact of a Balanced Architecture Architectural balance with low system noise is the key to a scalable platform Well Balanced Traits Translate to High Real World Application Performance

Conclusions Red Storm is an architecture Red Storm is an instantiation of that architecture Red Storm has demonstrated excellent scalability on real applications The upgrade has shown significant application speedup (more analysis to come)