Initial Performance Evaluation of the Cray SeaStar Interconnect

Similar documents
The Red Storm System: Architecture, System Update and Performance Analysis

RDMA-like VirtIO Network Device for Palacios Virtual Machines

Cray RS Programming Environment

Application Performance on Dual Processor Cluster Nodes

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

Challenges and Opportunities for HPC Interconnects and MPI

The Portals 3.3 Message Passing Interface Document Revision 2.1

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Scalable Computing at Work

Scalable, multithreaded, shared memory machine Designed for single word random global access patterns Very good at large graph problems

High Performance MPI on IBM 12x InfiniBand Architecture

Red Storm / Cray XT4: A Superior Architecture for Scalability

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Leveraging HyperTransport for a custom high-performance cluster network

Parallel Computer Architecture II

Application Sensitivity to Link and Injection Bandwidth on a Cray XT4 System

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Eldorado. Outline. John Feo. Cray Inc. Why multithreaded architectures. The Cray Eldorado. Programming environment.

Performance Evaluation of InfiniBand with PCI Express

Wednesday : Basic Overview. Thursday : Optimization

Cluster Network Products

Analyzing and Optimizing Global Array Toolkit for Cray Gemini Interconnect

Memory Management Strategies for Data Serving with RDMA

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive

Computer Comparisons Using HPCC. Nathan Wichmann Benchmark Engineer

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Opportunities and Approaches for System Software in Supporting

Illinois Proposal Considerations Greg Bauer

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

APENet: LQCD clusters a la APE

Lessons learned from MPI

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Flexible Architecture Research Machine (FARM)

The Tofu Interconnect 2

Distributing Application and OS Functionality to Improve Application Performance

How to Use MPI on the Cray XT. Jason Beech-Brandt Kevin Roy Cray UK

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

6.9. Communicating to the Outside World: Cluster Networking

UCX: An Open Source Framework for HPC Network APIs and Beyond

Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT

An Analysis of the Pathscale Inc. InfiniBand Host Channel Adapter, InfiniPath

Multifunction Networking Adapters

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Octopus: A Multi-core implementation

Application Acceleration Beyond Flash Storage

2008 International ANSYS Conference

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Performance of Variant Memory Configurations for Cray XT Systems

Performance Evaluation of InfiniBand with PCI Express

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

A Case for High Performance Computing with Virtual Machines

1 Copyright 2013 Oracle and/or its affiliates. All rights reserved.

Parallel Computing Platforms

Experience in Offloading Protocol Processing to a Programmable NIC

Research on the Implementation of MPI on Multicore Architectures

GPU-centric communication for improved efficiency

Advanced Computer Networks. End Host Optimization

Memory allocation and sample API calls. Preliminary Gemini performance measurements

Hard Disk Drives (HDDs) Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Cray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe:

DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Hard Disk Drives (HDDs)

Virtualization, Xen and Denali

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

Real Parallel Computers

Performance Evaluation of Myrinet-based Network Router

Low-Latency Datacenters. John Ousterhout Platform Lab Retreat May 29, 2015

Paving the Road to Exascale

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout

White Paper. Technical Advances in the SGI. UV Architecture

Diffusion TM 5.0 Performance Benchmarks

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Overview of Tianhe-2

The way toward peta-flops

Minimal-overhead Virtualization of a Large Scale Supercomputer

ABySS Performance Benchmark and Profiling. May 2010

HPC Architectures. Types of resource currently in use

Enabling high-speed asynchronous data extraction and transfer using DART

High Performance Computing: Blue-Gene and Road Runner. Ravi Patel

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Metropolitan Road Traffic Simulation on FPGAs

HPCC Results. Nathan Wichmann Benchmark Engineer

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Table 9. ASCI Data Storage Requirements

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

The Cray XD1. Technical Overview. Amar Shan, Senior Product Marketing Manager. Cray XD1. Cray Proprietary

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Transcription:

Initial Performance Evaluation of the Cray SeaStar Interconnect Ron Brightwell Kevin Pedretti Keith Underwood Sandia National Laboratories Scalable Computing Systems Department 13 th IEEE Symposium on High-Performance Interconnects August 18, 2005 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract DE-AC04-94AL85000.

Outline Red Storm requirements Cray SeaStar Portals programming interface Performance results Future work Conclusions

ASC Red Storm Performance Goals Peak of ~40 TFLOPS based on 2 FLOPS/clock Expected performance is ~10 times faster than ASCI Red Linpack (HPL) performance >14 TFLOPS Expect to get ~30TFLOPS Aggregate system memory bandwidth ~55 TB/s Aggregate sustained interconnect bandwidth >100 TB/s

Red Storm Interconnect Performance Goals MPI zero-length half round trip latency requirements <2 µs nearest neighbor <5 µs furthest neighbor Peak node bandwidth 1.5 GB/s each direction Peak link bandwidth 3 GB/s each direction

Red Storm Architecture True massively parallel processing machine Designed to be a single system Distributed memory MIMD parallel supercomputer Fully connected 3D mesh interconnect 10,368 compute nodes ~30 TB of DDR memory Red/Black switching: ~1/4, ~1/2, ~1/4 8 Service and I/O cabinets on each end (256 processors for each color) > 240 TB of disk storage (> 120 TB per color)

Red Storm Network Topology Compute node topology: 27 x 16 x 24 (x, y, z) Red/Black split: 2,688 4,992 2,688 Service and I/O node topology 2 x 8 x 16 (x, y, z) on each end (network is 2 x 16 x 16) 256 full bandwidth links to Compute Node Mesh (384 available)

Normally Classified Red Storm Layout (27 16 24 mesh) Switchable Nodes Normally Unclassified I/O and Service Nodes Disconnect Cabinets I/O and Service Nodes Disk storage system not shown

Red Storm System Software Operating Systems LINUX on service and I/O nodes Catamount lightweight kernel on compute nodes LINUX on RAS nodes Run-Time System Logarithmic loader Intelligent node allocator Batch system (PBS) Libraries MPI, I/O, Math File Systems Lustre for both UFS and Parallel FS

Red Storm Processors and Memory Processors AMD Opteron (Sledgehammer) 2.0 GHz 64 KB L1 instruction and data caches on chip 1 MB L2 shared (Data and Instruction) cache on chip Integrated dual DDR memory controllers @ 333 MHz Integrated 3 Hyper Transport Interfaces @ 3.2 GB/s each direction Node memory system Page miss latency to local processor memory is ~80 ns Peak memory bandwidth of ~5.3 GB/s for each processor

Cray SeaStar NIC/Router IBM 0.13u ASIC process 16 1.6 Gb/s HyperTransport to Opteron 500 MHz embedded PowerPC 440 384 KB on-board scratch RAM Seven-port router Six 12-channel 3.2 Gb/s high-speed serial links

SeaStar Block Diagram

SeaStar Network Interface Independent send/recv cache-coherent DMA engines between Opteron memory and network Message-based DMA engines handle packetization Attempts to minimizes host overhead Supports reception of multiple simultaneous messages Delivers boot code to Opteron

SeaStar Embedded PowerPC Message preparation Message demultiplexing MPI matching Native IP packets End-to-end reliability protocol System monitoring

Integrated Router Six high-speed network links per ASIC More than 4 GB/s per link Reliable link protocol with 16-bit CRC and automatic retry Support for up to 32K nodes in a 3D toroidal mesh

Network Reliability SECDED on scratch RAM with scrubbing SEC on routing lookup tables Parity protection on DMA tables Monitor port accesses on PowerPC and Opteron state Link protocol 16-bit CRC with automatic retry NIC computes 32-bit message CRC

SeaStar Programming Interface Cray chose Portals 3.3 API developed by Sandia and the University of New Mexico Portals was designed to support intelligent/programmable network interfaces

Portals Timeline Portals 0.0-1991 SUNMOS (Sandia/UNM OS) ncube, Intel Paragon Direct access to network FIFOs Message co-processor Portals 1.0-1993 Data structures in user-space Kernel-managed and user-managed memory descriptors Published but never implemented Portals 2.0-1994 Puma/Cougar Message selection (match lists) Four types of memory descriptors (three implemented) Portals 3.0-1998 Cplant/Linux Functional API Target intelligent/programmable network interfaces Portals 3.3 2003 Red Storm

Portals 3.3 Features Best effort, in-order delivery Well-defined transport failure semantics Based on expected messages One-sided operations Put, Get, Atomic swap Zero-copy OS-bypass Protocol offload No polling or threads to move data No host CPU overhead Runtime system independent

Portal Table Portals Addressing Operational Boundary Match List Event Queue Memory Descriptors Memory Regions Access Control Table Portal Space Application Space

Match Entry Contents Source node id Source process id 64 match bits 64 ignore bits

Memory Descriptor Start address Optionally supports gather/scatter list Length in bytes Threshold Number of operations allowed Max size Low-water mark Options Put/get Receiver/sender managed offset Truncate Ack/no ack Ignore start/end events 64 bits of user data Event queue handle Auto-unlink option

Event Queue Circular queue that records operations on MDs Types of events Get (PTL_EVENT_GET_{START,END}) MD has received a get request Put (PTL_EVENT_PUT_{START,END}) MD has received a put request Reply (PTL_EVENT_REPLY_{START,END}) MD has received a reply to a get request Send (PTL_EVENT_SEND_{START,END}) Put request has been processed Ack (PTL_EVENT_ACK) MD has received an ack to a put request

Event Scenarios initiator target send start put send end put start put end ack get get start get end reply start reply end

Event Entry Contents Event type Initiator of event (nid,pid) Portal index Match bits Requested length Manipulated length Offset MD handle 64 bits of out-of-band data Link Sequence number

What Makes Portals Different? Connectionless RDMA with matching Provides elementary building blocks for supporting higher-level protocols well MPI, RPC, Lustre, etc. Allows structures to be placed in user-space, kernel-space, or NIC-space Receiver-managed offset allows for efficient and scalable buffering of unexpected messages Supports multiple protocols within a process Needed for compute nodes where everything is a message

Portals 3.3 for SeaStar Cray started with Sandia reference implementation Needed single version of NIC firmware that supports all combinations of User-level and kernel-level API NIC-space and kernel-space library Cray added bridge layer to reference implementation to allow NAL to interface multiple API NALs and multiple library NALs qkbridge for Catamount applications ukbridge for Linux user-level applications kbridge for Linux kernel-level applications

SeaStar NAL Portals processing in kernel-space Interrupt-driven generic mode Portals processing in NIC-space No interrupts accelerated mode

Micro-Benchmarks PtlPerf Ping-pong latency and bandwidth (uni- and bidirectional) Single, persistent ME, MD, EQ Best-case performance for Portals NetPIPE 3.6.2 Ping-pong latency and bandwidth (uni- and bidirectional) Streaming bandwidth Implemented a Portals module

Disclaimer PtlPerf results From a snapshot of developer code base in May Sandia-developed C-based firmware Only generic mode NetPIPE results From a snapshot of developer code base on Tuesday Sandia/Cray merge of C-based firmware Generic and accelerated modes Some features that may impact performance are not implemented End-to-end reliability protocol

PtlPerf Latency 12 10 Latency (microseconds) 8 6 4 2 0 0 200 400 600 800 1000 Message Size (bytes) get put-bi put

PtlPerf Bandwidth (10KB-100KB) 1100 1000 900 800 Bandwidth (MB/s) 700 600 500 400 300 200 100 0 20000 40000 60000 80000 100000 Message Size (bytes) put put-bi get

PtlPerf Bandwidth (100KB-2MB) 1200 1000 Bandwidth (MB/s) 800 600 400 200 0 200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06 1.6e+06 1.8e+06 2e+06 Message Size (bytes) put put-bi get

NetPIPE Portals Latency 14 Generic (2.0 GHz) Generic (2.4 GHz) Accelerated (2.0 GHz) Accelerated (2.4 GHz) 12 10 Latency (us) 8 6 4 2 0 1 10 100 1000 Message Size (bytes)

NetPIPE Portals Bandwidth 1200 1000 Stream - Accelerated (2.4 GHz) Stream - Generic (2.4 GHz) Accelerated (2.4 GHz) Accelerated (2.0 GHz) Generic (2.4 GHz) Generic (2.0 GHz) 800 Bandwidth (MB/s) 600 400 200 0 1 10 100 1000 10000 100000 1e+06 1e+07 Message Size (bytes)

NetPIPE Portals Bi-directional Bandwidth 2500 Accelerated (2.4 GHz) Generic (2.4 GHz) Accelerated (2.4 GHz) Generic (2.4 GHz) 2000 Bandwidth (MB/s) 1500 1000 500 0 1 10 100 1000 10000 100000 1e+06 1e+07 Message Size (bytes)

NetPIPE TCP/IP Latency 25 Portals (2.0 GHz) Native (2.0 GHz) 20 Latency (microseconds) 15 10 5 0 1 10 100 1000 Message Size (bytes)

NetPIPE TCP/IP Bandwidth 700 Portals (2.0 GHz) Native (2.0 GHz) 600 500 Bandwidth (MB/s) 400 300 200 100 0 1 10 100 1000 10000 100000 1e+06 1e+07 Message Size (bytes)

Future Work Accelerated implementation Finish implementation Optimize NIC-based atomic memory operations (for SHMEM) NIC-based collective operations

Conclusions Latency performance Accelerated: 3.88 µs We may be able to reduce this numbers slightly Generic: 4.61 µs Bandwidth performance Asymptotic: 1.1 GB/s Cray is currently working on SeaStar 2.0

Acknowledgments Lots of people at Cray Especially Bob Alverson (from whom I stole some slides) Lots of people at Sandia Especially Tramm Hudson, Jim Laros, and Sue Kelly

Questions?

Die Layout

Red Storm 4-Node Compute Node Board Standard DIMMs RAS controller SeaStar AMD Opteron

MPI Latency 14 12 Latency (microseconds) 10 8 6 4 2 0 shmem-preposted shmem mpich2-preposted mpich-1.2.6-preposted no vshort ls 5 mpich-1.2.6-preposted mpich2 mpich-1.2.6 no vshort mpich-1.2.6 1 10 100 1000 Message Size (bytes)