InfiniBand-based HPC Clusters - PDF Free Download

Boosting Scalability of InfiniBand-based HPC Clusters Asaf Wachtel, Senior Product Manager 2010 Voltaire Inc.

InfiniBand-based HPC Clusters Scalability Challenges Cluster TCO Scalability Hardware costs Software License costs Space, Power & Cooling Communication Scalability Handle increasing compute power Multi-core core, GPUs Utilization Scalability Many jobs & users Varying sizes, traffic patterns & QoS Application Scalability Home-grown or ISVs MPI Collectives 2

Voltaire 40Gb/s InfiniBand Portfolio Fabric provisioning and performance monitoring Application Acceleration 40Gb/s InfiniBand Switching Platforms HSSM SSI Blade Switch 4036 36 x IB ports 4036E 34 x IB ports + 2 x 1/10GbE 4200 162 x IB ports 4700 324/648 x IB ports 3

Scalable Architectures Fat Tree Full bi-sectional bandwidth at any node count Uniform oversubscription options HyperScale Scale to thousands of nodes with linear performance Large non-blocking islands (more than 2,000 cores) 4-hops maximum latency to any port Lowest number of switches and cables Torus Lowest cost solution Built entirely with edge switches and copper cables Optimized support by Voltaire software, including Torus2QoS routing 4

HyperScale in the Top500 Large, low-latency, non-blocking Islands Lowest number of switches & cables Scales to thousands of nodes with linear performance 8:1 Oversubscribed Core 1,200-node Interconnect in only 2 Racks 13 x non-blocking HyperScale Islands 1.05PFLOPs 83.7% Efficiency 5

The Challenge: Static Routing Inefficiency The Challenge: One Size Routing does not Fit All Static routing assumes uniform traffic across entire fabric Real life is different Most jobs use small portion of the clusters Different traffic patterns for different jobs Different requirements for different traffic types (e.g. storage) The Solution: Voltaire TARA (Traffic Aware Routing Algorithm) A new routing algorithm on top of OpenSM Dynamically optimizes routing according to defined traffic patterns: Fabric topology Job-specific communication patterns Symmetric/Asymmetric communication Traffic load/qos Fully integrated t with leading job schedulers 6

TARA Traffic Aware Routing Algorithm Maximizing Cluster Utilization OpenSM without UFM TARA UFM TARA is ON 2000 2000 1800 1800 1600 1600 1400 1400 1200 1200 1000 800 1000 800 port weight port weight 600 600 400 400 200 200 0 1.18 1.28 2.20 2.30 3.22 3.32 4.24 4.34 5.26 6.18 6.28 7.20 7.30 8.22 8.32 9.24 9.34 10.26 11.18 11.28 12.20 12.30 13.22 13.32 14.24 14.34 15.26 16.18 16.28 17.20 17.30 switch.port 7 0 1.18 1.28 2.20 2.30 3.22 3.32 4.24 4.34 5.26 6.18 6.28 7.20 7.30 8.22 8.32 9.24 9.34 10.26 11.18 11.28 12.20 12.30 13.22 13.32 14.24 14.34 15.26 16.18 16.28 17.20 17.30 Internal ports on the line cards switch.port

The Challenge: Collective Operations Scalability Grouping algorithms are unaware of the topology and inefficient Network congestion due to all-to-all communication Slow nodes & OS involvement impair scalability and predictability The more powerful servers get (GPUs, more cores), the more poorly collectives scale in the fabric % collectives out of total run time Total run time Run time variance # Ranks # Ranks # Ranks Significant Inhibitor to MPI Application Scalability 8

Introducing: Voltaire Fabric Collective Accelerator Grid Director Switches: Switches: Fabric Collective Processing operations Power offloaded to switch CPUs Grid Director FCA Manager: Unified Fabric Topology-based p collective Manager tree (UFM): + Separate Virtual + network Topology Aware IB multicast for result Orchestrator distribution. + FCA Agent: Inter-core processing localized & optimized. + Breakthrough performance with no additional hardware 9

FCA Fabric Collective Accelerator Unmatched Application Scalability First and only system-wide solution for offloading MPI collectives Accelerates MPI collective computation by as much as 100X 10-40% improvement in application runtime Integrated with leading MPI implementations Fluent truck_111m 192 cores 180 160 140 120 100 80 60 40 20 0 PMPI PMPI + FCA PMPI PMPI + FCA 10

Summary Reduced total cost of ownership via scalable topologies (HyperScale) Increase cluster utilization via Traffic Aware Routing g( (TARA) Boost application scalability using Fabric Collective Acceleration (FCA) More Performance for each $ Spent 11

Thank You Asaf Wachtel Senior Product Manager, InfiniBand Solutions asafw@voltaire.com 2010 Voltaire Inc.