A Long-distance InfiniBand Interconnection between two Clusters in Production Use

Similar documents
Operating two InfiniBand grid clusters over 28 km distance

Performance Analysis and Prediction for distributed homogeneous Clusters

A Long-distance InfiniBand Interconnection between two Clusters in Production Use

bwgrid Treff am URZ Sabine Richling, Heinz Kredel Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim 29.

Lessons learned from Lustre file system operation

Hands-On Workshop bwunicluster June 29th 2015

bwfortreff bwhpc user meeting

Now SAML takes it all:

Assistance in Lustre administration

Habanero Operating Committee. January

A Case for High Performance Computing with Virtual Machines

Scalable Ethernet Clos-Switches. Norbert Eicker John von Neumann-Institute for Computing Ferdinand Geier ParTec Cluster Competence Center GmbH

Datura The new HPC-Plant at Albert Einstein Institute

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

An Overview of Fujitsu s Lustre Based File System

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

Cluster Network Products

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Computing Model Tier-2 Plans for Germany Relations to GridKa/Tier-1

BlueGene/L. Computer Science, University of Warwick. Source: IBM

ARCHER/RDF Overview. How do they fit together? Andy Turner, EPCC

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

HPC Architectures. Types of resource currently in use

ABySS Performance Benchmark and Profiling. May 2010

Our new HPC-Cluster An overview

Choosing Resources Wisely Plamen Krastev Office: 38 Oxford, Room 117 FAS Research Computing

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Rechenzentrum HIGH PERFORMANCE SCIENTIFIC COMPUTING

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Composite Metrics for System Throughput in HPC

MVAPICH MPI and Open MPI

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

Real Parallel Computers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

2008 International ANSYS Conference

Session Case study. Planning of different broadband solutions in the last mile for urban and suburban areas. suburban areas

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Parallel File Systems Compared

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET

QLogic TrueScale InfiniBand and Teraflop Simulations

A More Realistic Way of Stressing the End-to-end I/O System

Experiences with HP SFS / Lustre in HPC Production

Interconnect Your Future

My operating system is old but I don't care : I'm using NIX! B.Bzeznik BUX meeting, Vilnius 22/03/2016

VREDPro HPC Raytracing Cluster

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

InfiniBand and Next Generation Enterprise Networks

Beyond Petascale. Roger Haskin Manager, Parallel File Systems IBM Almaden Research Center

Altix Usage and Application Programming

Design and Evaluation of a 2048 Core Cluster System

The Red Storm System: Architecture, System Update and Performance Analysis

Parallel Computer Architecture II

ANALYSIS OF CLUSTER INTERCONNECTION NETWORK TOPOLOGIES

Scalability and Classifications

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters

Birds of a Feather Presentation

NCAR Globally Accessible Data Environment (GLADE) Updated: 15 Feb 2017

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Optimizing LS-DYNA Productivity in Cluster Environments

COSC 6385 Computer Architecture - Multi Processor Systems

Technical guide. Windows HPC server 2016 for LS-DYNA How to setup. Reference system setup - v1.0

MOHA: Many-Task Computing Framework on Hadoop

Determining the MPP LS-DYNA Communication and Computation Costs with the 3-Vehicle Collision Model and the Infiniband Interconnect

Extraordinary HPC file system solutions at KIT

The National Analysis DESY

Managing CAE Simulation Workloads in Cluster Environments

Single-Points of Performance

MERCED CLUSTER BASICS Multi-Environment Research Computer for Exploration and Discovery A Centerpiece for Computational Science at UC Merced

UAntwerpen, 24 June 2016

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

Virtualization of the ATLAS Tier-2/3 environment on the HPC cluster NEMO

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Advanced Data Placement via Ad-hoc File Systems at Extreme Scales (ADA-FS)

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

Enabling Performance-per-Watt Gains in High-Performance Cluster Computing

bwfdm Communities - a Research Data Management Initiative in the State of Baden-Wuerttemberg

Pushing the Limits. ADSM Symposium Sheelagh Treweek September 1999 Oxford University Computing Services 1

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

Cray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe:

UL HPC Monitoring in practice: why, what, how, where to look

Introduction to High Performance Parallel I/O

Real Application Performance and Beyond

MM5 Modeling System Performance Research and Profiling. March 2009

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA,

Advances of parallel computing. Kirill Bogachev May 2016

OBTAINING AN ACCOUNT:

Readme for Platform Open Cluster Stack (OCS)

APENet: LQCD clusters a la APE

n N c CIni.o ewsrg.au

Parallel File Systems for HPC

Apace Systems. Avid Unity Media Offload Solution KIT

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

The Why and How of HPC-Cloud Hybrids with OpenStack

Real Parallel Computers

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

Sharing High-Performance Devices Across Multiple Virtual Machines

Transcription:

A Long-distance InfiniBand Interconnection between two Clusters in Production Use Sabine Richling, Steffen Hau, Heinz Kredel, Hans-Günther Kruse IT-Center, University of Heidelberg, Germany IT-Center, University of Mannheim, Germany SC 11, State of the Practice, 16. November 2011 Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 1 / 25

Outline 1 Background D-Grid and bwgrid bwgrid MA/HD 2 Interconnection of two bwgrid clusters 3 Cluster Operation Node Management User Management Job Management 4 Performance MPI Performance Storage Access Performance 5 Summary and Conclusions Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 2 / 25

D-Grid and bwgrid bwgrid Virtual Organization (VO) Community project of the German Grid Initiative D-Grid Project partners are the Universities in Baden-Württemberg bwgrid Resources Compute clusters at 8 locations Central storage unit in Karlsruhe bwgrid Objectives Verifying the functionality and the benefit of Grid concepts for the HPC community in Baden-Württemberg Managing organizational, security, and license issues Development of new cluster and Grid applications Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 3 / 25

bwgrid Resources Compute Cluster Site Nodes Mannheim 140 Heidelberg 140 Karlsruhe 140 Stuttgart 420 Tübingen 140 Ulm/Konstanz 280 Freiburg 140 Esslingen 180 Total 1580 Mannheim Frankfurt (interconnected to a single cluster) Karlsruhe Heidelberg Stuttgart Esslingen Central Storage with backup 128 TB without backup 256 TB Total 384 TB Freiburg Tübingen Ulm (joint cluster with Konstanz) München Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 4 / 25

bwgrid Situation in MA/HD before interconnection Diversity of applications (1 128 nodes per job) Many first time HPC users! Access with local University Accounts (Authentication via LDAP/AD) User MA User HD LDAP AD Cluster Mannheim InfiniBand InfiniBand Cluster Heidelberg Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 5 / 25

bwgrid Situation in MA/HD before interconnection Grid certificate allows access to all bwgrid clusters Feasible only for more experienced users other bwgrid resources VORM Grid User MA Grid Certificate VO Registration Grid User HD VORM LDAP AD Cluster Mannheim InfiniBand InfiniBand Cluster Heidelberg Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 6 / 25

Interconnection of bwgrid clusters MA/HD Proposal in 2008 Acquisition and Assembly until May 2009 Running since July 2009 InfiniBand over Ethernet over fibre optics: Obsidian Longbow adaptor InfiniBand connector (black cable), fibre optic connector (yellow cable) Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 7 / 25

MPI Performance Prospects Measurements for different distances (HLRS, Stuttgart, Germany) Bandwidth 900-1000 MB/sec for up to 50-60 km Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 8 / 25

MPI Performance Interconnection MA/HD Cluster Mannheim Cluster Heidelberg InfiniBand Obsidian 28 km Obsidian InfiniBand Latency is high 145 µsec = 143 µsec light transit time + 2 µsec local latency Bandwidth is as expected about 930 MB/sec (local bandwidth 1200-1400 MB/sec) Obsidian needs a license for 40 km Obsidian has buffers for larger distances Activation of buffers with license License for 10 km is not sufficient Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 9 / 25

MPI Bandwidth Influence of the Obsidian License IMB 3.2 - PingPong - buffer size 1 GB bandwidth [Mbytes/sec] 1000 800 600 400 200 0 16 Sep 00:00 23 Sep 00:00 30 Sep 00:00 start time [date hour] 07 Oct 00:00 Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 10 / 25

bwgrid Cluster Mannheim/Heidelberg Overview VORM LDAP Cluster Mannheim Grid User MA InfiniBand Lustre bwfs MA PBS Admin passwd Obsidian 28 km Obsidian Grid User HD InfiniBand Lustre bwfs HD VORM AD Cluster Heidelberg 140 nodes cluster InfiniBand Network Login/Admin Server Directory service Storage Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 11 / 25

Node Management Administration server provides DHCP service for the nodes (MAC-to-IP address configuration file) NFS export for root file system NFS directory for software packages accessible via module utilities queuing and scheduling system Node administration adjusted shell scripts originally developed by HLRS IBM management module (command line interface and Web-GUI) Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 12 / 25

User Management Users should have exclusive access to compute nodes user names and user-ids must be unique direct connection to PBS for user authorization via PAM module Authentication at the access nodes directly against directory services: LDAP (MA) and AD (HD) or with D-Grid certificate Combining information from directory services from both universities Prefix for group names Adding offsets to user-ids and group-ids Activated user names from MA and HD must be different Activation process Adding a special attribute for the user in the directory service (for authentication) Updating the user database of the cluster (for authorization) Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 13 / 25

User Management Generation of configuration files Directory service MA user unique! Directory service HD user group LDAP +prefix ma +prefix hd user id +100.000 +200.000 group id +100.000 +200.000 group AD user id group id Adminserver passwd Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 14 / 25

Job Management Interconnection (high latency, limited bandwidth) provides enough bandwidth for I/O operations not sufficient for all kinds of MPI jobs Jobs run only on nodes located either in HD or in MA (realized with attributes provides by the queuing system) Before interconnection In Mannheim: mostly single node jobs free nodes In Heidelberg: many MPI jobs long waiting times With interconnection better resource utilization (see Ganglia report) Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 15 / 25

Ganglia Report during activation of the interconnection Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 16 / 25

MPI Performance Measurements Numerical model High-Performance Linpack (HPL) benchmark OpenMPI Intel MKL Model variants Calculations on a single cluster with up to 1024 CPU cores Calculations on the interconnected cluster with up to 2048 CPU cores symmetrically distributed Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 17 / 25

Results for a single cluster 300 250 200 HPL 1.0a local load parameter (matrix size) n p =40000 n p =30000 n p =20000 n p =10000 speed-up 150 100 p/ln(p) simple model (Kruse 2009) all CPU configurations have equal probability 50 0 0 500 1000 1500 2000 number of processors p Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 18 / 25

Results for interconnected cluster speed-up 100 80 60 40 HPL 1.0a for p > 256: reduced speed-up by a factor of 4 for p > 500: constant (decreasing) speed-up MA-HD load parameter (matrix size) n p =40000 n p =30000 n p =20000 n p =10000 20 0 0 500 1000 1500 2000 number of processors p Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 19 / 25

Performance model Improvement of simple analytical model (Kruse 2009) to analyze the characteristics of the interconnection high latency of 145 µsec limited bandwidth of 930 MB/sec (modelled as shared medium) Result for Speed-up: p S(p) ( ) 3 ln p + 3 100 4 n p (1 + 4p)c(p) p number of processors n p load parameter (matrix size) c(p) dimensionless function representing the communication topology Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 20 / 25

Speed-up of the model Results: Limited bandwidth is the performance bottleneck for shared connection between the clusters Double bandwidth: 25 % improvement for n p = 40 000 100 % improvement with a ten-fold bandwidth Jobs run on nodes located either in MA or in HD Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 21 / 25

Long-term MPI performance Latency between two random nodes in HD or in MA IMB 3.2 PingPong 14 buffer size 0 GB 12 latency [microsec] 10 8 6 4 2 0 29 Jan 12 Feb 26 Feb 11 Mar start time [date] 25 Mar 08 Apr 22 Apr Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 22 / 25

Long-term MPI performance Bandwidth between two random nodes in HD or in MA IMB 3.2 PingPong 2000 buffer size 1 GB bandwidth [Mbytes/sec] 1500 1000 500 0 29 Jan 12 Feb 26 Feb 11 Mar start time [date] 25 Mar 08 Apr 22 Apr Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 23 / 25

Storage Access Performance IOzone benchmark for 32 GB file with records size 4 MB (node storage) 600 500 write read Mbytes/sec 400 300 200 100 0 MA-HD MA-MA HD-HD HD-MA Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 24 / 25

Summary and Conclusions Interconnection network (Obsidian and InfiniBand switches) is stable and works reliable Bandwidth of 930 MB/sec is sufficient for Lustre file system access single system administration lower administration costs better load balance Setting up a federated authorization is challenging but worthwhile Further reduction of administration costs Lower access barrier for potential users Characteristics of the interconnection is not sufficient for all kinds of MPI jobs Jobs remain on one side of the combined cluster Possible improvements: Adding more parallel fibre lines (very expensive) Investigation of different job scheduler configurations Richling, Hau, Kredel, Kruse (URZ/RUM) Long-distance InfiniBand Connection Seattle, November 2011 25 / 25