Performance Analysis and Prediction for distributed homogeneous Clusters

Similar documents
A Long-distance InfiniBand Interconnection between two Clusters in Production Use

Operating two InfiniBand grid clusters over 28 km distance

bwgrid Treff am URZ Sabine Richling, Heinz Kredel Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim 29.

A Long-distance InfiniBand Interconnection between two Clusters in Production Use

Hands-On Workshop bwunicluster June 29th 2015

Lessons learned from Lustre file system operation

bwfortreff bwhpc user meeting

2008 International ANSYS Conference

Habanero Operating Committee. January

Composite Metrics for System Throughput in HPC

Master Informatics Eng.

Design and Evaluation of a 2048 Core Cluster System

Large scale Imaging on Current Many- Core Platforms

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Assistance in Lustre administration

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Cluster Network Products

An Overview of Fujitsu s Lustre Based File System

Single-Points of Performance

Now SAML takes it all:

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

A Case for High Performance Computing with Virtual Machines

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Rechenzentrum HIGH PERFORMANCE SCIENTIFIC COMPUTING

Determining the MPP LS-DYNA Communication and Computation Costs with the 3-Vehicle Collision Model and the Infiniband Interconnect

COSC 6385 Computer Architecture - Multi Processor Systems

Birds of a Feather Presentation

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Real Parallel Computers

The STREAM Benchmark. John D. McCalpin, Ph.D. IBM eserver Performance ^ Performance

Adapted from: TRENDS AND ATTRIBUTES OF HORIZONTAL AND VERTICAL COMPUTING ARCHITECTURES

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

The Optimal CPU and Interconnect for an HPC Cluster

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

Future Routing Schemes in Petascale clusters

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Parallel File Systems. John White Lawrence Berkeley National Lab

Computer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters ANSYS, Inc. All rights reserved. 1 ANSYS, Inc.

Real Parallel Computers

SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING

Parallel Computer Architecture - Basics -

Session Case study. Planning of different broadband solutions in the last mile for urban and suburban areas. suburban areas

The Architecture and the Application Performance of the Earth Simulator

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

QLogic TrueScale InfiniBand and Teraflop Simulations

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

APENet: LQCD clusters a la APE

Benchmark Results. 2006/10/03

TFLOP Performance for ANSYS Mechanical

Cray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe:

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

MM5 Modeling System Performance Research and Profiling. March 2009

Optimizing LS-DYNA Productivity in Cluster Environments

Roadmapping of HPC interconnects

Thread and Data parallelism in CPUs - will GPUs become obsolete?

Porting and Optimisation of UM on ARCHER. Karthee Sivalingam, NCAS-CMS. HPC Workshop ECMWF JWCRP

What does Heterogeneity bring?

ANALYSIS OF CLUSTER INTERCONNECTION NETWORK TOPOLOGIES

Interconnect Your Future

KNL tools. Dr. Fabio Baruffa

Intel Many Integrated Core (MIC) Architecture

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Application Performance on Dual Processor Cluster Nodes

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Feedback on BeeGFS. A Parallel File System for High Performance Computing

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

Milestone Solution Partner IT Infrastructure Components Certification Report

ARCHER/RDF Overview. How do they fit together? Andy Turner, EPCC

InfiniBand Strengthens Leadership as The High-Speed Interconnect Of Choice

Advanced Parallel Programming I

Chapter 7. Multicores, Multiprocessors, and Clusters

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

Performance Analysis of LS-DYNA in Huawei HPC Environment

GROMACS Performance Benchmark and Profiling. August 2011

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

University at Buffalo Center for Computational Research

I/O Monitoring at JSC, SIONlib & Resiliency

The Red Storm System: Architecture, System Update and Performance Analysis

Practical Scientific Computing

A Comprehensive Study on the Performance of Implicit LS-DYNA

System Requirements. PREEvision. System requirements and deployment scenarios Version 7.0 English

Milestone Solution Partner IT Infrastructure Components Certification Report

Trends in HPC (hardware complexity and software challenges)

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Porting the ICON Non-hydrostatic Dynamics and Physics to GPUs

Case study: OpenMP-parallel sparse matrix-vector multiplication

Intel profiling tools and roofline model. Dr. Luigi Iapichino

How HPC Hardware and Software are Evolving Towards Exascale

Transcription:

Performance Analysis and Prediction for distributed homogeneous Clusters Heinz Kredel, Hans-Günther Kruse, Sabine Richling, Erich Strohmaier IT-Center, University of Mannheim, Germany IT-Center, University of Heidelberg, Germany Future Technology Group, LBNL, Berkeley, USA ISC 12, Hamburg, 18. June 2012 Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 1 / 26

Outline 1 Background and Motivation D-Grid and bwgrid bwgrid Mannheim/Heidelberg Next generation bwgrid 2 Performance Modeling The Roofline Model Analysis of a single Region Analysis of two identical interconnected Regions Application to bwgrid 3 Conclusions Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 2 / 26

D-Grid and bwgrid Background and Motivation D-Grid and bwgrid bwgrid Virtual Organization (VO) Community project of the German Grid Initiative D-Grid Project partners are the Universities in Baden-Württemberg bwgrid Resources Compute clusters at 8 locations Central storage unit in Karlsruhe bwgrid Objectives Verifying the functionality and the benefit of Grid concepts for the HPC community in Baden-Württemberg Managing organizational, security, and license issues Development of new cluster and Grid applications Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 3 / 26

bwgrid Resources Background and Motivation D-Grid and bwgrid Compute Cluster Site Nodes Mannheim 140 Heidelberg 140 Karlsruhe 140 Stuttgart 420 Tübingen 140 Ulm/Konstanz 280 Freiburg 140 Esslingen 180 Total 1580 Mannheim Frankfurt (interconnected to a single cluster) Karlsruhe Heidelberg Stuttgart Esslingen Central Storage with backup without backup Total 128 TB 256 TB 384 TB Freiburg Tübingen Ulm (joint cluster with Konstanz) München Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 4 / 26

Background and Motivation bwgrid MA/HD Hardware bwgrid Mannheim/Heidelberg Hardware Mannheim Heidelberg total Blade Center 10 10 20 Blades (Nodes) 140 140 280 CPUs (Cores) 1120 1120 2240 Login Server 2 2 4 Admin Server 1 1 Infiniband Switches 1 1 2 HP Storage System 32 TB 32 TB 64 TB Blade Configuration 2 Intel Xeon CPUs, 2.8 GHz (each CPU with 4 Cores) 16 GB Memory 140 GB hard drive (since January 2009) Gigabit-Ethernet (1 Gbit) Infiniband Netzwork (20 Gbit) Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 5 / 26

Background and Motivation bwgrid MA/HD Overview bwgrid Mannheim/Heidelberg VORM LDAP Cluster Mannheim Grid User MA InfiniBand Lustre bwfs MA PBS Admin passwd Obsidian 28 km Obsidian Grid User HD InfiniBand Lustre bwfs HD VORM AD Cluster Heidelberg 140 nodes cluster InfiniBand Network Login/Admin Server Directory service Storage Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 6 / 26

Background and Motivation bwgrid MA/HD Interconnection bwgrid Mannheim/Heidelberg Network Technology InfiniBand over Ethernet over fibre optics (28 km) 2 Obsidian Longbow (150 TEUR) MPI Performance Latency is high: 145 µsec = 143 µsec light transit time + 2 µsec Bandwidth is as expected: 930 MB/sec (local 1200-1400 MB/sec) Operating Considerations Operating the two clusters as single system image Fast InfiniBand interconnection to the storage systems MPI performance not sufficient for all kinds of parallel jobs Keep all nodes of a job on one side Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 7 / 26

Background and Motivation Next generation bwgrid Next generation bwgrid Questions What bandwidth is required to allow all parallel jobs running accross two cluster regions? Is the expected bandwidth for the new system sufficient? Is there an optimal size for a cluster region? Performance Charateristics bwgrid 1 bwgrid 2 Bandwidth between two nodes 1.5 GByte/sec 6 GByte/sec Bandwidth between two regions 1.0 GByte/sec 15 45 GByte/sec Performance of a single core 8.5 GFlop/sec 10 16 GFlop/sec Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 8 / 26

Outline Performance Modeling 1 Background and Motivation D-Grid and bwgrid bwgrid Mannheim/Heidelberg Next generation bwgrid 2 Performance Modeling The Roofline Model Analysis of a single Region Analysis of two identical interconnected Regions Application to bwgrid 3 Conclusions Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 9 / 26

The Roofline Model EECS Electrical Engineering and Computer Sciences Basic Roofline BERKELEY PAR LAB Performance is upper bounded by both the peak flop rate, and the product of streaming bandwidth and the flop:byte ratio Gflop/s = min Peak Gflop/s Stream BW * actual flop:byte ratio 1 Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 10 / 26

The Roofline Model EECS Electrical Engineering and Computer Sciences Roofline model for Opteron (adding ceilings) BERKELEY PAR LAB attainable Gflop/s 128 64 32 16 8 4 2 AMD Opteron 2356 (Barcelona) peak SP Peak roofline performance based on manual for single precision peak and a hand tuned stream read for bandwidth 1 1 / 8 1 / 4 1 / 2 1 2 4 8 16 flop:dram byte ratio 2 Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 10 / 26

The Roofline Model A Performance Model based on the Roofline Model Roofline Principles: Bottleneck Analysis Bound by Peak Flop and Measured Bandwidth The following steps will be used to develop a performance model for single and multiple regions: Transform basics scales to dimensionless quantitates to arrive at universal scaling law Assume optimal floating-point operations and scaling with system size Introduce effective bandwidth scaling with system size Formulate result with dimensionless code-to-system balance factors Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 10 / 26

The Roofline Model Performance Model Overall System Abstraction Hardware Region I 1 aggregate bandwidth B E Region I 2 number of cores n core performance l th bandwidth b I number of cores n core performance l th bandwidth b I Application (Load) #op number of arithmetic operations performed on #b number of bytes (data) Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 11 / 26

Analysis of a single Region Analysis of a single Region Total time = Computation time + Communication time Total time with ideal floating-point operations: #op 1 + #b t V { #op d + #b s ( b I #op max d s, #b b I ) ( d th #op d th ( max b I d th #op 1, #b b I ) d th #op Identify a code-to-system balance factor x based on: a: Arithmetic intensity (roofline model, Williams et al. 2009) a : Operational balance ( architectural intensity ): Throughput: d = #op t V x = a a = #op #b b I #op = d th d th b I #b { d th x x+1 additive d th min(1, x) overlapping ) additive overlapping Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 12 / 26

Single Region Throughput Analysis of a single Region Throughput d for additive (green) and overlapping (red) concepts. Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 13 / 26

Single Region Speed-up Analysis of a single Region Ideal floating-point d th = n l l and Effective bandwidth scaling z = bl with a reference bandwidth b I b0 l 0 gives: x = #op #b b I d th = 1 n #op #b bi 0 bi l th b0 I = x z n where x is the balance factor of the core (or node, unit,...) Parallel Speed-up is then: Sp = d(n) d(1) = 1 + x z 1 + x z 1 + x z n n Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 14 / 26

Single Region Speed-up Analysis of a single Region Speed-up Sp for different values x and z. Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 15 / 26

Analysis of two interconnected Regions Analysis of two identical interconnected Regions Total time = Time (1 region, 1/2 comp. load) + Communication time between regions Total time for #x bytes and channel bandwidth B E : Throughput: t V t (1) V t V (#op/2) d th + #x/be ) (1 + a + #x a B E d 2d th 1 1 + a a + 2 dth B E #x #op Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 16 / 26

Two Regions Speed-up Analysis of two identical interconnected Regions Balance factors within (x ) and between regions (y ): x = a a = x n y = #op #x B E 2d th = 1 x 2 n ( ) ( ) #b B E #x b I = y n Interconnection is a shared medium with a constant aggregate bandwidth B E and an effective load factor p(n): This gives for the overall Speed-up: b E = BE p(n) Sp2 = x + y + x y p(n)x + y + x y n 0 n Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 17 / 26

Two Regions Speed-up Analysis of two identical interconnected Regions Speed-up Sp2 for different values of x. Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 18 / 26

Two Regions Speed-up Analysis of two identical interconnected Regions Focus on application and interconnection bandwidth: z = 2y x = r z with r = #b #x and z = be b I z is the ratio between balance factors between regions to between cores and should be as large as possible Overall Speed-up can be rewritten as: Sp2 = 2 + (1 + x )z 2p(n) + (1 + x n )z x z 2p(n) + x z n Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 19 / 26

Two Regions Speed-up Analysis of two identical interconnected Regions Speed-up Sp2 for x = 100 with increasing bandwidth b E (and consequently z ) and an assumed p(n) = n 20. Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 20 / 26

Two Regions Max. Speedup Analysis of two identical interconnected Regions Value of the maximum speed-up of Sp2 for linear p(n) = αn over bandwidth z. Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 21 / 26

Application to bwgrid Performance Modeling Application to bwgrid Performance Charateristics bwgrid 1 bwgrid 2 Bandwidth between two nodes b I 1.5 GByte/sec 6 GByte/sec Bandwidth between two regions B E 1.0 GByte/sec 15 GByte/sec Performance of a single core l th 8.5 GFlop/sec 10 GFlop/sec Reference Bandwidth: b0 I = 1.0 GByte/sec Application = LinPack: n p = 10000, 20000, 30000, 40000 #op 2 3 n3 p and #b 2n 2 p w Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 22 / 26

bwgrid Single Region Application to bwgrid Speed-up comparison of measurements and model for one region. speed-up 300 250 200 150 HPL 1.0a local n p =40000 Real n p =30000 Real n p =20000 Real n p =10000 Real n p =40000 Model n p =30000 Model n p =20000 Model n p =10000 Model 100 50 0 0 500 1000 1500 2000 number of processors p Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 23 / 26

bwgrid Two Regions Application to bwgrid Speed-up comparison of measurements and model for two regions for an estimated bandwidth contention of p(n) = n/20 speed-up 100 80 60 40 HPL 1.0a MA-HD n p =40000 Real n p =30000 Real n p =20000 Real n p =10000 Real n p =40000 Model n p =30000 Model n p =20000 Model n p =10000 Model 20 0 0 500 1000 1500 2000 number of processors p Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 24 / 26

bwgrid Two Regions Application to bwgrid Speed-up bwgrid1 for two regions and varying bandwidth B E. Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 25 / 26

bwgrid Speedup prediction Application to bwgrid Speed-up in bwgrid1&2 for one and two regions with n p = 40000. Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 26 / 26

Conclusions Conclusions Performance model is based on roofline model Throughput and speed-up are described by 2 3 scaling parameters which depend on important hardware and software characterisitics Model reproduces LinPack measurements for one and two regions (bwgrid1) Model predicts performance of next generation system (bwgrid2) Upper bounds for region sizes are derived by analyzing the maximal Speedup Lower bounds for region sizes are derived by analyzing the n 1/2 values (see paper) Next steps: More detailed model for the communication within a region Investigation of other applications Kredel, Kruse, Richling, Strohmaier (ISC 12) Performance Analysis Hamburg, June 2012 27 / 26