Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

Similar documents
SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

100% Warm-Water Cooling with CoolMUC-3

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

- Part 3 - Energy Aware Numerics. Vincent Heuveline

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

Large scale Imaging on Current Many- Core Platforms

Efficient Cooling Methods of Large HPC Systems

Co-designing an Energy Efficient System

OpenFOAM + GPGPU. İbrahim Özküçük

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Lecture 15: More Iterative Ideas

Results of Study Comparing Liquid Cooling Methods

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

A Parallel Solver for Laplacian Matrices. Tristan Konolige (me) and Jed Brown

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

Update on LRZ Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities. 2 Oct 2018 Prof. Dr. Dieter Kranzlmüller

Highly Parallel Multigrid Solvers for Multicore and Manycore Processors

A Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Cooling on Demand - scalable and smart cooling solutions. Marcus Edwards B.Sc.

Advances of parallel computing. Kirill Bogachev May 2016

Hybrid Warm Water Direct Cooling Solution Implementation in CS300-LC

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

Block Distributed Schur Complement Preconditioners for CFD Computations on Many-Core Systems

update: HPC, Data Center Infrastructure and Machine Learning HPC User Forum, February 28, 2017 Stuttgart

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

FREEING HPC FROM THE DATACENTRE

Energy Efficiency and WCT Innovations

Faster Innovation - Accelerating SIMULIA Abaqus Simulations with NVIDIA GPUs. Baskar Rajagopalan Accelerated Computing, NVIDIA

7 Best Practices for Increasing Efficiency, Availability and Capacity. XXXX XXXXXXXX Liebert North America

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction

Excool Indirect Adiabatic and Evaporative Data Centre Cooling World s Leading Indirect Adiabatic and Evaporative Data Centre Cooling

Data Center Energy Savings By the numbers

The Energy Challenge in HPC

CAS 2K13 Sept Jean-Pierre Panziera Chief Technology Director

Introduction to Multigrid and its Parallelization

GPU Cluster Computing for FEM

HPC projects. Grischa Bolls

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

Saving Energy with Free Cooling and How Well It Works

MSYS 4480 AC Systems Winter 2015

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimising the Mantevo benchmark suite for multi- and many-core architectures

High Performance Ocean Modeling using CUDA

Case study: GPU acceleration of parallel multigrid solvers

Modelling and implementation of algorithms in applied mathematics using MPI

Parallel High-Order Geometric Multigrid Methods on Adaptive Meshes for Highly Heterogeneous Nonlinear Stokes Flow Simulations of Earth s Mantle

Contents. I The Basic Framework for Stationary Problems 1

Total Modular Data Centre Solutions

Introduction to parallel Computing

Tuning Alya with READEX for Energy-Efficiency

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

TSUBAME-KFC : Ultra Green Supercomputing Testbed

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

HIGH DENSITY RACK COOLING CHOICES

Higher Order Multigrid Algorithms for a 2D and 3D RANS-kω DG-Solver

18 th National Award for Excellence in Energy Management. July 27, 2017

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

An Investigation into Iterative Methods for Solving Elliptic PDE s Andrew M Brown Computer Science/Maths Session (2000/2001)

LANL High Performance Computing Facilities Operations. Rick Rivera and Farhad Banisadr. Facility Data Center Management

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Planning for Liquid Cooling Patrick McGinn Product Manager, Rack DCLC

Proceedings of the First International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2014) Porto, Portugal

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Introduction to Parallel Computing

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

ANSYS HPC Technology Leadership

READEX: A Tool Suite for Dynamic Energy Tuning. Michael Gerndt Technische Universität München

How to Optimize Geometric Multigrid Methods on GPUs

Iterative Sparse Triangular Solves for Preconditioning

CHILLED WATER. HIGH PRECISION AIR CONDITIONERS, FROM 7 TO 211 kw

White paper adaptive IT infrastructures

Impact of Air Containment Systems

Multigrid Method using OpenMP/MPI Hybrid Parallel Programming Model on Fujitsu FX10

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

Cooling. Highly efficient cooling products for any IT application. For More Information: (866) DATA CENTER SOLUTIONS

Rack cooling solutions for high density racks management, from 10 to 75 kw

Next Generation Cooling

GPU-based Parallel Reservoir Simulators

smooth coefficients H. Köstler, U. Rüde

ME964 High Performance Computing for Engineering Applications

Software and Performance Engineering for numerical codes on GPU clusters

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms

LRZ SuperMUC One year of Operation

ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems

Open Compute Stack (OpenCS) Overview. D.D. Nikolić Updated: 20 August 2018 DAE Tools Project,

Transcription:

2015-09-14 Algorithms, System and Data Centre Optimisation for Energy Efficient HPC Vincent Heuveline URZ Computing Centre of Heidelberg University EMCL Engineering Mathematics and Computing Lab 1

Energy Efficient High Performance Computing 1. Data centre point of view HPC installation at URZ Computing Centre of Heidelberg University 2. Numerics point of view current activities at EMCL Engineering Mathematics and Computing Lab at Interdisciplinary Centre for Scientific Computing (IWR) 2

Energy Efficient HPC 1. Data centre point of view HPC installation at URZ 3

Energy consumption in data centres Energy consumption for IT and cooling Servers and network components Chillers, computer room air conditioners cold air, cold water Power supplies including uninterruptible power supplies (UPS) Fans Other energy consumption Heating Air conditioning for offices Warm water Other building equipment, e.g. light Emergency power generator at URZ 4

Energy consumption in data centres Measure for efficiency of energy usage Power Usage Effectiveness PUE (The Green Grid, 2007) Ideal: PUE = 1.0 Usual: PUE ~ 1.5 (good)... 2.5 (bad) SuperMUC @ Leibniz Computing Centre, Munich: PUE < 1.1 5

Special situation in HPC centres High packing and energy density 30 kw per rack already operational Envisage 100 kw and more per rack High load on servers Want to operate at full capacity Often homogeneous infrastructure Dedicated cooling circuit for one machine reasonable Reuse of waste heat possible, but not constantly available Unplanned (crash) or planned (maintenance) shutdowns reduce availability 6

HPC installation at URZ HPC cluster for research in Baden-Württemberg Joint project of Heidelberg and Mannheim Universities bwforcluster MLS & WISO Molecular Life Science Economic and Social Science Scientific Computing 7

HPC installation at URZ total of ~500 compute nodes based on Intel Haswell total of ~9,000 CPU cores includes co-processor nodes based on Intel Xeon Phi IT power ~200 kw Server racks in computer room at URZ 8

HPC installation at URZ example for 200 kw use 200 kw * 24 h/day * 365 day/year * 20 Ct/kWh 350,000 EUR/year traditional cold air cooling: PUE 2 additional 0.35 Mio. EUR/year only for cooling! URZ uses passive cooling technology outdoor environmental temperature used to cool IT additional cooling by evaporation in hot summer days goal: achieve PUE ~ 1.1 9

HPC installation at URZ wetting heat exchanger URZ URZ annex H E IT IT H E IT H E pump 10

HPC installation at URZ fans drive air through racks and heat exchangers at back doors wetting heat exchanger temperature of air coming out of the heat exchangers is approx. 1.5 higher than runback water temperature heat exchanger in back door of racks H E IT IT H E air in computer room IT H E cold water forward warm water return pump 11

HPC installation at URZ runback water temperature must be controlled to ensure cooling of IT wetting heat exchanger pump controls heat absorption in rack heat exchangers heat exchanger in back door of racks H E IT IT H E air in computer room IT H E cold water forward warm water return pump 12

HPC installation at URZ wetting heat exchanger Heat exchanger and wetting on roof of URZ heat exchanger in back door of racks H E IT IT H E air in computer room IT H E pump cold water forward warm water return 13

HPC installation at URZ wetting heat exchanger forward water temperature is determined by - environmental temperature for dry cooling - wet bulb temperature with wetting (adiabatic cooling) heat exchanger in back door of racks H E IT IT H E air in computer room IT H E pump cold water forward warm water return 14

Energy Efficient HPC 2. Numerics point of view current activities at EMCL 15

Numerical Simulation Model Partial differential equations Discretisation Algebraic system Linear Nonlinear Iterative methods Krylov subspace Conjugate Gradient (CG) General Minimal Residual (GMRES) Multilevel Algebraic multigrid Geometric multigrid 16

Fluid Flow Benchmark Scalability Study parallel efficiency # iter preconditioned GMRES 17

Fluid Flow Benchmark Scalability Study - poor parallel efficiency due to higher number of iterations - preconditioner is weaker for higher number of MPI processes parallel efficiency # iter preconditioned GMRES 18

Domain Decomposition Parallelisation - preconditioner acts only on diagonal blocks - more processes imply smaller blocks 19

Bottleneck in Numerical Algorithms Theoretical system performance hard to achieve Challenge to transfer computing power into simulation performance Parallelization can affect mathematical properties of numerical algorithms Scalability may suffer from synchronizations No usage of energy-saving techniques 20

Iterative Refinement and Multilevel Methods Iterative refinement Repeat: 1. Compute residual 2. Error correction equation Multilevel methods Solve error correction equation only in a subspace 3. Update 21

Geometric Multigrid Finite element discretisation levels Q 2 4 x Q 1 p-ref h-ref Q 1 22

Geometric Multigrid Grid Transfer compute residual update restriction prolongation error correction equation 23

Geometric Multigrid Restriction compute residual restriction restriction matrix 24

Geometric Multigrid Prolongation update prolongation prolongation matrix linear interpolation 25

High Oscillations linear system error equation correction? linear iterative schemes Jacobi Gauss-Seidel application of the scheme to the current approximation has a smoothing action on the unknown error 26

Smoothing smoother damps high error oscillations approximate on coarser grid 27

Coarse Grid Correction 1 pre-smoothing 2 1 2 6 7 3 4 3 5 5 6 4 7 post-smoothing 28

Multigrid Cycle hiflow3.org 29

Energy Efficient HPC 2. Numerics point of view current activities at EMCL 2.1 Modifying the smoother to avoid synchronization for exploiting massively parallel hardware like GPUs 30

Smoother Jacobi and Asynchronous Schemes Matrix splitting Jacobi iteration L U Component wise strict synchronisation Basic asynchronous scheme allow older values through shift function allow arbitrary order of component updates 31

Block-Asynchronous Iteration on GPU SMX = streaming mul1processor cores per SMX max. thread blocks per SMX Tesla K40 15 192 16 1024 max. threads per thread block thread block block update 32

Hybrid Parallel Implementation MPI, OpenMP & CUDA hiflow3.org from Kepler on: HyperQ (32 concurrent streams) must use Multi-Process Service to share one GPU 33

Hybrid Parallel Implementation multi-gpu hiflow3.org cannot use Multi-Process Service 34

Energy Measurement pmlib - source code instrumented with pmlib API calls - application becomes pmlib client, starts and stops measurements - extra tracing server controls power meter and collects measurement data - no perturbation of system under investigation ZES LMG450 external power meter 35

Geometric Multigrid vs. Conjugate Gradient Showcase to demonstrate the power of multigrid: - tests up to 16 CPUs and 1 GPU different variants of multigrid outperform standard CG solver 36

Results 1 to 32 MPI processes, up to 2 GPUs time to solution [sec] energy to solution [J] optimal time to solution with p = 32 MPI processes, no GPU t = 1.469 sec, E = 1,106.315 J saves 14% time over best energy to solution optimal energy to solution with p = 16 MPI processes, 2 GPUs t = 1.680 sec, E = 995.810 J saves 10% energy over best time to solution 37

Energy Efficient HPC 2. Numerics point of view current activities at EMCL 2.2 Adjusting hardware activity according to different problem sizes in the grid levels 38

Individual Decomposition on Grid Levels Smaller problem size on coarser levels - factor between levels ~ in dimensions Observation: scaling limited by smallest problem size Idea: use less processes on coarser levels Hope: - overall multigrid performance not impaired for large number of MPI processes - energy savings possible due to partly deactivation of hardware 39

Results: Time and Energy to Solution level 1 2 3 4 5 default adapted (A) adapted (B) # MPI processes per grid level optimal time to solution adapted (A) with p = 32 t = 0.148 sec, E = 88.900 J saves 13% time over best default optimal energy to solution adapted (A) with p = 16 t = 0.154 sec, E = 77.194 J saves 14% energy over best default 40

C-states and Intel RAPL C-states - CPU energy / activity states - C0 = active, C1 = halt, etc. - several further states, depending on architecture - works by cutting clock signals, and reducing CPU voltage for deeper states Intel Running Average Power Limit (RAPL) - development of RAPL driven by thermal design power (TDP): guarantees proper operation of CPU if chassis and cooling system obey the TDP - provides energy consumption information based on performance counters and power models - can be used as power indicator 41

Results: C-states and Intel RAPL energy savings due to temporary CPU deacjvajon muljgrid cycle is on coarse grid muljgrid cycle is on fine grid 42