READEX: A Tool Suite for Dynamic Energy Tuning. Michael Gerndt Technische Universität München

Similar documents
Energy Efficiency Tuning: READEX. Madhura Kumaraswamy Technische Universität München

READEX Runtime Exploitation of Application Dynamism for Energyefficient

AutoTune Workshop. Michael Gerndt Technische Universität München

Automatic Tuning of HPC Applications with Periscope. Michael Gerndt, Michael Firbach, Isaias Compres Technische Universität München

READEX: Linking Two Ends of the Computing Continuum to Improve Energy-efficiency in Dynamic Applications

Tuning Alya with READEX for Energy-Efficiency

Code Auto-Tuning with the Periscope Tuning Framework

Large scale Imaging on Current Many- Core Platforms

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

The READEX formalism for automatic tuning for energy efficiency

Analyzing I/O Performance on a NEXTGenIO Class System

AcuSolve Performance Benchmark and Profiling. October 2011

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Performance analysis with Periscope

A Simple Framework for Energy Efficiency Evaluation and Hardware Parameter. Tuning with Modular Support for Different HPC Platforms

ICON for HD(CP) 2. High Definition Clouds and Precipitation for Advancing Climate Prediction

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

On the Path towards Exascale Computing in the Czech Republic and in Europe

Application of GPU technology to OpenFOAM simulations

Powernightmares: The Challenge of Efficiently Using Sleep States on Multi-Core Systems

ORAP Forum October 10, 2013

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Performance Analysis with Periscope

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

Design Space Exploration and Application Autotuning for Runtime Adaptivity in Multicore Architectures

Harp-DAAL for High Performance Big Data Computing

update: HPC, Data Center Infrastructure and Machine Learning HPC User Forum, February 28, 2017 Stuttgart

Performance Analysis for Large Scale Simulation Codes with Periscope

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

High performance computing and numerical modeling

I/O Profiling Towards the Exascale

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

AUTOMATIC SMT THREADING

SCALASCA parallel performance analyses of SPEC MPI2007 applications

Managing Hardware Power Saving Modes for High Performance Computing

Lecture 7: Introduction to HFSS-IE

[Scalasca] Tool Integrations

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

The AutoTune Project

IBM High Performance Computing Toolkit

MPI Performance Engineering through the Integration of MVAPICH and TAU

Open Compute Stack (OpenCS) Overview. D.D. Nikolić Updated: 20 August 2018 DAE Tools Project,

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

simulation framework for piecewise regular grids

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Molecular Modelling and the Cray XC30 Performance Counters. Michael Bareford, ARCHER CSE Team

Performance of deal.ii on a node

Peta-Scale Simulations with the HPC Software Framework walberla:

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

NEXTGenIO Performance Tools for In-Memory I/O

Multi-GPU simulations in OpenFOAM with SpeedIT technology.

The walberla Framework: Multi-physics Simulations on Heterogeneous Parallel Platforms

Overview of research activities Toward portability of performance

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1

ANSYS HPC Technology Leadership

OpenFOAM Scaling on Cray Supercomputers Dr. Stephen Sachs GOFUN 2017

Predicting GPU Performance from CPU Runs Using Machine Learning

Adaptive Power Profiling for Many-Core HPC Architectures

Solving Large Complex Problems. Efficient and Smart Solutions for Large Models

Intel Many Integrated Core (MIC) Architecture

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Team 194: Aerodynamic Study of Airflow around an Airfoil in the EGI Cloud

HFSS Hybrid Finite Element and Integral Equation Solver for Large Scale Electromagnetic Design and Simulation

Motivation Goal Idea Proposition for users Study

Computational Interdisciplinary Modelling High Performance Parallel & Distributed Computing Our Research

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

Parallel solution for finite element linear systems of. equations on workstation cluster *

Performance of Implicit Solver Strategies on GPUs

Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Introduction to parallel Computing

I/O Monitoring at JSC, SIONlib & Resiliency

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

I/O at JSC. I/O Infrastructure Workloads, Use Case I/O System Usage and Performance SIONlib: Task-Local I/O. Wolfgang Frings

Detecting Memory-Boundedness with Hardware Performance Counters

Thread and Data parallelism in CPUs - will GPUs become obsolete?

Lecture 2: Introduction

Hybrid programming with MPI and OpenMP On the way to exascale

Optimization of parameter settings for GAMG solver in simple solver

Method-Level Phase Behavior in Java Workloads

Advances of parallel computing. Kirill Bogachev May 2016

High-performance Randomized Matrix Computations for Big Data Analytics and Applications

Overcoming Distributed Debugging Challenges in the MPI+OpenMP Programming Model

Detection and Analysis of Iterative Behavior in Parallel Applications

Early Experiences with Trinity - The First Advanced Technology Platform for the ASC Program

Cray XC Scalability and the Aries Network Tony Ford

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Building supercomputers from embedded technologies

The HiCFD Project A joint initiative from DLR, T-Systems SfR and TU Dresden with Airbus, MTU and IBM as associated partners

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Transcription:

READEX: A Tool Suite for Dynamic Energy Tuning Michael Gerndt Technische Universität München

Campus Garching 2

SuperMUC: 3 Petaflops, 3 MW 3

READEX Runtime Exploitation of Application Dynamism for Energy-efficient exascale Computing 09/2015 to 08/2018 www.readex.eu 4

Objectives Tuning for energy efficiency Beyond static tuning: exploit dynamism in application characteristics Leverage system scenario based tuning HPC Automatic Tuning Embedded System Scenarios 5

Systems Scenario based Methodology 6

Periscope Tuning Framework (PTF) Automatic application analysis & tuning Tune performance and energy (statically) Plug-in-based architecture Evaluate alternatives online Scalable and distributed framework Support variety of parallel paradigms MPI, OpenMP, OpenCL, Parallel pattern AutoTune EU-FP7 project 7

Score-P Scalable Performance Measurement Infrastructure for Parallel Codes Common instrumentationand measurement infrastructure 8

Tuning Plugin Interface Plugin Periscope Frontend Application with Monitor Search Space Exploration inside of Tuning Steps Scenario execution Tuning actions Measurement requests

Tuning Plugins MPI parameters Eager Limit, Buffer space, collective algorithms Application restart or MPIT Tools Interface DVFS Frequency tuning for energy delay product Model-based prediction of frequency Region level tuning Parallelism capping Thread number tuning for energy delay product Exhaustive and curve fitting based prediction

Dynamic Tuning with the READEX Tool Suite READEX extendsthe concept of tuning in Periscope Dynamic tuning Instead of one optimal configuration, SWITCH between different best configurations. Dynamic adaptation to changing program characteristics.

Scenario-Based Tuning Design Time Analysis Periscope Tuning Framework (PTF) Tuning Model Runtime Tuning READEX Runtime Library (RRL) 12

Intra-phase Dynamism Phase region Phase FREQ=2 GHz Intra-phase dynamism FREQ=1.5 GHz Significant region Runtime situation 13

READEX Intra-phase Tuning Plugin Tuning plugin supporting Core and uncore frequencies, numthreads parameters, application tuningparameters Configurable search space via READEX Configuration File Several objective functions: energy, CPUenergy, EDP, EDP2, time Several search strategies: exhaustive, individual, random, genetic Approach 1. Experiment with default configuration 2. Experiments for selected configurations Configuration set for phase region Energy and time measured for all runtime situations 3. Identification of static best for phase and rts specific best configurations 14

Pre-Computation of Tuning Model Search Algorithms Analysis Periscope Tuning Framework Plugin Control Performance Database Experiments Engine Score-P Online Access Interface READEX Runtime Library RTS Management DTA Management DTA Process Management READEX Tuning Plugin Substrate Plugin Interface Instrumentation RTS Database Scenario Identification Metric Plugin Interface Energy Measurements (HDEEM) Application Tuning Model 15

READEX Runtime Library (RRL) Runtime Application Tuning performed by the READEX Runtime Library. Tuning requests during Design Time Analysis are sent to RRL. A lightweight library Dynamic switching between different configurations at runtime. Implemented as a substrate plugin of Score-P. Developed by TUD and NTNU 16

Runtime Scenario Detection and Switching Decision during Production Run During Runtime Application Tuning Scenario classification Switching decision component Manipulation of tuning parameters 17

BEM4I Dynamic switching Energy http://bem4i.it4i.cz/ blade summary, energy assemble_k [J] assemble_v [J] gmres_solve [J] print_vtu [J] main [J] default settings 1467 1484 2733 1142 6872 static tuning only 1876 1926 1306 402 5537 dynamic tuning only 1348 1335 1150 268 4138 static + dynamic tuning 1343 1322 1161 265 4125 static savings [%] -27.9% -29.8% 52.2% 64.8% +19.4% dynamic savings [%] 8.4% 10.9% 57.5% 76.8% +40.0% static + dynamic savings [%] 8.1% 10.0% 57.9% 76.5% +39.8% static": { }, "FREQUENCY": 25", "NUM_THREADS": 12", "UNCORE_FREQUENCY": 22 <--------- 2.5 GHz <--------- 12 OpenMP threads <--------- 2.2 GHz "assemble_k": { "FREQUENCY": "23", "NUM_THREADS": "24", "UNCORE_FREQUENCY": 16 }, "assemble_v": { "FREQUENCY": 25", "NUM_THREADS": "24", "UNCORE_FREQUENCY": 14 }, "gmres_solve": { "FREQUENCY": 17", "NUM_THREADS": 8", "UNCORE_FREQUENCY": 22 }, "print_vtu": { "FREQUENCY": "25", "NUM_THREADS": 6", "UNCORE_FREQUENCY": 24 } 18

Scalability Tests OpenFOAM Analysis simplefoam strong scaling test Motorbike example optimum detected for every run Static: 11.7% Dynamic: 4.4% Total: 15.5% Dynamic savings increases with higher number of nodes Does not scale anymore

Inter-phase Dynamism All-to-all Performance 2048 phases PEPC Benchmark of the DEISA Benchmark Suite 20

Inter-Phase Analysis Variation of behavior among phases Group/cluster phases Select a best configuration for each cluster of phases What do we need? Identifiers of phase characteristics (Phase Identifiers) Provided by application expert (??) 21

Inter-Phase Analysis Approach Developed the interphase_tuning plugin 3 tuning steps: Analysis step: Random search strategy is used to create the search space Don t want to explore the whole tuning space Cluster phases and find best configuration for each cluster Default step: Run the application for the default setting Verification step: Select the best configuration for each phase, as determined for its cluster. Aggregate the savings over the phases 22

INDEED 3 clusters identified Noise points marked in red 23

Cluster Prediction in RRL How to handle phase identifiers to predict clusters? Call path of an rts nowincludes the cluster number Solution: Add the cluster number as a user parameter Add PAPI events to measure L3_TCM, Total_Instr and conditional branch instructions SCOREP_OA_PHASE_BEGIN() SCOREP_USER_PARAMETER_INT64(cluster, predict_cluster()) SCOREP_OA_PHASE_END() Predict the cluster of the upcomingphase If the cluster was mispredicted for the phase, correct it at the end of the phase 24

Evaluation of the readex_interphase plugin Performed on two applications: minimd, INDEED Experiments conducted on the Taurus HPC system at the ZIH in Dresden Each node contains two 12-core Intel Xeon CPUs E5-2680 v3 (Intel Haswell family) Runs with a default CPU frequency of 2.5 GHz, uncore frequency of 3 GHz Energy measurements provided on Taurus via HDEEM measurement hardware Provides processor and blade energy measurements 25

minimd Lightweight, parallel molecular dynamics simulation code Performs molecular dynamics simulation of a Lennard-Jones Embedded Atom Model (EAM) system Written in C++ Provides input file to specify problem size, temperature, timesteps Evaluation of DTA: Hybrid (MPI+OpenMP) AVX vectorized version Problem size of 50 for the Lennard-Jones system. 26

minimd (2) 6 clusters identified Noise points marked in red ParCo'17, September 13, 27

INDEED INDEED performs sheet metal forming simulations of tools with different geometries moving towards a stationary workpiece Contact between tool and workpiece causes: Adaptive mesh refinement Increase in number of finite element nodes Increasing computational cost Time loop computes the solution to a system of equations until equilibrium is reached. OpenMP version evaluated ParCo'17, September 13, 28

INDEED (2) 3 clusters identified Noise points marked in red ParCo'17, September 13, 29

Energy Savings Application Phase best for the rts s (%) rts best for the rts s (%) minimd 14.51 0.03 INDEED 9.24 10.45 minimd records lower dynamic savings minimd has only two significant regions One region is called only once during the entire application run Better static and dynamic savings for the rts s of INDEED INDEED has nine significant regions Provides more potential for dynamism 30

Application Tuning Parameters (ATP) Exploit the dynamism in characteristics through the use of different code paths (e.g. preconditioners) Identify the control variables responsible for control flow switching. provides APIs to annotate the source code

Evaluation of ATP: Espreso l Finite Element (FEM) tools and domain decomposition based Finite Element Tearing and Interconnect (FETI) solver l Contains a projected conjugate gradient (PCG) solver. l Convergence can be improved by several preconditioners. l Evaluated preconditioners on a structural mechanics problem with 23 million unknowns l On a single compute node with 24 MPI processes. 15.9 s 4091.5 j Preconditioner # iterations 1 iteration Solution None 172 125 ms 31.6 J 21.36 s 5 501.31 J Weight function 100 130+2 ms 32.3+0.53 J 12.89 s 3 284.07 J Lumped 45 130+10 ms 32.3+3.86 J 6.32 s 1 636.11 J Light dirichlet 39 130+10 ms 32.3+3.74 J 5.46 s 1 409.82 J Dirichlet 30 130+80 ms 32.3+20.62 J 6.34 s 1 594.50 J

Configuration Variable Tuning Recently added readex_configuration tuning plugin Application Configuration Parameters with search space Replace selected value in input files and rerun the application 33

Input Identifiers What about different inputs? Annotation with Input Identifiers like problemsize Apply Design Time Analysis and merge generated tuning models Selection at runtime based on the input identifiers 34

Summary Energy-efficiency tuning Design Time Analysis Tuning Model Runtime Tuning Support for Intra-phase dynamism Inter-phase dynamism Application Tuning Parameters Application Configuration Parameters Different input configurations Based on Periscope Tuning Framework Score-P Monitoring 35

Thank you! Questions? 36