Accelerators in Technical Computing: Is it Worth the Pain?

Similar documents
Research on Programming Models to foster Programmer Productivity

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

RWTH GPU-Cluster. Sandra Wienke March Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

The GPU-Cluster. Sandra Wienke Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

CPU-GPU Heterogeneous Computing

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

AutoTune Workshop. Michael Gerndt Technische Universität München

COMP528: Multi-core and Multi-Processor Computing

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Parallel Programming on Ranger and Stampede

Vectorisation and Portable Programming using OpenCL

The RWTH Compute Cluster Environment

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

The Era of Heterogeneous Computing

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

The Stampede is Coming: A New Petascale Resource for the Open Science Community

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Performance Tools for Technical Computing

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Parallel Computer Architecture - Basics -

Timothy Lanfear, NVIDIA HPC

Preparing for Highly Parallel, Heterogeneous Coprocessing

n N c CIni.o ewsrg.au

Pedraforca: a First ARM + GPU Cluster for HPC

Addressing Heterogeneity in Manycore Applications

Accelerating Financial Applications on the GPU

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Progress Report on QDP-JIT

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

Understanding Dynamic Parallelism

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

OP2 FOR MANY-CORE ARCHITECTURES

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Parallel Computer Architecture - Basics -

Stan Posey, CAE Industry Development NVIDIA, Santa Clara, CA, USA

HPC Enabling R&D at Philip Morris International

Running the FIM and NIM Weather Models on GPUs

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Faster Innovation - Accelerating SIMULIA Abaqus Simulations with NVIDIA GPUs. Baskar Rajagopalan Accelerated Computing, NVIDIA

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Digital Earth Routine on Tegra K1

An Introduction to the SPEC High Performance Group and their Benchmark Suites

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

To hear the audio, please be sure to dial in: ID#

OpenStaPLE, an OpenACC Lattice QCD Application

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda

GPUs and Emerging Architectures

Cuda C Programming Guide Appendix C Table C-

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

High performance Computing and O&G Challenges

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Parallel Computing. November 20, W.Homberg

Two-Phase flows on massively parallel multi-gpu clusters

Experiences with GPGPUs at HLRS

Tasking and OpenMP Success Stories

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

GPU computing at RZG overview & some early performance results. Markus Rampp

BEST BANG FOR YOUR BUCK

Accelerating Implicit LS-DYNA with GPU

Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers

Critically Missing Pieces on Accelerators: A Performance Tools Perspective

Porting and Tuning WRF Physics Packages on Intel Xeon and Xeon Phi and NVIDIA GPU

Steve Scott, Tesla CTO SC 11 November 15, 2011

S Comparing OpenACC 2.5 and OpenMP 4.5

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Technology for a better society. hetcomp.com

A low memory footprint OpenCL simulation of short-range particle interactions

HPC-CINECA infrastructure: The New Marconi System. HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati,

Accelerating Data Warehousing Applications Using General Purpose GPUs

Predicting GPU Performance from CPU Runs Using Machine Learning

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Tuning Alya with READEX for Energy-Efficiency

SuperMike-II Launch Workshop. System Overview and Allocations

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

Slurm BOF SC13 Bull s Slurm roadmap

High-level Abstraction for Block Structured Applications: A lattice Boltzmann Exploration

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture

Hybrid Implementation of 3D Kirchhoff Migration

Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks

arxiv: v1 [hep-lat] 12 Nov 2013

ELP. Effektive Laufzeitunterstützung für zukünftige Programmierstandards. Speaker: Tim Cramer, RWTH Aachen University

Evaluating OpenMP s Effectiveness in the Many-Core Era

LBRN - HPC systems : CCT, LSU

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

An Introduction to OpenACC

CMAQ PARALLEL PERFORMANCE WITH MPI AND OPENMP**

John Levesque Nov 16, 2001

Transcription:

Accelerators in Technical Computing: Is it Worth the Pain? A TCO Perspective Sandra Wienke, Dieter an Mey, Matthias S. Müller Center for Computing and Communication JARA High-Performance Computing RWTH Aachen University Rechen- und Kommunikationszentrum (RZ)

Agenda Introduction Modeling Total Cost of Ownership (TCO) Comparison Metrics Case Study on Accelerators Programming Models & System Types TCO Components @ RWTH Real-World Application Results Conclusion & Outlook 2

Introduction Today: Varity of HPC clusters Usage of accelerators (NVIDIA GPU, Intel Xeon Phi) motivated by promising performance per watt ratio System comparison by performance or performance per watt not sufficient for purchase decision Total costs of ownership (TCO) Acquisition costs, housing, operation costs,.. Inclusion of manpower costs (administration & programming) Comparison of costs per program run (application-dependent) Investigation of a real-world software package OpenMP on Intel Sandy Bridge OpenMP + LEO on Intel Xeon Phi Impact of manpower effort/ programming model? 3 OpenCL, OpenACC on NVIDA Fermi GPU

Modeling Total Cost of Ownership (TCO) Basis: single compute node extrapolate to cluster amount Investment I = TCO n, τ One-time costs C ot = C ot (n) + C pa (n) τ n: number of nodes τ: system lifetime 4 Per node: HW acquisition, building/infrastructure, OS/ env. installation Per node type: OS/ env. installation, programming effort Annual costs C pa Per node: HW maintenance, building/infrastructure, OS/ env. maintenance, power consumption Per node type: OS/ env. maintenance, compiler/software, application maintenance TCO depends on architecture & application

Modeling Comparison Metrics Costs per program run C ppr Includes investment/ TCO & application performance TCO(n, τ) k τ C ppr n, τ = with n n ex (τ) n ex τ = t par n number of nodes τ system lifetime n ex #app. executions k system usage rate t par : parallel runtime Used baseline for system X: Intel Sandy Bridge (SNB) + OpenMP C ppr,x n X, τ C ppr,omp n OMP, τ C ppr,omp n OMP, τ < 0 0 if X OMP beneficial Break-even investments Min. budget needed so that system X beneficial over OpenMP on SNB Solve for I with given fixed lifetime τ: C ppr,x n X, τ C ppr,omp n OMP, τ = 0 with TCO n, τ = I 5

Case Study on Accelerators Programming Models & System Types Programming Model Accelerator Host Compiler Serial OpenMP (simple, vectorized) LEO + OpenMP Intel Xeon Phi 5110P, 60 cores 2x Intel Sandy Bridge, 16 cores, 2 GHz 1x Intel Westmere, 4 cores, 2.4 GHz Intel 13.0.1 Intel 13.0.1 OpenACC NVIDIA Tesla PGI 12.9 OpenCL C2050 (Fermi), ECC on Intel 13.0.1 6

Case Study on Accelerators TCO Components @ RWTH One-time costs 7 HW purchase: list prices from Bull Building/infrastructure: as annual costs since it is amortized over 25 years OS/env. installation: - Programming effort: Full-time employee costs 285.71 a day Annual costs HW maintenance: 5% of HW purchase costs Building/infrastructure: 200,000 per year; costs per node: division by 1.6MW; multiplication by max. power consumption of each node OS/env. maintenance: 4 admins, 75% maintenance cluster (~2300 nodes): 180,000 / 2300 = 78 per node and year Software/compiler: - Power: PUE 1.5, regional electricity costs 0.15 /kwh Application maintenance: - (small kernels) Given lifetime of 4 years & investment C ppr #nodes, #executions (usage rate 80%)

Source: BMW, ZF, Klingelnberg Case Study on Accelerators Real-World Application Basis Serial version Small kernel Assumption: homogeneous app. landscape KegelSpan 2 3D simulation of bevel gear cutting process Kernel artificially increased from 25% to 90% 8 2 C. Brecher, C. Gorgels, and A. Hardjosuwito. Simulation based Tool Wear Analysis in Bevel Gear Cutting. In International Conference on Gears, volume 2108.2 of VDI- Berichte, pp.1381 1384, Düsseldorf, VDI Verlag, 2010.

effort [days] runtime [s] power consumption [W] Case Study on Accelerators TCO Components of Application 180 160 140 120 100 80 60 119 140 158 250 200 150 100 OpenCL (GPU) OpenACC (GPU) OpenMP+LEO (Phi) OpenMP-vec (SNB) OpenMP-simp (SNB) 40 20 50 0 0 6 4 5.0 4.5 3.5 2 0 1.5 0.5 9

break-even investment costs per program run (relative to OMP-simp) Case Study on Accelerators Results 20% 10% 0% 3.62% OpenCL (GPU) OpenACC (GPU) OpenMP+LEO (Phi) OpenMP-vec (SNB) -10% -20% 0 100K 200K Investment -12.09% -16.82% -17.15% 10,000 7,787 7,231 5,000 1,809 0 10

Conclusion Are accelerators beneficial? It depends TCO spreadsheet 1 for own computations available Our results (w/ 90% kernel portion) show GPU Fermi beneficial over 2-socket Intel SNB server Intel Xeon Phi results disappointing for now SNB-OMP (4 years, 250 K ) -17% C ppr + 4% C ppr Mainly due to high acquisition costs NVIDIA Kepler probably similar Programming effort impacts break-even investment (see OpenACC OpenCL) Bigger codes: increase of kernel size ~ increase of break-even invest. Projections possible (e.g. hybrid codes) 11 1 Wienke, S., an Mey, D., Müller, M.S.: Accelerators for Technical Computing: Is it Worth the Pain? TCO Spreadsheet. https://sharepoint. campus.rwth-aachen.de/units/rz/hpc/public/shared%20documents/ WienkeEtAl_Accelerators-TCO-Perspective.xlsx, 2013

Outlook Hybrid code implementation (cmp to projections) Model extensions New programming models & architectures (OpenMP 4.0, NVIDIA Kepler) Network communication (MPI) Mixed job execution (heterogeneous application landscape) Assessment of decrease in runtime/ gaining more results Comprehensive TCO calculation with predictive powers Performance, power consumption, manpower Towards exascale computing, architectures might get more complex More difficult to manage & program Impact of manpower effort might get stronger Thank you for your attention! 12