Accelerators in Technical Computing: Is it Worth the Pain?

Accelerators in Technical Computing: Is it Worth the Pain? A TCO Perspective Sandra Wienke, Dieter an Mey, Matthias S. Müller Center for Computing and Communication JARA High-Performance Computing RWTH Aachen University Rechen- und Kommunikationszentrum (RZ)

Agenda Introduction Modeling Total Cost of Ownership (TCO) Comparison Metrics Case Study on Accelerators Programming Models & System Types TCO Components @ RWTH Real-World Application Results Conclusion & Outlook 2

Introduction Today: Varity of HPC clusters Usage of accelerators (NVIDIA GPU, Intel Xeon Phi) motivated by promising performance per watt ratio System comparison by performance or performance per watt not sufficient for purchase decision Total costs of ownership (TCO) Acquisition costs, housing, operation costs,.. Inclusion of manpower costs (administration & programming) Comparison of costs per program run (application-dependent) Investigation of a real-world software package OpenMP on Intel Sandy Bridge OpenMP + LEO on Intel Xeon Phi Impact of manpower effort/ programming model? 3 OpenCL, OpenACC on NVIDA Fermi GPU

Modeling Total Cost of Ownership (TCO) Basis: single compute node extrapolate to cluster amount Investment I = TCO n, τ One-time costs C ot = C ot (n) + C pa (n) τ n: number of nodes τ: system lifetime 4 Per node: HW acquisition, building/infrastructure, OS/ env. installation Per node type: OS/ env. installation, programming effort Annual costs C pa Per node: HW maintenance, building/infrastructure, OS/ env. maintenance, power consumption Per node type: OS/ env. maintenance, compiler/software, application maintenance TCO depends on architecture & application

Modeling Comparison Metrics Costs per program run C ppr Includes investment/ TCO & application performance TCO(n, τ) k τ C ppr n, τ = with n n ex (τ) n ex τ = t par n number of nodes τ system lifetime n ex #app. executions k system usage rate t par : parallel runtime Used baseline for system X: Intel Sandy Bridge (SNB) + OpenMP C ppr,x n X, τ C ppr,omp n OMP, τ C ppr,omp n OMP, τ < 0 0 if X OMP beneficial Break-even investments Min. budget needed so that system X beneficial over OpenMP on SNB Solve for I with given fixed lifetime τ: C ppr,x n X, τ C ppr,omp n OMP, τ = 0 with TCO n, τ = I 5

Case Study on Accelerators Programming Models & System Types Programming Model Accelerator Host Compiler Serial OpenMP (simple, vectorized) LEO + OpenMP Intel Xeon Phi 5110P, 60 cores 2x Intel Sandy Bridge, 16 cores, 2 GHz 1x Intel Westmere, 4 cores, 2.4 GHz Intel 13.0.1 Intel 13.0.1 OpenACC NVIDIA Tesla PGI 12.9 OpenCL C2050 (Fermi), ECC on Intel 13.0.1 6

Case Study on Accelerators TCO Components @ RWTH One-time costs 7 HW purchase: list prices from Bull Building/infrastructure: as annual costs since it is amortized over 25 years OS/env. installation: - Programming effort: Full-time employee costs 285.71 a day Annual costs HW maintenance: 5% of HW purchase costs Building/infrastructure: 200,000 per year; costs per node: division by 1.6MW; multiplication by max. power consumption of each node OS/env. maintenance: 4 admins, 75% maintenance cluster (~2300 nodes): 180,000 / 2300 = 78 per node and year Software/compiler: - Power: PUE 1.5, regional electricity costs 0.15 /kwh Application maintenance: - (small kernels) Given lifetime of 4 years & investment C ppr #nodes, #executions (usage rate 80%)

Source: BMW, ZF, Klingelnberg Case Study on Accelerators Real-World Application Basis Serial version Small kernel Assumption: homogeneous app. landscape KegelSpan 2 3D simulation of bevel gear cutting process Kernel artificially increased from 25% to 90% 8 2 C. Brecher, C. Gorgels, and A. Hardjosuwito. Simulation based Tool Wear Analysis in Bevel Gear Cutting. In International Conference on Gears, volume 2108.2 of VDI- Berichte, pp.1381 1384, Düsseldorf, VDI Verlag, 2010.

effort [days] runtime [s] power consumption [W] Case Study on Accelerators TCO Components of Application 180 160 140 120 100 80 60 119 140 158 250 200 150 100 OpenCL (GPU) OpenACC (GPU) OpenMP+LEO (Phi) OpenMP-vec (SNB) OpenMP-simp (SNB) 40 20 50 0 0 6 4 5.0 4.5 3.5 2 0 1.5 0.5 9

break-even investment costs per program run (relative to OMP-simp) Case Study on Accelerators Results 20% 10% 0% 3.62% OpenCL (GPU) OpenACC (GPU) OpenMP+LEO (Phi) OpenMP-vec (SNB) -10% -20% 0 100K 200K Investment -12.09% -16.82% -17.15% 10,000 7,787 7,231 5,000 1,809 0 10

Conclusion Are accelerators beneficial? It depends TCO spreadsheet 1 for own computations available Our results (w/ 90% kernel portion) show GPU Fermi beneficial over 2-socket Intel SNB server Intel Xeon Phi results disappointing for now SNB-OMP (4 years, 250 K ) -17% C ppr + 4% C ppr Mainly due to high acquisition costs NVIDIA Kepler probably similar Programming effort impacts break-even investment (see OpenACC OpenCL) Bigger codes: increase of kernel size ~ increase of break-even invest. Projections possible (e.g. hybrid codes) 11 1 Wienke, S., an Mey, D., Müller, M.S.: Accelerators for Technical Computing: Is it Worth the Pain? TCO Spreadsheet. https://sharepoint. campus.rwth-aachen.de/units/rz/hpc/public/shared%20documents/ WienkeEtAl_Accelerators-TCO-Perspective.xlsx, 2013

Outlook Hybrid code implementation (cmp to projections) Model extensions New programming models & architectures (OpenMP 4.0, NVIDIA Kepler) Network communication (MPI) Mixed job execution (heterogeneous application landscape) Assessment of decrease in runtime/ gaining more results Comprehensive TCO calculation with predictive powers Performance, power consumption, manpower Towards exascale computing, architectures might get more complex More difficult to manage & program Impact of manpower effort might get stronger Thank you for your attention! 12