GPU computing at RZG overview & some early performance results Markus Rampp
Introduction Outline Hydra configuration overview GPU software environment Benchmarking and porting activities Team Renate Dohmen Andreas Marek Elena Erastova Florian Merz (associated IBM applications specialist) Fabio Baruffa Werner Nagel Tilman Dannert Klaus Reuter Lorenz Hüdepohl Markus Rampp Acknowledgements A. Köhler, P. Messmer (Nvidia), IBM team, RZG systems group
Hardware configuration Hydra Compute nodes (~ 80000 cores, 260 TB RAM): 3424 nodes (2x10c IvyBridge 2.8 GHz) RAM: 3324x 64 GB + 100x 128 GB 628 nodes (2x8c SandyBridge 2.6 GHz) RAM: 608x 64 GB + 20x 128 GB Accelerators (2 PCI cards per node) 676 GPUs (2 Nvidia K20x per node) 24 MICs (Intel Xeon Phi 5110p) Network topology (IB FDR 14 x4: 5.8 GB/s) I/O subsystem Nodes arranged in 5 domains with nb fat-tree interconnect: 1x 628 nodes (SandyBridge) 2x 628 nodes (Ivy Bridge) 1x 1818 nodes (Ivy Bridge) 1x 350 nodes (Ivy Bridge + Accelerators) 26 I/O nodes 5 PB online (/ptmp, /u) /ptmp exported to visualization cluster + extension
Hardware configuration Hydra Node architecture (hydra GPU): 2x CPU (20 cores total) + 2x GPU (PCIe 2) speedup on n nodes := T(2n CPU) / T(2n CPU+2n GPU) K20x 1.3 TFlop/s (DP) 6 GB RAM 250 GB/s [compare CRAY-type architectures: T(2n CPU) / T(1n CPU+1n GPU)] ~6 GB/s ~30 GB/s socket-to-socket comparison, GPU vs. multicore CPU: similar power similar price Xeon E5-2680v2 ~6 GB/s @2.8 GHz 0.25TFlop/s (DP) 32 GB RAM 40 GB/s
GPU software environment Software environment (GPU) Hydra GPU programming & libraries (cf. module available): CUDA (5.5): C, CUBLAS, CUFFT, tools PGI (14.4): CUDA-FORTRAN, OpenACC MAGMA (1.4.1): BLAS and LAPACK for GPUs (single node, multiple GPUs) Allinea ddt: interactive, graphical debugger GPU-enabled applications (cf. module available): gromacs, namd, lammps, [acemd] batch system: simply add to the LoadLeveller batch script: #@ requirements = (Feature=="gpu")
Utilization Hydra preliminary data for last 4 weeks (multiply GPUs by *2)
Test and development environment Test and development 2 standalone nodes (identical with Hydra, but no Infiniband) for interactive test and development. dvl01.opt.rzg.mpg.de (2x GPU K40m + 2x CPU Xeon E5-2680v2) dvl02.opt.rzg.mpg.de (2x GPU K20x + 2x CPU Xeon E5-2680v2) access for MPG users on request. software environment (cf. module available) CUDA (5.5, 6.0): C, CUBLAS, CUFFT, tools PGI (14.4): CUDA-FORTRAN, OpenACC MAGMA (1.4.1): BLAS and LAPACK for GPUs (single node, multiple GPUs) Allinea ddt: interactive, graphical debugger no batch system: use the command-line calendar tool gpschedule
Motivation ( why bother?) 1) Compute performance: substantial nominal performance gains (vs. multi-core CPU): 5x...10x...100x? 2x...3x sustained speedups (GPU vs. multi-core CPU: this is nowadays called apples-to-apples comparison in the GPU community!!!) porting and achieving application performance requires hard work: porting an HPC application to Xeon Phi is a project (like GPU) 2) Energy efficiency: substantial nominal energy-efficiency gains: 2x...3x (a must for exascale: 50x...100x required!) from: Accelerating Computational Science Symposium, ORNL (2012) sustained application speedups of 2x are reasonable from an operational perspective 3) Existing resources and technology-readiness significant GPU-based resources around in the world competition aspects: grants, impact, technology readiness
Porting MPG applications to MIC & GPU Context: assessment of accelerator technology for the MPG (RZG, starting 2012) porting of HPC applications codes developed in the MPG to GPU/MIC ( talk by A. Marek) assessment of existing GPU/MIC applications (e.g. MD: GROMACS, NAMD, ) relevant for the MPG => input for configuration of the new HPC system of MPG ( spend x% of budget for MIC and/or GPU ) decision by scientific steering commitees (Beirat, BAR): x~10% General strategy and methodology we are targeting heterogeneous GPU/MIC-cluster applications, leveraging the entire resource programming models GPU: CUDA kernels (not much choice so far) MIC: guided auto-vectorization and moderate code changes (loop interchange, ) only! performance comparison: we always compare with highly optimized (SIMD, multi-core) CPU code! platforms: NVidia Kepler (K20x) vs. Intel Sandy/Ivy Bridge (E5-2670 8c@2.6 GHz, E5-2680v2 10c@2.8GHz) Intel Xeon Phi (5110p, 7120p) vs. Sandy/Ivy Bridge (E5-2670 8c@2.6 GHz, E5-2680v2 10c@2.8GHz)
HPC applications VERTEX (MPI for Astrophysics) hot spot (50-60% of tot. runtime, algorithm prototypical for GPU) application speedup: 2x (SandyBridge CPUs, K20X GPUs) GENE (MPI of Plasma Physics) convolution in spectral space multiplication in real space (50% of tot. runtime) application speedup: 1.0x...2x (further optimization work in progress) ELPA (BMBF project: RZG, FHI, TUM, U. Wuppertal, MPI-MIS, IBM) work by P. Messmer (Nvidia) application speedup: 1.7x (SandyBridge CPUs, K20x GPUs, work in progress) MNDO (MPI f. Kohlenforschung) code was ported to single GPU by Wu, Koslowski, Thiel (JCTC 2012) application speedup (single node): ~ 2x 4.4x (uses multi-gpu DSYEVD from MAGMA/1.4.1)
Performance Results GPU benchmarks and production applications on Hydra NAMD MPI Biophysics (Dept. Hummer) ACEMD (single GPU): 2x...3x wrt. CPU MPI Colloids and Interfaces (Dept. Lipowsky) GROMACS MPI biophysical Chemistry (Dept. Grubmüller)
Summary and Conclusions Application speedups 2x speedups (time to solution) appear very competitive for complex HPC codes Programming efforts (GENE, VERTEX) 3...6 months per GPU code (RZG, HPC specialists) sustainability? (10k...100k LoC) Worth the effort? computational scientists (our) point of view: definitely yes thorough expertise on technology => consulting rethinking of algorithms and implementations pays off scientific user's point of view: not immediately obvious 2x...3x speedups do not enable qualitatively new science objectives => reluctance to sacrifice human efforts, code maintainability, regular CPUs (Xeon) still do a very good job and vendor roadmaps promise further performance increases (?) => business as usual? (Dannert et al. Proc. of ParCO 2013, arxiv:1310.1485)
Summary and Conclusions Challenges and opportunities there is life beyond heroic CUDA porting efforts for huge legacy codes: new algorithms, new codes (specifically suited for GPU-like architectures) drop-in libraries (e.g. FFTW interface of CUDA 6.0) DSLs (pycuda, matlab, etc) less instrusive programming models are maturing: OpenACC, OpenMP don't expect a non-disruptive way forward: processor technology evolution is driven by energy/power constraints, mass market (recall sustained EFlop/s @ 20 MW requires ~ 100 GFlop/s/Watt => 50x improvement!) => extreme concurrency => impact on programming models the community has mastered a number of revolutions before: recall that the MPI part in the formula (MPI/OpenMP + X ) is rarely questioned today recall that the OpenMP (multicore) part is common: cf. The free lunch is over.... by H. Sutter (2004) MPG provides a 1 PFlop/s (nominal) compute performance!
Logistics please sign the list of participants lunch: table reserved at IPP canteen (1st floor)