GPU computing at RZG overview & some early performance results. Markus Rampp

Similar documents
Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda

HPC Architectures. Types of resource currently in use

ORAP Forum October 10, 2013

n N c CIni.o ewsrg.au

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Introduction to High Performance Computing. Shaohao Chen Research Computing Services (RCS) Boston University

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Building supercomputers from commodity embedded chips

Mapping MPI+X Applications to Multi-GPU Architectures

High-performance computing technological trends and programming challenges

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Trends in HPC (hardware complexity and software challenges)

Illinois Proposal Considerations Greg Bauer

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

An Introduction to OpenACC

Hybrid Architectures Why Should I Bother?

IT4Innovations national supercomputing center. Branislav Jansík

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Vectorisation and Portable Programming using OpenCL

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

User Training Cray XC40 IITM, Pune

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Building supercomputers from embedded technologies

Preparing GPU-Accelerated Applications for the Summit Supercomputer

19. prosince 2018 CIIRC Praha. Milan Král, IBM Radek Špimr

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Pedraforca: a First ARM + GPU Cluster for HPC

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.

Present and Future Leadership Computers at OLCF

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Experiences with ENZO on the Intel Many Integrated Core Architecture

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

The Mont-Blanc approach towards Exascale

The Era of Heterogeneous Computing

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

Accelerating Implicit LS-DYNA with GPU

IBM CORAL HPC System Solution

CALMIP : HIGH PERFORMANCE COMPUTING

Quantum ESPRESSO on GPU accelerated systems

Genius - introduction

RZG Visualisation Infrastructure

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

University at Buffalo Center for Computational Research

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

CPU-GPU Heterogeneous Computing

HPC Algorithms and Applications

Accelerators in Technical Computing: Is it Worth the Pain?

Technology for a better society. hetcomp.com

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing

Genius - introduction

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

HOKUSAI System. Figure 0-1 System diagram

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Our Workshop Environment

Execution Models for the Exascale Era

CSC573: TSHA Introduction to Accelerators

Scientific Visualization Services at RZG

STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC. Stefan Maintz, Dr. Markus Wetzstein

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

The Arm Technology Ecosystem: Current Products and Future Outlook

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

VSC Users Day 2018 Start to GPU Ehsan Moravveji

John Levesque Nov 16, 2001

The Mont-Blanc Project

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Lecture 1: Why Parallelism? Parallel Computer Architecture and Programming CMU , Spring 2013

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Our Workshop Environment

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

MAGMA: a New Generation

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Parallel Programming on Ranger and Stampede

Umeå University

Umeå University

Timothy Lanfear, NVIDIA HPC

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

NVIDIA GPU TECHNOLOGY UPDATE

HPC with Multicore and GPUs

High Performance Computing with Accelerators

OpenStaPLE, an OpenACC Lattice QCD Application

Motivation Goal Idea Proposition for users Study

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

Running the FIM and NIM Weather Models on GPUs

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

Deploying remote GPU virtualization with rcuda. Federico Silla Technical University of Valencia Spain

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Scientific Computing in practice

Speeding up MATLAB Applications Sean de Wolski Application Engineer

Transcription:

GPU computing at RZG overview & some early performance results Markus Rampp

Introduction Outline Hydra configuration overview GPU software environment Benchmarking and porting activities Team Renate Dohmen Andreas Marek Elena Erastova Florian Merz (associated IBM applications specialist) Fabio Baruffa Werner Nagel Tilman Dannert Klaus Reuter Lorenz Hüdepohl Markus Rampp Acknowledgements A. Köhler, P. Messmer (Nvidia), IBM team, RZG systems group

Hardware configuration Hydra Compute nodes (~ 80000 cores, 260 TB RAM): 3424 nodes (2x10c IvyBridge 2.8 GHz) RAM: 3324x 64 GB + 100x 128 GB 628 nodes (2x8c SandyBridge 2.6 GHz) RAM: 608x 64 GB + 20x 128 GB Accelerators (2 PCI cards per node) 676 GPUs (2 Nvidia K20x per node) 24 MICs (Intel Xeon Phi 5110p) Network topology (IB FDR 14 x4: 5.8 GB/s) I/O subsystem Nodes arranged in 5 domains with nb fat-tree interconnect: 1x 628 nodes (SandyBridge) 2x 628 nodes (Ivy Bridge) 1x 1818 nodes (Ivy Bridge) 1x 350 nodes (Ivy Bridge + Accelerators) 26 I/O nodes 5 PB online (/ptmp, /u) /ptmp exported to visualization cluster + extension

Hardware configuration Hydra Node architecture (hydra GPU): 2x CPU (20 cores total) + 2x GPU (PCIe 2) speedup on n nodes := T(2n CPU) / T(2n CPU+2n GPU) K20x 1.3 TFlop/s (DP) 6 GB RAM 250 GB/s [compare CRAY-type architectures: T(2n CPU) / T(1n CPU+1n GPU)] ~6 GB/s ~30 GB/s socket-to-socket comparison, GPU vs. multicore CPU: similar power similar price Xeon E5-2680v2 ~6 GB/s @2.8 GHz 0.25TFlop/s (DP) 32 GB RAM 40 GB/s

GPU software environment Software environment (GPU) Hydra GPU programming & libraries (cf. module available): CUDA (5.5): C, CUBLAS, CUFFT, tools PGI (14.4): CUDA-FORTRAN, OpenACC MAGMA (1.4.1): BLAS and LAPACK for GPUs (single node, multiple GPUs) Allinea ddt: interactive, graphical debugger GPU-enabled applications (cf. module available): gromacs, namd, lammps, [acemd] batch system: simply add to the LoadLeveller batch script: #@ requirements = (Feature=="gpu")

Utilization Hydra preliminary data for last 4 weeks (multiply GPUs by *2)

Test and development environment Test and development 2 standalone nodes (identical with Hydra, but no Infiniband) for interactive test and development. dvl01.opt.rzg.mpg.de (2x GPU K40m + 2x CPU Xeon E5-2680v2) dvl02.opt.rzg.mpg.de (2x GPU K20x + 2x CPU Xeon E5-2680v2) access for MPG users on request. software environment (cf. module available) CUDA (5.5, 6.0): C, CUBLAS, CUFFT, tools PGI (14.4): CUDA-FORTRAN, OpenACC MAGMA (1.4.1): BLAS and LAPACK for GPUs (single node, multiple GPUs) Allinea ddt: interactive, graphical debugger no batch system: use the command-line calendar tool gpschedule

Motivation ( why bother?) 1) Compute performance: substantial nominal performance gains (vs. multi-core CPU): 5x...10x...100x? 2x...3x sustained speedups (GPU vs. multi-core CPU: this is nowadays called apples-to-apples comparison in the GPU community!!!) porting and achieving application performance requires hard work: porting an HPC application to Xeon Phi is a project (like GPU) 2) Energy efficiency: substantial nominal energy-efficiency gains: 2x...3x (a must for exascale: 50x...100x required!) from: Accelerating Computational Science Symposium, ORNL (2012) sustained application speedups of 2x are reasonable from an operational perspective 3) Existing resources and technology-readiness significant GPU-based resources around in the world competition aspects: grants, impact, technology readiness

Porting MPG applications to MIC & GPU Context: assessment of accelerator technology for the MPG (RZG, starting 2012) porting of HPC applications codes developed in the MPG to GPU/MIC ( talk by A. Marek) assessment of existing GPU/MIC applications (e.g. MD: GROMACS, NAMD, ) relevant for the MPG => input for configuration of the new HPC system of MPG ( spend x% of budget for MIC and/or GPU ) decision by scientific steering commitees (Beirat, BAR): x~10% General strategy and methodology we are targeting heterogeneous GPU/MIC-cluster applications, leveraging the entire resource programming models GPU: CUDA kernels (not much choice so far) MIC: guided auto-vectorization and moderate code changes (loop interchange, ) only! performance comparison: we always compare with highly optimized (SIMD, multi-core) CPU code! platforms: NVidia Kepler (K20x) vs. Intel Sandy/Ivy Bridge (E5-2670 8c@2.6 GHz, E5-2680v2 10c@2.8GHz) Intel Xeon Phi (5110p, 7120p) vs. Sandy/Ivy Bridge (E5-2670 8c@2.6 GHz, E5-2680v2 10c@2.8GHz)

HPC applications VERTEX (MPI for Astrophysics) hot spot (50-60% of tot. runtime, algorithm prototypical for GPU) application speedup: 2x (SandyBridge CPUs, K20X GPUs) GENE (MPI of Plasma Physics) convolution in spectral space multiplication in real space (50% of tot. runtime) application speedup: 1.0x...2x (further optimization work in progress) ELPA (BMBF project: RZG, FHI, TUM, U. Wuppertal, MPI-MIS, IBM) work by P. Messmer (Nvidia) application speedup: 1.7x (SandyBridge CPUs, K20x GPUs, work in progress) MNDO (MPI f. Kohlenforschung) code was ported to single GPU by Wu, Koslowski, Thiel (JCTC 2012) application speedup (single node): ~ 2x 4.4x (uses multi-gpu DSYEVD from MAGMA/1.4.1)

Performance Results GPU benchmarks and production applications on Hydra NAMD MPI Biophysics (Dept. Hummer) ACEMD (single GPU): 2x...3x wrt. CPU MPI Colloids and Interfaces (Dept. Lipowsky) GROMACS MPI biophysical Chemistry (Dept. Grubmüller)

Summary and Conclusions Application speedups 2x speedups (time to solution) appear very competitive for complex HPC codes Programming efforts (GENE, VERTEX) 3...6 months per GPU code (RZG, HPC specialists) sustainability? (10k...100k LoC) Worth the effort? computational scientists (our) point of view: definitely yes thorough expertise on technology => consulting rethinking of algorithms and implementations pays off scientific user's point of view: not immediately obvious 2x...3x speedups do not enable qualitatively new science objectives => reluctance to sacrifice human efforts, code maintainability, regular CPUs (Xeon) still do a very good job and vendor roadmaps promise further performance increases (?) => business as usual? (Dannert et al. Proc. of ParCO 2013, arxiv:1310.1485)

Summary and Conclusions Challenges and opportunities there is life beyond heroic CUDA porting efforts for huge legacy codes: new algorithms, new codes (specifically suited for GPU-like architectures) drop-in libraries (e.g. FFTW interface of CUDA 6.0) DSLs (pycuda, matlab, etc) less instrusive programming models are maturing: OpenACC, OpenMP don't expect a non-disruptive way forward: processor technology evolution is driven by energy/power constraints, mass market (recall sustained EFlop/s @ 20 MW requires ~ 100 GFlop/s/Watt => 50x improvement!) => extreme concurrency => impact on programming models the community has mastered a number of revolutions before: recall that the MPI part in the formula (MPI/OpenMP + X ) is rarely questioned today recall that the OpenMP (multicore) part is common: cf. The free lunch is over.... by H. Sutter (2004) MPG provides a 1 PFlop/s (nominal) compute performance!

Logistics please sign the list of participants lunch: table reserved at IPP canteen (1st floor)