HECToR. UK National Supercomputing Service. Andy Turner & Chris Johnson

HECToR UK National Supercomputing Service Andy Turner & Chris Johnson

Outline EPCC HECToR Introduction HECToR Phase 3 Introduction to AMD Bulldozer Architecture Performance Application placement the hardware really matters CP2K in PRACE Simulation at exascale What software is used on HECToR? Exascale software challenges What does this mean for a HPC users? 2

EPCC Founded in 1990 Based at The University of Edinburgh, within the School of Physics and Astronomy leading European centre of expertise in advanced research training and technology transfer supercomputer services to academia and business 95% of our funding from external sources Heavily involved in European projects such as PRACE and HPC-Europa

PRACE and HPC-Europa PRACE DECI - a resource exchange programme projects can access several million CPU-hours of compute resource on machines across Europe http://www.prace-ri.eu/call-announcements HPC-Europa visitor programme Visitors can visit one of 7 countries: Italy, UK, Spain, Germany, France, The Netherlands or Finland http://www.hpc-europa.eu/ Find a host in an academic department and HPC-Europa provides travel, subsistence and access to HPC resources

HECToR 5

HECToR Partners RCUK UK Research funding councils UoE HPCx Ltd./EPCC System host and operator Cray Inc. System provider NAG Ltd. Computational science and engineering support 6

HECToR Details UK National HPC Service PRACE Tier-1 machine Currently 30-cabinet Cray XE6 system 2816 nodes, 90,112 cores Each node has 2 16-core AMD Opterons (2.3GHz Interlagos) 32 GB memory Peak of over 800 TF 90 TB of memory 7

HECToR Service Compute nodes Login nodes Lustre OSS Lustre MDS NFS Server Boot/SDB node Cray XE6 Supercomputer 1 GigE Backbone Infiniband Switch 10 GigE Backup and Archive Servers esfs Lustre high-performance, parallel filesystem 8

HECToR Compute Nodes All dies link to memory, interconnect and each other by HyperTransport. Nodes arranged in 3D torus. Interconnect supports message passing and DRMA in hardware. Interconnect supports MPI, SHMEM, PGAS, ARMCI. Image courtesy of NAG. 9

AMD Bulldozer Architecture Image courtesy of Wikipedia 10

Dual-core Interlagos module Image courtesy of NAG.

Phase 3 Performance Comparison 12

Task placement matters 13

Taks placement

CP2K Improving scaling 15

CP2K: Overview CP2K is a freely available (GPL) Density Functional Theory code (+ support for classical, empirical potentials) can perform MD, MC, geometry optimisation, normal mode calculations The Swiss Army Knife of Molecular Simulation (VandeVondele) c.f. CASTEP, VASP, CPMD etc. 16

CP2K million atom KS-DFT Focussing on CP2K on BlueGene/P (reducing memory usage) scaling to 1,000,000 atoms (estimated as 200,000 cores) Led by Iain Bethune at EPCC Supported by Dr. Joost VandeVondele et al, CP2K Developers at Physical Chemistry Institute, University of Zurich Work done under dcse and PRACE Improved scaling via increased use of OpenMP directives

CP2K mixed mode Performance improvement due to: Reduce impact of algorithms which scale poorly with number of MPI tasks E.g. When using T threads, switchover point from 1D decomposed FFT (more efficient) to 2D decomposed FFT (less efficient) is increased by a factor of T Improved load balancing Existing MPI load balancing algorithms do a coarser load balance, finegrained balance done over OpenMP threads Reduced number of messages significantly Especially on pre-gemini networks For all-to-all communications, message count reduced by factor oft2

CP2K: Functional Evaluation 93% efficiency with 6 threads, 74% with 24 threads Mixed Mode Parallelism in CP2K: A Case Study 20

CP2K: Fast Fourier Transforms CP2K uses a 3D Fourier Transform to turn real data on the plane wave grids into g-space data on the plane wave grids. The grids may be distributed as planes, or rays (pencils) so the FFT may involve one or two transpose steps between the 3 1D FFT operations The 1D FFTs are performed via an interface which supports many libraries e.g. FFTW 2/3 ESSL, ACML, CUDA, FFTSG (in-built) 21

CP2K: Fast Fourier Transforms We can parallelise two parts with OpenMP 1D FFT assign each thread a subset of rows to FFT Buffer packing threads cooperatively pack the buffers which are passed to MPI Communication still handled outside a the parallel regions 22

Simulation at Exascale Software Edinburgh/Tsukuba Workshop, February 2012 23

Scientific Software Chemistry, materials science, climate, oceanography, engineering, plasma physics, paleontology Dye-sensitised solar cells F. Schiffmann and J. VandeVondele University of Zurich Modelling dinosaur gaits Dr Bill Sellers, University of Manchester Fractal-based models of turbulent flows Christos Vassilicos & Sylvain Laizet, Imperial College Edinburgh/Tsukuba Workshop, February 2012

Scientific Usage Profile HECToRXT4 Chemistry/Materials Science, 37.26 Chemistry/Materials Science her/unknown, 38.23 Earth Science/Climate Physics Engineering Other/Unknown Engineering, 1.91 Physics, 6 Earth Science/Climate, 16.6 Edinburgh/Tsukuba Workshop, February 2012

HELIUM 2.9% NAMD 3.3% CP2KNEMO (Hybrid) LAMMPS Fluidity Quantum Espresso DL_POLY ChemShell 0.1% 0.1% 0.7% SENGA 1.0% 0.9% 1.2% 1.7% Terra 1.9% Shelf 2.3% 2.7% CP2K (MPI) 4.5% Others 45.7% CASTEP 5.9% UM 6.4% VASP 17.5% HECToRXT4

Future Look What does the future hold for HPC and the national facility? 2012 2015 2018 System Perf. 20 PFlops 100-200 PFlops 1 EFlops Memory 1 PB 5 PB 10 PB Node Perf. 200 GFlops 400 GFlops 1-10 TFlops Concurrency 32 O(100) O(1000) Interconnect BW 40 GB/s 100 GB/s 200-400 GB/s Nodes 100,000 500,000 O(Million) I/O 2 TB/s 10 TB/s 20 TB/s MTTI Days Days O(1 Day) Power 10 MW 10 MW 20 MW Accelerators: GPGPUs Edinburgh/Tsukuba Workshop, February 2012

Application sustainability National-scale HPC facilities provide a capability resource. For users who want to run calculations that are too large for other resources In reality, in the UK, also gets used for smaller-scale calculations The future of national-scale HPC (as for everyone else): Lots of cores per node (CPU + co-processor) Little memory per core Lots of compute power per network interface The balance of compute to communication power and compute to memory are both radically different to now Need to ensure UK researchers have software that can exploit these resources effectively Edinburgh/Tsukuba Workshop, February 2012 28

Application sustainability Requirements for software on future capability HPC resources: Probably cannot be pure message passing parallel This will not scale on nodes with high amount of compute Must exploit all parallelism at all levels vectorisation, shared-memory, message-passing Must exploit memory hierarchy efficiently Must harness the co-processors/lightweight cores Must be fault-tolerant None of today s large codes meet all these requirements Edinburgh/Tsukuba Workshop, February 2012 29