CP2K: HIGH PERFORMANCE ATOMISTIC SIMULATION

Similar documents
CP2K: INTRODUCTION AND ORIENTATION

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

SETTING UP A CP2K CALCULATION. Iain Bethune

RUNNING CP2K CALCULATIONS. Iain Bethune

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer

CP2K Performance Benchmark and Profiling. April 2011

HECToR. UK National Supercomputing Service. Andy Turner & Chris Johnson

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

KNL Performance Comparison: CP2K. March 2017

CP2K Performance Benchmark and Profiling. April 2011

Getting Started with OpenMX Ver. 3.5

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Quantum ESPRESSO on GPU accelerated systems

Goals of parallel computing

Porting CASTEP to GPGPUs. Adrian Jackson, Toni Collis, EPCC, University of Edinburgh Graeme Ackland University of Edinburgh

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Millions of atoms in DFTB

Cray Scientific Libraries. Overview

Achieve Better Performance with PEAK on XSEDE Resources

Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel Xeon Phi Processor

A Lightweight OpenMP Runtime

Overview of OpenMX: 3. Ongoing and planned works. 2. Implementation. 1. Introduction. 4. Summary. Implementation and practical aspects

Time-dependent density-functional theory with massively parallel computers. Jussi Enkovaara CSC IT Center for Science, Finland

Lecture 15: More Iterative Ideas

Exploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core Architecture

Million Atom KS-DFT with CP2K

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Molecular Modelling and the Cray XC30 Performance Counters. Michael Bareford, ARCHER CSE Team

Performance Evaluation of Quantum ESPRESSO on SX-ACE. REV-A Workshop held on conjunction with the IEEE Cluster 2017 Hawaii, USA September 5th, 2017

Porting and Optimisation of UM on ARCHER. Karthee Sivalingam, NCAS-CMS. HPC Workshop ECMWF JWCRP

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Introduction to parallel Computing

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

An Innovative Massively Parallelized Molecular Dynamic Software

High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers

NAMD Serial and Parallel Performance

STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC. Stefan Maintz, Dr. Markus Wetzstein

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks

arxiv: v1 [cs.dc] 29 May 2017

Improving the Performance and Extending the Scalability in the Cluster of SMP based Petaflops Computing

Parallel Hybrid Monte Carlo Algorithms for Matrix Computations

Mixed MPI-OpenMP EUROBEN kernels

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS

Parallel Methods for Convex Optimization. A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney

Cray Scientific Libraries: Overview and Performance. Cray XE6 Performance Workshop University of Reading Nov 2012

Intel Performance Libraries

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

General Purpose GPU Computing in Partial Wave Analysis

CRYSTAL in parallel: replicated and distributed. Ian Bush Numerical Algorithms Group Ltd, HECToR CSE

ecse08-10: Optimal parallelisation in CASTEP

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

In 1986, I had degrees in math and engineering and found I wanted to compute things. What I ve mostly found is that:

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

MOLECULAR INTEGRATION SIMULATION TOOLKIT - INTERFACING NOVEL INTEGRATORS WITH MOLECULAR DYNAMICS CODES

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Bands-parallelism in Castep

VASP Accelerated with GPUs

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC

Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd.

LARGE-EDDY EDDY SIMULATION CODE FOR CITY SCALE ENVIRONMENTS

How to Use a Quantum Chemistry Code over 100,000 CPUs

Efficient use of hybrid computing clusters for nanosciences

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

TEXT - Towards EXaflop applications

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Brief notes on setting up semi-high performance computing environments. July 25, 2014

Solving Dense Linear Systems on Graphics Processors

A proposal for the foundation of an ELSI test suite Fabiano Corsetti ELSI Connector Meeting 2016

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Hybrid programming with MPI and OpenMP On the way to exascale

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method

What is!?! Fortran95: math&sci. libxc etsf-io.

CMAQ PARALLEL PERFORMANCE WITH MPI AND OPENMP**

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

Parallel Numerics, WT 2013/ Introduction

Optimizing and Accelerating Your MATLAB Code

Building NVLink for Developers

Charm++ for Productivity and Performance

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

GPU computing at RZG overview & some early performance results. Markus Rampp

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

The Icosahedral Nonhydrostatic (ICON) Model

Mathematical Libraries and Application Software on JUQUEEN and JURECA

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Mathematical Libraries and Application Software on JUQUEEN and JURECA

Generators at the LHC

Mixed Mode MPI / OpenMP Programming

OP2 FOR MANY-CORE ARCHITECTURES

Parallelism in Spiral

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

Parallelization of DQMC Simulations for Strongly Correlated Electron Systems

Trends in HPC (hardware complexity and software challenges)

Introduction to OpenMP. Lecture 2: OpenMP fundamentals

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi-Dimensional Array Transposition

Transcription:

CP2K: HIGH PERFORMANCE ATOMISTIC SIMULATION Iain Bethune ibethune@epcc.ed.ac.uk http://tinyurl.com/mcc-ukcp-2016

CP2K Overview CP2K is a program to perform atomistic and molecular simulations of solid state, liquid, molecular, and biological systems. It provides a general framework for different methods such as e.g., density functional theory (DFT) using a mixed Gaussian and plane waves approach (GPW) and classical pair and many-body potentials. From www.cp2k.org (2004!)

CP2K Overview Many force models: Classical DFT (GPW, GAPW + vdw) Hybrid Hartree-Fock LS-DFT post-hf (MP2, RPA) Combinations (QM/MM, mixed) Simulation tools MD (various ensembles) Monte Carlo Minimisation (GEO/CELL_OPT) Properties (Spectra, excitations ) Open Source GPL, www.cp2k.org 1m loc, ~2 commits per day ~20 core developers

CP2K History 25 th June 2001 CP2K repository online at berlios.de Merger of Quickstep (DFT) + FIST (MD) codes Jürg Hutter, Matthias Krack, Chris Mundy Oct 2011 First official release CP2K 2.2 14 years on 1m lines of code, ~16k commits 25 developers + many contributors 100s of users Fully open-source (GPL) Image from Jürg Hutter

CP2K Features QUICKSTEP DFT: Gaussian and Plane Waves Method (VandeVondele et al, Comp. Phys. Comm., 2005) Advantages of atom-centred basis (primary) Density, KS matrices are sparse Advantages of plane-wave basis (auxiliary) Efficient computation of Hartree potential Efficient mapping between basis sets -> Construction of the KS Matrix is ~O(n) ϕ α (r) S αβ ϕ β (r) Orbital Transformation Method (VandeVondele & Hutter, J. Chem. Phys., 2003) Replacement for traditional diagonalisation to optimise KS orbitals within SCF (non-metallic systems only) Cubic scaling but ~10% cost

CP2K Features E el [n] = µ + 2 = µ P µ G µ 1 2 2 + V SR loc + V nl ñ tot (G)ñ tot (G) G 2 + R P µ µ 1 2 2 + V ext ñ(r)v XC (R) + R V HXC µ (R) µ (R) From Marcella Iannuzzi, http://archer.ac.uk/training/course-material/2014/08/cp2k/slides/gpw_gapw.pdf

CP2K Features 64 H 2 O 32 CPUs IBM SP4 1 SCF iter DZVP TZVP TZV2P QZV2P QZV3P OT 0.50 0.60 0.77 0.87 1.06 Diagonalisation 6.02 8.40 13.80 17.34 24.59 Figure from Jürg Hutter

CP2K Features QM/MM (Laino et al, JCTC, 2005, 2006) Fully periodic, linear scaling electrostatic coupling Gaussian and Augmented Plane Waves (Iannuzzi et al, CHIMIA, 2005) Partitioning the electronic density -> all-electron calculations Hartree-Fock Exchange (Guidon et al, JCP, 2008) Beyond local DFT (also MP2, RPA ) Auxiliary Density Matrix Method (Guidon et al, JCTC, 2010) Linear Scaling DFT (VandeVondele, Borstnik & Hutter, JCTC, 2012) Fully linear scaling condensed-phase DFT, up to millions of atoms https://www.epcc.ed.ac.uk/sites/default/files/pdf/cp2k- UK-2015-Hutter.pdf

CP2K Features And LOTS more Recent review paper: Hutter et al, WIREs Comput Mol Sci 2014, 4:15 25 http://dx.doi.org/10.1002/wcms.1159 Some highlight applications: http://www.cp2k.org/science All for free! Please cite the references Please give feedback / patches / feature requests Please spread the word about CP2K! DSSC: see Shiffmann et al, PNAS, 2010

CP2K Information CP2K Website (http://www.cp2k.org) Everything else is linked from here Now a wiki so feel free to contribute! CP2K Sourceforge site (http://sf.net/p/cp2k) Contains source code repository (SVN) public read-only, read-write access to developers Bug reporting Source tarball / binary downloads

CP2K Information CP2K Discussion Group (http://groups.google.com/group/cp2k) Email / web forum Users and developers Searchable history CP2K Input reference manual (http://manual.cp2k.org) Documents every possible CP2K input keyword Mostly with helpful descriptions Beta CP2K online input editor http://cp2k-www.epcc.ed.ac.uk/cp2k-input-editor

CP2K Information Which version? Current release 3.0 (Dec 2015) + stable, major bug-fixes are back-ported + source and binaries available from http://www.cp2k.org/download + installed on ARCHER - missing absolute latest features, minor bugs are not always fixed Current release 2.6.2 (Sep 2015) + available for Ubuntu / Debian / Fedora via package managers http://www.cp2k.org/version_history SVN trunk version 4.0 + latest features, fixes, performance improvements + actively developed - bugs may exist (see http://dashboard.cp2k.org) - must be obtained from SVN and compiled from source

Support for UK CP2K Users CP2K-UK: EPSRC Software for the Future 500,000, 2013-2018 EPCC, UCL, KCL + 7 supporting groups Aims Grow and develop existing CP2K community in UK Lower barriers to usage and development of CP2K Long-term sustainability of CP2K Extend ability of CP2K to tackle challenging systems Annual user meetings (next Feb 22 nd 2016 @ KCL) http://www.epcc.ed.ac.uk/content/cp2k-uk-workshop-2016 Updates via mailing list

CP2K Performance Distributed realspace grids Overcome memory bottleneck Reduce communication costs Parallel load balancing 1 2 3 On a single grid level Re-ordering multiple grid levels Finely balance with replicated tasks Level 1, fine grid, distributed Level 2, medium grid, dist Level 3, coarse grid, replicated 4 5 6 7 8 9 5 6 8 3 1 7 9 4 2 libgrid for optimised collocate/integrate routines ~5-10% speedup typical

CP2K Performance Fast Fourier Transforms 1D or 2D decomposition FFTW3 and CuFFT library interface Cache and re-use data FFTW plans, cartesian communicators GLOBAL%FFTW_PLAN_TYPE MEASURE PATIENT Up to 5% Speedup possible 8" Libsmm(vs.(Libsci(DGEMM(Performance( 7" DBCSR Distributed MM based on Cannon s Algorithm Local multiplication recursive, cache oblivious GFLOP/s( 6" 5" 4" 3" 2" 1" 0" 1,1,1" 1,9,9" 1,22,22" 4,9,6" 4,22,17" 5,9,5" 5,22,16" 6,9,4" 6,22,13" 9,9,1" 9,22,9" M,N,K( 13,6,22" 13,22,6" 16,6,17" 16,22,5" 17,6,16" 17,22,4" 22,6,13" 22,22,1" SMM"(Gfortran"4.6.2)" Libsci"BLAS"(11.0.04)" Figure 5: Comparing performance of SMM and Libsci BLAS for block sizes up to 22,22,22 libsmm for small block multiplications

CP2K Performance OpenMP Now in all key areas of CP2K FFT, DBCSR, Collocate/ Integrate, Buffer Packing Incremental addition over time Usually 2 or 4 threads per process Dense Linear Algebra Matrix operations during SCF GEMM - ScaLAPACK SYEVD ScaLAPACK / ELPA Time per MD step (seconds)! 20! 2! 10! 100! 1000! 10000! 100000! Number of cores! XT4 (MPI Only)! XT4 (MPI/OpenMP)! XT6 (MPI Only)! XT6 (MPI/OpenMP)! -D ELPA2 and link library to enable GLOBAL %PREFERRED_DIAG_LIBRARY ELPA Up to ~5x Speedup for large, metallic systems

Parallel Performance : H2O-xx 500! H2O-512! Time per MD steip (seconds)! 50! 5! XT3 Stage 0 (2005)! XC30 ARCHER (2013)! H2O-256! H2O-128! H2O-64! H2O-32! H2O-2048!! H2O-1024! H2O-512! H2O-256! H2O-128! H2O-64! H2O-32! 0.5! 1! 10! 100! 1000! 10000! Number of cores!

Parallel Performance: LiH-HFX 1000 Performance comparison of the LiH-HFX benchmark 4TH ARCHER HECToR 2TH 6TH 2TH Time (seconds) 100 6TH 2.30 6TH 2.60 4TH 6TH 8TH 6TH 2.55 2.37 10 10 100 1000 10000 Number of nodes used

Parallel Performance: H2O-LS-DFT 1000 Performance comparison of the H2O-LS-DFT benchmark 2TH ARCHER HECToR 6TH 2TH 2.00 6TH 4TH 8TH 4TH Time (seconds) 100 2.06 6TH 2.20 6TH 4TH 8TH 3.30 2TH 2TH 4TH 4.66 3.68 3.45 10 10 100 1000 10000 Number of nodes used

Parallel Performance: H2O-64-RI-MP2 1000 2TH ARCHER HECToR Phase 3 MPI 2TH Time (seconds) 100 2.09 MPI 2.20 2TH 2TH 1.65 4TH 2TH 1.60 8TH 8TH8TH 4TH 4TH4TH 1.49 1.691.71 10 10 100 1000 10000 Number of nodes used

Parallel Performance: GPUs ~25% speedup Only for DBCSR

CP2K Timing Report ------------------------------------------------------------------------------- - - - T I M I N G - - - ------------------------------------------------------------------------------- SUBROUTINE CALLS ASD SELF TIME TOTAL TIME MAXIMUM AVERAGE MAXIMUM AVERAGE MAXIMUM CP2K 1 1.0 0.018 0.018 57.900 57.900 qs_mol_dyn_low 1 2.0 0.007 0.008 57.725 57.737 qs_forces 11 3.9 0.262 0.278 57.492 57.493 qs_energies_scf 11 4.9 0.005 0.006 55.828 55.836 scf_env_do_scf 11 5.9 0.000 0.001 51.007 51.019 scf_env_do_scf_inner_loop 99 6.5 0.003 0.007 43.388 43.389 velocity_verlet 10 3.0 0.001 0.001 32.954 32.955 qs_scf_loop_do_ot 99 7.5 0.000 0.000 29.807 29.918 ot_scf_mini 99 8.5 0.003 0.004 28.538 28.627 cp_dbcsr_multiply_d 2338 11.6 0.005 0.006 25.588 25.936 dbcsr_mm_cannon_multiply 2338 13.6 2.794 3.975 25.458 25.809 cannon_multiply_low 2338 14.6 3.845 4.349 14.697 15.980 ot_mini 99 9.5 0.003 0.004 15.701 15.942 ---------------------------------------------------------------------

CP2K Timing Report Not just for developers! Check that communication is < 50% of total runtime Check where most time is being spent: Sparse matrix multiplication - cp_dbcsr_multiply_d Dense matrix algebra cp_fm_syevd (&DIAGONALISATION), cp_fm_cholesky_* (&OT), cp_fm_gemm FFT fft3d_* Collocate / integrate calculate_rho_elec, integrate_v_rspace Control level of granularity &GLOBAL &TIMINGS THRESHOLD 0.00001 Default is 0.02 (2%) &END TIMINGS &END GLOBAL

CP2K Performance Summary First look for algorithmic gains Cell size, SCF settings, preconditioner, choice of basis set, QM/ MM, ADMM Check scaling of your system Run a few MD steps / reduced MAX_SCF Almost all performance-critical code is in libraries Compiler optimisation O3 is good enough Intel vs gfortran vs Cray difference is close to zero Before spending 1,000s of CPU hours, build libsmm, libgrid, ELPA, FFTW3 Or ask your local HPC support team J