How to Use a Quantum Chemistry Code over 100,000 CPUs

Size: px

Start display at page:

Download "How to Use a Quantum Chemistry Code over 100,000 CPUs"

Michael Chandler
6 years ago
Views:

1 How to Use a Quantum Chemistry Code over 100,000 CPUs Edoardo Aprà Pacific Northwest National Laboratory Outline NWChem description Parallelization strategies 2 1

Developed at EMSL/PNNL Provides major modeling and simulation capability for molecular science

capabilities Performance characteristics designed for MPP Runs on a wide range of computers Open

Run-time database Generic Energy, structure, Tasks Object-oriented design SCF energy, gradient,

.. Basis Set Object Parallel IO Global Arrays PeIGS Memory Allocator.

Toolkit abstraction, data hiding, APIs Parallel programming model non-uniform memory access,

2 Developed at EMSL/PNNL Provides major modeling and simulation capability for molecular science Broad range of molecules, including catalysts, biomolecules, and heavy elements Solid state capabilities Performance characteristics designed for MPP Runs on a wide range of computers Open Source large user community Uses Global Arrays/ARMCI for parallelization 3 NWChem Structure Run-time database Generic Energy, structure, Tasks Object-oriented design SCF energy, gradient, DFT energy, gradient, MD, NMR, Solvation, Optimize, Dynamics, Integral API Geometry Object... Basis Set Object Parallel IO Global Arrays PeIGS Memory Allocator... Molecular Calculation Modules Molecular Modeling Toolkit Molecular Software Development Toolkit abstraction, data hiding, APIs Parallel programming model non-uniform memory access, Global Arrays, MPI Infrastructure GA, Parallel I/O, RTDB, MA,... Program modules communication only through the database persistence for easy restart 2

3 Replicated vs Distributed: Matrix distribution Example: parallel computer made of 8 processors How to distribute the elements of a 2-D Matrix? Distributed Replicated Approaches to parallelization Distributed data - Pros Smaller memory usage Better scaling at large number of processors(more later) Replicated data - Pros Better scaling at small to moderate number of processors Distributed data - Cons More network traffic required at small/moderate number processors. Better scaling at large number of processors(more later) Replicated data - Cons Larger memory footprint Worse scaling at large number of processors, because of collective operations needed to merge resulting matrices 3

Global Arrays Distributed dense arrays that can be accessed through a shared memory-like style High level abstraction layer for the application developer (that s me!

4 Global Arrays Distributed dense arrays that can be accessed through a shared memory-like style High level abstraction layer for the application developer (that s me!) One-sided model = no need to worry and send/receive Physically distributed data single, shared data structure/ global indexing e.g., access A(4,3) rather than buf(7) on task 2 Global Address Space Gaussian DFT computational kernel Evaluation of XC potential matrix element my_next_task = SharedCounter() do i=1,max_i if(i.eq.my_next_task) then call ga_get() (x (do work) q ) = D (x q ) (x q ) call ga_acc() my_next_task = F += SharedCounter() q w q (x q ) V xc [ (x q )] (x q ) endif enddo barrier() D F Both GA operations are greatly dependent on the communication latency 8 4

5 Parallel scaling of the DFT code Si 28 O 148 H Basis functions LDA wavefunction 9 XC build Benchmark run on Cray XK6 Parallel scaling of the DFT code Si 159 O 264 H Basis functions LDA wavefunction 10 XC build Benchmark run on Cray XK6 5

6 Hybrid computing Goes beyond the Send-Receive construct of Message-Passing (a.k.a MPI) OpenMP (directives) Shared memory Cilk Intel TBB Threads Mirrored Arrays: Matrix distribution Global Arrays that are replicated between SMP nodes but distributed within SMP nodes Distributed Mirrored Replicated 12 6

7 Trends in Chemistry Codes Multi-level parallelism Effective path for most applications to scale to O(10 5 ) processors Examples: coarse grain over vibrational degrees of freedom in numerical hessian, or geometries in a surface scan or parameter study Conventional distributed memory within each subtask Fine grain parallelism within a few processor SMP (multi-threads, OpenMP, parallel BLAS, ) Example of application later in the talk.. Efficient exploitation of fine grain parallelism is a major concern on future architectures Emergence of Accelerators GPUs Intel Xeon Phi What is CCSD(T)? Coupled-cluster (CC) theory is a numerical manybody technique that incorporates the effect of electron correlation on the electronic structure of molecular systems CCSD(T) estimates the effect of electron correlations by considering single, double and triple excitations Valence only CCSD(T) calculations = gold standard of quantum chemistry for their chemical accuracy in determining molecular energetics numerical cost scales as N 7 (N = number of electrons) 7

8 CCSD(T) algorithm aijkbc algorithm of Rendell and coworkers no use of I/O intermediate quantities (two-electron integrals and coupled-cluster wave function amplitudes) stored in global memory (GA) floating-point intensive kernel of this algorithm: BLAS DGEMM calls Main Steps of a CCSD(T) run MP2 energy & transformation of Molecular orbitals Generation of 2-electron integrals needed and storage in Global Array (ga_acc) Main CCSD(T) loop fetching the 2-electron integrals with ga_get Computational intensive kernel via dgemm 8

9 CCSD(T) kernel: code features - I To scale at 1K procs: increased locality data to reduce communication To scale at 40K procs: implemented more careful tiling of intermediates to reduce memory consumption and increase parallelism and load balance These two modifications can be seen as PGAS style programming (distinction between local and global memory CCSD(T) kernel: code features - II Three levels of the memory hierarchy in dynamically load balanced algorithm intermediate results fit in available global memory nested loops tiled so that data for each each task fits into local memory each process access a global shared counter to determine the next task data moved from global into local memory via ga_get 9

CCSD(T) run on Cray XT5 : 18 water molecules February

(H 2 O) 18 54 atoms 918 basis functions Cc-pvtz(-f) basis

Floating-Point performance at 96K cores: 480 TFlops

10 CCSD(T) run on Cray XT5 : 18 water molecules February 2009 Floating-Point performance at 90K cores: 358 TFlops (H 2 O) atoms 918 basis functions Cc-pvtz(-f) basis CCSD(T) run on Cray XT5 : 20 water February 2009 Floating-Point performance at 96K cores: 480 TFlops Efficiency > 50% (H 2 O) atoms 1020 basis functions Cc-pvtz(-f) basis 10

NWChem code changes required to scale beyond 100K cores Many to one communication patterns causes

Causes: several processors simultaneously accessing the same patch of a matrix, where the patch is

Solutions: Staggered access Use of a subset of processing elements Atomic shared counter: first

11 NWChem code changes required to scale beyond 100K cores Many to one communication patterns causes job to progress very slowly (best case) to hang (or worse) Diagnosis: cumbersome lucky coredumps! Causes: several processors simultaneously accessing the same patch of a matrix, where the patch is owned by a single processor. Solutions: Staggered access Use of a subset of processing elements Atomic shared counter: first token is static CCSD(T) run on Cray XT5 : 24 water November 2009 Floating-Point performance at 223K cores: 1.39 PetaFLOP/s (H 2 O) atoms 1224 basis functions Cc-pvtz(-f) basis 11

Tensor Contraction Engine (TCE) Symbolic algebra systems for coding complicated tensor expressions: Tensor Contraction Engine (TCE) Hirata, J. Phys. Chem.

Lai, Zhang, Rajbhandari, Valeev, Kowalski, Sadayappan, Procedia Computer Science (2012) New implementation of CC methods (since 2003) more effective for

12 Tensor Contraction Engine (TCE) Symbolic algebra systems for coding complicated tensor expressions: Tensor Contraction Engine (TCE) Hirata, J. Phys. Chem. A 107, 9887 (2003) Sadayappan, Krishnamoorthy, et al. Proceedings of the IEEEE, 93, 276 (2005). Lai, Zhang, Rajbhandari, Valeev, Kowalski, Sadayappan, Procedia Computer Science (2012) New implementation of CC methods (since 2003) more effective for implementing new methods Easier tuning and porting 23 Tensor Contraction Engine (TCE) Tile structure: S1 S2 S1 S2 S1 S2. S1 S2. Occupied spinorbitals unccupied spinorbitals Tensor structure: T T i a [ h ] [ pn] m 24 12

13 New elements of parallel design for the iterative EOMCCSD method Use of Global Arrays (GA) to implement a distributed memory model Iterative CCSD/EOMCCSD methods (basic challenges) Global memory requirements Complex load balancing Complicated communication pattern: use of one-sided ga_get,ga_put,ga_acc Implementation improvements New way of representing antysymmetric 2-electron integrals for the restricted (RHF) and restricted open-shell (ROHF) references Replication of low-rank tensors New task scheduling for the CCSD/EOMCCSD methods 25 New elements of parallel design for the non-iterative CR- EOMCCSD(T) method Use of Global Arrays (GA) to implement a distributed memory model Basic challenges for Non-Iterative CR- EOMCCSD(T) method Local memory requirements: (tilesize) 4 (EOMCCSD) vs. M*(tilesize) 6 (CR-EOMCCSD(T)) Implementation improvements Two-fold reduction of local memory use : 2*(tilesize) 6 New algorithms which enable the decomposition of six-dimensional tensors 26 13

Scalability of iterative EOMCC methods Alternative

balancing reduce the number of synchronization steps

used 27 Scalability of the non-iterative EOMCC code

Scalability of the triples part of the CR-

14 Scalability of iterative EOMCC methods Alternative task schedulers use global task pool improve load balancing reduce the number of synchronization steps to absolute minimum larger tiles can be effectively used 27 Scalability of the non-iterative EOMCC code 94 %parallel efficiency using 210,000 cores Scalability of the triples part of the CR- EOMCCSD(T) approach for the FBP-f-coronene system in the AVTZ basis set. Timings were determined from calculations on the Jaguar Cray XT5 computer system at NCCS/ORNL in

15 MRCC theory in a nutshell Reference function M M ls 0 1 Model space Schematic representation of the complete model space corresponding to two active electrons distributed over two active orbitals (red lines). Only determinants with M S =0 are included in the model space. 29 MRCC approaches: main challenges Intruder-state/intruder-solution problems Complete model space Huge dimensionality A large number of superfluous configurations not contributing to a given state Overall cost of the MRCC methods M N 6 (iterative MRCCSD) M N 7 (non-iterative MRCCSD(T)) Algebraic complexity of the MRCC methods 30 15

Processor groups (PGs) and reference level parallelism )( )(

.., M The reference level parallelism can be applied in:

Build efficient parallel schemes for non-iterative MRCC

parallelism Scalability of the BW- MRCCSD methods for

16 Processor groups (PGs) and reference level parallelism )( )( )1( RTF ()(,..., TGTT )( () M,..., )0 1,..., M The reference level parallelism can be applied in: Solving coupled referencedependent MRCC iterative equations Build efficient parallel schemes for non-iterative MRCC methods 31 Processor groups (PGs) and reference level parallelism Scalability of the BW- MRCCSD methods for -carotene in the 6-31G basis set ( 470 basis set functions); (4,4) a model space model of 20 references 32 16

When triple excitations are needed: MRCCSD(T) Improve the quality of the MRCCSD approaches Counteract the intruder-state problem H eff eff ) ( () THSD () T ( Numerical complexity M N 7 Scalability M

17 When triple excitations are needed: MRCCSD(T) Improve the quality of the MRCCSD approaches Counteract the intruder-state problem H eff eff ) ( () THSD () T ( Numerical complexity M N 7 Scalability M (scalability of the CCSD(T) approach) 33 GPU implementation of non-iterative part of the MRCCSD(T) approach 34 ~4x speed-up. Observed 5x in CCSD(T). Ongoing effort towards GPU-ing iterative part of the MRCCSD(T) approach 17

18 Thanks Karol Kowalski, Kiran Bhaskaran-Nair (PNNL) Wenjing Ma, Sriram Krishnamoorthy, Jeff Daily, Abhinav Vishnu, Bruce Palmer(PNNL) Vinod Tipparaju (AMD) Ryan Olson (Cray) Jiri Brabec (Czech Ac. Sc.) Oreste Villa & Norbert Juffa (NVIDIA) 35 Acknowledgements PNNL extreme Scale Computing Initiative Dept. of Energy Office of Biological and Environmental Research Resources of the National Center for Computational Sciences at Oak Ridge National Laboratory allocated through the INCITE program EMSL computing resources (Chinook HP system) 36 18

19 thank you 37 Backup 38 19

20 GPU implementation of non-iterative part of the MRCCSD(T) approach Ongoing effort towards GPU-ing iterative part of the MRCCSD(T) approach(oreste Villa & Norbert Juffa, NVIDIA) 39 20

Development of Intel MIC Codes in NWChem. Edoardo Aprà Pacific Northwest National Laboratory

Development of Intel MIC Codes in NWChem Edoardo Aprà Pacific Northwest National Laboratory Acknowledgements! Karol Kowalski (PNNL)! Michael Klemm (Intel)! Kiran Bhaskaran-Nair (LSU)! Wenjing Ma (Chinese