Quantum Chemistry (QC) on GPUs. Feb. 2, 2017

Size: px

Start display at page:

Download "Quantum Chemistry (QC) on GPUs. Feb. 2, 2017"

Julie Freeman
6 years ago
Views:

1 Quantum Chemistry (QC) on GPUs Feb. 2, 2017

Overview of Life & Material Accelerated Apps MD: All key codes are GPU-accelerated Great multi-gpu performance Focus on dense (up to 16) GPU nodes &/or large # of GPU nodes ACEMD*, AMBER (PMEMD)*,

net, GROMACS, HALMD, HOOMD-Blue*, LAMMPS, Lattice Microbes*, mdcore, MELD, minimd, NAMD, OpenMM, PolyFTS, SOP-GPU* & more QC: All key codes are ported or optimizing Focus on using GPU-accelerated

2 Overview of Life & Material Accelerated Apps MD: All key codes are GPU-accelerated Great multi-gpu performance Focus on dense (up to 16) GPU nodes &/or large # of GPU nodes ACEMD*, AMBER (PMEMD)*, BAND, CHARMM, DESMOND, ESPResso, Folding@Home, GPUgrid.net, GROMACS, HALMD, HOOMD-Blue*, LAMMPS, Lattice Microbes*, mdcore, MELD, minimd, NAMD, OpenMM, PolyFTS, SOP-GPU* & more QC: All key codes are ported or optimizing Focus on using GPU-accelerated math libraries, OpenACC directives GPU-accelerated and available today: ABINIT, ACES III, ADF, BigDFT, CP2K, GAMESS, GAMESS- UK, GPAW, LATTE, LSDalton, LSMS, MOLCAS, MOPAC2012, NWChem, OCTOPUS*, PEtot, QUICK, Q-Chem, QMCPack, Quantum Espresso/PWscf, QUICK, TeraChem* Active GPU acceleration projects: green* = application where >90% of the workload is on GPU CASTEP, GAMESS, Gaussian, ONETEP, Quantum Supercharger Library*, VASP & more 2

3 MD vs. QC on GPUs Classical Molecular Dynamics Simulates positions of atoms over time; chemical-biological or chemical-material behaviors Forces calculated from simple empirical formulas (bond rearrangement generally forbidden) Up to millions of atoms Solvent included without difficulty Single precision dominated Uses cublas, cufft, CUDA Geforce (Accademics), Tesla (Servers) ECC off Quantum Chemistry (MO, PW, DFT, Semi-Emp) Calculates electronic properties; ground state, excited states, spectral properties, making/breaking bonds, physical properties Forces derived from electron wave function (bond rearrangement OK, e.g., bond energies) Up to a few thousand atoms Generally in a vacuum but if needed, solvent treated classically (QM/MM) or using implicit methods Double precision is important Uses cublas, cufft, OpenACC Tesla recommended ECC on 3

the HIV virus and discovered the chemical structure of its capsid the perfect target for

4 Accelerating Discoveries Using a supercomputer powered by the Tesla Platform with over 3,000 Tesla accelerators, University of Illinois scientists performed the first all-atom simulation of the HIV virus and discovered the chemical structure of its capsid the perfect target for fighting the infection. Without gpu, the supercomputer would need to be 5x larger for similar performance. 4

5 GPU-Accelerated Quantum Chemistry Apps Green Lettering Indicates Performance Slides Included Abinit ACES III Gaussian GPAW Octopus ONETEP Quantum SuperCharger Library ADF LATTE Petot RMG BigDFT LSDalton Q-Chem TeraChem CP2K MOLCAS QMCPACK UNM GAMESS-US Mopac2012 NWChem Quantum Espresso VASP WL-LSMS GPU Perf compared against dual multi-core x86 CPU socket. 5

6 ABINIT

7 ABINIT on GPUS Speed in the parallel version: For ground-state calculations, GPUs can be used. This is based on CUDA+MAGMA For ground-state calculations, the wavelet part of ABINIT (which is BigDFT) is also very well parallelized : MPI band parallelism, combined with GPUs

8 BigDFT

9 Courtesy of BigDFT CEA

10 Courtesy of BigDFT CEA

11 Courtesy of BigDFT CEA

12 Courtesy of BigDFT CEA

13 Courtesy of BigDFT CEA

14 Courtesy of BigDFT CEA

15 Gaussian 16 April 2017

16 GPU-ACCELERATED GAUSSIAN 16 AVAILABLE 100% PGI OpenACC Port (no CUDA) Gaussian is a Top Ten HPC (Quantum Chemistry) Application % of use cases are GPU-accelerated (Hartree-Fock and DFT: energies, 1 st derivatives (gradients) and 2 nd derivatives). More functionality to come. K40, K80 support; P100 support coming as a minor release, performance good, faster wall clock times. Early P100 results promising. No pricing difference between Gaussian CPU and GPU versions. Existing Gaussian 09 customers under maintenance contract get (free) upgrade. Existing non-maintenance customers required to pay upgrade fee. To get the bits or to ask about the upgrade fee, please contact Gaussian, Inc. s Jim Hess, Operations Manager; jhess@gaussian.com. 16

17 Speed-up vs Dual Haswell rg-a25 on K80s 1.6x 1.4x 1.2x 1.0x 0.8x 0.6x rg-a25 Running Gaussian version 16 The blue node contains Dual Intel Xeon E (Haswell) CPUs The green nodes contain Dual Intel Xeon E (Haswell) CPUs + Tesla K80 (autoboost) GPUs Alanine 25. Two steps: Force and Frequency. APFD 6-31G* natoms = 259, nbasis = x 0.2x 0.0x 7.4 hrs 6.4 hrs 5.8 hrs 5.1 hrs 1 Haswell node 1x K80 2x K80 4x K80 17

Speed-up vs Dual Haswell rg-a25td on K80s 1.6x 1.4x 1.2x 1.0x 0.8x 0.6x rg-a25td Running Gaussian version 16 The blue node contains Dual Intel Xeon E5-2698 v3@2.

18 Speed-up vs Dual Haswell rg-a25td on K80s 1.6x 1.4x 1.2x 1.0x 0.8x 0.6x rg-a25td Running Gaussian version 16 The blue node contains Dual Intel Xeon E (Haswell) CPUs The green nodes contain Dual Intel Xeon E (Haswell) CPUs + Tesla K80 (autoboost) GPUs Alanine 25. Two Time-Dependent (TD) steps: Force and Frequency. APFD 6-31G* natoms = 259, nbasis = x 0.2x 0.0x 27.9 hrs 25.8 hrs 22.6 hrs 20.1 hrs 1 Haswell node 1x K80 2x K80 4x K80 18

19 Speed-up vs Dual Haswell rg-on on K80s 1.8x 1.6x 1.4x 1.2x rg-on Running Gaussian version 16 The blue node contains Dual Intel Xeon E (Haswell) CPUs The green nodes contain Dual Intel Xeon E (Haswell) CPUs + Tesla K80 (autoboost) GPUs 1.0x 0.8x 0.6x GFP ONIOM. Two steps: Force and Frequency. APFD/6-311+G(2d,p):amber=softfirst)=embed natoms = 3715 (48/3667), nbasis = x 0.2x 0.0x 2.5 hrs 2.1 hrs 1.7 hrs 1.5 hrs 1 Haswell node 1x K80 2x K80 4x K80 19

Speed-up vs Dual Haswell rg-ontd on K80s 2.5x 2.0x 1.5x rg-ontd Running Gaussian version 16 The blue node contains Dual Intel Xeon E5-2698 v3@2.

20 Speed-up vs Dual Haswell rg-ontd on K80s 2.5x 2.0x 1.5x rg-ontd Running Gaussian version 16 The blue node contains Dual Intel Xeon E (Haswell) CPUs The green nodes contain Dual Intel Xeon E (Haswell) CPUs + Tesla K80 (autoboost) GPUs 1.0x GFP ONIOM. Two Time-Dependent (TD) steps: Force and Frequency. APFD/6-311+G(2d,p):amber=softfirst)=embed natoms = 3715 (48/3667), nbasis = x 0.0x 55.4 hrs 43.6 hrs 33.7 hrs 26.8 hrs 1 Haswell node 1x K80 2x K80 4x K80 20

Speed-up vs Dual Haswell rg-val on K80s 2.5x 2.0x rg-val Running Gaussian version 16 The blue node contains Dual Intel Xeon E5-2698 v3@2.3ghz (Haswell) CPUs 1.

21 Speed-up vs Dual Haswell rg-val on K80s 2.5x 2.0x rg-val Running Gaussian version 16 The blue node contains Dual Intel Xeon E (Haswell) CPUs 1.5x The green nodes contain Dual Intel Xeon E (Haswell) CPUs + Tesla K80 (autoboost) GPUs 1.0x Valinomycin. Two steps: Force and Frequency. APFD G(2d,p) natoms = 168, nbasis = x 0.0x hrs hrs hrs hrs 1 Haswell node 1x K80 2x K80 4x K80 21

22 7.4 hrs 6.4 hrs 5.8 hrs 5.1 hrs 27.9 hrs 25.8 hrs 22.6 hrs 20.1 hrs 2.5 hrs 2.1 hrs 1.7 hrs 1.5 hrs 55.4 hrs 43.6 hrs 33.7 hrs 26.8 hrs hrs hrs hrs hrs Speed-up over run without GPUs Effects of using K80s 2.5x 2.3x 2.0x Effects of using K80 boards Haswell E x 1.5x 1.3x 1.0x 0.8x 0.5x 0.3x 0.0x rg-a25 rg-a25td rg-on rg-ontd rg-val Number of K80 boards

23 Gaussian 16 Supported Platforms 4-way collaboration; Gaussian, Inc., PGI, NVIDIA and HPE HPE, NVIDIA and PGI is the development platform All released/certified x86_64 versions of Gaussian 16 use the PGI compilers Certified versions of Gaussian 16 use Intel only for Itanium, XLF for some IBM platforms, Fujitsu compilers for some SPARC-based machines and PGI for the rest (including some Apple products) GINC is collaborating with IBM, PGI (and NVIDIA) to release an OpenPower version of Gaussian that also uses the PGI compiler See Gaussian Supported Platforms for more details: 23

24 CLOSING REMARKS Significant Progress has been made in enabling Gaussian on GPUs with OpenACC OpenACC is increasingly becoming more versatile Significant work lies ahead to improve performance Expand feature set: PBC, Solvation, MP2, ONIOM, triples-corrections 24

25 ACKNOWLEDGEMENTS Development is taking place with: Hewlett-Packard (HP) Series SL2500 Servers (Intel Xeon E v2 (2.8GHz/10- core/25mb/8.0gt-s QPI/115W, DDR3-1866) NVIDIA Tesla GPUs (K40 and later) PGI Accelerator Compilers (16.x) with OpenACC (2.5 standard) 5/10/

26 GPAW

27 Speedup Compared to CPU Only Increase Performance with Kepler 3.5 Running GPAW The blue nodes contain 1x E5-2687W CPU (8 Cores per CPU). The green nodes contain 1x E5-2687W CPU (8 Cores per CPU) and 1x or 2x NVIDIA K20X for the GPU Silicon K=1 Silicon K=2 Silicon K=3

28 Speedup Compared to CPU Only Increase Performance with Kepler x Running GPAW The blue nodes contain 1x E5-2687W CPU (8 Cores per CPU). The green nodes contain 1x E5-2687W CPUs (8 Cores per CPU) and 2x NVIDIA K20 or K20X for the GPU x 1 1.7x Silicon K=1 Silicon K=2 Silicon K=3

Speedup Compared to CPU Only Increase Performance with Kepler 1.6 Running GPAW 10258 1.4 The blue nodes contain 2x E5-2687W CPUs (8 Cores per CPU). 1.2 1 0.8 1.4x 1.

29 Speedup Compared to CPU Only Increase Performance with Kepler 1.6 Running GPAW The blue nodes contain 2x E5-2687W CPUs (8 Cores per CPU) x 1.4x The green nodes contain 2x E5-2687W CPUs (8 Cores per CPU) and 2x NVIDIA K20 or K20X for the GPU x Silicon K=1 Silicon K=2 Silicon K=3

30 Used with permission from Samuli Hakala

34 34

35 35

36 36

37 37

38 38

39 39

40 NWChem

41 NWChem 6.3 Release with GPU Acceleration Addresses large complex and challenging molecular-scale scientific problems in the areas of catalysis, materials, geochemistry and biochemistry on highly scalable, parallel computing platforms to obtain the fastest time-to-solution Researchers can for the first time be able to perform large scale coupled cluster with perturbative triples calculations utilizing the NVIDIA GPU technology. A highly scalable multi-reference coupled cluster capability will also be available in NWChem 6.3. The software, released under the Educational Community License 2.0, can be downloaded from the NWChem website at

42 NWChem - Speedup of the non-iterative calculation for various configurations/tile sizes System: cluster consisting of dual-socket nodes constructed from: 8-core AMD Interlagos processors 64 GB of memory Tesla M2090 (Fermi) GPUs The nodes are connected using a high-performance QDR Infiniband interconnect Courtesy of Kowolski, K., Bhaskaran-Nair, at PNNL, JCTC (submitted)

43 Time to Solution (seconds) Kepler, Faster Performance (NWChem) Uracil CPU Only CPU + 1x K20X CPU + 2x K20X Uracil Molecule Performance improves by 2x with one GPU and by 3.1x with 2 GPUs

44 Quantum Espresso December 2016

45 seconds AUSURF112 on K80s AUSURF112 *Lower is better 580 Running Quantum Espresso version X The blue node contains Dual Intel Xeon E [3.5GHz Turbo] (Broadwell) CPUs The green node contains Dual Intel Xeon E [3.5GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs Broadwell node 4x K80 45

46 seconds AUSURF112 on P100s PCIe AUSURF1120 *Lower is better X 1.2X Running Quantum Espresso version The blue node contains Dual Intel Xeon E [3.5GHz Turbo] (Broadwell) CPUs 200 The green nodes contain Dual Intel Xeon E [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs Broadwell node 4x P100 PCIe 8x P100 PCIe 46

47 TeraChem 1.5K Speaker, Date

48 TERACHEM 1.5K; TRIPCAGE ON TESLA K40S TeraChem 1.5K; TripCage on Tesla K40s & IVB CPUs (Total Processing Time in Seconds) x Xeon E v2@2.70ghz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E v2@2.70ghz + 2 x Tesla K40@875Mhz (1 node) 2 x Xeon E v2@2.70ghz + 4 x Tesla K40@875Mhz (1 node) 2 x Xeon E v2@2.70ghz + 8 x Tesla K40@875Mhz (1 node) 48

49 TERACHEM 1.5K; TRIPCAGE ON TESLA K40S & HASWELL CPUS 200 TeraChem 1.5K; TripCage on Tesla K40s & Haswell CPUs (Total Processing Time in Seconds) x Xeon E v3@2.30ghz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E v3@2.30ghz + 2 x Tesla K40@875Mhz (1 node) 2 x Xeon E v3@2.30ghz + 4 x Tesla K40@875Mhz (1 node) 49

50 TERACHEM 1.5K; TRIPCAGE ON TESLA K80S & IVB CPUS 120 TeraChem 1.5K; TripCage on Tesla K80s & IVB CPUs (Total Processing Time in Seconds) x Xeon E v2@2.70ghz + 1 x Tesla K80 board (1 node) 2 x Xeon E v2@2.70ghz + 2 x Tesla K80 boards 2 x Xeon E v2@2.70ghz + 4 x Tesla K80 boards (1 node) (1 node) 50

51 TERACHEM 1.5K; TRIPCAGE ON TESLA K80S & HASWELL CPUS 120 TeraChem 1.5K; TripCage on Tesla K80s & Haswell CPUs (Total Processing Time in Seconds) x Xeon E v3@2.30ghz + 1 x Tesla K80 board (1 node) 2 x Xeon E v3@2.30ghz + 2 x Tesla K80 boards (1 node) 2 x Xeon E v3@2.30ghz + 4 x Tesla K80 boards (1 node) 51

52 TERACHEM 1.5K; BPTI ON TESLA K40S & IVB CPUS TeraChem 1.5K; BPTI on Tesla K40s & IVB CPUs (Total Processing Time in Seconds) x Xeon E v2@2.70ghz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E v2@2.70ghz + 2 x Tesla K40@875Mhz (1 node) 2 x Xeon E v2@2.70ghz + 4 x Tesla K40@875Mhz (1 node) 2 x Xeon E v2@2.70ghz + 8 x Tesla K40@875Mhz (1 node) 52

53 TERACHEM 1.5K; BPTI ON TESLA K80S & IVB CPUS 8000 TeraChem 1.5K; BPTI on Tesla K80s & IVB CPUs (Total Processing Time in Seconds) x Xeon E v2@2.70ghz + 1 x Tesla K80 board (1 node) 2 x Xeon E v2@2.70ghz + 2 x Tesla K80 boards (1 node) 2 x Xeon E v2@2.70ghz + 4 x Tesla K80 boards (1 node) 53

54 TERACHEM 1.5K; BPTI ON TESLA K40S & IVB CPUS TeraChem 1.5K; BPTI on Tesla K40s & Haswell CPUs (Total Processing Time in Seconds) x Xeon E v3@2.30ghz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E v3@2.30ghz + 2 x Tesla K40@875Mhz (1 node) 2 x Xeon E v3@2.30ghz + 4 x Tesla K40@875Mhz (1 node) 54

55 TERACHEM 1.5K; BPTI ON TESLA K80S & HASWELL CPUS 6000 TeraChem 1.5K; BPTI on Tesla K80s & Haswell CPUs (Total Processing Time in Seconds) x Xeon E v3@2.30ghz + 1 x Tesla K80 board (1 node) 2 x Xeon E v3@2.30ghz + 2 x Tesla K80 boards (1 node) 2 x Xeon E v3@2.30ghz + 4 x Tesla K80 boards (1 node) 55

56 Time (Seconds) TeraChem Supercomputer Speeds on GPUs Time for SCF Step TeraChem running on 8 C2050s on 1 node NWChem running on 4096 Quad Core CPUs In the Chinook Supercomputer Giant Fullerene C240 Molecule Quad Core CPUs ($19,000,000) 8 C2050 ($31,000) Similar performance from just a handful of GPUs

Price/Performance relative to Supercomputer TeraChem Bang for the Buck 600 500 400 300 Performance/Price 493 TeraChem running on 8 C2050s on 1 node NWChem running on 4096 Quad Core CPUs In the

57 Price/Performance relative to Supercomputer TeraChem Bang for the Buck Performance/Price 493 TeraChem running on 8 C2050s on 1 node NWChem running on 4096 Quad Core CPUs In the Chinook Supercomputer Giant Fullerene C240 Molecule Note: Typical CPU and GPU node pricing used. Pricing may vary depending on node configuration. Contact your preferred HW vendor for actual pricing Quad Core CPUs ($19,000,000) 8 C2050 ($31,000) Dollars spent on GPUs do 500x more science than those spent on CPUs

First graph is of BLYP/G-31(d) Second is B3LYP/6-31G(d) 500 1200 400 1000

58 Seconds Seconds Kepler s Even Better 800 Olestra BLYP 453 Atoms 2000 B3LYP/6-31G(d) TeraChem running on C2050 and K20C First graph is of BLYP/G-31(d) Second is B3LYP/6-31G(d) C2050 K20C 0 C2050 K20C Kepler performs 2x faster than Tesla

59 VASP February 2017

60 1/seconds Interface on K80s Interface Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs X X X The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs 1x K80 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Interface between a platinum slab Pt(111) (108 atoms) and liquid water (120 water molecules) (468 ions) Broadwell node 1x K80 2x K80 4x K bands plane waves ALGO = Fast (Davidson + RMM-DIIS) 60

61 1/seconds Interface on P100s PCIe Interface X X X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs X 1x P100 PCIe is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Interface between a platinum slab Pt(111) (108 atoms) and liquid water (120 water molecules) (468 ions) Broadwell node 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe 1256 bands plane waves ALGO = Fast (Davidson + RMM-DIIS) 61

62 1/seconds Interface on P100s SXM Interface X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs X X 1.6X The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 1x P100 SXM2 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Interface between a platinum slab Pt(111) (108 atoms) and liquid water (120 water molecules) (468 ions) Broadwell node 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM bands plane waves ALGO = Fast (Davidson + RMM-DIIS) 62

63 1/seconds Silica IFPEN on K80s Silica IFPEN 1.0X X X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs 1x K80 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x K80 2x K80 4x K ions, cristobalite (high) bulk 720 bands? plane waves ALGO = Very Fast (RMM-DIIS) 63

64 1/seconds Silica IFPEN on P100s PCIe Silica IFPEN X X X X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs 1x P100 PCIe is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe 240 ions, cristobalite (high) bulk 720 bands? plane waves ALGO = Very Fast (RMM-DIIS) 64

65 1/seconds Silica IFPEN on P100s SXM Silica IFPEN X X X X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 1x P100 SXM2 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 240 ions, cristobalite (high) bulk 720 bands? plane waves ALGO = Very Fast (RMM-DIIS) 65

66 1/seconds Si-Huge on K80s Si-Huge Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs 1x K80 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x K80 2x K80 4x K Si atoms 1282 bands Plane Waves Algo = Normal (blocked Davidson) 66

67 1/seconds Si-Huge on P100s PCIe Si-Huge X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs X X 3.1X The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs 1x P100 PCIe is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe 512 Si atoms 1282 bands Plane Waves Algo = Normal (blocked Davidson) 67

68 1/seconds Si-Huge on P100s SXM Si-Huge X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs X X 2.4X The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 1x P100 SXM2 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 512 Si atoms 1282 bands Plane Waves Algo = Normal (blocked Davidson) 68

69 1/seconds SupportedSystems on K80s SupportedSystems X X X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs x K80 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x K80 2x K80 4x K ions 788 bands plane waves ALGO = Fast (Davidson + RMM-DIIS) 69

70 1/seconds SupportedSystems on P100s PCIe SupportedSystems Running VASP version X X X 1.9X The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs 1x P100 PCIe is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe 267 ions 788 bands plane waves ALGO = Fast (Davidson + RMM-DIIS) 70

71 1/seconds SupportedSystems on P100s SXM SupportedSystems X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs X 1.2X 1.4X The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 1x P100 SXM2 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 267 ions 788 bands plane waves ALGO = Fast (Davidson + RMM-DIIS) 71

72 1/seconds NiAl-MD on K80s NiAl-MD X X X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs 1x K80 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x K80 2x K80 4x K ions 3200 bands plane waves ALGO = Fast (Davidson + RMM-DIIS) 72

73 1/seconds NiAl-MD on P100s PCIe X NiAl-MD X X 2.7X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs 1x P100 PCIe is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe 500 ions 3200 bands plane waves ALGO = Fast (Davidson + RMM-DIIS) 73

74 1/seconds NiAl-MD on P100s SXM X NiAl-MD X X X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 1x P100 SXM2 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 500 ions 3200 bands plane waves ALGO = Fast (Davidson + RMM-DIIS) 74

75 1/seconds LiZnO on P100s PCIe LiZnO X 1.4X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs Broadwell node 2x P100 PCIe 4x P100 PCIe 500 ions 3200 bands plane waves ALGO = Fast (Davidson + RMM-DIIS) 75

76 1/seconds LiZnO on P100s SXM LiZnO Running VASP version X X X 1.6X The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 1x P100 SXM2 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe 500 ions 3200 bands plane waves ALGO = Fast (Davidson + RMM-DIIS) 76

77 1/seconds B.hR105 on P100s PCIe B.hR X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs X The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs X 1 Broadwell node 1x P100 PCIe 4.1X 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe 1x P100 PCIe is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) 105 Boron atoms (β-rhombohedral structure) 216 bands plane waves Hybrid Functional with blocked Davicson (ALGO=Normal) LHFCALC=.True. (Exact Exchange) 77

78 1/secpnds B.hR105 on P100s SXM X 1 Broadwell node 1x P100 SXM2 B.hR X 2x P100 SXM X 4x P100 SXM X 8x P100 SXM2 Running VASP version The blue node contains Dual Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 1x P100 SXM2 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) 105 Boron atoms (β-rhombohedral structure) 216 bands plane waves Hybrid Functional with blocked Davicson (ALGO=Normal) LHFCALC=.True. (Exact Exchange) 78

79 1/seconds B.aP107 on P100s PCIe B.aP X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs X X The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs 1x P100 PCIe is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) X 1 Broadwell node 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe 107 Boron atoms (symmetry broken 107-atom β variant) 216 bands plane waves Hybrid functional calculation (exact exchange) with blocked Davidson. No KPoint parallelization. Hybrid Functional with blocked Davidson (ALGO=Normal) LHFCALC=.True. (Exact Exchange) 79

80 1/seconds B.aP107 on P100s SXM X 1 Broadwell node 1x P100 SXM2 B.aP X 2x P100 SXM X 4x P100 SXM X 8x P100 SXM2 Running VASP version The blue node contains Dual Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 1x P100 SXM2 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) 107 Boron atoms (symmetry broken 107-atom β variant) 216 bands plane waves Hybrid functional calculation (exact exchange) with blocked Davidson. No KPoint parallelization. Hybrid Functional with blocked Davidson (ALGO=Normal) LHFCALC=.True. (Exact Exchange) 80

81 Quantum Chemistry (QC) on GPUs Dec, 19, 2016

Quantum Chemistry (QC) on GPUs. Dec. 19, 2016

Quantum Chemistry (QC) on GPUs Dec. 19, 2016 Overview of Life & Material Accelerated Apps MD: All key codes are GPU-accelerated Great multi-gpu performance Focus on dense (up to 16) GPU nodes &/or large