Quantum Chemistry (QC) on GPUs. Feb. 2, 2017

Size: px
Start display at page:

Download "Quantum Chemistry (QC) on GPUs. Feb. 2, 2017"

Transcription

1 Quantum Chemistry (QC) on GPUs Feb. 2, 2017

2 Overview of Life & Material Accelerated Apps MD: All key codes are GPU-accelerated Great multi-gpu performance Focus on dense (up to 16) GPU nodes &/or large # of GPU nodes ACEMD*, AMBER (PMEMD)*, BAND, CHARMM, DESMOND, ESPResso, Folding@Home, GPUgrid.net, GROMACS, HALMD, HOOMD-Blue*, LAMMPS, Lattice Microbes*, mdcore, MELD, minimd, NAMD, OpenMM, PolyFTS, SOP-GPU* & more QC: All key codes are ported or optimizing Focus on using GPU-accelerated math libraries, OpenACC directives GPU-accelerated and available today: ABINIT, ACES III, ADF, BigDFT, CP2K, GAMESS, GAMESS- UK, GPAW, LATTE, LSDalton, LSMS, MOLCAS, MOPAC2012, NWChem, OCTOPUS*, PEtot, QUICK, Q-Chem, QMCPack, Quantum Espresso/PWscf, QUICK, TeraChem* Active GPU acceleration projects: green* = application where >90% of the workload is on GPU CASTEP, GAMESS, Gaussian, ONETEP, Quantum Supercharger Library*, VASP & more 2

3 MD vs. QC on GPUs Classical Molecular Dynamics Simulates positions of atoms over time; chemical-biological or chemical-material behaviors Forces calculated from simple empirical formulas (bond rearrangement generally forbidden) Up to millions of atoms Solvent included without difficulty Single precision dominated Uses cublas, cufft, CUDA Geforce (Accademics), Tesla (Servers) ECC off Quantum Chemistry (MO, PW, DFT, Semi-Emp) Calculates electronic properties; ground state, excited states, spectral properties, making/breaking bonds, physical properties Forces derived from electron wave function (bond rearrangement OK, e.g., bond energies) Up to a few thousand atoms Generally in a vacuum but if needed, solvent treated classically (QM/MM) or using implicit methods Double precision is important Uses cublas, cufft, OpenACC Tesla recommended ECC on 3

4 Accelerating Discoveries Using a supercomputer powered by the Tesla Platform with over 3,000 Tesla accelerators, University of Illinois scientists performed the first all-atom simulation of the HIV virus and discovered the chemical structure of its capsid the perfect target for fighting the infection. Without gpu, the supercomputer would need to be 5x larger for similar performance. 4

5 GPU-Accelerated Quantum Chemistry Apps Green Lettering Indicates Performance Slides Included Abinit ACES III Gaussian GPAW Octopus ONETEP Quantum SuperCharger Library ADF LATTE Petot RMG BigDFT LSDalton Q-Chem TeraChem CP2K MOLCAS QMCPACK UNM GAMESS-US Mopac2012 NWChem Quantum Espresso VASP WL-LSMS GPU Perf compared against dual multi-core x86 CPU socket. 5

6 ABINIT

7 ABINIT on GPUS Speed in the parallel version: For ground-state calculations, GPUs can be used. This is based on CUDA+MAGMA For ground-state calculations, the wavelet part of ABINIT (which is BigDFT) is also very well parallelized : MPI band parallelism, combined with GPUs

8 BigDFT

9 Courtesy of BigDFT CEA

10 Courtesy of BigDFT CEA

11 Courtesy of BigDFT CEA

12 Courtesy of BigDFT CEA

13 Courtesy of BigDFT CEA

14 Courtesy of BigDFT CEA

15 Gaussian 16 April 2017

16 GPU-ACCELERATED GAUSSIAN 16 AVAILABLE 100% PGI OpenACC Port (no CUDA) Gaussian is a Top Ten HPC (Quantum Chemistry) Application % of use cases are GPU-accelerated (Hartree-Fock and DFT: energies, 1 st derivatives (gradients) and 2 nd derivatives). More functionality to come. K40, K80 support; P100 support coming as a minor release, performance good, faster wall clock times. Early P100 results promising. No pricing difference between Gaussian CPU and GPU versions. Existing Gaussian 09 customers under maintenance contract get (free) upgrade. Existing non-maintenance customers required to pay upgrade fee. To get the bits or to ask about the upgrade fee, please contact Gaussian, Inc. s Jim Hess, Operations Manager; jhess@gaussian.com. 16

17 Speed-up vs Dual Haswell rg-a25 on K80s 1.6x 1.4x 1.2x 1.0x 0.8x 0.6x rg-a25 Running Gaussian version 16 The blue node contains Dual Intel Xeon E (Haswell) CPUs The green nodes contain Dual Intel Xeon E (Haswell) CPUs + Tesla K80 (autoboost) GPUs Alanine 25. Two steps: Force and Frequency. APFD 6-31G* natoms = 259, nbasis = x 0.2x 0.0x 7.4 hrs 6.4 hrs 5.8 hrs 5.1 hrs 1 Haswell node 1x K80 2x K80 4x K80 17

18 Speed-up vs Dual Haswell rg-a25td on K80s 1.6x 1.4x 1.2x 1.0x 0.8x 0.6x rg-a25td Running Gaussian version 16 The blue node contains Dual Intel Xeon E (Haswell) CPUs The green nodes contain Dual Intel Xeon E (Haswell) CPUs + Tesla K80 (autoboost) GPUs Alanine 25. Two Time-Dependent (TD) steps: Force and Frequency. APFD 6-31G* natoms = 259, nbasis = x 0.2x 0.0x 27.9 hrs 25.8 hrs 22.6 hrs 20.1 hrs 1 Haswell node 1x K80 2x K80 4x K80 18

19 Speed-up vs Dual Haswell rg-on on K80s 1.8x 1.6x 1.4x 1.2x rg-on Running Gaussian version 16 The blue node contains Dual Intel Xeon E (Haswell) CPUs The green nodes contain Dual Intel Xeon E (Haswell) CPUs + Tesla K80 (autoboost) GPUs 1.0x 0.8x 0.6x GFP ONIOM. Two steps: Force and Frequency. APFD/6-311+G(2d,p):amber=softfirst)=embed natoms = 3715 (48/3667), nbasis = x 0.2x 0.0x 2.5 hrs 2.1 hrs 1.7 hrs 1.5 hrs 1 Haswell node 1x K80 2x K80 4x K80 19

20 Speed-up vs Dual Haswell rg-ontd on K80s 2.5x 2.0x 1.5x rg-ontd Running Gaussian version 16 The blue node contains Dual Intel Xeon E (Haswell) CPUs The green nodes contain Dual Intel Xeon E (Haswell) CPUs + Tesla K80 (autoboost) GPUs 1.0x GFP ONIOM. Two Time-Dependent (TD) steps: Force and Frequency. APFD/6-311+G(2d,p):amber=softfirst)=embed natoms = 3715 (48/3667), nbasis = x 0.0x 55.4 hrs 43.6 hrs 33.7 hrs 26.8 hrs 1 Haswell node 1x K80 2x K80 4x K80 20

21 Speed-up vs Dual Haswell rg-val on K80s 2.5x 2.0x rg-val Running Gaussian version 16 The blue node contains Dual Intel Xeon E (Haswell) CPUs 1.5x The green nodes contain Dual Intel Xeon E (Haswell) CPUs + Tesla K80 (autoboost) GPUs 1.0x Valinomycin. Two steps: Force and Frequency. APFD G(2d,p) natoms = 168, nbasis = x 0.0x hrs hrs hrs hrs 1 Haswell node 1x K80 2x K80 4x K80 21

22 7.4 hrs 6.4 hrs 5.8 hrs 5.1 hrs 27.9 hrs 25.8 hrs 22.6 hrs 20.1 hrs 2.5 hrs 2.1 hrs 1.7 hrs 1.5 hrs 55.4 hrs 43.6 hrs 33.7 hrs 26.8 hrs hrs hrs hrs hrs Speed-up over run without GPUs Effects of using K80s 2.5x 2.3x 2.0x Effects of using K80 boards Haswell E x 1.5x 1.3x 1.0x 0.8x 0.5x 0.3x 0.0x rg-a25 rg-a25td rg-on rg-ontd rg-val Number of K80 boards

23 Gaussian 16 Supported Platforms 4-way collaboration; Gaussian, Inc., PGI, NVIDIA and HPE HPE, NVIDIA and PGI is the development platform All released/certified x86_64 versions of Gaussian 16 use the PGI compilers Certified versions of Gaussian 16 use Intel only for Itanium, XLF for some IBM platforms, Fujitsu compilers for some SPARC-based machines and PGI for the rest (including some Apple products) GINC is collaborating with IBM, PGI (and NVIDIA) to release an OpenPower version of Gaussian that also uses the PGI compiler See Gaussian Supported Platforms for more details: 23

24 CLOSING REMARKS Significant Progress has been made in enabling Gaussian on GPUs with OpenACC OpenACC is increasingly becoming more versatile Significant work lies ahead to improve performance Expand feature set: PBC, Solvation, MP2, ONIOM, triples-corrections 24

25 ACKNOWLEDGEMENTS Development is taking place with: Hewlett-Packard (HP) Series SL2500 Servers (Intel Xeon E v2 (2.8GHz/10- core/25mb/8.0gt-s QPI/115W, DDR3-1866) NVIDIA Tesla GPUs (K40 and later) PGI Accelerator Compilers (16.x) with OpenACC (2.5 standard) 5/10/

26 GPAW

27 Speedup Compared to CPU Only Increase Performance with Kepler 3.5 Running GPAW The blue nodes contain 1x E5-2687W CPU (8 Cores per CPU). The green nodes contain 1x E5-2687W CPU (8 Cores per CPU) and 1x or 2x NVIDIA K20X for the GPU Silicon K=1 Silicon K=2 Silicon K=3

28 Speedup Compared to CPU Only Increase Performance with Kepler x Running GPAW The blue nodes contain 1x E5-2687W CPU (8 Cores per CPU). The green nodes contain 1x E5-2687W CPUs (8 Cores per CPU) and 2x NVIDIA K20 or K20X for the GPU x 1 1.7x Silicon K=1 Silicon K=2 Silicon K=3

29 Speedup Compared to CPU Only Increase Performance with Kepler 1.6 Running GPAW The blue nodes contain 2x E5-2687W CPUs (8 Cores per CPU) x 1.4x The green nodes contain 2x E5-2687W CPUs (8 Cores per CPU) and 2x NVIDIA K20 or K20X for the GPU x Silicon K=1 Silicon K=2 Silicon K=3

30 Used with permission from Samuli Hakala

31

32

33

34 34

35 35

36 36

37 37

38 38

39 39

40 NWChem

41 NWChem 6.3 Release with GPU Acceleration Addresses large complex and challenging molecular-scale scientific problems in the areas of catalysis, materials, geochemistry and biochemistry on highly scalable, parallel computing platforms to obtain the fastest time-to-solution Researchers can for the first time be able to perform large scale coupled cluster with perturbative triples calculations utilizing the NVIDIA GPU technology. A highly scalable multi-reference coupled cluster capability will also be available in NWChem 6.3. The software, released under the Educational Community License 2.0, can be downloaded from the NWChem website at

42 NWChem - Speedup of the non-iterative calculation for various configurations/tile sizes System: cluster consisting of dual-socket nodes constructed from: 8-core AMD Interlagos processors 64 GB of memory Tesla M2090 (Fermi) GPUs The nodes are connected using a high-performance QDR Infiniband interconnect Courtesy of Kowolski, K., Bhaskaran-Nair, at PNNL, JCTC (submitted)

43 Time to Solution (seconds) Kepler, Faster Performance (NWChem) Uracil CPU Only CPU + 1x K20X CPU + 2x K20X Uracil Molecule Performance improves by 2x with one GPU and by 3.1x with 2 GPUs

44 Quantum Espresso December 2016

45 seconds AUSURF112 on K80s AUSURF112 *Lower is better 580 Running Quantum Espresso version X The blue node contains Dual Intel Xeon E [3.5GHz Turbo] (Broadwell) CPUs The green node contains Dual Intel Xeon E [3.5GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs Broadwell node 4x K80 45

46 seconds AUSURF112 on P100s PCIe AUSURF1120 *Lower is better X 1.2X Running Quantum Espresso version The blue node contains Dual Intel Xeon E [3.5GHz Turbo] (Broadwell) CPUs 200 The green nodes contain Dual Intel Xeon E [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs Broadwell node 4x P100 PCIe 8x P100 PCIe 46

47 TeraChem 1.5K Speaker, Date

48 TERACHEM 1.5K; TRIPCAGE ON TESLA K40S TeraChem 1.5K; TripCage on Tesla K40s & IVB CPUs (Total Processing Time in Seconds) x Xeon E v2@2.70ghz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E v2@2.70ghz + 2 x Tesla K40@875Mhz (1 node) 2 x Xeon E v2@2.70ghz + 4 x Tesla K40@875Mhz (1 node) 2 x Xeon E v2@2.70ghz + 8 x Tesla K40@875Mhz (1 node) 48

49 TERACHEM 1.5K; TRIPCAGE ON TESLA K40S & HASWELL CPUS 200 TeraChem 1.5K; TripCage on Tesla K40s & Haswell CPUs (Total Processing Time in Seconds) x Xeon E v3@2.30ghz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E v3@2.30ghz + 2 x Tesla K40@875Mhz (1 node) 2 x Xeon E v3@2.30ghz + 4 x Tesla K40@875Mhz (1 node) 49

50 TERACHEM 1.5K; TRIPCAGE ON TESLA K80S & IVB CPUS 120 TeraChem 1.5K; TripCage on Tesla K80s & IVB CPUs (Total Processing Time in Seconds) x Xeon E v2@2.70ghz + 1 x Tesla K80 board (1 node) 2 x Xeon E v2@2.70ghz + 2 x Tesla K80 boards 2 x Xeon E v2@2.70ghz + 4 x Tesla K80 boards (1 node) (1 node) 50

51 TERACHEM 1.5K; TRIPCAGE ON TESLA K80S & HASWELL CPUS 120 TeraChem 1.5K; TripCage on Tesla K80s & Haswell CPUs (Total Processing Time in Seconds) x Xeon E v3@2.30ghz + 1 x Tesla K80 board (1 node) 2 x Xeon E v3@2.30ghz + 2 x Tesla K80 boards (1 node) 2 x Xeon E v3@2.30ghz + 4 x Tesla K80 boards (1 node) 51

52 TERACHEM 1.5K; BPTI ON TESLA K40S & IVB CPUS TeraChem 1.5K; BPTI on Tesla K40s & IVB CPUs (Total Processing Time in Seconds) x Xeon E v2@2.70ghz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E v2@2.70ghz + 2 x Tesla K40@875Mhz (1 node) 2 x Xeon E v2@2.70ghz + 4 x Tesla K40@875Mhz (1 node) 2 x Xeon E v2@2.70ghz + 8 x Tesla K40@875Mhz (1 node) 52

53 TERACHEM 1.5K; BPTI ON TESLA K80S & IVB CPUS 8000 TeraChem 1.5K; BPTI on Tesla K80s & IVB CPUs (Total Processing Time in Seconds) x Xeon E v2@2.70ghz + 1 x Tesla K80 board (1 node) 2 x Xeon E v2@2.70ghz + 2 x Tesla K80 boards (1 node) 2 x Xeon E v2@2.70ghz + 4 x Tesla K80 boards (1 node) 53

54 TERACHEM 1.5K; BPTI ON TESLA K40S & IVB CPUS TeraChem 1.5K; BPTI on Tesla K40s & Haswell CPUs (Total Processing Time in Seconds) x Xeon E v3@2.30ghz + 1 x Tesla K40@875Mhz (1 node) 2 x Xeon E v3@2.30ghz + 2 x Tesla K40@875Mhz (1 node) 2 x Xeon E v3@2.30ghz + 4 x Tesla K40@875Mhz (1 node) 54

55 TERACHEM 1.5K; BPTI ON TESLA K80S & HASWELL CPUS 6000 TeraChem 1.5K; BPTI on Tesla K80s & Haswell CPUs (Total Processing Time in Seconds) x Xeon E v3@2.30ghz + 1 x Tesla K80 board (1 node) 2 x Xeon E v3@2.30ghz + 2 x Tesla K80 boards (1 node) 2 x Xeon E v3@2.30ghz + 4 x Tesla K80 boards (1 node) 55

56 Time (Seconds) TeraChem Supercomputer Speeds on GPUs Time for SCF Step TeraChem running on 8 C2050s on 1 node NWChem running on 4096 Quad Core CPUs In the Chinook Supercomputer Giant Fullerene C240 Molecule Quad Core CPUs ($19,000,000) 8 C2050 ($31,000) Similar performance from just a handful of GPUs

57 Price/Performance relative to Supercomputer TeraChem Bang for the Buck Performance/Price 493 TeraChem running on 8 C2050s on 1 node NWChem running on 4096 Quad Core CPUs In the Chinook Supercomputer Giant Fullerene C240 Molecule Note: Typical CPU and GPU node pricing used. Pricing may vary depending on node configuration. Contact your preferred HW vendor for actual pricing Quad Core CPUs ($19,000,000) 8 C2050 ($31,000) Dollars spent on GPUs do 500x more science than those spent on CPUs

58 Seconds Seconds Kepler s Even Better 800 Olestra BLYP 453 Atoms 2000 B3LYP/6-31G(d) TeraChem running on C2050 and K20C First graph is of BLYP/G-31(d) Second is B3LYP/6-31G(d) C2050 K20C 0 C2050 K20C Kepler performs 2x faster than Tesla

59 VASP February 2017

60 1/seconds Interface on K80s Interface Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs X X X The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs 1x K80 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Interface between a platinum slab Pt(111) (108 atoms) and liquid water (120 water molecules) (468 ions) Broadwell node 1x K80 2x K80 4x K bands plane waves ALGO = Fast (Davidson + RMM-DIIS) 60

61 1/seconds Interface on P100s PCIe Interface X X X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs X 1x P100 PCIe is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Interface between a platinum slab Pt(111) (108 atoms) and liquid water (120 water molecules) (468 ions) Broadwell node 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe 1256 bands plane waves ALGO = Fast (Davidson + RMM-DIIS) 61

62 1/seconds Interface on P100s SXM Interface X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs X X 1.6X The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 1x P100 SXM2 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Interface between a platinum slab Pt(111) (108 atoms) and liquid water (120 water molecules) (468 ions) Broadwell node 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM bands plane waves ALGO = Fast (Davidson + RMM-DIIS) 62

63 1/seconds Silica IFPEN on K80s Silica IFPEN 1.0X X X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs 1x K80 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x K80 2x K80 4x K ions, cristobalite (high) bulk 720 bands? plane waves ALGO = Very Fast (RMM-DIIS) 63

64 1/seconds Silica IFPEN on P100s PCIe Silica IFPEN X X X X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs 1x P100 PCIe is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe 240 ions, cristobalite (high) bulk 720 bands? plane waves ALGO = Very Fast (RMM-DIIS) 64

65 1/seconds Silica IFPEN on P100s SXM Silica IFPEN X X X X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 1x P100 SXM2 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 240 ions, cristobalite (high) bulk 720 bands? plane waves ALGO = Very Fast (RMM-DIIS) 65

66 1/seconds Si-Huge on K80s Si-Huge Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs 1x K80 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x K80 2x K80 4x K Si atoms 1282 bands Plane Waves Algo = Normal (blocked Davidson) 66

67 1/seconds Si-Huge on P100s PCIe Si-Huge X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs X X 3.1X The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs 1x P100 PCIe is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe 512 Si atoms 1282 bands Plane Waves Algo = Normal (blocked Davidson) 67

68 1/seconds Si-Huge on P100s SXM Si-Huge X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs X X 2.4X The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 1x P100 SXM2 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 512 Si atoms 1282 bands Plane Waves Algo = Normal (blocked Davidson) 68

69 1/seconds SupportedSystems on K80s SupportedSystems X X X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs x K80 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x K80 2x K80 4x K ions 788 bands plane waves ALGO = Fast (Davidson + RMM-DIIS) 69

70 1/seconds SupportedSystems on P100s PCIe SupportedSystems Running VASP version X X X 1.9X The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs 1x P100 PCIe is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe 267 ions 788 bands plane waves ALGO = Fast (Davidson + RMM-DIIS) 70

71 1/seconds SupportedSystems on P100s SXM SupportedSystems X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs X 1.2X 1.4X The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 1x P100 SXM2 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 267 ions 788 bands plane waves ALGO = Fast (Davidson + RMM-DIIS) 71

72 1/seconds NiAl-MD on K80s NiAl-MD X X X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs 1x K80 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x K80 2x K80 4x K ions 3200 bands plane waves ALGO = Fast (Davidson + RMM-DIIS) 72

73 1/seconds NiAl-MD on P100s PCIe X NiAl-MD X X 2.7X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs 1x P100 PCIe is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe 500 ions 3200 bands plane waves ALGO = Fast (Davidson + RMM-DIIS) 73

74 1/seconds NiAl-MD on P100s SXM X NiAl-MD X X X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 1x P100 SXM2 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 500 ions 3200 bands plane waves ALGO = Fast (Davidson + RMM-DIIS) 74

75 1/seconds LiZnO on P100s PCIe LiZnO X 1.4X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs Broadwell node 2x P100 PCIe 4x P100 PCIe 500 ions 3200 bands plane waves ALGO = Fast (Davidson + RMM-DIIS) 75

76 1/seconds LiZnO on P100s SXM LiZnO Running VASP version X X X 1.6X The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 1x P100 SXM2 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) Broadwell node 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe 500 ions 3200 bands plane waves ALGO = Fast (Davidson + RMM-DIIS) 76

77 1/seconds B.hR105 on P100s PCIe B.hR X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs X The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs X 1 Broadwell node 1x P100 PCIe 4.1X 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe 1x P100 PCIe is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) 105 Boron atoms (β-rhombohedral structure) 216 bands plane waves Hybrid Functional with blocked Davicson (ALGO=Normal) LHFCALC=.True. (Exact Exchange) 77

78 1/secpnds B.hR105 on P100s SXM X 1 Broadwell node 1x P100 SXM2 B.hR X 2x P100 SXM X 4x P100 SXM X 8x P100 SXM2 Running VASP version The blue node contains Dual Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 1x P100 SXM2 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) 105 Boron atoms (β-rhombohedral structure) 216 bands plane waves Hybrid Functional with blocked Davicson (ALGO=Normal) LHFCALC=.True. (Exact Exchange) 78

79 1/seconds B.aP107 on P100s PCIe B.aP X Running VASP version The blue node contains Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs X X The green nodes contain Dual Intel Xeon E [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs 1x P100 PCIe is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) X 1 Broadwell node 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe 107 Boron atoms (symmetry broken 107-atom β variant) 216 bands plane waves Hybrid functional calculation (exact exchange) with blocked Davidson. No KPoint parallelization. Hybrid Functional with blocked Davidson (ALGO=Normal) LHFCALC=.True. (Exact Exchange) 79

80 1/seconds B.aP107 on P100s SXM X 1 Broadwell node 1x P100 SXM2 B.aP X 2x P100 SXM X 4x P100 SXM X 8x P100 SXM2 Running VASP version The blue node contains Dual Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) CPUs The green nodes contain Dual Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 1x P100 SXM2 is paired with Single Intel Xeon E v4@2.2ghz [3.6GHz Turbo] (Broadwell) 107 Boron atoms (symmetry broken 107-atom β variant) 216 bands plane waves Hybrid functional calculation (exact exchange) with blocked Davidson. No KPoint parallelization. Hybrid Functional with blocked Davidson (ALGO=Normal) LHFCALC=.True. (Exact Exchange) 80

81 Quantum Chemistry (QC) on GPUs Dec, 19, 2016

Quantum Chemistry (QC) on GPUs. Dec. 19, 2016

Quantum Chemistry (QC) on GPUs. Dec. 19, 2016 Quantum Chemistry (QC) on GPUs Dec. 19, 2016 Overview of Life & Material Accelerated Apps MD: All key codes are GPU-accelerated Great multi-gpu performance Focus on dense (up to 16) GPU nodes &/or large

More information

Quantum Chemistry (QC) on GPUs. October 2017

Quantum Chemistry (QC) on GPUs. October 2017 Quantum Chemistry (QC) on GPUs October 2017 Overview of Life & Material Accelerated Apps MD: All key codes are GPU-accelerated Great multi-gpu performance Focus on dense (up to 16) GPU nodes &/or large

More information

Quantum Chemistry (QC) on GPUs. March 2018

Quantum Chemistry (QC) on GPUs. March 2018 Quantum Chemistry (QC) on GPUs March 2018 Overview of Life & Material Accelerated Apps MD: All key codes are GPU-accelerated Great multi-gpu performance Focus on dense (up to 16) GPU nodes &/or large #

More information

TESLA ACCELERATED COMPUTING. Mike Wang Solutions Architect NVIDIA Australia & NZ

TESLA ACCELERATED COMPUTING. Mike Wang Solutions Architect NVIDIA Australia & NZ TESLA ACCELERATED COMPUTING Mike Wang Solutions Architect NVIDIA Australia & NZ mikewang@nvidia.com GAMING DESIGN ENTERPRISE VIRTUALIZATION HPC & CLOUD SERVICE PROVIDERS AUTONOMOUS MACHINES PC DATA CENTER

More information

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications TESLA P PERFORMANCE GUIDE HPC and Deep Learning Applications MAY 217 TESLA P PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important

More information

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications TESLA V100 PERFORMANCE GUIDE Life Sciences Applications NOVEMBER 2017 TESLA V100 PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important

More information

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications TESLA P PERFORMANCE GUIDE Deep Learning and HPC Applications SEPTEMBER 217 TESLA P PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important

More information

ENABLING NEW SCIENCE GPU SOLUTIONS

ENABLING NEW SCIENCE GPU SOLUTIONS ENABLING NEW SCIENCE TESLA BIO Workbench The NVIDIA Tesla Bio Workbench enables biophysicists and computational chemists to push the boundaries of life sciences research. It turns a standard PC into a

More information

STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC. Stefan Maintz, Dr. Markus Wetzstein

STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC. Stefan Maintz, Dr. Markus Wetzstein STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC Stefan Maintz, Dr. Markus Wetzstein smaintz@nvidia.com; mwetzstein@nvidia.com Companies Academia VASP USERS AND USAGE 12-25% of CPU cycles @ supercomputing

More information

GPU ACCELERATED COMPUTING. 1 st AlsaCalcul GPU Challenge, 14-Jun-2016, Strasbourg Frédéric Parienté, Tesla Accelerated Computing, NVIDIA Corporation

GPU ACCELERATED COMPUTING. 1 st AlsaCalcul GPU Challenge, 14-Jun-2016, Strasbourg Frédéric Parienté, Tesla Accelerated Computing, NVIDIA Corporation GPU ACCELERATED COMPUTING 1 st AlsaCalcul GPU Challenge, 14-Jun-2016, Strasbourg Frédéric Parienté, Tesla Accelerated Computing, NVIDIA Corporation GAMING PRO ENTERPRISE VISUALIZATION DATA CENTER AUTO

More information

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve Larger Problems** Critical Path Analysis * HOOMD Blue v1.3.3 Lennard-Jones liquid

More information

An Innovative Massively Parallelized Molecular Dynamic Software

An Innovative Massively Parallelized Molecular Dynamic Software Renewable energies Eco-friendly production Innovative transport Eco-efficient processes Sustainable resources An Innovative Massively Parallelized Molecular Dynamic Software Mohamed Hacene, Ani Anciaux,

More information

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016 RECENT TRENDS IN GPU ARCHITECTURES Perspectives of GPU computing in Science, 26 th Sept 2016 NVIDIA THE AI COMPUTING COMPANY GPU Computing Computer Graphics Artificial Intelligence 2 NVIDIA POWERS WORLD

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS Roberto Gomperts (NVIDIA, Corp.) Michael Frisch (Gaussian, Inc.) Giovanni Scalmani (Gaussian, Inc.) Brent Leback (PGI) TOPICS Gaussian Design

More information

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING Accelerated computing is revolutionizing the economics of the data center. HPC and hyperscale customers deploy accelerated

More information

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING Accelerated computing is revolutionizing the economics of the data center. HPC enterprise and hyperscale customers deploy

More information

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015 PERFORMANCE PORTABILITY WITH OPENACC Jeff Larkin, NVIDIA, November 2015 TWO TYPES OF PORTABILITY FUNCTIONAL PORTABILITY PERFORMANCE PORTABILITY The ability for a single code to run anywhere. The ability

More information

TESLA V100 PERFORMANCE GUIDE May 2018

TESLA V100 PERFORMANCE GUIDE May 2018 TESLA V100 PERFORMANCE GUIDE May 2018 TESLA V100 The Fastest and Most Productive GPU for AI and HPC Volta Architecture Tensor Core Improved NVLink & HBM2 Volta MPS Improved SIMT Model Most Productive GPU

More information

High Performance Computing with Accelerators

High Performance Computing with Accelerators High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing

More information

GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester

GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester NVIDIA GPU Computing A Revolution in High Performance Computing GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester John Ashley Senior Solutions Architect

More information

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA STATE OF THE ART 2012 18,688 Tesla K20X GPUs 27 PetaFLOPS FLAGSHIP SCIENTIFIC APPLICATIONS

More information

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015 ACCELERATED COMPUTING: THE PATH FORWARD Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015 COMMODITY DISRUPTS CUSTOM SOURCE: Top500 ACCELERATED COMPUTING: THE PATH FORWARD It s time to start

More information

CALMIP : HIGH PERFORMANCE COMPUTING

CALMIP : HIGH PERFORMANCE COMPUTING CALMIP : HIGH PERFORMANCE COMPUTING Nicolas.renon@univ-tlse3.fr Emmanuel.courcelle@inp-toulouse.fr CALMIP (UMS 3667) Espace Clément Ader www.calmip.univ-toulouse.fr CALMIP :Toulouse University Computing

More information

Performance Study of Popular Computational Chemistry Software Packages on Cray HPC Systems

Performance Study of Popular Computational Chemistry Software Packages on Cray HPC Systems Performance Study of Popular Computational Chemistry Software Packages on Cray HPC Systems Junjie Li (lijunj@iu.edu) Shijie Sheng (shengs@iu.edu) Raymond Sheppard (rsheppar@iu.edu) Pervasive Technology

More information

VSC Users Day 2018 Start to GPU Ehsan Moravveji

VSC Users Day 2018 Start to GPU Ehsan Moravveji Outline A brief intro Available GPUs at VSC GPU architecture Benchmarking tests General Purpose GPU Programming Models VSC Users Day 2018 Start to GPU Ehsan Moravveji Image courtesy of Nvidia.com Generally

More information

Game-changing Extreme GPU computing with The Dell PowerEdge C4130

Game-changing Extreme GPU computing with The Dell PowerEdge C4130 Game-changing Extreme GPU computing with The Dell PowerEdge C4130 A Dell Technical White Paper This white paper describes the system architecture and performance characterization of the PowerEdge C4130.

More information

Scaling in a Heterogeneous Environment with GPUs: GPU Architecture, Concepts, and Strategies

Scaling in a Heterogeneous Environment with GPUs: GPU Architecture, Concepts, and Strategies Scaling in a Heterogeneous Environment with GPUs: GPU Architecture, Concepts, and Strategies John E. Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology

More information

AMBER 11 Performance Benchmark and Profiling. July 2011

AMBER 11 Performance Benchmark and Profiling. July 2011 AMBER 11 Performance Benchmark and Profiling July 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource -

More information

Comparative Benchmarking of the First Generation of HPC-Optimised Arm Processors on Isambard

Comparative Benchmarking of the First Generation of HPC-Optimised Arm Processors on Isambard Prof Simon McIntosh-Smith Isambard PI University of Bristol / GW4 Alliance Comparative Benchmarking of the First Generation of HPC-Optimised Arm Processors on Isambard Isambard system specification 10,000+

More information

GROMACS Performance Benchmark and Profiling. August 2011

GROMACS Performance Benchmark and Profiling. August 2011 GROMACS Performance Benchmark and Profiling August 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource

More information

CP2K Performance Benchmark and Profiling. April 2011

CP2K Performance Benchmark and Profiling. April 2011 CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC

More information

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING Table of Contents: The Accelerated Data Center Optimizing Data Center Productivity Same Throughput with Fewer Server Nodes

More information

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations Ophir Maor HPC Advisory Council ophir@hpcadvisorycouncil.com The HPC-AI Advisory Council World-wide HPC non-profit

More information

The State of Accelerated Applications. Michael Feldman

The State of Accelerated Applications. Michael Feldman The State of Accelerated Applications Michael Feldman Accelerator Market in HPC Nearly half of all new HPC systems deployed incorporate accelerators Accelerator hardware performance has been advancing

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

GROMACS (GPU) Performance Benchmark and Profiling. February 2016 GROMACS (GPU) Performance Benchmark and Profiling February 2016 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Mellanox, NVIDIA Compute

More information

CPMD Performance Benchmark and Profiling. February 2014

CPMD Performance Benchmark and Profiling. February 2014 CPMD Performance Benchmark and Profiling February 2014 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting

More information

April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,

April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27, April 4-7, 2016 Silicon Valley INSIDE PASCAL Mark Harris, October 27, 2016 @harrism INTRODUCING TESLA P100 New GPU Architecture CPU to CPUEnable the World s Fastest Compute Node PCIe Switch PCIe Switch

More information

Quantum ESPRESSO on GPU accelerated systems

Quantum ESPRESSO on GPU accelerated systems Quantum ESPRESSO on GPU accelerated systems Massimiliano Fatica, Everett Phillips, Josh Romero - NVIDIA Filippo Spiga - University of Cambridge/ARM (UK) MaX International Conference, Trieste, Italy, January

More information

Accelerating High Performance Computing.

Accelerating High Performance Computing. Accelerating High Performance Computing http://www.nvidia.com/tesla Computing The 3 rd Pillar of Science Drug Design Molecular Dynamics Seismic Imaging Reverse Time Migration Automotive Design Computational

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

Building NVLink for Developers

Building NVLink for Developers Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized

More information

World s most advanced data center accelerator for PCIe-based servers

World s most advanced data center accelerator for PCIe-based servers NVIDIA TESLA P100 GPU ACCELERATOR World s most advanced data center accelerator for PCIe-based servers HPC data centers need to support the ever-growing demands of scientists and researchers while staying

More information

GPU computing at RZG overview & some early performance results. Markus Rampp

GPU computing at RZG overview & some early performance results. Markus Rampp GPU computing at RZG overview & some early performance results Markus Rampp Introduction Outline Hydra configuration overview GPU software environment Benchmarking and porting activities Team Renate Dohmen

More information

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Preparing GPU-Accelerated Applications for the Summit Supercomputer Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership

More information

NAMD GPU Performance Benchmark. March 2011

NAMD GPU Performance Benchmark. March 2011 NAMD GPU Performance Benchmark March 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Intel, Mellanox Compute resource - HPC Advisory

More information

Timothy Lanfear, NVIDIA HPC

Timothy Lanfear, NVIDIA HPC GPU COMPUTING AND THE Timothy Lanfear, NVIDIA FUTURE OF HPC Exascale Computing will Enable Transformational Science Results First-principles simulation of combustion for new high-efficiency, lowemision

More information

Accelerating Insights In the Technical Computing Transformation

Accelerating Insights In the Technical Computing Transformation Accelerating Insights In the Technical Computing Transformation Dr. Rajeeb Hazra Vice President, Data Center Group General Manager, Technical Computing Group June 2014 TOP500 Highlights Intel Xeon Phi

More information

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016 ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016 Challenges What is Algebraic Multi-Grid (AMG)? AGENDA Why use AMG? When to use AMG? NVIDIA AmgX Results 2

More information

CP2K Performance Benchmark and Profiling. April 2011

CP2K Performance Benchmark and Profiling. April 2011 CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox

More information

LAMMPS Performance Benchmark and Profiling. July 2012

LAMMPS Performance Benchmark and Profiling. July 2012 LAMMPS Performance Benchmark and Profiling July 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation Ray Browell nvidia Technology Theater SC12 1 2012 ANSYS, Inc. nvidia Technology Theater SC12 HPC Revolution Recent

More information

NVIDIA GPU TECHNOLOGY UPDATE

NVIDIA GPU TECHNOLOGY UPDATE NVIDIA GPU TECHNOLOGY UPDATE May 2015 Axel Koehler Senior Solutions Architect, NVIDIA NVIDIA: The VISUAL Computing Company GAMING DESIGN ENTERPRISE VIRTUALIZATION HPC & CLOUD SERVICE PROVIDERS AUTONOMOUS

More information

NAMD Performance Benchmark and Profiling. January 2015

NAMD Performance Benchmark and Profiling. January 2015 NAMD Performance Benchmark and Profiling January 2015 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource

More information

IBM CORAL HPC System Solution

IBM CORAL HPC System Solution IBM CORAL HPC System Solution HPC and HPDA towards Cognitive, AI and Deep Learning Deep Learning AI / Deep Learning Strategy for Power Power AI Platform High Performance Data Analytics Big Data Strategy

More information

OCTOPUS Performance Benchmark and Profiling. June 2015

OCTOPUS Performance Benchmark and Profiling. June 2015 OCTOPUS Performance Benchmark and Profiling June 2015 2 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the

More information

Exascale: challenges and opportunities in a power constrained world

Exascale: challenges and opportunities in a power constrained world Exascale: challenges and opportunities in a power constrained world Carlo Cavazzoni c.cavazzoni@cineca.it SuperComputing Applications and Innovation Department CINECA CINECA non profit Consortium, made

More information

Pedraforca: a First ARM + GPU Cluster for HPC

Pedraforca: a First ARM + GPU Cluster for HPC www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu

More information

Introduction to High Performance Computing. Shaohao Chen Research Computing Services (RCS) Boston University

Introduction to High Performance Computing. Shaohao Chen Research Computing Services (RCS) Boston University Introduction to High Performance Computing Shaohao Chen Research Computing Services (RCS) Boston University Outline What is HPC? Why computer cluster? Basic structure of a computer cluster Computer performance

More information

HPC Issues for DFT Calculations. Adrian Jackson EPCC

HPC Issues for DFT Calculations. Adrian Jackson EPCC HC Issues for DFT Calculations Adrian Jackson ECC Scientific Simulation Simulation fast becoming 4 th pillar of science Observation, Theory, Experimentation, Simulation Explore universe through simulation

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

19. prosince 2018 CIIRC Praha. Milan Král, IBM Radek Špimr

19. prosince 2018 CIIRC Praha. Milan Král, IBM Radek Špimr 19. prosince 2018 CIIRC Praha Milan Král, IBM Radek Špimr CORAL CORAL 2 CORAL Installation at ORNL CORAL Installation at LLNL Order of Magnitude Leap in Computational Power Real, Accelerated Science ACME

More information

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015 LAMMPS-KOKKOS Performance Benchmark and Profiling September 2015 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox, NVIDIA

More information

GROMACS Performance Benchmark and Profiling. September 2012

GROMACS Performance Benchmark and Profiling. September 2012 GROMACS Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource

More information

SOLUTIONS BRIEF: Transformation of Modern Healthcare

SOLUTIONS BRIEF: Transformation of Modern Healthcare SOLUTIONS BRIEF: Transformation of Modern Healthcare Healthcare & The Intel Xeon Scalable Processor Intel is committed to bringing the best of our manufacturing, design and partner networks to enable our

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

LAMMPSCUDA GPU Performance. April 2011

LAMMPSCUDA GPU Performance. April 2011 LAMMPSCUDA GPU Performance April 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Intel, Mellanox Compute resource - HPC Advisory Council

More information

John Levesque Nov 16, 2001

John Levesque Nov 16, 2001 1 We see that the GPU is the best device available for us today to be able to get to the performance we want and meet our users requirements for a very high performance node with very high memory bandwidth.

More information

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid

More information

HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015

HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015 HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015 Accelerators Surge in World s Top Supercomputers 125 100 75 Top500: # of Accelerated Supercomputers 100+ accelerated systems

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory

GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Presentation Outline NVIDIA

More information

ACCELERATED COMPUTING: THE PATH FORWARD. Jensen Huang, Founder & CEO SC17 Nov. 13, 2017

ACCELERATED COMPUTING: THE PATH FORWARD. Jensen Huang, Founder & CEO SC17 Nov. 13, 2017 ACCELERATED COMPUTING: THE PATH FORWARD Jensen Huang, Founder & CEO SC17 Nov. 13, 2017 COMPUTING AFTER MOORE S LAW Tech Walker 40 Years of CPU Trend Data 10 7 GPU-Accelerated Computing 10 5 1.1X per year

More information

A Comprehensive Study on the Performance of Implicit LS-DYNA

A Comprehensive Study on the Performance of Implicit LS-DYNA 12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four

More information

The Exascale Era Has Arrived

The Exascale Era Has Arrived Technology Spotlight The Exascale Era Has Arrived Sponsored by NVIDIA Steve Conway, Earl Joseph, Bob Sorensen, and Alex Norton November 2018 EXECUTIVE SUMMARY Earlier this year, scientists broke the exascale

More information

Interactive Supercomputing for State-of-the-art Biomolecular Simulation

Interactive Supercomputing for State-of-the-art Biomolecular Simulation Interactive Supercomputing for State-of-the-art Biomolecular Simulation John E. Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

Headline in Arial Bold 30pt. Visualisation using the Grid Jeff Adie Principal Systems Engineer, SAPK July 2008

Headline in Arial Bold 30pt. Visualisation using the Grid Jeff Adie Principal Systems Engineer, SAPK July 2008 Headline in Arial Bold 30pt Visualisation using the Grid Jeff Adie Principal Systems Engineer, SAPK July 2008 Agenda Visualisation Today User Trends Technology Trends Grid Viz Nodes Software Ecosystem

More information

Porting Scalable Parallel CFD Application HiFUN on NVIDIA GPU

Porting Scalable Parallel CFD Application HiFUN on NVIDIA GPU Porting Scalable Parallel CFD Application NVIDIA D. V., N. Munikrishna, Nikhil Vijay Shende 1 N. Balakrishnan 2 Thejaswi Rao 3 1. S & I Engineering Solutions Pvt. Ltd., Bangalore, India 2. Aerospace Engineering,

More information

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization Exploiting Task-Parallelism on Clusters via Adrián Castelló, Rafael Mayo, Judit Planas, Enrique S. Quintana-Ortí RePara 2015, August Helsinki, Finland Exploiting Task-Parallelism on Clusters via Power/energy/utilization

More information

Molecular Dynamics and Quantum Mechanics Applications

Molecular Dynamics and Quantum Mechanics Applications Understanding the Performance of Molecular Dynamics and Quantum Mechanics Applications on Dell HPC Clusters High-performance computing (HPC) clusters are proving to be suitable environments for running

More information

HECToR. UK National Supercomputing Service. Andy Turner & Chris Johnson

HECToR. UK National Supercomputing Service. Andy Turner & Chris Johnson HECToR UK National Supercomputing Service Andy Turner & Chris Johnson Outline EPCC HECToR Introduction HECToR Phase 3 Introduction to AMD Bulldozer Architecture Performance Application placement the hardware

More information

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

NVIDIA Update and Directions on GPU Acceleration for Earth System Models NVIDIA Update and Directions on GPU Acceleration for Earth System Models Stan Posey, HPC Program Manager, ESM and CFD, NVIDIA, Santa Clara, CA, USA Carl Ponder, PhD, Applications Software Engineer, NVIDIA,

More information

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Agenda 1 Agenda-Day 1 HPC Overview What is a cluster? Shared v.s. Distributed Parallel v.s. Massively Parallel Interconnects

More information

Fast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE

Fast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE Fast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE S. Mniszewski*, M. Cawkwell, A. Niklasson GPU Technology Conference San Jose, California March 18-21, 2013 *smm@lanl.gov Slide 1 Background:

More information

Technologies and application performance. Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017

Technologies and application performance. Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017 Technologies and application performance Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017 The landscape is changing We are no longer in the general purpose era the argument of

More information

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department Approaches to acceleration: GPUs vs Intel MIC Fabio AFFINITO SCAI department Single core Multi core Many core GPU Intel MIC 61 cores 512bit-SIMD units from http://www.karlrupp.net/ from http://www.karlrupp.net/

More information

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010 Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010 Windows HPC Server 2008 R2 Windows HPC Server 2008 R2 makes supercomputing

More information

Accelerating Financial Applications on the GPU

Accelerating Financial Applications on the GPU Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

GPU WORKSHOP. NYU Center for Data Science. Bradley Palmer Solution Architect

GPU WORKSHOP. NYU Center for Data Science. Bradley Palmer Solution Architect GPU WORKSHOP NYU Center for Data Science Bradley Palmer Solution Architect HOW GPU ACCELERATION WORKS Application Code Compute-Intensive Functions GPU 5% of Code Rest of Sequential CPU Code CPU + 2 TAKE

More information

Faster Innovation - Accelerating SIMULIA Abaqus Simulations with NVIDIA GPUs. Baskar Rajagopalan Accelerated Computing, NVIDIA

Faster Innovation - Accelerating SIMULIA Abaqus Simulations with NVIDIA GPUs. Baskar Rajagopalan Accelerated Computing, NVIDIA Faster Innovation - Accelerating SIMULIA Abaqus Simulations with NVIDIA GPUs Baskar Rajagopalan Accelerated Computing, NVIDIA 1 Engineering & IT Challenges/Trends NVIDIA GPU Solutions AGENDA Abaqus GPU

More information

IT4Innovations national supercomputing center. Branislav Jansík

IT4Innovations national supercomputing center. Branislav Jansík IT4Innovations national supercomputing center Branislav Jansík branislav.jansik@vsb.cz Anselm Salomon Data center infrastructure Anselm and Salomon Anselm Intel Sandy Bridge E5-2665 2x8 cores 64GB RAM

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

HIGH-PERFORMANCE COMPUTING WITH NVIDIA TESLA GPUS. Chris Butler NVIDIA

HIGH-PERFORMANCE COMPUTING WITH NVIDIA TESLA GPUS. Chris Butler NVIDIA HIGH-PERFORMANCE COMPUTING WITH NVIDIA TESLA GPUS Chris Butler NVIDIA Science is Desperate for Throughput Gigaflops 1,000,000,000 1 Exaflop 1,000,000 1 Petaflop Bacteria 100s of Chromatophores Chromatophore

More information

Development of Intel MIC Codes in NWChem. Edoardo Aprà Pacific Northwest National Laboratory

Development of Intel MIC Codes in NWChem. Edoardo Aprà Pacific Northwest National Laboratory Development of Intel MIC Codes in NWChem Edoardo Aprà Pacific Northwest National Laboratory Acknowledgements! Karol Kowalski (PNNL)! Michael Klemm (Intel)! Kiran Bhaskaran-Nair (LSU)! Wenjing Ma (Chinese

More information

HPC and AI Solution Overview. Garima Kochhar HPC and AI Innovation Lab

HPC and AI Solution Overview. Garima Kochhar HPC and AI Innovation Lab HPC and AI Solution Overview Garima Kochhar HPC and AI Innovation Lab 1 Dell EMC HPC and DL team charter Design, develop and integrate HPC and DL Heading systems Lorem ipsum dolor sit amet, consectetur

More information