How to Use a Quantum Chemistry Code over 100,000 CPUs

Size: px
Start display at page:

Download "How to Use a Quantum Chemistry Code over 100,000 CPUs"

Transcription

1 How to Use a Quantum Chemistry Code over 100,000 CPUs Edoardo Aprà Pacific Northwest National Laboratory Outline NWChem description Parallelization strategies 2 1

2 Developed at EMSL/PNNL Provides major modeling and simulation capability for molecular science Broad range of molecules, including catalysts, biomolecules, and heavy elements Solid state capabilities Performance characteristics designed for MPP Runs on a wide range of computers Open Source large user community Uses Global Arrays/ARMCI for parallelization 3 NWChem Structure Run-time database Generic Energy, structure, Tasks Object-oriented design SCF energy, gradient, DFT energy, gradient, MD, NMR, Solvation, Optimize, Dynamics, Integral API Geometry Object... Basis Set Object Parallel IO Global Arrays PeIGS Memory Allocator... Molecular Calculation Modules Molecular Modeling Toolkit Molecular Software Development Toolkit abstraction, data hiding, APIs Parallel programming model non-uniform memory access, Global Arrays, MPI Infrastructure GA, Parallel I/O, RTDB, MA,... Program modules communication only through the database persistence for easy restart 2

3 Replicated vs Distributed: Matrix distribution Example: parallel computer made of 8 processors How to distribute the elements of a 2-D Matrix? Distributed Replicated Approaches to parallelization Distributed data - Pros Smaller memory usage Better scaling at large number of processors(more later) Replicated data - Pros Better scaling at small to moderate number of processors Distributed data - Cons More network traffic required at small/moderate number processors. Better scaling at large number of processors(more later) Replicated data - Cons Larger memory footprint Worse scaling at large number of processors, because of collective operations needed to merge resulting matrices 3

4 Global Arrays Distributed dense arrays that can be accessed through a shared memory-like style High level abstraction layer for the application developer (that s me!) One-sided model = no need to worry and send/receive Physically distributed data single, shared data structure/ global indexing e.g., access A(4,3) rather than buf(7) on task 2 Global Address Space Gaussian DFT computational kernel Evaluation of XC potential matrix element my_next_task = SharedCounter() do i=1,max_i if(i.eq.my_next_task) then call ga_get() (x (do work) q ) = D (x q ) (x q ) call ga_acc() my_next_task = F += SharedCounter() q w q (x q ) V xc [ (x q )] (x q ) endif enddo barrier() D F Both GA operations are greatly dependent on the communication latency 8 4

5 Parallel scaling of the DFT code Si 28 O 148 H Basis functions LDA wavefunction 9 XC build Benchmark run on Cray XK6 Parallel scaling of the DFT code Si 159 O 264 H Basis functions LDA wavefunction 10 XC build Benchmark run on Cray XK6 5

6 Hybrid computing Goes beyond the Send-Receive construct of Message-Passing (a.k.a MPI) OpenMP (directives) Shared memory Cilk Intel TBB Threads Mirrored Arrays: Matrix distribution Global Arrays that are replicated between SMP nodes but distributed within SMP nodes Distributed Mirrored Replicated 12 6

7 Trends in Chemistry Codes Multi-level parallelism Effective path for most applications to scale to O(10 5 ) processors Examples: coarse grain over vibrational degrees of freedom in numerical hessian, or geometries in a surface scan or parameter study Conventional distributed memory within each subtask Fine grain parallelism within a few processor SMP (multi-threads, OpenMP, parallel BLAS, ) Example of application later in the talk.. Efficient exploitation of fine grain parallelism is a major concern on future architectures Emergence of Accelerators GPUs Intel Xeon Phi What is CCSD(T)? Coupled-cluster (CC) theory is a numerical manybody technique that incorporates the effect of electron correlation on the electronic structure of molecular systems CCSD(T) estimates the effect of electron correlations by considering single, double and triple excitations Valence only CCSD(T) calculations = gold standard of quantum chemistry for their chemical accuracy in determining molecular energetics numerical cost scales as N 7 (N = number of electrons) 7

8 CCSD(T) algorithm aijkbc algorithm of Rendell and coworkers no use of I/O intermediate quantities (two-electron integrals and coupled-cluster wave function amplitudes) stored in global memory (GA) floating-point intensive kernel of this algorithm: BLAS DGEMM calls Main Steps of a CCSD(T) run MP2 energy & transformation of Molecular orbitals Generation of 2-electron integrals needed and storage in Global Array (ga_acc) Main CCSD(T) loop fetching the 2-electron integrals with ga_get Computational intensive kernel via dgemm 8

9 CCSD(T) kernel: code features - I To scale at 1K procs: increased locality data to reduce communication To scale at 40K procs: implemented more careful tiling of intermediates to reduce memory consumption and increase parallelism and load balance These two modifications can be seen as PGAS style programming (distinction between local and global memory CCSD(T) kernel: code features - II Three levels of the memory hierarchy in dynamically load balanced algorithm intermediate results fit in available global memory nested loops tiled so that data for each each task fits into local memory each process access a global shared counter to determine the next task data moved from global into local memory via ga_get 9

10 CCSD(T) run on Cray XT5 : 18 water molecules February 2009 Floating-Point performance at 90K cores: 358 TFlops (H 2 O) atoms 918 basis functions Cc-pvtz(-f) basis CCSD(T) run on Cray XT5 : 20 water February 2009 Floating-Point performance at 96K cores: 480 TFlops Efficiency > 50% (H 2 O) atoms 1020 basis functions Cc-pvtz(-f) basis 10

11 NWChem code changes required to scale beyond 100K cores Many to one communication patterns causes job to progress very slowly (best case) to hang (or worse) Diagnosis: cumbersome lucky coredumps! Causes: several processors simultaneously accessing the same patch of a matrix, where the patch is owned by a single processor. Solutions: Staggered access Use of a subset of processing elements Atomic shared counter: first token is static CCSD(T) run on Cray XT5 : 24 water November 2009 Floating-Point performance at 223K cores: 1.39 PetaFLOP/s (H 2 O) atoms 1224 basis functions Cc-pvtz(-f) basis 11

12 Tensor Contraction Engine (TCE) Symbolic algebra systems for coding complicated tensor expressions: Tensor Contraction Engine (TCE) Hirata, J. Phys. Chem. A 107, 9887 (2003) Sadayappan, Krishnamoorthy, et al. Proceedings of the IEEEE, 93, 276 (2005). Lai, Zhang, Rajbhandari, Valeev, Kowalski, Sadayappan, Procedia Computer Science (2012) New implementation of CC methods (since 2003) more effective for implementing new methods Easier tuning and porting 23 Tensor Contraction Engine (TCE) Tile structure: S1 S2 S1 S2 S1 S2. S1 S2. Occupied spinorbitals unccupied spinorbitals Tensor structure: T T i a [ h ] [ pn] m 24 12

13 New elements of parallel design for the iterative EOMCCSD method Use of Global Arrays (GA) to implement a distributed memory model Iterative CCSD/EOMCCSD methods (basic challenges) Global memory requirements Complex load balancing Complicated communication pattern: use of one-sided ga_get,ga_put,ga_acc Implementation improvements New way of representing antysymmetric 2-electron integrals for the restricted (RHF) and restricted open-shell (ROHF) references Replication of low-rank tensors New task scheduling for the CCSD/EOMCCSD methods 25 New elements of parallel design for the non-iterative CR- EOMCCSD(T) method Use of Global Arrays (GA) to implement a distributed memory model Basic challenges for Non-Iterative CR- EOMCCSD(T) method Local memory requirements: (tilesize) 4 (EOMCCSD) vs. M*(tilesize) 6 (CR-EOMCCSD(T)) Implementation improvements Two-fold reduction of local memory use : 2*(tilesize) 6 New algorithms which enable the decomposition of six-dimensional tensors 26 13

14 Scalability of iterative EOMCC methods Alternative task schedulers use global task pool improve load balancing reduce the number of synchronization steps to absolute minimum larger tiles can be effectively used 27 Scalability of the non-iterative EOMCC code 94 %parallel efficiency using 210,000 cores Scalability of the triples part of the CR- EOMCCSD(T) approach for the FBP-f-coronene system in the AVTZ basis set. Timings were determined from calculations on the Jaguar Cray XT5 computer system at NCCS/ORNL in

15 MRCC theory in a nutshell Reference function M M ls 0 1 Model space Schematic representation of the complete model space corresponding to two active electrons distributed over two active orbitals (red lines). Only determinants with M S =0 are included in the model space. 29 MRCC approaches: main challenges Intruder-state/intruder-solution problems Complete model space Huge dimensionality A large number of superfluous configurations not contributing to a given state Overall cost of the MRCC methods M N 6 (iterative MRCCSD) M N 7 (non-iterative MRCCSD(T)) Algebraic complexity of the MRCC methods 30 15

16 Processor groups (PGs) and reference level parallelism )( )( )1( RTF ()(,..., TGTT )( () M,..., )0 1,..., M The reference level parallelism can be applied in: Solving coupled referencedependent MRCC iterative equations Build efficient parallel schemes for non-iterative MRCC methods 31 Processor groups (PGs) and reference level parallelism Scalability of the BW- MRCCSD methods for -carotene in the 6-31G basis set ( 470 basis set functions); (4,4) a model space model of 20 references 32 16

17 When triple excitations are needed: MRCCSD(T) Improve the quality of the MRCCSD approaches Counteract the intruder-state problem H eff eff ) ( () THSD () T ( Numerical complexity M N 7 Scalability M (scalability of the CCSD(T) approach) 33 GPU implementation of non-iterative part of the MRCCSD(T) approach 34 ~4x speed-up. Observed 5x in CCSD(T). Ongoing effort towards GPU-ing iterative part of the MRCCSD(T) approach 17

18 Thanks Karol Kowalski, Kiran Bhaskaran-Nair (PNNL) Wenjing Ma, Sriram Krishnamoorthy, Jeff Daily, Abhinav Vishnu, Bruce Palmer(PNNL) Vinod Tipparaju (AMD) Ryan Olson (Cray) Jiri Brabec (Czech Ac. Sc.) Oreste Villa & Norbert Juffa (NVIDIA) 35 Acknowledgements PNNL extreme Scale Computing Initiative Dept. of Energy Office of Biological and Environmental Research Resources of the National Center for Computational Sciences at Oak Ridge National Laboratory allocated through the INCITE program EMSL computing resources (Chinook HP system) 36 18

19 thank you 37 Backup 38 19

20 GPU implementation of non-iterative part of the MRCCSD(T) approach Ongoing effort towards GPU-ing iterative part of the MRCCSD(T) approach(oreste Villa & Norbert Juffa, NVIDIA) 39 20

Development of Intel MIC Codes in NWChem. Edoardo Aprà Pacific Northwest National Laboratory

Development of Intel MIC Codes in NWChem. Edoardo Aprà Pacific Northwest National Laboratory Development of Intel MIC Codes in NWChem Edoardo Aprà Pacific Northwest National Laboratory Acknowledgements! Karol Kowalski (PNNL)! Michael Klemm (Intel)! Kiran Bhaskaran-Nair (LSU)! Wenjing Ma (Chinese

More information

Performance Study of Popular Computational Chemistry Software Packages on Cray HPC Systems

Performance Study of Popular Computational Chemistry Software Packages on Cray HPC Systems Performance Study of Popular Computational Chemistry Software Packages on Cray HPC Systems Junjie Li (lijunj@iu.edu) Shijie Sheng (shengs@iu.edu) Raymond Sheppard (rsheppar@iu.edu) Pervasive Technology

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Scaling Applications on Blue Waters

Scaling Applications on Blue Waters May 23, 2013 New User BW Workshop May 22-23, 2013 NWChem NWChem is ab initio Quantum Chemistry package Compute electronic structure of molecular systems PNNL home: http://www.nwchem-sw.org 2 Coupled Cluster

More information

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS Roberto Gomperts (NVIDIA, Corp.) Michael Frisch (Gaussian, Inc.) Giovanni Scalmani (Gaussian, Inc.) Brent Leback (PGI) TOPICS Gaussian Design

More information

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. Portable and Productive Performance with OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 Cray: Leadership in Computational Research Earth Sciences

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

Analysis and Visualization Algorithms in VMD

Analysis and Visualization Algorithms in VMD 1 Analysis and Visualization Algorithms in VMD David Hardy Research/~dhardy/ NAIS: State-of-the-Art Algorithms for Molecular Dynamics (Presenting the work of John Stone.) VMD Visual Molecular Dynamics

More information

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015 PERFORMANCE PORTABILITY WITH OPENACC Jeff Larkin, NVIDIA, November 2015 TWO TYPES OF PORTABILITY FUNCTIONAL PORTABILITY PERFORMANCE PORTABILITY The ability for a single code to run anywhere. The ability

More information

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh. Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization Dmitry I. Lyakh liakhdi@ornl.gov This research used resources of the Oak Ridge Leadership Computing Facility at the

More information

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers. CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes

More information

STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC. Stefan Maintz, Dr. Markus Wetzstein

STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC. Stefan Maintz, Dr. Markus Wetzstein STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC Stefan Maintz, Dr. Markus Wetzstein smaintz@nvidia.com; mwetzstein@nvidia.com Companies Academia VASP USERS AND USAGE 12-25% of CPU cycles @ supercomputing

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

Dataflow Programming Paradigms for Computational Chemistry Methods

Dataflow Programming Paradigms for Computational Chemistry Methods University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Doctoral Dissertations Graduate School 5-2017 Dataflow Programming Paradigms for Computational Chemistry Methods Heike

More information

Alex A. Granovsky Laboratory of Chemical Cybernetics, M.V. Lomonosov Moscow State University, Moscow, Russia May 10, 2003

Alex A. Granovsky Laboratory of Chemical Cybernetics, M.V. Lomonosov Moscow State University, Moscow, Russia May 10, 2003 New efficient large-scale fully asynchronous parallel algorithm for calculation of canonical MP2 energies. Alex A. Granovsky Laboratory of Chemical Cybernetics, M.V. Lomonosov Moscow State University,

More information

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar CRAY XK6 REDEFINING SUPERCOMPUTING - Sanjana Rakhecha - Nishad Nerurkar CONTENTS Introduction History Specifications Cray XK6 Architecture Performance Industry acceptance and applications Summary INTRODUCTION

More information

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department Approaches to acceleration: GPUs vs Intel MIC Fabio AFFINITO SCAI department Single core Multi core Many core GPU Intel MIC 61 cores 512bit-SIMD units from http://www.karlrupp.net/ from http://www.karlrupp.net/

More information

Runtime Techniques to Enable a Highly-Scalable Global Address Space Model for Petascale Computing

Runtime Techniques to Enable a Highly-Scalable Global Address Space Model for Petascale Computing DOI 10.1007/s10766-012-0214-9 Runtime Techniques to Enable a Highly-Scalable Global Address Space Model for Petascale Computing Vinod Tipparaju Edoardo Apra Weikuan Yu Xinyu Que Jeffrey S. Vetter Received:

More information

Building Multi-Petaflop Systems with MVAPICH2 and Global Arrays

Building Multi-Petaflop Systems with MVAPICH2 and Global Arrays Building Multi-Petaflop Systems with MVAPICH2 and Global Arrays ABHINAV VISHNU*, JEFFREY DAILY, BRUCE PALMER, HUBERTUS VAN DAM, KAROL KOWALSKI, DARREN KERBYSON, AND ADOLFY HOISIE PACIFIC NORTHWEST NATIONAL

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Preparing GPU-Accelerated Applications for the Summit Supercomputer Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050

More information

Intel Math Kernel Library 10.3

Intel Math Kernel Library 10.3 Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008 Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared

More information

Intel Math Kernel Library

Intel Math Kernel Library Intel Math Kernel Library Release 7.0 March 2005 Intel MKL Purpose Performance, performance, performance! Intel s scientific and engineering floating point math library Initially only basic linear algebra

More information

Accelerating NWChem Coupled Cluster through dataflow-based execution

Accelerating NWChem Coupled Cluster through dataflow-based execution Special OriginalIssue ArticlePaper Accelerating NWChem Coupled Cluster through dataflow-based execution The International Journal High Performance Computing Applications 2018, Vol. 32(4) 540 551 ª The

More information

A Comprehensive Study on the Performance of Implicit LS-DYNA

A Comprehensive Study on the Performance of Implicit LS-DYNA 12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four

More information

Early Experiences Writing Performance Portable OpenMP 4 Codes

Early Experiences Writing Performance Portable OpenMP 4 Codes Early Experiences Writing Performance Portable OpenMP 4 Codes Verónica G. Vergara Larrea Wayne Joubert M. Graham Lopez Oscar Hernandez Oak Ridge National Laboratory Problem statement APU FPGA neuromorphic

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

Philip C. Roth. Computer Science and Mathematics Division Oak Ridge National Laboratory

Philip C. Roth. Computer Science and Mathematics Division Oak Ridge National Laboratory Philip C. Roth Computer Science and Mathematics Division Oak Ridge National Laboratory A Tree-Based Overlay Network (TBON) like MRNet provides scalable infrastructure for tools and applications MRNet's

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

Harnessing GPU speed to accelerate LAMMPS particle simulations

Harnessing GPU speed to accelerate LAMMPS particle simulations Harnessing GPU speed to accelerate LAMMPS particle simulations Paul S. Crozier, W. Michael Brown, Peng Wang pscrozi@sandia.gov, wmbrown@sandia.gov, penwang@nvidia.com SC09, Portland, Oregon November 18,

More information

Hybrid programming with MPI and OpenMP On the way to exascale

Hybrid programming with MPI and OpenMP On the way to exascale Institut du Développement et des Ressources en Informatique Scientifique www.idris.fr Hybrid programming with MPI and OpenMP On the way to exascale 1 Trends of hardware evolution Main problematic : how

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Key Technologies for 100 PFLOPS. Copyright 2014 FUJITSU LIMITED

Key Technologies for 100 PFLOPS. Copyright 2014 FUJITSU LIMITED Key Technologies for 100 PFLOPS How to keep the HPC-tree growing Molecular dynamics Computational materials Drug discovery Life-science Quantum chemistry Eigenvalue problem FFT Subatomic particle phys.

More information

Steve Scott, Tesla CTO SC 11 November 15, 2011

Steve Scott, Tesla CTO SC 11 November 15, 2011 Steve Scott, Tesla CTO SC 11 November 15, 2011 What goal do these products have in common? Performance / W Exaflop Expectations First Exaflop Computer K Computer ~10 MW CM5 ~200 KW Not constant size, cost

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 What is Cray Libsci_acc? Provide basic scientific

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been

More information

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture

More information

Super instruction architecture of a parallel implementation of coupled cluster theory

Super instruction architecture of a parallel implementation of coupled cluster theory Super instruction architecture of a parallel implementation of coupled cluster theory Erik Deumens, Victor Lotrich, Mark Ponton, Rod Bartlett, Beverly Sanders AcesQC, LLC QTP, University of Florida Gainesville,

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

Cray XC Scalability and the Aries Network Tony Ford

Cray XC Scalability and the Aries Network Tony Ford Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?

More information

High performance computational chemistry: An overview of NWChem a distributed parallel application

High performance computational chemistry: An overview of NWChem a distributed parallel application Computer Physics Communications 128 (2000) 260 283 www.elsevier.nl/locate/cpc High performance computational chemistry: An overview of NWChem a distributed parallel application Ricky A. Kendall a,, Edoardo

More information

OpenMP for next generation heterogeneous clusters

OpenMP for next generation heterogeneous clusters OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc. HPC Runtime Software Rishi Khan SC 11 Current Programming Models Shared Memory Multiprocessing OpenMP fork/join model Pthreads Arbitrary SMP parallelism (but hard to program/ debug) Cilk Work Stealing

More information

Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel Xeon Phi Processor

Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel Xeon Phi Processor * Some names and brands may be claimed as the property of others. Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel Xeon Phi Processor E.J. Bylaska 1, M. Jacquelin

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Photos placed in horizontal position with even amount of white space between photos and header Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster,

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs Li-Wen Chang, Wen-mei Hwu University of Illinois A Scalable, Numerically Stable, High- How to Build a gtsv for Performance

More information

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Intel K. K. E-mail: hirokazu.kobayashi@intel.com Yoshifumi Nakamura RIKEN AICS E-mail: nakamura@riken.jp Shinji Takeda

More information

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid

More information

Recursive inverse factorization

Recursive inverse factorization Recursive inverse factorization Anton Artemov Division of Scientific Computing, Uppsala University anton.artemov@it.uu.se 06.09.2017 Context of work Research group on large scale electronic structure computations

More information

S Comparing OpenACC 2.5 and OpenMP 4.5

S Comparing OpenACC 2.5 and OpenMP 4.5 April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical

More information

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers Overlapping Computation and Communication for Advection on Hybrid Parallel Computers James B White III (Trey) trey@ucar.edu National Center for Atmospheric Research Jack Dongarra dongarra@eecs.utk.edu

More information

Hybrid (MPP+OpenMP) version of LS-DYNA

Hybrid (MPP+OpenMP) version of LS-DYNA Hybrid (MPP+OpenMP) version of LS-DYNA LS-DYNA Forum 2011 Jason Wang Oct. 12, 2011 Outline 1) Why MPP HYBRID 2) What is HYBRID 3) Benefits 4) How to use HYBRID Why HYBRID LS-DYNA LS-DYNA/MPP Speedup, 10M

More information

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins, Todd Harman Scientific Computing and Imaging Institute & University of Utah I. Uintah

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs

An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs Xin Huo, Vignesh T. Ravi, Wenjing Ma and Gagan Agrawal Department of Computer Science and Engineering

More information

A Standard for Batching BLAS Operations

A Standard for Batching BLAS Operations A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community

More information

Center Extreme Scale CS Research

Center Extreme Scale CS Research Center Extreme Scale CS Research Center for Compressible Multiphase Turbulence University of Florida Sanjay Ranka Herman Lam Outline 10 6 10 7 10 8 10 9 cores Parallelization and UQ of Rocfun and CMT-Nek

More information

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility

More information

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new

More information

Electronic structure calculations on Thousands of CPU's and GPU's

Electronic structure calculations on Thousands of CPU's and GPU's Electronic structure calculations on Thousands of CPU's and GPU's Emil Briggs, North Carolina State University 1. Outline of real-space Multigrid (RMG) 2. Trends in high performance computing 3. Scalability

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc. Debugging CUDA Applications with Allinea DDT Ian Lumb Sr. Systems Engineer, Allinea Software Inc. ilumb@allinea.com GTC 2013, San Jose, March 20, 2013 Embracing GPUs GPUs a rival to traditional processors

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language Martin C. Rinard (martin@cs.ucsb.edu) Department of Computer Science University

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Time-dependent density-functional theory with massively parallel computers. Jussi Enkovaara CSC IT Center for Science, Finland

Time-dependent density-functional theory with massively parallel computers. Jussi Enkovaara CSC IT Center for Science, Finland Time-dependent density-functional theory with massively parallel computers Jussi Enkovaara CSC IT Center for Science, Finland Outline Overview of the GPAW software package Parallelization for time-dependent

More information

SIAL Course Lectures 3 & 4

SIAL Course Lectures 3 & 4 SIAL Course Lectures 3 & 4 Victor Lotrich, Mark Ponton, Erik Deumens, Rod Bartlett, Beverly Sanders AcesQC, LLC QTP, University of Florida Gainesville, Florida SIAL Course Lect 3 & 4 July 2009 1 Lecture

More information

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

GROMACS (GPU) Performance Benchmark and Profiling. February 2016 GROMACS (GPU) Performance Benchmark and Profiling February 2016 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Mellanox, NVIDIA Compute

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede Qingyu Meng, Alan Humphrey, John Schmidt, Martin Berzins Thanks to: TACC Team for early access to Stampede J. Davison

More information

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted

More information

CP2K: HIGH PERFORMANCE ATOMISTIC SIMULATION

CP2K: HIGH PERFORMANCE ATOMISTIC SIMULATION CP2K: HIGH PERFORMANCE ATOMISTIC SIMULATION Iain Bethune ibethune@epcc.ed.ac.uk http://tinyurl.com/mcc-ukcp-2016 CP2K Overview CP2K is a program to perform atomistic and molecular simulations of solid

More information

Performance Evaluation of Quantum ESPRESSO on SX-ACE. REV-A Workshop held on conjunction with the IEEE Cluster 2017 Hawaii, USA September 5th, 2017

Performance Evaluation of Quantum ESPRESSO on SX-ACE. REV-A Workshop held on conjunction with the IEEE Cluster 2017 Hawaii, USA September 5th, 2017 Performance Evaluation of Quantum ESPRESSO on SX-ACE REV-A Workshop held on conjunction with the IEEE Cluster 2017 Hawaii, USA September 5th, 2017 Osamu Watanabe Akihiro Musa Hiroaki Hokari Shivanshu Singh

More information

Intro to Parallel Computing

Intro to Parallel Computing Outline Intro to Parallel Computing Remi Lehe Lawrence Berkeley National Laboratory Modern parallel architectures Parallelization between nodes: MPI Parallelization within one node: OpenMP Why use parallel

More information

NEW ADVANCES IN GPU LINEAR ALGEBRA

NEW ADVANCES IN GPU LINEAR ALGEBRA GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear

More information

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Performance and Energy Usage of Workloads on KNL and Haswell Architectures Performance and Energy Usage of Workloads on KNL and Haswell Architectures Tyler Allen 1 Christopher Daley 2 Doug Doerfler 2 Brian Austin 2 Nicholas Wright 2 1 Clemson University 2 National Energy Research

More information

High-Performance Scientific Computing

High-Performance Scientific Computing High-Performance Scientific Computing Instructor: Randy LeVeque TA: Grady Lemoine Applied Mathematics 483/583, Spring 2011 http://www.amath.washington.edu/~rjl/am583 World s fastest computers http://top500.org

More information

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel Parallel Programming Environments Presented By: Anand Saoji Yogesh Patel Outline Introduction How? Parallel Architectures Parallel Programming Models Conclusion References Introduction Recent advancements

More information

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC Fourth Workshop on Accelerator Programming Using Directives (WACCPD), Nov. 13, 2017 Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC Takuma

More information