How to Use a Quantum Chemistry Code over 100,000 CPUs
|
|
- Michael Chandler
- 6 years ago
- Views:
Transcription
1 How to Use a Quantum Chemistry Code over 100,000 CPUs Edoardo Aprà Pacific Northwest National Laboratory Outline NWChem description Parallelization strategies 2 1
2 Developed at EMSL/PNNL Provides major modeling and simulation capability for molecular science Broad range of molecules, including catalysts, biomolecules, and heavy elements Solid state capabilities Performance characteristics designed for MPP Runs on a wide range of computers Open Source large user community Uses Global Arrays/ARMCI for parallelization 3 NWChem Structure Run-time database Generic Energy, structure, Tasks Object-oriented design SCF energy, gradient, DFT energy, gradient, MD, NMR, Solvation, Optimize, Dynamics, Integral API Geometry Object... Basis Set Object Parallel IO Global Arrays PeIGS Memory Allocator... Molecular Calculation Modules Molecular Modeling Toolkit Molecular Software Development Toolkit abstraction, data hiding, APIs Parallel programming model non-uniform memory access, Global Arrays, MPI Infrastructure GA, Parallel I/O, RTDB, MA,... Program modules communication only through the database persistence for easy restart 2
3 Replicated vs Distributed: Matrix distribution Example: parallel computer made of 8 processors How to distribute the elements of a 2-D Matrix? Distributed Replicated Approaches to parallelization Distributed data - Pros Smaller memory usage Better scaling at large number of processors(more later) Replicated data - Pros Better scaling at small to moderate number of processors Distributed data - Cons More network traffic required at small/moderate number processors. Better scaling at large number of processors(more later) Replicated data - Cons Larger memory footprint Worse scaling at large number of processors, because of collective operations needed to merge resulting matrices 3
4 Global Arrays Distributed dense arrays that can be accessed through a shared memory-like style High level abstraction layer for the application developer (that s me!) One-sided model = no need to worry and send/receive Physically distributed data single, shared data structure/ global indexing e.g., access A(4,3) rather than buf(7) on task 2 Global Address Space Gaussian DFT computational kernel Evaluation of XC potential matrix element my_next_task = SharedCounter() do i=1,max_i if(i.eq.my_next_task) then call ga_get() (x (do work) q ) = D (x q ) (x q ) call ga_acc() my_next_task = F += SharedCounter() q w q (x q ) V xc [ (x q )] (x q ) endif enddo barrier() D F Both GA operations are greatly dependent on the communication latency 8 4
5 Parallel scaling of the DFT code Si 28 O 148 H Basis functions LDA wavefunction 9 XC build Benchmark run on Cray XK6 Parallel scaling of the DFT code Si 159 O 264 H Basis functions LDA wavefunction 10 XC build Benchmark run on Cray XK6 5
6 Hybrid computing Goes beyond the Send-Receive construct of Message-Passing (a.k.a MPI) OpenMP (directives) Shared memory Cilk Intel TBB Threads Mirrored Arrays: Matrix distribution Global Arrays that are replicated between SMP nodes but distributed within SMP nodes Distributed Mirrored Replicated 12 6
7 Trends in Chemistry Codes Multi-level parallelism Effective path for most applications to scale to O(10 5 ) processors Examples: coarse grain over vibrational degrees of freedom in numerical hessian, or geometries in a surface scan or parameter study Conventional distributed memory within each subtask Fine grain parallelism within a few processor SMP (multi-threads, OpenMP, parallel BLAS, ) Example of application later in the talk.. Efficient exploitation of fine grain parallelism is a major concern on future architectures Emergence of Accelerators GPUs Intel Xeon Phi What is CCSD(T)? Coupled-cluster (CC) theory is a numerical manybody technique that incorporates the effect of electron correlation on the electronic structure of molecular systems CCSD(T) estimates the effect of electron correlations by considering single, double and triple excitations Valence only CCSD(T) calculations = gold standard of quantum chemistry for their chemical accuracy in determining molecular energetics numerical cost scales as N 7 (N = number of electrons) 7
8 CCSD(T) algorithm aijkbc algorithm of Rendell and coworkers no use of I/O intermediate quantities (two-electron integrals and coupled-cluster wave function amplitudes) stored in global memory (GA) floating-point intensive kernel of this algorithm: BLAS DGEMM calls Main Steps of a CCSD(T) run MP2 energy & transformation of Molecular orbitals Generation of 2-electron integrals needed and storage in Global Array (ga_acc) Main CCSD(T) loop fetching the 2-electron integrals with ga_get Computational intensive kernel via dgemm 8
9 CCSD(T) kernel: code features - I To scale at 1K procs: increased locality data to reduce communication To scale at 40K procs: implemented more careful tiling of intermediates to reduce memory consumption and increase parallelism and load balance These two modifications can be seen as PGAS style programming (distinction between local and global memory CCSD(T) kernel: code features - II Three levels of the memory hierarchy in dynamically load balanced algorithm intermediate results fit in available global memory nested loops tiled so that data for each each task fits into local memory each process access a global shared counter to determine the next task data moved from global into local memory via ga_get 9
10 CCSD(T) run on Cray XT5 : 18 water molecules February 2009 Floating-Point performance at 90K cores: 358 TFlops (H 2 O) atoms 918 basis functions Cc-pvtz(-f) basis CCSD(T) run on Cray XT5 : 20 water February 2009 Floating-Point performance at 96K cores: 480 TFlops Efficiency > 50% (H 2 O) atoms 1020 basis functions Cc-pvtz(-f) basis 10
11 NWChem code changes required to scale beyond 100K cores Many to one communication patterns causes job to progress very slowly (best case) to hang (or worse) Diagnosis: cumbersome lucky coredumps! Causes: several processors simultaneously accessing the same patch of a matrix, where the patch is owned by a single processor. Solutions: Staggered access Use of a subset of processing elements Atomic shared counter: first token is static CCSD(T) run on Cray XT5 : 24 water November 2009 Floating-Point performance at 223K cores: 1.39 PetaFLOP/s (H 2 O) atoms 1224 basis functions Cc-pvtz(-f) basis 11
12 Tensor Contraction Engine (TCE) Symbolic algebra systems for coding complicated tensor expressions: Tensor Contraction Engine (TCE) Hirata, J. Phys. Chem. A 107, 9887 (2003) Sadayappan, Krishnamoorthy, et al. Proceedings of the IEEEE, 93, 276 (2005). Lai, Zhang, Rajbhandari, Valeev, Kowalski, Sadayappan, Procedia Computer Science (2012) New implementation of CC methods (since 2003) more effective for implementing new methods Easier tuning and porting 23 Tensor Contraction Engine (TCE) Tile structure: S1 S2 S1 S2 S1 S2. S1 S2. Occupied spinorbitals unccupied spinorbitals Tensor structure: T T i a [ h ] [ pn] m 24 12
13 New elements of parallel design for the iterative EOMCCSD method Use of Global Arrays (GA) to implement a distributed memory model Iterative CCSD/EOMCCSD methods (basic challenges) Global memory requirements Complex load balancing Complicated communication pattern: use of one-sided ga_get,ga_put,ga_acc Implementation improvements New way of representing antysymmetric 2-electron integrals for the restricted (RHF) and restricted open-shell (ROHF) references Replication of low-rank tensors New task scheduling for the CCSD/EOMCCSD methods 25 New elements of parallel design for the non-iterative CR- EOMCCSD(T) method Use of Global Arrays (GA) to implement a distributed memory model Basic challenges for Non-Iterative CR- EOMCCSD(T) method Local memory requirements: (tilesize) 4 (EOMCCSD) vs. M*(tilesize) 6 (CR-EOMCCSD(T)) Implementation improvements Two-fold reduction of local memory use : 2*(tilesize) 6 New algorithms which enable the decomposition of six-dimensional tensors 26 13
14 Scalability of iterative EOMCC methods Alternative task schedulers use global task pool improve load balancing reduce the number of synchronization steps to absolute minimum larger tiles can be effectively used 27 Scalability of the non-iterative EOMCC code 94 %parallel efficiency using 210,000 cores Scalability of the triples part of the CR- EOMCCSD(T) approach for the FBP-f-coronene system in the AVTZ basis set. Timings were determined from calculations on the Jaguar Cray XT5 computer system at NCCS/ORNL in
15 MRCC theory in a nutshell Reference function M M ls 0 1 Model space Schematic representation of the complete model space corresponding to two active electrons distributed over two active orbitals (red lines). Only determinants with M S =0 are included in the model space. 29 MRCC approaches: main challenges Intruder-state/intruder-solution problems Complete model space Huge dimensionality A large number of superfluous configurations not contributing to a given state Overall cost of the MRCC methods M N 6 (iterative MRCCSD) M N 7 (non-iterative MRCCSD(T)) Algebraic complexity of the MRCC methods 30 15
16 Processor groups (PGs) and reference level parallelism )( )( )1( RTF ()(,..., TGTT )( () M,..., )0 1,..., M The reference level parallelism can be applied in: Solving coupled referencedependent MRCC iterative equations Build efficient parallel schemes for non-iterative MRCC methods 31 Processor groups (PGs) and reference level parallelism Scalability of the BW- MRCCSD methods for -carotene in the 6-31G basis set ( 470 basis set functions); (4,4) a model space model of 20 references 32 16
17 When triple excitations are needed: MRCCSD(T) Improve the quality of the MRCCSD approaches Counteract the intruder-state problem H eff eff ) ( () THSD () T ( Numerical complexity M N 7 Scalability M (scalability of the CCSD(T) approach) 33 GPU implementation of non-iterative part of the MRCCSD(T) approach 34 ~4x speed-up. Observed 5x in CCSD(T). Ongoing effort towards GPU-ing iterative part of the MRCCSD(T) approach 17
18 Thanks Karol Kowalski, Kiran Bhaskaran-Nair (PNNL) Wenjing Ma, Sriram Krishnamoorthy, Jeff Daily, Abhinav Vishnu, Bruce Palmer(PNNL) Vinod Tipparaju (AMD) Ryan Olson (Cray) Jiri Brabec (Czech Ac. Sc.) Oreste Villa & Norbert Juffa (NVIDIA) 35 Acknowledgements PNNL extreme Scale Computing Initiative Dept. of Energy Office of Biological and Environmental Research Resources of the National Center for Computational Sciences at Oak Ridge National Laboratory allocated through the INCITE program EMSL computing resources (Chinook HP system) 36 18
19 thank you 37 Backup 38 19
20 GPU implementation of non-iterative part of the MRCCSD(T) approach Ongoing effort towards GPU-ing iterative part of the MRCCSD(T) approach(oreste Villa & Norbert Juffa, NVIDIA) 39 20
Development of Intel MIC Codes in NWChem. Edoardo Aprà Pacific Northwest National Laboratory
Development of Intel MIC Codes in NWChem Edoardo Aprà Pacific Northwest National Laboratory Acknowledgements! Karol Kowalski (PNNL)! Michael Klemm (Intel)! Kiran Bhaskaran-Nair (LSU)! Wenjing Ma (Chinese
More informationPerformance Study of Popular Computational Chemistry Software Packages on Cray HPC Systems
Performance Study of Popular Computational Chemistry Software Packages on Cray HPC Systems Junjie Li (lijunj@iu.edu) Shijie Sheng (shengs@iu.edu) Raymond Sheppard (rsheppar@iu.edu) Pervasive Technology
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationScaling Applications on Blue Waters
May 23, 2013 New User BW Workshop May 22-23, 2013 NWChem NWChem is ab initio Quantum Chemistry package Compute electronic structure of molecular systems PNNL home: http://www.nwchem-sw.org 2 Coupled Cluster
More informationCURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS
CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS Roberto Gomperts (NVIDIA, Corp.) Michael Frisch (Gaussian, Inc.) Giovanni Scalmani (Gaussian, Inc.) Brent Leback (PGI) TOPICS Gaussian Design
More informationPortable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.
Portable and Productive Performance with OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 Cray: Leadership in Computational Research Earth Sciences
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationAnalysis and Visualization Algorithms in VMD
1 Analysis and Visualization Algorithms in VMD David Hardy Research/~dhardy/ NAIS: State-of-the-Art Algorithms for Molecular Dynamics (Presenting the work of John Stone.) VMD Visual Molecular Dynamics
More informationPERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015
PERFORMANCE PORTABILITY WITH OPENACC Jeff Larkin, NVIDIA, November 2015 TWO TYPES OF PORTABILITY FUNCTIONAL PORTABILITY PERFORMANCE PORTABILITY The ability for a single code to run anywhere. The ability
More informationPortable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.
Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization Dmitry I. Lyakh liakhdi@ornl.gov This research used resources of the Oak Ridge Leadership Computing Facility at the
More informationObjective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.
CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes
More informationSTRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC. Stefan Maintz, Dr. Markus Wetzstein
STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC Stefan Maintz, Dr. Markus Wetzstein smaintz@nvidia.com; mwetzstein@nvidia.com Companies Academia VASP USERS AND USAGE 12-25% of CPU cycles @ supercomputing
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationDataflow Programming Paradigms for Computational Chemistry Methods
University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Doctoral Dissertations Graduate School 5-2017 Dataflow Programming Paradigms for Computational Chemistry Methods Heike
More informationAlex A. Granovsky Laboratory of Chemical Cybernetics, M.V. Lomonosov Moscow State University, Moscow, Russia May 10, 2003
New efficient large-scale fully asynchronous parallel algorithm for calculation of canonical MP2 energies. Alex A. Granovsky Laboratory of Chemical Cybernetics, M.V. Lomonosov Moscow State University,
More informationCRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar
CRAY XK6 REDEFINING SUPERCOMPUTING - Sanjana Rakhecha - Nishad Nerurkar CONTENTS Introduction History Specifications Cray XK6 Architecture Performance Industry acceptance and applications Summary INTRODUCTION
More informationApproaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department
Approaches to acceleration: GPUs vs Intel MIC Fabio AFFINITO SCAI department Single core Multi core Many core GPU Intel MIC 61 cores 512bit-SIMD units from http://www.karlrupp.net/ from http://www.karlrupp.net/
More informationRuntime Techniques to Enable a Highly-Scalable Global Address Space Model for Petascale Computing
DOI 10.1007/s10766-012-0214-9 Runtime Techniques to Enable a Highly-Scalable Global Address Space Model for Petascale Computing Vinod Tipparaju Edoardo Apra Weikuan Yu Xinyu Que Jeffrey S. Vetter Received:
More informationBuilding Multi-Petaflop Systems with MVAPICH2 and Global Arrays
Building Multi-Petaflop Systems with MVAPICH2 and Global Arrays ABHINAV VISHNU*, JEFFREY DAILY, BRUCE PALMER, HUBERTUS VAN DAM, KAROL KOWALSKI, DARREN KERBYSON, AND ADOLFY HOISIE PACIFIC NORTHWEST NATIONAL
More informationPORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune
PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationPreparing GPU-Accelerated Applications for the Summit Supercomputer
Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser
ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050
More informationIntel Math Kernel Library 10.3
Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)
More informationOn the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters
1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationParallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008
Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared
More informationIntel Math Kernel Library
Intel Math Kernel Library Release 7.0 March 2005 Intel MKL Purpose Performance, performance, performance! Intel s scientific and engineering floating point math library Initially only basic linear algebra
More informationAccelerating NWChem Coupled Cluster through dataflow-based execution
Special OriginalIssue ArticlePaper Accelerating NWChem Coupled Cluster through dataflow-based execution The International Journal High Performance Computing Applications 2018, Vol. 32(4) 540 551 ª The
More informationA Comprehensive Study on the Performance of Implicit LS-DYNA
12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four
More informationEarly Experiences Writing Performance Portable OpenMP 4 Codes
Early Experiences Writing Performance Portable OpenMP 4 Codes Verónica G. Vergara Larrea Wayne Joubert M. Graham Lopez Oscar Hernandez Oak Ridge National Laboratory Problem statement APU FPGA neuromorphic
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationPhilip C. Roth. Computer Science and Mathematics Division Oak Ridge National Laboratory
Philip C. Roth Computer Science and Mathematics Division Oak Ridge National Laboratory A Tree-Based Overlay Network (TBON) like MRNet provides scalable infrastructure for tools and applications MRNet's
More informationChallenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery
Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured
More informationHarnessing GPU speed to accelerate LAMMPS particle simulations
Harnessing GPU speed to accelerate LAMMPS particle simulations Paul S. Crozier, W. Michael Brown, Peng Wang pscrozi@sandia.gov, wmbrown@sandia.gov, penwang@nvidia.com SC09, Portland, Oregon November 18,
More informationHybrid programming with MPI and OpenMP On the way to exascale
Institut du Développement et des Ressources en Informatique Scientifique www.idris.fr Hybrid programming with MPI and OpenMP On the way to exascale 1 Trends of hardware evolution Main problematic : how
More informationThe Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationKey Technologies for 100 PFLOPS. Copyright 2014 FUJITSU LIMITED
Key Technologies for 100 PFLOPS How to keep the HPC-tree growing Molecular dynamics Computational materials Drug discovery Life-science Quantum chemistry Eigenvalue problem FFT Subatomic particle phys.
More informationSteve Scott, Tesla CTO SC 11 November 15, 2011
Steve Scott, Tesla CTO SC 11 November 15, 2011 What goal do these products have in common? Performance / W Exaflop Expectations First Exaflop Computer K Computer ~10 MW CM5 ~200 KW Not constant size, cost
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationPortable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.
Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 What is Cray Libsci_acc? Provide basic scientific
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationComputing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany
Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been
More informationExecution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures
Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture
More informationSuper instruction architecture of a parallel implementation of coupled cluster theory
Super instruction architecture of a parallel implementation of coupled cluster theory Erik Deumens, Victor Lotrich, Mark Ponton, Rod Bartlett, Beverly Sanders AcesQC, LLC QTP, University of Florida Gainesville,
More informationAccelerating Implicit LS-DYNA with GPU
Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationCray XC Scalability and the Aries Network Tony Ford
Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?
More informationHigh performance computational chemistry: An overview of NWChem a distributed parallel application
Computer Physics Communications 128 (2000) 260 283 www.elsevier.nl/locate/cpc High performance computational chemistry: An overview of NWChem a distributed parallel application Ricky A. Kendall a,, Edoardo
More informationOpenMP for next generation heterogeneous clusters
OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.
HPC Runtime Software Rishi Khan SC 11 Current Programming Models Shared Memory Multiprocessing OpenMP fork/join model Pthreads Arbitrary SMP parallelism (but hard to program/ debug) Cilk Work Stealing
More informationPerformance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel Xeon Phi Processor
* Some names and brands may be claimed as the property of others. Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel Xeon Phi Processor E.J. Bylaska 1, M. Jacquelin
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationPortability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures
Photos placed in horizontal position with even amount of white space between photos and header Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster,
More informationHPC Architectures. Types of resource currently in use
HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationA Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois
A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs Li-Wen Chang, Wen-mei Hwu University of Illinois A Scalable, Numerically Stable, High- How to Build a gtsv for Performance
More informationOptimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor
Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Intel K. K. E-mail: hirokazu.kobayashi@intel.com Yoshifumi Nakamura RIKEN AICS E-mail: nakamura@riken.jp Shinji Takeda
More informationProductive Performance on the Cray XK System Using OpenACC Compilers and Tools
Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid
More informationRecursive inverse factorization
Recursive inverse factorization Anton Artemov Division of Scientific Computing, Uppsala University anton.artemov@it.uu.se 06.09.2017 Context of work Research group on large scale electronic structure computations
More informationS Comparing OpenACC 2.5 and OpenMP 4.5
April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical
More informationOverlapping Computation and Communication for Advection on Hybrid Parallel Computers
Overlapping Computation and Communication for Advection on Hybrid Parallel Computers James B White III (Trey) trey@ucar.edu National Center for Atmospheric Research Jack Dongarra dongarra@eecs.utk.edu
More informationHybrid (MPP+OpenMP) version of LS-DYNA
Hybrid (MPP+OpenMP) version of LS-DYNA LS-DYNA Forum 2011 Jason Wang Oct. 12, 2011 Outline 1) Why MPP HYBRID 2) What is HYBRID 3) Benefits 4) How to use HYBRID Why HYBRID LS-DYNA LS-DYNA/MPP Speedup, 10M
More informationRadiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System
Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins, Todd Harman Scientific Computing and Imaging Institute & University of Utah I. Uintah
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationAn Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs
An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs Xin Huo, Vignesh T. Ravi, Wenjing Ma and Gagan Agrawal Department of Computer Science and Engineering
More informationA Standard for Batching BLAS Operations
A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community
More informationCenter Extreme Scale CS Research
Center Extreme Scale CS Research Center for Compressible Multiphase Turbulence University of Florida Sanjay Ranka Herman Lam Outline 10 6 10 7 10 8 10 9 cores Parallelization and UQ of Rocfun and CMT-Nek
More informationEARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA
EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility
More informationESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report
ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new
More informationElectronic structure calculations on Thousands of CPU's and GPU's
Electronic structure calculations on Thousands of CPU's and GPU's Emil Briggs, North Carolina State University 1. Outline of real-space Multigrid (RMG) 2. Trends in high performance computing 3. Scalability
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationDebugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.
Debugging CUDA Applications with Allinea DDT Ian Lumb Sr. Systems Engineer, Allinea Software Inc. ilumb@allinea.com GTC 2013, San Jose, March 20, 2013 Embracing GPUs GPUs a rival to traditional processors
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationINTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017
INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and
More informationAn Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language
An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language Martin C. Rinard (martin@cs.ucsb.edu) Department of Computer Science University
More information6.1 Multiprocessor Computing Environment
6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,
More informationTime-dependent density-functional theory with massively parallel computers. Jussi Enkovaara CSC IT Center for Science, Finland
Time-dependent density-functional theory with massively parallel computers Jussi Enkovaara CSC IT Center for Science, Finland Outline Overview of the GPAW software package Parallelization for time-dependent
More informationSIAL Course Lectures 3 & 4
SIAL Course Lectures 3 & 4 Victor Lotrich, Mark Ponton, Erik Deumens, Rod Bartlett, Beverly Sanders AcesQC, LLC QTP, University of Florida Gainesville, Florida SIAL Course Lect 3 & 4 July 2009 1 Lecture
More informationGROMACS (GPU) Performance Benchmark and Profiling. February 2016
GROMACS (GPU) Performance Benchmark and Profiling February 2016 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Mellanox, NVIDIA Compute
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationPreliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede
Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede Qingyu Meng, Alan Humphrey, John Schmidt, Martin Berzins Thanks to: TACC Team for early access to Stampede J. Davison
More informationOpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4
OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted
More informationCP2K: HIGH PERFORMANCE ATOMISTIC SIMULATION
CP2K: HIGH PERFORMANCE ATOMISTIC SIMULATION Iain Bethune ibethune@epcc.ed.ac.uk http://tinyurl.com/mcc-ukcp-2016 CP2K Overview CP2K is a program to perform atomistic and molecular simulations of solid
More informationPerformance Evaluation of Quantum ESPRESSO on SX-ACE. REV-A Workshop held on conjunction with the IEEE Cluster 2017 Hawaii, USA September 5th, 2017
Performance Evaluation of Quantum ESPRESSO on SX-ACE REV-A Workshop held on conjunction with the IEEE Cluster 2017 Hawaii, USA September 5th, 2017 Osamu Watanabe Akihiro Musa Hiroaki Hokari Shivanshu Singh
More informationIntro to Parallel Computing
Outline Intro to Parallel Computing Remi Lehe Lawrence Berkeley National Laboratory Modern parallel architectures Parallelization between nodes: MPI Parallelization within one node: OpenMP Why use parallel
More informationNEW ADVANCES IN GPU LINEAR ALGEBRA
GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear
More informationPerformance and Energy Usage of Workloads on KNL and Haswell Architectures
Performance and Energy Usage of Workloads on KNL and Haswell Architectures Tyler Allen 1 Christopher Daley 2 Doug Doerfler 2 Brian Austin 2 Nicholas Wright 2 1 Clemson University 2 National Energy Research
More informationHigh-Performance Scientific Computing
High-Performance Scientific Computing Instructor: Randy LeVeque TA: Grady Lemoine Applied Mathematics 483/583, Spring 2011 http://www.amath.washington.edu/~rjl/am583 World s fastest computers http://top500.org
More informationParallel Programming Environments. Presented By: Anand Saoji Yogesh Patel
Parallel Programming Environments Presented By: Anand Saoji Yogesh Patel Outline Introduction How? Parallel Architectures Parallel Programming Models Conclusion References Introduction Recent advancements
More informationImplicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC
Fourth Workshop on Accelerator Programming Using Directives (WACCPD), Nov. 13, 2017 Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC Takuma
More information