AACE: Applications. Director, Application Acceleration Center of Excellence National Institute for Computational Sciences glenn-

Size: px
Start display at page:

Download "AACE: Applications. Director, Application Acceleration Center of Excellence National Institute for Computational Sciences glenn-"

Transcription

1 AACE: Applications R. Glenn Brook Director, Application Acceleration Center of Excellence National Institute for Computational Sciences glenn- Ryan C. Hulguin Computational Science Associate National Institute for Computational Sciences ryan-

2 Codes Investigated by AACE on the Intel Xeon Phi Coprocessor Science codes ported or optimized through the Beacon Project Chemistry NWChem (ported) Astrophysics Enzo (ported and optimized) Magnetospheric Physics H3D (ported and optimized) Other codes of interest Electronic Structures Elk FP- LAPW (ported) Computational Fluid Dynamics (CFD) Euler and BGK Boltzmann Solver (ported and optimized) Linear Algebra routines SGEMM and DGEMM (ported)

3 Enzo Community code for computational astrophysics and cosmology More than 1 million lines of code Uses powerful adaptive mesh revinement Highly vectorized with a hybrid MPI + OpenMP programming model Utilizes HDF5 and HYPRE libraries Multiple MPI tasks per coprocessor and many threads per MPI task Enzo was ported and optimized for the the Intel Xeon Phi Coprocessor by Dr. Robert Harkness harkness@sdsc.edu

4 Preliminary Scaling Study: Native ENZO-C 128^3 mesh (non-amr) pure MPI Observed Ideal native mode Speedup Number of Threads Results were generated on the Intel Knights Ferry software development platform

5 Hybrid3d (H3D) Provides breakthrough kinetic simulations of the Earth s magnetosphere Models the complex solar wind- magnetosphere interaction using both electron Vluid and kinetic ions Unlike magnetohydrodynamics (MHD), which completely ignores ion kinetic effects Contains the following HPC innovations: 1. multi- zone (asynchronous) algorithm 2. dynamic load balancing 3. code adaptation and optimization to large number of cores Hybrid3d (H3D) was provided for porting to the the Intel Xeon Phi Coprocessor by Dr. Homa Karimabadi hkarimabadi@ucsd.edu

6 Hybrid3d (H3D) Performance 64 H3D Speedup on the Intel Xeon Phi Coprocessor (codename Knights Corner) 32 Optimizations were provided by Intel senior software engineer Rob Van der Wjingaart. Rela%ve Speedup Observed Ideal Speedup Number of MPI Processes Results were generated on a Pre- Production Intel Xeon Phi coprocessor with B0 HW and Beta SW GHz and 8 GB of GDDR GHz

7 Elk FP- LAPW Paramount to extracting functionality from these advanced materials is having a detailed understanding of their electronic, magnetic, vibrational, and optical properties. Elk is a software platform which allows for the understanding of these properties from a first principles approach. It employs electronic structure techniques such as density functional theory, Hartree-Fock theory, and Green s function theory for the calculation of relevant properties from first principles. Fortran 90 Efficient hybrid MPI + OpenMP parallelization Antiferromagnetic structure of Sr 2 CuO 3 Elk was ported to the the Intel Xeon Phi Coprocessor by W. Scott Thornton wsttiger@gmail.com

8 Elk FP- LAPW Performance Elk uses master- slave parallelism where orbitals for different momenta are computed semi- independently. In this test 27 and 64 different crystal momenta were used. The test case was bulk silicon. Results were generated on a Pre- Production Intel Xeon Phi coprocessor with A0 HW and Beta SW GHz and 8 GB of GDDR GHz

9 Computational Fluid Dynamics (CFD) 2 CFD solvers were developed in house at NICS 1 st solver is based on the Euler equations 2 nd solver is based on Model Boltzmann equations Unsteady solution of a Sod Shock using the Euler equations Steady- state solution of a Couette Vlow using the Boltzmann equation with BGK collision approximation The above CFD solvers were developed for the Intel Xeon Phi Coprocessor by Ryan C. Hulguin ryan- hulguin@tennessee.edu

10 Impact of Various Optimizations on the Model Boltzmann Equation Solver The Model Boltzmann Equation solver was optimized by Intel software engineer Rob Van der Wjingaart He took a baseline solver where all loops were vectorized except for one, and applied the following optimizations to get the most performance out of the Intel Xeon Phi Coprocessor (codename Knights Corner) Set I Loop Vectorization Stack variable pulled out of the loop Class member turned into a regular structure Set II Data Access Arrays linearized using macros Align data for more efvicient access Set III Parallel Overhead Reduce the number of parallel sections Set IV Dependency Remove reduction from computational loop by saving value into a private variable Set V Precision Use medium precision for math function calls (- Vimf- precision=medium) Set VI Precision Use single precision constants and intrinsics Set VII Compiler Hints Use #pragma SIMD instead of #pragma IVDEP

11 Optimization Results from the Model Boltzmann Equation Solver 8 7 balanced sca:er Rela%ve Speedup Loop Vectoriza%on 1 0 Results were generated on a Pre- Production Intel Xeon Phi coprocessor with B0 HW and Beta SW GHz and 8 GB of GDDR GHz

12 Model Boltzmann Equation Solver Performance 128 Rela%ve Speedup of two 8- core 3.5 GHz Intel Xeon E Processors Versus an Intel Xeon Phi Coprocessor 64 Rela%ve Speedup Dual Intel Xeon E Compiler Hints Intel Xeon Phi - Precision II - Balanced Intel Xeon Phi - Compiler Hints - Balanced Intel Xeon Phi - Precision II - Sca:er Intel Xeon Phi - Compiler Hints - Sca:er Number of OpenMP Threads Results were generated on a Pre- Production Intel Xeon Phi coprocessor with B0 HW and Beta SW GHz and 8 GB of GDDR GHz

13 Porting to the Intel Xeon Phi Coprocessor No major code rewrites were needed to start running on an Intel Xeon Phi coprocessor The previous applications were run in native mode and simply required a recompile using the mmic Vlag Parallelism is achieved using OpenMP, MPI, or both The transition from the Intel Xeon Phi software development platform (codename Knights Ferry) to the Intel Xeon Phi coprocessor (codename Knights Corner) is seamless.

14 Custom SGEMM and DGEMM Routines for the Intel Xeon Phi Coprocessor Custom General Matrix- Matrix Multiply routines using single and double precision (SGEMM and DGEMM respectively) were developed for the Intel Xeon Phi coprocessor. Square matrix sizes were used (m = n = k). Intel Xeon Phi coprocessor results are run with 240 threads and compared against Intel Xeon E processors. The above SGEMM and DGEMM routines were developed for the Intel Xeon Phi Coprocessor by Jonathan Peyton jpeyton1@utk.edu

15 Custom SGEMM Performance Results

16 Custom DGEMM Performance Results

17 Contact Information R. Glenn Brook, Ph.D. Director, Application Acceleration Center of Excellence National Institute for Computational Sciences glenn-

Performance Metrics and Application Experiences on a Cray CS300- AC Cluster Supercomputer Equipped with Intel Xeon Phi Coprocessors

Performance Metrics and Application Experiences on a Cray CS300- AC Cluster Supercomputer Equipped with Intel Xeon Phi Coprocessors Performance Metrics and Application Experiences on a Cray CS300- AC Cluster Supercomputer Equipped with Intel Xeon Phi Coprocessors Vincent C. Betro, Ph.D. Computational Scientist National Institute for

More information

Ryan Hulguin

Ryan Hulguin Ryan Hulguin ryan-hulguin@tennessee.edu Outline Beacon The Beacon project The Beacon cluster TOP500 ranking System specs Xeon Phi Coprocessor Technical specs Many core trend Programming models Applications

More information

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Outline NICS and AACE Architecture Overview Resources Native Mode Boltzmann BGK Solver Native/Offload

More information

Ryan C. Hulguin TACC-Intel Highly Parallel Computing Symposium April 10th-11th, 2012 Austin, TX

Ryan C. Hulguin TACC-Intel Highly Parallel Computing Symposium April 10th-11th, 2012 Austin, TX Ryan C. Hulguin TACC-Intel Highly Parallel Computing Symposium April 10th-11th, 2012 Austin, TX Outline Introduction Knights Ferry Technical Specifications CFD Governing Equations Numerical Algorithm Solver

More information

Experiences with ENZO on the Intel Many Integrated Core Architecture

Experiences with ENZO on the Intel Many Integrated Core Architecture Experiences with ENZO on the Intel Many Integrated Core Architecture Dr. Robert Harkness National Institute for Computational Sciences April 10th, 2012 Overview ENZO applications at petascale ENZO and

More information

Experiences with ENZO on the Intel R Many Integrated Core (Intel MIC) Architecture

Experiences with ENZO on the Intel R Many Integrated Core (Intel MIC) Architecture Experiences with ENZO on the Intel R Many Integrated Core (Intel MIC) Architecture 1 Introduction Robert Harkness National Institute for Computational Sciences Oak Ridge National Laboratory The National

More information

HPC-BLAST Scalable Sequence Analysis for the Intel Many Integrated Core Future

HPC-BLAST Scalable Sequence Analysis for the Intel Many Integrated Core Future HPC-BLAST Scalable Sequence Analysis for the Intel Many Integrated Core Future Dr. R. Glenn Brook & Shane Sawyer Joint Institute For Computational Sciences University of Tennessee, Knoxville Dr. Bhanu

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford

More information

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider

More information

Bring your application to a new era:

Bring your application to a new era: Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.

More information

Adaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA

Adaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA Adaptive Mesh Astrophysical Fluid Simulations on GPU San Jose 10/2/2009 Peng Wang, NVIDIA Overview Astrophysical motivation & the Enzo code Finite volume method and adaptive mesh refinement (AMR) CUDA

More information

John Hengeveld Director of Marketing, HPC Evangelist

John Hengeveld Director of Marketing, HPC Evangelist MIC, Intel and Rearchitecting for Exascale John Hengeveld Director of Marketing, HPC Evangelist Intel Data Center Group Dr. Jean-Laurent Philippe, PhD Technical Sales Manager & Exascale Technical Lead

More information

OPTIMIZATION OF THE CODE OF THE NUMERICAL MAGNETOSHEATH-MAGNETOSPHERE MODEL

OPTIMIZATION OF THE CODE OF THE NUMERICAL MAGNETOSHEATH-MAGNETOSPHERE MODEL Journal of Theoretical and Applied Mechanics, Sofia, 2013, vol. 43, No. 2, pp. 77 82 OPTIMIZATION OF THE CODE OF THE NUMERICAL MAGNETOSHEATH-MAGNETOSPHERE MODEL P. Dobreva Institute of Mechanics, Bulgarian

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

A Unified Approach to Heterogeneous Architectures Using the Uintah Framework

A Unified Approach to Heterogeneous Architectures Using the Uintah Framework DOE for funding the CSAFE project (97-10), DOE NETL, DOE NNSA NSF for funding via SDCI and PetaApps A Unified Approach to Heterogeneous Architectures Using the Uintah Framework Qingyu Meng, Alan Humphrey

More information

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics

More information

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf PADC Anual Workshop 20 Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture Alexander Berreth RECOM Services GmbH, Stuttgart Markus Bühler, Benedikt Anlauf IBM Deutschland

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

Technical Report. Document Id.: CESGA Date: July 28 th, Responsible: Andrés Gómez. Status: FINAL

Technical Report. Document Id.: CESGA Date: July 28 th, Responsible: Andrés Gómez. Status: FINAL Technical Report Abstract: This technical report presents CESGA experience of porting three applications to the new Intel Xeon Phi coprocessor. The objective of these experiments was to evaluate the complexity

More information

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs Presented at the 2014 ANSYS Regional Conference- Detroit, June 5, 2014 Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs Bhushan Desam, Ph.D. NVIDIA Corporation 1 NVIDIA Enterprise

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede Qingyu Meng, Alan Humphrey, John Schmidt, Martin Berzins Thanks to: TACC Team for early access to Stampede J. Davison

More information

Introduction to the Intel Xeon Phi on Stampede

Introduction to the Intel Xeon Phi on Stampede June 10, 2014 Introduction to the Intel Xeon Phi on Stampede John Cazes Texas Advanced Computing Center Stampede - High Level Overview Base Cluster (Dell/Intel/Mellanox): Intel Sandy Bridge processors

More information

Knights Landing Scalability and the Role of Hybrid Parallelism

Knights Landing Scalability and the Role of Hybrid Parallelism Knights Landing Scalability and the Role of Hybrid Parallelism Sergi Siso 1, Aidan Chalk 1, Alin Elena 2, James Clark 1, Luke Mason 1 1 Hartree Centre @ STFC - Daresbury Labs 2 Scientific Computing Department

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ, Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon

More information

A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications

A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications James Bordner, Michael L. Norman San Diego Supercomputer Center University of California, San Diego 15th SIAM Conference

More information

Intel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop

Intel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop Intel MKL @ MIC Architecture Dr. Momme Allalen, LRZ, allalen@lrz.de PRACE PATC: Intel MIC&GPU Programming Workshop 1 2 Momme Allalen, HPC with GPGPUs, Oct. 10, 2011 What is the Intel MKL? Math library

More information

Accelerating Insights In the Technical Computing Transformation

Accelerating Insights In the Technical Computing Transformation Accelerating Insights In the Technical Computing Transformation Dr. Rajeeb Hazra Vice President, Data Center Group General Manager, Technical Computing Group June 2014 TOP500 Highlights Intel Xeon Phi

More information

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015 Training Agenda Session 1: Introduction 8:00 9:45 Session 2: Native: MIC stand-alone 10:00-11:45 Lunch break Session 3: Offload: MIC as coprocessor 1:00 2:45 Session 4: Symmetric: MPI 3:00 4:45 1 Last

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Optimising the Mantevo benchmark suite for multi- and many-core architectures Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of

More information

Building NVLink for Developers

Building NVLink for Developers Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized

More information

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Parallel is the Path Forward Intel Xeon and Intel Xeon Phi Product Families are both going parallel Intel Xeon processor

More information

Code modernization of Polyhedron benchmark suite

Code modernization of Polyhedron benchmark suite Code modernization of Polyhedron benchmark suite Manel Fernández Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Approaches for

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, 27.6-29.6.2016 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi - Compiler Assisted Offload - Automatic Offload - Native Execution

More information

Introduc)on to Xeon Phi

Introduc)on to Xeon Phi Introduc)on to Xeon Phi ACES Aus)n, TX Dec. 04 2013 Kent Milfeld, Luke Wilson, John McCalpin, Lars Koesterke TACC What is it? Co- processor PCI Express card Stripped down Linux opera)ng system Dense, simplified

More information

Intel Many Integrated Core (MIC) Architecture

Intel Many Integrated Core (MIC) Architecture Intel Many Integrated Core (MIC) Architecture Karl Solchenbach Director European Exascale Labs BMW2011, November 3, 2011 1 Notice and Disclaimers Notice: This document contains information on products

More information

Sarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018

Sarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018 Sarah Knepper Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018 Outline Motivation Problem statement and solutions Simple example Performance comparison 2 Motivation Partial differential equations

More information

ORAP Forum October 10, 2013

ORAP Forum October 10, 2013 Towards Petaflop simulations of core collapse supernovae ORAP Forum October 10, 2013 Andreas Marek 1 together with Markus Rampp 1, Florian Hanke 2, and Thomas Janka 2 1 Rechenzentrum der Max-Planck-Gesellschaft

More information

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung

More information

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center The Stampede is Coming Welcome to Stampede Introductory Training Dan Stanzione Texas Advanced Computing Center dan@tacc.utexas.edu Thanks for Coming! Stampede is an exciting new system of incredible power.

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers

More information

AcuSolve Performance Benchmark and Profiling. October 2011

AcuSolve Performance Benchmark and Profiling. October 2011 AcuSolve Performance Benchmark and Profiling October 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox, Altair Compute

More information

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M

More information

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero Introduction to Intel Xeon Phi programming techniques Fabio Affinito Vittorio Ruggiero Outline High level overview of the Intel Xeon Phi hardware and software stack Intel Xeon Phi programming paradigms:

More information

Benchmark results on Knight Landing (KNL) architecture

Benchmark results on Knight Landing (KNL) architecture Benchmark results on Knight Landing (KNL) architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Roma 23/10/2017 KNL, BDW, SKL A1 BDW A2 KNL A3 SKL cores per node 2 x 18 @2.3

More information

GE Usage & Trends

GE Usage & Trends CFD @ GE Usage & Trends Dr. Senior Principal Engineer GE Global Research 06 January 2018 Overview of CFD at GE Wide penetration Aviation, Power, Oil & Gas, Renewables Aerodynamics, heat transfer, aeromechanics,

More information

Stan Posey, CAE Industry Development NVIDIA, Santa Clara, CA, USA

Stan Posey, CAE Industry Development NVIDIA, Santa Clara, CA, USA Stan Posey, CAE Industry Development NVIDIA, Santa Clara, CA, USA NVIDIA and HPC Evolution of GPUs Public, based in Santa Clara, CA ~$4B revenue ~5,500 employees Founded in 1999 with primary business in

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Missile External Aerodynamics Using Star-CCM+ Star European Conference 03/22-23/2011

Missile External Aerodynamics Using Star-CCM+ Star European Conference 03/22-23/2011 Missile External Aerodynamics Using Star-CCM+ Star European Conference 03/22-23/2011 StarCCM_StarEurope_2011 4/6/11 1 Overview 2 Role of CFD in Aerodynamic Analyses Classical aerodynamics / Semi-Empirical

More information

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation Ray Browell nvidia Technology Theater SC12 1 2012 ANSYS, Inc. nvidia Technology Theater SC12 HPC Revolution Recent

More information

Runtime Systems and Out-of-Core Cholesky Factorization on the Intel Xeon Phi System

Runtime Systems and Out-of-Core Cholesky Factorization on the Intel Xeon Phi System Runtime Systems and Out-of-Core Cholesky Factorization on the Intel Xeon Phi System Allan Richmond R. Morales, Chong Tian, Kwai Wong, Eduardo D Azevedo The George Washington University, The Chinese University

More information

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP

More information

Intel Knights Landing Hardware

Intel Knights Landing Hardware Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming

More information

Investigation of Intel MIC for implementation of Fast Fourier Transform

Investigation of Intel MIC for implementation of Fast Fourier Transform Investigation of Intel MIC for implementation of Fast Fourier Transform Soren Goyal Department of Physics IIT Kanpur e-mail address: soren@iitk.ac.in The objective of the project was to run the code for

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances) HPC and IT Issues Session Agenda Deployment of Simulation (Trends and Issues Impacting IT) Discussion Mapping HPC to Performance (Scaling, Technology Advances) Discussion Optimizing IT for Remote Access

More information

Parallel Systems. Project topics

Parallel Systems. Project topics Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a

More information

Native Computing and Optimization. Hang Liu December 4 th, 2013

Native Computing and Optimization. Hang Liu December 4 th, 2013 Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning

More information

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015 PERFORMANCE PORTABILITY WITH OPENACC Jeff Larkin, NVIDIA, November 2015 TWO TYPES OF PORTABILITY FUNCTIONAL PORTABILITY PERFORMANCE PORTABILITY The ability for a single code to run anywhere. The ability

More information

HPC-CINECA infrastructure: The New Marconi System. HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati,

HPC-CINECA infrastructure: The New Marconi System. HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati, HPC-CINECA infrastructure: The New Marconi System HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati, g.amati@cineca.it Agenda 1. New Marconi system Roadmap Some performance info

More information

Advances of parallel computing. Kirill Bogachev May 2016

Advances of parallel computing. Kirill Bogachev May 2016 Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being

More information

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Why do we need HPC? High Performance Computing Amazon can ship products within hours would it

More information

Intel Xeon Phi coprocessor (codename Knights Corner) George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012

Intel Xeon Phi coprocessor (codename Knights Corner) George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012 Intel Xeon Phi coprocessor (codename Knights Corner) George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012 Legal Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

Early Experiences Writing Performance Portable OpenMP 4 Codes

Early Experiences Writing Performance Portable OpenMP 4 Codes Early Experiences Writing Performance Portable OpenMP 4 Codes Verónica G. Vergara Larrea Wayne Joubert M. Graham Lopez Oscar Hernandez Oak Ridge National Laboratory Problem statement APU FPGA neuromorphic

More information

Growth in Cores - A well rehearsed story

Growth in Cores - A well rehearsed story Intel CPUs Growth in Cores - A well rehearsed story 2 1. Multicore is just a fad! Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

More information

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation The Intel Xeon Phi Coprocessor Dr-Ing. Michael Klemm Software and Services Group Intel Corporation (michael.klemm@intel.com) Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED

More information

Knights Corner: Your Path to Knights Landing

Knights Corner: Your Path to Knights Landing Knights Corner: Your Path to Knights Landing James Reinders, Intel Wednesday, September 17, 2014; 9-10am PDT Photo (c) 2014, James Reinders; used with permission; Yosemite Half Dome rising through forest

More information

Accelerator Programming Lecture 1

Accelerator Programming Lecture 1 Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming

More information

Debugging Intel Xeon Phi KNC Tutorial

Debugging Intel Xeon Phi KNC Tutorial Debugging Intel Xeon Phi KNC Tutorial Last revised on: 10/7/16 07:37 Overview: The Intel Xeon Phi Coprocessor 2 Debug Library Requirements 2 Debugging Host-Side Applications that Use the Intel Offload

More information

Intel Xeon Phi Coprocessors

Intel Xeon Phi Coprocessors Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava, PRACE PATC Course: Intel MIC Programming Workshop, MKL Ostrava, 7-8.2.2017 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi Compiler Assisted Offload Automatic Offload Native Execution Hands-on

More information

Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System

Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System Overview This guide is intended to help developers use the latest version of Intel Math Kernel Library (Intel

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

Intel Architecture and Tools Jureca Tuning for the platform II. Dr. Heinrich Bockhorst Intel SSG/DPD/ Date:

Intel Architecture and Tools Jureca Tuning for the platform II. Dr. Heinrich Bockhorst Intel SSG/DPD/ Date: Intel Architecture and Tools Jureca Tuning for the platform II Dr. Heinrich Bockhorst Intel SSG/DPD/ Date: 23.11.2017 Agenda Introduction Processor Architecture Overview Composer XE Compiler Intel Python

More information

Intel Architecture for HPC

Intel Architecture for HPC Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

First Experiences with Intel Cluster OpenMP

First Experiences with Intel Cluster OpenMP First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May

More information

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory

More information

Intel Math Kernel Library (Intel MKL) Latest Features

Intel Math Kernel Library (Intel MKL) Latest Features Intel Math Kernel Library (Intel MKL) Latest Features Sridevi Allam Technical Consulting Engineer Sridevi.allam@intel.com 1 Agenda - Introduction to Support on Intel Xeon Phi Coprocessors - Performance

More information

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS Roberto Gomperts (NVIDIA, Corp.) Michael Frisch (Gaussian, Inc.) Giovanni Scalmani (Gaussian, Inc.) Brent Leback (PGI) TOPICS Gaussian Design

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

General Plasma Physics

General Plasma Physics Present and Future Computational Requirements General Plasma Physics Center for Integrated Computation and Analysis of Reconnection and Turbulence () Kai Germaschewski, Homa Karimabadi Amitava Bhattacharjee,

More information

Key Technologies for 100 PFLOPS. Copyright 2014 FUJITSU LIMITED

Key Technologies for 100 PFLOPS. Copyright 2014 FUJITSU LIMITED Key Technologies for 100 PFLOPS How to keep the HPC-tree growing Molecular dynamics Computational materials Drug discovery Life-science Quantum chemistry Eigenvalue problem FFT Subatomic particle phys.

More information

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA Introduction to the Xeon Phi programming model Fabio AFFINITO, CINECA What is a Xeon Phi? MIC = Many Integrated Core architecture by Intel Other names: KNF, KNC, Xeon Phi... Not a CPU (but somewhat similar

More information