Progress Report on QDP-JIT
|
|
- Darrell Patterson
- 5 years ago
- Views:
Transcription
1 Progress Report on QDP-JIT F. T. Winter Thomas Jefferson National Accelerator Facility USQCD Software Meeting 14 April 16-17, 14 at Jefferson Lab F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 1 / 16
2 QDP-JIT/, A Framework for Lattice QCD Calculations for GPUs QDP-JIT/ provides a reimplementation of QDP++ for NVIDIA GPUs Automatic off-loading of expressions to the accelerators Multi-GPU support Dynamic code generation Additional Just-In-Time (JIT) compilation step with NVIDIA driver Data layout is optimized for coalesced memory accesses Automatic H2D, D2H memory transfers via a software cache Trajectory Time [s] V = , 2+1 Anisotropic Clover, m π ~ 23 MeV, τ =.2 CPU only (XE nodes) CPU+QUDA QDP-JIT+QUDA F. T. Winter M. A. Clark R. G. Edwards B. Joo in IPDPS'14 Automatic tuning of CUDA kernels XE Sockets / XK Nodes Paper accepted for publication in IEEE International Parallel & Distributed Processing Symposium 14 F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 2 / 16
3 QDP-JIT/LLVM Motivation Code maintainability No template specializations (SSE, AVX, etc.) for each architecture No heavy usage of #ifdef constructs Performance portability Efficient code generation for all relevant targets Not to be committed on compilers ability to deal with templated codes Support for vector units, memory pre-fetchers, etc. Efficient code: threading, scheduling, cache blocking, etc. QDP-JIT/LLVM LLVM IR Architecture independent implementation of QDP++ LLVM is a framework worth targeting LLVM IR is architecture independent LLVM is embraced by HPC industry, e.g. NVIDIA, IBM, Intel,... F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 3 / 16
4 QDP-JIT/LLVM Overview QDP-JIT/ QDP-JIT/LLVM LLVM IR nvptx libnvvm x86-64 ppc64+qpx... GPUs GPUs Intel/AMD CPUs Blue Gene/Q? QDP-JIT/ is limited to GPUs. To target a broader range of architectures a new LLVM IR code generator was implemented. F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 4 / 16
5 QDP-JIT/LLVM Overview QDP-JIT/ QDP-JIT/LLVM LLVM IR nvptx libnvvm x86-64 ppc64+qpx... GPUs GPUs Intel/AMD CPUs Blue Gene/Q? GPU route is still there via, two approaches: The open source NV backend or closed source libnvvm library. F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 5 / 16
6 QDP-JIT/LLVM Overview QDP-JIT/ QDP-JIT/LLVM LLVM IR nvptx libnvvm x86-64 ppc64+qpx... GPUs GPUs Intel/AMD CPUs Blue Gene/Q? libnvvm part of CUDA since 5.5 and includes -specific optimizations. F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 6 / 16
7 QDP-JIT/LLVM Overview QDP-JIT/ QDP-JIT/LLVM LLVM IR nvptx libnvvm x86-64 ppc64+qpx... GPUs GPUs Intel/AMD CPUs Blue Gene/Q? Generate x86 code with LLVM s mature x86 backend. (Great SSE/AVX support) F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 7 / 16
8 QDP-JIT/LLVM Overview QDP-JIT/ QDP-JIT/LLVM LLVM IR nvptx libnvvm x86-64 ppc64+qpx... GPUs GPUs Intel/AMD CPUs Blue Gene/Q? Generate PowerPC 64 code. Some support for QPX (work in progress). F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 8 / 16
9 QDP-JIT/LLVM Overview QDP-JIT/ QDP-JIT/LLVM LLVM IR nvptx libnvvm x86-64 ppc64+qpx... GPUs GPUs Intel/AMD CPUs Blue Gene/Q? New architectures supported provided that it supports JIT compilation. F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 9 / 16
10 Optimization: Custom Data Layout QDP++ specifies the data layout through the nesting order of templated data types: Outer < Spin < Color < Reality < float > > > > QDP-JIT splits the outer loop by an optional inner vector length I Outer < Spin < Color < Reality < Inner < float > > > > > The code generation step intercepts and changes the data layout Spin < Color < Reality < Outer < Inner < float > > > > > (GPUs, I = 1) Outer < Spin < Color < Reality < Inner < float > > > > > (CPUs with SSE/AVX, I = 2/4/8) Outer < Spin < Color < Inner < Reality < float > > > > > (BG/Q, I = 2 (DP)) F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 1 / 16
11 Benchmark on Intel Sandy Bridge t_linalg (single precision), QDP++(SSE) vs. QDP-JIT/LLVM M=M*M M=adj(M)*M M=M*adj(M) M=adj(M)*adj(M) M+=M*M M+=adj(M)*M M+=M*adj(M) M+=adj(M)*adj(M) M-=M*M M-=adj(M)*M M-=M*adj(M) M-=adj(M)*adj(M) V=M*V V=adj(M)*V V=V+V D=M*D D=adj(M)*D H=M*H H=adj(M)*H Out of L2 cache for local problem sizes larger than L = 4. Within cache the code achieves up to 78% peak of E5-265 at 2.GHz, 256 (SP) peak F. Winter (Jefferson Lab) QDP-JIT USQCD-Software / 16
12 Benchmark on Intel Sandy Bridge t_linalg (double precision), QDP++(SSE) vs. QDP-JIT/LLVM M=M*M M=adj(M)*M M=M*adj(M) M=adj(M)*adj(M) M+=M*M M+=adj(M)*M M+=M*adj(M) M+=adj(M)*adj(M) M-=M*M M-=adj(M)*M M-=M*adj(M) M-=adj(M)*adj(M) V=M*V V=adj(M)*V V=V+V D=M*D D=adj(M)*D H=M*H H=adj(M)*H Out of L2 cache for local problem sizes larger than L = Within cache the code achieves up to 78% peak of E5-265 at 2.GHz, 128 (DP) peak F. Winter (Jefferson Lab) QDP-JIT USQCD-Software / 16
13 Benchmark on Blue Gene/Q (single node, preliminary) t_linalg DP, 1 node, threads=32, inner=4, layout=oscri M=M*M M=adj(M)*M M=M*adj(M) M=adj(M)*adj(M) M+=M*M M+=adj(M)*M M+=M*adj(M) M+=adj(M)*adj(M) M-=M*M M-=adj(M)*M M-=M*adj(M) M-=adj(M)*adj(M) V=M*V V=adj(M)*V V=V+V D=M*D D=adj(M)*D H=M*H H=adj(M)*H QPX instructions are generated, there are however still alignment issue Out of L2 cache for local problem sizes larger than L = F. Winter (Jefferson Lab) QDP-JIT USQCD-Software / 16
14 Benchmark on Blue Gene/Q (single node) t_linalg DP, 1 node, threads=64, QDP++, OMP, gcc -O3 M=M*M M=adj(M)*M M=M*adj(M) M=adj(M)*adj(M) M+=M*M M+=adj(M)*M M+=M*adj(M) M+=adj(M)*adj(M) M-=M*M M-=adj(M)*M M-=M*adj(M) M-=adj(M)*adj(M) V=M*V V=adj(M)*V V=V+V D=M*D D=adj(M)*D H=M*H H=adj(M)*H GCC on vanilla QDP++ is currently doing better on the linear algebra than QDP-JIT/LLVM. Mainly because the LLVM BG/Q backend misses essential performance features F. Winter (Jefferson Lab) QDP-JIT USQCD-Software / 16
15 Benchmark on Blue Gene/Q, preliminary Rb2 Wilson DSlash, local volume V =12 1 4, DP, 1 MPI rank/node QDP-JIT, 32 threads QDP++, 16 threads 1 Shifting of sub-lattices Overlapping of computation and off-node communication. For rb2 Wilson DSlash preliminary measurements show a speedup factor of Performance [] BG/Q nodes F. Winter (Jefferson Lab) QDP-JIT USQCD-Software / 16
16 Summary & Outlook QDP-JIT/LLVM provides an architecture independent implementation of QDP++ Runs Chroma HMC (Wilson Clover) on GPUs, x86, and BG/Q Optimizations: Custom data layout to support vectorization Multi-threading Sub-lattice shifting Improve performance on BG/Q (QPX, SPI) Intel Xeon Phi (KNL) Apply advanced optimizations: Polyhedral model Cache blocking Memory prefetching Overlapping MPI and compute F. Winter (Jefferson Lab) QDP-JIT USQCD-Software / 16
QDP-JIT/PTX: A QDP++ Implementation for CUDA-Enabled GPUs
: A QDP++ Implementation for CUDA-Enabled GPUs, R. G. Edwards Thomas Jefferson National Accelerator Facility, 236 Newport News, VA E-mail: fwinter@jlab.org These proceedings describe briefly the framework
More informationQCD Data Parallel (Expressive C++ API for Lattice Field Theory) for GPUs
QCD Data Parallel (Expressive C++ API for Lattice Field Theory) for GPUs Frank Winter Jefferson Lab GPU Technology Conference 2013 March 18-21, San Jose, California Frank Winter (Jefferson Lab) QDP-JIT
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationQDP++/Chroma on IBM PowerXCell 8i Processor
QDP++/Chroma on IBM PowerXCell 8i Processor Frank Winter (QCDSF Collaboration) frank.winter@desy.de University Regensburg NIC, DESY-Zeuthen STRONGnet 2010 Conference Hadron Physics in Lattice QCD Paphos,
More informationOptimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor
Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Intel K. K. E-mail: hirokazu.kobayashi@intel.com Yoshifumi Nakamura RIKEN AICS E-mail: nakamura@riken.jp Shinji Takeda
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationPoS(LATTICE2014)028. The FUEL code project
Argonne Leadership Computing Facility 9700 S. Cass Ave. Argonne, IL 60439, USA E-mail: osborn@alcf.anl.gov We give an introduction to the FUEL project for lattice field theory code. The code being developed
More informationThe need for speed... Bálint Joó, Scientific Computing Group Jefferson Lab
The need for speed... Bálint Joó, Scientific Computing Group Jefferson Lab Alternative Title: Reduce, Reuse, Recycle (as much as you possibly can) Bálint Joó, Scientific Computing Group Jefferson Lab Outline
More informationarxiv: v1 [hep-lat] 1 Dec 2017
arxiv:1712.00143v1 [hep-lat] 1 Dec 2017 MILC Code Performance on High End CPU and GPU Supercomputer Clusters Carleton DeTar 1, Steven Gottlieb 2,, Ruizi Li 2,, and Doug Toussaint 3 1 Department of Physics
More informationPortability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures
Photos placed in horizontal position with even amount of white space between photos and header Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster,
More informationarxiv: v2 [hep-lat] 21 Nov 2018
arxiv:1806.06043v2 [hep-lat] 21 Nov 2018 E-mail: j.m.o.rantaharju@swansea.ac.uk Ed Bennett E-mail: e.j.bennett@swansea.ac.uk Mark Dawson E-mail: mark.dawson@swansea.ac.uk Michele Mesiti E-mail: michele.mesiti@swansea.ac.uk
More informationOpenStaPLE, an OpenACC Lattice QCD Application
OpenStaPLE, an OpenACC Lattice QCD Application Enrico Calore Postdoctoral Researcher Università degli Studi di Ferrara INFN Ferrara Italy GTC Europe, October 10 th, 2018 E. Calore (Univ. and INFN Ferrara)
More informationarxiv: v1 [hep-lat] 13 Jun 2008
Continuing Progress on a Lattice QCD Software Infrastructure arxiv:0806.2312v1 [hep-lat] 13 Jun 2008 Bálint Joó on behalf of the USQCD Collaboration Thomas Jefferson National Laboratory, 12000 Jefferson
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationIntel Knights Landing Hardware
Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute
More informationarxiv: v2 [hep-lat] 3 Nov 2016
MILC staggered conjugate gradient performance on Intel KNL arxiv:1611.00728v2 [hep-lat] 3 Nov 2016 Department of Physics, Indiana University, Bloomington IN 47405, USA E-mail: ruizli@umail.iu.edu Carleton
More informationSIMD Exploitation in (JIT) Compilers
SIMD Exploitation in (JIT) Compilers Hiroshi Inoue, IBM Research - Tokyo 1 What s SIMD? Single Instruction Multiple Data Same operations applied for multiple elements in a vector register input 1 A0 input
More informationAddressing Heterogeneity in Manycore Applications
Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction
More informationOpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4
OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationIllinois Proposal Considerations Greg Bauer
- 2016 Greg Bauer Support model Blue Waters provides traditional Partner Consulting as part of its User Services. Standard service requests for assistance with porting, debugging, allocation issues, and
More informationThe Mont-Blanc approach towards Exascale
http://www.montblanc-project.eu The Mont-Blanc approach towards Exascale Alex Ramirez Barcelona Supercomputing Center Disclaimer: Not only I speak for myself... All references to unavailable products are
More informationAchieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017
Achieving Peak Performance on Intel Hardware Intel Software Developer Conference London, 2017 Welcome Aims for the day You understand some of the critical features of Intel processors and other hardware
More informationLLVM and Clang on the Most Powerful Supercomputer in the World
LLVM and Clang on the Most Powerful Supercomputer in the World Hal Finkel November 7, 2012 The 2012 LLVM Developers Meeting Hal Finkel (Argonne National Laboratory) LLVM and Clang on the BG/Q November
More informationParallel Computing. November 20, W.Homberg
Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationAccelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies
Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationHPC-CINECA infrastructure: The New Marconi System. HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati,
HPC-CINECA infrastructure: The New Marconi System HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati, g.amati@cineca.it Agenda 1. New Marconi system Roadmap Some performance info
More informationX10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management
X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large
More informationAutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming
AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming David Pfander, Malte Brunn, Dirk Pflüger University of Stuttgart, Germany May 25, 2018 Vancouver, Canada, iwapt18 May 25, 2018 Vancouver,
More informationOPENMP GPU OFFLOAD IN FLANG AND LLVM. Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz
OPENMP GPU OFFLOAD IN FLANG AND LLVM Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz MOTIVATION What does HPC programmer need today? Performance à GPUs, multi-cores, other
More informationOutline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends
Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory
More informationINTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017
INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and
More informationEvaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi
Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi National Center for Supercomputing Applications University of Illinois at Urbana-Champaign
More informationGPU Computing with NVIDIA s new Kepler Architecture
GPU Computing with NVIDIA s new Kepler Architecture Axel Koehler Sr. Solution Architect HPC HPC Advisory Council Meeting, March 13-15 2013, Lugano 1 NVIDIA: Parallel Computing Company GPUs: GeForce, Quadro,
More informationAn Introduction to OpenACC
An Introduction to OpenACC Alistair Hart Cray Exascale Research Initiative Europe 3 Timetable Day 1: Wednesday 29th August 2012 13:00 Welcome and overview 13:15 Session 1: An Introduction to OpenACC 13:15
More informationPortability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17
Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17 11/27/2017 Background Many developers choose OpenMP in hopes of having a single source code that runs effectively anywhere (performance
More informationAchieving Peak Performance on Intel Hardware. Jim Cownie: Intel Software Developer Conference Frankfurt, December 2017
Achieving Peak Performance on Intel Hardware Jim Cownie: Intel Software Developer Conference Frankfurt, December 2017 Welcome Aims for the day You understand some of the critical features of Intel processors
More informationIMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM
IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information
More informationFahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County
Accelerating a climate physics model with OpenCL Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou University of Maryland Baltimore County Introduction The demand to increase forecast predictability
More informationResources Current and Future Systems. Timothy H. Kaiser, Ph.D.
Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic
More informationResources Current and Future Systems. Timothy H. Kaiser, Ph.D.
Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic
More informationAccelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX
Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX David Pfander*, Gregor Daiß*, Dominic Marcello**, Hartmut Kaiser**, Dirk Pflüger* * University of Stuttgart ** Louisiana State
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More informationINTEL HPC DEVELOPER CONFERENCE FUEL YOUR INSIGHT
INTEL HPC DEVELOPER CONFERENCE FUEL YOUR INSIGHT INTEL HPC DEVELOPER CONFERENCE FUEL YOUR INSIGHT UPDATE ON OPENSWR: A SCALABLE HIGH- PERFORMANCE SOFTWARE RASTERIZER FOR SCIVIS Jefferson Amstutz Intel
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationOptimising the Mantevo benchmark suite for multi- and many-core architectures
Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of
More informationCOMP Parallel Computing. Programming Accelerators using Directives
COMP 633 - Parallel Computing Lecture 15 October 30, 2018 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia COMP 633 - Prins Directives for Accelerator
More informationAnalysis and Visualization Algorithms in VMD
1 Analysis and Visualization Algorithms in VMD David Hardy Research/~dhardy/ NAIS: State-of-the-Art Algorithms for Molecular Dynamics (Presenting the work of John Stone.) VMD Visual Molecular Dynamics
More informationPiecewise Holistic Autotuning of Compiler and Runtime Parameters
Piecewise Holistic Autotuning of Compiler and Runtime Parameters Mihail Popov, Chadi Akel, William Jalby, Pablo de Oliveira Castro University of Versailles Exascale Computing Research August 2016 C E R
More informationThe Mont-Blanc Project
http://www.montblanc-project.eu The Mont-Blanc Project Daniele Tafani Leibniz Supercomputing Centre 1 Ter@tec Forum 26 th June 2013 This project and the research leading to these results has received funding
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationIMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign
SINGLE-TRANSPOSE IMPLEMENTATION OF THE OUT-OF-ORDER 3D-FFT Alexander J. Yee University of Illinois Urbana-Champaign The Problem FFTs are extremely memory-intensive. Completely bound by memory access. Memory
More informationarxiv: v1 [physics.comp-ph] 4 Nov 2013
arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department
More informationAn Introduction to the SPEC High Performance Group and their Benchmark Suites
An Introduction to the SPEC High Performance Group and their Benchmark Suites Robert Henschel Manager, Scientific Applications and Performance Tuning Secretary, SPEC High Performance Group Research Technologies
More informationApril 2 nd, Bob Burroughs Director, HPC Solution Sales
April 2 nd, 2019 Bob Burroughs Director, HPC Solution Sales Today - Introducing 2 nd Generation Intel Xeon Scalable Processors how Intel Speeds HPC performance Work Time System Peak Efficiency Software
More informationS Comparing OpenACC 2.5 and OpenMP 4.5
April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical
More informationThe Era of Heterogeneous Computing
The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------
More informationDebugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.
Debugging CUDA Applications with Allinea DDT Ian Lumb Sr. Systems Engineer, Allinea Software Inc. ilumb@allinea.com GTC 2013, San Jose, March 20, 2013 Embracing GPUs GPUs a rival to traditional processors
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationPerformance of deal.ii on a node
Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions
More informationarxiv: v1 [hep-lat] 12 Nov 2013
Lattice Simulations using OpenACC compilers arxiv:13112719v1 [hep-lat] 12 Nov 2013 Indian Association for the Cultivation of Science, Kolkata E-mail: tppm@iacsresin OpenACC compilers allow one to use Graphics
More informationPresenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs
Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs A paper comparing modern architectures Joakim Skarding Christian Chavez Motivation Continue scaling of performance
More informationLocality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives
Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel
More informationPERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015
PERFORMANCE PORTABILITY WITH OPENACC Jeff Larkin, NVIDIA, November 2015 TWO TYPES OF PORTABILITY FUNCTIONAL PORTABILITY PERFORMANCE PORTABILITY The ability for a single code to run anywhere. The ability
More informationTurbo Boost Up, AVX Clock Down: Complications for Scaling Tests
Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests Steve Lantz 12/8/2017 1 What Is CPU Turbo? (Sandy Bridge) = nominal frequency http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/hc23.19.9-desktop-cpus/hc23.19.921.sandybridge_power_10-rotem-intel.pdf
More informationA Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle
A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening Alberto Magni, Christophe Dubach, Michael O'Boyle Introduction Wide adoption of GPGPU for HPC Many GPU devices from many of vendors AMD
More informationCompiling CUDA and Other Languages for GPUs. Vinod Grover and Yuan Lin
Compiling CUDA and Other Languages for GPUs Vinod Grover and Yuan Lin Agenda Vision Compiler Architecture Scenarios SDK Components Roadmap Deep Dive SDK Samples Demos Vision Build a platform for GPU computing
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationOptimisation Myths and Facts as Seen in Statistical Physics
Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY
More informationVLPL-S Optimization on Knights Landing
VLPL-S Optimization on Knights Landing 英特尔软件与服务事业部 周姗 2016.5 Agenda VLPL-S 性能分析 VLPL-S 性能优化 总结 2 VLPL-S Workload Descriptions VLPL-S is the in-house code from SJTU, paralleled with MPI and written in C++.
More informationAccelerators in Technical Computing: Is it Worth the Pain?
Accelerators in Technical Computing: Is it Worth the Pain? A TCO Perspective Sandra Wienke, Dieter an Mey, Matthias S. Müller Center for Computing and Communication JARA High-Performance Computing RWTH
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationGrowth in Cores - A well rehearsed story
Intel CPUs Growth in Cores - A well rehearsed story 2 1. Multicore is just a fad! Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
More informationMassively Parallel Phase Field Simulations using HPC Framework walberla
Massively Parallel Phase Field Simulations using HPC Framework walberla SIAM CSE 2015, March 15 th 2015 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich
More informationBig Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures
Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid
More informationNative Computing and Optimization. Hang Liu December 4 th, 2013
Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning
More informationDirected Optimization On Stencil-based Computational Fluid Dynamics Application(s)
Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2
More informationGPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler
GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs
More informationOpenCL Vectorising Features. Andreas Beckmann
Mitglied der Helmholtz-Gemeinschaft OpenCL Vectorising Features Andreas Beckmann Levels of Vectorisation vector units, SIMD devices width, instructions SMX, SP cores Cus, PEs vector operations within kernels
More informationToward Building up Arm HPC Ecosystem --Fujitsu s Activities--
Toward Building up Arm HPC Ecosystem --Fujitsu s Activities-- Shinji Sumimoto, Ph.D. Next Generation Technical Computing Unit FUJITSU LIMITED Jun. 28 th, 2018 0 Copyright 2018 FUJITSU LIMITED Outline of
More informationgpucc: An Open-Source GPGPU Compiler
gpucc: An Open-Source GPGPU Compiler Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt One-Slide Overview Motivation
More informationTowards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ)
Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ) Overview Modernising P-Gadget3 for the Intel Xeon Phi : code features, challenges and strategy for
More informationCUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation
CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark
More informationLattice QCD code Bridge++ on arithmetic accelerators
Lattice QCD code Bridge++ on arithmetic accelerators a, S. Aoki b, T. Aoyama c, K. Kanaya d,e, H. Matsufuru a, T. Miyamoto b, Y. Namekawa f, H. Nemura f, Y. Taniguchi d, S. Ueda g, and N. Ukita f a Computing
More informationAutomated Finite Element Computations in the FEniCS Framework using GPUs
Automated Finite Element Computations in the FEniCS Framework using GPUs Florian Rathgeber (f.rathgeber10@imperial.ac.uk) Advanced Modelling and Computation Group (AMCG) Department of Earth Science & Engineering
More informationMassive Parallel QCD Computing on FPGA Accelerator with Data-Flow Programming
Massive Parallel QCD Computing on FPGA Accelerator with Data-Flow Programming Thomas Janson and Udo Kebschull Infrastructure and Computer Systems in Data Processing (IRI) Goethe University Frankfurt Germany
More informationHPC Architectures. Types of resource currently in use
HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationOperational Robustness of Accelerator Aware MPI
Operational Robustness of Accelerator Aware MPI Sadaf Alam Swiss National Supercomputing Centre (CSSC) Switzerland 2nd Annual MVAPICH User Group (MUG) Meeting, 2014 Computing Systems @ CSCS http://www.cscs.ch/computers
More informationPerformance Portability of QCD with Kokkos
Performance Portability of QCD with Kokkos Balint Joo Jefferson Lab Jack Deslippe, Thorsten Kurth NERSC Kate Clark NVIDIA Dan Ibanez, Dan Sunderland Sandia National Lab IXPUG 2017 US Fall Meeting, Oct
More informationIt s a Multicore World. John Urbanic Pittsburgh Supercomputing Center
It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data Moore s Law is not at all
More informationOP2 FOR MANY-CORE ARCHITECTURES
OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC
More informationCUDA Experiences: Over-Optimization and Future HPC
CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign
More informationSteve Scott, Tesla CTO SC 11 November 15, 2011
Steve Scott, Tesla CTO SC 11 November 15, 2011 What goal do these products have in common? Performance / W Exaflop Expectations First Exaflop Computer K Computer ~10 MW CM5 ~200 KW Not constant size, cost
More informationHPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,
HPC trends (Myths about) accelerator cards & more June 24, 2015 - Martin Schreiber, M.Schreiber@exeter.ac.uk Outline HPC & current architectures Performance: Programming models: OpenCL & OpenMP Some applications:
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationParallel Programming on Ranger and Stampede
Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE
More informationMeta-Programming and JIT Compilation
Meta-Programming and JIT Compilation Sean Treichler 1 Portability vs. Performance Many scientific codes sp ~100% of their cycles in a tiny fraction of the code base We want these kernels to be as fast
More information