Premiers retours d expérience sur l utilisation de GPU pour des applications de mécanique des structures
|
|
- Abel Powers
- 6 years ago
- Views:
Transcription
1 Premiers retours d expérience sur l utilisation de GPU pour des applications de mécanique des structures Antoine Petitet et Stefanos Vlachoutsis Juin 2011 Copyright ESI Group, All rights reserved. 1
2 Sommaire Travaux réalisés dans le cadre du projet OpenGPU grâce au soutien de la DGCIS. Méthode implicite: résolution de systèmes linéaires creux Méthode explicite: Smoothed Particle Hydrodynamics (SPH) Copyright ESI Group, All rights reserved.
3 Multi-frontal Solver and CUBLAS One of the major workhorses of VPS implicit is the (multifrontal) linear system direct solver (MUMPS). The multi-frontal method operates by design on dense submatrices for performance: GEMM and TRSM BLAS Level 3 kernels with sometime a large number of RHS. In VPS, main focus is on double precision real and complex operands. What about using the CUBLAS provided by NVIDIA and see what happens on some industrial test cases?
4 CUBLAS (3.2) Level 3 Performance Performance on C2070 (ECC on) including data transfers. GEMM and [SY,HE]RK optimized. Little has been done for the performance of the other Level 3 BLAS routines. Single Precision Level 3 CUBLAS True for all other precisions D, C and Z. TRSM is important for multiple RHS solve. Gflops / s SGEMM SSYMM SSYRK SSYR2K STRMM STRSM Problem size
5 Recursive GEMM based Level 3 BLAS A 11 A 21 A 22 B 1 B 2 B 1 := A B 1 (TRSM) B 2 := B 2 A 21 B 1 (GEMM) B 2 := A B 2 (TRSM) Recursive formulation of the TRSM operation. Use of native (slower) TRSM on leaves of the tree and (fast) GEMM elsewhere. Method can be applied to all Level 3 (and 2) operations.
6 (Recursive) CUBLAS Level 3 Performance Asymptotically achieves GEMM performance. DGEMM (original) DTRSM (original) DTRSM (recursive) ZGEMM (original) ZSYMM (recursive) ZHEMM (recursive) ZSYR2K (recursive) ZHER2K (recursive) ZTRMM (recursive) ZTRSM (recursive) Gflo ops / s Problem size Gflops / s Problem size [SY,HE] rank-2k updates should be implemented by a GEMM call followed by a triangular inplace copy-add. The recursive algorithm should be used until there is enough memory to use the above algorithm.
7 VPS Implicit: Non-Linear Static Test Case Double precision real, 1 rhs. 12 numerical factorizations and 12 solves. Problem size = , non-zero terms = Speed-up: 20% over 1 Nehalem core Time in mn CPU CPU-GPU 0 Total Matrix Solver
8 VPS-Implicit: NVH Frequency Response Double precision complex, 1258 rhs. 25 numerical factorizations and 175 solves. Problem size = , non-zero terms = Speed-up: 2x over 1 Nehalem core. Time in mn Internal Acoustics Total Matrix Solver CPU CPU-GPU
9 Conclusions Naïve (no data transfer / computation overlap) recursive GEMM based implementation was necessary to handle efficiently large number of rhs. The library approach makes the GPU particularly easy to use within complex applications the performance gain however remains limited. More work is necessary to get better speedups for sparse direct solvers on GPUs.
10 SPH La granularité des calculs effectués en SPH en fait une méthode de choix pour le calcul sur GPU. Calculs réel simple précision. La majeure partie des calculs est uniformément répartie dans (seulement) 3 hot-spots de 5 routines au total. Les temps d exécution reportés inclus les transferts de données vers la carte (pas de recouvrement). Comparaison des temps de calcul entre 1 cœur Nehalem W5590 et une carte Nvidia Fermi (C2070 6Gb de RAM). Cas industriel: Véhicule roulant sur de l eau ( points, particules, plaques).
11 Cuda kernels for one hot-spot Simulation (ms) GPU (s) CPU (s) Gain(%) Elapsed time CPU - GPU (1) Speedup seems to slightly increase with the simulation time. Elapsed time (s) Simulation time (ms) GPU CPU
12 Estimation for 3 hot-spots Simulation (ms) GPU (s) CPU (s) Gain(%) Elapsed time CPU - GPU - estimation Data re-use (less data transfers) as the numbers of kernels increase should lead to an even better speedup. Elapsed time (s) GPU CPU Simulation time (s)
13 Cuda kernels for 3 hot-spots Simulation (ms) GPU (s) CPU (s) Gain(%) Number of registers is constant: need to reduce the size of thread blocks to run successfully: performance loss. Size of argument list is limited in bytes: 256 Bytes 1.x (C1060) 4 Kbytes 2.0 (C2070) 13
14 Conclusions future work SPH: very promising for GPU computing still need to work on kernels to achieve the potential. Hybrid GPU(s) CPU computing: to investigate. Other explicit method topics to investigate: Finite Pointset Method (FPM), Internal forces computing, Contact mechanics, Experiments on clusters of GPUs (MPI+OpenMP+GPUs) Tools evaluation for kernel generation: HMPP, PGI
15 Copyright ESI Group, All rights reserved.
CUDA Toolkit 4.0 Performance Report. June, 2011
CUDA Toolkit 4. Performance Report June, 211 CUDA Math Libraries High performance math routines for your applications: cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse
More informationUsing OpenACC With CUDA Libraries
Using OpenACC With CUDA Libraries John Urbanic with NVIDIA Pittsburgh Supercomputing Center Copyright 2015 3 Ways to Accelerate Applications Applications Libraries Drop-in Acceleration CUDA Libraries are
More informationUsing OpenACC With CUDA Libraries
Using OpenACC With CUDA Libraries John Urbanic with NVIDIA Pittsburgh Supercomputing Center Copyright 2018 3 Ways to Accelerate Applications Applications Libraries Drop-in Acceleration CUDA Libraries are
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationCUDA 6.0 Performance Report. April 2014
CUDA 6. Performance Report April 214 1 CUDA 6 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationLeveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library
Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library Benchmarking the NVIDIA GPU A White Paper by Rogue Wave Software. October, 2010 Rogue Wave Softw are 5500 Flatiron Parkw ay, Suite 200 Boulder,
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationLevel-3 BLAS on the TI C6678 multi-core DSP
Level-3 BLAS on the TI C6678 multi-core DSP Murtaza Ali, Eric Stotzer Texas Instruments {mali,estotzer}@ti.com Francisco D. Igual Dept. Arquitectura de Computadores y Automática Univ. Complutense de Madrid
More informationCUDA 7.0 Performance Report. May 2015
CUDA 7.0 Performance Report May 2015 1 CUDA 7.0 Performance Report cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library New in cusolver Linear Solver Library
More informationCUDA 6.5 Performance Report
CUDA 6.5 Performance Report 1 CUDA 6.5 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random Number
More informationMulti-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation
Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M
More informationAccelerating MCAE with GPUs
Accelerating MCAE with GPUs Information Sciences Institute 15 Sept 2010 15 Sept 2010 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes {rflucas,genew,ddavis}@isi.edu and grimes@lstc.com Report Documentation
More informationAccelerating GPU Kernels for Dense Linear Algebra
Accelerating GPU Kernels for Dense Linear Algebra Rajib Nath, Stan Tomov, and Jack Dongarra Innovative Computing Lab University of Tennessee, Knoxville July 9, 21 xgemm performance of CUBLAS-2.3 on GTX28
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationNVIDIA CUDA Libraries
NVIDIA CUDA Libraries Ujval Kapasi*, Elif Albuz*, Philippe Vandermersch*, Nathan Whitehead*, Frank Jargstorff* San Jose Convention Center Sept 22, 2010 *NVIDIA NVIDIA CUDA Libraries Applications 3 rd Party
More informationA GPU Sparse Direct Solver for AX=B
1 / 25 A GPU Sparse Direct Solver for AX=B Jonathan Hogg, Evgueni Ovtchinnikov, Jennifer Scott* STFC Rutherford Appleton Laboratory 26 March 2014 GPU Technology Conference San Jose, California * Thanks
More informationGPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU
April 4-7, 2016 Silicon Valley GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim Davis, April 6, 2016 OBJECTIVE Direct sparse methods are among the most widely
More informationAnalyse statique de programmes avioniques
June 28th 2013. Forum Méthodes Formelles Cycle de conférences: Analyse Statique : «Retour d expériences industrielles» Analyse statique de programmes avioniques Presenté par Jean Souyris (Airbus Opérations
More informationA Standard for Batching BLAS Operations
A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community
More informationFrom BLAS routines to finite field exact linear algebra solutions
From BLAS routines to finite field exact linear algebra solutions Pascal Giorgi Laboratoire de l Informatique du Parallélisme (Arenaire team) ENS Lyon - CNRS - INRIA - UCBL France Main goals Solve Linear
More informationBatch Linear Algebra for GPU-Accelerated High Performance Computing Environments
Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering
More informationImplementing Level-3 BLAS Routines in OpenCL on Different Processing Units
Technical Report 2014-001 Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units Kazuya Matsumoto, Naohito Nakasato, and Stanislav Sedukhin October 22, 2014 Graduate School of Computer
More informationAdrian Tate XK6 / openacc workshop Manno, Mar
Adrian Tate XK6 / openacc workshop Manno, Mar6-7 2012 1 Overview & Philosophy Two modes of usage Contents Present contents Upcoming releases Optimization of libsci_acc Autotuning Adaptation Asynchronous
More informationOn the limits of (and opportunities for?) GPU acceleration
On the limits of (and opportunities for?) GPU acceleration Aparna Chandramowlishwaran, Jee Choi, Kenneth Czechowski, Murat (Efe) Guney, Logan Moon, Aashay Shringarpure, Richard (Rich) Vuduc HotPar 10,
More informationDense Linear Algebra for Hybrid GPU-Based Systems. Stanimire Tomov Department of Electrical Engineering and Computer Science, University of Tennessee
Chapter 3 Dense Linear Algebra for Hybrid GPU-Based Systems Stanimire Tomov Department of Electrical Engineering and Computer Science, University of Tennessee Jack Dongarra Department of Electrical Engineering
More informationAccelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster
th IEEE International Conference on Computer and Information Technology (CIT ) Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster WANG Lei ZHANG Yunquan
More informationEfficient Multi-GPU CUDA Linear Solvers for OpenFOAM
Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,
More informationCUDA 8 PERFORMANCE OVERVIEW. November 2016
CUDA 8 PERFORMANCE OVERVIEW November 2016 CUDA 8 PERFORMANCE HIGHLIGHTS 2X 1.5-2X higher performance out-of-the-box Solve larger problems than possible before with Unified Memory SOCIAL NETWORK ANALYSIS
More informationDAG-Scheduled Linear Algebra Using Template-Based Building Blocks
DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg STFC Rutherford Appleton Laboratory 1 / 20 19 March 2015 GPU Technology Conference San Jose, California * Thanks also to
More informationRedesigning Triangular Dense Matrix Computations on GPUs
Redesigning Triangular Dense Matrix Computations on GPUs Ali Charara, Hatem Ltaief, and David Keyes Extreme Computing Research Center, King Abdullah University of Science and Technology, Thuwal, Jeddah,
More informationIntroduction to OpenACC Directives. Duncan Poole, NVIDIA
Introduction to OpenACC Directives Duncan Poole, NVIDIA GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationAccelerating GPU kernels for dense linear algebra
Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,
More informationGPU COMPUTING WITH MSC NASTRAN 2013
SESSION TITLE WILL BE COMPLETED BY MSC SOFTWARE GPU COMPUTING WITH MSC NASTRAN 2013 Srinivas Kodiyalam, NVIDIA, Santa Clara, USA THEME Accelerated computing with GPUs SUMMARY Current trends in HPC (High
More informationNEW ADVANCES IN GPU LINEAR ALGEBRA
GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationEvaluation and Tuning of the Level 3 CUBLAS for Graphics Processors
Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors Sergio Barrachina Maribel Castillo Francisco D. Igual Rafael Mayo Enrique S. Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores
More informationNVBLAS LIBRARY. DU _v6.0 February User Guide
NVBLAS LIBRARY DU-06702-001_v6.0 February 2014 User Guide DU-06702-001_v6.0 2 Chapter 1. INTRODUCTION The is a GPU-accelerated Libary that implements BLAS (Basic Linear Algebra Subprograms). It can accelerate
More informationGREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer
GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES Nikolay Markovskiy Peter Messmer ABOUT CP2K Atomistic and molecular simulations of solid state From ab initio DFT and Hartree-Fock
More informationSPOC : GPGPU programming through Stream Processing with OCaml
SPOC : GPGPU programming through Stream Processing with OCaml Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte January 23rd, 2012 GPGPU Programming Two main frameworks Cuda OpenCL Different Languages
More informationGPU Programming. Ringberg Theorie Seminar 2010
or How to tremendously accelerate your code? Michael Kraus, Christian Konz Max-Planck-Institut für Plasmaphysik, Garching Ringberg Theorie Seminar 2010 Introduction? GPU? GPUs can do more than just render
More informationThe Fermi GPU and HPC Application Breakthroughs
The Fermi GPU and HPC Application Breakthroughs Peng Wang, PhD HPC Developer Technology Group Stan Posey HPC Industry Development NVIDIA, Santa Clara, CA, USA NVIDIA Corporation 2009 Overview GPU Computing:
More informationSparse Multifrontal Performance Gains via NVIDIA GPU January 16, 2009
Sparse Multifrontal Performance Gains via NVIDIA GPU January 16, 2009 Dan l Pierce, PhD, MBA, CEO & President AAI Joint with: Yukai Hung, Chia-Chi Liu, Yao-Hung Tsai, Weichung Wang, and David Yu Access
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationGPU Computing using CUDA C/C++ Dr. Timo Stich Developer Technology Group
GPU Computing using CUDA C/C++ Dr. Timo Stich Developer Technology Group Why CUDA? Mainstream Massively Parallel Programming Over 300 Million CUDA capable GPUs sold Runs on GPU and CPU (PGI CUDA-x86) Additional
More informationMAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville
MAGMA LAPACK for GPUs Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville Keeneland GPU Tutorial 2011, Atlanta, GA April 14-15,
More informationNEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS
NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS 1 Unified Memory CUDA 6 2 3 XT and Drop-in Libraries GPUDirect RDMA in MPI 4 Developer Tools 1 Unified Memory CUDA 6 2 3 XT and Drop-in Libraries
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationDistributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca
Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationThinking Outside of the Tera-Scale Box. Piotr Luszczek
Thinking Outside of the Tera-Scale Box Piotr Luszczek Brief History of Tera-flop: 1997 1997 ASCI Red Brief History of Tera-flop: 2007 Intel Polaris 2007 1997 ASCI Red Brief History of Tera-flop: GPGPU
More informationPORTING PARALLEL APPLICATIONS TO HETEROGENEOUS SUPERCOMPUTERS: LIBRARIES AND TOOLS CAN MAKE IT TRANSPARENT
PORTING PARALLEL APPLICATIONS TO HETEROGENEOUS SUPERCOMPUTERS: LIBRARIES AND TOOLS CAN MAKE IT TRANSPARENT Jean-Yves VET, DDN Storage Patrick CARRIBAULT, CEA Albert COHEN, INRIA CEA, DAM, DIF, F-91297
More informationIncremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010
Innovative software for manycore paradigms Incremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010 Introduction Many applications can benefit
More informationExploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement
Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Tim Davis (Texas A&M University) with Sanjay Ranka, Mohamed Gadou (University of Florida) Nuri Yeralan (Microsoft) NVIDIA
More informationOptimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí
Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators Enrique S. Quintana-Ortí Disclaimer Not a course on how to program dense linear algebra kernels on s Where have you
More informationA Comprehensive Study on the Performance of Implicit LS-DYNA
12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four
More informationHierarchical DAG Scheduling for Hybrid Distributed Systems
June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical
More informationPerformance of Implicit Solver Strategies on GPUs
9. LS-DYNA Forum, Bamberg 2010 IT / Performance Performance of Implicit Solver Strategies on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Abstract: The increasing power of GPUs can be used
More informationAutomatic Development of Linear Algebra Libraries for the Tesla Series
Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source
More informationFast and reliable linear system solutions on new parallel architectures
Fast and reliable linear system solutions on new parallel architectures Marc Baboulin Université Paris-Sud Chaire Inria Saclay Île-de-France Séminaire Aristote - Ecole Polytechnique 15 mai 2013 Marc Baboulin
More informationCS 179: Lecture 10. Introduction to cublas
CS 179: Lecture 10 Introduction to cublas Table of contents, you are here. Welcome to week 4, this is new material from here on out so please ask questions and help the TAs to improve the lectures and
More informationSENSEI / SENSEI-Lite / SENEI-LDC Updates
SENSEI / SENSEI-Lite / SENEI-LDC Updates Chris Roy and Brent Pickering Aerospace and Ocean Engineering Dept. Virginia Tech July 23, 2014 Collaborations with Math Collaboration on the implicit SENSEI-LDC
More informationFMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)
FMM implementation on CPU and GPU Nail A. Gumerov (Lecture for CMSC 828E) Outline Two parts of the FMM Data Structure Flow Chart of the Run Algorithm FMM Cost/Optimization on CPU Programming on GPU Fast
More informationUsing GPUs for unstructured grid CFD
Using GPUs for unstructured grid CFD Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Schlumberger Abingdon Technology Centre, February 17th, 2011
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationPorting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method
Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Josh Romero, Massimiliano Fatica - NVIDIA Vamsi Spandan, Roberto Verzicco -
More informationAccelerating Implicit LS-DYNA with GPU
Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,
More informationPerformance and accuracy of the matrix multiplication routines : CUBLAS on Nvidia Tesla versus MKL and ATLAS on Intel Nehalem
Performance and accuracy of the matrix multiplication routines : on Nvidia Tesla versus and on Intel Nehalem Philippe Estival, Luc Giraud To cite this version: Philippe Estival, Luc Giraud. Performance
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationHigh-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs
High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea
More informationJohn Levesque Nov 16, 2001
1 We see that the GPU is the best device available for us today to be able to get to the performance we want and meet our users requirements for a very high performance node with very high memory bandwidth.
More informationCray Scientific Libraries: Overview and Performance. Cray XE6 Performance Workshop University of Reading Nov 2012
Cray Scientific Libraries: Overview and Performance Cray XE6 Performance Workshop University of Reading 20-22 Nov 2012 Contents LibSci overview and usage BFRAME / CrayBLAS LAPACK ScaLAPACK FFTW / CRAFFT
More informationComparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015
Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015 Overview Dense linear algebra algorithms Hybrid CPU GPU implementation
More informationFatMan vs. LittleBoy: Scaling up Linear Algebraic Operations in Scale-out Data Platforms
FatMan vs. LittleBoy: Scaling up Linear Algebraic Operations in Scale-out Data Platforms Luna Xu (Virginia Tech) Seung-Hwan Lim (ORNL) Ali R. Butt (Virginia Tech) Sreenivas R. Sukumar (ORNL) Ramakrishnan
More informationBig Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures
Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid
More informationIntroduction to GPGPUs and to CUDA programming model: CUDA Libraries
Introduction to GPGPUs and to CUDA programming model: CUDA Libraries www.cineca.it Marzia Rivi m.rivi@cineca.it NVIDIA CUDA Libraries http://developer.nvidia.com/technologies/libraries CUDA Toolkit includes
More informationCURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS
CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS Roberto Gomperts (NVIDIA, Corp.) Michael Frisch (Gaussian, Inc.) Giovanni Scalmani (Gaussian, Inc.) Brent Leback (PGI) TOPICS Gaussian Design
More informationSparse LU Factorization for Parallel Circuit Simulation on GPUs
Department of Electronic Engineering, Tsinghua University Sparse LU Factorization for Parallel Circuit Simulation on GPUs Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Nano-scale Integrated
More informationCUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD
More informationA Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection
A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF
More informationMatrix Computations on GPUs, multiple GPUs and clusters of GPUs
Matrix Computations on GPUs, multiple GPUs and clusters of GPUs Francisco D. Igual Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón (Spain). Matrix Computations on
More informationHybrid Multicore Cholesky Factorization with Multiple GPU Accelerators
Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators Hatem Ltaief 1, Stanimire Tomov 1, Rajib Nath 1, and Jack Dongarra 1,2,3 1 Department of Electrical Engineering and Computer Science,
More informationA GPU Enhanced Linux Cluster for Accelerated FMS
A GPU Enhanced Linux Cluster for Accelerated FMS Computational Sciences 21 June 07 Gene Wagenbreth genew@isi.edu (310)448-8213 Background Computational Sciences Division of ISI works with clusters, compilers
More informationLarge scale Imaging on Current Many- Core Platforms
Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,
More informationParallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU
Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware
More informationCray Scientific Libraries. Overview
Cray Scientific Libraries Overview What are libraries for? Building blocks for writing scientific applications Historically allowed the first forms of code re-use Later became ways of running optimized
More informationParallelising Pipelined Wavefront Computations on the GPU
Parallelising Pipelined Wavefront Computations on the GPU S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick
More informationRAPID MULTI-GPU PROGRAMMING WITH CUDA LIBRARIES. Nikolay Markovskiy
RAPID MULTI-GPU PROGRAMMING WITH CUDA LIBRARIES Nikolay Markovskiy CUDA 6 cublas cufft 2. cublas-xt 3. cufft-xt 1. NVBLAS WHAT IS NVBLAS? Drop-in replacement of BLAS Built on top of cublas-xt BLAS Level
More informationAdministrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve
Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard
More informationCUDA Accelerated Compute Libraries. M. Naumov
CUDA Accelerated Compute Libraries M. Naumov Outline Motivation Why should you use libraries? CUDA Toolkit Libraries Overview of performance CUDA Proprietary Libraries Address specific markets Third Party
More informationINTEL MKL Vectorized Compact routines
INTEL MKL Vectorized Compact routines Mesut Meterelliyoz, Peter Caday, Timothy B. Costa, Kazushige Goto, Louise Huot, Sarah Knepper, Arthur Araujo Mitrano, Shane Story 2018 BLIS RETREAT 09/17/2018 OUTLINE
More informationBound the Peak Performance of SGEMM on GPU with software-controlled fast memory
Bound the Peak Performance of SGEMM on GPU with software-controlled fast memory Junjie Lai, André Seznec To cite this version: Junjie Lai, André Seznec. Bound the Peak Performance of SGEMM on GPU with
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationProgramming Dense Linear Algebra Kernels on Vectorized Architectures
University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Masters Theses Graduate School 5-2013 Programming Dense Linear Algebra Kernels on Vectorized Architectures Jonathan Lawrence
More information