Performance Modeling for Ranking Blocked Algorithms
|
|
- Kevin Benjamin Andrews
- 5 years ago
- Views:
Transcription
1 Performance Modeling for Ranking Blocked Algorithms Elmar Peise Aachen Institute for Advanced Study in Computational Engineering Science Elmar Peise (AICES) Performance Modeling
2 Blocked algorithms Inversion of a triangular matrix L L 1 R n n L 00 L 10 L 11 L 20 L 21 L 22 Variant 1 Variant 2 Variant 3 Variant 4 L 10 L 10L 00 L 10 L 1 11 L10 L 11 L 1 11 L 21 L 1 22 L21 L 21 L 21L 1 11 L 11 L 1 11 L 21 L 21L 1 11 L 20 L 20 + L 21L 10 L 10 L 1 11 L10 L 11 L 1 11 L 21 L 1 22 L21 L 20 L 20 L 21L 10 L 10 L 10L 00 L 11 L 1 11 Elmar Peise (AICES) Performance Modeling
3 Execution time Inversion of a triangular matrix L L 1 R n n Time [s] variant 1 variant 2 variant 3 variant ,024 1,536 2,048 matrix size n Intel Harpertown 2.99 GHz 1 core Intel MKL BLAS Elmar Peise (AICES) Performance Modeling
4 Efficiency Inversion of a triangular matrix L L 1 R n n Efficiency [%] variant 1 variant 2 variant 3 variant ,024 matrix size n Intel Harpertown 2.99 GHz 1 core Intel MKL BLAS Elmar Peise (AICES) Performance Modeling
5 Blocked algorithms Inversion of a triangular matrix L L 1 R n n 3rd ( ) 2nd ( ) 1st ( ) 4th ( ) Variant 1 Variant 2 Variant 3 Variant 4 L 10 L 10L 00 L 10 L 1 11 L10 L 11 L 1 11 L 21 L 1 22 L21 L 21 L 21L 1 11 L 11 L 1 11 L 21 L 21L 1 11 L 20 L 20 + L 21L 10 L 10 L 1 11 L10 L 11 L 1 11 L 21 L 1 22 L21 L 20 L 20 L 21L 10 L 10 L 10L 00 L 11 L 1 11 Can we rank the algorithms based on their update statements? No! Elmar Peise (AICES) Performance Modeling
6 Outline 1 Sampling 2 Modeling 3 Prediction and Ranking Elmar Peise (AICES) Performance Modeling
7 Sampling B 0.37 B L 1 with B R , L R Routine call dtrsm(r, L, N, U, 128, 96, 0.37, L, 128, B, 128) Sampling Performance counters (PAPI) ticks flops L1misses Elmar Peise (AICES) Performance Modeling
8 Matrix arguments Routine call dtrsm(r, L, N, U, 128, 96, 0.37, L, 128, B, 128) Performance is independent of L and B. Assign (random) memory chunk of sufficient size. Input (dtrsm, R, L, N, U, 128, 96, 0.37, , 128, , 128) Elmar Peise (AICES) Performance Modeling
9 Multiple samples Input (dtrsm, R, L, N, U, 128, 96, 0.37, 12288, 128, 12288, 128) (dtrsm, R, L, N, U, 128, 96, 0.37, 12288, 128, 12288, 128) (dtrsm, R, L, N, U, 128, 96, 0.37, 12288, 128, 12288, 128) (dtrsm, R, L, N, U, 128, 96, 0.37, 12288, 128, 12288, 128) Performance counters. Sampling ticks flops L1misses Elmar Peise (AICES) Performance Modeling
10 Influence of caching dtrsm(r, L, N, U, 128, 96, 0.37, L, 128, B, 128) 10 6 ticks in cache: GotoBLAS2 MKL ATLAS out of cache: GotoBLAS2 MKL ATLAS ,000 experiments AMD Opteron GHz 1 core Elmar Peise (AICES) Performance Modeling
11 Outline 1 Sampling 2 Modeling 3 Prediction and Ranking Elmar Peise (AICES) Performance Modeling
12 Modeling Understanding performance Model structure Model generation Modeling results Elmar Peise (AICES) Performance Modeling
13 Argument types B α A 1 B α A B dtrsm(side, uplo, transa, diag, m, n, alpha, A, lda, B, ldb) discrete size data scalar matrix leading dimension Elmar Peise (AICES) Performance Modeling
14 Discrete arguments dtrsm(side, uplo, transa, diag, 256, 256, 0.5, A, 256, B, 256) L L N N L L N U L L T N L L T U L U N N L U N U L U T N L U T U R L N N R L N U R L T N R L T U R U N N R U N U R U T N R U T U ticks GotoBLAS2 MKL ATLAS side uplo transa diag AMD Opteron GHz 1 core Elmar Peise (AICES) Performance Modeling
15 Size arguments Small scale dgemm(n, N, m, n, k, 0.5, A, m, B, k, 0.5, C, m) GotoBLAS2 MKL ATLAS ticks m/n/k AMD Opteron GHz 1 core Elmar Peise (AICES) Performance Modeling
16 Size arguments Large scale dgemm(n, N, m, n, k, 0.5, A, m, B, k, 0.5, C, m) ticks GotoBLAS2 MKL ATLAS m n k m/n/k AMD Opteron GHz 1 core Elmar Peise (AICES) Performance Modeling
17 Scalar arguments dgemm(n, N, 256, 256, 256, alpha, A, 256, B, 256, beta, C, 256) ticks GotoBLAS2 MKL ATLAS 0 C 0 C AB C AB C 0.5AB C C C AB + C C AB + C C 0.5AB + C C C C AB C C AB C C 0.5AB C C 0.5C C AB + 0.5C C AB + 0.5C C 0.5AB+0.5C AMD Opteron GHz 1 core Elmar Peise (AICES) Performance Modeling
18 Matrix arguments dgemm(n, N, 256, 256, 256, 0.5, A, 256, B, 256, 0.5, C, 256) No performance dependency Elmar Peise (AICES) Performance Modeling
19 Leading dimension arguments dgemm(n, N, 128, 128, 128, 0.5, A, lda, B, ldb, 0.5, C, ldc) ticks GotoBLAS2 MKL ATLAS lda ldb ldc lda/ldb/ldc AMD Opteron GHz 1 core Elmar Peise (AICES) Performance Modeling
20 Modeling Understanding performance Model structure Model generation Modeling results Elmar Peise (AICES) Performance Modeling
21 Model structure input: (dtrsm, L, U, N, N, 256, 512, 0.7, A, 512, B, 512) model for dtrsm parameter extraction: (L, U, N) (256, 512) piecewise polynomials: (256, 512) discrete cases: LLN LLT LUN LUT RLN RUN RLT RUT statistical quantities: min avg std max performance counters: ticks flops L1misses polynomial: (256, 512) output: (min, avg,...) (min, avg,...) (min, avg, std, max) Elmar Peise (AICES) Performance Modeling
22 Modeling Understanding performance Model structure Model generation Modeling results Elmar Peise (AICES) Performance Modeling
23 Modeler Modeler Sampler Interface Sampler Routine Modeler Piecewise Polynomial Modeler 1 case (e.g. (L, U, N)) Model Model 1 counter (e.g. ticks) Model Elmar Peise (AICES) Performance Modeling
24 Piecewise Polynomial Modelers Model Expansion Adaptive Refinement Elmar Peise (AICES) Performance Modeling
25 Model Expansion completed region Elmar Peise (AICES) Performance Modeling
26 Adaptive Refinement refinement refinement refinement accuracy: bad threshold good Elmar Peise (AICES) Performance Modeling
27 Modeling Understanding performance Model structure Model generation Modeling results Elmar Peise (AICES) Performance Modeling
28 Model Expansion for ticks dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 1,024 20% % n ,024 m 10% 5% 0% AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling
29 Opposite direction: dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 1,024 20% % n ,024 m 10% 5% 0% AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling
30 Smaller approximation error bound dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 1,024 20% % n ,024 m 10% 5% 0% AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling
31 Smaller models dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 1,024 20% % n ,024 m 10% 5% 0% AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling
32 Adaptive Refinement for ticks dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 1,024 20% % n ,024 m 10% 5% 0% AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling
33 Smaller models dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 1,024 20% % n ,024 m 10% 5% 0% AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling
34 Smaller approximation error bound dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 1,024 20% % n ,024 m 10% 5% 0% AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling
35 Comparison dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 10 average error [%] #samples 10 5 Model Expansion Adaptive Refinement AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling
36 Outline 1 Sampling 2 Modeling 3 Prediction and Ranking Elmar Peise (AICES) Performance Modeling
37 Ranking Blocked algorithm revisited: L L 1 L 00 L 10 L 11 L 20 L 21 L 22 Variant 1 Variant 2 Variant 3 Variant 4 L 10 L 10L 00 L 10 L 1 11 L10 L 11 L 1 11 L 21 L 1 22 L21 L 21 L 21L 1 11 L 11 L 1 11 L 21 L 21L 1 11 L 20 L 20 + L 21L 10 L 10 L 1 11 L10 L 11 L 1 11 L 21 L 1 22 L21 L 20 L 20 L 21L 10 L 10 L 10L 00 L 11 L 1 11 Elmar Peise (AICES) Performance Modeling
38 C implementation: Variant 1 i n t t r i n v 1 _ ( c h a r diag, i n t n, d o u b l e A, i n t lda, i n t b s i z e ) { i f ( n == 1) { i f ( d i a g [ 0 ] == N ) A = 1 / A ; r e t u r n 0 ; } i n t i o n e = 1 ; d o u b l e one = 1 ; d o u b l e mone = 1; f o r ( i n t p = 0 ; p < n ; p += b s i z e ) { i n t b = b s i z e ; i f ( p + b > n ) b = n p ; #d e f i n e A00 (A) #d e f i n e A10 (A + p ) #d e f i n e A11 (A + lda p + p ) Variant 1 L 10 L 10L 00 L 10 L 1 11 L10 L 11 L 1 11 } dtrmm_( "R", "L", "N", diag, &b, &p, &one, A00, lda, A10, lda ) ; dtrsm_ ( "L", "L", "N", diag, &b, &p, &mone, A11, lda, A10, lda ) ; t r i n v 1 _ ( diag, &b, A11, lda, &i o n e ) ; } r e t u r n 0 ; Elmar Peise (AICES) Performance Modeling
39 Prediction Input: trinv1(n, 300, A, 300, 100) Compute routine invocations update routine invocation L 10 L 10 L 00 (dtrmm, R, L, N, N, 100, 0, 1,, 300,, 300) L 10 L 1 11 L 10 (dtrsm, L, L, N, N, 100, 0, 1,, 300,, 300) L 11 L 1 11 (trinv1, N, 100,, 300, 1) L 10 L 10 L 00 (dtrmm, R, L, N, N, 100, 100, 1,, 300,, 300) L 10 L 1 11 L 10 (dtrsm, L, L, N, N, 100, 100, 1,, 300,, 300) L 11 L 1 11 (trinv1, N, 100,, 300, 1) L 10 L 10 L 00 (dtrmm, R, L, N, N, 100, 200, 1,, 300,, 300) L 10 L 1 11 L 10 (dtrsm, L, L, N, N, 100, 200, 1,, 300,, 300) L 11 L 1 11 (trinv1, N, 100,, 300, 1) Evaluate performance Models Accumulate ticks Elmar Peise (AICES) Performance Modeling
40 Results: cache-trashing efficiency variant 1 variant 2 variant 3 variant 4 Measurements Prediction ,024 n Intel Harpertown 2.99 GHz 1 core Intel MKL BLAS Elmar Peise (AICES) Performance Modeling
41 Results: in-cache efficiency variant 1 variant 2 variant 3 variant 4 Measurements Prediction ,024 n Intel Harpertown 2.99 GHz 1 core Intel MKL BLAS Elmar Peise (AICES) Performance Modeling
42 Results: in-cache efficiency variant 1 variant 2 variant 3 variant 4 Measurements Prediction ,024 n Intel Harpertown 2.99 GHz 1 core Intel MKL BLAS Elmar Peise (AICES) Performance Modeling
43 Results: probabilistic prediction efficiency variant 1 variant 2 variant 3 variant 4 Measurements Prediction: median average min max ,024 n Intel Harpertown 2.99 GHz 1 core Intel MKL BLAS Elmar Peise (AICES) Performance Modeling
44 Optimal block-size efficiency variant 1 variant 2 variant 3 variant 4 Measurements Prediction: median average min max blocksize Intel Harpertown 2.99 GHz 1 core Intel MKL BLAS Elmar Peise (AICES) Performance Modeling
45 Conclusion 1 Sampling Performance measurement tool Applicable to various DLA routines 2 Modeling Automatic modeling system Flexible and accurate 3 Prediction and Ranking Accurate predictions Correct ranking Elmar Peise (AICES) Performance Modeling
46 Performance Modeling for Ranking Blocked Algorithms Questions? Funding from DFG is gratefully acknowledged Elmar Peise (AICES) Performance Modeling
INTEL MKL Vectorized Compact routines
INTEL MKL Vectorized Compact routines Mesut Meterelliyoz, Peter Caday, Timothy B. Costa, Kazushige Goto, Louise Huot, Sarah Knepper, Arthur Araujo Mitrano, Shane Story 2018 BLIS RETREAT 09/17/2018 OUTLINE
More informationBLAS. Christoph Ortner Stef Salvini
BLAS Christoph Ortner Stef Salvini The BLASics Basic Linear Algebra Subroutines Building blocks for more complex computations Very widely used Level means number of operations Level 1: vector-vector operations
More informationDense matrix algebra and libraries (and dealing with Fortran)
Dense matrix algebra and libraries (and dealing with Fortran) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Dense matrix algebra and libraries (and dealing with Fortran)
More informationA Few Numerical Libraries for HPC
A Few Numerical Libraries for HPC CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 1 / 37 Outline 1 HPC == numerical linear
More informationATLAS (Automatically Tuned Linear Algebra Software),
LAPACK library I Scientists have developed a large library of numerical routines for linear algebra. These routines comprise the LAPACK package that can be obtained from http://www.netlib.org/lapack/.
More informationAutomatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning
More informationDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends
Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact
More informationResources for parallel computing
Resources for parallel computing BLAS Basic linear algebra subprograms. Originally published in ACM Toms (1979) (Linpack Blas + Lapack). Implement matrix operations upto matrix-matrix multiplication and
More informationFaster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017
Faster Code for Free: Linear Algebra Libraries Advanced Research Compu;ng 22 Feb 2017 Outline Introduc;on Implementa;ons Using them Use on ARC systems Hands on session Conclusions Introduc;on 3 BLAS Level
More informationIntel Math Kernel Library (Intel MKL) Team - Presenter: Murat Efe Guney Workshop on Batched, Reproducible, and Reduced Precision BLAS Georgia Tech,
Intel Math Kernel Library (Intel MKL) Team - Presenter: Murat Efe Guney Workshop on Batched, Reproducible, and Reduced Precision BLAS Georgia Tech, Atlanta February 24, 2017 Acknowledgements Benoit Jacob
More informationThe Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations.
TITLE Basic Linear Algebra Subprograms BYLINE Robert A. van de Geijn Department of Computer Science The University of Texas at Austin Austin, TX USA rvdg@cs.utexas.edu Kazushige Goto Texas Advanced Computing
More informationBatch Linear Algebra for GPU-Accelerated High Performance Computing Environments
Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering
More informationAutomated Timer Generation for Empirical Tuning
Automated Timer Generation for Empirical Tuning Josh Magee Qing Yi R. Clint Whaley University of Texas at San Antonio SMART'10 1 Propositions How do we measure success for tuning? The performance of the
More informationOptimizing the operations with sparse matrices on Intel architecture
Optimizing the operations with sparse matrices on Intel architecture Gladkikh V. S. victor.s.gladkikh@intel.com Intel Xeon, Intel Itanium are trademarks of Intel Corporation in the U.S. and other countries.
More informationScheduling of QR Factorization Algorithms on SMP and Multi-core Architectures
Scheduling of Algorithms on SMP and Multi-core Architectures Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Ernie Chan Robert A. van de Geijn Field G. Van Zee quintana@icc.uji.es Universidad Jaime I de
More informationLinear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore svmoore@utep.edu CPS5401 Fall 2012 svmoore.pbworks.com November 8, 2012 1 Learning ObjecNves AOer complenng this lesson, you
More information*Yuta SAWA and Reiji SUDA The University of Tokyo
Auto Tuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS *Yuta SAWA and Reiji SUDA The University of Tokyo iwapt 29 October 1-2 *Now in Central Research Laboratory, Hitachi,
More informationOptimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí
Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators Enrique S. Quintana-Ortí Disclaimer Not a course on how to program dense linear algebra kernels on s Where have you
More informationParallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware
Parallelism V HPC Profiling John Cavazos Dept of Computer & Information Sciences University of Delaware Lecture Overview Performance Counters Profiling PAPI TAU HPCToolkit PerfExpert Performance Counters
More informationA Standard for Batching BLAS Operations
A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationNAG Fortran Library Routine Document F01CTF.1
NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent
More informationAdvanced School in High Performance and GRID Computing November Mathematical Libraries. Part I
1967-10 Advanced School in High Performance and GRID Computing 3-14 November 2008 Mathematical Libraries. Part I KOHLMEYER Axel University of Pennsylvania Department of Chemistry 231 South 34th Street
More informationBLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg
University of Freiburg BLIS retreat September 19, 217 Basic Linear Algebra Subroutines For Embedded Optimization performance dgemm_nt 5 4 Intel Core i7 48MQ HP OpenBLAS.2.19 MKL 217.2.174 ATLAS 3.1.3 BLIS.1.6
More informationLAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1
LAPACK Linear Algebra PACKage 1 Janice Giudice David Knezevic 1 Motivating Question Recalling from last week... Level 1 BLAS: vectors ops Level 2 BLAS: matrix-vectors ops 2 2 O( n ) flops on O( n ) data
More informationSarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018
Sarah Knepper Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018 Outline Motivation Problem statement and solutions Simple example Performance comparison 2 Motivation Partial differential equations
More informationNAG Fortran Library Routine Document F01CWF.1
NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationCache Oblivious Dense and Sparse Matrix Multiplication Based on Peano Curves
Cache Oblivious Dense and Sparse Matrix Multiplication Based on eano Curves Michael Bader and Alexander Heinecke Institut für Informatik, Technische Universitüt München, Germany Abstract. Cache oblivious
More informationIntel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop
Intel MKL @ MIC Architecture Dr. Momme Allalen, LRZ, allalen@lrz.de PRACE PATC: Intel MIC&GPU Programming Workshop 1 2 Momme Allalen, HPC with GPGPUs, Oct. 10, 2011 What is the Intel MKL? Math library
More informationOvercoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science
Overcoming the Barriers to Sustained Petaflop Performance William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp But First Are we too CPU-centric? What about I/O? What do applications
More informationPRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,
PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, 27.6-29.6.2016 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi - Compiler Assisted Offload - Automatic Offload - Native Execution
More informationBatched BLAS (Basic Linear Algebra Subprograms) 2018 Specification
Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification Jack Dongarra *, Iain Duff, Mark Gates *, Azzam Haidar *, Sven Hammarling, Nicholas J. Higham, Jonathan Hogg, Pedro Valero Lara, Piotr
More information2 CHAPTER 1. LEGACYBLAS Considered methods Various other naming schemes have been proposed, such as adding C or c to the name. Most of these schemes a
Chapter 1 LegacyBLAS 1.1 New routines 1.2 C interface to legacy BLAS This section discusses the proposed C interface to the legacy BLAS in some detail. Every mention of BLAS" in this chapter should be
More informationImplementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS
Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of
More informationFrom BLAS routines to finite field exact linear algebra solutions
From BLAS routines to finite field exact linear algebra solutions Pascal Giorgi Laboratoire de l Informatique du Parallélisme (Arenaire team) ENS Lyon - CNRS - INRIA - UCBL France Main goals Solve Linear
More informationLAFF-On High-Performance Programming
LAFF-On High-Performance Programming Margaret E Myers Robert A van de Geijn Release Date Wednesday 20 th December, 2017 This is a work in progress Copyright 2017, 2018 by Margaret E Myers and Robert A
More informationA High Performance C Package for Tridiagonalization of Complex Symmetric Matrices
A High Performance C Package for Tridiagonalization of Complex Symmetric Matrices Guohong Liu and Sanzheng Qiao Department of Computing and Software McMaster University Hamilton, Ontario L8S 4L7, Canada
More informationBLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker
BLAS and LAPACK + Data Formats for Sparse Matrices Part of the lecture Wissenschaftliches Rechnen Hilmar Wobker Institute of Applied Mathematics and Numerics, TU Dortmund email: hilmar.wobker@math.tu-dortmund.de
More informationALGORITHM 679 A Set of Level 3 Basic Linear Algebra Subprograms: Model Implementation and Test Programs
ALGORITHM 679 A Set of Level 3 Basic Linear Algebra Subprograms: Model Implementation and Test Programs JACK J. DONGARRA University of Tennessee and Oak Ridge National Laboratory JEREMY DU CROZ and SVEN
More informationprintf("\n\nx = "); for(i=0;i<5;i++) printf("\n%f %f", X[i],X[i+5]); printf("\n\ny = "); for(i=0;i<5;i++) printf("\n%f", Y[i]);
OLS.c / / #include #include #include int main(){ int i,info, ipiv[2]; char trans = 't', notrans ='n'; double alpha = 1.0, beta=0.0; int ncol=2; int nrow=5; int
More information6 BLAS (Basic Linear Algebra Subroutines)
161 BLAS 6.1 Motivation 6 BLAS (Basic Linear Algebra Subroutines) 6.1 Motivation How to optimise programs that use a lot of linear algebra operations? Efficiency depends on but also on: processor speed
More informationCUDA 6.0 Performance Report. April 2014
CUDA 6. Performance Report April 214 1 CUDA 6 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random
More informationToward a supernodal sparse direct solver over DAG runtimes
Toward a supernodal sparse direct solver over DAG runtimes HOSCAR 2013, Bordeaux X. Lacoste Xavier LACOSTE HiePACS team Inria Bordeaux Sud-Ouest November 27, 2012 Guideline Context and goals About PaStiX
More informationMAINTAINING HIGH PERFORMANCE ACROSS ALL PROBLEM SIZES AND PARALLEL SCALES USING MICROKERNEL-BASED LINEAR ALGEBRA
MAINTAINING HIGH PERFORMANCE ACROSS ALL PROBLEM SIZES AND PARALLEL SCALES USING MICROKERNEL-BASED LINEAR ALGEBRA A Dissertation Submitted to the Graduate Faculty of the Louisiana State University and Agricultural
More informationHIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA
1 HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 BLAS BLAS 1, 2, 3 Performance GEMM Optimized BLAS Parallel
More informationSCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL
SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL Matthias Bach and David Rohr Frankfurt Institute for Advanced Studies Goethe University of Frankfurt I: INTRODUCTION 3 Scaling
More informationOptimizations of BLIS Library for AMD ZEN Core
Optimizations of BLIS Library for AMD ZEN Core 1 Introduction BLIS [1] is a portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries [2] The framework was
More informationAdaptive Scientific Software Libraries
Adaptive Scientific Software Libraries Lennart Johnsson Advanced Computing Research Laboratory Department of Computer Science University of Houston Challenges Diversity of execution environments Growing
More informationTask based parallelization of recursive linear algebra routines using Kaapi
Task based parallelization of recursive linear algebra routines using Kaapi Clément PERNET joint work with Jean-Guillaume DUMAS and Ziad SULTAN Université Grenoble Alpes, LJK-CASYS January 20, 2017 Journée
More informationAutomatic Development of Linear Algebra Libraries for the Tesla Series
Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source
More informationFast Multiscale Algorithms for Information Representation and Fusion
Prepared for: Office of Naval Research Fast Multiscale Algorithms for Information Representation and Fusion No. 2 Devasis Bassu, Principal Investigator Contract: N00014-10-C-0176 Telcordia Technologies
More informationFLAME Working Note #30
FLAG@lab: An M-script API for Linear Algebra Operations on Graphics Processors Sergio Barrachina Maribel Castillo Francisco D. Igual Rafael Mayo Enrique S. Quintana-Ortí Depto. de Ingeniería y Ciencia
More informationIntel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager
Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationFFPACK: Finite Field Linear Algebra Package
FFPACK: Finite Field Linear Algebra Package Jean-Guillaume Dumas, Pascal Giorgi and Clément Pernet pascal.giorgi@ens-lyon.fr, {Jean.Guillaume.Dumas, Clément.Pernet}@imag.fr P. Giorgi, J-G. Dumas & C. Pernet
More informationSolving Dense Linear Systems on Platforms with Multiple Hardware Accelerators
Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de
More informationarxiv: v1 [cs.ms] 17 Jan 2019
Supporting mixed-datatype matrix multiplication within the BLIS framework FLAME Working Note #89 Field G. Van Zee a,b, Devangi N. Parikh a, Robert A. van de Geijn a,b arxiv:191.61v1 [cs.ms] 17 Jan 19 a
More informationAdvanced Computing Research Laboratory. Adaptive Scientific Software Libraries
Adaptive Scientific Software Libraries and Texas Learning and Computation Center and Department of Computer Science University of Houston Challenges Diversity of execution environments Growing complexity
More information9. Linear Algebra Computation
9. Linear Algebra Computation Basic Linear Algebra Subprograms (BLAS) Routines that provide standard, low-level, building blocks for performing basic vector and matrix operations. Originally developed
More informationAccelerating Linpack with CUDA on heterogeneous clusters
Accelerating Linpack with CUDA on heterogeneous clusters Massimiliano Fatica NVIDIA Corporation 2701 San Tomas Expressway Santa Clara CA 95050 mfatica@nvidia.com ABSTRACT This paper describes the use of
More informationAdvanced Numerical Techniques for Cluster Computing
Advanced Numerical Techniques for Cluster Computing Presented by Piotr Luszczek http://icl.cs.utk.edu/iter-ref/ Presentation Outline Motivation hardware Dense matrix calculations Sparse direct solvers
More informationC++ API for Batch BLAS
4 C++ API for Batch BLAS Ahmad Abdelfattah ICL 1 Konstantin Arturov Intel 2 Cris Cecka NVIDIA 3 Jack Dongarra ICL Chip Freitag AMD 4 Mark Gates ICL Azzam Haidar ICL Jakub Kurzak ICL Piotr Luszczek ICL
More informationPRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,
PRACE PATC Course: Intel MIC Programming Workshop, MKL Ostrava, 7-8.2.2017 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi Compiler Assisted Offload Automatic Offload Native Execution Hands-on
More informationBLAS: Basic Linear Algebra Subroutines I
BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines
More informationOutline. MC Sampling Bisection. 1 Consolidate your C basics. 2 More C Basics. 3 Working with Arrays. 4 Arrays and Structures. 5 Working with pointers
Outline 1 Consolidate your C basics 2 Basics 3 Working with 4 5 Working with pointers 6 Working with and File I/O 7 Outline 1 Consolidate your C basics 2 Basics 3 Working with 4 5 Working with pointers
More informationLeveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library
Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library Benchmarking the NVIDIA GPU A White Paper by Rogue Wave Software. October, 2010 Rogue Wave Softw are 5500 Flatiron Parkw ay, Suite 200 Boulder,
More informationLinear Algebra for Modern Computers. Jack Dongarra
Linear Algebra for Modern Computers Jack Dongarra Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining. 2 Indirect Addressing d
More informationBenchmark Results. 2006/10/03
Benchmark Results cychou@nchc.org.tw 2006/10/03 Outline Motivation HPC Challenge Benchmark Suite Software Installation guide Fine Tune Results Analysis Summary 2 Motivation Evaluate, Compare, Characterize
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More information0_PreCNotes17 18.notebook May 16, Chapter 12
Chapter 12 Notes BASIC MATRIX OPERATIONS Matrix (plural: Matrices) an n x m array of elements element a ij Example 1 a 21 = a 13 = Multiply Matrix by a Scalar Distribute scalar to all elements Addition
More informationBLAS: Basic Linear Algebra Subroutines I
BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines
More informationParallel Performance and Optimization
Parallel Performance and Optimization Erik Schnetter Gregory G. Howes Iowa High Performance Computing Summer School University of Iowa Iowa City, Iowa May 20-22, 2013 Thank you Ben Rogers Glenn Johnson
More informationOutline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity
Outline Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Lecture 10 CS 410/510 Information Retrieval on the Internet Query reformulation Sources of relevance for feedback Using
More informationScientific Computing with GPUs Autotuning GEMMs Fermi GPUs
Parallel Processing and Applied Mathematics September 11-14, 2011 Toruń, Poland Scientific Computing with GPUs Autotuning GEMMs Fermi GPUs Innovative Computing Laboratory Electrical Engineering and Computer
More informationA Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures
A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative
More informationHeterogeneous ScaLAPACK Programmers Reference and Installation Manual. Heterogeneous ScaLAPACK Programmers Reference and Installation Manual
Heterogeneous ScaLAPACK Programmers Reference and Installation Manual 1 Heterogeneous ScaLAPACK Parallel Linear Algebra Programs for Heterogeneous Networks of Computers Version 1.0.0 Ravi Reddy, Alexey
More informationHigh-Performance Implementation of the Level-3 BLAS
High-Performance Implementation of the Level- BLAS KAZUSHIGE GOTO The University of Texas at Austin and ROBERT VAN DE GEIJN The University of Texas at Austin A simple but highly effective approach for
More informationMAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel
MAGMA Library version 0.1 S. Tomov J. Dongarra V. Volkov J. Demmel 2 -- MAGMA (version 0.1) -- Univ. of Tennessee, Knoxville Univ. of California, Berkeley Univ. of Colorado, Denver June 2009 MAGMA project
More informationNAG Library Routine Document F07MAF (DSYSV).1
NAG Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent
More informationThe Starving CPU Problem
Or Why Should I Care About Memory Access? Software Architect Continuum Analytics Outline Motivation 1 Motivation 2 3 Computing a Polynomial We want to compute the next polynomial: y = 0.25x 3 + 0.75x²
More informationLinear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre
Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects sylvestre.ledru@scilab.org / sylvestre@debian.org
More informationChapter 4. Matrix and Vector Operations
1 Scope of the Chapter Chapter 4 This chapter provides procedures for matrix and vector operations. This chapter (and Chapters 5 and 6) can handle general matrices, matrices with special structure and
More informationNAG Fortran Library Routine Document F08BHF (DTZRZF).1
NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent
More informationHPC Numerical Libraries. Nicola Spallanzani SuperComputing Applications and Innovation Department
HPC Numerical Libraries Nicola Spallanzani n.spallanzani@cineca.it SuperComputing Applications and Innovation Department Algorithms and Libraries Many numerical algorithms are well known and largely available.
More informationINTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING AND GRIDS Cetraro (Italy), July 3-6, 2006
INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING AND GRIDS Cetraro (Italy), July 3-6, 2006 The Challenges of Multicore and Specialized Accelerators Jack Dongarra University of Tennessee
More informationArm Performance Libraries Reference Manual
Arm Performance Libraries Reference Manual Version 18.0.0 Document Number 101004_1800_00 Copyright c 2017 Arm Ltd. or its affiliates, Numerical Algorithms Group Ltd. All rights reserved. Non-Confidential
More informationSome notes on efficient computing and high performance computing environments
Some notes on efficient computing and high performance computing environments Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public
More informationCUDA Toolkit 4.0 Performance Report. June, 2011
CUDA Toolkit 4. Performance Report June, 211 CUDA Math Libraries High performance math routines for your applications: cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse
More informationAccurate Cache and TLB Characterization Using Hardware Counters
Accurate Cache and TLB Characterization Using Hardware Counters Jack Dongarra, Shirley Moore, Philip Mucci, Keith Seymour, and Haihang You Innovative Computing Laboratory, University of Tennessee Knoxville,
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationPerformance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply
Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu
More informationEvaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li
Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,
More informationThe LinBox Project for Linear Algebra Computation
The LinBox Project for Linear Algebra Computation A Practical Tutorial Daniel S. Roche Symbolic Computation Group School of Computer Science University of Waterloo MOCAA 2008 University of Western Ontario
More informationAuto-tuning Multigrid with PetaBricks
Auto-tuning with PetaBricks Cy Chan Joint Work with: Jason Ansel Yee Lok Wong Saman Amarasinghe Alan Edelman Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
More informationA Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection
A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF
More informationNAG Fortran Library Routine Document F04DHF.1
F04 Simultaneous Linear Equations NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised
More informationRELIABLE GENERATION OF HIGH- PERFORMANCE MATRIX ALGEBRA
RELIABLE GENERATION OF HIGH- PERFORMANCE MATRIX ALGEBRA Jeremy G. Siek, University of Colorado at Boulder Joint work with Liz Jessup, Boyana Norris, Geoffrey Belter, Thomas Nelson Lake Isabelle, Colorado
More informationHigh Performance Linear Algebra
High Performance Linear Algebra Hatem Ltaief Senior Research Scientist Extreme Computing Research Center King Abdullah University of Science and Technology 4th International Workshop on Real-Time Control
More informationHardware Sizing Guide OV
Hardware Sizing Guide OV3600 6.3 www.alcatel-lucent.com/enterprise Part Number: 0510620-01 Table of Contents Table of Contents... 2 Overview... 3 Properly Sizing Processing and for your OV3600 Server...
More information