Performance Modeling for Ranking Blocked Algorithms

Size: px
Start display at page:

Download "Performance Modeling for Ranking Blocked Algorithms"

Transcription

1 Performance Modeling for Ranking Blocked Algorithms Elmar Peise Aachen Institute for Advanced Study in Computational Engineering Science Elmar Peise (AICES) Performance Modeling

2 Blocked algorithms Inversion of a triangular matrix L L 1 R n n L 00 L 10 L 11 L 20 L 21 L 22 Variant 1 Variant 2 Variant 3 Variant 4 L 10 L 10L 00 L 10 L 1 11 L10 L 11 L 1 11 L 21 L 1 22 L21 L 21 L 21L 1 11 L 11 L 1 11 L 21 L 21L 1 11 L 20 L 20 + L 21L 10 L 10 L 1 11 L10 L 11 L 1 11 L 21 L 1 22 L21 L 20 L 20 L 21L 10 L 10 L 10L 00 L 11 L 1 11 Elmar Peise (AICES) Performance Modeling

3 Execution time Inversion of a triangular matrix L L 1 R n n Time [s] variant 1 variant 2 variant 3 variant ,024 1,536 2,048 matrix size n Intel Harpertown 2.99 GHz 1 core Intel MKL BLAS Elmar Peise (AICES) Performance Modeling

4 Efficiency Inversion of a triangular matrix L L 1 R n n Efficiency [%] variant 1 variant 2 variant 3 variant ,024 matrix size n Intel Harpertown 2.99 GHz 1 core Intel MKL BLAS Elmar Peise (AICES) Performance Modeling

5 Blocked algorithms Inversion of a triangular matrix L L 1 R n n 3rd ( ) 2nd ( ) 1st ( ) 4th ( ) Variant 1 Variant 2 Variant 3 Variant 4 L 10 L 10L 00 L 10 L 1 11 L10 L 11 L 1 11 L 21 L 1 22 L21 L 21 L 21L 1 11 L 11 L 1 11 L 21 L 21L 1 11 L 20 L 20 + L 21L 10 L 10 L 1 11 L10 L 11 L 1 11 L 21 L 1 22 L21 L 20 L 20 L 21L 10 L 10 L 10L 00 L 11 L 1 11 Can we rank the algorithms based on their update statements? No! Elmar Peise (AICES) Performance Modeling

6 Outline 1 Sampling 2 Modeling 3 Prediction and Ranking Elmar Peise (AICES) Performance Modeling

7 Sampling B 0.37 B L 1 with B R , L R Routine call dtrsm(r, L, N, U, 128, 96, 0.37, L, 128, B, 128) Sampling Performance counters (PAPI) ticks flops L1misses Elmar Peise (AICES) Performance Modeling

8 Matrix arguments Routine call dtrsm(r, L, N, U, 128, 96, 0.37, L, 128, B, 128) Performance is independent of L and B. Assign (random) memory chunk of sufficient size. Input (dtrsm, R, L, N, U, 128, 96, 0.37, , 128, , 128) Elmar Peise (AICES) Performance Modeling

9 Multiple samples Input (dtrsm, R, L, N, U, 128, 96, 0.37, 12288, 128, 12288, 128) (dtrsm, R, L, N, U, 128, 96, 0.37, 12288, 128, 12288, 128) (dtrsm, R, L, N, U, 128, 96, 0.37, 12288, 128, 12288, 128) (dtrsm, R, L, N, U, 128, 96, 0.37, 12288, 128, 12288, 128) Performance counters. Sampling ticks flops L1misses Elmar Peise (AICES) Performance Modeling

10 Influence of caching dtrsm(r, L, N, U, 128, 96, 0.37, L, 128, B, 128) 10 6 ticks in cache: GotoBLAS2 MKL ATLAS out of cache: GotoBLAS2 MKL ATLAS ,000 experiments AMD Opteron GHz 1 core Elmar Peise (AICES) Performance Modeling

11 Outline 1 Sampling 2 Modeling 3 Prediction and Ranking Elmar Peise (AICES) Performance Modeling

12 Modeling Understanding performance Model structure Model generation Modeling results Elmar Peise (AICES) Performance Modeling

13 Argument types B α A 1 B α A B dtrsm(side, uplo, transa, diag, m, n, alpha, A, lda, B, ldb) discrete size data scalar matrix leading dimension Elmar Peise (AICES) Performance Modeling

14 Discrete arguments dtrsm(side, uplo, transa, diag, 256, 256, 0.5, A, 256, B, 256) L L N N L L N U L L T N L L T U L U N N L U N U L U T N L U T U R L N N R L N U R L T N R L T U R U N N R U N U R U T N R U T U ticks GotoBLAS2 MKL ATLAS side uplo transa diag AMD Opteron GHz 1 core Elmar Peise (AICES) Performance Modeling

15 Size arguments Small scale dgemm(n, N, m, n, k, 0.5, A, m, B, k, 0.5, C, m) GotoBLAS2 MKL ATLAS ticks m/n/k AMD Opteron GHz 1 core Elmar Peise (AICES) Performance Modeling

16 Size arguments Large scale dgemm(n, N, m, n, k, 0.5, A, m, B, k, 0.5, C, m) ticks GotoBLAS2 MKL ATLAS m n k m/n/k AMD Opteron GHz 1 core Elmar Peise (AICES) Performance Modeling

17 Scalar arguments dgemm(n, N, 256, 256, 256, alpha, A, 256, B, 256, beta, C, 256) ticks GotoBLAS2 MKL ATLAS 0 C 0 C AB C AB C 0.5AB C C C AB + C C AB + C C 0.5AB + C C C C AB C C AB C C 0.5AB C C 0.5C C AB + 0.5C C AB + 0.5C C 0.5AB+0.5C AMD Opteron GHz 1 core Elmar Peise (AICES) Performance Modeling

18 Matrix arguments dgemm(n, N, 256, 256, 256, 0.5, A, 256, B, 256, 0.5, C, 256) No performance dependency Elmar Peise (AICES) Performance Modeling

19 Leading dimension arguments dgemm(n, N, 128, 128, 128, 0.5, A, lda, B, ldb, 0.5, C, ldc) ticks GotoBLAS2 MKL ATLAS lda ldb ldc lda/ldb/ldc AMD Opteron GHz 1 core Elmar Peise (AICES) Performance Modeling

20 Modeling Understanding performance Model structure Model generation Modeling results Elmar Peise (AICES) Performance Modeling

21 Model structure input: (dtrsm, L, U, N, N, 256, 512, 0.7, A, 512, B, 512) model for dtrsm parameter extraction: (L, U, N) (256, 512) piecewise polynomials: (256, 512) discrete cases: LLN LLT LUN LUT RLN RUN RLT RUT statistical quantities: min avg std max performance counters: ticks flops L1misses polynomial: (256, 512) output: (min, avg,...) (min, avg,...) (min, avg, std, max) Elmar Peise (AICES) Performance Modeling

22 Modeling Understanding performance Model structure Model generation Modeling results Elmar Peise (AICES) Performance Modeling

23 Modeler Modeler Sampler Interface Sampler Routine Modeler Piecewise Polynomial Modeler 1 case (e.g. (L, U, N)) Model Model 1 counter (e.g. ticks) Model Elmar Peise (AICES) Performance Modeling

24 Piecewise Polynomial Modelers Model Expansion Adaptive Refinement Elmar Peise (AICES) Performance Modeling

25 Model Expansion completed region Elmar Peise (AICES) Performance Modeling

26 Adaptive Refinement refinement refinement refinement accuracy: bad threshold good Elmar Peise (AICES) Performance Modeling

27 Modeling Understanding performance Model structure Model generation Modeling results Elmar Peise (AICES) Performance Modeling

28 Model Expansion for ticks dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 1,024 20% % n ,024 m 10% 5% 0% AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling

29 Opposite direction: dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 1,024 20% % n ,024 m 10% 5% 0% AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling

30 Smaller approximation error bound dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 1,024 20% % n ,024 m 10% 5% 0% AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling

31 Smaller models dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 1,024 20% % n ,024 m 10% 5% 0% AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling

32 Adaptive Refinement for ticks dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 1,024 20% % n ,024 m 10% 5% 0% AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling

33 Smaller models dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 1,024 20% % n ,024 m 10% 5% 0% AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling

34 Smaller approximation error bound dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 1,024 20% % n ,024 m 10% 5% 0% AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling

35 Comparison dtrsm(l, L, N, N, m, n,.5, L, 2500, B, 2500) 10 average error [%] #samples 10 5 Model Expansion Adaptive Refinement AMD Opteron GHz 1 core GotoBLAS2 Elmar Peise (AICES) Performance Modeling

36 Outline 1 Sampling 2 Modeling 3 Prediction and Ranking Elmar Peise (AICES) Performance Modeling

37 Ranking Blocked algorithm revisited: L L 1 L 00 L 10 L 11 L 20 L 21 L 22 Variant 1 Variant 2 Variant 3 Variant 4 L 10 L 10L 00 L 10 L 1 11 L10 L 11 L 1 11 L 21 L 1 22 L21 L 21 L 21L 1 11 L 11 L 1 11 L 21 L 21L 1 11 L 20 L 20 + L 21L 10 L 10 L 1 11 L10 L 11 L 1 11 L 21 L 1 22 L21 L 20 L 20 L 21L 10 L 10 L 10L 00 L 11 L 1 11 Elmar Peise (AICES) Performance Modeling

38 C implementation: Variant 1 i n t t r i n v 1 _ ( c h a r diag, i n t n, d o u b l e A, i n t lda, i n t b s i z e ) { i f ( n == 1) { i f ( d i a g [ 0 ] == N ) A = 1 / A ; r e t u r n 0 ; } i n t i o n e = 1 ; d o u b l e one = 1 ; d o u b l e mone = 1; f o r ( i n t p = 0 ; p < n ; p += b s i z e ) { i n t b = b s i z e ; i f ( p + b > n ) b = n p ; #d e f i n e A00 (A) #d e f i n e A10 (A + p ) #d e f i n e A11 (A + lda p + p ) Variant 1 L 10 L 10L 00 L 10 L 1 11 L10 L 11 L 1 11 } dtrmm_( "R", "L", "N", diag, &b, &p, &one, A00, lda, A10, lda ) ; dtrsm_ ( "L", "L", "N", diag, &b, &p, &mone, A11, lda, A10, lda ) ; t r i n v 1 _ ( diag, &b, A11, lda, &i o n e ) ; } r e t u r n 0 ; Elmar Peise (AICES) Performance Modeling

39 Prediction Input: trinv1(n, 300, A, 300, 100) Compute routine invocations update routine invocation L 10 L 10 L 00 (dtrmm, R, L, N, N, 100, 0, 1,, 300,, 300) L 10 L 1 11 L 10 (dtrsm, L, L, N, N, 100, 0, 1,, 300,, 300) L 11 L 1 11 (trinv1, N, 100,, 300, 1) L 10 L 10 L 00 (dtrmm, R, L, N, N, 100, 100, 1,, 300,, 300) L 10 L 1 11 L 10 (dtrsm, L, L, N, N, 100, 100, 1,, 300,, 300) L 11 L 1 11 (trinv1, N, 100,, 300, 1) L 10 L 10 L 00 (dtrmm, R, L, N, N, 100, 200, 1,, 300,, 300) L 10 L 1 11 L 10 (dtrsm, L, L, N, N, 100, 200, 1,, 300,, 300) L 11 L 1 11 (trinv1, N, 100,, 300, 1) Evaluate performance Models Accumulate ticks Elmar Peise (AICES) Performance Modeling

40 Results: cache-trashing efficiency variant 1 variant 2 variant 3 variant 4 Measurements Prediction ,024 n Intel Harpertown 2.99 GHz 1 core Intel MKL BLAS Elmar Peise (AICES) Performance Modeling

41 Results: in-cache efficiency variant 1 variant 2 variant 3 variant 4 Measurements Prediction ,024 n Intel Harpertown 2.99 GHz 1 core Intel MKL BLAS Elmar Peise (AICES) Performance Modeling

42 Results: in-cache efficiency variant 1 variant 2 variant 3 variant 4 Measurements Prediction ,024 n Intel Harpertown 2.99 GHz 1 core Intel MKL BLAS Elmar Peise (AICES) Performance Modeling

43 Results: probabilistic prediction efficiency variant 1 variant 2 variant 3 variant 4 Measurements Prediction: median average min max ,024 n Intel Harpertown 2.99 GHz 1 core Intel MKL BLAS Elmar Peise (AICES) Performance Modeling

44 Optimal block-size efficiency variant 1 variant 2 variant 3 variant 4 Measurements Prediction: median average min max blocksize Intel Harpertown 2.99 GHz 1 core Intel MKL BLAS Elmar Peise (AICES) Performance Modeling

45 Conclusion 1 Sampling Performance measurement tool Applicable to various DLA routines 2 Modeling Automatic modeling system Flexible and accurate 3 Prediction and Ranking Accurate predictions Correct ranking Elmar Peise (AICES) Performance Modeling

46 Performance Modeling for Ranking Blocked Algorithms Questions? Funding from DFG is gratefully acknowledged Elmar Peise (AICES) Performance Modeling

INTEL MKL Vectorized Compact routines

INTEL MKL Vectorized Compact routines INTEL MKL Vectorized Compact routines Mesut Meterelliyoz, Peter Caday, Timothy B. Costa, Kazushige Goto, Louise Huot, Sarah Knepper, Arthur Araujo Mitrano, Shane Story 2018 BLIS RETREAT 09/17/2018 OUTLINE

More information

BLAS. Christoph Ortner Stef Salvini

BLAS. Christoph Ortner Stef Salvini BLAS Christoph Ortner Stef Salvini The BLASics Basic Linear Algebra Subroutines Building blocks for more complex computations Very widely used Level means number of operations Level 1: vector-vector operations

More information

Dense matrix algebra and libraries (and dealing with Fortran)

Dense matrix algebra and libraries (and dealing with Fortran) Dense matrix algebra and libraries (and dealing with Fortran) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Dense matrix algebra and libraries (and dealing with Fortran)

More information

A Few Numerical Libraries for HPC

A Few Numerical Libraries for HPC A Few Numerical Libraries for HPC CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 1 / 37 Outline 1 HPC == numerical linear

More information

ATLAS (Automatically Tuned Linear Algebra Software),

ATLAS (Automatically Tuned Linear Algebra Software), LAPACK library I Scientists have developed a large library of numerical routines for linear algebra. These routines comprise the LAPACK package that can be obtained from http://www.netlib.org/lapack/.

More information

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning

More information

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact

More information

Resources for parallel computing

Resources for parallel computing Resources for parallel computing BLAS Basic linear algebra subprograms. Originally published in ACM Toms (1979) (Linpack Blas + Lapack). Implement matrix operations upto matrix-matrix multiplication and

More information

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017 Faster Code for Free: Linear Algebra Libraries Advanced Research Compu;ng 22 Feb 2017 Outline Introduc;on Implementa;ons Using them Use on ARC systems Hands on session Conclusions Introduc;on 3 BLAS Level

More information

Intel Math Kernel Library (Intel MKL) Team - Presenter: Murat Efe Guney Workshop on Batched, Reproducible, and Reduced Precision BLAS Georgia Tech,

Intel Math Kernel Library (Intel MKL) Team - Presenter: Murat Efe Guney Workshop on Batched, Reproducible, and Reduced Precision BLAS Georgia Tech, Intel Math Kernel Library (Intel MKL) Team - Presenter: Murat Efe Guney Workshop on Batched, Reproducible, and Reduced Precision BLAS Georgia Tech, Atlanta February 24, 2017 Acknowledgements Benoit Jacob

More information

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations.

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations. TITLE Basic Linear Algebra Subprograms BYLINE Robert A. van de Geijn Department of Computer Science The University of Texas at Austin Austin, TX USA rvdg@cs.utexas.edu Kazushige Goto Texas Advanced Computing

More information

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering

More information

Automated Timer Generation for Empirical Tuning

Automated Timer Generation for Empirical Tuning Automated Timer Generation for Empirical Tuning Josh Magee Qing Yi R. Clint Whaley University of Texas at San Antonio SMART'10 1 Propositions How do we measure success for tuning? The performance of the

More information

Optimizing the operations with sparse matrices on Intel architecture

Optimizing the operations with sparse matrices on Intel architecture Optimizing the operations with sparse matrices on Intel architecture Gladkikh V. S. victor.s.gladkikh@intel.com Intel Xeon, Intel Itanium are trademarks of Intel Corporation in the U.S. and other countries.

More information

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures Scheduling of Algorithms on SMP and Multi-core Architectures Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Ernie Chan Robert A. van de Geijn Field G. Van Zee quintana@icc.uji.es Universidad Jaime I de

More information

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore svmoore@utep.edu CPS5401 Fall 2012 svmoore.pbworks.com November 8, 2012 1 Learning ObjecNves AOer complenng this lesson, you

More information

*Yuta SAWA and Reiji SUDA The University of Tokyo

*Yuta SAWA and Reiji SUDA The University of Tokyo Auto Tuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS *Yuta SAWA and Reiji SUDA The University of Tokyo iwapt 29 October 1-2 *Now in Central Research Laboratory, Hitachi,

More information

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators Enrique S. Quintana-Ortí Disclaimer Not a course on how to program dense linear algebra kernels on s Where have you

More information

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware Parallelism V HPC Profiling John Cavazos Dept of Computer & Information Sciences University of Delaware Lecture Overview Performance Counters Profiling PAPI TAU HPCToolkit PerfExpert Performance Counters

More information

A Standard for Batching BLAS Operations

A Standard for Batching BLAS Operations A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

NAG Fortran Library Routine Document F01CTF.1

NAG Fortran Library Routine Document F01CTF.1 NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent

More information

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I 1967-10 Advanced School in High Performance and GRID Computing 3-14 November 2008 Mathematical Libraries. Part I KOHLMEYER Axel University of Pennsylvania Department of Chemistry 231 South 34th Street

More information

BLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg

BLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg University of Freiburg BLIS retreat September 19, 217 Basic Linear Algebra Subroutines For Embedded Optimization performance dgemm_nt 5 4 Intel Core i7 48MQ HP OpenBLAS.2.19 MKL 217.2.174 ATLAS 3.1.3 BLIS.1.6

More information

LAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1

LAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1 LAPACK Linear Algebra PACKage 1 Janice Giudice David Knezevic 1 Motivating Question Recalling from last week... Level 1 BLAS: vectors ops Level 2 BLAS: matrix-vectors ops 2 2 O( n ) flops on O( n ) data

More information

Sarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018

Sarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018 Sarah Knepper Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018 Outline Motivation Problem statement and solutions Simple example Performance comparison 2 Motivation Partial differential equations

More information

NAG Fortran Library Routine Document F01CWF.1

NAG Fortran Library Routine Document F01CWF.1 NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Cache Oblivious Dense and Sparse Matrix Multiplication Based on Peano Curves

Cache Oblivious Dense and Sparse Matrix Multiplication Based on Peano Curves Cache Oblivious Dense and Sparse Matrix Multiplication Based on eano Curves Michael Bader and Alexander Heinecke Institut für Informatik, Technische Universitüt München, Germany Abstract. Cache oblivious

More information

Intel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop

Intel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop Intel MKL @ MIC Architecture Dr. Momme Allalen, LRZ, allalen@lrz.de PRACE PATC: Intel MIC&GPU Programming Workshop 1 2 Momme Allalen, HPC with GPGPUs, Oct. 10, 2011 What is the Intel MKL? Math library

More information

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science Overcoming the Barriers to Sustained Petaflop Performance William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp But First Are we too CPU-centric? What about I/O? What do applications

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, 27.6-29.6.2016 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi - Compiler Assisted Offload - Automatic Offload - Native Execution

More information

Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification

Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification Jack Dongarra *, Iain Duff, Mark Gates *, Azzam Haidar *, Sven Hammarling, Nicholas J. Higham, Jonathan Hogg, Pedro Valero Lara, Piotr

More information

2 CHAPTER 1. LEGACYBLAS Considered methods Various other naming schemes have been proposed, such as adding C or c to the name. Most of these schemes a

2 CHAPTER 1. LEGACYBLAS Considered methods Various other naming schemes have been proposed, such as adding C or c to the name. Most of these schemes a Chapter 1 LegacyBLAS 1.1 New routines 1.2 C interface to legacy BLAS This section discusses the proposed C interface to the legacy BLAS in some detail. Every mention of BLAS" in this chapter should be

More information

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of

More information

From BLAS routines to finite field exact linear algebra solutions

From BLAS routines to finite field exact linear algebra solutions From BLAS routines to finite field exact linear algebra solutions Pascal Giorgi Laboratoire de l Informatique du Parallélisme (Arenaire team) ENS Lyon - CNRS - INRIA - UCBL France Main goals Solve Linear

More information

LAFF-On High-Performance Programming

LAFF-On High-Performance Programming LAFF-On High-Performance Programming Margaret E Myers Robert A van de Geijn Release Date Wednesday 20 th December, 2017 This is a work in progress Copyright 2017, 2018 by Margaret E Myers and Robert A

More information

A High Performance C Package for Tridiagonalization of Complex Symmetric Matrices

A High Performance C Package for Tridiagonalization of Complex Symmetric Matrices A High Performance C Package for Tridiagonalization of Complex Symmetric Matrices Guohong Liu and Sanzheng Qiao Department of Computing and Software McMaster University Hamilton, Ontario L8S 4L7, Canada

More information

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker BLAS and LAPACK + Data Formats for Sparse Matrices Part of the lecture Wissenschaftliches Rechnen Hilmar Wobker Institute of Applied Mathematics and Numerics, TU Dortmund email: hilmar.wobker@math.tu-dortmund.de

More information

ALGORITHM 679 A Set of Level 3 Basic Linear Algebra Subprograms: Model Implementation and Test Programs

ALGORITHM 679 A Set of Level 3 Basic Linear Algebra Subprograms: Model Implementation and Test Programs ALGORITHM 679 A Set of Level 3 Basic Linear Algebra Subprograms: Model Implementation and Test Programs JACK J. DONGARRA University of Tennessee and Oak Ridge National Laboratory JEREMY DU CROZ and SVEN

More information

printf("\n\nx = "); for(i=0;i<5;i++) printf("\n%f %f", X[i],X[i+5]); printf("\n\ny = "); for(i=0;i<5;i++) printf("\n%f", Y[i]);

printf(\n\nx = ); for(i=0;i<5;i++) printf(\n%f %f, X[i],X[i+5]); printf(\n\ny = ); for(i=0;i<5;i++) printf(\n%f, Y[i]); OLS.c / / #include #include #include int main(){ int i,info, ipiv[2]; char trans = 't', notrans ='n'; double alpha = 1.0, beta=0.0; int ncol=2; int nrow=5; int

More information

6 BLAS (Basic Linear Algebra Subroutines)

6 BLAS (Basic Linear Algebra Subroutines) 161 BLAS 6.1 Motivation 6 BLAS (Basic Linear Algebra Subroutines) 6.1 Motivation How to optimise programs that use a lot of linear algebra operations? Efficiency depends on but also on: processor speed

More information

CUDA 6.0 Performance Report. April 2014

CUDA 6.0 Performance Report. April 2014 CUDA 6. Performance Report April 214 1 CUDA 6 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random

More information

Toward a supernodal sparse direct solver over DAG runtimes

Toward a supernodal sparse direct solver over DAG runtimes Toward a supernodal sparse direct solver over DAG runtimes HOSCAR 2013, Bordeaux X. Lacoste Xavier LACOSTE HiePACS team Inria Bordeaux Sud-Ouest November 27, 2012 Guideline Context and goals About PaStiX

More information

MAINTAINING HIGH PERFORMANCE ACROSS ALL PROBLEM SIZES AND PARALLEL SCALES USING MICROKERNEL-BASED LINEAR ALGEBRA

MAINTAINING HIGH PERFORMANCE ACROSS ALL PROBLEM SIZES AND PARALLEL SCALES USING MICROKERNEL-BASED LINEAR ALGEBRA MAINTAINING HIGH PERFORMANCE ACROSS ALL PROBLEM SIZES AND PARALLEL SCALES USING MICROKERNEL-BASED LINEAR ALGEBRA A Dissertation Submitted to the Graduate Faculty of the Louisiana State University and Agricultural

More information

HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 1 HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 BLAS BLAS 1, 2, 3 Performance GEMM Optimized BLAS Parallel

More information

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL Matthias Bach and David Rohr Frankfurt Institute for Advanced Studies Goethe University of Frankfurt I: INTRODUCTION 3 Scaling

More information

Optimizations of BLIS Library for AMD ZEN Core

Optimizations of BLIS Library for AMD ZEN Core Optimizations of BLIS Library for AMD ZEN Core 1 Introduction BLIS [1] is a portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries [2] The framework was

More information

Adaptive Scientific Software Libraries

Adaptive Scientific Software Libraries Adaptive Scientific Software Libraries Lennart Johnsson Advanced Computing Research Laboratory Department of Computer Science University of Houston Challenges Diversity of execution environments Growing

More information

Task based parallelization of recursive linear algebra routines using Kaapi

Task based parallelization of recursive linear algebra routines using Kaapi Task based parallelization of recursive linear algebra routines using Kaapi Clément PERNET joint work with Jean-Guillaume DUMAS and Ziad SULTAN Université Grenoble Alpes, LJK-CASYS January 20, 2017 Journée

More information

Automatic Development of Linear Algebra Libraries for the Tesla Series

Automatic Development of Linear Algebra Libraries for the Tesla Series Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source

More information

Fast Multiscale Algorithms for Information Representation and Fusion

Fast Multiscale Algorithms for Information Representation and Fusion Prepared for: Office of Naval Research Fast Multiscale Algorithms for Information Representation and Fusion No. 2 Devasis Bassu, Principal Investigator Contract: N00014-10-C-0176 Telcordia Technologies

More information

FLAME Working Note #30

FLAME Working Note #30 FLAG@lab: An M-script API for Linear Algebra Operations on Graphics Processors Sergio Barrachina Maribel Castillo Francisco D. Igual Rafael Mayo Enrique S. Quintana-Ortí Depto. de Ingeniería y Ciencia

More information

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

FFPACK: Finite Field Linear Algebra Package

FFPACK: Finite Field Linear Algebra Package FFPACK: Finite Field Linear Algebra Package Jean-Guillaume Dumas, Pascal Giorgi and Clément Pernet pascal.giorgi@ens-lyon.fr, {Jean.Guillaume.Dumas, Clément.Pernet}@imag.fr P. Giorgi, J-G. Dumas & C. Pernet

More information

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de

More information

arxiv: v1 [cs.ms] 17 Jan 2019

arxiv: v1 [cs.ms] 17 Jan 2019 Supporting mixed-datatype matrix multiplication within the BLIS framework FLAME Working Note #89 Field G. Van Zee a,b, Devangi N. Parikh a, Robert A. van de Geijn a,b arxiv:191.61v1 [cs.ms] 17 Jan 19 a

More information

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries Adaptive Scientific Software Libraries and Texas Learning and Computation Center and Department of Computer Science University of Houston Challenges Diversity of execution environments Growing complexity

More information

9. Linear Algebra Computation

9. Linear Algebra Computation 9. Linear Algebra Computation Basic Linear Algebra Subprograms (BLAS) Routines that provide standard, low-level, building blocks for performing basic vector and matrix operations. Originally developed

More information

Accelerating Linpack with CUDA on heterogeneous clusters

Accelerating Linpack with CUDA on heterogeneous clusters Accelerating Linpack with CUDA on heterogeneous clusters Massimiliano Fatica NVIDIA Corporation 2701 San Tomas Expressway Santa Clara CA 95050 mfatica@nvidia.com ABSTRACT This paper describes the use of

More information

Advanced Numerical Techniques for Cluster Computing

Advanced Numerical Techniques for Cluster Computing Advanced Numerical Techniques for Cluster Computing Presented by Piotr Luszczek http://icl.cs.utk.edu/iter-ref/ Presentation Outline Motivation hardware Dense matrix calculations Sparse direct solvers

More information

C++ API for Batch BLAS

C++ API for Batch BLAS 4 C++ API for Batch BLAS Ahmad Abdelfattah ICL 1 Konstantin Arturov Intel 2 Cris Cecka NVIDIA 3 Jack Dongarra ICL Chip Freitag AMD 4 Mark Gates ICL Azzam Haidar ICL Jakub Kurzak ICL Piotr Luszczek ICL

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava, PRACE PATC Course: Intel MIC Programming Workshop, MKL Ostrava, 7-8.2.2017 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi Compiler Assisted Offload Automatic Offload Native Execution Hands-on

More information

BLAS: Basic Linear Algebra Subroutines I

BLAS: Basic Linear Algebra Subroutines I BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines

More information

Outline. MC Sampling Bisection. 1 Consolidate your C basics. 2 More C Basics. 3 Working with Arrays. 4 Arrays and Structures. 5 Working with pointers

Outline. MC Sampling Bisection. 1 Consolidate your C basics. 2 More C Basics. 3 Working with Arrays. 4 Arrays and Structures. 5 Working with pointers Outline 1 Consolidate your C basics 2 Basics 3 Working with 4 5 Working with pointers 6 Working with and File I/O 7 Outline 1 Consolidate your C basics 2 Basics 3 Working with 4 5 Working with pointers

More information

Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library

Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library Benchmarking the NVIDIA GPU A White Paper by Rogue Wave Software. October, 2010 Rogue Wave Softw are 5500 Flatiron Parkw ay, Suite 200 Boulder,

More information

Linear Algebra for Modern Computers. Jack Dongarra

Linear Algebra for Modern Computers. Jack Dongarra Linear Algebra for Modern Computers Jack Dongarra Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining. 2 Indirect Addressing d

More information

Benchmark Results. 2006/10/03

Benchmark Results. 2006/10/03 Benchmark Results cychou@nchc.org.tw 2006/10/03 Outline Motivation HPC Challenge Benchmark Suite Software Installation guide Fine Tune Results Analysis Summary 2 Motivation Evaluate, Compare, Characterize

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

0_PreCNotes17 18.notebook May 16, Chapter 12

0_PreCNotes17 18.notebook May 16, Chapter 12 Chapter 12 Notes BASIC MATRIX OPERATIONS Matrix (plural: Matrices) an n x m array of elements element a ij Example 1 a 21 = a 13 = Multiply Matrix by a Scalar Distribute scalar to all elements Addition

More information

BLAS: Basic Linear Algebra Subroutines I

BLAS: Basic Linear Algebra Subroutines I BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines

More information

Parallel Performance and Optimization

Parallel Performance and Optimization Parallel Performance and Optimization Erik Schnetter Gregory G. Howes Iowa High Performance Computing Summer School University of Iowa Iowa City, Iowa May 20-22, 2013 Thank you Ben Rogers Glenn Johnson

More information

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Outline Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Lecture 10 CS 410/510 Information Retrieval on the Internet Query reformulation Sources of relevance for feedback Using

More information

Scientific Computing with GPUs Autotuning GEMMs Fermi GPUs

Scientific Computing with GPUs Autotuning GEMMs Fermi GPUs Parallel Processing and Applied Mathematics September 11-14, 2011 Toruń, Poland Scientific Computing with GPUs Autotuning GEMMs Fermi GPUs Innovative Computing Laboratory Electrical Engineering and Computer

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

Heterogeneous ScaLAPACK Programmers Reference and Installation Manual. Heterogeneous ScaLAPACK Programmers Reference and Installation Manual

Heterogeneous ScaLAPACK Programmers Reference and Installation Manual. Heterogeneous ScaLAPACK Programmers Reference and Installation Manual Heterogeneous ScaLAPACK Programmers Reference and Installation Manual 1 Heterogeneous ScaLAPACK Parallel Linear Algebra Programs for Heterogeneous Networks of Computers Version 1.0.0 Ravi Reddy, Alexey

More information

High-Performance Implementation of the Level-3 BLAS

High-Performance Implementation of the Level-3 BLAS High-Performance Implementation of the Level- BLAS KAZUSHIGE GOTO The University of Texas at Austin and ROBERT VAN DE GEIJN The University of Texas at Austin A simple but highly effective approach for

More information

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel MAGMA Library version 0.1 S. Tomov J. Dongarra V. Volkov J. Demmel 2 -- MAGMA (version 0.1) -- Univ. of Tennessee, Knoxville Univ. of California, Berkeley Univ. of Colorado, Denver June 2009 MAGMA project

More information

NAG Library Routine Document F07MAF (DSYSV).1

NAG Library Routine Document F07MAF (DSYSV).1 NAG Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent

More information

The Starving CPU Problem

The Starving CPU Problem Or Why Should I Care About Memory Access? Software Architect Continuum Analytics Outline Motivation 1 Motivation 2 3 Computing a Polynomial We want to compute the next polynomial: y = 0.25x 3 + 0.75x²

More information

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects sylvestre.ledru@scilab.org / sylvestre@debian.org

More information

Chapter 4. Matrix and Vector Operations

Chapter 4. Matrix and Vector Operations 1 Scope of the Chapter Chapter 4 This chapter provides procedures for matrix and vector operations. This chapter (and Chapters 5 and 6) can handle general matrices, matrices with special structure and

More information

NAG Fortran Library Routine Document F08BHF (DTZRZF).1

NAG Fortran Library Routine Document F08BHF (DTZRZF).1 NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent

More information

HPC Numerical Libraries. Nicola Spallanzani SuperComputing Applications and Innovation Department

HPC Numerical Libraries. Nicola Spallanzani SuperComputing Applications and Innovation Department HPC Numerical Libraries Nicola Spallanzani n.spallanzani@cineca.it SuperComputing Applications and Innovation Department Algorithms and Libraries Many numerical algorithms are well known and largely available.

More information

INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING AND GRIDS Cetraro (Italy), July 3-6, 2006

INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING AND GRIDS Cetraro (Italy), July 3-6, 2006 INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING AND GRIDS Cetraro (Italy), July 3-6, 2006 The Challenges of Multicore and Specialized Accelerators Jack Dongarra University of Tennessee

More information

Arm Performance Libraries Reference Manual

Arm Performance Libraries Reference Manual Arm Performance Libraries Reference Manual Version 18.0.0 Document Number 101004_1800_00 Copyright c 2017 Arm Ltd. or its affiliates, Numerical Algorithms Group Ltd. All rights reserved. Non-Confidential

More information

Some notes on efficient computing and high performance computing environments

Some notes on efficient computing and high performance computing environments Some notes on efficient computing and high performance computing environments Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public

More information

CUDA Toolkit 4.0 Performance Report. June, 2011

CUDA Toolkit 4.0 Performance Report. June, 2011 CUDA Toolkit 4. Performance Report June, 211 CUDA Math Libraries High performance math routines for your applications: cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse

More information

Accurate Cache and TLB Characterization Using Hardware Counters

Accurate Cache and TLB Characterization Using Hardware Counters Accurate Cache and TLB Characterization Using Hardware Counters Jack Dongarra, Shirley Moore, Philip Mucci, Keith Seymour, and Haihang You Innovative Computing Laboratory, University of Tennessee Knoxville,

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu

More information

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,

More information

The LinBox Project for Linear Algebra Computation

The LinBox Project for Linear Algebra Computation The LinBox Project for Linear Algebra Computation A Practical Tutorial Daniel S. Roche Symbolic Computation Group School of Computer Science University of Waterloo MOCAA 2008 University of Western Ontario

More information

Auto-tuning Multigrid with PetaBricks

Auto-tuning Multigrid with PetaBricks Auto-tuning with PetaBricks Cy Chan Joint Work with: Jason Ansel Yee Lok Wong Saman Amarasinghe Alan Edelman Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

More information

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF

More information

NAG Fortran Library Routine Document F04DHF.1

NAG Fortran Library Routine Document F04DHF.1 F04 Simultaneous Linear Equations NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised

More information

RELIABLE GENERATION OF HIGH- PERFORMANCE MATRIX ALGEBRA

RELIABLE GENERATION OF HIGH- PERFORMANCE MATRIX ALGEBRA RELIABLE GENERATION OF HIGH- PERFORMANCE MATRIX ALGEBRA Jeremy G. Siek, University of Colorado at Boulder Joint work with Liz Jessup, Boyana Norris, Geoffrey Belter, Thomas Nelson Lake Isabelle, Colorado

More information

High Performance Linear Algebra

High Performance Linear Algebra High Performance Linear Algebra Hatem Ltaief Senior Research Scientist Extreme Computing Research Center King Abdullah University of Science and Technology 4th International Workshop on Real-Time Control

More information

Hardware Sizing Guide OV

Hardware Sizing Guide OV Hardware Sizing Guide OV3600 6.3 www.alcatel-lucent.com/enterprise Part Number: 0510620-01 Table of Contents Table of Contents... 2 Overview... 3 Properly Sizing Processing and for your OV3600 Server...

More information