Resources for parallel computing

Size: px
Start display at page:

Download "Resources for parallel computing"

Transcription

1 Resources for parallel computing BLAS Basic linear algebra subprograms. Originally published in ACM Toms (1979) (Linpack Blas + Lapack). Implement matrix operations upto matrix-matrix multiplication and triangular solve, but not matrix factorizations or eigenvalue calculations. A reference implementation is on netlib.org. Web page "Frequently asked questions" BLAS 1) What and where are the BLAS? 2) Are there legal restrictions on the use of BLAS reference implementation software? 3) Publications/references for the BLAS? 4) Is there a Quick Reference Guide to the BLAS available? 5) Are optimized BLAS libraries available? Where can I find vendor supplied BLAS? 6) Where can I find Java BLAS? 7) Is there a C interface to the BLAS? 1

2 8) Are prebuilt reference implementations of the Fortran77 BLAS available? 9) What about shared memory machines? Are there multithreaded versions of the BLAS available? 1) What and where are the BLAS? The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-. On Wikipedia Example of a reference subroutine: SUBROUTINE DGEMV ( TRANS, M, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) *.. Scalar Arguments.. DOUBLE PRECISION ALPHA, BETA INTEGER INCX, INCY, LDA, M, N CHARACTER*1 TRANS *.. Array Arguments.. DOUBLE PRECISION A( LDA, * ), X( * ), Y( * ) * * Purpose 2

3 * ======= * DGEMV performs one of the matrix-vector operations * * y := alpha*a*x + beta*y, or y := alpha*a'*x + beta*y, * * where alpha and beta are scalars, x and y are vectors and A is an * m by n matrix. * * Parameters * ==========... * M - INTEGER. * On entry, M specifies the number of rows of the matrix A. * M must be at least zero.... IF( INCY.EQ.1 )THEN DO 60, J = 1, N IF( X( JX ).NE.ZERO )THEN TEMP = ALPHA*X( JX ) DO 50, I = 1, M Y( I ) = Y( I ) + TEMP*A( I, J ) 50 CONTINUE END IF JX = JX + INCX 60 CONTINUE... BLAS quick reference (see 3

4 LAPACK Fortran subroutines for linear equations (dense, banded), linear least squares problems, eigenvalue problems and singular values. There is a printed user guide (LAPACK Users' Guide, 11 authors, SIAM, 1999), part of this guide is in several html documents on and there are man pages on which are worth installing. The routines were written with parallel computation in mind. Web page 4

5 Atlas Open source implementaion of BLAS and a few LAPACK routines.these must be (or should be) built (with make) on the machine where they are to be used (may take a few hours?). These must be (or should be) built (with make) on the machine where they are to be used (may take a few hours?). GotoBLAS (framb.: gótóblas) The web page boasts: currently the fastest implementation of the Basic Linear Algebra Subroutines 5

6 Intel MKL (Math kernel library) Academic license $160. Includes: BLAS Selected LAPACK routines Fortran 95 interface CBLAS (interface to call from C) Sparse BLAS Sparse linear equation solvers ScaLAPACK Some statistical functions (incl. random number generation) some MPI support Fast Fourier transforms PDE solution support (I think incl. Poisson solver) some numerical optimization. User manual and reference manual These come with the program, and are also both available on (107 pages and 3250 pages). AMD Core Math Library (ACML) web page 6

7 DGEMM benchmarks From Intel Web site: 7

8 From GotoBLAS web site: From a neutral (?) web site: 8

9 Example of BLAS use The following examples are from programs for evaluation of VARMA time-series likelihood that I wrote first in Matlab (~3500 lines) and am close to finishing translating into C (~7000 lines). Eventually I hope to call some parallel BLAS routines and report on timing comparison between the Matlab and C versions. The Matlab programs are on and the C-programs are on the way there. omega_factor.m %OMEGA_FACTOR Cholesky factorization of Omega % % [Lu,Ll,info] = omega_factor(su,olow,p,q,ko) calculates the Cholesky % factorization Omega = L L' of Omega which is stored in two parts, a full % upper left partition, Su, and a block-band lower partition, Olow, as returned % by omega_build. Omega is symmetric, only the lower triangle of Su is % populated, and Olow only stores diagonal and subdiagonal blocks. On exit, L = % [L1; L2] with L1 = [Lu 0] and L2 is stored in block-band-storage in Ll. Info % is 0 on success, otherwise the loop index resulting in a negative number % square root. P and q are the dimensions of the problem and ko is a vector % with ko(t) = number of observed values before time t. % % In the complete data case ko should be 0:r:n*r. For missing values, Su and % Olow are the upper left and lower partitions of Omega_o = Omega with missing % rows and columns removed. In this case Lu and Ll return L_o, the Cholesky % factor of Omega_o. function [Lu,Ll,info] = omega_factor(su,olow,p,q,ko) n = length(ko)-1; h = max(p,q); ro = diff(ko); Ll = zeros(size(olow)); [Lu,info] = chol(su'); % upper left partition if info>0, return; end Lu = Lu'; e = ko(h+1); % order of Su for t = h+1 : n % loop over block-lines in Olow K = ko(t)+1 : ko(t+1); KL = K - ko(t-q); JL = 1 : ko(t)-ko(t-q); tmin = t-q; tmax = t-1; Ll(K-e, JL) = omega_forward(lu, Ll, Olow(K-e,JL)', p, q, ko, tmin, tmax)'; [Ltt, info] = chol(olow(k-e,kl) - Ll(K-e,JL)*Ll(K-e,JL)'); if info>0, info = info + ko(t); return; end Ll(K-e, KL) = Ltt'; end end 9

10 OmegaFactor.c #include "xassert.h" #include "BlasGateway.h" #include "VarmaUtilities.h" #include "Omega.h" void OmegaFactor ( // Cholesky-factorize Omega, or, for missing values, Omega_o double Su[], // in/out upper left part of Omega, dimension msu msu double Olow[], // in/out molow nolow, block-diagonals of lower part of Omega int p, // in number of autoregressive terms int q, // in number of moving average terms int n, // in length of time series int ko[], // in ko[t] = N of observed values before time t+1, t<=n int *info) // out 0 if ok, otherwise k for first nonpositive Ltt { // Finds Cholesky factorization of covariance matrix Omega_o for missing value // VARMA log-likelihood. Also handles complete data with ko[sws]=r*i for all i. // The Cholesky factors overwrite Omega in memory. All matrices are stored in // Fortran fashion. // double *U, *Ltt; int t, j, ro; int h = max(p,q); int msu = ko[h]; // order of Su int molow = ko[n]-msu; // no. of rows in Olow // // CHOLESKY-FACTORIZE Su INTO Lu Lu' (Lu OVERWRITES Su): if (msu>0) potrf("low", msu, Su, msu, info); else *info = 0; xassert(*info >= 0); if (*info>0) return; // // TURN ATTENTION TO Olow: U = Olow; for (t=h; t<n; t++) { ro = ko[t+1] - ko[t]; if (ro>0) { j = ko[t] - ko[t-q]; Ltt = U + molow*j; //Solve L(0:t-1,0:t-1) U' = s' (U contains s on call): OmegaForward("T", Su, Olow, p, q, ko, n, U, molow, ro, t-q, t-1); //Ltt-U U' and then Cholesky of that syrk("low", "N", ro, j, -1.0, U, molow, 1.0, Ltt, molow); potrf("low", ro, Ltt, molow, info); if (*info>0) { *info += ko[t]; return; } } U += ro; } } 10

11 From omega_forward.m LT.LT = true; if m>0, X(1:m,:) = linsolve(lu(j:end,j:end), Y(1:m,:), LT); end for t=max(h+1,tmin):tmax t1 = max(t-q,tmin); J = ko(t1)+1 : ko(t); K = ko(t)+1 : ko(t+1); JX = J - j + 1; KX = K - j + 1; JL = J - ko(t-q); KL = K - ko(t-q); X(KX,:) = X(KX,:) - Ll(K-e,JL)*X(JX,:); X(KX,:) = linsolve(ll(k-e,kl), X(KX,:), LT); end From OmegaForward.c if (transp && e>0) trsm("right", "Low", "T", "NotUdia", ny, m, 1.0, Luii, e, Y, iy); else if (e>0) trsm("left","low", "NoT", "NotUdia", m, ny, 1.0, Luii, e, Y, iy); // // FIND SECOND PARTITION OF X: incy = transp? iy : 1; tbeg = max(h,tmin); for (t=tbeg; t<=tmax; t++) { k1 = max(t-q,tmin); k = ko[t] - ko[k1]; Yt = Y + (ko[t] - j)*incy; Yk = Yt - k*incy; Lt = Ll + ko[t] - e; Ltt = Lt + mll*(ko[t] - ko[t-q]); Ltk = Ltt - mll*k; ro = ko[t+1] - ko[t]; if (transp && iy>0) { gemm("nt", "T", ny, ro, k, -1.0, Yk, iy, Ltk, mll, 1.0, Yt, iy); trsm("right", "Low", "T", "NotUdia", ny, ro, 1.0, Ltt, mll, Yt, iy); } else if (!transp && mll>0) { gemm("nt", "NT", ro, ny, k, -1.0, Ltk, mll, Yk, iy, 1.0, Yt, iy); trsm("left", "Low", "NT", "NotUdia", ro, ny, 1.0, Ltt, mll, Yt, iy); } } 11

12 Gateway function to reference Blas/Lapack // Gateway function to reference Blas/Lapack #include "BlasGateway.h" #include "blasf.h" void gemm(char *transa, char *transb, int m, int n, int k, double alpha, double a[], int lda, double b[], int ldb, double beta, double c[], int ldc) { dgemm(transa, transb, &m, &n, &k, &alpha, a, &lda, b,&ldb,&beta,c,&ldc,1,1); } BLAS/LAPACK from Fortran examples Cholesky factorization using Netlib LAPACK95 From SUBROUTINE LA_POTRF( A, UPLO, RCOND, NORM, INFO )! <type>(<wp>), INTENT(INOUT) :: A(:,:)! CHARACTER(LEN=1), INTENT(IN), OPTIONAL :: UPLO! REAL(<wp>), INTENT(OUT), OPTIONAL :: RCOND! CHARACTER(LEN=1), INTENT(IN), OPTIONAL :: NORM! INTEGER, INTENT(OUT), OPTIONAL :: INFO Cholesky factorization from MKL Excerpt from MKL Reference Manual: 12

13 Matrix-matrix multiply Using Intel MKL (help from the reference manual): There is a description (incompatible with MKL) and a link to a reference implementation on Netlib: (difficult to download and use). Here is an example from the description: For calling from Fortran 77 see the call to the reference version of DGEMM as shown above in the quote from the MKL reference manual. 13

14 Sparse BLAS A Fortran 95 reference implementation was published in ACM TOMS in 2002: Included in MKL. Originally defined by the BLAS Technical Forum, see Netlib ( 14

15 CBLAS Netlib has a description and an implementation (interface to the Fortran reference BLAS), see Example: Notice that the dimension arguments are passed by value, and not by reference as is necessary when calling the Fortran routines directly from C. Information on MKL web: Atlas (see above) has a free implementation and a quick reference card. GNU Scientific Library ( has interface with a different interface (and their own vector / matrix data types). 15

Dense matrix algebra and libraries (and dealing with Fortran)

Dense matrix algebra and libraries (and dealing with Fortran) Dense matrix algebra and libraries (and dealing with Fortran) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Dense matrix algebra and libraries (and dealing with Fortran)

More information

A Few Numerical Libraries for HPC

A Few Numerical Libraries for HPC A Few Numerical Libraries for HPC CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 1 / 37 Outline 1 HPC == numerical linear

More information

BLAS. Christoph Ortner Stef Salvini

BLAS. Christoph Ortner Stef Salvini BLAS Christoph Ortner Stef Salvini The BLASics Basic Linear Algebra Subroutines Building blocks for more complex computations Very widely used Level means number of operations Level 1: vector-vector operations

More information

9. Linear Algebra Computation

9. Linear Algebra Computation 9. Linear Algebra Computation Basic Linear Algebra Subprograms (BLAS) Routines that provide standard, low-level, building blocks for performing basic vector and matrix operations. Originally developed

More information

ATLAS (Automatically Tuned Linear Algebra Software),

ATLAS (Automatically Tuned Linear Algebra Software), LAPACK library I Scientists have developed a large library of numerical routines for linear algebra. These routines comprise the LAPACK package that can be obtained from http://www.netlib.org/lapack/.

More information

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore svmoore@utep.edu CPS5401 Fall 2012 svmoore.pbworks.com November 8, 2012 1 Learning ObjecNves AOer complenng this lesson, you

More information

Scientific Computing. Some slides from James Lambers, Stanford

Scientific Computing. Some slides from James Lambers, Stanford Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical

More information

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning

More information

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware Parallelism V HPC Profiling John Cavazos Dept of Computer & Information Sciences University of Delaware Lecture Overview Performance Counters Profiling PAPI TAU HPCToolkit PerfExpert Performance Counters

More information

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017 Faster Code for Free: Linear Algebra Libraries Advanced Research Compu;ng 22 Feb 2017 Outline Introduc;on Implementa;ons Using them Use on ARC systems Hands on session Conclusions Introduc;on 3 BLAS Level

More information

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects sylvestre.ledru@scilab.org / sylvestre@debian.org

More information

BLAS: Basic Linear Algebra Subroutines I

BLAS: Basic Linear Algebra Subroutines I BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines

More information

BLAS: Basic Linear Algebra Subroutines I

BLAS: Basic Linear Algebra Subroutines I BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines

More information

BLAS. Basic Linear Algebra Subprograms

BLAS. Basic Linear Algebra Subprograms BLAS Basic opera+ons with vectors and matrices dominates scien+fic compu+ng programs To achieve high efficiency and clean computer programs an effort has been made in the last few decades to standardize

More information

Numerical libraries. Exercises PARALLEL COMPUTING USING MPI AND OPENMP. M.Cremonesi May 2014

Numerical libraries. Exercises PARALLEL COMPUTING USING MPI AND OPENMP. M.Cremonesi May 2014 Numerical libraries Exercises PARALLEL COMPUTING USING MPI AND OPENMP M.Cremonesi May 2014 Library installation As a beginning effort try download and install LAPACK from NETLIB: Download from http://www.netlib.org/lapack/lapack-

More information

Some notes on efficient computing and high performance computing environments

Some notes on efficient computing and high performance computing environments Some notes on efficient computing and high performance computing environments Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public

More information

Numerical libraries. - Exercises -

Numerical libraries. - Exercises - Numerical libraries - Exercises - Library installation As a beginning effort try download and install LAPACK from NETLIB: Download from http://www.netlib.org/lapack/lapack- 3.4.1.tgz Configure make.inc

More information

A Standard for Batching BLAS Operations

A Standard for Batching BLAS Operations A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community

More information

Brief notes on setting up semi-high performance computing environments. July 25, 2014

Brief notes on setting up semi-high performance computing environments. July 25, 2014 Brief notes on setting up semi-high performance computing environments July 25, 2014 1 We have two different computing environments for fitting demanding models to large space and/or time data sets. 1

More information

HPC Numerical Libraries. Nicola Spallanzani SuperComputing Applications and Innovation Department

HPC Numerical Libraries. Nicola Spallanzani SuperComputing Applications and Innovation Department HPC Numerical Libraries Nicola Spallanzani n.spallanzani@cineca.it SuperComputing Applications and Innovation Department Algorithms and Libraries Many numerical algorithms are well known and largely available.

More information

AMath 483/583 Lecture 22. Notes: Another Send/Receive example. Notes: Notes: Another Send/Receive example. Outline:

AMath 483/583 Lecture 22. Notes: Another Send/Receive example. Notes: Notes: Another Send/Receive example. Outline: AMath 483/583 Lecture 22 Outline: MPI Master Worker paradigm Linear algebra LAPACK and the BLAS References: $UWHPSC/codes/mpi class notes: MPI section class notes: Linear algebra Another Send/Receive example

More information

NAG Library Chapter Introduction. F16 Further Linear Algebra Support Routines

NAG Library Chapter Introduction. F16 Further Linear Algebra Support Routines NAG Library Chapter Introduction Contents 1 Scope of the Chapter.... 2 2 Background to the Problems... 2 3 Recommendations on Choice and Use of Available Routines... 2 3.1 Naming Scheme... 2 3.1.1 NAGnames...

More information

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations.

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations. TITLE Basic Linear Algebra Subprograms BYLINE Robert A. van de Geijn Department of Computer Science The University of Texas at Austin Austin, TX USA rvdg@cs.utexas.edu Kazushige Goto Texas Advanced Computing

More information

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen High-Performance Libraries and Tools HPC Fall 2012 Prof. Robert van Engelen Overview Dense matrix BLAS (serial) ATLAS (serial/threaded) LAPACK (serial) Vendor-tuned LAPACK (shared memory parallel) ScaLAPACK/PLAPACK

More information

LAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1

LAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1 LAPACK Linear Algebra PACKage 1 Janice Giudice David Knezevic 1 Motivating Question Recalling from last week... Level 1 BLAS: vectors ops Level 2 BLAS: matrix-vectors ops 2 2 O( n ) flops on O( n ) data

More information

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides

More information

printf("\n\nx = "); for(i=0;i<5;i++) printf("\n%f %f", X[i],X[i+5]); printf("\n\ny = "); for(i=0;i<5;i++) printf("\n%f", Y[i]);

printf(\n\nx = ); for(i=0;i<5;i++) printf(\n%f %f, X[i],X[i+5]); printf(\n\ny = ); for(i=0;i<5;i++) printf(\n%f, Y[i]); OLS.c / / #include #include #include int main(){ int i,info, ipiv[2]; char trans = 't', notrans ='n'; double alpha = 1.0, beta=0.0; int ncol=2; int nrow=5; int

More information

A High Performance C Package for Tridiagonalization of Complex Symmetric Matrices

A High Performance C Package for Tridiagonalization of Complex Symmetric Matrices A High Performance C Package for Tridiagonalization of Complex Symmetric Matrices Guohong Liu and Sanzheng Qiao Department of Computing and Software McMaster University Hamilton, Ontario L8S 4L7, Canada

More information

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I 1967-10 Advanced School in High Performance and GRID Computing 3-14 November 2008 Mathematical Libraries. Part I KOHLMEYER Axel University of Pennsylvania Department of Chemistry 231 South 34th Street

More information

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering

More information

Module 5.5: nag sym bnd lin sys Symmetric Banded Systems of Linear Equations. Contents

Module 5.5: nag sym bnd lin sys Symmetric Banded Systems of Linear Equations. Contents Module Contents Module 5.5: nag sym bnd lin sys Symmetric Banded Systems of nag sym bnd lin sys provides a procedure for solving real symmetric or complex Hermitian banded systems of linear equations with

More information

Achieve Better Performance with PEAK on XSEDE Resources

Achieve Better Performance with PEAK on XSEDE Resources Achieve Better Performance with PEAK on XSEDE Resources Haihang You, Bilel Hadri, Shirley Moore XSEDE 12 July 18 th 2012 Motivations FACTS ALTD ( Automatic Tracking Library Database ) ref Fahey, Jones,

More information

Intel Math Kernel Library

Intel Math Kernel Library Intel Math Kernel Library Release 7.0 March 2005 Intel MKL Purpose Performance, performance, performance! Intel s scientific and engineering floating point math library Initially only basic linear algebra

More information

NAG Fortran Library Routine Document F01CTF.1

NAG Fortran Library Routine Document F01CTF.1 NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent

More information

Chapter 24a More Numerics and Parallelism

Chapter 24a More Numerics and Parallelism Chapter 24a More Numerics and Parallelism Nick Maclaren http://www.ucs.cam.ac.uk/docs/course-notes/un ix-courses/cplusplus This was written by me, not Bjarne Stroustrup Numeric Algorithms These are only

More information

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code

More information

Intel Math Kernel Library 10.3

Intel Math Kernel Library 10.3 Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)

More information

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker BLAS and LAPACK + Data Formats for Sparse Matrices Part of the lecture Wissenschaftliches Rechnen Hilmar Wobker Institute of Applied Mathematics and Numerics, TU Dortmund email: hilmar.wobker@math.tu-dortmund.de

More information

NAG Fortran Library Routine Document F01CWF.1

NAG Fortran Library Routine Document F01CWF.1 NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent

More information

Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview

Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview Stefano Cozzini CNR/INFM Democritos and SISSA/eLab cozzini@democritos.it Agenda Tools for

More information

Mathematical Libraries and Application Software on JUQUEEN and JURECA

Mathematical Libraries and Application Software on JUQUEEN and JURECA Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUQUEEN and JURECA JSC Training Course May 2017 I.Gutheil Outline General Informations Sequential Libraries Parallel

More information

Automatic Development of Linear Algebra Libraries for the Tesla Series

Automatic Development of Linear Algebra Libraries for the Tesla Series Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source

More information

NAG Fortran Library Routine Document F04DHF.1

NAG Fortran Library Routine Document F04DHF.1 F04 Simultaneous Linear Equations NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised

More information

Performance Modeling for Ranking Blocked Algorithms

Performance Modeling for Ranking Blocked Algorithms Performance Modeling for Ranking Blocked Algorithms Elmar Peise Aachen Institute for Advanced Study in Computational Engineering Science 27.4.2012 Elmar Peise (AICES) Performance Modeling 27.4.2012 1 Blocked

More information

Fastest and most used math library for Intel -based systems 1

Fastest and most used math library for Intel -based systems 1 Fastest and most used math library for Intel -based systems 1 Speaker: Alexander Kalinkin Contributing authors: Peter Caday, Kazushige Goto, Louise Huot, Sarah Knepper, Mesut Meterelliyoz, Arthur Araujo

More information

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science Overcoming the Barriers to Sustained Petaflop Performance William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp But First Are we too CPU-centric? What about I/O? What do applications

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

Optimizing the operations with sparse matrices on Intel architecture

Optimizing the operations with sparse matrices on Intel architecture Optimizing the operations with sparse matrices on Intel architecture Gladkikh V. S. victor.s.gladkikh@intel.com Intel Xeon, Intel Itanium are trademarks of Intel Corporation in the U.S. and other countries.

More information

Introduction to Parallel Programming & Cluster Computing

Introduction to Parallel Programming & Cluster Computing Introduction to Parallel Programming & Cluster Computing Scientific Libraries & I/O Libraries Joshua Alexander, U Oklahoma Ivan Babic, Earlham College Michial Green, Contra Costa College Mobeen Ludin,

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava, PRACE PATC Course: Intel MIC Programming Workshop, MKL Ostrava, 7-8.2.2017 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi Compiler Assisted Offload Automatic Offload Native Execution Hands-on

More information

Mathematical Libraries and Application Software on JUROPA, JUGENE, and JUQUEEN. JSC Training Course

Mathematical Libraries and Application Software on JUROPA, JUGENE, and JUQUEEN. JSC Training Course Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUROPA, JUGENE, and JUQUEEN JSC Training Course May 22, 2012 Outline General Informations Sequential Libraries Parallel

More information

2 CHAPTER 1. LEGACYBLAS Considered methods Various other naming schemes have been proposed, such as adding C or c to the name. Most of these schemes a

2 CHAPTER 1. LEGACYBLAS Considered methods Various other naming schemes have been proposed, such as adding C or c to the name. Most of these schemes a Chapter 1 LegacyBLAS 1.1 New routines 1.2 C interface to legacy BLAS This section discusses the proposed C interface to the legacy BLAS in some detail. Every mention of BLAS" in this chapter should be

More information

INTEL MKL Vectorized Compact routines

INTEL MKL Vectorized Compact routines INTEL MKL Vectorized Compact routines Mesut Meterelliyoz, Peter Caday, Timothy B. Costa, Kazushige Goto, Louise Huot, Sarah Knepper, Arthur Araujo Mitrano, Shane Story 2018 BLIS RETREAT 09/17/2018 OUTLINE

More information

Mathematical Libraries and Application Software on JUQUEEN and JURECA

Mathematical Libraries and Application Software on JUQUEEN and JURECA Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUQUEEN and JURECA JSC Training Course November 2015 I.Gutheil Outline General Informations Sequential Libraries Parallel

More information

Basic Linear Algebra Subprograms Library

Basic Linear Algebra Subprograms Library Software Development Kit for Multicore Acceleration Version 3.0 Basic Linear Algebra Subprograms Library Programmer s Guide and API Reference SC33-8426-00 Software Development Kit for Multicore Acceleration

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

Arm Performance Libraries Reference Manual

Arm Performance Libraries Reference Manual Arm Performance Libraries Reference Manual Version 18.0.0 Document Number 101004_1800_00 Copyright c 2017 Arm Ltd. or its affiliates, Numerical Algorithms Group Ltd. All rights reserved. Non-Confidential

More information

Making Dataflow Programming Ubiquitous for Scientific Computing

Making Dataflow Programming Ubiquitous for Scientific Computing Making Dataflow Programming Ubiquitous for Scientific Computing Hatem Ltaief KAUST Supercomputing Lab Synchronization-reducing and Communication-reducing Algorithms and Programming Models for Large-scale

More information

NAG Library Routine Document F04BEF.1

NAG Library Routine Document F04BEF.1 F04 Simultaneous Linear Equations NAG Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

NAG Fortran Library Routine Document G05LZF.1

NAG Fortran Library Routine Document G05LZF.1 NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent

More information

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel MAGMA Library version 0.1 S. Tomov J. Dongarra V. Volkov J. Demmel 2 -- MAGMA (version 0.1) -- Univ. of Tennessee, Knoxville Univ. of California, Berkeley Univ. of Colorado, Denver June 2009 MAGMA project

More information

NAG Library Routine Document F07MAF (DSYSV).1

NAG Library Routine Document F07MAF (DSYSV).1 NAG Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent

More information

Parallel Programming & Cluster Computing

Parallel Programming & Cluster Computing Parallel Programming & Cluster Computing Grab Bag: Scientific Libraries, I/O Libraries, Visualization Henry Neeman, University of Oklahoma Charlie Peck, Earlham College Andrew Fitz Gibbon, Earlham College

More information

NAG Fortran Library Routine Document F04CAF.1

NAG Fortran Library Routine Document F04CAF.1 F04 Simultaneous Linear Equations NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised

More information

Fast Multiscale Algorithms for Information Representation and Fusion

Fast Multiscale Algorithms for Information Representation and Fusion Prepared for: Office of Naval Research Fast Multiscale Algorithms for Information Representation and Fusion No. 2 Devasis Bassu, Principal Investigator Contract: N00014-10-C-0176 Telcordia Technologies

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, 27.6-29.6.2016 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi - Compiler Assisted Offload - Automatic Offload - Native Execution

More information

NAG Fortran Library Routine Document F04BJF.1

NAG Fortran Library Routine Document F04BJF.1 F04 Simultaneous Linear Equations NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised

More information

NEW ADVANCES IN GPU LINEAR ALGEBRA

NEW ADVANCES IN GPU LINEAR ALGEBRA GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear

More information

Programming for supercomputers

Programming for supercomputers Programming for supercomputers Anthony Scemama http://scemama.mooo.com Labratoire de Chimie et Physique Quantiques IRSAMC (Toulouse) Note Slides can be downloaded here: http://irpf90.ups-tlse.fr/files/lttc17_supercomputing.pdf

More information

NAG Fortran Library Routine Document F04CJF.1

NAG Fortran Library Routine Document F04CJF.1 F04 Simultaneous Linear Equations NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised

More information

Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification

Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification Jack Dongarra *, Iain Duff, Mark Gates *, Azzam Haidar *, Sven Hammarling, Nicholas J. Higham, Jonathan Hogg, Pedro Valero Lara, Piotr

More information

Intel Math Kernel Library (Intel MKL) Team - Presenter: Murat Efe Guney Workshop on Batched, Reproducible, and Reduced Precision BLAS Georgia Tech,

Intel Math Kernel Library (Intel MKL) Team - Presenter: Murat Efe Guney Workshop on Batched, Reproducible, and Reduced Precision BLAS Georgia Tech, Intel Math Kernel Library (Intel MKL) Team - Presenter: Murat Efe Guney Workshop on Batched, Reproducible, and Reduced Precision BLAS Georgia Tech, Atlanta February 24, 2017 Acknowledgements Benoit Jacob

More information

Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System

Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System Overview This guide is intended to help developers use the latest version of Intel Math Kernel Library (Intel

More information

The implementation of the Sparse BLAS in Fortran 95 1

The implementation of the Sparse BLAS in Fortran 95 1 The implementation of the Sparse BLAS in Fortran 95 1 Iain S. Duff 2 Christof Vömel 3 Technical Report TR/PA/01/27 September 10, 2001 CERFACS 42 Ave G. Coriolis 31057 Toulouse Cedex France ABSTRACT The

More information

*Yuta SAWA and Reiji SUDA The University of Tokyo

*Yuta SAWA and Reiji SUDA The University of Tokyo Auto Tuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS *Yuta SAWA and Reiji SUDA The University of Tokyo iwapt 29 October 1-2 *Now in Central Research Laboratory, Hitachi,

More information

COMPUTATIONAL LINEAR ALGEBRA

COMPUTATIONAL LINEAR ALGEBRA COMPUTATIONAL LINEAR ALGEBRA Matrix Vector Multiplication Matrix matrix Multiplication Slides from UCSD and USB Directed Acyclic Graph Approach Jack Dongarra A new approach using Strassen`s algorithm Jim

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

ABSTRACT 1. INTRODUCTION. * phone ; fax ; emphotonics.com

ABSTRACT 1. INTRODUCTION. * phone ; fax ; emphotonics.com CULA: Hybrid GPU Accelerated Linear Algebra Routines John R. Humphrey *, Daniel K. Price, Kyle E. Spagnoli, Aaron L. Paolini, Eric J. Kelmelis EM Photonics, Inc, 51 E Main St, Suite 203, Newark, DE, USA

More information

NAG Fortran Library Routine Document F08KAF (DGELSS).1

NAG Fortran Library Routine Document F08KAF (DGELSS).1 NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent

More information

CHAPEL + LAPACK NEW DOG, MEET OLD DOG. Ian Bertolacci

CHAPEL + LAPACK NEW DOG, MEET OLD DOG. Ian Bertolacci CHAPEL + LAPACK NEW DOG, MEET OLD DOG. Ian Bertolacci INTRO: WHAT IS CHAPEL Chapel is a high performance programming language that has been in development at Cray since 2005. It includes many parallel

More information

PARDISO Version Reference Sheet Fortran

PARDISO Version Reference Sheet Fortran PARDISO Version 5.0.0 1 Reference Sheet Fortran CALL PARDISO(PT, MAXFCT, MNUM, MTYPE, PHASE, N, A, IA, JA, 1 PERM, NRHS, IPARM, MSGLVL, B, X, ERROR, DPARM) 1 Please note that this version differs significantly

More information

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra CS 294-73 Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra Slides from James Demmel and Kathy Yelick 1 Outline What is Dense Linear Algebra? Where does the time go in an algorithm?

More information

Compilers & Optimized Librairies

Compilers & Optimized Librairies Institut de calcul intensif et de stockage de masse Compilers & Optimized Librairies Modules Environment.bashrc env $PATH... Compilers : GNU, Intel, Portland Memory considerations : size, top, ulimit Hello

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

CUSOLVER LIBRARY. DU _v7.0 March 2015

CUSOLVER LIBRARY. DU _v7.0 March 2015 CUSOLVER LIBRARY DU-06709-001_v7.0 March 2015 TABLE OF CONTENTS Chapter 1. Introduction...1 1.1. cusolverdn: Dense LAPACK...2 1.2. cusolversp: Sparse LAPACK...2 1.3. cusolverrf: Refactorization...3 1.4.

More information

Mathematical libraries at the CHPC

Mathematical libraries at the CHPC Presentation Mathematical libraries at the CHPC Martin Cuma Center for High Performance Computing University of Utah mcuma@chpc.utah.edu October 19, 2006 http://www.chpc.utah.edu Overview What and what

More information

F02WUF NAG Fortran Library Routine Document

F02WUF NAG Fortran Library Routine Document F02 Eigenvalues and Eigenvectors F02WUF NAG Fortran Library Routine Document Note. Before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised

More information

CUDA. CUBLAS Library

CUDA. CUBLAS Library CUDA CUBLAS Library PG-00000-002_V1.0 June, 2007 CUBLAS Library PG-00000-002_V1.0 Published by Corporation 2701 San Tomas Expressway Santa Clara, CA 95050 Notice ALL DESIGN SPECIFICATIONS, REFERENCE BOARDS,

More information

Intel Math Kernel Library. Getting Started Tutorial: Using the Intel Math Kernel Library for Matrix Multiplication

Intel Math Kernel Library. Getting Started Tutorial: Using the Intel Math Kernel Library for Matrix Multiplication Intel Math Kernel Library Getting Started Tutorial: Using the Intel Math Kernel Library for Matrix Multiplication Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS.

More information

NAG Fortran Library Routine Document F07AAF (DGESV).1

NAG Fortran Library Routine Document F07AAF (DGESV).1 NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent

More information

NAG Library Routine Document F04MCF.1

NAG Library Routine Document F04MCF.1 NAG Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent

More information

Part 3: Functions New to the IMSL Fortran Numerical Library 7.0

Part 3: Functions New to the IMSL Fortran Numerical Library 7.0 IMSL( ) Fortran Numerical Library, Version 7.0.0 January 2012 (revised) This document contains release notes for IMSL Fortran Numerical Library, Version 7.0.0. This document has the following parts: 1.

More information

EM Algorithm for a Mixed-Effect Multivariate State-Space Model with Missing Data

EM Algorithm for a Mixed-Effect Multivariate State-Space Model with Missing Data EM Algorithm for a Mixed-Effect Multivariate State-Space Model with Missing Data Yu He Supervisor: Warren Jin COMP8740 Project Report Artificial Intelligence College of Engineering and Computer Science

More information

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.

More information

Outline. MC Sampling Bisection. 1 Consolidate your C basics. 2 More C Basics. 3 Working with Arrays. 4 Arrays and Structures. 5 Working with pointers

Outline. MC Sampling Bisection. 1 Consolidate your C basics. 2 More C Basics. 3 Working with Arrays. 4 Arrays and Structures. 5 Working with pointers Outline 1 Consolidate your C basics 2 Basics 3 Working with 4 5 Working with pointers 6 Working with and File I/O 7 Outline 1 Consolidate your C basics 2 Basics 3 Working with 4 5 Working with pointers

More information

Intel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop

Intel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop Intel MKL @ MIC Architecture Dr. Momme Allalen, LRZ, allalen@lrz.de PRACE PATC: Intel MIC&GPU Programming Workshop 1 2 Momme Allalen, HPC with GPGPUs, Oct. 10, 2011 What is the Intel MKL? Math library

More information

Technical Report Performance Analysis of CULA on different NVIDIA GPU Architectures. Prateek Gupta

Technical Report Performance Analysis of CULA on different NVIDIA GPU Architectures. Prateek Gupta Technical Report 2014-02 Performance Analysis of CULA on different NVIDIA GPU Architectures Prateek Gupta May 20, 2014 1 Spring 2014: Performance Analysis of CULA on different NVIDIA GPU Architectures

More information

The AxParafit and AxPcoords Manual

The AxParafit and AxPcoords Manual The AxParafit and AxPcoords Manual A. Stamatakis 1, A. Auch 2, J. Meier-Kolthoff 2, and M. Göker 3 1 École Polytechnique Fédérale de Lausanne School of Computer & Communication Sciences Laboratory for

More information

CSDP User s Guide. Brian Borchers. August 15, 2006

CSDP User s Guide. Brian Borchers. August 15, 2006 CSDP User s Guide Brian Borchers August 5, 6 Introduction CSDP is a software package for solving semidefinite programming problems. The algorithm is a predictor corrector version of the primal dual barrier

More information