Resources for parallel computing

Similar documents
Dense matrix algebra and libraries (and dealing with Fortran)

A Few Numerical Libraries for HPC

BLAS. Christoph Ortner Stef Salvini

9. Linear Algebra Computation

ATLAS (Automatically Tuned Linear Algebra Software),

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Scientific Computing. Some slides from James Lambers, Stanford

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

BLAS: Basic Linear Algebra Subroutines I

BLAS: Basic Linear Algebra Subroutines I

BLAS. Basic Linear Algebra Subprograms

Numerical libraries. Exercises PARALLEL COMPUTING USING MPI AND OPENMP. M.Cremonesi May 2014

Some notes on efficient computing and high performance computing environments

Numerical libraries. - Exercises -

A Standard for Batching BLAS Operations

Brief notes on setting up semi-high performance computing environments. July 25, 2014

HPC Numerical Libraries. Nicola Spallanzani SuperComputing Applications and Innovation Department

AMath 483/583 Lecture 22. Notes: Another Send/Receive example. Notes: Notes: Another Send/Receive example. Outline:

NAG Library Chapter Introduction. F16 Further Linear Algebra Support Routines

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations.

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen

LAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

printf("\n\nx = "); for(i=0;i<5;i++) printf("\n%f %f", X[i],X[i+5]); printf("\n\ny = "); for(i=0;i<5;i++) printf("\n%f", Y[i]);

A High Performance C Package for Tridiagonalization of Complex Symmetric Matrices

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

Module 5.5: nag sym bnd lin sys Symmetric Banded Systems of Linear Equations. Contents

Achieve Better Performance with PEAK on XSEDE Resources

Intel Math Kernel Library

NAG Fortran Library Routine Document F01CTF.1

Chapter 24a More Numerics and Parallelism

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Intel Math Kernel Library 10.3

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker

NAG Fortran Library Routine Document F01CWF.1

Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview

Mathematical Libraries and Application Software on JUQUEEN and JURECA

Automatic Development of Linear Algebra Libraries for the Tesla Series

NAG Fortran Library Routine Document F04DHF.1

Performance Modeling for Ranking Blocked Algorithms

Fastest and most used math library for Intel -based systems 1

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science

Intel Performance Libraries

Optimizing the operations with sparse matrices on Intel architecture

Introduction to Parallel Programming & Cluster Computing

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,

Mathematical Libraries and Application Software on JUROPA, JUGENE, and JUQUEEN. JSC Training Course

2 CHAPTER 1. LEGACYBLAS Considered methods Various other naming schemes have been proposed, such as adding C or c to the name. Most of these schemes a

INTEL MKL Vectorized Compact routines

Mathematical Libraries and Application Software on JUQUEEN and JURECA

Basic Linear Algebra Subprograms Library

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Arm Performance Libraries Reference Manual

Making Dataflow Programming Ubiquitous for Scientific Computing

NAG Library Routine Document F04BEF.1

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

NAG Fortran Library Routine Document G05LZF.1

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel

NAG Library Routine Document F07MAF (DSYSV).1

Parallel Programming & Cluster Computing

NAG Fortran Library Routine Document F04CAF.1

Fast Multiscale Algorithms for Information Representation and Fusion

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,

NAG Fortran Library Routine Document F04BJF.1

NEW ADVANCES IN GPU LINEAR ALGEBRA

Programming for supercomputers

NAG Fortran Library Routine Document F04CJF.1

Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification

Intel Math Kernel Library (Intel MKL) Team - Presenter: Murat Efe Guney Workshop on Batched, Reproducible, and Reduced Precision BLAS Georgia Tech,

Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System

The implementation of the Sparse BLAS in Fortran 95 1

*Yuta SAWA and Reiji SUDA The University of Tokyo

COMPUTATIONAL LINEAR ALGEBRA

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

ABSTRACT 1. INTRODUCTION. * phone ; fax ; emphotonics.com

NAG Fortran Library Routine Document F08KAF (DGELSS).1

CHAPEL + LAPACK NEW DOG, MEET OLD DOG. Ian Bertolacci

PARDISO Version Reference Sheet Fortran

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra

Compilers & Optimized Librairies

Solving Dense Linear Systems on Graphics Processors

CUSOLVER LIBRARY. DU _v7.0 March 2015

Mathematical libraries at the CHPC

F02WUF NAG Fortran Library Routine Document

CUDA. CUBLAS Library

Intel Math Kernel Library. Getting Started Tutorial: Using the Intel Math Kernel Library for Matrix Multiplication

NAG Fortran Library Routine Document F07AAF (DGESV).1

NAG Library Routine Document F04MCF.1

Part 3: Functions New to the IMSL Fortran Numerical Library 7.0

EM Algorithm for a Mixed-Effect Multivariate State-Space Model with Missing Data

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Outline. MC Sampling Bisection. 1 Consolidate your C basics. 2 More C Basics. 3 Working with Arrays. 4 Arrays and Structures. 5 Working with pointers

Intel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop

Technical Report Performance Analysis of CULA on different NVIDIA GPU Architectures. Prateek Gupta

The AxParafit and AxPcoords Manual

CSDP User s Guide. Brian Borchers. August 15, 2006

Transcription:

Resources for parallel computing BLAS Basic linear algebra subprograms. Originally published in ACM Toms (1979) (Linpack Blas + Lapack). Implement matrix operations upto matrix-matrix multiplication and triangular solve, but not matrix factorizations or eigenvalue calculations. A reference implementation is on netlib.org. Web page www.netlib.org/blas "Frequently asked questions" BLAS 1) What and where are the BLAS? 2) Are there legal restrictions on the use of BLAS reference implementation software? 3) Publications/references for the BLAS? 4) Is there a Quick Reference Guide to the BLAS available? 5) Are optimized BLAS libraries available? Where can I find vendor supplied BLAS? 6) Where can I find Java BLAS? 7) Is there a C interface to the BLAS? 1

8) Are prebuilt reference implementations of the Fortran77 BLAS available? 9) What about shared memory machines? Are there multithreaded versions of the BLAS available? 1) What and where are the BLAS? The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-. On Wikipedia Example of a reference subroutine: SUBROUTINE DGEMV ( TRANS, M, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) *.. Scalar Arguments.. DOUBLE PRECISION ALPHA, BETA INTEGER INCX, INCY, LDA, M, N CHARACTER*1 TRANS *.. Array Arguments.. DOUBLE PRECISION A( LDA, * ), X( * ), Y( * ) * * Purpose 2

* ======= * DGEMV performs one of the matrix-vector operations * * y := alpha*a*x + beta*y, or y := alpha*a'*x + beta*y, * * where alpha and beta are scalars, x and y are vectors and A is an * m by n matrix. * * Parameters * ==========... * M - INTEGER. * On entry, M specifies the number of rows of the matrix A. * M must be at least zero.... IF( INCY.EQ.1 )THEN DO 60, J = 1, N IF( X( JX ).NE.ZERO )THEN TEMP = ALPHA*X( JX ) DO 50, I = 1, M Y( I ) = Y( I ) + TEMP*A( I, J ) 50 CONTINUE END IF JX = JX + INCX 60 CONTINUE... BLAS quick reference (see www.netlib.org/blas/blasqr.pdf) 3

LAPACK Fortran subroutines for linear equations (dense, banded), linear least squares problems, eigenvalue problems and singular values. There is a printed user guide (LAPACK Users' Guide, 11 authors, SIAM, 1999), part of this guide is in several html documents on www.netlib.org/lapack/lug, and there are man pages on www.netlib.org/lapack/manpages.tgz which are worth installing. The routines were written with parallel computation in mind. Web page www.netlib.org/lapack 4

Atlas www.math-atlas.sourceforge.net. Open source implementaion of BLAS and a few LAPACK routines.these must be (or should be) built (with make) on the machine where they are to be used (may take a few hours?). These must be (or should be) built (with make) on the machine where they are to be used (may take a few hours?). GotoBLAS (framb.: gótóblas) http://www.tacc.utexas.edu/resources/software/. The web page boasts: currently the fastest implementation of the Basic Linear Algebra Subroutines 5

Intel MKL (Math kernel library) Academic license $160. Includes: BLAS Selected LAPACK routines Fortran 95 interface CBLAS (interface to call from C) Sparse BLAS Sparse linear equation solvers ScaLAPACK Some statistical functions (incl. random number generation) some MPI support Fast Fourier transforms PDE solution support (I think incl. Poisson solver) some numerical optimization. User manual and reference manual These come with the program, and are also both available on http://www.intel.com/cd/software/products/asmo-na/eng/345631.htm (107 pages and 3250 pages). AMD Core Math Library (ACML) web page www.amd.com/acml. 6

DGEMM benchmarks From Intel Web site: 7

From GotoBLAS web site: From a neutral (?) web site: 8

Example of BLAS use The following examples are from programs for evaluation of VARMA time-series likelihood that I wrote first in Matlab (~3500 lines) and am close to finishing translating into C (~7000 lines). Eventually I hope to call some parallel BLAS routines and report on timing comparison between the Matlab and C versions. The Matlab programs are on www.hi.is/~jonasson, and the C-programs are on the way there. omega_factor.m %OMEGA_FACTOR Cholesky factorization of Omega % % [Lu,Ll,info] = omega_factor(su,olow,p,q,ko) calculates the Cholesky % factorization Omega = L L' of Omega which is stored in two parts, a full % upper left partition, Su, and a block-band lower partition, Olow, as returned % by omega_build. Omega is symmetric, only the lower triangle of Su is % populated, and Olow only stores diagonal and subdiagonal blocks. On exit, L = % [L1; L2] with L1 = [Lu 0] and L2 is stored in block-band-storage in Ll. Info % is 0 on success, otherwise the loop index resulting in a negative number % square root. P and q are the dimensions of the problem and ko is a vector % with ko(t) = number of observed values before time t. % % In the complete data case ko should be 0:r:n*r. For missing values, Su and % Olow are the upper left and lower partitions of Omega_o = Omega with missing % rows and columns removed. In this case Lu and Ll return L_o, the Cholesky % factor of Omega_o. function [Lu,Ll,info] = omega_factor(su,olow,p,q,ko) n = length(ko)-1; h = max(p,q); ro = diff(ko); Ll = zeros(size(olow)); [Lu,info] = chol(su'); % upper left partition if info>0, return; end Lu = Lu'; e = ko(h+1); % order of Su for t = h+1 : n % loop over block-lines in Olow K = ko(t)+1 : ko(t+1); KL = K - ko(t-q); JL = 1 : ko(t)-ko(t-q); tmin = t-q; tmax = t-1; Ll(K-e, JL) = omega_forward(lu, Ll, Olow(K-e,JL)', p, q, ko, tmin, tmax)'; [Ltt, info] = chol(olow(k-e,kl) - Ll(K-e,JL)*Ll(K-e,JL)'); if info>0, info = info + ko(t); return; end Ll(K-e, KL) = Ltt'; end end 9

OmegaFactor.c #include "xassert.h" #include "BlasGateway.h" #include "VarmaUtilities.h" #include "Omega.h" void OmegaFactor ( // Cholesky-factorize Omega, or, for missing values, Omega_o double Su[], // in/out upper left part of Omega, dimension msu msu double Olow[], // in/out molow nolow, block-diagonals of lower part of Omega int p, // in number of autoregressive terms int q, // in number of moving average terms int n, // in length of time series int ko[], // in ko[t] = N of observed values before time t+1, t<=n int *info) // out 0 if ok, otherwise k for first nonpositive Ltt { // Finds Cholesky factorization of covariance matrix Omega_o for missing value // VARMA log-likelihood. Also handles complete data with ko[sws]=r*i for all i. // The Cholesky factors overwrite Omega in memory. All matrices are stored in // Fortran fashion. // double *U, *Ltt; int t, j, ro; int h = max(p,q); int msu = ko[h]; // order of Su int molow = ko[n]-msu; // no. of rows in Olow // // CHOLESKY-FACTORIZE Su INTO Lu Lu' (Lu OVERWRITES Su): if (msu>0) potrf("low", msu, Su, msu, info); else *info = 0; xassert(*info >= 0); if (*info>0) return; // // TURN ATTENTION TO Olow: U = Olow; for (t=h; t<n; t++) { ro = ko[t+1] - ko[t]; if (ro>0) { j = ko[t] - ko[t-q]; Ltt = U + molow*j; //Solve L(0:t-1,0:t-1) U' = s' (U contains s on call): OmegaForward("T", Su, Olow, p, q, ko, n, U, molow, ro, t-q, t-1); //Ltt-U U' and then Cholesky of that syrk("low", "N", ro, j, -1.0, U, molow, 1.0, Ltt, molow); potrf("low", ro, Ltt, molow, info); if (*info>0) { *info += ko[t]; return; } } U += ro; } } 10

From omega_forward.m LT.LT = true; if m>0, X(1:m,:) = linsolve(lu(j:end,j:end), Y(1:m,:), LT); end for t=max(h+1,tmin):tmax t1 = max(t-q,tmin); J = ko(t1)+1 : ko(t); K = ko(t)+1 : ko(t+1); JX = J - j + 1; KX = K - j + 1; JL = J - ko(t-q); KL = K - ko(t-q); X(KX,:) = X(KX,:) - Ll(K-e,JL)*X(JX,:); X(KX,:) = linsolve(ll(k-e,kl), X(KX,:), LT); end From OmegaForward.c if (transp && e>0) trsm("right", "Low", "T", "NotUdia", ny, m, 1.0, Luii, e, Y, iy); else if (e>0) trsm("left","low", "NoT", "NotUdia", m, ny, 1.0, Luii, e, Y, iy); // // FIND SECOND PARTITION OF X: incy = transp? iy : 1; tbeg = max(h,tmin); for (t=tbeg; t<=tmax; t++) { k1 = max(t-q,tmin); k = ko[t] - ko[k1]; Yt = Y + (ko[t] - j)*incy; Yk = Yt - k*incy; Lt = Ll + ko[t] - e; Ltt = Lt + mll*(ko[t] - ko[t-q]); Ltk = Ltt - mll*k; ro = ko[t+1] - ko[t]; if (transp && iy>0) { gemm("nt", "T", ny, ro, k, -1.0, Yk, iy, Ltk, mll, 1.0, Yt, iy); trsm("right", "Low", "T", "NotUdia", ny, ro, 1.0, Ltt, mll, Yt, iy); } else if (!transp && mll>0) { gemm("nt", "NT", ro, ny, k, -1.0, Ltk, mll, Yk, iy, 1.0, Yt, iy); trsm("left", "Low", "NT", "NotUdia", ro, ny, 1.0, Ltt, mll, Yt, iy); } } 11

Gateway function to reference Blas/Lapack // Gateway function to reference Blas/Lapack #include "BlasGateway.h" #include "blasf.h" void gemm(char *transa, char *transb, int m, int n, int k, double alpha, double a[], int lda, double b[], int ldb, double beta, double c[], int ldc) { dgemm(transa, transb, &m, &n, &k, &alpha, a, &lda, b,&ldb,&beta,c,&ldc,1,1); } BLAS/LAPACK from Fortran examples Cholesky factorization using Netlib LAPACK95 From http://www.netlib.org/lapack95/html/doc:! SUBROUTINE LA_POTRF( A, UPLO, RCOND, NORM, INFO )! <type>(<wp>), INTENT(INOUT) :: A(:,:)! CHARACTER(LEN=1), INTENT(IN), OPTIONAL :: UPLO! REAL(<wp>), INTENT(OUT), OPTIONAL :: RCOND! CHARACTER(LEN=1), INTENT(IN), OPTIONAL :: NORM! INTEGER, INTENT(OUT), OPTIONAL :: INFO Cholesky factorization from MKL Excerpt from MKL Reference Manual: 12

Matrix-matrix multiply Using Intel MKL (help from the reference manual): There is a description (incompatible with MKL) and a link to a reference implementation on Netlib: http://www.netlib.org/blas/blast-forum (difficult to download and use). Here is an example from the description: For calling from Fortran 77 see the call to the reference version of DGEMM as shown above in the quote from the MKL reference manual. 13

Sparse BLAS A Fortran 95 reference implementation was published in ACM TOMS in 2002: Included in MKL. Originally defined by the BLAS Technical Forum, see Netlib (http://www.netlib.org/blas/blast-forum): 14

CBLAS Netlib has a description and an implementation (interface to the Fortran reference BLAS), see http://www.netlib.org/blas/blast-forum. Example: Notice that the dimension arguments are passed by value, and not by reference as is necessary when calling the Fortran routines directly from C. Information on MKL web: www.intel.com/software/products/mkl/docs/mklqref/ Atlas (see above) has a free implementation and a quick reference card. GNU Scientific Library (http://www.gnu.org/software/gsl) has interface with a different interface (and their own vector / matrix data types). 15