Resources for parallel computing

Resources for parallel computing BLAS Basic linear algebra subprograms. Originally published in ACM Toms (1979) (Linpack Blas + Lapack). Implement matrix operations upto matrix-matrix multiplication and triangular solve, but not matrix factorizations or eigenvalue calculations. A reference implementation is on netlib.org. Web page www.netlib.org/blas "Frequently asked questions" BLAS 1) What and where are the BLAS? 2) Are there legal restrictions on the use of BLAS reference implementation software? 3) Publications/references for the BLAS? 4) Is there a Quick Reference Guide to the BLAS available? 5) Are optimized BLAS libraries available? Where can I find vendor supplied BLAS? 6) Where can I find Java BLAS? 7) Is there a C interface to the BLAS? 1

8) Are prebuilt reference implementations of the Fortran77 BLAS available? 9) What about shared memory machines? Are there multithreaded versions of the BLAS available? 1) What and where are the BLAS? The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-. On Wikipedia Example of a reference subroutine: SUBROUTINE DGEMV ( TRANS, M, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) *.. Scalar Arguments.. DOUBLE PRECISION ALPHA, BETA INTEGER INCX, INCY, LDA, M, N CHARACTER*1 TRANS *.. Array Arguments.. DOUBLE PRECISION A( LDA, * ), X( * ), Y( * ) * * Purpose 2

* ======= * DGEMV performs one of the matrix-vector operations * * y := alpha*a*x + beta*y, or y := alpha*a'*x + beta*y, * * where alpha and beta are scalars, x and y are vectors and A is an * m by n matrix. * * Parameters * ==========... * M - INTEGER. * On entry, M specifies the number of rows of the matrix A. * M must be at least zero.... IF( INCY.EQ.1 )THEN DO 60, J = 1, N IF( X( JX ).NE.ZERO )THEN TEMP = ALPHA*X( JX ) DO 50, I = 1, M Y( I ) = Y( I ) + TEMP*A( I, J ) 50 CONTINUE END IF JX = JX + INCX 60 CONTINUE... BLAS quick reference (see www.netlib.org/blas/blasqr.pdf) 3

LAPACK Fortran subroutines for linear equations (dense, banded), linear least squares problems, eigenvalue problems and singular values. There is a printed user guide (LAPACK Users' Guide, 11 authors, SIAM, 1999), part of this guide is in several html documents on www.netlib.org/lapack/lug, and there are man pages on www.netlib.org/lapack/manpages.tgz which are worth installing. The routines were written with parallel computation in mind. Web page www.netlib.org/lapack 4

Atlas www.math-atlas.sourceforge.net. Open source implementaion of BLAS and a few LAPACK routines.these must be (or should be) built (with make) on the machine where they are to be used (may take a few hours?). These must be (or should be) built (with make) on the machine where they are to be used (may take a few hours?). GotoBLAS (framb.: gótóblas) http://www.tacc.utexas.edu/resources/software/. The web page boasts: currently the fastest implementation of the Basic Linear Algebra Subroutines 5

Intel MKL (Math kernel library) Academic license $160. Includes: BLAS Selected LAPACK routines Fortran 95 interface CBLAS (interface to call from C) Sparse BLAS Sparse linear equation solvers ScaLAPACK Some statistical functions (incl. random number generation) some MPI support Fast Fourier transforms PDE solution support (I think incl. Poisson solver) some numerical optimization. User manual and reference manual These come with the program, and are also both available on http://www.intel.com/cd/software/products/asmo-na/eng/345631.htm (107 pages and 3250 pages). AMD Core Math Library (ACML) web page www.amd.com/acml. 6

DGEMM benchmarks From Intel Web site: 7

From GotoBLAS web site: From a neutral (?) web site: 8

Example of BLAS use The following examples are from programs for evaluation of VARMA time-series likelihood that I wrote first in Matlab (~3500 lines) and am close to finishing translating into C (~7000 lines). Eventually I hope to call some parallel BLAS routines and report on timing comparison between the Matlab and C versions. The Matlab programs are on www.hi.is/~jonasson, and the C-programs are on the way there. omega_factor.m %OMEGA_FACTOR Cholesky factorization of Omega % % [Lu,Ll,info] = omega_factor(su,olow,p,q,ko) calculates the Cholesky % factorization Omega = L L' of Omega which is stored in two parts, a full % upper left partition, Su, and a block-band lower partition, Olow, as returned % by omega_build. Omega is symmetric, only the lower triangle of Su is % populated, and Olow only stores diagonal and subdiagonal blocks. On exit, L = % [L1; L2] with L1 = [Lu 0] and L2 is stored in block-band-storage in Ll. Info % is 0 on success, otherwise the loop index resulting in a negative number % square root. P and q are the dimensions of the problem and ko is a vector % with ko(t) = number of observed values before time t. % % In the complete data case ko should be 0:r:n*r. For missing values, Su and % Olow are the upper left and lower partitions of Omega_o = Omega with missing % rows and columns removed. In this case Lu and Ll return L_o, the Cholesky % factor of Omega_o. function [Lu,Ll,info] = omega_factor(su,olow,p,q,ko) n = length(ko)-1; h = max(p,q); ro = diff(ko); Ll = zeros(size(olow)); [Lu,info] = chol(su'); % upper left partition if info>0, return; end Lu = Lu'; e = ko(h+1); % order of Su for t = h+1 : n % loop over block-lines in Olow K = ko(t)+1 : ko(t+1); KL = K - ko(t-q); JL = 1 : ko(t)-ko(t-q); tmin = t-q; tmax = t-1; Ll(K-e, JL) = omega_forward(lu, Ll, Olow(K-e,JL)', p, q, ko, tmin, tmax)'; [Ltt, info] = chol(olow(k-e,kl) - Ll(K-e,JL)*Ll(K-e,JL)'); if info>0, info = info + ko(t); return; end Ll(K-e, KL) = Ltt'; end end 9

OmegaFactor.c #include "xassert.h" #include "BlasGateway.h" #include "VarmaUtilities.h" #include "Omega.h" void OmegaFactor ( // Cholesky-factorize Omega, or, for missing values, Omega_o double Su[], // in/out upper left part of Omega, dimension msu msu double Olow[], // in/out molow nolow, block-diagonals of lower part of Omega int p, // in number of autoregressive terms int q, // in number of moving average terms int n, // in length of time series int ko[], // in ko[t] = N of observed values before time t+1, t<=n int *info) // out 0 if ok, otherwise k for first nonpositive Ltt { // Finds Cholesky factorization of covariance matrix Omega_o for missing value // VARMA log-likelihood. Also handles complete data with ko[sws]=r*i for all i. // The Cholesky factors overwrite Omega in memory. All matrices are stored in // Fortran fashion. // double *U, *Ltt; int t, j, ro; int h = max(p,q); int msu = ko[h]; // order of Su int molow = ko[n]-msu; // no. of rows in Olow // // CHOLESKY-FACTORIZE Su INTO Lu Lu' (Lu OVERWRITES Su): if (msu>0) potrf("low", msu, Su, msu, info); else *info = 0; xassert(*info >= 0); if (*info>0) return; // // TURN ATTENTION TO Olow: U = Olow; for (t=h; t<n; t++) { ro = ko[t+1] - ko[t]; if (ro>0) { j = ko[t] - ko[t-q]; Ltt = U + molow*j; //Solve L(0:t-1,0:t-1) U' = s' (U contains s on call): OmegaForward("T", Su, Olow, p, q, ko, n, U, molow, ro, t-q, t-1); //Ltt-U U' and then Cholesky of that syrk("low", "N", ro, j, -1.0, U, molow, 1.0, Ltt, molow); potrf("low", ro, Ltt, molow, info); if (*info>0) { *info += ko[t]; return; } } U += ro; } } 10

From omega_forward.m LT.LT = true; if m>0, X(1:m,:) = linsolve(lu(j:end,j:end), Y(1:m,:), LT); end for t=max(h+1,tmin):tmax t1 = max(t-q,tmin); J = ko(t1)+1 : ko(t); K = ko(t)+1 : ko(t+1); JX = J - j + 1; KX = K - j + 1; JL = J - ko(t-q); KL = K - ko(t-q); X(KX,:) = X(KX,:) - Ll(K-e,JL)*X(JX,:); X(KX,:) = linsolve(ll(k-e,kl), X(KX,:), LT); end From OmegaForward.c if (transp && e>0) trsm("right", "Low", "T", "NotUdia", ny, m, 1.0, Luii, e, Y, iy); else if (e>0) trsm("left","low", "NoT", "NotUdia", m, ny, 1.0, Luii, e, Y, iy); // // FIND SECOND PARTITION OF X: incy = transp? iy : 1; tbeg = max(h,tmin); for (t=tbeg; t<=tmax; t++) { k1 = max(t-q,tmin); k = ko[t] - ko[k1]; Yt = Y + (ko[t] - j)*incy; Yk = Yt - k*incy; Lt = Ll + ko[t] - e; Ltt = Lt + mll*(ko[t] - ko[t-q]); Ltk = Ltt - mll*k; ro = ko[t+1] - ko[t]; if (transp && iy>0) { gemm("nt", "T", ny, ro, k, -1.0, Yk, iy, Ltk, mll, 1.0, Yt, iy); trsm("right", "Low", "T", "NotUdia", ny, ro, 1.0, Ltt, mll, Yt, iy); } else if (!transp && mll>0) { gemm("nt", "NT", ro, ny, k, -1.0, Ltk, mll, Yk, iy, 1.0, Yt, iy); trsm("left", "Low", "NT", "NotUdia", ro, ny, 1.0, Ltt, mll, Yt, iy); } } 11

Gateway function to reference Blas/Lapack // Gateway function to reference Blas/Lapack #include "BlasGateway.h" #include "blasf.h" void gemm(char *transa, char *transb, int m, int n, int k, double alpha, double a[], int lda, double b[], int ldb, double beta, double c[], int ldc) { dgemm(transa, transb, &m, &n, &k, &alpha, a, &lda, b,&ldb,&beta,c,&ldc,1,1); } BLAS/LAPACK from Fortran examples Cholesky factorization using Netlib LAPACK95 From http://www.netlib.org/lapack95/html/doc:! SUBROUTINE LA_POTRF( A, UPLO, RCOND, NORM, INFO )! <type>(<wp>), INTENT(INOUT) :: A(:,:)! CHARACTER(LEN=1), INTENT(IN), OPTIONAL :: UPLO! REAL(<wp>), INTENT(OUT), OPTIONAL :: RCOND! CHARACTER(LEN=1), INTENT(IN), OPTIONAL :: NORM! INTEGER, INTENT(OUT), OPTIONAL :: INFO Cholesky factorization from MKL Excerpt from MKL Reference Manual: 12

Matrix-matrix multiply Using Intel MKL (help from the reference manual): There is a description (incompatible with MKL) and a link to a reference implementation on Netlib: http://www.netlib.org/blas/blast-forum (difficult to download and use). Here is an example from the description: For calling from Fortran 77 see the call to the reference version of DGEMM as shown above in the quote from the MKL reference manual. 13

Sparse BLAS A Fortran 95 reference implementation was published in ACM TOMS in 2002: Included in MKL. Originally defined by the BLAS Technical Forum, see Netlib (http://www.netlib.org/blas/blast-forum): 14

CBLAS Netlib has a description and an implementation (interface to the Fortran reference BLAS), see http://www.netlib.org/blas/blast-forum. Example: Notice that the dimension arguments are passed by value, and not by reference as is necessary when calling the Fortran routines directly from C. Information on MKL web: www.intel.com/software/products/mkl/docs/mklqref/ Atlas (see above) has a free implementation and a quick reference card. GNU Scientific Library (http://www.gnu.org/software/gsl) has interface with a different interface (and their own vector / matrix data types). 15