BLAS. Basic Linear Algebra Subprograms

Size: px

Start display at page:

Download "BLAS. Basic Linear Algebra Subprograms"

Amberly Underwood
6 years ago
Views:

1 BLAS Basic opera+ons with vectors and matrices dominates scien+fic compu+ng programs To achieve high efficiency and clean computer programs an effort has been made in the last few decades to standardize and op+mize such opera+ons Basic Linear Algebra Subprograms h>p:// The BLAS (Basic Linear Algebra Subprograms) are rou+nes that provide standard building blocks for efficiently performing basic vector and matrix opera+ons. (1979) 1

2 BLAS Because the BLAS are efficient, portable, and widely available, they are commonly used in the development of high quality linear algebra sonware, LAPACK for example. BLAS serve as building blocks in many computer codes, and their performance on a certain computer usually reflects the performance of that code on the same computer. This is why most of the computer vendors op+mize BLAS for specific architectures. 2

3 BLAS Taxonomy based on computa+onal complexity = number of needed Floa+ng Points opera+ons to complete an opera+on The Level 1 BLAS perform scalar, vector and vector- vector opera+ons, the Level 2 BLAS perform matrix- vector opera+ons, and the Level 3 BLAS perform matrix- matrix opera+ons. Inner product of two vectors of size n : n mul+plica+on and (n- 1) addi+ons for a total of approximately 2n opera+ons Matrix- Vector mul+plica+on: n 2 mult and n(n- 1) add 2n 2 ops Matrix- Matrix mul+plica+on 2n 3 3

4 Level 3 Mat- Mat Mul;plica;on Inner (dot) product version: = Middle (saxpy) productversion: = Outer product version: =

5 ATLAS The most recent trend has been todevelop a new genera+on of self- tuning BLAS libraries targe+ng the rich but complex memory systems of modern processors (1996) Automa;cally Tuned Linear Algebra So@ware hcp://math- atlas.sourceforge.net/ ATLAS is an implementa+on of empirical op+miza+on procedures that allow for many different ways of performing a kernel opera+on with corresponding +mers to determine which approach is best for a par+cular pla_orm. Mul+ple implementa+on : different hand- wri>en versions of the same kernel are explicitly provided 5

6 ATLAS code genera+on : a highly parameterized code is wri>en that generates many different kernel implementa+ons (different block sizes, unit or larger stride,...) 6

7 BLAS and Memory Access The prac+cal difference in the various ways of implemen+ng matrix- vector and matrix- matrix opera+ons lies in the way we access memory and the type of memory, i.e., main memory or cache memory The cost associated with the BLAS programs can be computed by taking into account the total number of floa+ng opera+ons but also including the cost associated with memory access to load the operands. T = n f δt + n m τ = n f δt(1 + n m n f τ δt ), 7

8 Level 1 - "BLAS1" n f = 2n, n m = 3n+1, n m /n f - > 3/2 if n >> Comparative ddot() performance on a Xeon4 1.7GHz ATLAS hot/hot ATLAS hot/cold ATLAS cold/hot ATLAS cold/cold Mflop/sec e+06 array size Figure 2.8: Performance of the dot product on the Intel Pentium-4 with speed 1.7 GHz. (Courtesy of C. Evangelinos) 8

9 Level 2 - "BLAS2" }{{} n f = 2n 2, n m = n 2 +3n, n m /n f - > 1/2 if n >> 1200 Comparative dgemv() performance on a Xeon4 1.7GHz Mflop/sec ATLAS N hot ATLAS N cold ATLAS T hot ATLAS T cold array size Figure 2.9: Performance of the matrix-vector multiply on the Intel Pentium-4 with speed of 1.7 GHz. (Courtesy of C. Evangelinos) 9

10 Level 3 - "BLAS3" }{{} n f = 2n 3, n m = 3n 2, n m /n f - > 3/2n if n >> 2500 Comparative dgemm() performance on a Xeon4 1.7GHz Mflop/sec ATLAS NN hot ATLAS NN cold ATLAS TT hot ATLAS TT cold array size Figure 2.10: Performance of the matrix-matrix multiply on the Intel Pentium-4 with speed 1.7 GHz. (Courtesy of C. Evangelinos) 10

11 Op;mized BLAS libraries AMD Core Math Library (ACML) hcp://developer.amd.com/tools/cpu/acml/ pages/default.aspx Intel Math Kernel Library (Intel MKL) us/intel- mkl EIGEN3 hcp://eigen.tuxfamily.org 11

12 LAPACK LAPACK Linear Algebra PACKage hcp:// LAPACK is wri>en in Fortran 90 and provides rou+nes for solving systems of simultaneous linear equa+ons, least- squares solu+ons of linear systems of equa+ons, eigenvalue problems, and singular value problems. The associated matrix factoriza+ons (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided LAPACK rou+nes are wri>en so that as much as possible of the computa+on is performed by calls to the Basic Linear Algebra Subprograms (BLAS). LAPACK is designed at the outset to exploit the Level 3 BLAS 12

13 HPL- Linpack (Top500) A Portable Implementa;on of the High- Performance Linpack Benchmark for Distributed- Memory Computers hcp:// HPL is a so@ware package that solves a (random) dense linear system in double precision (64 bits) arithme;c on distributed- memory computers. 13

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017

Faster Code for Free: Linear Algebra Libraries Advanced Research Compu;ng 22 Feb 2017 Outline Introduc;on Implementa;ons Using them Use on ARC systems Hands on session Conclusions Introduc;on 3 BLAS Level