How to Write Fast Numerical Code Spring 2012 Lecture 9. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

Similar documents
How to Write Fast Numerical Code

How to Write Fast Numerical Code Spring 2011 Lecture 7. Instructor: Markus Püschel TA: Georg Ofenbeck

How to Write Fast Numerical Code Spring 2012 Lecture 13. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

How to Write Fast Numerical Code

Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri

How to Write Fast Numerical Code

Optimizing MMM & ATLAS Library Generator

Automatic Performance Tuning and Machine Learning

Program Generation, Optimization, and Adaptation: SPIRAL and other efforts

How to Write Fast Numerical Code

How to Write Fast Numerical Code

Scientific Computing. Some slides from James Lambers, Stanford

LAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to Write Fast Numerical Code Spring 2012 Lecture 20. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

TASKING Performance Libraries For Infineon AURIX MCUs Product Overview

A Comparison of Empirical and Model-driven Optimization

Auto-tuning a Matrix Routine for High Performance. Rune E. Jensen Ian Karlin Anne C. Elster

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra

Is Search Really Necessary to Generate High-Performance BLAS?

1 Motivation for Improving Matrix Multiplication

Optimal Performance Numerical Code Challenges and Solutions

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Algorithms and Computation in Signal Processing

Matrix algorithms: fast, stable, communication-optimizing random?!

How to Write Fast Numerical Code

Algorithms and Computation in Signal Processing

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Performance Analysis of a Family of WHT Algorithms

Formal Loop Merging for Signal Transforms

Dense matrix algebra and libraries (and dealing with Fortran)

A Few Numerical Libraries for HPC

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Development of efficient computational kernels and linear algebra routines for out-of-order superscalar processors

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

How to Write Fast Numerical Code

Algorithms and Computation in Signal Processing

Introduction to Mathematical Programming

How to Write Fast Numerical Code Spring 2011 Lecture 22. Instructor: Markus Püschel TA: Georg Ofenbeck

COMPUTATIONAL LINEAR ALGEBRA

High-Performance Scientific Computing

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.

Think Globally, Search Locally

Efficient Recursive Implementations for Linear Algebra Operations

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

2.7 Numerical Linear Algebra Software

High Performance Dense Linear Algebra in Intel Math Kernel Library (Intel MKL)

System Design S.CS301

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

EXPERIMENTS WITH STRASSEN S ALGORITHM: FROM SEQUENTIAL TO PARALLEL

How to Write Fast Numerical Code

Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee.

Parallelism in Spiral

How to Write Fast Code , spring st Lecture, Jan. 14 th

BLAS. Basic Linear Algebra Subprograms

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Lecture 2: Single processor architecture and memory

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

A Fast Fourier Transform Compiler

MATFORCE: USING AUTOMATED SOURCE-TO-SOURCE TRANSLATION IN DEVELOPING C++ APPLICATIONS BASED ON MATLAB CODE

How to Write Fast Numerical Code Spring 2013 Lecture: Architecture/Microarchitecture and Intel Core

Autotuning (1/2): Cache-oblivious algorithms

The LinBox Project for Linear Algebra Computation

Optimizing Matrix Multiplication with a Classifier Learning System

CS 612: Software Design for High-performance Architectures. Keshav Pingali Cornell University

9. Linear Algebra Computation

CS 6210 Fall 2016 Bei Wang. Lecture 1 A (Hopefully) Fun Introduction to Scientific Computing

Lecture 3. Vectorization Memory system optimizations Performance Characterization

A MATLAB Interface to the GPU

Cache-oblivious Programming

2nd Introduction to the Matrix package

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Aggressive Scheduling for Numerical Programs

Adaptive Scientific Software Libraries

A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD

Blocked Schur Algorithms for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012.

COSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science

Memory Hierarchy Optimizations and Performance Bounds for Sparse A T Ax

Matrix Multiplication

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries

Uniprocessor Optimizations. Idealized Uniprocessor Model

A Tour of Matlab for Math 496, Section 6

2nd Introduction to the Matrix package

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Blocked Schur Algorithms for Computing the Matrix Square Root

MAGMA: a New Generation

Intel Performance Libraries

Double-precision General Matrix Multiply (DGEMM)

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

A Multi-Tiered Optimization Framework for Heterogeneous Computing

Accurate Cache and TLB Characterization Using Hardware Counters

Divide-and-Conquer. Combine solutions to sub-problems into overall solution. Break up problem of size n into two equal parts of size!n.

Submission instructions (read carefully): SS17 / Assignment 4 Instructor: Markus Püschel. ETH Zurich

High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers

Transcription:

How to Write Fast Numerical Code Spring 2012 Lecture 9 Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

Today Linear algebra software: history, LAPACK and BLAS Blocking (BLAS 3): key to performance How to make MMM fast: ATLAS, model-based ATLAS

Linear Algebra Algorithms: Examples Solving systems of linear equations Eigenvalue problems Singular value decomposition LU/Cholesky/QR/ decompositions and many others Make up most of the numerical computation across disciplines (sciences, computer science, engineering) Efficient software is extremely relevant

The Path to LAPACK EISPACK and LINPACK Libraries for linear algebra algorithms Developed in the early 70s Jack Dongarra, Jim Bunch, Cleve Moler, Pete Stewart, LINPACK still used as benchmark for the TOP500 (Wiki) list of most powerful supercomputers Problem: Implementation vector-based, i.e., little locality in data access Low performance on computers with deep memory hierarchy Became apparent in the 80s Solution: LAPACK Reimplement the algorithms block-based, i.e., with locality Developed late 1980s, early 1990s Jim Demmel, Jack Dongarra et al.

Matlab Invented in the late 70s by Cleve Moler Commercialized (MathWorks) in 84 Motivation: Make LINPACK, EISPACK easy to use Matlab uses LAPACK and other libraries but can only call it if you operate with matrices and vectors and do not write your own loops A*B (calls MMM routine) A\b (calls linear system solver)

LAPACK and BLAS Basic Idea: LAPACK BLAS static reimplemented for each platform Basic Linear Algebra Subroutines (BLAS, list) BLAS 1: vector-vector operations (e.g., vector sum) BLAS 2: matrix-vector operations (e.g., matrix-vector product) BLAS 3: matrix-matrix operations (e.g., MMM) LAPACK implemented on top of BLAS Using BLAS 3 as much as possible I(n) = O(1) O(1) O( p C) cache size

Why is BLAS3 so important? Using BLAS3 (instead of BLAS 1 or 2) in LAPACK = blocking = high operational intensity I = high performance Remember last time (blocking MMM): I(n) = = * O(1) = * O( p C)

Today Linear algebra software: history, LAPACK and BLAS Blocking (BLAS 3): key to performance How to make MMM fast: ATLAS, model-based ATLAS

MMM: Complexity? Usually computed as C = AB + C Cost as computed before n 3 multiplications + n 3 additions = 2n 3 floating point operations = O(n 3 ) runtime Blocking Increases locality (see previous example) Does not decrease cost Can we reduce the op count?

Strassen s Algorithm Strassen, V. "Gaussian Elimination is Not Optimal," Numerische Mathematik 13, 354-356, 1969 Until then, MMM was thought to be Θ(n 3 ) Recurrence T(n) = 7T(n/2) + O(n 2 ); hence O(n log 2 (7) ) O(n 2.808 ) Crossover point, in terms of cost: below n=1000, but Structure more complex performance crossover much later Numerical stability inferior MMM: Cost by definition/cost Strassen Can we reduce more? 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 crossover 0 5 10 15 20 log 2 (n)

MMM Complexity: What is known Coppersmith, D. and Winograd, S. "Matrix Multiplication via Arithmetic Programming," J. Symb. Comput. 9, 251-280, 1990 MMM is O(n 2.376 ) MMM is obviously Ω(n 2 ) It could well be close to Θ(n 2 ) Compare this to matrix-vector multiplication: Known to be Θ(n 2 ) (Winograd), i.e., boring Practically all code out there uses 2n 3 flops

MMM: Memory Hierarchy Optimization MMM (square real double) Core 2 Duo 3Ghz theoretical scalar peak ATLAS generated triple loop matrix size Intel compiler icc O2 Huge performance difference for large sizes Great case study to learn memory hierarchy optimization

ATLAS BLAS program generator and library (web, successor of PhiPAC) Idea: automatic porting LAPACK BLAS static regenerated for each platform People can also contribute handwritten code The generator uses empirical search over implementation alternatives to find the fastest implementation no vectorization or parallelization: so not really used anymore We focus on BLAS 3 MMM Search only over cost 2n 3 algorithms (cost equal to triple loop)

ATLAS Architecture MFLOPS Compile, Execute, Measure Detect Hardware Parameters L1Size NR MulAdd L * ATLAS Search Engine (MMSearch) NB MU,NU,KU xfetch MulAdd Latency ATLAS MM Code Generator (MMCase) MiniMMM Source Search parameters: for example blocking size span search space specify code Hardware parameters: found by orthogonal line search L1Size: size of L1 data cache NR: number of registers MulAdd: fused multiply-add available? L * : latency of FP multiplication source: Pingali, Yotov, Cornell U.

ATLAS Mflop/s Compile Execute Measure Detect Hardware Parameters L1Size NR MulAdd L * ATLAS Search Engine NB MU,NU,KU xfetch MulAdd Latency ATLAS MMM Code Generator MiniMMM Source Model-Based ATLAS Detect Hardware Parameters L1Size L1I$Size NR MulAdd L * Model NB MU,NU,KU xfetch MulAdd Latency ATLAS MMM Code Generator MiniMMM Source Search for parameters replaced by model to compute them More hardware parameters needed source: Pingali, Yotov, Cornell U.

Optimizing MMM Blackboard References: "Automated Empirical Optimization of Software and the ATLAS project" by R. Clint Whaley, Antoine Petitet and Jack Dongarra. Parallel Computing, 27(1-2):3-35, 2001 K. Yotov, X. Li, G. Ren, M. Garzaran, D. Padua, K. Pingali, P. Stodghill, Is Search Really Necessary to Generate High-Performance BLAS?, Proceedings of the IEEE, 93(2), pp. 358 386, 2005. Our presentation is based on this paper