Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

Similar documents
High-Performance Implementation of the Level-3 BLAS

Toward Scalable Matrix Multiply on Multithreaded Architectures

Algorithms and architectures for N-body methods

Dealing with Asymmetry for Performance and Energy Efficiency

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations.

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen

Mixed Data Layout Kernels for Vectorized Complex Arithmetic

Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee.

Level-3 BLAS on the TI C6678 multi-core DSP

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

A Square Block Format for Symmetric Band Matrices

Towards ABFT for BLIS GEMM

Hierarchical Decompositions of Kernel Matrices

Programming Dense Linear Algebra Kernels on Vectorized Architectures

Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors

SuperMatrix on Heterogeneous Platforms. Jianyu Huang SHPC, UT Austin

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Anatomy of High-Performance Matrix Multiplication

Level-3 BLAS on a GPU: Picking the Low Hanging Fruit

arxiv: v1 [cs.ms] 24 Aug 2018

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors

Naïve (1) CS 6354: Memory Hierarchy III. Goto Fig. 4. Naïve (2)

Optimizations of BLIS Library for AMD ZEN Core

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

High performance matrix inversion of SPD matrices on graphics processors

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

BLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg

Automatic Derivation of Linear Algebra Algorithms with Application to Control Theory

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra

Minimizing Computation in Convolutional Neural Networks

Using recursion to improve performance of dense linear algebra software. Erik Elmroth Dept of Computing Science & HPC2N Umeå University, Sweden

Task based parallelization of recursive linear algebra routines using Kaapi

A MATLAB Interface to the GPU

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

A Standard for Batching BLAS Operations

Solving Dense Linear Systems on Graphics Processors

COMPUTATIONAL LINEAR ALGEBRA

Analytical Models for the BLIS Framework

EXPERIMENTS WITH STRASSEN S ALGORITHM: FROM SEQUENTIAL TO PARALLEL

Dense matrix algebra and libraries (and dealing with Fortran)

Tools and Primitives for High Performance Graph Computation

A Few Numerical Libraries for HPC

Fast and reliable linear system solutions on new parallel architectures

QR Decomposition on GPUs

The BLIS Framework: Experiments in Portability

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

arxiv: v1 [cs.ms] 17 Jan 2019

LAFF-On High-Performance Programming

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Underutilizing Resources for HPC on Clouds

INTEL MKL Vectorized Compact routines

2 Fred G. Gustavson, Jerzy Waśniewski and Jack J. Dongarra a. Lower Packed Format fi 2

Intel Knights Landing Hardware

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017

NUMA-aware Multicore Matrix Multiplication

A High Performance Parallel Strassen Implementation. Brian Grayson. The University of Texas at Austin.

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides

Efficient Recursive Implementations for Linear Algebra Operations

Unleashing the Power of Multi-GPU Accelerators with FLAME

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,

Implementing high-performance complex matrix multiplication via the 1m method

Porting BLIS to new architectures Early experiences

Runtime Systems and Out-of-Core Cholesky Factorization on the Intel Xeon Phi System

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy

Double-precision General Matrix Multiply (DGEMM)

How to Write Fast Numerical Code Spring 2012 Lecture 9. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

A cache-oblivious sparse matrix vector multiplication scheme based on the Hilbert curve

A Simple and Practical Linear Algebra Library Interface with Static Size Checking

Effective Implementation of DGEMM on modern multicore CPU

Analysis of Matrix Multiplication Computational Methods

The BLIS Framework: Experiments in Portability

Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory

Accelerating GPU Kernels for Dense Linear Algebra

Beautiful Parallel Code: Evolution vs. Intelligent Design

How to Write Fast Numerical Code

Accelerating GPU kernels for dense linear algebra

*Yuta SAWA and Reiji SUDA The University of Tokyo

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

BLAS: Basic Linear Algebra Subroutines I

Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units

Algorithm Engineering

IMPLEMENTATION OF AN OPTIMAL MATRIX-CHAIN PRODUCT EXPLOITING MATRIX SPECIALIZATION

The M4RI & M4RIE libraries for linear algebra over F 2 and small extensions

Auto-tuning a Matrix Routine for High Performance. Rune E. Jensen Ian Karlin Anne C. Elster

Cache Oblivious Dense and Sparse Matrix Multiplication Based on Peano Curves

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

arxiv: v3 [cs.ms] 7 Jan 2018

Neural Network Assisted Tile Size Selection

OpenMP * Past, Present and Future

Parallel Reduction from Block Hessenberg to Hessenberg using MPI

Parallelism in Spiral

BLAS. Christoph Ortner Stef Salvini

Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures

1 Motivation for Improving Matrix Multiplication

Sparse Direct Solvers for Extreme-Scale Computing

Transcription:

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016

*Overlook of the Bay Area. Photo taken in Mission Peak Regional Preserve, Fremont, CA. Summer 2014. STRASSEN, from 30,000 feet Volker Strassen (Born in 1936, aged 80) Original Strassen Paper (1969)

One-level Strassen s Algorithm (In theory) Direct Computation Strassen s Algorithm 8 multiplications, 8 additions 7 multiplications, 22 additions *Strassen, Volker. "Gaussian elimination is not optimal." Numerische Mathematik 13, no. 4 (1969): 354-356.

Multi-level Strassen s Algorithm (In theory) M 0 := (A 00 +A 11 )(B 00 +B 11 ); M 1 := (A 10 +A 11 )B 00 ; M 2 := A 00 (B 01 B 11 ); M 3 := A 11 (B 10 B 00 ); M 4 := (A 00 +A 01 )B 11 ; M 5 := (A 10 A 00 )(B 00 +B 01 ); M 6 := (A 01 A 11 )(B 10 +B 11 ); C 00 M 0 + M 3 M 4 + M 6 One-level Strassen (1+14.3% speedup) 8 multiplications 7 multiplications ; Two-level Strassen (1+30.6% speedup) 64 multiplications 49 multiplications; d-level Strassen (n 3 /n 2.803 speedup) 8 d multiplications 7 d multiplications; C 01 M 2 + M 4 C 10 M 1 + M 3 C 11 M 0 M 1 + M 2 + M 5

Multi-level Strassen s Algorithm (In theory) M 0 := (A 00 +A 11 )(B 00 +B 11 ); M 1 := (A 10 +A 11 )B 00 ; M 2 := A 00 (B 01 B 11 ); M 3 := A 11 (B 10 B 00 ); M 4 := (A 00 +A 01 )B 11 ; M 5 := (A 10 A 00 )(B 00 +B 01 ); M 6 := (A 01 A 11 )(B 10 +B 11 ); C 00 M 0 + M 3 M 4 + M 6 C 01 M 2 + M 4 C 10 M 1 + M 3 C 11 M 0 M 1 + M 2 + M 5 One-level Strassen (1+14.3% speedup) 8 multiplications 7 multiplications ; Two-level Strassen (1+30.6% speedup) 64 multiplications 49 multiplications; d-level Strassen (n 3 /n 2.803 speedup) 8 d multiplications 7 d multiplications;

Strassen s Algorithm (In practice) M 0 := (A 00 +A 11 )(B 00 +B 11 ); M 1 := (A 10 +A 11 )B 00 ; M 2 := A 00 (B 01 B 11 ); M 3 := A 11 (B 10 B 00 ); M 4 := (A 00 +A 01 )B 11 ; M 5 := (A 10 A 00 )(B 00 +B 01 ); M 6 := (A 01 A 11 )(B 10 +B 11 ); C 00 M 0 + M 3 M 4 + M 6 C 01 M 2 + M 4 C 10 M 1 + M 3 C 11 M 0 M 1 + M 2 + M 5

Strassen s Algorithm (In practice) T 0 M 0 := (A 00 +A 11 )(B 00 +B 11 ); T 2 M 1 := (A 10 +A 11 )B 00 ; M 2 := A 00 (B 01 B 11 ); M 3 := A 11 (B 10 B 00 ); T 5 T 3 T 4 M 4 := (A 00 +A 01 )B 11 ; T 1 T 6 T 7 M 5 := (A 10 A 00 )(B 00 +B 01 ); T 8 T 9 M 6 := (A 01 A 11 )(B 10 +B 11 ); C 00 M 0 + M 3 M 4 + M 6 One-level Strassen (1+14.3% speedup) 7 multiplications + 22 additions; Two-level Strassen (1+30.6% speedup) 49 multiplications + 344 additions; d-level Strassen (n 3 /n 2.803 speedup) Numerical unstable; Not achievable C 01 M 2 + M 4 C 10 M 1 + M 3 C 11 M 0 M 1 + M 2 + M 5

Strassen s Algorithm (In practice) T 0 M 0 := (A 00 +A 11 )(B 00 +B 11 ); T 2 M 1 := (A 10 +A 11 )B 00 ; M 2 := A 00 (B 01 B 11 ); M 3 := A 11 (B 10 B 00 ); T 5 T 3 T 4 M 4 := (A 00 +A 01 )B 11 ; T 1 T 6 T 7 M 5 := (A 10 A 00 )(B 00 +B 01 ); T 8 T 9 M 6 := (A 01 A 11 )(B 10 +B 11 ); C 00 M 0 + M 3 M 4 + M 6 One-level Strassen (1+14.3% speedup) 7 multiplications + 22 additions; Two-level Strassen (1+30.6% speedup) 49 multiplications + 344 additions; d-level Strassen (n 3 /n 2.803 speedup) Numerical unstable; Not achievable C 01 M 2 + M 4 C 10 M 1 + M 3 C 11 M 0 M 1 + M 2 + M 5

To achieve practical high performance of Strassen s algorithm... Conventional Implementations Our Implementations Matrix Size Must be large Matrix Shape Must be square No Additional Workspace Parallelism

To achieve practical high performance of Strassen s algorithm... Conventional Implementations Our Implementations Matrix Size Must be large Matrix Shape Must be square No Additional Workspace Parallelism Usually task parallelism

To achieve practical high performance of Strassen s algorithm... Conventional Implementations Our Implementations Matrix Size Must be large Matrix Shape Must be square No Additional Workspace Parallelism Usually task parallelism Can be data parallelism

Outline Review of State-of-the-art GEMM in BLIS Strassen s Algorithm Reloaded Theoretical Model and Practical Performance Extension to Other BLAS-3 Operation Extension to Other Fast Matrix Multiplication Conclusion

Level-3 BLAS Matrix-Matrix Multiplication (GEMM) (General) matrix-matrix multiplication (GEMM) is supported in the level-3 BLAS* interface as Ignoring transa and transb, GEMM computes We consider the simplified version of GEMM *Dongarra, Jack J., et al. "A set of level 3 basic linear algebra subprograms."acm Transactions on Mathematical Software (TOMS) 16.1 (1990): 1-17.

State-of-the-art GEMM in BLIS BLAS-like Library Instantiation Software (BLIS) is a portable framework for instantiating BLAS-like dense linear algebra libraries. Field Van Zee, and Robert van de Geijn. BLIS: A Framework for Rapidly Instantiating BLAS Functionality." ACM TOMS 41.3 (2015): 14. BLIS provides a refactoring of GotoBLAS algorithm (best-known approach) to implement GEMM. Kazushige Goto, and Robert van de Geijn. "High-performance implementation of the level-3 BLAS." ACM TOMS 35.1 (2008): 4. Kazushige Goto, and Robert van de Geijn. "Anatomy of high-performance matrix multiplication." ACM TOMS 34.3 (2008): 12. GEMM implementation in BLIS has 6-layers of loops. The outer 5 loops are written i. The inner-most loop (micro-kernel) is written in assembly for high performance. Partition matrices into smaller blocks to fit into the different memory hierarchy. The order of these loops is designed to utilize the cache reuse rate. BLIS opens the black box of GEMM, leading to many applications built on BLIS. Chenhan D. Yu, *Jianyu Huang, Woody Austin, Bo Xiao, and George Biros. "Performance optimization for the k-nearest neighbors kernel on x86 architectures." In SC 15, p. 7. ACM, 2015. *Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn. Strassen s Algorithm Reloaded. In submission to SC 16. Devin Matthews, Field Van Zee, and Robert van de Geijn. High-Performance Tensor Contraction without BLAS. In submission to SC 16

State-of-the-art GEMM in BLIS BLAS-like Library Instantiation Software (BLIS) is a portable framework for instantiating BLAS-like dense linear algebra libraries. Field Van Zee, and Robert van de Geijn. BLIS: A Framework for Rapidly Instantiating BLAS Functionality." ACM TOMS 41.3 (2015): 14. BLIS provides a refactoring of GotoBLAS algorithm (best-known approach) to implement GEMM. Kazushige Goto, and Robert van de Geijn. "High-performance implementation of the level-3 BLAS." ACM TOMS 35.1 (2008): 4. Kazushige Goto, and Robert van de Geijn. "Anatomy of high-performance matrix multiplication." ACM TOMS 34.3 (2008): 12. GEMM implementation in BLIS has 6-layers of loops. The outer 5 loops are written i. The inner-most loop (micro-kernel) is written in assembly for high performance. Partition matrices into smaller blocks to fit into the different memory hierarchy. The order of these loops is designed to utilize the cache reuse rate. BLIS opens the black box of GEMM, leading to many applications built on BLIS. Chenhan D. Yu, Jianyu Huang, Woody Austin, Bo Xiao, and George Biros. "Performance Optimization for the k-nearest Neighbors Kernel on x86 Architectures." In SC 15. Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn. Strassen s Algorithm Reloaded. To appear in SC 16. Devin Matthews. High-Performance Tensor Contraction without BLAS., arxiv:1607.00291 Paul Springer, Paolo Bientinesi. Design of a High-performance GEMM-like Tensor-Tensor Multiplication, arxiv:1607.00145

GotoBLAS algorithm for GEMM in BLIS *Field G. Van Zee, and Tyler M. Smith. Implementing high-performance complex matrix multiplication." In ACM Transactions on Mathematical Software (TOMS), accepted pending modifications.

5 th loop around micro-kernel GotoBLAS algorithm for GEMM in BLIS C j A B j main memory *Field G. Van Zee, and Tyler M. Smith. Implementing high-performance complex matrix multiplication." In ACM Transactions on Mathematical Software (TOMS), accepted pending modifications.

5 th loop around micro-kernel GotoBLAS algorithm for GEMM in BLIS C j A B j 4 th loop around micro-kernel C j A p B p Pack B p B p B p main memory L3 cache *Field G. Van Zee, and Tyler M. Smith. Implementing high-performance complex matrix multiplication." In ACM Transactions on Mathematical Software (TOMS), accepted pending modifications.

5 th loop around micro-kernel GotoBLAS algorithm for GEMM in BLIS C j A B j 4 th loop around micro-kernel C j A p B p 3 rd loop around micro-kernel Pack B p B p C i m C m C A i Pack A i A i B p A i m R main memory L3 cache L2 cache *Field G. Van Zee, and Tyler M. Smith. Implementing high-performance complex matrix multiplication." In ACM Transactions on Mathematical Software (TOMS), accepted pending modifications.

5 th loop around micro-kernel GotoBLAS algorithm for GEMM in BLIS C j A B j 4 th loop around micro-kernel C j A p B p 3 rd loop around micro-kernel Pack B p B p C i m C m C A i Pack A i A i B p C i 2 nd loop around micro-kernel n R n R A i B p m R main memory L3 cache L2 cache *Field G. Van Zee, and Tyler M. Smith. Implementing high-performance complex matrix multiplication." In ACM Transactions on Mathematical Software (TOMS), accepted pending modifications.

5 th loop around micro-kernel GotoBLAS algorithm for GEMM in BLIS C j A B j 4 th loop around micro-kernel C j A p B p 3 rd loop around micro-kernel Pack B p B p C i m C m C A i Pack A i A i B p C i 2 nd loop around micro-kernel n R n R A i B p m R n R 1 st loop around micro-kernel m R Update C ij main memory L3 cache L2 cache L1 cache registers *Field G. Van Zee, and Tyler M. Smith. Implementing high-performance complex matrix multiplication." In ACM Transactions on Mathematical Software (TOMS), accepted pending modifications.

5 th loop around micro-kernel GotoBLAS algorithm for GEMM in BLIS C j A B j 4 th loop around micro-kernel C j A p B p 3 rd loop around micro-kernel Pack B p B p C i m C m C A i Pack A i A i B p C i 2 nd loop around micro-kernel n R n R A i B p m R n R 1 st loop around micro-kernel m R Update C ij main memory L3 cache L2 cache L1 cache registers 1 micro-kernel 1 *Field G. Van Zee, and Tyler M. Smith. Implementing high-performance complex matrix multiplication." In ACM Transactions on Mathematical Software (TOMS), accepted pending modifications.

5 th loop around micro-kernel GotoBLAS algorithm for GEMM in BLIS C j A B j 4 th loop around micro-kernel C j A p B p 3 rd loop around micro-kernel Pack B p B p C i m C m C A i Pack A i A i B p C i 2 nd loop around micro-kernel n R n R A i B p m R n R 1 st loop around micro-kernel m R Update C ij main memory L3 cache L2 cache L1 cache registers 1 micro-kernel 1 *Field G. Van Zee, and Tyler M. Smith. Implementing high-performance complex matrix multiplication." In ACM Transactions on Mathematical Software (TOMS), accepted pending modifications.

Outline Review of State-of-the-art GEMM in BLIS Strassen s Algorithm Reloaded Theoretical Model and Practical Performance Extension to Other BLAS-3 Operation Extension to Other Fast Matrix Multiplication Conclusion

One-level Strassen s Algorithm Reloaded M 0 := a(a 00 +A 11 )(B 00 +B 11 ); M 1 := a(a 10 +A 11 )B 00 ; M 2 := aa 00 (B 01 B 11 ); M 3 := aa 11 (B 10 B 00 ); M 4 := a(a 00 +A 01 )B 11 ; M 5 := a(a 10 A 00 )(B 00 +B 01 ); M 6 := a(a 01 A 11 )(B 10 +B 11 ); C 00 M 0 + M 3 M 4 + M 6 C 01 M 2 + M 4 C 10 M 1 + M 3 C 11 M 0 M 1 + M 2 + M 5 M 0 := a(a 00 +A 11 )(B 00 +B 11 ); C 00 M 0 ; C 11 M 0 ; M 1 := a(a 10 +A 11 )B 00 ; C 10 M 1 ; C 11 = M 1 ; M 2 := aa 00 (B 01 B 11 ); C 01 M 2 ; C 11 M 2 ; M 3 := aa 11 (B 10 B 00 ); C 00 M 3 ; C 10 M 3 ; M 4 := a(a 00 +A 01 )B 11 ; C 01 M 4 ; C 00 = M 4 ; M 5 := a(a 10 A 00 )(B 00 +B 01 ); C 11 M 5 ; M 6 := a(a 01 A 11 )(B 10 +B 11 ); C 00 M 6 ; M := a(x+y)(v+w); C M; D M; General operation for one-level Strassen: M := a(x+dy)(v+ew); C g 0 M; D g 1 M; g 0, g 1,d,e {-1, 0, 1}.

High-performance implementation of the general operation? M := a(x+dy)(v+ew); C g 0 M; D g 1 M; g 0, g 1,d,e {-1, 0, 1}. *http://shpc.ices.utexas.edu/index.html

M := a(x+dy)(v+ew); C g 0 M; D g 1 M; g 0, g 1,d,e {-1, 0, 1}. *Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn. Strassen s Algorithm Reloaded. In SC 16.

M := a(x+dy)(v+ew); C g 0 M; D g 1 M; g 0, g 1,d,e {-1, 0, 1}. D j C n j C 5 th loop around micro-kernel Y X W V j main memory *Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn. Strassen s Algorithm Reloaded. In SC 16.

M := a(x+dy)(v+ew); C g 0 M; D g 1 M; g 0, g 1,d,e {-1, 0, 1}. D j C n j C 5 th loop around micro-kernel Y X W V j 4 th loop around micro-kernel D j C j Y p X p W p V p Pack V p + ew p B p B p main memory L3 cache *Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn. Strassen s Algorithm Reloaded. In SC 16.

M := a(x+dy)(v+ew); C g 0 M; D g 1 M; g 0, g 1,d,e {-1, 0, 1}. D j C n j C 5 th loop around micro-kernel Y X W V j 4 th loop around micro-kernel D j C j Y p X p W p V p 3 rd loop around micro-kernel Pack V p + ew p B p D i C i m C m C Y i X i B p Pack X i + dy i A i A i m R main memory L3 cache L2 cache *Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn. Strassen s Algorithm Reloaded. In SC 16.

M := a(x+dy)(v+ew); C g 0 M; D g 1 M; g 0, g 1,d,e {-1, 0, 1}. D j C n j C 5 th loop around micro-kernel Y X W V j 4 th loop around micro-kernel D j C j Y p X p W p V p 3 rd loop around micro-kernel Pack V p + ew p B p D i C i m C m C Y i X i B p Pack X i + dy i A i 2 nd loop around micro-kernel n R n R A i B p D i C i n R m R 1 st loop around micro-kernel Update C ij, D ij m R *Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn. Strassen s Algorithm Reloaded. In SC 16. main memory L3 cache L2 cache L1 cache registers 1 micro-kernel 1

M := a(x+dy)(v+ew); C g 0 M; D g 1 M; g 0, g 1,d,e {-1, 0, 1}. D j C n j C 5 th loop around micro-kernel Y X W V j 4 th loop around micro-kernel D j C j Y p X p W p V p 3 rd loop around micro-kernel Pack V p + ew p B p D i C i m C m C Y i X i B p Pack X i + dy i A i 2 nd loop around micro-kernel n R n R A i B p D i C i n R m R 1 st loop around micro-kernel Update C ij, D ij m R *Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn. Strassen s Algorithm Reloaded. In SC 16. main memory L3 cache L2 cache L1 cache registers 1 micro-kernel 1

5 th loop around micro-kernel 5 th loop around micro-kernel C j A B D j C n j C Y X W V j 4 th loop around micro-kernel 4 th loop around micro-kernel C j A p B p D j C j Y p X p W p V p 3 rd loop around micro-kernel Pack B p B p 3 rd loop around micro-kernel Pack V p + ew p B p C i m C m C A i Pack A i A i B p D i C i m C m C Y i X i B p Pack X i + dy i A i C i 2 nd loop around micro-kernel n R n R A i B p m R n R 1 st loop around micro-kernel 2 nd loop around micro-kernel n R n R A i B p D i C i n R m R 1 st loop around micro-kernel m R Update C ij Update C ij, D ij m R main memory L3 cache L2 cache L1 cache registers 1 micro-kernel 1 main memory L3 cache L2 cache L1 cache registers 1 micro-kernel 1

Two-level Strassen s Algorithm Reloaded

Two-level Strassen s Algorithm Reloaded (Continue) M := a(x 0 +X 1 +X 2 +X 3 )(V+V 1 +V 2 +V 3 ); C 0 M; C 1 M; C 2 M; C 3 M; General operation for two-level Strassen: M := a(x 0 +d 1 X 1 +d 2 X 2 +d 3 X 3 )(V+e 1 V 1 +e 2 V 2 +e 3 V 3 ); C 0 g 0 M; C 1 g 1 M; C 2 g 2 M; C 3 g 3 M; g i,d i,e i {-1, 0, 1}.

Additional Levels of Strassen Reloaded The general operation of one-level Strassen: M := a(x+dy)(v+ew); C g 0 M; D g 1 M; g 0, g 1,d,e {-1, 0, 1}. The general operation of two-level Strassen: M := a(x 0 +d 1 X 1 +d 2 X 2 +d 3 X 3 )(V+e 1 V 1 +e 2 V 2 +e 3 V 3 ); C 0 g 0 M; C 1 g 1 M; C 2 g 2 M; C 3 g 3 M; g i,d i,e i {-1, 0, 1}. The general operation needed to integrate k levels of Strassen is given by

Building blocks BLIS framework A routine for packing B p into C/Intel intrinsics Adapted to general operation Integrate the addition of multiple matrices V t into A routine for packing A i into C/Intel intrinsics Integrate the addition of multiple matrices X s into A micro-kernel for updating an m R n R submatrix of C. SIMD assembly (AVX, AVX512, etc) Integrate the update of multiple submatrices of C.

Variations on a theme Naïve Strassen A traditional implementation with temporary buffers. AB Strassen Integrate the addition of matrices into and. ABC Strassen Integrate the addition of matrices into and. Integrate the update of multiple submatrices of C in the micro-kernel.

5 th loop around micro-kernel D j C n j C Y X W V j Parallelization 3 rd loop (along m C direction) 4 th loop around micro-kernel D j C j Y p X p W p V p D i C i m C 3 rd loop around micro-kernel m C Y i X i Pack V p + ew p B p B p Pack X i + dy i A i 2 nd loop (along n R direction) 2 nd loop around micro-kernel n R n R A i B p D i C i n R m R 1 st loop around micro-kernel both 3 rd and 2 nd loop Update C ij, D ij m R main memory L3 cache L2 cache L1 cache registers 1 micro-kernel 1 *Tyler M. Smith, Robert Van De Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. "Anatomy of high-performance many-threaded matrix multiplication." In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, pp. 1049-1059. IEEE, 2014.

Outline Review of State-of-the-art GEMM in BLIS Strassen s Algorithm Reloaded Theoretical Model and Practical Performance Extension to Other BLAS-3 Operation Extension to Other Fast Matrix Multiplication Conclusion

Performance Model Performance Metric Total Time Breakdown Arithmetic Operations Memory Operations

Arithmetic Operations DGEMM No extra additions Submatrix multiplication Extra additions with submatrices of A, B, C, respectively One-level Strassen (ABC, AB, Naïve) 7 submatrix multiplications 5 extra additions of submatrices of A and B 12 extra additions of submatrices of C M 0 := a(a 00 +A 11 )(B 00 +B 11 ); C 00 M 0 ; C 11 M 0 ; M 1 := a(a 10 +A 11 )B 00 ; C 10 M 1 ; C 11 = M 1 ; M 2 := aa 00 (B 01 B 11 ); C 01 M 2 ; C 11 M 2 ; M 3 := aa 11 (B 10 B 00 ); C 00 M 3 ; C 10 M 3 ; M 4 := a(a 00 +A 01 )B 11 ; C 01 M 4 ; C 00 = M 4 ; M 5 := a(a 10 A 00 )(B 00 +B 01 ); C 11 M 5 ; M 6 := a(a 01 A 11 )(B 10 +B 11 ); C 00 M 6 ; Two-level Strassen (ABC, AB, Naïve) 49 submatrix multiplications 95 extra additions of submatrices of A and B 154 extra additions of submatrices of C

Memory Operations DGEMM One-level ABC Strassen AB Strassen Naïve Strassen Two-level ABC Strassen AB Strassen Naïve Strassen

Modeled and Actual Performance on Single Core

Observation (Square Matrices) Modeled Performance Actual Performance

Observation (Square Matrices) Modeled Performance Actual Performance

Observation (Square Matrices) Modeled Performance Actual Performance

Observation (Square Matrices) Modeled Performance Actual Performance

Observation (Square Matrices) Modeled Performance Actual Performance

Observation (Square Matrices) Modeled Performance Actual Performance 26.0% 13.1% Theoretical Speedup over DGEMM One-level Strassen (1+14.3% speedup) 8 multiplications 7 multiplications; Two-level Strassen (1+30.6% speedup) 64 multiplications 49 multiplications;

Observation (Square Matrices) Both one-level and two-level For small square matrices, ABC Strassen outperforms AB Strassen For larger square matrices, this trend reverses Reason ABC Strassen avoids storing M (M resides in the register) ABC Strassen increases the number of times for updating submatrices of C

5 th loop around micro-kernel 5 th loop around micro-kernel C j A B D j C n j C Y X W V j 4 th loop around micro-kernel 4 th loop around micro-kernel C j A p B p D j C j Y p X p W p V p 3 rd loop around micro-kernel Pack B p B p 3 rd loop around micro-kernel Pack V p + ew p B p C i m C m C A i Pack A i A i B p D i C i m C m C Y i X i B p Pack X i + dy i A i C i 2 nd loop around micro-kernel n R n R A i B p m R n R 1 st loop around micro-kernel 2 nd loop around micro-kernel n R n R A i B p D i C i n R m R 1 st loop around micro-kernel m R Update C ij Update C ij, D ij m R main memory L3 cache L2 cache L1 cache registers 1 micro-kernel 1 main memory L3 cache L2 cache L1 cache registers 1 micro-kernel 1

Observation (Square Matrices) Both one-level and two-level For small square matrices, ABC Strassen outperforms AB Strassen For larger square matrices, this trend reverses Reason ABC Strassen avoids storing M (M resides in the register) ABC Strassen increases the number of times for updating submatrices of C

Observation (Rank-k Update) What is Rank-k update? n m m k n

Observation (Rank-k Update) Importance of Rank-k update Blocked LU with partial pivoting (getrf) A 12 A 21 A 22

Observation (Rank-k Update) Importance of Rank-k update A 12 A 21 A 22

Observation (Rank-k Update) Modeled Performance Actual Performance

Observation (Rank-k Update) Modeled Performance Actual Performance

Observation (Rank-k Update) Modeled Performance Actual Performance

Observation (Rank-k Update) Modeled Performance Actual Performance

Observation (Rank-k Update) Modeled Performance Actual Performance

Observation (Rank-k Update) Modeled Performance Actual Performance Reason: ABC Strassen avoids forming the temporary matrix M explicitly in the memory (M resides in register), especially important when m, n >> k.

Single Node Experiment

Rank-k Update Square Matrices 1 core 5 core 10 core

Many-core Experiment

Intel Xeon Phi coprocessor (KNC)

Distributed Memory Experiment

Outline Review of State-of-the-art GEMM in BLIS Strassen s Algorithm Reloaded Theoretical Model and Practical Performance Extension to Other BLAS-3 Operation Extension to Other Fast Matrix Multiplication Conclusion

Level-3 BLAS Symmetric Matrix-Matrix Multiplication (SYMM) Symmetric matrix-matrix multiplication (SYMM) is supported in the level-3 BLAS* interface as SYMM computes *Dongarra, Jack J., et al. "A set of level 3 basic linear algebra subprograms."acm Transactions on Mathematical Software (TOMS) 16.1 (1990): 1-17.

Level-3 BLAS Symmetric Matrix-Matrix Multiplication (SYMM) A is a symmetric matrix (square and equal to its transpose); Assume only the lower triangular half is stored; Follow the same algorithm we used for GEMM by modifying the packing routine for matrix A to account for the symmetric nature of A. Example Matrix A is stored as

SYMM with One-Level Strassen When partitioning A for one-level Strassen operations, we integrate the symmetric nature of A. A 00 A 01 A 10 A 11

SYMM Implementation Platform: BLISlab* What is BLISlab? A Sandbox for Optimizing GEMM that mimics implementation in BLIS A set of exercises that use GEMM to show how high performance can be attained on moderpus with hierarchical memories Used in Prof. Robert van de Geijn s CS 383C Numerical Linear Algebra Contributions Added DSYMM Applied Strassen s algorithm to DGEMM and DSYMM 3 versions: Naive, AB, ABC Strassen improves performance in both DGEMM and DSYMM * https://github.com/flame/blislab

DSYMM Results in BLISlab

Outline Review of State-of-the-art GEMM in BLIS Strassen s Algorithm Reloaded Theoretical Model and Practical Performance Extension to Other BLAS-3 Operation Extension to Other Fast Matrix Multiplication Conclusion

(2,2,2) Strassen Algorithm M 0 := (A 00 +A 11 )(B 00 +B 11 ); C 00 M 0 ; C 11 M 0 ; M 1 := (A 10 +A 11 )B 00 ; C 10 M 1 ; C 11 = M 1 ; M 2 := A 00 (B 01 B 11 ); C 01 M 2 ; C 11 M 2 ; M 3 := A 11 (B 10 B 00 ); C 00 M 3 ; C 10 M 3 ; M 4 := (A 00 +A 01 )B 11 ; C 01 M 4 ; C 00 = M 4 ; M 5 := (A 10 A 00 )(B 00 +B 01 ); C 11 M 5 ; M 6 := (A 01 A 11 )(B 10 +B 11 ); C 00 M 6 ;

(3,2,3) Fast Matrix Multiplication M 0 := ( A 01 +A 11 +A 21 )(B 11 +B 12 ); C 02 = M 0 ; M 1 := ( A 00 +A 10 +A 21 )(B 00 B 12 ); C 02 = M 1 ; C 21 = M 1 ; M 2 := ( A 00 +A 10 )( B 01 B 11 ); C 00 =M 2 ;C 01 M 2 ;C 21 M 2 ; M 3 := ( A 10 A 20 +A 21 )( B 02 B 10 +B 12 ); C 02 M 3 ;C 12 M 3 ;C 20 =M 3 ; M 4 := ( A 00 +A 01 +A 10 )( B 00 +B 01 +B 11 ); C 00 =M 4 ;C 01 M 4 ;C 11 M 4 ; M 5 := ( A 10 )( B 00 ); C 00 M 5 ;C 01 =M 5 ;C 10 M 5 ;C 11 =M 5 ;C 20 M 5 ; M 6 := ( A 10 +A 11 +A 20 A 21 )( B 10 B 12 ); C 02 =M 6 ;C 12 =M 6 ; M 7 := ( A 11 )( B 10 ); C 02 M 7 ;C 10 M 7 ;C 12 M 7 ; M 8 := ( A 00 +A 10 +A 20 )( B 01 +B 02 ); C 21 M 8 ; M 9 := ( A 00 +A 01 )( B 00 B 01 ); C 01 M 9 ;C 11 M 9 ; M 10 :=( A 21 )( B 02 +B 12 ); C 02 M 10 ;C 20 M 10 ;C 22 M 10 ; M 11 := ( A 20 A 21 )( B 02 ); C 02 M 11 ;C 12 M 11 ;C 21 =M 11 ;C 22 M 11 ; M 12 := ( A 01 )( B 00 +B 01 +B 10 +B 11 ); C 00 M 12 ; M 13 := ( A 10 A 20 )( B 00 B 02 +B 10 B 12 ); C 20 =M 13 ; M 14 := ( A 00 +A 01 +A 10 A 11 )( B 11 ); C 02 =M 14 ;C 11 =M 14 ; *Austin R. Benson, Grey Ballard. "A framework for practical parallel fast matrix multiplication." PPOPP15.

Outline Review of State-of-the-art GEMM in BLIS Strassen s Algorithm Reloaded Theoretical Model and Practical Performance Extension to Other BLAS-3 Operation Extension to Other Fast Matrix Multiplication Conclusion

To achieve practical high performance of Strassen s algorithm... Conventional Implementations Our Implementations Matrix Size Must be large Matrix Shape Must be square No Additional Workspace Parallelism Usually task parallelism Can be data parallelism

Acknowledgement - NSF grants ACI-1148125/1340293, CCF-1218483. - Intel Corporation through an Intel Parallel Computing Center (IPCC). - Access to the Maverick and Stampede supercomputers administered by TACC. We thank Field Van Zee, Chenhan Yu, Devin Matthews, and the rest of the SHPC team (http://shpc.ices.utexas.edu) for their supports. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Thank you!