How to Write Fast Numerical Code Spring 2012 Lecture 13. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

Similar documents
How to Write Fast Numerical Code

How to Write Fast Numerical Code Spring 2012 Lecture 9. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

How to Write Fast Numerical Code

How to Write Fast Numerical Code

Adaptable benchmarks for register blocked sparse matrix-vector multiplication

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Automatic Performance Tuning and Machine Learning

Optimizing MMM & ATLAS Library Generator

How to Write Fast Numerical Code Spring 2011 Lecture 7. Instructor: Markus Püschel TA: Georg Ofenbeck

SPARSITY: Optimization Framework for Sparse Matrix Kernels

Memory Hierarchy Optimizations and Performance Bounds for Sparse A T Ax

Cache-oblivious Programming

How to Write Fast Numerical Code Spring 2012 Lecture 20. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

Automatic Tuning of Sparse Matrix Kernels

When Cache Blocking of Sparse Matrix Vector Multiply Works and Why

How to Write Fast Numerical Code

Lecture 13: March 25

Autotuning (1/2): Cache-oblivious algorithms

How to Write Fast Numerical Code

How to Write Fast Numerical Code

Performance Optimizations and Bounds for Sparse Symmetric Matrix - Multiple Vector Multiply

How to Write Fast Numerical Code Spring 2011 Lecture 22. Instructor: Markus Püschel TA: Georg Ofenbeck

Program Generation, Optimization, and Adaptation: SPIRAL and other efforts

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides

Optimizing Sparse Data Structures for Matrix-Vector Multiply

Case study: OpenMP-parallel sparse matrix-vector multiplication

A cache-oblivious sparse matrix vector multiplication scheme based on the Hilbert curve

How to Write Fast Numerical Code

How to Write Fast Numerical Code

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Optimizing the Performance of Sparse Matrix-Vector Multiplication by Eun-Jin Im Bachelor of Science, Seoul National University, Seoul, 1991 Master of

Generating Optimized Sparse Matrix Vector Product over Finite Fields

Submission instructions (read carefully): SS17 / Assignment 4 Instructor: Markus Püschel. ETH Zurich

How to Write Fast Numerical Code Spring 2013 Lecture: Architecture/Microarchitecture and Intel Core

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Roofline Model (Will be using this in HW2)

Algorithms and Computation in Signal Processing

Sparse Linear Algebra in CUDA

Optimization by Run-time Specialization for Sparse-Matrix Vector Multiplication

OSKI: Autotuned sparse matrix kernels

Master Informatics Eng.

Fast sparse matrix-vector multiplication by exploiting variable block structure

CS 426 Parallel Computing. Parallel Computing Platforms

Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

An Experimental Comparison of Cache-oblivious and Cache-aware Programs DRAFT: DO NOT DISTRIBUTE

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

CS560 Lecture Parallel Architecture 1

Matrix Multiplication

Alleviating memory-bandwidth limitations for scalability and energy efficiency

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Administration. Prerequisites. Website. CSE 392/CS 378: High-performance Computing: Principles and Practice

Storage Formats for Sparse Matrices in Java

Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

: How to Write Fast Numerical Code ETH Computer Science, Spring 2016 Midterm Exam Wednesday, April 20, 2016

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Lecture 2. Memory locality optimizations Address space organization

1 Motivation for Improving Matrix Multiplication

Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY

Formal Loop Merging for Signal Transforms

Bentley Rules for Optimizing Work

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

CS241 Computer Organization Spring Principle of Locality

Lecture 15: More Iterative Ideas

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Copyright 2012, Elsevier Inc. All rights reserved.

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

Optimizing Sparse Matrix Vector Multiplication Using Cache Blocking Method on Fermi GPU

CS 612: Software Design for High-performance Architectures. Keshav Pingali Cornell University

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Matrix Multiplication

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

BLAS. Christoph Ortner Stef Salvini

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Optimizing the operations with sparse matrices on Intel architecture

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Basics of Performance Engineering

Cache Oblivious Dense and Sparse Matrix Multiplication Based on Peano Curves

CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance. Why More on Memory Hierarchy?

A Comparison of Empirical and Model-driven Optimization

AMath 483/583 Lecture 11. Notes: Notes: Comments on Homework. Notes: AMath 483/583 Lecture 11

NOW Handout Page # Why More on Memory Hierarchy? CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Scientific Computing. Some slides from James Lambers, Stanford

CS 261 Fall Caching. Mike Lam, Professor. (get it??)

AMath 483/583 Lecture 11

Optimising for the p690 memory system

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Improving Performance of Sparse Matrix-Vector Multiplication

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.

Lecture 2: Single processor architecture and memory

Caching Basics. Memory Hierarchies

poski: Parallel Optimized Sparse Kernel Interface Library User s Guide for Version 1.0.0

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects

Transcription:

How to Write Fast Numerical Code Spring 2012 Lecture 13 Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

ATLAS Mflop/s Compile Execute Measure Detect Hardware Parameters L1Size NR MulAdd L * ATLAS Search Engine NB MU,NU,KU xfetch MulAdd Latency ATLAS MMM Code Generator MiniMMM Source Model-Based ATLAS Detect Hardware Parameters L1Size L1I$Size NR MulAdd L * Model NB MU,NU,KU xfetch MulAdd Latency ATLAS MMM Code Generator MiniMMM Source source: Pingali, Yotov, Cornell U.

Principles Optimization for memory hierarchy Blocking for cache Blocking for registers Basic block optimizations Loop order for ILP Unrolling + scalar replacement Scheduling & software pipelining Optimizations for virtual memory Buffering (copying spread-out data into contiguous memory) Autotuning Search over parameters (ATLAS) Model to estimate parameters (Model-based ATLAS) All high performance MMM libraries do some of these (but possibly in a different way)

Today Memory bound computations Sparse linear algebra, OSKI

Memory Bound Computation Data movement, not computation, is the bottleneck Typically: Computations with operational intensity I(n) = O(1) performance memory bandwidth bound peak performance bound memory bound compute bound operational intensity

Memory Bound Or Not? Depends On The computer Memory bandwidth Peak performance How it is implemented Good/bad locality SIMD or not MMM: Cold data & code How the measurement is done Cold or warm cache for data/code In which cache data resides See next slide

Example 1: BLAS 1, Warm Data & Code z = x + y on Core i7 (Nehalem, one core, no SSE), icc 12.0 /O2 /fp:fast /Qipo Percentage peak performance (peak = 1 add/cycle) 100 L1 L2 90 cache cache 80 70 60 50 40 30 20 10 L3 cache Bounds based on bandwidth 2 doubles/cycle 1 double/cycle 1/2 double/cycle 0 1 KB 4 KB 16 KB 64 KB 256 KB 1 MB 4 MB 16 MB sum of vector lengths (working set)

Example 2: Cold Data & Code BLAS 3 BLAS 2 BLAS 1 FFT preliminary plots roofline tool developed in our group by R. Steinmann

Example 3: Cold/Warm Data & Code Cold data Cold code Cold data Warm code Warm data Warm code Warm data Cold code preliminary plots roofline tool developed in our group by R. Steinmann

Sparse Linear Algebra Sparse matrix-vector multiplication (MVM) Sparsity/Bebop/OSKI References: Eun-Jin Im, Katherine A. Yelick, Richard Vuduc. SPARSITY: An Optimization Framework for Sparse Matrix Kernels, Int l Journal of High Performance Comp. App., 18(1), pp. 135-158, 2004 Vuduc, R.; Demmel, J.W.; Yelick, K.A.; Kamil, S.; Nishtala, R.; Lee, B.; Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply, pp. 26, Supercomputing, 2002 Sparsity/Bebop website

Sparse Linear Algebra Very different characteristics from dense linear algebra (LAPACK etc.) Applications: finite element methods PDE solving physical/chemical simulation (e.g., fluid dynamics) linear programming scheduling signal processing (e.g., filters) Core building block: Sparse MVM Graphics: http://aam.mathematik.uni-freiburg.de/iam/homepages/claus/ projects/unfitted-meshes_en.html

Sparse MVM (SMVM) y = y + Ax, A sparse but known = + y y A x Typically executed many times for fixed A What is reused (temporal locality)? Upper bound on operational intensity?

Storage of Sparse Matrices Standard storage is obviously inefficient: Many zeros are stored Unnecessary operations Unnecessary data movement Bad operational intensity Several sparse storage formats are available Most popular: Compressed sparse row (CSR) format blackboard

CSR Assumptions: A is m x n K nonzero entries A as matrix A in CSR: b c c a b b values col_idx b c c a b b c 0 1 3 1 2 3 2 length K length K c row_start 0 3 4 6 7 length m+1 Storage: K doubles + (K+m+1) ints = Θ(max(K, m)) Typically: Θ(K)

Sparse MVM Using CSR y = y + Ax void smvm(int m, const double* values, const int* col_idx, const int* row_start, double* x, double* y) { int i, j; double d; /* loop over m rows */ for (i = 0; i < m; i++) { d = y[i]; /* scalar replacement since reused */ } } /* loop over non-zero elements in row i */ for (j = row_start[i]; j < row_start[i+1]; j++) d += values[j] * x[col_idx[j]]; y[i] = d; CSR + sparse MVM: Advantages?

CSR Advantages: Only nonzero values are stored All three arrays for A (values, col_idx, row_start) accessed consecutively in MVM (good spatial locality) Good temporal locality with respect to y Disadvantages: Insertion into A is costly Poor temporal locality with respect to x

Impact of Matrix Sparsity on Performance Adressing overhead (dense MVM vs. dense MVM in CSR): ~ 2x slower (example only) Irregular structure ~ 5x slower (example only) for random sparse matrices Fundamental difference between MVM and sparse MVM (SMVM): Sparse MVM is input dependent (sparsity pattern of A) Changing the order of computation (blocking) requires changing the data structure (CSR)

Bebop/Sparsity: SMVM Optimizations Idea: Blocking for registers Reason: Reuse x to reduce memory traffic Execution: Block SMVM y = y + Ax into micro MVMs Block size r x c becomes a parameter Consequence: Change A from CSR to r x c block-csr (BCSR) BCSR: Blackboard

BCSR (Blocks of Size r x c) Assumptions: A is m x n Block size r x c K r,c nonzero blocks A as matrix (r = c = 2) A in BCSR (r = c = 2): b c c a b b b_values b_col_idx b c 0 a 0 c 0 0 b b c 0 0 1 1 length rck r,c length K r,c c b_row_start 0 2 3 length m/r+1 Storage: rck r,c doubles + (K r,c +m/r+1) ints = Θ(rcK r,c ) rck r,c K

Sparse MVM Using 2 x 2 BCSR void smvm_2x2(int bm, const int *b_row_start, const int *b_col_idx, const double *b_values, double *x, double *y) { int i, j; double d0, d1, c0, c1; /* loop over bm block rows */ for (i = 0; i < bm; i++) { d0 = y[2*i]; /* scalar replacement since reused */ d1 = y[2*i+1]; } } /* dense micro MVM */ for (j = b_row_start[i]; j < b_row_start[i+1]; j++, b_values += 2*2) { c0 = x[2*b_col_idx[j]+0]; /* scalar replacement since reused */ c1 = x[2*b_col_idx[j]+1]; d0 += b_values[0] * c0; d1 += b_values[2] * c0; d0 += b_values[1] * c1; d1 += b_values[3] * c1; } y[2*i] = d0; y[2*i+1] = d1;

BCSR Advantages: Temporal locality with respect to x and y Reduced storage for indexes Disadvantages: Storage for values of A increased (zeros added) Computational overhead (also due to zeros) * = Main factors (since memory bound): Plus: increased temporal locality on x + reduced index storage = reduced memory traffic Minus: more zeros = increased memory traffic

Which Block Size (r x c) is Optimal? Example: 20,000 x 20,000 matrix (only part shown) Perfect 8 x 8 block structure No overhead when blocked r x c, with r, c divides 8 source: R. Vuduc, LLNL

Speed-up Through r x c Blocking machine dependent hard to predict Source: Eun-Jin Im, Katherine A. Yelick, Richard Vuduc. SPARSITY: An Optimization Framework for Sparse Matrix Kernels, Int l Journal of High Performance Comp. App., 18(1), pp. 135-158, 2004

How to Find the Best Blocking for given A? Best block size is hard to predict (see previous slide) Solution 1: Searching over all r x c within a range, e.g., 1 r,c 12 Conversion of A in CSR to BCSR roughly as expensive as 10 SMVMs Total cost: 1440 SMVMs Too expensive Solution 2: Model Estimate the gain through blocking Estimate the loss through blocking Pick best ratio

Model: Example Gain by blocking (dense MVM) Overhead (average) by blocking * = 16/9 = 1.77 1.4 1.4/1.77 = 0.79 (no gain) Model: Doing that for all r and c and picking best

Model Goal: find best r x c for y = y + Ax Gain through r x c blocking (estimation): G r,c = dense MVM performance in r x c BCSR dense MVM performance in CSR dependent on machine, independent of sparse matrix Overhead through r x c blocking (estimation) scan part of matrix A O r,c = number of matrix values in r x c BCSR number of matrix values in CSR independent of machine, dependent on sparse matrix Expected gain: G r,c /O r,c

row block size r row block size r Markus Püschel Gain from Blocking (Dense Matrix in BCSR) Pentium III Itanium 2 column block size c column block size c machine dependent hard to predict Source: Eun-Jin Im, Katherine A. Yelick, Richard Vuduc. SPARSITY: An Optimization Framework for Sparse Matrix Kernels, Int l Journal of High Performance Comp. App., 18(1), pp. 135-158, 2004

Typical Result CRS BCRS model BCSR exhaustive search Analytical upper bound how obtained? (blackboard) Figure: Eun-Jin Im, Katherine A. Yelick, Richard Vuduc. SPARSITY: An Optimization Framework for Sparse Matrix Kernels, Int l Journal of High Performance Comp. App., 18(1), pp. 135-158, 2004

Principles in Bebop/Sparsity Optimization Optimization for memory hierarchy = increasing locality Blocking for registers (micro-mmms) Requires change of data structure for A Optimizations are input dependent (on sparse structure of A) Fast basic blocks for small sizes (micro-mmm): Unrolling + scalar replacement Search for the fastest over a relevant set of algorithm/implementation alternatives (parameters r, c) Use of performance model (versus measuring runtime) to evaluate expected gain Different from ATLAS

SMVM: Other Ideas Cache blocking Value compression Index compression Pattern-based compression Special scenario: Multiple inputs

Cache Blocking Idea: divide sparse matrix into blocks of sparse matrices Experiments: Requires very large matrices (x and y do not fit into cache) Speed-up up to 2.2x, only for few matrices, with 1 x 1 BCSR Figure: Eun-Jin Im, Katherine A. Yelick, Richard Vuduc. SPARSITY: An Optimization Framework for Sparse Matrix Kernels, Int l Journal of High Performance Comp. App., 18(1), pp. 135-158, 2004

Value Compression Situation: Matrix A contains many duplicate values Idea: Store only unique ones plus index information b c c a b b c A in CSR: A in CSR-VI: values b c c a b b c a b c col_idx 0 1 3 1 2 3 2 values 1 2 2 0 1 1 2 row_start 0 3 4 6 7 col_idx 0 1 3 1 2 3 2 row_start 0 3 4 6 7 Kourtis, Goumas, and Koziris, Improving the Performance of Multithreaded Sparse Matrix-Vector Multiplication using Index and Value Compression, pp. 511-519, ICPP 2008

Index Compression Situation: Matrix A contains sequences of nonzero entries Idea: Use special byte code to jointly compress col_idx and row_start Coding Decoding row_start col_idx byte code Willcock and Lumsdaine, Accelerating Sparse Matrix Computations via Data Compression, pp. 307-316, ICS 2006

Pattern-Based Compression Situation: After blocking A, many blocks have the same nonzero pattern Idea: Use special BCSR format to avoid storing zeros; needs specialized micro-mvm kernel for each pattern A as matrix b c c a b b c Values in 2 x 2 BCSR b c 0 a 0 0 c 0 b b c 0 Values in 2 x 2 PBR b c a c b b c + bit string: 1101 0100 1110 Belgin, Back, and Ribbens, Pattern-based Sparse Matrix Representation for Memory-Efficient SMVM Kernels, pp. 100-109, ICS 2009

Special scenario: Multiple inputs Situation: Compute SMVM y = y + Ax for several independent x Blackboard Experiments: up to 9x speedup for 9 vectors Source: Eun-Jin Im, Katherine A. Yelick, Richard Vuduc. SPARSITY: An Optimization Framework for Sparse Matrix Kernels, Int l Journal of High Performance Comp. App., 18(1), pp. 135-158, 2004