Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Similar documents
CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

Solving the heat equation with CUDA

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors

Figure 6.1: Truss topology optimization diagram.

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA


NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

GPU Performance Nuggets

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Sparse Matrix-Matrix Multiplication on the GPU. Julien Demouth, NVIDIA

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Tesla Architecture, CUDA and Optimization Strategies

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Automated Finite Element Computations in the FEniCS Framework using GPUs

Lecture 6: Input Compaction and Further Studies

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi

CUDA. Matthew Joyner, Jeremy Williams

Iterative solution of linear systems in electromagnetics (and not only): experiences with CUDA

Dense Linear Algebra. HPC - Algorithms and Applications

CUDA Performance Considerations (2 of 2)

On the Parallel Solution of Sparse Triangular Linear Systems. M. Naumov* San Jose, CA May 16, 2012 *NVIDIA

GPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Warps and Reduction Algorithms

Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra

Master Thesis. Master Program of Computer Science

Sparse Linear Algebra in CUDA

Efficient sparse matrix-vector multiplication on cache-based GPUs

High Performance Computing and GPU Programming

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Introduction to GPGPU and GPU-architectures

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Design of efficient sparse matrix-vector multiplication for Fermi GPUs

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

Experiences in Optimizations of Preconditioned Iterative Solvers for FEM/FVM Applications & Matrix Assembly of FEM using Intel Xeon Phi

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

CUDA Experiences: Over-Optimization and Future HPC

A Multi-Tiered Optimization Framework for Heterogeneous Computing

Lecture 8: GPU Programming. CSE599G1: Spring 2017

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Implementation of Adaptive Coarsening Algorithm on GPU using CUDA

GPU DEVELOPMENT & FUTURE PLAN OF MIDAS NFX

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer

Krishnan Suresh Associate Professor Mechanical Engineering

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

A Cross-Platform SpMV Framework on Many-Core Architectures

Parallel Accelerators

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC

University of Bielefeld

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

Parallel Accelerators

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs

State of Art and Project Proposals Intensive Computation

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Optimizing Sparse Data Structures for Matrix-Vector Multiply

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Profiling & Tuning Applications. CUDA Course István Reguly

B. Tech. Project Second Stage Report on

An MPI-CUDA Implementation and Optimization for Parallel Sparse Equations and Least Squares (LSQR) 1

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

Sparse Matrix-Vector Multiplication with Wide SIMD Units: Performance Models and a Unified Storage Format

Modern GPUs (Graphics Processing Units)

Dynamic Sparse Matrix Allocation on GPUs. James King

A new sparse matrix vector multiplication graphics processing unit algorithm designed for finite element problems

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

A study on linear algebra operations using extended precision floating-point arithmetic on GPUs

Cartoon parallel architectures; CPUs and GPUs

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

GPU-Based Acceleration for CT Image Reconstruction

Example 1: Color-to-Grayscale Image Processing

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Device Memories and Matrix Multiplication

Case study: OpenMP-parallel sparse matrix-vector multiplication

Lab 1 Part 1: Introduction to CUDA

Accelerating Dense Linear Algebra on the GPU

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

Transcription:

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology

Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation fast Iterative method : CG method, BiCG method Level-1 BLAS (Dot product + AXPY) Sequential memory access Sparse matrix vector multiplication (SpMV) Using sparse matrix format Random memory access Performance depends on cache hit rate 1

SpMV computation on GPU High memory bandwidth and parallelism enable high performance Latency is hidden with SMT Available cache per thread is small Controlling the cache is difficult => Lower cache hit rate compared to CPU Cache size Intel Xeon Processor E5-262 v2 L1 cache : 192KB (instruction / data) L2 cache : 1.5MB, L3 cache : 15MB NVIDIA Tesla K2X Max threads 12 threads 28672 threads Read-only cache : 12KB * 4 / SMX L2 cache : 1.5MB 2

Contribution We propose a family of cache-aware formats for GPU Segmentation along the column Segmented formats, Non-Uniformly Segmented formats 2 ways of SpMV computation Achieve speedups of up to x2.1 for real datasets and x3.2 for synthetic matrices in SpMV x1.15 in CG 3

Sparse Format Compressing the needless zero elements Reduce memory usage Eg.) COO, CSR Efficient memory access to matrix data depends on architecture Vector machine, GPU : column major format JDS, ELLPACK, SELL-C-σ 4

(Existing Sparse Format) JDS Reordering the rows by the number of non-zero elements per row Generate column major format Favorable for vector machine and many core architectures 5

constant int jds_ptr[]; global void KernelJDS(float *out, } (Existing Sparse Format) SpMV kernel of JDS format const float* restrict vector int *jds_col, float *jds_val, int M, int nnz_max) { //Calculate i-th row int i = blockidx.x * blockdim.x + threadidx.x; int j=; float answer = ; int index = i + jds_ptr[i]; while (index < jds_ptr[j + 1] && j < nnz_max) { answer += jds_val[index] * vector[jds_col[index]]; index = I + jds_ptr[++j]; } out[i] = answer; Using Constant cache when nnz_max * sizeof(float) < 16KB Read-only cache for input vector Using inline PTX assembly not to pollute the cache jds_val[index]=>ld.global.cv.f32 jds_col[index]=>ld.global.cv.s32 out[i]=>st.global.cs.f32 6

(Existing Sparse Format) SELL-C-σ [Kreutzer, 213] Converting ELLPACK each row block (Sliced ELLPACK) Reduce the zero filling C is block size C = WARP size Sorting each σ rows Tradeoff between the zero fill and the cost of sorting 7

Cache Hit Rates of Existing Sparse Formats NVIDIA Tesla K2X Dataset : University of Florida Sparse Matrix Collection JDS format Input vector is assigned to read-only cache Coalesced access to matrix data Matrix Size L2 Cache Hit Rate [%] Read-only Cache Hit Rate [%] Audikw_1 943,695 82.864 51.42 Crankseg_2 63,838 98.338 66.54 mouse_gene 45,11 99.912 8.298 8

PROPOSED FORMATS 9

Column size and cache hit rate SpMV execution Column for size random where the matrix cache hit rate drops The number corresponds of row : 124 to ^3 each cache size - Read-only cache : 12KB - L2 cache : 1.5MB Segmenting The number the matrix of columns and the : input 2 ^ x (4 vector <= x <= enable 24) to achieve Non-zero high cache elements hit rate per row : 16 Single precision Using JDS format 1

Column-wise segmentation Segmented Formats Each segment is converted to JDS or SELL-C-σ 11

Segmented formats SpMV Execution Two ways of SpMV computation 2 phases computation : Reduce random write 1 st phase : Computing SpMV for each sub-matrix and subvector, and storing the result into the memory 2 nd phase : Accumulation of the intermediate vectors 12

Segmented formats SpMV Execution Two ways of SpMV computation 1 phase computation using atomic operation Prepare additional threads to initialize output vector 13

Segmented Formats disadvantages Increase memory access cost Additional memory access (2 phase SpMV computation) Atomic operation is expensive Generate the segments having few non-zero elements Improvement of reusability < Overhead of segmenting => Low efficiency 14

Non-Uniformly Segmented Formats (NUS Formats) Mixing the multi level segmentation size Large segmentation width for the low density area => Reduce the number of segments Sorting by the number of non-zero elements per column Set the high density column to left side and high reusability vector elements to the top 15

Converting NUS Format 16 1 1 2 2 3 3 4 4 4 4 4 5 5 5 1 2 3 4 5 4 2 2 2 5 3 1 2 3 4 5 Permutation 1 1 1 1 3 3 4 4 5 5 2 2 2 1 1 2 1 1 2 2 3 3 4 4 5 5 5 Matrix index : column index Vector index : original row index Count # of non-zero elements per column Sorting Reordering Update col index Converting to Segmented CSR Converting submatrix to JDS

Auto Parameter tuning mechanism for Conjugate Gradient method Difficulty of setting parameter NUS formats have 2D parameter space Number of segments (seg_num) Size of segment (seg_size) Detection for best parameter in iterative method Time of converting matrix to NUS format <<< Duration time until converging 17

Auto Parameter tuning mechanism for Parallelizing by OMP section Conjugate Gradient method CPU : Converting matrix GPU : Executing iteration Parameter Giving seg_size Changing # of segments 18

PERFORMANCE EVALUATION 19

Experiment Environment TSUBAME-KFC CPU:Intel Xeon E5-262 v2 2.1GHz x 2 GPU:NVIDIA Tesla K2X x 4 Single precision peak performance : 3.95 [TFLOPS] Bandwidth : 25 [GB / sec] Memory size : 6 [GB] L2 cache : 1.5 [MB] Read-only cache : 12 * 4 [KB / SMX] CUDA 5.5 cusparse Provided by NVIDIA CSR format HYBRID format 2

Our formats show Performance Evaluation SpMV (Florida data sets) speedup of x.86 ~ x2.13 stable performance 21 Blue : Existing formats, Red : Proposal (2 phases ver.), Green : Proposal (Atomic ver.)

Performance Evaluation Cache Hit Rate of SpMV Segment size suits to read-only cache Improvement of cache hit rate from non-segmented formats 22

Performance Evaluation SpMV (Randomly generated matrix) Investigating larger matrices Number Speedup of rows of : up 1.M, to x3.2 1.5M, and 2.M our formats, 2.5M, 3.M are stable to matrix properties Non-zero density :.1%,.2%,.5% 23

Performance Evaluation Conjugate Gradient method CG computation for positive definite matrices Similar speedup to SpMV ; Up to x1.15 Speedup of SpMV is x1.22 24

Performance Evaluation Auto Parameter Tuning CG method Speedup is x1.9 crankseg_2 nd24k 25

Strong scaling Performance Evaluation Multi-node CG method One GPU for each node Communication between nodes by MPI Send / receive the vector and MPI_Reduce each residual to each iteration Assign row block to each node Each row block has fewer non-zero elements => Cause performance degradation Generate larger random matrices; row size is 8M 26

Performance Evaluation Multi-node CG method NUS-SELL-C-σ shows superiority to CSR and SELL-C-σ Speedup of up to x1.68 In lower density matrix, data transfer time between nodes takes relatively longer Performance difference between formats is not noticeable 27

Features of matrices Family of Segmented formats works well for the matrix such that Input vector access is more random Improving the cache hit rate using Segmented formats Matrix has many non-zero elements Achieve high cache reusability Matrix has large variance of the number of non-zero elements per row Reduce idle threads from JDS or SELL-C-σ 28

Conclusion S-Formats and NUS-Formats improve the cache locality and SpMV performance NUS formats achieved speedups of up to X2.1 for real datasets and x3.2 for synthetic matrix in SpMV X1.15 for real datasets in CG E-mail : nagasaka.y.aa@m.titech.ac.jp 29