Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Similar documents
L17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011!

Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

State of Art and Project Proposals Intensive Computation

Administrative Issues. L12: Application Case Studies. Triangular Solve (STRSM) Outline 2/25/13. Next assignment, triangular solve.

Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation

QR Decomposition on GPUs

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute

Solving the heat equation with CUDA

Administrative Issues. L10: Dense Linear Algebra on GPUs. Triangular Solve (STRSM) Outline 2/25/11. Next assignment, linear algebra

1/25/12. Administrative

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Dynamic Sparse Matrix Allocation on GPUs. James King

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

L9: Next Assignment, Project and Floating Point Issues CS6963

Optimizing the operations with sparse matrices on Intel architecture

Behavioral Data Mining. Lecture 12 Machine Biology

Programming in CUDA. Malik M Khan

Lecture 6: Input Compaction and Further Studies

Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

A Survey on Performance Modelling and Optimization Techniques for SpMV on GPUs

Introduction to GPGPU and GPU-architectures

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors

Warps and Reduction Algorithms

Optimizing Sparse Matrix Vector Multiplication Using Cache Blocking Method on Fermi GPU

A Cross-Platform SpMV Framework on Many-Core Architectures

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

High Performance Computing on GPUs using NVIDIA CUDA

Scientific Computing on GPUs: GPU Architecture Overview

Use of Synthetic Benchmarks for Machine- Learning-based Performance Auto-tuning

Flexible Batched Sparse Matrix-Vector Product on GPUs

Accelerating Sparse Matrix Vector Multiplication in Iterative Methods Using GPU

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Administrative Issues. L13: Application Case Studies. Triangular Solve (STRSM) Outline 3/10/10. Next assignment, triangular solve.

HPC with Multicore and GPUs

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Double-Precision Matrix Multiply on CUDA

Applications of Berkeley s Dwarfs on Nvidia GPUs

Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

On the limits of (and opportunities for?) GPU acceleration

Code Optimizations for High Performance GPU Computing

Lecture 2. Memory locality optimizations Address space organization

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

The Sliced COO format for Sparse Matrix-Vector Multiplication on CUDA-enabled GPUs

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Lecture 13: March 25

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing

A Performance Prediction and Analysis Integrated Framework for SpMV on GPUs

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU

A Script- Based Autotuning Compiler System to Generate High- Performance CUDA code

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

A GPGPU Compiler for Memory Optimization and Parallelism Management

Linear Algebra for Modern Computers. Jack Dongarra

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Outline. L9: Project Discussion and Floating Point Issues. Project Parts (Total = 50%) Project Proposal (due 3/8) 2/13/12.

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Matrix Multiplication

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Accelerating GPU Kernels for Dense Linear Algebra

Improving Performance Portability in OpenCL Programs

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Sparse Linear Algebra in CUDA

How to Write Fast Numerical Code

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

Scalable GPU Graph Traversal!

Accelerating GPU kernels for dense linear algebra

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Unrolling parallel loops

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Master Informatics Eng.

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

CS6963: Parallel Programming for GPUs Midterm Exam March 25, 2009

CUDA Toolkit 4.1 CUSPARSE Library. PG _v01 January 2012

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators

GPU Fundamentals Jeff Larkin November 14, 2016


GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication

Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory

Master Thesis. Master Program of Computer Science

S4289: Efficient solution of multiple scalar and block-tridiagonal equations

Adaptable benchmarks for register blocked sparse matrix-vector multiplication

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Sparse Matrix Formats

Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

CS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010

Transcription:

Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 <probfile> Project proposals Due 5PM, Wednesday, March 7 (hard deadline) handin cs6963 prop <pdffile> 2 L12: Sparse Linear Algebra Triangular Solve (STRSM) for (j = 0; j < n; j++) for (k = 0; k < n; k++) if (B[j*n+k]!= 0.0f) { for (i = k+1; i < n; i++) B[j*n+i] -= A[k * n + i] * B[j * n + k]; } Equivalent to: cublasstrsm('l' /* left operator */, 'l' /* lower triangular */, 'N' /* not transposed */, u' /* unit triangular */, N, N, alpha, d_a, N, d_b, N); See: http://www.netlib.org/blas/strsm.f A Few Details C stores multi-dimensional arrays in row major order Fortran (and MATLAB) stores multidimensional arrays in column major order Confusion alert: BLAS libraries were designed for FORTRAN codes, so column major order is implicit in CUBLAS! 3 4 1

Dependences in STRSM Assignment for (j = 0; j < n; j++) for (k = 0; k < n; k++) if (B[j*n+k]!= 0.0f) { for (i = k+1; i < n; i++) B[j*n+i] -= A[k * n + i] * B[j * n + k]; } Which loop(s) carry dependences? Which loop(s) is(are) safe to execute in parallel? Details: Integrated with simplecublas test in SDK Reference sequential version provided 1. Rewrite in CUDA 2. Compare performance with CUBLAS library 5 6 Performance Issues? + Abundant data reuse - Difficult edge cases - Different amounts of work for different <j,k> values - Complex mapping or load imbalance Outline Next assignment For your projects: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU, Lee et al., ISCA 2010. Sparse Linear Algebra Readings: Implementing Sparse Matrix-Vector Multiplication on Throughput Oriented Processors, Bell and Garland (Nvidia), SC09, Nov. 2009. Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs, Choi, Singh, Vuduc, PPoPP 10, Jan. 2010. Optimizing sparse matrix-vector multiply on emerging multicore platforms, Journal of Parallel Computing, 35(3):178-194, March 2009. (Expanded from SC07 paper.) 7 8 2

Overview: CPU and GPU Comparisons Many projects will compare speedup over a sequential CPU implementation Ok for this class, but not for a research contribution Is your CPU implementation as smart as your GPU implementation? Parallel? Manages memory hierarchy? Minimizes synchronization or accesses to global memory? The Comparison Architectures Intel i7, quad-core, 3.2GHz, 2-way hyperthreading, SSE, 32KB L1, 256KB L2, 8MB L3 Same i7 with Nvidia GTX 280 Workload 14 benchmarks, some from the GPU literature Architectural Comparison Workload Summary Core i7 960 GTX280 Number PEs 4 30 Frequency (GHz) 3.2 1.3 Number Transistors 0.7B 1.4B BW (GB/sec) 32 141 SP SIMD width 4 8 DP SIMD width 2 1 Peak SP Scalar FLOPS (GFLOPS) Peak SP SIMD Flops (GFLOPS) Peak DP SIMD Flops (GFLOPS) 25.6 116.6 102.4 311.1/933.1 51.2 77.8 3

Performance Results CPU optimization Tile for cache utilization SIMD execution on multimedia extensions Multi-threaded, beyond number of cores Data reorganization to improve SIMD performance Sparse Linear Algebra Suppose you are applying matrix-vector multiply and the matrix has lots of zero elements Computation cost? Space requirements? General sparse matrix representation concepts Primarily only represent the nonzero data values Auxiliary data structures describe placement of nonzeros in dense matrix 15 GPU Challenges Computation partitioning? Memory access patterns? Parallel reduction BUT, good news is that sparse linear algebra performs TERRIBLY on conventional architectures, so poor baseline leads to improvements! 16 L12: Sparse Linear Algebra 4

data = Some common representations * 1 7 * 2 8 5 3 9 6 4 * data = 1 7 * 2 8 * 5 3 9 6 4 * indices = A = offsets = [-2 0 1] DIA: Store elements along a set of diagonals. 0 1 * 1 2 * 0 2 3 1 3 * ELL: Store a set of K elements per row and pad as needed. Best suited when number non-zeros roughly consistent across rows. 1 7 0 0 0 2 8 0 5 0 3 9 0 6 0 4 ptr = [0 2 4 7 9] indices = [0 1 1 2 0 2 3 1 3] data = [1 7 2 8 5 3 9 6 4] Compressed Sparse Row (CSR): Store only nonzero elements, with ptr to beginning of each row and indices representing column. row = [0 0 1 1 2 2 2 3 3] indices = [0 1 1 2 0 2 3 1 3] data = [1 7 2 8 5 3 9 6 4] COO: Store nonzero elements and their corresponding coordinates. CSR Example for (j=0; j<nr; j++) { for (k = ptr[j]; k<ptr[j+1]-1; k++) t[j] = t[j] + data[k] * x[indices[k]]; 18 Summary of Representation and Implementation Bytes/Flop Kernel Granularity Coalescing 32-bit 64-bit DIA thread : row full 4 8 ELL thread : row full 6 10 CSR(s) thread : row rare 6 10 CSR(v) warp : row partial 6 10 COO thread : nonz full 8 12 HYB thread : row full 6 10 Table 1 from Bell/Garland: Summary of SpMV kernel properties. 19 L12: Sparse Linear Algebra Other Representation Examples Blocked CSR Represent non-zeros as a set of blocks, usually of fixed size Within each block, treat as dense and pad block with zeros Block looks like standard matvec So performs well for blocks of decent size Hybrid ELL and COO Find a K value that works for most of matrix Use COO for rows with more nonzeros (or even significantly fewer) 20 5

Stencil Example What is a 3-point stencil? 5-point stencil? 7-point? 9-point? 27-point? Examples: a[i] = (b[i-1] + b[i+1])/2; a[i][j] = (b[i-1][j] + b[i+1][j] + b[i][j-1] + b[i][j+1])/4; Stencil Result (structured matrices) See Figures 11 and 12, Bell and Garland How is this represented by a sparse matrix? 21 22 Unstructured Matrices PPoPP paper See Figures 13 and 14 Note that graphs can also be represented as sparse matrices. What is an adjacency matrix? What if you customize the representation to the problem? Additional global data structure modifications (like blocked representation)? Strategy Apply models and autotuning to identify best solution for each application 23 24 6

Summary of Results BELLPACK (blocked ELLPACK) achieves up to 29 Gflop/s in SP and 15.7 Gflop/s in DP Up to 1.8x and 1.5x improvement over Bell and Garland. This Lecture Exposure to the issues in a sparse matrix vector computation on GPUs A set of implementations and their expected performance A little on how to improve performance through application-specific knowledge and customization of sparse matrix representation 25 26 What s coming Next time: Application case study from Kirk and Hwu (Ch. 8, real-time MRI) Wednesday, March 2: two guest speakers from last year s class BOTH use sparse matrix representation! Shreyas Ramalingam: program analysis on GPUs Pascal Grosset: graph coloring on GPUs 27 7