Sparse Matrix Formats

Size: px
Start display at page:

Download "Sparse Matrix Formats"

Transcription

1 Christopher Bross Friedrich-Alexander-Universität Erlangen-Nürnberg

2 Motivation Sparse Matrices are everywhere Sparse Matrix Formats C. Bross BGCE Research Day, Erlangen, /16

3 Motivation Sparse Matrix Formats C. Bross BGCE Research Day, Erlangen, /16

4 Sparse Matrix Vector Multiplication Sparse Matrix-Vector Multiplication: y = Ax A is a sparse matrix Just store (and process) non zero values of A Indirect access of vector x Sparsity pattern of a matrix Sparse Matrix Formats C. Bross BGCE Research Day, Erlangen, /16

5 Sparse Matrix Vector Multiplication 1 for ( int row =0; row < MaxRows ; ++ row ) 2 for ( int col =0; col < NonZerosInRow [ row ]; ++ col ) 3 y[ row ] += A[ row ][ col ] * x[ columid [ row ][ col ] ] Algorithm Analysis For each non zero matrix entry: 2 Flops 2 double precision loads; 1 int index load 1 double precision store (+ load) per row Memory intensive Code balance 10 Bytes/Flop Memory layout is important Sparse Matrix Formats C. Bross BGCE Research Day, Erlangen, /16

6 Standard Sparse Matrix Storage Formats Sparse Matrix Formats C. Bross BGCE Research Day, Erlangen, /16

7 CSR - spmv Kernel 1 for ( int rowid =0; rowid < NumberOfRows ; ++ rowid ) 2 { 3 double tmp = 0.; 4 5 for ( int id = rowptr [ rowid ]; 6 id < rowptr [ rowid +1]; ++ id) 7 { 8 tmp += val [id] * x[ colind [id] ]; 9 } y[ rowid ] = tmp ; 12 } 13 } Sparse Matrix Formats C. Bross BGCE Research Day, Erlangen, /16

8 CSR - spmv Kernel - Vectorization 1 for ( int rowid =0; rowid < NumberOfRows ; ++ rowid ){ 2 double tmp0, tmp1, tmp2, tmp3 = 0.; 3 for ( int id = rowptr [ rowid ]; 4 id +4 < rowptr [ rowid +1]; id +=4 ) 5 { 6 tmp0 += val [id +0] * x[ colind [id +0] ]; 7 tmp1 += val [id +1] * x[ colind [id +1] ]; 8 tmp2 += val [id +2] * x[ colind [id +2] ]; 9 tmp3 += val [id +3] * x[ colind [id +3] ]; 10 } 11 y[ rowid ] = tmp0 + tmp1 + tmp2 + tmp3 ; 12 remainder loop 13 } 14 } Sparse Matrix Formats C. Bross BGCE Research Day, Erlangen, /16

9 CSR - spmv Kernel - Vectorization 1 for ( int rowid =0; rowid < NumberOfRows ; ++ rowid ){ 2 double tmp0, tmp1, tmp2, tmp3 = 0.; 3 for ( int id = rowptr [ rowid ]; 4 id +4 < rowptr [ rowid +1]; id +=4 ) 5 { 6 tmp0 += val [id +0] * x[ colind [id +0] ]; 7 NOT tmp1 optimal += val for [id large +1] vector * x[ colind lengths! [id +1] ]; 8 tmp2 += val [id +2] * x[ colind [id +2] ]; 9 tmp3 += val [id +3] * x[ colind [id +3] ]; 10 } 11 y[ rowid ] = tmp0 + tmp1 + tmp2 + tmp3 ; 12 remainder loop 13 } 14 } Sparse Matrix Formats C. Bross BGCE Research Day, Erlangen, /16

10 Standard Sparse Matrix Storage Formats Sparse Matrix Formats C. Bross BGCE Research Day, Erlangen, /16

11 Sell-C-Sigma Format Sort matrix by row length Divide into chunks Fill chunks with zeros Save data in column major order Chunk occupancy Fraction of useful data entries in Sell-C-Sigma: β = NonZeros Chunks i=0 C l i Sparse Matrix Formats C. Bross BGCE Research Day, Erlangen, /16

12 Sparse Matrix Formats C. Bross BGCE Research Day, Erlangen, /16

13 Sell-C-sigma - spmv Kernel 1 for ( int chunk =0; chunk < rows / C; ++ chunk ){ 2 int chunkoffset = chunkptr [ chunk ]; 3 double tmp [C] {}; 4 5 for ( int j =0; j< chunklength [ chunk ]; ++j){ 6 for ( int i =0; i <C; ++ i){ // for vectorization 7 tmp [ i] += val [ chunkoffset + j* C + i] 8 * x[ colind [ chunkoffset + j* C + i] 9 ]; 10 } 11 } 12 // write back results 13 } 14 remainder loop or extra padding Sparse Matrix Formats C. Bross BGCE Research Day, Erlangen, /16

14 Sell-C-sigma - spmv Kernel - Compiler Issues 1 for ( int chunk =0; chunk < rows / C; ++ chunk ){ 2 int chunkoffset = chunkptr [ chunk ]; 3 double tmp [C] {}; 4 5 for ( int j =0; j< chunklength [ chunk ]; ++j){ 6 # pragma simd 7 for ( int i =0; i <C; ++ i){ // for vectorization 8 tmp [ i] += val [ chunkoffset + j* C + i] 9 * x[ colind [ chunkoffset + j* C + i] 10 ]; 11 } 12 } 13 } Sparse Matrix Formats C. Bross BGCE Research Day, Erlangen, /16

15 Performance Analysis Sparse Matrix Formats C. Bross BGCE Research Day, Erlangen, /16

16 number of rows/columns 2,063,494 nonzeros 12,771,361 density 3.43e-06 nonzeros per row 7.08 β (C=4,sigma=1) 73.2% β (C=4,sigma=16) 87.6% β (C=4,sigma=512) 99.5% CSR 1.3 Gflop/s Sell Gflop/s Sell Gflop/s Sell Gflop/s Intel Haswell i5-4300u CPU, dual 1.90GHz Sparse Matrix Formats C. Bross BGCE Research Day, Erlangen, /16

17 Problems of common Formats CSR -> short loops, low vectorization ratio ELLPACK -> Fill-in Sell-C-Sigma combination of CSR and ELLPACK good performance on different architectures Think about your data structure It might increase your performance. Sparse Matrix Formats C. Bross BGCE Research Day, Erlangen, /16

Case study: OpenMP-parallel sparse matrix-vector multiplication

Case study: OpenMP-parallel sparse matrix-vector multiplication Case study: OpenMP-parallel sparse matrix-vector multiplication A simple (but sometimes not-so-simple) example for bandwidth-bound code and saturation effects in memory Sparse matrix-vector multiply (spmvm)

More information

Flexible Batched Sparse Matrix-Vector Product on GPUs

Flexible Batched Sparse Matrix-Vector Product on GPUs ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems November 13, 217 Flexible Batched Sparse Matrix-Vector Product on GPUs Hartwig Anzt, Gary Collins, Jack Dongarra,

More information

Sparse Matrix-Vector Multiplication with Wide SIMD Units: Performance Models and a Unified Storage Format

Sparse Matrix-Vector Multiplication with Wide SIMD Units: Performance Models and a Unified Storage Format ERLANGEN REGIONAL COMPUTING CENTER Sparse Matrix-Vector Multiplication with Wide SIMD Units: Performance Models and a Unified Storage Format Moritz Kreutzer, Georg Hager, Gerhard Wellein SIAM PP14 MS53

More information

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT JOSEPH L. GREATHOUSE, MAYANK DAGA AMD RESEARCH 11/20/2014 THIS TALK IN ONE SLIDE Demonstrate how to save space and time

More information

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication Steve Rennich Nvidia Developer Technology - Compute Block Sparse Matrix Vector Multiplication Sparse Matrix-Vector Multiplication

More information

Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra

Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra Dominik Grewe Anton Lokhmotov Media Processing Division ARM School of Informatics University of Edinburgh December 13, 2010 Introduction

More information

Dynamic Sparse Matrix Allocation on GPUs. James King

Dynamic Sparse Matrix Allocation on GPUs. James King Dynamic Sparse Matrix Allocation on GPUs James King Graph Applications Dynamic updates to graphs Adding edges add entries to sparse matrix representation Motivation Graph operations (adding edges) (e.g.

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi Erik Saule 1, Kamer Kaya 1 and Ümit V. Çatalyürek 1,2 esaule@uncc.edu, {kamer,umit}@bmi.osu.edu 1 Department of Biomedical

More information

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard

More information

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors Kaixi Hou, Wu-chun Feng {kaixihou, wfeng}@vt.edu Shuai Che Shuai.Che@amd.com Sparse

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices CSE 599 I Accelerated Computing - Programming GPUS Parallel Pattern: Sparse Matrices Objective Learn about various sparse matrix representations Consider how input data affects run-time performance of

More information

Efficient Sparse Matrix-Vector Multiplication on x86-based Many-Core Processors

Efficient Sparse Matrix-Vector Multiplication on x86-based Many-Core Processors Efficient Sparse Matrix-Vector Multiplication on x86-based Many-Core Processors Xing Liu School of Computational Science and Engineering Georgia Institute of Technology, Atlanta, Georgia xing.liu@gatech.edu

More information

SIMD Parallel Sparse Matrix-Vector and Transposed-Matrix-Vector Multiplication in DD Precision

SIMD Parallel Sparse Matrix-Vector and Transposed-Matrix-Vector Multiplication in DD Precision SIMD Parallel Sparse Matrix-Vector and Transposed-Matrix-Vector Multiplication in DD Precision Toshiaki Hishinuma 1, Hidehiko Hasegawa 12, and Teruo Tanaka 2 1 University of Tsukuba, Tsukuba, Japan 2 Kogakuin

More information

GPU-Based Acceleration for CT Image Reconstruction

GPU-Based Acceleration for CT Image Reconstruction GPU-Based Acceleration for CT Image Reconstruction Xiaodong Yu Advisor: Wu-chun Feng Collaborators: Guohua Cao, Hao Gong Outline Introduction and Motivation Background Knowledge Challenges and Proposed

More information

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2014 COMP3320/6464/HONS High Performance Scientific Computing Study Period: 15 minutes Time Allowed: 3 hours Permitted Materials: Non-Programmable

More information

Optimizing Sparse Matrix Vector Multiplication on Emerging Multicores

Optimizing Sparse Matrix Vector Multiplication on Emerging Multicores Optimizing Sparse Matrix Vector Multiplication on Emerging Multicores Orhan Kislal, Wei Ding, Mahmut Kandemir The Pennsylvania State University University Park, Pennsylvania, USA omk03, wzd09, kandemir@cse.psu.edu

More information

Intelligent BEE Method for Matrix-vector Multiplication on Parallel Computers

Intelligent BEE Method for Matrix-vector Multiplication on Parallel Computers Intelligent BEE Method for Matrix-vector Multiplication on Parallel Computers Seiji Fujino Research Institute for Information Technology, Kyushu University, Fukuoka, Japan, 812-8581 E-mail: fujino@cc.kyushu-u.ac.jp

More information

Master Thesis. Master Program of Computer Science

Master Thesis. Master Program of Computer Science Hochschule Bonn-Rhein-Sieg University of Applied Sciences Fachbereich Informatik Computer Science Department Master Thesis Master Program of Computer Science Requirement Analysis and Realization of Efficient

More information

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu

More information

Optimizing the operations with sparse matrices on Intel architecture

Optimizing the operations with sparse matrices on Intel architecture Optimizing the operations with sparse matrices on Intel architecture Gladkikh V. S. victor.s.gladkikh@intel.com Intel Xeon, Intel Itanium are trademarks of Intel Corporation in the U.S. and other countries.

More information

Performance Engineering for Algorithmic Building Blocks in the GHOST Library

Performance Engineering for Algorithmic Building Blocks in the GHOST Library Performance Engineering for Algorithmic Building Blocks in the GHOST Library Georg Hager, Moritz Kreutzer, Faisal Shahzad, Gerhard Wellein, Martin Galgon, Lukas Krämer, Bruno Lang, Jonas Thies, Melven

More information

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides

More information

Sparse matrices on the web -- Characterizing the performance and optimal format selection of sparse matrix-vector multiplication in JavaScript

Sparse matrices on the web -- Characterizing the performance and optimal format selection of sparse matrix-vector multiplication in JavaScript McGill University School of Computer Science Sable Research Group Sparse matrices on the web -- Characterizing the performance and optimal format selection of sparse matrix-vector multiplication in JavaScript

More information

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science,

More information

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,

More information

Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures

Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures Jan Treibig, Simon Hausmann, Ulrich Ruede Zusammenfassung The Lattice Boltzmann method (LBM) is a well established algorithm

More information

Lecture 6: Input Compaction and Further Studies

Lecture 6: Input Compaction and Further Studies PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 6: Input Compaction and Further Studies 1 Objective To learn the key techniques for compacting input data for reduced consumption of

More information

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion M. Stürmer, H. Köstler, and U. Rüde Lehrstuhl für Systemsimulation Friedrich-Alexander-Universität Erlangen-Nürnberg

More information

Submission instructions (read carefully): SS17 / Assignment 4 Instructor: Markus Püschel. ETH Zurich

Submission instructions (read carefully): SS17 / Assignment 4 Instructor: Markus Püschel. ETH Zurich 263-2300-00: How To Write Fast Numerical Code Assignment 4: 120 points Due Date: Th, April 13th, 17:00 http://www.inf.ethz.ch/personal/markusp/teaching/263-2300-eth-spring17/course.html Questions: fastcode@lists.inf.ethz.ch

More information

Unveiling the performance-energy trade-off in iterative linear system solvers for multithreaded processors

Unveiling the performance-energy trade-off in iterative linear system solvers for multithreaded processors CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (2014) Published online in Wiley Online Library (wileyonlinelibrary.com)..3341 SPECIAL ISSUE PAPER Unveiling the

More information

simulation framework for piecewise regular grids

simulation framework for piecewise regular grids WALBERLA, an ultra-scalable multiphysics simulation framework for piecewise regular grids ParCo 2015, Edinburgh September 3rd, 2015 Christian Godenschwager, Florian Schornbaum, Martin Bauer, Harald Köstler

More information

Sparse Linear Algebra in CUDA

Sparse Linear Algebra in CUDA Sparse Linear Algebra in CUDA HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 22 nd 2017 Table of Contents Homework - Worksheet 2

More information

The Ascendance of the Dual Simplex Method: A Geometric View

The Ascendance of the Dual Simplex Method: A Geometric View The Ascendance of the Dual Simplex Method: A Geometric View Robert Fourer 4er@ampl.com AMPL Optimization Inc. www.ampl.com +1 773-336-AMPL U.S.-Mexico Workshop on Optimization and Its Applications Huatulco

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research ACCELERATING MATRIX PROCESSING WITH GPUs Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research ACCELERATING MATRIX PROCESSING WITH GPUS MOTIVATION Matrix operations

More information

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,

More information

Lecture 13: March 25

Lecture 13: March 25 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging

More information

Optimizing Sparse Data Structures for Matrix-Vector Multiply

Optimizing Sparse Data Structures for Matrix-Vector Multiply Summary Optimizing Sparse Data Structures for Matrix-Vector Multiply William Gropp (UIUC) and Dahai Guo (NCSA) Algorithms and Data Structures need to take memory prefetch hardware into account This talk

More information

Generating Optimized Sparse Matrix Vector Product over Finite Fields

Generating Optimized Sparse Matrix Vector Product over Finite Fields Generating Optimized Sparse Matrix Vector Product over Finite Fields Pascal Giorgi 1 and Bastien Vialla 1 LIRMM, CNRS, Université Montpellier 2, pascal.giorgi@lirmm.fr, bastien.vialla@lirmm.fr Abstract.

More information

Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations

Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations D. Zheltkov, N. Zamarashkin INM RAS September 24, 2018 Scalability of Lanczos method Notations Matrix order

More information

Parallel Combinatorial BLAS and Applications in Graph Computations

Parallel Combinatorial BLAS and Applications in Graph Computations Parallel Combinatorial BLAS and Applications in Graph Computations Aydın Buluç John R. Gilbert University of California, Santa Barbara SIAM ANNUAL MEETING 2009 July 8, 2009 1 Primitives for Graph Computations

More information

Adaptable benchmarks for register blocked sparse matrix-vector multiplication

Adaptable benchmarks for register blocked sparse matrix-vector multiplication Adaptable benchmarks for register blocked sparse matrix-vector multiplication Berkeley Benchmarking and Optimization group (BeBOP) Hormozd Gahvari and Mark Hoemmen Based on research of: Eun-Jin Im Rich

More information

Massively Parallel Phase Field Simulations using HPC Framework walberla

Massively Parallel Phase Field Simulations using HPC Framework walberla Massively Parallel Phase Field Simulations using HPC Framework walberla SIAM CSE 2015, March 15 th 2015 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich

More information

Instructor: Leopold Grinberg

Instructor: Leopold Grinberg Part 1 : Roofline Model Instructor: Leopold Grinberg IBM, T.J. Watson Research Center, USA e-mail: leopoldgrinberg@us.ibm.com 1 ICSC 2014, Shanghai, China The Roofline Model DATA CALCULATIONS (+, -, /,

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

Near Memory Computing Spectral and Sparse Accelerators

Near Memory Computing Spectral and Sparse Accelerators Near Memory Computing Spectral and Sparse Accelerators Franz Franchetti ECE, Carnegie Mellon University www.ece.cmu.edu/~franzf Co-Founder, SpiralGen www.spiralgen.com The work was sponsored by Defense

More information

Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers

Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers The International Arab Journal of Information Technology, Vol. 8, No., April Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers Rukhsana Shahnaz and Anila Usman

More information

A Study of SpMV Implementation using MPI and OpenMP on Intel Many-Core Architecture

A Study of SpMV Implementation using MPI and OpenMP on Intel Many-Core Architecture A Study of SpMV Implementation using MPI and OpenMP on Intel Many-Core Architecture Fan Ye 1,2, Christophe Calvin 1, Serge Petiton 2,3 1 CEA/DEN/DANS/DM2S, CEA Saclay, France 2 LIFL, Université de Lille

More information

Behavioral Data Mining. Lecture 12 Machine Biology

Behavioral Data Mining. Lecture 12 Machine Biology Behavioral Data Mining Lecture 12 Machine Biology Outline CPU geography Mass storage Buses and Networks Main memory Design Principles Intel i7 close-up From Computer Architecture a Quantitative Approach

More information

Applications of Linked Lists

Applications of Linked Lists Applications of Linked Lists Linked List concept can be used to deal with many practical problems. Problem 1: Suppose you need to program an application that has a pre-defined number of categories, but

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Alleviating memory-bandwidth limitations for scalability and energy efficiency

Alleviating memory-bandwidth limitations for scalability and energy efficiency .. Alleviating memory-bandwidth limitations for scalability and energy efficiency Lessons learned from the optimization of SpMxV Georgios Goumas goumas@cslab.ece.ntua.gr Computing Systems Laboratory National

More information

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES Nikolay Markovskiy Peter Messmer ABOUT CP2K Atomistic and molecular simulations of solid state From ab initio DFT and Hartree-Fock

More information

Simulating tsunami propagation on parallel computers using a hybrid software framework

Simulating tsunami propagation on parallel computers using a hybrid software framework Simulating tsunami propagation on parallel computers using a hybrid software framework Xing Simula Research Laboratory, Norway Department of Informatics, University of Oslo March 12, 2007 Outline Intro

More information

Effect of memory latency

Effect of memory latency CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable

More information

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,

More information

HPCG Performance Improvement on the K computer ~10min. brief~

HPCG Performance Improvement on the K computer ~10min. brief~ HPCG Performance Improvement on the K computer ~10min. brief~ Kiyoshi Kumahata, Kazuo Minami RIKEN AICS HPCG BoF #273 SC14 New Orleans Evaluate Original Code 1/2 Weak Scaling Measurement 9,500 GFLOPS in

More information

Blocking Optimization Strategies for Sparse Tensor Computation

Blocking Optimization Strategies for Sparse Tensor Computation Blocking Optimization Strategies for Sparse Tensor Computation Jee Choi 1, Xing Liu 1, Shaden Smith 2, and Tyler Simon 3 1 IBM T. J. Watson Research, 2 University of Minnesota, 3 University of Maryland

More information

NUMERICAL PARALLEL COMPUTING

NUMERICAL PARALLEL COMPUTING Lecture 4: More on OpenMP http://people.inf.ethz.ch/iyves/pnc11/ Peter Arbenz, Andreas Adelmann Computer Science Dept, ETH Zürich, E-mail: arbenz@inf.ethz.ch Paul Scherrer Institut, Villigen E-mail: andreas.adelmann@psi.ch

More information

ECE331: Hardware Organization and Design

ECE331: Hardware Organization and Design ECE331: Hardware Organization and Design Lecture 25: Multilevel Caches & Data Access Strategies Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Overview Last time: Associative

More information

Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation

Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation Dominik Grewe Institute for Computing Systems Architecture School of Informatics University

More information

Improving Performance of Sparse Matrix-Vector Multiplication

Improving Performance of Sparse Matrix-Vector Multiplication Improving Performance of Sparse Matrix-Vector Multiplication Ali Pınar Michael T. Heath Department of Computer Science and Center of Simulation of Advanced Rockets University of Illinois at Urbana-Champaign

More information

Julian Hall School of Mathematics University of Edinburgh. June 15th Parallel matrix inversion for the revised simplex method - a study

Julian Hall School of Mathematics University of Edinburgh. June 15th Parallel matrix inversion for the revised simplex method - a study Parallel matrix inversion for the revised simplex method - A study Julian Hall School of Mathematics University of Edinburgh June 5th 006 Parallel matrix inversion for the revised simplex method - a study

More information

Lab 1 Part 1: Introduction to CUDA

Lab 1 Part 1: Introduction to CUDA Lab 1 Part 1: Introduction to CUDA Code tarball: lab1.tgz In this hands-on lab, you will learn to use CUDA to program a GPU. The lab can be conducted on the SSSU Fermi Blade (M2050) or NCSA Forge using

More information

A Bytecode Interpreter for Secure Program Execution in Untrusted Main Memory

A Bytecode Interpreter for Secure Program Execution in Untrusted Main Memory A Bytecode Interpreter for Secure Program Execution in Untrusted Main Memory Maximilian Seitzer, Michael Gruhn, Tilo Müller Friedrich Alexander Universität Erlangen-Nürnberg https://www1.cs.fau.de Introduction

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

CUDA 6.0 Performance Report. April 2014

CUDA 6.0 Performance Report. April 2014 CUDA 6. Performance Report April 214 1 CUDA 6 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random

More information

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications Evaluation of Intel Memory Drive Technology Performance for Scientific Applications Vladimir Mironov, Andrey Kudryavtsev, Yuri Alexeev, Alexander Moskovsky, Igor Kulikov, and Igor Chernykh Introducing

More information

A parallel patch based algorithm for CT image denoising on the Cell Broadband Engine

A parallel patch based algorithm for CT image denoising on the Cell Broadband Engine A parallel patch based algorithm for CT image denoising on the Cell Broadband Engine Dominik Bartuschat, Markus Stürmer, Harald Köstler and Ulrich Rüde Friedrich-Alexander Universität Erlangen-Nürnberg,Germany

More information

Double-precision General Matrix Multiply (DGEMM)

Double-precision General Matrix Multiply (DGEMM) Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply

More information

Solving the heat equation with CUDA

Solving the heat equation with CUDA Solving the heat equation with CUDA Oliver Meister January 09 th 2013 Last Tutorial CSR kernel - scalar One row per thread No coalesced memory access Non-uniform matrices CSR kernel - vectorized One row

More information

GPU acceleration of the matrix-free interior point method

GPU acceleration of the matrix-free interior point method GPU acceleration of the matrix-free interior point method E. Smith, J. Gondzio and J. A. J. Hall School of Mathematics and Maxwell Institute for Mathematical Sciences The University of Edinburgh Mayfield

More information

GPU Implementation of a Multiobjective Search Algorithm

GPU Implementation of a Multiobjective Search Algorithm Department Informatik Technical Reports / ISSN 29-58 Steffen Limmer, Dietmar Fey, Johannes Jahn GPU Implementation of a Multiobjective Search Algorithm Technical Report CS-2-3 April 2 Please cite as: Steffen

More information

Using ODHeuristics To Solve Hard Mixed Integer Programming Problems. Alkis Vazacopoulos Robert Ashford Optimization Direct Inc.

Using ODHeuristics To Solve Hard Mixed Integer Programming Problems. Alkis Vazacopoulos Robert Ashford Optimization Direct Inc. Using ODHeuristics To Solve Hard Mixed Integer Programming Problems Alkis Vazacopoulos Robert Ashford Optimization Direct Inc. February 2017 Summary Challenges of Large Scale Optimization Exploiting parallel

More information

XMT-HW1: Matrix-Vector Multiplication

XMT-HW1: Matrix-Vector Multiplication XMT-HW1: Matrix-Vector Multiplication Course: ENEE459P/ENEE699 Title: Matrix-vector multiplication (matvec) Date Assigned: September 27th, 2010 Date Due: October 11, 2010, 11:59pm Central Time Contact:

More information

HPCG Performance Improvement on the K computer ~short brief~

HPCG Performance Improvement on the K computer ~short brief~ HPCG Performance Improvement on the K computer ~short brief~ Kiyoshi Kumahata, Kazuo Minami RIKEN AICS HPCG BoF@Room15 SC15 Austin Evaluate Original Code 1/2 Weak Scaling Measurement 9,500 GFLOPS in 32,768

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Enhanced Oil Recovery simulation Performances on New Hybrid Architectures

Enhanced Oil Recovery simulation Performances on New Hybrid Architectures Renewable energies Eco-friendly production Innovative transport Eco-efficient processes Sustainable resources Enhanced Oil Recovery simulation Performances on New Hybrid Architectures A. Anciaux, J-M.

More information

GPUBenchmark results for tesla2

GPUBenchmark results for tesla2 Benchmark results for tesla2 May 4, 202 Abstract This report shows the Benchmark results obtained on tesla2 on May 4, 202. Contents Introduction 2 Hardware description 3 Transfer speed between hard disk

More information

c 2014 Society for Industrial and Applied Mathematics

c 2014 Society for Industrial and Applied Mathematics SIAM J. SCI. COMPUT. Vol. 36, No. 2, pp. C29 C239 c 204 Society for Industrial and Applied Mathematics COMPRESSED MULTIROW STORAGE FORMAT FOR SPARSE MATRICES ON GRAPHICS PROCESSING UNITS ZBIGNIEW KOZA,

More information

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics Moysey Brio & Paul Dostert July 4, 2009 1 / 18 Sparse Matrices In many areas of applied mathematics and modeling, one

More information

How to declare an array in C?

How to declare an array in C? Introduction An array is a collection of data that holds fixed number of values of same type. It is also known as a set. An array is a data type. Representation of a large number of homogeneous values.

More information

poski: Parallel Optimized Sparse Kernel Interface Library User s Guide for Version 1.0.0

poski: Parallel Optimized Sparse Kernel Interface Library User s Guide for Version 1.0.0 poski: Parallel Optimized Sparse Kernel Interface Library User s Guide for Version 1.0.0 Jong-Ho Byun James W. Demmel Richard Lin Katherine A. Yelick Berkeley Benchmarking and Optimization (BeBOP) Group

More information

Automatic Tuning of Sparse Matrix Kernels

Automatic Tuning of Sparse Matrix Kernels Automatic Tuning of Sparse Matrix Kernels Kathy Yelick U.C. Berkeley and Lawrence Berkeley National Laboratory Richard Vuduc, Lawrence Livermore National Laboratory James Demmel, U.C. Berkeley Berkeley

More information

Sparse Matrices. This means that for increasing problem size the matrices become sparse and sparser. O. Rheinbach, TU Bergakademie Freiberg

Sparse Matrices. This means that for increasing problem size the matrices become sparse and sparser. O. Rheinbach, TU Bergakademie Freiberg Sparse Matrices Many matrices in computing only contain a very small percentage of nonzeros. Such matrices are called sparse ( dünn besetzt ). Often, an upper bound on the number of nonzeros in a row can

More information

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends Imagine stream processor; Bill Dally, Stanford Connection Machine CM; Thinking Machines Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz Eitan Grinspun Caltech Ian Farmer

More information

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?

More information

PARDISO Version Reference Sheet Fortran

PARDISO Version Reference Sheet Fortran PARDISO Version 5.0.0 1 Reference Sheet Fortran CALL PARDISO(PT, MAXFCT, MNUM, MTYPE, PHASE, N, A, IA, JA, 1 PERM, NRHS, IPARM, MSGLVL, B, X, ERROR, DPARM) 1 Please note that this version differs significantly

More information

Tomonori Kouya Shizuoka Institute of Science and Technology Toyosawa, Fukuroi, Shizuoka Japan. October 5, 2018

Tomonori Kouya Shizuoka Institute of Science and Technology Toyosawa, Fukuroi, Shizuoka Japan. October 5, 2018 arxiv:1411.2377v1 [math.na] 10 Nov 2014 A Highly Efficient Implementation of Multiple Precision Sparse Matrix-Vector Multiplication and Its Application to Product-type Krylov Subspace Methods Tomonori

More information

Introduction to High Performance Computing and Optimization

Introduction to High Performance Computing and Optimization Institut für Numerische Mathematik und Optimierung Introduction to High Performance Computing and Optimization Oliver Ernst Audience: 1./3. CMS, 5./7./9. Mm, doctoral students Wintersemester 2012/13 Contents

More information

Petalisp. A Common Lisp Library for Data Parallel Programming. Marco Heisig Chair for System Simulation FAU Erlangen-Nürnberg

Petalisp. A Common Lisp Library for Data Parallel Programming. Marco Heisig Chair for System Simulation FAU Erlangen-Nürnberg Petalisp A Common Lisp Library for Data Parallel Programming Marco Heisig 16.04.2018 Chair for System Simulation FAU Erlangen-Nürnberg Petalisp The library Petalisp 1 is a new approach to data parallel

More information

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430. Parallel Systems

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430. Parallel Systems THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430 Parallel Systems Study Period: 15 minutes Time Allowed: 3 hours Permitted Materials: Non-Programmable Calculator This

More information

Auto-tuning Multigrid with PetaBricks

Auto-tuning Multigrid with PetaBricks Auto-tuning with PetaBricks Cy Chan Joint Work with: Jason Ansel Yee Lok Wong Saman Amarasinghe Alan Edelman Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

More information

Data Structures for sparse matrices

Data Structures for sparse matrices Data Structures for sparse matrices The use of a proper data structures is critical to achieving good performance. Generate a symmetric sparse matrix A in matlab and time the operations of accessing (only)

More information

Sparse Matrices. sparse many elements are zero dense few elements are zero

Sparse Matrices. sparse many elements are zero dense few elements are zero Sparse Matrices sparse many elements are zero dense few elements are zero Special Matrices A square matrix has the same number of rows and columns. Some special forms of square matrices are Diagonal: M(i,j)

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

Unrolling parallel loops

Unrolling parallel loops Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:

More information

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller

More information

A Performance Prediction and Analysis Integrated Framework for SpMV on GPUs

A Performance Prediction and Analysis Integrated Framework for SpMV on GPUs Procedia Computer Science Volume 80, 2016, Pages 178 189 ICCS 2016. The International Conference on Computational Science A Performance Prediction and Analysis Integrated Framework for SpMV on GPUs Ping

More information