Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Architectures

Size: px
Start display at page:

Download "Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Architectures"

Transcription

1 Eploiting Matri Reuse and Data Locality in Sparse Matri-Vector and Matri-ranspose-Vector Multiplication on Many-Core Architectures Ozan Karsavuran 1 Kadir Akbudak 2 Cevdet Aykanat 1 (speaker) sites.google.com/site/kadircs kadir.cs@gmail.com 1 Bilkent University, urkey 2 KAUS, KSA SIAM Workshop on Combinatorial Scientific Computing (CSC), Albuquerque, NM, USA, October 10-12, 2016 O. Karsavuran, K. Akbudak, and C. Aykanat, Locality-Aware Parallel Sparse Matri-Vector and Matri-ranspose-Vector Multiplication on Many-Core Architectures, IEEE ransactions on Parallel and Distributed Systems (PDS), vol. 27(6), pp , 2016, available at ieeeplore.ieee.org/document/ / 1 / 14

2 1 Introduction: y =AA 2 Open problems & Related work 3 Parallel SpAA based on 1D partitioning of A and A matrices Quality criteria for efficient parallelization of SpAA Proposed SpAA algorithms Eperiments 4 References 2 / 14

3 Introduction: y = AA hread-level parallelization of y =AA (SpAA ) y = AA is computed as two Sparse Matri-Vector Multiplies (SpMV) Sparse Matriz = A and then z A ranspose Vector Multiply (SpA ) z Sparse Matri-Vector y = Az y A Multiply (SpA) hread-level parallelization of repeated and consecutive SpA and SpA that involve the same sparse matri A Eamples: Linear Programming (LP) problems via interior point methods nonsymmetric systems via Bi-CG, CGNE, Lanczos Bi-ortagonalization least squares problem via LSQR linear feasibility problem via Surrogate Constraints method Krylov-based balancing algorithms used as preconditioners for sparse eigensolvers web page ranking via HIS algorithm 3 / 14

4 Open problems & Related work Open problems Utilize the opportunity of reusing A-matri nonzeros? Obtain close performance for both z = A and y = Az at the same time? Single storage of A for both z = A and y = Az Storage of A for z = A and a separate storage of A for y = Az Related work Optimized Sparse Kernel Interface (OSKI), Berkeley Serial z Each row/column is reused. y A z A Compressed Sparse Blocks (CSB) by Buluc et. al. [10] Parallel Same data structure for both SpA and SpA operations without any performance degradation wo phase, i.e., SpA and SpA are not performed simultaneously 4 / 14

5 Parallel SpAA based on 1D partitioning of A and A matrices hread-level baseline parallelization of SpAA z y = C 1 C 2 C 3 C 4 z = C1 C2 C3 C4 A CCp A Column-Column parallel y 1 R 1 y 2 R 2 = y 3 R 3 y 4 R 4 R1 R = 2 R3 R4 A RRp A Row-Row parallel z y = C 1 C 2 C 3 C 4 C1 C = 2 C3 C4 A CRp A y 1 R 1 y 2 R 2 = y 3 R 3 y 4 R 4 z = R1 R2 R3 R4 A RCp A Row-Column parallel Column-Row parallel YELLOW scale tone: eclusive accesses by a single thread RED color: concurrent accesses by multiple threads. Four baseline SpAA algorithms for computing y = A z after z = A by four threads.5 / 14

6 Parallel SpAA based on 1D partitioning of A and A matrices Contributions Identify five quality criteria (QC), which have impact on performance of parallel SpAA Singly-bordered block-diagonal (SB) form based methods: sbcrp and sbrcp Matri A partitioned in to four and the submatrices are processed by four threads. For sbcrp (SB-based Column-Row parallel algorithm), we permute matri A into a rowwise SB form, which induces a columnwise SB form of matri A z B y 1 y 2 y 3 = y 4 y B A 11 A22 A33 A44 A B1 A B2 A B3 A B4 y 1 A 11 y 2 A 22 = y 3 A 33 A B1 A B2 A B3 For sbrcp (SB-based Row-Column parallel algorithm), we permute matri A into a columnwise SB form, which induces a rowwise SB form of matri A y 4 A 44 A B4 Achieve (a) (z-vector reuse) and (b) (A-matri reuse). Objectives of minimizing the size of the row/column border in the SB form of A achieve QC (c), (d), and (e) in sbcrp/sbrcp. 6 / 14

7 Parallel SpAA based on 1D partitioning of A and A matrices Quality criteria for efficient parallelization of SpAA Quality criteria for efficient parallelization of SpAA Quality Criteria RRp CRp RCp sbcrp sbrcp (a) Reusing z-vector entries generated in z = A and then read in y = A z (b) Reusing matri nonzeros (together with their indices) 2 in z = A and y = A z (c) Eploiting temporal locality in reading input vector entries in row-parallel SpMVs (d) Eploiting temporal locality in updating output vector 3 3 entries in column-parallel SpMVs (e) Minimizing the number of concurrent writes performed by different threads in column-parallel SpMVs 1 z1 z2 z3 z4 y = C1 C2 C3 C4 z1 z2 = z3 C 1 C 2 C 3 z4 C4 A CRp A Column-Row parallel : satisfied : satisfied ecept z B border subvectors : not applicable : satisfied ecept A kb border submatrices : not satisfied 3 : may be satisfied through row/column reordering 7 / 14

8 Parallel SpAA based on 1D partitioning of A and A matrices Proposed SpAA algorithms Maintaining balance on the number of nonzeros at each slice Reducing parallel time under arbitrary task scheduling Reducing border size c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 1 c 2 Reducing # of cache misses due to loss of temporal locality λ(c j ) = {A k : c j has at least one nonzero at A k, k 1,..., K} r 1 2 r 3 4 r 5 6 λ(cj ) = r 1 4 r 2 3 r c 3 c 4 c 5 c 6 c 7 c 8 λ(cj ) = Matri A partitioned in to three and the submatrices are processed by three threads. Reducing # of concurrent writes λ(r i ) = {A k : r i has at least one nonzero at A k, k 1,..., K} 8 / 14

9 Parallel SpAA based on 1D partitioning of A and A matrices Proposed SpAA algorithms Merits of Singly-Bordered Block Diagonal (SB) Form on CRp SB Form z 1 y = C 1 C 2 C 3 C 4 C1 C = 2 C3 C4 A y 1 A 11 y 2 A 22 y 3 = A 33 y 4 A 44 y B A B1 A B2 A B3 A B4 = A CRp A sbcrp Concurrent accesses Whole and y vectors Only B and y B subvectors Eploits temporal locality in reading -vector entries in row parallel z = A Eploits temporal locality in updating y-vector entries in column-parallel y = A z Minimizing border size in the SB form 1 A 11 2 A 22 3 A 33 A 4 B A B1 A B2 A B3 A 44 A B4 Minimizing number of concurrent writes by different threads in column-parallel y = A z 9 / 14

10 Parallel SpAA based on 1D partitioning of A and A matrices Proposed SpAA algorithms Require: A kk and A Bk matrices;, y, and z vectors 1: for k 1 to K in parallel do 2: z k A kk k 3: z k z k + A Bk B 4: y k A kk z k 5: y B y B + A Bk z k Concurrent writes 6: end for z k C k y C k z k Require: A kk and A kb matrices;, y, and z vectors 1: for k 1 to K in parallel do 2: z k A kk k 3: z B z B + A kb k Concurrent writes 4: y k A kk z k 5: end for 6: for k 1 to K in parallel do 7: y k y k + A kb z B 8: end for z R k y k R k z k Singly-bordered block-diagonal (SB) form y 1 A 11 y 2 A 22 y 3 = A 33 y 4 A 44 y B A B1 A B2 A B3 A B4 = 1 A 11 2 A 22 3 A 33 4 A sbcrp A SB-based Column-Row parallel B A B1 A B2 A B3 A 44 A B4 Singly-bordered block-diagonal (SB) form y 1 y 2 = y 3 y 4 A 11 A 22 A 33 z B A 1B A 2B A 3B A 44 A 4B = z B 1 A 11 2 A 22 3 A 33 4 A 44 A 1B A 2B A 3B A 4B A sbrcp A SB-based Row-Column parallel 10 / 14

11 Parallel SpAA based on 1D partitioning of A and A matrices Proposed SpAA algorithms Iterative Methods CRp sbcrp sbrcp LP [1, 2] Directly applicable z A y Az Directly (no dependency since inner product can be delayed) z q A CGNE [3] β (z, z)/(q, q) y A z Directly (linear vector operations without synchronization) z A LSQR [4] w f (z) y A w z A Surrogate [5, 6] w f (z) Constraints y A w Independent SpMVs (the two for loops of sbrcp can be fused.) z A BiCG [3] y A w Lanczos Bi-orthogonalization [3] z A y A w z A HIS [7, 8] y A w Krylov-based z A [9] Balancing y A Not applicable due to inner product and inter-dependency z A CGNR [3] α y 2 2 / z 2 2 y A αw 11 / 14

12 Parallel SpAA based on 1D partitioning of A and A matrices Eperiments Performance Results on Intel Xeon Phi Average results of 28 sparse matrices from UFL Up to 20M nonzeros, 3.5M rows/cols Baseline methods RRp, CRp, RCp (OpenMP) RRp with vendor-provided MKL Reverse Cuthill-McKee for QC (c) and (d) Proposed methods sbcrp, sbrcp (OpenMP, PaoH-3runs) Highly-tuned SpMV libs can be integrated. Normalized wrt RRp with original ordering web-berkstan Normalized parallel SpAA times Best of RRp MKL CRp/RCp org RCM org RCM org RCM SB *Smaller the better **Best of 1, 2, 3, and 4 threads per core sbcrp Original Columnwise SB form 12 / 14

13 Parallel SpAA based on 1D partitioning of A and A matrices Eperiments Performance Profiles 1.0 Proposed methods: sbcrp, sbrcp Double storage of A: RRp, MKL Original order, RCM ordering Single storage of A: CRp, RCp Original order, RCM ordering Percentage of test instances best of sbcrp/sbrcp 0.2 best of RRp/MKL best of CRp/RCp Parallel SpAA time relative to the best 13 / 14

14 Parallel SpAA based on 1D partitioning of A and A matrices Eperiments Performance Results on Xeon wo E cores in total 16 threads with Hyperhreading Normalized parallel SpAA times Best of RRp MKL CRp/RCp Matri org RCM org RCM org RCM SB degme LargeRegFile Stanford web-berkstan *Smaller the better Preprocessing overhead in terms of number of SpAA operations using RRp Matri sbcrp/sbrcp degme 136 LargeRegFile 143 Stanford 2 web-berkstan / 14

15 References References: [1] N. Karmarkar, A new polynomial-time algorithm for linear programming, Proc. 16th annual ACM symposium on heory of computing, pp , [2] S. Mehrotra, On the implementation of a primal-dual interior point method, SIAM Journal on Optimization, vol. 2, no. 4, pp , [3] Y. Saad, Iterative methods for sparse linear systems. SIAM, [4] C. C. Paige and M. A. Saunders, LSQR: An algorithm for sparse linear equations and sparse least squares, ACM ransactions on Mathematical Software (OMS), vol. 8, no. 1, pp , [5] K. Yang and K. G. Murty, New iterative methods for linear inequalities, Journal of Optimization heory and Applications, vol. 72, no. 1, pp , [6] B. Uçar, C. Aykanat, M. Ç. Pınar, and. Malas, Parallel image restoration using surrogate constraint methods, Journal of Parallel and Distributed Computing, vol. 67, no. 2, pp , [7] J. M. Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the ACM (JACM), vol. 46, no. 5, pp , [8] X. Yang, S. Parthasarathy, and P. Sadayappan, Fast sparse matri-vector multiplication on GPUs: Implications for graph mining, Proc. VLDB Endow., vol. 4, no. 4, pp , Jan [9].-Y. Chen and J. W. Demmel, Balancing sparse matrices for computing eigenvalues, Linear Algebra and its Applications, vol. 309, no. 13, pp , [10] A. Buluç, J.. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, Parallel sparse matri-vector and matri-transpose-vector multiplication using compressed sparse blocks, Proc. 21st symposium on Parallelism in Algorithms and Architectures, pp , [11] K. Akbudak, E. Kayaaslan, and C. Aykanat, Hypergraph partitioning based models and methods for eploiting cache locality in sparse matri-vector multiplication, SIAM Journal on Scientific Computing, vol. 35, no. 3, pp. C237 C262, / 14

Analyzing and Enhancing OSKI for Sparse Matrix-Vector Multiplication 1

Analyzing and Enhancing OSKI for Sparse Matrix-Vector Multiplication 1 Available on-line at www.prace-ri.eu Partnership for Advanced Computing in Europe Analyzing and Enhancing OSKI for Sparse Matrix-Vector Multiplication 1 Kadir Akbudak a, Enver Kayaaslan a, Cevdet Aykanat

More information

1.5D PARALLEL SPARSE MATRIX-VECTOR MULTIPLY

1.5D PARALLEL SPARSE MATRIX-VECTOR MULTIPLY .D PARALLEL SPARSE MATRIX-VECTOR MULTIPLY ENVER KAYAASLAN, BORA UÇAR, AND CEVDET AYKANAT Abstract. There are three common parallel sparse matrix-vector multiply algorithms: D row-parallel, D column-parallel

More information

Portable, usable, and efficient sparse matrix vector multiplication

Portable, usable, and efficient sparse matrix vector multiplication Portable, usable, and efficient sparse matrix vector multiplication Albert-Jan Yzelman Parallel Computing and Big Data Huawei Technologies France 8th of July, 2016 Introduction Given a sparse m n matrix

More information

Technical Report. Hypergraph-Partitioning-Based Models and Methods for Exploiting Cache Locality in Sparse-Matrix Vector Multiplication

Technical Report. Hypergraph-Partitioning-Based Models and Methods for Exploiting Cache Locality in Sparse-Matrix Vector Multiplication Technical Report BU-CE-1201 Hypergraph-Partitioning-Based Models and Methods for Exploiting Cache Locality in Sparse-Matrix Vector Multiplication Kadir Akbudak, Enver Kayaaslan and Cevdet Aykanat February

More information

Exploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core Architecture

Exploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core Architecture Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Exploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core Architecture K. Akbudak a, C.Aykanat

More information

Mathematics and Computer Science

Mathematics and Computer Science Technical Report TR-2006-010 Revisiting hypergraph models for sparse matrix decomposition by Cevdet Aykanat, Bora Ucar Mathematics and Computer Science EMORY UNIVERSITY REVISITING HYPERGRAPH MODELS FOR

More information

ENCAPSULATING MULTIPLE COMMUNICATION-COST METRICS IN PARTITIONING SPARSE RECTANGULAR MATRICES FOR PARALLEL MATRIX-VECTOR MULTIPLIES

ENCAPSULATING MULTIPLE COMMUNICATION-COST METRICS IN PARTITIONING SPARSE RECTANGULAR MATRICES FOR PARALLEL MATRIX-VECTOR MULTIPLIES SIAM J. SCI. COMPUT. Vol. 25, No. 6, pp. 1837 1859 c 2004 Society for Industrial and Applied Mathematics ENCAPSULATING MULTIPLE COMMUNICATION-COST METRICS IN PARTITIONING SPARSE RECTANGULAR MATRICES FOR

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning

Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning Michael M. Wolf 1,2, Erik G. Boman 2, and Bruce A. Hendrickson 3 1 Dept. of Computer Science, University of Illinois at Urbana-Champaign,

More information

Technical Report. OSUBMI-TR-2009-n02/ BU-CE Hypergraph Partitioning-Based Fill-Reducing Ordering

Technical Report. OSUBMI-TR-2009-n02/ BU-CE Hypergraph Partitioning-Based Fill-Reducing Ordering Technical Report OSUBMI-TR-2009-n02/ BU-CE-0904 Hypergraph Partitioning-Based Fill-Reducing Ordering Ümit V. Çatalyürek, Cevdet Aykanat and Enver Kayaaslan April 2009 The Ohio State University Department

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

On shared-memory parallelization of a sparse matrix scaling algorithm

On shared-memory parallelization of a sparse matrix scaling algorithm On shared-memory parallelization of a sparse matrix scaling algorithm Ümit V. Çatalyürek, Kamer Kaya The Ohio State University Dept. of Biomedical Informatics {umit,kamer}@bmi.osu.edu Bora Uçar CNRS and

More information

Hypergraph Partitioning for Computing Matrix Powers

Hypergraph Partitioning for Computing Matrix Powers Hypergraph Partitioning for Computing Matrix Powers Nicholas Knight Erin Carson, James Demmel UC Berkeley, Parallel Computing Lab CSC 11 1 Hypergraph Partitioning for Computing Matrix Powers Motivation

More information

Advances in Parallel Partitioning, Load Balancing and Matrix Ordering for Scientific Computing

Advances in Parallel Partitioning, Load Balancing and Matrix Ordering for Scientific Computing Advances in Parallel Partitioning, Load Balancing and Matrix Ordering for Scientific Computing Erik G. Boman 1, Umit V. Catalyurek 2, Cédric Chevalier 1, Karen D. Devine 1, Ilya Safro 3, Michael M. Wolf

More information

SPARSE matrix-vector multiplication (SpMV) is a

SPARSE matrix-vector multiplication (SpMV) is a Spatiotemporal Graph and Hypergraph Partitioning Models for Sparse Matrix-Vector Multiplication on Many-Core Architectures Nabil Abubaker, Kadir Akbudak, and Cevdet Aykanat Abstract There exist graph/hypergraph

More information

Improving Performance of Sparse Matrix-Vector Multiplication

Improving Performance of Sparse Matrix-Vector Multiplication Improving Performance of Sparse Matrix-Vector Multiplication Ali Pınar Michael T. Heath Department of Computer Science and Center of Simulation of Advanced Rockets University of Illinois at Urbana-Champaign

More information

PARDISO Version Reference Sheet Fortran

PARDISO Version Reference Sheet Fortran PARDISO Version 5.0.0 1 Reference Sheet Fortran CALL PARDISO(PT, MAXFCT, MNUM, MTYPE, PHASE, N, A, IA, JA, 1 PERM, NRHS, IPARM, MSGLVL, B, X, ERROR, DPARM) 1 Please note that this version differs significantly

More information

c 2004 Society for Industrial and Applied Mathematics

c 2004 Society for Industrial and Applied Mathematics SIAM J. SCI. COMPUT. Vol. 25, No. 6, pp. 160 179 c 2004 Society for Industrial and Applied Mathematics PERMUTING SPARSE RECTANGULAR MATRICES INTO BLOCK-DIAGONAL FORM CEVDET AYKANAT, ALI PINAR, AND ÜMIT

More information

A cache-oblivious sparse matrix vector multiplication scheme based on the Hilbert curve

A cache-oblivious sparse matrix vector multiplication scheme based on the Hilbert curve A cache-oblivious sparse matrix vector multiplication scheme based on the Hilbert curve A. N. Yzelman and Rob H. Bisseling Abstract The sparse matrix vector (SpMV) multiplication is an important kernel

More information

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography 1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography

More information

Combinatorial problems in a Parallel Hybrid Linear Solver

Combinatorial problems in a Parallel Hybrid Linear Solver Combinatorial problems in a Parallel Hybrid Linear Solver Ichitaro Yamazaki and Xiaoye Li Lawrence Berkeley National Laboratory François-Henry Rouet and Bora Uçar ENSEEIHT-IRIT and LIP, ENS-Lyon SIAM workshop

More information

Parallel Combinatorial BLAS and Applications in Graph Computations

Parallel Combinatorial BLAS and Applications in Graph Computations Parallel Combinatorial BLAS and Applications in Graph Computations Aydın Buluç John R. Gilbert University of California, Santa Barbara SIAM ANNUAL MEETING 2009 July 8, 2009 1 Primitives for Graph Computations

More information

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker BLAS and LAPACK + Data Formats for Sparse Matrices Part of the lecture Wissenschaftliches Rechnen Hilmar Wobker Institute of Applied Mathematics and Numerics, TU Dortmund email: hilmar.wobker@math.tu-dortmund.de

More information

Sparse Matrices. Mathematics In Science And Engineering Volume 99 READ ONLINE

Sparse Matrices. Mathematics In Science And Engineering Volume 99 READ ONLINE Sparse Matrices. Mathematics In Science And Engineering Volume 99 READ ONLINE If you are looking for a ebook Sparse Matrices. Mathematics in Science and Engineering Volume 99 in pdf form, in that case

More information

Sparse matrices: Basics

Sparse matrices: Basics Outline : Basics Bora Uçar RO:MA, LIP, ENS Lyon, France CR-08: Combinatorial scientific computing, September 201 http://perso.ens-lyon.fr/bora.ucar/cr08/ 1/28 CR09 Outline Outline 1 Course presentation

More information

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Yanhao Chen 1, Ari B. Hayes 1, Chi Zhang 2, Timothy Salmon 1, Eddy Z. Zhang 1 1. Rutgers University 2. University of Pittsburgh Sparse

More information

Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform

Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform J Supercomput (2013) 63:710 721 DOI 10.1007/s11227-011-0626-0 Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform Shiming Xu Wei Xue Hai Xiang Lin Published

More information

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,

More information

Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers

Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers The International Arab Journal of Information Technology, Vol. 8, No., April Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers Rukhsana Shahnaz and Anila Usman

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 5: Sparse Linear Systems and Factorization Methods Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 18 Sparse

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

SIMD Parallel Sparse Matrix-Vector and Transposed-Matrix-Vector Multiplication in DD Precision

SIMD Parallel Sparse Matrix-Vector and Transposed-Matrix-Vector Multiplication in DD Precision SIMD Parallel Sparse Matrix-Vector and Transposed-Matrix-Vector Multiplication in DD Precision Toshiaki Hishinuma 1, Hidehiko Hasegawa 12, and Teruo Tanaka 2 1 University of Tsukuba, Tsukuba, Japan 2 Kogakuin

More information

Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations

Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations D. Zheltkov, N. Zamarashkin INM RAS September 24, 2018 Scalability of Lanczos method Notations Matrix order

More information

Lecture 27: Fast Laplacian Solvers

Lecture 27: Fast Laplacian Solvers Lecture 27: Fast Laplacian Solvers Scribed by Eric Lee, Eston Schweickart, Chengrun Yang November 21, 2017 1 How Fast Laplacian Solvers Work We want to solve Lx = b with L being a Laplacian matrix. Recall

More information

GPU acceleration of the matrix-free interior point method

GPU acceleration of the matrix-free interior point method GPU acceleration of the matrix-free interior point method E. Smith, J. Gondzio and J. A. J. Hall School of Mathematics and Maxwell Institute for Mathematical Sciences The University of Edinburgh Mayfield

More information

Accelerating the Iterative Linear Solver for Reservoir Simulation

Accelerating the Iterative Linear Solver for Reservoir Simulation Accelerating the Iterative Linear Solver for Reservoir Simulation Wei Wu 1, Xiang Li 2, Lei He 1, Dongxiao Zhang 2 1 Electrical Engineering Department, UCLA 2 Department of Energy and Resources Engineering,

More information

I. This material refers to mathematical methods designed for facilitating calculations in matrix

I. This material refers to mathematical methods designed for facilitating calculations in matrix A FEW CONSIDERATIONS REGARDING MATRICES operations. I. This material refers to mathematical methods designed for facilitating calculations in matri In this case, we know the operations of multiplying two

More information

A New Parallel Algorithm for EREW PRAM Matrix Multiplication

A New Parallel Algorithm for EREW PRAM Matrix Multiplication A New Parallel Algorithm for EREW PRAM Matrix Multiplication S. Vollala 1, K. Geetha 2, A. Joshi 1 and P. Gayathri 3 1 Department of Computer Science and Engineering, National Institute of Technology,

More information

Tools and Primitives for High Performance Graph Computation

Tools and Primitives for High Performance Graph Computation Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World

More information

Sparse LU Factorization for Parallel Circuit Simulation on GPUs

Sparse LU Factorization for Parallel Circuit Simulation on GPUs Department of Electronic Engineering, Tsinghua University Sparse LU Factorization for Parallel Circuit Simulation on GPUs Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Nano-scale Integrated

More information

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi Erik Saule 1, Kamer Kaya 1 and Ümit V. Çatalyürek 1,2 esaule@uncc.edu, {kamer,umit}@bmi.osu.edu 1 Department of Biomedical

More information

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard

More information

Solving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems

Solving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems AMSC 6 /CMSC 76 Advanced Linear Numerical Analysis Fall 7 Direct Solution of Sparse Linear Systems and Eigenproblems Dianne P. O Leary c 7 Solving Sparse Linear Systems Assumed background: Gauss elimination

More information

Tomonori Kouya Shizuoka Institute of Science and Technology Toyosawa, Fukuroi, Shizuoka Japan. October 5, 2018

Tomonori Kouya Shizuoka Institute of Science and Technology Toyosawa, Fukuroi, Shizuoka Japan. October 5, 2018 arxiv:1411.2377v1 [math.na] 10 Nov 2014 A Highly Efficient Implementation of Multiple Precision Sparse Matrix-Vector Multiplication and Its Application to Product-type Krylov Subspace Methods Tomonori

More information

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION International Journal of Computer Engineering and Applications, Volume IX, Issue VIII, Sep. 15 www.ijcea.com ISSN 2321-3469 COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

More information

Optimizing the operations with sparse matrices on Intel architecture

Optimizing the operations with sparse matrices on Intel architecture Optimizing the operations with sparse matrices on Intel architecture Gladkikh V. S. victor.s.gladkikh@intel.com Intel Xeon, Intel Itanium are trademarks of Intel Corporation in the U.S. and other countries.

More information

Fast Recommendation on Bibliographic Networks with Sparse-Matrix Ordering and Partitioning

Fast Recommendation on Bibliographic Networks with Sparse-Matrix Ordering and Partitioning Social Network Analysis and Mining manuscript No. (will be inserted by the editor) Fast Recommendation on Bibliographic Networks with Sparse-Matrix Ordering and Partitioning Onur Küçüktunç Kamer Kaya Erik

More information

Technical Report. OSUBMI-TR_2008_n04. On two-dimensional sparse matrix partitioning: Models, Methods, and a Recipe

Technical Report. OSUBMI-TR_2008_n04. On two-dimensional sparse matrix partitioning: Models, Methods, and a Recipe The Ohio State University Department of Biomedical Informatics Graves Hall, W. th Avenue Columbus, OH http://bmi.osu.edu Technical Report OSUBMI-TR n On two-dimensional sparse matrix partitioning: Models,

More information

Q. Wang National Key Laboratory of Antenna and Microwave Technology Xidian University No. 2 South Taiba Road, Xi an, Shaanxi , P. R.

Q. Wang National Key Laboratory of Antenna and Microwave Technology Xidian University No. 2 South Taiba Road, Xi an, Shaanxi , P. R. Progress In Electromagnetics Research Letters, Vol. 9, 29 38, 2009 AN IMPROVED ALGORITHM FOR MATRIX BANDWIDTH AND PROFILE REDUCTION IN FINITE ELEMENT ANALYSIS Q. Wang National Key Laboratory of Antenna

More information

Behavioral Data Mining. Lecture 12 Machine Biology

Behavioral Data Mining. Lecture 12 Machine Biology Behavioral Data Mining Lecture 12 Machine Biology Outline CPU geography Mass storage Buses and Networks Main memory Design Principles Intel i7 close-up From Computer Architecture a Quantitative Approach

More information

Augmenting Hypergraph Models with Message Nets to Reduce Bandwidth and Latency Costs Simultaneously

Augmenting Hypergraph Models with Message Nets to Reduce Bandwidth and Latency Costs Simultaneously Augmenting Hypergraph Models with Message Nets to Reduce Bandwidth and Latency Costs Simultaneously Oguz Selvitopi, Seher Acer, and Cevdet Aykanat Bilkent University, Ankara, Turkey CSC16, Albuquerque,

More information

Generating Optimized Sparse Matrix Vector Product over Finite Fields

Generating Optimized Sparse Matrix Vector Product over Finite Fields Generating Optimized Sparse Matrix Vector Product over Finite Fields Pascal Giorgi 1 and Bastien Vialla 1 LIRMM, CNRS, Université Montpellier 2, pascal.giorgi@lirmm.fr, bastien.vialla@lirmm.fr Abstract.

More information

NEW ADVANCES IN GPU LINEAR ALGEBRA

NEW ADVANCES IN GPU LINEAR ALGEBRA GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

CACHE-OBLIVIOUS SPARSE MATRIX VECTOR MULTIPLICATION BY USING SPARSE MATRIX PARTITIONING METHODS

CACHE-OBLIVIOUS SPARSE MATRIX VECTOR MULTIPLICATION BY USING SPARSE MATRIX PARTITIONING METHODS CACHE-OBLIVIOUS SPARSE MATRIX VECTOR MULTIPLICATION BY USING SPARSE MATRIX PARTITIONING METHODS A.N. YZELMAN AND ROB H. BISSELING Abstract. In this article, we introduce a cache-oblivious method for sparse

More information

Julian Hall School of Mathematics University of Edinburgh. June 15th Parallel matrix inversion for the revised simplex method - a study

Julian Hall School of Mathematics University of Edinburgh. June 15th Parallel matrix inversion for the revised simplex method - a study Parallel matrix inversion for the revised simplex method - A study Julian Hall School of Mathematics University of Edinburgh June 5th 006 Parallel matrix inversion for the revised simplex method - a study

More information

Reducing Inter-process Communication Overhead in Parallel Sparse Matrix-Matrix Multiplication

Reducing Inter-process Communication Overhead in Parallel Sparse Matrix-Matrix Multiplication Reducing Inter-process Communication Overhead in Parallel Sparse Matrix-Matrix Multiplication Md Salman Ahmed, Jennifer Houser, Mohammad Asadul Hoque, Phil Pfeiffer Department of Computing Department of

More information

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu

More information

Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory

Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory Jongsoo Park, Parallel Computing Lab, Intel Corporation with contributions from MKL team 1 Algorithm/

More information

Minimizing Communication Cost in Fine-Grain Partitioning of Sparse Matrices

Minimizing Communication Cost in Fine-Grain Partitioning of Sparse Matrices Minimizing Communication Cost in Fine-Grain Partitioning of Sparse Matrices Bora Uçar and Cevdet Aykanat Department of Computer Engineering, Bilkent University, 06800, Ankara, Turkey {ubora,aykanat}@cs.bilkent.edu.tr

More information

Improving the Performance of the Symmetric Sparse Matrix-Vector Multiplication in Multicore

Improving the Performance of the Symmetric Sparse Matrix-Vector Multiplication in Multicore Improving the Performance of the Symmetric Sparse Matrix-Vector Multiplication in Multicore Theodoros Gkountouvas, Vasileios Karakasis, Kornilios Kourtis, Georgios Goumas and Nectarios Koziris School of

More information

A Parallel Solver for Laplacian Matrices. Tristan Konolige (me) and Jed Brown

A Parallel Solver for Laplacian Matrices. Tristan Konolige (me) and Jed Brown A Parallel Solver for Laplacian Matrices Tristan Konolige (me) and Jed Brown Graph Laplacian Matrices Covered by other speakers (hopefully) Useful in a variety of areas Graphs are getting very big Facebook

More information

A scalable parallel LSQR algorithm for solving large-scale linear system for tomographic problems: a case study in seismic tomography

A scalable parallel LSQR algorithm for solving large-scale linear system for tomographic problems: a case study in seismic tomography Available online at www.sciencedirect.com Procedia Computer Science 00 (2013) 000 000 International Conference on Computational Science, ICCS 2013 A scalable parallel LSQR algorithm for solving large-scale

More information

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science,

More information

Hypergraph-Partitioning-Based Decomposition for Parallel Sparse-Matrix Vector Multiplication

Hypergraph-Partitioning-Based Decomposition for Parallel Sparse-Matrix Vector Multiplication IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 10, NO. 7, JULY 1999 673 Hypergraph-Partitioning-Based Decomposition for Parallel Sparse-Matrix Vector Multiplication UÈ mitv.cëatalyuè rek and

More information

Storage Formats for Sparse Matrices in Java

Storage Formats for Sparse Matrices in Java Storage Formats for Sparse Matrices in Java Mikel Luján, Anila Usman, Patrick Hardie, T.L. Freeman, and John R. Gurd Centre for Novel Computing, The University of Manchester, Oxford Road, Manchester M13

More information

Lecture 17: More Fun With Sparse Matrices

Lecture 17: More Fun With Sparse Matrices Lecture 17: More Fun With Sparse Matrices David Bindel 26 Oct 2011 Logistics Thanks for info on final project ideas. HW 2 due Monday! Life lessons from HW 2? Where an error occurs may not be where you

More information

Sparse Hypermatrix Cholesky: Customization for High Performance

Sparse Hypermatrix Cholesky: Customization for High Performance Sparse Hypermatri Cholesky: Customization for High Performance José R. Herrero, Juan J. Navarro Abstract Efficient eecution of numerical algorithms requires adapting the code to the underlying eecution

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.3 Iterative Methods Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign

More information

User Guide for GLU V2.0

User Guide for GLU V2.0 User Guide for GLU V2.0 Lebo Wang and Sheldon Tan University of California, Riverside June 2017 Contents 1 Introduction 1 2 Using GLU within a C/C++ program 2 2.1 GLU flowchart........................

More information

Optimizing Sparse Matrix Vector Multiplication Using Cache Blocking Method on Fermi GPU

Optimizing Sparse Matrix Vector Multiplication Using Cache Blocking Method on Fermi GPU 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing Optimizing Sparse Matrix Vector Multiplication Using Cache Blocking

More information

Downloaded 01/21/15 to Redistribution subject to SIAM license or copyright; see

Downloaded 01/21/15 to Redistribution subject to SIAM license or copyright; see SIAM J. SCI. COMPUT. Vol. 36, No. 5, pp. C568 C590 c 2014 Society for Industrial and Applied Mathematics SIMULTANEOUS INPUT AND OUTPUT MATRIX PARTITIONING FOR OUTER-PRODUCT PARALLEL SPARSE MATRIX-MATRIX

More information

A Configurable Architecture for Sparse LU Decomposition on Matrices with Arbitrary Patterns

A Configurable Architecture for Sparse LU Decomposition on Matrices with Arbitrary Patterns A Configurable Architecture for Sparse LU Decomposition on Matrices with Arbitrary Patterns Xinying Wang, Phillip H. Jones and Joseph Zambreno Department of Electrical and Computer Engineering Iowa State

More information

Portable, usable, and efficient sparse matrix vector multiplication

Portable, usable, and efficient sparse matrix vector multiplication Usable sparse matrix vector multiplication Portable, usable, and efficient sparse matrix vector multiplication Albert-Jan Yzelman Parallel Computing and Big Data Huawei Technologies France 30th of November,

More information

Sparse Matrix-Vector Multiplication with Wide SIMD Units: Performance Models and a Unified Storage Format

Sparse Matrix-Vector Multiplication with Wide SIMD Units: Performance Models and a Unified Storage Format ERLANGEN REGIONAL COMPUTING CENTER Sparse Matrix-Vector Multiplication with Wide SIMD Units: Performance Models and a Unified Storage Format Moritz Kreutzer, Georg Hager, Gerhard Wellein SIAM PP14 MS53

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

A Comparative Study on Exact Triangle Counting Algorithms on the GPU A Comparative Study on Exact Triangle Counting Algorithms on the GPU Leyuan Wang, Yangzihao Wang, Carl Yang, John D. Owens University of California, Davis, CA, USA 31 st May 2016 L. Wang, Y. Wang, C. Yang,

More information

Computational issues in linear programming

Computational issues in linear programming Computational issues in linear programming Julian Hall School of Mathematics University of Edinburgh 15th May 2007 Computational issues in linear programming Overview Introduction to linear programming

More information

Discrete Optimization 2010 Lecture 5 Min-Cost Flows & Total Unimodularity

Discrete Optimization 2010 Lecture 5 Min-Cost Flows & Total Unimodularity Discrete Optimization 2010 Lecture 5 Min-Cost Flows & Total Unimodularity Marc Uetz University of Twente m.uetz@utwente.nl Lecture 5: sheet 1 / 26 Marc Uetz Discrete Optimization Outline 1 Min-Cost Flows

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from

More information

Optimizing Sparse Data Structures for Matrix-Vector Multiply

Optimizing Sparse Data Structures for Matrix-Vector Multiply Summary Optimizing Sparse Data Structures for Matrix-Vector Multiply William Gropp (UIUC) and Dahai Guo (NCSA) Algorithms and Data Structures need to take memory prefetch hardware into account This talk

More information

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.

More information

swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu

swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu 1 2 3 Outline 1. Background 2. Sunway architecture

More information

Site-Based Partitioning and Repartitioning Techniques for Parallel PageRank Computation

Site-Based Partitioning and Repartitioning Techniques for Parallel PageRank Computation 786 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 5, MAY 20 Site-Based Partitioning and Repartitioning Techniques for Parallel PageRank Computation Ali Cevahir, Cevdet Aykanat, Ata

More information

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Intel K. K. E-mail: hirokazu.kobayashi@intel.com Yoshifumi Nakamura RIKEN AICS E-mail: nakamura@riken.jp Shinji Takeda

More information

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing... Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 48 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum

More information

Low-overhead Load-balanced Scheduling for Sparse Tensor Computations

Low-overhead Load-balanced Scheduling for Sparse Tensor Computations Low-overhead Load-balanced Scheduling for Sparse Tensor Computations Muthu Baskaran, Benoît Meister, Richard Lethin Reservoir Labs Inc. New York, NY 112 Email: {baskaran,meister,lethin}@reservoir.com Abstract

More information

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing... Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 22 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum

More information

Sparse Matrix Vector Multiplication on a Field Programmable Gate Array

Sparse Matrix Vector Multiplication on a Field Programmable Gate Array Sparse Matri Vector Multiplication on a Field Programmable Gate Array September 2007 Marcel van der Veen University of Twente Faculty of Electrical Engineering, Mathematics and Computer Science Computer

More information

Quantifying the Effect of Matrix Structure on Multithreaded Performance of the SpMV Kernel

Quantifying the Effect of Matrix Structure on Multithreaded Performance of the SpMV Kernel Quantifying the Effect of Matrix Structure on Multithreaded Performance of the SpMV Kernel Daniel Kimball, Elizabeth Michel, Paul Keltcher, and Michael M. Wolf MIT Lincoln Laboratory Lexington, MA {daniel.kimball,

More information

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer CHAO YANG Dr. Chao Yang is a full professor at the Laboratory of Parallel Software and Computational Sciences, Institute of Software, Chinese Academy Sciences. His research interests include numerical

More information

Self-Improving Sparse Matrix Partitioning and Bulk-Synchronous Pseudo-Streaming

Self-Improving Sparse Matrix Partitioning and Bulk-Synchronous Pseudo-Streaming Self-Improving Sparse Matrix Partitioning and Bulk-Synchronous Pseudo-Streaming MSc Thesis Jan-Willem Buurlage Scientific Computing Group Mathematical Institute Utrecht University supervised by Prof. Rob

More information

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors Kaixi Hou, Wu-chun Feng {kaixihou, wfeng}@vt.edu Shuai Che Shuai.Che@amd.com Sparse

More information

Tools and Libraries for Parallel Sparse Matrix Computations. Edmond Chow and Yousef Saad. University of Minnesota. Minneapolis, MN

Tools and Libraries for Parallel Sparse Matrix Computations. Edmond Chow and Yousef Saad. University of Minnesota. Minneapolis, MN Tools and Libraries for Parallel Sparse Matrix Computations Edmond Chow and Yousef Saad Department of Computer Science, and Minnesota Supercomputer Institute University of Minnesota Minneapolis, MN 55455

More information

Author's personal copy

Author's personal copy Soc. Netw. Anal. Min. (203) 3:097 DOI 0.007/s3278-03-006-z ORIGINAL ARTICLE Fast recommendation on bibliographic networks with sparse-matrix ordering and partitioning Onur Küçüktunç Kamer Kaya Erik Saule

More information

arxiv: v1 [cs.ms] 2 Jun 2016

arxiv: v1 [cs.ms] 2 Jun 2016 Parallel Triangular Solvers on GPU Zhangxin Chen, Hui Liu, and Bo Yang University of Calgary 2500 University Dr NW, Calgary, AB, Canada, T2N 1N4 {zhachen,hui.j.liu,yang6}@ucalgary.ca arxiv:1606.00541v1

More information

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning

More information