Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication
|
|
- Sophie Charles
- 5 years ago
- Views:
Transcription
1 ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication Penporn Koanantakool 1,2, riful zad 2, ydin Buluç 2,Dmitriy Morozov 2, Sang-Yun Oh 2,3, Leonid Oliker 2, Katherine Yelick 1,2 1 omputer Science Division, University of alifornia, Berkeley 2 Lawrence Berkeley National Laboratory 3 University of alifornia, Santa Barbara May 26, 2016
2 Outline Background and Motivation lgorithms 2D SUMM 3D SUMM 1.5D variants Experimental Results : 100x speedup Iterative Multiplication onclusions ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 2
3 Parallel lgorithm ost Model p distributed, homogenous processors connected through network Per-processor costs along critical path Image courtesy of: James Demmel PU DRM PU DRM PU DRM PU DRM time = #flops time_per_flop + #messages time_per_message + #words time_per_word omputation ommunication time_per_flop, time_per_message, time_per_word are machine-specific constants. Goal: minimize #messages (S) and #words (W) ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 3
4 Sparse Dense Matmul (sparse) B = (dense) (dense) i j k i,j += i,k B k,j B :,j i,: i,j ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 4
5 Matrix Multiplication lgorithms B B B 1.5D 2.5D 1D 2D 3D 2D (SUMM) B (dense) B (dense) 1D p 0 p 0 Optimal for dense-dense/ sparse-sparse matmul [Solomonik and Demmel 2011] [Ballard et al. 2013] (sparse) (dense) (sparse) (dense) ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication * 2D and 3D images courtesy of Grey Ballard 5
6 Existing lgorithms ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 6
7 1D Ring lgorithm p=4 1D layout p B ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 7
8 1D Ring lgorithm p=4 1D layout Shifts matrix around cyclically p B ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 8
9 1D Ring lgorithm p=4 1D layout Shifts matrix around cyclically p B ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 9
10 1D Ring lgorithm p=4 1D layout Shifts matrix around cyclically #msgs: p msg size: nnz()/p #words: nnz() p B ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 10
11 2D SUMM lgorithm p=16 2D layout, grid a b c d e f B a b c d e f a b c d e f ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 11
12 2D SUMM lgorithm p=16 2D layout, outer products grid B ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 12
13 2D SUMM lgorithm p=16 2D layout, outer products grid B broadcasts row-wise ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 13
14 2D SUMM lgorithm p=16 2D layout, outer products grid B broadcasts row-wise B broadcasts column-wise ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 14
15 2D SUMM lgorithm p=16 2D layout, outer products grid B ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 15
16 2D SUMM lgorithm p=16 2D layout, outer products grid B ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 16
17 2D SUMM lgorithm p=16 2D layout, outer products grid broadcasts of size nnz()/p broadcasts of size nnz(b)/p B ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 17
18 3D SUMM lgorithm p=32, c=2 (c is #copies) grid (teams of size c) B B ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 18
19 3D SUMM lgorithm p=32, c=2 (c is #copies) grid (teams of size c) B B First team member calculates the first half outer products. Second team member calculates the latter half outer products. ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 19
20 3D SUMM lgorithm p=32, c=2 (c is #copies) grid (teams of size c) Each team member calculates 1/c of all outer products broadcasts of size nnz()c/p broadcasts of size nnz(b)c/p B Second team member calculates the latter half outer products. ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 20
21 3D SUMM: osts Replication Replicate : Replicate B: Propagation Broadcast : Broadcast B: ollection Reduce : #messages #words ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 21
22 1.5D lgorithms ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 22
23 1.5D ol p p=4, c=2 B has p/c block columns p/c 0, 1 2, Round 1 ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 23
24 1.5D ol p p=4, c=2 B has p/c block columns p/c #msgs: p/c msg size: 0, 1 2, Round 1 #words: nnz() 2, 3 0, Round 2 ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 24
25 1.5D ol B p/c p=8, c=2 and B both have p/c block columns p/c B Each team member does 1/c of work Round 1 ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 25
26 1.5D ol B p/c p=8, c=2 and B both have p/c block columns p/c B Each team member does 1/c of work #msgs: p/c Round 1 msg size: #words: Round 2 ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 26
27 1.5D Inner B p/c p=8, c=2 has p/c block rows B has p/c block columns B Each team member does 1/c of work 0, 5 2, 7 4, 1 6, Round 1 ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 27
28 1.5D Inner B p/c p=8, c=2 has p/c block rows B has p/c block columns B Each team member does 1/c of work #msgs: p/c 2 0, 5 2, 7 4, 1 6, Round 1 msg size: #words: 6, 3 0, 5 2, 7 4, Round 2 ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 28
29 ost omparison Latency cost ppend,, for bandwidth cost Even replication and collection can be the bottleneck if nnz() nnz(b), nnz() ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 29
30 Experimental Results ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 30
31 Test environments omparison 1.5D variants: ol, ol B, and Inner B Baseline: Best of 2D/2.5D/3D SUMM B ++ MPI, hybrid (12 threads per MPI process) MKL s multithreaded routine for local computation (mkl_dcsrmm) (double-precision) SR + row major 2.57PFLOPS ray X30 12 cores, 2 sockets each ray ries, Dragonfly topology ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 31
32 ost Breakdown p=3,072, n=64k x 64k, 41 nnz per row, ray X30 Existing New ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 32
33 Moving Dense Matrix is ostly p=3,072, n=64k x 64k, 41 nnz per row, ray X30 Existing New ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 33
34 Replication might not pay off p=3,072, n=64k x 64k, 41 nnz per row, ray X30 17x Existing New ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 34
35 Different omputation Efficiency p=3,072, n=64k x 64k, 41 nnz per row, ray X30 17x Existing New ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 35
36 100x Improvement 66k x 172k, B 172k x 66k, % nnz, ray X30 100x Up is good ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 36
37 When is 1.5D better? Best theoretical bandwidth cost areas ssuming nnz() = nnz(b) lso applicable to dense-dense and sparse-sparse Most sparse matrices have < 5% nonzeroes ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 37
38 Iterative Multiplication ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 38
39 Sparse IM Estimation *[S. Oh et al 2014] IM = inverse covariance matrix Direct pairwise dependency Data could be Brain fmri scan Gene expression S is usually rank-deficient or too expensive to invert directly Estimate by setting an objective function and optimize r observations X T n variables/features X (observed data) S (sample covariance) rank-deficient ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 39
40 ONORD-IST ONORD objective function : diagonal elements : off-diagonal elements : estimated sparse inverse of S Iteratively compute [or ] Ω S or Ω X T Replication (and sometimes collection) costs can be amortized. ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 40
41 onclusions 1.5D beneficial when #nnz of matrices are imbalanced. lso applicable to dense-dense/sparse-sparse. Observed up to 100 speedup. 1.5D can handle moderately imbalanced load with greedy partitioning. Iterative multiplication gains even more benefit. Future work Parallel ONORD-IST New communication lower bounds Try recursive approach onsider different computation efficiency ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 41
42 References I P. Koanantakool,. zad,. Buluç, D. Morozov, S. Oh, L. Oliker, K. Yelick. ommunication-avoiding parallel sparse-dense matrix-matrix multiplication. In IPDPS E. Solomonik and J. Demmel, ommunication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In Euro-Par, 2011, vol. 6853, pp G. Ballard,. Buluc, J. Demmel, L. Grigori, B. Lipshitz, O. Schwartz, and S. Toledo. ommunication optimal parallel multiplication of sparse random matrices. In Proceedings of the twenty-fifth annual M symposium on Parallelism in algorithms and architectures. M, 2013, pp ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 42
43 References II M. Driscoll, E. Georganas, P. Koanantakool, E. Solomonik, K. Yelick. communication-optimal n-body algorithm for direct interactions. In IPDPS S. Oh, O. Dalal, K. Khare, and B. Rajaratnam, Optimization methods for sparse pseudo-likelihood graphical model selection. In NIPS 27, 2014, pp ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 43
44 Thank you! Questions? Suggestions/pplications? ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 44
45 ommunication osts 3 phases Phase Description s c increases.. Replication to have c copies Propagation to multiply ollection of matrix Non-replicating version only has propagation cost. Replication and collection happen between team members, i.e., only c processors ( ), so they are usually cheaper than propagation. ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 45
46 ONORD-IST ONORD objective function non-smooth : estimated sparse inverse of S Proximal gradient method Smooth: Gradient descent Non-smooth: Soft-thresholding (IST: Iterative Soft-Thresholding lgorithm) : diagonal elements : off-diagonal elements λ λ Does not assume Gaussian distribution ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 46
47 Pseudocode : estimated sparse inverse of S S = X T X/r Ω = I do Ω 0 = Ω ompute gradient G try τ = {1, 1/2, 1/4, } // find appropriate step size Ω = soft_thresholding(ω 0 -τg, τλ) until descent condition satisfied while max( Ω 0 -Ω ) ε // until no more progress Distributed operations S = X T X/r : dense-dense matrix multiplication SΩ+ΩS : sparse-dense matrix multiplication ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 47
Parallel Methods for Convex Optimization. A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney
Parallel Methods for Convex Optimization A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney Problems minimize g(x)+f(x; A, b) Sparse regression g(x) =kxk 1 f(x) =kax bk 2 2 mx Sparse SVM g(x) =kxk
More informationChallenges and Advances in Parallel Sparse Matrix-Matrix Multiplication
Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science,
More informationCS 598: Communication Cost Analysis of Algorithms Lecture 4: communication avoiding algorithms for LU factorization
CS 598: Communication Cost Analysis of Algorithms Lecture 4: communication avoiding algorithms for LU factorization Edgar Solomonik University of Illinois at Urbana-Champaign August 31, 2016 Review of
More informationPerformance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply
Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu
More informationAchieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation
Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College
More informationOutline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency
1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming
More informationTools and Primitives for High Performance Graph Computation
Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 5 Vector and Matrix Products Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel
More informationDistributed-memory Algorithms for Dense Matrices, Vectors, and Arrays
Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 19 25 October 2018 Topics for
More informationAutomatic Tuning of Sparse Matrix Kernels
Automatic Tuning of Sparse Matrix Kernels Kathy Yelick U.C. Berkeley and Lawrence Berkeley National Laboratory Richard Vuduc, Lawrence Livermore National Laboratory James Demmel, U.C. Berkeley Berkeley
More informationModern GPUs (Graphics Processing Units)
Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB
More informationMaunam: A Communication-Avoiding Compiler
Maunam: A Communication-Avoiding Compiler Karthik Murthy Advisor: John Mellor-Crummey Department of Computer Science Rice University! karthik.murthy@rice.edu February 2014!1 Knights Tour Closed Knight's
More informationLecture 18: March 23
0-725/36-725: Convex Optimization Spring 205 Lecturer: Ryan Tibshirani Lecture 8: March 23 Scribes: James Duyck Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.1 Vector and Matrix Products Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign
More informationCS 598: Communication Cost Analysis of Algorithms Lecture 7: parallel algorithms for QR factorization
CS 598: Communication Cost Analysis of Algorithms Lecture 7: parallel algorithms for QR factorization Edgar Solomonik University of Illinois at Urbana-Champaign September 14, 2016 Parallel Householder
More informationMatrix algorithms: fast, stable, communication-optimizing random?!
Matrix algorithms: fast, stable, communication-optimizing random?! Ioana Dumitriu Department of Mathematics University of Washington (Seattle) Joint work with Grey Ballard, James Demmel, Olga Holtz, Robert
More informationExam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3
UMEÅ UNIVERSITET Institutionen för datavetenskap Lars Karlsson, Bo Kågström och Mikael Rännar Design and Analysis of Algorithms for Parallel Computer Systems VT2009 June 2, 2009 Exam Design and Analysis
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationServer Side Applications (i.e., public/private Clouds and HPC)
Server Side Applications (i.e., public/private Clouds and HPC) Kathy Yelick UC Berkeley Lawrence Berkeley National Laboratory Proposed DOE Exascale Science Problems Accelerators Carbon Capture Cosmology
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular Linear Systems Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationParallel Reduction from Block Hessenberg to Hessenberg using MPI
Parallel Reduction from Block Hessenberg to Hessenberg using MPI Viktor Jonsson May 24, 2013 Master s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Lars Karlsson Examiner: Fredrik Georgsson
More informationLecture 4: Principles of Parallel Algorithm Design (part 4)
Lecture 4: Principles of Parallel Algorithm Design (part 4) 1 Mapping Technique for Load Balancing Minimize execution time Reduce overheads of execution Sources of overheads: Inter-process interaction
More informationAll-Pairs Shortest Paths - Floyd s Algorithm
All-Pairs Shortest Paths - Floyd s Algorithm Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 31, 2011 CPD (DEI / IST) Parallel
More informationParallel dense linear algebra computations (1)
Parallel dense linear algebra computations (1) Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA, Spring 2008 [L.07] Tuesday, January 29, 2008 1 Sources for today s material Mike Heath
More informationParallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop
Parallelizing The Matrix Multiplication 6/10/2013 LONI Parallel Programming Workshop 2013 1 Serial version 6/10/2013 LONI Parallel Programming Workshop 2013 2 X = A md x B dn = C mn d c i,j = a i,k b k,j
More informationA Communication-Optimal N-Body Algorithm for Direct Interactions
A Communication-Optimal N-Body Algorithm for Direct Interactions Michael Driscoll, Evangelos Georganas, Penporn Koanantakool, Edgar Solomonik, and Katherine Yelick Computer Science Division, University
More informationCommunication Efficient Gaussian Elimination with Partial Pivoting using a Shape Morphing Data Layout
Communication Efficient Gaussian Elimination with Partial Pivoting using a Shape Morphing Data Layout Grey Ballard James Demmel Benjamin Lipshitz Oded Schwartz Sivan Toledo Electrical Engineering and Computer
More informationExploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core Architecture
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Exploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core Architecture K. Akbudak a, C.Aykanat
More informationLecture 15: More Iterative Ideas
Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!
More informationMulti-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation
Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 1: Course Overview; Matrix Multiplication Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 21 Outline 1 Course
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationLecture 17: Array Algorithms
Lecture 17: Array Algorithms CS178: Programming Parallel and Distributed Systems April 4, 2001 Steven P. Reiss I. Overview A. We talking about constructing parallel programs 1. Last time we discussed sorting
More informationPerformance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi
More informationIterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms
Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms By:- Nitin Kamra Indian Institute of Technology, Delhi Advisor:- Prof. Ulrich Reude 1. Introduction to Linear
More informationDense Matrix Algorithms
Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication
More informationEvaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li
Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,
More informationHIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA
1 HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 BLAS BLAS 1, 2, 3 Performance GEMM Optimized BLAS Parallel
More informationBlocking Optimization Strategies for Sparse Tensor Computation
Blocking Optimization Strategies for Sparse Tensor Computation Jee Choi 1, Xing Liu 1, Shaden Smith 2, and Tyler Simon 3 1 IBM T. J. Watson Research, 2 University of Minnesota, 3 University of Maryland
More informationReducing Inter-process Communication Overhead in Parallel Sparse Matrix-Matrix Multiplication
Reducing Inter-process Communication Overhead in Parallel Sparse Matrix-Matrix Multiplication Md Salman Ahmed, Jennifer Houser, Mohammad Asadul Hoque, Phil Pfeiffer Department of Computing Department of
More informationCommunication Optimal Parallel Multiplication of Sparse Random Matrices
Communication Optimal arallel Multiplication of Sparse Random Matrices Grey Ballard Aydin Buluc James Demmel Laura Grigori Benjamin Lipshitz Oded Schwartz Sivan Toledo Electrical Engineering and Computer
More informationCS 770G - Parallel Algorithms in Scientific Computing
CS 770G - Parallel lgorithms in Scientific Computing Dense Matrix Computation II: Solving inear Systems May 28, 2001 ecture 6 References Introduction to Parallel Computing Kumar, Grama, Gupta, Karypis,
More informationAdaptive Matrix Transpose Algorithms for Distributed Multicore Processors
Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures
More informationLINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.
1 2 The LINPACK Benchmark on the Fujitsu AP 1000 Richard P. Brent Computer Sciences Laboratory The LINPACK Benchmark A popular benchmark for floating-point performance. Involves the solution of a nonsingular
More information15. The Software System ParaLab for Learning and Investigations of Parallel Methods
15. The Software System ParaLab for Learning and Investigations of Parallel Methods 15. The Software System ParaLab for Learning and Investigations of Parallel Methods... 1 15.1. Introduction...1 15.2.
More informationswsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu
swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu 1 2 3 Outline 1. Background 2. Sunway architecture
More informationNumerical Algorithms
Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0
More informationGraph Partitioning for Scalable Distributed Graph Computations
Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering
More informationCommunication-Avoiding Sparse Matrix Algorithms for Large Graph and Machine Learning Problems
Communication-Avoiding Sparse Matrix Algorithms for Large Graph and Machine Learning Problems Aydın Buluç Computational Research Division, LBNL EECS Department, UC Berkeley New Architectures and Algorithms
More informationAvoiding communication in primal and dual block coordinate descent methods
Avoiding communication in primal and dual block coordinate descent methods Aditya Devarakonda Kimon Fountoulakis James Demmel Michael W. Mahoney Electrical Engineering and Computer Sciences University
More informationSummary. A simple model for point-to-point messages. Small message broadcasts in the α-β model. Messaging in the LogP model.
Summary Design of Parallel and High-Performance Computing: Distributed-Memory Models and lgorithms Edgar Solomonik ETH Zürich December 9, 2014 Lecture overview Review: α-β communication cost model LogP
More informationEXPLOITING MULTIPLE LEVELS OF PARALLELISM IN SPARSE MATRIX-MATRIX MULTIPLICATION
EXPLOITING MULTIPLE LEVELS OF PARALLELISM IN SPARSE MATRIX-MATRIX MULTIPLICATION ARIFUL AZAD, GREY BALLARD, AYDIN BULUÇ, JAMES DEMMEL, LAURA GRIGORI, ODED SCHWARTZ, SIVAN TOLEDO #, AND SAMUEL WILLIAMS
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationIntroduction to Parallel. Programming
University of Nizhni Novgorod Faculty of Computational Mathematics & Cybernetics Introduction to Parallel Section 9. Programming Parallel Methods for Solving Linear Systems Gergel V.P., Professor, D.Sc.,
More informationAdaptable benchmarks for register blocked sparse matrix-vector multiplication
Adaptable benchmarks for register blocked sparse matrix-vector multiplication Berkeley Benchmarking and Optimization group (BeBOP) Hormozd Gahvari and Mark Hoemmen Based on research of: Eun-Jin Im Rich
More informationMatrix Multiplication
Matrix Multiplication Nur Dean PhD Program in Computer Science The Graduate Center, CUNY 05/01/2017 Nur Dean (The Graduate Center) Matrix Multiplication 05/01/2017 1 / 36 Today, I will talk about matrix
More informationLecture 17: More Fun With Sparse Matrices
Lecture 17: More Fun With Sparse Matrices David Bindel 26 Oct 2011 Logistics Thanks for info on final project ideas. HW 2 due Monday! Life lessons from HW 2? Where an error occurs may not be where you
More informationLecture 4: Graph Algorithms
Lecture 4: Graph Algorithms Definitions Undirected graph: G =(V, E) V finite set of vertices, E finite set of edges any edge e = (u,v) is an unordered pair Directed graph: edges are ordered pairs If e
More informationCommunication-avoidance!
Communication-avoidance! and! Automatic performance tuning! Nick Knight! UC-Berkeley Parlab (Bebop group)! 1 Students and Collaborators (Bebop) PIs! James Demmel! Kathy Yelick! Students! Marghoob Mohiyuddin!
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationIntroduction to Multithreaded Algorithms
Introduction to Multithreaded Algorithms CCOM5050: Design and Analysis of Algorithms Chapter VII Selected Topics T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein. Introduction to algorithms, 3 rd
More informationAim. Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity. Structure and matrix sparsity: Overview
Aim Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity Julian Hall School of Mathematics University of Edinburgh jajhall@ed.ac.uk What should a 2-hour PhD lecture on structure
More informationArtificial Intelligence for Robotics: A Brief Summary
Artificial Intelligence for Robotics: A Brief Summary This document provides a summary of the course, Artificial Intelligence for Robotics, and highlights main concepts. Lesson 1: Localization (using Histogram
More informationLecture 19: November 5
0-725/36-725: Convex Optimization Fall 205 Lecturer: Ryan Tibshirani Lecture 9: November 5 Scribes: Hyun Ah Song Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not
More informationUnrolling parallel loops
Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:
More informationRobot Mapping. Least Squares Approach to SLAM. Cyrill Stachniss
Robot Mapping Least Squares Approach to SLAM Cyrill Stachniss 1 Three Main SLAM Paradigms Kalman filter Particle filter Graphbased least squares approach to SLAM 2 Least Squares in General Approach for
More informationGraphbased. Kalman filter. Particle filter. Three Main SLAM Paradigms. Robot Mapping. Least Squares Approach to SLAM. Least Squares in General
Robot Mapping Three Main SLAM Paradigms Least Squares Approach to SLAM Kalman filter Particle filter Graphbased Cyrill Stachniss least squares approach to SLAM 1 2 Least Squares in General! Approach for
More informationDivide and Conquer Kernel Ridge Regression
Divide and Conquer Kernel Ridge Regression Yuchen Zhang John Duchi Martin Wainwright University of California, Berkeley COLT 2013 Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT 2013 1 / 15 Problem
More informationCSE 160 Lecture 23. Matrix Multiplication Continued Managing communicators Gather and Scatter (Collectives)
CS 160 Lecture 23 Matrix Multiplication Continued Managing communicators Gather and Scatter (Collectives) Today s lecture All to all communication Application to Parallel Sorting Blocking for cache 2013
More informationPerformance without Pain = Productivity Data Layout and Collective Communication in UPC
Performance without Pain = Productivity Data Layout and Collective Communication in UPC By Rajesh Nishtala (UC Berkeley), George Almási (IBM Watson Research Center), Călin Caşcaval (IBM Watson Research
More informationParallel algorithms for sparse matrix product, indexing, and assignment
Parallel algorithms for sparse matrix product, indexing, and assignment Aydın Buluç Lawrence Berkeley National Laboratory February 8, 2012 Joint work with John R. Gilbert (UCSB) With inputs from G. Ballard,
More informationDepartment of Electrical and Computer Engineering
LAGRANGIAN RELAXATION FOR GATE IMPLEMENTATION SELECTION Yi-Le Huang, Jiang Hu and Weiping Shi Department of Electrical and Computer Engineering Texas A&M University OUTLINE Introduction and motivation
More informationAnalysis and Mapping of Sparse Matrix Computations
Analysis and Mapping of Sparse Matrix Computations Nadya Bliss & Sanjeev Mohindra Varun Aggarwal & Una-May O Reilly MIT Computer Science and AI Laboratory September 19th, 2007 HPEC2007-1 This work is sponsored
More informationParallelizing LU Factorization
Parallelizing LU Factorization Scott Ricketts December 3, 2006 Abstract Systems of linear equations can be represented by matrix equations of the form A x = b LU Factorization is a method for solving systems
More informationCommunication-optimal Parallel and Sequential Cholesky decomposition
Communication-optimal arallel and Sequential Cholesky decomposition Grey Ballard James Demmel Olga Holtz Oded Schwartz Electrical Engineering and Computer Sciences University of California at Berkeley
More informationParallel FFT Program Optimizations on Heterogeneous Computers
Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid
More informationParallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering
Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering George Karypis and Vipin Kumar Brian Shi CSci 8314 03/09/2017 Outline Introduction Graph Partitioning Problem Multilevel
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationHow HPC Hardware and Software are Evolving Towards Exascale
How HPC Hardware and Software are Evolving Towards Exascale Kathy Yelick Associate Laboratory Director and NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley NERSC Overview
More informationESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report
ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new
More informationSparse matrices, graphs, and tree elimination
Logistics Week 6: Friday, Oct 2 1. I will be out of town next Tuesday, October 6, and so will not have office hours on that day. I will be around on Monday, except during the SCAN seminar (1:25-2:15);
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction
More informationAdaptive Matrix Transpose Algorithms for Distributed Multicore Processors
Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures
More informationDimension reduction : PCA and Clustering
Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationGraph Algorithms on A transpose A. Benjamin Chang John Gilbert, Advisor
Graph Algorithms on A transpose A. Benjamin Chang John Gilbert, Advisor June 2, 2016 Abstract There are strong correspondences between matrices and graphs. Of importance to this paper are adjacency matrices
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26
More informationThe Fast Multipole Method on NVIDIA GPUs and Multicore Processors
The Fast Multipole Method on NVIDIA GPUs and Multicore Processors Toru Takahashi, a Cris Cecka, b Eric Darve c a b c Department of Mechanical Science and Engineering, Nagoya University Institute for Applied
More informationA Study of Clustering Techniques and Hierarchical Matrix Formats for Kernel Ridge Regression
A Study of Clustering Techniques and Hierarchical Matrix Formats for Kernel Ridge Regression Xiaoye Sherry Li, xsli@lbl.gov Liza Rebrova, Gustavo Chavez, Yang Liu, Pieter Ghysels Lawrence Berkeley National
More informationParallel Graph Algorithms. Richard Peng Georgia Tech
Parallel Graph Algorithms Richard Peng Georgia Tech OUTLINE Model and problems Graph decompositions Randomized clusterings Interface with optimization THE MODEL The `scale up approach: Have a number of
More informationCME 213 SPRING Eric Darve
CME 213 SPRING 2017 Eric Darve LINEAR ALGEBRA MATRIX-VECTOR PRODUCTS Application example: matrix-vector product We are going to use that example to illustrate additional MPI functionalities. This will
More informationHypergraph Partitioning for Computing Matrix Powers
Hypergraph Partitioning for Computing Matrix Powers Nicholas Knight Erin Carson, James Demmel UC Berkeley, Parallel Computing Lab CSC 11 1 Hypergraph Partitioning for Computing Matrix Powers Motivation
More informationParallel Computing: Parallel Algorithm Design Examples Jin, Hai
Parallel Computing: Parallel Algorithm Design Examples Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! Given associative operator!! a 0! a 1! a 2!! a
More informationCombinatorial problems in a Parallel Hybrid Linear Solver
Combinatorial problems in a Parallel Hybrid Linear Solver Ichitaro Yamazaki and Xiaoye Li Lawrence Berkeley National Laboratory François-Henry Rouet and Bora Uçar ENSEEIHT-IRIT and LIP, ENS-Lyon SIAM workshop
More informationGauss-Jordan Matrix Inversion Speed-Up using GPUs with the Consideration of Power Consumption
Gauss-Jordan Matrix Inversion Speed-Up using GPUs with the Consideration of Power Consumption Mahmoud Shirazi School of Computer Science Institute for Research in Fundamental Sciences (IPM) Tehran, Iran
More informationCommunication-efficient parallel generic pairwise elimination
Communication-efficient parallel generic pairwise elimination Alexander Tiskin Department of Computer Science, University of Warwick, Coventry CV4 7AL, United Kingdom Abstract The model of bulk-synchronous
More informationAll About Erasure Codes: - Reed-Solomon Coding - LDPC Coding. James S. Plank. ICL - August 20, 2004
All About Erasure Codes: - Reed-Solomon Coding - LDPC Coding James S. Plank Logistical Computing and Internetworking Laboratory Department of Computer Science University of Tennessee ICL - August 2, 24
More information