Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Similar documents
Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Master Informatics Eng.

MAGMA. Matrix Algebra on GPU and Multicore Architectures

45-year CPU Evolution: 1 Law -2 Equations

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

EECS Electrical Engineering and Computer Sciences

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

How HPC Hardware and Software are Evolving Towards Exascale

COSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

CS560 Lecture Parallel Architecture 1

Computer Systems Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.

Computer Systems Architecture

Parallel Architectures

Lecture 3: Intro to parallel machines and models

WHY PARALLEL PROCESSING? (CE-401)

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

The Roofline Model: EECS. A pedagogical tool for program analysis and optimization

Lawrence Berkeley National Laboratory Recent Work

n N c CIni.o ewsrg.au

Parallel Architecture. Hwansoo Han

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Multiprocessors & Thread Level Parallelism

QR Decomposition on GPUs

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

The Roofline Model: A pedagogical tool for program analysis and optimization

Advances of parallel computing. Kirill Bogachev May 2016

Advanced Parallel Programming I

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

MAGMA: a New Generation

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection

Computer Architecture Spring 2016

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Modern CPU Architectures

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

4. Shared Memory Parallel Architectures

Parallel Languages: Past, Present and Future

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Intel profiling tools and roofline model. Dr. Luigi Iapichino

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

Kaisen Lin and Michael Conley

Toward Automated Application Profiling on Cray Systems

Trends in HPC (hardware complexity and software challenges)

Application Performance on Dual Processor Cluster Nodes

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

KNL tools. Dr. Fabio Baruffa

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Chapter 7. Multicores, Multiprocessors, and Clusters

Placement de processus (MPI) sur architecture multi-cœur NUMA

Accelerating the Iterative Linear Solver for Reservoir Simulation

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

THREAD LEVEL PARALLELISM

Computer Architecture

Automatic Tuning of Sparse Matrix Kernels

swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

CellSs Making it easier to program the Cell Broadband Engine processor

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING

Performance of Multicore LUP Decomposition

Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded

COSC 6385 Computer Architecture - Multi Processor Systems

Portable, usable, and efficient sparse matrix vector multiplication

What SMT can do for You. John Hague, IBM Consultant Oct 06

Parallel Computing. Hwansoo Han (SKKU)

Multithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Iterative Sparse Triangular Solves for Preconditioning

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors

CMAQ PARALLEL PERFORMANCE WITH MPI AND OPENMP**

On the limits of (and opportunities for?) GPU acceleration

MUMPS. The MUMPS library. Abdou Guermouche and MUMPS team, June 22-24, Univ. Bordeaux 1 and INRIA

Bandwidth Avoiding Stencil Computations

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

Transcription:

Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc, Georgia Tech Sam Williams, UC Berkeley

Overview Chip multiprocessor (CMP) systems become de factor HPC building blocks better trade-offs between performance (parallelism) and energy efficiency diverse CMP architectural designs: multicore, multithreading, Testing machines in this study: all programmable in shared address space Intel Clovertown (homogeneous multicore) Sun VictoriaFalls (hardware-threaded multicore, NUMA) IBM Power 5 (conventional SMP node) Questions programmability: Pthread, MPI performance of existing code where and how to improve performance Findings may be applicable to other algorithms, such as ILU 2

Architectural summary System Intel Clovertown Sun VictoriaFalls IBM Power 5 (575) Core type superscalar (4) multithreaded (8) superscalar (4) Clock (GHz) 2. 1.4 1.9 L1 DCache 2 KB 8 KB 2 KB # sockets 2 4 8 # cores/socket 4 8 (256 threads) L2 cache 4 MB/2-cores (16 MB) 4 MB/socket (16 MB) 1 1.92 MB/core (2 MB L$/node) DP Gflops 74.7 44.8 6.8 DRAM GB/s (read) 21. 84 2 Byte-to-flop ratio.29 1.88.29 Socket power (Watts) 16 (max) 84 (max) 5 (measured) Sources: John Shalf, Sam Williams

Architectural diagram Xeon Xeon 4MB L2 Xeon Xeon 4MB L2 Xeon Xeon 4MB L2 Xeon Xeon 4MB L2 Intel Colvertown 2 sockets, 8 cores FBS (Front Side Bus) FBS (Front Side Bus) Blackford Interface read: 21. GB/s write: 1.6 GB/s DRAM Write bandwidth is half of Read Sun VictoriaFalls: Quad-chip Niagara2 (NUMA) 2 cores 256 threads Core-cache c crossbar 4B L2 Core-cache c crossbar 4B L2 Core-cache c crossbar 4B L2 Core-cache c crossbar 4B L2 21. GB/s 42.7 GB/s FBDIMMs DRAM FBDIMMs DRAM FBDIMMs DRAM FBDIMMs DRAM External Coherent Hubs

Single-core, single threaded BLAS Clovertown Intel MKL VictoriaFalls Sun Performance Library Can t use 8 hw threads!! 5

Sparse GE (LU factorization) Scalar version : nested loop 1 2 L 4 U 5 6 for i = 1 to n for j = i+1 to n A(j,i) = A(j,i) / A(i,i) for k = i+1 to n s.t. A(i,k)!= for j = i+1 to n s.t. A(j,i)!= 7 A(j,k) = A(j,k) - A(j,i) * A(i,k) Typical fill-ratio: 1x for 2D problems, -5x for D problems 6

Supernode: dense blocks in {L\U} Good for high performance Enable use of Level BLAS Reduce inefficient indirect addressing (scatter/gather) Reduce time of the graph algorithms by traversing a coarser graph 7

Major stages 1. Order equations & variables to preserve sparsity. NP-hard, use heuristics 2. Symbolic factorization. Identify supernodes, set up data structures and allocate memory for L & U.. Numerical factorization usually dominates total time. How to pivot? 4. Triangular solutions usually less than 5% total time. SuperLU_MT 1. Sparsity ordering 2. Factorization Partial pivoting Symbolic fact. Num. fact. (BLAS 2.5). Solve SuperLU_DIST 1. Static pivoting 2. Sparsity ordering. Symbolic fact. 4. Numerical fact. (BLAS ) 5. Solve 8

SuperLU_MT [Li/Demmel/Gilbert] Pthread or OpenMP Left looking relatively more READs than WRITEs Use shared task queue to schedule ready columns in the elimination tree (bottom up) Over 12x speedup on conventional 16-CPU SMPs (1999) P 1 P 2 U A L DONE BUSY NOT TOUCHED 9

SuperLU_DIST [Li/Demmel/Grigori] MPI Right looking -- relatively more WRITEs than READs 2D block cyclic layout One step look-ahead to overlap comm. & comp. Scales to 1s processors Matrix 1 2 1 2 4 5 4 5 1 2 1 2 4 5 4 5 1 2 1 2 4 5 4 5 Process mesh 1 2 4 5 1 2 1 2 ACTIVE 1

PAPI : load/store counters PAPI: Performance Application Programming Interface Portable interface to access hardware performance counters right-looking (slu_dist) has over x more load or store instructions STORE is costly: cache coherence, lower bandwidth 11

Clovertown SuperLU_DIST MPICH can be configured one of two modes: ch_shmem within socket ch_p4 across sockets MPICH needs hybrid mode (not yet available!!) 12

Clovertown SuperLU_MT Maximum speedup 4., smaller than conventional SMP Pthreads scale better 1

VictoriaFalls multicore + multithread SuperLU_MT SuperLU_DIST Pthreads more robust, scale better MPICH crashes with large tasks mismatch between coarse and fine grain models 14

Triangular solution (SuperLU_DIST) x i = b i i 1 j= 1 L ii L ij x j 1 4 1 4 1 4 1 5 2 5 2 Process mesh 1 2 4 5 1 2 4 5 1 2 4 5 4 5 4 + 4 Higher level of dependency Lower arithmetic intensity (flops per byte of DRAM access or communication) 15

Triangular solution PAPI counters of flops versus load instructions Flops-to-load ratio 16

Triangular solution Clovertown: 8 cores IBM Power5: 8 cpus/node Time of MPI_Reduce with 1, 2, 4, 8 tasks:.9,.5, 1.28, 2.52 17

Final remarks Results are preliminary, findings may be applicable to other algorithms, such as ILU preconditioner right-looking (maybe multifrontal) incurs more memory traffic Hybrid algorithm, hybrid programming will be beneficial left-looking + right-looking threading + MPI require significant code rewriting Need good runtime profiling tools to study multicore scaling how to calibrate memory and other contentions in the system? 18

Test matrices g7jac2 stomach torso1 twotone apps dim nnz(a) SLU_MT Fill Economic model D finite diff. D finite diff. Nonlinear analog circuit SLU_DIST Fill 59,1.7 M.7 M.7 M 1.9 21,6. M 16.8 M 17.4 M 4. 116,158 8.5 M 26.9 M 27. M 4. 12,75 1.2 M 11.4 M 11.4 M 2. Avg. S-node SLU_MT Symmetric Mode ordering on A+A pivot on main diagonal 19