LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.

Similar documents
Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

The LINPACK Benchmark on the Fujitsu AP 1000

Parallel Implementation of QRD Algorithms on the Fujitsu AP1000

Implementation of the BLAS Level 3 and LINPACK Benchmark on the AP1000

COSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science

Section 3.1 Gaussian Elimination Method (GEM) Key terms

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Linear Algebra Research on the AP1000

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

Numerical Methods 5633

Automatic Tuning of the High Performance Linpack Benchmark

Parallel Implementations of Gaussian Elimination

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides

Numerical Linear Algebra

LARP / 2018 ACK : 1. Linear Algebra and Its Applications - Gilbert Strang 2. Autar Kaw, Transforming Numerical Methods Education for STEM Graduates

EXTENSION. a 1 b 1 c 1 d 1. Rows l a 2 b 2 c 2 d 2. a 3 x b 3 y c 3 z d 3. This system can be written in an abbreviated form as

Scalability of Heterogeneous Computing

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

10/26/ Solving Systems of Linear Equations Using Matrices. Objectives. Matrices

2.7 Numerical Linear Algebra Software

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Maths for Signals and Systems Linear Algebra in Engineering. Some problems by Gilbert Strang

Techniques for Optimizing FEM/MoM Codes

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

CS6015 / LARP ACK : Linear Algebra and Its Applications - Gilbert Strang

QR Decomposition on GPUs

3. Replace any row by the sum of that row and a constant multiple of any other row.

Parallel solution for finite element linear systems of. equations on workstation cluster *

For example, the system. 22 may be represented by the augmented matrix

Matrix Multiplication

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

LINPACK Benchmark Optimizations on a Virtual Processor Grid

COMPUTATIONAL LINEAR ALGEBRA

MATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix.

Scientific Computing. Some slides from James Lambers, Stanford

Comparison of Parallel Processing Systems. Motivation

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra

Computational Methods CMSC/AMSC/MAPL 460. Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Benchmark Results. 2006/10/03

Project Report. 1 Abstract. 2 Algorithms. 2.1 Gaussian elimination without partial pivoting. 2.2 Gaussian elimination with partial pivoting

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Development of efficient computational kernels and linear algebra routines for out-of-order superscalar processors

CS Elementary Graph Algorithms & Transform-and-Conquer

Sparse Linear Solver for Power System Analyis using FPGA

High-Performance Linear Algebra Processor using FPGA

Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster

Computing the rank of big sparse matrices modulo p using gaussian elimination

1 Motivation for Improving Matrix Multiplication

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

Effect of memory latency

Uppsala University Department of Information technology. Hands-on 1: Ill-conditioning = x 2

x = 12 x = 12 1x = 16

Parallel Reduction from Block Hessenberg to Hessenberg using MPI

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Parallelizing LU Factorization

Numerical Algorithms

Measuring Performance. Speed-up, Amdahl s Law, Gustafson s Law, efficiency, benchmarks

BLAS: Basic Linear Algebra Subroutines I

A MATLAB Interface to the GPU

Exam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3

Solution of Out-of-Core Lower-Upper Decomposition for Complex Valued Matrices

Sparse Matrices Direct methods

Double-Precision Matrix Multiply on CUDA

BLAS: Basic Linear Algebra Subroutines I

How to Write Fast Numerical Code Spring 2012 Lecture 9. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

FFPACK: Finite Field Linear Algebra Package

Linear Equations in Linear Algebra

Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8.

A Poorly Conditioned System. Matrix Form

DM545 Linear and Integer Programming. Lecture 2. The Simplex Method. Marco Chiarandini

The Architecture of a Homogeneous Vector Supercomputer

CS 426 Parallel Computing. Parallel Computing Platforms

Dense LU Factorization

Double-precision General Matrix Multiply (DGEMM)

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Sparse Direct Solvers for Extreme-Scale Computing

The M4RI & M4RIE libraries for linear algebra over F 2 and small extensions

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Outline. How Fast is -fast? Performance Analysis of KKD Applications using Hardware Performance Counters on UltraSPARC-III

Optimization of FEM solver for heterogeneous multicore processor Cell. Noriyuki Kushida 1

High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers

arxiv: v1 [cs.dc] 9 Sep 2014

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Fast and reliable linear system solutions on new parallel architectures

Parallel Numerics, WT 2013/ Introduction

Sparse Matrices and Graphs: There and Back Again

MATH 2000 Gauss-Jordan Elimination and the TI-83 [Underlined bold terms are defined in the glossary]

Sparse matrices, graphs, and tree elimination

Optimizing the operations with sparse matrices on Intel architecture

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

PIPELINE AND VECTOR PROCESSING

Underutilizing Resources for HPC on Clouds

CHAPTER 5 SYSTEMS OF EQUATIONS. x y

Solving Systems Using Row Operations 1 Name

Energy Efficient Adaptive Beamforming on Sensor Networks

FPGA ACCELERATION OF THE LINPACK BENCHMARK USING HANDEL-C AND THE CELOXICA FLOATING POINT LIBRARY

Transcription:

1 2 The LINPACK Benchmark on the Fujitsu AP 1000 Richard P. Brent Computer Sciences Laboratory The LINPACK Benchmark A popular benchmark for floating-point performance. Involves the solution of a nonsingular system of n equations in n unknowns by Gaussian elimination with partial pivoting. Australian National University n = 100 Three Cases The original benchmark (too easy for our purposes). n = 1000 Often used to compare vector processors and parallel computers. n >> 1000 Often used to compare massively parallel computers. 3 Assumptions Assume double-precision arithmetic (64-bit). Interested in n 1000. Assume coefficient matrix available in processors. Use C indexing conventions - Indices 0, 1,... Row-major ordering 4

Hardware The Fujitsu AP 1000 (also known as the CAP II) is a MIMD machine with up to 1024 independent 25 Mhz Sparc processors (called cells). Each cell has 16 MB RAM, 128 KB cache, and Weitek floating-point unit capable of 5.56 Mflop for overlapped multiply and add. 5 Communication The topology of the AP1000 is a torus with wormhole routing. The theoretical bandwidth between any pair of cells is 25 MB/sec. In practice, because of system overheads, copying of buffers, etc, only about 6 MB/sec is attainable by user programs. 6 Data Distribution Possible ways of storing matrices (data and results) on the AP 1000 are - column wrapped row wrapped scattered = row and column wrapped blocked versions of these We chose the scattered representation because of its good load-balancing and communication bandwidth properties. 7 Scattered Storage On a 2 by 2 configuration cell cell cell cell a 4 by 6 matrix would be stored as follows, where the color-coding indicates the cell where an element is stored - 00 01 02 03 04 05 10 11 12 13 14 15 20 21 22 23 24 25 30 31 32 33 34 35 8

Scattered Storage - Global Local Mapping On a machine configuration with ncelx. ncely cells (x, y), 0 x < ncelx, 0 y < ncely, element a i,j is stored in cell (j mod ncelx, i mod ncely) with local indices 1 1 Sorry (y,x) conventions! i = i div ncely, j = j div ncelx. 9 about the confusing (i,j) and Blocked Storage 10 If the above definition of scattered storage is applied to a block matrix with b by b blocks, then we get the blocked panel-wrapped representation. Choosing larger b reduces the number of communication steps but worsens the load balance. We use b = 1, but b > 1 has been used on other localmemory machines (e.g. Intel Delta). Blocked Matrix Operations The rank-1 updates in Gaussian elimination can be grouped into blocks of ω so rank-ω updates can be performed using level 3 BLAS (i.e. matrix-matrix operations). The two possible forms of blocking are independent - we can have b > 1 or ω > 1 or both. If both then b = ω is convenient but not necessary. In our implementation b = 1, ω 1. 11 Gaussian Elimination The idea of Gaussian Elimination (G.E.) is to transform a nonsingular linear system Ax = b into an equivalent upper triangular system Ux = b which is (relatively) easy to solve for x. It is also called LU Factorization because PA = LU, where P is a permutation matrix and L is lower triangular. 12

A Typical Step of G.E. x x x 13 is converted by row operations (rank-1 update) into x x x 0 x x x x 0 x x x x 0 x x x x Comments x is a nonzero element, x is the pivot element, x is an element to be zeroed, x is in the pivot row, x x is in the active region. Row interchanges are generally necessary to bring the pivot element x into the correct position. 14 The right-hand side vector has been stored as the last column of the (augmented) matrix. Communication Requirements for G.E. Pivot selection requires finding the largest element in (part of) a column; then, if necessary, two rows are interchanged. (We do this explicitly.) 15 x_brd and y_brd The AP 1000 has hardware support for x_brd and y_brd, so these can be performed in the same time as a single cell to cell communication. (A binary tree with O(log n) communication overheads is not required.) 16 The rank-1 update requires vertical broadcast (y_brd) of the pivot row and horizontal broadcast (x_brd) of the multiplier column. x x x y_brd x_brd

Memory Refs per Flop 17 G.E. with Blocking 18 The ratio R = (loads and stores)/(flops) is important because it is impossible to keep the floating-point unit busy unless R < 1. Rank-1 updates Defer operations on the region labelled D until ω steps of G.E. have been performed. Then the rank-ω update is simply D D - BC and can be performed by level-3 BLAS without inter-cell communication. a ij a ij + u i *v j have R 1. To reduce R and improve performance, need blocking. (ω rank-1 updates one rank-ω update.) Choice of ω 19 Operations in the vertical strip of width ω and the horizontal strip of depth ω are done using rank-1 updates (slow) so want ω to be small. However, level-3 BLAS for rank-ω updates are slow unless ω is large. The optimum choice is usually ω n 1/2 However, ω should be small enough that the parts of the strips stored on each cell fit in the cache. LINPACK Benchmark Results (n = 1000) on the AP 1000 cells time (sec) speedup efficiency 512 1.10 147 0.29 256 1.50 108 0.42 128 2.42 66.5 0.52 64 3.51 46.0 0.72 32 6.71 24.0 0.75 16 11.5 13.9 0.87 8 22.6 7.12 0.89 4 41.3 3.90 0.97 2 81.4 1.98 0.99 1 160 1.00 1.00 20

21 Comparison for n = 1000 using Dongarra s Table 2 LINPACK Benchmark Results (n large) on the AP 1000 22 cells 512 r max n max n half r max / Gflop r peak 2.251 25600 2500 0.79 256 1.162 18000 1600 0.82 128 0.566 12800 1100 0.80 64 0.291 10000 648 0.82 32 0.143 7000 520 0.80 16 0.073 5000 320 0.82 The AP 1000 is fastest for 128 cells and has little "tailoff" as number of cells Note the high ratio r max /r peak and the large ratio n max /n half. 23 Comparison of Options on 64-cell AP 1000 Conclusions 24 The Fujitsu AP 1000 is a wellbalanced machine for linear algebra. It is possible to attain at least 50% of peak performance over a wide range of problem sizes. Hardware support for x and y broadcast is a good feature. The graph shows the effect of turning off blocking, hardware x_brd, y_brd, or assembler BLAS 3 inner loops. The communication speed is high and startup costs low relative to the floating-point speed (which is slow by current standards).