Advanced Numerical Techniques for Cluster Computing

Similar documents
Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy

INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING AND GRIDS Cetraro (Italy), July 3-6, 2006

Exploiting Mixed Precision Floating Point Hardware in Scientific Computations

Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

Self Adapting Numerical Software (SANS-Effort)

Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

Mixed Precision Methods

Benchmark Results. 2006/10/03

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers

Iterative Sparse Triangular Solves for Preconditioning

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

HPCS HPCchallenge Benchmark Suite

Java Performance Analysis for Scientific Computing

A Parallel Implementation of the BDDC Method for Linear Elasticity

Fast and reliable linear system solutions on new parallel architectures

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Approaches to Parallel Implementation of the BDDC Method

A Standard for Batching BLAS Operations

Advanced Processor Architecture

CS Understanding Parallel Computing

Iterative Solver Benchmark Jack Dongarra, Victor Eijkhout, Henk van der Vorst 2001/01/14 1 Introduction The traditional performance measurement for co

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

Performance Evaluation of Multiple and Mixed Precision Iterative Refinement Method and its Application to High-Order Implicit Runge-Kutta Method

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

MAGMA: a New Generation

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen

SCALABLE ALGORITHMS for solving large sparse linear systems of equations

Sparse LU Factorization for Parallel Circuit Simulation on GPUs

Introduction to Parallel Computing

Performance and accuracy of hardware-oriented. native-, solvers in FEM simulations

Achieve Better Performance with PEAK on XSEDE Resources

Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee.

Ilya Lashuk, Merico Argentati, Evgenii Ovtchinnikov, Andrew Knyazev (speaker)

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

COMPUTATIONAL LINEAR ALGEBRA

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Intel Performance Libraries

BLAS. Basic Linear Algebra Subprograms

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Performance Analysis and Optimization of a Deterministic Radiation Transport Code on the Cray SV1

Implementation of an integrated efficient parallel multiblock Flow solver

OpenFOAM + GPGPU. İbrahim Özküçük

Matrix-free IPM with GPU acceleration

LAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1

Distributed Schur Complement Solvers for Real and Complex Block-Structured CFD Problems

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

PARDISO Version Reference Sheet Fortran

Intel MKL Sparse Solvers. Software Solutions Group - Developer Products Division

A Parallel Solver for Laplacian Matrices. Tristan Konolige (me) and Jed Brown

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection

A parallel direct/iterative solver based on a Schur complement approach

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

High Performance Computing for PDE Some numerical aspects of Petascale Computing

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

Accelerating Double Precision FEM Simulations with GPUs

Contents. I The Basic Framework for Stationary Problems 1

Parallel Tiled QR Factorization for Multicore Architectures LAPACK Working Note # 190

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

arxiv: v1 [math.na] 24 Jul 2007

Making a Case for a Green500 List

Scientific Computing. Some slides from James Lambers, Stanford

On the Performance of Simple Parallel Computer of Four PCs Cluster

Behavioral Data Mining. Lecture 12 Machine Biology

Confessions of an Accidental Benchmarker

Toward a supernodal sparse direct solver over DAG runtimes

Implementation of the Mixed-Precision High Performance LINPACK Benchmark on the CELL Processor. Technical Report UT-CS

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra

Compilation for Heterogeneous Platforms

PARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

Dense matrix algebra and libraries (and dealing with Fortran)

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

In 1986, I had degrees in math and engineering and found I wanted to compute things. What I ve mostly found is that:

(Sparse) Linear Solvers

A Few Numerical Libraries for HPC

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

Comparison of Parallel Processing Systems. Motivation

2.7 Numerical Linear Algebra Software

High Performance Computing for PDE Towards Petascale Computing

Parallelized Progressive Network Coding with Hardware Acceleration

How HPC Hardware and Software are Evolving Towards Exascale

Parallel solution for finite element linear systems of. equations on workstation cluster *

Turbostream: A CFD solver for manycore

BLAS. Christoph Ortner Stef Salvini

First Experiences with Intel Cluster OpenMP

Transcription:

Advanced Numerical Techniques for Cluster Computing Presented by Piotr Luszczek http://icl.cs.utk.edu/iter-ref/

Presentation Outline Motivation hardware Dense matrix calculations Sparse direct solvers Iterative methods Other applications 2

Floating-point Operations Per Cycle Processor Single Double x87 SSE 3DNow! x87 SSE2 Pentium 1 1 Pentium II 1 1 Pentium III 1 4 1 Pentium 4 1 4 1 2 Athlon 2 4 2 Athlon4 2 4 4 2 AthlonMP 2 4 4 2 AltiVec, VMX (G4, G5, POWER6) 3

Speed of Single versus Double Precision Processor BLAS N speak/ SGEMM/ SGETRF/ dpeak /DGEMM/DGETRF IBM Cell 2.4 Ghz custom 3000 ~10 ~10 ~10 IBM Cell 3.2 Ghz custom 3000 ~10 ~10 ~10 Intel Pentium III Coppermine Goto 3500 2 2.10 2.24 Intel Pentium III Katmai Goto 3000 2 2.12 2.11 Intel Pentium IV Prescott Goto 4000 2 2.00 1.86 Intel Pentium IV-M Northwood Goto 4000 2 2.02 1.98 AMD Opteron Goto 4000 2 1.98 1.93 Cray X1 SciLib 4000 2 1.68 1.54 IBM PowerPC G5 2.7 Ghz VecLib 5000 2 2.29 2.05 Sun UltraSPARC IIe Sunperf 3000 2 1.45 1.79 Compaq Alpha EV6 CXML 3000 1 0.99 1.08 IBM SP POWER3 ESSL 3000 1 1.03 1.13 SGI Octane ATLAS 2000 1 1.08 1.13 Intel Itanium 2 Goto/ATLAS 1500 1 0.71 4

Solving System of Linear Equations Ax=b Standard technique L U P = lu( A ) O(n 3 ) x = U \ (L \ P b) O(n 2 ) r = b Ax O(n 2 ) Hopefully r is small Accumulate in extra precision hence the name: mixed-precision Iterative refinement (old) L U P = lu( A ) O(n 3 ) x = U \ (L \ P b) O(n 2 ) r = b Ax O(n 2 ) while r too big do O(1) z = U \ ( L \ P r ) O(n 2 ) x = x + z O(n) r = b Ax O(n 2 ) end 5

New Iterative Refinement: Focus on Speed (but keep Accuracy) L U P = lu( A ) O(n 3 ) x = U \ (L \ P b) O(n 2 ) r = b Ax O(n 2 ) while r too big do z = U \ ( L \ P r ) O(n 2 ) x = x + z O(n) r = b Ax O(n 2 ) Single precision in black Double precision in red Contributions Speed is first priority Accuracy is acceptable Error analysis update Caveats Condition number of A must be < 10 8 Helps convergence Values of A cannot over/under-flow in single precision 6

Single versus Double Precision: Revisited Processor BLAS N speak/ SGEMM/ SGETRF/ DGESV/ # dpeak /DGEMM/DGETRF DSGESV Iter IBM Cell 2.4 Ghz custom 3000 ~10 ~10 ~10 ~8 4 IBM Cell 3.2 Ghz custom 3000 ~10 ~10 ~10 ~8 4 Intel Pentium III Coppermine Goto 3500 2 2.10 2.24 1.92 4 Intel Pentium III Katmai Goto 3000 2 2.12 2.11 1.79 4 Intel Pentium IV Prescott Goto 4000 2 2.00 1.86 1.57 5 Intel Pentium IV-M Northwood Goto 4000 2 2.02 1.98 1.84 5 AMD Opteron Goto 4000 2 1.98 1.93 1.53 5 Cray X1 SciLib 4000 2 1.68 1.54 1.38 7 IBM PowerPC G5 2.7 Ghz VecLib 5000 2 2.29 2.05 1.24 5 Sun UltraSPARC IIe Sunperf 3000 2 1.45 1.79 1.58 4 Compaq Alpha EV6 CXML 3000 1 0.99 1.08 1.01 4 IBM SP POWER3 ESSL 3000 1 1.03 1.13 1 3 SGI Octane ATLAS 2000 1 1.08 1.13 0.91 4 Intel Itanium 2 Goto/ATLAS 1500 1 0.71 7

Single vs. Double Precision in Parallel Cluster of dual AMD Opteron processors BLAS: Goto MPI: OpenMPI Interconnect: Myricom MX Notice very large N Factorization dominates the time (N 3 ) Iterative steps of negligible cost (N 2 ) Proc N PDGETRF/ PDGESV/ # PSDGETRFPSDGESV Iter 32 22627 1.85 1.79 6 64 32000 1.9 1.83 6 8

Using Quad-Precision: Single=64 bits, Double=128 bits Time (s) N QGESVQDGESV Speedup 100 0.29 0.03 9.67 200 2.27 0.10 22.70 300 7.61 0.24 31.71 400 17.81 0.44 40.48 500 34.71 0.69 50.30 600 60.11 1.01 59.51 700 94.95 1.38 68.80 800 141.75 1.83 77.46 900 201.81 2.33 86.61 1000 276.94 2.92 94.84 Hardware/software: Intel Xeon 3.2 GHz Intel Fortran compiler Reference BLAS Expected accuracy: 32 decimal digits Up to 3 iterations needed 9

Running on IBM Cell Blade @ 2.4 GHz Gflop/s 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 peek SP linpack-sp iter-ref peek DP linpack-dp 0 0 500 1000 1500 2000 2500 3000 3500 4000 Matrix size 10

Running on IBM Cell Blade @ 3.2 GHz 220 200 Gflop/s 180 160 140 120 100 80 60 40 20 peek SP linpack-sp iter-ref peek DP linpack-dp 0 0 500 1000 1500 2000 2500 3000 3500 4000 Matrix size 11

Linpack Report 11/4/2006, Page 76 Computer Processors Rmax Nmax N1/2 Rpeak (Full Precision) or Cores Gflop/s Order Order Gflop/s Sun Fire 6900 (UltraSPARC IV, 1.2 Ghz) 48 98.26 96116 8300 115.2 14.6 (64 bit) IBM Cell BE (3.2 Ghz) ***** 9 98.05 4096 1536204.8 (32 bit) SGI Altix 3000, 900 Mhz 32 97.67 82079 82079 115 Linpack matrices are uniformly random centered around 0: Distribution: U(0,1) Condition number ~ N Mixed-precision approach works Mixed-precision not legal for TOP500 and HPCC submissions Distorts historical data 12

Applying Mixed-Precision to Sparse Direct Solvers 41 real-world matrices Tim Davis' collection Varying Disciplines Structure Conditioning (10-10 16 ) Solvers MUMPS Multifrontal (mat-mat) SuperLU (sequential) Supernodal (mat-vec) Machines AMD Opteron 246 IBM PowerPC Intel Core Duo, Pentium III, Xeon, Woodcrest Sun UltraSPARC IIe Speedup MUMPS: 7-150% SuperLU: 0-30% Accuracy (2 failures) Same as full double* * Can do better if slightly less accurate 13

Application To Iterative Methods Original Idea Preconditioned Richardson x k+1 = (I A S \A)x k +A S \b Lower precision factorization is an ideal preconditioner Doesn't work for real-world sparse matrices Large condition number Forming lower precision factors is too expensive Basic idea: Use single precision (mostly) in preconditioner Use inner-outer iteration to embed an iterative method as a lower precision solver PCG-PCG Inner PCG carries the computational load GMRES-FGMRES Need to accommodate non-const preconditioner 14

Mixed-Precision Iterative Methods' Results Model problem Adaptive discretization of a 3D linear elasticity problem on a tetrahedral mesh with piecewise linear elements N NNZ Log10(Cond) 11,142 442,225 3 25,980 1,061,542 4 79,275 3,374,736 4 230,793 9,991,028 5 602,091 26,411,323 5 Speedup Standard method 50-120% Mixed-precision PCG 40-100% Mixed-precision GMRES 40-90% 15

Mixed-Precision Refinement Technique Linear Systems LU (dense and sparse) Cholesky, Symmetric QR Eigenvalue Symmetric/hermitian SVD Reduce to special form in lower precision improve in higher (N 2 per value/vector) Iterative methods Relaxed GMRES Inner/outer iteration 16

Collaborators Alfredo Buttari Jack Dongarra Jakub Kurzak Julie Langou Julien Langou Stanimire Tomov http://icl.cs.utk.edu/iter-ref/ 17