Self Adapting Numerical Software (SANS-Effort)

Size: px

Start display at page:

Download "Self Adapting Numerical Software (SANS-Effort)"

Carol Robbins
6 years ago
Views:

1 Self Adapting Numerical Software (SANS-Effort) Jack Dongarra Innovative Computing Laboratory University of Tennessee and Oak Ridge National Laboratory 1

Dynamic algorithm choice depending on the data 4.

2 Work on Self Adapting Software 1. Lapack For Clusters 2. Collective communication optimization 3. Dynamic algorithm choice depending on the data 4. Generic Code Optimization techniques applied to highly parallel systems 5. Fault tolerant linear algebra approaches 2

3 Motivation Self Adapting Numerical Software (SANS) Effort Optimizing software to exploit the features of a given processor has historically been an exercise in hand customization. Time consuming and tedious Hard to predict performance from source code Growing list of kernels to tune and or choose from Must be redone for every architecture and compiler Compiler technology often lags architecture Best algorithm may depend on input, so some tuning may be needed at run-time. Not all algorithms semantically or mathematically equivalent Need for quick/dynamic deployment of optimized routines. 3

Linear Algebra Software Packages http://icl.cs.utk.

4 Linear Algebra Software Packages icl.cs.utk.edu/lapack/ LAPACK Used by Matlab, Mathematica, Numeric Python, Tuned version provided by vendors: AMD, Apple, Compaq, Cray, Fujitsu, Hewlett-Packard, Hitachi, IBM, Intel, MathWorks, NAG, NEC, PGI, SUN, Visual Numerics, by most of Linux distribution (Fedora, Debian, Cygwin,...). On going work: Multi-core, performance, accuracy, extended precision, ease of use ScaLAPACK Parallel implementation of LAPACK scaling on parallel hardware from 10 s to 100 s to 1000 s of processors On going work: Target new architectures, new parallel environment. For example port to Microsoft HPC cluster solution LAPACK for Clusters (LFC) Most of ScaLAPACK functionality from serial clients (Matlab, Python, Mathematica) On going work: Looking at sparse data and I/O scenarios, web services 4

LFC Ease of Deployment LFC clients: C, Mathematica, Matlab, Python ScaLAPACK serial interface parallel interface LAPACK PBLAS global addressing local addressing

5 LFC Ease of Deployment LFC clients: C, Mathematica, Matlab, Python ScaLAPACK serial interface parallel interface LAPACK PBLAS global addressing local addressing BLAS (LFC) portable External to LFC BLAS (vendor) BLACS platform specific MPI External to LFC Only one file to download Just type:./configure && make && make install 5

6 LFC Overview Interactive clients Mathematica Matlab Python Server Software Clients Tunnel (IP, TCP,...) 0,0 Firewall Server 6

7 LFC: Behind the Scenes x = lfc.gesv(a, b) Batch mode bypass x = b.copy() command('pdgesv', A.id, x.id) call pdgesv(a, x) send(c_sckt, buf) Internet Intranet Shared memory... recv(s_sckt, buf) 7

8 LFC: Current Functionality Linear systems (via factorizations) Cholesky: A = U T U A = LL T Least Squares: A = QR Gaussian: PA = LU Singular- and eigen-value problems A = UΣV T (thin SVD) AV=VΛ=A H V (symmetric AEP) AV = VΛ (non-symmetric AEP) Norms, condition number estimates Precision, data-types Single, double Real, complex Mixed precision (by upcasting) User data routines Loading/Saving MPI I/O... Generating Plugins Moving More to come Now working on sparse matrices support 8

9 Sample LFC Code: Linear System Solve Matlab with LFC (parallel): n = lfc(1000); nrhs = 1; A = rand(n); b = rand(n, 1); x = A \ b; r = A*x b; norm(r, fro ) Python with LFC (parallel): n = 1000 nrhs = 1 A = lfc.rand(n) b = lfc.rand(n, 1) x = lfc.solve(a, b) r = A*x b print r.norm( F ) Matlab no LFC (sequential): n = 1000; nrhs = 1; A = rand(n); b = rand(n, 1); x = A \ b; r = A*x b; norm(r, fro ) Package to be released at SC06 9

10 Work on Self Adapting Software 1. Lapack For Clusters 2. Collective communication optimization 3. Dynamic applications choice depending on the data 4. Generic Code Optimization techniques applied to highly parallel systems 5. Fault tolerant linear algebra approaches 10

11 Self Adapting MPI Operations MPI collective operations Frequently used Can be performance bottleneck MPI collective algorithms Numerous in literature Explicit message segmentation, may cause performance issues Performance portability issue Different network may have different performance points Tuning collective operations for particular system Ideally, automatic tuning 11

12 The Approach MPI collective algorithm implementations Exhaustive Testing Performance Modeling Optimal MPI collective implementation ALTERNATIVE SLIDE Decision Process Decision Function 12

Decision Selection Process Parametric data modeling: Use algorithm

(Hockney, LogGP, PLogP, ) based on input of collective and system parameters

information algorithm switching points Statistical learning methods: Use

13 Decision Selection Process Parametric data modeling: Use algorithm performance models to select algorithm with shortest completion time (Hockney, LogGP, PLogP, ) based on input of collective and system parameters Image encoding techniques: Use image encoding algorithms to capture information algorithm switching points Statistical learning methods: Use statistical learning methods to find patterns in algorithm performance data and to construct decision systems 13

14 FT-MPI mpi/ Define the behavior of MPI in event a failure occurs at the process level. FT-MPI based on MPI 1.3 (plus some MPI 2 features) with a fault tolerant model similar to what was done in PVM. Complete reimplementation, not based on other implementations. Gives the application the possibility to recover from a process-failure. A regular, non fault-tolerant MPI program will run using FT-MPI. What FT-MPI does not do: Recover user data (e.g. automatic check-pointing) Provide transparent fault-tolerance Open-MPI for MS 14

15 Open-MPI Collaborators Los Alamos National Lab (LA-MPI) Sandia National Lab Indiana U (LAM/MPI) U of Tennessee (FT-MPI) HLRS - U of Stuttgart (PACX-MPI) U of Houston Cisco Systems Mellanox Voltaire Sun IBM Myricom Qlogic URL: 15

16 Work on Self Adapting Software 1. Lapack For Clusters 2. Collective communication optimization 3. Dynamic applications choice depending on the data 4. Generic Code Optimization techniques applied to highly parallel systems 5. Fault tolerant linear algebra approaches 16

SALSA: Motivation For Sparse Iterative Methods Ax=b Problem: How to pick a numerical algorithm? Difficulties in choosing appropriate numerical algorithms: 1.

17 SALSA: Motivation For Sparse Iterative Methods Ax=b Problem: How to pick a numerical algorithm? Difficulties in choosing appropriate numerical algorithms: 1. More than one algorithm can solve a problem 2. Algorithms can have parameters (continuous/discrete) Goals: Example: fill-in levels, restart length 3. Unclear influence of data features on decisions Identify relevant characteristics of a problem Predict best suitable method based on relevant features Uncover relationships between characteristics and parameters Make decisions automatically 17

Feature Extraction and Identification First step is to know and understand the features of a system The features are grouped in these categories: Simple: normlike quantities (1, Frobenius, infinity

18 Feature Extraction and Identification First step is to know and understand the features of a system The features are grouped in these categories: Simple: normlike quantities (1, Frobenius, infinity excludes 2-norm) Normal: departure from normality estimates Variance: estimates of how different are the elements in a matrix (this is completely heuristic) Spectrum: estimates of eigenvalues and singular values Structural: properties that are only a function of the nonzero structure Principal Component Analysis is then used to: Identify important features and correlations Eliminate redundant features Dimensionality reduction Visualization of data clustering (helps analysis) 18

Recommendation based on Convergence or Performance Based on convergence analysis attempt to train classifiers to recommend: A converging configuration method Best method among the converging ones

19 Recommendation based on Convergence or Performance Based on convergence analysis attempt to train classifiers to recommend: A converging configuration method Best method among the converging ones Determining which are the best combinations / configurations of reliability-performance for software implementation How many methods are compared at a time (may result in many combinations/classes) Combinations that can be disregarded (e.g. reliability of direct method vs. iterative methods) Certain preconditioners that are not useful for some methods GMRES: 92% BCGS: 70% TFQMR: 60% PREONLY: 98% 19

20 Work on Self Adapting Software 1. Lapack For Clusters 2. Collective communication optimization 3. Dynamic applications choice depending on the data 4. Generic Code Optimization techniques applied to highly parallel systems 5. Fault tolerant linear algebra approaches 20

21 Collaborators / Support U Tennessee, Knoxville Piotr Luszczek LFC Erika Fuentes Salsa Jelena Pjesivac- Grbovic Opt communication libraries 21

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/