A comparison of Algorithms for Sparse Matrix. Real-time Multibody Dynamic Simulation

A comparison of Algorithms for Sparse Matrix Factoring and Variable Reordering aimed at Real-time Multibody Dynamic Simulation Jose-Luis Torres-Moreno, Jose-Luis Blanco, Javier López-Martínez, Antonio Giménez-Fernández 1-4 July, 2013 Zagreb (Croatia) ECCOMAS Multibody Dynamics 2013

Goal of this work Problem under study Multibody dynamic formulation: Best numerical solver algorithm? = = \ 2 It depends Ax b x A b

Talk outline 1. Problem introduction 2. Tested algorithms 3. Benchmark 4. Results 3

Problem introduction Modeling decisions: coordinates, formulation, etc. (,&, ) q&& t = f q t q t t ( ) ( ) ( ) (, &, ) && z t = f z t z t t ( ) ( ) ( ) Our choices for this work: Dependent coordinates vs. Independent coordinates Coordinate type for modeling: Dynamic formulation: Natural coordinates Index-1 Lagrange w. Baumgarte 4 1/4: Introduction

Problem introduction A x = b Problem structure? N dependent coordinates M constraints (N+M) x (N+M) (N+M M) x 1 = (N+M M) x 1 Square, symmetric, Dense columns psd and sparse 5 1/4: Introduction

Structure of matrix A A Most important features - M is constant - Static non-zero structure - SPARSITY 6 1/4: Introduction

Why sparsity matters? Naive matrix inversion: T M φq q&& Q = φq 0 λ c 14243 A q && λ 1 T M φ q Q = φq 0 c 14243 A 1 7 1/4: Introduction

Why sparsity matters? nnz=528 (93.76% are zeros) nnz=8266 (2.34% are zeros) 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 90 90 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 M φq φ T q 0 8 1/4: Introduction M φq In general, the inverse of a sparse matrix is NOT sparse! ( 3 ( + ) ) Cost: O N M φ T q 0 1

Why sparsity matters? How to keep matrices sparse? Avoid direct matrix inversion. Alternatives? Sparse matrix factorization algorithms: 1) Method: LU, Cholesky. 2) Variable ordering 9 1/4: Introduction

Why variable reordering matters? nnz=528 (93.76% are zeros) 10 20 30 ~ 94% zeros Natural ordering A = 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 ~ 70% zeros nnz=2599 (69.29% are zeros) nnz=2456 (70.98% are zeros) 10 10 20 20 30 30 A = L U = 40 50 40 50 60 60 70 70 80 80 90 90 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 10 1/4: Introduction

Why variable reordering matters? Reordering = permutation of variable indices nnz=528 (93.76% are zeros) nnz=528 (93.76% are zeros) 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 90 90 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 A ( :, : ) A( p, p ) 11 1/4: Introduction

Why variable reordering matters? nnz=528 (93.76% are zeros) 10 20 30 ~ 94% zeros Optimized ordering 40 50 60 70 80 90 ~97% 10 20 30 40 50 60 70 80 90 and ~93% zeros nnz=244 (97.12% are zeros) nnz=620 (92.67% are zeros) 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 90 10 20 30 40 50 60 70 80 90 90 10 20 30 40 50 60 70 80 90 12 1/4: Introduction

Talk outline 1. Problem introduction 2. Tested algorithms 3. Benchmark 4. Results 13

Tested algorithms Approaches: LU decomposition Cholesky decomposition 14 3/4: Algorithms

Tested algorithms Approaches: LU decomposition Cholesky decomposition Ax = b } A 1) L Ux { = L y= b, solve for y ( LU) x = b = y 2) Ux = y, solve for x nnz=244 (97.12% are zeros) nnz=620 (92.67% are zeros) 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 90 90 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 15 3/4: Algorithms

Tested algorithms Benchmarked LU implementations (6 variations): 1) Dense LU Eigen C++ library 2) Sparse LU Suitesparse s KLU a) AMD ordering b) COLAMD ordering 3) Sparse LU Suitesparse s UMFPACK a) Natural ordering b) AMD ordering c) METIS ordering Symbolic vs. numeric factorization! 16 3/4: Algorithms

Tested algorithms Approaches: LU decomposition Cholesky decomposition (only when M is definite positive) T M φq q&& Q = φq 0 λ c 14243 A Apply Cholesky to M. Use Schür complement to build two reduced systems of equations. [R. von Schwerin, Mathematical Methods for MBS in Descriptor Form, 1999] 17 3/4: Algorithms

Tested algorithms Approaches: LU decomposition Cholesky decomposition (only when M is definite positive) Benchmarked Cholesky implementations (3 variations): 1) Sparse Cholesky Suitesparse s CHOLMOD a) AMD ordering b) COLAMD ordering c) METIS ordering 18 3/4: Algorithms

Talk outline 1. Problem introduction 2. Tested algorithms 3. Benchmark 4. Results 19

Our benchmark model Test mechanism: Parameterized 2D lattice of N x N y articulated quadrangles. Example: Nx = 5 Ny = 4 20 3/4: Benchmark

Our benchmark model Irregular node locations to avoid singular configurations: 21 3/4: Benchmark

Benchmark C++ implementation MB model Solver-specific tasks to be profiled Assemble Solver Assembled model Preprocessing 1. Update Jacobians 2. Build RHS 3. Triplet -> CCS (*) 4. Numeric factoring 5. Solve Ax=b N.I.S. 22 3/4: Benchmark

The benchmark For each solver For each ordering For N=1 32 Build mechanism with N x = N, N y =N. Run thousands of time steps for each configuration. Measure average execution time of each calculation. 23 3/4: Benchmark

Talk outline 1. Problem introduction 2. Tested algorithms 3. Benchmark 4. Results 24

http://www.youtube.com/watch?v=1red2bqbnys 25 4/4: Results

Results: Preprocessing (symbolic factorization) 10 3 10 2 10 1 Time (ms) 10 0 10-1 10-2 Dense LU UMFPACK Natural UMFPACK AMD UMFPACK METIS KLU AMD KLU COLAMD CHOLMOD AMD/AMD CHOLMOD COLAMD/COLAMD CHOLMOD METIS/METIS 10-3 10 1 10 2 10 3 10 4 Number of redundant coordinates 27 4/4: Results

Results: Distribution of time (KLU+COLAMD) Update Jacobians Build RHS Solve Ax=b Triplet->CCS Numeric factoring 4% 7% 4% 4% Triplet -> CCS can be eliminated with smarter coding! 81% 29 4/4: Results

Results: num. factoring + solve Total numeric factorization & solve time 300 250 Dense LU UMFPACK Natural UMFPACK AMD UMFPACK METIS KLU AMD KLU COLAMD CHOLMOD AMD/AMD CHOLMOD COLAMD/COLAMD CHOLMOD METIS/METIS 200 Time (ms) 150 100 50 UMFPACK with different orderings KLU with different orderings 0 500 1000 1500 2000 2500 3000 3500 4000 Number of redundant coordinates 30 4/4: Results

A comment on the Cholesky approach E matrix chol( E * E ) dense!! 31 4/4: Results

Results: num. factoring + solve (log scale) 10 2 Dense LU UMFPACK+AMD 10 1 Time (ms) 10 0 10-1 10-2 KLU+COLAMD Dense LU UMFPACK Natural UMFPACK AMD UMFPACK METIS KLU AMD KLU COLAMD CHOLMOD AMD/AMD CHOLMOD COLAMD/COLAMD CHOLMOD METIS/METIS 10 1 10 2 10 3 32 Number of redundant coordinates 4/4: Results

Results: num. factoring + solve (log scale) 10 2 Dense LU UMFPACK+AMD 10 1 ~ 50 coords Time (ms) 10 0 ~ 2000 coords 10-1 10-2 KLU+COLAMD Dense LU UMFPACK Natural UMFPACK AMD UMFPACK METIS KLU AMD KLU COLAMD CHOLMOD AMD/AMD CHOLMOD COLAMD/COLAMD CHOLMOD METIS/METIS 10 1 10 2 10 3 33 Number of redundant coordinates 4/4: Results

Results: num. factoring + solve (log scale) 10 2 Dense LU UMFPACK+AMD 10 1 Time (ms) 10 0 T=1ms 10-1 10-2 KLU+COLAMD Dense LU UMFPACK Natural UMFPACK AMD UMFPACK METIS KLU AMD KLU COLAMD CHOLMOD AMD/AMD CHOLMOD COLAMD/COLAMD CHOLMOD METIS/METIS 10 1 10 2 10 3 34 Number of redundant coordinates 4/4: Results

Conclusions MBS dynamics leads to very sparse systems. Exploiting sparsity is crucial for performance unless N<50 approx. 50<N<2000 Use KLU+COLAMD 2000<N Use UMFPACK+AMD In any case: reuse symbolic factorizations! Slides available online: http://www.ual.es/~jlblanco/ Publications 35 4/4: Results

A comparison of Algorithms for Sparse Matrix Factoring and Variable Reordering aimed at Real-time Multibody Dynamic Simulation Thanks for your attention! Slides available online: http://www.ual.es/~jlblanco/ Publications