Sparse LU Factorization for Parallel Circuit Simulation on GPUs

Size: px

Start display at page:

Download "Sparse LU Factorization for Parallel Circuit Simulation on GPUs"

Lucy Fisher
5 years ago
Views:

Department of Electronic Engineering, Tsinghua University Sparse LU Factorization for Parallel Circuit Simulation on GPUs Ling Ren, Xiaoming Chen, Yu Wang, Chenxi

1 Department of Electronic Engineering, Tsinghua University Sparse LU Factorization for Parallel Circuit Simulation on GPUs Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Nano-scale Integrated Circuit and System Lab., Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. 1

2 Motivation Parallel SPICE simulator bottleneck 2

3 Related works Dense LU [Volkov2008, Tomov2010] Very efficient on GPUs (850 Gflop/s) Sparse LU SuperLU and Pardiso: Supernode (dense blocks) [Christen2007] dense blocks on GPU UMFPACK, MUMPS, WSMP: multifrontal No dense blocks in extremely sparse matrices KLU, for circuit matrices, without Supernode G/P left-looking algorithm [G/P 1988] Sequential version only 3

4 Algorithm left-looking Each column (k) is sequentially updated (vector multiplyand-add, MAD) by all the columns on its left b b b c c a a = a c b read write nonzero Nonzero structure of U determines the dependency and the EGraph Egraph [chen2011] nodes: columns Edges: vector MAD 4 (a) Upper triangular matrix U (b)egraph

5 Algorithm analysis parallelism Divide EGraph into levels Columns in the same level are independent Cluster mode & pipeline mode A sample EGraph Timing order in pipeline mode 5

6 GPU implementation - avoid deadlock Traditionally, some warps Inactive at the beginning Activated when other active warps finish But in sparse LU, all warps must be active from the beginning An upper bound for concurrent columns 6

7 GPU implementation memory access pattern 7

8 GPU implementation - workflow 8

9 Performance analysis More concurrent columns, higher performance? No, inexecutable operations. 9

Experiments CPU: 2 Xeon X5680 GPU: NVIDIA GTX580 Testing matrices University of Florida Sparse Matrix Collection (not only circuit matrices) Hybrid solver 1-core /

10 Experiments CPU: 2 Xeon X5680 GPU: NVIDIA GTX580 Testing matrices University of Florida Sparse Matrix Collection (not only circuit matrices) Hybrid solver 1-core / multi-core / many-core (GPU) Group Bandwidth (GB/s) GPU 1 CPU 4 CPUs 8 CPUs KLU A (flop < 200M) B (flop > 200M)

11 11

12 12

13 Hybrid Solver Based on the number of flops in the factorization Sequential or parallel? [Chen 2011] Single-core, multi-core or many-core (GPU) Accuracy: Pivoting once + several numerical factorization Since nonzero values do not change rapidly When nonzeros do vary greatly, pivot (preprocess) again 13

14 Summary Sparse LU solver on GPU Timing order and work partitioning on GPU The optimal number of concurrent columns Memory access pattern Hybrid Solver As FLOPS increase, left-looking algorithm should be done on 1-core, multi-core or many-core (GPU). 14

15 Limitation & Future work On distributed-memory machines (e.g. multiple GPU)? Limited memory on GPU Blocked Algorithm? Circuit partition + blocked factorization Boarded-Blocked-Diagonal (BBD) matrices Thank you! 15

16 Reference [Volkov2008] V.Volkov and J. Demmel, Benchmarking GPUs to tune dense linear algebra, SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, IEEE Press, 2008, pp [Tomov2010] S. Tomov, J. Dongarra, and M. Baboulin, Towards dense linear algebra for hybrid GPU accelerated many-core systems, Parallel Comput., vol. 36, pp , June [Christen2007] M. Christen, O. Schenk, and H. Burkhart, General-purpose sparse matrix building blocks using the NVIDIA CUDA technology platform, [SuperLU1999] J. W. Demmel, S. C. Eisenstat, J. R. Gilbert, X. S. Li, and J. W. H. Liu, A supernodal approach to sparse partial pivoting, SIAM J. Matrix Analysis and Applications, vol. 20, no. 3, pp , 1999 [Pardiso2002] O. Schenk and K. Gartner, Solving un-symmetric sparse systems of linear equations with PARDISO, Computational Science - ICCS 2002, vol. 2330, pp , [Florida] T. A. Davis and Y. Hu, The university of Florida sparse matrix collection, to appear in ACM Transactions on Mathematical Software. 16

17 Reference [G/P 1988] J. R. Gilbert and T. Peierls, Sparse partial pivoting in time proportional to arithmetic operations, SIAM J. Sci. Statist. Comput., vol. 9, pp , 1988 [KLU2010] T. A. Davis and E. Palamadai Natarajan, Algorithm 907: KLU, a direct sparse solver for circuit simulation problems, ACM Trans. Math. Softw., vol. 37, pp. 36:1 36:17, September [MC64] I. S. Duff and J. Koster, The design and use of algorithms for permuting large entries to the diagonal of sparse matrices, SIAM J. Matrix Anal. and Applics, no. 4, pp , [AMD] P. R. Amestoy, Enseeiht-Irit, T. A. Davis, and I. S. Duff, Algorithm 837: AMD, an approximate minimum degree ordering algorithm, ACM Trans. Math. Softw., vol. 30, pp , September [Chen 2011] X. Chen, W. Wu, Y. Wang, H. Yu, and H. Yang, An escheduler-based data dependence analysis and task scheduling for parallel circuit simulation, Circuits and Systems II: Express Briefs, IEEE Transactions on, vol. 58, no. 10, pp , oct

An Adaptive LU Factorization Algorithm for Parallel Circuit Simulation

An Adaptive LU Factorization Algorithm for Parallel Circuit Simulation Xiaoming Chen, Yu Wang, Huazhong Yang Department of Electronic Engineering Tsinghua National Laboratory for Information Science and