GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

Similar documents
GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

Sparse LU Factorization for Parallel Circuit Simulation on GPUs

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CUDA. Matthew Joyner, Jeremy Williams

NVIDIA Fermi Architecture

User Guide for GLU V2.0

E6895 Advanced Big Data Analytics Lecture 8: GPU Examples and GPU on ios devices

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Automated Finite Element Computations in the FEniCS Framework using GPUs

Tesla Architecture, CUDA and Optimization Strategies

GPU COMPUTING WITH MSC NASTRAN 2013

Parallelising Pipelined Wavefront Computations on the GPU

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

Solving Dense Linear Systems on Graphics Processors

FPGA Accelerated Parallel Sparse Matrix Factorization for Circuit Simulations*

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Sparse Matrix Algorithms

Double-Precision Matrix Multiply on CUDA

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Mattan Erez. The University of Texas at Austin

Performance Benefits of NVIDIA GPUs for LS-DYNA

Portland State University ECE 588/688. Graphics Processors

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation

Center for Computational Science

Sparse LU Factorization for Parallel Circuit Simulation on GPU

Data Structures and Algorithms for Counting Problems on Graphs using GPU

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Applications of Berkeley s Dwarfs on Nvidia GPUs

An Adaptive LU Factorization Algorithm for Parallel Circuit Simulation

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

B. Tech. Project Second Stage Report on

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Application of GPU-Based Computing to Large Scale Finite Element Analysis of Three-Dimensional Structures

Optimization solutions for the segmented sum algorithmic function

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

THE USE of graphics processing units (GPUs) in scientific

Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs

TinySPICE Plus: Scaling Up Statistical SPICE Simulations on GPU Leveraging Shared-Memory Based Sparse Matrix Solution Techniques

Mathematical Methods in Fluid Dynamics and Simulation of Giant Oil and Gas Reservoirs. 3-5 September 2012 Swissotel The Bosphorus, Istanbul, Turkey

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

GPU Architecture. Alan Gray EPCC The University of Edinburgh

CUDA Programming Model

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin

Threading Hardware in G80

Generic Polyphase Filterbanks with CUDA

Parallel Processing SIMD, Vector and GPU s cont.

Is There A Tradeoff Between Programmability and Performance?

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Accelerating Implicit LS-DYNA with GPU

A Configurable Architecture for Sparse LU Decomposition on Matrices with Arbitrary Patterns

Gauss-Jordan Matrix Inversion Speed-Up using GPUs with the Consideration of Power Consumption

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

Programmable Graphics Hardware (GPU) A Primer

Accelerating the Iterative Linear Solver for Reservoir Simulation

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

GPU-Based Acceleration for CT Image Reconstruction

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Warps and Reduction Algorithms

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

CS 179 Lecture 4. GPU Compute Architecture

Heterogeneous-Race-Free Memory Models

ME964 High Performance Computing for Engineering Applications

A Configurable Parallel Hardware Architecture for Efficient Integral Histogram Image Computing

The Pennsylvania State University The Graduate School ADAPTING SPARSE TRIANGULAR SOLUTION TO GPU

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

GPU-based Parallel Reservoir Simulators

Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

High Performance Matrix-matrix Multiplication of Very Small Matrices

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Sparse Linear Algebra in CUDA

PARALLEL SYSTEMS PROJECT

Fundamental CUDA Optimization. NVIDIA Corporation

Figure 6.1: Truss topology optimization diagram.

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute


Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

Parallel Approach for Implementing Data Mining Algorithms

Transcription:

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis Abstract: Lower upper (LU) factorization for sparse matrices is the most important computing step for circuit simulation problems. However, parallelizing LU factorization on the graphic processing units (GPUs) turns out to be a difficult problem due to intrinsic data dependence and irregular memory access, which diminish GPU computing power. In this paper, we propose a new sparse LU solver on GPUs for circuit simulation and more general scientific computing. The new method, which is called GPU accelerated LU factorization (GLU) solver (for GPU LU), is based on a hybrid right-looking LU factorization algorithm for sparse matrices. We show that more concurrency can be exploited in the right-looking method than the left-looking method, which is more popular for circuit analysis, on GPU platforms. At the same time, the GLU also preserves the benefit of column-based left-looking LU method, such as symbolic analysis and column level concurrency. We show that the resulting new parallel GPU LU solver allows the parallelization of all three loops in the LU factorization on GPUs. While in contrast, the existing GPU-based left-looking LU factorization approach can only allow parallelization of two loops. Experimental results show that the proposed GLU solver can deliver 5.71 and 1.46 speedup over the singlethreaded and the 16-threaded PARDISO solvers, respectively, 19.56 speedup over the KLU solver, 47.13 over the UMFPACK solver, and 1.47 speedup over a recently proposed GPUbased left-looking LU solver on the set of typical circuit matrices from the University of Florida (UFL) sparse matrix collection. Furthermore, we also compare the proposed GLU solver on a set of general matrices from the UFL, GLU achieves 6.38 and 1.12 speedup over the single threaded and the 16-threaded PARDISO solvers, respectively, 39.39 speedup over the KLU solver, 24.04 over the UMFPACK solver, and 2.35 speedup over the same GPU-based leftlooking LU solver. In addition, comparison on self-generated RLC mesh networks shows a similar trend, which further validates the advantage of the proposed method over the existing

sparse LU solvers. The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2. Enhancement of the project: Existing System: Before we present our new approach, we first review the two main stream LU factorization methods: 1) the left-looking G/P factorization algorithm [13] and 2) a variant of the rightlooking algorithms, such as the Gaussian elimination method. We then review some recent works on LU factorizations on GPU and the NVIDIA CUDA programming system. The LU factorization of an n n matrix, A, has the form A = LU, where L is a lower triangular matrix and U is an upper triangular matrix. For a full matrix, LU factorization has O(n3) complexity as it has three embedded loops. GPU Architecture and CUDA Programming In this section, we review the GPU architecture and CUDA programming. CUDA is the parallel programming model for NVIDIA s general-purpose GPUs. The architecture of a typical CUDAcapable GPU consists of an array of highly threaded streaming multiprocessors (SMs) and comes with up to a huge amount of DRAM, referred to as global memory. Take the Tesla C2070 GPU, for example. It contains 14 SMs, each of which has 32 streaming processors (SPs, or CUDA cores called by NVIDIA), four special function units (SFUs), and its own shared memory/l1 cache. The structure of an SM is shown in Fig. 1.

Fig. 1. Diagram of an SM in NVIDIA Tesla C2070 [SP, L/S for load/store unit, and SFU]. As the programming model of GPU, CUDA extends C into CUDA C and supports such tasks as threads calling and memory allocation, which makes programmers able to explore most of the capabilities of GPU parallelism. In CUDA programming model, shown in Fig. 2, threads are

organized into blocks; blocks of threads are organized as grids. CUDA also assumes that both the host (CPU) and the device (GPU) maintain their own separate memory spaces, which are referred to as host memory and device memory, respectively.

Fig. 2. Programming model of CUDA. Disadvantages: Speed is less Proposed System: In this section, we explain our new hybrid columnbased right-looking sparse LU factorization method on the GPUs GLU solver. GLU solver was originally inspired by the observation that the existing left-looking LU factorization has inherent limitations for concurrency exploitions due to the required solving of triangular matrices. To mitigate this problem, we look at the traditional right-looking LU factorization method, which seems more amenable for parallelization, especially on GPU platforms. However, we also want the benefits of symbolic analysis for storage management of factorized matrices and column-level concurrency in the leftlooking-based method. The resulting method is the hybrid column-based right-looking LU method, which will be discussed in the following. Column-Based Right-Looking Algorithm Our starting point is still the left-looking algorithm as we want to keep the column-concurrency and symbolic analysis and we still compute one column for both L and U matrices. However, unlike the left-looking algorithm, once a column of L is computed, its impacts on the yet-to-besolved columns will be updated right away (so we now start to look right in this sense). In this column-based right-looking algorithm, we still have three loops: 1) the outer k-loop chooses the current column k that will be factorized; 2) the middle j-loop chooses the column j in the submatrix right to column k that depends on column k; and 3) the inner i-loop is used to perform MAD operations between column k and column j.

Preprocessing and Symbolic Analysis As we mentioned earlier that the proposed method combines the benefits of both the left-looking method and the right-looking methods. As a result, it still follows the preprocessing and symbolic analysis steps to improve the factorization efficiency. First, the preprocessing phase preorders the matrix A to minimize fill-in and to ensure a zero-free diagonal. Second, the symbolic phase performs symbolic factorization and determines the structure of lower triangular matrix L and upper triangular matrix U. Then, it groups independent columns into levels. Note that the concurrent computation resources (warps per SM, shared memory per block, and threads per block) on GPU are limited. As a result, the number of columns, which can be solved in parallel, in each level should be limited. Hence, we propose a resource-aware levelization scheme, in which the number of columns of each level will be limited by a fixed number. Fig. 3 (bottom) shows the levelization result from the top figure, in which the maximum number of allowed columns is 3. This resource-aware levelization scheme will be applied to parallelize the outer k-loop of the proposed right-looking algorithm.

Fig. 3. Resource-aware levelization scheme.

Parallel Implementation on GPU In parallel implementation of sparse LU factorization, the CPU is responsible for initializing the matrix and doing the symbolic analysis. The GPU only concentrates on numerical factorization. CPU is also responsible for allocating device memory, copy host inputs to device memory, and copy the computed device results back to the host. Fig. 4. Comparison of the concurrency exploitation on GPU in terms of warp scheduling. Advantages: High Speed Software implementation: Modelsim

Xilinx ISE