Parallel solution for finite element linear systems of. equations on workstation cluster *
|
|
- Rosalind Dean
- 5 years ago
- Views:
Transcription
1 Aug. 2009, Volume 6, No.8 (Serial No.57) Journal of Communication and Computer, ISSN , USA Parallel solution for finite element linear systems of equations on workstation cluster * FU Chao-jiang (Department of Civil Engineering, Fujian University of Technology, Fuzhou , China) Abstract: With the advances in the high speed computers network technologies, a workstation cluster is becoming the main environment for parallel processing. Finite element linear systems of equations are common throughout structural analysis in Civil Engineering. The preconditioned conjugate gradient method (PCGM) is an iterative method used to solve the finite element systems of equations with symmetric positive definite system matrices. In this paper, the algorithm of PCGM is parallelized and implemented on DELL workstation cluster. Optimization techniques for the sparse matrix vector multiplication are adopted in programming. The storage scheme is analyzed in detail. The experiment result shows that the designed parallel algorithm has high speedup and good efficiency on the high performance workstation cluster. This illustrates the power of parallel computing in solving large problems much faster than on a single processor. Key words: parallel computing; preconditioned conjugate gradient method; finite element method; network; workstation cluster 1. Introduction In recent years, parallel computing has become an important field in science computing. There are two main reasons which make it reasonable. First, the use of parallel computing can handle large-scale problems. Second, networks of computers under LINUX, with a connection of several computers via standard software like MPI, result in a workstation cluster, which is a low cost parallel machine. On such hardware platforms, one can gain first experiences in parallel computing. So * Acknowledgment: This work is supported by Fujian Province Natural Science Foundation (No. 2008J0180) and Scientific Research Start Foundation of Fujian University of Technology (No. GY-Z0707). Corresponding author: FU Chao-jiang, Ph.D.; research field: parallel computing. a workstation cluster is poised to become the primary computing infrastructure for science and engineering. It may achieve cheap, highly available, and provide multiple CPUs for parallel computing. Solving finite element linear systems of equations is one of the most important problems in structural analysis in Civil Engineering. The application of finite element method to engineering problems requires the solution of a system of linear equations. In general, it is possible to write these systems as, Ax = b. Where A is a n n sparse matrix, x is the vector of nodal unknowns and b is n-dimensional balance force vector. Due to the nature of finite element method, A and b are constructed by assembling the individual element contributions, i.e., A = ΣA e ; b = Σb e (e=1, nel). Where A e are the element matrices, b e are the element force vectors, and nel is the number of elements in the mesh. This yields a linear system with a symmetric positive definite sparse system matrix. Unfortunately, the direct methods in elementary linear algebra classes (e.g., Gaussian elimination) are difficult to adapt for distributed memory systems, especially if the coefficient matrix is sparse. So most of the sparse linear system solvers for distributed memory systems use iterative methods [1-2]. An iterative method makes an initial guess at a solution to the system, and then tries to repeatedly improve the guess. Thus, iterative methods produce a sequence of vectors, which gradually converge to the solution to the linear system. Since the finite element method yields a linear system with a symmetric positive definite sparse system 59
2 matrix. This property of the system matrix allows the use of PCGM as the linear solver. 2. Preconditioned conjugate gradient method In order to solve the finite element linear systems of equations Ax = b, the PCG method has been used. The CG method is an iterative technique which converges to the solution to the system, through a sequence of n vector approximations. From any vector approximation to the solution, a search direction which is conjugate to the previous ones is used to determine a new approximation. In practice, only a few iterations are employed to obtain a good estimate of the solution. In the preconditioned version of the CG method, both the condition number of A and the number of iterations may be reduced by pre-multiplying the system Ax = b by a preconditioner M. due to system matrix A with diagonal elements that are very different in magnitude, the preconditioner M is adopted by using the diagonal matrix of A. The PCG method is listed below. peconditioned conjugate gradient algorithm initialize: X 0 r 0 =b AX 0 Mu 0 =r 0 p 0 =u 0 begin loop: for k=0, (1) α k =(u k,r k )/(p k,ap k ) (2) X k+1 =X k +α k p k (3) r k+1 =r k -α k Ap k (4) Mu k+1 =r k+1 (5) if(u k+1,r k+1 ) ε (6) if(r k+1,r k+1 ) ε (7) return (8) β k =(u k+1,r k+1 )/(u k,r k ) (9) p k+1 =u k+1 +β k p k (10)end for end loop 3. Parallel implementation of PCG method The parallelization of the PCG method can be implemented since every vector operation of this algorithm can be parallelized separately, i.e., each processor executes scalar operations on a subset of the components of a vector, the parallel implementation of the preconditioned conjugate gradient method is identical to the serial implementation. The parallelism comes from storing and operating on main sections of the working vectors and the system matrix. The parallel implementation of the preconditioned conjugate gradient method is presented below. initialize: X0 r0 = b - AX0 Mu0 = r0 p0 = u0 C = dot(u0,r0) M is the preconditioning matrix, using M=diag (a11,a22,,ann) Main iterate over: (1) Z = Ap /Parallel Matrix-Vector Product/ (2) α = c/dot(p,z) /Parallel Vector Dot Product/ (3) x = x + α*p /BLAS daxpy Operation/ (4) r = r - α*z /BLAS daxpy Opreation/ (5) solve Mu = r for u /Solve Matrix System/ (6) d = dot(r,r) /Parallel Vector Dot Product/ (7) if(sqrt(d)<tolerance) break; (8) β = d/c (9) p = u + β*p /BLAS daxpy Operation/ (10) gather(p) /Parallel Gather / (11) c = d (12) go to (1) Where BLAS(basic linear algebra subprograms) is a collection of routines that perform specific vector and matrix operations [3]. The parallel algorithm of PCGM is implemented in the programming language C++ using the Message Passing Interface (MPI) standard [4] for communications between processors. This algorithm for the solution of a linear system of equations of dimension n involves 1 matrix-vector product, 2 dot products, and 3 vector updates (daxpy operations) per iteration. This implementation is optimized with respect to memory usage by storing exactly 4 vectors of length n, the smallest number possible, and by using a function that implements the matrix-vector multiplication in matrix-free form, i.e., does not store 60
3 any matrix [5-6]. The vectors stored are the approximation solution x, the search direction p, the residual vector r and the auxiliary vector q. The parallel form of the PCG algorithm requires 4 communication operations per iteration. To compute the matrix-vector product, processors need to interchange one data with their neighboring processors. These communications are implemented with nonblocking MPI_ISEND/MPI_IRECV commands. Since the dot products apply to vectors split across the processors and since the results are needed on all processors, an MPI_Allreduce operation is required in each of the 2 dot products. In case of a diagonal preconditioner, no additional communication is required since diagonal preconditioning can be accomplished locally on all rows contained in a process. Other preconditioners such as incomplete Cholesky require additional communication. So the diagonal preconditioner is used in this algorithm. Optimization techniques [7-8] for the sparse matrix vector multiplication are adopted in programming, i.e., the serial part of the dot products, local to each processor, is implemented in C++ inner function dot; The most expensive operation in each iteration is the matrix-vector product which is implemented in step (1) by using parallel dot operations. In this paper, a reorganization of the computations in the PCG algorithm has been implemented in order to hide the latency of the communications. This overlap of the communications with the computations has been implemented by using asynchronous messages in the message-passing model described below. The parallelization of the product Z = Ap in algorithm can be performed with an overlap between the messages and the computations. The required message with the halo data of p; i.e., the data corresponding to the grid points in the domain of one processor which are used to compute data of Z in a neighboring processor, is overlapped with computations of matrix rows corresponding to the inner part of the domain, where the inner part is the domain without its halos. The latency of the communication step in the inner product p T Z can be hidden by delaying the update of X i by one iteration [9]. 4. Storage scheme The aim of the matrix storage is to achieve a better performance. The important impact on performance is caused by the storage technique. There are several possibilities to store a matrix. The key is not to store unnecessary data. The emphasis of the storage is on developing an algorithm which will solve problems too large to be solved efficiently on a single CPU. Storing all n 2 elements of A R n n explicitly is known as dense storage mode. This is the easiest possible storage scheme. For dense matrices this is good technique. For sparse matrix it is not, because all zero elements are stored. Since most of the n 2 elements are 0, this is wasteful both in memory and in runtime, as many multiplications involving zeros are needlessly computed. Simulations using this storage mode can be easily programmed in a language C++. A common idea is to take advantage of the sparsity of the matrix, i.e., the low percentage of non-zero elements. In this sparse storage mode, only non-zero elements are stored along with integer index information to indicate the position of the element in the matrix. This reduces the memory requirements for the system matrix. This also improves performance, as only multiplications with those elements that are stored explicitly, i.e., non-zero are computed. To reduce the memory usage further, the constant coefficients in pre-determined positions of the system matrix are taken. A function is provided that accepts a vector p as input and returns the vector Z = Ap as output; each component Z k is computed as summation of the appropriate components of p multiplied with hard-coded coefficients [5]. This matrix-vector multiplication function is inserted at the appropriate place of implementation of the PCGM in the programming language C++. This technique is known 61
4 as a matrix-free implementation because no elements of the system matrix A are explicitly stored at all [6]. Such a matrix-free method dramatically reduces the memory requirements. It is the most efficient approach to memory usage. The scheme is adopted in this paper. Memory is an important limitation to the size of the system equations to be solved. As stated earlier, the minimum number of vectors needed to compute the solution of the linear system equations is 4 vectors of length n. Aside from the system matrix, a prediction of 4n is almost all of the memory to compute the solution with the matrix-free method. Using the sparse storage mode, non-zero elements of the system matrix need to be stored. For example, the matrix is pentadiagonal, assuming that only 5 vectors of length n are stored for the matrix. The memory prediction is 5n added to the matrix-free implementation prediction. Densely storing the matrix requires a vector of n 2 elements added to the matrix-free implementation prediction. The theoretical predictions of memory usage for different storage method are discussed in Table 1, when double precision arithmetic is used. Table 1 Predicted memory usage for the three storage methods in Megabytes (MB) n Dense Sparse Matrix-free 256 <1 <1 < <1 < <1 < <1 5. Numerical test and analysis 5.1 Parallel platform The numerical and performance tests of the developed parallel PCG algorithm are performed on DELL workstation cluster in School of Computer Engineering and Science, Shanghai University. It is a cluster with 12 processors arranged in 1 four-processor node with 2.0 GHz Intel Xeon chips (1MB cache) and 4 dual-processor nodes with 2.4 GHz Intel Xeon chips (512 KB cache) and 1.0 GB of memory per node. These nodes are connected with a Mbps Ethernet interconnect. Using the Message Passing Interface (MPI) paradigm [10]. In the message-passing version, MPICH library has been considered because of its better efficiency in computer architecture. MPICH library has been widely adopted as the message-passing interface of choice for many years. In the MPICH implementation, inter-processor communication is performed through special interface provided to send a message, and a matching interface in other processor to receive it. In our programming, the communications are implemented with nonblocking communication in order to perform the overlap of the communications with the computations. 5.2 Numerical test problem and decision The problem to be solved is a classical prototype problem, the Poisson equation with a homogeneous Dirichlet boundary condition, using finite element in two dimensions. This yields a linear system with a symmetric positive definite system matrix. This property of the system matrix allows the use of the PCG method we present in this paper as the solver. The results of computation are listed in Table 2 and Fig. 1. At each time step, the solution of the system of algebraic equations can be divided into two parts: (i) initialization, which includes the generation of the matrix A and right-hand side vector b, the calculation of the preconditioner and the initialization of the PCG method, and (ii) the iterations of the PCG method. Since different problem sizes imply different convergence rates, the ratio between the times required in both parts varies as the sizes of the problem are varied. In order to simplify the presentation of results and avoid problems with the accuracy of the timer and perturbations to the cache behavior, we have timed both parts together. 5.3 Numerical results analysis Memory is an important limitation to the size of the system to be solved. Aside from the system matrix, the vectors of length n used in the method make up the bulk of the memory. Any other variables are ignored. 62
5 The estimation is simply made by counting the number of vectors used. As stated earlier, the minimum number of vectors needed in this algorithm is 4 of length n. The Table 1 shows the theoretical predictions of memory used by the C++ code using the dense, sparse, and the matrix-free implementation, respectively when double-precision arithmetic is used. Less than 1 MB of memory is used for a sparse or matrix-free storage compared to 134 MB for a dense storage when dimension (n) of the system matrix is equal to Adding the dimension (n) to 16384, it takes over 2 GB using dense storage. This is not possible to compute this problem for single processor of the workstations because only 1 GB of memory is available for the single processor computation, so n equals to that is the largest system for memory of the single processor. Clearly, a matrix-free implementation is the best storage method. With a matrix-free implementation, the method is optimal with respect to memory usage, and we are able to solve problems that are much too large for single processor computers. Table 2 and Fig. 1 summarize the timings and speedups for this cluster, using up to 12 processors. It is readily apparent from both Table 2 and Fig. 1 that the timings continue to decrease significantly all the way up to 12 processors. When the number of processor is greater than 4, an apparent slow-down is observed because the communication overhead between processors increases significantly. The cache for each processor is a good size for non-computational uses, but does not hold much data. This leads to frequent loading of data from memory to the cache. The nodes have a 32-bit bus which can not serve the data as fast as the processors can use it. Within the 4 processors, the computation is implemented on 4-processor node. The communication overhead between processors is less; thus its performance is better. speed-up Table 2 Timings in seconds for different n on the 12-processor cluster n n=512 n=2048 n=4096 n= speed up number of processor Fig. 1 Plot of the speedup for different n on the 12-processor cluster n=512 n=1024 n=2048 n= number of processor Fig. 2 Plot of the speedup for different n on the 4-processor cluster To analyze the importance of the different parallel environment for this algorithm, another cluster is constituted, which has 4 single-processor nodes with 600 MHz Intel Pentium Ⅲ processors and 256 MB of memory. These nodes are connected with a Mbps Ethernet interconnect. The same computations are performed on this cluster. The results are listed in Table 3 and Fig. 2. The results for up to 4 processors show 63
6 that the use of about 2 processors constitutes the most efficient use of the resources. Though the processor speed is less than that of the cluster used earlier, the results are good. The speedups in Fig. 1 are higher than that in Fig. 2 for the same number of processor and the same size of problem. This result shows that a high-performance parallel platform is required for parallel computing. Table 3 Timings in seconds for different n on the 4-processor cluster n Conclusions Parallel computing is a useful tool for solving large-scale problems faster. PCG algorithm is parallelized and this parallel algorithm is implemented on two different clusters. Its performance is analyzed. The reduction in memory due to the matrix-free implementation not only allows for the solution of much larger problems but also decreases computing time. The parallel performance studies show that a high-performance parallel platform is necessary to obtain excellent speedup on a cluster. The load balancing is very important to ensure good scalability and to increase efficiency. These observations reflect the fact that the parallel PCGM algorithm involves 4 communications per iteration; therefore it requires a tight coupling of the cluster. This shows how important it is to have a workstation cluster available. The algorithm implemented on different parallel platform appears to be robust. References: [1] Barret R. Templates for the Solution of Linear Systems, Building Blocks for Iterative Methods. SIAM, Philadelphia, [2] Law K. H. A parallel finite element solution method. Computers & Structures, 1986, 23(6): [3] George Em Karniadkis. Parallel Scientific Computing in C++ and MPI. Cambridge University Press, [4] Pacheco P. S. Parallel Programming with MPI. Morgan Kaufmann, [5] Kevin P. Allen. Efficient Parallel Computing for Solving Linear Systems of Equations. Graduate Student Seminar. University of Maryland, Baltimore County, November [6] Aliaga J. I., Hernandez, V. Symmetric sparse matrix-vector product on distributed memory multiprocessors. Conference on Parallel Computing and Transputer Applications. Barcelona, Spain, [7] Brown P. N., Hindmarsh A. C. Matrix-free methods for stiff systems of ODE s. SIAM J. Numer. Anal, 1986, 23: [8] Geus R., Rollin S. Towards a fast parallel sparse symmetric matrix vector multiplication. Parallel Computing, 2007, 27(2): [9] Sorin G. Nastea. Load-balanced sparse matrix vector multiplication on parallel computers. Parallel and Distributed Computing, 1997, 46: [10] FU Chao-jiang. The research on parallel computation of finite element structural analysis based on MPI cluster. Doctor dissertation, Shanghai: Shanghai University, China, (Edited by Amy, Jane) 64
2 The Elliptic Test Problem
A Comparative Study of the Parallel Performance of the Blocking and Non-Blocking MPI Communication Commands on an Elliptic Test Problem on the Cluster tara Hafez Tari and Matthias K. Gobbert Department
More informationOn the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters
1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk
More informationParallel Performance Studies for an Elliptic Test Problem on the Cluster maya
Parallel Performance Studies for an Elliptic Test Problem on the Cluster maya Samuel Khuvis and Matthias K. Gobbert (gobbert@umbc.edu) Department of Mathematics and Statistics, University of Maryland,
More informationPerformance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem
Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem Guan Wang and Matthias K. Gobbert Department of Mathematics and Statistics, University of
More informationReport of Linear Solver Implementation on GPU
Report of Linear Solver Implementation on GPU XIANG LI Abstract As the development of technology and the linear equation solver is used in many aspects such as smart grid, aviation and chemical engineering,
More informationEFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI
EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,
More informationContents. I The Basic Framework for Stationary Problems 1
page v Preface xiii I The Basic Framework for Stationary Problems 1 1 Some model PDEs 3 1.1 Laplace s equation; elliptic BVPs... 3 1.1.1 Physical experiments modeled by Laplace s equation... 5 1.2 Other
More informationLecture 15: More Iterative Ideas
Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationNumerical Methods to Solve 2-D and 3-D Elliptic Partial Differential Equations Using Matlab on the Cluster maya
Numerical Methods to Solve 2-D and 3-D Elliptic Partial Differential Equations Using Matlab on the Cluster maya David Stonko, Samuel Khuvis, and Matthias K. Gobbert (gobbert@umbc.edu) Department of Mathematics
More informationParallel Performance Studies for an Elliptic Test Problem on the 2018 Portion of the Taki Cluster
Parallel Performance Studies for an Elliptic Test Problem on the 2018 Portion of the Taki Cluster Carlos Barajas and Matthias K. Gobbert (gobbert@umbc.edu) Department of Mathematics and Statistics, University
More informationTechniques for Optimizing FEM/MoM Codes
Techniques for Optimizing FEM/MoM Codes Y. Ji, T. H. Hubing, and H. Wang Electromagnetic Compatibility Laboratory Department of Electrical & Computer Engineering University of Missouri-Rolla Rolla, MO
More informationEfficient Multi-GPU CUDA Linear Solvers for OpenFOAM
Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,
More informationMVAPICH2 vs. OpenMPI for a Clustering Algorithm
MVAPICH2 vs. OpenMPI for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland, Baltimore
More informationSELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND
Student Submission for the 5 th OpenFOAM User Conference 2017, Wiesbaden - Germany: SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND TESSA UROIĆ Faculty of Mechanical Engineering and Naval Architecture, Ivana
More informationLINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.
1 2 The LINPACK Benchmark on the Fujitsu AP 1000 Richard P. Brent Computer Sciences Laboratory The LINPACK Benchmark A popular benchmark for floating-point performance. Involves the solution of a nonsingular
More informationParallel Performance Studies for a Clustering Algorithm
Parallel Performance Studies for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland,
More informationAndrew V. Knyazev and Merico E. Argentati (speaker)
1 Andrew V. Knyazev and Merico E. Argentati (speaker) Department of Mathematics and Center for Computational Mathematics University of Colorado at Denver 2 Acknowledgement Supported by Lawrence Livermore
More informationIssues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM
Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.
More informationESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report
ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new
More informationS0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS
S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS John R Appleyard Jeremy D Appleyard Polyhedron Software with acknowledgements to Mark A Wakefield Garf Bowen Schlumberger Outline of Talk Reservoir
More informationParallel Performance Studies for a Parabolic Test Problem
Parallel Performance Studies for a Parabolic Test Problem Michael Muscedere and Matthias K. Gobbert Department of Mathematics and Statistics, University of Maryland, Baltimore County {mmusce1,gobbert}@math.umbc.edu
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationIlya Lashuk, Merico Argentati, Evgenii Ovtchinnikov, Andrew Knyazev (speaker)
Ilya Lashuk, Merico Argentati, Evgenii Ovtchinnikov, Andrew Knyazev (speaker) Department of Mathematics and Center for Computational Mathematics University of Colorado at Denver SIAM Conference on Parallel
More informationEfficient O(N log N) algorithms for scattered data interpolation
Efficient O(N log N) algorithms for scattered data interpolation Nail Gumerov University of Maryland Institute for Advanced Computer Studies Joint work with Ramani Duraiswami February Fourier Talks 2007
More informationPorting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation
Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code
More informationOutline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency
1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming
More informationPerformance Studies for the Two-Dimensional Poisson Problem Discretized by Finite Differences
Performance Studies for the Two-Dimensional Poisson Problem Discretized by Finite Differences Jonas Schäfer Fachbereich für Mathematik und Naturwissenschaften, Universität Kassel Abstract In many areas,
More informationAmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015
AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for the future 2 AmgX Fast, scalable linear solvers, emphasis on iterative
More informationHigh Performance Computing: Tools and Applications
High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 15 Numerically solve a 2D boundary value problem Example:
More informationFigure 6.1: Truss topology optimization diagram.
6 Implementation 6.1 Outline This chapter shows the implementation details to optimize the truss, obtained in the ground structure approach, according to the formulation presented in previous chapters.
More informationPARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS
Proceedings of FEDSM 2000: ASME Fluids Engineering Division Summer Meeting June 11-15,2000, Boston, MA FEDSM2000-11223 PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS Prof. Blair.J.Perot Manjunatha.N.
More informationOptimizing Data Locality for Iterative Matrix Solvers on CUDA
Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,
More informationContents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet
Contents 2 F10: Parallel Sparse Matrix Computations Figures mainly from Kumar et. al. Introduction to Parallel Computing, 1st ed Chap. 11 Bo Kågström et al (RG, EE, MR) 2011-05-10 Sparse matrices and storage
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationAccelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations
Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations Hartwig Anzt 1, Marc Baboulin 2, Jack Dongarra 1, Yvan Fournier 3, Frank Hulsemann 3, Amal Khabou 2, and Yushan Wang 2 1 University
More informationMatrix-free IPM with GPU acceleration
Matrix-free IPM with GPU acceleration Julian Hall, Edmund Smith and Jacek Gondzio School of Mathematics University of Edinburgh jajhall@ed.ac.uk 29th June 2011 Linear programming theory Primal-dual pair
More informationApplication of GPU-Based Computing to Large Scale Finite Element Analysis of Three-Dimensional Structures
Paper 6 Civil-Comp Press, 2012 Proceedings of the Eighth International Conference on Engineering Computational Technology, B.H.V. Topping, (Editor), Civil-Comp Press, Stirlingshire, Scotland Application
More informationModelling and implementation of algorithms in applied mathematics using MPI
Modelling and implementation of algorithms in applied mathematics using MPI Lecture 1: Basics of Parallel Computing G. Rapin Brazil March 2011 Outline 1 Structure of Lecture 2 Introduction 3 Parallel Performance
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationComputational Fluid Dynamics - Incompressible Flows
Computational Fluid Dynamics - Incompressible Flows March 25, 2008 Incompressible Flows Basis Functions Discrete Equations CFD - Incompressible Flows CFD is a Huge field Numerical Techniques for solving
More informationA PARALLEL IMPLEMENTATION OF A FEM SOLVER IN SCILAB
powered by A PARALLEL IMPLEMENTATION OF A FEM SOLVER IN SCILAB Author: Massimiliano Margonari Keywords. Scilab; Open source software; Parallel computing; Mesh partitioning, Heat transfer equation. Abstract:
More information1e+07 10^5 Node Mesh Step Number
Implicit Finite Element Applications: A Case for Matching the Number of Processors to the Dynamics of the Program Execution Meenakshi A.Kandaswamy y Valerie E. Taylor z Rudolf Eigenmann x Jose' A. B. Fortes
More informationPerformance of Implicit Solver Strategies on GPUs
9. LS-DYNA Forum, Bamberg 2010 IT / Performance Performance of Implicit Solver Strategies on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Abstract: The increasing power of GPUs can be used
More informationAdvanced Surface Based MoM Techniques for Packaging and Interconnect Analysis
Electrical Interconnect and Packaging Advanced Surface Based MoM Techniques for Packaging and Interconnect Analysis Jason Morsey Barry Rubin, Lijun Jiang, Lon Eisenberg, Alina Deutsch Introduction Fast
More informationAnalysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms
Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms H. Anzt, V. Heuveline Karlsruhe Institute of Technology, Germany
More informationSuper Matrix Solver-P-ICCG:
Super Matrix Solver-P-ICCG: February 2011 VINAS Co., Ltd. Project Development Dept. URL: http://www.vinas.com All trademarks and trade names in this document are properties of their respective owners.
More informationEfficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs
Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More information3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs
3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs H. Knibbe, C. W. Oosterlee, C. Vuik Abstract We are focusing on an iterative solver for the three-dimensional
More informationPerformance Evaluation of a New Parallel Preconditioner
Performance Evaluation of a New Parallel Preconditioner Keith D. Gremban Gary L. Miller Marco Zagha School of Computer Science Carnegie Mellon University 5 Forbes Avenue Pittsburgh PA 15213 Abstract The
More informationCost-Effective Parallel Computational Electromagnetic Modeling
Cost-Effective Parallel Computational Electromagnetic Modeling, Tom Cwik {Daniel.S.Katz, cwik}@jpl.nasa.gov Beowulf System at PL (Hyglac) l 16 Pentium Pro PCs, each with 2.5 Gbyte disk, 128 Mbyte memory,
More informationScalability of Heterogeneous Computing
Scalability of Heterogeneous Computing Xian-He Sun, Yong Chen, Ming u Department of Computer Science Illinois Institute of Technology {sun, chenyon1, wuming}@iit.edu Abstract Scalability is a key factor
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 5: Sparse Linear Systems and Factorization Methods Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 18 Sparse
More informationLecture 27: Fast Laplacian Solvers
Lecture 27: Fast Laplacian Solvers Scribed by Eric Lee, Eston Schweickart, Chengrun Yang November 21, 2017 1 How Fast Laplacian Solvers Work We want to solve Lx = b with L being a Laplacian matrix. Recall
More information2 Fundamentals of Serial Linear Algebra
. Direct Solution of Linear Systems.. Gaussian Elimination.. LU Decomposition and FBS..3 Cholesky Decomposition..4 Multifrontal Methods. Iterative Solution of Linear Systems.. Jacobi Method Fundamentals
More informationTowards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers
Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationLINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those
Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications Michael Eberl 1, Wolfgang Karl 1, Carsten Trinitis 1 and Andreas Blaszczyk 2 1 Technische Universitat Munchen
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationA substructure based parallel dynamic solution of large systems on homogeneous PC clusters
CHALLENGE JOURNAL OF STRUCTURAL MECHANICS 1 (4) (2015) 156 160 A substructure based parallel dynamic solution of large systems on homogeneous PC clusters Semih Özmen, Tunç Bahçecioğlu, Özgür Kurç * Department
More informationIntroduction to Parallel Computing
Introduction to Parallel Computing W. P. Petersen Seminar for Applied Mathematics Department of Mathematics, ETHZ, Zurich wpp@math. ethz.ch P. Arbenz Institute for Scientific Computing Department Informatik,
More informationBlueGene/L. Computer Science, University of Warwick. Source: IBM
BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours
More informationIntroduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI
Introduction to Parallel Programming for Multi/Many Clusters Part II-3: Parallel FVM using MPI Kengo Nakajima Information Technology Center The University of Tokyo 2 Overview Introduction Local Data Structure
More informationIntroduction to Parallel. Programming
University of Nizhni Novgorod Faculty of Computational Mathematics & Cybernetics Introduction to Parallel Section 9. Programming Parallel Methods for Solving Linear Systems Gergel V.P., Professor, D.Sc.,
More informationEvaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li
Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,
More informationHigh-Performance Linear Algebra Processor using FPGA
High-Performance Linear Algebra Processor using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract With recent advances in FPGA (Field Programmable Gate Array) technology it is now feasible
More informationGPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction
1/8 GPU Implementation of Elliptic Solvers in Numerical Weather- and Climate- Prediction Eike Hermann Müller, Robert Scheichl Department of Mathematical Sciences EHM, Xu Guo, Sinan Shi and RS: http://arxiv.org/abs/1302.7193
More informationApproaches to Parallel Implementation of the BDDC Method
Approaches to Parallel Implementation of the BDDC Method Jakub Šístek Includes joint work with P. Burda, M. Čertíková, J. Mandel, J. Novotný, B. Sousedík. Institute of Mathematics of the AS CR, Prague
More informationHigh-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers
High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers July 14, 1997 J Daniel S. Katz (Daniel.S.Katz@jpl.nasa.gov) Jet Propulsion Laboratory California Institute of Technology
More informationData mining with sparse grids
Data mining with sparse grids Jochen Garcke and Michael Griebel Institut für Angewandte Mathematik Universität Bonn Data mining with sparse grids p.1/40 Overview What is Data mining? Regularization networks
More informationGTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013
GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»
More informationfspai-1.0 Factorized Sparse Approximate Inverse Preconditioner
fspai-1.0 Factorized Sparse Approximate Inverse Preconditioner Thomas Huckle Matous Sedlacek 2011 08 01 Technische Universität München Research Unit Computer Science V Scientific Computing in Computer
More informationPerformance Evaluation of a New Parallel Preconditioner
Performance Evaluation of a New Parallel Preconditioner Keith D. Gremban Gary L. Miller October 994 CMU-CS-94-25 Marco Zagha School of Computer Science Carnegie Mellon University Pittsburgh, PA 523 This
More informationAccelerating Double Precision FEM Simulations with GPUs
Accelerating Double Precision FEM Simulations with GPUs Dominik Göddeke 1 3 Robert Strzodka 2 Stefan Turek 1 dominik.goeddeke@math.uni-dortmund.de 1 Mathematics III: Applied Mathematics and Numerics, University
More informationNative mesh ordering with Scotch 4.0
Native mesh ordering with Scotch 4.0 François Pellegrini INRIA Futurs Project ScAlApplix pelegrin@labri.fr Abstract. Sparse matrix reordering is a key issue for the the efficient factorization of sparse
More informationLarge-scale Structural Analysis Using General Sparse Matrix Technique
Large-scale Structural Analysis Using General Sparse Matrix Technique Yuan-Sen Yang 1), Shang-Hsien Hsieh 1), Kuang-Wu Chou 1), and I-Chau Tsai 1) 1) Department of Civil Engineering, National Taiwan University,
More informationfspai-1.1 Factorized Sparse Approximate Inverse Preconditioner
fspai-1.1 Factorized Sparse Approximate Inverse Preconditioner Thomas Huckle Matous Sedlacek 2011 09 10 Technische Universität München Research Unit Computer Science V Scientific Computing in Computer
More informationCHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer
CHAO YANG Dr. Chao Yang is a full professor at the Laboratory of Parallel Software and Computational Sciences, Institute of Software, Chinese Academy Sciences. His research interests include numerical
More informationKrishnan Suresh Associate Professor Mechanical Engineering
Large Scale FEA on the GPU Krishnan Suresh Associate Professor Mechanical Engineering High-Performance Trick Computations (i.e., 3.4*1.22): essentially free Memory access determines speed of code Pick
More informationUsing Analytic QP and Sparseness to Speed Training of Support Vector Machines
Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Microsoft Research 1 Microsoft Way Redmond, WA 9805 jplatt@microsoft.com Abstract Training a Support Vector Machine
More informationA MATLAB Interface to the GPU
Introduction Results, conclusions and further work References Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo June 2007 Introduction Results, conclusions and further
More informationCS 542G: Solving Sparse Linear Systems
CS 542G: Solving Sparse Linear Systems Robert Bridson November 26, 2008 1 Direct Methods We have already derived several methods for solving a linear system, say Ax = b, or the related leastsquares problem
More informationParallel Performance Studies for COMSOL Multiphysics Using Scripting and Batch Processing
Parallel Performance Studies for COMSOL Multiphysics Using Scripting and Batch Processing Noemi Petra and Matthias K. Gobbert Department of Mathematics and Statistics, University of Maryland, Baltimore
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationEfficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI
Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from
More informationA Parallel Implementation of the BDDC Method for Linear Elasticity
A Parallel Implementation of the BDDC Method for Linear Elasticity Jakub Šístek joint work with P. Burda, M. Čertíková, J. Mandel, J. Novotný, B. Sousedík Institute of Mathematics of the AS CR, Prague
More informationBDDCML. solver library based on Multi-Level Balancing Domain Decomposition by Constraints copyright (C) Jakub Šístek version 1.
BDDCML solver library based on Multi-Level Balancing Domain Decomposition by Constraints copyright (C) 2010-2012 Jakub Šístek version 1.3 Jakub Šístek i Table of Contents 1 Introduction.....................................
More informationMemory Efficient Adaptive Mesh Generation and Implementation of Multigrid Algorithms Using Sierpinski Curves
Memory Efficient Adaptive Mesh Generation and Implementation of Multigrid Algorithms Using Sierpinski Curves Michael Bader TU München Stefanie Schraufstetter TU München Jörn Behrens AWI Bremerhaven Abstract
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationTransaction of JSCES, Paper No Parallel Finite Element Analysis in Large Scale Shell Structures using CGCG Solver
Transaction of JSCES, Paper No. 257 * Parallel Finite Element Analysis in Large Scale Shell Structures using Solver 1 2 3 Shohei HORIUCHI, Hirohisa NOGUCHI and Hiroshi KAWAI 1 24-62 22-4 2 22 3-14-1 3
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationAccelerating a Simulation of Type I X ray Bursts from Accreting Neutron Stars Mark Mackey Professor Alexander Heger
Accelerating a Simulation of Type I X ray Bursts from Accreting Neutron Stars Mark Mackey Professor Alexander Heger The goal of my project was to develop an optimized linear system solver to shorten the
More informationUppsala University Department of Information technology. Hands-on 1: Ill-conditioning = x 2
Uppsala University Department of Information technology Hands-on : Ill-conditioning Exercise (Ill-conditioned linear systems) Definition A system of linear equations is said to be ill-conditioned when
More informationParallel FEM Computation and Multilevel Graph Partitioning Xing Cai
Parallel FEM Computation and Multilevel Graph Partitioning Xing Cai Simula Research Laboratory Overview Parallel FEM computation how? Graph partitioning why? The multilevel approach to GP A numerical example
More informationLecture 3: Intro to parallel machines and models
Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class
More informationBenchmarking CPU Performance. Benchmarking CPU Performance
Cluster Computing Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance,
More informationHandling Parallelisation in OpenFOAM
Handling Parallelisation in OpenFOAM Hrvoje Jasak hrvoje.jasak@fsb.hr Faculty of Mechanical Engineering and Naval Architecture University of Zagreb, Croatia Handling Parallelisation in OpenFOAM p. 1 Parallelisation
More informationIterative Sparse Triangular Solves for Preconditioning
Euro-Par 2015, Vienna Aug 24-28, 2015 Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt, Edmond Chow and Jack Dongarra Incomplete Factorization Preconditioning Incomplete LU factorizations
More informationAccelerating Finite Element Analysis in MATLAB with Parallel Computing
MATLAB Digest Accelerating Finite Element Analysis in MATLAB with Parallel Computing By Vaishali Hosagrahara, Krishna Tamminana, and Gaurav Sharma The Finite Element Method is a powerful numerical technique
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More information