Parallel solution for finite element linear systems of. equations on workstation cluster *

Size: px
Start display at page:

Download "Parallel solution for finite element linear systems of. equations on workstation cluster *"

Transcription

1 Aug. 2009, Volume 6, No.8 (Serial No.57) Journal of Communication and Computer, ISSN , USA Parallel solution for finite element linear systems of equations on workstation cluster * FU Chao-jiang (Department of Civil Engineering, Fujian University of Technology, Fuzhou , China) Abstract: With the advances in the high speed computers network technologies, a workstation cluster is becoming the main environment for parallel processing. Finite element linear systems of equations are common throughout structural analysis in Civil Engineering. The preconditioned conjugate gradient method (PCGM) is an iterative method used to solve the finite element systems of equations with symmetric positive definite system matrices. In this paper, the algorithm of PCGM is parallelized and implemented on DELL workstation cluster. Optimization techniques for the sparse matrix vector multiplication are adopted in programming. The storage scheme is analyzed in detail. The experiment result shows that the designed parallel algorithm has high speedup and good efficiency on the high performance workstation cluster. This illustrates the power of parallel computing in solving large problems much faster than on a single processor. Key words: parallel computing; preconditioned conjugate gradient method; finite element method; network; workstation cluster 1. Introduction In recent years, parallel computing has become an important field in science computing. There are two main reasons which make it reasonable. First, the use of parallel computing can handle large-scale problems. Second, networks of computers under LINUX, with a connection of several computers via standard software like MPI, result in a workstation cluster, which is a low cost parallel machine. On such hardware platforms, one can gain first experiences in parallel computing. So * Acknowledgment: This work is supported by Fujian Province Natural Science Foundation (No. 2008J0180) and Scientific Research Start Foundation of Fujian University of Technology (No. GY-Z0707). Corresponding author: FU Chao-jiang, Ph.D.; research field: parallel computing. a workstation cluster is poised to become the primary computing infrastructure for science and engineering. It may achieve cheap, highly available, and provide multiple CPUs for parallel computing. Solving finite element linear systems of equations is one of the most important problems in structural analysis in Civil Engineering. The application of finite element method to engineering problems requires the solution of a system of linear equations. In general, it is possible to write these systems as, Ax = b. Where A is a n n sparse matrix, x is the vector of nodal unknowns and b is n-dimensional balance force vector. Due to the nature of finite element method, A and b are constructed by assembling the individual element contributions, i.e., A = ΣA e ; b = Σb e (e=1, nel). Where A e are the element matrices, b e are the element force vectors, and nel is the number of elements in the mesh. This yields a linear system with a symmetric positive definite sparse system matrix. Unfortunately, the direct methods in elementary linear algebra classes (e.g., Gaussian elimination) are difficult to adapt for distributed memory systems, especially if the coefficient matrix is sparse. So most of the sparse linear system solvers for distributed memory systems use iterative methods [1-2]. An iterative method makes an initial guess at a solution to the system, and then tries to repeatedly improve the guess. Thus, iterative methods produce a sequence of vectors, which gradually converge to the solution to the linear system. Since the finite element method yields a linear system with a symmetric positive definite sparse system 59

2 matrix. This property of the system matrix allows the use of PCGM as the linear solver. 2. Preconditioned conjugate gradient method In order to solve the finite element linear systems of equations Ax = b, the PCG method has been used. The CG method is an iterative technique which converges to the solution to the system, through a sequence of n vector approximations. From any vector approximation to the solution, a search direction which is conjugate to the previous ones is used to determine a new approximation. In practice, only a few iterations are employed to obtain a good estimate of the solution. In the preconditioned version of the CG method, both the condition number of A and the number of iterations may be reduced by pre-multiplying the system Ax = b by a preconditioner M. due to system matrix A with diagonal elements that are very different in magnitude, the preconditioner M is adopted by using the diagonal matrix of A. The PCG method is listed below. peconditioned conjugate gradient algorithm initialize: X 0 r 0 =b AX 0 Mu 0 =r 0 p 0 =u 0 begin loop: for k=0, (1) α k =(u k,r k )/(p k,ap k ) (2) X k+1 =X k +α k p k (3) r k+1 =r k -α k Ap k (4) Mu k+1 =r k+1 (5) if(u k+1,r k+1 ) ε (6) if(r k+1,r k+1 ) ε (7) return (8) β k =(u k+1,r k+1 )/(u k,r k ) (9) p k+1 =u k+1 +β k p k (10)end for end loop 3. Parallel implementation of PCG method The parallelization of the PCG method can be implemented since every vector operation of this algorithm can be parallelized separately, i.e., each processor executes scalar operations on a subset of the components of a vector, the parallel implementation of the preconditioned conjugate gradient method is identical to the serial implementation. The parallelism comes from storing and operating on main sections of the working vectors and the system matrix. The parallel implementation of the preconditioned conjugate gradient method is presented below. initialize: X0 r0 = b - AX0 Mu0 = r0 p0 = u0 C = dot(u0,r0) M is the preconditioning matrix, using M=diag (a11,a22,,ann) Main iterate over: (1) Z = Ap /Parallel Matrix-Vector Product/ (2) α = c/dot(p,z) /Parallel Vector Dot Product/ (3) x = x + α*p /BLAS daxpy Operation/ (4) r = r - α*z /BLAS daxpy Opreation/ (5) solve Mu = r for u /Solve Matrix System/ (6) d = dot(r,r) /Parallel Vector Dot Product/ (7) if(sqrt(d)<tolerance) break; (8) β = d/c (9) p = u + β*p /BLAS daxpy Operation/ (10) gather(p) /Parallel Gather / (11) c = d (12) go to (1) Where BLAS(basic linear algebra subprograms) is a collection of routines that perform specific vector and matrix operations [3]. The parallel algorithm of PCGM is implemented in the programming language C++ using the Message Passing Interface (MPI) standard [4] for communications between processors. This algorithm for the solution of a linear system of equations of dimension n involves 1 matrix-vector product, 2 dot products, and 3 vector updates (daxpy operations) per iteration. This implementation is optimized with respect to memory usage by storing exactly 4 vectors of length n, the smallest number possible, and by using a function that implements the matrix-vector multiplication in matrix-free form, i.e., does not store 60

3 any matrix [5-6]. The vectors stored are the approximation solution x, the search direction p, the residual vector r and the auxiliary vector q. The parallel form of the PCG algorithm requires 4 communication operations per iteration. To compute the matrix-vector product, processors need to interchange one data with their neighboring processors. These communications are implemented with nonblocking MPI_ISEND/MPI_IRECV commands. Since the dot products apply to vectors split across the processors and since the results are needed on all processors, an MPI_Allreduce operation is required in each of the 2 dot products. In case of a diagonal preconditioner, no additional communication is required since diagonal preconditioning can be accomplished locally on all rows contained in a process. Other preconditioners such as incomplete Cholesky require additional communication. So the diagonal preconditioner is used in this algorithm. Optimization techniques [7-8] for the sparse matrix vector multiplication are adopted in programming, i.e., the serial part of the dot products, local to each processor, is implemented in C++ inner function dot; The most expensive operation in each iteration is the matrix-vector product which is implemented in step (1) by using parallel dot operations. In this paper, a reorganization of the computations in the PCG algorithm has been implemented in order to hide the latency of the communications. This overlap of the communications with the computations has been implemented by using asynchronous messages in the message-passing model described below. The parallelization of the product Z = Ap in algorithm can be performed with an overlap between the messages and the computations. The required message with the halo data of p; i.e., the data corresponding to the grid points in the domain of one processor which are used to compute data of Z in a neighboring processor, is overlapped with computations of matrix rows corresponding to the inner part of the domain, where the inner part is the domain without its halos. The latency of the communication step in the inner product p T Z can be hidden by delaying the update of X i by one iteration [9]. 4. Storage scheme The aim of the matrix storage is to achieve a better performance. The important impact on performance is caused by the storage technique. There are several possibilities to store a matrix. The key is not to store unnecessary data. The emphasis of the storage is on developing an algorithm which will solve problems too large to be solved efficiently on a single CPU. Storing all n 2 elements of A R n n explicitly is known as dense storage mode. This is the easiest possible storage scheme. For dense matrices this is good technique. For sparse matrix it is not, because all zero elements are stored. Since most of the n 2 elements are 0, this is wasteful both in memory and in runtime, as many multiplications involving zeros are needlessly computed. Simulations using this storage mode can be easily programmed in a language C++. A common idea is to take advantage of the sparsity of the matrix, i.e., the low percentage of non-zero elements. In this sparse storage mode, only non-zero elements are stored along with integer index information to indicate the position of the element in the matrix. This reduces the memory requirements for the system matrix. This also improves performance, as only multiplications with those elements that are stored explicitly, i.e., non-zero are computed. To reduce the memory usage further, the constant coefficients in pre-determined positions of the system matrix are taken. A function is provided that accepts a vector p as input and returns the vector Z = Ap as output; each component Z k is computed as summation of the appropriate components of p multiplied with hard-coded coefficients [5]. This matrix-vector multiplication function is inserted at the appropriate place of implementation of the PCGM in the programming language C++. This technique is known 61

4 as a matrix-free implementation because no elements of the system matrix A are explicitly stored at all [6]. Such a matrix-free method dramatically reduces the memory requirements. It is the most efficient approach to memory usage. The scheme is adopted in this paper. Memory is an important limitation to the size of the system equations to be solved. As stated earlier, the minimum number of vectors needed to compute the solution of the linear system equations is 4 vectors of length n. Aside from the system matrix, a prediction of 4n is almost all of the memory to compute the solution with the matrix-free method. Using the sparse storage mode, non-zero elements of the system matrix need to be stored. For example, the matrix is pentadiagonal, assuming that only 5 vectors of length n are stored for the matrix. The memory prediction is 5n added to the matrix-free implementation prediction. Densely storing the matrix requires a vector of n 2 elements added to the matrix-free implementation prediction. The theoretical predictions of memory usage for different storage method are discussed in Table 1, when double precision arithmetic is used. Table 1 Predicted memory usage for the three storage methods in Megabytes (MB) n Dense Sparse Matrix-free 256 <1 <1 < <1 < <1 < <1 5. Numerical test and analysis 5.1 Parallel platform The numerical and performance tests of the developed parallel PCG algorithm are performed on DELL workstation cluster in School of Computer Engineering and Science, Shanghai University. It is a cluster with 12 processors arranged in 1 four-processor node with 2.0 GHz Intel Xeon chips (1MB cache) and 4 dual-processor nodes with 2.4 GHz Intel Xeon chips (512 KB cache) and 1.0 GB of memory per node. These nodes are connected with a Mbps Ethernet interconnect. Using the Message Passing Interface (MPI) paradigm [10]. In the message-passing version, MPICH library has been considered because of its better efficiency in computer architecture. MPICH library has been widely adopted as the message-passing interface of choice for many years. In the MPICH implementation, inter-processor communication is performed through special interface provided to send a message, and a matching interface in other processor to receive it. In our programming, the communications are implemented with nonblocking communication in order to perform the overlap of the communications with the computations. 5.2 Numerical test problem and decision The problem to be solved is a classical prototype problem, the Poisson equation with a homogeneous Dirichlet boundary condition, using finite element in two dimensions. This yields a linear system with a symmetric positive definite system matrix. This property of the system matrix allows the use of the PCG method we present in this paper as the solver. The results of computation are listed in Table 2 and Fig. 1. At each time step, the solution of the system of algebraic equations can be divided into two parts: (i) initialization, which includes the generation of the matrix A and right-hand side vector b, the calculation of the preconditioner and the initialization of the PCG method, and (ii) the iterations of the PCG method. Since different problem sizes imply different convergence rates, the ratio between the times required in both parts varies as the sizes of the problem are varied. In order to simplify the presentation of results and avoid problems with the accuracy of the timer and perturbations to the cache behavior, we have timed both parts together. 5.3 Numerical results analysis Memory is an important limitation to the size of the system to be solved. Aside from the system matrix, the vectors of length n used in the method make up the bulk of the memory. Any other variables are ignored. 62

5 The estimation is simply made by counting the number of vectors used. As stated earlier, the minimum number of vectors needed in this algorithm is 4 of length n. The Table 1 shows the theoretical predictions of memory used by the C++ code using the dense, sparse, and the matrix-free implementation, respectively when double-precision arithmetic is used. Less than 1 MB of memory is used for a sparse or matrix-free storage compared to 134 MB for a dense storage when dimension (n) of the system matrix is equal to Adding the dimension (n) to 16384, it takes over 2 GB using dense storage. This is not possible to compute this problem for single processor of the workstations because only 1 GB of memory is available for the single processor computation, so n equals to that is the largest system for memory of the single processor. Clearly, a matrix-free implementation is the best storage method. With a matrix-free implementation, the method is optimal with respect to memory usage, and we are able to solve problems that are much too large for single processor computers. Table 2 and Fig. 1 summarize the timings and speedups for this cluster, using up to 12 processors. It is readily apparent from both Table 2 and Fig. 1 that the timings continue to decrease significantly all the way up to 12 processors. When the number of processor is greater than 4, an apparent slow-down is observed because the communication overhead between processors increases significantly. The cache for each processor is a good size for non-computational uses, but does not hold much data. This leads to frequent loading of data from memory to the cache. The nodes have a 32-bit bus which can not serve the data as fast as the processors can use it. Within the 4 processors, the computation is implemented on 4-processor node. The communication overhead between processors is less; thus its performance is better. speed-up Table 2 Timings in seconds for different n on the 12-processor cluster n n=512 n=2048 n=4096 n= speed up number of processor Fig. 1 Plot of the speedup for different n on the 12-processor cluster n=512 n=1024 n=2048 n= number of processor Fig. 2 Plot of the speedup for different n on the 4-processor cluster To analyze the importance of the different parallel environment for this algorithm, another cluster is constituted, which has 4 single-processor nodes with 600 MHz Intel Pentium Ⅲ processors and 256 MB of memory. These nodes are connected with a Mbps Ethernet interconnect. The same computations are performed on this cluster. The results are listed in Table 3 and Fig. 2. The results for up to 4 processors show 63

6 that the use of about 2 processors constitutes the most efficient use of the resources. Though the processor speed is less than that of the cluster used earlier, the results are good. The speedups in Fig. 1 are higher than that in Fig. 2 for the same number of processor and the same size of problem. This result shows that a high-performance parallel platform is required for parallel computing. Table 3 Timings in seconds for different n on the 4-processor cluster n Conclusions Parallel computing is a useful tool for solving large-scale problems faster. PCG algorithm is parallelized and this parallel algorithm is implemented on two different clusters. Its performance is analyzed. The reduction in memory due to the matrix-free implementation not only allows for the solution of much larger problems but also decreases computing time. The parallel performance studies show that a high-performance parallel platform is necessary to obtain excellent speedup on a cluster. The load balancing is very important to ensure good scalability and to increase efficiency. These observations reflect the fact that the parallel PCGM algorithm involves 4 communications per iteration; therefore it requires a tight coupling of the cluster. This shows how important it is to have a workstation cluster available. The algorithm implemented on different parallel platform appears to be robust. References: [1] Barret R. Templates for the Solution of Linear Systems, Building Blocks for Iterative Methods. SIAM, Philadelphia, [2] Law K. H. A parallel finite element solution method. Computers & Structures, 1986, 23(6): [3] George Em Karniadkis. Parallel Scientific Computing in C++ and MPI. Cambridge University Press, [4] Pacheco P. S. Parallel Programming with MPI. Morgan Kaufmann, [5] Kevin P. Allen. Efficient Parallel Computing for Solving Linear Systems of Equations. Graduate Student Seminar. University of Maryland, Baltimore County, November [6] Aliaga J. I., Hernandez, V. Symmetric sparse matrix-vector product on distributed memory multiprocessors. Conference on Parallel Computing and Transputer Applications. Barcelona, Spain, [7] Brown P. N., Hindmarsh A. C. Matrix-free methods for stiff systems of ODE s. SIAM J. Numer. Anal, 1986, 23: [8] Geus R., Rollin S. Towards a fast parallel sparse symmetric matrix vector multiplication. Parallel Computing, 2007, 27(2): [9] Sorin G. Nastea. Load-balanced sparse matrix vector multiplication on parallel computers. Parallel and Distributed Computing, 1997, 46: [10] FU Chao-jiang. The research on parallel computation of finite element structural analysis based on MPI cluster. Doctor dissertation, Shanghai: Shanghai University, China, (Edited by Amy, Jane) 64

2 The Elliptic Test Problem

2 The Elliptic Test Problem A Comparative Study of the Parallel Performance of the Blocking and Non-Blocking MPI Communication Commands on an Elliptic Test Problem on the Cluster tara Hafez Tari and Matthias K. Gobbert Department

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Parallel Performance Studies for an Elliptic Test Problem on the Cluster maya

Parallel Performance Studies for an Elliptic Test Problem on the Cluster maya Parallel Performance Studies for an Elliptic Test Problem on the Cluster maya Samuel Khuvis and Matthias K. Gobbert (gobbert@umbc.edu) Department of Mathematics and Statistics, University of Maryland,

More information

Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem

Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem Guan Wang and Matthias K. Gobbert Department of Mathematics and Statistics, University of

More information

Report of Linear Solver Implementation on GPU

Report of Linear Solver Implementation on GPU Report of Linear Solver Implementation on GPU XIANG LI Abstract As the development of technology and the linear equation solver is used in many aspects such as smart grid, aviation and chemical engineering,

More information

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,

More information

Contents. I The Basic Framework for Stationary Problems 1

Contents. I The Basic Framework for Stationary Problems 1 page v Preface xiii I The Basic Framework for Stationary Problems 1 1 Some model PDEs 3 1.1 Laplace s equation; elliptic BVPs... 3 1.1.1 Physical experiments modeled by Laplace s equation... 5 1.2 Other

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

Numerical Methods to Solve 2-D and 3-D Elliptic Partial Differential Equations Using Matlab on the Cluster maya

Numerical Methods to Solve 2-D and 3-D Elliptic Partial Differential Equations Using Matlab on the Cluster maya Numerical Methods to Solve 2-D and 3-D Elliptic Partial Differential Equations Using Matlab on the Cluster maya David Stonko, Samuel Khuvis, and Matthias K. Gobbert (gobbert@umbc.edu) Department of Mathematics

More information

Parallel Performance Studies for an Elliptic Test Problem on the 2018 Portion of the Taki Cluster

Parallel Performance Studies for an Elliptic Test Problem on the 2018 Portion of the Taki Cluster Parallel Performance Studies for an Elliptic Test Problem on the 2018 Portion of the Taki Cluster Carlos Barajas and Matthias K. Gobbert (gobbert@umbc.edu) Department of Mathematics and Statistics, University

More information

Techniques for Optimizing FEM/MoM Codes

Techniques for Optimizing FEM/MoM Codes Techniques for Optimizing FEM/MoM Codes Y. Ji, T. H. Hubing, and H. Wang Electromagnetic Compatibility Laboratory Department of Electrical & Computer Engineering University of Missouri-Rolla Rolla, MO

More information

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,

More information

MVAPICH2 vs. OpenMPI for a Clustering Algorithm

MVAPICH2 vs. OpenMPI for a Clustering Algorithm MVAPICH2 vs. OpenMPI for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland, Baltimore

More information

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND Student Submission for the 5 th OpenFOAM User Conference 2017, Wiesbaden - Germany: SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND TESSA UROIĆ Faculty of Mechanical Engineering and Naval Architecture, Ivana

More information

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P. 1 2 The LINPACK Benchmark on the Fujitsu AP 1000 Richard P. Brent Computer Sciences Laboratory The LINPACK Benchmark A popular benchmark for floating-point performance. Involves the solution of a nonsingular

More information

Parallel Performance Studies for a Clustering Algorithm

Parallel Performance Studies for a Clustering Algorithm Parallel Performance Studies for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland,

More information

Andrew V. Knyazev and Merico E. Argentati (speaker)

Andrew V. Knyazev and Merico E. Argentati (speaker) 1 Andrew V. Knyazev and Merico E. Argentati (speaker) Department of Mathematics and Center for Computational Mathematics University of Colorado at Denver 2 Acknowledgement Supported by Lawrence Livermore

More information

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.

More information

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new

More information

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS John R Appleyard Jeremy D Appleyard Polyhedron Software with acknowledgements to Mark A Wakefield Garf Bowen Schlumberger Outline of Talk Reservoir

More information

Parallel Performance Studies for a Parabolic Test Problem

Parallel Performance Studies for a Parabolic Test Problem Parallel Performance Studies for a Parabolic Test Problem Michael Muscedere and Matthias K. Gobbert Department of Mathematics and Statistics, University of Maryland, Baltimore County {mmusce1,gobbert}@math.umbc.edu

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

Ilya Lashuk, Merico Argentati, Evgenii Ovtchinnikov, Andrew Knyazev (speaker)

Ilya Lashuk, Merico Argentati, Evgenii Ovtchinnikov, Andrew Knyazev (speaker) Ilya Lashuk, Merico Argentati, Evgenii Ovtchinnikov, Andrew Knyazev (speaker) Department of Mathematics and Center for Computational Mathematics University of Colorado at Denver SIAM Conference on Parallel

More information

Efficient O(N log N) algorithms for scattered data interpolation

Efficient O(N log N) algorithms for scattered data interpolation Efficient O(N log N) algorithms for scattered data interpolation Nail Gumerov University of Maryland Institute for Advanced Computer Studies Joint work with Ramani Duraiswami February Fourier Talks 2007

More information

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code

More information

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency 1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming

More information

Performance Studies for the Two-Dimensional Poisson Problem Discretized by Finite Differences

Performance Studies for the Two-Dimensional Poisson Problem Discretized by Finite Differences Performance Studies for the Two-Dimensional Poisson Problem Discretized by Finite Differences Jonas Schäfer Fachbereich für Mathematik und Naturwissenschaften, Universität Kassel Abstract In many areas,

More information

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015 AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for the future 2 AmgX Fast, scalable linear solvers, emphasis on iterative

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 15 Numerically solve a 2D boundary value problem Example:

More information

Figure 6.1: Truss topology optimization diagram.

Figure 6.1: Truss topology optimization diagram. 6 Implementation 6.1 Outline This chapter shows the implementation details to optimize the truss, obtained in the ground structure approach, according to the formulation presented in previous chapters.

More information

PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS

PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS Proceedings of FEDSM 2000: ASME Fluids Engineering Division Summer Meeting June 11-15,2000, Boston, MA FEDSM2000-11223 PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS Prof. Blair.J.Perot Manjunatha.N.

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet Contents 2 F10: Parallel Sparse Matrix Computations Figures mainly from Kumar et. al. Introduction to Parallel Computing, 1st ed Chap. 11 Bo Kågström et al (RG, EE, MR) 2011-05-10 Sparse matrices and storage

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations Hartwig Anzt 1, Marc Baboulin 2, Jack Dongarra 1, Yvan Fournier 3, Frank Hulsemann 3, Amal Khabou 2, and Yushan Wang 2 1 University

More information

Matrix-free IPM with GPU acceleration

Matrix-free IPM with GPU acceleration Matrix-free IPM with GPU acceleration Julian Hall, Edmund Smith and Jacek Gondzio School of Mathematics University of Edinburgh jajhall@ed.ac.uk 29th June 2011 Linear programming theory Primal-dual pair

More information

Application of GPU-Based Computing to Large Scale Finite Element Analysis of Three-Dimensional Structures

Application of GPU-Based Computing to Large Scale Finite Element Analysis of Three-Dimensional Structures Paper 6 Civil-Comp Press, 2012 Proceedings of the Eighth International Conference on Engineering Computational Technology, B.H.V. Topping, (Editor), Civil-Comp Press, Stirlingshire, Scotland Application

More information

Modelling and implementation of algorithms in applied mathematics using MPI

Modelling and implementation of algorithms in applied mathematics using MPI Modelling and implementation of algorithms in applied mathematics using MPI Lecture 1: Basics of Parallel Computing G. Rapin Brazil March 2011 Outline 1 Structure of Lecture 2 Introduction 3 Parallel Performance

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Computational Fluid Dynamics - Incompressible Flows

Computational Fluid Dynamics - Incompressible Flows Computational Fluid Dynamics - Incompressible Flows March 25, 2008 Incompressible Flows Basis Functions Discrete Equations CFD - Incompressible Flows CFD is a Huge field Numerical Techniques for solving

More information

A PARALLEL IMPLEMENTATION OF A FEM SOLVER IN SCILAB

A PARALLEL IMPLEMENTATION OF A FEM SOLVER IN SCILAB powered by A PARALLEL IMPLEMENTATION OF A FEM SOLVER IN SCILAB Author: Massimiliano Margonari Keywords. Scilab; Open source software; Parallel computing; Mesh partitioning, Heat transfer equation. Abstract:

More information

1e+07 10^5 Node Mesh Step Number

1e+07 10^5 Node Mesh Step Number Implicit Finite Element Applications: A Case for Matching the Number of Processors to the Dynamics of the Program Execution Meenakshi A.Kandaswamy y Valerie E. Taylor z Rudolf Eigenmann x Jose' A. B. Fortes

More information

Performance of Implicit Solver Strategies on GPUs

Performance of Implicit Solver Strategies on GPUs 9. LS-DYNA Forum, Bamberg 2010 IT / Performance Performance of Implicit Solver Strategies on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Abstract: The increasing power of GPUs can be used

More information

Advanced Surface Based MoM Techniques for Packaging and Interconnect Analysis

Advanced Surface Based MoM Techniques for Packaging and Interconnect Analysis Electrical Interconnect and Packaging Advanced Surface Based MoM Techniques for Packaging and Interconnect Analysis Jason Morsey Barry Rubin, Lijun Jiang, Lon Eisenberg, Alina Deutsch Introduction Fast

More information

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms H. Anzt, V. Heuveline Karlsruhe Institute of Technology, Germany

More information

Super Matrix Solver-P-ICCG:

Super Matrix Solver-P-ICCG: Super Matrix Solver-P-ICCG: February 2011 VINAS Co., Ltd. Project Development Dept. URL: http://www.vinas.com All trademarks and trade names in this document are properties of their respective owners.

More information

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,

More information

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs 3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs H. Knibbe, C. W. Oosterlee, C. Vuik Abstract We are focusing on an iterative solver for the three-dimensional

More information

Performance Evaluation of a New Parallel Preconditioner

Performance Evaluation of a New Parallel Preconditioner Performance Evaluation of a New Parallel Preconditioner Keith D. Gremban Gary L. Miller Marco Zagha School of Computer Science Carnegie Mellon University 5 Forbes Avenue Pittsburgh PA 15213 Abstract The

More information

Cost-Effective Parallel Computational Electromagnetic Modeling

Cost-Effective Parallel Computational Electromagnetic Modeling Cost-Effective Parallel Computational Electromagnetic Modeling, Tom Cwik {Daniel.S.Katz, cwik}@jpl.nasa.gov Beowulf System at PL (Hyglac) l 16 Pentium Pro PCs, each with 2.5 Gbyte disk, 128 Mbyte memory,

More information

Scalability of Heterogeneous Computing

Scalability of Heterogeneous Computing Scalability of Heterogeneous Computing Xian-He Sun, Yong Chen, Ming u Department of Computer Science Illinois Institute of Technology {sun, chenyon1, wuming}@iit.edu Abstract Scalability is a key factor

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 5: Sparse Linear Systems and Factorization Methods Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 18 Sparse

More information

Lecture 27: Fast Laplacian Solvers

Lecture 27: Fast Laplacian Solvers Lecture 27: Fast Laplacian Solvers Scribed by Eric Lee, Eston Schweickart, Chengrun Yang November 21, 2017 1 How Fast Laplacian Solvers Work We want to solve Lx = b with L being a Laplacian matrix. Recall

More information

2 Fundamentals of Serial Linear Algebra

2 Fundamentals of Serial Linear Algebra . Direct Solution of Linear Systems.. Gaussian Elimination.. LU Decomposition and FBS..3 Cholesky Decomposition..4 Multifrontal Methods. Iterative Solution of Linear Systems.. Jacobi Method Fundamentals

More information

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,

More information

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications Michael Eberl 1, Wolfgang Karl 1, Carsten Trinitis 1 and Andreas Blaszczyk 2 1 Technische Universitat Munchen

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

A substructure based parallel dynamic solution of large systems on homogeneous PC clusters

A substructure based parallel dynamic solution of large systems on homogeneous PC clusters CHALLENGE JOURNAL OF STRUCTURAL MECHANICS 1 (4) (2015) 156 160 A substructure based parallel dynamic solution of large systems on homogeneous PC clusters Semih Özmen, Tunç Bahçecioğlu, Özgür Kurç * Department

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Introduction to Parallel Computing W. P. Petersen Seminar for Applied Mathematics Department of Mathematics, ETHZ, Zurich wpp@math. ethz.ch P. Arbenz Institute for Scientific Computing Department Informatik,

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI Introduction to Parallel Programming for Multi/Many Clusters Part II-3: Parallel FVM using MPI Kengo Nakajima Information Technology Center The University of Tokyo 2 Overview Introduction Local Data Structure

More information

Introduction to Parallel. Programming

Introduction to Parallel. Programming University of Nizhni Novgorod Faculty of Computational Mathematics & Cybernetics Introduction to Parallel Section 9. Programming Parallel Methods for Solving Linear Systems Gergel V.P., Professor, D.Sc.,

More information

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,

More information

High-Performance Linear Algebra Processor using FPGA

High-Performance Linear Algebra Processor using FPGA High-Performance Linear Algebra Processor using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract With recent advances in FPGA (Field Programmable Gate Array) technology it is now feasible

More information

GPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction

GPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction 1/8 GPU Implementation of Elliptic Solvers in Numerical Weather- and Climate- Prediction Eike Hermann Müller, Robert Scheichl Department of Mathematical Sciences EHM, Xu Guo, Sinan Shi and RS: http://arxiv.org/abs/1302.7193

More information

Approaches to Parallel Implementation of the BDDC Method

Approaches to Parallel Implementation of the BDDC Method Approaches to Parallel Implementation of the BDDC Method Jakub Šístek Includes joint work with P. Burda, M. Čertíková, J. Mandel, J. Novotný, B. Sousedík. Institute of Mathematics of the AS CR, Prague

More information

High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers

High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers July 14, 1997 J Daniel S. Katz (Daniel.S.Katz@jpl.nasa.gov) Jet Propulsion Laboratory California Institute of Technology

More information

Data mining with sparse grids

Data mining with sparse grids Data mining with sparse grids Jochen Garcke and Michael Griebel Institut für Angewandte Mathematik Universität Bonn Data mining with sparse grids p.1/40 Overview What is Data mining? Regularization networks

More information

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

fspai-1.0 Factorized Sparse Approximate Inverse Preconditioner

fspai-1.0 Factorized Sparse Approximate Inverse Preconditioner fspai-1.0 Factorized Sparse Approximate Inverse Preconditioner Thomas Huckle Matous Sedlacek 2011 08 01 Technische Universität München Research Unit Computer Science V Scientific Computing in Computer

More information

Performance Evaluation of a New Parallel Preconditioner

Performance Evaluation of a New Parallel Preconditioner Performance Evaluation of a New Parallel Preconditioner Keith D. Gremban Gary L. Miller October 994 CMU-CS-94-25 Marco Zagha School of Computer Science Carnegie Mellon University Pittsburgh, PA 523 This

More information

Accelerating Double Precision FEM Simulations with GPUs

Accelerating Double Precision FEM Simulations with GPUs Accelerating Double Precision FEM Simulations with GPUs Dominik Göddeke 1 3 Robert Strzodka 2 Stefan Turek 1 dominik.goeddeke@math.uni-dortmund.de 1 Mathematics III: Applied Mathematics and Numerics, University

More information

Native mesh ordering with Scotch 4.0

Native mesh ordering with Scotch 4.0 Native mesh ordering with Scotch 4.0 François Pellegrini INRIA Futurs Project ScAlApplix pelegrin@labri.fr Abstract. Sparse matrix reordering is a key issue for the the efficient factorization of sparse

More information

Large-scale Structural Analysis Using General Sparse Matrix Technique

Large-scale Structural Analysis Using General Sparse Matrix Technique Large-scale Structural Analysis Using General Sparse Matrix Technique Yuan-Sen Yang 1), Shang-Hsien Hsieh 1), Kuang-Wu Chou 1), and I-Chau Tsai 1) 1) Department of Civil Engineering, National Taiwan University,

More information

fspai-1.1 Factorized Sparse Approximate Inverse Preconditioner

fspai-1.1 Factorized Sparse Approximate Inverse Preconditioner fspai-1.1 Factorized Sparse Approximate Inverse Preconditioner Thomas Huckle Matous Sedlacek 2011 09 10 Technische Universität München Research Unit Computer Science V Scientific Computing in Computer

More information

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer CHAO YANG Dr. Chao Yang is a full professor at the Laboratory of Parallel Software and Computational Sciences, Institute of Software, Chinese Academy Sciences. His research interests include numerical

More information

Krishnan Suresh Associate Professor Mechanical Engineering

Krishnan Suresh Associate Professor Mechanical Engineering Large Scale FEA on the GPU Krishnan Suresh Associate Professor Mechanical Engineering High-Performance Trick Computations (i.e., 3.4*1.22): essentially free Memory access determines speed of code Pick

More information

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Microsoft Research 1 Microsoft Way Redmond, WA 9805 jplatt@microsoft.com Abstract Training a Support Vector Machine

More information

A MATLAB Interface to the GPU

A MATLAB Interface to the GPU Introduction Results, conclusions and further work References Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo June 2007 Introduction Results, conclusions and further

More information

CS 542G: Solving Sparse Linear Systems

CS 542G: Solving Sparse Linear Systems CS 542G: Solving Sparse Linear Systems Robert Bridson November 26, 2008 1 Direct Methods We have already derived several methods for solving a linear system, say Ax = b, or the related leastsquares problem

More information

Parallel Performance Studies for COMSOL Multiphysics Using Scripting and Batch Processing

Parallel Performance Studies for COMSOL Multiphysics Using Scripting and Batch Processing Parallel Performance Studies for COMSOL Multiphysics Using Scripting and Batch Processing Noemi Petra and Matthias K. Gobbert Department of Mathematics and Statistics, University of Maryland, Baltimore

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from

More information

A Parallel Implementation of the BDDC Method for Linear Elasticity

A Parallel Implementation of the BDDC Method for Linear Elasticity A Parallel Implementation of the BDDC Method for Linear Elasticity Jakub Šístek joint work with P. Burda, M. Čertíková, J. Mandel, J. Novotný, B. Sousedík Institute of Mathematics of the AS CR, Prague

More information

BDDCML. solver library based on Multi-Level Balancing Domain Decomposition by Constraints copyright (C) Jakub Šístek version 1.

BDDCML. solver library based on Multi-Level Balancing Domain Decomposition by Constraints copyright (C) Jakub Šístek version 1. BDDCML solver library based on Multi-Level Balancing Domain Decomposition by Constraints copyright (C) 2010-2012 Jakub Šístek version 1.3 Jakub Šístek i Table of Contents 1 Introduction.....................................

More information

Memory Efficient Adaptive Mesh Generation and Implementation of Multigrid Algorithms Using Sierpinski Curves

Memory Efficient Adaptive Mesh Generation and Implementation of Multigrid Algorithms Using Sierpinski Curves Memory Efficient Adaptive Mesh Generation and Implementation of Multigrid Algorithms Using Sierpinski Curves Michael Bader TU München Stefanie Schraufstetter TU München Jörn Behrens AWI Bremerhaven Abstract

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

Transaction of JSCES, Paper No Parallel Finite Element Analysis in Large Scale Shell Structures using CGCG Solver

Transaction of JSCES, Paper No Parallel Finite Element Analysis in Large Scale Shell Structures using CGCG Solver Transaction of JSCES, Paper No. 257 * Parallel Finite Element Analysis in Large Scale Shell Structures using Solver 1 2 3 Shohei HORIUCHI, Hirohisa NOGUCHI and Hiroshi KAWAI 1 24-62 22-4 2 22 3-14-1 3

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

Accelerating a Simulation of Type I X ray Bursts from Accreting Neutron Stars Mark Mackey Professor Alexander Heger

Accelerating a Simulation of Type I X ray Bursts from Accreting Neutron Stars Mark Mackey Professor Alexander Heger Accelerating a Simulation of Type I X ray Bursts from Accreting Neutron Stars Mark Mackey Professor Alexander Heger The goal of my project was to develop an optimized linear system solver to shorten the

More information

Uppsala University Department of Information technology. Hands-on 1: Ill-conditioning = x 2

Uppsala University Department of Information technology. Hands-on 1: Ill-conditioning = x 2 Uppsala University Department of Information technology Hands-on : Ill-conditioning Exercise (Ill-conditioned linear systems) Definition A system of linear equations is said to be ill-conditioned when

More information

Parallel FEM Computation and Multilevel Graph Partitioning Xing Cai

Parallel FEM Computation and Multilevel Graph Partitioning Xing Cai Parallel FEM Computation and Multilevel Graph Partitioning Xing Cai Simula Research Laboratory Overview Parallel FEM computation how? Graph partitioning why? The multilevel approach to GP A numerical example

More information

Lecture 3: Intro to parallel machines and models

Lecture 3: Intro to parallel machines and models Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class

More information

Benchmarking CPU Performance. Benchmarking CPU Performance

Benchmarking CPU Performance. Benchmarking CPU Performance Cluster Computing Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance,

More information

Handling Parallelisation in OpenFOAM

Handling Parallelisation in OpenFOAM Handling Parallelisation in OpenFOAM Hrvoje Jasak hrvoje.jasak@fsb.hr Faculty of Mechanical Engineering and Naval Architecture University of Zagreb, Croatia Handling Parallelisation in OpenFOAM p. 1 Parallelisation

More information

Iterative Sparse Triangular Solves for Preconditioning

Iterative Sparse Triangular Solves for Preconditioning Euro-Par 2015, Vienna Aug 24-28, 2015 Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt, Edmond Chow and Jack Dongarra Incomplete Factorization Preconditioning Incomplete LU factorizations

More information

Accelerating Finite Element Analysis in MATLAB with Parallel Computing

Accelerating Finite Element Analysis in MATLAB with Parallel Computing MATLAB Digest Accelerating Finite Element Analysis in MATLAB with Parallel Computing By Vaishali Hosagrahara, Krishna Tamminana, and Gaurav Sharma The Finite Element Method is a powerful numerical technique

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information