Figure 6.1: Truss topology optimization diagram.

Size: px
Start display at page:

Download "Figure 6.1: Truss topology optimization diagram."

Transcription

1 6 Implementation 6.1 Outline This chapter shows the implementation details to optimize the truss, obtained in the ground structure approach, according to the formulation presented in previous chapters. IPM was coded in C++ and CUDA in its two versions, long-step and predictor-corrector, for comparison purposes. The predictor-corrector requires less iterations due to its adaptive σ but requires solving two linear systems. As discussed before, this is not a problem if a direct solver is used, because M = BD ipm B T can be factorized once and used in two triangular substitutions (back substitution operations). But if an iterative solver is used it would be required to solve two entirely different linear systems. Hence, the long-step version was used for most of the tests. The general diagram is given in Figure 6.1. Figure 6.1: Truss topology optimization diagram. To solve the optimization problem, a software solution called GRAND++ was developed. This uses the mesh generator of GRAND. The mesh is then saved in a text file and read into C++. There are some differences between GRAND, which is MATLAB based; GRAND++, which is based in C++; and CUDA. In the application of boundary conditions in GRAND, the columns and

2 Chapter 6. Implementation 51 rows of arrays for every degree of freedom that has a restriction are removed after the assembling. In GRAND++ the size of arrays does not change, as will be explained below. The file from GRAND contains the data as shown in Table 6.1. Table 6.1: Input Mesh data. Variable Name Size 2D 3D Description NODE N n 2 N n 3 Each row has the x,y for 2D and x,y,z for 3D coordinates of a given node BARS N b 2 Each row contains the pair of connected nodes i, j SUP P N f 3 N f 4 Each row consists of a node number, fixity x, fixity y and possibly fixity z. The total number of specified fixities is N sup LOAD N l 3 N l 4 Each row consists of a node number, load in x, load in y and possibly load in z. The director cosines t r are obtained for a bar that connects nodes i, j from the NODE matrix: d r t r = NODE j NODE i = dr d r. (6-1) 6.2 Simplified Views As noted in Chapter 3, M = BD ipm B T from IPM can be found as the product or as a sum of matrices. If a direct solver is employed, the full M matrix is usually assembled. Since M is quite sparse, a sparse format could be employed to store the matrix, such as COO (Coordinate List or Triplets format), or CSR (Compressed Sparse Row), among others. It is useful to know the memory required in advance to store M, and one approach is to find the space required to store BB T, which is the same size as M. A common way is to provide an approximate value and if insufficient then double it. But this approach wastes space, which is a scarce resource in large meshes. Figure 6.2 shows the local m r with the matrix indices they will have when expanded in their global version, M r. These i and j indices are the node indices of a bar i,j.

3 Chapter 6. Implementation 52 Figure 6.2: m r and its indices. The indices have the effect of dispersing the elements of m r. This dispersion is made in four blocks denoted by the dotted lines. When the global M r is formed, these blocks are placed as shown in Figure 6.3, where each square represents a 2x2 or 3x3 sub-matrix of m r. Figure 6.3: M r and its sub-matrices. Figure 6.4 shows the simplified view of M r, M sv r. While M r has a size of DIM N n DIM N n, the simplified view has size N n N n, and its elements are in the i and j indices. The simplified view can make easier to find the memory requirements of M. We want to find the number of non-zeros of M, nnz(m) = DIM 2 nnz(m sv ). To find nnz(m sv ), we note that there is no bar with equal indices, i is lower than j and two of the elements of M sv are in the diagonal. Hence, nnz(m sv ) is the difference between the sum of the elements contributed by all the bars (each bar has 4 elements in M sv ) and the elements that are in the same position N rep : nnz(m sv ) = 4N b N rep. (6-2)

4 Chapter 6. Implementation 53 Figure 6.4: Simplified view of M sv r and its indices. If two elements are in the same position, they accounts for one repetition. The number of repetitions is: N rep = 2N b N n. (6-3) Then nnz(m sv ) = 2N b + N n, and the number of non zeros of the original M is: nnz(m) = DIM 2 (2N b + N n ). (6-4) This equation was used in C++ to create the vector that stores M in CSR format. 6.3 Linear system solvers As discussed above, every optimizer iteration must solve a linear system M y = r y. The matrix M has some key features: 1. It is square, large, sparse, and symmetric positive definite; 2. It changes for every iteration; 3. It tends to become ill-conditioned as we approach the final iterations; 4. We can use Cholesky factorizing or and Iterative solver such as PCG due to (1); 5. It is similar to the stiffness matrix, only different in the diagonal values of D a. The next sections detail the solvers used. Some parts of the GRAND++ code are based on the work of Duarte et al [44] who devised PolyTop C++, which is an C++ implementation of PolyTop-MATLAB, presented by Talischi et al. [17]. Table 6.2 summarizes the features of all the solvers used in this work.

5 Chapter 6. Implementation 54 Table 6.2: Features of the employed linear solvers. Solver Type Version Platform Dimension UMFPACK Direct Serial CPU 2D PARDISO Direct Parallel CPU 2D/3D PCG Iterative Serial CPU 2D EbeCPU Iterative Serial/Parallel CPU 2D/3D EbeGPU Iterative Parallel GPU 2D/3D UMFPACK The Unsymmetric MultiFrontal Package (UMFPACK) [45] is a set of routines for solving sparse linear systems of the form Ax = b, using the Unsymmetric MultiFrontal method (A is not required to be symmetric). It was written in ANSI/ISO C and interfaces with MATLAB. We use its C interface in GRAND++. The matrix M is obtained by means of the product BD ipm B T and is stored by using the triplet format. UMFPACK requires a M matrix in the Compressed Sparse Column (CCS) format, so we first convert it using the umfpack_dl_triplet_to_col function provided in UMFPACK. The result is the vector y that is used to find x and z in the second and third normal equations PARDISO The Parallel Direct Sparse Solver Interface (PARDISO) [46] is a high-performance, robust, memory efficient, and easy to use software package for solving large sparse symmetric and non-symmetric linear systems of equations on shared memory multiprocessors. As a parallel solver it can use all of the CPU cores and is highly efficient. PARDISO needs the matrix in CSR format, so the umfpack_dl_triplet_to_col is also employed here. In this thesis PARDISO is used with all the available threads of the CPU PCG We use the Preconditioned Conjugate-Gradient (PCG) [47] as an iterative solver that can solve symmetric definite positive systems. This solver has the advantage of requiring less memory that the above direct solvers, but possibly requires more time to converge if it has a bad or no preconditioner. The Jacobi preconditioner was used for all the iterative solvers due to its simplicity.

6 Chapter 6. Implementation 55 The algorithm is shown in Algorithm 3 for a generic system Ax = b, with the preconditioner P. Algorithm 3 Preconditioned CG 1: Initialization 2: Given x 0 and r 0 = b Ax 0 3: for i = 1, 2,... do 4: Solve P w i 1 = r i 1 5: ρ i 1 = r T i 1w i 1 6: if i = 1 then 7: p i = w i 1 8: else 9: β i 1 = ρ i 1 ρ i 2 10: p i = r i 1 + β i 1 p i 1 11: end if 12: q i = Ap i 13: 14: α i = ρ i 1 p T i q i x i = x i 1 + α i p i 15: r i = r i 1 α i q i 16: if x i is accurate enough then 17: quit 18: end if 19: end for The most complex and time consuming operation is the matrix-vector product q i = Ap i. Usually the matrix is sparse and the vector is full, called Sparse Matrix-Vector Multiplication (SpMV) EbeCPU All three above solvers have to assemble M, and this makes them inappropriate to deal with large meshes due to limited PC memory. They also require significant numerical processing for large problems. Element by Element PCG in CPU (EbeCPU) is a solver that was created to address these issues, and increase the number of bars the solver is able to deal with. EbeCPU has two key features: 1. To reduce the memory consumption, the full M is not assembled. Rather, the matrix-vector product of step 11 in Algorithm 3 is produced as a sum of products per bar: Nb q = M r p ; (6-5) r=1 2. To reduce the time required to solve the system, parallel computing is used to produce the matrix-vector products.

7 Chapter 6. Implementation 56 The SpMV operation is performed using Equation 6-5, as shown in Figure 6.5. For the 2D problem this requires: 1. 2 elements (i,j) of BARS to define the nodes; 2. 4 elements of p (6 for a 3D problem); elements of M r (36 for a 3D problem). To reduce the memory footprint, only the 2 director cosines (3 for 3D) and the D ipm are be used, because that is the minimal information to create the matrix of Figure 6.2; 4. 4 elements of q (6 for a 3D problem). Figure 6.5: 2D SpMV for a bar and the elements affected. Race condition. If two bars share a node, they will have some non- zeros in the same position in M. That will cause the partial matrix-vector products to have non-zeros in the same position. This causes a problem in the parallel implementation of EbeCPU, because two threads will attempt to access the same memory positions in the result vector q, with undefined behavior. This condition is called race condition and should be avoided to avoid a numerical error. One way to overcome this problem is to use coloring [48]. The coloring algorithm creates groups of elements that do not share any node, allowing threads to process the bars with the same color at the same time. Once a color is completed the next color is processed and so on. Coloring applied to a 2D truss is shown in Figure 6.6. The algorithm and code for a 3D truss are the same, because only node connectivity is required. Three versions of EbeCPU were implemented, differing only in the way that the matrix-vector product is performed.

8 Chapter 6. Implementation 57 Figure 6.6: Bar coloring and groups by color. EbeCPU Serial. The serial version was created and tested to provide a base for comparison. It sequentially performs the SpMV and accumulates in output q. EbeCPU Parallel 1. Bars of the same color are chosen, and the SpMV is performed in parallel. Every available CPU thread computes an element product q r = M r p and OpenMP is used for the thread distribution. Note that when bars director cosines are chosen according to their colors, the code can access distant memory positions, causing cache misses. The next version attempts to solve this. EbeCPU Parallel 2. To enhance data locality the bars are reordered so bar indices i,j for bars of the same color are placed together. This process is executed prior to the calculus of director cosines so they are also ordered, as illustrated in Figure 6.7. Reordering affects the order of the force and area vector elements. When the optimization completes, the area vector is reordered back to its original order for plotting purposes.

9 Chapter 6. Implementation 58 Figure 6.7: Bar indices before and after reordering GPU implementations The number of mathematical operations is enormous for large meshes, and to enhance the time required, it was implemented on GPU solvers. The PCG algorithm was ported to GPU using CUDA programming, and two versions were developed. The entire PCG algorithm was coded in GPU, and two variants for the SpMV operation are discussed as follows. EbeGPU. In the Element by Element PCG in GPU (EbeGPU) the bars are colored and reordered, then, to prevent access to invalid memory addresses, the bars vectors i, j, t x, t y (and t z in 3D problems) are padded. Padding consists of appending dummy values to vectors to ensure an appropriate length. Padding prevents invalid memory addressing by making the vector length a multiple of the GPU block size. In the case of colored bars, each bar vector is padded with 0 or -1 to ensure that every group of the same color has a number of elements as a multiple of a constant PCGTHREADS, as shown in Figure 6.8. The constant PCGTHREADS is set to 2048 and the kernel is launched by setting the parameters in Listing 6.1 Listing 6.1: Kernel launching parameters 1 int threads = 128; 2 dim3 numthreads = dim3(threads, 1, 1); 3 dim3 numblocks = dim3(pcgthreads / threads, 1, 1); The matrix vector product kernel is shown in Listing 6.2 for a 2D truss. The kernel is launched once for every color, with NsColor, which is the number of segments of size PCGTHREADS for the color being processed (in Figure 6.8, NsColor would take values 1, 1, and 2). It is used in the for loop to process NsColor PCGTHREADS bars each time the kernel is called, the for iterations are shown in Figure 6.8(b). Note that there are 4 coalesced accesses in lines 10-13, but 8 aleatory accesses in lines 18,19,26,27,31,32. The latter accesses would cause access to invalid memory addresses, so invalid bar checks are performed in lines 16, 24 and 29. Aleatory accesses causes a degradation in the effective bandwidth of the GPU, so the next version attempts to solve this issue.

10 Chapter 6. Implementation 59 Figure 6.8: Padding and data in GPU. (a) Padding for a colored bar. (b) Data access pattern in the GPU. Listing 6.2: SpMV Kernel 1 global void kernelgpuspmv(double *d_cos, double *d_sin, 2 double *d_dipm, int *d_i, int *d_j, double *d_p, double *d_q, 3 int NsColor, int begincolor) 4 { 5 // select a bar index 6 unsigned int r = begincolor + blockidx.x * blockdim.x + 7 threadidx.x; 8 for (int ns = 0; ns < NsColor; ns++) 9 { 10 int i = d_i[r]; // i node for bar r 11 int j = d_j[r]; // j node for bar r 12 double c = d_cos[r]; // cos = tx 13 double s = d_sin[r]; // sin = ty 14 double p13 = 0; // temp variable 15 double p24 = 0; // temp variable 16 if ((i!= -1) && (j!= -1)) // check invalid bar 17 { 18 p13 = d_p[2*i] - d_p[2*j]; 19 p24 = d_p[2*i+1] - d_p[2*j+1]; 20 } 21 double tt = d_dipm[r] * (c * p13 + s * p24); 22 double q0 = c * tt; // partial output 23 double q1 = s * tt; // partial output 24 if (i!= -1) // check invalid bar 25 { 26 d_q[2*i] += q0; // update output 27 d_q[2*i+1] += q1; // update output 28 } 29 if (j!= -1) // check invalid bar 30 { 31 d_q[2*j] -= q0; // update output 32 d_q[2*j+1] -= q1; // update output 33 } 34 r += griddim.x * blockdim.x; // update r 35 } 36 }

11 Chapter 6. Implementation 60 BdiaGPU. The Diagonal format (DIA) [49] is an efficient way to represent a matrix if it is sparse and its elements are grouped in diagonals. Figure 6.9 shows the DIA matrices data and offsets for a matrix A. The diagonals of A are stored in data with possibly padded elements marked by *. The vector offsets stores the diagonal offset values, the main diagonal has zero offset, the diagonals below the main diagonal have negative offsets and the upper diagonals have positive offsets. This format is efficient not only to store a matrix with values grouped in diagonals, but also for accessing the elements in memory. The latter is essential for GPU programming. Figure 6.9: DIA format representation of matrix A. The DIA format has an evident drawback in that if the matrix has some elements off the diagonals, the amount of padding can be excessive. Therefore is not a general purpose format. In this application it proved to be useful, mainly when the mesh is structured. The DIA format stores all the elements of the matrix, but to reduce the memory footprint only the director cosines and D ipm are stored. Rather than a matrix data, we have datatx, dataty (and datatz for 3D) for the director cosines and datadipm. Each triplet of bar elements datatx, dataty, datadipm (4 for 3D) are the basic information to generate the 16 (36 in 3D) elements of Figure 6.2. We called this format Blocked DIA (BDIA). In practice, SpVM is performed without forming M r, and the multiplication is made in code in similar fashion as for the EbeGPU (element by element) SpVM. Figure 6.10 shows the structure of M for the truss of Figure 6.6. Each square represents 4 elements in 2D or 9 elements in 3D. To use the Diagonal format it is necessary to pad the elements missing in diagonals, as shown in Figure 6.11(a). Also, due to the symmetry, it was stored as the upper triangular side and the main diagonal. Each small rectangle in q and p represents 2 elements (3 for 3D). Figure 6.11(b) shows the parallel data access pattern for the diagonals, some padding may be required to complete PCGTHREADS

12 Chapter 6. Implementation 61 processed bars each time the multiplication kernel is called. The first diagonal is multiplied by p and the result is accumulated in q, then the second, etc. Figure 6.10: M structure. To produce the product in the GPU, the values of BDIA matrices datatx, dataty, datadipm are read for a diagonal and then the multiplication by p is performed. The process is repeated for every diagonal and the result accumulated into q. A key difference with the EbeGPU (element by element) is that BDIA does not require i,j indices, and does not have aleatory memory accesses. Rather, almost all memory accesses are coalesced and accesses to p and q are strided. The number of strided memory accesses is low compared to coalesced, so there is almost no penalty on memory bandwidth. Hence, BDIA is expected to have better performance than the element by element approach. However, BDIA produces excessive padding for non-structured meshes, so its applicability is restrained to structured ones Applying Boundary Conditions Although we are not using a stiffness matrix, it is necessary to apply boundary conditions to M. We employ the big number technique. The mean of the diagonal of M (mean) is found and is multiplied by a large constant number BN = mean (6-6) For solvers that explicitly assemble M, namely UMFPACK, PARDISO, and PCG, the value BN is placed in the diagonal element that corresponds to the degree of freedom that is being restricted. In the case of the element by element solvers there is no M matrix to place the BN value, so the boundary conditions are applied to the output vector q. For a given degree of freedom constraint bc, the solver stores the diagonal element in M that corresponds to bc. Once the SpMV is completed, it subtracts the corresponding multiplication:

13 Chapter 6. Implementation 62 Figure 6.11: Parallel data access patterns. (a) Padding and SpMV in BDIA. (b) Data access pattern for the numbered diagonals. q(bc) = q(bc) diag(bc) p(bc). (6-7) This step is equivalent to clear the element in the M diagonal. Then the big number is placed: q(bc) = q(bc) + BN p(bc). (6-8) This process is repeated for all the degrees of freedom constrained.

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices CSE 599 I Accelerated Computing - Programming GPUS Parallel Pattern: Sparse Matrices Objective Learn about various sparse matrix representations Consider how input data affects run-time performance of

More information

Report of Linear Solver Implementation on GPU

Report of Linear Solver Implementation on GPU Report of Linear Solver Implementation on GPU XIANG LI Abstract As the development of technology and the linear equation solver is used in many aspects such as smart grid, aviation and chemical engineering,

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet Contents 2 F10: Parallel Sparse Matrix Computations Figures mainly from Kumar et. al. Introduction to Parallel Computing, 1st ed Chap. 11 Bo Kågström et al (RG, EE, MR) 2011-05-10 Sparse matrices and storage

More information

(Sparse) Linear Solvers

(Sparse) Linear Solvers (Sparse) Linear Solvers Ax = B Why? Many geometry processing applications boil down to: solve one or more linear systems Parameterization Editing Reconstruction Fairing Morphing 2 Don t you just invert

More information

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code

More information

Application of GPU-Based Computing to Large Scale Finite Element Analysis of Three-Dimensional Structures

Application of GPU-Based Computing to Large Scale Finite Element Analysis of Three-Dimensional Structures Paper 6 Civil-Comp Press, 2012 Proceedings of the Eighth International Conference on Engineering Computational Technology, B.H.V. Topping, (Editor), Civil-Comp Press, Stirlingshire, Scotland Application

More information

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,

More information

3 Interior Point Method

3 Interior Point Method 3 Interior Point Method Linear programming (LP) is one of the most useful mathematical techniques. Recent advances in computer technology and algorithms have improved computational speed by several orders

More information

(Sparse) Linear Solvers

(Sparse) Linear Solvers (Sparse) Linear Solvers Ax = B Why? Many geometry processing applications boil down to: solve one or more linear systems Parameterization Editing Reconstruction Fairing Morphing 1 Don t you just invert

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 15 Numerically solve a 2D boundary value problem Example:

More information

Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms

Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms By:- Nitin Kamra Indian Institute of Technology, Delhi Advisor:- Prof. Ulrich Reude 1. Introduction to Linear

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26

More information

Contents. I The Basic Framework for Stationary Problems 1

Contents. I The Basic Framework for Stationary Problems 1 page v Preface xiii I The Basic Framework for Stationary Problems 1 1 Some model PDEs 3 1.1 Laplace s equation; elliptic BVPs... 3 1.1.1 Physical experiments modeled by Laplace s equation... 5 1.2 Other

More information

PARDISO Version Reference Sheet Fortran

PARDISO Version Reference Sheet Fortran PARDISO Version 5.0.0 1 Reference Sheet Fortran CALL PARDISO(PT, MAXFCT, MNUM, MTYPE, PHASE, N, A, IA, JA, 1 PERM, NRHS, IPARM, MSGLVL, B, X, ERROR, DPARM) 1 Please note that this version differs significantly

More information

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,

More information

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,

More information

Matrix-free IPM with GPU acceleration

Matrix-free IPM with GPU acceleration Matrix-free IPM with GPU acceleration Julian Hall, Edmund Smith and Jacek Gondzio School of Mathematics University of Edinburgh jajhall@ed.ac.uk 29th June 2011 Linear programming theory Primal-dual pair

More information

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

Hardware/Software Co-Design

Hardware/Software Co-Design 1 / 13 Hardware/Software Co-Design Review so far Miaoqing Huang University of Arkansas Fall 2011 2 / 13 Problem I A student mentioned that he was able to multiply two 1,024 1,024 matrices using a tiled

More information

CUDA Kenjiro Taura 1 / 36

CUDA Kenjiro Taura 1 / 36 CUDA Kenjiro Taura 1 / 36 Contents 1 Overview 2 CUDA Basics 3 Kernels 4 Threads and thread blocks 5 Moving data between host and device 6 Data sharing among threads in the device 2 / 36 Contents 1 Overview

More information

ECE 408 / CS 483 Final Exam, Fall 2014

ECE 408 / CS 483 Final Exam, Fall 2014 ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across

More information

GPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction

GPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction 1/8 GPU Implementation of Elliptic Solvers in Numerical Weather- and Climate- Prediction Eike Hermann Müller, Robert Scheichl Department of Mathematical Sciences EHM, Xu Guo, Sinan Shi and RS: http://arxiv.org/abs/1302.7193

More information

Sparse Matrices. This means that for increasing problem size the matrices become sparse and sparser. O. Rheinbach, TU Bergakademie Freiberg

Sparse Matrices. This means that for increasing problem size the matrices become sparse and sparser. O. Rheinbach, TU Bergakademie Freiberg Sparse Matrices Many matrices in computing only contain a very small percentage of nonzeros. Such matrices are called sparse ( dünn besetzt ). Often, an upper bound on the number of nonzeros in a row can

More information

CUDA. More on threads, shared memory, synchronization. cuprintf

CUDA. More on threads, shared memory, synchronization. cuprintf CUDA More on threads, shared memory, synchronization cuprintf Library function for CUDA Developers Copy the files from /opt/cuprintf into your source code folder #include cuprintf.cu global void testkernel(int

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search CSE 599 I Accelerated Computing - Programming GPUS Parallel Patterns: Graph Search Objective Study graph search as a prototypical graph-based algorithm Learn techniques to mitigate the memory-bandwidth-centric

More information

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,

More information

A GPU Sparse Direct Solver for AX=B

A GPU Sparse Direct Solver for AX=B 1 / 25 A GPU Sparse Direct Solver for AX=B Jonathan Hogg, Evgueni Ovtchinnikov, Jennifer Scott* STFC Rutherford Appleton Laboratory 26 March 2014 GPU Technology Conference San Jose, California * Thanks

More information

PolyTop++: an efficient alternative for serial and parallel topology optimization on CPUs & GPUs

PolyTop++: an efficient alternative for serial and parallel topology optimization on CPUs & GPUs Struct Multidisc Optim (2015) 52:845 859 DOI 10.1007/s00158-015-1252-x RESEARCH PAPER PolyTop++: an efficient alternative for serial and parallel topology optimization on CPUs & GPUs Leonardo S. Duarte

More information

Solving the heat equation with CUDA

Solving the heat equation with CUDA Solving the heat equation with CUDA Oliver Meister January 09 th 2013 Last Tutorial CSR kernel - scalar One row per thread No coalesced memory access Non-uniform matrices CSR kernel - vectorized One row

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Performance of Implicit Solver Strategies on GPUs

Performance of Implicit Solver Strategies on GPUs 9. LS-DYNA Forum, Bamberg 2010 IT / Performance Performance of Implicit Solver Strategies on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Abstract: The increasing power of GPUs can be used

More information

Lecture 17: More Fun With Sparse Matrices

Lecture 17: More Fun With Sparse Matrices Lecture 17: More Fun With Sparse Matrices David Bindel 26 Oct 2011 Logistics Thanks for info on final project ideas. HW 2 due Monday! Life lessons from HW 2? Where an error occurs may not be where you

More information

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory. Table of Contents Shared Memory Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain miguel.cardenas@ciemat.es

More information

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications ME964 High Performance Computing for Engineering Applications Outlining Midterm Projects Topic 3: GPU-based FEA Topic 4: GPU Direct Solver for Sparse Linear Algebra March 01, 2011 Dan Negrut, 2011 ME964

More information

Parallel solution for finite element linear systems of. equations on workstation cluster *

Parallel solution for finite element linear systems of. equations on workstation cluster * Aug. 2009, Volume 6, No.8 (Serial No.57) Journal of Communication and Computer, ISSN 1548-7709, USA Parallel solution for finite element linear systems of equations on workstation cluster * FU Chao-jiang

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

Computational Fluid Dynamics (CFD) using Graphics Processing Units

Computational Fluid Dynamics (CFD) using Graphics Processing Units Computational Fluid Dynamics (CFD) using Graphics Processing Units Aaron F. Shinn Mechanical Science and Engineering Dept., UIUC Accelerators for Science and Engineering Applications: GPUs and Multicores

More information

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102 1 / 102 GPU Programming Parallel Patterns Miaoqing Huang University of Arkansas 2 / 102 Outline Introduction Reduction All-Prefix-Sums Applications Avoiding Bank Conflicts Segmented Scan Sorting 3 / 102

More information

OpenFOAM + GPGPU. İbrahim Özküçük

OpenFOAM + GPGPU. İbrahim Özküçük OpenFOAM + GPGPU İbrahim Özküçük Outline GPGPU vs CPU GPGPU plugins for OpenFOAM Overview of Discretization CUDA for FOAM Link (cufflink) Cusp & Thrust Libraries How Cufflink Works Performance data of

More information

Sparse Linear Algebra in CUDA

Sparse Linear Algebra in CUDA Sparse Linear Algebra in CUDA HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 22 nd 2017 Table of Contents Homework - Worksheet 2

More information

Mathematical Methods in Fluid Dynamics and Simulation of Giant Oil and Gas Reservoirs. 3-5 September 2012 Swissotel The Bosphorus, Istanbul, Turkey

Mathematical Methods in Fluid Dynamics and Simulation of Giant Oil and Gas Reservoirs. 3-5 September 2012 Swissotel The Bosphorus, Istanbul, Turkey Mathematical Methods in Fluid Dynamics and Simulation of Giant Oil and Gas Reservoirs 3-5 September 2012 Swissotel The Bosphorus, Istanbul, Turkey Fast and robust solvers for pressure systems on the GPU

More information

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations Hartwig Anzt 1, Marc Baboulin 2, Jack Dongarra 1, Yvan Fournier 3, Frank Hulsemann 3, Amal Khabou 2, and Yushan Wang 2 1 University

More information

Krishnan Suresh Associate Professor Mechanical Engineering

Krishnan Suresh Associate Professor Mechanical Engineering Large Scale FEA on the GPU Krishnan Suresh Associate Professor Mechanical Engineering High-Performance Trick Computations (i.e., 3.4*1.22): essentially free Memory access determines speed of code Pick

More information

Sparse Linear Systems

Sparse Linear Systems 1 Sparse Linear Systems Rob H. Bisseling Mathematical Institute, Utrecht University Course Introduction Scientific Computing February 22, 2018 2 Outline Iterative solution methods 3 A perfect bipartite

More information

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100 CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor

More information

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis Abstract: Lower upper (LU) factorization for sparse matrices is the most important computing step for circuit simulation

More information

Warp shuffles. Lecture 4: warp shuffles, and reduction / scan operations. Warp shuffles. Warp shuffles

Warp shuffles. Lecture 4: warp shuffles, and reduction / scan operations. Warp shuffles. Warp shuffles Warp shuffles Lecture 4: warp shuffles, and reduction / scan operations Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 4 p. 1 Warp

More information

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs 3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs H. Knibbe, C. W. Oosterlee, C. Vuik Abstract We are focusing on an iterative solver for the three-dimensional

More information

Iterative Sparse Triangular Solves for Preconditioning

Iterative Sparse Triangular Solves for Preconditioning Euro-Par 2015, Vienna Aug 24-28, 2015 Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt, Edmond Chow and Jack Dongarra Incomplete Factorization Preconditioning Incomplete LU factorizations

More information

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing HW #9 10., 10.3, 10.7 Due April 17 { } Review Completing Graph Algorithms Maximal Independent Set Johnson s shortest path algorithm using adjacency lists Q= V; for all v in Q l[v] = infinity; l[s] = 0;

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 5: Sparse Linear Systems and Factorization Methods Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 18 Sparse

More information

Programming Problem for the 1999 Comprehensive Exam

Programming Problem for the 1999 Comprehensive Exam Programming Problem for the 1999 Comprehensive Exam Out: Monday 1/25/99 at 9:00 am Due: Friday 1/29/99 at 5:00 pm Department of Computer Science Brown University Providence, RI 02912 1.0 The Task This

More information

Introduction to Parallel. Programming

Introduction to Parallel. Programming University of Nizhni Novgorod Faculty of Computational Mathematics & Cybernetics Introduction to Parallel Section 9. Programming Parallel Methods for Solving Linear Systems Gergel V.P., Professor, D.Sc.,

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

Lecture 4: warp shuffles, and reduction / scan operations

Lecture 4: warp shuffles, and reduction / scan operations Lecture 4: warp shuffles, and reduction / scan operations Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 4 p. 1 Warp shuffles Warp

More information

On the Parallel Solution of Sparse Triangular Linear Systems. M. Naumov* San Jose, CA May 16, 2012 *NVIDIA

On the Parallel Solution of Sparse Triangular Linear Systems. M. Naumov* San Jose, CA May 16, 2012 *NVIDIA On the Parallel Solution of Sparse Triangular Linear Systems M. Naumov* San Jose, CA May 16, 2012 *NVIDIA Why Is This Interesting? There exist different classes of parallel problems Embarrassingly parallel

More information

Module Memory and Data Locality

Module Memory and Data Locality GPU Teaching Kit Accelerated Computing Module 4.5 - Memory and Data Locality Handling Arbitrary Matrix Sizes in Tiled Algorithms Objective To learn to handle arbitrary matrix sizes in tiled matrix multiplication

More information

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA GPU-Accelerated Algebraic Multigrid for Commercial Applications Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA ANSYS Fluent 2 Fluent control flow Accelerate this first Non-linear iterations Assemble

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Chapter 4. Matrix and Vector Operations

Chapter 4. Matrix and Vector Operations 1 Scope of the Chapter Chapter 4 This chapter provides procedures for matrix and vector operations. This chapter (and Chapters 5 and 6) can handle general matrices, matrices with special structure and

More information

1/25/12. Administrative

1/25/12. Administrative Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:

More information

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends Imagine stream processor; Bill Dally, Stanford Connection Machine CM; Thinking Machines Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz Eitan Grinspun Caltech Ian Farmer

More information

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics Moysey Brio & Paul Dostert July 4, 2009 1 / 18 Sparse Matrices In many areas of applied mathematics and modeling, one

More information

Beams. Lesson Objectives:

Beams. Lesson Objectives: Beams Lesson Objectives: 1) Derive the member local stiffness values for two-dimensional beam members. 2) Assemble the local stiffness matrix into global coordinates. 3) Assemble the structural stiffness

More information

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research Reductions and Low-Level Performance Considerations CME343 / ME339 27 May 2011 David Tarjan [dtarjan@nvidia.com] NVIDIA Research REDUCTIONS Reduction! Reduce vector to a single value! Via an associative

More information

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs Distributed NVAMG Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs Istvan Reguly (istvan.reguly at oerc.ox.ac.uk) Oxford e-research Centre NVIDIA Summer Internship

More information

Lecture 27: Fast Laplacian Solvers

Lecture 27: Fast Laplacian Solvers Lecture 27: Fast Laplacian Solvers Scribed by Eric Lee, Eston Schweickart, Chengrun Yang November 21, 2017 1 How Fast Laplacian Solvers Work We want to solve Lx = b with L being a Laplacian matrix. Recall

More information

Vector Addition on the Device: main()

Vector Addition on the Device: main() Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space

More information

Dense Matrix Algorithms

Dense Matrix Algorithms Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication

More information

CS 542G: Solving Sparse Linear Systems

CS 542G: Solving Sparse Linear Systems CS 542G: Solving Sparse Linear Systems Robert Bridson November 26, 2008 1 Direct Methods We have already derived several methods for solving a linear system, say Ax = b, or the related leastsquares problem

More information

GPU acceleration of the matrix-free interior point method

GPU acceleration of the matrix-free interior point method GPU acceleration of the matrix-free interior point method E. Smith, J. Gondzio and J. A. J. Hall School of Mathematics and Maxwell Institute for Mathematical Sciences The University of Edinburgh Mayfield

More information

Structure-preserving Smoothing for Seismic Amplitude Data by Anisotropic Diffusion using GPGPU

Structure-preserving Smoothing for Seismic Amplitude Data by Anisotropic Diffusion using GPGPU GPU Technology Conference 2016 April, 4-7 San Jose, CA, USA Structure-preserving Smoothing for Seismic Amplitude Data by Anisotropic Diffusion using GPGPU Joner Duarte jduartejr@tecgraf.puc-rio.br Outline

More information

Numerical Algorithms

Numerical Algorithms Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0

More information

Flexible Batched Sparse Matrix-Vector Product on GPUs

Flexible Batched Sparse Matrix-Vector Product on GPUs ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems November 13, 217 Flexible Batched Sparse Matrix-Vector Product on GPUs Hartwig Anzt, Gary Collins, Jack Dongarra,

More information

Automated Finite Element Computations in the FEniCS Framework using GPUs

Automated Finite Element Computations in the FEniCS Framework using GPUs Automated Finite Element Computations in the FEniCS Framework using GPUs Florian Rathgeber (f.rathgeber10@imperial.ac.uk) Advanced Modelling and Computation Group (AMCG) Department of Earth Science & Engineering

More information

Parallel Implementations of Gaussian Elimination

Parallel Implementations of Gaussian Elimination s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n

More information

BDDCML. solver library based on Multi-Level Balancing Domain Decomposition by Constraints copyright (C) Jakub Šístek version 1.

BDDCML. solver library based on Multi-Level Balancing Domain Decomposition by Constraints copyright (C) Jakub Šístek version 1. BDDCML solver library based on Multi-Level Balancing Domain Decomposition by Constraints copyright (C) 2010-2012 Jakub Šístek version 1.3 Jakub Šístek i Table of Contents 1 Introduction.....................................

More information

ADDENDUM TO THE SEDUMI USER GUIDE VERSION 1.1

ADDENDUM TO THE SEDUMI USER GUIDE VERSION 1.1 ADDENDUM TO THE SEDUMI USER GUIDE VERSION 1.1 IMRE PÓLIK 1. Introduction The main goal of this reference guide is to give a summary of all the options in SeDuMi. The default value of the options is satisfactory

More information

Chapter Introduction

Chapter Introduction Chapter 4.1 Introduction After reading this chapter, you should be able to 1. define what a matrix is. 2. identify special types of matrices, and 3. identify when two matrices are equal. What does a matrix

More information

Module 3: CUDA Execution Model -I. Objective

Module 3: CUDA Execution Model -I. Objective ECE 8823A GPU Architectures odule 3: CUDA Execution odel -I 1 Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource

More information

Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units

Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units Khor Shu Heng Engineering Science Programme National University of Singapore Abstract

More information

Image convolution with CUDA

Image convolution with CUDA Image convolution with CUDA Lecture Alexey Abramov abramov _at_ physik3.gwdg.de Georg-August University, Bernstein Center for Computational Neuroscience, III Physikalisches Institut, Göttingen, Germany

More information

Accelerating GPU kernels for dense linear algebra

Accelerating GPU kernels for dense linear algebra Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,

More information

Intel MKL Sparse Solvers. Software Solutions Group - Developer Products Division

Intel MKL Sparse Solvers. Software Solutions Group - Developer Products Division Intel MKL Sparse Solvers - Agenda Overview Direct Solvers Introduction PARDISO: main features PARDISO: advanced functionality DSS Performance data Iterative Solvers Performance Data Reference Copyright

More information

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication Steve Rennich Nvidia Developer Technology - Compute Block Sparse Matrix Vector Multiplication Sparse Matrix-Vector Multiplication

More information

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms H. Anzt, V. Heuveline Karlsruhe Institute of Technology, Germany

More information

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs Li-Wen Chang, Wen-mei Hwu University of Illinois A Scalable, Numerically Stable, High- How to Build a gtsv for Performance

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

THE DEVELOPMENT OF THE POTENTIAL AND ACADMIC PROGRAMMES OF WROCLAW UNIVERISTY OF TECH- NOLOGY ITERATIVE LINEAR SOLVERS

THE DEVELOPMENT OF THE POTENTIAL AND ACADMIC PROGRAMMES OF WROCLAW UNIVERISTY OF TECH- NOLOGY ITERATIVE LINEAR SOLVERS ITERATIVE LIEAR SOLVERS. Objectives The goals of the laboratory workshop are as follows: to learn basic properties of iterative methods for solving linear least squares problems, to study the properties

More information

Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs

Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs Workshop on Batched, Reproducible, and Reduced Precision BLAS Atlanta, GA 02/25/2017 Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs Hartwig Anzt Joint work with Goran

More information

Multi-GPU simulations in OpenFOAM with SpeedIT technology.

Multi-GPU simulations in OpenFOAM with SpeedIT technology. Multi-GPU simulations in OpenFOAM with SpeedIT technology. Attempt I: SpeedIT GPU-based library of iterative solvers for Sparse Linear Algebra and CFD. Current version: 2.2. Version 1.0 in 2008. CMRS format

More information

AMS 148 Chapter 6: Histogram, Sort, and Sparse Matrices

AMS 148 Chapter 6: Histogram, Sort, and Sparse Matrices AMS 148 Chapter 6: Histogram, Sort, and Sparse Matrices Steven Reeves Now that we have completed the more fundamental parallel primitives on GPU, we will dive into more advanced topics. Histogram is a

More information

A new sparse matrix vector multiplication graphics processing unit algorithm designed for finite element problems

A new sparse matrix vector multiplication graphics processing unit algorithm designed for finite element problems INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN ENGINEERING Int. J. Numer. Meth. Engng 2015; 102:1784 1814 Published online 9 January 2015 in Wiley Online Library (wileyonlinelibrary.com)..4865 A new sparse

More information

Iterative solution of linear systems in electromagnetics (and not only): experiences with CUDA

Iterative solution of linear systems in electromagnetics (and not only): experiences with CUDA How to cite this paper: D. De Donno, A. Esposito, G. Monti, and L. Tarricone, Iterative Solution of Linear Systems in Electromagnetics (and not only): Experiences with CUDA, Euro-Par 2010 Parallel Processing

More information

Optimizing the operations with sparse matrices on Intel architecture

Optimizing the operations with sparse matrices on Intel architecture Optimizing the operations with sparse matrices on Intel architecture Gladkikh V. S. victor.s.gladkikh@intel.com Intel Xeon, Intel Itanium are trademarks of Intel Corporation in the U.S. and other countries.

More information