Figure 6.1: Truss topology optimization diagram.

Size: px

Start display at page:

Download "Figure 6.1: Truss topology optimization diagram."

Agnes Ward
5 years ago
Views:

1 6 Implementation 6.1 Outline This chapter shows the implementation details to optimize the truss, obtained in the ground structure approach, according to the formulation presented in previous chapters. IPM was coded in C++ and CUDA in its two versions, long-step and predictor-corrector, for comparison purposes. The predictor-corrector requires less iterations due to its adaptive σ but requires solving two linear systems. As discussed before, this is not a problem if a direct solver is used, because M = BD ipm B T can be factorized once and used in two triangular substitutions (back substitution operations). But if an iterative solver is used it would be required to solve two entirely different linear systems. Hence, the long-step version was used for most of the tests. The general diagram is given in Figure 6.1. Figure 6.1: Truss topology optimization diagram. To solve the optimization problem, a software solution called GRAND++ was developed. This uses the mesh generator of GRAND. The mesh is then saved in a text file and read into C++. There are some differences between GRAND, which is MATLAB based; GRAND++, which is based in C++; and CUDA. In the application of boundary conditions in GRAND, the columns and

2 Chapter 6. Implementation 51 rows of arrays for every degree of freedom that has a restriction are removed after the assembling. In GRAND++ the size of arrays does not change, as will be explained below. The file from GRAND contains the data as shown in Table 6.1. Table 6.1: Input Mesh data. Variable Name Size 2D 3D Description NODE N n 2 N n 3 Each row has the x,y for 2D and x,y,z for 3D coordinates of a given node BARS N b 2 Each row contains the pair of connected nodes i, j SUP P N f 3 N f 4 Each row consists of a node number, fixity x, fixity y and possibly fixity z. The total number of specified fixities is N sup LOAD N l 3 N l 4 Each row consists of a node number, load in x, load in y and possibly load in z. The director cosines t r are obtained for a bar that connects nodes i, j from the NODE matrix: d r t r = NODE j NODE i = dr d r. (6-1) 6.2 Simplified Views As noted in Chapter 3, M = BD ipm B T from IPM can be found as the product or as a sum of matrices. If a direct solver is employed, the full M matrix is usually assembled. Since M is quite sparse, a sparse format could be employed to store the matrix, such as COO (Coordinate List or Triplets format), or CSR (Compressed Sparse Row), among others. It is useful to know the memory required in advance to store M, and one approach is to find the space required to store BB T, which is the same size as M. A common way is to provide an approximate value and if insufficient then double it. But this approach wastes space, which is a scarce resource in large meshes. Figure 6.2 shows the local m r with the matrix indices they will have when expanded in their global version, M r. These i and j indices are the node indices of a bar i,j.

3 Chapter 6. Implementation 52 Figure 6.2: m r and its indices. The indices have the effect of dispersing the elements of m r. This dispersion is made in four blocks denoted by the dotted lines. When the global M r is formed, these blocks are placed as shown in Figure 6.3, where each square represents a 2x2 or 3x3 sub-matrix of m r. Figure 6.3: M r and its sub-matrices. Figure 6.4 shows the simplified view of M r, M sv r. While M r has a size of DIM N n DIM N n, the simplified view has size N n N n, and its elements are in the i and j indices. The simplified view can make easier to find the memory requirements of M. We want to find the number of non-zeros of M, nnz(m) = DIM 2 nnz(m sv ). To find nnz(m sv ), we note that there is no bar with equal indices, i is lower than j and two of the elements of M sv are in the diagonal. Hence, nnz(m sv ) is the difference between the sum of the elements contributed by all the bars (each bar has 4 elements in M sv ) and the elements that are in the same position N rep : nnz(m sv ) = 4N b N rep. (6-2)

4 Chapter 6. Implementation 53 Figure 6.4: Simplified view of M sv r and its indices. If two elements are in the same position, they accounts for one repetition. The number of repetitions is: N rep = 2N b N n. (6-3) Then nnz(m sv ) = 2N b + N n, and the number of non zeros of the original M is: nnz(m) = DIM 2 (2N b + N n ). (6-4) This equation was used in C++ to create the vector that stores M in CSR format. 6.3 Linear system solvers As discussed above, every optimizer iteration must solve a linear system M y = r y. The matrix M has some key features: 1. It is square, large, sparse, and symmetric positive definite; 2. It changes for every iteration; 3. It tends to become ill-conditioned as we approach the final iterations; 4. We can use Cholesky factorizing or and Iterative solver such as PCG due to (1); 5. It is similar to the stiffness matrix, only different in the diagonal values of D a. The next sections detail the solvers used. Some parts of the GRAND++ code are based on the work of Duarte et al [44] who devised PolyTop C++, which is an C++ implementation of PolyTop-MATLAB, presented by Talischi et al. [17]. Table 6.2 summarizes the features of all the solvers used in this work.

5 Chapter 6. Implementation 54 Table 6.2: Features of the employed linear solvers. Solver Type Version Platform Dimension UMFPACK Direct Serial CPU 2D PARDISO Direct Parallel CPU 2D/3D PCG Iterative Serial CPU 2D EbeCPU Iterative Serial/Parallel CPU 2D/3D EbeGPU Iterative Parallel GPU 2D/3D UMFPACK The Unsymmetric MultiFrontal Package (UMFPACK) [45] is a set of routines for solving sparse linear systems of the form Ax = b, using the Unsymmetric MultiFrontal method (A is not required to be symmetric). It was written in ANSI/ISO C and interfaces with MATLAB. We use its C interface in GRAND++. The matrix M is obtained by means of the product BD ipm B T and is stored by using the triplet format. UMFPACK requires a M matrix in the Compressed Sparse Column (CCS) format, so we first convert it using the umfpack_dl_triplet_to_col function provided in UMFPACK. The result is the vector y that is used to find x and z in the second and third normal equations PARDISO The Parallel Direct Sparse Solver Interface (PARDISO) [46] is a high-performance, robust, memory efficient, and easy to use software package for solving large sparse symmetric and non-symmetric linear systems of equations on shared memory multiprocessors. As a parallel solver it can use all of the CPU cores and is highly efficient. PARDISO needs the matrix in CSR format, so the umfpack_dl_triplet_to_col is also employed here. In this thesis PARDISO is used with all the available threads of the CPU PCG We use the Preconditioned Conjugate-Gradient (PCG) [47] as an iterative solver that can solve symmetric definite positive systems. This solver has the advantage of requiring less memory that the above direct solvers, but possibly requires more time to converge if it has a bad or no preconditioner. The Jacobi preconditioner was used for all the iterative solvers due to its simplicity.

6 Chapter 6. Implementation 55 The algorithm is shown in Algorithm 3 for a generic system Ax = b, with the preconditioner P. Algorithm 3 Preconditioned CG 1: Initialization 2: Given x 0 and r 0 = b Ax 0 3: for i = 1, 2,... do 4: Solve P w i 1 = r i 1 5: ρ i 1 = r T i 1w i 1 6: if i = 1 then 7: p i = w i 1 8: else 9: β i 1 = ρ i 1 ρ i 2 10: p i = r i 1 + β i 1 p i 1 11: end if 12: q i = Ap i 13: 14: α i = ρ i 1 p T i q i x i = x i 1 + α i p i 15: r i = r i 1 α i q i 16: if x i is accurate enough then 17: quit 18: end if 19: end for The most complex and time consuming operation is the matrix-vector product q i = Ap i. Usually the matrix is sparse and the vector is full, called Sparse Matrix-Vector Multiplication (SpMV) EbeCPU All three above solvers have to assemble M, and this makes them inappropriate to deal with large meshes due to limited PC memory. They also require significant numerical processing for large problems. Element by Element PCG in CPU (EbeCPU) is a solver that was created to address these issues, and increase the number of bars the solver is able to deal with. EbeCPU has two key features: 1. To reduce the memory consumption, the full M is not assembled. Rather, the matrix-vector product of step 11 in Algorithm 3 is produced as a sum of products per bar: Nb q = M r p ; (6-5) r=1 2. To reduce the time required to solve the system, parallel computing is used to produce the matrix-vector products.

7 Chapter 6. Implementation 56 The SpMV operation is performed using Equation 6-5, as shown in Figure 6.5. For the 2D problem this requires: 1. 2 elements (i,j) of BARS to define the nodes; 2. 4 elements of p (6 for a 3D problem); elements of M r (36 for a 3D problem). To reduce the memory footprint, only the 2 director cosines (3 for 3D) and the D ipm are be used, because that is the minimal information to create the matrix of Figure 6.2; 4. 4 elements of q (6 for a 3D problem). Figure 6.5: 2D SpMV for a bar and the elements affected. Race condition. If two bars share a node, they will have some non- zeros in the same position in M. That will cause the partial matrix-vector products to have non-zeros in the same position. This causes a problem in the parallel implementation of EbeCPU, because two threads will attempt to access the same memory positions in the result vector q, with undefined behavior. This condition is called race condition and should be avoided to avoid a numerical error. One way to overcome this problem is to use coloring [48]. The coloring algorithm creates groups of elements that do not share any node, allowing threads to process the bars with the same color at the same time. Once a color is completed the next color is processed and so on. Coloring applied to a 2D truss is shown in Figure 6.6. The algorithm and code for a 3D truss are the same, because only node connectivity is required. Three versions of EbeCPU were implemented, differing only in the way that the matrix-vector product is performed.

8 Chapter 6. Implementation 57 Figure 6.6: Bar coloring and groups by color. EbeCPU Serial. The serial version was created and tested to provide a base for comparison. It sequentially performs the SpMV and accumulates in output q. EbeCPU Parallel 1. Bars of the same color are chosen, and the SpMV is performed in parallel. Every available CPU thread computes an element product q r = M r p and OpenMP is used for the thread distribution. Note that when bars director cosines are chosen according to their colors, the code can access distant memory positions, causing cache misses. The next version attempts to solve this. EbeCPU Parallel 2. To enhance data locality the bars are reordered so bar indices i,j for bars of the same color are placed together. This process is executed prior to the calculus of director cosines so they are also ordered, as illustrated in Figure 6.7. Reordering affects the order of the force and area vector elements. When the optimization completes, the area vector is reordered back to its original order for plotting purposes.

9 Chapter 6. Implementation 58 Figure 6.7: Bar indices before and after reordering GPU implementations The number of mathematical operations is enormous for large meshes, and to enhance the time required, it was implemented on GPU solvers. The PCG algorithm was ported to GPU using CUDA programming, and two versions were developed. The entire PCG algorithm was coded in GPU, and two variants for the SpMV operation are discussed as follows. EbeGPU. In the Element by Element PCG in GPU (EbeGPU) the bars are colored and reordered, then, to prevent access to invalid memory addresses, the bars vectors i, j, t x, t y (and t z in 3D problems) are padded. Padding consists of appending dummy values to vectors to ensure an appropriate length. Padding prevents invalid memory addressing by making the vector length a multiple of the GPU block size. In the case of colored bars, each bar vector is padded with 0 or -1 to ensure that every group of the same color has a number of elements as a multiple of a constant PCGTHREADS, as shown in Figure 6.8. The constant PCGTHREADS is set to 2048 and the kernel is launched by setting the parameters in Listing 6.1 Listing 6.1: Kernel launching parameters 1 int threads = 128; 2 dim3 numthreads = dim3(threads, 1, 1); 3 dim3 numblocks = dim3(pcgthreads / threads, 1, 1); The matrix vector product kernel is shown in Listing 6.2 for a 2D truss. The kernel is launched once for every color, with NsColor, which is the number of segments of size PCGTHREADS for the color being processed (in Figure 6.8, NsColor would take values 1, 1, and 2). It is used in the for loop to process NsColor PCGTHREADS bars each time the kernel is called, the for iterations are shown in Figure 6.8(b). Note that there are 4 coalesced accesses in lines 10-13, but 8 aleatory accesses in lines 18,19,26,27,31,32. The latter accesses would cause access to invalid memory addresses, so invalid bar checks are performed in lines 16, 24 and 29. Aleatory accesses causes a degradation in the effective bandwidth of the GPU, so the next version attempts to solve this issue.

10 Chapter 6. Implementation 59 Figure 6.8: Padding and data in GPU. (a) Padding for a colored bar. (b) Data access pattern in the GPU. Listing 6.2: SpMV Kernel 1 global void kernelgpuspmv(double *d_cos, double *d_sin, 2 double *d_dipm, int *d_i, int *d_j, double *d_p, double *d_q, 3 int NsColor, int begincolor) 4 { 5 // select a bar index 6 unsigned int r = begincolor + blockidx.x * blockdim.x + 7 threadidx.x; 8 for (int ns = 0; ns < NsColor; ns++) 9 { 10 int i = d_i[r]; // i node for bar r 11 int j = d_j[r]; // j node for bar r 12 double c = d_cos[r]; // cos = tx 13 double s = d_sin[r]; // sin = ty 14 double p13 = 0; // temp variable 15 double p24 = 0; // temp variable 16 if ((i!= -1) && (j!= -1)) // check invalid bar 17 { 18 p13 = d_p[2*i] - d_p[2*j]; 19 p24 = d_p[2*i+1] - d_p[2*j+1]; 20 } 21 double tt = d_dipm[r] * (c * p13 + s * p24); 22 double q0 = c * tt; // partial output 23 double q1 = s * tt; // partial output 24 if (i!= -1) // check invalid bar 25 { 26 d_q[2*i] += q0; // update output 27 d_q[2*i+1] += q1; // update output 28 } 29 if (j!= -1) // check invalid bar 30 { 31 d_q[2*j] -= q0; // update output 32 d_q[2*j+1] -= q1; // update output 33 } 34 r += griddim.x * blockdim.x; // update r 35 } 36 }

11 Chapter 6. Implementation 60 BdiaGPU. The Diagonal format (DIA) [49] is an efficient way to represent a matrix if it is sparse and its elements are grouped in diagonals. Figure 6.9 shows the DIA matrices data and offsets for a matrix A. The diagonals of A are stored in data with possibly padded elements marked by *. The vector offsets stores the diagonal offset values, the main diagonal has zero offset, the diagonals below the main diagonal have negative offsets and the upper diagonals have positive offsets. This format is efficient not only to store a matrix with values grouped in diagonals, but also for accessing the elements in memory. The latter is essential for GPU programming. Figure 6.9: DIA format representation of matrix A. The DIA format has an evident drawback in that if the matrix has some elements off the diagonals, the amount of padding can be excessive. Therefore is not a general purpose format. In this application it proved to be useful, mainly when the mesh is structured. The DIA format stores all the elements of the matrix, but to reduce the memory footprint only the director cosines and D ipm are stored. Rather than a matrix data, we have datatx, dataty (and datatz for 3D) for the director cosines and datadipm. Each triplet of bar elements datatx, dataty, datadipm (4 for 3D) are the basic information to generate the 16 (36 in 3D) elements of Figure 6.2. We called this format Blocked DIA (BDIA). In practice, SpVM is performed without forming M r, and the multiplication is made in code in similar fashion as for the EbeGPU (element by element) SpVM. Figure 6.10 shows the structure of M for the truss of Figure 6.6. Each square represents 4 elements in 2D or 9 elements in 3D. To use the Diagonal format it is necessary to pad the elements missing in diagonals, as shown in Figure 6.11(a). Also, due to the symmetry, it was stored as the upper triangular side and the main diagonal. Each small rectangle in q and p represents 2 elements (3 for 3D). Figure 6.11(b) shows the parallel data access pattern for the diagonals, some padding may be required to complete PCGTHREADS

12 Chapter 6. Implementation 61 processed bars each time the multiplication kernel is called. The first diagonal is multiplied by p and the result is accumulated in q, then the second, etc. Figure 6.10: M structure. To produce the product in the GPU, the values of BDIA matrices datatx, dataty, datadipm are read for a diagonal and then the multiplication by p is performed. The process is repeated for every diagonal and the result accumulated into q. A key difference with the EbeGPU (element by element) is that BDIA does not require i,j indices, and does not have aleatory memory accesses. Rather, almost all memory accesses are coalesced and accesses to p and q are strided. The number of strided memory accesses is low compared to coalesced, so there is almost no penalty on memory bandwidth. Hence, BDIA is expected to have better performance than the element by element approach. However, BDIA produces excessive padding for non-structured meshes, so its applicability is restrained to structured ones Applying Boundary Conditions Although we are not using a stiffness matrix, it is necessary to apply boundary conditions to M. We employ the big number technique. The mean of the diagonal of M (mean) is found and is multiplied by a large constant number BN = mean (6-6) For solvers that explicitly assemble M, namely UMFPACK, PARDISO, and PCG, the value BN is placed in the diagonal element that corresponds to the degree of freedom that is being restricted. In the case of the element by element solvers there is no M matrix to place the BN value, so the boundary conditions are applied to the output vector q. For a given degree of freedom constraint bc, the solver stores the diagonal element in M that corresponds to bc. Once the SpMV is completed, it subtracts the corresponding multiplication:

13 Chapter 6. Implementation 62 Figure 6.11: Parallel data access patterns. (a) Padding and SpMV in BDIA. (b) Data access pattern for the numbered diagonals. q(bc) = q(bc) diag(bc) p(bc). (6-7) This step is equivalent to clear the element in the M diagonal. Then the big number is placed: q(bc) = q(bc) + BN p(bc). (6-8) This process is repeated for all the degrees of freedom constrained.

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S