USING HYBRID PARALLEL PROGRAMMING TECHNIQUES FOR THE COMPUTATION, ASSEMBLY AND SOLUTION STAGES IN FINITE ELEMENT CODES

Size: px

Start display at page:

Download "USING HYBRID PARALLEL PROGRAMMING TECHNIQUES FOR THE COMPUTATION, ASSEMBLY AND SOLUTION STAGES IN FINITE ELEMENT CODES"

Jody Holt
6 years ago
Views:

1 Latin American Applied Researc 41: (2011) USING HYBRID PARALLEL PROGRAMMING TECHNIQUES FOR THE COMPUTATION, ASSEMBLY AND SOLUTION STAGES IN FINITE ELEMENT CODES R.R. PAZ, M.A. STORTI, H.G. CASTRO and L.D. DALCÍN Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC), Instituto de Desarrollo Tecnológico para la Industria Química (INTEC), CONICET, Universidad Nacional del Litoral (UNL). Santa Fe, Argentina. {rodrigop mstorti Grupo de Investigación en Mecánica de Fluidos, Universidad Tecnológica Nacional, Facultad Regional Resistencia, Caco, Argentina. Abstract Te so called ybrid parallelism paradigm, tat combines programming tecniques for arcitectures wit distributed and sared memories using MPI (Message Passing Interface) and OpenMP (Open Multi-Processing) standards, is currently adopted to exploit te growing use of multi-core computers, tus improving te efficiency of codes in suc arcitectures (several multi-core nodes or clustered symmetric multi-processors (SMP) connected by a fast net to do exaustive computations). In tis paper a parallel ybrid finite element code is developed and its performance is evaluated, using MPI for communication between cluster nodes and OpenMP for parallelism witin te SMP nodes. An efficient tread-safe matrix library for computing element/cell residuals (or rigt and sides) and Jacobians (or matrices) in FEM-like codes is introduced and fully described. Te cluster in wic te code was tested is te CIMEC s Coyote cluster, wic consists of eigt-core computing nodes connected troug Gigabit Eternet. Keywords Finite Elements, MPI, OpenMP, PETSc, ybrid programming, Matrix Library. I. INTRODUCTION A variety of engineering applications and scientific problems related to Computational Mecanics (CM) area, and particularly in te Computational Fluid Dynamics (CFD) field, demand ig computational resources (Sonzogni et al., 2002). A great effort as been made over te years in order to obtain ig quality solutions (Paz et al., 2006) for large-scale problems in realistic time (Beara and Mittal., 2009) using many different computing arcitectures (e.g., vector processors, distributed and sared memories, grapic process units or GPGPU s). Symmetric multi-processors (SMP) involve a ardware arcitecture wit two or more identical processors connected to a single sared main memory. Oter recent computing systems migt use Non-Uniform Memory Access (NUMA). NUMA dedicates different memory banks to different processors; processors may access local memory quickly. Despite te differences wit NUMA arcitectures, tis work will use te term SMP in a broader sense to refer to a general many-processor many-core single-memory computing macine. Since SMP ave spread out widely in conjunction wit ig-speed network ardware, using SMP clusters ave become attractive for ig-performance computing. To exploit suc computing systems te tendency is to use te so called ybrid parallelism paradigm tat combines programming tecniques for arcitectures wit distributed and sared memory, often using MPI and OpenMP standards (Jost and Jin., 2003). Te ybrid MPI/OpenMP programming tecnique is based on using message passing for coarse-grained parallelism and multi-treading for fine-grained parallelism. Te MPI programming paradigm defines a ig-level abstraction for fast and portable inter-process communication and assumes a local/private address space for eac process. Applications can run in clusters of (possibly eterogeneous) workstations or dedicated nodes, SMP macines, or even a mixture of bot. MPI ides all te low-level details, like networking or sared memory management; simplifying development and maintaining portability, witout sacrificing performance (see Section V.B). Altoug message passing is te way to communicate between nodes, it could not be an efficient resource witin an SMP node. In sared memory arcitectures, parallelization strategies tat use OpenMP standard could provide better performances and efficiency in parallel applications. A combination of bot paradigms witin an application tat runs on ybrid clusters may provide a more efficient parallelization strategy tan tose applications tat exploit te features of pure MPI. Tis paper is focused on a mixed MPI and OpenMP implementation of a finite element code for scalar PDE s and discusses te benefits of developing mixed mode MPI/OpenMP codes running on Beowulf clusters of SMP s. To address tese objectives, te remainder of tis paper is organized as follows. Section II provides a sort description and comparison of different caracteristics of OpenMP and MPI paradigms. Section III introduces and describes an efficient tread-safe matrix library called Fast-Mat for computing element residuals and Jacobians in te context of multi-treaded finite element codes. Section IV discusses te implementation of mixed (ybrid) mode application and describes a number of situations were mixed mode programming is potentially beneficial. Section V presents te implementation of an ybrid application suc as advective- 365

Latin American Applied Researc 41: 365-377(2011) diffusive partial differential equation.

2 Latin American Applied Researc 41: (2011) diffusive partial differential equation. Several tests are performed on a cluster of SMP s and on single SMP workstations (Intel Xeon E54xx-series and i7 arcitectures) comparing and contrasting te performance of te FEM code. Te results demonstrate tat tis style of programming may increase te code performance. Tis improvement can be acieved by taking into account a few rules of tumb depending on te application at and. Concluding remarks are given in section VI. II AN OVERVIEW OF MPI AND OPENMP A. MPI MPI, te Message Passing Interface (MPI, 2010), is a standardized and portable message-passing system designed to function on a wide variety of parallel computers. Te standard defines te syntax and semantics of library routines (MPI is not a programming language extension) and allows users to write portable programs in te main scientific programming languages (Fortran, C and C++). Te MPI programming model is a distributed memory model wit explicit control of parallelism. MPI is portable to bot distributed and sared memory arcitecture. Te explicit parallelism often provides a better performance and a number of optimized collective communication routines are available for optimal efficiency. MPI defines a ig-level abstraction for fast and portable inter-process communication (Snir et al., 1998; Gropp et al., 1998). MPI applications can run in clusters of (possibly eterogeneous) workstations or dedicated nodes, (symmetric) multi-processors macines, or even a mixture of bot. MPI ides all te low-level details, like networking or sared memory management, simplifying development and maintaining portability, witout sacrificing performance. Te MPI specifications is nowadays te leading standard for message passing libraries in te world of parallel computers. At te time of tis writing, clarifications to MPI-2 are being actively discussed and new working groups are being establised for generating a future MPI-3 specification. MPI provides te following functionality: Communication Domains and Process Groups: MPI communication operations occurs witin a specific communication domain troug an abstraction called communicator. Communicators are built from groups of participating processes and provide a communication context for te members of tose groups. Process groups enable parallel applications to assign processing resources in sets of cooperating processes in order to perform independent work. Point-to-Point Communication: Tis fundamental mecanism enables te transmission of data between a pair of processes, one side sending, te oter receiving. Collective Communications: Tey allow te transmission of data between multiple processes of a group simultaneously. Dynamic Process Management: MPI provides mecanisms to create or connect a set of processes and establis communication between tem and te existing MPI application. One-Sided Operations: One-sided communications supplements te traditional two-sided, cooperative send/receive MPI communication model wit a one-sided, remote put/get operation of specified regions of processes memory tat ave been made available for read and write operations. Parallel Input/Output. B. OpenMP OpenMP, te Open specifications for Multi-Processing (OpenMP, 2010), defines an application program interface tat supports concurrent programming employing a sared memory model. OpenMP is available for several platforms and languages. Tere are extensions for most known languages like Fortran (77, 90 and 95) and C/C++. OpenMP is te result of te joint work between companies and educational institutions involved in te researc and development of ardware and software. OpenMP is based on te fork-join model (see Fig. 1), a paradigm implemented early on UNIX systems, were a task is divided into several processes (fork) wit less weigt tan te initial task, and ten collecting teir results at te end and merging into a single result (join). Using OpenMP in existing codes implies te insertion of special compiler directives (beginning wit #pragma omp in C/C++) and runtime routine calls. Additionally, environment variables can be defined in order to control some functionality at execution time. A code parallelized using OpenMP directives initially runs as a single process by te master tread or main process. Upon entering into a region to be parallalized te main process creates a set of parallel processes or parallel treads. Including an OpenMP directive implies a mandatory syncronization across te parallel block. Tat is, te code block is marked as parallel and treads are launced according to te caracteristics of te directive. At te end of te parallel block, working treads are syncronized (unless an additional directive is inserted to remove tis implicit syncronization). By default, in te parallelized region, all variables except tose used in loops are sared by eac tread. Tis can be canged by specifying te variable type (private or sared) before get into te parallel region. Tis model is also applicable to nested parallelism. C. PETSc PETSc (Balay et al., 2010a,b), te Portable Extensible Toolkit for Scientific Computation (PETSc), is a suite of data structures and routines for te scalable (parallel) solution of scientific applications modeled by partial Figure 1: Te master tread creates a team of parallel treads. 366

3 R.R: PAZ, M.A. STORTI, H.G. CASTRO, L.C. DALCIN Figure 2: Hierarcical structure of te PETSc library (taken from PETSc Manual (Balay et al., 2010a)). differential equations. It employs te MPI standard for all message-passing communication. Being written in C and based on MPI, PETSc is a igly portable software library. PETScbased applications can run in almost all modern parallel environment, ranging from distributed memory arcitectures (Balay et al., 1997) (wit standard networks as well as specialized communication ardware) to multi-processor (and multi-core) sared memory macines. PETSc library provides to its users a platform to develop applications exploiting fully parallelism and te flexibility to experiment many different models, linear and nonlinear large system solving metods avoiding explicit calls to MPI library. It is a freely available library usable from C/C++, Fortran 77/90 and Pyton (Dalcin, 2010). An overview of some of te components of PETSc can be seen in Fig. 2. An important feature of te package is te possibility to write applications at a ig level and work te way down in level of abstraction (including explicit calls to MPI). As PETSc employs te distributed memory model, eac process as its own address space. Data is communicated using MPI wen required. For instance, in a linear (or nonlinear) system solution stage (a common case in FEM applications) eac process will own a contiguous subset of rows of te system matrix (in te C implementation) and will primarily work on tis subset, sending (or receiving) information to (or from) oter processes. PETSc interface allows users an agile development of parallel applications. PETSc provides sequential/ distributed matrix and vector data structures, efficient parallel matrix/vector assembly operations using an object oriented style. Also, several iterative metods for linear/nonlinear solvers are designed in te same way. III THE FastMat MATRIX CLASS A. Preliminaries Finite element codes usually ave two levels of programming. In te outer level a large vector describes te state of te pysical system. Usually te size of tis vector is te number of nodes times te number of fields minus te number of constraints (e.g. Diriclet boundary conditions). So tat te state vector size is N nod n dof -n constr. Tis vector can be computed at once by assembling te rigt and side (RHS) and te stiffness matrix in a linear problem, iterated in a non-linear problem or updated at eac time step troug solution of a linear or non-linear system. Te point is tat at tis outer level all global assemble operations, tat build te residual vector and matrices, are performed. At te inner level, one performs a loop over all te elements in te mes, compute te RHS vector and matrix contributions of eac element and assemble tem in te global vector/matrix. From one application to anoter, te strategy at te outer level (linear/non-linear, steady/temporal dependent, etc) and te pysics of te problem tat defines te FEM matrices and vectors may vary. Te FastMat matrix class as been designed in order to perform matrix computations efficiently at te element level. One of te key points in te design of te matrix library is te caracteristic of being tread-safe so tat it can be used in an SMP environment witin OpenMP parallel blocks. In view of efficiency tere is an operation cacing mecanism wic will be described later. Cacing is also tread-safe provided tat independent cace contexts are used in eac tread. It is assumed tat te code as an outer loop (usually te loop over elements) tat is executed many times, and at eac execution of te loop a series of operations are performed wit a rater reduced set of local (or element) vectors and matrices. In many cases, FEM-like algoritms need to operate on sub-matrices i.e., columns, rows or sets of tem. In general, performance is degraded for suc operations because tere is a certain amount of work needed to extract or set te sub-matrix. Oterwise, a copy of te row or column in an intermediate object can be made, but some overead is expected due to te copy operations. Te particularity of FastMat is tat at te first execution of te loop te address of te elements used in te operation are caced in an internal object, so tat in te second and subsequent executions of te loop te addresses are retrieved from te cace. Te library is of public domain and can be accessed from (Storti et al., 2010). A.1 Example Consider te following simple example: A given 2D finite element mes composed by triangles, i.e. an array xnod of 2N nod doubles wit te node coordinates and an array icone wit 3n elem elements wit node connectivities. For eac element 0j<n elem its nodes are stored at icone[3*j+k] for 0k<2. For instance, it is required to compute te maximum and minimum value of te area of te triangles. Tis is a computation wic is quite similar to tose found in FEM analysis. For eac element in te mes two basic operations are needed: i) loading te node coordinates in local vectors x 1, x 2 and x 3, ii) computing te vectors along te sides of te elements a=x 2 x 1 and b=x 3 x 1. Te area of te element is, ten, te determinant of te 22 matrix J formed by putting a and b as rows. Te FastMat code for te proposed computations is sown in Listing

4 Latin American Applied Researc 41: (2011) Listing 1: Simple FEM-like code Calls to te FastMat::CaceCtx ctx object are related to te cacing manipulation and will be discussed later. Matrices are dimensioned in line 3, te first argument is te matrix (rank), and ten, follow te dimensions for eac index rank or sape. For instance FastMat x(2,3,2) defines a matrix of rank 2 and sape (3,2), i.e., wit 2 indices ranging from 1 to 3, and 1 to 2 respectively. Te rows of tis matrix will store te coordinates of te local nodes to te element. FastMat matrices may ave any number of indices or rank. Also tey can ave zero rank, wic stands for scalars. A.2 Current matrix views (te so-called masks ) In lines 7 to 10 of code Listing 1 te coordinates of te nodes are loaded in matrix x. Te underlying pilosopy in FastMat is tat views (or masks ) of te matrix can be made witout making any copies of te underlying values. For instance te operation x.ir(1,k) (for index restriction ) sets a view of x so tat index 1 is restricted to take te value k reducing in one te rank of te matrix. As x as two indices, te operation x.ir(1,k) gives a matrix of dimension one consisting in te k-t row of x. A call witout arguments like in x.ir(.) cancels te restriction. Also, te function rs(.) (for reset ) cancels te actual view. Please, refer to te Appendix A for a synopsis of metods/operations available in te FastMat class. A.3 Set operations Te operation a.set(x.ir(1,2)) copies te contents of te argument x.ir(1,2) in a. Also, x.set(xp) can be used, being xp an array of doubles (double *xp). A.4 Dimension matcing Te x.set(y) operation, were y is anoter FastMat object, requires tat x and y ave te same masked dimensions. As te.ir(1,2) operation restricts index to te value of 2, x.ir(1,2) is seen as a row vector of size 2 and ten can be copied to a. If te masked dimensions do not fit ten an error is issued. A.5 Automatic dimensioning In te example, a as been dimensioned at line 3, but most operations perform te dimensioning if te matrix as not been already dimensioned. For instance, if at line 3 a FastMat a is declared witout specifying dimensions, ten at line 12, te matrix is dimensioned taking te dimensions from te argument. Te same applies to set(matrix &) but not to set(double *) since in tis last case te argument (double *) does not give information about is dimensions. Oter operations tat define dimensions are products and contraction operations. A.6 Concatenation of operations Many operations return a reference to te matrix (return value FastMat &) so tat operations may be concatenated as in A.ir(1,k).ir(2,j). A.7 Underlying implementation wit BLAS/LAPACK Some functions are implemented at te low level using BLAS(BLAS, 2010)/LAPACK(Lapack, 2010). Notably prod() uses BLAS s dgemm so tat te amortized cost of te prod() call is te same as for dgemm(). As a matter of fact, a profiling study of FastMat efficiency in a typical FEM code as determined tat te largest CPU consumption in te residual/jacobian computation stage corresponds to prod() calls. Anoter notable case is eig() tat uses LAPACK dgeev. Te eig() metod is not commonly used, but if it does, its cost may be significant so tat a fast implementation as proposed ere wit dgeev is mandatory. A.8 Te FastMat operation cace concept Te idea wit caces is tat tey are objects (class FastMatCace) tat store te addresses and any oter information tat can be computed in advance for te current operation. In te first pass troug te body of te loop (i.e., ie=0 in te example of Listing 1) a cace object is created for eac of te operations, and stored in a list. Tis list is basically a doubly linked list (list< >) of cace objects. Wen te body of te loop is executed te second time (i.e., ie>1 in te example) and te following, te addresses of te matrix elements are not needed to be recomputed but tey are read from te cace instead. Te use of te cace is rater automatic and requires little intervention by te user but in some cases te position in te cace-list can get out of syncronization wit respect to te execution of te operations and severe errors may occur. Te basic use of cacing is to create te cace structure FastMat::CaceCtx ctx and keep te position in te cace structure syncronized wit te position of te code. Te process is very simple, wen te code consists in a linear sequence of FastMat operations tat are executed always in te same order. In tis case te CaceCtx object stores a list of te cace objects (one for eac Fast-Mat operation). As te operations are ex- 368

5 R.R: PAZ, M.A. STORTI, H.G. CASTRO, L.C. DALCIN ecuted te internal FastMat code is in carge of advancing te cace position in te cace list automatically. A linear sequence of cace operations tat are executed always in te same order is called a branc. Looking at te previous code, it as one branc starting at te x.ir(1,k).set(...) line, troug te J.rs().det() line. Tis sequence is repeated many times (one for eac element) so tat it is interesting to reuse te cace list. For tis, a branc object b1 (class FastMat:: Cace Ctx::Branc) and a jump to tis branc are created eac time a loop is executed. In te first loop iteration te cace list is created and stored in te first position of te cace structure. In te next and subsequent executions of te loop, te cace is reused avoiding recomputing many administrative work related wit te matrices. Te problem is wen te sequence of operations is not always te same. In tat case several jump() commands must be issued, eac one to te start of a sequence of Fast-Mat operations. Consider for instance te following code, A vector x of size 3 is randomly generated in a loop (te line x.fun(rnd);). Ten its lengt is computed, and if it is sorter tan 1.0 it is scaled by 1.0/len, so tat its final lengt is one. In tis case two brances are defined and two jumps are executed, branc b1: operations x.fun() and x.norm_p_all(), branc b2: operation x.scale(). B. Cacing te addresses used in te operations If cacing is not used te performance of te library is poor wile te caced version is very fast, in te sense tat almost all te CPU time is spent in performing multiplications and additions, and negligible CPU time is spent in auxiliary operations. B.1 Brancing is not always needed However, brancing is needed only if te instruction sequence canges during te same execution of te code. For instance, if a code like follows is considered te metod flag is determined at te moment of reading te data and ten is left uncanged for te wole execution of te code, ten it is not necessary to jump() since te instruction sequence will be always te same. B.2 Cace mismatc Te cace process may fail if a cace mismatc is produced. For instance, consider te following variation of te previous code Tere is an additional block in te conditional, if te lengt of te vector is greater tan 1.1, ten te vector is set to te null vector. Every time tat a branc is opened in a program block a ctx.jump() must be called using different arguments for te brances (i.e., b1, b2, etc). In te previous code tere are tree brances. Te code sown is correct, but assume tat te user forgets te jump() calls at lines 10 and 13 (sentences ctx.jump(b2) and ctx.jump (b3)), ten wen reacing te x.set(0.0), te operation in line 14, te corresponding cace would be te cace corresponding to te x.scale() operation (line 11), and an incorrect computation will occur. Eac time tat te retrieved cace does not matc wit te operation tat will be computed or even wen it does not exist a cace mismatc exception is produced. B.3 Causes for a cace mismatc error Basically, te information stored in te cace (and ten, retrieved from te objects tat were passed in te moment of creating te cace) must be te same needed for performing te current FastMat operation, tat is Te FastMat matrices involved must be te same, (i.e. teir pointers to te matrices must be te same). Te indices passed in te operation must coincide (for instance for te prod(), ctr, sum() operations). Te masks (see III.A.2) applied to eac of te matrix arguments must be te same. B.4 Multi-treading and reentrancy If cacing is not enabled, FastMat is tread safe. If cacing is enabled, ten it is tread safe in te following sense, a context ctx must be created for eac tread, and te matrices used in eac tread must be associated wit te context of tat tread. If creating te cace structures eac time is too bad for efficiency, ten te context and te matrices may be used in a parallel region, stored in variables, and reused in a subsequent parallel region. 369

Latin American Applied Researc 41: 365-377(2011) Figure 3: Efficiency comparatives on Intel Core 2 Duo T5450 (1.66GHz) processor C.

6 Latin American Applied Researc 41: (2011) Figure 3: Efficiency comparatives on Intel Core 2 Duo T5450 (1.66GHz) processor C. Efficiency Tis bencmark computes a=b*c in a loop for a large number N of distinct square matrices a, b, c of varying size n. As mentioned before te amortized cost is te same as te underlying dgemm() call. Te processing rate in Gflops is computed on te base of an operation count per matrix product, i.e. 2n N 2n rategflops 10 (1) elapsed time secs Te number of times for reacing a 50% amortization (n 1 /2) is in te order of 15 to 30. Wit respect to te dgemm() implementation for te matrix product, several options were tested on an Intel Core 2 Duo T5450 (1.66GHz) processor using one core (see Fig. 3). Te options tested were: i) te Atlas (Waley et al., 2001) selfconfigurable version, bot wit te default setup and selfconfigured; ii) te Intel Mat Kernel Library (MKL) library, wit te GNU/GCC g++ and iii) Intel icc compilers. A significant improvement is obtained if te library is linked to te MKL, combined wit eiter icc or g++ compiler. Tis combinations peak at 5 to 6 Gflops for 100<n<200. Wit operation cacing activated te overead of te mask computation is avoided, and te amortized cost is similar to using dgemm() on plain C matrices. D. FastMat Multi-product operation A common operation in FEM codes (see Section V) and many oter applications is te product of several matrices or tensors. In addition, tis kind of operation usually consume te largest part of te CPU time in a typical FEM computation of residuals and Jacobians. Te number of operations (and consequently te CPU time) can be largely reduced by coosing te order in wic te products are performed. For instance consider te following operation ij ik kl lj A B C D, (2) (Einstein s convention on repeated indices is assumed). Te same operation using matricial notation is A BCD, (3) were A, B, C, D are rectangular (rank 2) matrices of sape (m 1,m 4 ), (m 1,m 2 ), (m 2,m 3 ), (m 3,m 4 ), respectively. As te matrix product is associative te order in wic te 370 computations are performed can be cosen at will, so tat it can be performed in te following two ways A BCD, (computation tree CT1),, (4) A BCD, (computation tree CT2). Te order in wic te computations are performed can be represented by a Complete Binary Tree. In tis paper te order will be described using parenteses. Te number of operations (op. count) for te different trees, and in consequence te CPU time, can be very different. Te cost of performing te first product BC in te first row of (4) (and a product of two rectangular matrices in general) is op.count 2m1m2 m, (5) 3 wic can be put as, op.count 2m1 m3 m2, (6) 2(prod.of dimsfor B andcfreeindices) (prod.of or alternatively, (m1m2)(m2m3) m1m 3 m2m op.count 2 m dimsfor contractedindices) 2 3, (7) (prod.of B dims) (prod.of Cdims) 2 (prod.of dimsfor contractedindices) and for te second product is 2m 1 m 3 m 4 so tat te total cost for te first computation tree CT1 is 2m 1 m 3 (m 2 + m 4 ). If te second computation tree CT2 is used, ten te number of operations is 2m 2 m 4 (m 1 + m 3 ). Tis numbers may be very different, for instance wen B and C are square matrices and D is a vector, i.e. m 1 =m 2 =m 3 = m> 1, m 4 =1. In tis case te operation count is 2m 2 (m+ 1) = O(m 3 ) for CT1 and 4m 2 = O(m 2 ) for CT2, so tat CT2 is muc more convenient. D.1. Algoritms for te determination of te computation tree Tere is a simple algoritm tat exploits tis euristic rule in a general case (Ao et al., 1983). If te multi-product is R A A, (8) 1 2 wit A k of sape (m k m k+1 ), ten te operation count c k for eac of te possible products A k A k+1, is computed, namely c k = m k m k+1 m k+2 for k =1 to n-1. Let c k* be te minimum operation count, ten te corresponding product A k* A k*+1 is performed and te pair A q, A q+1 is replaced by tis A n

7 R.R: PAZ, M.A. STORTI, H.G. CASTRO, L.C. DALCIN product. Ten, te list of matrices in te multi-product is sortened by one. Te algoritm proceeds recursively until te number of matrices is reduced to only one. Te cost of tis algoritm is O(n 2 ) (please note tat tis refers to te number of operations needed to determine te computation tree, not in actually computing te matrix product). For a small number of matrices te optimal tree may be found by performing an exaustive searc over all possible orders. Te cost is in tis case O(n!). In Fig. 4 te computing times of te exaustive optimal and te euristic algoritms is sown for a number of matrices up to 8. Of course it is proibitive for a large number of matrices, but it can be afforded for up to 6 or 7 matrices, wic is by far te most common case. Te situation is basically te same but more complex in te full tensorial case, as it is implemented in te FastMat library. First consider a product of two tensors like tis A B C, (9) ijk klj were tensors A, B, C ave sape (m 1,m 2,m 3 ), (m 3,m 4,m 2 ), (m 1,m 4 ) respectively. i, j, k are free indices wile l is a contracted index. Te cost of tis product is (Eqs. 6 and 7) il op.count m m m m, (10) On one and, te modification wit respect to te case of rectangular (rank 2) matrices is tat every matrix can be contracted wit any oter in te list. So tat, te euristic algoritm must ceck now all te pair of distinct matrices wic is n (n - 1)/2 were 1n n is te number of actual matrices. Tis must be added over n so tat te algoritm is O(n 3 ). On te oter and, regarding te optimal order, it turns out to be tat te complexity for its computation is n 2 n ( n 1) ( n!) O. (11) n n In te FastMat library te strategy is to use te exaustive approac for nn max and te euristic one. Oterwise, wit n max =5 by default but dynamically configurable by te user. Figure 4: Cost of determination of te optimal order for computing te product of matrices wit te euristic and exaustive (optimal) algoritms. D.2. Example: Computation of SUPG stabilization term. Element residual and Jacobian As an example, consider te computation of te stabilization term (see Section V) for general advective-diffusive systems. Te following product must be computed were R SUPFG, e p p, k 3 k 2 4 A R, (12) 1 SUPFG, gp 371 SUPFG, e R p (sape (n el,n dof ), identifier res) is te SUPG residual contribution from te element e (see Section V and Eq. 15), ω p,k (sape (n dim,n el ), identifier gn) are te spatial gradients of te interpolation functions ω p, A kµν =( F jµ / U ν ) (sape (n el,n dof,n dof ), identifier A) are te Jacobians of te advective fluxes F jµ wit respect to te state variables U ν, τ να (sape (n dof, n dof ), identifier tau) is te matrix of intrinsic times, and SUPFG, gp R (sape (n dof ), identifier R) is te vector of residuals per field at te Gauss point. Tis tensor products arise, for instance, in te context of te FEM-Galerkin SUPG stabilizing metods (see References Donea and Huerta., 2003; Tezduyar and Osawa, 2000). Tis multi-product is just an example of typical computations tat are performed in a FEM based CFD code. Tis operation can be computed in a FastMat call like tis res.prod(gn,a,tau,r, 1,1, 1,2, 2, 2, 3, 3); were if te j-t integer argument is positive it represents te position of te index in te resulting matrix, oterwise if te j-t argument is -1 ten a contraction operation is performed over all tese indices (see Appendix A.C). Te FastMat::prod() metod ten implements several possibilities for te computing tree natural: Te products are performed in te order te user as entered tem, i.e. (((A 1 A 2 )A 3 )A 4 ) euristic: Uses te euristic algoritm described in Section III.D.1. optimal: An exaustive brute-force approac is applied in order to determine te computation tree wit te lowest operation count. In Table 1 te operation counts for tese tree strategies are reported. Te first tree columns sow te relevant dimension parameters. n dim may be 1,2,3 and n el may be 2 (segments), 3 (triangles), 4 (quads in 2D, tetras in 3D) and 8 (exaedra). Te values explored for n dof are n dof =1: scalar advection-diffusion, n dof = ndim +2: compressible flow, n dof = 10: advection-diffusion for 10 species. Operation counts for te computation of element residuals. Te costs of te tensor operation defined in Eq. (12) (were te involved tensors are as described above) are evaluated in terms of te gains (%) (see Table 1). Te gains are related to te cost of products performed in te natural order, so tat CT1 is: ((gn*r)*tau)*a CT2 is: (gn*(tau*r))*a CT3 is: gn*((a*tau)*r) CT4 is: gn*(a*(tau*r)) Operation counts for te computation of element Jacobians. Tis product is similar as te one described above, but now te Jacobian of te residual term is computed so tat te last tensor is not a vector, but rater a rank 3 tensor. J SUPFG, e pqv A J, (13) p, k SUPFG, gp k q

8 Latin American Applied Researc 41: (2011) J.prod(gN,A,tau,JR, 1,1, 1,2, 2, 2, 3, 3,3,4); SUPFG, gp were J (sape (n q dof,n el,n dof ), identifier JR) is te Jacobian of residuals per field at te Gauss point. Possible orders (or computation trees): CT5: ((gn*a)*tau)*jr CT6: (gn*(a*tau))*jr CT7: gn*((a*tau)*jr) Discussion of te influence of computation tree From te previous examples it is noticed tat: In many cases te use of te CT determined wit te euristic or optimal orders yields a significant gain in operation count. Te gain may be 90% or even iger in a realistic case like te computations for Eqs. (12) and (13). In te presented cases, te euristic algoritm yielded always a reduction in operation count, toug in general tis is not guaranteed. In some cases te euristic approac yielded te optimal CT. In oters it gave a gain, but far from te optimal one. It is very interesting tat neiter te euristic nor optimal computation trees are te same for all combinations of parameters (n dim,n el,n dof ). For instance in te computation of te Jacobian te optimal order is (gn*(a*tau))*jr in some cases and gn*((a*tau)*jr) in oters (designated as CT6 and CT7 in te tables). Tis means tat it would be impossible for te user to coose an optimal order for all cases. Tis must be computed automatically at run-time as proposed ere in FastMat library. In same cases te optimal or euristically determined orders involve contractions tat are not in te natural order, for instance te CT (gn*(tau*r))*a (designated as CT2) is obtained wit te euristic algoritm for te computation of te element residual for some set of parameters. Note tat te second sceduled contraction involves te gn and tau*r tensors, even if tey do not sare any contracted indices. Note tat, in order to perform te computation of an efficient CT for te multi-product (wit eiter te euristic or optimal algoritms) it is needed tat te library implements te operation in functional form as in te FastMat library. Operator overloading is not friendly wit te implementation of an algoritm like tis, because tere is no way to capture te wole set of matrices and te contraction indices. Te computation of te optimal or euristic order is done only once and stored in te cace. In fact tis is a very good example of te utility of using caces. Using more elaborated estimations of computing time. In te present work it is assumed tat te computing time is directly proportional to te number of operations. Tis may not be true, but note tat te computation tree could be determined wit a more direct approac. For instance, by bencmarking te products and ten determining te CT tat results in te lowest computing time, not in operation count. 372 IV COMBINING MPI WITH OPENMP To exploit te benefits of bot parallel approaces te socalled ybrid programming paradigm was used earlier wen multi-core processors become available and Massively Parallel Processing (MPP) systems were abandoned in favor of clusters of SMP nodes. In te ybrid model, eac SMP node executes one multi-treaded MPI process (ere tis is done by implementing OpenMP directives) wile in pure MPI programming, eac processor executes a single-treaded MPI process. Te ybrid model approac not always provides many benefits. For example, (Henty, 2000) observed tat altoug in some cases OpenMP was more efficient tan MPI on a single SMP node (flat MPI), ybrid parallelism did not outperform pure message-passing on an SMP cluster. Also, according to Smit and Bull (2001), tis style of programming cannot be regarded as te ideal programming model for all codes. Neverteless, a significant benefit may be obtained if te parallel MPI code suffers from poor scaling due to load imbalance or too fine grained tasks. On one and, if te code scales poorly wit an increasing number of SMP nodes and a multi-treading implementation of a specific part of it scales well, it may be expected an improvement in performance on a mixed code. Tis is because intra-node communication disappears. On te oter and, if an MPI code scales Table 1: Operation count for te stabilization term in te SUPG formulation of advection-diffusion of n dof number of fields. Oter relevant dimensions are te space dimension n dim and te number of nodes per element n el. n dimn eln dof #ops(nat)eur. #ops gain (%) opt. #ops gain (%) CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % Table 2: Operation count for te Jacobian of te SUPG term in te for n dof number of fields. Oter relevant dimensions are te space dimension n dim and te number of nodes per element n el. n dimn eln dof#ops(nat)eur. #ops gain (%) opt. #ops gain (%) CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % poorly due to load imbalance an ybrid programming code version will create a coarser grained problem. Tus im-

9 R.R: PAZ, M.A. STORTI, H.G. CASTRO, L.C. DALCIN proving te code performance inasmuc as MPI will only be used for communication between nodes. Typically, a mixed mode code involves a ierarcical model were MPI parallelization occurs at top level (coarse grain) and multi-treading parallelization below (fine grain). Tis implies tat MPI routines are called outside parallel regions wile all treads but te master are sleeping. Tis approac provides a portable parallelization and it is te sceme used in tis work. Tread-safe objects, like FastMat ones, are also needed in tis tecnique. Hybrid MPI/OpenMP parallelization In tis work ybrid parallelism is evaluated wit a C++ code for te solution of te scalar advection-diffusion partial differential equation by means of a stabilized finite element metod. Te parallel code was implemented using MPI, PETSc and OpenMP, aving four main parts: i. Problem Partition across SMP nodes: eac node stores a portion of te mes elements and te associated degrees-of-freedom. Besides, a parallel matrix and work vectors are created (MatCreateMPIAIJ, VecCreateM PI). ii. Element FEM matrices and RHS computation: eac SMP node performs a loop over te owned partition of te elements in te mes and computes element matrices and rigt and side contributions using OpenMP treads. Eac tread performs te computations of its assigned group of elements according to a selected OpenMP sceduling (static, dynamic, guided). iii. Matrix/Vector Values Assembly: vector and matrix element contributions are stored in PETSc Matrices and Vectors objects (VecSetValue, MatSetValue) on eac SMP node. After tat, {Vec Mat}Assembly Begin and {Vec Mat}AssemblyEnd are called in order to communicate te off-process vector/matrix contributions and to prepare internal matrix data structures for subsequent operations like parallel matrix-vector product. iv. Linear System Solving: a suitable iterative Krylov Space-based linear solver (KSP objects) is used. Te matrix-vector product is a igly demanding operation of most iterative Krylov metods. Te MatMult operation wic is employed in all PETSc Krylov solvers is also parallelized using OpenMP treads inside an SMP node. For tis purpose, OpenMP directives are introduced at te row loop of te sparse matrix-vector product. V TEST CASE: ADVECTION-DIFFUSION SKEW TO THE MESH Te scalar advection-diffusion equation to be solved is, a u v u s in u u on, (14) n u q D on N D Γ N. Te discrete variational formulation of (14) wit added SUPG stabilizing term (Tezduyar and Osawa, 2000) is written as follows: find u S suc tat were v u d a u d g a a u v u sup e, (15) were u is te unknown scalar, a is te advection velocity vector, ν>0 is te diffusion coefficient and s(x) is a volumetric source term. For te sake of simplicity only Diriclet and Neumann boundary conditions are considered. Prescribed values for Diriclet and Neumann conditions are u D and q, and n is te unit exterior normal to te boundary computation; and in Fig. 8 for te MatMult operation. 373 u nel e e1 sd, V S u u V H andu u D H ( ), u on and 0 on 1 D D e P e ( ), P 1 m s d e ( ) e m e ( ) e, (16) were P m is te finite element interpolating space. According to Tezduyar and Osawa (2000) te stabilization parameter τ supg is computed as wit supg supg, (17) 2 a n supg 2 en s, (18) i i1 were s is a unit vector pointing in te streamline direction. Te problem statement is depicted in Fig. 5 were te computational domain is a unit square, i.e., Ω= (x,y) in [0, 1] [0, 1]. Tis 2D test case as been widely used to illustrate te effectiveness of stabilized finite element metods in te modeling of advection-dominated flows (Donea and Huerta., 2003). Te flow is unidirectional and constant, a =1, but te advection velocity is skew to te mes wit an angle of 30º. Te diffusion coefficient is 10-4, tus obtaining an advection-dominated system. As sown in Fig. 5, Diriclet boundary conditions are imposed at inlet walls (x=0 and y=0 planes) wile at outlet walls (x=1 and y=1 planes) Neumann omogeneous conditions are considered. A solution obtained on a coarse mes is plotted in Fig. 6. A. Results Te equipment used to perform te tests was te Coyote cluster at CIMEC laboratory. Tis is a Beowulf-class cluster, consisting of a server and 6 nodes dual quad-core Intel Xeon E5420 (1333Mz Front Side Bus, FSB) 2.5 GHz CPU wit 8 Gb RAM per node (see Fig. 11.a), interconnected via a Gigabit Eternet network. In order to compute te performance of te proposed algoritm a mes of 4.5 millions of linear triangles (wit linear refinement toward walls) is used. As te system of linear equations arising from (15) is non-symmetric, te Stabilized Bi-Conjugate Gradient (BiCG-Stab) metod wit diagonal preconditioning is employed as solver (Saad., 2000). Te results for te consumed wall-clock time wile sweeping te number of SMP nodes (n p ) and te number of treads (n c ) are sown in Fig. 7 for te RHS and matrix 1

Latin American Applied Researc 41: 365-377(2011) Let T* be te execution time to solve te problem on 1 node and 1 tread (n p =1 and n c =1) and T denote te execution time of te parallel algoritm wen

Ten, te speedup S and efficiency E are defined as S T */ T, (19) E S / n n T */ Tn n, (20) p s p s memory (like nodal coordinates and a few pysical parameters) and computing several quantities wit

As expected, tis stage is not appreciably influenced by memory bandwidt. Figure 5: Advection of discontinuous inlet data skew to te mes: problem statement.

Figure 9: Speedup for te RHS and matrix computation. On one and, wen considering te left plot of Fig. 9, it can be seen from te blue line labeled ( 1 tread ) tat pure MPI run scaling is linear.

10 Latin American Applied Researc 41: (2011) Let T* be te execution time to solve te problem on 1 node and 1 tread (n p =1 and n c =1) and T denote te execution time of te parallel algoritm wen tere are n p MPI processes eac running n c OpenMP treads. Ten, te speedup S and efficiency E are defined as S T */ T, (19) E S / n n T */ Tn n, (20) p s p s memory (like nodal coordinates and a few pysical parameters) and computing several quantities wit tem (element area, interpolation functions, gradients, element contribution to global FEM matrix, etc). Tus, te ratio between floating point operations and memory accesses is ig. As expected, tis stage is not appreciably influenced by memory bandwidt. Figure 5: Advection of discontinuous inlet data skew to te mes: problem statement. Figure 8: Elapsed time for te MatMult solver operation. Figure 6: SUPG solution for te 2D convection-diffusion problem wit downwind natural conditions in a coarse mes. Figure 9: Speedup for te RHS and matrix computation. On one and, wen considering te left plot of Fig. 9, it can be seen from te blue line labeled ( 1 tread ) tat pure MPI run scaling is linear. On te oter and, if axis line n p =1 is considered (or te blue line labeled 1 node in te rigt plot of te same figure), OpenMP treaded runs also sow linearity. Beyond wat as been said, all ybrid runs exibit te same beavior wit a maximum loss in efficiency of about 10% (for te case of n p =6 and Figure 7: Elapsed time for te RHS and matrix computation. n c =8 treads). respectively. Te speedup obtained for te RHS and matrix computation are illustrated in Fig. 9 and for te Mat ation (Figs. 8 and 10). In tis case, altoug pure Te situation is clearly different for te MatMult oper- Mult operation in Fig. 10. MPI/PETSc runs represented by te blue line labeled 1 As sown troug Figs. 7 to 9, te Matrix and RHS tread in te left part of Fig. 10 sow a linear scaling computation stage scales linearly for te wole range of wen varying te number of SMP processors (or nodes), SMP nodes and numbers of considered cores/treads as te ybrid runs sow an appreciable degradation of speedup. A speedup of 10 is obtained for 6 nodes wit 8 cores, expected. Tis stage is igly parallelizable (i.e., do not require inter-node communications) and te performance wereas te teoretical value is 48. Tis loss in efficiency is only sligtly degraded. Te memory access pattern of can be better appreciated on te rigt plot of te Figure tis stage consists on loading a few element data from were speedup stagnates beyond four treads. In sparse 374

R.R: PAZ, M.A. STORTI, H.G. CASTRO, L.C. DALCIN matrix-vector product, te ratio between floating point operations and memory accesses is low.

OpenMP and Flat MPI on one multi-processor macine A speedup comparison between pure OpenMP and pure MPI versions of te FEM code running on one multiprocessor macine was made using a mes of 2 millions

Two different arcitectures are used separately: a quadcore Intel i7-950 (3.07Gz) macine and a dual Intel quadcore (dual-die dual-core eac processor) Xeon E5420 (2.

11 R.R: PAZ, M.A. STORTI, H.G. CASTRO, L.C. DALCIN matrix-vector product, te ratio between floating point operations and memory accesses is low. Tus, te overall performance in tis stage is mainly controlled by memory bandwidt. B. OpenMP and Flat MPI on one multi-processor macine A speedup comparison between pure OpenMP and pure MPI versions of te FEM code running on one multiprocessor macine was made using a mes of 2 millions of linear triangles. Figure 10: Speedup for te MatMult iterative solver operation. Figure 12: Speedup for te RHS and matrix computations (left) and MatMult op. (rigt) stages in Xeon arcitecture. Two different arcitectures are used separately: a quadcore Intel i7-950 (3.07Gz) macine and a dual Intel quadcore (dual-die dual-core eac processor) Xeon E5420 (2.5GHz) wit a 1333Mz Front Side Bus (FSB). Common features of all Nealem (and so tat te i7 processor) based processors include an integrated DDR3 memory controller as well as QuickPat Interconnect (QPI) on te processor replacing te Front Side Bus used in earlier Core processors like te Xeon arcitecture. Besides, Nealem processors ave 256KB on-die L2 cace per core, plus up to 12 MB sared on-die L3 cace. QPI is muc faster tan FSB and ence improves te overall performance. Te Core i7 ave an on-die memory controller wic means tat it can access memory muc faster tan te dual core Xeon processors (tat ave an external memory controller), terefore improving te overall performance. Figure 11: dual Xeon and i7 arcs Figure 13: Speedup for te RHS and matrix computations (left) and MatMult op. (rigt) stages in i7 arcitecture. Te ierarcical topology, including memory nodes, sockets, sared caces and cores of te processors involved in tis test are sketced in Fig. 11. Tey were obtained using wloc (wloc, 2010). Figures 12 and 13 sow te speedups obtained on Xeon and i7 arcitectures, respectively, for te matrix and RHS computation and for te iterative system solution stages (MatMult). Performances wen computing te FEM matrix and te RHS are comparable wit tose presented in te previous tests. Regarding te linear system solution stage, as sown in te rigt plot of Fig. 12 and 13, i7 processor performs sligtly better tan Xeon processor considering runs wit varying number of treads (OpenMP) or processes (flat MPI) from one to

12 Latin American Applied Researc 41: (2011) four. Beyond tat, wen running te test using five to eigt treads/ processes in te Xeon processor (recall tat i7 processor as only 4 cores and Xeon as 8 cores), te memory bandwidt saturates and tere is no improvement of performance in te MatMult operation. VI CONCLUSIONS Te computation, assembly and solution stages in typical Finite Element codes ave been discussed and studied in order to exploit te advantages provided by new ybrid arcitectures, like clusters of multi-core processors. For tis purpose an efficient tensor library, wic is tread-safe, is presented and rigorously evaluated. Having in mind tat for large scale meses all FEM computation inside te loop over elements must be evaluated millions of times, te process of cacing all FEM operations at te element level (provides by FastMat library) is a key-point in te performance for te computing/assembling stages. Te most important features of te FastMat tensor library can be summarized as follows: Higly repeated tensorial operations at te element level are caced. Hence, element routines are very fast, in te sense tat almost all te CPU time is spent in performing multiplications and additions, and negligible CPU time is spent in auxiliary operations. Tis feature is te key-point wen dealing wit large/fine FEM meses. For te very common multi-product tensor operation, te order in wic te successive tensor contractions are computed is always performed at te minimum operation count cost depending on te number of tensor/matrices in te FEM terms (exaustive or euristic approaces). Te FastMat library is fully tensorial in te sense tat contractions are implemented in a general way. Te number of contracted tensors and te range of teir indices can be variable. Also, a complete set of classical tensorial functions are available in te library. Te FastMat library is tread-safe, i.e., element residuals and Jacobians can be computed in te context of multi-treaded FEM codes. Also, in tis article, a stabilized finite element formulation of te linear scalar advection-diffusion equation as been implemented. Tis formulation is used to analyze te performance of computing te element contributions and te assemble process using te FastMat library and te multi-treaded version of PETSc s linear equation solver on a cluster of SMP nodes (ybrid arcitecture). Extending pure MPI FEM codes to be used on ybrid clusters wit OpenMP is not a difficult task. It could be only based on te different levels of parallelism of eac part of te code. Implementing MPI/OpenMP ybrid codes starting from treaded codes generally requires muc effort, especially wen considering te linear and non-linear solution stages. FEM code developers and researcers can take advantages from spreading ybrid arcitectures combining MPI and OpenMP standards. Noneteless, as it was sown in tests, it seems tat tere is a limitation on te performance of arcitectures implementing te FSB model, wic is not 376 appreciably improved on i7 arcitectures. Tus, numerical algoritms wit low ratios between floating point operations and memory accesses would perform poorly on ybrid environments. ACKNOWLEDGMENTS Tis work as received financial support from Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET, Argentina, grant PIP 5271/05), Universidad Nacional del Litoral (UNL, Argentina, grants CAI+D /334), Agencia Nacional de Promoción Científica y Tecnológica (ANPCyT, Argentina, grants PICT 01141/2007, PICT Jóvenes Investigadores, PICT-1506/2006) and Secretaría de Ciencia y Tecnología de la Universidad Tecnológica Nacional, Facultad Regional Resistencia (Caco). APPENDIX A: SYNOPSIS OF FASTMAT OPERATIONS A. One-to-one operations Te operations are from one element of A to te corresponding element in *tis. Te one-to-one operations implemented so far are FastMat& set(const FastMat &A): Copy matrix. FastMat& add(const FastMat &A): Add matrix. FastMat& rest(const FastMat &A): Substract a matrix. FastMat& mult(const FastMat &A): Multiply (element by element) (like Matlab.*). FastMat& div(const FastMat &A): Divide matrix (element by element, like Matlab./). FastMat& axpy(const FastMat &A, const double alpa): Axpy operation (element by element): (*tis) += alpa * A B. In-place operations Tese operations perform an action on all te elements of a matrix. FastMat& set(const double val=0.): Sets all te element of a matrix to a constant value. FastMat& scale(const double val): Scale by a constant value. FastMat& add(const double val): Adds constant val. FastMat& fun(double (*)(double) *f): Apply a function to all elements. C. Generic sum operations (sum over indices) Tese metods perform some associative reduction operation an all te indices of a given dimension resulting in a matrix wic as a lower rank. It s a generalization of te sum/max/min operations in Matlab tat returns te specified operation per columns, resulting in a row vector result (one element per column). Here you specify a number of integer arguments, in suc a way tat if te j-t integer argument is positive it represents te position of te index in te resulting matrix, oterwise if te j-t argument is -1 ten te specified operation (sum/max/min etc...) is performed over all tis index. For instance, if a FastMat A(4,2,2,3,3) is declared ten B.sum(A, 1,2,1, 1) means B A,for i 1..3, j 1..2, (21) ij kjil k1..2, l1..3 Tese operations can be extended to any binary associative operation. So far te following operations are implemented FastMat& sum(const FastMat &A, const int m=0,..): Sum over all selected indices.

Using Hybrid Parallel Programming Techniques for the Computation, Assembly and Solution Stages in Finite Element Codes

Using Hybrid Parallel Programming Techniques for the Computation, Assembly and Solution Stages in Finite Element Codes Rodrigo R. Paz a,, Mario A. Storti a, Hugo G. Castro a,b, and Lisandro D. Dalcín a,