Analysis and parallelization strategies for Ruge-Stüben AMG on many-core processors

Size: px

Start display at page:

Download "Analysis and parallelization strategies for Ruge-Stüben AMG on many-core processors"

Emory Goodwin
5 years ago
Views:

1 Anaysis and paraeization strategies for Ruge-Stüben AMG on many-core processors P. Zaspe Departement Mathematik und Informatik Preprint No Fachbereich Mathematik June 217 Universität Base CH-451 Base

2 Journa of Scientific Computing manuscript No. (wi be inserted by the editor) Anaysis and paraeization strategies for Ruge-Stüben AMG on many-core processors Peter Zaspe Received: date / Accepted: date Abstract The Ruge-Stüben agebraic mutigrid method (AMG) is an optimacompexity back-box approach to sove inear systems arising in discretizations of e.g. eiptic PDEs. Recenty, there has been a growing interest in paraeizing this method on many-core hardware, especiay graphics processing units (GPUs). This type of hardware deivers high performance for highy parae agorithms. In this work, we anayse convergence properties of recent AMG deveopments for manycore processors and propose to use more cassica choices of AMG components for higher robustness. Based on these choices, we introduce many-core paraeization strategies for a robust hybrid many-core AMG. The strategies can be understood and appied without deep knowedge of a given many-core architecture. We use them to propose a new hybrid GPU impementation. The impementation is tested in an in-depth performance anaysis, which outines its good convergence properties and high performance in the sove phase. Keywords Ruge-Stüben agebraic mutigrid (AMG) Graphics processing units (GPUs) Paraeization Many-core Iterative inear sovers Graph traversa Mathematics Subject Cassification (2) 65N55 65F1 65Y5 68W1 65F5 65N22 1Introduction The soution of sparse inear systems from discretized eiptic partia di erentia equations (PDEs) is the time-dominant component of many numerica simuations. Iterative sovers based on mutigrid methods sove such systems in optima (inear) compexity. Standard geometric mutigrid often strugges with discretizations of PDEs in compex geometries. However, agebraic mutigrid (AMG) is abe P. Zaspe Departement für Mathematik und Informatik Universität Base Spiegegasse Base, Switzerand E-mai: peter.zaspe@unibas.ch

3 2 Peter Zaspe to hande this specia case without further adoptions. AMG buids the mutigrid hierarchy by a purey agebraic construction invoving ony the entries of the underying system matrix. Many-core processors achieve high performance by executing a very high number of concurrent computationa threads. However, they are often imited by a reativey ow per-thread performance and other simpifications. One exampe of many-core processors are graphics processing units (GPUs). Another exampe are Xeon Phi processors. Nowadays, a series of high performance computing (HPC) custer systems is equipped with many-core processors as acceerators. Thatis, many-core processors can be used in addition to the existing muti-core processors in a given compute node. To achieve high performance on many-core processors, it is necessary to deveop numerica agorithms that use the provided rather extreme amount of paraeism. Here, it is often not enough to reformuate an existing agorithm in a smart way. Instead, the appied numerica method has to be changed to achieve high (parae) performance. However, by changing the numerica method, the soution properties are a ected. As a rue of thumb, we note that a higher degree of paraeism in a numerica method is often connected to weaker soution properties. Hence, we have to make a trade-o between (parae) performance and strong soution properties. This artice anayses this kind of trade-o in recent work, e.g. [23], on the design and impementation of Ruge-Stüben agebraic mutigrid [31, Appendix A] on many-core processors. We consider Ruge-Stüben AMG as main object of study, since it is we-known for its good convergence and robustness. Based on our anaysis, we propose a di erent baance of agorithmic components and parae performance than used, for many-core processors, before. For a practica reaization, we propose many-core paraeization strategies for major parts of the method. This eads to a hybrid GPU-based Ruge-Stüben AMG impementation, for which we anayse the performance. 1.1 Reated work Whenever we discuss agebraic mutigrid in this artice, we actuay refer to Ruge- Stüben AMG. For competeness, we note however that quite some work on non- Ruge-Stüben AMG types on many-core processors has been conducted in e.g. [4, 32,1,9,6,21,18] for (smoothed) aggregation AMG and in [34] for auxiiary grid AMG. Agebraic mutigrid has a setup phase with the mutigrid hierarchy construction and a sove phase with the iterative appication of recursive mutigrid cyces. The sove phase is amost identica in geometric and agebraic mutigrid. Hence, the core of AMG ies in the setup phase. Traditionay, the execution of the setup phase has been considered to be purey sequentia [31, Appendix A]. With the need of paraeization of AMG on muti-core custers, a paraeization of the setup phase has been mainy achieved by appying the purey sequentia parts of the setup phase, i.e. the coarsening or C/F spitting, onto partitions of the fu system matrix and by adding some corrections at the interface [15, 35]. Unfortunatey, this approach does not scae to the degree of paraeism provided by many-core processors. In fact, a singe many-core processor roughy requires the paraeism that is normay distributed over a fu custer. This eads to a bad domain-to-interface ratio when appying the paraeization strategy for muti-core processors to many-

4 Tite Suppressed Due to Excessive Length 3 core processors. One very parae coarsening approach, which shows acceptabe performance even on many-core processors, is the PMIS (parae maximum independent set) cassification [35]. However, PMIS is not aways robust. This is often overcome by some stronger interpoation techniques at higher computationa costs [8]. The choice of suitabe agorithmic components in the setup phase is therefore one of the crucia trade-o s that has to be made in the construction of AMG on many-core hardware. These trade-o s wi be discussed in this work. Maybe the first work on (Ruge-Stüben) AMG for many-core processors has been performed in [19,14]. In that work, the setup phase is based on matrix decomposition with boundary treatment. In [16, 24], ony the sove phase of AMG was paraeized on GPU and aso used in a muti-gpu setting. This approach might not yet use the fu performance of GPUs. More recenty, AMG has become avaiabe in the open-source ibraries ViennaCL [27] and Paraution [17]. In [33, 28], performance resuts of ViennaCL have been discussed for AMG, however Ruge-Stüben AMG has sti ony been considered either with a muti-core CPU setup or with a singe-threaded GPU setup. To the author s knowedge, detais on the impementation of Paraution are not pubished. Two current state of the art commercia many-core (GPU) impementations are GAMPACK [11] and AmgX [23]. They provide a sove and setup phase that runs fuy on GPUs by using PMIS. In their works [11] and [23], the authors provide an overview of their impementations and some exceent performance resuts when comparing CPU and GPU based AMG. To keep their competitive advantage, they can, however, not uncover the fu detais of their paraeization approaches. Moreover, their choice of test cases for convergence resuts is, understandaby enough, targeted mainy at their customers needs. 1.2 Contribution of this work The objective of this work is two-fod. First, the aim is to compement the above works by a baanced discussion of the impact of the choice of the numerica components of AMG on the robustness and soution properties. To start this discussion, we introduce the components of AMG. We then discuss choices of AMG components that have been made for many-core AMG by authors in the past. Here, we especiay focus on the commercia GPU impementations [11, 23] as god standard. Within these impementations we cosey foow [23] and (numericay) study the convergence and robustness properties of methods used therein. Our numerica study uses modified versions of the AMG213 benchmark, which is a strippeddown version of BoomerAMG [15] in hypre [12]. Within that study, we propose a set of parameters and components for AMG that we consider to be more robust than existing choices in many-core AMG. Second, the aim is to propose a series of paraeization strategies that aow to run the major part of the AMG setup phase on many-core processors. This discussion sha give starters in the fied of many-core programming an insight into the agorithmic chaenges that have to be faced when deaing with AMG on many-core hardware. In order to understand the properties of the proposed agorithms, an impementation has been performed based on CUDA and a sparse inear agebra ibrary for GPUs (CUSP [5]). This impementation is hybrid in the sense that the fu sove phase and amost the fu setup phase is done on GPU. The ony part, which is eft on CPU is a subset of the operations in the C/F

5 4 Peter Zaspe spitting. The new AMG sover runs on a singe GPU together with one CPU core. As we wi see, the resuting AMG impementation has strong convergence and robustness properties. A performance study wi outine that the sove phase of our AMG runs ceary faster than a parae version of AMG213. However, the hybrid GPU setup phase strugges in competing with AMG213. Thosedi cuties wi be anayzed and expained individuay. Finay, a rea-word appication with inear systems from a discretized PDE in a compex geometry wi be shown. This work is structured as foows. In Section 2, we review Ruge-Stüben AMG. Section 3 anayses the convergence and robustness properties of typica AMG constructions in GPU-based AMG. Section 4 introduces paraeization strategies for a robust hybrid many-core AMG and briefy discusses our concrete GPU impementation. Numerica resuts and performance measurements of our impementation are outined in Section 5. Section 6 summarizes the outcome of this work. 2OverviewofthecomponentsofRuge-Stübenagebraicmutigrid We wi cosey foow [31, Appendix A] to give a short overview of Ruge-Stüben AMG. Notation from [31] wi be partiay adapted. We want to sove inear systems Ax = b (1) with A R N N a sparse symmetric M-matrix, and x, b R N.Theseinear systems arise e.g. in the approximate soution of eiptic PDEs by finite di erences. In the foowing, wi first discuss the two-eve idea of mutigrid and then the mutigrid agorithm. This is foowed by a summary of a agorithmicay important components of AMG, namey C/F spitting, interpoation and smoothers. 2.1 Two-eve idea The two-eve or two-grid agorithm contains a important properties of mutigrid. To expain it, we start by identifying the origina inear system as system on a fine eve or fine grid, noting that we wi use the terms grid and eve equivaenty. On the fine grid, we thus have A f x f = b f j f a f ij xf j = bf i i f, with N f := N, A f := A, x f := x, b f := b and := {1,...,N f }.Next,we introduce the smoother S f R N f N f. Starting from an approximation P f A 1 f to the inverse of the system matrix, it is defined as S f := (I f P f A f ). I f is the identity matrix. One step of the smoother generates from x f asmoothed vector x f with x f := S f x f +(I f S f )A 1 f bf. (2) The smoother has a structure, which avoids to invert the system matrix. It sha damp the highy osciatory modes in the error e f = x f A 1 f bf. Exampes for smoothers, i.e. concrete choices for P f,wibegiveninsection2.5. f

6 Tite Suppressed Due to Excessive Length 5 Having damped highy osciatory error modes on the fine eve by the presmoother, the idea of the two-eve method is to construct a coarse eve or coarse grid on which the remaining ow error modes are damped. The coarse eve is constructed by introducing a spitting f = C F,theC/F spitting, into coarse and fine grid variabes. Fine grid variabes wi be kept on the next eve, whie the information of a variabes wi be transferred from coarse to fine grid by transfer operators. The inear system that is soved on the fine grid is given as A c x c = b c j c a c ijx c j = b c i i c, (3) with the obvious choice of c := C such that that A c R N c N c, x c, b c R N c, N c := C. The above equation is couped to the fine grid probem by restricting the residua r f = b f A f x on the fine eve to the coarse grid and by using it as righthand side. To transfer soutions from fine to coarse eve, a restriction operator If c R N c N f is introduced. The opposite mapping is done by an interpoation or proongation operator Ic f R N f N c. Together, they define the coarse grid system matrix A c := If c A f Ic f, which we ca the Gaerkin operator. We usuay set the interpoation matrix as transposed restriction matrix I c f := (I f c ). In a pure two-eve method, the coarseeve probem is soved directy. Its soution is transfered back to the fine grid by the coarse grid correction x f = x + I f c x c and another post-smoothing step is appied. Iterating this approach eads to an iterative inear two-eve sover. In AMG, we ca the construction of the C/F spitting, the construction of the proongation and restriction, the construction of the Gaerkin operator and the construction of the smoother the setup phase. The remaining part is the sove phase. It wi turn out, that the setup phase is actuay the agorithmicay more chaenging part. 2.2 Mutigrid method By carefuy choosing a numeric ingredients it is possibe to have inear compexity in the degrees of freedom for a operations on the fine grid during one iteration of the sove phase, cf. [31, Appendix A] for more detais. However, directy soving the coarse probem (3) sti has a computationa compexity of O(N 3 c ), eading to a non-optima method. To overcome this issue, the agorithmic idea of presmoothing, coarse grid correction and post-smoothing can be extended such that the direct sove on the coarse grid is repaced by a recursive appication of presmoothing, coarse grid correction and post-smoothing on the coarse eve. Extending this to severa eves eads to the so-caed V-cyce in a mutigrid method. Agorithm 1 summarizes the V-cyce. Here, we have introduced eves =, 1,..., max with nested sets of variabes 1... max.now,eve max is the finest eve f. Anaogousy we introduce per-eve coarse/fine grid sets C /F, matrices A, vectors x, b, smoothers S, interpoation operators I 1 and restriction operators I 1.Inines3/4and9/1,thesmootherisappied times. The construction of coarser eves is usuay stopped as soon as the number of variabes per eve goes beow a certain threshod and the coarsest grid is soved directy. Agorithm 2 introduces AMG as iterative sover. Lines 2 7 correspond to the setup phase. Lines 8 12 describe a simpe iterative sover where the function cyce

7 6 Peter Zaspe Agorithm 1 V-cyce in AMG Require: compete mutigrid hierarchy aready set up 1: function cyce(, x, b ) 2: if > then finer eves 3: for s =1,..., do 4: x = S (x, b ) pre-smoothing 5: b 1 = I 1 (b A x ) residua restriction 6: x 1 = 7: cyce( 1, x 1, b 1 ) sove on coarse grid 8: x = x + I 1 x 1 update soution 9: for s =1,..., do 1: x = S (x, b ) post-smoothing 11: return x 12: ese coarsest eve 13: x = A 1 b direct sove 14: return x Agorithm 2 Iterative agebraic mutigrid sover Require: A R N N symmetric M matrix 1: function AMGsover(A, b, res) 2: A max := A, max := {1,...,N} 3: for = max,...,1 do mutigrid hierarchy setup 4: C F =, 1 := C C/F spitting 5: construct I 1, I 1 := (I 1 ) transfer operators 6: A 1 := I 1 A I 1 Gaerkin operator 7: construct S smoother 8: x :=,n:= 9: repeat iterative soution 1: x n+1 = AMG( max, x n, b) mutigrid cyce 11: r n+1 = b Ax n+1 12: unti r n+1 2 res 13: return x n+1 corresponds to the aunch of the V-cyce in Agorithm 1. Instead of appying the V- cyce in an iterative scheme, it coud aso be aunched as preconditioner in a Kryov subspace sover, e.g. a conjugate gradient (CG) method. This second approach is often preferred due to faster convergence. 2.3 C/F spitting The core idea of the agebraic mutigrid method is to introduce the dua view of the inear system as a graph. Thereby, the C/F spitting or coarsening can be done on a graph instead of a geometry. The mapping between matrix and graph is done by considering the system matrix as adjacency matrix for a graph in which the variabes are the nodes or points. Hence, weighted connections or edges between nodes exist if there is a non-zero non-diagona entry in the system matrix A reating one variabe to the other Sequentia coarsening The crucia sequentia part of the Ruge-Stüben AMG method is the choice of coarse and fine grid points (C-/F-points), i.e. C/F spitting. It is based on the above discussed graph representation.

8 Tite Suppressed Due to Excessive Length 7 Agorithm 3 Standard coarsening agorithm Require: eve 1: function AMGstandardCoarsening 2: F :=,C :=,U := 3: for i U do 4: i := Si U +2 Si F 5: whie i s.th. i =do 6: find i max := argmax i i 7: C := C {i max} 8: U := U \{i max} 9: for j (S i U ) do 1: F := F {j} 11: U := U \{j} 12: for i U do 13: i := S i U +2 S i F 14: return C,F The neighborhood of a point/variabe i is given by N i := j j = i, a ij =. We say that a variabe i is strongy negativey couped to variabe j if we have, for afixed< str < 1, the reaxation a ij str max a ik < a ik. A strong negative coupings of a variabe i can be denoted by the set S i = j N i i strongy negativey couped to j. We aso need the set of a variabes j which are strongy couped to i. Itis S i := j i S j. The rough idea of C/F spitting is to create a rather uniform distribution of coarse and fine grid variabes with fine variabes being surrounded by coarse variabes (in the graph). Good convergence is achieved, if fine grid points are strongy couped to coarse grid points [31, Section A.7.1.1]. A first agorithmic idea [26, 31] to achieve this, is as foows. One repeats the choice of coarse grid points unti a points are cassified as coarse or fine grid points. Whenever a coarse grid point has been chosen, a neighboring points, which are strongy couped to the coarse point, are defined as fine grid points. In rea appications, this agorithmic idea is repaced by the standard coarsening agorithm [31, Section A.7.1.1], which is stated in Agorithm 3. This method aso introduces sets of undecided variabes U and importance measures i. It resuts in a rather eveny distributed set of coarse grid points due to the indirecty imposed ordering by the weights i.

9 8 Peter Zaspe Parae coarsening As sketched in Section 1, the standard idea to paraeize the setup phase and thus aso the coarsening on muti-core processors is to appy a domain-decomposition idea. That is, the system matrix / graph is divided into subparts where the standard sequentia coarsening is appied. Afterwards, the resuting spittings are connected by some boundary treatment, cf. [35, 13]. However, it is we-known that the quaity of the spitting gets a ected if the ratio between interna points and points treated by the boundary correction becomes smaer. Appying muti-core coarsening on many-core processors is somewhat the extreme case of this situation. To be abe to have a high utiization of the compute resources of a many-core processor, a arge amount of threads have to be executed in parae. Thereby, amost a points become boundary points. Consequenty, it is not advisabe to appy muti-core coarsening techniques on many-core hardware. One exception to this rue is the parae maxima independent set (PMIS) [35] coarsening. This agorithm can be paraeized on many-core processors without a degradation of the generated C/F spitting. However, PMIS, paraeized or not, often does not produce optima spittings. To keep a reativey robust AMG impementation, PMIS usuay is combined with strong, thus computationay expensive, interpoation [8] methods. Interpoation wi be discussed in the next section. 2.4 Interpoation Interpoation defines the transfer between information on di erent eves in AMG. As previousy stated, the interpoation operator I 1 transfers soutions on a coarse eve to the next finer eve. Since we use the transpose of the interpoation operator matrix as restriction matrix, it su ces to describe the interpoation operation and its construction. We start by introducing some additiona notation with C i := C N i, F i := F N i, C i := C S i, F i := F S i Direct interpoation Direct interpoation uses the strongy couped coarse grid points to interpoate to a given fine grid point. Thus, for each i F we use the set of so-caed interpoatory variabes Pi = C i and interpoate a given fine grid variabe e i by e i = k P i w ike k, w ik = a ik i a ii, i = j N a i ij k P a i ik. We sti assume here that the system matrix is an M-matrix. Therefore, we can negect the handing of positive non-diagona entries Standard interpoation A much better convergence can be achieved by standard interpoation. Itnotony considers strongy connected coarse grid nodes but aso incudes strong connections between fine grid nodes. In order to describe this process, we have a ook at a fine

10 Tite Suppressed Due to Excessive Length 9 grid point i reads as F. The appication of its corresponding matrix row to a vector e a iie i + a ije j. j Ni To appy standard interpoation, we introduce a modified system matrix. There, we repace in those rows that are associated with a fine grid point i F,asabove, the variabes e j with j F i, thus the strongy couped fine grid points, as e j k N j a jke k/a jj. The newy generated matrix Â has entries â ij and possesses new neighborhood sets ˆN i. By further setting for a i F that ˆP i = C i ( j F i C j ), we can define standard interpoation anaogousy to direct interpoation as e i = k ˆP i â ike k, ŵ ik = ˆi â ik â ii, ˆi = j ˆN a i ij. k ˆP â i ik We thus extend direct interpoation to incude the neighborhood of the strongy connected fine grid points Truncation The more connections between variabes are taken into account when constructing the interpoation operator, the better the interpoation. However, too many connections ead to rather densey popuated interpoation operator matrices and thus to denser system matrices on coarser eves. This in turn might heaviy a ect the overa compexity of the mutigrid method. Therefore, it is often necessary to introduce a truncation of the interpoation. One thus drops a entries reated to connections with a strength smaer than the argest entry scaed by a threshod tr. The resuting entries finay have to be rescaed accordingy. 2.5 Standard smoothers We aready introduced smoothers S R N N to be an important numerica ingredient on a given eve of the agebraic mutigrid method. For P := D 1, D =diag(a )wegetthejacobi smoother S J := I D 1 A.Foragivenpointi its appication reads as x i = 1 b i a ij x j. a ii j=i Due to its trivia nature it can be very easiy appied in parae, even on manycore hardware. Furthermore, we can introduce a reaxed Jacobi iteration with the parameter R and P := D 1,thusS J := I D 1 A, resuting in a simiar iteration rue. By a proper choice of the reaxation parameter, the smoothing can be improved a ot.

11 1 Peter Zaspe A stronger smoother than Jacobi is the Gauss-Seide smoother with P := L 1 and L 1 the ower trianguar part of A incuding a diagona entries. It resuts in the smoother S GS = I L 1 A which has the iteration rue x i = 1 a ii b i i 1 j=1 a ij x j N j=i+1 a ij x j. (4) By iteration rue (4) we can aready see that this smoother is purey sequentia. A way to overcome this is to introduce a cooring [2, Section 3.6.3] for the variabes of the inear system, decouping subsets of the variabes which can be then treated in a parae way. For more detais on cooring for iterative inear sovers see e.g. [29, Section 12.4]. Note however, that coored Gauss-Seide versions are not considered here. 3Convergenceandrobustnessinexistingmany-coreAMG In this section, we review and anayse the convergence and robustness impact of typica combinations of methods and parameters in many-core AMG. In addition, we propose a set of parameter and method choices, which sti cover some of the many-core performance considerations whie achieving better convergence and robustness resuts. The anaysis wi be based on three mode probems, i.e. discretizations of a Poisson probem in two and three dimensions as we as discretizations of an eiptic probem in two dimensions with an anisotropic Lapace operator. As discussed in the previous section, the choice of coarsening, interpoation and smoother can strongy impact the quaity of the numerica resuts. Choosing the best combination for a given appication is a chaenge by its own. Ceary, reviewing the current state of the art in this fied for arbitrary appications and a avaiabe techniques is out of scope. However, we here aim at discussing choices that have been made recenty in context of GPU-based AMG. In this context, we focus on the two commercia GPU AMG impementations AmgX and GAMPACK. Since both impementations use simiar techniques, we have chosen to discuss resuts with respect to AmgX, more specificay resuts based on [23]. In [23], the authors introduce AmgX and compare it to the AMG213 benchmark [1]. The atter is a stripped-down version of BoomerAMG in hypre 2.9.b. AMG213 was introduced to perform custer scaabiity studies with compex work oads. The benchmark uses a pre-designed appication setup and a fixed set of coarsening, interpoation and smoother choices. Whie these choices seemingy are favorabe for a benchmark study, they do not necessariy represent the best choices for convergence and robustness. In fact, BoomerAMG might converge much better for other parameter choices. This shoud be kept in mind when assessing the impact of our and other resuts. Moreover, note that, at time of writing this artice, the current version of hypre is and no onger 2.9.b. A comparison of a recent hypre version against AmgX has been made in [25], where performance improvements for hypre were reported.

12 Tite Suppressed Due to Excessive Length Benchmark appications We report on three benchmarks, which can be performed with the AMG213 benchmark appication. The benchmarks sove inear systems from discretized eiptic partia di erentia equations. The first benchmark is a Poisson probem in three dimensions with homogeneous Dirichet boundary conditions, u = f in D, u = on D, (5) which corresponds for D =(, 1) 3 exacty to one important benchmark in [23]. The inear systems that we consider are obtained by discretizing the above probem by a standard second-order seven-point finite di erence stenci on a uniform grid with mesh width h := 1 N D +1, where N D is the number of (inner) grid points in 2 each space dimension. The right-hand side is set to f. We abbreviate this benchmark by the name Poisson3D. It is we known that the condition number of the resuting inear system matrix in Poisson3D scaes ike O(h 2 ). Hence, it ony depends on the mesh width in one dimension. This aso means that, for a given amount of unknowns, the condition number of the system matrix for a discretization of a two-dimensiona Poisson probem becomes much arger than the condition number in context of the three-dimensiona probem for a fixed number of unknowns. Consequenty, for a given number of unknowns, considering a two-dimensiona probem shoud be more chaenging than a three-dimensiona probem. Therefore, we aso incude the two-dimensiona test case Poisson2D based on (5) with D =(, 1) 2 and a standard five-point stenci discretization into our considerations. Moreover, we are interested in the robustness of the inear sover with respect to anisotropies. Therefore we consider a third benchmark, Anisotropic2D in which we discretize the eiptic PDE 1 h c 1 2 u x u x 2 2 = f in D, u = on D, with anisotropy coe cient c 1 R on a two-dimensiona domain D =(, 1) 2.We again use a corresponding second-order five-point finite di erence stenci and the same right-hand side. 3.2 Sover test cases In our convergence studies, we mimic the AMG213 benchmark and the exampes picked in [23]. Therefore we sove the given inear systems Ax = b for an initia soution guess x = and et the sover run unti the reative residua r i / b of a given iterate x i drops beow 1 6. AMG is used as preconditioner in a conjugate gradient sover. Throughout our convergence studies, we stop coarsening, when the current eve has ess than 1 unknowns. Moreover, we appy a truncation with tr =.2. Whenever appicabe, we further set the strength parameter to str =.25. We use three di erent sets of parameters for agebraic mutigrid. These parameters are summarized in Tabe 1. The first set of parameters is identica to

13 12 Peter Zaspe AMG213 param. AmgX param. parameter proposa coarsening PMIS PMIS standard coarsening aggressive coarsening first eve no no interpoation extended standard standard smoother L1-Gauss-Seide Jacobi ( =.8) Jacobi ( =.8) Tabe 1 We compare convergence and robustness properties of three sets of parameters for AMG (AMG213 benchmark, AmgX parameters in [23], our own parameter set proposa). the settings considered in the AMG213 benchmark: The coarsening is done using PMIS. Aggressive coarsening, cf. [36], with appropriate interpoation is use on the first eve. On a other eves, extended interpoation, cf. [8], is appied. The preand post-smoother is aways an L1-Gauss-Seide smoother. Our second set of parameters sha refect the settings chosen in [23] for AmgX. Unfortunatey, [23] does not give a the detais about their appied parameters. Therefore, we tried our best to reverse-engineer these parameters using the same software, hardware (Titan custer), a simiar buid environment, etc. The parameters reported in Tabe 1 seem to fit best the resuts in [23]. That is, we use PMIS as coarsening without aggressive coarsening on the first eve. Moreover, we appy standard interpoation and use a (twice appied) reaxed Jacobi smoother with reaxation parameter =.8. The fina set of parameters is our own proposa of parameters, which is more cose to the cassica way to buid AMG. That is, we choose standard coarsening, no aggressive coarsening on the first eve, standard interpoation and a reaxed Jacobi smoother with reaxation parameter =.8. As we wi see in our resuts, the convergence properties for this approach ceary outperform the AmgX parameters and are at east as good as the AMG213 parameters. We wi argue that these parameters show good convergence properties and robustness with respect to anisotropies. Meanwhie, the Jacobi smoother sti aows for a very high paraeism in the sove phase. The crucia di erence to AMG213 wi be the choice of a mainy sequentia AMG setup. However, as we wi see, there is sti a ot of room for a many-core paraeization for major pieces of the setup phase whie keeping the ony truy sequentia part of the standards coarsening on CPU. Note that we perform a convergence studies in this chapter with hand-modified versions of the AMG213 benchmark. Hence, we use a reiabe, we tested CPUbased code to do the comparison of the di erent choices of parameter sets. We even use the same hardware as in [23], i.e. we run our studies on the compute nodes of the custer Titan at Oak Ridge Nationa Lab. The Titan custer is a Cray XK7 system. Each node is equipped with a 16-core 2.2GHz AMD Opteron 6274 processor and 32 GB of RAM. We used the GCC compier and MPICH and compied AMG213 and the hand-modified versions with compier fag -O3. Within our study, simiar to [23], we aso compare the convergence properties between benchmarks running sequentiay on one node and benchmarks being distributed to 8 cores by the MPI-paraeization within AMG213 or hypre, respectivey. As we wi see, the use of a paraeized setup phase wi have an impact on the setup and sove phase.

14 Tite Suppressed Due to Excessive Length 13 8 Poisson3D, sequentia AMG213 parameters 8 Poisson2D, sequentia / Anisotropic2D, c 1 =1,sequentia AMG213 parameters 6 AmgX parameters parameter proposa 6 AmgX parameters parameter proposa #iterations 4 #iterations , 4, N D N D Fig. 1 In some cases, we observe significant di erences in the convergence behavior for the three sover parameter settings in the Poisson3D appication probem (eft) andthepoisson2d appication probem (right). 3.3 Numerica resuts In the foowing, we discuss the convergence resuts that we obtain with the above given parameters. We aways report the number of iterations of the sover with respect to the parameter N D, i.e. the discretization parameter used above for defining the finite di erence discretization. This parameter is not identica to the number of unknowns which is N 3 D and N 2 D in Poisson3D and Poisson2D/Anisotropic2D, respectivey. On the eft-hand side of Fig. 1, we show the resuts for the Poisson3D benchmark running sequentiay for the di erent parameters sets defined before. This benchmark is shown in [23]. We observe a growing number of iterations for growing probem size. Hence, a benchmarks seem not to be in an asymptotic regime, where we woud expect to see constant or at east very sowy growing number of iterations. The genera tendency is that our parameters outperform the parameters of AMG213. The worst resuts are seen for the AmgX parameters. In contrast, the resuts in Fig. 1, on the right-hand side, are in the asymptotic regime. For the AMG213 parameters and our parameter proposa, we observe amost constant iteration counts when going to finer discretizations with an asymptoticay high condition number. In contrast, the AmgX parameters do not resut in the AMG-type optima convergence. The more expensive extended interpoation in connection with the Gauss-Seide-type smoother seem to recover the optima convergence of AMG in presence of the ess strong PMIS coarsening for AMG213. The resuts on the right-hand side of Fig. 1 together with the two resuts in Fig. 2 form the robustness study Anisotropic2D for growing anisotropies c 1 = 1, 1, 1. The tendency of the resuts fits together with the resuts obtained in the previous benchmarks. The parameters from AMG213 and our parameter proposa show simiar convergence resuts. Both parameters are amost perfecty

15 14 Peter Zaspe Anisotropic2D, c 1 =1,sequentia Anisotropic2D, c 1 =1,sequentia 8 8 AMG213 parameters AMG213 parameters AmgX parameters AmgX parameters 6 parameter proposa 6 parameter proposa #iterations 4 #iterations , 4, 2, 4, N D N D Fig. 2 The (potentia) parameters of AmgX in [23] ead to a strong dependence on the anisotropy parameter comparing c 1 =1(Fig.1,right), c 1 =1(eft) andc 1 =1(right). That is, AMG is not robust with respect to anisotropies in this case. Poisson3D, sequentia Poisson3D, parae 8 8 AMG213 parameters AMG213 parameters AmgX parameters AmgX parameters 6 parameter proposa 6 parameter proposa #iterations 4 #iterations N D N D Fig. 3 When running AMG213 / hypre in parae, the coarsening in our parameters get degraded eading to a worse sover performance. robust with respect to the anisotropy. On the other hand, the AmgX parameters are not robust with respect to anisotropy. Finay, we aso compare, simiar to [23], convergence resuts for the sequentia use of AMG213 / hypre (as before) with resuts obtained when distributing the work over eight MPI processes. Fig. 3 highights these resuts comparing the sequentia resuts for Poisson3D on the eft with the parae resuts for Poisson3D on the right. It becomes obvious that the PMIS-based parameter sets converge amost identica in the sequentia and in the parae case. In contrast, the paraeization of the standard coarsening requires corrections at the interface of the decomposed domains. Thereby performance resuts degrade in the parae case, sti achieving simiar convergence as with the AMG213 parameters.

16 Tite Suppressed Due to Excessive Length Discussion As expected, our resuts show that the best possibe choice of parameters is not perfecty cear, even in our minimaistic benchmark. The standard parameters in AMG213 aways produce amost probem-size independent convergence, are robust with respect to anisotropies and seem to be robust with respect to an increase in the number of processors for the parae setup phase. The good parae properties might be the reason why the authors used these parameters for the parae benchmark AMG213, even though the cassica parameters in Boomer- AMG/hypre are much more simiar to our proposed parameter set. However, the AMG213 parameters use the sequentia Gauss-Seide-type smoother and rather expensive extended interpoation. The AmgX parameters, keeping in mind that we don t know the exact parameters and just reverse engineered them, seem to be not robust to anisotropies and seem to show a growing number of iterations for arger probem sizes. However, since a Jacobi smoother is used, the sove phase wi run very e cienty on many-core hardware. The setup phase wi be e cient due to PMIS. This is aso reported in [23], where the authors show impressive performance resuts comparing AmgX to AMG213. That is, with the parameters of AmgX used in [23], the convergence and robustness properties of the method are owered at the gain of a pre-asymptoticay fast soution behavior. This is a typica baancing done for numerica methods for many-core processors. Our proposed parameters sha introduce a third favor. That is, we aim for best (asymptotic) convergence rates, sti seeking for a method that wi have a very parae and e cient sove phase. However, achieving a good many-core performance in the setup phase, when using standard coarsening and standard interpoation, is chaenging. In fact, paraeizing some part of the coarsening is impossibe. However, as we wi show, it is sti possibe to paraeize a major part on the setup phase even if the coarsening is done on CPU. 4Hybridmany-coreparaeizationstrategies In this section, we discuss agorithms and paraeization strategies for AMG on many-core hardware. These agorithms are generic in terms of the target manycore hardware. That is, their e cient impementation sha not be restricted to GPUs but shoud aso be possibe for, e.g., Xeon Phi processors. To this end, the appied amount of hardware-specific functionaities is reduced to a minimum. We do this with an educative objective: We want to ower the bar for beginners in many-core computing to be abe to write scientific codes and to focus more on the core idea in many-core paraeization, i.e. exposing a high degree of paraeism. We first give an overview about detais of our genera strategy to do a manycore paraeization. Thereafter, we expain the genera methodoogies of sparse matrix construction and oca graph traversa. This genera consideration aows to simpify the discussion of paraeization detais for the setup phase with the (hybrid) C/F spitting and the interpoation operator construction, afterwards. Moreover, we address the paraeization of the V-cyce and the smoothers. Finay, detais of the concrete impementation on GPU are discussed.

17 16 Peter Zaspe 4.1 Genera many-core paraeization strategy Our genera base strategy for many-core paraeization is to move a big portion of the paraeization compexity to ibraries. In the context of the given appication (AMG) this means that we use ibraries impementing a standard set of routines for (sparse) inear agebra (sparse matrix formats, matrix matrix products, matrix vector products, scaar products,... ). We caim that such ibraries exist with the (incompete) ist of exampes containing CUSP, CUSPARSE, LAMA, ViennaCL, Paraution, GHOST, MAGMA, MKL. Simiary, we use ibraries, which impement STL vector agorithm type methods in parae, such as sum, excusivescan or maximum. Again, severa ibraries aow this, incuding but not being imited to Thrust, ArrayFire and Boost.Compute. Ony if we are acking appropriate ibrary support, we actuay propose to impement new many-core parae functions. We ca such functions kernes. In addition to the function parameters, we pass the amount of parae threads on which the kerne executes its code. Within the kerne code, the thread index is fetched first. Afterwards, computations are done reative to that thread index, cf. Agorithm 5 for an exampe. Writes into the many-core processor s memory are not considered to be consistent / thread-safe during the kerne execution. That is, whie executing a kerne, two di erent threads sha not write into the same memory ocation. The ony exception are atomic operations. They seriaize conficting parae writes, if necessary. However, this might impact performance. A goba synchronization over a threads is done at the end of the kerne execution. Note again that these assumptions are strong simpifications of the usuay much more eaborated memory and thread execution modes used in recent many-core processors. However, they make it much more easy to introduce many-core agorithms. 4.2 CSR matrix construction We store matrices in the compressed sparse row (CSR) matrix format. It encodes a sparse matrix A R N N with N nonzero non-zero entries by arrays row o sets of indices (of ength N + 1) describing the o set to take in coumn indices and vaues to find entries of a given matrix row, coumn indices of indices (of ength N nonzero ) describing the coumn index of a given row entry and vaues of (doube precision) foating point vaues (of ength N nonzero ) describing the respective matrix entry. Throughout the construction of the interpoation operators, we have to assembe sparse CSR matrices. This can be rather easiy done in parae, when using one parae thread per row of the matrix that sha be fied. Agorithm 4 shows the necessary steps. First, a generic function countentries mimics the work that woud be done for generating the matrix entries. This is done in a competey parae way. Whie doing this, instead of writing the data into the new matrix, the number of generated non-zero entries per row are counted and stored in array num entries per row. Then, a parae sum and a parae excusive scan operation aow to compute the tota number of non-zeros and the o sets for the CSR matrix data structure. Finay another generic function ficsrmatrix does the actua work and fis the system matrix as simuated in countentries, before.

18 Tite Suppressed Due to Excessive Length 17 Agorithm 4 Generic method to construct CSR matrix in parae 1: function buidcsrmatrix(a, N,...) 2: aocate num entries per row[n] 3: setvaue N (num entries per row,) 4: countentries N (num entries per row,...) count number of entries per row 5: tota entries := sum N (num entries per row) compute number of non-zeros 6: aocate num rows[n +1],coumn indices[tota entries], vaues[tota entries] 7: excusivescan N (num entries per row, row offsets) compute o sets 8: ficsrmatrix N (num rows, coumn indices, vaues,...) fi matrix Fig. 4 In the oca graph traversa, the next neighbors of each node are visited up to a fixed depth (indicated by the di erent eipse sizes). The many-core paraeization makes use of the independence of each traversa (indicated by the di erent ine types). Even though the above approach actuay requires to compute the sparse matrix twice, it is sti a very e ective way to construct a CSR matrix in parae. 4.3 Loca graph traversa up to a fixed depth Severa agorithms within the AMG setup phase require to iterate over neighbours and/or neighbours of neighbours of a node in the graph representation of the inear system. This can be transated to a oca graph traversa up to a depth of one/two starting from each node in the graph, cf. Fig. 4. In the foowing, we formuate a generic many-core parae oca traversa up to depth two for a graph that is given by an adjacency matrix stored in the CSR format. The easiest way to paraeize this embarrassingy parae oca graph traversa is by paraeizing it over the starting nodes for each of the traversa steps. Agorithm 5 formaizes this idea. It is aunched in parae with N threads and a sparse matrix A R N N given as parameter. In each thread, first the current thread index idx is retrieved. Then, each neighbour of node idx is visited by iterating over a non-zero o -diagona eements in the idxth matrix row. The same idea is repeated to iterate over a neighbours of a visited neighbour node. Ceary, Agorithm 5 is not the necessariy most e cient paraeization approach for a specific many-core architecture. To give an exampe, an impementation of the agorithm in the CUDA programming mode and running on GPUs of Nvidia, mightnot achieve the fastest possibe performance due to thread divergence. Here, e.g. [22] gives an exceent overview over optimization techniques. However, Agorithm 5 is very generic and deivers a high degree of paraeism for a ot of many-core architecures. Sti, optimized graph traversa agorithms are considered future work.

19 18 Peter Zaspe Agorithm 5 Generic kerne for oca graph traversa up to depth two 1: function traverse N (A, N,...) 2: idx := getthreadindex() 3: for jj [A.row offsets[idx],a.row offsets[idx +1] 1] do 4: j := A.coumn indices[jj] get coumn index of current matrix entry 5: if (idx = j) AND... then 6:... do something with weight on edge (idx, j) 7: for kk [A.row offsets[j],a.row offsets[j +1] 1] do 8: k := A.coumn indices[kk] 9: if (j = k) AND... then 1:... do something with weight of edge (j, k) 4.4 Hybrid C/F spitting We use the C/F spitting foowing Agorithm 3. The coarsening agorithm is the ony part of the code which is ony partiay paraeized on many-core hardware. Steps three and four of Agorithm 3 compute the strong infuence weights i for the current eve. This operation can be paraeized in a many-core fashion. Here, the first necessary step is to pre-compute the maximum (negative) non-diagona entry for each matrix row, thus, max a ik < a ik. The parae graph traversa idea of Section 4.3 is appied for this. Using these maxima, it is possibe to identify strongy couped nodes in an agorithm for evauating the i.theweightevauation agorithm again traverses the graph up to a depth of one. Whenever a strongy infuencing node is found, the appropriate weight i is increased by an atomic operation. The remaining part of Agorithm 3 is impemented sequentiay incuding the necessary copy operations between sequentia processor (CPU) and many-core processor. To achieve an optima computationa compexity, the ookup of the argest weight i in step 6 of the agorithm is performed with a priority queue data structure [7, Section 6.5]. Since e.g. the STL-based impementation of a priority queue does not aow to have a constant-compexity maximum-find and -remova operation (in fact that impementation has a ogarithmic compexity in that case), the data structure is hand-impemented. The proposed impementation uses bucket sort [7, Section 8.4] as base agorithm and aows to achieve the designated constant compexity remova operation with a inear-compexity data structure setup time. Overa this keeps a inear compexity for the standard coarsening agorithm. Finay, whie skipped in Agorithm 3, it is necessary to identify the mapping between the newy found coarse grid points and their position in the next coarser grid eve. This can be impemented by means of a many-core parae STL vector agorithm-type operation. Fig. 5 shows an exampe for the approach. A given array coarse contains per node the fag whether the node wi become a coarse grid node. By appying an excusivescan operation to that array, an enumeration of the coarse grid nodes is generated. As ong as one accesses ony those entries in fine to coarse that correspond to a new coarse grid node, the correct mapping wi be returned.

20 Tite Suppressed Due to Excessive Length 19 Fig. 5 An excusive scan operation aows to compute the mapping between a coarse node on the current and next mutigrid eve. Ony entries with ight background wi be accessed. A = f t tff ff t tf Fig. 6 In the standard interpoation operator construction, we reuse the existing sparsity pattern of A to store the set of interpoatory points Pi = C i via a booean indicator (t/f). 4.5 Interpoation operator construction Direct interpoation The many-core parae impementation of direct interpoation starts by pre-computing the intermediate coe cients i = j N a i ij k P a i ik To buid the two sums, parae oca graph traversa up to a depth of one becomes necessary. The core idea for this has been discussed in Section 4.3. The weights wik are computed whie buiding the interpoation matrix foowing the ideas in Section 4.2. The positions of the interpoation coe cients in the finay constructed matrix I 1 are determined by the ookup tabe created at the end of the C/F spitting as outined in Section 4.4. Overa the direct interpoation matrix is constructed fuy in parae Standard interpoation Impementing standard interpoation for many-core processors is more invoved. Here, the agorithm starts by finding the set of interpoatory variabes Pi = C i for direct interpoation. Since this is necessary for each row, a sparse matrix is constructed to store this information. We reuse the existing data structure of the sparse system matrix and repace the array of non-zeros by booean indicator vaues as shown in Fig. 6. The booean vaues (t/f) store, whether a given neighbor node is part of the interpoatory set or not. Fiing this indicator matrix can be done by paraeization over the number of rows of the system matrix simiar to the oca graph traversa in Section 4.3 Next, the origina system matrix is modified such that strongy couped fine grid points are expanded as e j k N j. a jke k/a jj. Note, that this construction is rather compute intensive and might be repaced by a more sophisticate method in the future. It requires e.g. to impement a sparse

21 2 Peter Zaspe matrix row addition. The new set ˆP i = C i ( j F i C j ) of interpoatory variabes is computed, as we. Finay the agorithm reproduces the direct interpoation impementation now using the expanded system matrix and ˆP i. Truncation in the interpoation matrix can aso be done in a many-core parae way. This is done within the construction agorithm for interpoation matrices, to avoid buiding new sparse matrices. However, this agorithm is aso avaiabe as stand-aone method. 4.6 V-cyce and smoothers Due to the (assumed to be) avaiabe inear agebra primitives, the impementation of a V-cyce on many-core hardware is straight-forward, since the V-cyce ony contains standard matrix-matrix, matrix-vector and vector-vector operations, cf. Agorithm 1. Within the V-cyce of Agorithm 1, we repace the direct sover on the coarsest eve by a Jacobi-preconditioned CG sover. For now, we ony use a reaxed Jacobi iteration as smoother. The construction and appication of the Jacobi smoother can be easiy reaized by the avaiabe inear agebra primitives 4.7 Impementation on GPU Our concrete impementation of agebraic mutigrid for a singe GPU is based on the sparse inear agebra ibrary CUSP [5] in version.4. and the STL Vector agorithm ibrary Thrust, which comes with the CUDA Tookit 7.5. The first ibrary provides a set of inear agebra primitives (matrix-matrix products, matrix-vector products,... ), iterative sovers (CG, GMRES,... ) and preconditioners (Jacobi,... ) for sparse matrices of di erent sparse matrix formats, incuding CSR. Thrust provides operations such as sum or excusivescan. The impemented code integrates withing the CUSP framework, thus the new Ruge-Stüben cassica AMG can be used as preconditioner for the standard CUSP sovers. It can be furthermore appied as stand-aone sover. Compute kernes are impemented as CUDA kernes based on the CUDA Tookit 7.5. Note that we do not use the atest CUSP version due to issues with the matrixmatrix product impementation. Moreover, we did not consider to use the more recent CUSPARSE ibrary instead of CUSP, since, at the time of starting this software project, CUSPARSE was sti at a very eary deveopment stage. Finay, the at time of writing this artice atest CUDA Tookit version 8. is currenty not supported on our target benchmark system. 5 Performance of the hybrid GPU AMG In the foowing, we discuss convergence and performance resuts of our hybrid GPU AMG. First, we doube-check the convergence and robustness of our code by repeating the convergence studies of Section 3 with our specific impementation. The remaining part of the section is a performance benchmark. To this end, we first introduce our benchmark setup in terms of hardware and software. Afterwards a genera performance comparison between our impementation and AMG213 (foowing [23]) is given. As we wi see, the sove phase is fast, however our expectations in terms of speedup for the setup phase on GPU are not met.

22 Tite Suppressed Due to Excessive Length 21 8 Poisson3D direct interp. 8 Poisson2D / Anisotropic2D, c 1 =1 direct interp. 6 standard interp. 6 standard interp. #iterations 4 #iterations , 4, N D N D Fig. 7 Our hybrid GPU impementation of AMG attains the predicted convergence behavior for our proposed parameter set when using standard interpoation. This is shown here for the Poisson3D (eft) andthepoisson2d(right) testcase.directinterpoationshowssighty weaker convergence resuts. Therefore, we further give a detaied performance anaysis and discussion for the setup phase. Finay, we show resuts for a rea-word test probem on a compex geometry. 5.1 Convergence and robustness check We repeat the convergence studies performed in Section 3, i.e. Poisson3D, Poisson2D and Anisotropic2D, for the proposed hybrid GPU AMG impementation. Hence, we show resuts for the parameter set that was proposed in Section 3.2, but now for our own impementation. The objective of this study is to doube-check that our impementation is correct and that it is not a ected by any paraeization issue. As discussed in Section 3.2, we use standard coarsening, standard interpoation and a twice appied reaxed Jacobi pre- and post-smoother with reaxation parameter =.8. Truncation is used with tr =.2 and the strength parameter is str =.25. The recursive construction of coarser eves is stopped as soon as a given eve has ess than 1 unknowns. The coarse grid sover is a Jacobi-preconditioned CG method converged to an absoute residua of 1 2. The V-cyce is used as preconditioner within a CG method. In addition to these parameters, we aso discuss convergence resuts for direct interpoation, since as we wi see the GPU-parae version of direct interpoation is much faster than the one for standard interpoation. Convergence resuts are again reported for the benchmark probems defined in Section 3.1 with the stopping criterion of the outer CG sover set to a reative residua norm of 1 6. In Fig. 7 we report the convergence resuts for the benchmark probems Poisson3D and Poisson2D on the eft-hand side and right-hand side, respectivey. In both cases, the convergence by using standard interpoation matches the conver-

23 22 Peter Zaspe 8 Anisotropic2D, c 1 =1 direct interp. 8 Anisotropic2D, c 1 =1 direct interp. standard interp. standard interp. 6 6 #iterations 4 #iterations , 4, 1, 2, N D N D Fig. 8 Direct interpoation is not robust to the anisotropy in the Anisotropic2D benchmark. Meanwhie, our proposed parameter set appied in our hybrid GPU impementation achieves the predicted amost perfect robustness. gence resuts predicted in Section 3: The three-dimensiona probem sti has a very sight increase in iteration count for a grid size of up to N D =128,i.e.N =128 3 unknowns. In contrast, the two-dimensiona probem is aready in its asymptotic convergence behavior with a constant number of iterations for growing probem size. In the two-dimensiona probem, direct interpoation ony sighty increases the number of iterations unti convergence. This behavior is worse for the Poisson3D probem, where there is a stronger, but sti acceptabe, increase in the number of iterations when using direct interpoation. The impact of the weaker direct interpoation becomes ceary visibe when considering the Anisotropic2D test case in Fig. 8. Direct interpoation, together with our other parameters, is not robust to anisotropies. With growing anisotropy, the number of iterations unti convergence becomes much arger. In contrast, standard interpoation shows ony a very sight increase in the number of iterations, i.e. it is amost perfecty robust. Again, this matches the convergence behavior predicted in Section Performance benchmark setup Our performance benchmark sha be comparabe with the resut presented in [23]. To this end, we use the same hardware and in genera the same software environment as in this study. Our benchmark patform is a Cray XK7 system, namey the GPU custer Titan at Oak Ridge Nationa Lab. Our performance comparisons are imited to a singe node of this custer. Each node contains an 16-core 2.2GHz AMD Opteron 6274 processor and 32 GB of RAM. The GPU is an Nvidia Tesa K2X. We use the standard GNU compier framework on that custer, i.e. GCC with MPICH TheNvidia CUDA Programming Tookit is used in version 7.5. As stated before, we further use the ibrary Thrust, asitcomeswith the CUDA Tookit as we as CUSP in version.4.. A software is compied with compier fag -O3 for optimization. The GPU code is compied for the specific compute capabiities of the Tesa K2X. That

24 Tite Suppressed Due to Excessive Length 23 is, we use the additiona compier fag -arch sm 35. Tocompywith[23],the AMG213 benchmark is run either sequentiay or in parae using MPI on eight cores of a singe node of Titan. The parae MPI processes are distributed eveny over the physica cores of the Opteron processor. Reported AMG213 timings correspond to the timings reported as wa cock time by the AMG213 benchmark. Timings reported for our hybrid AMG GPU impementation were measured with the gettimeofday command and thereby aso represent wa cock times. Our GPU code fuy runs at doube precision. The GPU AMG impementation is not intended to be an acceerated part of an existing CPU code but a sover within a fu GPU code. This is why, in the hybrid GPU case, it is expected that the system matrix, right-hand side and initia guess are stored in GPU memory before the benchmark starts. Nevertheess, a remaining transfer times between CPU and GPU memory, as we as the fu CPU and GPU compute time, are incuded for the hybrid GPU setup/sover test runs. 5.3 Performance comparison against AMG213 In the foowing, we report the performance of our hybrid GPU AMG impementation, comparing it to the performance of the AMG213 benchmark for the benchmark appications defined in Section 3.1. It sha be noted here again that the specific set of AMG parameters used in AMG213 might not refect the best possibe choice of parameters for a CPU-based AMG method. However, for the sake of comparabe resuts with [23], we keep these parameters even though BoomerAMG might be much faster for other parameters (and more recent versions of hypre) Setup phase In Figures 9 and 1, we show the di erent runtimes for the setup phase of the Poisson3D, Poisson2D and Anisotropic2D benchmarks using AMG213 sequentiay or in parae on CPU and using our hybrid GPU AMG impementation on GPU. It turns our that the setup phase of our GPU AMG impementation is surprisingy sow, especiay in comparison to the parae AMG213 test case. For direct interpoation, we often outperform the sequentia AMG213 benchmark. However, the proposed use of standard interpoation on GPU is aways the sowest method. To study the reasons for this rather pessimistic performance resut, we do a detaied performance anaysis of the setup phase for the Poisson3D test case for N D =128. This study is presented and discussed in Section Sove phase The performance resuts for the GPU sove phase are much better than the performance resuts for the setup phase. In Fig. 11, we give on the eft-hand side the performance resuts of the sove phase for the Poisson3D test case and on the right-hand side the resuts for the Poisson2D test case. In the three-dimensiona test case, the GPU-parae sove phase (for standard interpoation) is by a factor of roughy 3.3 faster than the AMG213 benchmark paraeized on eight cores. Moreover, the GPU AMG is faster by roughy a factor of 7.1 in the two-dimensiona test case and the same interpoation. Note again that we use here ony eight of the avaiabe 16 cores of a Titan node, as done in [23]. Aso, there exists a newer version of hypre, which is a superset of the AMG213 benchmark with more recent performance optimizations [25].

25 24 Peter Zaspe runtime of the setup phase (in sec.) Poisson3D hybr. GPU (direct interp.) hybr. GPU (std. interp.) AMG213 sequentia AMG213 parae runtime of the setup phase (in sec.) Poisson2D / Anisotropic2D, c 1 =1 hybr. GPU (direct interp.) hybr. GPU (std. interp.) AMG213 sequentia AMG213 parae 5 1 1, 2, N D N D Fig. 9 We compare the runtime of the setup phase for our hybrid GPU impementation (using direct and standard interpoation) with the runtime of the setup phase of the AMG213 benchmark (eft: Poisson3D benchmark, right: Poisson2D benchmark). runtime of the setup phase (in sec.) Anisotropic2D, c 1 =1 hybr. GPU (direct interp.) hybr. GPU (std. interp.) AMG213 sequentia AMG213 parae runtime of the setup phase (in sec.) Anisotropic2D, c 1 =1 hybr. GPU (direct interp.) hybr. GPU (std. interp.) AMG213 sequentia AMG213 parae 1, 2, 1, 2, N D N D Fig. 1 The Anisotropic2D benchmark for c 1 = 1 (eft) and c 1 = 1 (right) outines asimiarowperformanceforthesetupphaseofourhybridgpuamgcomparedtothe AMG213 benchmark. It has to be made cear at this point that our positive performance resuts for the sove phase coud ony be achieved since the chosen parameters in the setup phase ead to an iterative method with strong convergence and robustness, as discussed before. To underine this, we show the performance of the sove phase for the Anisotropic2D test case in Fig. 12. Using the weak direct interpoation, we ony get a ow (c 1 =1)ornospeedup(c 1 = 1) compared to the parae AMG213 benchmark. However, the robust standard interpoation eads to speedups simiar to the Poisson2D test case.

26 Tite Suppressed Due to Excessive Length 25 runtime of the sove phase (in sec.) 1 5 Poisson3D hybrid GPU (direct interp.) hybrid GPU (std. interp.) AMG213 parae runtime of the sove phase (in sec.) 1 5 Poisson2D / Anisotropic2D, c 1 =1 hybrid GPU (direct interp.) hybrid GPU (std. interp.) AMG213 parae 5 1 1, 2, N D N D Fig. 11 In the sove phase, the hybrid GPU AMG impementation ceary outperforms the parae AMG213 benchmark, both for the Poisson3D (eft) andthepoisson2d(right) test case. runtime of the sove phase (in sec.) 1 5 Anisotropic2D, c 1 =1 hybrid GPU (direct interp.) hybrid GPU (std. interp.) AMG213 parae runtime of the sove phase (in sec.) 1 5 Anisotropic2D, c 1 =1 hybrid GPU (direct interp.) hybrid GPU (std. interp.) AMG213 parae 1, 2, 1, 2, N D N D Fig. 12 The hybrid GPU AMG sove phase aso ceary outperforms the parae AMG213 benchmark when using standard interpoation in the Anisotropic2D test case. However, the performance advantage of the GPU gets ost, if the chosen set of parameters for AMG does not ead to a robust method as shown here for direct interpoation. In Fig. 13, we summarize the performance resuts for the setup and the sove phase for the argest test cases of the Poisson3D and the Poisson2D probem. The parae version of the AMG213 benchmark is faster than the hybrid GPU AMG in the setup phase, whie the GPU code outperforms AMG213 in the sove phase. Very often, it is possibe to reuse an existing mutigrid hierarchy from one setup phase for severa sove phases, e.g. if a singe probem is soved for many right-hand sides. In this case, reativey speaking, the GPU-based AMG coud become much faster. Meanwhie, we ony show in Fig. 13 the tota compute time for a singe setup and a singe sove phase. Here, AMG213 outperforms our impementation.

27 26 Peter Zaspe Poisson3D performance, N D =128 Poissson2D performance, N D =248 hybrid GPU (std. interp.) hybrid GPU (std. interp.) 3 AMG213 sequentia AMG213 parae 3 AMG213 sequentia AMG213 parae time in seconds 2 1 time in seconds 2 1 setup phase sove phase tota setup phase sove phase tota Fig. 13 In summary, the parae version of the AMG213 benchmark is faster for the setup phase, whie the hybrid GPU AMG is faster for the sove phase. 5.4 Detaied performance anaysis of the setup phase As seen in Section 5.3.1, the performance of the hybrid GPU AMG does not meet our performance expectations, when comparing it to the AMG213 benchmark on CPU. Therefore, we make an in-depth anaysis of the performance of a components of the setup phase. To abbreviate the discussion, we focus on a singe test case, the Poisson3D probem with N D = 128. Moreover, we ony discuss the performance for the setup phase on the finest grid eve. In our study, we woud ike to anayse the impact of our paraeization of the setup phase. To do this, we compare our many-core parae impementation with a sequentia base CPU impementation that was initiay done by the author. Note again that comparing a singe-core CPU impementation to a many-core impementation is of course unfair. However, if we do not see significant speedups within this comparison, we have no chance to get a fast many-core impementation, anyway. In Tabe 2, we report the resuts of our performance comparison against the reference CPU impementation, both for direct and standard interpoation. We incude direct interpoation, since we want to find the reason for the strong performance di erences between direct and standard interpoation C/F spitting The first two timings in Tabe 2 correspond to the initia part of the C/F spitting, which are paraeized on many-core hardware. We compare the CPU and GPU timings for computing the maximum non-diagona entry for each matrix row and the strong infuence weights i, asdiscussedinsections2.3and4.4.basedonthe speedups, we concude that the paraeization of this part of the C/F spitting is very successfu. Actuay, speedups in a range of 1 woud correspond to a speedup of 6-1 when comparing equay priced hardware. The next bock of timings corresponds to the part of the setup phase, which is done on CPU. In the many-core impementation, we first have to copy the data

28 Tite Suppressed Due to Excessive Length 27 direct interp. std. interp. CPU [s] GPU [s] factor CPU [s] GPU [s] factor C/F spitting max. neg. nondiag strong infuence GPU to CPU transfer n/a.4537 n/a n/a.4565 n/a CPU data structure CPU coarsening CPU to GPU transfer n/a.17 n/a n/a.17 n/a coarse node count fine to coarse map interpoation coe s. for interp n/a n/a n/a direct interp. matrix n/a n/a n/a interp. point set n/a n/a n/a modified matrix n/a n/a n/a max. neg. nondiag. n/a n/a n/a coe s. for interp. n/a n/a n/a direct interp. matrix n/a n/a n/a restriction operator Gaerkin operator Jacobi smoother Tabe 2 The detaied benchmark anaysis for the setup phase (Poisson3D, N D =128,finest grid) compares runtimes of a components of the proposed hybrid GPU impementation with those of a base CPU impementation running on one CPU core. back to CPU memory. Afterwards, the CPU part of the coarsening is appied and the resuts are copied back to GPU. We can make severa observations, here. First, the CPU-based coarsening in the GPU impementation is the most expensive part of the setup phase. Here, it has to be noted that the standard singe-core performance of the AMD Opteron 6274 processor is reativey ow and Turbo CORE, i.e. the AMD technoogy to overcock a singe core, seems not to be switched on on Titan. Second, our CPU coarsening impementation seems to be ceary sower than coarsening impementations in e.g. AMG213. This is not surprising, knowing that the base ibrary hypre has been deveoped since many years. Third, it woud be worth, overapping the CPU data structure construction with major parts of the GPU to CPU data transfer. However, this is not possibe with CUSP in version.4., at east to the author s knowedge, since an access to CUDA Streams is not impemented in this version, yet. In the fina many-core parae part of the C/F spitting, we compute the number of coarse nodes and get the fine to coarse node mapping as discussed in Section 4.4. The reduction operation, to compute the number of coarse nodes, is (surprisingy) sower on GPU than on CPU, but in genera negectabe, due to ow overa performance impact. The excusive scan operation for the node mapping is faster on GPU by a factor of nine. This is the (acceptabe) speedup deivered by the thrust ibrary and corresponds to an amost equa performance in a hypothetica equay-priced-hardware comparison Interpoation Next, we consider the di erent interpoation strategies. In direct interpoation, we first pre-compute the coe cients i by the (parae) graph traversa strategy.

29 28 Peter Zaspe This is fast on GPU with a speedup of beyond 2, i.e. even a decent speedup in a muti-core to many-core comparison. The actua construction is by a factor of 25 faster on GPU than on CPU. This is in the range of a speedup of two, when comparing equay priced hardware. This is a good resut. The overa time spend in the standard interpoation is much higher. Our paraeization strategy, to compute the set of interpoatory points, is successfu, with a speedup in the range of 3. However, seemingy, our CPU base impementation to construct the modified system matrix, cf. Section 4.5, is too sow. Its paraeized version achieves a speedup of roughy 2, which is acceptabe. However, the genera approach to compute the modified system matrix has to be re-thinked. Afterwards, the maxima for the negative non-diagona weights have to be updated. As in the C/F spitting, this is very fast on GPU. The computation of the coe cients for the modified system matrix is much sower than in the direct interpoation case. First, we use a sighty di erent version of the compute kerne here, since we aso incude truncation. Second, the modified system matrix has a (significanty) arger number of non-zeros per row. Thereby, each compute kerne has much more sequentia work to do, presumaby eading to a higher thread divergence. Consequenty, ony a moderate speedup of above 6 is achieved. Finay, the kerne to compute direct interpoation is appied to the modified system matrix. This kerne is again fast Restriction operator, Gaerkin product, smoother The fina three steps of the setup phase are the construction of the restriction operator by transposing the interpoation matrix, the tripe matrix-product to construct the Gaerkin operator and the construction of the Jacobi smoother. Transposing on GPU is more than a factor of 6 faster than on CPU. This speedup is deivered by the CUSP ibrary and is acceptabe. The tripe product is again done using CUSP and is by a factor of 4 faster. We woud have hoped to see higher performance, here. The CUSP-based setup of the Jacobi smoother is fast on GPU, anyway Summary and discussion As it turns out, our genera many-core paraeization strategies (parae oca graph traversa, matrix construction,... ) are fast. Whenever we use hand-impemented kernes, we get a decent performance improvement. The speedup obtained using the CUSP ibrary is sometimes ow. Note however that we do not use the atest version of CUSP as we discussed before. Nvidia has more recenty introduced CUSPARSE, a commerciaized version of CUSP, which might deiver higher performance. Using this ibrary is future work. Our way to construct the standard interpoation by the modified system matrix seems to be a major performance botteneck. This has to be re-thinked. Finay, the major performance imitation of the hybrid GPU AMG setup phase is the sequentia CPU part. We sti argue that it is more important to get a robust numerica method. Therefore using the sequentia C/F spitting is favorabe. Nevertheess, research on truey parae and robust versions of C/F spitting shoud be undertaken. In addition, we argue that a strong GPU or many-core processor shoud be couped with a few very fast CPU cores, in the future. This woud reduce the negative impact of tighty couped hybridization approaches as the one discussed here. i

Tite Suppressed Due to Excessive Length 29 Fig. 14 The finite eement discretization of the Poisson probem (6) soved on a gear geometry [2] is used as rea-word appication test case. 5.

30 Tite Suppressed Due to Excessive Length 29 Fig. 14 The finite eement discretization of the Poisson probem (6) soved on a gear geometry [2] is used as rea-word appication test case. 5.5 Rea word appication exampe We finish this section with a more reaistic appication exampe. It is a Poisson probem soved on the compex gear geometry shown in Fig. 14. The exact mathematica probem is u= 6 in D, (6) 2 2 u = 1 + x1 + 2x2 on D, with D R3 the inner part of the gear geometry. The exact soution of this probem is u = 1 + x21 + 2x22 on the fu domain D and thereby constant with respect to the third coordinate direction. Note that it woud be rather invoved to sove such a compex geometry probem with a geometric mutigrid method. However, AMG can sove such probems out-of-the-box. The mesh that defines the geometry, i.e. D, is avaiabe at [2]. An initia meshing of D is done using tetgen beta1 [3]. It generates meshes of tetrahedrons. To generate meshes at different resoutions, we use the -a switch of tetgen and impose maximum ce voumes of 22, with {,..., 5}. Each mesh is imported into FEniCS [3], a finite eement framework, where it is used in a standard inear finite eement discretization of the PDE (6). FEniCS aows to assembe a system matrix and right-hand side for the discretization using the assembe system command. An export of matrix and right-hand side is done using the inear agebra back end Eigen. The resuting system matrices and right-hand sides are finay used in our hybrid GPU AMG. Fig. 15 summarizes the convergence resuts for the gear appication probem on the eft-hand side. We use the same parameters as in the performance and convergence studies before. These ead to an amost perfect convergence for growing

A Comparison of a Second-Order versus a Fourth- Order Laplacian Operator in the Multigrid Algorithm

A Comparison of a Second-Order versus a Fourth- Order Lapacian Operator in the Mutigrid Agorithm Kaushik Datta (kdatta@cs.berkeey.edu Math Project May 9, 003 Abstract In this paper, the mutigrid agorithm