Analysis and parallelization strategies for Ruge-Stüben AMG on many-core processors

Size: px
Start display at page:

Download "Analysis and parallelization strategies for Ruge-Stüben AMG on many-core processors"

Transcription

1 Anaysis and paraeization strategies for Ruge-Stüben AMG on many-core processors P. Zaspe Departement Mathematik und Informatik Preprint No Fachbereich Mathematik June 217 Universität Base CH-451 Base

2 Journa of Scientific Computing manuscript No. (wi be inserted by the editor) Anaysis and paraeization strategies for Ruge-Stüben AMG on many-core processors Peter Zaspe Received: date / Accepted: date Abstract The Ruge-Stüben agebraic mutigrid method (AMG) is an optimacompexity back-box approach to sove inear systems arising in discretizations of e.g. eiptic PDEs. Recenty, there has been a growing interest in paraeizing this method on many-core hardware, especiay graphics processing units (GPUs). This type of hardware deivers high performance for highy parae agorithms. In this work, we anayse convergence properties of recent AMG deveopments for manycore processors and propose to use more cassica choices of AMG components for higher robustness. Based on these choices, we introduce many-core paraeization strategies for a robust hybrid many-core AMG. The strategies can be understood and appied without deep knowedge of a given many-core architecture. We use them to propose a new hybrid GPU impementation. The impementation is tested in an in-depth performance anaysis, which outines its good convergence properties and high performance in the sove phase. Keywords Ruge-Stüben agebraic mutigrid (AMG) Graphics processing units (GPUs) Paraeization Many-core Iterative inear sovers Graph traversa Mathematics Subject Cassification (2) 65N55 65F1 65Y5 68W1 65F5 65N22 1Introduction The soution of sparse inear systems from discretized eiptic partia di erentia equations (PDEs) is the time-dominant component of many numerica simuations. Iterative sovers based on mutigrid methods sove such systems in optima (inear) compexity. Standard geometric mutigrid often strugges with discretizations of PDEs in compex geometries. However, agebraic mutigrid (AMG) is abe P. Zaspe Departement für Mathematik und Informatik Universität Base Spiegegasse Base, Switzerand E-mai: peter.zaspe@unibas.ch

3 2 Peter Zaspe to hande this specia case without further adoptions. AMG buids the mutigrid hierarchy by a purey agebraic construction invoving ony the entries of the underying system matrix. Many-core processors achieve high performance by executing a very high number of concurrent computationa threads. However, they are often imited by a reativey ow per-thread performance and other simpifications. One exampe of many-core processors are graphics processing units (GPUs). Another exampe are Xeon Phi processors. Nowadays, a series of high performance computing (HPC) custer systems is equipped with many-core processors as acceerators. Thatis, many-core processors can be used in addition to the existing muti-core processors in a given compute node. To achieve high performance on many-core processors, it is necessary to deveop numerica agorithms that use the provided rather extreme amount of paraeism. Here, it is often not enough to reformuate an existing agorithm in a smart way. Instead, the appied numerica method has to be changed to achieve high (parae) performance. However, by changing the numerica method, the soution properties are a ected. As a rue of thumb, we note that a higher degree of paraeism in a numerica method is often connected to weaker soution properties. Hence, we have to make a trade-o between (parae) performance and strong soution properties. This artice anayses this kind of trade-o in recent work, e.g. [23], on the design and impementation of Ruge-Stüben agebraic mutigrid [31, Appendix A] on many-core processors. We consider Ruge-Stüben AMG as main object of study, since it is we-known for its good convergence and robustness. Based on our anaysis, we propose a di erent baance of agorithmic components and parae performance than used, for many-core processors, before. For a practica reaization, we propose many-core paraeization strategies for major parts of the method. This eads to a hybrid GPU-based Ruge-Stüben AMG impementation, for which we anayse the performance. 1.1 Reated work Whenever we discuss agebraic mutigrid in this artice, we actuay refer to Ruge- Stüben AMG. For competeness, we note however that quite some work on non- Ruge-Stüben AMG types on many-core processors has been conducted in e.g. [4, 32,1,9,6,21,18] for (smoothed) aggregation AMG and in [34] for auxiiary grid AMG. Agebraic mutigrid has a setup phase with the mutigrid hierarchy construction and a sove phase with the iterative appication of recursive mutigrid cyces. The sove phase is amost identica in geometric and agebraic mutigrid. Hence, the core of AMG ies in the setup phase. Traditionay, the execution of the setup phase has been considered to be purey sequentia [31, Appendix A]. With the need of paraeization of AMG on muti-core custers, a paraeization of the setup phase has been mainy achieved by appying the purey sequentia parts of the setup phase, i.e. the coarsening or C/F spitting, onto partitions of the fu system matrix and by adding some corrections at the interface [15, 35]. Unfortunatey, this approach does not scae to the degree of paraeism provided by many-core processors. In fact, a singe many-core processor roughy requires the paraeism that is normay distributed over a fu custer. This eads to a bad domain-to-interface ratio when appying the paraeization strategy for muti-core processors to many-

4 Tite Suppressed Due to Excessive Length 3 core processors. One very parae coarsening approach, which shows acceptabe performance even on many-core processors, is the PMIS (parae maximum independent set) cassification [35]. However, PMIS is not aways robust. This is often overcome by some stronger interpoation techniques at higher computationa costs [8]. The choice of suitabe agorithmic components in the setup phase is therefore one of the crucia trade-o s that has to be made in the construction of AMG on many-core hardware. These trade-o s wi be discussed in this work. Maybe the first work on (Ruge-Stüben) AMG for many-core processors has been performed in [19,14]. In that work, the setup phase is based on matrix decomposition with boundary treatment. In [16, 24], ony the sove phase of AMG was paraeized on GPU and aso used in a muti-gpu setting. This approach might not yet use the fu performance of GPUs. More recenty, AMG has become avaiabe in the open-source ibraries ViennaCL [27] and Paraution [17]. In [33, 28], performance resuts of ViennaCL have been discussed for AMG, however Ruge-Stüben AMG has sti ony been considered either with a muti-core CPU setup or with a singe-threaded GPU setup. To the author s knowedge, detais on the impementation of Paraution are not pubished. Two current state of the art commercia many-core (GPU) impementations are GAMPACK [11] and AmgX [23]. They provide a sove and setup phase that runs fuy on GPUs by using PMIS. In their works [11] and [23], the authors provide an overview of their impementations and some exceent performance resuts when comparing CPU and GPU based AMG. To keep their competitive advantage, they can, however, not uncover the fu detais of their paraeization approaches. Moreover, their choice of test cases for convergence resuts is, understandaby enough, targeted mainy at their customers needs. 1.2 Contribution of this work The objective of this work is two-fod. First, the aim is to compement the above works by a baanced discussion of the impact of the choice of the numerica components of AMG on the robustness and soution properties. To start this discussion, we introduce the components of AMG. We then discuss choices of AMG components that have been made for many-core AMG by authors in the past. Here, we especiay focus on the commercia GPU impementations [11, 23] as god standard. Within these impementations we cosey foow [23] and (numericay) study the convergence and robustness properties of methods used therein. Our numerica study uses modified versions of the AMG213 benchmark, which is a strippeddown version of BoomerAMG [15] in hypre [12]. Within that study, we propose a set of parameters and components for AMG that we consider to be more robust than existing choices in many-core AMG. Second, the aim is to propose a series of paraeization strategies that aow to run the major part of the AMG setup phase on many-core processors. This discussion sha give starters in the fied of many-core programming an insight into the agorithmic chaenges that have to be faced when deaing with AMG on many-core hardware. In order to understand the properties of the proposed agorithms, an impementation has been performed based on CUDA and a sparse inear agebra ibrary for GPUs (CUSP [5]). This impementation is hybrid in the sense that the fu sove phase and amost the fu setup phase is done on GPU. The ony part, which is eft on CPU is a subset of the operations in the C/F

5 4 Peter Zaspe spitting. The new AMG sover runs on a singe GPU together with one CPU core. As we wi see, the resuting AMG impementation has strong convergence and robustness properties. A performance study wi outine that the sove phase of our AMG runs ceary faster than a parae version of AMG213. However, the hybrid GPU setup phase strugges in competing with AMG213. Thosedi cuties wi be anayzed and expained individuay. Finay, a rea-word appication with inear systems from a discretized PDE in a compex geometry wi be shown. This work is structured as foows. In Section 2, we review Ruge-Stüben AMG. Section 3 anayses the convergence and robustness properties of typica AMG constructions in GPU-based AMG. Section 4 introduces paraeization strategies for a robust hybrid many-core AMG and briefy discusses our concrete GPU impementation. Numerica resuts and performance measurements of our impementation are outined in Section 5. Section 6 summarizes the outcome of this work. 2OverviewofthecomponentsofRuge-Stübenagebraicmutigrid We wi cosey foow [31, Appendix A] to give a short overview of Ruge-Stüben AMG. Notation from [31] wi be partiay adapted. We want to sove inear systems Ax = b (1) with A R N N a sparse symmetric M-matrix, and x, b R N.Theseinear systems arise e.g. in the approximate soution of eiptic PDEs by finite di erences. In the foowing, wi first discuss the two-eve idea of mutigrid and then the mutigrid agorithm. This is foowed by a summary of a agorithmicay important components of AMG, namey C/F spitting, interpoation and smoothers. 2.1 Two-eve idea The two-eve or two-grid agorithm contains a important properties of mutigrid. To expain it, we start by identifying the origina inear system as system on a fine eve or fine grid, noting that we wi use the terms grid and eve equivaenty. On the fine grid, we thus have A f x f = b f j f a f ij xf j = bf i i f, with N f := N, A f := A, x f := x, b f := b and := {1,...,N f }.Next,we introduce the smoother S f R N f N f. Starting from an approximation P f A 1 f to the inverse of the system matrix, it is defined as S f := (I f P f A f ). I f is the identity matrix. One step of the smoother generates from x f asmoothed vector x f with x f := S f x f +(I f S f )A 1 f bf. (2) The smoother has a structure, which avoids to invert the system matrix. It sha damp the highy osciatory modes in the error e f = x f A 1 f bf. Exampes for smoothers, i.e. concrete choices for P f,wibegiveninsection2.5. f

6 Tite Suppressed Due to Excessive Length 5 Having damped highy osciatory error modes on the fine eve by the presmoother, the idea of the two-eve method is to construct a coarse eve or coarse grid on which the remaining ow error modes are damped. The coarse eve is constructed by introducing a spitting f = C F,theC/F spitting, into coarse and fine grid variabes. Fine grid variabes wi be kept on the next eve, whie the information of a variabes wi be transferred from coarse to fine grid by transfer operators. The inear system that is soved on the fine grid is given as A c x c = b c j c a c ijx c j = b c i i c, (3) with the obvious choice of c := C such that that A c R N c N c, x c, b c R N c, N c := C. The above equation is couped to the fine grid probem by restricting the residua r f = b f A f x on the fine eve to the coarse grid and by using it as righthand side. To transfer soutions from fine to coarse eve, a restriction operator If c R N c N f is introduced. The opposite mapping is done by an interpoation or proongation operator Ic f R N f N c. Together, they define the coarse grid system matrix A c := If c A f Ic f, which we ca the Gaerkin operator. We usuay set the interpoation matrix as transposed restriction matrix I c f := (I f c ). In a pure two-eve method, the coarseeve probem is soved directy. Its soution is transfered back to the fine grid by the coarse grid correction x f = x + I f c x c and another post-smoothing step is appied. Iterating this approach eads to an iterative inear two-eve sover. In AMG, we ca the construction of the C/F spitting, the construction of the proongation and restriction, the construction of the Gaerkin operator and the construction of the smoother the setup phase. The remaining part is the sove phase. It wi turn out, that the setup phase is actuay the agorithmicay more chaenging part. 2.2 Mutigrid method By carefuy choosing a numeric ingredients it is possibe to have inear compexity in the degrees of freedom for a operations on the fine grid during one iteration of the sove phase, cf. [31, Appendix A] for more detais. However, directy soving the coarse probem (3) sti has a computationa compexity of O(N 3 c ), eading to a non-optima method. To overcome this issue, the agorithmic idea of presmoothing, coarse grid correction and post-smoothing can be extended such that the direct sove on the coarse grid is repaced by a recursive appication of presmoothing, coarse grid correction and post-smoothing on the coarse eve. Extending this to severa eves eads to the so-caed V-cyce in a mutigrid method. Agorithm 1 summarizes the V-cyce. Here, we have introduced eves =, 1,..., max with nested sets of variabes 1... max.now,eve max is the finest eve f. Anaogousy we introduce per-eve coarse/fine grid sets C /F, matrices A, vectors x, b, smoothers S, interpoation operators I 1 and restriction operators I 1.Inines3/4and9/1,thesmootherisappied times. The construction of coarser eves is usuay stopped as soon as the number of variabes per eve goes beow a certain threshod and the coarsest grid is soved directy. Agorithm 2 introduces AMG as iterative sover. Lines 2 7 correspond to the setup phase. Lines 8 12 describe a simpe iterative sover where the function cyce

7 6 Peter Zaspe Agorithm 1 V-cyce in AMG Require: compete mutigrid hierarchy aready set up 1: function cyce(, x, b ) 2: if > then finer eves 3: for s =1,..., do 4: x = S (x, b ) pre-smoothing 5: b 1 = I 1 (b A x ) residua restriction 6: x 1 = 7: cyce( 1, x 1, b 1 ) sove on coarse grid 8: x = x + I 1 x 1 update soution 9: for s =1,..., do 1: x = S (x, b ) post-smoothing 11: return x 12: ese coarsest eve 13: x = A 1 b direct sove 14: return x Agorithm 2 Iterative agebraic mutigrid sover Require: A R N N symmetric M matrix 1: function AMGsover(A, b, res) 2: A max := A, max := {1,...,N} 3: for = max,...,1 do mutigrid hierarchy setup 4: C F =, 1 := C C/F spitting 5: construct I 1, I 1 := (I 1 ) transfer operators 6: A 1 := I 1 A I 1 Gaerkin operator 7: construct S smoother 8: x :=,n:= 9: repeat iterative soution 1: x n+1 = AMG( max, x n, b) mutigrid cyce 11: r n+1 = b Ax n+1 12: unti r n+1 2 res 13: return x n+1 corresponds to the aunch of the V-cyce in Agorithm 1. Instead of appying the V- cyce in an iterative scheme, it coud aso be aunched as preconditioner in a Kryov subspace sover, e.g. a conjugate gradient (CG) method. This second approach is often preferred due to faster convergence. 2.3 C/F spitting The core idea of the agebraic mutigrid method is to introduce the dua view of the inear system as a graph. Thereby, the C/F spitting or coarsening can be done on a graph instead of a geometry. The mapping between matrix and graph is done by considering the system matrix as adjacency matrix for a graph in which the variabes are the nodes or points. Hence, weighted connections or edges between nodes exist if there is a non-zero non-diagona entry in the system matrix A reating one variabe to the other Sequentia coarsening The crucia sequentia part of the Ruge-Stüben AMG method is the choice of coarse and fine grid points (C-/F-points), i.e. C/F spitting. It is based on the above discussed graph representation.

8 Tite Suppressed Due to Excessive Length 7 Agorithm 3 Standard coarsening agorithm Require: eve 1: function AMGstandardCoarsening 2: F :=,C :=,U := 3: for i U do 4: i := Si U +2 Si F 5: whie i s.th. i =do 6: find i max := argmax i i 7: C := C {i max} 8: U := U \{i max} 9: for j (S i U ) do 1: F := F {j} 11: U := U \{j} 12: for i U do 13: i := S i U +2 S i F 14: return C,F The neighborhood of a point/variabe i is given by N i := j j = i, a ij =. We say that a variabe i is strongy negativey couped to variabe j if we have, for afixed< str < 1, the reaxation a ij str max a ik < a ik. A strong negative coupings of a variabe i can be denoted by the set S i = j N i i strongy negativey couped to j. We aso need the set of a variabes j which are strongy couped to i. Itis S i := j i S j. The rough idea of C/F spitting is to create a rather uniform distribution of coarse and fine grid variabes with fine variabes being surrounded by coarse variabes (in the graph). Good convergence is achieved, if fine grid points are strongy couped to coarse grid points [31, Section A.7.1.1]. A first agorithmic idea [26, 31] to achieve this, is as foows. One repeats the choice of coarse grid points unti a points are cassified as coarse or fine grid points. Whenever a coarse grid point has been chosen, a neighboring points, which are strongy couped to the coarse point, are defined as fine grid points. In rea appications, this agorithmic idea is repaced by the standard coarsening agorithm [31, Section A.7.1.1], which is stated in Agorithm 3. This method aso introduces sets of undecided variabes U and importance measures i. It resuts in a rather eveny distributed set of coarse grid points due to the indirecty imposed ordering by the weights i.

9 8 Peter Zaspe Parae coarsening As sketched in Section 1, the standard idea to paraeize the setup phase and thus aso the coarsening on muti-core processors is to appy a domain-decomposition idea. That is, the system matrix / graph is divided into subparts where the standard sequentia coarsening is appied. Afterwards, the resuting spittings are connected by some boundary treatment, cf. [35, 13]. However, it is we-known that the quaity of the spitting gets a ected if the ratio between interna points and points treated by the boundary correction becomes smaer. Appying muti-core coarsening on many-core processors is somewhat the extreme case of this situation. To be abe to have a high utiization of the compute resources of a many-core processor, a arge amount of threads have to be executed in parae. Thereby, amost a points become boundary points. Consequenty, it is not advisabe to appy muti-core coarsening techniques on many-core hardware. One exception to this rue is the parae maxima independent set (PMIS) [35] coarsening. This agorithm can be paraeized on many-core processors without a degradation of the generated C/F spitting. However, PMIS, paraeized or not, often does not produce optima spittings. To keep a reativey robust AMG impementation, PMIS usuay is combined with strong, thus computationay expensive, interpoation [8] methods. Interpoation wi be discussed in the next section. 2.4 Interpoation Interpoation defines the transfer between information on di erent eves in AMG. As previousy stated, the interpoation operator I 1 transfers soutions on a coarse eve to the next finer eve. Since we use the transpose of the interpoation operator matrix as restriction matrix, it su ces to describe the interpoation operation and its construction. We start by introducing some additiona notation with C i := C N i, F i := F N i, C i := C S i, F i := F S i Direct interpoation Direct interpoation uses the strongy couped coarse grid points to interpoate to a given fine grid point. Thus, for each i F we use the set of so-caed interpoatory variabes Pi = C i and interpoate a given fine grid variabe e i by e i = k P i w ike k, w ik = a ik i a ii, i = j N a i ij k P a i ik. We sti assume here that the system matrix is an M-matrix. Therefore, we can negect the handing of positive non-diagona entries Standard interpoation A much better convergence can be achieved by standard interpoation. Itnotony considers strongy connected coarse grid nodes but aso incudes strong connections between fine grid nodes. In order to describe this process, we have a ook at a fine

10 Tite Suppressed Due to Excessive Length 9 grid point i reads as F. The appication of its corresponding matrix row to a vector e a iie i + a ije j. j Ni To appy standard interpoation, we introduce a modified system matrix. There, we repace in those rows that are associated with a fine grid point i F,asabove, the variabes e j with j F i, thus the strongy couped fine grid points, as e j k N j a jke k/a jj. The newy generated matrix  has entries â ij and possesses new neighborhood sets ˆN i. By further setting for a i F that ˆP i = C i ( j F i C j ), we can define standard interpoation anaogousy to direct interpoation as e i = k ˆP i â ike k, ŵ ik = ˆi â ik â ii, ˆi = j ˆN a i ij. k ˆP â i ik We thus extend direct interpoation to incude the neighborhood of the strongy connected fine grid points Truncation The more connections between variabes are taken into account when constructing the interpoation operator, the better the interpoation. However, too many connections ead to rather densey popuated interpoation operator matrices and thus to denser system matrices on coarser eves. This in turn might heaviy a ect the overa compexity of the mutigrid method. Therefore, it is often necessary to introduce a truncation of the interpoation. One thus drops a entries reated to connections with a strength smaer than the argest entry scaed by a threshod tr. The resuting entries finay have to be rescaed accordingy. 2.5 Standard smoothers We aready introduced smoothers S R N N to be an important numerica ingredient on a given eve of the agebraic mutigrid method. For P := D 1, D =diag(a )wegetthejacobi smoother S J := I D 1 A.Foragivenpointi its appication reads as x i = 1 b i a ij x j. a ii j=i Due to its trivia nature it can be very easiy appied in parae, even on manycore hardware. Furthermore, we can introduce a reaxed Jacobi iteration with the parameter R and P := D 1,thusS J := I D 1 A, resuting in a simiar iteration rue. By a proper choice of the reaxation parameter, the smoothing can be improved a ot.

11 1 Peter Zaspe A stronger smoother than Jacobi is the Gauss-Seide smoother with P := L 1 and L 1 the ower trianguar part of A incuding a diagona entries. It resuts in the smoother S GS = I L 1 A which has the iteration rue x i = 1 a ii b i i 1 j=1 a ij x j N j=i+1 a ij x j. (4) By iteration rue (4) we can aready see that this smoother is purey sequentia. A way to overcome this is to introduce a cooring [2, Section 3.6.3] for the variabes of the inear system, decouping subsets of the variabes which can be then treated in a parae way. For more detais on cooring for iterative inear sovers see e.g. [29, Section 12.4]. Note however, that coored Gauss-Seide versions are not considered here. 3Convergenceandrobustnessinexistingmany-coreAMG In this section, we review and anayse the convergence and robustness impact of typica combinations of methods and parameters in many-core AMG. In addition, we propose a set of parameter and method choices, which sti cover some of the many-core performance considerations whie achieving better convergence and robustness resuts. The anaysis wi be based on three mode probems, i.e. discretizations of a Poisson probem in two and three dimensions as we as discretizations of an eiptic probem in two dimensions with an anisotropic Lapace operator. As discussed in the previous section, the choice of coarsening, interpoation and smoother can strongy impact the quaity of the numerica resuts. Choosing the best combination for a given appication is a chaenge by its own. Ceary, reviewing the current state of the art in this fied for arbitrary appications and a avaiabe techniques is out of scope. However, we here aim at discussing choices that have been made recenty in context of GPU-based AMG. In this context, we focus on the two commercia GPU AMG impementations AmgX and GAMPACK. Since both impementations use simiar techniques, we have chosen to discuss resuts with respect to AmgX, more specificay resuts based on [23]. In [23], the authors introduce AmgX and compare it to the AMG213 benchmark [1]. The atter is a stripped-down version of BoomerAMG in hypre 2.9.b. AMG213 was introduced to perform custer scaabiity studies with compex work oads. The benchmark uses a pre-designed appication setup and a fixed set of coarsening, interpoation and smoother choices. Whie these choices seemingy are favorabe for a benchmark study, they do not necessariy represent the best choices for convergence and robustness. In fact, BoomerAMG might converge much better for other parameter choices. This shoud be kept in mind when assessing the impact of our and other resuts. Moreover, note that, at time of writing this artice, the current version of hypre is and no onger 2.9.b. A comparison of a recent hypre version against AmgX has been made in [25], where performance improvements for hypre were reported.

12 Tite Suppressed Due to Excessive Length Benchmark appications We report on three benchmarks, which can be performed with the AMG213 benchmark appication. The benchmarks sove inear systems from discretized eiptic partia di erentia equations. The first benchmark is a Poisson probem in three dimensions with homogeneous Dirichet boundary conditions, u = f in D, u = on D, (5) which corresponds for D =(, 1) 3 exacty to one important benchmark in [23]. The inear systems that we consider are obtained by discretizing the above probem by a standard second-order seven-point finite di erence stenci on a uniform grid with mesh width h := 1 N D +1, where N D is the number of (inner) grid points in 2 each space dimension. The right-hand side is set to f. We abbreviate this benchmark by the name Poisson3D. It is we known that the condition number of the resuting inear system matrix in Poisson3D scaes ike O(h 2 ). Hence, it ony depends on the mesh width in one dimension. This aso means that, for a given amount of unknowns, the condition number of the system matrix for a discretization of a two-dimensiona Poisson probem becomes much arger than the condition number in context of the three-dimensiona probem for a fixed number of unknowns. Consequenty, for a given number of unknowns, considering a two-dimensiona probem shoud be more chaenging than a three-dimensiona probem. Therefore, we aso incude the two-dimensiona test case Poisson2D based on (5) with D =(, 1) 2 and a standard five-point stenci discretization into our considerations. Moreover, we are interested in the robustness of the inear sover with respect to anisotropies. Therefore we consider a third benchmark, Anisotropic2D in which we discretize the eiptic PDE 1 h c 1 2 u x u x 2 2 = f in D, u = on D, with anisotropy coe cient c 1 R on a two-dimensiona domain D =(, 1) 2.We again use a corresponding second-order five-point finite di erence stenci and the same right-hand side. 3.2 Sover test cases In our convergence studies, we mimic the AMG213 benchmark and the exampes picked in [23]. Therefore we sove the given inear systems Ax = b for an initia soution guess x = and et the sover run unti the reative residua r i / b of a given iterate x i drops beow 1 6. AMG is used as preconditioner in a conjugate gradient sover. Throughout our convergence studies, we stop coarsening, when the current eve has ess than 1 unknowns. Moreover, we appy a truncation with tr =.2. Whenever appicabe, we further set the strength parameter to str =.25. We use three di erent sets of parameters for agebraic mutigrid. These parameters are summarized in Tabe 1. The first set of parameters is identica to

13 12 Peter Zaspe AMG213 param. AmgX param. parameter proposa coarsening PMIS PMIS standard coarsening aggressive coarsening first eve no no interpoation extended standard standard smoother L1-Gauss-Seide Jacobi ( =.8) Jacobi ( =.8) Tabe 1 We compare convergence and robustness properties of three sets of parameters for AMG (AMG213 benchmark, AmgX parameters in [23], our own parameter set proposa). the settings considered in the AMG213 benchmark: The coarsening is done using PMIS. Aggressive coarsening, cf. [36], with appropriate interpoation is use on the first eve. On a other eves, extended interpoation, cf. [8], is appied. The preand post-smoother is aways an L1-Gauss-Seide smoother. Our second set of parameters sha refect the settings chosen in [23] for AmgX. Unfortunatey, [23] does not give a the detais about their appied parameters. Therefore, we tried our best to reverse-engineer these parameters using the same software, hardware (Titan custer), a simiar buid environment, etc. The parameters reported in Tabe 1 seem to fit best the resuts in [23]. That is, we use PMIS as coarsening without aggressive coarsening on the first eve. Moreover, we appy standard interpoation and use a (twice appied) reaxed Jacobi smoother with reaxation parameter =.8. The fina set of parameters is our own proposa of parameters, which is more cose to the cassica way to buid AMG. That is, we choose standard coarsening, no aggressive coarsening on the first eve, standard interpoation and a reaxed Jacobi smoother with reaxation parameter =.8. As we wi see in our resuts, the convergence properties for this approach ceary outperform the AmgX parameters and are at east as good as the AMG213 parameters. We wi argue that these parameters show good convergence properties and robustness with respect to anisotropies. Meanwhie, the Jacobi smoother sti aows for a very high paraeism in the sove phase. The crucia di erence to AMG213 wi be the choice of a mainy sequentia AMG setup. However, as we wi see, there is sti a ot of room for a many-core paraeization for major pieces of the setup phase whie keeping the ony truy sequentia part of the standards coarsening on CPU. Note that we perform a convergence studies in this chapter with hand-modified versions of the AMG213 benchmark. Hence, we use a reiabe, we tested CPUbased code to do the comparison of the di erent choices of parameter sets. We even use the same hardware as in [23], i.e. we run our studies on the compute nodes of the custer Titan at Oak Ridge Nationa Lab. The Titan custer is a Cray XK7 system. Each node is equipped with a 16-core 2.2GHz AMD Opteron 6274 processor and 32 GB of RAM. We used the GCC compier and MPICH and compied AMG213 and the hand-modified versions with compier fag -O3. Within our study, simiar to [23], we aso compare the convergence properties between benchmarks running sequentiay on one node and benchmarks being distributed to 8 cores by the MPI-paraeization within AMG213 or hypre, respectivey. As we wi see, the use of a paraeized setup phase wi have an impact on the setup and sove phase.

14 Tite Suppressed Due to Excessive Length 13 8 Poisson3D, sequentia AMG213 parameters 8 Poisson2D, sequentia / Anisotropic2D, c 1 =1,sequentia AMG213 parameters 6 AmgX parameters parameter proposa 6 AmgX parameters parameter proposa #iterations 4 #iterations , 4, N D N D Fig. 1 In some cases, we observe significant di erences in the convergence behavior for the three sover parameter settings in the Poisson3D appication probem (eft) andthepoisson2d appication probem (right). 3.3 Numerica resuts In the foowing, we discuss the convergence resuts that we obtain with the above given parameters. We aways report the number of iterations of the sover with respect to the parameter N D, i.e. the discretization parameter used above for defining the finite di erence discretization. This parameter is not identica to the number of unknowns which is N 3 D and N 2 D in Poisson3D and Poisson2D/Anisotropic2D, respectivey. On the eft-hand side of Fig. 1, we show the resuts for the Poisson3D benchmark running sequentiay for the di erent parameters sets defined before. This benchmark is shown in [23]. We observe a growing number of iterations for growing probem size. Hence, a benchmarks seem not to be in an asymptotic regime, where we woud expect to see constant or at east very sowy growing number of iterations. The genera tendency is that our parameters outperform the parameters of AMG213. The worst resuts are seen for the AmgX parameters. In contrast, the resuts in Fig. 1, on the right-hand side, are in the asymptotic regime. For the AMG213 parameters and our parameter proposa, we observe amost constant iteration counts when going to finer discretizations with an asymptoticay high condition number. In contrast, the AmgX parameters do not resut in the AMG-type optima convergence. The more expensive extended interpoation in connection with the Gauss-Seide-type smoother seem to recover the optima convergence of AMG in presence of the ess strong PMIS coarsening for AMG213. The resuts on the right-hand side of Fig. 1 together with the two resuts in Fig. 2 form the robustness study Anisotropic2D for growing anisotropies c 1 = 1, 1, 1. The tendency of the resuts fits together with the resuts obtained in the previous benchmarks. The parameters from AMG213 and our parameter proposa show simiar convergence resuts. Both parameters are amost perfecty

15 14 Peter Zaspe Anisotropic2D, c 1 =1,sequentia Anisotropic2D, c 1 =1,sequentia 8 8 AMG213 parameters AMG213 parameters AmgX parameters AmgX parameters 6 parameter proposa 6 parameter proposa #iterations 4 #iterations , 4, 2, 4, N D N D Fig. 2 The (potentia) parameters of AmgX in [23] ead to a strong dependence on the anisotropy parameter comparing c 1 =1(Fig.1,right), c 1 =1(eft) andc 1 =1(right). That is, AMG is not robust with respect to anisotropies in this case. Poisson3D, sequentia Poisson3D, parae 8 8 AMG213 parameters AMG213 parameters AmgX parameters AmgX parameters 6 parameter proposa 6 parameter proposa #iterations 4 #iterations N D N D Fig. 3 When running AMG213 / hypre in parae, the coarsening in our parameters get degraded eading to a worse sover performance. robust with respect to the anisotropy. On the other hand, the AmgX parameters are not robust with respect to anisotropy. Finay, we aso compare, simiar to [23], convergence resuts for the sequentia use of AMG213 / hypre (as before) with resuts obtained when distributing the work over eight MPI processes. Fig. 3 highights these resuts comparing the sequentia resuts for Poisson3D on the eft with the parae resuts for Poisson3D on the right. It becomes obvious that the PMIS-based parameter sets converge amost identica in the sequentia and in the parae case. In contrast, the paraeization of the standard coarsening requires corrections at the interface of the decomposed domains. Thereby performance resuts degrade in the parae case, sti achieving simiar convergence as with the AMG213 parameters.

16 Tite Suppressed Due to Excessive Length Discussion As expected, our resuts show that the best possibe choice of parameters is not perfecty cear, even in our minimaistic benchmark. The standard parameters in AMG213 aways produce amost probem-size independent convergence, are robust with respect to anisotropies and seem to be robust with respect to an increase in the number of processors for the parae setup phase. The good parae properties might be the reason why the authors used these parameters for the parae benchmark AMG213, even though the cassica parameters in Boomer- AMG/hypre are much more simiar to our proposed parameter set. However, the AMG213 parameters use the sequentia Gauss-Seide-type smoother and rather expensive extended interpoation. The AmgX parameters, keeping in mind that we don t know the exact parameters and just reverse engineered them, seem to be not robust to anisotropies and seem to show a growing number of iterations for arger probem sizes. However, since a Jacobi smoother is used, the sove phase wi run very e cienty on many-core hardware. The setup phase wi be e cient due to PMIS. This is aso reported in [23], where the authors show impressive performance resuts comparing AmgX to AMG213. That is, with the parameters of AmgX used in [23], the convergence and robustness properties of the method are owered at the gain of a pre-asymptoticay fast soution behavior. This is a typica baancing done for numerica methods for many-core processors. Our proposed parameters sha introduce a third favor. That is, we aim for best (asymptotic) convergence rates, sti seeking for a method that wi have a very parae and e cient sove phase. However, achieving a good many-core performance in the setup phase, when using standard coarsening and standard interpoation, is chaenging. In fact, paraeizing some part of the coarsening is impossibe. However, as we wi show, it is sti possibe to paraeize a major part on the setup phase even if the coarsening is done on CPU. 4Hybridmany-coreparaeizationstrategies In this section, we discuss agorithms and paraeization strategies for AMG on many-core hardware. These agorithms are generic in terms of the target manycore hardware. That is, their e cient impementation sha not be restricted to GPUs but shoud aso be possibe for, e.g., Xeon Phi processors. To this end, the appied amount of hardware-specific functionaities is reduced to a minimum. We do this with an educative objective: We want to ower the bar for beginners in many-core computing to be abe to write scientific codes and to focus more on the core idea in many-core paraeization, i.e. exposing a high degree of paraeism. We first give an overview about detais of our genera strategy to do a manycore paraeization. Thereafter, we expain the genera methodoogies of sparse matrix construction and oca graph traversa. This genera consideration aows to simpify the discussion of paraeization detais for the setup phase with the (hybrid) C/F spitting and the interpoation operator construction, afterwards. Moreover, we address the paraeization of the V-cyce and the smoothers. Finay, detais of the concrete impementation on GPU are discussed.

17 16 Peter Zaspe 4.1 Genera many-core paraeization strategy Our genera base strategy for many-core paraeization is to move a big portion of the paraeization compexity to ibraries. In the context of the given appication (AMG) this means that we use ibraries impementing a standard set of routines for (sparse) inear agebra (sparse matrix formats, matrix matrix products, matrix vector products, scaar products,... ). We caim that such ibraries exist with the (incompete) ist of exampes containing CUSP, CUSPARSE, LAMA, ViennaCL, Paraution, GHOST, MAGMA, MKL. Simiary, we use ibraries, which impement STL vector agorithm type methods in parae, such as sum, excusivescan or maximum. Again, severa ibraries aow this, incuding but not being imited to Thrust, ArrayFire and Boost.Compute. Ony if we are acking appropriate ibrary support, we actuay propose to impement new many-core parae functions. We ca such functions kernes. In addition to the function parameters, we pass the amount of parae threads on which the kerne executes its code. Within the kerne code, the thread index is fetched first. Afterwards, computations are done reative to that thread index, cf. Agorithm 5 for an exampe. Writes into the many-core processor s memory are not considered to be consistent / thread-safe during the kerne execution. That is, whie executing a kerne, two di erent threads sha not write into the same memory ocation. The ony exception are atomic operations. They seriaize conficting parae writes, if necessary. However, this might impact performance. A goba synchronization over a threads is done at the end of the kerne execution. Note again that these assumptions are strong simpifications of the usuay much more eaborated memory and thread execution modes used in recent many-core processors. However, they make it much more easy to introduce many-core agorithms. 4.2 CSR matrix construction We store matrices in the compressed sparse row (CSR) matrix format. It encodes a sparse matrix A R N N with N nonzero non-zero entries by arrays row o sets of indices (of ength N + 1) describing the o set to take in coumn indices and vaues to find entries of a given matrix row, coumn indices of indices (of ength N nonzero ) describing the coumn index of a given row entry and vaues of (doube precision) foating point vaues (of ength N nonzero ) describing the respective matrix entry. Throughout the construction of the interpoation operators, we have to assembe sparse CSR matrices. This can be rather easiy done in parae, when using one parae thread per row of the matrix that sha be fied. Agorithm 4 shows the necessary steps. First, a generic function countentries mimics the work that woud be done for generating the matrix entries. This is done in a competey parae way. Whie doing this, instead of writing the data into the new matrix, the number of generated non-zero entries per row are counted and stored in array num entries per row. Then, a parae sum and a parae excusive scan operation aow to compute the tota number of non-zeros and the o sets for the CSR matrix data structure. Finay another generic function ficsrmatrix does the actua work and fis the system matrix as simuated in countentries, before.

18 Tite Suppressed Due to Excessive Length 17 Agorithm 4 Generic method to construct CSR matrix in parae 1: function buidcsrmatrix(a, N,...) 2: aocate num entries per row[n] 3: setvaue N (num entries per row,) 4: countentries N (num entries per row,...) count number of entries per row 5: tota entries := sum N (num entries per row) compute number of non-zeros 6: aocate num rows[n +1],coumn indices[tota entries], vaues[tota entries] 7: excusivescan N (num entries per row, row offsets) compute o sets 8: ficsrmatrix N (num rows, coumn indices, vaues,...) fi matrix Fig. 4 In the oca graph traversa, the next neighbors of each node are visited up to a fixed depth (indicated by the di erent eipse sizes). The many-core paraeization makes use of the independence of each traversa (indicated by the di erent ine types). Even though the above approach actuay requires to compute the sparse matrix twice, it is sti a very e ective way to construct a CSR matrix in parae. 4.3 Loca graph traversa up to a fixed depth Severa agorithms within the AMG setup phase require to iterate over neighbours and/or neighbours of neighbours of a node in the graph representation of the inear system. This can be transated to a oca graph traversa up to a depth of one/two starting from each node in the graph, cf. Fig. 4. In the foowing, we formuate a generic many-core parae oca traversa up to depth two for a graph that is given by an adjacency matrix stored in the CSR format. The easiest way to paraeize this embarrassingy parae oca graph traversa is by paraeizing it over the starting nodes for each of the traversa steps. Agorithm 5 formaizes this idea. It is aunched in parae with N threads and a sparse matrix A R N N given as parameter. In each thread, first the current thread index idx is retrieved. Then, each neighbour of node idx is visited by iterating over a non-zero o -diagona eements in the idxth matrix row. The same idea is repeated to iterate over a neighbours of a visited neighbour node. Ceary, Agorithm 5 is not the necessariy most e cient paraeization approach for a specific many-core architecture. To give an exampe, an impementation of the agorithm in the CUDA programming mode and running on GPUs of Nvidia, mightnot achieve the fastest possibe performance due to thread divergence. Here, e.g. [22] gives an exceent overview over optimization techniques. However, Agorithm 5 is very generic and deivers a high degree of paraeism for a ot of many-core architecures. Sti, optimized graph traversa agorithms are considered future work.

19 18 Peter Zaspe Agorithm 5 Generic kerne for oca graph traversa up to depth two 1: function traverse N (A, N,...) 2: idx := getthreadindex() 3: for jj [A.row offsets[idx],a.row offsets[idx +1] 1] do 4: j := A.coumn indices[jj] get coumn index of current matrix entry 5: if (idx = j) AND... then 6:... do something with weight on edge (idx, j) 7: for kk [A.row offsets[j],a.row offsets[j +1] 1] do 8: k := A.coumn indices[kk] 9: if (j = k) AND... then 1:... do something with weight of edge (j, k) 4.4 Hybrid C/F spitting We use the C/F spitting foowing Agorithm 3. The coarsening agorithm is the ony part of the code which is ony partiay paraeized on many-core hardware. Steps three and four of Agorithm 3 compute the strong infuence weights i for the current eve. This operation can be paraeized in a many-core fashion. Here, the first necessary step is to pre-compute the maximum (negative) non-diagona entry for each matrix row, thus, max a ik < a ik. The parae graph traversa idea of Section 4.3 is appied for this. Using these maxima, it is possibe to identify strongy couped nodes in an agorithm for evauating the i.theweightevauation agorithm again traverses the graph up to a depth of one. Whenever a strongy infuencing node is found, the appropriate weight i is increased by an atomic operation. The remaining part of Agorithm 3 is impemented sequentiay incuding the necessary copy operations between sequentia processor (CPU) and many-core processor. To achieve an optima computationa compexity, the ookup of the argest weight i in step 6 of the agorithm is performed with a priority queue data structure [7, Section 6.5]. Since e.g. the STL-based impementation of a priority queue does not aow to have a constant-compexity maximum-find and -remova operation (in fact that impementation has a ogarithmic compexity in that case), the data structure is hand-impemented. The proposed impementation uses bucket sort [7, Section 8.4] as base agorithm and aows to achieve the designated constant compexity remova operation with a inear-compexity data structure setup time. Overa this keeps a inear compexity for the standard coarsening agorithm. Finay, whie skipped in Agorithm 3, it is necessary to identify the mapping between the newy found coarse grid points and their position in the next coarser grid eve. This can be impemented by means of a many-core parae STL vector agorithm-type operation. Fig. 5 shows an exampe for the approach. A given array coarse contains per node the fag whether the node wi become a coarse grid node. By appying an excusivescan operation to that array, an enumeration of the coarse grid nodes is generated. As ong as one accesses ony those entries in fine to coarse that correspond to a new coarse grid node, the correct mapping wi be returned.

20 Tite Suppressed Due to Excessive Length 19 Fig. 5 An excusive scan operation aows to compute the mapping between a coarse node on the current and next mutigrid eve. Ony entries with ight background wi be accessed. A = f t tff ff t tf Fig. 6 In the standard interpoation operator construction, we reuse the existing sparsity pattern of A to store the set of interpoatory points Pi = C i via a booean indicator (t/f). 4.5 Interpoation operator construction Direct interpoation The many-core parae impementation of direct interpoation starts by pre-computing the intermediate coe cients i = j N a i ij k P a i ik To buid the two sums, parae oca graph traversa up to a depth of one becomes necessary. The core idea for this has been discussed in Section 4.3. The weights wik are computed whie buiding the interpoation matrix foowing the ideas in Section 4.2. The positions of the interpoation coe cients in the finay constructed matrix I 1 are determined by the ookup tabe created at the end of the C/F spitting as outined in Section 4.4. Overa the direct interpoation matrix is constructed fuy in parae Standard interpoation Impementing standard interpoation for many-core processors is more invoved. Here, the agorithm starts by finding the set of interpoatory variabes Pi = C i for direct interpoation. Since this is necessary for each row, a sparse matrix is constructed to store this information. We reuse the existing data structure of the sparse system matrix and repace the array of non-zeros by booean indicator vaues as shown in Fig. 6. The booean vaues (t/f) store, whether a given neighbor node is part of the interpoatory set or not. Fiing this indicator matrix can be done by paraeization over the number of rows of the system matrix simiar to the oca graph traversa in Section 4.3 Next, the origina system matrix is modified such that strongy couped fine grid points are expanded as e j k N j. a jke k/a jj. Note, that this construction is rather compute intensive and might be repaced by a more sophisticate method in the future. It requires e.g. to impement a sparse

21 2 Peter Zaspe matrix row addition. The new set ˆP i = C i ( j F i C j ) of interpoatory variabes is computed, as we. Finay the agorithm reproduces the direct interpoation impementation now using the expanded system matrix and ˆP i. Truncation in the interpoation matrix can aso be done in a many-core parae way. This is done within the construction agorithm for interpoation matrices, to avoid buiding new sparse matrices. However, this agorithm is aso avaiabe as stand-aone method. 4.6 V-cyce and smoothers Due to the (assumed to be) avaiabe inear agebra primitives, the impementation of a V-cyce on many-core hardware is straight-forward, since the V-cyce ony contains standard matrix-matrix, matrix-vector and vector-vector operations, cf. Agorithm 1. Within the V-cyce of Agorithm 1, we repace the direct sover on the coarsest eve by a Jacobi-preconditioned CG sover. For now, we ony use a reaxed Jacobi iteration as smoother. The construction and appication of the Jacobi smoother can be easiy reaized by the avaiabe inear agebra primitives 4.7 Impementation on GPU Our concrete impementation of agebraic mutigrid for a singe GPU is based on the sparse inear agebra ibrary CUSP [5] in version.4. and the STL Vector agorithm ibrary Thrust, which comes with the CUDA Tookit 7.5. The first ibrary provides a set of inear agebra primitives (matrix-matrix products, matrix-vector products,... ), iterative sovers (CG, GMRES,... ) and preconditioners (Jacobi,... ) for sparse matrices of di erent sparse matrix formats, incuding CSR. Thrust provides operations such as sum or excusivescan. The impemented code integrates withing the CUSP framework, thus the new Ruge-Stüben cassica AMG can be used as preconditioner for the standard CUSP sovers. It can be furthermore appied as stand-aone sover. Compute kernes are impemented as CUDA kernes based on the CUDA Tookit 7.5. Note that we do not use the atest CUSP version due to issues with the matrixmatrix product impementation. Moreover, we did not consider to use the more recent CUSPARSE ibrary instead of CUSP, since, at the time of starting this software project, CUSPARSE was sti at a very eary deveopment stage. Finay, the at time of writing this artice atest CUDA Tookit version 8. is currenty not supported on our target benchmark system. 5 Performance of the hybrid GPU AMG In the foowing, we discuss convergence and performance resuts of our hybrid GPU AMG. First, we doube-check the convergence and robustness of our code by repeating the convergence studies of Section 3 with our specific impementation. The remaining part of the section is a performance benchmark. To this end, we first introduce our benchmark setup in terms of hardware and software. Afterwards a genera performance comparison between our impementation and AMG213 (foowing [23]) is given. As we wi see, the sove phase is fast, however our expectations in terms of speedup for the setup phase on GPU are not met.

22 Tite Suppressed Due to Excessive Length 21 8 Poisson3D direct interp. 8 Poisson2D / Anisotropic2D, c 1 =1 direct interp. 6 standard interp. 6 standard interp. #iterations 4 #iterations , 4, N D N D Fig. 7 Our hybrid GPU impementation of AMG attains the predicted convergence behavior for our proposed parameter set when using standard interpoation. This is shown here for the Poisson3D (eft) andthepoisson2d(right) testcase.directinterpoationshowssighty weaker convergence resuts. Therefore, we further give a detaied performance anaysis and discussion for the setup phase. Finay, we show resuts for a rea-word test probem on a compex geometry. 5.1 Convergence and robustness check We repeat the convergence studies performed in Section 3, i.e. Poisson3D, Poisson2D and Anisotropic2D, for the proposed hybrid GPU AMG impementation. Hence, we show resuts for the parameter set that was proposed in Section 3.2, but now for our own impementation. The objective of this study is to doube-check that our impementation is correct and that it is not a ected by any paraeization issue. As discussed in Section 3.2, we use standard coarsening, standard interpoation and a twice appied reaxed Jacobi pre- and post-smoother with reaxation parameter =.8. Truncation is used with tr =.2 and the strength parameter is str =.25. The recursive construction of coarser eves is stopped as soon as a given eve has ess than 1 unknowns. The coarse grid sover is a Jacobi-preconditioned CG method converged to an absoute residua of 1 2. The V-cyce is used as preconditioner within a CG method. In addition to these parameters, we aso discuss convergence resuts for direct interpoation, since as we wi see the GPU-parae version of direct interpoation is much faster than the one for standard interpoation. Convergence resuts are again reported for the benchmark probems defined in Section 3.1 with the stopping criterion of the outer CG sover set to a reative residua norm of 1 6. In Fig. 7 we report the convergence resuts for the benchmark probems Poisson3D and Poisson2D on the eft-hand side and right-hand side, respectivey. In both cases, the convergence by using standard interpoation matches the conver-

23 22 Peter Zaspe 8 Anisotropic2D, c 1 =1 direct interp. 8 Anisotropic2D, c 1 =1 direct interp. standard interp. standard interp. 6 6 #iterations 4 #iterations , 4, 1, 2, N D N D Fig. 8 Direct interpoation is not robust to the anisotropy in the Anisotropic2D benchmark. Meanwhie, our proposed parameter set appied in our hybrid GPU impementation achieves the predicted amost perfect robustness. gence resuts predicted in Section 3: The three-dimensiona probem sti has a very sight increase in iteration count for a grid size of up to N D =128,i.e.N =128 3 unknowns. In contrast, the two-dimensiona probem is aready in its asymptotic convergence behavior with a constant number of iterations for growing probem size. In the two-dimensiona probem, direct interpoation ony sighty increases the number of iterations unti convergence. This behavior is worse for the Poisson3D probem, where there is a stronger, but sti acceptabe, increase in the number of iterations when using direct interpoation. The impact of the weaker direct interpoation becomes ceary visibe when considering the Anisotropic2D test case in Fig. 8. Direct interpoation, together with our other parameters, is not robust to anisotropies. With growing anisotropy, the number of iterations unti convergence becomes much arger. In contrast, standard interpoation shows ony a very sight increase in the number of iterations, i.e. it is amost perfecty robust. Again, this matches the convergence behavior predicted in Section Performance benchmark setup Our performance benchmark sha be comparabe with the resut presented in [23]. To this end, we use the same hardware and in genera the same software environment as in this study. Our benchmark patform is a Cray XK7 system, namey the GPU custer Titan at Oak Ridge Nationa Lab. Our performance comparisons are imited to a singe node of this custer. Each node contains an 16-core 2.2GHz AMD Opteron 6274 processor and 32 GB of RAM. The GPU is an Nvidia Tesa K2X. We use the standard GNU compier framework on that custer, i.e. GCC with MPICH TheNvidia CUDA Programming Tookit is used in version 7.5. As stated before, we further use the ibrary Thrust, asitcomeswith the CUDA Tookit as we as CUSP in version.4.. A software is compied with compier fag -O3 for optimization. The GPU code is compied for the specific compute capabiities of the Tesa K2X. That

24 Tite Suppressed Due to Excessive Length 23 is, we use the additiona compier fag -arch sm 35. Tocompywith[23],the AMG213 benchmark is run either sequentiay or in parae using MPI on eight cores of a singe node of Titan. The parae MPI processes are distributed eveny over the physica cores of the Opteron processor. Reported AMG213 timings correspond to the timings reported as wa cock time by the AMG213 benchmark. Timings reported for our hybrid AMG GPU impementation were measured with the gettimeofday command and thereby aso represent wa cock times. Our GPU code fuy runs at doube precision. The GPU AMG impementation is not intended to be an acceerated part of an existing CPU code but a sover within a fu GPU code. This is why, in the hybrid GPU case, it is expected that the system matrix, right-hand side and initia guess are stored in GPU memory before the benchmark starts. Nevertheess, a remaining transfer times between CPU and GPU memory, as we as the fu CPU and GPU compute time, are incuded for the hybrid GPU setup/sover test runs. 5.3 Performance comparison against AMG213 In the foowing, we report the performance of our hybrid GPU AMG impementation, comparing it to the performance of the AMG213 benchmark for the benchmark appications defined in Section 3.1. It sha be noted here again that the specific set of AMG parameters used in AMG213 might not refect the best possibe choice of parameters for a CPU-based AMG method. However, for the sake of comparabe resuts with [23], we keep these parameters even though BoomerAMG might be much faster for other parameters (and more recent versions of hypre) Setup phase In Figures 9 and 1, we show the di erent runtimes for the setup phase of the Poisson3D, Poisson2D and Anisotropic2D benchmarks using AMG213 sequentiay or in parae on CPU and using our hybrid GPU AMG impementation on GPU. It turns our that the setup phase of our GPU AMG impementation is surprisingy sow, especiay in comparison to the parae AMG213 test case. For direct interpoation, we often outperform the sequentia AMG213 benchmark. However, the proposed use of standard interpoation on GPU is aways the sowest method. To study the reasons for this rather pessimistic performance resut, we do a detaied performance anaysis of the setup phase for the Poisson3D test case for N D =128. This study is presented and discussed in Section Sove phase The performance resuts for the GPU sove phase are much better than the performance resuts for the setup phase. In Fig. 11, we give on the eft-hand side the performance resuts of the sove phase for the Poisson3D test case and on the right-hand side the resuts for the Poisson2D test case. In the three-dimensiona test case, the GPU-parae sove phase (for standard interpoation) is by a factor of roughy 3.3 faster than the AMG213 benchmark paraeized on eight cores. Moreover, the GPU AMG is faster by roughy a factor of 7.1 in the two-dimensiona test case and the same interpoation. Note again that we use here ony eight of the avaiabe 16 cores of a Titan node, as done in [23]. Aso, there exists a newer version of hypre, which is a superset of the AMG213 benchmark with more recent performance optimizations [25].

25 24 Peter Zaspe runtime of the setup phase (in sec.) Poisson3D hybr. GPU (direct interp.) hybr. GPU (std. interp.) AMG213 sequentia AMG213 parae runtime of the setup phase (in sec.) Poisson2D / Anisotropic2D, c 1 =1 hybr. GPU (direct interp.) hybr. GPU (std. interp.) AMG213 sequentia AMG213 parae 5 1 1, 2, N D N D Fig. 9 We compare the runtime of the setup phase for our hybrid GPU impementation (using direct and standard interpoation) with the runtime of the setup phase of the AMG213 benchmark (eft: Poisson3D benchmark, right: Poisson2D benchmark). runtime of the setup phase (in sec.) Anisotropic2D, c 1 =1 hybr. GPU (direct interp.) hybr. GPU (std. interp.) AMG213 sequentia AMG213 parae runtime of the setup phase (in sec.) Anisotropic2D, c 1 =1 hybr. GPU (direct interp.) hybr. GPU (std. interp.) AMG213 sequentia AMG213 parae 1, 2, 1, 2, N D N D Fig. 1 The Anisotropic2D benchmark for c 1 = 1 (eft) and c 1 = 1 (right) outines asimiarowperformanceforthesetupphaseofourhybridgpuamgcomparedtothe AMG213 benchmark. It has to be made cear at this point that our positive performance resuts for the sove phase coud ony be achieved since the chosen parameters in the setup phase ead to an iterative method with strong convergence and robustness, as discussed before. To underine this, we show the performance of the sove phase for the Anisotropic2D test case in Fig. 12. Using the weak direct interpoation, we ony get a ow (c 1 =1)ornospeedup(c 1 = 1) compared to the parae AMG213 benchmark. However, the robust standard interpoation eads to speedups simiar to the Poisson2D test case.

26 Tite Suppressed Due to Excessive Length 25 runtime of the sove phase (in sec.) 1 5 Poisson3D hybrid GPU (direct interp.) hybrid GPU (std. interp.) AMG213 parae runtime of the sove phase (in sec.) 1 5 Poisson2D / Anisotropic2D, c 1 =1 hybrid GPU (direct interp.) hybrid GPU (std. interp.) AMG213 parae 5 1 1, 2, N D N D Fig. 11 In the sove phase, the hybrid GPU AMG impementation ceary outperforms the parae AMG213 benchmark, both for the Poisson3D (eft) andthepoisson2d(right) test case. runtime of the sove phase (in sec.) 1 5 Anisotropic2D, c 1 =1 hybrid GPU (direct interp.) hybrid GPU (std. interp.) AMG213 parae runtime of the sove phase (in sec.) 1 5 Anisotropic2D, c 1 =1 hybrid GPU (direct interp.) hybrid GPU (std. interp.) AMG213 parae 1, 2, 1, 2, N D N D Fig. 12 The hybrid GPU AMG sove phase aso ceary outperforms the parae AMG213 benchmark when using standard interpoation in the Anisotropic2D test case. However, the performance advantage of the GPU gets ost, if the chosen set of parameters for AMG does not ead to a robust method as shown here for direct interpoation. In Fig. 13, we summarize the performance resuts for the setup and the sove phase for the argest test cases of the Poisson3D and the Poisson2D probem. The parae version of the AMG213 benchmark is faster than the hybrid GPU AMG in the setup phase, whie the GPU code outperforms AMG213 in the sove phase. Very often, it is possibe to reuse an existing mutigrid hierarchy from one setup phase for severa sove phases, e.g. if a singe probem is soved for many right-hand sides. In this case, reativey speaking, the GPU-based AMG coud become much faster. Meanwhie, we ony show in Fig. 13 the tota compute time for a singe setup and a singe sove phase. Here, AMG213 outperforms our impementation.

27 26 Peter Zaspe Poisson3D performance, N D =128 Poissson2D performance, N D =248 hybrid GPU (std. interp.) hybrid GPU (std. interp.) 3 AMG213 sequentia AMG213 parae 3 AMG213 sequentia AMG213 parae time in seconds 2 1 time in seconds 2 1 setup phase sove phase tota setup phase sove phase tota Fig. 13 In summary, the parae version of the AMG213 benchmark is faster for the setup phase, whie the hybrid GPU AMG is faster for the sove phase. 5.4 Detaied performance anaysis of the setup phase As seen in Section 5.3.1, the performance of the hybrid GPU AMG does not meet our performance expectations, when comparing it to the AMG213 benchmark on CPU. Therefore, we make an in-depth anaysis of the performance of a components of the setup phase. To abbreviate the discussion, we focus on a singe test case, the Poisson3D probem with N D = 128. Moreover, we ony discuss the performance for the setup phase on the finest grid eve. In our study, we woud ike to anayse the impact of our paraeization of the setup phase. To do this, we compare our many-core parae impementation with a sequentia base CPU impementation that was initiay done by the author. Note again that comparing a singe-core CPU impementation to a many-core impementation is of course unfair. However, if we do not see significant speedups within this comparison, we have no chance to get a fast many-core impementation, anyway. In Tabe 2, we report the resuts of our performance comparison against the reference CPU impementation, both for direct and standard interpoation. We incude direct interpoation, since we want to find the reason for the strong performance di erences between direct and standard interpoation C/F spitting The first two timings in Tabe 2 correspond to the initia part of the C/F spitting, which are paraeized on many-core hardware. We compare the CPU and GPU timings for computing the maximum non-diagona entry for each matrix row and the strong infuence weights i, asdiscussedinsections2.3and4.4.basedonthe speedups, we concude that the paraeization of this part of the C/F spitting is very successfu. Actuay, speedups in a range of 1 woud correspond to a speedup of 6-1 when comparing equay priced hardware. The next bock of timings corresponds to the part of the setup phase, which is done on CPU. In the many-core impementation, we first have to copy the data

28 Tite Suppressed Due to Excessive Length 27 direct interp. std. interp. CPU [s] GPU [s] factor CPU [s] GPU [s] factor C/F spitting max. neg. nondiag strong infuence GPU to CPU transfer n/a.4537 n/a n/a.4565 n/a CPU data structure CPU coarsening CPU to GPU transfer n/a.17 n/a n/a.17 n/a coarse node count fine to coarse map interpoation coe s. for interp n/a n/a n/a direct interp. matrix n/a n/a n/a interp. point set n/a n/a n/a modified matrix n/a n/a n/a max. neg. nondiag. n/a n/a n/a coe s. for interp. n/a n/a n/a direct interp. matrix n/a n/a n/a restriction operator Gaerkin operator Jacobi smoother Tabe 2 The detaied benchmark anaysis for the setup phase (Poisson3D, N D =128,finest grid) compares runtimes of a components of the proposed hybrid GPU impementation with those of a base CPU impementation running on one CPU core. back to CPU memory. Afterwards, the CPU part of the coarsening is appied and the resuts are copied back to GPU. We can make severa observations, here. First, the CPU-based coarsening in the GPU impementation is the most expensive part of the setup phase. Here, it has to be noted that the standard singe-core performance of the AMD Opteron 6274 processor is reativey ow and Turbo CORE, i.e. the AMD technoogy to overcock a singe core, seems not to be switched on on Titan. Second, our CPU coarsening impementation seems to be ceary sower than coarsening impementations in e.g. AMG213. This is not surprising, knowing that the base ibrary hypre has been deveoped since many years. Third, it woud be worth, overapping the CPU data structure construction with major parts of the GPU to CPU data transfer. However, this is not possibe with CUSP in version.4., at east to the author s knowedge, since an access to CUDA Streams is not impemented in this version, yet. In the fina many-core parae part of the C/F spitting, we compute the number of coarse nodes and get the fine to coarse node mapping as discussed in Section 4.4. The reduction operation, to compute the number of coarse nodes, is (surprisingy) sower on GPU than on CPU, but in genera negectabe, due to ow overa performance impact. The excusive scan operation for the node mapping is faster on GPU by a factor of nine. This is the (acceptabe) speedup deivered by the thrust ibrary and corresponds to an amost equa performance in a hypothetica equay-priced-hardware comparison Interpoation Next, we consider the di erent interpoation strategies. In direct interpoation, we first pre-compute the coe cients i by the (parae) graph traversa strategy.

29 28 Peter Zaspe This is fast on GPU with a speedup of beyond 2, i.e. even a decent speedup in a muti-core to many-core comparison. The actua construction is by a factor of 25 faster on GPU than on CPU. This is in the range of a speedup of two, when comparing equay priced hardware. This is a good resut. The overa time spend in the standard interpoation is much higher. Our paraeization strategy, to compute the set of interpoatory points, is successfu, with a speedup in the range of 3. However, seemingy, our CPU base impementation to construct the modified system matrix, cf. Section 4.5, is too sow. Its paraeized version achieves a speedup of roughy 2, which is acceptabe. However, the genera approach to compute the modified system matrix has to be re-thinked. Afterwards, the maxima for the negative non-diagona weights have to be updated. As in the C/F spitting, this is very fast on GPU. The computation of the coe cients for the modified system matrix is much sower than in the direct interpoation case. First, we use a sighty di erent version of the compute kerne here, since we aso incude truncation. Second, the modified system matrix has a (significanty) arger number of non-zeros per row. Thereby, each compute kerne has much more sequentia work to do, presumaby eading to a higher thread divergence. Consequenty, ony a moderate speedup of above 6 is achieved. Finay, the kerne to compute direct interpoation is appied to the modified system matrix. This kerne is again fast Restriction operator, Gaerkin product, smoother The fina three steps of the setup phase are the construction of the restriction operator by transposing the interpoation matrix, the tripe matrix-product to construct the Gaerkin operator and the construction of the Jacobi smoother. Transposing on GPU is more than a factor of 6 faster than on CPU. This speedup is deivered by the CUSP ibrary and is acceptabe. The tripe product is again done using CUSP and is by a factor of 4 faster. We woud have hoped to see higher performance, here. The CUSP-based setup of the Jacobi smoother is fast on GPU, anyway Summary and discussion As it turns out, our genera many-core paraeization strategies (parae oca graph traversa, matrix construction,... ) are fast. Whenever we use hand-impemented kernes, we get a decent performance improvement. The speedup obtained using the CUSP ibrary is sometimes ow. Note however that we do not use the atest version of CUSP as we discussed before. Nvidia has more recenty introduced CUSPARSE, a commerciaized version of CUSP, which might deiver higher performance. Using this ibrary is future work. Our way to construct the standard interpoation by the modified system matrix seems to be a major performance botteneck. This has to be re-thinked. Finay, the major performance imitation of the hybrid GPU AMG setup phase is the sequentia CPU part. We sti argue that it is more important to get a robust numerica method. Therefore using the sequentia C/F spitting is favorabe. Nevertheess, research on truey parae and robust versions of C/F spitting shoud be undertaken. In addition, we argue that a strong GPU or many-core processor shoud be couped with a few very fast CPU cores, in the future. This woud reduce the negative impact of tighty couped hybridization approaches as the one discussed here. i

30 Tite Suppressed Due to Excessive Length 29 Fig. 14 The finite eement discretization of the Poisson probem (6) soved on a gear geometry [2] is used as rea-word appication test case. 5.5 Rea word appication exampe We finish this section with a more reaistic appication exampe. It is a Poisson probem soved on the compex gear geometry shown in Fig. 14. The exact mathematica probem is u= 6 in D, (6) 2 2 u = 1 + x1 + 2x2 on D, with D R3 the inner part of the gear geometry. The exact soution of this probem is u = 1 + x21 + 2x22 on the fu domain D and thereby constant with respect to the third coordinate direction. Note that it woud be rather invoved to sove such a compex geometry probem with a geometric mutigrid method. However, AMG can sove such probems out-of-the-box. The mesh that defines the geometry, i.e. D, is avaiabe at [2]. An initia meshing of D is done using tetgen beta1 [3]. It generates meshes of tetrahedrons. To generate meshes at different resoutions, we use the -a switch of tetgen and impose maximum ce voumes of 22, with {,..., 5}. Each mesh is imported into FEniCS [3], a finite eement framework, where it is used in a standard inear finite eement discretization of the PDE (6). FEniCS aows to assembe a system matrix and right-hand side for the discretization using the assembe system command. An export of matrix and right-hand side is done using the inear agebra back end Eigen. The resuting system matrices and right-hand sides are finay used in our hybrid GPU AMG. Fig. 15 summarizes the convergence resuts for the gear appication probem on the eft-hand side. We use the same parameters as in the performance and convergence studies before. These ead to an amost perfect convergence for growing

A Comparison of a Second-Order versus a Fourth- Order Laplacian Operator in the Multigrid Algorithm

A Comparison of a Second-Order versus a Fourth- Order Laplacian Operator in the Multigrid Algorithm A Comparison of a Second-Order versus a Fourth- Order Lapacian Operator in the Mutigrid Agorithm Kaushik Datta (kdatta@cs.berkeey.edu Math Project May 9, 003 Abstract In this paper, the mutigrid agorithm

More information

Outline. Parallel Numerical Algorithms. Forward Substitution. Triangular Matrices. Solving Triangular Systems. Back Substitution. Parallel Algorithm

Outline. Parallel Numerical Algorithms. Forward Substitution. Triangular Matrices. Solving Triangular Systems. Back Substitution. Parallel Algorithm Outine Parae Numerica Agorithms Chapter 8 Prof. Michae T. Heath Department of Computer Science University of Iinois at Urbana-Champaign CS 554 / CSE 512 1 2 3 4 Trianguar Matrices Michae T. Heath Parae

More information

Mobile App Recommendation: Maximize the Total App Downloads

Mobile App Recommendation: Maximize the Total App Downloads Mobie App Recommendation: Maximize the Tota App Downoads Zhuohua Chen Schoo of Economics and Management Tsinghua University chenzhh3.12@sem.tsinghua.edu.cn Yinghui (Catherine) Yang Graduate Schoo of Management

More information

MULTIGRID REDUCTION IN TIME FOR NONLINEAR PARABOLIC PROBLEMS: A CASE STUDY

MULTIGRID REDUCTION IN TIME FOR NONLINEAR PARABOLIC PROBLEMS: A CASE STUDY MULTIGRID REDUCTION IN TIME FOR NONLINEAR PARABOLIC PROBLEMS: A CASE STUDY R.D. FALGOUT, T.A. MANTEUFFEL, B. O NEILL, AND J.B. SCHRODER Abstract. The need for paraeism in the time dimension is being driven

More information

Lecture Notes for Chapter 4 Part III. Introduction to Data Mining

Lecture Notes for Chapter 4 Part III. Introduction to Data Mining Data Mining Cassification: Basic Concepts, Decision Trees, and Mode Evauation Lecture Notes for Chapter 4 Part III Introduction to Data Mining by Tan, Steinbach, Kumar Adapted by Qiang Yang (2010) Tan,Steinbach,

More information

arxiv: v2 [math.na] 7 Mar 2017

arxiv: v2 [math.na] 7 Mar 2017 An agebraic mutigrid method for Q 2 Q 1 mixed discretizations of the Navier-Stokes equations Andrey Prokopenko and Raymond S. Tuminaro arxiv:1607.02489v2 [math.na] 7 Mar 2017 Abstract Agebraic mutigrid

More information

A Petrel Plugin for Surface Modeling

A Petrel Plugin for Surface Modeling A Petre Pugin for Surface Modeing R. M. Hassanpour, S. H. Derakhshan and C. V. Deutsch Structure and thickness uncertainty are important components of any uncertainty study. The exact ocations of the geoogica

More information

Distance Weighted Discrimination and Second Order Cone Programming

Distance Weighted Discrimination and Second Order Cone Programming Distance Weighted Discrimination and Second Order Cone Programming Hanwen Huang, Xiaosun Lu, Yufeng Liu, J. S. Marron, Perry Haaand Apri 3, 2012 1 Introduction This vignette demonstrates the utiity and

More information

Lecture outline Graphics and Interaction Scan Converting Polygons and Lines. Inside or outside a polygon? Scan conversion.

Lecture outline Graphics and Interaction Scan Converting Polygons and Lines. Inside or outside a polygon? Scan conversion. Lecture outine 433-324 Graphics and Interaction Scan Converting Poygons and Lines Department of Computer Science and Software Engineering The Introduction Scan conversion Scan-ine agorithm Edge coherence

More information

Space-Time Trade-offs.

Space-Time Trade-offs. Space-Time Trade-offs. Chethan Kamath 03.07.2017 1 Motivation An important question in the study of computation is how to best use the registers in a CPU. In most cases, the amount of registers avaiabe

More information

A MULTIGRID METHOD FOR ADAPTIVE SPARSE GRIDS

A MULTIGRID METHOD FOR ADAPTIVE SPARSE GRIDS A MULTIGRID METHOD FOR ADAPTIVE SPARSE GRIDS BENJAMIN PEHERSTORFER, STEFAN ZIMMER, CHRISTOPH ZENGER, AND HANS-JOACHIM BUNGARTZ Preprint December 17, 2014 Abstract. Sparse grids have become an important

More information

Hiding secrete data in compressed images using histogram analysis

Hiding secrete data in compressed images using histogram analysis University of Woongong Research Onine University of Woongong in Dubai - Papers University of Woongong in Dubai 2 iding secrete data in compressed images using histogram anaysis Farhad Keissarian University

More information

Response Surface Model Updating for Nonlinear Structures

Response Surface Model Updating for Nonlinear Structures Response Surface Mode Updating for Noninear Structures Gonaz Shahidi a, Shamim Pakzad b a PhD Student, Department of Civi and Environmenta Engineering, Lehigh University, ATLSS Engineering Research Center,

More information

A Memory Grouping Method for Sharing Memory BIST Logic

A Memory Grouping Method for Sharing Memory BIST Logic A Memory Grouping Method for Sharing Memory BIST Logic Masahide Miyazai, Tomoazu Yoneda, and Hideo Fuiwara Graduate Schoo of Information Science, Nara Institute of Science and Technoogy (NAIST), 8916-5

More information

ETNA Kent State University

ETNA Kent State University Eectronic Transactions on Numerica Anaysis. Voume 6, pp. 162-181, December 1997. Copyright 1997,. ISSN 1068-9613. ETNA WAVE-RAY MULTIGRID METHOD FOR STANDING WAVE EQUATIONS A. BRANDT AND I. LIVSHITS Abstract.

More information

As Michi Henning and Steve Vinoski showed 1, calling a remote

As Michi Henning and Steve Vinoski showed 1, calling a remote Reducing CORBA Ca Latency by Caching and Prefetching Bernd Brügge and Christoph Vismeier Technische Universität München Method ca atency is a major probem in approaches based on object-oriented middeware

More information

Nearest Neighbor Learning

Nearest Neighbor Learning Nearest Neighbor Learning Cassify based on oca simiarity Ranges from simpe nearest neighbor to case-based and anaogica reasoning Use oca information near the current query instance to decide the cassification

More information

MCSE Training Guide: Windows Architecture and Memory

MCSE Training Guide: Windows Architecture and Memory MCSE Training Guide: Windows 95 -- Ch 2 -- Architecture and Memory Page 1 of 13 MCSE Training Guide: Windows 95-2 - Architecture and Memory This chapter wi hep you prepare for the exam by covering the

More information

Load Balancing by MPLS in Differentiated Services Networks

Load Balancing by MPLS in Differentiated Services Networks Load Baancing by MPLS in Differentiated Services Networks Riikka Susitaiva, Jorma Virtamo, and Samui Aato Networking Laboratory, Hesinki University of Technoogy P.O.Box 3000, FIN-02015 HUT, Finand {riikka.susitaiva,

More information

A METHOD FOR GRIDLESS ROUTING OF PRINTED CIRCUIT BOARDS. A. C. Finch, K. J. Mackenzie, G. J. Balsdon, G. Symonds

A METHOD FOR GRIDLESS ROUTING OF PRINTED CIRCUIT BOARDS. A. C. Finch, K. J. Mackenzie, G. J. Balsdon, G. Symonds A METHOD FOR GRIDLESS ROUTING OF PRINTED CIRCUIT BOARDS A C Finch K J Mackenzie G J Basdon G Symonds Raca-Redac Ltd Newtown Tewkesbury Gos Engand ABSTRACT The introduction of fine-ine technoogies to printed

More information

Chapter Multidimensional Direct Search Method

Chapter Multidimensional Direct Search Method Chapter 09.03 Mutidimensiona Direct Search Method After reading this chapter, you shoud be abe to:. Understand the fundamentas of the mutidimensiona direct search methods. Understand how the coordinate

More information

Quality of Service Evaluations of Multicast Streaming Protocols *

Quality of Service Evaluations of Multicast Streaming Protocols * Quaity of Service Evauations of Muticast Streaming Protocos Haonan Tan Derek L. Eager Mary. Vernon Hongfei Guo omputer Sciences Department University of Wisconsin-Madison, USA {haonan, vernon, guo}@cs.wisc.edu

More information

Solutions to the Final Exam

Solutions to the Final Exam CS/Math 24: Intro to Discrete Math 5//2 Instructor: Dieter van Mekebeek Soutions to the Fina Exam Probem Let D be the set of a peope. From the definition of R we see that (x, y) R if and ony if x is a

More information

ECEn 528 Prof. Archibald Lab: Dynamic Scheduling Part A: due Nov. 6, 2018 Part B: due Nov. 13, 2018

ECEn 528 Prof. Archibald Lab: Dynamic Scheduling Part A: due Nov. 6, 2018 Part B: due Nov. 13, 2018 ECEn 528 Prof. Archibad Lab: Dynamic Scheduing Part A: due Nov. 6, 2018 Part B: due Nov. 13, 2018 Overview This ab's purpose is to expore issues invoved in the design of out-of-order issue processors.

More information

AN EVOLUTIONARY APPROACH TO OPTIMIZATION OF A LAYOUT CHART

AN EVOLUTIONARY APPROACH TO OPTIMIZATION OF A LAYOUT CHART 13 AN EVOLUTIONARY APPROACH TO OPTIMIZATION OF A LAYOUT CHART Eva Vona University of Ostrava, 30th dubna st. 22, Ostrava, Czech Repubic e-mai: Eva.Vona@osu.cz Abstract: This artice presents the use of

More information

A Design Method for Optimal Truss Structures with Certain Redundancy Based on Combinatorial Rigidity Theory

A Design Method for Optimal Truss Structures with Certain Redundancy Based on Combinatorial Rigidity Theory 0 th Word Congress on Structura and Mutidiscipinary Optimization May 9 -, 03, Orando, Forida, USA A Design Method for Optima Truss Structures with Certain Redundancy Based on Combinatoria Rigidity Theory

More information

Neural Network Enhancement of the Los Alamos Force Deployment Estimator

Neural Network Enhancement of the Los Alamos Force Deployment Estimator Missouri University of Science and Technoogy Schoars' Mine Eectrica and Computer Engineering Facuty Research & Creative Works Eectrica and Computer Engineering 1-1-1994 Neura Network Enhancement of the

More information

On-Chip CNN Accelerator for Image Super-Resolution

On-Chip CNN Accelerator for Image Super-Resolution On-Chip CNN Acceerator for Image Super-Resoution Jung-Woo Chang and Suk-Ju Kang Dept. of Eectronic Engineering, Sogang University, Seou, South Korea {zwzang91, sjkang}@sogang.ac.kr ABSTRACT To impement

More information

Efficient method to design RF pulses for parallel excitation MRI using gridding and conjugate gradient

Efficient method to design RF pulses for parallel excitation MRI using gridding and conjugate gradient Origina rtice Efficient method to design RF puses for parae excitation MRI using gridding and conjugate gradient Shuo Feng, Jim Ji Department of Eectrica & Computer Engineering, Texas & M University, Texas,

More information

A Fast Block Matching Algorithm Based on the Winner-Update Strategy

A Fast Block Matching Algorithm Based on the Winner-Update Strategy In Proceedings of the Fourth Asian Conference on Computer Vision, Taipei, Taiwan, Jan. 000, Voume, pages 977 98 A Fast Bock Matching Agorithm Based on the Winner-Update Strategy Yong-Sheng Chenyz Yi-Ping

More information

CLOUD RADIO ACCESS NETWORK WITH OPTIMIZED BASE-STATION CACHING

CLOUD RADIO ACCESS NETWORK WITH OPTIMIZED BASE-STATION CACHING CLOUD RADIO ACCESS NETWORK WITH OPTIMIZED BASE-STATION CACHING Binbin Dai and Wei Yu Ya-Feng Liu Department of Eectrica and Computer Engineering University of Toronto, Toronto ON, Canada M5S 3G4 Emais:

More information

A Multigrid Method for Adaptive Sparse Grids

A Multigrid Method for Adaptive Sparse Grids A Mutigrid Method for Adaptive Sparse Grids The MIT Facuty has made this artice openy avaiabe. Pease share how this access benefits you. Your story matters. Citation As Pubished Pubisher Peherstorfer,

More information

Language Identification for Texts Written in Transliteration

Language Identification for Texts Written in Transliteration Language Identification for Texts Written in Transiteration Andrey Chepovskiy, Sergey Gusev, Margarita Kurbatova Higher Schoo of Economics, Data Anaysis and Artificia Inteigence Department, Pokrovskiy

More information

parms: A Package for Solving General Sparse Linear Systems on Parallel Computers

parms: A Package for Solving General Sparse Linear Systems on Parallel Computers parms: A Package for Soving Genera Sparse Linear Systems on Parae Computers Y. Saad 1 and M. Sosonkina 2 1 University of Minnesota, Minneapois, MN 55455, USA, saad@cs.umn.edu, http://www.cs.umn.edu/ saad

More information

Joint disparity and motion eld estimation in. stereoscopic image sequences. Ioannis Patras, Nikos Alvertos and Georgios Tziritas y.

Joint disparity and motion eld estimation in. stereoscopic image sequences. Ioannis Patras, Nikos Alvertos and Georgios Tziritas y. FORTH-ICS / TR-157 December 1995 Joint disparity and motion ed estimation in stereoscopic image sequences Ioannis Patras, Nikos Avertos and Georgios Tziritas y Abstract This work aims at determining four

More information

An Exponential Time 2-Approximation Algorithm for Bandwidth

An Exponential Time 2-Approximation Algorithm for Bandwidth An Exponentia Time 2-Approximation Agorithm for Bandwidth Martin Fürer 1, Serge Gaspers 2, Shiva Prasad Kasiviswanathan 3 1 Computer Science and Engineering, Pennsyvania State University, furer@cse.psu.edu

More information

file://j:\macmillancomputerpublishing\chapters\in073.html 3/22/01

file://j:\macmillancomputerpublishing\chapters\in073.html 3/22/01 Page 1 of 15 Chapter 9 Chapter 9: Deveoping the Logica Data Mode The information requirements and business rues provide the information to produce the entities, attributes, and reationships in ogica mode.

More information

A Fast Multigrid Algorithm for Mesh Deformation

A Fast Multigrid Algorithm for Mesh Deformation A Fast Mutigrid Agorithm for Mesh Deformation Lin Shi Yizhou Yu Nathan Be Wei-Wen Feng University of Iinois at Urbana-Champaign Figure : The ide CAMEL becomes a boxer with the hep of MOCAP data and our

More information

Alpha labelings of straight simple polyominal caterpillars

Alpha labelings of straight simple polyominal caterpillars Apha abeings of straight simpe poyomina caterpiars Daibor Froncek, O Nei Kingston, Kye Vezina Department of Mathematics and Statistics University of Minnesota Duuth University Drive Duuth, MN 82-3, U.S.A.

More information

Solving Large Double Digestion Problems for DNA Restriction Mapping by Using Branch-and-Bound Integer Linear Programming

Solving Large Double Digestion Problems for DNA Restriction Mapping by Using Branch-and-Bound Integer Linear Programming The First Internationa Symposium on Optimization and Systems Bioogy (OSB 07) Beijing, China, August 8 10, 2007 Copyright 2007 ORSC & APORC pp. 267 279 Soving Large Doube Digestion Probems for DNA Restriction

More information

DETERMINING INTUITIONISTIC FUZZY DEGREE OF OVERLAPPING OF COMPUTATION AND COMMUNICATION IN PARALLEL APPLICATIONS USING GENERALIZED NETS

DETERMINING INTUITIONISTIC FUZZY DEGREE OF OVERLAPPING OF COMPUTATION AND COMMUNICATION IN PARALLEL APPLICATIONS USING GENERALIZED NETS DETERMINING INTUITIONISTIC FUZZY DEGREE OF OVERLAPPING OF COMPUTATION AND COMMUNICATION IN PARALLEL APPLICATIONS USING GENERALIZED NETS Pave Tchesmedjiev, Peter Vassiev Centre for Biomedica Engineering,

More information

Sensitivity Analysis of Hopfield Neural Network in Classifying Natural RGB Color Space

Sensitivity Analysis of Hopfield Neural Network in Classifying Natural RGB Color Space Sensitivity Anaysis of Hopfied Neura Network in Cassifying Natura RGB Coor Space Department of Computer Science University of Sharjah UAE rsammouda@sharjah.ac.ae Abstract: - This paper presents a study

More information

Further Concepts in Geometry

Further Concepts in Geometry ppendix F Further oncepts in Geometry F. Exporing ongruence and Simiarity Identifying ongruent Figures Identifying Simiar Figures Reading and Using Definitions ongruent Trianges assifying Trianges Identifying

More information

Intro to Programming & C Why Program? 1.2 Computer Systems: Hardware and Software. Why Learn to Program?

Intro to Programming & C Why Program? 1.2 Computer Systems: Hardware and Software. Why Learn to Program? Intro to Programming & C++ Unit 1 Sections 1.1-3 and 2.1-10, 2.12-13, 2.15-17 CS 1428 Spring 2018 Ji Seaman 1.1 Why Program? Computer programmabe machine designed to foow instructions Program a set of

More information

Register Allocation. Consider the following assignment statement: x = (a*b)+((c*d)+(e*f)); In posfix notation: ab*cd*ef*++x

Register Allocation. Consider the following assignment statement: x = (a*b)+((c*d)+(e*f)); In posfix notation: ab*cd*ef*++x Register Aocation Consider the foowing assignment statement: x = (a*b)+((c*d)+(e*f)); In posfix notation: ab*cd*ef*++x Assume that two registers are avaiabe. Starting from the eft a compier woud generate

More information

Further Optimization of the Decoding Method for Shortened Binary Cyclic Fire Code

Further Optimization of the Decoding Method for Shortened Binary Cyclic Fire Code Further Optimization of the Decoding Method for Shortened Binary Cycic Fire Code Ch. Nanda Kishore Heosoft (India) Private Limited 8-2-703, Road No-12 Banjara His, Hyderabad, INDIA Phone: +91-040-3378222

More information

Special Edition Using Microsoft Excel Selecting and Naming Cells and Ranges

Special Edition Using Microsoft Excel Selecting and Naming Cells and Ranges Specia Edition Using Microsoft Exce 2000 - Lesson 3 - Seecting and Naming Ces and.. Page 1 of 8 [Figures are not incuded in this sampe chapter] Specia Edition Using Microsoft Exce 2000-3 - Seecting and

More information

GPU Implementation of Parallel SVM as Applied to Intrusion Detection System

GPU Implementation of Parallel SVM as Applied to Intrusion Detection System GPU Impementation of Parae SVM as Appied to Intrusion Detection System Sudarshan Hiray Research Schoar, Department of Computer Engineering, Vishwakarma Institute of Technoogy, Pune, India sdhiray7@gmai.com

More information

University of Illinois at Urbana-Champaign, Urbana, IL 61801, /11/$ IEEE 162

University of Illinois at Urbana-Champaign, Urbana, IL 61801, /11/$ IEEE 162 oward Efficient Spatia Variation Decomposition via Sparse Regression Wangyang Zhang, Karthik Baakrishnan, Xin Li, Duane Boning and Rob Rutenbar 3 Carnegie Meon University, Pittsburgh, PA 53, wangyan@ece.cmu.edu,

More information

Merging Jacobi and Gauss-Seidel Methods for Solving Markov Chains on Computer Clusters

Merging Jacobi and Gauss-Seidel Methods for Solving Markov Chains on Computer Clusters Proceedings of the Internationa Muticonference on Computer Science and Information Technoogy pp. 263 268 ISBN 978-83-60810-14-9 ISSN 1896-7094 Merging Jacobi and Gauss-Seide Methods for Soving Markov Chains

More information

A Fast-Convergence Decoding Method and Memory-Efficient VLSI Decoder Architecture for Irregular LDPC Codes in the IEEE 802.

A Fast-Convergence Decoding Method and Memory-Efficient VLSI Decoder Architecture for Irregular LDPC Codes in the IEEE 802. A Fast-Convergence Decoding Method and Memory-Efficient VLSI Decoder Architecture for Irreguar LDPC Codes in the IEEE 82.16e Standards Yeong-Luh Ueng and Chung-Chao Cheng Dept. of Eectrica Engineering,

More information

Application of Intelligence Based Genetic Algorithm for Job Sequencing Problem on Parallel Mixed-Model Assembly Line

Application of Intelligence Based Genetic Algorithm for Job Sequencing Problem on Parallel Mixed-Model Assembly Line American J. of Engineering and Appied Sciences 3 (): 5-24, 200 ISSN 94-7020 200 Science Pubications Appication of Inteigence Based Genetic Agorithm for Job Sequencing Probem on Parae Mixed-Mode Assemby

More information

A Local Optimal Method on DSA Guiding Template Assignment with Redundant/Dummy Via Insertion

A Local Optimal Method on DSA Guiding Template Assignment with Redundant/Dummy Via Insertion A Loca Optima Method on DSA Guiding Tempate Assignment with Redundant/Dummy Via Insertion Xingquan Li 1, Bei Yu 2, Jiani Chen 1, Wenxing Zhu 1, 24th Asia and South Pacific Design T h e p i c Automation

More information

Fastest-Path Computation

Fastest-Path Computation Fastest-Path Computation DONGHUI ZHANG Coege of Computer & Information Science Northeastern University Synonyms fastest route; driving direction Definition In the United states, ony 9.% of the househods

More information

A New Supervised Clustering Algorithm Based on Min-Max Modular Network with Gaussian-Zero-Crossing Functions

A New Supervised Clustering Algorithm Based on Min-Max Modular Network with Gaussian-Zero-Crossing Functions 2006 Internationa Joint Conference on Neura Networks Sheraton Vancouver Wa Centre Hote, Vancouver, BC, Canada Juy 16-21, 2006 A New Supervised Custering Agorithm Based on Min-Max Moduar Network with Gaussian-Zero-Crossing

More information

Functions. 6.1 Modular Programming. 6.2 Defining and Calling Functions. Gaddis: 6.1-5,7-10,13,15-16 and 7.7

Functions. 6.1 Modular Programming. 6.2 Defining and Calling Functions. Gaddis: 6.1-5,7-10,13,15-16 and 7.7 Functions Unit 6 Gaddis: 6.1-5,7-10,13,15-16 and 7.7 CS 1428 Spring 2018 Ji Seaman 6.1 Moduar Programming Moduar programming: breaking a program up into smaer, manageabe components (modues) Function: a

More information

Design of IP Networks with End-to. to- End Performance Guarantees

Design of IP Networks with End-to. to- End Performance Guarantees Design of IP Networks with End-to to- End Performance Guarantees Irena Atov and Richard J. Harris* ( Swinburne University of Technoogy & *Massey University) Presentation Outine Introduction Mutiservice

More information

(0,l) (0,0) (l,0) (l,l)

(0,l) (0,0) (l,0) (l,l) Parae Domain Decomposition and Load Baancing Using Space-Fiing Curves Srinivas Auru Fatih E. Sevigen Dept. of CS Schoo of EECS New Mexico State University Syracuse University Las Cruces, NM 88003-8001

More information

Shape Analysis with Structural Invariant Checkers

Shape Analysis with Structural Invariant Checkers Shape Anaysis with Structura Invariant Checkers Bor-Yuh Evan Chang Xavier Riva George C. Necua University of Caifornia, Berkeey SAS 2007 Exampe: Typestate with shape anaysis Concrete Exampe Abstraction

More information

Replication of Virtual Network Functions: Optimizing Link Utilization and Resource Costs

Replication of Virtual Network Functions: Optimizing Link Utilization and Resource Costs Repication of Virtua Network Functions: Optimizing Link Utiization and Resource Costs Francisco Carpio, Wogang Bziuk and Admea Jukan Technische Universität Braunschweig, Germany Emai:{f.carpio, w.bziuk,

More information

Testing Whether a Set of Code Words Satisfies a Given Set of Constraints *

Testing Whether a Set of Code Words Satisfies a Given Set of Constraints * JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 6, 333-346 (010) Testing Whether a Set of Code Words Satisfies a Given Set of Constraints * HSIN-WEN WEI, WAN-CHEN LU, PEI-CHI HUANG, WEI-KUAN SHIH AND MING-YANG

More information

JOINT IMAGE REGISTRATION AND EXAMPLE-BASED SUPER-RESOLUTION ALGORITHM

JOINT IMAGE REGISTRATION AND EXAMPLE-BASED SUPER-RESOLUTION ALGORITHM JOINT IMAGE REGISTRATION AND AMPLE-BASED SUPER-RESOLUTION ALGORITHM Hyo-Song Kim, Jeyong Shin, and Rae-Hong Park Department of Eectronic Engineering, Schoo of Engineering, Sogang University 35 Baekbeom-ro,

More information

Endoscopic Motion Compensation of High Speed Videoendoscopy

Endoscopic Motion Compensation of High Speed Videoendoscopy Endoscopic Motion Compensation of High Speed Videoendoscopy Bharath avuri Department of Computer Science and Engineering, University of South Caroina, Coumbia, SC - 901. ravuri@cse.sc.edu Abstract. High

More information

Binarized support vector machines

Binarized support vector machines Universidad Caros III de Madrid Repositorio instituciona e-archivo Departamento de Estadística http://e-archivo.uc3m.es DES - Working Papers. Statistics and Econometrics. WS 2007-11 Binarized support vector

More information

Genetic Algorithms for Parallel Code Optimization

Genetic Algorithms for Parallel Code Optimization Genetic Agorithms for Parae Code Optimization Ender Özcan Dept. of Computer Engineering Yeditepe University Kayışdağı, İstanbu, Turkey Emai: eozcan@cse.yeditepe.edu.tr Abstract- Determining the optimum

More information

Extended Node-Arc Formulation for the K-Edge-Disjoint Hop-Constrained Network Design Problem

Extended Node-Arc Formulation for the K-Edge-Disjoint Hop-Constrained Network Design Problem Extended Node-Arc Formuation for the K-Edge-Disjoint Hop-Constrained Network Design Probem Quentin Botton Université cathoique de Louvain, Louvain Schoo of Management, (Begique) botton@poms.uc.ac.be Bernard

More information

Layout Conscious Approach and Bus Architecture Synthesis for Hardware-Software Co-Design of Systems on Chip Optimized for Speed

Layout Conscious Approach and Bus Architecture Synthesis for Hardware-Software Co-Design of Systems on Chip Optimized for Speed Layout Conscious Approach and Bus Architecture Synthesis for Hardware-Software Co-Design of Systems on Chip Optimized for Speed Nattawut Thepayasuwan, Member, IEEE and Aex Doboi, Member, IEEE Abstract

More information

Self-Control Cyclic Access with Time Division - A MAC Proposal for The HFC System

Self-Control Cyclic Access with Time Division - A MAC Proposal for The HFC System Sef-Contro Cycic Access with Time Division - A MAC Proposa for The HFC System S.M. Jiang, Danny H.K. Tsang, Samue T. Chanson Hong Kong University of Science & Technoogy Cear Water Bay, Kowoon, Hong Kong

More information

AMR++: Object-Oriented Parallel Adaptive Mesh Refinement

AMR++: Object-Oriented Parallel Adaptive Mesh Refinement UCRL-ID-137373 AMR++: Object-Oriented Parae Adaptive Mesh Refinement 5. Phiip, D. Quinan February 2,200O U.S. Department of Energy Approved for pubic reease; further dissemination unimited DISCLAIMER This

More information

Searching, Sorting & Analysis

Searching, Sorting & Analysis Searching, Sorting & Anaysis Unit 2 Chapter 8 CS 2308 Fa 2018 Ji Seaman 1 Definitions of Search and Sort Search: find a given item in an array, return the index of the item, or -1 if not found. Sort: rearrange

More information

Priority Queueing for Packets with Two Characteristics

Priority Queueing for Packets with Two Characteristics 1 Priority Queueing for Packets with Two Characteristics Pave Chuprikov, Sergey I. Nikoenko, Aex Davydow, Kiri Kogan Abstract Modern network eements are increasingy required to dea with heterogeneous traffic.

More information

Concurrent programming: From theory to practice. Concurrent Algorithms 2016 Tudor David

Concurrent programming: From theory to practice. Concurrent Algorithms 2016 Tudor David oncurrent programming: From theory to practice oncurrent Agorithms 2016 Tudor David From theory to practice Theoretica (design) Practica (design) Practica (impementation) 2 From theory to practice Theoretica

More information

A Robust Sign Language Recognition System with Sparsely Labeled Instances Using Wi-Fi Signals

A Robust Sign Language Recognition System with Sparsely Labeled Instances Using Wi-Fi Signals A Robust Sign Language Recognition System with Sparsey Labeed Instances Using Wi-Fi Signas Jiacheng Shang, Jie Wu Center for Networked Computing Dept. of Computer and Info. Sciences Tempe University Motivation

More information

Windows NT, Terminal Server and Citrix MetaFrame Terminal Server Architecture

Windows NT, Terminal Server and Citrix MetaFrame Terminal Server Architecture Windows NT, Termina Server and Citrix MetaFrame - CH 3 - Termina Server Architect.. Page 1 of 13 [Figures are not incuded in this sampe chapter] Windows NT, Termina Server and Citrix MetaFrame - 3 - Termina

More information

ACTIVE LEARNING ON WEIGHTED GRAPHS USING ADAPTIVE AND NON-ADAPTIVE APPROACHES. Eyal En Gad, Akshay Gadde, A. Salman Avestimehr and Antonio Ortega

ACTIVE LEARNING ON WEIGHTED GRAPHS USING ADAPTIVE AND NON-ADAPTIVE APPROACHES. Eyal En Gad, Akshay Gadde, A. Salman Avestimehr and Antonio Ortega ACTIVE LEARNING ON WEIGHTED GRAPHS USING ADAPTIVE AND NON-ADAPTIVE APPROACHES Eya En Gad, Akshay Gadde, A. Saman Avestimehr and Antonio Ortega Department of Eectrica Engineering University of Southern

More information

A NEW APPROACH FOR BLOCK BASED STEGANALYSIS USING A MULTI-CLASSIFIER

A NEW APPROACH FOR BLOCK BASED STEGANALYSIS USING A MULTI-CLASSIFIER Internationa Journa on Technica and Physica Probems of Engineering (IJTPE) Pubished by Internationa Organization of IOTPE ISSN 077-358 IJTPE Journa www.iotpe.com ijtpe@iotpe.com September 014 Issue 0 Voume

More information

Automatic Hidden Web Database Classification

Automatic Hidden Web Database Classification Automatic idden Web atabase Cassification Zhiguo Gong, Jingbai Zhang, and Qian Liu Facuty of Science and Technoogy niversity of Macau Macao, PRC {fstzgg,ma46597,ma46620}@umac.mo Abstract. In this paper,

More information

Fuzzy Controller for a Dynamic Window in Elliptic Curve Cryptography Wireless Networks for Scalar Multiplication

Fuzzy Controller for a Dynamic Window in Elliptic Curve Cryptography Wireless Networks for Scalar Multiplication 6th Asia-Pacific Conference on Communications Fuzzy Controer for a Dynamic Window in Eiptic Curve Cryptography Wireess Networks for Scaar Mutipication Xu Huang Facuty of Information Sciences and Engineering

More information

An Introduction to Design Patterns

An Introduction to Design Patterns An Introduction to Design Patterns 1 Definitions A pattern is a recurring soution to a standard probem, in a context. Christopher Aexander, a professor of architecture Why woud what a prof of architecture

More information

RDF Objects 1. Alex Barnell Information Infrastructure Laboratory HP Laboratories Bristol HPL November 27 th, 2002*

RDF Objects 1. Alex Barnell Information Infrastructure Laboratory HP Laboratories Bristol HPL November 27 th, 2002* RDF Objects 1 Aex Barne Information Infrastructure Laboratory HP Laboratories Bristo HPL-2002-315 November 27 th, 2002* E-mai: Andy_Seaborne@hp.hp.com RDF, semantic web, ontoogy, object-oriented datastructures

More information

Joint Optimization of Intra- and Inter-Autonomous System Traffic Engineering

Joint Optimization of Intra- and Inter-Autonomous System Traffic Engineering Joint Optimization of Intra- and Inter-Autonomous System Traffic Engineering Kin-Hon Ho, Michae Howarth, Ning Wang, George Pavou and Styianos Georgouas Centre for Communication Systems Research, University

More information

Utility-based Camera Assignment in a Video Network: A Game Theoretic Framework

Utility-based Camera Assignment in a Video Network: A Game Theoretic Framework This artice has been accepted for pubication in a future issue of this journa, but has not been fuy edited. Content may change prior to fina pubication. Y.LI AND B.BHANU CAMERA ASSIGNMENT: A GAME-THEORETIC

More information

Optimized Base-Station Cache Allocation for Cloud Radio Access Network with Multicast Backhaul

Optimized Base-Station Cache Allocation for Cloud Radio Access Network with Multicast Backhaul Optimized Base-Station Cache Aocation for Coud Radio Access Network with Muticast Backhau Binbin Dai, Student Member, IEEE, Ya-Feng Liu, Member, IEEE, and Wei Yu, Feow, IEEE arxiv:804.0730v [cs.it] 28

More information

M. Badent 1, E. Di Giacomo 2, G. Liotta 2

M. Badent 1, E. Di Giacomo 2, G. Liotta 2 DIEI Dipartimento di Ingegneria Eettronica e de informazione RT 005-06 Drawing Coored Graphs on Coored Points M. Badent 1, E. Di Giacomo 2, G. Liotta 2 1 University of Konstanz 2 Università di Perugia

More information

On Upper Bounds for Assortment Optimization under the Mixture of Multinomial Logit Models

On Upper Bounds for Assortment Optimization under the Mixture of Multinomial Logit Models On Upper Bounds for Assortment Optimization under the Mixture of Mutinomia Logit Modes Sumit Kunnumka September 30, 2014 Abstract The assortment optimization probem under the mixture of mutinomia ogit

More information

On Finding the Best Partial Multicast Protection Tree under Dual-Homing Architecture

On Finding the Best Partial Multicast Protection Tree under Dual-Homing Architecture On inding the est Partia Muticast Protection Tree under ua-homing rchitecture Mei Yang, Jianping Wang, Xiangtong Qi, Yingtao Jiang epartment of ectrica and omputer ngineering, University of Nevada Las

More information

Backing-up Fuzzy Control of a Truck-trailer Equipped with a Kingpin Sliding Mechanism

Backing-up Fuzzy Control of a Truck-trailer Equipped with a Kingpin Sliding Mechanism Backing-up Fuzzy Contro of a Truck-traier Equipped with a Kingpin Siding Mechanism G. Siamantas and S. Manesis Eectrica & Computer Engineering Dept., University of Patras, Patras, Greece gsiama@upatras.gr;stam.manesis@ece.upatras.gr

More information

CSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Lecture 4: Threads

CSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Lecture 4: Threads CSE120 Principes of Operating Systems Prof Yuanyuan (YY) Zhou Lecture 4: Threads Announcement Project 0 Due Project 1 out Homework 1 due on Thursday Submit it to Gradescope onine 2 Processes Reca that

More information

An Optimizing Compiler

An Optimizing Compiler An Optimizing Compier The big difference between interpreters and compiers is that compiers have the abiity to think about how to transate a source program into target code in the most effective way. Usuay

More information

Multi-level Shape Recognition based on Wavelet-Transform. Modulus Maxima

Multi-level Shape Recognition based on Wavelet-Transform. Modulus Maxima uti-eve Shape Recognition based on Waveet-Transform oduus axima Faouzi Aaya Cheikh, Azhar Quddus and oncef Gabbouj Tampere University of Technoogy (TUT), Signa Processing aboratory, P.O. Box 553, FIN-33101

More information

Alternative Decompositions for Distributed Maximization of Network Utility: Framework and Applications

Alternative Decompositions for Distributed Maximization of Network Utility: Framework and Applications Aternative Decompositions for Distributed Maximization of Network Utiity: Framework and Appications Danie P. Paomar and Mung Chiang Eectrica Engineering Department, Princeton University, NJ 08544, USA

More information

Development of a National Portal for Tuvalu. Business Case. SPREP Pacific iclim

Development of a National Portal for Tuvalu. Business Case. SPREP Pacific iclim Deveopment of a Nationa Porta for Tuvau Business Case SPREP Pacific iclim Apri 2018 Tabe of Contents 1. Introduction... 3 1.1 Report Purpose... 3 1.2 Background & Context... 3 1.3 Other IKM Activities

More information

understood as processors that match AST patterns of the source language and translate them into patterns in the target language.

understood as processors that match AST patterns of the source language and translate them into patterns in the target language. A Basic Compier At a fundamenta eve compiers can be understood as processors that match AST patterns of the source anguage and transate them into patterns in the target anguage. Here we wi ook at a basic

More information

1682 IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 22, NO. 6, DECEMBER Backward Fuzzy Rule Interpolation

1682 IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 22, NO. 6, DECEMBER Backward Fuzzy Rule Interpolation 1682 IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 22, NO. 6, DECEMBER 2014 Bacward Fuzzy Rue Interpoation Shangzhu Jin, Ren Diao, Chai Que, Senior Member, IEEE, and Qiang Shen Abstract Fuzzy rue interpoation

More information

Reference trajectory tracking for a multi-dof robot arm

Reference trajectory tracking for a multi-dof robot arm Archives of Contro Sciences Voume 5LXI, 5 No. 4, pages 53 57 Reference trajectory tracking for a muti-dof robot arm RÓBERT KRASŇANSKÝ, PETER VALACH, DÁVID SOÓS, JAVAD ZARBAKHSH This paper presents the

More information

Intro to Programming & C Why Program? 1.2 Computer Systems: Hardware and Software. Hardware Components Illustrated

Intro to Programming & C Why Program? 1.2 Computer Systems: Hardware and Software. Hardware Components Illustrated Intro to Programming & C++ Unit 1 Sections 1.1-3 and 2.1-10, 2.12-13, 2.15-17 CS 1428 Fa 2017 Ji Seaman 1.1 Why Program? Computer programmabe machine designed to foow instructions Program instructions

More information

Resource Optimization to Provision a Virtual Private Network Using the Hose Model

Resource Optimization to Provision a Virtual Private Network Using the Hose Model Resource Optimization to Provision a Virtua Private Network Using the Hose Mode Monia Ghobadi, Sudhakar Ganti, Ghoamai C. Shoja University of Victoria, Victoria C, Canada V8W 3P6 e-mai: {monia, sganti,

More information

Automatic Grouping for Social Networks CS229 Project Report

Automatic Grouping for Social Networks CS229 Project Report Automatic Grouping for Socia Networks CS229 Project Report Xiaoying Tian Ya Le Yangru Fang Abstract Socia networking sites aow users to manuay categorize their friends, but it is aborious to construct

More information

tread base carc1 carc2 sidewall2 beadflat beadcushion treadfine basefine Y X Z

tread base carc1 carc2 sidewall2 beadflat beadcushion treadfine basefine Y X Z APPLICATION OF THE FINITE ELEMENT METHOD TO THE ANALYSIS OF AUTOMOBILE TIRES H.-J. PAYER, G. MESCHKE AND H.A. MANG Institute for Strength of Materias Vienna University of Technoogy Karspatz 13/E202, A-1040

More information

InnerSpec: Technical Report

InnerSpec: Technical Report InnerSpec: Technica Report Fabrizio Guerrini, Aessandro Gnutti, Riccardo Leonardi Department of Information Engineering, University of Brescia Via Branze 38, 25123 Brescia, Itay {fabrizio.guerrini, a.gnutti006,

More information