(recursive) `Divide and Conquer' strategies hierarchical data and solver structures, but also hierarchical (!) `matrix structures' ScaRC as generaliza

Size: px
Start display at page:

Download "(recursive) `Divide and Conquer' strategies hierarchical data and solver structures, but also hierarchical (!) `matrix structures' ScaRC as generaliza"

Transcription

1 SOME BASIC CONCEPTS OF FEAST M. Altieri, Chr. Becker, S. Kilian, H. Oswald, S. Turek, J. Wallis Institut fur Angewandte Mathematik, Universitat Heidelberg Im Neuenheimer Feld 294, Heidelberg, Germany SUMMARY This paper deals with the basic principles of the new FEM software package FEAST. Based on an initial discussion of available software tools with respect to their application area, i.e., Education, Research or (industrial) Application, it illustrates the specic demands for such PDE software which is aimed to solve 'real life' problems. For the FEAST software, which is principally designed for high{performance simulations, we explain the basic principles of the underlying numerical, algorithmic and implementation concepts. Computational examples illustrate the (expected) eciency of this new software package, particularly in relation to existing approaches. INTRODUCTION Current trends in software development for Partial Dierential Equations (PDE's), and here in particular for Finite Element (FEM) approaches, go clearly towards objectoriented techniques and adaptive methods in any sense. Hereby the employed data and solver structures, and especially the `matrix structures', are often in contradiction to modern hardware platforms. As a result, the observed computational eciency is far away from expected Peak rates of almost 1 GFLOP nowadays, and the 'real life' gap will even further increase (see recent papers of Rude). Since high performance calculations may be only reached by explicitly exploiting 'caching in' and 'pipelining' in combination with sequentially stored arrays (using machine-optimized libraries as BLAS, ESSL or PERFLIB, for instance), the corresponding realization seems to be `easier' for simple Finite Dierence approaches. So, the question arises how to perform similar techniques for much more sophisticated Finite Element codes? These discrepancies between complex mathematical approaches and highly structured computational demands often lead to unreasonable calculation times for `real world' problems, e.g. Computational Fluid Dynamics (CFD) calculations in 3D, as can be seen from recent benchmarks [SRT] for commercial as well as research codes. Hence, strategies for eciency enhancement are necessary, not only from the mathematical (algorithms, discretizations) but also from the software point of view. To realize some of these necessary improvements, our new Finite Element package (project name: FEAST { Finite Element Analysis & Solution Tools) is under development. This package is based on the following concepts: 1

2 (recursive) `Divide and Conquer' strategies hierarchical data and solver structures, but also hierarchical (!) `matrix structures' ScaRC as generalization of multigrid and domain decomposition techniques frequent use of machine-optimized Linear Algebra routines all typical Finite Element facilities included The result is going to be a exible software package with special emphasis on: (closer to) peak performance on modern and future processors typical multigrid behaviour w.r.t. eciency and robustness parallelization tools directly included on low level open for dierent adaptivity concepts low storage requirements application to many `real life' problems possible In contrast to many other tool boxes, which often aim to develop preferably software for research or education topics, our approach clearly is designed for high performance applications with industrial background, especially in CFD. Consequently, our main emphasis lies on the aspects `eciency' and `robustness' and less on topics as `easy implementable' or `most modern programming environment'. As programming language FORTRAN (77 and 90) is used: This makes it possible to adopt many reliable parts of the predecessor packages FEAT2D, FEAT3D and FEAT- FLOW [TU2]. Further, on high performance computers, very ecient FORTRAN compilers are available and transparent access to the data structures is possible. The preand postprocessing, which will be based on (commercial) professional tools, is handled by JAVA-based program parts. Conguring a high performance computer as a FEAST server, the user shall be able to perform the remote calculation by a FEAST client. In the following, we give examples for `real' computational eciency results of typical numerical tools which help to motivate our hierarchical data, solver and matrix structures. To understand these better, we illustrate shortly the corresponding solution technique ScaRC ("Scalable Recursive Clustering") in combination with the overall `Divide and Conquer' philosophy which is essential for FEAST. We discuss how typical multigrid rates, even for complex congurations, can be achieved, on parallel as well as sequential computers with a very high computational eciency. CLASSIFICATION OF PDE SOFTWARE "Most of the available PDE software can be applied, in principal, to (almost) all problems. The practical functionality is mostly restricted by computer requirements (CPU, RAM) only! " 2

3 This statement appears to be quite `simple', but it nevertheless describes well the stateof-the-art of available software packages for the numerical solution of PDE's. Further, it can be used to classify many existing software tools and to illustrate the dierences in the underlying concepts and realizations of codes. Typical examples for critical applications which demonstrate the problems of many software packages are in the eld of CFD. Especially the recent DFG ow benchmark set "Channel ow around a cylinder" [SRT], which has been carried out recently under the German Priority Research Programme "Flow Simulation on High Performance Computers", has shown quite surprising results. Many codes with various numerical basics and on very dierent computer platforms have participated so far. The results in [SRT] clearly show dierences in total eciency (concerning elapsed CPU times with respect to obtained accuracy!) of several orders of magnitude, and in fact many codes have not been able to give satisfying results, and this for a laminar incompressible ow in the range of Reynolds number Re = 20, resp., Re = 100. As an (ocial) result one could state the following conclusions, all on the basis of such benchmark calculations: 1. It is often not sucient to take any `multi{purpose' package as basic tool, and then to implement a (more or less) clever numerical approach in a straightforward manner. In most cases, the eciency of the underlying basic package does not allow to solve accurately "hard" problems as the incompressible Navier{Stokes equations. 2. Often the chosen `numerical ingredients', as mesh design, discretization spaces, time{stepping schemes, solvers for discrete linear and nonlinear systems, are `good' schemes as standalones but do not t together if realized in a common code. 3. The typical approach of implementing well-known, but `old' solution schemes in a straightforward manner on vector or parallel super-computers does not lead to satisfying results, without essentially improving the numerical and computational background at the same time. Applications from Computational Fluid Dynamics clearly indicate that specic attempts for optimizing the applied numerical schemes (discretization), algorithmic components (solver) as well as software aspects (implementation w.r.t. underlying hardware) must be performed. Only then, similar problems with a certain `real life' character may be tackled successfully. Based on such comparisons and resulting experiences we propose the following classi- cation of PDE software packages, namely if they are principally designed for the use in Education, Research or (Industrial) Application. Education. Corresponding software tools are mainly designed for students to `play{around' with mathematical tools. Their most important features include `easy' user interfaces, and the code should be based on simple, but very robust algorithms. Due to the typically low complexity of the problems to be examined by the students, the code eciency is often `independent of implementation' and typically requires only few seconds execution time. Therefore, C++- and especially JAVA-based implementations with graphical interfaces and platform-independent execution are typical candidates for this kind of software. 3

4 Research. This type of software is representative for most of the available tools in the mathematical community. The software is designed to be open for numerical and algorithmic changes, for instance to examine new concepts for adaptive error control or to test the convergence behaviour of new multilevel-type solvers. Consequently, programming languages which allow very exible and robust data structures are favourized, and particularly modern object-oriented environments as C++ are wide-spread in this eld. In contrast, user interfaces or eciency aspects play a minor part in this mathematical basic research. (Industrial) Application. This software is specialized to apply `well-understood' and `optimized' numerical and algorithmic tools to `real life' congurations with an industrial background. If the full potential of such sophisticated approaches is to be eciently exploited, an `optimal play{together' of Mathematics and implementation in all components has to be guaranteed, to be able to compete with present production codes. Consequently, the demand for robustness and eciency of the developed software is of major interest. Coming back to the previous `Application' examples from CFD, we want to be more specic to illustrate the practical demands on software which will be successfully applied in this special area. Typical problems associated with `real life' ows are the following: complex domains and anisotropic meshes in space and time 10 3? 10 5 time steps and 10 4? 10 8 unknowns by accuracy reasons optimal control of specic physical quantities as drag coecients or ux distributions Therefore, concrete `realistic' simulations often require the use of several millions of unknowns for the numerical solution of (mostly) nonlinear and nonstationary PDE's, and this many times. On the other hand, often parallel computers with many processors and each with (almost) 1 GFLOP Peak performance are available, and today's Mathematics oers higher order discretizations with adaptivity concepts and accompanying multigrid/multilevel-like solvers. However, to marry such modern numerical tools with recent and future hardware aspects and to create an adequate code, the following four components have to be respected with the same weight: Numerics, Algorithms, Implementation and Hardware platforms. Only if all components optimally play-together, the (potential) high{performance of each aspect can be incorporated in the resulting code such that the expected substantially improved simulation tool gets really available. MAIN PRINCIPLES IN FEAST We give a short explanation of some of the main principles in FEAST - at least with respect to the current status in April These are mainly concepts and ideas which are recently planned to be done and/or which are actually under work. Hence, we often omit a rigorous mathematical or algorithmic formulation and present instead the philosophy and background of these concepts only. More details about the practical realization and the numerical and algorithmic sources can be found in [KI], [BE] and [TU1]. 4

5 I) PROGRAMMING LANGUAGE As explained in the INTRODUCTION, we decided to implement most of the numerical kernels of FEAST in FORTRAN 90 (F90). Beside the modern functionality of this language, promising compilers on most recent hardware platforms are available. And additionally, what is also very important, many numerical parts of the predecessor packages FEAT2D, FEAT3D and FEATFLOW [TU2] can be directly included, such as for instance the routines for matrix assembly, the Finite Element basis function library, multigrid ingredients and many more. However, preliminary tests with recent F90 compilers on several platforms have shown, that the use of FORTRAN 77 (F77) compilers is still much more favourable if absolute high{performance is required. Especially for the Numerical Linear Algebra part, i.e., matrix-vector multiplications, vector modications, tridiagonal solvers and other tools which are typical in the context of iterative solution schemes (see [ABT1] and [ABT2]), the use of machine{optimized F77 software libraries or self{developed FORTRAN 77 code still leads to much better performance ratings. Nevertheless, it is planned to switch to F90 completely if F90 has achieved the performance of F77, and the complete package is developed under the restriction that the corresponding modules can be easily exchanged. Finally, it is planned to "hide" the numerical kernel routines from the "typical user" who only wants to `perform applications'. Recently, we are developing a higher-level user{ interface written in JAVA to be machine- and platform-independent as far as possible. Additionally, graphical user interfaces (GUI's) or "meta{language" approaches are under work to leave behind the "useless" discussion about the programming language used in the numerical kernels. II) HIERARCHICAL DATA, SOLVER AND MATRIX STRUCTURES One of the most important principles in FEAST is to apply consequently a (Recursive) Divide and Conquer strategy. The solution of the complete "global" problem is recursively split into smaller "independent" subproblems on `patches' as part of the complete set of unknowns. Thus, the two major aims in this splitting procedure which can be performed by hand or via self{adaptive strategies are: Find locally structured parts Find locally anisotropic parts Based on "small" structured subdomains on the lowest level (in fact, even one single or a small number of elements only is allowed), the "higher-level" substructures are generated via clustering of "lower-level" parts such that algebraic or geometric irregularities are hidden inside the new "higher-level" patch. More background for this strategy is given in the following Sections `HIGH{PERFORMANCE LINEAR ALGEBRA' and `REFERENCE ELEMENT SOLVERS', and particularly in `GENERALIZED SOLVER STRATEGY ScaRC' which describes the corresponding solvers related to each stage. The following Figures illustrate exemplarily the employed data structure for a (coarse) triangulation of a given domain and its recursive partitioning into several kinds of substructures. 5

6 SB PB PB According to this decomposition, a corresponding `data tree' { the skeleton of the partitioning strategy { describes the hierarchical decomposition process. It consists of a specic collection of Elements, Macros (`Mxxx'), Matrix Blocks (`'), Parallel Blocks (`PB'), Subdomain Blocks (`SB'), etc. SB PB PB M1 M2 M3 M4 M5 M6 M7 M8 The atomic units in our decomposition are the `macros' which may be of type `structured' (as n n collection of quadrilaterals (in 2D) with local Finite-Dierence data structures) or `unstructured' (any collection of elements, for instance in the case of fully adaptive local grid renement). These `macros' (one or several) can be clustered to build a `matrix block' which contains the "local matrix parts": only here are the complete matrix informations stored! Higher-level constructs are `parallel blocks' (for the parallel distribution) and `subdomain blocks' (with special conformity rules with respect to grid renement and applied discretization spaces). They all together build the complete domain, resp., the complete set of unknowns. It is important to realize that each stage in this hierarchical tree can act as independent `father' in relation to its `child' substructures while it is a 'child' at the same time in another phase of the solution process (inside of the ScaRC solver, see later). III) GENERALIZED SOLVER STRATEGY ScaRC In short form, our long-time experience with the numerical and computational runtime behaviour of typical multigrid (MG) and Domain Decomposition (DD) solvers can be concluded as follows: a) Some observations from standard multigrid approaches: While in fact the numerical convergence behaviour of (optimized) multigrid is very satisfying with respect to robustness and eciency requirements, there still remain some `open' problems: Often the parallelization of powerful `recursive' smoothers (as SOR or ILU) leads to performance degradations since they can be realized only in a `blockwise' sense. Thus, it is often not clear how the nice numerical behaviour in sequential codes for 6

7 complicated geometric structures or local anisotropies, can be reached in parallel computations. And additionally, the communication overhead especially on coarser grid levels dominates the total CPU time. Even more important is the `computational observation' that the realized performance on modern platforms is often far beyond (sometimes less than 1 %) the expected Peak performance. Many codes often reach much less than 10 MFLOP, and this on computers which are said (by the vendors) to run with up to 1 GFLOP Peak. The reason is simply that the single components in multigrid (smoother, defect calculation, grid transfer) perform too few arithmetic work with respect to each data exchange, such that the facilities of modern superscalar architectures are poorly exploitable. In contrast, we will show that in fact 30 { 70 % can be realistic with appropriate techniques. b) Some observations from standard Domain Decomposition approaches: In contrast to standard multigrid, the parallel eciency is much higher, at least as long as no large overlap region between processors must be exchanged. While overlapping DD methods do not require additional coarse grid problems (however the implementation in 3D for complicated domains or for complex Finite Element spaces is a hard job!), non-overlapping DD approaches require certain coarse grid problems, as the BPS preconditioner for instance which may lead again to severe numerical and computational problems, depending on the geometrical structure or the used discretization spaces. However, the most important dierence between Domain Decomposition and multigrid are the (often) much worse convergence rates of DD, although at the same time more arithmetic work is done on each processor. As a conclusion, improvements are enforced by the facts that the convergence behaviour is often quite sensitive with respect to (local) geometric/algebraic anisotropies (in `real life' congurations!), and that the performed arithmetic work (which allows the high-performance) is often restricted by (un)necessary data exchanges. An additional observation which is strongly related to the previous data structure in combination with the specic hierarchical ScaRC solver is illustrated in the following Figure. We show the resulting "optimal" mesh from a numerical simulation of R.Becker/R.Rannacher for `Flow around the cylinder' which was adaptively rened via rigorous a-posteriori error control mechanisms specied for the required drag coecient (see [RB]). As can be seen, the adaptive grid renement techniques are needed only locally, near the boundaries, while mostly regular substructures (up to 90 %) can be (and should be!) used in the interior of the domain. This is a quite typical result and shows that even for (more or less) complex ow simulations (here as a prototypical example) locally blockwise `Finite Dierence' techniques can be applied: these regions can be detected and exploited by the given hierarchical strategies. 7

8 We omit here a detailed description of the numerical and algorithmic properties of ScaRC and refer to the papers [KT] and particularly to [KI]. Here, we restrict to repeat the main philosophy behind this generalized MG/DD approach, which is strongly coupled with the hierarchical data and matrix structures as explained before. ScaRC stands for: Scalable (w.r.t. "quality and number of local solution steps" at each stage) Recursive ("independently" for each stage in the hierarchy of partitioning) Clustering (for building patches via "xed or adaptive blocking strategies") and its "advantageous" numerical and computational behaviour can be characterized through following observations (look at [KT] and [TU1] for numerical examples): "Block{Jacobi/Gau{Seidel schemes perform well for locally hidden anisotropies. More arithmetic operations can be performed locally whereby additional work may be unproportionately small (in terms of CPU) due to the local high{performance facilities." IV) HIGH{PERFORMANCE LINEAR ALGEBRA One of the main ideas behind the described (Recursive) Divide and Conquer approach in combination with the ScaRC solver technology is to detect `locally structured parts'. In these `local subdomains', we apply consequently `highly structured tools' as typical for Finite Dierence approaches: line- or rowwise numbering of unknowns and storing of matrices as sparse bands (however the matrix entries are calculated via the Finite Element modules!). As a result, we have `optimal' data structures on each of these patches (which often correspond to the former introduced `matrix blocks') and we can perform very powerful Linear Algebra tools which explicitely exploit the high-performance of specic machine{optimized libraries (i.e., BLAS, LAPACK, ESSL, PERFLIB). The following Table shows typical results on some selected hardware platforms, for dierent tasks and techniques in Numerical Linear Algebra. While Gaussian Elimination (GE) is presented only to demonstrate the (potentially) available performance of the given processors (often several hundreds of MFLOP which are really measured!), we are much more interested in the realistic run-time behaviour of several matrix{vector multiplication (MV) techniques. Since these are probably the most important (since most time{ consuming) components in typical iterative solution schemes as Krylov-space methods or multigrid solvers, they are - beside the vector-modication routines (as DAXPY for the linear combination of two vectors) - excellent representants to exemplarily demonstrate the `real life' eciency of many simulation tools in combination with specic hardware platforms. The measured MFLOP for the Gaussian Elimination are for a dense matrix (analogously to the standard linpack test!) while for the dierent MV techniques the matrix is a typical 9{point stencil ("discretized Poisson operator"). We perform tests for two dierent vector lengths N and give the measured MFLOP rates which are all calculated via 20 N=time (for MV), resp., 2 N=time (for DAXPY). In all cases, we attempted to use "optimal" compiler options and machine-optimized libraries as the BLAS, ESSL or PERFLIB. Only in the case of the PENTIUM II we had to perform the Gaussian Elimination with the FORTRAN-sources exclusively which might explain the worse rates. 8

9 Corresponding results for other iterative components which are essential in the context of multigrid solvers can be found at which also contains our complete measurements on many modern processors (see also [ABT1] and [ABT2]). Computer GE N DAXPY sparse MV banded MV blocked MV IBM RS K (166 Mhz) 256K SUN U K (250 Mhz) 256K PC PII 45 4K (233 Mhz) 65K The `sparse MV' technique is the standard technique in Finite Element codes (and others), also well known as `compact storage' technique or similar: the matrix (plus index arrays or lists) is stored as long array containing the "nonzero elements" only. While this approach can be applied for arbitrary meshes and numberings of the unknowns, no explicit advantage of the linewise numbering can be exploited. The result is that through the indexed access, the performance degradates dramatically in comparison to the (almost) Peak-rates from the Gaussian Elimination (down to 5 %!). In fact, these results are even `quasi-optimal' since the best-available F77 compiler options have been applied. Moreover, it should be a necessary test for everyone, particularly for those who work in F90, C++ or even JAVA: to measure the corresponding rates of the used MV routines provides a rst impression of the own eciency! The most `natural' way to improve these results is to exploit the fact that the matrix is a sparse banded matrix with 9 bands only. Hence, the matrix{vector multiplication is rewritten such that now "band after band" are applied. In fact, each "band multiplication" is performed analogously to the DAXPY operation (modulo the "variable" multiplication factors!), and leads consequently to similar results as for DAXPY. The obvious advantage of this `banded MV' approach is that these tasks can be performed on the basis of BLAS1 routines which may exploit the vectorization facilities of many processors (particularly on vector computers!). Indeed, the measured results show improvements. However, for `long' vector lengths (256 K) the improvements are absolutely disappointing: It is obvious, that for this kind of (workstation/pc) chip technology the processor cache dominates the resulting eciency! The nal step towards highly ecient components is to rearrange the matrix{vector multiplication in a "blockwise" sense (`blocked MV'): for a certain set of unknowns, a corresponding part of the matrix is treated such that cache-optimized and fully vectorized operations can be performed. This procedure is called "BLAS 2+"-style since in fact certain techniques for dense matrices which are based on routines from the BLAS2, resp., BLAS3 library, have now been developed for such sparse banded matrices. The exact procedure has to be carefully developed in dependence of the underlying FEM discretization, and a more detailed description can be found in [ABT2]. In fact, we expect even better performance ratings in future by more careful implementations which should reach at least the DAXPY measurements! 9

10 V) REFERENCE ELEMENT SOLVERS The results in the previous Section have shown how important the use of (locally) highly{structured meshes for the resulting performance is. However, it is also obvious that techniques of the described `BLAS 2+' style are absolutely necessary to achieve a high percentage of the several hundreds of MFLOP's on modern processors. The full FEM functionality including complex geometries and adapted meshes is administrated via the hierarchical 'data tree' partitioning which in combination with the ScaRC solvers is responsible for the `global' convergence behaviour. In contrast, the resulting eciency is mainly directed by the performance rates with respect to convergence rates and computational eciency on these highly structured "subdomains". Hence, a very important step is to measure and to understand the characteristic runtime behaviour of modern processors and computer architectures. While these measurements provide the "processor speed" only, we additionally have to understand the typical convergence behaviour of certain multigrid components on such local patches, depending on the discretization spaces, the dierential operators and the uniformity of the mesh: Each `reference element' is assumed to be a (logically equivalent) tensor-product mesh, but it may contain deformations and large aspect ratios. The important property is the "sparse banded" structure of the (local) matrices in the `matrix blocks'! Those convergence rates together with the measured processor speed determine the `total numerical eciency' which simply gives the "CPU time to gain 1 digit per unknown". Therefore, we can perform these measurements completely a-priori, for dierent prototypes of meshes, dierential operators and discretization spaces, and we can store these results in a kind of machine{dependent data basis: our expert system for the `reference element solvers'! Then, during the solution process and independently for each stage in the hierarchical tree structure, ScaRC "selects" automatically the "optimal" conguration, i.e., the smoothing operator and the number of smoothing steps, via this "a-priori expert system". Within the FEAST approach, we have therefore the chance to incorporate all the knowledge from the `Finite Dierence world' and from `unit-square solver experts' into the higher functionality of FEM codes. And additionally, the facilities of actual and future hardware platforms can be exploited inside of a software product which at the same time is designed to realize the modern mathematical FEM methodology. VI) SEVERAL ADAPTIVITY CONCEPTS As typical for modern FEM packages, we directly incorporate certain tools for grid generation which allow an easy handling of local and global renement or coarsening strategies: adaptive mesh moving, macro adaptivity and fully local adaptivity. Adaptive strategies for moving mesh points, along boundaries or inner structures, allow the same logic structure in each `macro block', and hence the shown performance rates can be preserved. Additionally, we work with adaptivity concepts related to each `macro block', resp., `matrix block'. Allowing `blind' or `slave macro nodes' preserves the high{ performance facilities in each `matrix block', and is a good compromise between fully local adaptivity and optimal eciency through structured data. Only in that case, that these concepts do not lead to satisfying results, certain macros will loose their `highly structured' features through the (local) use of fully adaptive techniques. On these (hopefully) few patches, the standard `sparse' techniques for unstructured meshes have to be applied. 10

11 VII) DIRECT INTEGRATION OF PARALLELISM Most software packages are designed for sequential algorithms to solve a given PDE problem, and the subsequent parallelization of certain methods takes often unproportionately long. On the other hand, the following typical `naive' statement shows that this extra work is often neglected (by those who never perform the parallelization!): "1 IBM is more expensive than several SUN's or PC's! Why spend so much money for such a highly{tuned single processor machine? `Simply' parallelize your code..." In fact that is easy to say, but hard to realize with most software packages. Therefore, we directly include tools as MPI or PVM, or other standardized communication routines concerning our hierarchical tree structure, already on low level. However the more important step, which makes parallelization much more easy, is the design of the ScaRC solver according to the hierarchical decomposition in dierent stages. Indeed, from an algorithmic point of view, our sequential and parallel versions dier only as analogously Jacobi- and Gau{Seidel-like schemes work dierently. Hence, all parallel executions can be identically simulated on single processors which however can additionally improve their numerical behaviour with respect to eciency and robustness through Gau{Seidel-like mechanisms. Again, it is absolutely important to realize - see Section `GENERALIZED SOLVER STRATEGY ScaRC' - that (see also [KT] or [TU1]) "Block{Jacobi/Gau{Seidel schemes perform well for locally hidden anisotropies." Hence, we only provide in FEAST the `software' tools for including parallelism on low level, while the `numerical parallelism' is incorporated via our ScaRC solver and the hierarchical `tree structure'. However, what will be `non-standard' is our concept of (adaptive) parallel loadbalancing which is oriented in `total numerical eciency' (that means, "how much processor work is spent to achieve a certain accuracy, depending on the local conguration"!) in contrast to the `classical' criterion of equilibrating the number of local unknowns (see [BE] for detailed information and examples in FEAST). VIII) FULL FINITE ELEMENT FUNCTIONALITY We plan to include (at least) all facilities from the predecessor packages FEAT2D, FEAT3D and FEATFLOW [TU2], as for instance the routines for matrix assembly, the Finite Element basis function library, certain multigrid ingredients and many more. However, in addition, also mechanisms for a-posteriori error control via `residual techniques' and `dual solution' will be provided and complemented with several concepts of adaptivity. IX) PROFESSIONAL PRE- AND POSTPROCESSING Candidates are our JAVA{based tool DeViSor for graphical pre- and postprocessing and certain AVS/Express modules for which we agreed with AVS to include in our software package (for free!). However, we still look for a competent partner for professional geometry and mesh generators which shall be included as `macro mesh generators' and for CAD-like descriptions of the domain. This step will be one of the most important towards the numerical solution of `real life' problems. 11

12 X) OPTIMIZED APPLICATION TOOLS We plan (at least) to publish a new version of our FEATFLOW2.0 which contains most of our methodology derived in [TU1], but then based on the FEAST package. We hope to improve signicantly the quality of the recent FEATFLOW1.1 [TU2] through all the addressed mathematical, algorithmic and implementation aspects. CONCLUSIONS AND OUTLOOK We expect the rst version of FEAST for end of 1998, but most of the `numerical' and `computational' ingredients have already been successfully realized in several test implementations (see the papers in the REFERENCES). The actual status of the FEAST project and further information can always be obtained from our Web page: Nevertheless, help is always welcome: for instance in implementing and testing many auxiliary components, pre- and postprocessing, `unit square' experts and `computers for performance measurements', etc. REFERENCES [ABT 1] ALTIERI, M., BECKER, CHR., TUREK, S.: "Konsequenzen eines numerischen `Elch Tests' fur Prozessor{Architektur und Computersimulation", to appear. [ABT 2] ALTIERI, M., BECKER, CHR., TUREK, S.: "On the realistic performance of components in iterative solvers", Proc. FORTWIHR Conference, Munich, March 1998, LNCSE, Springer-Verlag, to appear. [BE] BECKER, CHR.: "FEAST - The realization of Finite Element software for high-performance applications", Thesis, to appear. [KI] KILIAN, S.: "Ecient parallel iterative solvers of ScaRC-type and their application to the incompressible Navier-Stokes equations", Thesis, [KT ] KILIAN, S., TUREK, S.: "An example for parallel ScaRC and its application to the incompressible Navier-Stokes equations", Proc. ENUMATH-97, Heidelberg, October [RB] RANNACHER, R., BECKER, R.: "A Feed-Back Approach to Error Control in Finite Element Methods: Basic Analysis and Examples", Preprint 96{52, University of Heidelberg, SFB 359, [SRT ] SCH AFER, M., RANNACHER, R., TUREK, S.: "Evaluation of a CFD Benchmark for Laminar Flows", Proc. ENUMATH-97, Heidelberg, October [T U 1] TUREK, S.: "Ecient solvers for incompressible ow problems: An algorithmic approach in view of computational aspects", LNCSE 2, Springer-Verlag, [T U 2] TUREK, S.: "FEATFLOW. Finite element software for the incompressible Navier-Stokes equations: User Manual, Release 1.1, 1998 (see the WWW-address above). 12

Two main topics: `A posteriori (error) control of FEM/FV discretizations with adaptive meshing strategies' `(Iterative) Solution strategies for huge s

Two main topics: `A posteriori (error) control of FEM/FV discretizations with adaptive meshing strategies' `(Iterative) Solution strategies for huge s . Trends in processor technology and their impact on Numerics for PDE's S. Turek Institut fur Angewandte Mathematik, Universitat Heidelberg Im Neuenheimer Feld 294, 69120 Heidelberg, Germany http://gaia.iwr.uni-heidelberg.de/~ture

More information

High Performance Computing for PDE Towards Petascale Computing

High Performance Computing for PDE Towards Petascale Computing High Performance Computing for PDE Towards Petascale Computing S. Turek, D. Göddeke with support by: Chr. Becker, S. Buijssen, M. Grajewski, H. Wobker Institut für Angewandte Mathematik, Univ. Dortmund

More information

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction FOR 493 - P3: A monolithic multigrid FEM solver for fluid structure interaction Stefan Turek 1 Jaroslav Hron 1,2 Hilmar Wobker 1 Mudassar Razzaq 1 1 Institute of Applied Mathematics, TU Dortmund, Germany

More information

High Performance Computing for PDE Some numerical aspects of Petascale Computing

High Performance Computing for PDE Some numerical aspects of Petascale Computing High Performance Computing for PDE Some numerical aspects of Petascale Computing S. Turek, D. Göddeke with support by: Chr. Becker, S. Buijssen, M. Grajewski, H. Wobker Institut für Angewandte Mathematik,

More information

Hardware-Oriented Numerics - High Performance FEM Simulation of PDEs

Hardware-Oriented Numerics - High Performance FEM Simulation of PDEs Hardware-Oriented umerics - High Performance FEM Simulation of PDEs Stefan Turek Institut für Angewandte Mathematik, Univ. Dortmund http://www.mathematik.uni-dortmund.de/ls3 http://www.featflow.de Performance

More information

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST Integrating GPUs as fast co-processors into the existing parallel FE package FEAST Dominik Göddeke Universität Dortmund dominik.goeddeke@math.uni-dortmund.de Christian Becker christian.becker@math.uni-dortmund.de

More information

GPU Cluster Computing for FEM

GPU Cluster Computing for FEM GPU Cluster Computing for FEM Dominik Göddeke Sven H.M. Buijssen, Hilmar Wobker and Stefan Turek Angewandte Mathematik und Numerik TU Dortmund, Germany dominik.goeddeke@math.tu-dortmund.de GPU Computing

More information

Resilient geometric finite-element multigrid algorithms using minimised checkpointing

Resilient geometric finite-element multigrid algorithms using minimised checkpointing Resilient geometric finite-element multigrid algorithms using minimised checkpointing Dominik Göddeke, Mirco Altenbernd, Dirk Ribbrock Institut für Angewandte Mathematik (LS3) Fakultät für Mathematik TU

More information

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST Integrating GPUs as fast co-processors into the existing parallel FE package FEAST Dipl.-Inform. Dominik Göddeke (dominik.goeddeke@math.uni-dortmund.de) Mathematics III: Applied Mathematics and Numerics

More information

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications Michael Eberl 1, Wolfgang Karl 1, Carsten Trinitis 1 and Andreas Blaszczyk 2 1 Technische Universitat Munchen

More information

GPU Acceleration of Unmodified CSM and CFD Solvers

GPU Acceleration of Unmodified CSM and CFD Solvers GPU Acceleration of Unmodified CSM and CFD Solvers Dominik Göddeke Sven H.M. Buijssen, Hilmar Wobker and Stefan Turek Angewandte Mathematik und Numerik TU Dortmund, Germany dominik.goeddeke@math.tu-dortmund.de

More information

Performance and accuracy of hardware-oriented. native-, solvers in FEM simulations

Performance and accuracy of hardware-oriented. native-, solvers in FEM simulations Robert Strzodka, Stanford University Dominik Göddeke, Universität Dortmund Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations Number of slices

More information

Performance. Computing (UCHPC)) for Finite Element Simulations

Performance. Computing (UCHPC)) for Finite Element Simulations technische universität dortmund Universität Dortmund fakultät für mathematik LS III (IAM) UnConventional High Performance Computing (UCHPC)) for Finite Element Simulations S. Turek, Chr. Becker, S. Buijssen,

More information

Performance and accuracy of hardware-oriented native-, solvers in FEM simulations

Performance and accuracy of hardware-oriented native-, solvers in FEM simulations Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations Dominik Göddeke Angewandte Mathematik und Numerik, Universität Dortmund Acknowledgments Joint

More information

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,

More information

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,

More information

HPC Algorithms and Applications

HPC Algorithms and Applications HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear

More information

Contents. I The Basic Framework for Stationary Problems 1

Contents. I The Basic Framework for Stationary Problems 1 page v Preface xiii I The Basic Framework for Stationary Problems 1 1 Some model PDEs 3 1.1 Laplace s equation; elliptic BVPs... 3 1.1.1 Physical experiments modeled by Laplace s equation... 5 1.2 Other

More information

Adaptive-Mesh-Refinement Pattern

Adaptive-Mesh-Refinement Pattern Adaptive-Mesh-Refinement Pattern I. Problem Data-parallelism is exposed on a geometric mesh structure (either irregular or regular), where each point iteratively communicates with nearby neighboring points

More information

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution Multigrid Pattern I. Problem Problem domain is decomposed into a set of geometric grids, where each element participates in a local computation followed by data exchanges with adjacent neighbors. The grids

More information

1.2 Numerical Solutions of Flow Problems

1.2 Numerical Solutions of Flow Problems 1.2 Numerical Solutions of Flow Problems DIFFERENTIAL EQUATIONS OF MOTION FOR A SIMPLIFIED FLOW PROBLEM Continuity equation for incompressible flow: 0 Momentum (Navier-Stokes) equations for a Newtonian

More information

GPU Cluster Computing for Finite Element Applications

GPU Cluster Computing for Finite Element Applications GPU Cluster Computing for Finite Element Applications Dominik Göddeke, Hilmar Wobker, Sven H.M. Buijssen and Stefan Turek Applied Mathematics TU Dortmund dominik.goeddeke@math.tu-dortmund.de http://www.mathematik.tu-dortmund.de/~goeddeke

More information

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND Student Submission for the 5 th OpenFOAM User Conference 2017, Wiesbaden - Germany: SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND TESSA UROIĆ Faculty of Mechanical Engineering and Naval Architecture, Ivana

More information

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM Dominik Göddeke and Robert Strzodka Institut für Angewandte Mathematik (LS3), TU Dortmund Max Planck Institut

More information

PROGRAMMING OF MULTIGRID METHODS

PROGRAMMING OF MULTIGRID METHODS PROGRAMMING OF MULTIGRID METHODS LONG CHEN In this note, we explain the implementation detail of multigrid methods. We will use the approach by space decomposition and subspace correction method; see Chapter:

More information

Introduction to Multigrid and its Parallelization

Introduction to Multigrid and its Parallelization Introduction to Multigrid and its Parallelization! Thomas D. Economon Lecture 14a May 28, 2014 Announcements 2 HW 1 & 2 have been returned. Any questions? Final projects are due June 11, 5 pm. If you are

More information

The GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke.

The GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke. The GPU as a co-processor in FEM-based simulations Preliminary results Dipl.-Inform. Dominik Göddeke dominik.goeddeke@mathematik.uni-dortmund.de Institute of Applied Mathematics University of Dortmund

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26

More information

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

Reconstruction of Trees from Laser Scan Data and further Simulation Topics Reconstruction of Trees from Laser Scan Data and further Simulation Topics Helmholtz-Research Center, Munich Daniel Ritter http://www10.informatik.uni-erlangen.de Overview 1. Introduction of the Chair

More information

Accelerating Double Precision FEM Simulations with GPUs

Accelerating Double Precision FEM Simulations with GPUs In Proceedings of ASIM 2005-18th Symposium on Simulation Technique, Sept. 2005. Accelerating Double Precision FEM Simulations with GPUs Dominik Göddeke dominik.goeddeke@math.uni-dortmund.de Universität

More information

Flow simulation. Frank Lohmeyer, Oliver Vornberger. University of Osnabruck, D Osnabruck.

Flow simulation. Frank Lohmeyer, Oliver Vornberger. University of Osnabruck, D Osnabruck. To be published in: Notes on Numerical Fluid Mechanics, Vieweg 1994 Flow simulation with FEM on massively parallel systems Frank Lohmeyer, Oliver Vornberger Department of Mathematics and Computer Science

More information

smooth coefficients H. Köstler, U. Rüde

smooth coefficients H. Köstler, U. Rüde A robust multigrid solver for the optical flow problem with non- smooth coefficients H. Köstler, U. Rüde Overview Optical Flow Problem Data term and various regularizers A Robust Multigrid Solver Galerkin

More information

Recent developments in the solution of indefinite systems Location: De Zwarte Doos (TU/e campus)

Recent developments in the solution of indefinite systems Location: De Zwarte Doos (TU/e campus) 1-day workshop, TU Eindhoven, April 17, 2012 Recent developments in the solution of indefinite systems Location: De Zwarte Doos (TU/e campus) :10.25-10.30: Opening and word of welcome 10.30-11.15: Michele

More information

Accelerating Double Precision FEM Simulations with GPUs

Accelerating Double Precision FEM Simulations with GPUs Accelerating Double Precision FEM Simulations with GPUs Dominik Göddeke 1 3 Robert Strzodka 2 Stefan Turek 1 dominik.goeddeke@math.uni-dortmund.de 1 Mathematics III: Applied Mathematics and Numerics, University

More information

Data mining with sparse grids

Data mining with sparse grids Data mining with sparse grids Jochen Garcke and Michael Griebel Institut für Angewandte Mathematik Universität Bonn Data mining with sparse grids p.1/40 Overview What is Data mining? Regularization networks

More information

Free-Form Shape Optimization using CAD Models

Free-Form Shape Optimization using CAD Models Free-Form Shape Optimization using CAD Models D. Baumgärtner 1, M. Breitenberger 1, K.-U. Bletzinger 1 1 Lehrstuhl für Statik, Technische Universität München (TUM), Arcisstraße 21, D-80333 München 1 Motivation

More information

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014 Automatic Generation of Algorithms and Data Structures for Geometric Multigrid Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014 Introduction Multigrid Goal: Solve a partial differential

More information

Exploring unstructured Poisson solvers for FDS

Exploring unstructured Poisson solvers for FDS Exploring unstructured Poisson solvers for FDS Dr. Susanne Kilian hhpberlin - Ingenieure für Brandschutz 10245 Berlin - Germany Agenda 1 Discretization of Poisson- Löser 2 Solvers for 3 Numerical Tests

More information

Parallel FEM Computation and Multilevel Graph Partitioning Xing Cai

Parallel FEM Computation and Multilevel Graph Partitioning Xing Cai Parallel FEM Computation and Multilevel Graph Partitioning Xing Cai Simula Research Laboratory Overview Parallel FEM computation how? Graph partitioning why? The multilevel approach to GP A numerical example

More information

Algebraic Multigrid (AMG) for Ground Water Flow and Oil Reservoir Simulation

Algebraic Multigrid (AMG) for Ground Water Flow and Oil Reservoir Simulation lgebraic Multigrid (MG) for Ground Water Flow and Oil Reservoir Simulation Klaus Stüben, Patrick Delaney 2, Serguei Chmakov 3 Fraunhofer Institute SCI, Klaus.Stueben@scai.fhg.de, St. ugustin, Germany 2

More information

Space Filling Curves and Hierarchical Basis. Klaus Speer

Space Filling Curves and Hierarchical Basis. Klaus Speer Space Filling Curves and Hierarchical Basis Klaus Speer Abstract Real world phenomena can be best described using differential equations. After linearisation we have to deal with huge linear systems of

More information

Memory Efficient Adaptive Mesh Generation and Implementation of Multigrid Algorithms Using Sierpinski Curves

Memory Efficient Adaptive Mesh Generation and Implementation of Multigrid Algorithms Using Sierpinski Curves Memory Efficient Adaptive Mesh Generation and Implementation of Multigrid Algorithms Using Sierpinski Curves Michael Bader TU München Stefanie Schraufstetter TU München Jörn Behrens AWI Bremerhaven Abstract

More information

Higher order nite element methods and multigrid solvers in a benchmark problem for the 3D Navier Stokes equations

Higher order nite element methods and multigrid solvers in a benchmark problem for the 3D Navier Stokes equations INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN FLUIDS Int. J. Numer. Meth. Fluids 2002; 40:775 798 (DOI: 10.1002/d.377) Higher order nite element methods and multigrid solvers in a benchmark problem for

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Efficiency Aspects for Advanced Fluid Finite Element Formulations

Efficiency Aspects for Advanced Fluid Finite Element Formulations Proceedings of the 5 th International Conference on Computation of Shell and Spatial Structures June 1-4, 2005 Salzburg, Austria E. Ramm, W. A. Wall, K.-U. Bletzinger, M. Bischoff (eds.) www.iassiacm2005.de

More information

on the CM-200. Claus Bendtsen Abstract to implement LAPACK-style routines already developed for other architectures,

on the CM-200. Claus Bendtsen Abstract to implement LAPACK-style routines already developed for other architectures, \Quick" Implementation of Block LU Algorithms on the CM-200. Claus Bendtsen Abstract The CMSSL library only includes a limited amount of mathematical algorithms. Hence, when writing code for the Connection

More information

Computing on GPU Clusters

Computing on GPU Clusters Computing on GPU Clusters Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009

More information

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Technische Universitat Munchen. Institut fur Informatik. D Munchen. Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl

More information

Case study: GPU acceleration of parallel multigrid solvers

Case study: GPU acceleration of parallel multigrid solvers Case study: GPU acceleration of parallel multigrid solvers Dominik Göddeke Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Acknowledgements Hilmar Wobker,

More information

τ-extrapolation on 3D semi-structured finite element meshes

τ-extrapolation on 3D semi-structured finite element meshes τ-extrapolation on 3D semi-structured finite element meshes European Multi-Grid Conference EMG 2010 Björn Gmeiner Joint work with: Tobias Gradl, Ulrich Rüde September, 2010 Contents The HHG Framework τ-extrapolation

More information

Driven Cavity Example

Driven Cavity Example BMAppendixI.qxd 11/14/12 6:55 PM Page I-1 I CFD Driven Cavity Example I.1 Problem One of the classic benchmarks in CFD is the driven cavity problem. Consider steady, incompressible, viscous flow in a square

More information

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea. Abdulrahman Manea PhD Student Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science Energy Resources Engineering Department School of Earth Sciences

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing... Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 48 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum

More information

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing... Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 22 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum

More information

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs Distributed NVAMG Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs Istvan Reguly (istvan.reguly at oerc.ox.ac.uk) Oxford e-research Centre NVIDIA Summer Internship

More information

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable

More information

TAU mesh deformation. Thomas Gerhold

TAU mesh deformation. Thomas Gerhold TAU mesh deformation Thomas Gerhold The parallel mesh deformation of the DLR TAU-Code Introduction Mesh deformation method & Parallelization Results & Applications Conclusion & Outlook Introduction CFD

More information

Virtual EM Inc. Ann Arbor, Michigan, USA

Virtual EM Inc. Ann Arbor, Michigan, USA Functional Description of the Architecture of a Special Purpose Processor for Orders of Magnitude Reduction in Run Time in Computational Electromagnetics Tayfun Özdemir Virtual EM Inc. Ann Arbor, Michigan,

More information

Highly Parallel Multigrid Solvers for Multicore and Manycore Processors

Highly Parallel Multigrid Solvers for Multicore and Manycore Processors Highly Parallel Multigrid Solvers for Multicore and Manycore Processors Oleg Bessonov (B) Institute for Problems in Mechanics of the Russian Academy of Sciences, 101, Vernadsky Avenue, 119526 Moscow, Russia

More information

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs Iterative Solvers Numerical Results Conclusion and outlook 1/18 Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs Eike Hermann Müller, Robert Scheichl, Eero Vainikko

More information

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling Iterative Solvers Numerical Results Conclusion and outlook 1/22 Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling Part II: GPU Implementation and Scaling on Titan Eike

More information

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?

More information

Problem-Adapted Mesh Generation With FEM-Features

Problem-Adapted Mesh Generation With FEM-Features INTERNATIONAL DESIGN CONFERENCE - DESIGN 2000 Dubrovnik, May 23-26, 2000. Problem-Adapted Mesh Generation With FEM-Features Dipl.-Ing. Horst Werner, Prof. Dr.-Ing. Christian Weber, cand. ing. Martin Schilke

More information

Performance of Implicit Solver Strategies on GPUs

Performance of Implicit Solver Strategies on GPUs 9. LS-DYNA Forum, Bamberg 2010 IT / Performance Performance of Implicit Solver Strategies on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Abstract: The increasing power of GPUs can be used

More information

The driving motivation behind the design of the Janus framework is to provide application-oriented, easy-to-use and ecient abstractions for the above

The driving motivation behind the design of the Janus framework is to provide application-oriented, easy-to-use and ecient abstractions for the above Janus a C++ Template Library for Parallel Dynamic Mesh Applications Jens Gerlach, Mitsuhisa Sato, and Yutaka Ishikawa fjens,msato,ishikawag@trc.rwcp.or.jp Tsukuba Research Center of the Real World Computing

More information

Seed Point. Agglomerated Points

Seed Point. Agglomerated Points AN ASSESSMENT OF LINEAR VERSUS NON-LINEAR MULTIGRID METHODS FOR UNSTRUCTURED MESH SOLVERS DIMITRI J. MAVRIPLIS Abstract. The relative performance of a non-linear FAS multigrid algorithm and an equivalent

More information

Data mining with sparse grids using simplicial basis functions

Data mining with sparse grids using simplicial basis functions Data mining with sparse grids using simplicial basis functions Jochen Garcke and Michael Griebel Institut für Angewandte Mathematik Universität Bonn Part of the work was supported within the project 03GRM6BN

More information

Handling Parallelisation in OpenFOAM

Handling Parallelisation in OpenFOAM Handling Parallelisation in OpenFOAM Hrvoje Jasak hrvoje.jasak@fsb.hr Faculty of Mechanical Engineering and Naval Architecture University of Zagreb, Croatia Handling Parallelisation in OpenFOAM p. 1 Parallelisation

More information

THE application of advanced computer architecture and

THE application of advanced computer architecture and 544 IEEE TRANSACTIONS ON ANTENNAS AND PROPAGATION, VOL. 45, NO. 3, MARCH 1997 Scalable Solutions to Integral-Equation and Finite-Element Simulations Tom Cwik, Senior Member, IEEE, Daniel S. Katz, Member,

More information

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK Multigrid Solvers in CFD David Emerson Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK david.emerson@stfc.ac.uk 1 Outline Multigrid: general comments Incompressible

More information

High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers

High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers July 14, 1997 J Daniel S. Katz (Daniel.S.Katz@jpl.nasa.gov) Jet Propulsion Laboratory California Institute of Technology

More information

Java Performance Analysis for Scientific Computing

Java Performance Analysis for Scientific Computing Java Performance Analysis for Scientific Computing Roldan Pozo Leader, Mathematical Software Group National Institute of Standards and Technology USA UKHEC: Java for High End Computing Nov. 20th, 2000

More information

computational Fluid Dynamics - Prof. V. Esfahanian

computational Fluid Dynamics - Prof. V. Esfahanian Three boards categories: Experimental Theoretical Computational Crucial to know all three: Each has their advantages and disadvantages. Require validation and verification. School of Mechanical Engineering

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

2D & 3D Finite Element Method Packages of CEMTool for Engineering PDE Problems

2D & 3D Finite Element Method Packages of CEMTool for Engineering PDE Problems 2D & 3D Finite Element Method Packages of CEMTool for Engineering PDE Problems Choon Ki Ahn, Jung Hun Park, and Wook Hyun Kwon 1 Abstract CEMTool is a command style design and analyzing package for scientific

More information

1 Motivation for Improving Matrix Multiplication

1 Motivation for Improving Matrix Multiplication CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n

More information

Large-scale Gas Turbine Simulations on GPU clusters

Large-scale Gas Turbine Simulations on GPU clusters Large-scale Gas Turbine Simulations on GPU clusters Tobias Brandvik and Graham Pullan Whittle Laboratory University of Cambridge A large-scale simulation Overview PART I: Turbomachinery PART II: Stencil-based

More information

On Convergence Acceleration Techniques for Unstructured Meshes

On Convergence Acceleration Techniques for Unstructured Meshes NASA/CR-1998-208732 ICASE Report No. 98-44 On Convergence Acceleration Techniques for Unstructured Meshes Dimitri J. Mavriplis ICASE, Hampton, Virginia Institute for Computer Applications in Science and

More information

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.

More information

Using Java for Scientific Computing. Mark Bul EPCC, University of Edinburgh

Using Java for Scientific Computing. Mark Bul EPCC, University of Edinburgh Using Java for Scientific Computing Mark Bul EPCC, University of Edinburgh markb@epcc.ed.ac.uk Java and Scientific Computing? Benefits of Java for Scientific Computing Portability Network centricity Software

More information

What is Multigrid? They have been extended to solve a wide variety of other problems, linear and nonlinear.

What is Multigrid? They have been extended to solve a wide variety of other problems, linear and nonlinear. AMSC 600/CMSC 760 Fall 2007 Solution of Sparse Linear Systems Multigrid, Part 1 Dianne P. O Leary c 2006, 2007 What is Multigrid? Originally, multigrid algorithms were proposed as an iterative method to

More information

Point-to-Point Synchronisation on Shared Memory Architectures

Point-to-Point Synchronisation on Shared Memory Architectures Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:

More information

Cost-Effective Parallel Computational Electromagnetic Modeling

Cost-Effective Parallel Computational Electromagnetic Modeling Cost-Effective Parallel Computational Electromagnetic Modeling, Tom Cwik {Daniel.S.Katz, cwik}@jpl.nasa.gov Beowulf System at PL (Hyglac) l 16 Pentium Pro PCs, each with 2.5 Gbyte disk, 128 Mbyte memory,

More information

arxiv: v1 [math.na] 20 Sep 2016

arxiv: v1 [math.na] 20 Sep 2016 arxiv:1609.06236v1 [math.na] 20 Sep 2016 A Local Mesh Modification Strategy for Interface Problems with Application to Shape and Topology Optimization P. Gangl 1,2 and U. Langer 3 1 Doctoral Program Comp.

More information

Review of previous examinations TMA4280 Introduction to Supercomputing

Review of previous examinations TMA4280 Introduction to Supercomputing Review of previous examinations TMA4280 Introduction to Supercomputing NTNU, IMF April 24. 2017 1 Examination The examination is usually comprised of: one problem related to linear algebra operations with

More information

Mesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System

Mesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System Mesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System The Harvard community has made this article openly available. Please share how this

More information

Approaches to Parallel Implementation of the BDDC Method

Approaches to Parallel Implementation of the BDDC Method Approaches to Parallel Implementation of the BDDC Method Jakub Šístek Includes joint work with P. Burda, M. Čertíková, J. Mandel, J. Novotný, B. Sousedík. Institute of Mathematics of the AS CR, Prague

More information

PARALLEL DECOMPOSITION OF 100-MILLION DOF MESHES INTO HIERARCHICAL SUBDOMAINS

PARALLEL DECOMPOSITION OF 100-MILLION DOF MESHES INTO HIERARCHICAL SUBDOMAINS Technical Report of ADVENTURE Project ADV-99-1 (1999) PARALLEL DECOMPOSITION OF 100-MILLION DOF MESHES INTO HIERARCHICAL SUBDOMAINS Hiroyuki TAKUBO and Shinobu YOSHIMURA School of Engineering University

More information

NIC FastICA Implementation

NIC FastICA Implementation NIC-TR-2004-016 NIC FastICA Implementation Purpose This document will describe the NIC FastICA implementation. The FastICA algorithm was initially created and implemented at The Helsinki University of

More information

1 Past Research and Achievements

1 Past Research and Achievements Parallel Mesh Generation and Adaptation using MAdLib T. K. Sheel MEMA, Universite Catholique de Louvain Batiment Euler, Louvain-La-Neuve, BELGIUM Email: tarun.sheel@uclouvain.be 1 Past Research and Achievements

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 15 Numerically solve a 2D boundary value problem Example:

More information

The Immersed Interface Method

The Immersed Interface Method The Immersed Interface Method Numerical Solutions of PDEs Involving Interfaces and Irregular Domains Zhiiin Li Kazufumi Ito North Carolina State University Raleigh, North Carolina Society for Industrial

More information

GPU PROGRESS AND DIRECTIONS IN APPLIED CFD

GPU PROGRESS AND DIRECTIONS IN APPLIED CFD Eleventh International Conference on CFD in the Minerals and Process Industries CSIRO, Melbourne, Australia 7-9 December 2015 GPU PROGRESS AND DIRECTIONS IN APPLIED CFD Stan POSEY 1*, Simon SEE 2, and

More information

A Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units

A Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units A Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units Markus Wagner, Karl Rupp,2, Josef Weinbub Institute for Microelectronics, TU

More information

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers Dominik Göddeke Institut für Angewandte Mathematik (LS3) TU Dortmund dominik.goeddeke@math.tu-dortmund.de ILAS 2011 Mini-Symposium: Parallel

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Parallel High-Order Geometric Multigrid Methods on Adaptive Meshes for Highly Heterogeneous Nonlinear Stokes Flow Simulations of Earth s Mantle

Parallel High-Order Geometric Multigrid Methods on Adaptive Meshes for Highly Heterogeneous Nonlinear Stokes Flow Simulations of Earth s Mantle ICES Student Forum The University of Texas at Austin, USA November 4, 204 Parallel High-Order Geometric Multigrid Methods on Adaptive Meshes for Highly Heterogeneous Nonlinear Stokes Flow Simulations of

More information

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM Dominik Göddeke Institut für Angewandte Mathematik (LS3) TU Dortmund dominik.goeddeke@math.tu-dortmund.de SIMTECH

More information

NIA CFD Seminar, October 4, 2011 Hyperbolic Seminar, NASA Langley, October 17, 2011

NIA CFD Seminar, October 4, 2011 Hyperbolic Seminar, NASA Langley, October 17, 2011 NIA CFD Seminar, October 4, 2011 Hyperbolic Seminar, NASA Langley, October 17, 2011 First-Order Hyperbolic System Method If you have a CFD book for hyperbolic problems, you have a CFD book for all problems.

More information