(recursive) `Divide and Conquer' strategies hierarchical data and solver structures, but also hierarchical (!) `matrix structures' ScaRC as generaliza
|
|
- Erika McCarthy
- 6 years ago
- Views:
Transcription
1 SOME BASIC CONCEPTS OF FEAST M. Altieri, Chr. Becker, S. Kilian, H. Oswald, S. Turek, J. Wallis Institut fur Angewandte Mathematik, Universitat Heidelberg Im Neuenheimer Feld 294, Heidelberg, Germany SUMMARY This paper deals with the basic principles of the new FEM software package FEAST. Based on an initial discussion of available software tools with respect to their application area, i.e., Education, Research or (industrial) Application, it illustrates the specic demands for such PDE software which is aimed to solve 'real life' problems. For the FEAST software, which is principally designed for high{performance simulations, we explain the basic principles of the underlying numerical, algorithmic and implementation concepts. Computational examples illustrate the (expected) eciency of this new software package, particularly in relation to existing approaches. INTRODUCTION Current trends in software development for Partial Dierential Equations (PDE's), and here in particular for Finite Element (FEM) approaches, go clearly towards objectoriented techniques and adaptive methods in any sense. Hereby the employed data and solver structures, and especially the `matrix structures', are often in contradiction to modern hardware platforms. As a result, the observed computational eciency is far away from expected Peak rates of almost 1 GFLOP nowadays, and the 'real life' gap will even further increase (see recent papers of Rude). Since high performance calculations may be only reached by explicitly exploiting 'caching in' and 'pipelining' in combination with sequentially stored arrays (using machine-optimized libraries as BLAS, ESSL or PERFLIB, for instance), the corresponding realization seems to be `easier' for simple Finite Dierence approaches. So, the question arises how to perform similar techniques for much more sophisticated Finite Element codes? These discrepancies between complex mathematical approaches and highly structured computational demands often lead to unreasonable calculation times for `real world' problems, e.g. Computational Fluid Dynamics (CFD) calculations in 3D, as can be seen from recent benchmarks [SRT] for commercial as well as research codes. Hence, strategies for eciency enhancement are necessary, not only from the mathematical (algorithms, discretizations) but also from the software point of view. To realize some of these necessary improvements, our new Finite Element package (project name: FEAST { Finite Element Analysis & Solution Tools) is under development. This package is based on the following concepts: 1
2 (recursive) `Divide and Conquer' strategies hierarchical data and solver structures, but also hierarchical (!) `matrix structures' ScaRC as generalization of multigrid and domain decomposition techniques frequent use of machine-optimized Linear Algebra routines all typical Finite Element facilities included The result is going to be a exible software package with special emphasis on: (closer to) peak performance on modern and future processors typical multigrid behaviour w.r.t. eciency and robustness parallelization tools directly included on low level open for dierent adaptivity concepts low storage requirements application to many `real life' problems possible In contrast to many other tool boxes, which often aim to develop preferably software for research or education topics, our approach clearly is designed for high performance applications with industrial background, especially in CFD. Consequently, our main emphasis lies on the aspects `eciency' and `robustness' and less on topics as `easy implementable' or `most modern programming environment'. As programming language FORTRAN (77 and 90) is used: This makes it possible to adopt many reliable parts of the predecessor packages FEAT2D, FEAT3D and FEAT- FLOW [TU2]. Further, on high performance computers, very ecient FORTRAN compilers are available and transparent access to the data structures is possible. The preand postprocessing, which will be based on (commercial) professional tools, is handled by JAVA-based program parts. Conguring a high performance computer as a FEAST server, the user shall be able to perform the remote calculation by a FEAST client. In the following, we give examples for `real' computational eciency results of typical numerical tools which help to motivate our hierarchical data, solver and matrix structures. To understand these better, we illustrate shortly the corresponding solution technique ScaRC ("Scalable Recursive Clustering") in combination with the overall `Divide and Conquer' philosophy which is essential for FEAST. We discuss how typical multigrid rates, even for complex congurations, can be achieved, on parallel as well as sequential computers with a very high computational eciency. CLASSIFICATION OF PDE SOFTWARE "Most of the available PDE software can be applied, in principal, to (almost) all problems. The practical functionality is mostly restricted by computer requirements (CPU, RAM) only! " 2
3 This statement appears to be quite `simple', but it nevertheless describes well the stateof-the-art of available software packages for the numerical solution of PDE's. Further, it can be used to classify many existing software tools and to illustrate the dierences in the underlying concepts and realizations of codes. Typical examples for critical applications which demonstrate the problems of many software packages are in the eld of CFD. Especially the recent DFG ow benchmark set "Channel ow around a cylinder" [SRT], which has been carried out recently under the German Priority Research Programme "Flow Simulation on High Performance Computers", has shown quite surprising results. Many codes with various numerical basics and on very dierent computer platforms have participated so far. The results in [SRT] clearly show dierences in total eciency (concerning elapsed CPU times with respect to obtained accuracy!) of several orders of magnitude, and in fact many codes have not been able to give satisfying results, and this for a laminar incompressible ow in the range of Reynolds number Re = 20, resp., Re = 100. As an (ocial) result one could state the following conclusions, all on the basis of such benchmark calculations: 1. It is often not sucient to take any `multi{purpose' package as basic tool, and then to implement a (more or less) clever numerical approach in a straightforward manner. In most cases, the eciency of the underlying basic package does not allow to solve accurately "hard" problems as the incompressible Navier{Stokes equations. 2. Often the chosen `numerical ingredients', as mesh design, discretization spaces, time{stepping schemes, solvers for discrete linear and nonlinear systems, are `good' schemes as standalones but do not t together if realized in a common code. 3. The typical approach of implementing well-known, but `old' solution schemes in a straightforward manner on vector or parallel super-computers does not lead to satisfying results, without essentially improving the numerical and computational background at the same time. Applications from Computational Fluid Dynamics clearly indicate that specic attempts for optimizing the applied numerical schemes (discretization), algorithmic components (solver) as well as software aspects (implementation w.r.t. underlying hardware) must be performed. Only then, similar problems with a certain `real life' character may be tackled successfully. Based on such comparisons and resulting experiences we propose the following classi- cation of PDE software packages, namely if they are principally designed for the use in Education, Research or (Industrial) Application. Education. Corresponding software tools are mainly designed for students to `play{around' with mathematical tools. Their most important features include `easy' user interfaces, and the code should be based on simple, but very robust algorithms. Due to the typically low complexity of the problems to be examined by the students, the code eciency is often `independent of implementation' and typically requires only few seconds execution time. Therefore, C++- and especially JAVA-based implementations with graphical interfaces and platform-independent execution are typical candidates for this kind of software. 3
4 Research. This type of software is representative for most of the available tools in the mathematical community. The software is designed to be open for numerical and algorithmic changes, for instance to examine new concepts for adaptive error control or to test the convergence behaviour of new multilevel-type solvers. Consequently, programming languages which allow very exible and robust data structures are favourized, and particularly modern object-oriented environments as C++ are wide-spread in this eld. In contrast, user interfaces or eciency aspects play a minor part in this mathematical basic research. (Industrial) Application. This software is specialized to apply `well-understood' and `optimized' numerical and algorithmic tools to `real life' congurations with an industrial background. If the full potential of such sophisticated approaches is to be eciently exploited, an `optimal play{together' of Mathematics and implementation in all components has to be guaranteed, to be able to compete with present production codes. Consequently, the demand for robustness and eciency of the developed software is of major interest. Coming back to the previous `Application' examples from CFD, we want to be more specic to illustrate the practical demands on software which will be successfully applied in this special area. Typical problems associated with `real life' ows are the following: complex domains and anisotropic meshes in space and time 10 3? 10 5 time steps and 10 4? 10 8 unknowns by accuracy reasons optimal control of specic physical quantities as drag coecients or ux distributions Therefore, concrete `realistic' simulations often require the use of several millions of unknowns for the numerical solution of (mostly) nonlinear and nonstationary PDE's, and this many times. On the other hand, often parallel computers with many processors and each with (almost) 1 GFLOP Peak performance are available, and today's Mathematics oers higher order discretizations with adaptivity concepts and accompanying multigrid/multilevel-like solvers. However, to marry such modern numerical tools with recent and future hardware aspects and to create an adequate code, the following four components have to be respected with the same weight: Numerics, Algorithms, Implementation and Hardware platforms. Only if all components optimally play-together, the (potential) high{performance of each aspect can be incorporated in the resulting code such that the expected substantially improved simulation tool gets really available. MAIN PRINCIPLES IN FEAST We give a short explanation of some of the main principles in FEAST - at least with respect to the current status in April These are mainly concepts and ideas which are recently planned to be done and/or which are actually under work. Hence, we often omit a rigorous mathematical or algorithmic formulation and present instead the philosophy and background of these concepts only. More details about the practical realization and the numerical and algorithmic sources can be found in [KI], [BE] and [TU1]. 4
5 I) PROGRAMMING LANGUAGE As explained in the INTRODUCTION, we decided to implement most of the numerical kernels of FEAST in FORTRAN 90 (F90). Beside the modern functionality of this language, promising compilers on most recent hardware platforms are available. And additionally, what is also very important, many numerical parts of the predecessor packages FEAT2D, FEAT3D and FEATFLOW [TU2] can be directly included, such as for instance the routines for matrix assembly, the Finite Element basis function library, multigrid ingredients and many more. However, preliminary tests with recent F90 compilers on several platforms have shown, that the use of FORTRAN 77 (F77) compilers is still much more favourable if absolute high{performance is required. Especially for the Numerical Linear Algebra part, i.e., matrix-vector multiplications, vector modications, tridiagonal solvers and other tools which are typical in the context of iterative solution schemes (see [ABT1] and [ABT2]), the use of machine{optimized F77 software libraries or self{developed FORTRAN 77 code still leads to much better performance ratings. Nevertheless, it is planned to switch to F90 completely if F90 has achieved the performance of F77, and the complete package is developed under the restriction that the corresponding modules can be easily exchanged. Finally, it is planned to "hide" the numerical kernel routines from the "typical user" who only wants to `perform applications'. Recently, we are developing a higher-level user{ interface written in JAVA to be machine- and platform-independent as far as possible. Additionally, graphical user interfaces (GUI's) or "meta{language" approaches are under work to leave behind the "useless" discussion about the programming language used in the numerical kernels. II) HIERARCHICAL DATA, SOLVER AND MATRIX STRUCTURES One of the most important principles in FEAST is to apply consequently a (Recursive) Divide and Conquer strategy. The solution of the complete "global" problem is recursively split into smaller "independent" subproblems on `patches' as part of the complete set of unknowns. Thus, the two major aims in this splitting procedure which can be performed by hand or via self{adaptive strategies are: Find locally structured parts Find locally anisotropic parts Based on "small" structured subdomains on the lowest level (in fact, even one single or a small number of elements only is allowed), the "higher-level" substructures are generated via clustering of "lower-level" parts such that algebraic or geometric irregularities are hidden inside the new "higher-level" patch. More background for this strategy is given in the following Sections `HIGH{PERFORMANCE LINEAR ALGEBRA' and `REFERENCE ELEMENT SOLVERS', and particularly in `GENERALIZED SOLVER STRATEGY ScaRC' which describes the corresponding solvers related to each stage. The following Figures illustrate exemplarily the employed data structure for a (coarse) triangulation of a given domain and its recursive partitioning into several kinds of substructures. 5
6 SB PB PB According to this decomposition, a corresponding `data tree' { the skeleton of the partitioning strategy { describes the hierarchical decomposition process. It consists of a specic collection of Elements, Macros (`Mxxx'), Matrix Blocks (`'), Parallel Blocks (`PB'), Subdomain Blocks (`SB'), etc. SB PB PB M1 M2 M3 M4 M5 M6 M7 M8 The atomic units in our decomposition are the `macros' which may be of type `structured' (as n n collection of quadrilaterals (in 2D) with local Finite-Dierence data structures) or `unstructured' (any collection of elements, for instance in the case of fully adaptive local grid renement). These `macros' (one or several) can be clustered to build a `matrix block' which contains the "local matrix parts": only here are the complete matrix informations stored! Higher-level constructs are `parallel blocks' (for the parallel distribution) and `subdomain blocks' (with special conformity rules with respect to grid renement and applied discretization spaces). They all together build the complete domain, resp., the complete set of unknowns. It is important to realize that each stage in this hierarchical tree can act as independent `father' in relation to its `child' substructures while it is a 'child' at the same time in another phase of the solution process (inside of the ScaRC solver, see later). III) GENERALIZED SOLVER STRATEGY ScaRC In short form, our long-time experience with the numerical and computational runtime behaviour of typical multigrid (MG) and Domain Decomposition (DD) solvers can be concluded as follows: a) Some observations from standard multigrid approaches: While in fact the numerical convergence behaviour of (optimized) multigrid is very satisfying with respect to robustness and eciency requirements, there still remain some `open' problems: Often the parallelization of powerful `recursive' smoothers (as SOR or ILU) leads to performance degradations since they can be realized only in a `blockwise' sense. Thus, it is often not clear how the nice numerical behaviour in sequential codes for 6
7 complicated geometric structures or local anisotropies, can be reached in parallel computations. And additionally, the communication overhead especially on coarser grid levels dominates the total CPU time. Even more important is the `computational observation' that the realized performance on modern platforms is often far beyond (sometimes less than 1 %) the expected Peak performance. Many codes often reach much less than 10 MFLOP, and this on computers which are said (by the vendors) to run with up to 1 GFLOP Peak. The reason is simply that the single components in multigrid (smoother, defect calculation, grid transfer) perform too few arithmetic work with respect to each data exchange, such that the facilities of modern superscalar architectures are poorly exploitable. In contrast, we will show that in fact 30 { 70 % can be realistic with appropriate techniques. b) Some observations from standard Domain Decomposition approaches: In contrast to standard multigrid, the parallel eciency is much higher, at least as long as no large overlap region between processors must be exchanged. While overlapping DD methods do not require additional coarse grid problems (however the implementation in 3D for complicated domains or for complex Finite Element spaces is a hard job!), non-overlapping DD approaches require certain coarse grid problems, as the BPS preconditioner for instance which may lead again to severe numerical and computational problems, depending on the geometrical structure or the used discretization spaces. However, the most important dierence between Domain Decomposition and multigrid are the (often) much worse convergence rates of DD, although at the same time more arithmetic work is done on each processor. As a conclusion, improvements are enforced by the facts that the convergence behaviour is often quite sensitive with respect to (local) geometric/algebraic anisotropies (in `real life' congurations!), and that the performed arithmetic work (which allows the high-performance) is often restricted by (un)necessary data exchanges. An additional observation which is strongly related to the previous data structure in combination with the specic hierarchical ScaRC solver is illustrated in the following Figure. We show the resulting "optimal" mesh from a numerical simulation of R.Becker/R.Rannacher for `Flow around the cylinder' which was adaptively rened via rigorous a-posteriori error control mechanisms specied for the required drag coecient (see [RB]). As can be seen, the adaptive grid renement techniques are needed only locally, near the boundaries, while mostly regular substructures (up to 90 %) can be (and should be!) used in the interior of the domain. This is a quite typical result and shows that even for (more or less) complex ow simulations (here as a prototypical example) locally blockwise `Finite Dierence' techniques can be applied: these regions can be detected and exploited by the given hierarchical strategies. 7
8 We omit here a detailed description of the numerical and algorithmic properties of ScaRC and refer to the papers [KT] and particularly to [KI]. Here, we restrict to repeat the main philosophy behind this generalized MG/DD approach, which is strongly coupled with the hierarchical data and matrix structures as explained before. ScaRC stands for: Scalable (w.r.t. "quality and number of local solution steps" at each stage) Recursive ("independently" for each stage in the hierarchy of partitioning) Clustering (for building patches via "xed or adaptive blocking strategies") and its "advantageous" numerical and computational behaviour can be characterized through following observations (look at [KT] and [TU1] for numerical examples): "Block{Jacobi/Gau{Seidel schemes perform well for locally hidden anisotropies. More arithmetic operations can be performed locally whereby additional work may be unproportionately small (in terms of CPU) due to the local high{performance facilities." IV) HIGH{PERFORMANCE LINEAR ALGEBRA One of the main ideas behind the described (Recursive) Divide and Conquer approach in combination with the ScaRC solver technology is to detect `locally structured parts'. In these `local subdomains', we apply consequently `highly structured tools' as typical for Finite Dierence approaches: line- or rowwise numbering of unknowns and storing of matrices as sparse bands (however the matrix entries are calculated via the Finite Element modules!). As a result, we have `optimal' data structures on each of these patches (which often correspond to the former introduced `matrix blocks') and we can perform very powerful Linear Algebra tools which explicitely exploit the high-performance of specic machine{optimized libraries (i.e., BLAS, LAPACK, ESSL, PERFLIB). The following Table shows typical results on some selected hardware platforms, for dierent tasks and techniques in Numerical Linear Algebra. While Gaussian Elimination (GE) is presented only to demonstrate the (potentially) available performance of the given processors (often several hundreds of MFLOP which are really measured!), we are much more interested in the realistic run-time behaviour of several matrix{vector multiplication (MV) techniques. Since these are probably the most important (since most time{ consuming) components in typical iterative solution schemes as Krylov-space methods or multigrid solvers, they are - beside the vector-modication routines (as DAXPY for the linear combination of two vectors) - excellent representants to exemplarily demonstrate the `real life' eciency of many simulation tools in combination with specic hardware platforms. The measured MFLOP for the Gaussian Elimination are for a dense matrix (analogously to the standard linpack test!) while for the dierent MV techniques the matrix is a typical 9{point stencil ("discretized Poisson operator"). We perform tests for two dierent vector lengths N and give the measured MFLOP rates which are all calculated via 20 N=time (for MV), resp., 2 N=time (for DAXPY). In all cases, we attempted to use "optimal" compiler options and machine-optimized libraries as the BLAS, ESSL or PERFLIB. Only in the case of the PENTIUM II we had to perform the Gaussian Elimination with the FORTRAN-sources exclusively which might explain the worse rates. 8
9 Corresponding results for other iterative components which are essential in the context of multigrid solvers can be found at which also contains our complete measurements on many modern processors (see also [ABT1] and [ABT2]). Computer GE N DAXPY sparse MV banded MV blocked MV IBM RS K (166 Mhz) 256K SUN U K (250 Mhz) 256K PC PII 45 4K (233 Mhz) 65K The `sparse MV' technique is the standard technique in Finite Element codes (and others), also well known as `compact storage' technique or similar: the matrix (plus index arrays or lists) is stored as long array containing the "nonzero elements" only. While this approach can be applied for arbitrary meshes and numberings of the unknowns, no explicit advantage of the linewise numbering can be exploited. The result is that through the indexed access, the performance degradates dramatically in comparison to the (almost) Peak-rates from the Gaussian Elimination (down to 5 %!). In fact, these results are even `quasi-optimal' since the best-available F77 compiler options have been applied. Moreover, it should be a necessary test for everyone, particularly for those who work in F90, C++ or even JAVA: to measure the corresponding rates of the used MV routines provides a rst impression of the own eciency! The most `natural' way to improve these results is to exploit the fact that the matrix is a sparse banded matrix with 9 bands only. Hence, the matrix{vector multiplication is rewritten such that now "band after band" are applied. In fact, each "band multiplication" is performed analogously to the DAXPY operation (modulo the "variable" multiplication factors!), and leads consequently to similar results as for DAXPY. The obvious advantage of this `banded MV' approach is that these tasks can be performed on the basis of BLAS1 routines which may exploit the vectorization facilities of many processors (particularly on vector computers!). Indeed, the measured results show improvements. However, for `long' vector lengths (256 K) the improvements are absolutely disappointing: It is obvious, that for this kind of (workstation/pc) chip technology the processor cache dominates the resulting eciency! The nal step towards highly ecient components is to rearrange the matrix{vector multiplication in a "blockwise" sense (`blocked MV'): for a certain set of unknowns, a corresponding part of the matrix is treated such that cache-optimized and fully vectorized operations can be performed. This procedure is called "BLAS 2+"-style since in fact certain techniques for dense matrices which are based on routines from the BLAS2, resp., BLAS3 library, have now been developed for such sparse banded matrices. The exact procedure has to be carefully developed in dependence of the underlying FEM discretization, and a more detailed description can be found in [ABT2]. In fact, we expect even better performance ratings in future by more careful implementations which should reach at least the DAXPY measurements! 9
10 V) REFERENCE ELEMENT SOLVERS The results in the previous Section have shown how important the use of (locally) highly{structured meshes for the resulting performance is. However, it is also obvious that techniques of the described `BLAS 2+' style are absolutely necessary to achieve a high percentage of the several hundreds of MFLOP's on modern processors. The full FEM functionality including complex geometries and adapted meshes is administrated via the hierarchical 'data tree' partitioning which in combination with the ScaRC solvers is responsible for the `global' convergence behaviour. In contrast, the resulting eciency is mainly directed by the performance rates with respect to convergence rates and computational eciency on these highly structured "subdomains". Hence, a very important step is to measure and to understand the characteristic runtime behaviour of modern processors and computer architectures. While these measurements provide the "processor speed" only, we additionally have to understand the typical convergence behaviour of certain multigrid components on such local patches, depending on the discretization spaces, the dierential operators and the uniformity of the mesh: Each `reference element' is assumed to be a (logically equivalent) tensor-product mesh, but it may contain deformations and large aspect ratios. The important property is the "sparse banded" structure of the (local) matrices in the `matrix blocks'! Those convergence rates together with the measured processor speed determine the `total numerical eciency' which simply gives the "CPU time to gain 1 digit per unknown". Therefore, we can perform these measurements completely a-priori, for dierent prototypes of meshes, dierential operators and discretization spaces, and we can store these results in a kind of machine{dependent data basis: our expert system for the `reference element solvers'! Then, during the solution process and independently for each stage in the hierarchical tree structure, ScaRC "selects" automatically the "optimal" conguration, i.e., the smoothing operator and the number of smoothing steps, via this "a-priori expert system". Within the FEAST approach, we have therefore the chance to incorporate all the knowledge from the `Finite Dierence world' and from `unit-square solver experts' into the higher functionality of FEM codes. And additionally, the facilities of actual and future hardware platforms can be exploited inside of a software product which at the same time is designed to realize the modern mathematical FEM methodology. VI) SEVERAL ADAPTIVITY CONCEPTS As typical for modern FEM packages, we directly incorporate certain tools for grid generation which allow an easy handling of local and global renement or coarsening strategies: adaptive mesh moving, macro adaptivity and fully local adaptivity. Adaptive strategies for moving mesh points, along boundaries or inner structures, allow the same logic structure in each `macro block', and hence the shown performance rates can be preserved. Additionally, we work with adaptivity concepts related to each `macro block', resp., `matrix block'. Allowing `blind' or `slave macro nodes' preserves the high{ performance facilities in each `matrix block', and is a good compromise between fully local adaptivity and optimal eciency through structured data. Only in that case, that these concepts do not lead to satisfying results, certain macros will loose their `highly structured' features through the (local) use of fully adaptive techniques. On these (hopefully) few patches, the standard `sparse' techniques for unstructured meshes have to be applied. 10
11 VII) DIRECT INTEGRATION OF PARALLELISM Most software packages are designed for sequential algorithms to solve a given PDE problem, and the subsequent parallelization of certain methods takes often unproportionately long. On the other hand, the following typical `naive' statement shows that this extra work is often neglected (by those who never perform the parallelization!): "1 IBM is more expensive than several SUN's or PC's! Why spend so much money for such a highly{tuned single processor machine? `Simply' parallelize your code..." In fact that is easy to say, but hard to realize with most software packages. Therefore, we directly include tools as MPI or PVM, or other standardized communication routines concerning our hierarchical tree structure, already on low level. However the more important step, which makes parallelization much more easy, is the design of the ScaRC solver according to the hierarchical decomposition in dierent stages. Indeed, from an algorithmic point of view, our sequential and parallel versions dier only as analogously Jacobi- and Gau{Seidel-like schemes work dierently. Hence, all parallel executions can be identically simulated on single processors which however can additionally improve their numerical behaviour with respect to eciency and robustness through Gau{Seidel-like mechanisms. Again, it is absolutely important to realize - see Section `GENERALIZED SOLVER STRATEGY ScaRC' - that (see also [KT] or [TU1]) "Block{Jacobi/Gau{Seidel schemes perform well for locally hidden anisotropies." Hence, we only provide in FEAST the `software' tools for including parallelism on low level, while the `numerical parallelism' is incorporated via our ScaRC solver and the hierarchical `tree structure'. However, what will be `non-standard' is our concept of (adaptive) parallel loadbalancing which is oriented in `total numerical eciency' (that means, "how much processor work is spent to achieve a certain accuracy, depending on the local conguration"!) in contrast to the `classical' criterion of equilibrating the number of local unknowns (see [BE] for detailed information and examples in FEAST). VIII) FULL FINITE ELEMENT FUNCTIONALITY We plan to include (at least) all facilities from the predecessor packages FEAT2D, FEAT3D and FEATFLOW [TU2], as for instance the routines for matrix assembly, the Finite Element basis function library, certain multigrid ingredients and many more. However, in addition, also mechanisms for a-posteriori error control via `residual techniques' and `dual solution' will be provided and complemented with several concepts of adaptivity. IX) PROFESSIONAL PRE- AND POSTPROCESSING Candidates are our JAVA{based tool DeViSor for graphical pre- and postprocessing and certain AVS/Express modules for which we agreed with AVS to include in our software package (for free!). However, we still look for a competent partner for professional geometry and mesh generators which shall be included as `macro mesh generators' and for CAD-like descriptions of the domain. This step will be one of the most important towards the numerical solution of `real life' problems. 11
12 X) OPTIMIZED APPLICATION TOOLS We plan (at least) to publish a new version of our FEATFLOW2.0 which contains most of our methodology derived in [TU1], but then based on the FEAST package. We hope to improve signicantly the quality of the recent FEATFLOW1.1 [TU2] through all the addressed mathematical, algorithmic and implementation aspects. CONCLUSIONS AND OUTLOOK We expect the rst version of FEAST for end of 1998, but most of the `numerical' and `computational' ingredients have already been successfully realized in several test implementations (see the papers in the REFERENCES). The actual status of the FEAST project and further information can always be obtained from our Web page: Nevertheless, help is always welcome: for instance in implementing and testing many auxiliary components, pre- and postprocessing, `unit square' experts and `computers for performance measurements', etc. REFERENCES [ABT 1] ALTIERI, M., BECKER, CHR., TUREK, S.: "Konsequenzen eines numerischen `Elch Tests' fur Prozessor{Architektur und Computersimulation", to appear. [ABT 2] ALTIERI, M., BECKER, CHR., TUREK, S.: "On the realistic performance of components in iterative solvers", Proc. FORTWIHR Conference, Munich, March 1998, LNCSE, Springer-Verlag, to appear. [BE] BECKER, CHR.: "FEAST - The realization of Finite Element software for high-performance applications", Thesis, to appear. [KI] KILIAN, S.: "Ecient parallel iterative solvers of ScaRC-type and their application to the incompressible Navier-Stokes equations", Thesis, [KT ] KILIAN, S., TUREK, S.: "An example for parallel ScaRC and its application to the incompressible Navier-Stokes equations", Proc. ENUMATH-97, Heidelberg, October [RB] RANNACHER, R., BECKER, R.: "A Feed-Back Approach to Error Control in Finite Element Methods: Basic Analysis and Examples", Preprint 96{52, University of Heidelberg, SFB 359, [SRT ] SCH AFER, M., RANNACHER, R., TUREK, S.: "Evaluation of a CFD Benchmark for Laminar Flows", Proc. ENUMATH-97, Heidelberg, October [T U 1] TUREK, S.: "Ecient solvers for incompressible ow problems: An algorithmic approach in view of computational aspects", LNCSE 2, Springer-Verlag, [T U 2] TUREK, S.: "FEATFLOW. Finite element software for the incompressible Navier-Stokes equations: User Manual, Release 1.1, 1998 (see the WWW-address above). 12
Two main topics: `A posteriori (error) control of FEM/FV discretizations with adaptive meshing strategies' `(Iterative) Solution strategies for huge s
. Trends in processor technology and their impact on Numerics for PDE's S. Turek Institut fur Angewandte Mathematik, Universitat Heidelberg Im Neuenheimer Feld 294, 69120 Heidelberg, Germany http://gaia.iwr.uni-heidelberg.de/~ture
More informationHigh Performance Computing for PDE Towards Petascale Computing
High Performance Computing for PDE Towards Petascale Computing S. Turek, D. Göddeke with support by: Chr. Becker, S. Buijssen, M. Grajewski, H. Wobker Institut für Angewandte Mathematik, Univ. Dortmund
More informationFOR P3: A monolithic multigrid FEM solver for fluid structure interaction
FOR 493 - P3: A monolithic multigrid FEM solver for fluid structure interaction Stefan Turek 1 Jaroslav Hron 1,2 Hilmar Wobker 1 Mudassar Razzaq 1 1 Institute of Applied Mathematics, TU Dortmund, Germany
More informationHigh Performance Computing for PDE Some numerical aspects of Petascale Computing
High Performance Computing for PDE Some numerical aspects of Petascale Computing S. Turek, D. Göddeke with support by: Chr. Becker, S. Buijssen, M. Grajewski, H. Wobker Institut für Angewandte Mathematik,
More informationHardware-Oriented Numerics - High Performance FEM Simulation of PDEs
Hardware-Oriented umerics - High Performance FEM Simulation of PDEs Stefan Turek Institut für Angewandte Mathematik, Univ. Dortmund http://www.mathematik.uni-dortmund.de/ls3 http://www.featflow.de Performance
More informationIntegrating GPUs as fast co-processors into the existing parallel FE package FEAST
Integrating GPUs as fast co-processors into the existing parallel FE package FEAST Dominik Göddeke Universität Dortmund dominik.goeddeke@math.uni-dortmund.de Christian Becker christian.becker@math.uni-dortmund.de
More informationGPU Cluster Computing for FEM
GPU Cluster Computing for FEM Dominik Göddeke Sven H.M. Buijssen, Hilmar Wobker and Stefan Turek Angewandte Mathematik und Numerik TU Dortmund, Germany dominik.goeddeke@math.tu-dortmund.de GPU Computing
More informationResilient geometric finite-element multigrid algorithms using minimised checkpointing
Resilient geometric finite-element multigrid algorithms using minimised checkpointing Dominik Göddeke, Mirco Altenbernd, Dirk Ribbrock Institut für Angewandte Mathematik (LS3) Fakultät für Mathematik TU
More informationIntegrating GPUs as fast co-processors into the existing parallel FE package FEAST
Integrating GPUs as fast co-processors into the existing parallel FE package FEAST Dipl.-Inform. Dominik Göddeke (dominik.goeddeke@math.uni-dortmund.de) Mathematics III: Applied Mathematics and Numerics
More informationLINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those
Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications Michael Eberl 1, Wolfgang Karl 1, Carsten Trinitis 1 and Andreas Blaszczyk 2 1 Technische Universitat Munchen
More informationGPU Acceleration of Unmodified CSM and CFD Solvers
GPU Acceleration of Unmodified CSM and CFD Solvers Dominik Göddeke Sven H.M. Buijssen, Hilmar Wobker and Stefan Turek Angewandte Mathematik und Numerik TU Dortmund, Germany dominik.goeddeke@math.tu-dortmund.de
More informationPerformance and accuracy of hardware-oriented. native-, solvers in FEM simulations
Robert Strzodka, Stanford University Dominik Göddeke, Universität Dortmund Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations Number of slices
More informationPerformance. Computing (UCHPC)) for Finite Element Simulations
technische universität dortmund Universität Dortmund fakultät für mathematik LS III (IAM) UnConventional High Performance Computing (UCHPC)) for Finite Element Simulations S. Turek, Chr. Becker, S. Buijssen,
More informationPerformance and accuracy of hardware-oriented native-, solvers in FEM simulations
Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations Dominik Göddeke Angewandte Mathematik und Numerik, Universität Dortmund Acknowledgments Joint
More informationTowards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers
Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationEfficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs
Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationHPC Algorithms and Applications
HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear
More informationContents. I The Basic Framework for Stationary Problems 1
page v Preface xiii I The Basic Framework for Stationary Problems 1 1 Some model PDEs 3 1.1 Laplace s equation; elliptic BVPs... 3 1.1.1 Physical experiments modeled by Laplace s equation... 5 1.2 Other
More informationAdaptive-Mesh-Refinement Pattern
Adaptive-Mesh-Refinement Pattern I. Problem Data-parallelism is exposed on a geometric mesh structure (either irregular or regular), where each point iteratively communicates with nearby neighboring points
More informationMultigrid Pattern. I. Problem. II. Driving Forces. III. Solution
Multigrid Pattern I. Problem Problem domain is decomposed into a set of geometric grids, where each element participates in a local computation followed by data exchanges with adjacent neighbors. The grids
More information1.2 Numerical Solutions of Flow Problems
1.2 Numerical Solutions of Flow Problems DIFFERENTIAL EQUATIONS OF MOTION FOR A SIMPLIFIED FLOW PROBLEM Continuity equation for incompressible flow: 0 Momentum (Navier-Stokes) equations for a Newtonian
More informationGPU Cluster Computing for Finite Element Applications
GPU Cluster Computing for Finite Element Applications Dominik Göddeke, Hilmar Wobker, Sven H.M. Buijssen and Stefan Turek Applied Mathematics TU Dortmund dominik.goeddeke@math.tu-dortmund.de http://www.mathematik.tu-dortmund.de/~goeddeke
More informationSELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND
Student Submission for the 5 th OpenFOAM User Conference 2017, Wiesbaden - Germany: SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND TESSA UROIĆ Faculty of Mechanical Engineering and Naval Architecture, Ivana
More informationMixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM
Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM Dominik Göddeke and Robert Strzodka Institut für Angewandte Mathematik (LS3), TU Dortmund Max Planck Institut
More informationPROGRAMMING OF MULTIGRID METHODS
PROGRAMMING OF MULTIGRID METHODS LONG CHEN In this note, we explain the implementation detail of multigrid methods. We will use the approach by space decomposition and subspace correction method; see Chapter:
More informationIntroduction to Multigrid and its Parallelization
Introduction to Multigrid and its Parallelization! Thomas D. Economon Lecture 14a May 28, 2014 Announcements 2 HW 1 & 2 have been returned. Any questions? Final projects are due June 11, 5 pm. If you are
More informationThe GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke.
The GPU as a co-processor in FEM-based simulations Preliminary results Dipl.-Inform. Dominik Göddeke dominik.goeddeke@mathematik.uni-dortmund.de Institute of Applied Mathematics University of Dortmund
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26
More informationReconstruction of Trees from Laser Scan Data and further Simulation Topics
Reconstruction of Trees from Laser Scan Data and further Simulation Topics Helmholtz-Research Center, Munich Daniel Ritter http://www10.informatik.uni-erlangen.de Overview 1. Introduction of the Chair
More informationAccelerating Double Precision FEM Simulations with GPUs
In Proceedings of ASIM 2005-18th Symposium on Simulation Technique, Sept. 2005. Accelerating Double Precision FEM Simulations with GPUs Dominik Göddeke dominik.goeddeke@math.uni-dortmund.de Universität
More informationFlow simulation. Frank Lohmeyer, Oliver Vornberger. University of Osnabruck, D Osnabruck.
To be published in: Notes on Numerical Fluid Mechanics, Vieweg 1994 Flow simulation with FEM on massively parallel systems Frank Lohmeyer, Oliver Vornberger Department of Mathematics and Computer Science
More informationsmooth coefficients H. Köstler, U. Rüde
A robust multigrid solver for the optical flow problem with non- smooth coefficients H. Köstler, U. Rüde Overview Optical Flow Problem Data term and various regularizers A Robust Multigrid Solver Galerkin
More informationRecent developments in the solution of indefinite systems Location: De Zwarte Doos (TU/e campus)
1-day workshop, TU Eindhoven, April 17, 2012 Recent developments in the solution of indefinite systems Location: De Zwarte Doos (TU/e campus) :10.25-10.30: Opening and word of welcome 10.30-11.15: Michele
More informationAccelerating Double Precision FEM Simulations with GPUs
Accelerating Double Precision FEM Simulations with GPUs Dominik Göddeke 1 3 Robert Strzodka 2 Stefan Turek 1 dominik.goeddeke@math.uni-dortmund.de 1 Mathematics III: Applied Mathematics and Numerics, University
More informationData mining with sparse grids
Data mining with sparse grids Jochen Garcke and Michael Griebel Institut für Angewandte Mathematik Universität Bonn Data mining with sparse grids p.1/40 Overview What is Data mining? Regularization networks
More informationFree-Form Shape Optimization using CAD Models
Free-Form Shape Optimization using CAD Models D. Baumgärtner 1, M. Breitenberger 1, K.-U. Bletzinger 1 1 Lehrstuhl für Statik, Technische Universität München (TUM), Arcisstraße 21, D-80333 München 1 Motivation
More informationAutomatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014
Automatic Generation of Algorithms and Data Structures for Geometric Multigrid Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014 Introduction Multigrid Goal: Solve a partial differential
More informationExploring unstructured Poisson solvers for FDS
Exploring unstructured Poisson solvers for FDS Dr. Susanne Kilian hhpberlin - Ingenieure für Brandschutz 10245 Berlin - Germany Agenda 1 Discretization of Poisson- Löser 2 Solvers for 3 Numerical Tests
More informationParallel FEM Computation and Multilevel Graph Partitioning Xing Cai
Parallel FEM Computation and Multilevel Graph Partitioning Xing Cai Simula Research Laboratory Overview Parallel FEM computation how? Graph partitioning why? The multilevel approach to GP A numerical example
More informationAlgebraic Multigrid (AMG) for Ground Water Flow and Oil Reservoir Simulation
lgebraic Multigrid (MG) for Ground Water Flow and Oil Reservoir Simulation Klaus Stüben, Patrick Delaney 2, Serguei Chmakov 3 Fraunhofer Institute SCI, Klaus.Stueben@scai.fhg.de, St. ugustin, Germany 2
More informationSpace Filling Curves and Hierarchical Basis. Klaus Speer
Space Filling Curves and Hierarchical Basis Klaus Speer Abstract Real world phenomena can be best described using differential equations. After linearisation we have to deal with huge linear systems of
More informationMemory Efficient Adaptive Mesh Generation and Implementation of Multigrid Algorithms Using Sierpinski Curves
Memory Efficient Adaptive Mesh Generation and Implementation of Multigrid Algorithms Using Sierpinski Curves Michael Bader TU München Stefanie Schraufstetter TU München Jörn Behrens AWI Bremerhaven Abstract
More informationHigher order nite element methods and multigrid solvers in a benchmark problem for the 3D Navier Stokes equations
INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN FLUIDS Int. J. Numer. Meth. Fluids 2002; 40:775 798 (DOI: 10.1002/d.377) Higher order nite element methods and multigrid solvers in a benchmark problem for
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationEfficiency Aspects for Advanced Fluid Finite Element Formulations
Proceedings of the 5 th International Conference on Computation of Shell and Spatial Structures June 1-4, 2005 Salzburg, Austria E. Ramm, W. A. Wall, K.-U. Bletzinger, M. Bischoff (eds.) www.iassiacm2005.de
More informationon the CM-200. Claus Bendtsen Abstract to implement LAPACK-style routines already developed for other architectures,
\Quick" Implementation of Block LU Algorithms on the CM-200. Claus Bendtsen Abstract The CMSSL library only includes a limited amount of mathematical algorithms. Hence, when writing code for the Connection
More informationComputing on GPU Clusters
Computing on GPU Clusters Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009
More informationTechnische Universitat Munchen. Institut fur Informatik. D Munchen.
Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl
More informationCase study: GPU acceleration of parallel multigrid solvers
Case study: GPU acceleration of parallel multigrid solvers Dominik Göddeke Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Acknowledgements Hilmar Wobker,
More informationτ-extrapolation on 3D semi-structured finite element meshes
τ-extrapolation on 3D semi-structured finite element meshes European Multi-Grid Conference EMG 2010 Björn Gmeiner Joint work with: Tobias Gradl, Ulrich Rüde September, 2010 Contents The HHG Framework τ-extrapolation
More informationDriven Cavity Example
BMAppendixI.qxd 11/14/12 6:55 PM Page I-1 I CFD Driven Cavity Example I.1 Problem One of the classic benchmarks in CFD is the driven cavity problem. Consider steady, incompressible, viscous flow in a square
More informationPhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.
Abdulrahman Manea PhD Student Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science Energy Resources Engineering Department School of Earth Sciences
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More information8. Hardware-Aware Numerics. Approaching supercomputing...
Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 48 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum
More information8. Hardware-Aware Numerics. Approaching supercomputing...
Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 22 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum
More informationDistributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs
Distributed NVAMG Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs Istvan Reguly (istvan.reguly at oerc.ox.ac.uk) Oxford e-research Centre NVIDIA Summer Internship
More informationBindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core
Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable
More informationTAU mesh deformation. Thomas Gerhold
TAU mesh deformation Thomas Gerhold The parallel mesh deformation of the DLR TAU-Code Introduction Mesh deformation method & Parallelization Results & Applications Conclusion & Outlook Introduction CFD
More informationVirtual EM Inc. Ann Arbor, Michigan, USA
Functional Description of the Architecture of a Special Purpose Processor for Orders of Magnitude Reduction in Run Time in Computational Electromagnetics Tayfun Özdemir Virtual EM Inc. Ann Arbor, Michigan,
More informationHighly Parallel Multigrid Solvers for Multicore and Manycore Processors
Highly Parallel Multigrid Solvers for Multicore and Manycore Processors Oleg Bessonov (B) Institute for Problems in Mechanics of the Russian Academy of Sciences, 101, Vernadsky Avenue, 119526 Moscow, Russia
More informationMatrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs
Iterative Solvers Numerical Results Conclusion and outlook 1/18 Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs Eike Hermann Müller, Robert Scheichl, Eero Vainikko
More informationEfficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling
Iterative Solvers Numerical Results Conclusion and outlook 1/22 Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling Part II: GPU Implementation and Scaling on Titan Eike
More informationAlgorithms and Architecture. William D. Gropp Mathematics and Computer Science
Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?
More informationProblem-Adapted Mesh Generation With FEM-Features
INTERNATIONAL DESIGN CONFERENCE - DESIGN 2000 Dubrovnik, May 23-26, 2000. Problem-Adapted Mesh Generation With FEM-Features Dipl.-Ing. Horst Werner, Prof. Dr.-Ing. Christian Weber, cand. ing. Martin Schilke
More informationPerformance of Implicit Solver Strategies on GPUs
9. LS-DYNA Forum, Bamberg 2010 IT / Performance Performance of Implicit Solver Strategies on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Abstract: The increasing power of GPUs can be used
More informationThe driving motivation behind the design of the Janus framework is to provide application-oriented, easy-to-use and ecient abstractions for the above
Janus a C++ Template Library for Parallel Dynamic Mesh Applications Jens Gerlach, Mitsuhisa Sato, and Yutaka Ishikawa fjens,msato,ishikawag@trc.rwcp.or.jp Tsukuba Research Center of the Real World Computing
More informationSeed Point. Agglomerated Points
AN ASSESSMENT OF LINEAR VERSUS NON-LINEAR MULTIGRID METHODS FOR UNSTRUCTURED MESH SOLVERS DIMITRI J. MAVRIPLIS Abstract. The relative performance of a non-linear FAS multigrid algorithm and an equivalent
More informationData mining with sparse grids using simplicial basis functions
Data mining with sparse grids using simplicial basis functions Jochen Garcke and Michael Griebel Institut für Angewandte Mathematik Universität Bonn Part of the work was supported within the project 03GRM6BN
More informationHandling Parallelisation in OpenFOAM
Handling Parallelisation in OpenFOAM Hrvoje Jasak hrvoje.jasak@fsb.hr Faculty of Mechanical Engineering and Naval Architecture University of Zagreb, Croatia Handling Parallelisation in OpenFOAM p. 1 Parallelisation
More informationTHE application of advanced computer architecture and
544 IEEE TRANSACTIONS ON ANTENNAS AND PROPAGATION, VOL. 45, NO. 3, MARCH 1997 Scalable Solutions to Integral-Equation and Finite-Element Simulations Tom Cwik, Senior Member, IEEE, Daniel S. Katz, Member,
More informationMultigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK
Multigrid Solvers in CFD David Emerson Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK david.emerson@stfc.ac.uk 1 Outline Multigrid: general comments Incompressible
More informationHigh-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers
High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers July 14, 1997 J Daniel S. Katz (Daniel.S.Katz@jpl.nasa.gov) Jet Propulsion Laboratory California Institute of Technology
More informationJava Performance Analysis for Scientific Computing
Java Performance Analysis for Scientific Computing Roldan Pozo Leader, Mathematical Software Group National Institute of Standards and Technology USA UKHEC: Java for High End Computing Nov. 20th, 2000
More informationcomputational Fluid Dynamics - Prof. V. Esfahanian
Three boards categories: Experimental Theoretical Computational Crucial to know all three: Each has their advantages and disadvantages. Require validation and verification. School of Mechanical Engineering
More informationLecture 15: More Iterative Ideas
Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!
More information2D & 3D Finite Element Method Packages of CEMTool for Engineering PDE Problems
2D & 3D Finite Element Method Packages of CEMTool for Engineering PDE Problems Choon Ki Ahn, Jung Hun Park, and Wook Hyun Kwon 1 Abstract CEMTool is a command style design and analyzing package for scientific
More information1 Motivation for Improving Matrix Multiplication
CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n
More informationLarge-scale Gas Turbine Simulations on GPU clusters
Large-scale Gas Turbine Simulations on GPU clusters Tobias Brandvik and Graham Pullan Whittle Laboratory University of Cambridge A large-scale simulation Overview PART I: Turbomachinery PART II: Stencil-based
More informationOn Convergence Acceleration Techniques for Unstructured Meshes
NASA/CR-1998-208732 ICASE Report No. 98-44 On Convergence Acceleration Techniques for Unstructured Meshes Dimitri J. Mavriplis ICASE, Hampton, Virginia Institute for Computer Applications in Science and
More informationIssues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM
Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.
More informationUsing Java for Scientific Computing. Mark Bul EPCC, University of Edinburgh
Using Java for Scientific Computing Mark Bul EPCC, University of Edinburgh markb@epcc.ed.ac.uk Java and Scientific Computing? Benefits of Java for Scientific Computing Portability Network centricity Software
More informationWhat is Multigrid? They have been extended to solve a wide variety of other problems, linear and nonlinear.
AMSC 600/CMSC 760 Fall 2007 Solution of Sparse Linear Systems Multigrid, Part 1 Dianne P. O Leary c 2006, 2007 What is Multigrid? Originally, multigrid algorithms were proposed as an iterative method to
More informationPoint-to-Point Synchronisation on Shared Memory Architectures
Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:
More informationCost-Effective Parallel Computational Electromagnetic Modeling
Cost-Effective Parallel Computational Electromagnetic Modeling, Tom Cwik {Daniel.S.Katz, cwik}@jpl.nasa.gov Beowulf System at PL (Hyglac) l 16 Pentium Pro PCs, each with 2.5 Gbyte disk, 128 Mbyte memory,
More informationarxiv: v1 [math.na] 20 Sep 2016
arxiv:1609.06236v1 [math.na] 20 Sep 2016 A Local Mesh Modification Strategy for Interface Problems with Application to Shape and Topology Optimization P. Gangl 1,2 and U. Langer 3 1 Doctoral Program Comp.
More informationReview of previous examinations TMA4280 Introduction to Supercomputing
Review of previous examinations TMA4280 Introduction to Supercomputing NTNU, IMF April 24. 2017 1 Examination The examination is usually comprised of: one problem related to linear algebra operations with
More informationMesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System
Mesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System The Harvard community has made this article openly available. Please share how this
More informationApproaches to Parallel Implementation of the BDDC Method
Approaches to Parallel Implementation of the BDDC Method Jakub Šístek Includes joint work with P. Burda, M. Čertíková, J. Mandel, J. Novotný, B. Sousedík. Institute of Mathematics of the AS CR, Prague
More informationPARALLEL DECOMPOSITION OF 100-MILLION DOF MESHES INTO HIERARCHICAL SUBDOMAINS
Technical Report of ADVENTURE Project ADV-99-1 (1999) PARALLEL DECOMPOSITION OF 100-MILLION DOF MESHES INTO HIERARCHICAL SUBDOMAINS Hiroyuki TAKUBO and Shinobu YOSHIMURA School of Engineering University
More informationNIC FastICA Implementation
NIC-TR-2004-016 NIC FastICA Implementation Purpose This document will describe the NIC FastICA implementation. The FastICA algorithm was initially created and implemented at The Helsinki University of
More information1 Past Research and Achievements
Parallel Mesh Generation and Adaptation using MAdLib T. K. Sheel MEMA, Universite Catholique de Louvain Batiment Euler, Louvain-La-Neuve, BELGIUM Email: tarun.sheel@uclouvain.be 1 Past Research and Achievements
More informationHigh Performance Computing: Tools and Applications
High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 15 Numerically solve a 2D boundary value problem Example:
More informationThe Immersed Interface Method
The Immersed Interface Method Numerical Solutions of PDEs Involving Interfaces and Irregular Domains Zhiiin Li Kazufumi Ito North Carolina State University Raleigh, North Carolina Society for Industrial
More informationGPU PROGRESS AND DIRECTIONS IN APPLIED CFD
Eleventh International Conference on CFD in the Minerals and Process Industries CSIRO, Melbourne, Australia 7-9 December 2015 GPU PROGRESS AND DIRECTIONS IN APPLIED CFD Stan POSEY 1*, Simon SEE 2, and
More informationA Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units
A Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units Markus Wagner, Karl Rupp,2, Josef Weinbub Institute for Microelectronics, TU
More informationMixed-Precision GPU-Multigrid Solvers with Strong Smoothers
Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers Dominik Göddeke Institut für Angewandte Mathematik (LS3) TU Dortmund dominik.goeddeke@math.tu-dortmund.de ILAS 2011 Mini-Symposium: Parallel
More informationMultiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University
A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor
More informationParallel High-Order Geometric Multigrid Methods on Adaptive Meshes for Highly Heterogeneous Nonlinear Stokes Flow Simulations of Earth s Mantle
ICES Student Forum The University of Texas at Austin, USA November 4, 204 Parallel High-Order Geometric Multigrid Methods on Adaptive Meshes for Highly Heterogeneous Nonlinear Stokes Flow Simulations of
More informationMixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM
Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM Dominik Göddeke Institut für Angewandte Mathematik (LS3) TU Dortmund dominik.goeddeke@math.tu-dortmund.de SIMTECH
More informationNIA CFD Seminar, October 4, 2011 Hyperbolic Seminar, NASA Langley, October 17, 2011
NIA CFD Seminar, October 4, 2011 Hyperbolic Seminar, NASA Langley, October 17, 2011 First-Order Hyperbolic System Method If you have a CFD book for hyperbolic problems, you have a CFD book for all problems.
More information