(recursive) `Divide and Conquer' strategies hierarchical data and solver structures, but also hierarchical (!) `matrix structures' ScaRC as generaliza

Size: px

Start display at page:

Download "(recursive) `Divide and Conquer' strategies hierarchical data and solver structures, but also hierarchical (!) `matrix structures' ScaRC as generaliza"

Erika McCarthy
6 years ago
Views:

1 SOME BASIC CONCEPTS OF FEAST M. Altieri, Chr. Becker, S. Kilian, H. Oswald, S. Turek, J. Wallis Institut fur Angewandte Mathematik, Universitat Heidelberg Im Neuenheimer Feld 294, Heidelberg, Germany SUMMARY This paper deals with the basic principles of the new FEM software package FEAST. Based on an initial discussion of available software tools with respect to their application area, i.e., Education, Research or (industrial) Application, it illustrates the specic demands for such PDE software which is aimed to solve 'real life' problems. For the FEAST software, which is principally designed for high{performance simulations, we explain the basic principles of the underlying numerical, algorithmic and implementation concepts. Computational examples illustrate the (expected) eciency of this new software package, particularly in relation to existing approaches. INTRODUCTION Current trends in software development for Partial Dierential Equations (PDE's), and here in particular for Finite Element (FEM) approaches, go clearly towards objectoriented techniques and adaptive methods in any sense. Hereby the employed data and solver structures, and especially the `matrix structures', are often in contradiction to modern hardware platforms. As a result, the observed computational eciency is far away from expected Peak rates of almost 1 GFLOP nowadays, and the 'real life' gap will even further increase (see recent papers of Rude). Since high performance calculations may be only reached by explicitly exploiting 'caching in' and 'pipelining' in combination with sequentially stored arrays (using machine-optimized libraries as BLAS, ESSL or PERFLIB, for instance), the corresponding realization seems to be `easier' for simple Finite Dierence approaches. So, the question arises how to perform similar techniques for much more sophisticated Finite Element codes? These discrepancies between complex mathematical approaches and highly structured computational demands often lead to unreasonable calculation times for `real world' problems, e.g. Computational Fluid Dynamics (CFD) calculations in 3D, as can be seen from recent benchmarks [SRT] for commercial as well as research codes. Hence, strategies for eciency enhancement are necessary, not only from the mathematical (algorithms, discretizations) but also from the software point of view. To realize some of these necessary improvements, our new Finite Element package (project name: FEAST { Finite Element Analysis & Solution Tools) is under development. This package is based on the following concepts: 1

2 (recursive) `Divide and Conquer' strategies hierarchical data and solver structures, but also hierarchical (!) `matrix structures' ScaRC as generalization of multigrid and domain decomposition techniques frequent use of machine-optimized Linear Algebra routines all typical Finite Element facilities included The result is going to be a exible software package with special emphasis on: (closer to) peak performance on modern and future processors typical multigrid behaviour w.r.t. eciency and robustness parallelization tools directly included on low level open for dierent adaptivity concepts low storage requirements application to many `real life' problems possible In contrast to many other tool boxes, which often aim to develop preferably software for research or education topics, our approach clearly is designed for high performance applications with industrial background, especially in CFD. Consequently, our main emphasis lies on the aspects `eciency' and `robustness' and less on topics as `easy implementable' or `most modern programming environment'. As programming language FORTRAN (77 and 90) is used: This makes it possible to adopt many reliable parts of the predecessor packages FEAT2D, FEAT3D and FEAT- FLOW [TU2]. Further, on high performance computers, very ecient FORTRAN compilers are available and transparent access to the data structures is possible. The preand postprocessing, which will be based on (commercial) professional tools, is handled by JAVA-based program parts. Conguring a high performance computer as a FEAST server, the user shall be able to perform the remote calculation by a FEAST client. In the following, we give examples for `real' computational eciency results of typical numerical tools which help to motivate our hierarchical data, solver and matrix structures. To understand these better, we illustrate shortly the corresponding solution technique ScaRC ("Scalable Recursive Clustering") in combination with the overall `Divide and Conquer' philosophy which is essential for FEAST. We discuss how typical multigrid rates, even for complex congurations, can be achieved, on parallel as well as sequential computers with a very high computational eciency. CLASSIFICATION OF PDE SOFTWARE "Most of the available PDE software can be applied, in principal, to (almost) all problems. The practical functionality is mostly restricted by computer requirements (CPU, RAM) only! " 2

3 This statement appears to be quite `simple', but it nevertheless describes well the stateof-the-art of available software packages for the numerical solution of PDE's. Further, it can be used to classify many existing software tools and to illustrate the dierences in the underlying concepts and realizations of codes. Typical examples for critical applications which demonstrate the problems of many software packages are in the eld of CFD. Especially the recent DFG ow benchmark set "Channel ow around a cylinder" [SRT], which has been carried out recently under the German Priority Research Programme "Flow Simulation on High Performance Computers", has shown quite surprising results. Many codes with various numerical basics and on very dierent computer platforms have participated so far. The results in [SRT] clearly show dierences in total eciency (concerning elapsed CPU times with respect to obtained accuracy!) of several orders of magnitude, and in fact many codes have not been able to give satisfying results, and this for a laminar incompressible ow in the range of Reynolds number Re = 20, resp., Re = 100. As an (ocial) result one could state the following conclusions, all on the basis of such benchmark calculations: 1. It is often not sucient to take any `multi{purpose' package as basic tool, and then to implement a (more or less) clever numerical approach in a straightforward manner. In most cases, the eciency of the underlying basic package does not allow to solve accurately "hard" problems as the incompressible Navier{Stokes equations. 2. Often the chosen `numerical ingredients', as mesh design, discretization spaces, time{stepping schemes, solvers for discrete linear and nonlinear systems, are `good' schemes as standalones but do not t together if realized in a common code. 3. The typical approach of implementing well-known, but `old' solution schemes in a straightforward manner on vector or parallel super-computers does not lead to satisfying results, without essentially improving the numerical and computational background at the same time. Applications from Computational Fluid Dynamics clearly indicate that specic attempts for optimizing the applied numerical schemes (discretization), algorithmic components (solver) as well as software aspects (implementation w.r.t. underlying hardware) must be performed. Only then, similar problems with a certain `real life' character may be tackled successfully. Based on such comparisons and resulting experiences we propose the following classi- cation of PDE software packages, namely if they are principally designed for the use in Education, Research or (Industrial) Application. Education. Corresponding software tools are mainly designed for students to `play{around' with mathematical tools. Their most important features include `easy' user interfaces, and the code should be based on simple, but very robust algorithms. Due to the typically low complexity of the problems to be examined by the students, the code eciency is often `independent of implementation' and typically requires only few seconds execution time. Therefore, C++- and especially JAVA-based implementations with graphical interfaces and platform-independent execution are typical candidates for this kind of software. 3

4 Research. This type of software is representative for most of the available tools in the mathematical community. The software is designed to be open for numerical and algorithmic changes, for instance to examine new concepts for adaptive error control or to test the convergence behaviour of new multilevel-type solvers. Consequently, programming languages which allow very exible and robust data structures are favourized, and particularly modern object-oriented environments as C++ are wide-spread in this eld. In contrast, user interfaces or eciency aspects play a minor part in this mathematical basic research. (Industrial) Application. This software is specialized to apply `well-understood' and `optimized' numerical and algorithmic tools to `real life' congurations with an industrial background. If the full potential of such sophisticated approaches is to be eciently exploited, an `optimal play{together' of Mathematics and implementation in all components has to be guaranteed, to be able to compete with present production codes. Consequently, the demand for robustness and eciency of the developed software is of major interest. Coming back to the previous `Application' examples from CFD, we want to be more specic to illustrate the practical demands on software which will be successfully applied in this special area. Typical problems associated with `real life' ows are the following: complex domains and anisotropic meshes in space and time 10 3? 10 5 time steps and 10 4? 10 8 unknowns by accuracy reasons optimal control of specic physical quantities as drag coecients or ux distributions Therefore, concrete `realistic' simulations often require the use of several millions of unknowns for the numerical solution of (mostly) nonlinear and nonstationary PDE's, and this many times. On the other hand, often parallel computers with many processors and each with (almost) 1 GFLOP Peak performance are available, and today's Mathematics oers higher order discretizations with adaptivity concepts and accompanying multigrid/multilevel-like solvers. However, to marry such modern numerical tools with recent and future hardware aspects and to create an adequate code, the following four components have to be respected with the same weight: Numerics, Algorithms, Implementation and Hardware platforms. Only if all components optimally play-together, the (potential) high{performance of each aspect can be incorporated in the resulting code such that the expected substantially improved simulation tool gets really available. MAIN PRINCIPLES IN FEAST We give a short explanation of some of the main principles in FEAST - at least with respect to the current status in April These are mainly concepts and ideas which are recently planned to be done and/or which are actually under work. Hence, we often omit a rigorous mathematical or algorithmic formulation and present instead the philosophy and background of these concepts only. More details about the practical realization and the numerical and algorithmic sources can be found in [KI], [BE] and [TU1]. 4

5 I) PROGRAMMING LANGUAGE As explained in the INTRODUCTION, we decided to implement most of the numerical kernels of FEAST in FORTRAN 90 (F90). Beside the modern functionality of this language, promising compilers on most recent hardware platforms are available. And additionally, what is also very important, many numerical parts of the predecessor packages FEAT2D, FEAT3D and FEATFLOW [TU2] can be directly included, such as for instance the routines for matrix assembly, the Finite Element basis function library, multigrid ingredients and many more. However, preliminary tests with recent F90 compilers on several platforms have shown, that the use of FORTRAN 77 (F77) compilers is still much more favourable if absolute high{performance is required. Especially for the Numerical Linear Algebra part, i.e., matrix-vector multiplications, vector modications, tridiagonal solvers and other tools which are typical in the context of iterative solution schemes (see [ABT1] and [ABT2]), the use of machine{optimized F77 software libraries or self{developed FORTRAN 77 code still leads to much better performance ratings. Nevertheless, it is planned to switch to F90 completely if F90 has achieved the performance of F77, and the complete package is developed under the restriction that the corresponding modules can be easily exchanged. Finally, it is planned to "hide" the numerical kernel routines from the "typical user" who only wants to `perform applications'. Recently, we are developing a higher-level user{ interface written in JAVA to be machine- and platform-independent as far as possible. Additionally, graphical user interfaces (GUI's) or "meta{language" approaches are under work to leave behind the "useless" discussion about the programming language used in the numerical kernels. II) HIERARCHICAL DATA, SOLVER AND MATRIX STRUCTURES One of the most important principles in FEAST is to apply consequently a (Recursive) Divide and Conquer strategy. The solution of the complete "global" problem is recursively split into smaller "independent" subproblems on `patches' as part of the complete set of unknowns. Thus, the two major aims in this splitting procedure which can be performed by hand or via self{adaptive strategies are: Find locally structured parts Find locally anisotropic parts Based on "small" structured subdomains on the lowest level (in fact, even one single or a small number of elements only is allowed), the "higher-level" substructures are generated via clustering of "lower-level" parts such that algebraic or geometric irregularities are hidden inside the new "higher-level" patch. More background for this strategy is given in the following Sections `HIGH{PERFORMANCE LINEAR ALGEBRA' and `REFERENCE ELEMENT SOLVERS', and particularly in `GENERALIZED SOLVER STRATEGY ScaRC' which describes the corresponding solvers related to each stage. The following Figures illustrate exemplarily the employed data structure for a (coarse) triangulation of a given domain and its recursive partitioning into several kinds of substructures. 5

6 SB PB PB According to this decomposition, a corresponding `data tree' { the skeleton of the partitioning strategy { describes the hierarchical decomposition process. It consists of a specic collection of Elements, Macros (`Mxxx'), Matrix Blocks (`'), Parallel Blocks (`PB'), Subdomain Blocks (`SB'), etc. SB PB PB M1 M2 M3 M4 M5 M6 M7 M8 The atomic units in our decomposition are the `macros' which may be of type `structured' (as n n collection of quadrilaterals (in 2D) with local Finite-Dierence data structures) or `unstructured' (any collection of elements, for instance in the case of fully adaptive local grid renement). These `macros' (one or several) can be clustered to build a `matrix block' which contains the "local matrix parts": only here are the complete matrix informations stored! Higher-level constructs are `parallel blocks' (for the parallel distribution) and `subdomain blocks' (with special conformity rules with respect to grid renement and applied discretization spaces). They all together build the complete domain, resp., the complete set of unknowns. It is important to realize that each stage in this hierarchical tree can act as independent `father' in relation to its `child' substructures while it is a 'child' at the same time in another phase of the solution process (inside of the ScaRC solver, see later). III) GENERALIZED SOLVER STRATEGY ScaRC In short form, our long-time experience with the numerical and computational runtime behaviour of typical multigrid (MG) and Domain Decomposition (DD) solvers can be concluded as follows: a) Some observations from standard multigrid approaches: While in fact the numerical convergence behaviour of (optimized) multigrid is very satisfying with respect to robustness and eciency requirements, there still remain some `open' problems: Often the parallelization of powerful `recursive' smoothers (as SOR or ILU) leads to performance degradations since they can be realized only in a `blockwise' sense. Thus, it is often not clear how the nice numerical behaviour in sequential codes for 6

7 complicated geometric structures or local anisotropies, can be reached in parallel computations. And additionally, the communication overhead especially on coarser grid levels dominates the total CPU time. Even more important is the `computational observation' that the realized performance on modern platforms is often far beyond (sometimes less than 1 %) the expected Peak performance. Many codes often reach much less than 10 MFLOP, and this on computers which are said (by the vendors) to run with up to 1 GFLOP Peak. The reason is simply that the single components in multigrid (smoother, defect calculation, grid transfer) perform too few arithmetic work with respect to each data exchange, such that the facilities of modern superscalar architectures are poorly exploitable. In contrast, we will show that in fact 30 { 70 % can be realistic with appropriate techniques. b) Some observations from standard Domain Decomposition approaches: In contrast to standard multigrid, the parallel eciency is much higher, at least as long as no large overlap region between processors must be exchanged. While overlapping DD methods do not require additional coarse grid problems (however the implementation in 3D for complicated domains or for complex Finite Element spaces is a hard job!), non-overlapping DD approaches require certain coarse grid problems, as the BPS preconditioner for instance which may lead again to severe numerical and computational problems, depending on the geometrical structure or the used discretization spaces. However, the most important dierence between Domain Decomposition and multigrid are the (often) much worse convergence rates of DD, although at the same time more arithmetic work is done on each processor. As a conclusion, improvements are enforced by the facts that the convergence behaviour is often quite sensitive with respect to (local) geometric/algebraic anisotropies (in `real life' congurations!), and that the performed arithmetic work (which allows the high-performance) is often restricted by (un)necessary data exchanges. An additional observation which is strongly related to the previous data structure in combination with the specic hierarchical ScaRC solver is illustrated in the following Figure. We show the resulting "optimal" mesh from a numerical simulation of R.Becker/R.Rannacher for `Flow around the cylinder' which was adaptively rened via rigorous a-posteriori error control mechanisms specied for the required drag coecient (see [RB]). As can be seen, the adaptive grid renement techniques are needed only locally, near the boundaries, while mostly regular substructures (up to 90 %) can be (and should be!) used in the interior of the domain. This is a quite typical result and shows that even for (more or less) complex ow simulations (here as a prototypical example) locally blockwise `Finite Dierence' techniques can be applied: these regions can be detected and exploited by the given hierarchical strategies. 7

8 We omit here a detailed description of the numerical and algorithmic properties of ScaRC and refer to the papers [KT] and particularly to [KI]. Here, we restrict to repeat the main philosophy behind this generalized MG/DD approach, which is strongly coupled with the hierarchical data and matrix structures as explained before. ScaRC stands for: Scalable (w.r.t. "quality and number of local solution steps" at each stage) Recursive ("independently" for each stage in the hierarchy of partitioning) Clustering (for building patches via "xed or adaptive blocking strategies") and its "advantageous" numerical and computational behaviour can be characterized through following observations (look at [KT] and [TU1] for numerical examples): "Block{Jacobi/Gau{Seidel schemes perform well for locally hidden anisotropies. More arithmetic operations can be performed locally whereby additional work may be unproportionately small (in terms of CPU) due to the local high{performance facilities." IV) HIGH{PERFORMANCE LINEAR ALGEBRA One of the main ideas behind the described (Recursive) Divide and Conquer approach in combination with the ScaRC solver technology is to detect `locally structured parts'. In these `local subdomains', we apply consequently `highly structured tools' as typical for Finite Dierence approaches: line- or rowwise numbering of unknowns and storing of matrices as sparse bands (however the matrix entries are calculated via the Finite Element modules!). As a result, we have `optimal' data structures on each of these patches (which often correspond to the former introduced `matrix blocks') and we can perform very powerful Linear Algebra tools which explicitely exploit the high-performance of specic machine{optimized libraries (i.e., BLAS, LAPACK, ESSL, PERFLIB). The following Table shows typical results on some selected hardware platforms, for dierent tasks and techniques in Numerical Linear Algebra. While Gaussian Elimination (GE) is presented only to demonstrate the (potentially) available performance of the given processors (often several hundreds of MFLOP which are really measured!), we are much more interested in the realistic run-time behaviour of several matrix{vector multiplication (MV) techniques. Since these are probably the most important (since most time{ consuming) components in typical iterative solution schemes as Krylov-space methods or multigrid solvers, they are - beside the vector-modication routines (as DAXPY for the linear combination of two vectors) - excellent representants to exemplarily demonstrate the `real life' eciency of many simulation tools in combination with specic hardware platforms. The measured MFLOP for the Gaussian Elimination are for a dense matrix (analogously to the standard linpack test!) while for the dierent MV techniques the matrix is a typical 9{point stencil ("discretized Poisson operator"). We perform tests for two dierent vector lengths N and give the measured MFLOP rates which are all calculated via 20 N=time (for MV), resp., 2 N=time (for DAXPY). In all cases, we attempted to use "optimal" compiler options and machine-optimized libraries as the BLAS, ESSL or PERFLIB. Only in the case of the PENTIUM II we had to perform the Gaussian Elimination with the FORTRAN-sources exclusively which might explain the worse rates. 8

9 Corresponding results for other iterative components which are essential in the context of multigrid solvers can be found at which also contains our complete measurements on many modern processors (see also [ABT1] and [ABT2]). Computer GE N DAXPY sparse MV banded MV blocked MV IBM RS K (166 Mhz) 256K SUN U K (250 Mhz) 256K PC PII 45 4K (233 Mhz) 65K The `sparse MV' technique is the standard technique in Finite Element codes (and others), also well known as `compact storage' technique or similar: the matrix (plus index arrays or lists) is stored as long array containing the "nonzero elements" only. While this approach can be applied for arbitrary meshes and numberings of the unknowns, no explicit advantage of the linewise numbering can be exploited. The result is that through the indexed access, the performance degradates dramatically in comparison to the (almost) Peak-rates from the Gaussian Elimination (down to 5 %!). In fact, these results are even `quasi-optimal' since the best-available F77 compiler options have been applied. Moreover, it should be a necessary test for everyone, particularly for those who work in F90, C++ or even JAVA: to measure the corresponding rates of the used MV routines provides a rst impression of the own eciency! The most `natural' way to improve these results is to exploit the fact that the matrix is a sparse banded matrix with 9 bands only. Hence, the matrix{vector multiplication is rewritten such that now "band after band" are applied. In fact, each "band multiplication" is performed analogously to the DAXPY operation (modulo the "variable" multiplication factors!), and leads consequently to similar results as for DAXPY. The obvious advantage of this `banded MV' approach is that these tasks can be performed on the basis of BLAS1 routines which may exploit the vectorization facilities of many processors (particularly on vector computers!). Indeed, the measured results show improvements. However, for `long' vector lengths (256 K) the improvements are absolutely disappointing: It is obvious, that for this kind of (workstation/pc) chip technology the processor cache dominates the resulting eciency! The nal step towards highly ecient components is to rearrange the matrix{vector multiplication in a "blockwise" sense (`blocked MV'): for a certain set of unknowns, a corresponding part of the matrix is treated such that cache-optimized and fully vectorized operations can be performed. This procedure is called "BLAS 2+"-style since in fact certain techniques for dense matrices which are based on routines from the BLAS2, resp., BLAS3 library, have now been developed for such sparse banded matrices. The exact procedure has to be carefully developed in dependence of the underlying FEM discretization, and a more detailed description can be found in [ABT2]. In fact, we expect even better performance ratings in future by more careful implementations which should reach at least the DAXPY measurements! 9

10 V) REFERENCE ELEMENT SOLVERS The results in the previous Section have shown how important the use of (locally) highly{structured meshes for the resulting performance is. However, it is also obvious that techniques of the described `BLAS 2+' style are absolutely necessary to achieve a high percentage of the several hundreds of MFLOP's on modern processors. The full FEM functionality including complex geometries and adapted meshes is administrated via the hierarchical 'data tree' partitioning which in combination with the ScaRC solvers is responsible for the `global' convergence behaviour. In contrast, the resulting eciency is mainly directed by the performance rates with respect to convergence rates and computational eciency on these highly structured "subdomains". Hence, a very important step is to measure and to understand the characteristic runtime behaviour of modern processors and computer architectures. While these measurements provide the "processor speed" only, we additionally have to understand the typical convergence behaviour of certain multigrid components on such local patches, depending on the discretization spaces, the dierential operators and the uniformity of the mesh: Each `reference element' is assumed to be a (logically equivalent) tensor-product mesh, but it may contain deformations and large aspect ratios. The important property is the "sparse banded" structure of the (local) matrices in the `matrix blocks'! Those convergence rates together with the measured processor speed determine the `total numerical eciency' which simply gives the "CPU time to gain 1 digit per unknown". Therefore, we can perform these measurements completely a-priori, for dierent prototypes of meshes, dierential operators and discretization spaces, and we can store these results in a kind of machine{dependent data basis: our expert system for the `reference element solvers'! Then, during the solution process and independently for each stage in the hierarchical tree structure, ScaRC "selects" automatically the "optimal" conguration, i.e., the smoothing operator and the number of smoothing steps, via this "a-priori expert system". Within the FEAST approach, we have therefore the chance to incorporate all the knowledge from the `Finite Dierence world' and from `unit-square solver experts' into the higher functionality of FEM codes. And additionally, the facilities of actual and future hardware platforms can be exploited inside of a software product which at the same time is designed to realize the modern mathematical FEM methodology. VI) SEVERAL ADAPTIVITY CONCEPTS As typical for modern FEM packages, we directly incorporate certain tools for grid generation which allow an easy handling of local and global renement or coarsening strategies: adaptive mesh moving, macro adaptivity and fully local adaptivity. Adaptive strategies for moving mesh points, along boundaries or inner structures, allow the same logic structure in each `macro block', and hence the shown performance rates can be preserved. Additionally, we work with adaptivity concepts related to each `macro block', resp., `matrix block'. Allowing `blind' or `slave macro nodes' preserves the high{ performance facilities in each `matrix block', and is a good compromise between fully local adaptivity and optimal eciency through structured data. Only in that case, that these concepts do not lead to satisfying results, certain macros will loose their `highly structured' features through the (local) use of fully adaptive techniques. On these (hopefully) few patches, the standard `sparse' techniques for unstructured meshes have to be applied. 10

11 VII) DIRECT INTEGRATION OF PARALLELISM Most software packages are designed for sequential algorithms to solve a given PDE problem, and the subsequent parallelization of certain methods takes often unproportionately long. On the other hand, the following typical `naive' statement shows that this extra work is often neglected (by those who never perform the parallelization!): "1 IBM is more expensive than several SUN's or PC's! Why spend so much money for such a highly{tuned single processor machine? `Simply' parallelize your code..." In fact that is easy to say, but hard to realize with most software packages. Therefore, we directly include tools as MPI or PVM, or other standardized communication routines concerning our hierarchical tree structure, already on low level. However the more important step, which makes parallelization much more easy, is the design of the ScaRC solver according to the hierarchical decomposition in dierent stages. Indeed, from an algorithmic point of view, our sequential and parallel versions dier only as analogously Jacobi- and Gau{Seidel-like schemes work dierently. Hence, all parallel executions can be identically simulated on single processors which however can additionally improve their numerical behaviour with respect to eciency and robustness through Gau{Seidel-like mechanisms. Again, it is absolutely important to realize - see Section `GENERALIZED SOLVER STRATEGY ScaRC' - that (see also [KT] or [TU1]) "Block{Jacobi/Gau{Seidel schemes perform well for locally hidden anisotropies." Hence, we only provide in FEAST the `software' tools for including parallelism on low level, while the `numerical parallelism' is incorporated via our ScaRC solver and the hierarchical `tree structure'. However, what will be `non-standard' is our concept of (adaptive) parallel loadbalancing which is oriented in `total numerical eciency' (that means, "how much processor work is spent to achieve a certain accuracy, depending on the local conguration"!) in contrast to the `classical' criterion of equilibrating the number of local unknowns (see [BE] for detailed information and examples in FEAST). VIII) FULL FINITE ELEMENT FUNCTIONALITY We plan to include (at least) all facilities from the predecessor packages FEAT2D, FEAT3D and FEATFLOW [TU2], as for instance the routines for matrix assembly, the Finite Element basis function library, certain multigrid ingredients and many more. However, in addition, also mechanisms for a-posteriori error control via `residual techniques' and `dual solution' will be provided and complemented with several concepts of adaptivity. IX) PROFESSIONAL PRE- AND POSTPROCESSING Candidates are our JAVA{based tool DeViSor for graphical pre- and postprocessing and certain AVS/Express modules for which we agreed with AVS to include in our software package (for free!). However, we still look for a competent partner for professional geometry and mesh generators which shall be included as `macro mesh generators' and for CAD-like descriptions of the domain. This step will be one of the most important towards the numerical solution of `real life' problems. 11

12 X) OPTIMIZED APPLICATION TOOLS We plan (at least) to publish a new version of our FEATFLOW2.0 which contains most of our methodology derived in [TU1], but then based on the FEAST package. We hope to improve signicantly the quality of the recent FEATFLOW1.1 [TU2] through all the addressed mathematical, algorithmic and implementation aspects. CONCLUSIONS AND OUTLOOK We expect the rst version of FEAST for end of 1998, but most of the `numerical' and `computational' ingredients have already been successfully realized in several test implementations (see the papers in the REFERENCES). The actual status of the FEAST project and further information can always be obtained from our Web page: Nevertheless, help is always welcome: for instance in implementing and testing many auxiliary components, pre- and postprocessing, `unit square' experts and `computers for performance measurements', etc. REFERENCES [ABT 1] ALTIERI, M., BECKER, CHR., TUREK, S.: "Konsequenzen eines numerischen `Elch Tests' fur Prozessor{Architektur und Computersimulation", to appear. [ABT 2] ALTIERI, M., BECKER, CHR., TUREK, S.: "On the realistic performance of components in iterative solvers", Proc. FORTWIHR Conference, Munich, March 1998, LNCSE, Springer-Verlag, to appear. [BE] BECKER, CHR.: "FEAST - The realization of Finite Element software for high-performance applications", Thesis, to appear. [KI] KILIAN, S.: "Ecient parallel iterative solvers of ScaRC-type and their application to the incompressible Navier-Stokes equations", Thesis, [KT ] KILIAN, S., TUREK, S.: "An example for parallel ScaRC and its application to the incompressible Navier-Stokes equations", Proc. ENUMATH-97, Heidelberg, October [RB] RANNACHER, R., BECKER, R.: "A Feed-Back Approach to Error Control in Finite Element Methods: Basic Analysis and Examples", Preprint 96{52, University of Heidelberg, SFB 359, [SRT ] SCH AFER, M., RANNACHER, R., TUREK, S.: "Evaluation of a CFD Benchmark for Laminar Flows", Proc. ENUMATH-97, Heidelberg, October [T U 1] TUREK, S.: "Ecient solvers for incompressible ow problems: An algorithmic approach in view of computational aspects", LNCSE 2, Springer-Verlag, [T U 2] TUREK, S.: "FEATFLOW. Finite element software for the incompressible Navier-Stokes equations: User Manual, Release 1.1, 1998 (see the WWW-address above). 12

Two main topics: `A posteriori (error) control of FEM/FV discretizations with adaptive meshing strategies' `(Iterative) Solution strategies for huge s

Two main topics: `A posteriori (error) control of FEM/FV discretizations with adaptive meshing strategies' `(Iterative) Solution strategies for huge s . Trends in processor technology and their impact on Numerics for PDE's S. Turek Institut fur Angewandte Mathematik, Universitat Heidelberg Im Neuenheimer Feld 294, 69120 Heidelberg, Germany http://gaia.iwr.uni-heidelberg.de/~ture