Two main topics: `A posteriori (error) control of FEM/FV discretizations with adaptive meshing strategies' `(Iterative) Solution strategies for huge s

Similar documents
High Performance Computing for PDE Towards Petascale Computing

Hardware-Oriented Numerics - High Performance FEM Simulation of PDEs

High Performance Computing for PDE Some numerical aspects of Petascale Computing

(recursive) `Divide and Conquer' strategies hierarchical data and solver structures, but also hierarchical (!) `matrix structures' ScaRC as generaliza

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction

Performance. Computing (UCHPC)) for Finite Element Simulations

Java Performance Analysis for Scientific Computing

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST

Performance and accuracy of hardware-oriented native-, solvers in FEM simulations

Accelerating Double Precision FEM Simulations with GPUs

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

GPU Cluster Computing for FEM

Small Matrices fit into cache. Large Matrices do not fit into cache. Performance (MFLOPS) Performance (MFLOPS) bcsstk20 blckhole e05r0000 watson5

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Commodity Cluster Computing

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

High-Performance Scientific Computing

Accelerating Double Precision FEM Simulations with GPUs

What are Clusters? Why Clusters? - a Short History

The GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke.

Performance and accuracy of hardware-oriented. native-, solvers in FEM simulations

Case study: GPU acceleration of parallel multigrid solvers

Computing on GPU Clusters

GPU Acceleration of Unmodified CSM and CFD Solvers

GPU Cluster Computing for Finite Element Applications

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST

ArcExplorer -- Java Edition 9.0 System Requirements

Resilient geometric finite-element multigrid algorithms using minimised checkpointing

High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers

Introduction. Summary. Why computer architecture? Technology trends Cost issues

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

Computer Architecture Computer Architecture. Computer Architecture. What is Computer Architecture? Grading

Chapter 2 : Computational tools

τ-extrapolation on 3D semi-structured finite element meshes

Computer Architecture

Computer Architecture

CMPSCI 201: Architecture and Assembly Language

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker

Data mining with sparse grids

Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization

In 1986, I had degrees in math and engineering and found I wanted to compute things. What I ve mostly found is that:

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides

Computer System architectures

Chapter 1. Introduction To Computer Systems

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers

Why Parallel Architecture

Perspectives on the Memory Wall. John D. McCalpin, Ph.D IBM Global Microprocessor Development Austin, TX

Towards Realistic Performance Bounds for Implicit CFD Codes

Benchmarking CPU Performance. Benchmarking CPU Performance

Parallel Architecture. Sathish Vadhiyar

Figure 1-1. A multilevel machine.

Intel Math Kernel Library

Cost-Effective Parallel Computational Electromagnetic Modeling

Computer Architecture!

PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python. F.D. Witherden, M. Klemm, P.E. Vincent

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM

Data mining with sparse grids using simplicial basis functions

A Test Suite for High-Performance Parallel Java

IT 252 Computer Organization and Architecture. Introduction. Chia-Chi Teng

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics

Efficiency Aspects for Advanced Fluid Finite Element Formulations

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Towards a Unified Framework for Scientific Computing

The Mont-Blanc approach towards Exascale

Cluster Computing Paul A. Farrell 9/15/2011. Dept of Computer Science Kent State University 1. Benchmarking CPU Performance

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

Chapter 4 Main Memory

Alternate definition: Instruction Set Architecture (ISA) What is Computer Architecture? Computer Organization. Computer structure: Von Neumann model

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Hardware-Oriented Finite Element Multigrid Solvers for PDEs

Numerical Algorithms on Multi-GPU Architectures

CS Computer Architecture Spring Lecture 01: Introduction

Introduction to HPC. Lecture 21

Convergence of Parallel Architecture

6 Implementation of Parallel FE Systems

CSE 260 Introduction to Parallel Computation

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

Trends in HPC (hardware complexity and software challenges)

Higher order nite element methods and multigrid solvers in a benchmark problem for the 3D Navier Stokes equations

Computer Architecture

Microarchitecture Overview. Performance

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Node Hardware. Performance Convergence

Lecture 3: Evaluating Computer Architectures. How to design something:

Introduction Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as the Kendall Square KSR-, Intel P

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Virtual EM Inc. Ann Arbor, Michigan, USA

Masterpraktikum Scientific Computing

Non-Blocking Collectives for MPI

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Lecture 3: Intro to parallel machines and models

Transcription:

. Trends in processor technology and their impact on Numerics for PDE's S. Turek Institut fur Angewandte Mathematik, Universitat Heidelberg Im Neuenheimer Feld 294, 69120 Heidelberg, Germany http://gaia.iwr.uni-heidelberg.de/~ture http://www.iwr.uni-heidelberg.de/~featflow October 1999: Universitat Dortmund

Two main topics: `A posteriori (error) control of FEM/FV discretizations with adaptive meshing strategies' `(Iterative) Solution strategies for huge systems of equations' Is error control with adaptive meshing necessary??! YES!!! Does error control lead to more ecient simulation tools??!?? ) `real life' applications (CFD???) Do adaptive meshes lead to more ecient simulations??!? ) RAM????!??? ) CPU??? * Robust and ecient (complete) solvers! Modern processor technology!!!

Main components in iterative schemes: Matrix-Vector applications (MV) Krylov-space methods, Multigrid, etc. Defect calculations, smoothing, step-length control, etc. Sometimes consuming (at least) 60-90% of CPU time MV techniques (Storage/Application): 1) Sparse MV! Standard technique in most FEM or FV codes! Storage of `non-zero' matrix elements only (CSR, CSC,...)! Access via index vectors, linked lists, pointers, etc 2) (Sparse) Banded MV! Typical for FD codes on `tensorproduct meshes'! Storage of `non-zero' elements in bands/diagonals! MV multiplication `bandwise' (and `window-oriented')! FEAST!!!

Results on general unstructured meshes Sparse MV multiplication in FEATFLOW (F77!!!) Computer #Unknowns CM TL STO 13,688 22 20 19 SUN E450 54,256 17 15 13 ( 250 MFLOP/s) 216,032 16 14 6 862,144 16 15 4 STAR-CD, 500,000 hexaeders (by DaimlerChrysler) SGI Origin2000 (6 processors), 6.5 h (CPU) * `Optimal' multigrid 0.1 sec/subproblem??? Quantity Experiment Simulation Dierence Drag (`cw') 0.165 0.317 92 % Lift (`ca') -0.083-0.127 53 %

First conclusions for sparse MV: Dierent numbering strategies can lead to: identical numerical results and work (arithm. operations, memory accesses)! huge dierences in elapsed CPU time! Sparse MV techniques are `slow' and depend on: problem size! `amount' and `kind' (?) of data access! Sparse MV techniques are basis for: Most available commercial codes! Most recent research software projects!

Results on locally structured meshes 3D case N STO LINE SBB-V SBB-C MG-V MG-C DEC 21264 17 3 150 164 446 765 342 500 (500 MHz) 33 3 54 64 240 768 233 474 `DS20' 65 3 24 72 249 713 196 447 IBM RS6000/597 17 3 81 86 179 480 171 368 (160 MHz) 33 3 16 81 170 393 152 300 `SP2' 65 3 8 81 178 393 150 276 INTEL PII 17 3 28 29 56 183 48 136 (400 MHz) 33 3 24 29 53 139 47 116 `ALDI' 65 3 19 29 54 125 45 101 Sparse MV techniques (STO,LINE) MFLOP/s rates far away from `Peak Performance' Depending on problem size numbering `Old' (IBM PWR2) partially faster then `new' (IBM P2SC) PC partially faster than processors in `supercomputers'!!! Feast MV techniques (SBB) `Supercomputing' power gets visible (up to 800 MFLOP/s) Warning: Hard work!!! (! Sparse Banded Blas)

Further conclusions: `Most adaptive codes should run on (PENTIUM) PC's' `Processors are `sensible' Parallel-Vector supercomputers w.r.t. caching-in pipelining' Evaluation of Soft/Hardware components??? Adaptivity concepts??? Realization of complete FEM/FV approaches??? FEAST project

Expected hardware development: 1997 National Technology Roadmap for Semiconductors Year 1997 1999 2003 2006 2009 2012 Transistors/chip 11M 21M 76M 200M 520M 1.4B 1. 1 Billion Transistors ( 100!) 2. 10 Ghz clock rate ( 20!) 1 PC `faster' than complete CRAY T3E today! Memory access??? Data locality and internal parallelism??? Vectorization???

FEAST project Numerics and implementation techniques adapted to hardware!!! Precise knowledge of processor characteristics (Collection of) FEAST INDICES m Results for the dierent tasks in MG/Krylov-space solvers for various FEM/FV discretizations and mesh sizes

MV-MULT with identical matrix!!! 800 700 600 500 400 300 200 100 0 MV/C Level 3 3D 21264 WS/KAP 21264 WS (500) PWR3 (200) PWR2/397 (160) PWR2/597 (160) ORIGIN 2000 ULTRA60 (360) HP C240 (236) CRAY T3E (433) PWR2/590 (66) ORIGIN 200 21164 WS (400) 21164 PC (533) 21164 WS (500) HP PA8000 ULTRA60 (300) ULTRA10 (333) R10000 (195) ULTRA E450 (250) ULTRA1 (200) 21164 Linux (533) Pentium II (400) ULTRA1 (140) PWR/580 Pentium II (266) R8000 (75) PentiumPro (200) HP 735 250 225 200 175 150 125 100 75 50 25 0 MV/V Level 3 3D 21264 WS/KAP 21264 WS (500) PWR3 (200) PWR2/397 (160) PWR2/597 (160) ORIGIN 2000 ULTRA60 (360) HP C240 (236) CRAY T3E (433) PWR2/590 (66) ORIGIN 200 21164 WS (400) 21164 PC (533) 21164 WS (500) HP PA8000 ULTRA60 (300) ULTRA10 (333) R10000 (195) ULTRA E450 (250) ULTRA1 (200) 21164 Linux (533) Pentium II (400) ULTRA1 (140) PWR/580 Pentium II (266) R8000 (75) PentiumPro (200) HP 735 25 22,5 20 17,5 15 12,5 10 7,5 5 2,5 0 STO Level 3 3D 21264 WS/KAP 21264 WS (500) PWR3 (200) PW R2/397 (160) PW R2/597 (160) ORIGIN 2000 ULTRA60 (360) HP C240 (236) CRAY T3E (433) PW R2/590 (66) ORIGIN 200 21164 WS (400) 21164 PC (533) 21164 WS (500) HP PA8000 ULTRA60 (300) ULTRA10 (333) R10000 (195) ULTRA E450 (250) ULTRA1 (200) 21164 Linux (533) Pentium II (400) ULTRA1 (140) PW R/580 Pentium II (266) R8000 (75) PentiumPro (200) HP 735

Example: Concepts for adaptive meshing 1) macro-oriented adaptivity! `blind' (1{irregular) macro-nodes! conforming completion 2) patchwise adaptivity! many (local) logically equivalent tensorproduct meshes! moving mesh points 3) fully local adaptivity! (local) unstructured meshes `Exploit the locally nice structures!'

Conclusions: Everybody can/must (?) test the computational performance!! FORTRAN 90, C, JAVA??? Most adaptive FEM codes are NOT slower on PC's than on `High Performance' workstations or even `supercomputers'!! However: there is `supercomputing power' available!!! `Amount' and `kind' of memory accesses determines signicantly the computational eciency!! modern processors are `sensible' supercomputers!!! User-dened memory managment and optimize d implementations `w.r.t. hardware' are absolutely necessary!! massive performance losses due to Cache/Heap organization!!! Modern Numerics has to consider recent and future hardware trends!! math. theory for rigorous macro-oriented adaptivity???! improved MG/DD solvers (= ScaRC!!!)! algorithmic design: Balance of FLOPs and data access