Model Reduction for High Dimensional Micro-FE Models

Similar documents
A direct solver with reutilization of previously-computed LU factorizations for h-adaptive finite element grids with point singularities

A substructure based parallel dynamic solution of large systems on homogeneous PC clusters

Sparse LU Factorization for Parallel Circuit Simulation on GPUs

Exploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core Architecture

Sparse Matrix Partitioning for Parallel Eigenanalysis of Large Static and Dynamic Graphs

Increasing the Scale of LS-DYNA Implicit Analysis

THE procedure used to solve inverse problems in areas such as Electrical Impedance

Research Collection. WebParFE A web interface for the high performance parallel finite element solver ParFE. Report. ETH Library

On the I/O Volume in Out-of-Core Multifrontal Methods with a Flexible Allocation Scheme

SCALABLE ALGORITHMS for solving large sparse linear systems of equations

Uppsala University Department of Information technology. Hands-on 1: Ill-conditioning = x 2

An Adaptive LU Factorization Algorithm for Parallel Circuit Simulation

Numerical methods for on-line power system load flow analysis

Native mesh ordering with Scotch 4.0

Sparse Matrices Direct methods

Sparse Direct Factorizations through. Unassembled Hyper-Matrices

Parallel QR algorithm for data-driven decompositions

Large-scale Structural Analysis Using General Sparse Matrix Technique

Overview of MUMPS (A multifrontal Massively Parallel Solver)

Accelerating Finite Element Analysis in MATLAB with Parallel Computing

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

A Static Parallel Multifrontal Solver for Finite Element Meshes

Sparse Direct Factorizations through Unassembled Hyper-Matrices

Null space computation of sparse singular matrices with MUMPS

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

User Guide for GLU V2.0

Extreme Scalability Challenges in Micro-Finite Element Simulations of Human Bone

Sparse Direct Factorizations through Unassembled Hyper-Matrices

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

An Hybrid Approach for the Parallelization of a Block Iterative Algorithm

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

AN APPROACH FOR LOAD BALANCING FOR SIMULATION IN HETEROGENEOUS DISTRIBUTED SYSTEMS USING SIMULATION DATA MINING

Massively Parallel Finite Element Simulations with deal.ii

A NUMA Aware Scheduler for a Parallel Sparse Direct Solver

Package rmumps. December 18, 2017

Reckoning With The Limits Of FEM Analysis

ANALYSIS OF PIEZOELECTRIC MICRO PUMP

Partitioning of Public Transit Networks

FEMORAL STEM SHAPE DESIGN OF ARTIFICIAL HIP JOINT USING A VOXEL BASED FINITE ELEMENT METHOD

Minimizing Communication Cost in Fine-Grain Partitioning of Sparse Matrices

NumAn2014 Conference Proceedings.

Performance Evaluation of Multiple and Mixed Precision Iterative Refinement Method and its Application to High-Order Implicit Runge-Kutta Method

Effects of Solvers on Finite Element Analysis in COMSOL MULTIPHYSICS

Algebraic Iterative Methods for Computed Tomography

A parallel software for a saltwater intrusion problem

Tu P13 08 A MATLAB Package for Frequency Domain Modeling of Elastic Waves

An Out-of-Core Sparse Cholesky Solver

Wavelet-Galerkin Solutions of One and Two Dimensional Partial Differential Equations

Mathematics and Computer Science

Early Experiences with Trinity - The First Advanced Technology Platform for the ASC Program

The deal.ii Library, Version 8.3

Parallel hp-finite Element Simulations of 3D Resistivity Logging Instruments

CONCEPTS FOR HIGH PERFORMANCE GENERIC SCIENTIFIC COMPUTING

Sparse Matrices. Mathematics In Science And Engineering Volume 99 READ ONLINE

Parallel Mesh Partitioning in Alya

A Comparison of the Computational Speed of 3DSIM versus ANSYS Finite Element Analyses for Simulation of Thermal History in Metal Laser Sintering

A Coupled 3D/2D Axisymmetric Method for Simulating Magnetic Metal Forming Processes in LS-DYNA

Storage Formats for Sparse Matrices in Java

INTERIOR POINT METHOD BASED CONTACT ALGORITHM FOR STRUCTURAL ANALYSIS OF ELECTRONIC DEVICE MODELS

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

Proceedings of the First International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2014) Porto, Portugal

Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning

Maths for Signals and Systems Linear Algebra in Engineering. Some problems by Gilbert Strang

Analysis of the GCR method with mixed precision arithmetic using QuPAT

Using multifrontal hierarchically solver and HPC systems for 3D Helmholtz problem

A METHOD TO MODELIZE THE OVERALL STIFFNESS OF A BUILDING IN A STICK MODEL FITTED TO A 3D MODEL

Object Oriented Finite Element Modeling

Development of an Integrated Computational Simulation Method for Fluid Driven Structure Movement and Acoustics

Parallel algorithms for Scientific Computing May 28, Hands-on and assignment solving numerical PDEs: experience with PETSc, deal.

sizes become smaller than some threshold value. This ordering guarantees that no non zero term can appear in the factorization process between unknown

Scheduling Strategies for Parallel Sparse Backward/Forward Substitution

A Python HPC framework: PyTrilinos, ODIN, and Seamless

On Unbounded Tolerable Solution Sets

MUMPS. The MUMPS library. Abdou Guermouche and MUMPS team, June 22-24, Univ. Bordeaux 1 and INRIA

F k G A S S1 3 S 2 S S V 2 V 3 V 1 P 01 P 11 P 10 P 00

Direct Solvers for Sparse Matrices X. Li September 2006

Industrial finite element analysis: Evolution and current challenges. Keynote presentation at NAFEMS World Congress Crete, Greece June 16-19, 2009

Approaches to Parallel Implementation of the BDDC Method

PARALLEL DECOMPOSITION OF 100-MILLION DOF MESHES INTO HIERARCHICAL SUBDOMAINS

Transaction of JSCES, Paper No Parallel Finite Element Analysis in Large Scale Shell Structures using CGCG Solver

Q. Wang National Key Laboratory of Antenna and Microwave Technology Xidian University No. 2 South Taiba Road, Xi an, Shaanxi , P. R.

Tomonori Kouya Shizuoka Institute of Science and Technology Toyosawa, Fukuroi, Shizuoka Japan. October 5, 2018

Finite Element Method. Chapter 7. Practical considerations in FEM modeling

An Efficient Clustering for Crime Analysis

Graph Partitioning for High-Performance Scientific Simulations. Advanced Topics Spring 2008 Prof. Robert van Engelen

Clustering Algorithms for Data Stream

Solving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems

Fast Algorithms for Regularized Minimum Norm Solutions to Inverse Problems

Algorithm 8xx: SuiteSparseQR, a multifrontal multithreaded sparse QR factorization package

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

ENGINEERING MECHANICS 2012 pp Svratka, Czech Republic, May 14 17, 2012 Paper #206

3. Replace any row by the sum of that row and a constant multiple of any other row.

Optimization of Crane Cross Sectional

Coupled analysis of material flow and die deflection in direct aluminum extrusion

Material Properties Estimation of Layered Soft Tissue Based on MR Observation and Iterative FE Simulation

Strang Splitting Versus Implicit-Explicit Methods in Solving Chemistry Transport Models. A contribution to subproject GLOREAM. O. Knoth and R.

A Fine-Grain Hypergraph Model for 2D Decomposition of Sparse Matrices

Sparse Matrix Algorithms

High Performance Computing: Tools and Applications

Mathematical Libraries and Application Software on JUQUEEN and JURECA

Transcription:

Model Reduction for High Dimensional Micro-FE Models Rudnyi E. B., Korvink J. G. University of Freiburg, IMTEK-Department of Microsystems Engineering, {rudnyi,korvink}@imtek.uni-freiburg.de van Rietbergen B. Eindhoven University of Technology, Biomedical Engineering, Bone & Orthopaedic Biomechanics, B.v.Rietbergen@tue.nl Abstract Three-dimensional serial reconstruction techniques allow us to construct very detailed micro-finite element (micro-fe) model of bones that can represent the porous bone micro-architecture. Solving these models, however, implies solving very large sets of equations (order of millions degrees of freedom), which inhibits dynamic analyses. Previously it was shown for bone models up to 1 million degrees of freedom that formal model order reduction allows us to perform harmonic response simulation for such problems at a computational cost comparable with that of a static solution. However, the main requirement then is that a direct solver is used and there is enough memory to keep the factor. With common serial computers with 4GB of RAM such analyses are limited to bone models with about 1 million degrees of freedom. In order to extend this work to larger models, a parallelizable sparse direct solver needs to be implemented on larger parallel systems, which was the goal of this work. Four bone models have been generated in the range from 1 to 12 million degrees of freedom. They have been used to benchmark the multifrontal massively parallel sparse direct solver MUMPS on SGI Origin 3800 (teras at SARA). The study showed that MUMPS is a good choice to implement model reduction of high dimensional micro-finite element bone models provided that there is enough memory on a high performance computing system. 1

Introduction Three-dimensional serial reconstruction techniques allow constructing very detailed micro-finite element (micro-fe) model of bones that can represent the porous bone micro-architecture [1]. Such models can be used, for example to study differences in bone tissue loading between healthy and osteoporotic human bones during quasi static loading [2]. The main disadvantage of this approach is its huge computational requirements because of the high dimensionality of the models. This led to the fact that such computational analyses are limited to a static solution. There is evidence, however, that in particular high-frequency oscillating mechanical loads (up to approximately 50 Hz) might play an important role in the bone adaptive response [3]. Hence, there is a need to extent these static analyses to dynamic ones. Model order reduction is a relatively new area of mathematics that aims at finding a low-dimensional approximation for a high-dimensional system of ordinary differential equations. It has been used successfully over the last few years in different engineering disciplines like electrical engineering, structural mechanics, heat transfer, acoustics and electromagnetics [4]. Implicit moment matching based on the Arnoldi algorithm [5] is a very competitive approach to find the low dimensional subspace that accurately approximates the behavior of a high-dimensional state vector. At IMTEK, software MOR for ANSYS [6] has been used developed in order to perform model reduction directly to finite element models made in ANSYS. Its computational kernel has been employed for the two bone models of dimensions 130 000 and 900 000 accordingly and it has been demonstrated that model reduction can speed up harmonic simulation by orders of magnitude on a computer with 4 Gb RAM [7]. The bottleneck of model reduction based on the Arnoldi algorithm is a solution of a system of linear equations that should take place for each Arnoldi vector [6]. As a result, the use of a direct solver is preferred. However, 4 Gb of RAM happens to be enough to keep a factor of a bone model up to 900 000 only [7]. The treatment of models with higher dimensions requires parallel computing. The goal of the current work is to investigate what dimension of a bone model one can afford with the multifrontal massively parallel sparse direct solver MUMPS [8][9]. We first introduce the bone models employed as benchmarks. After that, we present the results obtained on SGI Origin 3800 (teras at SARA, http://www.sara.nl/userinfo/teras/description/index.html) and make conclusions. 2

Benchmarks of Micro-FE Bone Models The bone models are presented in Table 1. The matrices are symmetric and positive definite. The first two models, BS01 and BS10, are from [7]. They have been used for debugging the code. The new bone models have been obtained from another bone sample that was scanned with higher resolution. Models from B010 to B120 represent increasing subsets from the same bone sample. BS01 BS10 B010 B025 B050 B120 number of 20098 192539 278259 606253 1378782 3387547 elements number of 42508 305066 329001 719987 1644848 3989996 nodes number of 127224 914898 986703 2159661 4934244 11969688 DoFs nnz in half K 3421188 28191660 36326514 79292769 180663963 441785526 Table 1 Bone micro-finite element models The matrices as well as the driver to call MUMPS are available at Oberwolfach Model Reduction Benchmark Collection [10] (http://www.imtek.unifreiburg.de/simulation/benchmark/). Performance of MUMPS on teras Numerical experiments have been performed on SGI Origin 3800 (teras at SARA). MUMPS version 4.6.1 has been compiled with the MIPSpro compilers suite version 7.3.1.3m with full optimization (-O3) as a 64-bit application. The host processor has been used exclusively for matrix assembly (PAR=0) and did not take part during factorization. As a result, we use for plotting the number of processors reduced by one. One processor on the SGI Origin 3800 is recommended to use about 1 Gb of memory and hence the minimal number of processors depends on the problem size. Memory requirement depends on the number of nonzeros in the factor. Fig 1 displays the ratio of nonzeros in the factor to nonzeros in the matrix as a function of the matrix dimension after METIS [11] and PORD [12] reordering. It shows that the first two matrices treated in [7] are much easier to deal with as the number of nonzeros in the factor is about only 10 times more than in the matrix. The connectivity in the new four matrices is much higher, the number of nonzeros

being about 40 times more than in the matrix. Note that the number of nonzeros in the matrix is proportional to the matrix dimension (see Table 1). METIS reordering [11] gives the number of nonzeros in the factor about 14% less than PORD reordering [12] but unfortunately METIS is not yet fully 64- bit compliant [13] and could not be employed for matrices with higher dimensions. Fig 2 to 4 show the dependence of the wall time, the total memory (INFOG(19)) and memory on the most memory consuming processor (INFOG(18)) for matrix B010 as a function of the number of processors (not counting the host processor). The behavior is similar for other matrices. Fig 1 The ratio of nonzero elements in the factor to nonzero elements in the matrix. Fig 2 The wall time in s to factorize matrix B010 as a function of the number of processors. 4

Fig 3 Total memory in Gb (INFOG(19)) to factor matrix B010 as a function of the number of processors. Fig 4 The memory in Gb on the most memory consuming processor (INFOG(18)) to factor matrix B010 as a function of the number of processors. Our observations can be summarized as follows: The total memory required MUMPS to factorize a matrix is not constant and gradually grows when the number of processors is increased. There is an optimal number of processors to employ that depends on the matrix dimension. METIS reordering allows MUMPS for more efficient distributed computing than PORD. Provided there is enough memory to keep the factor, MUMPS allows us for efficient implementation of model reduction: to factorize a matrix once and then use back substitution during the Arnoldi process.

Fig 5 shows memory required by MUMPS per number ((INFOG(19) divided by the sum of nonzeros in the factor and the matrix) as a function of the number of processors for the four new matrices in the case of PORD reordering. Fig 5 Memory required MUMPS per number as a function of the number of processors. The matrix B120 requires about 1 Tb of RAM and if we want to treat bone models similar to the new benchmarks up to dimensions of 100 millions we may need up to 10 Tb of RAM. Conclusion The study showed that MUMPS is a good choice to implement model reduction of high dimensional micro-finite element bone models provided that there is enough memory on a high performance computing system. In the future, the memory requirements can be reduced when METIS becomes 64-bit compliant. Trilinos [14] interfaces MUMPS and, in our view, this is the best option to integrate MUMPS [8][9] with MOR for ANSYS [6]. Acknowledgment The work has been performed under the Project HPC-EUROPA (RII3-CT- 2003-506079), with the support of the European Community - Research Infrastructure Action under the FP6 Structuring the European Research Area Programme. 6

References [1] B. van Rietbergen, H. Weinans, R. Huiskes, A. Odgaard, A new method to determine trabecular bone elastic properties and loading using micromechanical finite-elements models. J. Biomechanics, v. 28, N 1, p. 69-81, 1995. [2] B. van Rietbergen, R. Huiskes, F. Eckstein, P. Rueegsegger, Trabecular Bone Tissue Strains in the Healthy and Osteoporotic Human Femur, J. Bone Mineral Research, v. 18, N 10, p. 1781-1787, 2003. [3] L. E. Lanyon, C. T. Rubin, Static versus dynamic loads as an influence on bone remodelling, Journal of Biomechanics, 17: 897-906, 1984. [4] E. B. Rudnyi, J. G. Korvink, Review: Automatic Model Reduction for Transient Simulation of MEMS-based Devices. Sensors Update v. 11, p. 3-33, 2002. [5] R. W. Freund, Krylov-subspace methods for reduced-order modeling in circuit simulation, Journal of Computational and Applied Mathematics, v. 123, pp. 395-421, 2000. [6] E. B. Rudnyi, J. G. Korvink. Model Order Reduction for Large Scale Engineering Models Developed in ANSYS. Lecture Notes in Computer Science, v. 3732, pp. 349-356, 2005. http://www.imtek.uni-freiburg.de/simulation/mor4ansys/ [7] E. B. Rudnyi, B. van Rietbergen, J. G. Korvink. Efficient Harmonic Simulation of a Trabecular Bone Finite Element Model by means of Model Reduction. 12th Workshop "The Finite Element Method in Biomedical Engineering, Biomechanics and Related Fields", University of Ulm, 20-21 July 2005. Preprint at http://www.imtek.uni-freiburg.de/simulation/mor4ansys/pdf/rudnyi05ulm.pdf. [8] P. R. Amestoy, I. S. Duff, J. Koster and J.-Y. L'Excellent, A fully asynchronous multifrontal solver using distributed dynamic scheduling, SIAM Journal of Matrix Analysis and Applications, v. 23, No 1, pp 15-41 (2001). [9] P. R. Amestoy and A. Guermouche and J.-Y. L'Excellent, S. Pralet, Hybrid scheduling for the parallel solution of linear systems. Parallel Computing, v. 32 (2), pp 136-156 (2006). http://graal.ens-lyon.fr/mumps/ [10] J. G. Korvink, E. B. Rudnyi. Oberwolfach Benchmark Collection. In: Benner, P., Mehrmann, V., Sorensen, D. (eds) Dimension Reduction of Large-Scale Systems, Lecture Notes in Computational Science and Engineering (LNCSE). Springer-Verlag, Berlin/Heidelberg, Germany, v. 45, p. 311-315, 2005. http://www.imtek.unifreiburg.de/simulation/benchmark/ [11] G. Karypis, V. Kumar, A fast and high quality multilevel scheme for partitioning irregular graphs, SIAM Journal on Scientific Computing, Vol. 20, N 1, 1998, pages 359-392. http://glaros.dtc.umn.edu/gkhome/views/metis/metis/ [12] J. Schulze, Towards a tighter coupling of bottom-up and top-down sparse matrix ordering methods, BIT, Vol. 41, No. 4, pp. 800 841, 2001. [13] G. Karypis, Private communication, 2006 [14] M. A. Heroux, R. A. Bartlett, V. E. Howle, R. J. Hoekstra, J. J. Hu, T. G. Kolda, R. B. Lehoucq, K. R. Long, R. P. Pawlowski, E. T. Phipps, A. G. Salinger, H. K. Thornquist, R. S. Tuminaro, J. M. Willenbring, A. Williams, K. S. Stanley, An overview of the Trilinos project, ACM Trans. Math. Softw., v. 31, N 3, p. 397-423, 2005. http://software.sandia.gov/trilinos/