Reconstruction of Trees from Laser Scan Data and further Simulation Topics

Similar documents
Large scale Imaging on Current Many- Core Platforms

Software and Performance Engineering for numerical codes on GPU clusters

Introduction to Multigrid and its Parallelization

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

How to Optimize Geometric Multigrid Methods on GPUs

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion

Numerical Algorithms on Multi-GPU Architectures

Massively Parallel Phase Field Simulations using HPC Framework walberla

τ-extrapolation on 3D semi-structured finite element meshes

Towards PetaScale Computational Science

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

HPC Algorithms and Applications

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla

Performance and Software-Engineering Considerations for Massively Parallel Simulations

The walberla Framework: Multi-physics Simulations on Heterogeneous Parallel Platforms

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

Peta-Scale Simulations with the HPC Software Framework walberla:

smooth coefficients H. Köstler, U. Rüde

Architecture Aware Multigrid

PROGRAMMING OF MULTIGRID METHODS

Multigrid algorithms on multi-gpu architectures

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

A parallel patch based algorithm for CT image denoising on the Cell Broadband Engine


simulation framework for piecewise regular grids

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

walberla: Developing a Massively Parallel HPC Framework

CellSs Making it easier to program the Cell Broadband Engine processor

Computational Fluid Dynamics with the Lattice Boltzmann Method KTH SCI, Stockholm

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

Massively Parallel Architectures

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

What is Multigrid? They have been extended to solve a wide variety of other problems, linear and nonlinear.

Handling Parallelisation in OpenFOAM

High Performance Computing. University questions with solution

High Performance Computing for PDE Towards Petascale Computing

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Performance and accuracy of hardware-oriented. native-, solvers in FEM simulations


high performance medical reconstruction using stream programming paradigms

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST

AMS526: Numerical Analysis I (Numerical Linear Algebra)

(LSS Erlangen, Simon Bogner, Ulrich Rüde, Thomas Pohl, Nils Thürey in collaboration with many more

Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures

Efficient Imaging Algorithms on Many-Core Platforms

Accelerating image registration on GPUs

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Evaluating the Portability of UPC to the Cell Broadband Engine

Parallel Hyperbolic PDE Simulation on Clusters: Cell versus GPU

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Smoothers. < interactive example > Partial Differential Equations Numerical Methods for PDEs Sparse Linear Systems

Communication-Avoiding Optimization of Geometric Multigrid on GPUs

GPU Cluster Computing for FEM

Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

A Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units

Highly Parallel Multigrid Solvers for Multicore and Manycore Processors

Simulation of Liquid-Gas-Solid Flows with the Lattice Boltzmann Method

Cell Processor and Playstation 3

High Performance Computing

Numerical Modelling in Fortran: day 6. Paul Tackley, 2017

T6: Position-Based Simulation Methods in Computer Graphics. Jan Bender Miles Macklin Matthias Müller

Multigrid Algorithms for Three-Dimensional RANS Calculations - The SUmb Solver

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Volume Illumination, Contouring

Adaptive Hierarchical Grids with a Trillion Tetrahedra

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Performance and accuracy of hardware-oriented native-, solvers in FEM simulations

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

Computing on GPU Clusters

Lecture 15: More Iterative Ideas

An Investigation into Iterative Methods for Solving Elliptic PDE s Andrew M Brown Computer Science/Maths Session (2000/2001)

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

Bandwidth Avoiding Stencil Computations

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

Fast Multipole and Related Algorithms

PARALLEL METHODS FOR SOLVING PARTIAL DIFFERENTIAL EQUATIONS. Ioana Chiorean

Multigrid Method using OpenMP/MPI Hybrid Parallel Programming Model on Fujitsu FX10

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

Multigrid Methods for Markov Chains

Fast Iterative Solvers for Markov Chains, with Application to Google's PageRank. Hans De Sterck

An Efficient, Geometric Multigrid Solver for the Anisotropic Diffusion Equation in Two and Three Dimensions

Automatic Pipeline Generation by the Sequential Segmentation and Skelton Construction of Point Cloud

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to CELL B.E. and GPU Programming. Agenda

Using GPUs for unstructured grid CFD

03 - Reconstruction. Acknowledgements: Olga Sorkine-Hornung. CSCI-GA Geometric Modeling - Spring 17 - Daniele Panozzo

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer

Data mining with sparse grids

Auto-tuning Multigrid with PetaBricks

Parallel Computations

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

The Shallow Water Equations and CUDA

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

Transcription:

Reconstruction of Trees from Laser Scan Data and further Simulation Topics Helmholtz-Research Center, Munich Daniel Ritter http://www10.informatik.uni-erlangen.de

Overview 1. Introduction of the Chair 2. Current Research Topics 3. Tree Reconstruction Project

Introduction of the Chair

Professors Head of the chair Professor for HPC

Working Groups: High-Performance Computing Group Algorithms for Simulation Group Complex Flows Group Computational Optics Group

Teaching - Lectures on - Simulation and Scientific Computing - Numerical Simulation of Fluids - Advanced Programming Techniques - Multigrid Methods - Functional Analysis - - Seminars on - Playstation Programming - Advanced C++-Programming - Simulation Claim and Risks - Hosting the - - Elite Master Program by the Bavarian Graduate School of Computational Engineering within the Elite Network of Bavaria) together with TU Munich - Double Master program together with the KTH Stockholm - ERASMUS mundus program Computer Simulations for Science and Engineering (COSSE)

walberla (complex flows group) widely applicable Lattice-Boltzmann from Erlangen: Joint project of four Ph.D. students. - Besides the LBM fluid solver, walberla can model the following phenomena: - Free surface flows - floating objects, - Porous media flows, - Blood clotting, - Particulate flows, - Laden particle flows.

What is the Lattice Boltzmann Method? See presentation of Iglberger, K. Thürey, N. Schmid, H.J. Feichtinger, C: Lattice Boltzmann Simulation bewegter Partikel

walberla Design goals for version 1.0 were: Understandability and usability: Easy integration of new simulation scenarios and numerical methods also by non-programming experts. Portability: Portable to various HPC supercomputer architectures and operating system environments. Maintainability and expandability: Integration of new functionality without major restructuring of code or modification of core parts in the framework. Efficiency: Possibility to integrate optimized kernels to enable efficient, hardware-adapted simulations. Scalability: Support of massively parallel simulations.

walberla Patch concept: The whole simulation domain is divided into patches to avoid an overhead of complex operations/execute optimized kernels in each patch only.

Multigrid methods V-Cycle Multigrid methods are asymptotically optimal solvers for sparse linear systems of equations (O(N) time complexity). Iterative solvers (Jacobi or Gauss-Seidel smoothers) remove only the local component of the error fast. Given Linear system: A u = f. 1. Smooth a few times to remove local error. 2. Compute the residual (r = f A u) and coarsen the error equation (A e = r) (restriction). 3. Solve the error equation on the coarse grid. 4. Interpolate (prolongate) correction term to the fine grid and apply correction. 5. Smooth again. Ideally, we do step 3 by recursively applying the whole scheme, until we get a system that is small enough to be solved directly.

Multigrid methods V-Cycle

Multigrid methods Restriction and interpolation operators: - If we have PDEs on a physical domain, just use this geometric information: pick every second unknown, or do a weighting. For interpolation: linear or higher order interpolation methods. - There exist also algebraic multigrid methods that use properties of the system matrix A to construct restriction and prolongation operators. - Good Introduction: Briggs, W. L. Hanson, V. E. McCormick, S. F., 2000. A Multigrid Tutorial, 2nd Edition, SIAM.

Multigrid methods - Parallelization - The smoother, restriction and interpolation kernels are local operators, therefore parallelization of them can be done straightforward. - One problem is that for the coarser levels, the communication overhead grows. (only few unknowns per process, but same number of processes, alternative: bring all unknowns to one process, solve system there and redistribute). - We have successfully implemented (massively) parallel multigrid methods on different clusters, GPUs and the Cell Broadband Engine and hold one of the world records in solving linear systems with our framework hierarchical hybrid grids (hhg) (300 Billion unknowns on 9170 Nodes).

Cell Broadband Engine (CBE) CBE consists of - 1 PPE (Power Processing Element = PowerPC) - 8 SPEs (Synergistic Processing Elements). They can only access data in their - 256 kb local storage - Data from main memory has to be fetched via the memory flow controller (MFC). Peak performance: 200 GFLOPS (float), 56 GB/s

Cell Broadband Engine - Because of its heterogeneous architecture, different code has to be written for PPE and SPEs. - While the PPE executes standard programs, the reduced instruction set of the SPEs requires special features: - The SPEs can only execute single precision floating-point operations at acceptable speed (was removed in PowerXCell 8i). - The SPEs have only SIMD registers, so that operations on scalars can be more expensive than on SIMD vectors. - The data transfer to and from main memory is very sensitive to strict alignment restrictions. - Efficient parallelization can only be implemented on pthreads level (low-level coding). No efficient OpenMP or MPI.

Cell Broadband Engine Example: MG A multigrid solver was implemented for solving Poisson s equation with open boundary conditions: Because of the infinite domain size a hierarchical grid coarsening was applied: Ritter, D. Stürmer, M. Rüde, U., 2010: A fast-adaptive composite grid algorithm for solving the free-space Poisson problem on the cell broadband engine In: Numerical Linear Algebra with Applications, 17(2-3): S. 291-305, 2010

CBE MG: Implementation Details - Decomposition: We split the 3 D-domain into slices of roughly the same size and process the data line by line. - To hide the times data transfer to and from the LS of the SPE needs, we use double buffering and background transfers. - The domain is traversed line by line. we have to load 3 lines of the unknowns, 1 line of the r.h.s. and to store one line of unknowns per traversal. - If-Statements in all computational kernels were eliminated. - All kernels were SIMDized - Synchronization of all the threads is done after each smoother step and after the restriction.

CBE MG: Scaling Results Algorithm is memory-bound. But just 50% of theoretical peak performance are reached. Why?

CBE MG: Alignment Data must be 128-byte aligned both in main memory and local store for optimal transfer speed: unaligned data: aligned data:

CBE MG: Scaling with proper alignment Now we get 90% of the peak performance. => Optimizing the code on the CBE is tedious and it is not trivial to understand all issues.

Tree Reconstruction Bachelor Thesis Conducted by Janakan Sivagnanasundaram http://www10.informatik.uni-erlangen.de/~sijasiva Task: Reconstruct tree topology from 3D-laser scanner data (scattered surface points). Following an approach from Hu, H. Gossett, N. and Chen, B. 2007. Knowledge and heuristic-based modeling of laserscanned trees. ACM Trans. Graph. 26 We are using the Boost Graph Library (www.boost.org) for handling the graphs.

Tree Reconstruction - Algorithm Input: cloud of points. Test data: Tree with 188,000 points. 1. Construct a strongly connected neighborhood graph with all points that have a distance below a certain threshold (20cm). (Graph with 20,000,000 edges.) 2. Search the shortest paths to all the points from the root. (Graph with 188,000 edges.) 3. Classify all the points according to their distance from the root (class length 50cm). 4. Build subclasses based on the connection information of each class. 5. Compute the centroid of each subclass. (Graph with 1000 nodes and edges) 6. Connect the centroids to obtain tree structure. 7. Identify and connect branches that are not connected to tree. 8. Fit in cylinders with least square method.

Tree Reconstruction Main Skeleton Local neighborhood graph Shortest-path graph Clustered points Figures from Hu, H. Gossett, N. and Chen, B. 2007. Knowledge and heuristic-based modeling of laser-scanned trees. ACM Trans. Graph. 26 Connected centroids

Tree Reconstruction Skeleton Extension - perform breadth-first search - project a cone of certain angle along the direction parent node - current node - when cone intersects with connected subgraph G - compute intersection point P - if P is within a certain range to current node => connect G to main skeleton

Tree Reconstruction Status - Programming almost done, we have everything until the connected centroids graph. - Missing: - Skeleton extension, - Cylinder fitting, - Parameter studies. - Program runs within a few minutes for our test data. - Produces good results for winter tree (without leaves) - Possible improvements: - Use L-System fitting in the crown of trees with leaves. - Parallelization of code.

Thank you for your attention! Visit our homepage: www10.informatik.uni-erlangen.de