Reconstruction of Trees from Laser Scan Data and further Simulation Topics

Reconstruction of Trees from Laser Scan Data and further Simulation Topics Helmholtz-Research Center, Munich Daniel Ritter http://www10.informatik.uni-erlangen.de

Overview 1. Introduction of the Chair 2. Current Research Topics 3. Tree Reconstruction Project

Introduction of the Chair

Professors Head of the chair Professor for HPC

Working Groups: High-Performance Computing Group Algorithms for Simulation Group Complex Flows Group Computational Optics Group

Teaching - Lectures on - Simulation and Scientific Computing - Numerical Simulation of Fluids - Advanced Programming Techniques - Multigrid Methods - Functional Analysis - - Seminars on - Playstation Programming - Advanced C++-Programming - Simulation Claim and Risks - Hosting the - - Elite Master Program by the Bavarian Graduate School of Computational Engineering within the Elite Network of Bavaria) together with TU Munich - Double Master program together with the KTH Stockholm - ERASMUS mundus program Computer Simulations for Science and Engineering (COSSE)

walberla (complex flows group) widely applicable Lattice-Boltzmann from Erlangen: Joint project of four Ph.D. students. - Besides the LBM fluid solver, walberla can model the following phenomena: - Free surface flows - floating objects, - Porous media flows, - Blood clotting, - Particulate flows, - Laden particle flows.

What is the Lattice Boltzmann Method? See presentation of Iglberger, K. Thürey, N. Schmid, H.J. Feichtinger, C: Lattice Boltzmann Simulation bewegter Partikel

walberla Design goals for version 1.0 were: Understandability and usability: Easy integration of new simulation scenarios and numerical methods also by non-programming experts. Portability: Portable to various HPC supercomputer architectures and operating system environments. Maintainability and expandability: Integration of new functionality without major restructuring of code or modification of core parts in the framework. Efficiency: Possibility to integrate optimized kernels to enable efficient, hardware-adapted simulations. Scalability: Support of massively parallel simulations.

walberla Patch concept: The whole simulation domain is divided into patches to avoid an overhead of complex operations/execute optimized kernels in each patch only.

Multigrid methods V-Cycle Multigrid methods are asymptotically optimal solvers for sparse linear systems of equations (O(N) time complexity). Iterative solvers (Jacobi or Gauss-Seidel smoothers) remove only the local component of the error fast. Given Linear system: A u = f. 1. Smooth a few times to remove local error. 2. Compute the residual (r = f A u) and coarsen the error equation (A e = r) (restriction). 3. Solve the error equation on the coarse grid. 4. Interpolate (prolongate) correction term to the fine grid and apply correction. 5. Smooth again. Ideally, we do step 3 by recursively applying the whole scheme, until we get a system that is small enough to be solved directly.

Multigrid methods V-Cycle

Multigrid methods Restriction and interpolation operators: - If we have PDEs on a physical domain, just use this geometric information: pick every second unknown, or do a weighting. For interpolation: linear or higher order interpolation methods. - There exist also algebraic multigrid methods that use properties of the system matrix A to construct restriction and prolongation operators. - Good Introduction: Briggs, W. L. Hanson, V. E. McCormick, S. F., 2000. A Multigrid Tutorial, 2nd Edition, SIAM.

Multigrid methods - Parallelization - The smoother, restriction and interpolation kernels are local operators, therefore parallelization of them can be done straightforward. - One problem is that for the coarser levels, the communication overhead grows. (only few unknowns per process, but same number of processes, alternative: bring all unknowns to one process, solve system there and redistribute). - We have successfully implemented (massively) parallel multigrid methods on different clusters, GPUs and the Cell Broadband Engine and hold one of the world records in solving linear systems with our framework hierarchical hybrid grids (hhg) (300 Billion unknowns on 9170 Nodes).

Cell Broadband Engine (CBE) CBE consists of - 1 PPE (Power Processing Element = PowerPC) - 8 SPEs (Synergistic Processing Elements). They can only access data in their - 256 kb local storage - Data from main memory has to be fetched via the memory flow controller (MFC). Peak performance: 200 GFLOPS (float), 56 GB/s

Cell Broadband Engine - Because of its heterogeneous architecture, different code has to be written for PPE and SPEs. - While the PPE executes standard programs, the reduced instruction set of the SPEs requires special features: - The SPEs can only execute single precision floating-point operations at acceptable speed (was removed in PowerXCell 8i). - The SPEs have only SIMD registers, so that operations on scalars can be more expensive than on SIMD vectors. - The data transfer to and from main memory is very sensitive to strict alignment restrictions. - Efficient parallelization can only be implemented on pthreads level (low-level coding). No efficient OpenMP or MPI.

Cell Broadband Engine Example: MG A multigrid solver was implemented for solving Poisson s equation with open boundary conditions: Because of the infinite domain size a hierarchical grid coarsening was applied: Ritter, D. Stürmer, M. Rüde, U., 2010: A fast-adaptive composite grid algorithm for solving the free-space Poisson problem on the cell broadband engine In: Numerical Linear Algebra with Applications, 17(2-3): S. 291-305, 2010

CBE MG: Implementation Details - Decomposition: We split the 3 D-domain into slices of roughly the same size and process the data line by line. - To hide the times data transfer to and from the LS of the SPE needs, we use double buffering and background transfers. - The domain is traversed line by line. we have to load 3 lines of the unknowns, 1 line of the r.h.s. and to store one line of unknowns per traversal. - If-Statements in all computational kernels were eliminated. - All kernels were SIMDized - Synchronization of all the threads is done after each smoother step and after the restriction.

CBE MG: Scaling Results Algorithm is memory-bound. But just 50% of theoretical peak performance are reached. Why?

CBE MG: Alignment Data must be 128-byte aligned both in main memory and local store for optimal transfer speed: unaligned data: aligned data:

CBE MG: Scaling with proper alignment Now we get 90% of the peak performance. => Optimizing the code on the CBE is tedious and it is not trivial to understand all issues.

Tree Reconstruction Bachelor Thesis Conducted by Janakan Sivagnanasundaram http://www10.informatik.uni-erlangen.de/~sijasiva Task: Reconstruct tree topology from 3D-laser scanner data (scattered surface points). Following an approach from Hu, H. Gossett, N. and Chen, B. 2007. Knowledge and heuristic-based modeling of laserscanned trees. ACM Trans. Graph. 26 We are using the Boost Graph Library (www.boost.org) for handling the graphs.

Tree Reconstruction - Algorithm Input: cloud of points. Test data: Tree with 188,000 points. 1. Construct a strongly connected neighborhood graph with all points that have a distance below a certain threshold (20cm). (Graph with 20,000,000 edges.) 2. Search the shortest paths to all the points from the root. (Graph with 188,000 edges.) 3. Classify all the points according to their distance from the root (class length 50cm). 4. Build subclasses based on the connection information of each class. 5. Compute the centroid of each subclass. (Graph with 1000 nodes and edges) 6. Connect the centroids to obtain tree structure. 7. Identify and connect branches that are not connected to tree. 8. Fit in cylinders with least square method.

Tree Reconstruction Main Skeleton Local neighborhood graph Shortest-path graph Clustered points Figures from Hu, H. Gossett, N. and Chen, B. 2007. Knowledge and heuristic-based modeling of laser-scanned trees. ACM Trans. Graph. 26 Connected centroids

Tree Reconstruction Skeleton Extension - perform breadth-first search - project a cone of certain angle along the direction parent node - current node - when cone intersects with connected subgraph G - compute intersection point P - if P is within a certain range to current node => connect G to main skeleton

Tree Reconstruction Status - Programming almost done, we have everything until the connected centroids graph. - Missing: - Skeleton extension, - Cylinder fitting, - Parameter studies. - Program runs within a few minutes for our test data. - Produces good results for winter tree (without leaves) - Possible improvements: - Use L-System fitting in the crown of trees with leaves. - Parallelization of code.

Thank you for your attention! Visit our homepage: www10.informatik.uni-erlangen.de