Dense matching GPU implementation

Similar documents
CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

COMPUTATIONAL OPTIMIZED 3D RECONSTRUCTION SYSTEM FOR AIRBORNE IMAGE SEQUENCES

Embedded real-time stereo estimation via Semi-Global Matching on the GPU

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL

B. Tech. Project Second Stage Report on

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

1. Introduction. A CASE STUDY Dense Image Matching Using Oblique Imagery Towards All-in- One Photogrammetry

GPU Fundamentals Jeff Larkin November 14, 2016

Numerical Simulation on the GPU

A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou

EVOLUTION OF POINT CLOUD

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

EVALUATION OF DEEP LEARNING BASED STEREO MATCHING METHODS: FROM GROUND TO AERIAL IMAGES

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Portland State University ECE 588/688. Graphics Processors

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

CUDA (Compute Unified Device Architecture)

Mathematical computations with GPUs

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

high performance medical reconstruction using stream programming paradigms

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Exploiting graphical processing units for data-parallel scientific applications

Dense Linear Algebra. HPC - Algorithms and Applications

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Fast Tridiagonal Solvers on GPU

GRAPHICS PROCESSING UNITS

hsgm: Hierarchical Pyramid Based Stereo Matching Algorithm

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

GPU-accelerated 3-D point cloud generation from stereo images

High Performance Computing and GPU Programming

Unwrapping of Urban Surface Models

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Remote Sensing Image Processing Using CUDA

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Parallel Computing: Parallel Architectures Jin, Hai

Multiray Photogrammetry and Dense Image. Photogrammetric Week Matching. Dense Image Matching - Application of SGM

University of Bielefeld

CME 213 S PRING Eric Darve

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

NVIDIA Fermi Architecture

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

Center for Computational Science

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

Tesla GPU Computing A Revolution in High Performance Computing

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

CS 179: GPU Computing

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Exploring GPU Architecture for N2P Image Processing Algorithms

Parallel Processing SIMD, Vector and GPU s cont.

Introduction to CUDA

Optimization solutions for the segmented sum algorithmic function

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

implementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot

Computer Architecture

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

arxiv: v1 [cs.cv] 21 Sep 2018

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Warps and Reduction Algorithms

Computer Architecture

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Optimization of Tele-Immersion Codes

Introduction to GPU hardware and to CUDA

Using GPUs to compute the multilevel summation of electrostatic forces

CHAPTER 3 DISPARITY AND DEPTH MAP COMPUTATION

Face Detection CUDA Accelerating

Advanced and parallel architectures

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Advanced and parallel architectures. Part B. Prof. A. Massini. June 13, Exercise 1a (3 points) Exercise 1b (3 points) Exercise 2 (8 points)

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Evaluation Of The Performance Of GPU Global Memory Coalescing

Double-Precision Matrix Multiply on CUDA

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Introduction to Parallel Computing with CUDA. Oswald Haan

Programmable Graphics Hardware (GPU) A Primer

Lecture 2: CUDA Programming

Applications of Berkeley s Dwarfs on Nvidia GPUs

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors

CUDA Experiences: Over-Optimization and Future HPC

CUDA Programming Model

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

ACCELERATION OF IMAGE RESTORATION ALGORITHMS FOR DYNAMIC MEASUREMENTS IN COORDINATE METROLOGY BY USING OPENCV GPU FRAMEWORK

Transcription:

Dense matching GPU implementation Author: Hailong Fu. Supervisor: Prof. Dr.-Ing. Norbert Haala, Dipl. -Ing. Mathias Rothermel. Universität Stuttgart 1. Introduction Correspondence problem is an important issue in computer vision and photogrammetry, which is to look for corresponding pixels across images. Such problem is called image matching. Dense matching is targeting dense and homogenous matching of every pixel; it is an essential part of 3D point cloud generation. Semi Global Matching (SGM) (Hirschmueller, 2008) is one of the best dense matching algorithms regarding quality and speed. SGM is simpler than global methods, but still computational intensive compared with local matching methods, which challenges its real-time application and fast aerial images processing. However, one good feature of SGM is its high parallelism, which opens the door to GPU implementations. Meanwhile, CUDA (Compute Unified Device Architecture) (NVIDIA, 2012a) is providing its own GPU multi-threading model using Grid, Block and Thread, which hide much hardware details and is easy to learn. This thesis is targeting the CUDA solution of SGM. 2. CUDA Figure 1 GPU chip architecture All CUDA instructions are carried out in the chip of graphics card. One chip has several Streaming Multiprocessors (SMs) and One SM has several Streaming Processors (SPs). The concepts Grid, Block and Thread are mapped to these hardware components. Grid means a grid of blocks, one grid is executed in one GPU chip, and the blocks in the grid can be 1, 2 or 3 dimensional. Block means a block of threads, one block is executed in one Streaming Multiprocessor (SM), and the threads can also be 1, 2 or 3 dimensional. A good choice of the block size is 16x16. Threads are executed in Streaming Processors (SPs), in one block there could be much more threads than the number of SPs, the maximum number of threads per SM is 1536.

Kernel is a C function defined using the global declaration specifier. Once a kernel is launched, it will be executed as many threads, each thread has a unique ID. Blocks per grid and threads per block should be specified to the kernel before launching. In CUDA, warp is the smallest execution unit. It contains 32 threads and selects threads in block x direction first, a block is composed of N warps. The threads inside one warp are executed exactly synchronous. So CUDA prefers less branched codes otherwise there will be idle threads. Figure 2 Multi-threading model and memory types There are different types of memory in CUDA. Their usage should be combined and tested. Global memory is off-chip and has 1-4GB volume which can be accessed by all threads. Global memory has very high latency and its access is essential for the speed. The most important pattern is called coalesced access, which means adjacent threads in one warp should access adjacent memory addresses in the global memory. Shared memory is on-chip with amount of 48KB per SM. It has very short latency and high bandwidth. If the same address in global memory will be read more than once, shared memory should be used. Shared memory is shared by the threads in one block, but not across blocks. Registers belong to each thread and can be used to store local variables. CUDA can also execute multiple kernels concurrently to take full power of device s multiprocessors. Kernels using much shared memory and registers are less likely to be run concurrently. 3. CUDA implementation of SGM Here we assume both base and match images are epipolar rectified, so the matching problem is simplified to 1D, searching can be done in a predefined search range. The matching result is called disparity image in which each disparity means the columns difference of correspondences. The disparity image can be used for triangulation (3D coordinates calculation).

Figure 3 Flow chart of SGM We are performing the matching in image pyramids, will full disparity range search in the highest level, and dynamic disparity range search in the lower levels. 3.1 Census transform In each level, we first do a census transform for the base and match images to increase robustness. It is done by a 9x7 window visiting each pixel in the image, comparing every pixel in the window and the central pixel, resulting in a 63 bits string containing only 0 and 1. A 2D parallel kernel can be launched to solve this problem. The comparison can be performed on global memory (where images stored) directly, but shared memory is preferred since every pixel will be visited 63 times. 3.2 Cost cube fill Figure 4 Constant/Dynamic cubes (Rothermel, Wenzel, Fritsch, & Haala, 2012) Afterwards, for each pixel of the base census image, we calculate a cost value (dissimilarity, e.g. hamming distance) for each disparity. Therefore every pixel can get a string of costs; all cost strings can make a local cost cube. Since disparity range for each pixel is dynamic, the cost cube

is dynamic too. This procedure can be solved by a 3D CUDA kernel, with block x direction pointing to the disparity range (to coalese cost cube access because the cube is stored in the dynamic range first in a linear memory space), block y direction pointing to columns and block z direction pointing to rows. Figure 5 Dynamic disparity range (example) The selection of block x dimension is dependent on the dynamic range. The maximum range in Figure 5 is 64 while most pixels only have 16. So we chose x dimension 16 but not 64 to avoid idle threads. One thread may process disparity range/x dimension steps for each pixel. 3.3 Cost aggregation Cost aggregation is designed for smoothing disparities while preserving disparity discontinuity near edges. It aggregates the local cost cube in 8 or 16 paths. Figure 6 Diagram of 16 aggregation paths ( ) ( ) { ( ) { ( ) ( ) } ( ) } ( )

In one path, this formula shows for every base pixel and each disparity, how path cost (aggregated) ( ) is calculated from the local cost ( ), previous path costs and previous minimum path cost. The aggregated cost cube (S cube) is a summation over all aggregated path costs and has the same structure as the local cost cube. The optimal disparity corresponds to the minimum S cost. The aggregation has a recursion behavior because later aggregated path costs are dependent on the earlier path costs. But still it is 2D parallel, for one path, pixels in one row/column can be processed in parallel; the path costs for disparities of one pixel can also be processed in parallel. Therefore we defined aggregation walls to be processed sequentially. Figure 7 Aggregation walls of horizontal / vertical / diagonal aggregation paths (example) An aggregation wall is a 2D array of path costs with height equal to the maximum disparity range and width equal to columns/rows. One aggregation wall can be split and processed by many 2D CUDA blocks: x direction points to the disparity range; y direction points to columns. We chose block x dimension to be 16, so one thread may process disparity range/x dimension path costs. Each block will read a part of the previous wall and generate a part of the following wall; the width is equal to the y dimension of block. Two walls are allocated for one path; their roles will be switched after once aggregation. Figure 8 CUDA blocks for vertical aggregation path (example)

To achieve the best speed, we split the 16 paths into 3 groups: Group one has 6 paths (two vertical paths and four diagonal paths) since their aggregation walls look the same. The 6 path aggregations are performed in 2 passes, each pass for 3 paths (Zhu, Butenuth, & d'angelo, 2010). This saves local cost cube and S cube access. We store the walls in global memory but not shared memory; otherwise one block cannot access walls diagonally of another block because shared memory belongs to each block. Furthermore synchronization between all blocks of each row is needed due to the diagonal access of walls. So we launch one kernel for one row, another kernel for the next row. When we read the previous wall into the kernel, we can read them into shared memory first because one path cost will be used 3 times. Figure 9 Concurrency of 6 aggregation paths Group two has 2 horizontal paths. In this case the previous wall and the following wall can always reside and swap in shared memory because they are perfectly aligned. Group three contains the other 8 inclined diagonal paths, which can be done using the same way of group one. In fact group one and three can be improved by concurrent kernels execution to exploit the peak performance of SMs, due to the independence of up down, down up passes. Figure 10 Concurrency of additional 8 paths The minimum path cost during aggregation can be got by parallel reduction (Harris, 2007). 3.4 Minimum cost disparity selection

Run time (second) The parallel reduction also works well for the minimum S cost disparity selection. 3.5 Left right image consistency check In our program, we implemented a forward-backward matching both from scratching. If the base disparity differs to its corresponding disparity more than 1, then it will be marked as invalid. This can be solved by twice launching a 2D CUDA kernel, in which one thread check for one pixel. 3.6 New range determination and upscaling The new disparity range determination is based on the disparity image from previous pyramid level. We can use a 2D CUDA kernel to visit each pixel, find the maximum and minimum disparities inside a window, and use the difference as the new disparity range. For the pixels marked as invalid, we assign the maximum disparity range. The upscaling of disparity image and disparity range images is just a data copy with doubled size and value. A 2D kernel can be used. 4. Results CPU GPU Intel notebook i5-2430m @ 2.40GHz 2 cores Intel desktop i3-2100 @ 3.10GHz 2 cores NVIDIA notebook GeForce GT 540M @ 1.34GHz 96 CUDA cores + i5-2430m NVIDIA desktop GeForce GTX 550Ti @ 1.8GHz 192 CUDA cores + i3-2100 NVIDIA desktop GeForce GTX 580 @ 1.54GHz 512 CUDA cores + i3-2100 Table 1 Utilized hardware devices Fountain image pair 3113x2111 Max disparity range: 64 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 39x 8x 61x 30x 8x 5x 47x 7x 5x 7x 7x 5x 9x 4x 1x 5x 1x 1x 1x 4x 1x Cost Cube fill 2 paths (horizontal) aggregation 6 paths (2 vertical, 4 diagonal) aggregation 8 paths (inclined diagonal) aggregation Disparity selection i5-2430m (Notebook CPU) 0.939 1.408 4.592 13.950 0.112 i3-2100 (Desktop CPU) 0.723 1.072 3.408 11.693 0.087 GT 540M (Notebook GPU) 0.115 0.266 0.843 1.214 0.149 GTX 550Ti (Desktop GPU) 0.060 0.167 0.549 0.750 0.057 GTX 580 (Desktop GPU) 0.015 0.036 0.108 0.148 0.016 43x 32x Figure 11 CPU vs GPU run time of different SGM steps

Run time (seconds) Run time (second) 3.0 2.0 1.0 0.0 Maximum range 64 GTX 580 (Desktop GPU) Tsukuba 384x288 Cones 450x375 Contruction site 640x480 Aloe 1282x1110 Foutain 3113x2111 Clive 3711x4751 0.059 0.084 0.119 0.447 1.727 3.525 Figure 12 Total SGM run time (with 6+2 paths) of different image sizes 800.0 700.0 600.0 500.0 400.0 300.0 200.0 100.0 780s 539s 0.0 Munich aerial photos 14336 x 15765 Maximum range 64 i5-2430m (Notebook CPU) i3-2100 (Desktop CPU) GT 540M (Notebook GPU) GTX 550Ti (Desktop GPU) GTX 580 (Desktop GPU) GPU used time [s] 0.0 0.0 207.3 98.1 22.1 Transfer time [s] 0.0 0.0 14.6 8.5 9.1 CPU used time [s] 780.0 539.0 75.6 42.1 38.7 298s 149s 70s Figure 13 CPUs vs GPUs total SGM run time (with 6+2 paths) of aerial photos Figure 14 Vaihingen point cloud of CPU / cloud of GPU

Figure 15 Fountain point cloud of CPU / cloud of GPU 5. Conclusion For speed comparison of CPU and GPU, we used an Intel i3-2100 and a NVIDIA GTX580. In pure path aggregation GTX580 shows more than 30x speed of i3-2100. For comparison with related work, we processed small frame images. Our program is not designed for real-time application, since we want to preserve all interfaces for aerial photos which needs much GPU-CPU data transfer. Using GTX580 and a maximum search range of 64, Tsukuba 384 288 got 17 FPS, Cones 450 375 got 12 FPS, Construction site 640 480 got 8.4 FPS, and Aloe 1280 1110 got 2.2 FPS. Neglecting data transfer and speckle filtering, Tsukuba can reach 32 FPS, Cones 22 FPS, Construction site 18 FPS and Aloe 4.9 FPS. These should be enough for real-time application. We also processed exemplary 16000 14000 size aerial images. The matching time for one pair is only 70 seconds, in which the remaining serial CPU codes have become the bottleneck. The point clouds derived from the CPU and CUDA solutions for two data sets have shown quite similar quality, which means the CUDA results are reliable. Overall, our GPU solution can get similar quality results as the CPU solution with much faster speed and is promising for both photogrammetric and computer vision use.

Bibliography Harris, M. (2007). Optimizing Parallel Reduction in CUDA. Retrieved from Nvidia: http://developer.download.nvidia.com/compute/cuda/1.1- Beta/x86_website/projects/reduction/doc/reduction.pdf Hirschmueller, H. (2008). Stereo Processing by Semiglobal Matching and Mutual Information. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 328-341. NVIDIA. (2012a). CUDA C PROGRAMMING GUIDE. Retrieved from Nvidia Developer Zone: http://docs.nvidia.com/cuda/cuda-c-programming-guide/ Rothermel, M., Wenzel, K., Fritsch, D., & Haala, N. (2012). SURE: PHOTOGRAMMETRIC SURFACE RECONSTRUCTION FROM IMAGERY. LC3D Workshop. Berlin. Zhu, K., Butenuth, M., & d'angelo, P. (2010). COMPUTATIONAL OPTIMIZED 3D RECONSTRUCTION SYSTEM FOR AIRBORNE IMAGE SEQUENCES. ISPRS WG I/4: Geometric/Radiometric Modelling of Optical Spaceborne Sensors - Session III.