Connected Component Labelling, an embarrassingly sequential algorithm

Connected Component Labelling, an embarrassingly sequential algorithm Platform Parallel Netherlands GPGPU-day, 20 June 203 Jaap van de Loosdrecht NHL Centre of Expertise in Computer Vision Van de Loosdrecht Machine Vision BV Limerick Institute of Technology

Overview Introduction and background Connected Component Labelling Sequential Few-core Many-core Kalentev et al. approach Suggestions for extending Suggestions for optimizing Summary and conclusions Future work on CCL References Future of intelligent cameras Questions

Introduction Manager NHL Centre of Expertise in Computer Vision University of Applied Sciences, Leeuwarden 4 FTE Since 996: 80 industrial projects Managing director Van de Loosdrecht Machine Vision BV VisionLab: development environment for Computer Vision with Pattern matching, Neural networks and Genetic algorithms Portable library (ANSI C++) Windows, Linux and Android x86, x64, ARM and PowerPC Student Limerick Institute of Technology (Ireland) Research master project, September 20 September 203

Research master project Accelerating sequential computer vision algorithms using commodity parallel hardware Apply parallel programming techniques to meet the challenges posed in computer vision by the limits of sequential architectures Distinctive: investigate how to speed up a whole library by parallelizing the algorithms in an economical way and execute them on multiple platforms Generic library, 00.000 lines of ANSI C++ Portability and vendor independency OpenMP for CPU, OpenCL for GPU Variance in execution times Run-time prediction if parallelization is beneficial

Computer vision algorithms and parallelization Classification image operators Low level image operators Point operators Local neighbour operators Global operators Connectivity based operators High level image operators Often built on the low level operators Specials Pattern matcher, neural network, genetic algorithm, etc Idea: start with low level image operators, design and implement skeletons for parallelizing representatives in each classes

Demonstration Label Blobs Open image cells.jl, Show image contents ThresholdIsoData Show image contents Explain background/objects, white/black and 0/ LabelBlobs, show image contents Show image contents Explain 3 used colours BlobAnalyse Explain table Explain successive label numbering

Screen shot demo

Label blobs iterative algorithm Classical sequential approach Haralick and Shapiro (992) Binary image: Give each object pixel a unique positive value 2 3 4 5 6 7 8 9 0 2 3 8

9 Label blobs iterative algorithm Repeat until no changes Down pass (top left to right bottom): give each pixel the minimum value of its 8 neighbours Up pass (right bottom to top left): give each pixel the minimum value of its 8 neighbours 3 3 3 3

Sequential version He, Chao, and Suzuki (2008): two passes approach best performance Pass: equivalent labels are stored in equivalence table (neighbourhood search) Resolving equivalences with search algorithm Pass2: assign label to pixel (lookup table) Analysis of execution time (VisionLab) in s on Core i7-2640m for typical image cells.jl Size image Pass ( s) Resolving equivalences ( s) Pass2 ( s) Total ( s) Pass/Total 256x256 34 43 78 0.75 52x52 405 2 59 566 0.7 024x024 358 3 629 990 0.68

Parallel version Rosenfeld and Pfaltz (966): CCL cannot be implemented with parallel local operations Hawick, Leist and Playne (200): Label Equivalence best performance Kalentev, Rai, Kemnitz, and Schneider (20): alternative Label Equivalence approach Store equivalence table in image No atomic operations Claim efficient in terms of number of iterations needed, on average 5 iterations on their test set Algorithm Initial pass Multiple iterations Link pass (neighbourhood search) Label equalize pass (neighbourhood search) Final pass

Kalentev et al. approach It is expected that Both passes of iteration have similar complexity as Pass Initial and final pass have similar complexity as Pass2 Analysis On average Kalentev et al approach needs 5 iterations One simple initial pass 0 neighbourhood search passes One simple final pass Extra post processing step with two simple passes Estimation Sequential version unit of execution time Kalentev et al. 8.2 units of (sequential) execution time

Kalentev et al. approach Different approaches needed for few-core CPU approach and many-core GPU approach GPU approach will suffer from branch diversion

Few-core approach on Core i7-2600 CPU @ 3.4 GHz (quad-core)

By Kalentev et al. suggested framework host code WriteBuffer(image) int notdone = ; RunKernel( InitLabels,image); WriteBuffer(notDone); while (notdone == ) { notdone = 0; WriteBuffer(notDone); RunKernel( Link,image,notDone) RunKernel( LabelEqualize,image) ReadBuffer(notDone); } // while notdone ReadBuffer(image)

Suggestions for extending Kalentev et al. approach InitLabel kernel is extended to set the border pixels of the image to the background value Link kernels are implemented for both four and eight connectivity Post processing step with two passes is added in order to make the labelling of the blobs successive

Suggestions for optimizing Kalentev et al. approach Each iteration has a Link pass and a LabelEqualize pass. For the last iteration the LabelEqualize pass is redundant Many of the kernel execute, read buffer and write buffer commands can be asynchronously started and synchronized using events The write to the IsNotDone buffer can be done in parallel to the LabelEqualize pass Except second pass post processing step, all kernels can be vectorized InitLabel kernel straightforward Other kernels a quick test if all pixels in the vector are background pixels Beneficial for processing background pixels Little extra overhead for object pixels

Core i7-2600 with GTX 560 Ti (OEM)

Summary and conclusions Connected component labelling Different approaches for few-core and many-core approaches Few-core approach: reasonable speedups on CPUs Many-core approach: reasonable speedups on GPUs Suggestions for extending Kalentev et al. approach Suggestions for optimizing Kalentev et al. approach

Future work on Connected Component Labelling Parallelize few-core label repair step Implement and benchmark OpenCL implementation few-core approach Research in finding the break-even point few-core versus manycore approach Implement and benchmark approach suggested by Stava and Benes (20), only H/W ^2

References Van de Loosdrecht, J., 203. Accelerating sequential computer vision algorithms using commodity parallel hardware. Research master project at Limerick Institute of Technology. Expected to be published in autumn 203 at www.vdlmv.nl/thesis. Haralick, R.M. and Shapiro, L.G., 992. Computer and Robot Vision. Volume I and Volume II. Reading: Addison-Welsey Publishing Company. He, L., Chao, Y. and Suzuki, K., 2008. A Run-Based Two-Scan Labeling Algorithm. IEEE Transactions on image processing, 7(5), pp.749-56. Rosenfeld, A. and Pfaltz, J.L., 966. Sequential Operations in Digital Picture Processing. Journal of the ACM, 3(4), pp.47-94. Hawick, K.A., Leist, A. and Playne, D.P., 200. Parallel graph component labeling with GPUs and CUDA. Parallel Computing, 36(2), pp.655-78. Kalentev, O., Rai, A., Kemnitz, S. and Schneider, S., 20. Connected component labeling on a 2D grid using CUDA. Journal of Parallel and Distributed Computing, 7 (4), pp.65-20. Stava, O. and Benes, B., 20. Connected Component Labeling in CUDA. In: Wen-Mei, W.H. ed. 20. Gpu Computing Gems, Emerald edition. Burlington: Morgan Kaufman. Ch.35.

Future: Intelligent camera with heterogonous computing XIMEA Currera G AMD T-56N Dual-core x64.6 GHZ 80 core GPU 500 MHz 2 GB DDR3 32 GB SSD 4 USB-3, USB-2 HDMI PoE Gigabit ethernet Micro PLC 8 digital I/Os Many image sensors <= 5M pixel

Prototype XIMEA Currera G

Questions? Jaap van de Loosdrecht NHL Centre of Expertise in Computer Vision j.van.de.loosdrecht@nhl.nl www.nhl.nl/computervision Van de Loosdrecht Machine Vision BV jaap@vdlmv.nl www.vdlmv.nl