Connected Component Labelling, an embarrassingly sequential algorithm

Similar documents
Accelerating sequential computer vision algorithms using commodity parallel hardware

Accelerating Sequential Computer Vision Algorithms Using Commodity Parallel Hardware

A Hybrid Approach to Parallel Connected Component Labeling Using CUDA

Connected component labeling on a 2D grid using CUDA

Computer Vision. License Plate Recognition Klaas Dijkstra - Jaap van de Loosdrecht

Multi Core Processing in VisionLab

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

FiPS and M2DC: Novel Architectures for Reconfigurable Hyperscale Servers

Single Pass Connected Components Analysis

Computer Vision & Deep Learning

Computer vision. 3D Stereo camera Bumblebee. 10 April 2018

Optimization solutions for the segmented sum algorithmic function

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Facial Recognition Using Neural Networks over GPGPU

GPGPU. Peter Laurens 1st-year PhD Student, NSC

FTF Americas. FTF Brazil. freescale.com/ftf. Secure, Embedded Processing Solutions for the Internet of Tomorrow

EXPLOITING ACCELERATOR-BASED HPC FOR ARMY APPLICATIONS

GTC 2013 March San Jose, CA The Smartest People. The Best Ideas. The Biggest Opportunities. Opportunities for Participation:

Embedded Computing without Compromise. Evolution of the Rugged GPGPU Computer Session: SIL7127 Dan Mor PLM -Aitech Systems GTC Israel 2017

Block-Based Connected-Component Labeling Algorithm Using Binary Decision Trees

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

Diego J C. Santiago], Tsang Ing Ren], George D. C. Cavalcant/ and Tsang Ing Jyh2

Hardware Acceleration of Feature Detection and Description Algorithms on Low Power Embedded Platforms

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

A176 Cyclone. GPGPU Fanless Small FF RediBuilt Supercomputer. IT and Instrumentation for industry. Aitech I/O

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

OpenCV on Zynq: Accelerating 4k60 Dense Optical Flow and Stereo Vision. Kamran Khan, Product Manager, Software Acceleration and Libraries July 2017

Performance potential for simulating spin models on GPU

Renderscript Accelerated Advanced Image and Video Processing on ARM Mali T-600 GPUs. Lihua Zhang, Ph.D. MulticoreWare Inc.

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Concurrent Manipulation of Dynamic Data Structures in OpenCL

Portable GPU-Based Artificial Neural Networks For Data-Driven Modeling

INTEGRATING COMPUTER VISION SENSOR INNOVATIONS INTO MOBILE DEVICES. Eli Savransky Principal Architect - CTO Office Mobile BU NVIDIA corp.

The rcuda middleware and applications

Elaborazione dati real-time su architetture embedded many-core e FPGA

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

REPORT DOCUMENTATION PAGE

Simplify System Complexity

The Benefits of GPU Compute on ARM Mali GPUs

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

Introduction to GPGPU and GPU-architectures

Offloading Java to Graphics Processors

15.6. TEP Series. Unique Expansion Possibilities. Power and Networking Expansion Modules 10.1

Servosila Robotic Heads

2008 International ANSYS Conference

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Overview of Project's Achievements

CME 213 S PRING Eric Darve

THE LEADER IN VISUAL COMPUTING

A new Computer Vision Processor Chip Design for automotive ADAS CNN applications in 22nm FDSOI SOI Symposium Santa Clara, Apr.

Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010

The Many-Core Revolution Understanding Change. Alejandro Cabrera January 29, 2009

A new Computer Vision Processor Chip Design for automotive ADAS CNN applications in 22nm FDSOI based on Cadence VP6 Technology

Back-Projection on GPU: Improving the Performance

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

Parallel Computing. Hwansoo Han (SKKU)

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Accelerating Financial Applications on the GPU

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

C7 Player. Overview. Specifications

Down selecting suitable manycore technologies for the ELT AO RTC. David Barr, Alastair Basden, Nigel Dipper and Noah Schwartz

OpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data

Lecture 1: Introduction and Computational Thinking

Manycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT

Martin Dubois, ing. Contents

Object Counting Using Convolutional Neural Network Accelerator IP Reference Design

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

High Performance Computing. Taichiro Suzuki Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab.

Speed Sign Detection Using Convolutional Neural Network Accelerator IP Reference Design

WaveView. System Requirement V6. Reference: WST Page 1. WaveView System Requirements V6 WST

Moving Object Detection by Connected Component Labeling of Point Cloud Registration Outliers on the GPU

USB for Embedded Device ASHWINI MISHRA

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

ARM and x86 on Qseven & COM Express Mini. Zeljko Loncaric, Marketing Engineer, congatec AG

TEK Series. Unique Expansion Possibilities. Power and Networking Expansion Module. Automation I/O Expansion Module

GPU-accelerated data expansion for the Marching Cubes algorithm

Profiling and Debugging Games on Mobile Platforms

What is Parallel Computing?

Vectorisation and Portable Programming using OpenCL

Evaluation Of The Performance Of GPU Global Memory Coalescing

GPGPU on ARM. Tom Gall, Gil Pitney, 30 th Oct 2013

Recent Advances in Heterogeneous Computing using Charm++

Project Proposals. Advanced Operating Systems / Embedded Systems (2016/2017)

EyeCheck Smart Cameras

General Purpose GPU Computing in Partial Wave Analysis

Predicting GPU Performance from CPU Runs Using Machine Learning

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Most real programs operate somewhere between task and data parallelism. Our solution also lies in this set.

How GPUs can find your next hit: Accelerating virtual screening with OpenCL. Simon Krige

Contour Detection on Mobile Platforms

Towards Breast Anatomy Simulation Using GPUs

M100 GigE Series. Multi-Camera Vision Controller. Easy cabling with PoE. Multiple inspections available thanks to 6 GigE Vision ports and 4 USB3 ports

Reducing Time-to-Market with i.mx6-based Qseven Modules

Transcription:

Connected Component Labelling, an embarrassingly sequential algorithm Platform Parallel Netherlands GPGPU-day, 20 June 203 Jaap van de Loosdrecht NHL Centre of Expertise in Computer Vision Van de Loosdrecht Machine Vision BV Limerick Institute of Technology

Overview Introduction and background Connected Component Labelling Sequential Few-core Many-core Kalentev et al. approach Suggestions for extending Suggestions for optimizing Summary and conclusions Future work on CCL References Future of intelligent cameras Questions

Introduction Manager NHL Centre of Expertise in Computer Vision University of Applied Sciences, Leeuwarden 4 FTE Since 996: 80 industrial projects Managing director Van de Loosdrecht Machine Vision BV VisionLab: development environment for Computer Vision with Pattern matching, Neural networks and Genetic algorithms Portable library (ANSI C++) Windows, Linux and Android x86, x64, ARM and PowerPC Student Limerick Institute of Technology (Ireland) Research master project, September 20 September 203

Research master project Accelerating sequential computer vision algorithms using commodity parallel hardware Apply parallel programming techniques to meet the challenges posed in computer vision by the limits of sequential architectures Distinctive: investigate how to speed up a whole library by parallelizing the algorithms in an economical way and execute them on multiple platforms Generic library, 00.000 lines of ANSI C++ Portability and vendor independency OpenMP for CPU, OpenCL for GPU Variance in execution times Run-time prediction if parallelization is beneficial

Computer vision algorithms and parallelization Classification image operators Low level image operators Point operators Local neighbour operators Global operators Connectivity based operators High level image operators Often built on the low level operators Specials Pattern matcher, neural network, genetic algorithm, etc Idea: start with low level image operators, design and implement skeletons for parallelizing representatives in each classes

Demonstration Label Blobs Open image cells.jl, Show image contents ThresholdIsoData Show image contents Explain background/objects, white/black and 0/ LabelBlobs, show image contents Show image contents Explain 3 used colours BlobAnalyse Explain table Explain successive label numbering

Screen shot demo

Label blobs iterative algorithm Classical sequential approach Haralick and Shapiro (992) Binary image: Give each object pixel a unique positive value 2 3 4 5 6 7 8 9 0 2 3 8

9 Label blobs iterative algorithm Repeat until no changes Down pass (top left to right bottom): give each pixel the minimum value of its 8 neighbours Up pass (right bottom to top left): give each pixel the minimum value of its 8 neighbours 3 3 3 3

Sequential version He, Chao, and Suzuki (2008): two passes approach best performance Pass: equivalent labels are stored in equivalence table (neighbourhood search) Resolving equivalences with search algorithm Pass2: assign label to pixel (lookup table) Analysis of execution time (VisionLab) in s on Core i7-2640m for typical image cells.jl Size image Pass ( s) Resolving equivalences ( s) Pass2 ( s) Total ( s) Pass/Total 256x256 34 43 78 0.75 52x52 405 2 59 566 0.7 024x024 358 3 629 990 0.68

Parallel version Rosenfeld and Pfaltz (966): CCL cannot be implemented with parallel local operations Hawick, Leist and Playne (200): Label Equivalence best performance Kalentev, Rai, Kemnitz, and Schneider (20): alternative Label Equivalence approach Store equivalence table in image No atomic operations Claim efficient in terms of number of iterations needed, on average 5 iterations on their test set Algorithm Initial pass Multiple iterations Link pass (neighbourhood search) Label equalize pass (neighbourhood search) Final pass

Kalentev et al. approach It is expected that Both passes of iteration have similar complexity as Pass Initial and final pass have similar complexity as Pass2 Analysis On average Kalentev et al approach needs 5 iterations One simple initial pass 0 neighbourhood search passes One simple final pass Extra post processing step with two simple passes Estimation Sequential version unit of execution time Kalentev et al. 8.2 units of (sequential) execution time

Kalentev et al. approach Different approaches needed for few-core CPU approach and many-core GPU approach GPU approach will suffer from branch diversion

Few-core approach on Core i7-2600 CPU @ 3.4 GHz (quad-core)

By Kalentev et al. suggested framework host code WriteBuffer(image) int notdone = ; RunKernel( InitLabels,image); WriteBuffer(notDone); while (notdone == ) { notdone = 0; WriteBuffer(notDone); RunKernel( Link,image,notDone) RunKernel( LabelEqualize,image) ReadBuffer(notDone); } // while notdone ReadBuffer(image)

Suggestions for extending Kalentev et al. approach InitLabel kernel is extended to set the border pixels of the image to the background value Link kernels are implemented for both four and eight connectivity Post processing step with two passes is added in order to make the labelling of the blobs successive

Suggestions for optimizing Kalentev et al. approach Each iteration has a Link pass and a LabelEqualize pass. For the last iteration the LabelEqualize pass is redundant Many of the kernel execute, read buffer and write buffer commands can be asynchronously started and synchronized using events The write to the IsNotDone buffer can be done in parallel to the LabelEqualize pass Except second pass post processing step, all kernels can be vectorized InitLabel kernel straightforward Other kernels a quick test if all pixels in the vector are background pixels Beneficial for processing background pixels Little extra overhead for object pixels

Core i7-2600 with GTX 560 Ti (OEM)

Core i7-2600 with GTX 560 Ti (OEM)

Core i7-2600 with GTX 560 Ti (OEM)

Summary and conclusions Connected component labelling Different approaches for few-core and many-core approaches Few-core approach: reasonable speedups on CPUs Many-core approach: reasonable speedups on GPUs Suggestions for extending Kalentev et al. approach Suggestions for optimizing Kalentev et al. approach

Future work on Connected Component Labelling Parallelize few-core label repair step Implement and benchmark OpenCL implementation few-core approach Research in finding the break-even point few-core versus manycore approach Implement and benchmark approach suggested by Stava and Benes (20), only H/W ^2

References Van de Loosdrecht, J., 203. Accelerating sequential computer vision algorithms using commodity parallel hardware. Research master project at Limerick Institute of Technology. Expected to be published in autumn 203 at www.vdlmv.nl/thesis. Haralick, R.M. and Shapiro, L.G., 992. Computer and Robot Vision. Volume I and Volume II. Reading: Addison-Welsey Publishing Company. He, L., Chao, Y. and Suzuki, K., 2008. A Run-Based Two-Scan Labeling Algorithm. IEEE Transactions on image processing, 7(5), pp.749-56. Rosenfeld, A. and Pfaltz, J.L., 966. Sequential Operations in Digital Picture Processing. Journal of the ACM, 3(4), pp.47-94. Hawick, K.A., Leist, A. and Playne, D.P., 200. Parallel graph component labeling with GPUs and CUDA. Parallel Computing, 36(2), pp.655-78. Kalentev, O., Rai, A., Kemnitz, S. and Schneider, S., 20. Connected component labeling on a 2D grid using CUDA. Journal of Parallel and Distributed Computing, 7 (4), pp.65-20. Stava, O. and Benes, B., 20. Connected Component Labeling in CUDA. In: Wen-Mei, W.H. ed. 20. Gpu Computing Gems, Emerald edition. Burlington: Morgan Kaufman. Ch.35.

Future: Intelligent camera with heterogonous computing XIMEA Currera G AMD T-56N Dual-core x64.6 GHZ 80 core GPU 500 MHz 2 GB DDR3 32 GB SSD 4 USB-3, USB-2 HDMI PoE Gigabit ethernet Micro PLC 8 digital I/Os Many image sensors <= 5M pixel

Prototype XIMEA Currera G

Prototype XIMEA Currera G

Questions? Jaap van de Loosdrecht NHL Centre of Expertise in Computer Vision j.van.de.loosdrecht@nhl.nl www.nhl.nl/computervision Van de Loosdrecht Machine Vision BV jaap@vdlmv.nl www.vdlmv.nl