Accelerating sequential computer vision algorithms using commodity parallel hardware

Accelerating sequential computer vision algorithms using commodity parallel hardware Platform Parallel Netherlands GPGPU-day, 28 June 2012 Jaap van de Loosdrecht NHL Centre of Expertise in Computer Vision Van de Loosdrecht Machine Vision BV Limerick Institute of Technology

Overview Introduction Computer vision algorithms and parallelization Benchmarking Run time prediction if parallelization is beneficial Progress Future work Summary and preliminary conclusions Questions

Introduction Manager NHL Centre of Expertise in Computer Vision University of professional education, Leeuwarden 4,5 FTE Since 1996: 160 industrial projects Managing director Van de Loosdrecht Machine Vision BV VisionLab: development environment for Computer Vision with Pattern matching, Neural networks and Genetic algorithms Portable library, > 100.000 lines of ANSI C++ Windows, Linux and Android x86, x64, ARM and PowerPC Student Limerick Institute of Technology (Ireland) Research master project,1 September 2011 1 July 2013

VisionLab: development environment for Computer Vision

Motivation Apply parallel programming techniques to meet the challenges posed in computer vision by the limits of sequential architectures

Aims and objectives Compare existing programming languages and environments for parallel computing Choose one standard for Multi-core CPU programming GPU programming Re-implement a number of standard and well-known algorithms Compare performance to existing sequential implementation of VisionLab Evaluate test results, benefits and costs of parallel approaches to implementation of computer vision algorithms

Requirements Primary target system Conventional PC or intelligent camera Windows or Linux, on a x86 or x64 Important option: easy porting (Android, ARM, PowerPC) Existing scripts and applications should not have to be modified in order to benefit from parallelization Run time prediction if parallelization is beneficial Chosen standards must be An industry standard Vendor independent For CPU ANSI C++ based Efficient parallelization for majority of code

Related research Other research projects Compare best sequential with best parallel algorithm Often specific domain and hardware Framework for auto parallelisation In research, not yet generic applicable Special points of interest in my project Generic library Portability and vendor independency Run time prediction if parallelization is beneficial Variance in execution times 100.000 lines of ANSI C++

Choice of standard for multi-core CPU programming (1 oct 2011) Requirement ---------------- Standard Industry standard Maturity Acceptance by market Future developments Vendor independence Portability Scalable to ccnuma (optional) Vector capabilities (optional) Effort for conversion Array Building Blocks No Beta New, not ranked Good Poor Poor No Yes Huge C++11 Threads Yes Partly new New, not ranked Good Good Good No No Huge Cilk Plus No Good Rank 6 Good Reasonable No MSVC Reasonable No Yes Low MCAPI No Poor Not ranked Poor Poor Poor Yes No Huge MPI Yes Excellent Rank 7 Good Good Good Yes No Huge OpenMP Yes Excellent Rank 1 Good Good Good Yes, No Low only GNU Parallel Patterns Library No Reasonable New, not ranked Good Poor Only MSVC Poor No No Huge Posix Threads Yes Excellent Not ranked Poor Good Good No No Huge Thread Building Blocks No Good Rank 3 Good Reasonable Reasonable No No Huge

Choice of standard for GPU programming (1 oct 2011) Requirement --------------- Standard Industry standard Maturity Acceptance by market Future developments Expected familiarization time Hardware vendor independence Software vendor independence Portability Heterogeneous Accelerator No Good Not ranked Bad Medium Bad Bad Poor No CUDA No Excellent Rank 5 Good High Bad Bad Bad No Direct Compute No Poor Not ranked Unknown High Bad Bad Bad No HMPP No Poor Not ranked Plan for open standard Medium Reasonable Bad Good Yes OpenCL Yes Good Rank 2 Good High Excellent Good Good Yes PGI Accelerator No Reasonable Not ranked Unknown Medium Bad Bad Bad No

Computer vision algorithms and parallelization Classification image operators Low level image operators Point operators Local neighbour operators Global operators Connectivity based operators High level image operators Often built on the low level operators Specials Pattern matcher, neural network, genetic algorithm, etc Idea: design and implement skeletons for parallelizing representatives in classes

Benchmarking Benchmark protocol Data analyse with R Speedup graphs Speedup tables Median of execution time tables Best work-group size tables Violin plots

OpenMP Progress Script commands added Frame work for benchmarking > 160 operators are parallelized Run time prediction if parallelization is beneficial Calibration procedure

Example speedup graph (i7 2600)

Run time prediction if parallelization is beneficial The speed-up depends on Size of image Pixel type Content of the image Parameters like size of neighbourhood Etc. Calibration procedure for OpenMP Simple and fast procedure for global optimization Complex and slow procedure for more optimal optimization for each (sub) operator

Variations in executing times, violin plot

OpenCL Progress Toolbox for using OpenCL kernels from scripts and C++ Script commands added for host API Frame work for benchmarking Implementation first kernels

OpenCL development in VisionLab

Optimal number of local histograms per work-group (GTX 560 Ti)

Speedup graph of Histogram, 16 local histogram per work-group (GTX 560 Ti)

Violin plot for Histogram, 16 local histograms/wg (GTX 560 Ti)

X-files: OpenCL versus OpenMP

Future work OpenCL Near future Memory transfers Pinned Zero copy APU Implementing more vision operators More distant future Intelligent buffer management? Automatic tuning of parameters? Run time prediction if parallelization is beneficial? Heterogeneous computing? OpenMP 4.0, OpenACC, C++ AMP?

The future: XIMEA Currera G (APU based)

Summary and preliminary conclusions Choice made for standards OpenMP and OpenCL Integration OpenMP and OpenCL in VisionLab Benchmark environment OpenMP Embarrassingly parallel algorithms are easy to convert More than 160 operators parallelized Run time prediction implemented OpenCL Scripting host side code accelerates development time Portable functionality Portable performance is not easy Still have to learn a lot about GPUs Searching for sparing partners

Questions? Jaap van de Loosdrecht NHL Centre of Expertise in Computer Vision j.van.de.loosdrecht@tech.nhl.nl www.nhl.nl/computervision Van de Loosdrecht Machine Vision BV jaap@vdlmv.nl www.vdlmv.nl