Accelerating sequential computer vision algorithms using commodity parallel hardware

Similar documents
Connected Component Labelling, an embarrassingly sequential algorithm

Accelerating Sequential Computer Vision Algorithms Using Commodity Parallel Hardware

Multi Core Processing in VisionLab

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

Parallel Systems. Project topics

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Parallel Programming Libraries and implementations

Computer vision. 3D Stereo camera Bumblebee. 10 April 2018

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Computer Vision. License Plate Recognition Klaas Dijkstra - Jaap van de Loosdrecht

Multi/Many Core Programming Strategies

GPUs and Emerging Architectures

Accelerating Financial Applications on the GPU

Cuda C Programming Guide Appendix C Table C-

Portable GPU-Based Artificial Neural Networks For Data-Driven Modeling

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

Auto-tuning a High-Level Language Targeted to GPU Codes. By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

How GPUs can find your next hit: Accelerating virtual screening with OpenCL. Simon Krige

Parallel Programming. Libraries and Implementations

Trends in HPC (hardware complexity and software challenges)

Computer Vision & Deep Learning

Energy Efficiency Tuning: READEX. Madhura Kumaraswamy Technische Universität München

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

GPU Accelerated Machine Learning for Bond Price Prediction

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

Addressing Heterogeneity in Manycore Applications

Experiences with GPGPUs at HLRS

FPGA-based Supercomputing: New Opportunities and Challenges

CPU-GPU Heterogeneous Computing

Predicting GPU Performance from CPU Runs Using Machine Learning

Experiences with Achieving Portability across Heterogeneous Architectures

How to Write Code that Will Survive the Many-Core Revolution

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

CUDA. Matthew Joyner, Jeremy Williams

Architecture, Programming and Performance of MIC Phi Coprocessor

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

An Alternative to GPU Acceleration For Mobile Platforms

Overview of Project's Achievements

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

Fast Interactive Sand Simulation for Gesture Tracking systems Shrenik Lad

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

Facial Recognition Using Neural Networks over GPGPU

GPGPU. Peter Laurens 1st-year PhD Student, NSC

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

The Benefits of GPU Compute on ARM Mali GPUs

AMBER 11 Performance Benchmark and Profiling. July 2011

An Introduction to OpenACC

Modern Processor Architectures. L25: Modern Compiler Design

Code Auto-Tuning with the Periscope Tuning Framework

Optimising the Mantevo benchmark suite for multi- and many-core architectures

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

Simulation of Conditional Value-at-Risk via Parallelism on Graphic Processing Units

From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133)

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

EXPLOITING ACCELERATOR-BASED HPC FOR ARMY APPLICATIONS

Accelerating image registration on GPUs

Porting the parallel Nek5000 application to GPU accelerators with OpenMP4.5. Alistair Hart (Cray UK Ltd.)

MatCL - OpenCL MATLAB Interface

An Introduction to the SPEC High Performance Group and their Benchmark Suites

Overview of research activities Toward portability of performance

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

EXTENDING THE GFN PRIME SEARCH BEYOND 1M DIGITS USING GPUS

Parallelising Pipelined Wavefront Computations on the GPU

A Detailed GPU Cache Model Based on Reuse Distance Theory

SHOC: The Scalable HeterOgeneous Computing Benchmark Suite

CME 213 S PRING Eric Darve

Computer Vision. Fourier Transform. 20 January Copyright by NHL Hogeschool and Van de Loosdrecht Machine Vision BV All rights reserved

Critically Missing Pieces on Accelerators: A Performance Tools Perspective

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules

Technology for a better society. hetcomp.com

Evaluation Of The Performance Of GPU Global Memory Coalescing

Renderscript Accelerated Advanced Image and Video Processing on ARM Mali T-600 GPUs. Lihua Zhang, Ph.D. MulticoreWare Inc.

Parallella: A $99 Open Hardware Parallel Computing Platform

GPU-Powered WRF in the Cloud for Research and Operational Applications

Parallel Programming. Libraries and implementations

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Heterogeneous Computing and OpenCL

ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Intel Parallel Studio XE 2015

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors

Profiling High Level Heterogeneous Programs

OpenCV on Zynq: Accelerating 4k60 Dense Optical Flow and Stereo Vision. Kamran Khan, Product Manager, Software Acceleration and Libraries July 2017

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

OpenACC (Open Accelerators - Introduced in 2012)

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

GPGPU/CUDA/C Workshop 2012

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

Programming Support for Heterogeneous Parallel Systems

Scalability of Processing on GPUs

Transcription:

Accelerating sequential computer vision algorithms using commodity parallel hardware Platform Parallel Netherlands GPGPU-day, 28 June 2012 Jaap van de Loosdrecht NHL Centre of Expertise in Computer Vision Van de Loosdrecht Machine Vision BV Limerick Institute of Technology

Overview Introduction Computer vision algorithms and parallelization Benchmarking Run time prediction if parallelization is beneficial Progress Future work Summary and preliminary conclusions Questions

Introduction Manager NHL Centre of Expertise in Computer Vision University of professional education, Leeuwarden 4,5 FTE Since 1996: 160 industrial projects Managing director Van de Loosdrecht Machine Vision BV VisionLab: development environment for Computer Vision with Pattern matching, Neural networks and Genetic algorithms Portable library, > 100.000 lines of ANSI C++ Windows, Linux and Android x86, x64, ARM and PowerPC Student Limerick Institute of Technology (Ireland) Research master project,1 September 2011 1 July 2013

VisionLab: development environment for Computer Vision

Introduction Manager NHL Centre of Expertise in Computer Vision University of professional education, Leeuwarden 4,5 FTE Since 1996: 160 industrial projects Managing director Van de Loosdrecht Machine Vision BV VisionLab: development environment for Computer Vision with Pattern matching, Neural networks and Genetic algorithms Portable library, > 100.000 lines of ANSI C++ Windows, Linux and Android x86, x64, ARM and PowerPC Student Limerick Institute of Technology (Ireland) Research master project,1 September 2011 1 July 2013

Motivation Apply parallel programming techniques to meet the challenges posed in computer vision by the limits of sequential architectures

Aims and objectives Compare existing programming languages and environments for parallel computing Choose one standard for Multi-core CPU programming GPU programming Re-implement a number of standard and well-known algorithms Compare performance to existing sequential implementation of VisionLab Evaluate test results, benefits and costs of parallel approaches to implementation of computer vision algorithms

Requirements Primary target system Conventional PC or intelligent camera Windows or Linux, on a x86 or x64 Important option: easy porting (Android, ARM, PowerPC) Existing scripts and applications should not have to be modified in order to benefit from parallelization Run time prediction if parallelization is beneficial Chosen standards must be An industry standard Vendor independent For CPU ANSI C++ based Efficient parallelization for majority of code

Related research Other research projects Compare best sequential with best parallel algorithm Often specific domain and hardware Framework for auto parallelisation In research, not yet generic applicable Special points of interest in my project Generic library Portability and vendor independency Run time prediction if parallelization is beneficial Variance in execution times 100.000 lines of ANSI C++

Choice of standard for multi-core CPU programming (1 oct 2011) Requirement ---------------- Standard Industry standard Maturity Acceptance by market Future developments Vendor independence Portability Scalable to ccnuma (optional) Vector capabilities (optional) Effort for conversion Array Building Blocks No Beta New, not ranked Good Poor Poor No Yes Huge C++11 Threads Yes Partly new New, not ranked Good Good Good No No Huge Cilk Plus No Good Rank 6 Good Reasonable No MSVC Reasonable No Yes Low MCAPI No Poor Not ranked Poor Poor Poor Yes No Huge MPI Yes Excellent Rank 7 Good Good Good Yes No Huge OpenMP Yes Excellent Rank 1 Good Good Good Yes, No Low only GNU Parallel Patterns Library No Reasonable New, not ranked Good Poor Only MSVC Poor No No Huge Posix Threads Yes Excellent Not ranked Poor Good Good No No Huge Thread Building Blocks No Good Rank 3 Good Reasonable Reasonable No No Huge

Choice of standard for GPU programming (1 oct 2011) Requirement --------------- Standard Industry standard Maturity Acceptance by market Future developments Expected familiarization time Hardware vendor independence Software vendor independence Portability Heterogeneous Accelerator No Good Not ranked Bad Medium Bad Bad Poor No CUDA No Excellent Rank 5 Good High Bad Bad Bad No Direct Compute No Poor Not ranked Unknown High Bad Bad Bad No HMPP No Poor Not ranked Plan for open standard Medium Reasonable Bad Good Yes OpenCL Yes Good Rank 2 Good High Excellent Good Good Yes PGI Accelerator No Reasonable Not ranked Unknown Medium Bad Bad Bad No

Computer vision algorithms and parallelization Classification image operators Low level image operators Point operators Local neighbour operators Global operators Connectivity based operators High level image operators Often built on the low level operators Specials Pattern matcher, neural network, genetic algorithm, etc Idea: design and implement skeletons for parallelizing representatives in classes

Benchmarking Benchmark protocol Data analyse with R Speedup graphs Speedup tables Median of execution time tables Best work-group size tables Violin plots

OpenMP Progress Script commands added Frame work for benchmarking > 160 operators are parallelized Run time prediction if parallelization is beneficial Calibration procedure

Example speedup graph (i7 2600)

Run time prediction if parallelization is beneficial The speed-up depends on Size of image Pixel type Content of the image Parameters like size of neighbourhood Etc. Calibration procedure for OpenMP Simple and fast procedure for global optimization Complex and slow procedure for more optimal optimization for each (sub) operator

Variations in executing times, violin plot

OpenCL Progress Toolbox for using OpenCL kernels from scripts and C++ Script commands added for host API Frame work for benchmarking Implementation first kernels

OpenCL development in VisionLab

OpenCL development in VisionLab

Optimal number of local histograms per work-group (GTX 560 Ti)

Speedup graph of Histogram, 16 local histogram per work-group (GTX 560 Ti)

Violin plot for Histogram, 16 local histograms/wg (GTX 560 Ti)

X-files: OpenCL versus OpenMP

Future work OpenCL Near future Memory transfers Pinned Zero copy APU Implementing more vision operators More distant future Intelligent buffer management? Automatic tuning of parameters? Run time prediction if parallelization is beneficial? Heterogeneous computing? OpenMP 4.0, OpenACC, C++ AMP?

The future: XIMEA Currera G (APU based)

Summary and preliminary conclusions Choice made for standards OpenMP and OpenCL Integration OpenMP and OpenCL in VisionLab Benchmark environment OpenMP Embarrassingly parallel algorithms are easy to convert More than 160 operators parallelized Run time prediction implemented OpenCL Scripting host side code accelerates development time Portable functionality Portable performance is not easy Still have to learn a lot about GPUs Searching for sparing partners

Questions? Jaap van de Loosdrecht NHL Centre of Expertise in Computer Vision j.van.de.loosdrecht@tech.nhl.nl www.nhl.nl/computervision Van de Loosdrecht Machine Vision BV jaap@vdlmv.nl www.vdlmv.nl