Hardware Acceleration of Feature Detection and Description Algorithms on Low Power Embedded Platforms

Similar documents
Hardware Acceleration of Feature Detection and Description Algorithms on Low-Power Embedded Platforms

Deep Neural Network Enhanced VSLAM Landmark Selection

A hardware design of optimized ORB algorithm with reduced hardware cost

Local features and image matching. Prof. Xin Yang HUST

Efficient Interest Point Detectors & Features

OpenCV on Zynq: Accelerating 4k60 Dense Optical Flow and Stereo Vision. Kamran Khan, Product Manager, Software Acceleration and Libraries July 2017

Computer Vision for HCI. Topics of This Lecture

Image Features: Detection, Description, and Matching and their Applications

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

A Configurable Parallel Hardware Architecture for Efficient Integral Histogram Image Computing

Local invariant features

Local Difference Binary for Ultrafast and Distinctive Feature Description

Computer Vision. Recap: Smoothing with a Gaussian. Recap: Effect of σ on derivatives. Computer Science Tripos Part II. Dr Christopher Town

Lecture 4.1 Feature descriptors. Trym Vegard Haavardsholm

FPGA-Based Feature Detection

Harder case. Image matching. Even harder case. Harder still? by Diva Sian. by swashford

Image matching. Announcements. Harder case. Even harder case. Project 1 Out today Help session at the end of class. by Diva Sian.

Efficient Interest Point Detectors & Features

The rcuda middleware and applications

INTEGRATING COMPUTER VISION SENSOR INNOVATIONS INTO MOBILE DEVICES. Eli Savransky Principal Architect - CTO Office Mobile BU NVIDIA corp.

CS 378: Autonomous Intelligent Robotics. Instructor: Jivko Sinapov

EECS150 - Digital Design Lecture 14 FIFO 2 and SIFT. Recap and Outline

Manycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT

Embedded real-time stereo estimation via Semi-Global Matching on the GPU

Harder case. Image matching. Even harder case. Harder still? by Diva Sian. by swashford

Local features: detection and description. Local invariant features

GPU Accelerating Speeded-Up Robust Features Timothy B. Terriberry, Lindley M. French, and John Helmsen

Local Image preprocessing (cont d)

Click to edit title style

High Quality Real Time Image Processing Framework on Mobile Platforms using Tegra K1. Eyal Hirsch

Edge and corner detection

Zynq Ultrascale+ Architecture

CS 4495 Computer Vision A. Bobick. CS 4495 Computer Vision. Features 2 SIFT descriptor. Aaron Bobick School of Interactive Computing

International Journal Of Global Innovations -Vol.6, Issue.I Paper Id: SP-V6-I1-P01 ISSN Online:

Feature descriptors and matching

Zynq-7000 All Programmable SoC Product Overview

Simplify System Complexity

AK Computer Vision Feature Point Detectors and Descriptors

Image Features. Work on project 1. All is Vanity, by C. Allan Gilbert,

CS4670: Computer Vision

The SIFT (Scale Invariant Feature

Implementation of GP-GPU with SIMT Architecture in the Embedded Environment

Local Features: Detection, Description & Matching

A Lightweight YOLOv2:

CEE598 - Visual Sensing for Civil Infrastructure Eng. & Mgmt.

Small is the New Big: Data Analytics on the Edge

TEGRA K1 AND THE AUTOMOTIVE INDUSTRY. Gernot Ziegler, Timo Stich

Bias-Variance Trade-off (cont d) + Image Representations

Image Matching. AKA: Image registration, the correspondence problem, Tracking,

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

Appearance-Based Place Recognition Using Whole-Image BRISK for Collaborative MultiRobot Localization

Corner Detection. GV12/3072 Image Processing.

SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014

A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video

Building a Panorama. Matching features. Matching with Features. How do we build a panorama? Computational Photography, 6.882

Large scale Imaging on Current Many- Core Platforms

Parallel Tracking. Henry Spang Ethan Peters

PERFORMANCE MEASUREMENTS OF FEATURE TRACKING AND HISTOGRAM BASED TRAFFIC CONGESTION ALGORITHMS

Local Image Features

Local Patch Descriptors

Adaptable Intelligence The Next Computing Era

Local Image Features

Lecture 10 Detectors and descriptors

Features Points. Andrea Torsello DAIS Università Ca Foscari via Torino 155, Mestre (VE)

THE LEADER IN VISUAL COMPUTING

HEAD HardwarE Accelerated Deduplication

SIFT Descriptor Extraction on the GPU for Large-Scale Video Analysis. Hannes Fassold, Jakub Rosner

Real-Time Support for GPU. GPU Management Heechul Yun

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

A System of Image Matching and 3D Reconstruction

Implementation and Comparison of Feature Detection Methods in Image Mosaicing

Feature Detection and Matching

27 March 2018 Mikael Arguedas and Morgan Quigley

OpenCL-Based Design of an FPGA Accelerator for Phase-Based Correspondence Matching

Target Shape Identification for Nanosatellites using Monocular Point Cloud Techniques

CAP 5415 Computer Vision Fall 2012

Intelligent Video Analytics for Urban Management

Cluster-based 3D Reconstruction of Aerial Video

SoC Systeme ultra-schnell entwickeln mit Vivado und Visual System Integrator

Feature Descriptors. CS 510 Lecture #21 April 29 th, 2013

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

Local Image Features

Simplify System Complexity

SIFT - scale-invariant feature transform Konrad Schindler

Near Memory Key/Value Lookup Acceleration MemSys 2017

XPU A Programmable FPGA Accelerator for Diverse Workloads

Visual Object Recognition

Copyright 2016 Xilinx

Motion illusion, rotating snakes

SURF. Lecture6: SURF and HOG. Integral Image. Feature Evaluation with Integral Image

Outline 7/2/201011/6/

Motion Estimation and Optical Flow Tracking

Computer Vision Algorithm Acceleration Using GPGPU and the Tegra Processor's Unified Memory

Dense matching GPU implementation

Towards smart sensing based on field-programmable technology

Computer Vision on Tegra K1. Chen Sagiv SagivTech Ltd.

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications

Realization of CUDA-based real-time registration and target localization for high-resolution video images

GPU-based pedestrian detection for autonomous driving

Transcription:

Hardware Acceleration of Feature Detection and Description Algorithms on LowPower Embedded Platforms Onur Ulusel, Christopher Picardo, Christopher Harris, Sherief Reda, R. Iris Bahar, School of Engineering, Brown University

Image Processing in Mobile Systems Image processing is everywhere! Input data has changed from words/numbers to images Sensors have improved dramatically Image processing is a major driving factor in technological advancement Autonomization relies on image processing Mobile/Embedded platforms?? Realtime computing limited data bandwidth prefer local computing to offloading to cloud BUT image processing can be very computationally intensive and power hungry www.guardiantv.com www.google.com 2

Accelerating Image Processing on Low Power Embedded Platforms Meeting real time image processing requirements for many of these applications requires HW assisted acceleration Which algorithms do we accelerate? Feature detection and feature description are key building blocks for image retrieval, biometric identification, visual odometry, etc. Computational efficient detection and analysis of image features is critical for performance and energyefficiency http://www.sybernautix.com/ https://vision.in.tum.de https://blog.pivotal.io 3

Hardware Acceleration for Energy Constrained Image Processing Low power embedded platforms Field Programmable Gate Arrays (FPGAs) Graphical Processing Units (GPUs) Low power general processors (CPUs) Xilinx Virtex 6 FPGAs Xilinx Zynq 7020 1532 core NVIDIA GeForce GTX 680 GPUs 192 core NVIDIA Jetson TK1 Power 15 W <5W 195 W <12W 4

Our Contributions Comparative study of feature detection and description algorithms What are their computation kernel characteristics? Comparative study of platforms for embedded applications Advantages/disadvantages of each platform? Accelerating algorithms on different platforms How can algorithms be modified to better exploit available hardware of each platform? How does performance compare in terms of run time and energy consumption? 5

What is a feature? Feature Detection An interesting part of an image that can be used to identify objects Examples: Edges, corners, ridges, blobs flat region: no change within block edge: no change along the edge corner: change in all directions Slide adapted from Darya Frolova, Denis Simakov, Weizmann Institude 6

Feature Description Given the features, uniquely describe them so they can be matched in other images Descriptors summarize characteristics of the features E.g., intensity, orientation Descriptors should be distinctive and insensitive to local image deformations. Images from: R. Szeliski, Computer Vision: Algorithms and Applications 7

Accuracy and Runtime Comparisons Accuracy Percetage 80 60 40 20 0 1 SIFT SURF BRIEF BRISK HoG (Histogram of Gradient) based Descriptors SIFT: ScaleInvariant Feature Transform SURF: Speeded Up Robust Features Binary Feature Descriptors 1000 100 BRIEF: Binary Robust Independent Elementary Features BRISK: Binary Robust Invariant Scalable Keypoints 10 CPU runtime (ms) log scale Precision (% of detected features that are correct) Recall (% of features detected) Runtime 8

FAST: Features from Accelerated Segment Test 12pixel continuity? If so then feature Rosten and Drummond, ECCV 06 Precompare pixels 1, 5, 9, and 13 to determine possibility for continuity On average 98.5% of the comparisons fail the continuity test at the precompare stage sliding window Bresenham Circle 9

BRIEF: Binary Robust Independent Elementary Features Compare intensities of pairs of points using Hamming distance BRIEF Sampling pattern 512 sampling pairs For each pair, X i is at (0,0) and Y i takes all possible values from coarse polar grid Chosen sampling pattern Sampling pairs are generated from a results in a 512bit 31 31 region around center pixel characterization array 10

BRISK: Binary Robust Invariant Scalable Keypoints BRISK uses custom sampling pattern 512 sampling pairs generated from a 31 31 region (like BRIEF) Distinguishes between short/long pairs Short pairs used similar to BRIEF to generate descriptor vectors based on intensity comparisons Long pairs used for orientation computation by rotating sampling pattern BRISK sampling pattern Red circles represent standard deviation of Gaussian smoothing 11

Start Read Input Frame Algorithm Flowchart FAST BRIEF For each pixel, p, apply the 7x7 filter of Bresenham circle p is a corner Generate N sampling pairs, X i and Y i, around p X i > Y i Feature Detection False D i = 0 False FAST feature detection BRIEF feature description Obtaining sampling window for feature description requires irregular access pattern D i = 1 Brief Feature Description Stop

Start Read Input Frame Algorithm Flowchart FAST BRISK For each pixel, p, apply the 7x7 filter of Bresenham circle p is a corner Generate N sampling pairs, X i and Y i, around p X i > Y i Feature Detection False False D i = 0 FAST feature detection BRISK feature description BRISK requires an extra step for orientation compensation A significant amount of extra hardware resources for this step D i = 1 Perform Orientation Compensation for BRISK Stop Feature Description

Experimental Embedded Platforms FPGA: MicroZED development board: 28nm Zynq 7020 SoC Artix7 FPGA 1GB DDR3 dualcore Arm Cortex A9 CPU (for debug and init. only) GPU & CPU: Jetson TK1 development kit 28nm Tegra K1 SoC Kepler GPU with 192 CUDA cores @ 950MHz Quadcore ARM Cortex A15 CPU @ 2.5GHz (single core activated) 2GB Memory Running OpenCV versions of FAST, BRIEF, BRISK 14

Feature Detection & Description: Block Diagram DDR3 ARM Cortex A9 CPU Central Interconnect Memory Interface Processing System (PS) 32b GP AXI Master Port AXI Interconnect Control 10Line word Buffer Enable & Sync signals Image data Buffer Address Generator and Register Array ZigZag Traversing Line Buffers X x x x Precompute units Equality Mask Size Register Array X 12 Circle Comparator Programmable Logic (PL) is_corner Smoothing & Region Generation Nwide comparator Orientation Compensation x&y coordinates descriptor orientation 15

Feature Detection & Description: Block Diagram DDR3 ARM Cortex A9 CPU Central Interconnect Memory Interface Processing System (PS) 32b GP AXI Master Port AXI Interconnect Control 10Line word Buffer Enable & Sync signals Image data Buffer Address Generator and Register Array ZigZag Traversing Line Buffers X x x x Precompute units Equality Mask Size Register Array X 12 Circle Comparator Programmable Logic (PL) is_corner Smoothing & Region Generation Nwide comparator Orientation Compensation x&y coordinates descriptor orientation FAST Feature Detection 16

Feature Detection & Description: Block Diagram BRIEF Descriptor DDR3 ARM Cortex A9 CPU Central Interconnect Memory Interface Processing System (PS) 32b GP AXI Master Port AXI Interconnect Control 10Line word Buffer Enable & Sync signals Image data Buffer Address Generator and Register Array ZigZag Traversing Line Buffers X x x x Precompute units Equality Mask Size Register Array X 12 Circle Comparator Programmable Logic (PL) is_corner Smoothing & Region Generation Nwide comparator Orientation Compensation x&y coordinates descriptor orientation 17

Feature Detection & Description: Block Diagram BRISK Descriptor DDR3 ARM Cortex A9 CPU Central Interconnect Memory Interface Processing System (PS) 32b GP AXI Master Port AXI Interconnect Control 10Line word Buffer Enable & Sync signals Image data Buffer Address Generator and Register Array ZigZag Traversing Line Buffers X x x x Precompute units Equality Mask Size Register Array X 12 Circle Comparator Programmable Logic (PL) is_corner Smoothing & Region Generation Nwide comparator Orientation Compensation x&y coordinates descriptor orientation 18

Feature Detection & Description: Block Diagram DDR3 ARM Cortex A9 CPU Central Interconnect Memory Interface Processing System (PS) 32b GP AXI Master Port AXI Interconnect Control 10Line word Buffer Enable & Sync signals Image data Data control logic for detection and description Buffer Address Generator and Register Array ZigZag Traversing Line Buffers X x x x Precompute units Equality Mask Size Register Array X 12 Circle Comparator Programmable Logic (PL) is_corner Descriptor logic Smoothing & Region Generation Nwide comparator Orientation Compensation x&y coordinates descriptor orientation Feature detection logic 19

Results: Runtime 120 100 80 60 40 20 0 4.4 36.4 25.8 13.8 18.2 27.6 13.7 13.7 13.7 50.8 103.7 111.7 Runtime (ms) Intel i7 CPU ARM on Jetson Tegra GPU on Jetson Zynq FPGA FAST FASTBRIEF FASTBRISK 20

Results: Power & Energy 25 (detection) (detection description) 21 2.20 2.27 2.31 4.3 6.5 6.2 6.3 6.3 6.3 21.5 22.2 22.4 Intel i7 CPU Embedded CPU Tegra GPU Zynq FPGA 20 Power (W) 15 10 94.0 166.8 59.3 114.1 174.3 30.00 30.96 31.58 809.4 1137.7 696.4 705.1 5 0 FAST FASTBRIEF FASTBRISK 1200 1000 800 Energy (mj) Intel i7 CPU Embedded CPU Tegra GPU Zynq FPGA FAST FASTBRIEF FASTBRISK 600 400 200 0

Results: Profiling Not Selected Stall Reasons during GPU Computation 100% 80% Instruction Distribution Memory Throttle 60% Pipe Busy 40% 20% Execution Dependency 0% FPGA: FAST FPGA: FAST BRIEF GPU: FAST GPU: FAST BRIEF 0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 40.0% 45.0% Description Detection (BRIEF) (FAST) Feature description stalled due to memory throttle Needs better data management Load/Store FP/Integer Feature description for GPU implementation has bump in load/store ops Almost 10X more than just FAST 22

Results: FPGA Resource Utilization Resource utilization Lookup Tables Flip Flops Block RAMs FAST 4564 1551 8 FAST BRIEF 14398 2093 11 FAST BRISK 25575 7115 11 10% 32% 57% Distribution of resources 19% 14% 66% 37% 37% 27% LookUp Tables Flip Flops Block Rams FAST FASTBRIEF FASTBRISK BRISK requires significant amount of extra resources for smoothing and orienting Extra resources do not translate to much extra power 23

Conclusions FPGA outperforms CPUs and GPUs in terms of power & performance FAST BRISK: 36 fps vs. 147 fps FPGA amenable to various HW optimizations: deep pipelining, optimized memory access, precomputation FGPA implementations better for handling multiple kernels For GPUs, multiple kernels highly bounded by kernel scheduler and memory bottlenecks FPGA customization on layers better for tackling operations on multiple kernels. Use profiling on GPU implementation as first step to FPGA optimization identify nature of bottlenecks Customized FPGA HW can often better manage certain types of bottlenecks 24