High Quality Real Time Image Processing Framework on Mobile Platforms using Tegra K1. Eyal Hirsch

Similar documents
Computer Vision on Tegra K1. Chen Sagiv SagivTech Ltd.

Maximizing GPU Power for Vision and Depth Sensor Processing. From NVIDIA's Tegra K1 to GPUs on the Cloud. Chen Sagiv Eri Rubin SagivTech Ltd.

SceneNet: 3D Reconstruction of Videos Taken by the Crowd on GPU. Chen Sagiv SagivTech Ltd. GTC 2015 San Jose

Renderscript Accelerated Advanced Image and Video Processing on ARM Mali T-600 GPUs. Lihua Zhang, Ph.D. MulticoreWare Inc.

Lecture 4.1 Feature descriptors. Trym Vegard Haavardsholm

Hardware Acceleration of Feature Detection and Description Algorithms on Low Power Embedded Platforms

A System of Image Matching and 3D Reconstruction

Image Features: Detection, Description, and Matching and their Applications

Technology for a better society. hetcomp.com

A Systems View of Large- Scale 3D Reconstruction

A Keypoint Descriptor Inspired by Retinal Computation

To 3D or not to 3D? Why GPUs Are Critical for 3D Mass Spectrometry Imaging Eri Rubin SagivTech Ltd.

Exploiting Depth Camera for 3D Spatial Relationship Interpretation

EECS150 - Digital Design Lecture 14 FIFO 2 and SIFT. Recap and Outline

The OpenVX Computer Vision and Neural Network Inference

Next Generation Visual Computing

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Panoramic Image Stitching

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

From Structure-from-Motion Point Clouds to Fast Location Recognition

Cluster-based 3D Reconstruction of Aerial Video

Advanced CUDA Optimization 1. Introduction

Midterm Examination CS 534: Computational Photography

High-Performance Data Loading and Augmentation for Deep Neural Network Training

Yudistira Pictures; Universitas Brawijaya

SIFT Descriptor Extraction on the GPU for Large-Scale Video Analysis. Hannes Fassold, Jakub Rosner

Parallel SIFT-detector implementation for images matching

Embedded real-time stereo estimation via Semi-Global Matching on the GPU

Real-Time Scene Reconstruction. Remington Gong Benjamin Harris Iuri Prilepov

B. Tech. Project Second Stage Report on

NumbaPro CUDA Python. Square matrix multiplication

GETTING STARTED WITH CUDA SDK SAMPLES

A Hybrid Feature Extractor using Fast Hessian Detector and SIFT

Introduction to CUDA

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPGPU on ARM. Tom Gall, Gil Pitney, 30 th Oct 2013

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

Integrating CPU and GPU, The ARM Methodology. Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM

Outline 7/2/201011/6/

Navigating the Vision API Jungle: Which API Should You Use and Why? Embedded Vision Summit, May 2015

School of Computing University of Utah

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

FPGA-Based Feature Detection

SIGGRAPH Briefing August 2014

MultiAR Project Michael Pekel, Ofir Elmakias [GIP] [234329]

Lecture 6: Input Compaction and Further Studies

Tesla GPU Computing A Revolution in High Performance Computing

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

The Benefits of GPU Compute on ARM Mali GPUs

SIMD Exploitation in (JIT) Compilers

High Performance Computing on GPUs using NVIDIA CUDA

high performance medical reconstruction using stream programming paradigms

PROGRAMMING NVIDIA GPUS WITH CUDANATIVE.JL

Thrust ++ : Portable, Abstract Library for Medical Imaging Applications

Photo Tourism: Exploring Photo Collections in 3D

BSB663 Image Processing Pinar Duygulu. Slides are adapted from Selim Aksoy

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Advanced Imaging Applications on Smart-phones Convergence of General-purpose computing, Graphics acceleration, and Sensors

CS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28

Interactive Non-Linear Image Operations on Gigapixel Images

and Parallel Algorithms Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Parallel Tracking. Henry Spang Ethan Peters

Combining NVIDIA Docker and databases to enhance agile development and optimize resource allocation

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

HISTOGRAMS OF ORIENTATIO N GRADIENTS

Building a Panorama. Matching features. Matching with Features. How do we build a panorama? Computational Photography, 6.882

Implementing a Speech Recognition System on a GPU using CUDA. Presented by Omid Talakoub Astrid Yi

OpenCV on Zynq: Accelerating 4k60 Dense Optical Flow and Stereo Vision. Kamran Khan, Product Manager, Software Acceleration and Libraries July 2017

3D Reconstruction From Multiple Views Based on Scale-Invariant Feature Transform. Wenqi Zhu

An Evaluation of Unified Memory Technology on NVIDIA GPUs

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

rcuda: an approach to provide remote access to GPU computational power

A Method to Eliminate Wrongly Matched Points for Image Matching

Performance Analysis of Sobel Edge Detection Filter on GPU using CUDA & OpenGL

TUNING CUDA APPLICATIONS FOR MAXWELL

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

Computer Vision Algorithm Acceleration Using GPGPU and the Tegra Processor's Unified Memory

HPC with Multicore and GPUs

Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors?

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

Shadows for Many Lights sounds like it might mean something, but In fact it can mean very different things, that require very different solutions.

Scale Invariant Feature Transform

GTC Interaction Simplified. Gesture Recognition Everywhere: Gesture Solutions on Tegra

Local features and image matching. Prof. Xin Yang HUST

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

CS 223B Computer Vision Problem Set 3

CS 378: Autonomous Intelligent Robotics. Instructor: Jivko Sinapov

Martin Dubois, ing. Contents

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

Practical Introduction to CUDA and GPU

Local invariant features

TUNING CUDA APPLICATIONS FOR MAXWELL

April 4-7, 2016 Silicon Valley VISIONWORKS A CUDA ACCELERATED COMPUTER VISION LIBRARY S6783. Elif Albuz, April 4, 2016

Anno accademico 2006/2007. Davide Migliore

Pop Quiz 1 [10 mins]

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

Cuda Compilation Utilizing the NVIDIA GPU Oct 19, 2017

Transcription:

High Quality Real Time Image Processing Framework on Mobile Platforms using Tegra K1 Eyal Hirsch

Established in 2009 and headquartered in Israel SagivTech Snapshot Core domain expertise: GPU Computing and Computer Vision What we do: - Technology - Solutions - Projects - EU Research - Training GPU expertise: - Hard core optimizations - Efficient streaming for single or multiple GPU systems - Mobile GPUs

The new era of mobile Mobile is everywhere

As mobile devices get smarter In the beginning: I can talk from anywhere! A bit later: My phone can take pictures! Now: Advanced camera More compute power Fast device cloud communication What can be done with those advancements?

Project Tango Mission: Running a depth sensing technology on a mobile platform Challenge: First time on NVIDIA s Tegra K1 Extreme optimizations on a CPU-GPU platform to allow the device to handle other tasks in parallel Expertise: Mantis Vision the algorithms NVIDIA the Tegra K1 platform SagivTech the GPU computing expertise Bottom line: Depth sensing running in real time in parallel to other compute intensive applications!

Project Tango Credits: http://techaeris.com

Mobile Crowdsourcing Video Scene Reconstruction If you ve been to a concert recently, you ve probably seen how many people take videos of the event with mobile phone cameras Each user has only one video taken from one angle and location and of only moderate quality

The Idea behind SceneNet Leverage the power of multiple mobile phone cameras Create a high-quality 3D video experience that is sharable via social networks

Creation of the 3D Video Sequence TIME The scene is photographed by several people using their cell phone camera The video data is transmitted via the cellular network to a High Performance Computing server. Following time synchronization, resolution normalization and spatial registration, the several videos are merged into a 3-D video cube.

Algorithms implemented on the TK1 Enabling the 3D reconstruction for SceneNet required various algorithms to run on the TK1 GPU FREAK: Fast Retina Key point BRISK: Binary Robust Invariant Scalable Key points DoG: Difference of Gaussians Algorithms had to run in real-time Algorithms are image processing building blocks for various image processing tasks

DoG: Input: 480 x 640 RGB Image Output: ~32K key points Freak: Freak &DoG performance on the TK1 Input: ~32K key points, Image Output: Descriptor per key point Majority of the code on the GPU Off loading to the GPU allows for real time processing, not possible on the CPU

DoG performance on the TK1 DoG flow: Gaussian DiffImage Find Key points Total: 10.83 ms Kernel Misc Gaussian: Conv2D Gaussian: DownSampleBilinear DiffImage FindKeyPoints Total DoG Avg. time (ms) 0.3 4.8 0.6 1.7 3.43 10.83

FREAK performance on the TK1 FREAK flow: IntegralImage Extract descriptors Total: 2.4 ms Kernel IntegralImage GetDescriptors Total FREAK Avg. time (ms) 1.5 0.9 2.4 Total DoG + FREAK: 13.23 ms

Freak &DoG performance on the TK1 13 ms means real time processing on Ardbeg development board!!! Room for more tasks to run in the background Opens up possibilities for many mobile applications Having real time performance is not enough Need to evaluate power consumption as well

Performance is also GFlops/WATT

CUDA NVIDIA Programming the TK1 GPU OpenCL Khronos RenderScript Developed by Google

Programming the TK1 - CUDA Most rules and methods that apply to discrete cards, apply to the TK1 GPU Code and libraries (such as cufft, cublas, cusparse, CUB, Thrust, etc) should work out of the box for the TK1 Develop on Windows/Linux with discrete card and then migrate to the TK1 Use the profiler

Programming the TK1 - OpenCL Most of the tips for CUDA applies to OpenCL Runs nicely and shows nice performance Migrated the in-house Bilateral filter from CUDA to OpenCL in less than a day 2D separable convolution yield nice performance gains (compared to an optimized Neon implementation)

2D separable convolution on the TK1 Used 4 tests configuration to evaluate performance Highly optimized reference library utilizing the NEON (CPU) SagivTech s in-house Neon implementation (CPU) SagivTech s in-house OpenCL implementation (GPU) Test configuration Reference library ST single core NEON ST 4 cores NEON ST OpenCL 1K x 1K 22 23.5 10.8 4 2K x 2K 97 99 48 9

Programming the TK1 RenderScript - 1 Google s way of doing Compute on a mobile platform Quick CUDA to RenderScript acronym translation: User manages allocations (a.k.a buffers) User manages data transfer/copies to/from allocations User sets runtime parameters (a.k.a kernel params) User launches kernels much like OpenCL/CUDA Code ran on the GPU and yielded impressive performance boost (still lags behind CUDA) CUDA to RS migration fairly easy

Programming the TK1 RenderScript - 2 Google does NOT mandate which SoC component will run the RS code Developer has no control where RS code will run Depends on specific hardware, vendors, code, etc To test RS on TK1, locked GPU clocks in different configurations and run RS sparse matrix vector multiplication benchmark Performance of the RS code under different clocks, would reveal which component ran RS code

Programming the TK1 RenderScript - 3 Sparse matrix vector multiplication using Render script Used 3 test configurations Naive C++ CPU code SagivTech RS NVIDIA s cusparse RS running on GPU RS shows nice performance 45 40 35 30 25 20 15 10 5 0 GPU: Full clocks Chart Title GPU: Half clocks GPU: Quarter clocks Naive C++ SagivTech RS NVIDIA cusparse

Programming the TK1 Optimization tips Only one SMX We ve seen cases where different optimizations behave differently on the TK1 than on equivalent discrete card (such as ldg etc) Try various optimizations, in some cases we got better performance when using atomics rather than shared memory Always optimize on the TK1 and not on discrete used for the development phase

The future Real time image processing of even complex algorithms is achievable on the TK1 Easy migration from mature discrete GPU code to new and exiting field of mobile compute Maxwell is already planed for next mobile generation, bringing more power efficiency and performance It works!!

T h a n k Yo u F o r m o r e i n f o r m a t i o n p l e a s e c o n t a c t E y a l H i r s c h e y a l @ s a g i v t e c h. c o m

Programming the TK1 General tips TK1 hardware CC is 3.2 Tools and compilation chain is quite different. Need some time to get started Strive to do the CUDA and managing app/code in Windows/Linux using a discrete card and then migrate to Android Always have a reference code in naive, single thread C++ to compare the results of the parallel algorithm

Computational Photography: examples Background subtitution

FREAK Fast Retina Keypoint Binary feature descriptor Hamming distance matcher Sampling pattern Overlapping receptive fields Exponential change in size Rotation invariant

BRISK Binary Robust Invariant Scalable Key points Binary feature descriptor Hamming distance matcher Sampling pattern Equally spaced in circles Gaussian kernel size relative to distance from feature

DoG Difference of Gaussians Feature detector Local minima/maxima of the image convolved with difference of gaussians Acts as a 2D band-pass filter over the image Enhances corners