OpenCV on Zynq: Accelerating 4k60 Dense Optical Flow and Stereo Vision. Kamran Khan, Product Manager, Software Acceleration and Libraries July 2017

Size: px

Start display at page:

Download "OpenCV on Zynq: Accelerating 4k60 Dense Optical Flow and Stereo Vision. Kamran Khan, Product Manager, Software Acceleration and Libraries July 2017"

Willis Sims
6 years ago
Views:

1 OpenCV on Zynq: Accelerating 4k60 Dense Optical Flow and Stereo Vision Kamran Khan, Product Manager, Software Acceleration and Libraries July 2017

2 Agenda Why Zynq SoCs for Traditional Computer Vision Automated Flow for OpenCV HW Acceleration Case Study

3 OpenCV Needs Acceleration in Embedded Typical ARM Cortex-A53 Typical Requirement Harris Corner Stereo Depth Map Dense Optical Flow > 30 FPS 24 FPS 21 FPS 01 FPS Source: Embedded Vision Alliance, Embedded Vision Developer Survey, January 2017

4 Zynq Offers the Most Efficient CV Acceleration

Zynq Offer Superior Performance, Latency <10 42x ms

Computer Vision CV:: StereoLBM @1080p CV:: LK Dense

700 296 43 Power (W) 48 33 79 Frames/s/watt 1458 897 54

48 33 79 Frames/s/watt 354 221 09 Xilinx Benchmark Xilinx

StereoLBM and OpenCV4Tegra for OpticalFlow All benchmarks

5 Zynq Offer Superior Performance, Latency <10 42x ms latency Real Time Applications Latency Frames/sec/watt Computer Vision CV:: CV:: LK Dense Optical Xilinx ZU9 Xilinx ZU5 egpu* Frames/s Power (W) Frames/s/watt Xilinx ZU9 Xilinx ZU5 egpu* Frames/s Power (W) Frames/s/watt Xilinx Benchmark Xilinx Benchmark egpu = nvidia Tegra X1 using VisionWorks for StereoLBM and OpenCV4Tegra for OpticalFlow All benchmarks utilize as much resources as possible on GPU (~99%) and programmable logic (~70%)

6 Why So Good? Efficient Window-based Streaming Typical SoC DSP/GPU Image Sensor Optical Flow SfM Stereo Depth Image Sensor Programmable Logic Optical Flow Stereo Depth SfM CPUs CPUs DDR DDR

7 Frameworks DNN CNN GoogLeNet SSD FCN Libraries and Tools Development Kits

Debunking Zynq SoC is Hard to Program C/C++/OpenCL

Trained Weights Optimized Accelerators & Data

8 Debunking Zynq SoC is Hard to Program C/C++/OpenCL Creation DNN CNN GoogLeNet SSD FCN Profiling to Identify Bottlenecks Computer Vision System Optimizing Compiler Machine Learning prototxt & Trained Weights Optimized Accelerators & Data Motion Network Scheduling of Pre-Optimized Neural Network Layers

$cv::imshow(out); } main(){ cv::imread(a); xf:stereorectify<line>(a,b,c,d); xf:stereolbm<win,n_disp>(c,d,out);$

9 OpenCV Support with Automatic HW Acceleration Cross-compile OpenCV application to Zynq (ARM A9/A53) Profile and identify bottleneck functions Minimal changes to the code and set functions to hardware Compile using SDSoC Run on a Zynq board stereorectify stereolbm main(){ cv::imread(a); cv::stereorectify(a,b,c,d); cv::stereolbm(c,d,out); cv::imshow(out); } main(){ cv::imread(a); xf:stereorectify<line>(a,b,c,d); xf:stereolbm<win,n_disp>(c,d,out); cv::imshow(out); }

10 xfopencv: HW Accelerated OpenCV Functions Level 1 Level 2 Level 3 Absolute difference Channel combine Box Scale/Resize Histogram of Oriented Gradients (HOG) Accumulate Channel extract Gaussian StereoRectify Accumulate squared Color convert Median Warp Affine SVM (binary) Accumulate weighted Convert bit depth Sobel Warp Perspective OTSU Thresholding Arithmetic addition Table lookup Mean Shift Tracking (MST) Custom Fast corner Arithmetic subtraction Histogram convolution LK Dense Optical Flow Bitwise: AND, OR, XOR, NOT Pixel-wise multiplication Integral image Gradient Phase Dilate Harris corner Canny edge detection Min/Max Location Mean & Standard Deviation Erode Remap Image pyramid Bilateral Equalize Histogram Color Detection Gradient Magnitude Thresholding StereoLBM

or OpenCL Optimize for hardware using HLS Assign

11 Custom CV Function / Library Creation Flow Cross-compile to Zynq (ARM A9/A53) Write custom CV function in C, C++ or OpenCL Optimize for hardware using HLS Assign functions to hardware Compile using SDSoC Run on a Zynq board

Example: 4K60 LK Dense Optical Flow Xilinx ZU9 Frames/s 60

imread(a); imread(b); denseopticalflowpyrltr(a,b,out)

using CUDA OpenCV Both Xilinx and nvidia benchmarks do not

flow, non-pyramidal, noniterative, Window size 53x53 SW HW

12 Example: 4K60 LK Dense Optical Flow Xilinx ZU9 Frames/s 60 Power (W) 48 Latency (ms) 167 Utilization 15% main(){ imread(a); imread(b); denseopticalflowpyrltr(a,b,out) imshow(out);} SDSoC Generated Platform MIPI nvidia number using CUDA OpenCV Both Xilinx and nvidia benchmarks do not include the camera inputs and HDMI/DP LK dense optical flow, non-pyramidal, noniterative, Window size 53x53 SW HW Application Libraries Linux denseopticalflowpyrltr AXI Drivers HDMI DMA AXI-S

Example: Stereo Depth Map Xilinx ZU9 Frames/s 140 Power (W) 48 Latency

stereorectify(a,b,c,d); stereolbm(c,d,out); imshow(out);} SDSoC Generated

nvidia number using CUDA OpenCV SAD based stereo localbm Both Xilinx and

13 Example: Stereo Depth Map Xilinx ZU9 Frames/s 140 Power (W) 48 Latency (ms) 71 Utilization 14% main(){ imread(a); imread(b); stereorectify(a,b,c,d); stereolbm(c,d,out); imshow(out);} SDSoC Generated Platform Application DMA AXI-S USB3 SW HW Libraries Linux AXI Drivers nvidia number using CUDA OpenCV SAD based stereo localbm Both Xilinx and nvidia benchmarks do not include the camera inputs and HDMI/DP outputs StereoRectify StereoLBM HDMI

14 Step 1: Port Desktop OpenCV Application to Zynq Simply import the C/C++ projects with OpenCV APIs into SDSoC All necessary OpenCV compile / linking environments for ARM are provided Ready-to-compile!

15 Step 2: Assign Functions to Hardware Acceleration Minor mods needed to use OpenCV libraries for hardware acceleration Namespace change: cv:: to xf:: Add template parameters for optimized hardware generation Simply assign critical functions to hardware

16 Step 3: Estimate Performance and Build Fast estimation in minutes to get system-level performance and HW utilization Build the full system with a click of button ARM Executable HW bitstream Linux kernel, Rootfs and Boot files

17 Step 4: Run on a Board and Collect Traces

Summary Zynq SoCs offer superior performance and lower latency compared to other SoC offerings revision stack on SDSoC introduces familiar software environment

18 Summary Zynq SoCs offer superior performance and lower latency compared to other SoC offerings revision stack on SDSoC introduces familiar software environment with pre-optimized OpenCV libraries Available NOW Visit the revision developer zone

19 Design Examples on Xilinxcom/revision Page 19

20 Resources INT8 Whitepaper Machine Learning Whitepaper revision Backgrounder Additional Papers & Tutorials Xilinx Embedded Vision Videos Forums For all this and more, visit Xilinxcom/reVISION

Empowering Product Creators to Harness Embedded Vision The Embedded Vision

technology and services suppliers Mission: Inspire and empower product creators

low-cost, high-quality technical educational resources for product developers

technology providers to grow their businesses through leads, ecosystem

21 Empowering Product Creators to Harness Embedded Vision The Embedded Vision Alliance (wwwembedded-visioncom) is a partnership of 60+ leading embedded vision technology and services suppliers Mission: Inspire and empower product creators to incorporate visual intelligence into their products The Alliance provides low-cost, high-quality technical educational resources for product developers Register for updates at wwwembedded-visioncom The Alliance enables vision technology providers to grow their businesses through leads, ecosystem partnerships, and insights For membership, us: membership@embedded-visioncom 2016 Embedded Vision Alliance 21

Recognition in TensorFlow Using TensorFlow in Embedded Systems July 13, 2017 Hyatt Regency Santa Clara Santa

22 Learn How to Develop Deep Learning Applications for Computer Vision in TensorFlow Topics: Introduction to TensorFlow TensorBoard Visualization Tools Open Source CNN Models Neural Networks in TensorFlow Object Recognition in TensorFlow Using TensorFlow in Embedded Systems July 13, 2017 Hyatt Regency Santa Clara Santa Clara, California September 7, 2017 Steigenberger Hotel Hamburg, Germany Embedded Vision Alliance 22

23 Q&A Page 23

借助 SDSoC 快速開發複雜的嵌入式應用

借助 SDSoC 快速開發複雜的嵌入式應用 May 2017 What Is C/C++ Development System-level Profiling SoC application-like programming Tools and IP for system-level profiling Specify C/C++ Functions for Acceleration Full System