A Lightweight YOLOv2:

Similar documents
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms

Lecture 5: Object Detection

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Yiqi Yan. May 10, 2017

Object Detection Based on Deep Learning

Xilinx ML Suite Overview

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Spatial Localization and Detection. Lecture 8-1

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

XPU A Programmable FPGA Accelerator for Diverse Workloads

THE NVIDIA DEEP LEARNING ACCELERATOR

Unified, real-time object detection

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Utilizing SDSoC to Port Convolutional Neural Network to a Space-grade FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

Revolutionizing the Datacenter

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

Object Detection on Self-Driving Cars in China. Lingyun Li

Object detection with CNNs

FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Reconfigurable Acceleration of 3D-CNNs for Human Action Recognition with Block Floating-Point Representation

ROS-COMPLIANT FPGA COMPONENT TECHNOLOGY INSTALLATION OF FPGA INTO ROS

Object Detection. Part1. Presenter: Dae-Yong

DEEP NEURAL NETWORKS FOR OBJECT DETECTION

Convolutional Neural Network Layer Reordering for Acceleration

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

Hardware Acceleration of Feature Detection and Description Algorithms on Low Power Embedded Platforms

An introduction to Machine Learning silicon

FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models

NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM. Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive)

Small is the New Big: Data Analytics on the Edge

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

YOLO9000: Better, Faster, Stronger

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

R-FCN: OBJECT DETECTION VIA REGION-BASED FULLY CONVOLUTIONAL NETWORKS

FPGA 加速机器学习应用. 罗霖 2017 年 6 月 20 日

Deploying Deep Neural Networks in the Embedded Space

Final Report: Smart Trash Net: Waste Localization and Classification

Brainchip OCTOBER

Versal: AI Engine & Programming Environment

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

On-chip Memory Based Binarized Convolutional Deep Neural Network Applying Batch Normalization Free Technique on an FPGA

Object Detection. TA : Young-geun Kim. Biostatistics Lab., Seoul National University. March-June, 2018

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

RON: Reverse Connection with Objectness Prior Networks for Object Detection

Deploying Deep Learning Networks to Embedded GPUs and CPUs

Hello Edge: Keyword Spotting on Microcontrollers

efpga for Neural Network based Image Recognition

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Regionlet Object Detector with Hand-crafted and CNN Feature

Embedded Linux Conference San Diego 2016

Creating Affordable and Reliable Autonomous Vehicle Systems

YOLO 9000 TAEWAN KIM

HEAD HardwarE Accelerated Deduplication

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Semantic Segmentation

Abandoned Luggage Detection

YOLO: You Only Look Once Unified Real-Time Object Detection. Presenter: Liyang Zhong Quan Zou

A Reconfigurable Multiclass Support Vector Machine Architecture for Real-Time Embedded Systems Classification

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)

Embedded Binarized Neural Networks

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

CNN optimization. Rassadin A

CNN-based Human Body Orientation Estimation for Robotic Attendant

Deep Learning Accelerators

Rich feature hierarchies for accurate object detection and semantic segmentation

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses

Accelerating your Embedded Vision / Machine Learning design with the revision Stack. Giles Peckham, Xilinx

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm

SOFTWARE HARDWARE CODESIGN ACCELERATION FOR EFFICIENT NEURAL NETWORK. ...Deep learning and neural

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

Volume 6, Issue 12, December 2018 International Journal of Advance Research in Computer Science and Management Studies

arxiv: v1 [cs.cv] 31 Mar 2016

DeepLearning on FPGAs

arxiv: v5 [cs.cv] 11 Dec 2018

Joint Object Detection and Viewpoint Estimation using CNN features

Zynq-7000 All Programmable SoC Product Overview

[Supplementary Material] Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors

Xilinx ML Suite Overview

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

direct hardware mapping of cnns on fpga-based smart cameras

Transcription:

FPGA2018 @Monterey A Lightweight YOLOv2: A Binarized CNN with a Parallel Support Vector Regression for an FPGA Hiroki Nakahara, Haruyoshi Yonekawa, Tomoya Fujii, Shimpei Sato Tokyo Institute of Technology, Japan

Outline Background Convolutional Neural Network (CNN) Mixed precision CNN for a Lightweight YOLOv2 Binary precision CNN Half precision support vector regression (SVR) FPGA Implementation Experimental Results Conclusion 2

Deep Learning is Everywhere 3

Convolutional Neural Network (CNN) Convolutional + Fully connected + Pooling State of the art performance in an image recognition task Widely applicable Source: https://www.mathworks.com/discovery/convolutional neural network.html 4

Image Recognition Tasks Easy Classification Answer category of the object in an image Baby (44%) Son (23%) Daughter (33%) Object Detection Classification + localization Son Baby Daughter Hard 5

Applications Robotics, autonomous driving, security, drones 6

Demo Available at https://www.youtube.com/watch?v=_imboyu8iwc 7

Requirements in Embedded System Cloud Embedded Many classes (1000s) Few classes (<10) Large workloads Frame rates (15 30 FPS) High efficiency (Performance/W) Server form factor Low cost & low power (1W 5W) Custom form factor J. Freeman (Intel), FPGA Acceleration in the era of high level design, HEART2017 8

Deep Learning Inference Device Flexibility: R&D costs for keeping on evolving algorithms Power performance efficiency FPGA has flexibility&better performance CPU (Raspberry Pi3) GPU (Jetson TX2) FPGA (UltraZed) ASIC (Movidius) Flexibility Power Performance Efficiency 9

Outline Background Convolutional Neural Network (CNN) Mixed precision CNN for a Lightweight YOLOv2 Binary precision CNN Half precision support vector regression (SVR) FPGA Implementation Experimental Results Conclusion 10

Object Detection Problem Detecting and classifying multiple objects at the same time Evaluation criteria (from Pascal VOC): Ground truth annotation Detection results: >50% overlap of bounding box(bbox) with ground truth One BBox for each object Confidence value for each object Person (50%) #. #. #. # Average Precision (AP): 1 11,,.,, 11

YOLOv2 (You Only Look Once version 2) Single CNN (One shot) object detector Both a classification and a BBox estimation for each grid Feature maps Class score Input Image (Frame) CONV+Pooling CONV+Pooling CNN Detection Bounding Box J. Redmon and A. Farhadi, YOLO9000: Better, Faster, Stronger," arxiv preprint arxiv:1612.08242, 2016. 12

2D Convolutional Operation Computational intensive part of the YOLOv2 Input feature map Kernel (Binary) Output feature map X 0,0 x W 0,0 X 0,1 x W 0,1 X 0,2 x W 0,2 X 1,0 x W 1,0 X 1,1 x W 1,1 X 1,2 x W 1,2 X 2,0 x W 2,0 X 2,1 x W 2,1 +) X 2,2 x W 2,2 y 13

Binarized CNN w 0 (Bias) w 1 x 1 x1 x2 Y 1 1 1 1 +1 1 +1 1 1 +1 +1 1 x1 x2 Y 0 0 1 0 1 0 1 0 0 1 1 1 w 2 Y f sgn (Y) z x 2... w n x n M. Courbariaux, I. Hubara, D. Soudry, R.E.Yaniv, Y. Bengio, Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or 1," Computer Research Repository (CoRR), Mar., 2016, http://arxiv.org/pdf/1602.02830v3.pdf 14

Improvements by Binarization w 0 (Bias) w 1 x 1 x1 x2 Y x1 x2 Y 1 1 1 0 0 1 1 +1 1 0 1 0 +1 1 1 1 0 0 +1 +1 1 1 1 1 w 2 Y f sgn (Y) z x 2... w n x n EXNORs Many MACs Binary Precision On chip Memory 15

Near Memory Realization by Binarization High bandwidth (Left) Less power consumption (Right) On-chip Memory J. Dean, Numbers everyone should know Source: https://gist.github.com/2841832 E. Joel et al., Tutorial on Hardware Architectures for Deep Neural Networks, MICRO 49, 2016. 16

Typical CNN for Classification Input image Feature maps 5 CONV+Pooling CONV+Pooling... Feature extraction layers 2 0 6 4 7 1 5 3 8 9 Classification layers 17

Hypothesis Does binarized feature map has a location? Yes Input image Feature maps Regression CONV+Pooling CONV+Pooling... 5 2 0 6 4 7 1 5 3 8 9 Feature extraction layers Classification 18

Problem Low precision NN is hard to regress a function Example: sin(x) regression using a NN (3 layers) Sin(x) Sin(x) Float32NN (a) Float 32 bit for activation and weight (b) Float32 for activation and binary weight Miss localization (c) All binarized BinNN 19

Proposed YOLOv2 Feature extraction layer: Binary precision Localization and classification layer: Half precision class CNN (Feature extraction) F. maps Flatten SVR Parallel SVR x yh w conf Feature Extraction (+Location) Localization + Classification 20

Support Vector Regression (SVR) Regression version of the Support Vector Machine (SVM) *1 Passive aggressive (On line) training is supported *2 Model decompression (sparse like) can be applied *3 : weight, : th input, and : bias. *1 H. Drucker, C. J. C. Burges, L. Kaufman, A. J. Smola and V. N. Vapnik, Support Vector Regression Machines," Neural Information Processing Systems, No. 9, NIPS 1996, pp. 155 161, 1997. *2 M. Martin, On Line Support Vector Machine Regression," ECML, pp.282 294, 2002. *3 T. Downs, K. E. Gates and A. Masters, Exact simplication of support vector solutions," Journal of Machine Learning Research, Vol. 2, 2001, pp. 293 297. 21

Outline Background Convolutional Neural Network (CNN) Mixed precision CNN for a Lightweight YOLOv2 Binary precision CNN Half precision support vector regression (SVR) FPGA Implementation Experimental Results Conclusion 22

Pipelined Conv2D Circuit Read M F.Maps at a time Shift Register (2L+K bits) x 00 x 01 x 02 x 03 x 04 x 22 x 21 x 20 x 14 x 13 x 12 x 11 x 10 x 04 x 03 x 02 x 01 x 00 x 10 x 11 x 12 x 13 x 14 x 20 x 21 x 22 x 23 x 24 x 30 x 31 x 32 x 33 x 34 x 40 x 41 x 42 x 43 x 44 Binarized Feature Maps (L=5, K=3) 9 Binarized Weight Mem. Binarized MACs (EXNORs + Adder Tree) Integer Bias Mem. + Sign bit Write Ctrl. Logic Counter B. Bosi, G. Bois and Y. Savaria, Reconfigurable pipelined 2 D convolvers for fast digital signal processing," IEEE Trans. on Very Large Scale Integration (VLSI) Systems, Vol. 7, No. 3, pp. 299 308, 1999.

Parallel Binarized CNN Circuit Update Ni+1 bits Read Ni bits v v v Ni F. maps 2L+K bits Shift Register v Binarized MACs v + Sign Write Ctrl. Logic B. Weight Mem. Int. Bias Mem. On Chip Memory Counter 24

Parallel SVR Circuit Index Weight cache bias From Binarized F. Map memory -1 w w 0 1 + Reg Clear + SVR for class_20 Circuit for a SVR SVR for class_1 Feature Extracted CNN Binarized Feature map memory SVR for conf SVR for w SVR for h SVR for y SVR for x Localization and Classification Parallel SVR 25

Overall Architecture Host ARM Processor AXI4 BUS W.C. SVR for class_20 v v v F. map memory Streamingv Binarized v 2D Conv. Binarized Circuit Weight Cache W.C. SVR for conf. W.C. SVR for x Index Weight Cache (W.C.) b Binarized Weight Cache 1 w w 0 1 + Reg Clear + 26

Outline Background Convolutional Neural Network (CNN) Mixed precision CNN for a Lightweight YOLOv2 Binary precision CNN Half precision support vector regression (SVR) FPGA Implementation Experimental Results Conclusion 27

Training Result Environment CNN: Proposed YOLOv2 Binarized Darknet19+SVR Dataset: Pascal VOC2007 21 class, 224x224 image size Framework Binary precision CNN: GUINNES *1 Half precision SVR: Pegasos *2 Accuracy (map) 67.6% *1 H. Nakahara et. al, A demonstration of the GUINNESS: A GUI based neural NEtwork SyntheSizer for an FPGA, FPL, 2017, page 1. https://github.com/hirokinakahara/guinness *2 S. S. Shwartz et. al, Pegasos: primal estimated sub gradient solver for SVM, Mathematical Programming, Vol. 127, No. 1, 2011, pp. 3 30. https://github.com/ejlb/pegasos 28

Implementation Setup Board: Xilinx Inc. Zynq UltraScale+ MPSoC zcu102 evaluation board Zynq UltraScale+ MPSoC FPGA Design tool: SDSoC 2017.4 Timing constraint: 299.97MHz Module #LUTs #FFs #18Kb BRAMs #DSP48Es Binary CNN (2D bin. Conv.) 108,138 (103,924) 358,868 (313,839) 1680 (0) 135 (0) Parallel SVR 27,243 11,431 26 242 Total (%) 135,381 (49.3) 370,299 (67.5) 1,706 (93.5) 377 (14.9) 29

Comparison Platform Embedded CPU Embedded GPU FPGA Device Quad core ARM Cortex A57 256 core Pascal GPU Zynq UltraScale+ MPSoC Clock Freq. [GHz] 1.9 1.3 0.3 Memory 32 GB emmc Flash 8GB LPDDR4 32.1 Mb BRAM Time [msec] (FPS) [sec 1 ] NVidia Jetson TX2 4210.0 (0.23) 715.9 (1.48*) Xilinx ZCU102 24.5 (40.81) Dynamic Power [W] 4.0 7.0 4.5 Efficiency [FPS/W] 0.057 0.211 9.060 * Chainer (version 1.24.0), source code: https://github.com/leetenki/yolov2 30

Conclusion Lightweight YOLOv2 for a real time object detector Mixed precision CNN Binary precision CNN: Feature extraction Half precision SVR: Classification and localization FPGA Implementation Outperforms an embedded GPU and a CPU Future Work: Applied to CNN based applications Single shot object detector (SSD, PVANet) Semantic segmentation (FCN, U Net) Pose estimation (OpenPose) CNN SLAM 31

Thank you! Contact: Hiroki Nakahara, Tokyo Institute of Technology nakahara@ict.e.titech.ac.jp 32