Implementing Deep Learning for Video Analytics on Tegra X1.

Similar documents
NVIDIA FOR DEEP LEARNING. Bill Veenhuis

Deploying Deep Learning Networks to Embedded GPUs and CPUs

CNNS FROM THE BASICS TO RECENT ADVANCES. Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague

Deep Learning for Computer Vision II

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

Deep Learning with Tensorflow AlexNet

A NEW COMPUTING ERA JENSEN HUANG, FOUNDER & CEO GTC CHINA 2017

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

Deep learning in MATLAB From Concept to CUDA Code

POINT CLOUD DEEP LEARNING

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA

Deep Face Recognition. Nathan Sun

Accelerating Convolutional Neural Nets. Yunming Zhang

Convolutional Neural Networks

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Deep Residual Learning

High Performance Computing

Deep Learning for Computer Vision with MATLAB By Jon Cherrie

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Yiqi Yan. May 10, 2017

Spatial Localization and Detection. Lecture 8-1

A performance comparison of Deep Learning frameworks on KNL

Wu Zhiwen.

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

High-Performance Data Loading and Augmentation for Deep Neural Network Training

Implementation of Deep Convolutional Neural Net on a Digital Signal Processor

CNN optimization. Rassadin A

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

Fuzzy Set Theory in Computer Vision: Example 3

Introduction to Deep Learning in Signal Processing & Communications with MATLAB

Deep Learning and Its Applications

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

arxiv: v1 [cs.cv] 20 Dec 2016

Research Faculty Summit Systems Fueling future disruptions

Deep Learning Based Real-time Object Recognition System with Image Web Crawler

Getting started with Caffe. Jon Barker, Solutions Architect

Flow-Based Video Recognition

Como funciona o Deep Learning

Autonomous Driving Solutions

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

World s most advanced data center accelerator for PCIe-based servers

CNN Basics. Chongruo Wu

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

Weighted Convolutional Neural Network. Ensemble.

Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications

Todo before next class

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Fei-Fei Li & Justin Johnson & Serena Yeung

Object Detection. Part1. Presenter: Dae-Yong

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Improving Face Recognition by Exploring Local Features with Visual Attention

OPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi

Channel Locality Block: A Variant of Squeeze-and-Excitation

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Joint Object Detection and Viewpoint Estimation using CNN features

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

Face Recognition A Deep Learning Approach

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer

Deep Learning for Vision

Graphics Processor Acceleration and YOU

arxiv: v1 [cs.cv] 11 Feb 2018

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB

Perceptron: This is convolution!

NVIDIA PLATFORM FOR AI

Face Detection and Tracking Control with Omni Car

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

OBJECT DETECTION HYUNG IL KOO

Scaling Deep Learning. Bryan

Inception Network Overview. David White CS793

Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim

Computing the Stereo Matching Cost with CNN

DEEP LEARNING WITH GPUS Maxim Milakov, Senior HPC DevTech Engineer, NVIDIA

Binary Convolutional Neural Network on RRAM

Object detection with CNNs

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Real-time convolutional networks for sonar image classification in low-power embedded systems

An Exploration of Computer Vision Techniques for Bird Species Classification

CafeGPI. Single-Sided Communication for Scalable Deep Learning

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Scalable Distributed Training with Parameter Hub: a whirlwind tour

All You Want To Know About CNNs. Yukun Zhu

Shrinath Shanbhag Senior Software Engineer Microsoft Corporation

Transcription:

Implementing Deep Learning for Video Analytics on Tegra X1 research@hertasecurity.com

Index Who we are, what we do Video analytics pipeline Video decoding Facial detection and preprocessing DNN: learning recipes + deployment Multi-DNN performance on TX1 Latency per architecture / layer

Who we are Facial recognition company GPU-powered solutions for security and marketing Offices in Barcelona, Madrid, London, Los Angeles

What we do Real-time, unconstrained facial analysis on HD streams. Video surveillance Facial recognition in crowds, live and forensic (up to 400 fps). Facial video analytics Gender, age, ethnicity, facial expression, counting... YOU Bar We know what you re drinking HERTA booth 727 Live demos (discrete + embedded)

Video analytics pipeline Response Video decoding Detection and preprocessing DNN inference Fully local processing Enables targeted response (e.g., digital signage) Send aggregated statistics to remote DB

Video decoding Tegra X1 features on-die hardware video decoder t=0 ms t=33 ms (30 FPS) Time CPU Video Decoding GPU face detection GPU DNN GPU VD GPU face detection GPU DNN Stages of video acquisition: RTSP parsing Video demuxing Video decoding

Face detection DNN-based face detection (sliding-window): Still slow, even on discrete GPUs! Soft cascade of CNN2 (2015) [1] 640x480 15 FPS Tesla K20 HOG + DNN (2015) [2] 640x480 0.7 FPS GTX 760 Not yet feasible for 4K on power-constrained TX1. Instead, dedicated cascade using custom CUDA kernels Our boosting approach 3840x2160 up to 30 FPS Tegra X1 Our boosting approach 3840x2160 up to 150 FPS Quadro K2200 [1] Angelova et al, Real-time pedestrian detection with deep cascades. BMVC 15 [2] Luo et al, Switchable deep network for pedestrian detection. CVPR 15

Preprocessing (I) Preprocessing: Independent tasks for detection, normalization, alignment 38 PREPROCESS DNN White Detect Localize Align Frontalize Extract & classify

Effects of custom preprocessing Feeding the network: eye-aligned vs. frontalized faces Higher accuracy >3X faster convergence

DNN deployment Deployment: very fast, models are already trained. Basically: sequence of matrix multiplications, (GEMV / GEMM) And a bunch of extremely fast CUDA kernels. (pooling, LRN, ReLU ) Not a big deal for TK1 / TX1, right? But how deep is deep? AlexNet 8 layers VGG 19 layers Inception 22-75 layers (+most with parallel subnetworks) ResNet 152 layers plus 4K video processing, multi-net inference All real-time.

DNN deployment (II) Alternatives: busy DNN, or independent custom tasks A good compromise: efficient preprocessing, shallower guided DNNs. Busy DNN: Localization + classification Detection and preprocessing Guided DNN (8-15 layers)

DNN recipes (I) DATA ARCHITECTURE n K obj Θ = l(y i, y i ) + Ω f k i LOSS Softmax Euclidean Contrastive Info gain k=1 REGULARIZERS Weight decay Dropout Batch normalization Stochastic pooling MaxOut Multi-net averaging

DNN recipes (II) LOSSES SoftMax: great for binary / multiclass classification Information gain: for unbalanced datasets Contrastive + Information gain: for Siamese nets Euclidean : for regression, much better after tuning relaxed on elders tougher on children

DNN recipes (III) REGULARIZERS Dropout: always great to generalize small datasets. Batch normalize: for us, better after PReLU + dropout! Stochastic pooling: mixed results. MaxOut: basically, partial multi-net averaging. Multi-net averaging: good to generalize, but expensive. ARCHITECTURE Weight initialization: Xavier / Fan-out vs Average / He Activation functions: ReLU / PReLU / ELU ( ELU need deeper nets to shine )

log (weight decay) DNN recipes (IV) ARCHITECTURE Convolutional autoencoders: just for initialization. Common arch + task-oriented heads: dedicated is better. Common baseline Gender Ethnicity Hyperparam cross-validation: random walk / Monte-Carlo. log (learning rate)

Implementation summary Demographic estimation on TX1: H.264/HVEC video decoding with NVIDIA s GStreamer plugins Custom CUDA kernels for detection + preprocessing cudnn v4 DNN layers: Convolution, MaxPool, SoftMax Custom DNN features + layers (not yet in cudnn): He, PReLU, MaxOut, StdPool Custom DNN inference engine on TX1. OpenGL for display

Multi-DNN performance on TX1 Simultaneous estimation of Gender + Ethnicity + Age in HD streams. Latency per architecture Latency per layer Intel i7-4770 4-core @3.5GHz lrn 9% prelu 2% stdpool 3% maxout softmax Tegra X1 2x speedup GTX 770-Ti 20x speedup conv 42% fc 44% 0 10 20 30 40 50 60 Latency (ms) Next: bottleneck convs, FP16, better kernels

We acknowledge NVIDIA for their support, through the UPC/BSC GPU Center of Excellence. Nacho Navarro (1958 2016) Expert in OS, enjoyed the internals of UNIX, Mach, Chorus, Exokernels Associate Professor at DAC (UPC) Leader of the group Accelerators for HPC (BSC) Leader of the BSC/UPC NVIDIA GPU CoE Pioneered research in GPU systems Automatic CPU + GPU memory management (adopted by industry in CUDA 4/6) Transparent multi-gpu programming Architectural support for making GPUs first-class OS citizens Established and led the NVIDIA BSC GPU Center of Excellence (since 2011) Teaching GPU programming, introducing CUDA to undergraduate courses @ UPC PUMPs summer school with Wen-mei Hwu and David Kirk (since 2009) PRACE introduction to CUDA programming (since 2012) Managing GPU development machines supporting 150+ people