In Live Computer Vision

Similar documents
EVA 2 : Exploiting Temporal Redundancy in Live Computer Vision

Research Faculty Summit Systems Fueling future disruptions

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Deep Learning Accelerators

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

arxiv: v1 [cs.cv] 11 Feb 2018

Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network

A Method to Estimate the Energy Consumption of Deep Neural Networks

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

DNN Accelerator Architectures

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

How to Estimate the Energy Consumption of Deep Neural Networks

CapsAcc: An Efficient Hardware Accelerator for CapsuleNets with Data Reuse

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Flow-Based Video Recognition

Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA

USING DATAFLOW TO OPTIMIZE ENERGY EFFICIENCY OF DEEP NEURAL NETWORK ACCELERATORS

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Towards Closing the Energy Gap Between HOG and CNN Features for Embedded Vision

Scaling Neural Network Acceleration using Coarse-Grained Parallelism

Efficient Methods for Deep Learning

direct hardware mapping of cnns on fpga-based smart cameras

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Lecture 12: Model Serving. CSE599W: Spring 2018

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

Revolutionizing the Datacenter

Chain-NN: An Energy-Efficient 1D Chain Architecture for Accelerating Deep Convolutional Neural Networks

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

SmartShuttle: Optimizing Off-Chip Memory Accesses for Deep Learning Accelerators

arxiv: v2 [cs.cv] 3 May 2016

Software Defined Hardware

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm

M.Tech Student, Department of ECE, S.V. College of Engineering, Tirupati, India

Redundancy-reduced MobileNet Acceleration on Reconfigurable Logic For ImageNet Classification

Semantic Segmentation

Binary Convolutional Neural Network on RRAM

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018

High Performance Computing

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

DEEP LEARNING ACCELERATOR UNIT WITH HIGH EFFICIENCY ON FPGA

Lecture: Deep Convolutional Neural Networks

The OpenVX Computer Vision and Neural Network Inference

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

FAST: A Framework to Accelerate Super- Resolution Processing on Compressed Videos

Bandwidth-Efficient Deep Learning

Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms

ACCELERATING DEEP CONVOLUTIONAL NETWORKS USING LOW-PRECISION AND SPARSITY. Ganesh Venkatesh, Eriko Nurvitadhi, Debbie Marr

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

A Communication-Centric Approach for Designing Flexible DNN Accelerators

Specializing Hardware for Image Processing

Hide-and-Seek: Forcing a network to be Meticulous for Weakly-supervised Object and Action Localization

SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks

Improving Energy Efficiency of Block-Matching Motion Estimation Using Dynamic Partial Reconfiguration

Leveraging MLC STT-RAM for Energy-efficient CNN Training

A COST-EFFICIENT RESIDUAL PREDICTION VLSI ARCHITECTURE FOR H.264/AVC SCALABLE EXTENSION

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang

Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

arxiv: v1 [cs.ne] 23 Mar 2018

Deploying Deep Neural Networks in the Embedded Space

Predicting ground-level scene Layout from Aerial imagery. Muhammad Hasan Maqbool

Real-time object detection towards high power efficiency

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing

Neural Network-Hardware Co-design for Scalable RRAM-based BNN Accelerators

Real-time convolutional networks for sonar image classification in low-power embedded systems

Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Comparing Memory Systems for Chip Multiprocessors

Photo-realistic Renderings for Machines Seong-heum Kim

ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA

Bit-Pragmatic Deep Neural Network Computing

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

RTSR: Enhancing Real-time H.264 Video Streaming using Deep Learning based Video Super Resolution Spring 2017 CS570 Project Presentation June 8, 2017

Minimizing Computation in Convolutional Neural Networks

Co-synthesis and Accelerator based Embedded System Design

THE NVIDIA DEEP LEARNING ACCELERATOR

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses

CS231N Section. Video Understanding 6/1/2018

Deep Incremental Scene Understanding. Federico Tombari & Christian Rupprecht Technical University of Munich, Germany

Optimize Deep Convolutional Neural Network with Ternarized Weights and High Accuracy

Deep Learning Requirements for Autonomous Vehicles

Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer

Efficient Segmentation-Aided Text Detection For Intelligent Robots

DPS: Dynamic Precision Scaling for Stochastic Computing-based Deep Neural Networks

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

PARTIAL STYLE TRANSFER USING WEAKLY SUPERVISED SEMANTIC SEGMENTATION. Shin Matsuo Wataru Shimoda Keiji Yanai

Scene Text Recognition for Augmented Reality. Sagar G V Adviser: Prof. Bharadwaj Amrutur Indian Institute Of Science

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

GeForce4. John Montrym Henry Moreton

Transcription:

EVA 2 : Exploiting Temporal Redundancy In Live Computer Vision Mark Buckler, Philip Bedoukian, Suren Jayasuriya, Adrian Sampson International Symposium on Computer Architecture (ISCA) Tuesday June 5, 2018

Convolutional Neural Networks (s) 2

Convolutional Neural Networks (s) 3

FPGA Research Embedded Vision Accelerators ASIC Research Suda et al. Zhang et al. ShiDianNao Eyeriss Qiu et al. Farabet et al. EIE S Many more Industry Adoption Many more 4

Temporal Redundancy Frame 0 Frame 1 Frame 2 Frame 3 Input Change High Low Low Low 5

Temporal Redundancy Frame 0 Frame 1 Frame 2 Frame 3 Input Change High Low Low Low Cost to Process High High High High 6

Temporal Redundancy Frame 0 Frame 1 Frame 2 Frame 3 Input Change High Low Low Low Cost to Process High High Low High Low High Low 7

Talk Overview Background Algorithm Hardware Evaluation Conclusion 8

Talk Overview Background Algorithm Hardware Evaluation Conclusion 9

Common Structure in s Image Classification Object Detection Semantic Segmentation Image Captioning 10

Common Structure in s Intermediate Activations Frame 0 Prefix High energy Suffix Low energy Frame 1 Prefix High energy Suffix Low energy #MakeRyanGoslingTheNewLenna 11

Common Structure in s Intermediate Activations Key Frame Motion Prefix High energy Motion Suffix Low energy Predicted Frame Prefix High energy Suffix Low energy #MakeRyanGoslingTheNewLenna 12

Common Structure in s Intermediate Activations Key Frame Motion Prefix High energy Motion Suffix Low energy Predicted Frame Prefix Suffix Low energy #MakeRyanGoslingTheNewLenna 13

Talk Overview Background Algorithm Hardware Evaluation Conclusion 14

Activation Motion Compensation (AMC) Time Input Frame Vision Computation Vision Result Stored Activations Key Frame t Prefix Suffix Predicted Frame t+k Motion Estimation Motion Compensation Suffix Motion Vector Field Predicted Activations 15

Activation Motion Compensation (AMC) Time Input Frame Vision Computation Vision Result Stored Activations Key Frame t Prefix Suffix ~10 11 MACs Predicted Frame t+k Motion Estimation Motion Compensation Suffix ~10 7 Adds Motion Vector Field Predicted Activations 16

AMC Design Decisions How to perform motion estimation? How to perform motion compensation? Which frames are key frames? 17

AMC Design Decisions How to perform motion estimation? How to perform motion compensation? Which frames are key frames? 18

AMC Design Decisions How to perform motion estimation? How to perform motion compensation? Which frames are key frames? 19

AMC Design Decisions How to perform motion estimation? How to perform motion compensation? Which frames are key frames?? 20

AMC Design Decisions How to perform motion estimation? How to perform motion compensation? Which frames are key frames? 21

Motion Estimation We need to estimate the motion of activations by using pixels Prefix Suffix Motion Estimation Performed on Pixels Motion Compensation Performed on Activations Suffix 22

Pixels to Activations Input Image 3x3 Conv 64 Intermediate Activations 3x3 Conv 64 Intermediate Activations 23

Pixels to Activations: Receptive Fields C=3 C=64 C=64 w=h=8 Input Image 3x3 Conv 64 Intermediate Activations 3x3 Conv 64 Intermediate Activations 24

Pixels to Activations: Receptive Fields C=3 C=64 C=64 w=h=8 5x5 Receptive Field Input Image 3x3 Conv 64 Intermediate Activations 3x3 Conv 64 Intermediate Activations Estimate motion of activations by estimating motion of receptive fields 25

Receptive Field Block Motion Estimation (RFBME) Key Frame Predicted Frame 26

Receptive Field Block Motion Estimation (RFBME) 0 1 2 3 0 1 2 3 Key Frame Predicted Frame 27

Receptive Field Block Motion Estimation (RFBME) 0 1 2 3 0 1 2 3 Key Frame Predicted Frame 28

AMC Design Decisions How to perform motion estimation? How to perform motion compensation? Which frames are key frames? 29

Motion Compensation C=64 C=64 Vector: X = 2.5 Y = 2.5 Stored Activations Predicted Activations Subtract the vector to index into the stored activations Interpolate when necessary 30

AMC Design Decisions How to perform motion estimation? How to perform motion compensation? Which frames are key frames?? 31

When to Compute Key Frame? System needs a new key frame when motion estimation fails: De-occlusion New objects Rotation/scaling Lighting changes 32

When to Compute Key Frame? System needs a new key frame when motion estimation fails: De-occlusion New objects Rotation/scaling Lighting changes So, compute key frame when RFBME error exceeds set threshold Yes Prefix Input Frame Motion Estimation Error > Thresh? Suffix No Key Frame Motion Compensation Vision Result 33

Talk Overview Background Algorithm Hardware Evaluation Conclusion 34

Embedded Vision Accelerator Global Buffer Eyeriss (Conv) EIE (Full Connect) Prefix Suffix Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks, S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, EIE: Efficient inference engine on compressed deep neural network, 35

Embedded Vision Accelerator Accelerator (EVA 2 ) Global Buffer EVA 2 Eyeriss (Conv) EIE (Full Connect) Motion Estimation Motion Compensation Prefix Suffix Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks, S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, EIE: Efficient inference engine on compressed deep neural network, 36

Embedded Vision Accelerator Accelerator (EVA 2 ) Frame 0 37

Embedded Vision Accelerator Accelerator (EVA 2 ) Frame 0: Key frame 38

Embedded Vision Accelerator Accelerator (EVA 2 ) Frame 1 Motion Estimation 39

Embedded Vision Accelerator Accelerator (EVA 2 ) Frame 1: Predicted frame Motion Estimation Motion Compensation EVA 2 leverages sparse techniques to save 80-87% storage and computation 40

Talk Overview Background Algorithm Hardware Evaluation Conclusion 41

Evaluation Details Train/Validation Datasets Evaluated Networks Hardware Baseline EVA 2 Implementation YouTube Bounding Box: Object Detection & Classification AlexNet, Faster R- with VGGM and VGG16 Eyeriss & EIE performance scaled from papers Written in RTL, synthesized with 65nm TSMC 42

EVA 2 Area Overhead Total 65nm area: 74mm 2 EVA 2 takes up only 3.3% 43

Normalized Energy EVA 2 Energy Savings 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 orig orig orig AlexNet Faster16 FasterM Eyeriss EIE EVA^2 Input Frame Prefix Suffix Vision Result 44

Normalized Energy EVA 2 Energy Savings Input Frame Key Frame 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 orig pred orig pred orig pred AlexNet Faster16 FasterM Eyeriss EIE EVA^2 Motion Estimation Motion Compensation Suffix Vision Result 45

Normalized Energy EVA 2 Energy Savings Input Frame Key Frame 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 orig pred avg orig pred avg orig pred avg AlexNet Faster16 FasterM Yes Prefix Motion Estimation Error > Thresh? Suffix No Motion Compensation Eyeriss EIE EVA^2 Vision Result 46

High Level EVA 2 Results Network Vision Task Keyframe % Accuracy Degredation Average Latency Savings AlexNet Classification 11% 0.8% top-1 86.9% 87.5% Faster R- VGG16 Detection 36% 0.7% map 61.7% 61.9% Faster R- VGGM Detection 37% 0.6% map 54.1% 54.7% Average Energy Savings EVA 2 enables 54-87% savings while incurring <1% accuracy degradation Adaptive key frame choice metric can be adjusted 47

Talk Overview Background Algorithm Hardware Evaluation Conclusion 48

Conclusion Temporal redundancy is an entirely new dimension for optimization AMC & EVA 2 improve efficiency and are highly general Applicable to many different applications (classification, detection, segmentation, etc) Hardware architectures (CPU, GPU, ASIC, etc) Motion estimation/compensation algorithms 49

EVA 2 : Exploiting Temporal Redundancy In Live Computer Vision Mark Buckler, Philip Bedoukian, Suren Jayasuriya, Adrian Sampson International Symposium on Computer Architecture (ISCA) Tuesday June 5, 2018

Backup Slides 51

Why not use vectors from video codec/isp? We ve demonstrated that the ISP can be skipped (Bucker et al. 2017) No need to compress video which is instantly thrown away Can save energy by power gating the ISP Opportunity to set own key frame schedule However, great idea for pre-stored video! 52

Why Not Simply Subsample? If lower frame rate needed, simply apply AMC at that frame rate Warping Adaptive key frame choice 53

Different Motion Estimation Methods Faster16 FasterM 54

Difference from Deep Feature Flow? Deep Feature Flow does also exploit temporal redundancy, but AMC and EVA 2 Deep Feature Flow Adaptive key frame rate? Yes No On chip activation cache? Yes No Learned motion estimation? No Yes Motion estimation granularity Per receptive field Per pixel (excess granularity) Motion compensation Sparse (four-way zero skip) Dense Activation storage Sparse (run length) Dense 55

Difference from Euphrates? Euphrates has a strong focus on SoC integration Motion estimation from ISP May want to skip the ISP to save energy & create more optimal key schedule Motion compensation on bounding boxes Skips entire network, but is only applicable to object detection 56

Re-use Tiles in RFBME 57

Changing Error Threshold 58

Different Adaptive Key Frame Metrics 59

Normalized Latency & Energy 60

How about Re-Training? 61

Where to cut the network? 62

#MakeRyanGoslingTheNewLenna Lenna dates back to 1973 We need a new test image for image processing!