Deep Learning Processing Technologies for Embedded Systems. October 2018

Size: px

Start display at page:

Download "Deep Learning Processing Technologies for Embedded Systems. October 2018"

Virginia Delilah Brooks
5 years ago
Views:

1 Deep Learning Processing Technologies for Embedded Systems October

2 Neural Networks Architecture Single Neuron DNN Multi Task NN Multi-Task Vehicle Detection With Region-of-Interest Voting

3 Popular Computer Vision Tasks Object Classification CAT Object Detection Semantic Segmentation

4 Case Study: Rear-View Camera Use Case I: Rear-Collision Warning (RCW) Use Case II: Lane-Change Assist (LCA) Use Case III: Parking Assist (PA) 4

Typical acceleration: 3 m/s 2 Viewing angle: 120 deg 120 Km/H System

5 Use case Example: Rear-Collision Warning Scenario Environment: High-speed cruising Hovering speed (Rear car): 120 Km/H Max Closing speed: 80 Km/H Typical acceleration: 3 m/s 2 Viewing angle: 120 deg 120 Km/H System Requirements Policy: Always-on Strategy: Match speed using ACC Safety margin: 2 sec 80 Km/H

6 Use case Example: Rear-Collision Warning W H 120 Km/H ΔV = 11 m/sec D = 50m W car = ~ 8 pixels W = 174m ~ 640 pixels Perception Subsystem Requirements Resolution: 640x480 pixles 2 positive detections: 0.65 sec 80 Km/H

Use case Example: Rear-Collision Warning Neural

1% Frames for 1 detection (5σ) 8 Compute

7 Use case Example: Rear-Collision Warning Neural Network Model Definition Task: Object Detection Model: SSD detector Feature extractor: VGG-16 Compute per frame: ~106 GMACs map (VOC2007): 75.1% Frames for 1 detection (5σ) 8 Compute 20 fps 2.1 TMAC

8 Theoretical Efficiency Input (8b) Weight (8b) Psum (16b) Sample (16b) ~0.1pJ / cycle Theoretical Efficiency: ~10 TMACs/W

9 Real-world Processors Efficiency for Deep Learning Image Processing Embedded Datacenter Intel Movidius Myriad 2 Nvidia Parker Huawei Kirin970 - Cambricon (*) Q'comm Snapdragon 835 (*) Graphcore IPU (**, ***) Google TPU (***) Nvidia Pascal P4 (***) Nvidia Volta V100 (***) Image classification inference task Based on vendors public benchmarks EFFECTIVE TMACS/W (*) Excluding host and memory (**) Estimated (***) Using batch-processing

Neural Networks - Observations Control Flexibility during compile time Fully deterministic in runtime Memory Parameters and partial sums are localized

10 Neural Networks - Observations Control Flexibility during compile time Fully deterministic in runtime Memory Parameters and partial sums are localized Layer outputs move around (but not too far) Compute Recurring operations (MACs >> Activations) Mostly low precision Interconnect Control Memory Compute

11 Neural Networks - Resource Balance Cumulative Memory [KB] Memory to compute ratio 100,000 10,000 10,000 1,000 1, Network direction 1 Network direction GoogLeNet ResNet-50 GoogleNet Resnet-50 Memory, Control and Compute balance changes dramatically along the network s layers

Approaches To Deep Learning Processing Fixed Function

Minimal flexibility Von-Neumann Architecture Temporal

control programming model Symmetric Dataflow Architecture

12 Approaches To Deep Learning Processing Fixed Function Accelerator Theoretically optimal at a specific workload Minimal flexibility Von-Neumann Architecture Temporal resource allocation Common memory space Classical flow control programming model Symmetric Dataflow Architecture Spatial resource allocation Segregated memory spaces Balanced graph oriented

13 Von-Neumann Architecture Fixed Control to Compute Ratio Cycle by Cycle Flexibility Control Compute Fixed Compute to Memory BW Ratio System Bus Full Addressability, Narrow Access Memory

14 Symmetric Dataflow Architecture Control Compute Element size - Utilization vs. Control and Comms overhead I-Mem D-Mem Fixed Compute to Memory BW and Size Ratio Inter-element Bus

15 Real-world Processors Efficiency for Deep Learning Image Processing Embedded Datacenter Intel Movidius Myriad 2 Nvidia Parker Huawei Kirin970 - Cambricon (*) Q'comm Snapdragon 835 (*) Graphcore IPU (**, ***) Google TPU (***) Nvidia Pascal P4 (***) Nvidia Volta V100 (***) Image classification inference task Based on vendors public benchmarks EFFECTIVE TMACS/W (*) Excluding host and memory (**) Estimated (***) Using batch-processing

16 Use case Example Summary: Rear-Collision Warning 50 m Detection Distance Hailo Based Solution Compute utilization 20% (SOTA detection) 0.3MP Camera Resolution 20 fps Frame Rate Hailo Based Solution <2W Power consumption (Including sensor)

17 Thank You! 17 Icons from icons8.com

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Characterization and Benchmarking of Deep Learning Natalia Vassilieva, PhD Sr. Research Manager Deep learning applications Vision Speech Text Other Search & information extraction Security/Video surveillance