Perception Systems for Autonomous Vehicles using Energy-Efficient Deep Neural Networks

Size: px

Start display at page:

Download "Perception Systems for Autonomous Vehicles using Energy-Efficient Deep Neural Networks"

Toby Bishop
5 years ago
Views:

1 Perception Systems for Autonomous Vehicles using Energy-Efficient Deep Neural Networks Forrest Iandola, Ben Landen, Kyle Bertin, Kurt Keutzer and the DeepScale Team

2 OFFLINE MAPS CAMERA SENSORS ULTRASONIC PATH PLANNING & ACTUATION RADAR LIDAR REAL-TIME PERCEPTION THE FLOW IMPLEMENTING AUTONOMOUS DRIVING

3 What does a car need to see? What does a car need to see? Note: above visuals are an artist s rendering created to help convey concepts. They should not be judged for accuracy.

4 What does a car need to see? Object Detection Cyclist (99%) Cyclist (99%) Pedestrian (99%) Pedestrian (99%) Vehicle (99%) Vehicle (98%) Vehicle (100%) Vehicle (100%) Vehicle (99%) Note: above visuals are an artist s rendering created to help convey concepts. They should not be judged for accuracy.

5 What does a car need to see? Distance Vehicle (99%) 15m Vehicle (100%) 10m Vehicle (98%) 20m Vehicle (100%) 14m Vehicle (99%) 18m Cyclist (99%) 16m Cyclist (99%) 14m Pedestrian (99%) 7m Pedestrian (99%) 7m Note: above visuals are an artist s rendering created to help convey concepts. They should not be judged for accuracy.

6 What does a car need to see? Object Tracking Vehicle (99%) 15m ID: 5 (95 frames) Vehicle (100%) 10m ID: 1 (135 frames) Vehicle (98%) 20m ID: 4 (140 frames) Vehicle (100%) 14m ID: 2 (140 frames) Vehicle (99%) 18m ID: 3 (140 frames) Cyclist (99%) 16m ID: 6 (90 frames) Cyclist (99%) 14m ID: 7 (95 frames) Pedestrian (99%) 7m ID: 8 (60 frames) Pedestrian (99%) 7m ID: 9 (60 frames) Note: above visuals are an artist s rendering created to help convey concepts. They should not be judged for accuracy.

7 What does a car need to see? Free Space & Driveable Area Vehicle (99%) 15m ID: 5 (95 frames) Vehicle (100%) 10m ID: 1 (135 frames) Vehicle (98%) 20m ID: 4 (140 frames) Vehicle (100%) 14m ID: 2 (140 frames) Vehicle (99%) 18m ID: 3 (140 frames) Cyclist (99%) 16m ID: 6 (90 frames) Cyclist (99%) 14m ID: 7 (95 frames) Pedestrian (99%) 7m ID: 8 (60 frames) Pedestrian (99%) 7m ID: 9 (60 frames) Note: above visuals are an artist s rendering created to help convey concepts. They should not be judged for accuracy.

8 What does a car need to see? Lane Recognition Vehicle (99%) 15m ID: 5 (95 frames) Vehicle (100%) 10m ID: 1 (135 frames) Vehicle (98%) 20m ID: 4 (140 frames) Vehicle (100%) 14m ID: 2 (140 frames) Vehicle (99%) 18m ID: 3 (140 frames) Cyclist (99%) 16m ID: 6 (90 frames) Cyclist (99%) 14m ID: 7 (95 frames) Pedestrian (99%) 7m ID: 8 (60 frames) Pedestrian (99%) 7m ID: 9 (60 frames) Note: above visuals are an artist s rendering created to help convey concepts. They should not be judged for accuracy.

com/man-vs-machine-my-rematchagainst-audis-new-self-driving-rs-7-21415540/ BMW + Intel https://newsroom.intel.

9 Today's autonomous cars require a lot of computing hardware! and perception is the most computationally-intensive part of the software stack Audi BMW + Intel Ford cted-cars-long-road-autonomousvehicles

10 Big computers = expensive cars

11 As a workaround, companies want people to share autonomous vehicles to amortize hardware costs

12 As a workaround, companies want people to share autonomous vehicles to amortize hardware costs

13 Shared autonomous vehicles will likely have some of the downsides as public transportation

14 Will better computer chips make autonomous cars affordable?

15 Will better computer chips make autonomous cars affordable?

16 Deep Learning Processors have arrived! THE SERVER SIDE Platform Computation (GFLOPS/s) NVIDIA K [1] (32-bit float) NVIDIA V100 [2] (16-bit float) Memory Bandwidth (GB/s) 208 (GDDR5) 900 (HBM2) Computationto-bandwidth ratio Power (TDP Watts) Year (yikes!) Uh-oh Processors are improving much faster than Memory. [1] [2] (PCIe version)

17 Deep Learning Processors have arrived! MOBILE PLATFORMS Device Cores Computation (GFLOPS/s) Samsung Galaxy Note 3 Huawei P20 NVIDIA Jetson Xavier [3,4] Arm Mali 120 T-628 GPU [1] (32-bit float) Kirin NPU [2] (16-bit float) NVIDIA Tensor Cores (8 32 int) Memory Bandwidth (GB/s) 12.8 (LPDDR3) 30 (LPDDR4X) Computationto-bandwidth ratio System Power (TDP Watts) 9.3 ~ (ouch!) (yikes!) ~ to 30 (multiple modes) Year 2018 [1] [2] [3] [4]

18 What will the next generation Deep Learning servers look like?

19 What will the next generation Deep Learning servers look like? 20 TOP/W COMPUTATION Platform Efficiency (TOP/s/W) Computation (TOP/s) NVIDIA K20 [1] (32-bit float) NVIDIA V100 [2] (16-bit float) Memory Bandwidth (TB/s) (GDDR5) (HBM2) Next-gen: * TOP/W (HBM3) [3] Computationto-bandwidth ratio Power (TDP Watts) Year (oh no!) (est.) [1] [2] (PCIe version) [3] * Assuming half the power is spent on computation, and the other half is spent on memory and other devices. 20 TOP/s/W * 20W * 0.5 = 2500 TOP/s

20 Small Neural Nets to the rescue

21 squeeze (verb): to make an AI system use less resources using whatever means necessary

22 Memory Footprint and Bandwidth Computational Operations Power and Energy Time squeeze (verb): to make an AI system use less resources using whatever means necessary

23 Memory Footprint and Bandwidth Computational Operations Power and Energy Time squeeze (verb): to make an AI system use less resources using whatever means necessary New DNN Models Applicationspecific Quantization and Pruning Superior Implementations Differentiated Data and Training Strategies

24 Most CV Applications Rely on Only a Few Core CV Capabilities Image Classification Object Detection Semantic Segmentation And the best accuracy for each of these capabilities is given by Convolutional Neural Nets

But We Need a Very Different Kind of DNN VGG16[1]

Watts 2-4 GB IOT Devices 100's MHz <1Watt <1 GB [1]

25 But We Need a Very Different Kind of DNN VGG16[1] model: - Parameter size: 552 MB - Memory: 93 MB/image - Computation: 15.8 GFLOPs/image TitanX: 11 TFLOPS, 223 Watts, 12 GB Memory DGX-1, 170 TFLOPS, 3.2 KWatts, 128 GB Memory Smartphones 100's GFLOPs 3 Watts 2-4 GB IOT Devices 100's MHz <1Watt <1 GB [1] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arxiv: , 2014.

Speed more Related to Memory Accesses than

Size 32 KB 2 MB 4 GB Read Latency 4 cycles 22

26 Speed more Related to Memory Accesses than Operations Samsung Exynos M1 Access Times Galaxy S7 L1 Cache/TLB L2 Cache L1 D-Cache (per core) L2 Cache (shared) Off-chip DRAM Size 32 KB 2 MB 4 GB Read Latency 4 cycles 22 cycles ~200 cycles Read Bandwidth 20.8 GB/s GB/s 28.7 GB/s

27 Energy More Related to Memory Accesses than operations (45nm 0.9V) 8b INT Mult Energy (pj) b FP Mult 32b FP Mult 5.5x 18.5x 64b Cache Read (32KB) 100x 64b Cache Read (1MB) 500x DRAM 10,000x Mark Horowitz, Computing s Energy Problem (and what we can do about it), ISSCC 2014

8MB compresses to 500KB [1] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton.

28 10,000 DNN Architectural Configurations Later: SqueezeNet (2016) AlexNet [1] SqueezeNet [2] CNN Top-5 Accuracy ImageNet Model Parameters Model Size AlexNet[1] 80.3% 60M 243MB SqueezeNet[2] 80.3% 1.2M 4.8MB compresses to 500KB [1] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." NIPS2012 [2] Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 1MB model size." arxiv: (2016). (February 2016)

29 SqueezeNet: Immediate Success in Embedded Vision Apple CoreML NXP Embedded Vision Summit Qualcomm Facebook F8 Enabled embedded processor vendors (ARM, NXP, Qualcomm) to demo CNNs Quickly ported to all the major Deep Learning Frameworks

SqueezeDet for Object Detection (2017) Conv Det Filter feature map Input

4 Joules Frame Bounding boxes Final detections Best Paper Award: Bichen

SqueezeDet: Unified, small, low power fully convolutional neural

30 SqueezeDet for Object Detection (2017) Conv Det Filter feature map Input image ~2M model parameters 57 FPS 1.4 Joules Frame Bounding boxes Final detections Best Paper Award: Bichen Wu, Forrest Iandola, Peter H. Jin, and Kurt Keutzer SqueezeDet: Unified, small, low power fully convolutional neural networks for realtime object detection for autonomous driving. In Proceedings, CVPR Embedded Computer Vision Workshop, July 2017.

"Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud.

31 SqueezeSeg: Semantic Segmentation for LIDAR (2018) LIDAR point cloud segmentation SqueezeSegV2: Higher accuracy: v1[1]: 64.6% -> v2[2]: 73.2% (+8.6%) Better Sim2Real performance: v1[1]: 30% -> v2[2]: 57.4% (+27.4%) Outperforms v1 trained on real data w/o intensity [1] Wu, Bichen, et al. "Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud." ICRA18 [2] Wu, Bichen, et al. "SqueezeSegV2: Improved Model Structure and Unsupervised Domain Adaptation for Road-Object Segmentation from a LiDAR Point Cloud." arxiv: (2018).

32 Squeeze Family SqueezeNet DiracDeltaNet SqueezeNext SqueezeDet SqueezeSeg-{v1, v2} DNASNet ShiftNet Image Classification Object Detection Semantic Segmentation

Andrew Howard's MobileNets: Efficient On-Device Computer Vision Models

Family of pareto optimal models to target needs of the user.

V2 introduces Inverted Residuals and Linear Bottlenecks.

33 Andrew Howard's MobileNets: Efficient On-Device Computer Vision Models Designed for efficiency on mobile phones. Family of pareto optimal models to target needs of the user. V1 based on Depthwise Separable Convolutions. V2 introduces Inverted Residuals and Linear Bottlenecks. Supports Classification, Detection, Segmentation and more. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications MobileNet V2: Inverted Residuals and Linear Bottlenecks

34 Model Compression DNN Architecture Search 10X 50X Slide Credit: Prof. Warren Gross (McGill Univ.)

35 Anatomy of a convolution layer x3 conv

36 Filters: Kernel Reduction x3 conv x reduction in model parameters 1

37 Filters/Channel Reduction x3 conv x3 conv x reduction in model parameters 128

38 Model Distillation/Compression Model Distillation Li, et al. Mimicking Very Efficient Network for Object Detection. CVPR, 2017.

39 Examples of what's on a DNN Architect's Palette Spatial Convolution e.g. 3x3 Pointwise Convolution 1x1 Depthwise Convolution Channel Shuffle Shift

The Art of Small Model Design Small Neural Nets Are Beautiful ESWeek 2017 The palette of an adept mobile/embedded DNN designer has grown very rich!

40 The Art of Small Model Design Small Neural Nets Are Beautiful ESWeek 2017 The palette of an adept mobile/embedded DNN designer has grown very rich! Overall architecture: economize on layers while retaining accuracy Layer types Kernel reduction: 5 x 5 3 x 3 1 x 1 Channel reduction: e.g. FireLayer Experiment with novel layer types that consume no FLOPS Shuffle Shift Model distillation: let big models teach smaller ones Apply pruning Tailor bit precision (aka quantization) to target processor Iandola, Forrest, and Kurt Keutzer. "Small neural nets are beautiful: enabling embedded systems with small deep-neural-network architectures." In Proceedings of the Twelfth International Conference on Hardware/Software Codesign and System Synthesis Companion, p. 1. ACM, (ESWeek 2017). Also, (arxiv: )

41 Artistic/Engineering Process of Designing a Deep Neural Net Manual design: Each iteration to evaluate a point in the design space is very expensive Exploration limited by human imagination

42 Can we automate this? Manual design: Each iteration to evaluate a point in the design space is very expensive Exploration limited by human imagination

DNAS: Differentiable Neural Architecture Search Differentiable Neural Architecture Search: Extremely fast: 8 GPUs, 24 hours Can search for different

43 DNAS: Differentiable Neural Architecture Search Differentiable Neural Architecture Search: Extremely fast: 8 GPUs, 24 hours Can search for different conditions case-by-case Optimize for actual latency Bichen Wu, Kurt Keutzer, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia

44 DNAS in context (FLOPs to normalize comparison) ImageNet top-1 Accuracy -- Good DNASNet: (ours) Acc: 74.2%, FLOPs: 295M Search Cost: 216 GPU-hrs MnasNet: [6] Acc: 74.0, FLOPs: 317M Search Cost*: 91,000 GPU-hrs MobileNetV2: [4] Acc: 71.8%, FLOPs: 300M AMC: [5] Acc: 70.8%, FLOPs: 150M * Estimated from the paper description PNAS: [2] Acc: 74.2%, FLOPs: 588M Search cost*: 6,000 GPU-hrs NAS: [1] Acc: 74.0%, FLOPs: 564M Search cost: 48,000 GPU-hrs DARTS: [3] Acc: 73.1%, FLOPs: 595M Search cost: 288 GPU-hrs X-axis: FLOPs Y-axis: accuracy Mark size: search cost Circles: search cost unknown More FLOPs - BAD [1] Zoph, Barret, et al. "Learning transferable architectures for scalable image recognition." arxiv preprint arxiv: (2017). [2] Liu, Chenxi, et al. "Progressive neural architecture search." arxiv preprint arxiv: (2017). [3] Liu, Hanxiao, Karen Simonyan, and Yiming Yang. "Darts: Differentiable architecture search." arxiv preprint arxiv: (2018) [4] Sandler, Mark, et al. "MobileNetV2: Inverted Residuals and Linear Bottlenecks. CVPR18 [5] He, Yihui, et al. "Amc: Automl for model compression and acceleration on mobile devices." Proceedings of the European Conference on Computer Vision (ECCV) [6] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." arxiv preprint arxiv: (2018).

45 DNAS for device-aware search For different targeted devices, both DNASNets achieve similar accuracy. However, per target DNN optimization was required NET Latency on iphonex Latency on Samsung S8 DNAS-iPhoneX ms ms (20% slower) DNAS-S ms (25% slower) Top-1 acc 73.20% ms 73.27%

The Future: Breaking down the wall between DNN Design & Hardware Design DNN Designers Unaware of arithmetic intensity Floating point vs fixed point

46 The Future: Breaking down the wall between DNN Design & Hardware Design DNN Designers Unaware of arithmetic intensity Floating point vs fixed point costs Memory hierarchy and latency NN HW Accelerator architects Using outdated models: - AlexNet - VGG16 Using irrelevant datasets: - MNIST - CIFAR

47 Key Takeaways Autonomous vehicles currently need thousands (or even hundreds of thousands) of dollars of computing hardware Processing is on a trajectory of rapid improvement (in operations-per-watt) but other aspects of the system (e.g. memory) are improving much more slowly today's neural networks will be choked by slow memory on tomorrow's DNN accelerators (this is already happening and will get worse) Designing new (smaller) neural networks helps with all of the following making full use of next-generation computing platforms reducing the hardware costs in autonomous vehicles enabling lower-cost, larger-scale rollouts of autonomous vehicles

Progressive Neural Architecture Search

Progressive Neural Architecture Search Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, Kevin Murphy 09/10/2018 @ECCV 1 Outline Introduction