Research Faculty Summit Systems Fueling future disruptions

Size: px

Start display at page:

Download "Research Faculty Summit Systems Fueling future disruptions"

Amy Cook
5 years ago
Views:

1 Research Faculty Summit 2018 Systems Fueling future disruptions

2 Efficient Edge Computing for Deep Neural Networks and Beyond Vivienne Sze In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang, Sertac Karaman, Luca Carlone, Amr Suleiman, Zhengdong Zhang Massachusetts Institute of Technology Contact Info Website:

3 Outline Limitations of Existing Efficient DNN Approaches Looking Beyond the DNN Accelerator for Acceleration Looking Beyond DNNs: Other forms of inference at the edge Slide 2

4 3 Limitations of Existing Efficient DNN Approaches Y.-H. Chen*, T.-J. Yang*, J. Emer, V. Sze, Understanding the Limitations of Existing Energy-Efficient Design Approaches for Deep Neural Networks, SysML 2018.

Energy-Efficient Processing of DNNs A significant amount of algorithm and hardware research on energy-efficient processing of DNNs eyeriss.mit.edu/tutorial.html V. Sze, Y.-H. Chen, T-J.

5 Energy-Efficient Processing of DNNs A significant amount of algorithm and hardware research on energy-efficient processing of DNNs eyeriss.mit.edu/tutorial.html V. Sze, Y.-H. Chen, T-J. Yang, J. Emer, Efficient Processing of Deep Neural Networks: A Tutorial and Survey, Proceedings of the IEEE, Dec We identified various limitations to existing approaches Slide 4

6 Design of Efficient DNN Algorithms Popular efficient DNN algorithm approaches Network Pruning Compact Network Architectures R S C R S C Examples: SqueezeNet, MobileNet... also reduced precision Focus on reducing number of MACs and weights Does it translate to energy savings? Slide 5

Data Movement is Expensive DRAM Global Buffer PE PE PE ALU fetch data to run a MAC here DRAM 0.5 1.

7 Data Movement is Expensive DRAM Global Buffer PE PE PE ALU fetch data to run a MAC here DRAM kb NoC: PEs kb Buffer RF ALU ALU PE ALU 2 ALU ALU Normalized Energy Cost * 1 (Reference) 1 6 Energy of weight depends on memory hierarchy and dataflow 200 * measured from a commercial 65nm process Slide 6

8 Energy-Evaluation Methodology DNN Shape Configuration (# of channels, # of filters, etc.) Hardware Energy Costs of each MAC and Memory Access Memory Accesses Optimization # acc. at mem. level 1 # acc. at mem. level 2 # acc. at mem. level n E data Energy estimation tool available at eyeriss.mit.edu # of MACs Calculation # of MACs E comp DNN Weights and Input Data [0.3, 0, -0.4, 0.7, 0, 0, 0.1, ] Energy L1 L2 L3 DNN Energy Consumption [Yang et al., CVPR 2017] Slide 7

Key Observations Number of weights alone

9 Key Observations Number of weights alone is not a good metric for energy All data types should be considered Computa:on 10% Input Feature Map 25% Energy Consump:on of GoogLeNet Weights 22% Output Feature Map 43% [Yang et al., CVPR 2017] Slide 8

10 Energy-Aware Pruning Normalized Energy (AlexNet) Directly target energy and incorporate it into the optimization of DNNs to provide greater energy savings Sort layers based on energy and prune layers that consume most energy first EAP reduces AlexNet energy by 3.7x and outperforms the previous work that uses magnitude-based pruning by 1.7x x x 3.7x Ori. Magnitude DC Energy EAP Aware Based Pruning Pruning Symposia [Yang et on al., VLSI CVPR Technology 2017] and Circuits Slide 9

11 NetAdapt: Platform-Aware DNN Adaptation Automatically adapt DNN to a mobile platform to reach a target latency or energy budget Use empirical measurements to guide optimization (avoid modeling of tool chain or platform architecture) Pretrained Network Metric Budget Adapted Network NetAdapt Latency 3.8 Budget Energy 10.5 Empirical Measurements Metric Proposal A Proposal Z Latency Energy Network Proposals A B C D Z Pla8orm Measure In collaboration with Google s Mobile Symposia Vision on Team VLSI Technology and Circuits [Yang et al., ECCV 2018] Slide 10

12 Improved Latency vs. Accuracy Tradeoff NetAdapt boosts the real inference speed of MobileNet by up to 1.7x with higher accuracy +0.3% accuracy 1.7x faster +0.3% accuracy 1.6x faster *Tested on the ImageNet dataset and a Google Pixel 1 CPU Slide 11

10100101000000000101000000000100 8-bit fixed 0 1 1 0 0 1 1 0 Binary 0 No

13 Many Efficient DNN Design Approaches Network Pruning Compact Network Architectures R S C R S C Reduce Precision 32-bit float bit fixed Binary 0 No guarantee that DNN algorithm designer will use a given approach. Need flexible hardware! Slide 12

14 Existing DNN Architectures Specialized DNN hardware often rely on certain properties of DNN in order to achieve high energy-efficiency Example: Reduce memory access by amortizing across MAC array Activation Memory Weight reuse Weight Memory MAC array Activation reuse Slide 13

15 Limitation of Existing DNN Architectures Example: Reuse and array utilization depends on # of channels, feature map/batch size Not efficient across all network architectures (e.g., compact DNNs) Less efficient as array scales up in size Can be challenging to exploit sparsity Number of input channels feature map or batch size Number of filters (output channels) MAC array (spatial accumulation) Number of filters (output channels) MAC array (temporal accumulation) Slide 14

16 Eyeriss v2: Balancing Flexibility and Efficiency Efficiently supports Wide range of filter shapes Large and Compact Different Layers CONV, FC, depth wise, etc. Wide range of sparsity Dense and Sparse Scalable architecture Over an order of magnitude faster and more energy efficient than Eyeriss v1 [Chen et al., arxiv 2018] eyeriss.mit.edu Slide 15

17 Eyeriss v2: Balancing Flexibility and Efficiency Flexible dataflow, called Row-Stationary Plus (RS+), that enables the spatial mapping of data from all dimensions for high PE array utilization and data reuse for various layer shapes and sizes F1 Output fmap width* Output fmap width* F1 Active PE Idle PE S1 Filter width* Filter width* S1 G1 # channel groups* Row Stationary Row Stationary Plus *tiling parameters [Chen et al., arxiv 2018] eyeriss.mit.edu Slide 16

18 Eyeriss v2: Balancing Flexibility and Efficiency Flexible dataflow, called Row-Stationary Plus (RS+), that enables the spatial mapping of data from all dimensions for high PE array utilization and data reuse for various layer shapes and sizes Flexible NoC to support RS+ that can operate in different modes for different requirements Utilizes multicast to exploit spatial data reuse Utilizes unicast for high BW for weights for FC and weights & activations for compact network architectures Processes data in both compressed and raw format to minimize data movement for both CONV and FC layers Exploit sparsity in both weights and activations [Chen et al., arxiv 2018] eyeriss.mit.edu Slide 17

19 18 Looking Beyond the DNN Accelerator for Acceleration Z. Zhang, V. Sze, FAST: A Framework to Accelerate Super-Resolution Processing on Compressed Videos, CVPRW 2017

20 Super-Resolution on Mobile Devices Low ResoluCon Streaming High ResoluCon Playback Transmit low resolution for lower bandwidth Screens are getting larger Use super-resolution to improve the viewing experience of lower-resolution content (reduce communication bandwidth) Slide 19

21 Complexity of Super Resolution Algorithms SRCNN (Dong et, al. ECCV 14) 8032 MACs/pixel à ~500 GMAC/s for 30 fps State-of-the-art super resolution algorithms use DNNs à computationally expensive, especially at high resolutions (HD or 4K) Slide 20

22 FAST: A Framework to Accelerate Super Resolution SR algorithm FAST SR 15x faster Compressed video Real-time A framework that accelerates any SR algorithm by up to 15x when running on compressed videos Symposia [Zhang on VLSI et Technology al., CVPRW and 2017] Circuits Slide 21

as a stack of pixels Representation from compressed video

23 Free Information in Compressed Videos Decode Pixels Block-structure Motion-compensation Compressed video Video as a stack of pixels Representation from compressed video This representation can help accelerate super-resolution Slide 22

Fractional Interpolation Bicubic Interpolation Skip Flag The complexity of the

24 SR SR SR SR SR Transfer is Lightweight SR Transfer Low-res video High-res video Low-res video High-res video Transfer allows SR to run on only a subset of frames Fractional Interpolation Bicubic Interpolation Skip Flag The complexity of the transfer is comparable to bicubic interpolation. Transfer N frames, accelerate by N Slide 23

25 Evaluation: Accelerating SRCNN PartyScene RaceHorse BasketballPass Examples of videos in the test set (20 videos for HEVC development) 4x acceleration with NO PSNR LOSS. 16x acceleration with 0.2 dbandloss of PSNR Symposia on VLSI Technology Circuits Slide 24

26 Visual Evaluation SRCNN FAST + SRCNN Bicubic Code released at Slide 25

27 26 Beyond Deep Neural Networks A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, V. Sze, Navion: A Fully Integrated Energy-Efficient Visual-Inertial Odometry Accelerator for Autonomous Navigation of Nano Drones, Symposium on VLSI 2018

Energy-Efficient Autonomous Navigation of NanoDrones Navion Chip Localization and Mapping at < 30mW (full integration on-chip) In collaboration with http://navion.

28 Energy-Efficient Autonomous Navigation of NanoDrones Navion Chip Localization and Mapping at < 30mW (full integration on-chip) In collaboration with Sertac Karaman (AeroAstro) Luca Carlone (AeroAstro) Symposia [Zhang et on al., VLSI RSS Technology 2017], [Suleiman and Circuits et al., VLSI 2018] Slide 27

29 Localization and Mapping Using VIO* Localization Image sequence IMU Inertial Measurement Unit Visual-Inertial Odometry (VIO) *Subset of SLAM algorithm (Simultaneous Localization Symposia And on Mapping) VLSI Technology and Circuits Mapping Slide 28

30 VIO: Backend uses Factor Graph to Infer State of Drone Camera Non-linear least squares factor graph optimization Vision Frontend (VFE) IMU Factors Vision Factors Other Factors Feature Tracks Estimated States Backend (BE) Factor Graph IMU Frontend (IFE) IMU Updated States (x i ) & Sparse 3D Map factors Exploit sparsity for 5.4x memory reduction and 7.2x speed up Slide 29

31 Summary Design considerations for deep learning at the edge Incorporate direct metrics into algorithm design for improved efficiency Use a flexible dataflow and NoC to exploit data reuse for energy efficiency and increase PE utilization for speed Accelerate deep learning by looking beyond the accelerator Exploit data representation for FAST Super-Resolution Other forms of inference at the edge beyond deep learning Graphical models for localization and mapping in nanodrones For more info: Symposia on VLSI Technology and Circuits Slide 30

32 Thank you!

How to Estimate the Energy Consumption of Deep Neural Networks

How to Estimate the Energy Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze MIT 1 Problem of DNNs Recognition Smart Drone AI Computation DNN 15k 300k OP/Px DPM 0.1k