Research Faculty Summit Systems Fueling future disruptions
|
|
- Amy Cook
- 5 years ago
- Views:
Transcription
1 Research Faculty Summit 2018 Systems Fueling future disruptions
2 Efficient Edge Computing for Deep Neural Networks and Beyond Vivienne Sze In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang, Sertac Karaman, Luca Carlone, Amr Suleiman, Zhengdong Zhang Massachusetts Institute of Technology Contact Info Website:
3 Outline Limitations of Existing Efficient DNN Approaches Looking Beyond the DNN Accelerator for Acceleration Looking Beyond DNNs: Other forms of inference at the edge Slide 2
4 3 Limitations of Existing Efficient DNN Approaches Y.-H. Chen*, T.-J. Yang*, J. Emer, V. Sze, Understanding the Limitations of Existing Energy-Efficient Design Approaches for Deep Neural Networks, SysML 2018.
5 Energy-Efficient Processing of DNNs A significant amount of algorithm and hardware research on energy-efficient processing of DNNs eyeriss.mit.edu/tutorial.html V. Sze, Y.-H. Chen, T-J. Yang, J. Emer, Efficient Processing of Deep Neural Networks: A Tutorial and Survey, Proceedings of the IEEE, Dec We identified various limitations to existing approaches Slide 4
6 Design of Efficient DNN Algorithms Popular efficient DNN algorithm approaches Network Pruning Compact Network Architectures R S C R S C Examples: SqueezeNet, MobileNet... also reduced precision Focus on reducing number of MACs and weights Does it translate to energy savings? Slide 5
7 Data Movement is Expensive DRAM Global Buffer PE PE PE ALU fetch data to run a MAC here DRAM kb NoC: PEs kb Buffer RF ALU ALU PE ALU 2 ALU ALU Normalized Energy Cost * 1 (Reference) 1 6 Energy of weight depends on memory hierarchy and dataflow 200 * measured from a commercial 65nm process Slide 6
8 Energy-Evaluation Methodology DNN Shape Configuration (# of channels, # of filters, etc.) Hardware Energy Costs of each MAC and Memory Access Memory Accesses Optimization # acc. at mem. level 1 # acc. at mem. level 2 # acc. at mem. level n E data Energy estimation tool available at eyeriss.mit.edu # of MACs Calculation # of MACs E comp DNN Weights and Input Data [0.3, 0, -0.4, 0.7, 0, 0, 0.1, ] Energy L1 L2 L3 DNN Energy Consumption [Yang et al., CVPR 2017] Slide 7
9 Key Observations Number of weights alone is not a good metric for energy All data types should be considered Computa:on 10% Input Feature Map 25% Energy Consump:on of GoogLeNet Weights 22% Output Feature Map 43% [Yang et al., CVPR 2017] Slide 8
10 Energy-Aware Pruning Normalized Energy (AlexNet) Directly target energy and incorporate it into the optimization of DNNs to provide greater energy savings Sort layers based on energy and prune layers that consume most energy first EAP reduces AlexNet energy by 3.7x and outperforms the previous work that uses magnitude-based pruning by 1.7x x x 3.7x Ori. Magnitude DC Energy EAP Aware Based Pruning Pruning Symposia [Yang et on al., VLSI CVPR Technology 2017] and Circuits Slide 9
11 NetAdapt: Platform-Aware DNN Adaptation Automatically adapt DNN to a mobile platform to reach a target latency or energy budget Use empirical measurements to guide optimization (avoid modeling of tool chain or platform architecture) Pretrained Network Metric Budget Adapted Network NetAdapt Latency 3.8 Budget Energy 10.5 Empirical Measurements Metric Proposal A Proposal Z Latency Energy Network Proposals A B C D Z Pla8orm Measure In collaboration with Google s Mobile Symposia Vision on Team VLSI Technology and Circuits [Yang et al., ECCV 2018] Slide 10
12 Improved Latency vs. Accuracy Tradeoff NetAdapt boosts the real inference speed of MobileNet by up to 1.7x with higher accuracy +0.3% accuracy 1.7x faster +0.3% accuracy 1.6x faster *Tested on the ImageNet dataset and a Google Pixel 1 CPU Slide 11
13 Many Efficient DNN Design Approaches Network Pruning Compact Network Architectures R S C R S C Reduce Precision 32-bit float bit fixed Binary 0 No guarantee that DNN algorithm designer will use a given approach. Need flexible hardware! Slide 12
14 Existing DNN Architectures Specialized DNN hardware often rely on certain properties of DNN in order to achieve high energy-efficiency Example: Reduce memory access by amortizing across MAC array Activation Memory Weight reuse Weight Memory MAC array Activation reuse Slide 13
15 Limitation of Existing DNN Architectures Example: Reuse and array utilization depends on # of channels, feature map/batch size Not efficient across all network architectures (e.g., compact DNNs) Less efficient as array scales up in size Can be challenging to exploit sparsity Number of input channels feature map or batch size Number of filters (output channels) MAC array (spatial accumulation) Number of filters (output channels) MAC array (temporal accumulation) Slide 14
16 Eyeriss v2: Balancing Flexibility and Efficiency Efficiently supports Wide range of filter shapes Large and Compact Different Layers CONV, FC, depth wise, etc. Wide range of sparsity Dense and Sparse Scalable architecture Over an order of magnitude faster and more energy efficient than Eyeriss v1 [Chen et al., arxiv 2018] eyeriss.mit.edu Slide 15
17 Eyeriss v2: Balancing Flexibility and Efficiency Flexible dataflow, called Row-Stationary Plus (RS+), that enables the spatial mapping of data from all dimensions for high PE array utilization and data reuse for various layer shapes and sizes F1 Output fmap width* Output fmap width* F1 Active PE Idle PE S1 Filter width* Filter width* S1 G1 # channel groups* Row Stationary Row Stationary Plus *tiling parameters [Chen et al., arxiv 2018] eyeriss.mit.edu Slide 16
18 Eyeriss v2: Balancing Flexibility and Efficiency Flexible dataflow, called Row-Stationary Plus (RS+), that enables the spatial mapping of data from all dimensions for high PE array utilization and data reuse for various layer shapes and sizes Flexible NoC to support RS+ that can operate in different modes for different requirements Utilizes multicast to exploit spatial data reuse Utilizes unicast for high BW for weights for FC and weights & activations for compact network architectures Processes data in both compressed and raw format to minimize data movement for both CONV and FC layers Exploit sparsity in both weights and activations [Chen et al., arxiv 2018] eyeriss.mit.edu Slide 17
19 18 Looking Beyond the DNN Accelerator for Acceleration Z. Zhang, V. Sze, FAST: A Framework to Accelerate Super-Resolution Processing on Compressed Videos, CVPRW 2017
20 Super-Resolution on Mobile Devices Low ResoluCon Streaming High ResoluCon Playback Transmit low resolution for lower bandwidth Screens are getting larger Use super-resolution to improve the viewing experience of lower-resolution content (reduce communication bandwidth) Slide 19
21 Complexity of Super Resolution Algorithms SRCNN (Dong et, al. ECCV 14) 8032 MACs/pixel à ~500 GMAC/s for 30 fps State-of-the-art super resolution algorithms use DNNs à computationally expensive, especially at high resolutions (HD or 4K) Slide 20
22 FAST: A Framework to Accelerate Super Resolution SR algorithm FAST SR 15x faster Compressed video Real-time A framework that accelerates any SR algorithm by up to 15x when running on compressed videos Symposia [Zhang on VLSI et Technology al., CVPRW and 2017] Circuits Slide 21
23 Free Information in Compressed Videos Decode Pixels Block-structure Motion-compensation Compressed video Video as a stack of pixels Representation from compressed video This representation can help accelerate super-resolution Slide 22
24 SR SR SR SR SR Transfer is Lightweight SR Transfer Low-res video High-res video Low-res video High-res video Transfer allows SR to run on only a subset of frames Fractional Interpolation Bicubic Interpolation Skip Flag The complexity of the transfer is comparable to bicubic interpolation. Transfer N frames, accelerate by N Slide 23
25 Evaluation: Accelerating SRCNN PartyScene RaceHorse BasketballPass Examples of videos in the test set (20 videos for HEVC development) 4x acceleration with NO PSNR LOSS. 16x acceleration with 0.2 dbandloss of PSNR Symposia on VLSI Technology Circuits Slide 24
26 Visual Evaluation SRCNN FAST + SRCNN Bicubic Code released at Slide 25
27 26 Beyond Deep Neural Networks A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, V. Sze, Navion: A Fully Integrated Energy-Efficient Visual-Inertial Odometry Accelerator for Autonomous Navigation of Nano Drones, Symposium on VLSI 2018
28 Energy-Efficient Autonomous Navigation of NanoDrones Navion Chip Localization and Mapping at < 30mW (full integration on-chip) In collaboration with Sertac Karaman (AeroAstro) Luca Carlone (AeroAstro) Symposia [Zhang et on al., VLSI RSS Technology 2017], [Suleiman and Circuits et al., VLSI 2018] Slide 27
29 Localization and Mapping Using VIO* Localization Image sequence IMU Inertial Measurement Unit Visual-Inertial Odometry (VIO) *Subset of SLAM algorithm (Simultaneous Localization Symposia And on Mapping) VLSI Technology and Circuits Mapping Slide 28
30 VIO: Backend uses Factor Graph to Infer State of Drone Camera Non-linear least squares factor graph optimization Vision Frontend (VFE) IMU Factors Vision Factors Other Factors Feature Tracks Estimated States Backend (BE) Factor Graph IMU Frontend (IFE) IMU Updated States (x i ) & Sparse 3D Map factors Exploit sparsity for 5.4x memory reduction and 7.2x speed up Slide 29
31 Summary Design considerations for deep learning at the edge Incorporate direct metrics into algorithm design for improved efficiency Use a flexible dataflow and NoC to exploit data reuse for energy efficiency and increase PE utilization for speed Accelerate deep learning by looking beyond the accelerator Exploit data representation for FAST Super-Resolution Other forms of inference at the edge beyond deep learning Graphical models for localization and mapping in nanodrones For more info: Symposia on VLSI Technology and Circuits Slide 30
32 Thank you!
33
How to Estimate the Energy Consumption of Deep Neural Networks
How to Estimate the Energy Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze MIT 1 Problem of DNNs Recognition Smart Drone AI Computation DNN 15k 300k OP/Px DPM 0.1k
More informationFAST: A Framework to Accelerate Super- Resolution Processing on Compressed Videos
FAST: A Framework to Accelerate Super- Resolution Processing on Compressed Videos Zhengdong Zhang, Vivienne Sze Massachusetts Institute of Technology http://www.mit.edu/~sze/fast.html 1 Super-Resolution
More informationEyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks
Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Yu-Hsin Chen 1, Joel Emer 1, 2, Vivienne Sze 1 1 MIT 2 NVIDIA 1 Contributions of This Work A novel energy-efficient
More informationDNN Accelerator Architectures
DNN Accelerator Architectures ISCA Tutorial (2017) Website: http://eyeriss.mit.edu/tutorial.html Joel Emer, Vivienne Sze, Yu-Hsin Chen 1 2 Highly-Parallel Compute Paradigms Temporal Architecture (SIMD/SIMT)
More informationA Method to Estimate the Energy Consumption of Deep Neural Networks
A Method to Estimate the Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze Massachusetts Institute of Technology, Cambridge, MA, USA {tjy, yhchen, jsemer, sze}@mit.edu
More informationUSING DATAFLOW TO OPTIMIZE ENERGY EFFICIENCY OF DEEP NEURAL NETWORK ACCELERATORS
... USING DATAFLOW TO OPTIMIZE ENERGY EFFICIENCY OF DEEP NEURAL NETWORK ACCELERATORS... Yu-Hsin Chen Massachusetts Institute of Technology Joel Emer Nvidia and Massachusetts Institute of Technology Vivienne
More informationEfficient Processing for Deep Learning: Challenges and Opportuni:es
Efficient Processing for Deep Learning: Challenges and Opportuni:es Vivienne Sze Massachuse@s Ins:tute of Technology Contact Info email: sze@mit.edu website: www.rle.mit.edu/eems In collabora*on with Yu-Hsin
More informationEyeriss v2: A Flexible and High-Performance Accelerator for Emerging Deep Neural Networks
Eyeriss v2: A Flexible and High-Performance Accelerator for Emerging Deep Neural Networks Yu-Hsin Chen, Joel Emer and Vivienne Sze EECS, MIT Cambridge, MA 239 NVIDIA Research, NVIDIA Westford, MA 886 {yhchen,
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationComputer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal
Computer Architectures for Deep Learning Ethan Dell and Daniyal Iqbal Agenda Introduction to Deep Learning Challenges Architectural Solutions Hardware Architectures CPUs GPUs Accelerators FPGAs SOCs ASICs
More informationIn Live Computer Vision
EVA 2 : Exploiting Temporal Redundancy In Live Computer Vision Mark Buckler, Philip Bedoukian, Suren Jayasuriya, Adrian Sampson International Symposium on Computer Architecture (ISCA) Tuesday June 5, 2018
More informationTETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory
TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural
More informationScaling Neural Network Acceleration using Coarse-Grained Parallelism
Scaling Neural Network Acceleration using Coarse-Grained Parallelism Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2018 Neural Networks (NNs)
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationRevolutionizing the Datacenter
Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5
More informationTowards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,
More informationAccelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru
More informationDeep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications
Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Jongsoo Park Facebook AI System SW/HW Co-design Team Sep-21 2018 Team Introduction
More informationImplementing Long-term Recurrent Convolutional Network Using HLS on POWER System
Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign
More informationEITF20: Computer Architecture Part4.1.1: Cache - 2
EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss
More informationTowards Closing the Energy Gap Between HOG and CNN Features for Embedded Vision
Towards Closing the Energy Gap Between HOG and CNN Features for Embedded Vision The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation
More informationDNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses
DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough 1,2 S. K. Lee 2, N. Mulholland 2, P. Hansen 2, S. Kodali 3, D. Brooks 2, G.-Y. Wei 2 1 ARM Research, Boston,
More informationDNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs
IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei
More informationMulti-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture
The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung
More informationDeep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia
Deep learning for dense per-pixel prediction Chunhua Shen The University of Adelaide, Australia Image understanding Classification error Convolution Neural Networks 0.3 0.2 0.1 Image Classification [Krizhevsky
More informationBinary Convolutional Neural Network on RRAM
Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua
More informationSwitched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network
Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network Lixue Xia, Tianqi Tang, Wenqin Huangfu, Ming Cheng, Xiling Yin, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E., Tsinghua
More informationThroughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu
More informationENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic- Voltage-Accuracy-Frequency- Scalable CNN Processor in 28nm FDSOI
ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic- Voltage-Accuracy-Frequency- Scalable CNN Processor in 28nm FDSOI Bert oons, Roel Uytterhoeven, Wim Dehaene, arian Verhelst ESAT/ICAS - KU Leuven
More informationIndex. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,
Index A Algorithmic noise tolerance (ANT), 93 94 Application specific instruction set processors (ASIPs), 115 116 Approximate computing application level, 95 circuits-levels, 93 94 DAS and DVAS, 107 110
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationDeep Learning Processing Technologies for Embedded Systems. October 2018
Deep Learning Processing Technologies for Embedded Systems October 2018 1 Neural Networks Architecture Single Neuron DNN Multi Task NN Multi-Task Vehicle Detection With Region-of-Interest Voting Popular
More informationBandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design
Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters
More informationRTSR: Enhancing Real-time H.264 Video Streaming using Deep Learning based Video Super Resolution Spring 2017 CS570 Project Presentation June 8, 2017
RTSR: Enhancing Real-time H.264 Video Streaming using Deep Learning based Video Super Resolution Spring 2017 CS570 Project Presentation June 8, 2017 Team 16 Soomin Kim Leslie Tiong Youngki Kwon Insu Jang
More informationMODELING AND ANALYZING DEEP LEARNING ACCELERATOR DATAFLOWS WITH MAESTRO
MODELING AND ANALYZING DEEP LEARNING ACCELERATOR DATAFLOWS WITH MAESTRO Michael Pellauer*, Hyoukjun Kwon** and Tushar Krishna** *Architecture Research Group, NVIDIA **Georgia Institute of Technology ACKNOWLEDGMENTS
More informationarxiv: v1 [cs.cv] 11 Feb 2018
arxiv:8.8v [cs.cv] Feb 8 - Partitioning of Deep Neural Networks with Feature Space Encoding for Resource-Constrained Internet-of-Things Platforms ABSTRACT Jong Hwan Ko, Taesik Na, Mohammad Faisal Amir,
More informationDeep Learning Requirements for Autonomous Vehicles
Deep Learning Requirements for Autonomous Vehicles Pierre Paulin, Director of R&D Synopsys Inc. Chipex, 1 May 2018 1 Agenda Deep Learning and Convolutional Neural Networks for Embedded Vision Automotive
More informationHigh Performance Computing
High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason
More informationInception Network Overview. David White CS793
Inception Network Overview David White CS793 So, Leonardo DiCaprio dreams about dreaming... https://m.media-amazon.com/images/m/mv5bmjaxmzy3njcxnf5bml5banbnxkftztcwnti5otm0mw@@._v1_sy1000_cr0,0,675,1 000_AL_.jpg
More informationGoogle Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand
Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan
More informationOne Network to Solve Them All Solving Linear Inverse Problems using Deep Projection Models
One Network to Solve Them All Solving Linear Inverse Problems using Deep Projection Models [Supplemental Materials] 1. Network Architecture b ref b ref +1 We now describe the architecture of the networks
More informationScalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA
Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School
More informationSCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks
SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks Angshuman Parashar Minsoo Rhu Anurag Mukkara Antonio Puglielli Rangharajan Venkatesan Brucek Khailany Joel Emer Stephen W. Keckler
More informationCNN for Low Level Image Processing. Huanjing Yue
CNN for Low Level Image Processing Huanjing Yue 2017.11 1 Deep Learning for Image Restoration General formulation: min Θ L( x, x) s. t. x = F(y; Θ) Loss function Parameters to be learned Key issues The
More informationCNN optimization. Rassadin A
CNN optimization Rassadin A. 01.2017-02.2017 What to optimize? Training stage time consumption (CPU / GPU) Inference stage time consumption (CPU / GPU) Training stage memory consumption Inference stage
More informationLecture 12: Model Serving. CSE599W: Spring 2018
Lecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications That drink will get you to 2800 calories for today I last saw your keys in the store room Remind Tom of the party You re on page
More informationWu Zhiwen.
Wu Zhiwen zhiwen.wu@intel.com Agenda Background information OpenCV DNN module OpenCL acceleration Vulkan backend Sample 2 What is OpenCV? Open Source Compute Vision (OpenCV) library 2500+ Optimized algorithms
More informationVersal: AI Engine & Programming Environment
Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY
More informationXilinx ML Suite Overview
Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame
More informationTHE NVIDIA DEEP LEARNING ACCELERATOR
THE NVIDIA DEEP LEARNING ACCELERATOR INTRODUCTION NVDLA NVIDIA Deep Learning Accelerator Developed as part of Xavier NVIDIA s SOC for autonomous driving applications Optimized for Convolutional Neural
More informationDNN Dataflow Choice Is Overrated
DNN Dataflow Choice Is Overrated Xuan Yang *, Mingyu Gao *, Jing Pu *, Ankita Nayak *, Qiaoyi Liu *, Steven Emberton Bell *, Jeff Ou Setter *, Kaidi Cao, Heonjae Ha *, Christos Kozyrakis * and Mark Horowitz
More informationDeploying Deep Neural Networks in the Embedded Space
Deploying Deep Neural Networks in the Embedded Space Stylianos I. Venieris, Alexandros Kouris, Christos-Savvas Bouganis 2 nd International Workshop on Embedded and Mobile Deep Learning (EMDL) MobiSys,
More informationHyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array
HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Li, Yiran Chen Duke University, University of Southern California {linghao.song,
More informationFlow-Based Video Recognition
Flow-Based Video Recognition Jifeng Dai Visual Computing Group, Microsoft Research Asia Joint work with Xizhou Zhu*, Yuwen Xiong*, Yujie Wang*, Lu Yuan and Yichen Wei (* interns) Talk pipeline Introduction
More informationarxiv: v1 [cs.ne] 23 Mar 2018
SqueezeNext: Hardware-Aware Neural Network Design arxiv:1803.10615v1 [cs.ne] 23 Mar 2018 Amir Gholami, Kiseok Kwon, Bichen Wu, Zizheng Tai, Xiangyu Yue, Peter Jin, Sicheng Zhao, Kurt Keutzer EECS, UC Berkeley
More informationNonlinear State Estimation for Robotics and Computer Vision Applications: An Overview
Nonlinear State Estimation for Robotics and Computer Vision Applications: An Overview Arun Das 05/09/2017 Arun Das Waterloo Autonomous Vehicles Lab Introduction What s in a name? Arun Das Waterloo Autonomous
More informationarxiv: v2 [cs.cv] 3 May 2016
EIE: Efficient Inference Engine on Compressed Deep Neural Network Song Han Xingyu Liu Huizi Mao Jing Pu Ardavan Pedram Mark A. Horowitz William J. Dally Stanford University, NVIDIA {songhan,xyl,huizi,jingpu,perdavan,horowitz,dally}@stanford.edu
More informationA Communication-Centric Approach for Designing Flexible DNN Accelerators
THEME ARTICLE: Hardware Acceleration A Communication-Centric Approach for Designing Flexible DNN Accelerators Hyoukjun Kwon, High computational demands of deep neural networks Ananda Samajdar, and (DNNs)
More informationDeep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon
Deep Learning For Video Classification Presented by Natalie Carlebach & Gil Sharon Overview Of Presentation Motivation Challenges of video classification Common datasets 4 different methods presented in
More informationEVA 2 : Exploiting Temporal Redundancy in Live Computer Vision
EVA 2 : Exploiting Temporal Redundancy in Live Computer Vision Mark Buckler Cornell University mab598@cornell.edu Philip Bedoukian Cornell University pbb59@cornell.edu Suren Jayasuriya Arizona State University
More informationSpeculations about Computer Architecture in Next Three Years. Jan. 20, 2018
Speculations about Computer Architecture in Next Three Years shuchang.zhou@gmail.com Jan. 20, 2018 About me https://zsc.github.io/ Source-to-source transformation Cache simulation Compiler Optimization
More informationEfficient Methods for Deep Learning
Efficient Methods for Deep Learning Song Han Stanford University Sep 2016 Background: Deep Learning for Everything Source: Brody Huval et al., An Empirical Evaluation, arxiv:1504.01716 Source: leon A.
More informationScalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism
Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Jiecao Yu 1, Andrew Lukefahr 1, David Palframan 2, Ganesh Dasika 2, Reetuparna Das 1, Scott Mahlke 1 1 University of Michigan 2 ARM
More informationHRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing
HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard
More informationReal-time convolutional networks for sonar image classification in low-power embedded systems
Real-time convolutional networks for sonar image classification in low-power embedded systems Matias Valdenegro-Toro Ocean Systems Laboratory - School of Engineering & Physical Sciences Heriot-Watt University,
More informationBidirectional Recurrent Convolutional Networks for Video Super-Resolution
Bidirectional Recurrent Convolutional Networks for Video Super-Resolution Qi Zhang & Yan Huang Center for Research on Intelligent Perception and Computing (CRIPAC) National Laboratory of Pattern Recognition
More informationVertex Shader Design I
The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only
More informationCOMP 551 Applied Machine Learning Lecture 16: Deep Learning
COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all
More informationAdvanced Memory Organizations
CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU
More informationdirect hardware mapping of cnns on fpga-based smart cameras
direct hardware mapping of cnns on fpga-based smart cameras Workshop on Architecture of Smart Cameras Kamel ABDELOUAHAB, Francois BERRY, Maxime PELCAT, Jocelyn SEROT, Jean-Charles QUINTON Cordoba, June
More informationDesign Space Exploration of FPGA-Based Deep Convolutional Neural Networks
Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Abstract Deep Convolutional Neural Networks (DCNN) have proven to be very effective in many pattern recognition applications, such
More informationFully Convolutional Networks for Semantic Segmentation
Fully Convolutional Networks for Semantic Segmentation Jonathan Long* Evan Shelhamer* Trevor Darrell UC Berkeley Chaim Ginzburg for Deep Learning seminar 1 Semantic Segmentation Define a pixel-wise labeling
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationTEXAS INSTRUMENTS DEEP LEARNING (TIDL) GOES HERE FOR SITARA PROCESSORS GOES HERE
YOUR TEXAS INSTRUMENTS VIDEO TITLE DEEP LEARNING (TIDL) GOES HERE FOR SITARA PROCESSORS OVERVIEW THE SUBTITLE GOES HERE Texas Instruments Deep Learning (TIDL) for Sitara Processors Overview Texas Instruments
More informationHENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage
HENet: A Highly Efficient Convolutional Neural Networks Optimized for Accuracy, Speed and Storage Qiuyu Zhu Shanghai University zhuqiuyu@staff.shu.edu.cn Ruixin Zhang Shanghai University chriszhang96@shu.edu.cn
More informationRecurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks
Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and
More informationDense Tracking and Mapping for Autonomous Quadrocopters. Jürgen Sturm
Computer Vision Group Prof. Daniel Cremers Dense Tracking and Mapping for Autonomous Quadrocopters Jürgen Sturm Joint work with Frank Steinbrücker, Jakob Engel, Christian Kerl, Erik Bylow, and Daniel Cremers
More informationComparing Memory Systems for Chip Multiprocessors
Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University
More informationSort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot
Sort vs. Hash Join Revisited for Near-Memory Execution Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot 1 Near-Memory Processing (NMP) Emerging technology Stacked memory: A logic die w/ a stack
More informationQuo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018 Outline: Introduction Action classification architectures
More informationCan FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.
Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:
More informationKeras: Handwritten Digit Recognition using MNIST Dataset
Keras: Handwritten Digit Recognition using MNIST Dataset IIT PATNA January 31, 2018 1 / 30 OUTLINE 1 Keras: Introduction 2 Installing Keras 3 Keras: Building, Testing, Improving A Simple Network 2 / 30
More informationAddressing the Memory Wall
Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the
More informationBuilding an Area-optimized Multi-format Video Encoder IP. Tomi Jalonen VP Sales
Building an Area-optimized Multi-format Video Encoder IP Tomi Jalonen VP Sales www.allegrodvt.com Allegro DVT Founded in 2003 Privately owned, based in Grenoble (France) Two product lines: 1) Industry
More informationRotate Intra Block Copy for Still Image Coding
Rotate Intra Block Copy for Still Image Coding The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher Zhang,
More informationA Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,
More informationComprehensive Arm Solutions for Innovative Machine Learning (ML) and Computer Vision (CV) Applications
Comprehensive Arm Solutions for Innovative Machine Learning (ML) and Computer Vision (CV) Applications Helena Zheng ML Group, Arm Arm Technical Symposia 2017, Taipei Machine Learning is a Subset of Artificial
More informationFAST: A Framework to Accelerate Super-Resolution Processing on Compressed Videos
FAST: A Framework to Accelerate Super-Resolution Processing on Compressed Videos Zhengdong Zhang, Vivienne Sze Massachusetts Institute of Technology {zhangzd, sze}@mit.edu Abstract State-of-the-art super-resolution
More informationDeep Back-Projection Networks For Super-Resolution Supplementary Material
Deep Back-Projection Networks For Super-Resolution Supplementary Material Muhammad Haris 1, Greg Shakhnarovich 2, and Norimichi Ukita 1, 1 Toyota Technological Institute, Japan 2 Toyota Technological Institute
More informationBHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques
BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques Jingyang Zhu 1, Zhiliang Qian 2*, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and
More informationSoftware Defined Hardware
Software Defined Hardware For data intensive computation Wade Shen DARPA I2O September 19, 2017 1 Goal Statement Build runtime reconfigurable hardware and software that enables near ASIC performance (within
More informationHotChips An innovative HD video and digital image processor for low-cost digital entertainment products. Deepu Talla.
HotChips 2007 An innovative HD video and digital image processor for low-cost digital entertainment products Deepu Talla Texas Instruments 1 Salient features of the SoC HD video encode and decode using
More informationDesign Space Exploration of FPGA-Based Deep Convolutional Neural Networks
Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Mohammad Motamedi, Philipp Gysel, Venkatesh Akella and Soheil Ghiasi Electrical and Computer Engineering Department, University
More informationTwo FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters
Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed
More informationNeural Adaptive Content-aware Internet Video Delivery. Hyunho Yeo, Youngmok Jung, Jaehong Kim, Jinwoo Shin, Dongsu Han
Neural Adaptive Content-aware Internet Video Delivery Hyunho Yeo, Youngmok Jung, Jaehong Kim, Jinwoo Shin, Dongsu Han Observation on Current Video Ecosystem 2 Adaptive streaming has been widely deployed
More informationLecture 7: Semantic Segmentation
Semantic Segmentation CSED703R: Deep Learning for Visual Recognition (207F) Segmenting images based on its semantic notion Lecture 7: Semantic Segmentation Bohyung Han Computer Vision Lab. bhhanpostech.ac.kr
More informationThe Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016
The Path to Embedded Vision & AI using a Low Power Vision DSP Yair Siegel, Director of Segment Marketing Hotchips August 2016 Presentation Outline Introduction The Need for Embedded Vision & AI Vision
More informationMorph: Flexible Acceleration for 3D CNN-based Video Understanding
Morph: Flexible Acceleration for 3D CNN-based Video Understanding Kartik Hegde, Rohit Agrawal, Yulun Yao, Christopher W. Fletcher University of Illinois at Urbana-Champaign {kvhegde2, rohita2, yuluny2,
More informationESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA
ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA Song Han 1,2, Junlong Kang 2, Huizi Mao 1, Yiming Hu 3, Xin Li 2, Yubin Li 2, Dongliang Xie 2, Hong Luo 2, Song Yao 2, Yu Wang 2,3, Huazhong
More informationBrainchip OCTOBER
Brainchip OCTOBER 2017 1 Agenda Neuromorphic computing background Akida Neuromorphic System-on-Chip (NSoC) Brainchip OCTOBER 2017 2 Neuromorphic Computing Background Brainchip OCTOBER 2017 3 A Brief History
More information