ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA
|
|
- Roger Crawford
- 6 years ago
- Views:
Transcription
1 ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA Song Han 1,2, Junlong Kang 2, Huizi Mao 1, Yiming Hu 3, Xin Li 2, Yubin Li 2, Dongliang Xie 2, Hong Luo 2, Song Yao 2, Yu Wang 2,3, Huazhong Yang 2,3 and Bill Dally 1,4 Stanford University 1, DeePhi 2, Tsinghua University 3, NVIDIA 4 Feb 23, 217 FPGA 17, Monterey, CA
2 Recurrent Neural Networks and LSTM speech recognition image caption machine translation visual question answering Compression Acceleration Regularization
3 Speech Recognition
4 Machine Translation
5 Huang et al. Visual Storytelling Image Caption
6 VQA: Visual Question Answering which country is the flag of? what is behind him? what is the color of his hair?
7 Recurrent Neural Network MLP image caption sentiment analysis machine translation speech recognition Stanford cs231n lecture notes
8 Comparing CNN / LSTM CNN: weights shared in space RNN/LSTM: weights shared in time => Produces complicated data dependency => Making parallelization difficult
9 LSTM Structure Input LSTM LSTM FC Softmax Output
10 Models are Getting Larger SECH RECOGNITION 1X Training Ops 8 GFLOP 7, hrs of Data ~8% Error 465 GFLOP 12, hrs of Data ~5% Error 214 Deep Speech Deep Speech 2
11 We Need more Computation But Moore s law is no longer providing more compute
12 Improve the Efficiency of Deep Learning by Algorithm-Hardware Co-Design
13 Conventional Paradigm Training Inference
14 Conventional Paradigm Training Inference
15 Proposed Paradigm Conventional Training Inference Slow Power Hungry Proposed Training Han et al ICLR 17 Model Compression Han et al NIPS 15 Han et al ICLR 16 (best paper award) Accelerated Inference Han et al ISCA 16 Han et al FPGA 17 (best paper award) Fast Power Efficient
16 Agenda Compression Load Balance-Aware Pruning Scheduling Overlap Computation and Memory Reference Compression Pruning Quantization Accelerated Inference Efficient Architecture for Sparse LSTM Accelerated Inference Results
17 Agenda Compression Load Balance-Aware Pruning Scheduling Overlap Computation and Memory Reference Accelerated Inference Efficient Architecture for Sparse LSTM Results
18 Pruning Review Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 15
19 Pruning Lead to Load Imbalance w, w,1 w,3 w1,2 w2,1 w2,3 w4,2 w4,3 w5, w6, w6,3 w7,1 1 C A
20 Pruning Lead to Load Imbalance w, w,1 w,3 w1,2 w2,1 w2,3 w4,2 w4,3 w5, w6, w6,3 w7,1 1 C A
21 Pruning Lead to Load Imbalance w, w,1 w,3 w1,2 w2,1 w2,3 w4,2 w4,3 w5, w6, w6,3 w7,1 1 C A
22 Pruning Lead to Load Imbalance w, w,1 w,3 w1,2 w2,1 w2,3 w4,2 w4,3 w5, w6, w6,3 w7,1 1 C A Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles
23 Load Balance Aware Pruning w, w,1 w,3 w1,2 w2,1 w2,3 w4,2 w4,3 w5, w6, w6,3 1 C A w, w,3 w1,2 w2,1 w2,3 w3,2 w4,2 w5, w5,3 w6, 1 C A w7,1 w7,1 w7,3 Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles
24 Load Balance Aware Pruning w, w,1 w,3 w1,2 w2,1 w2,3 w4,2 w4,3 w5, w6, w6,3 1 C A w, w,3 w1,2 w2,1 w2,3 w3,2 w4,2 w5, w5,3 w6, 1 C A w7,1 w7,1 w7,3 Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles
25 Load Balance Aware Pruning w, w,1 w,3 w1,2 w2,1 w2,3 w4,2 w4,3 w5, w6, w6,3 1 C A w, w,3 w1,2 w2,1 w2,3 w3,2 w4,2 w5, w5,3 w6, 1 C A w7,1 w7,1 w7,3 Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles Balanced 3 cycles 3 cycles 3 cycles 3 cycles Overall: 3 cycles
26 racy vs Sparsity
27 racy vs Sparsity
28 Weight Quantization Table 4: WER Before and After Compression. Networks WER 32bit floating original network 2.3% 32bit floating pruned network 2.7% 16bit fixed pruned network 2.7% 12bit fixed pruned network 2.7% 8bit fixed pruned network 84.5%
29 Agenda Compression Load Balance-Aware Pruning Scheduling Overlap Computation and Memory Reference Accelerated Inference Efficient Architecture for Sparse LSTM Results
30 FSM for LSTM
31 Scheduling Memory spmm Elt-wise
32 Scheduling Memory spmm Elt-wise
33 Scheduling Memory spmm Elt-wise
34 Scheduling Memory spmm Elt-wise
35 Scheduling Memory spmm Elt-wise
36 Scheduling Memory spmm Elt-wise
37 Scheduling Memory spmm Elt-wise
38 Scheduling Memory spmm Elt-wise
39 Scheduling Memory spmm Elt-wise
40 Scheduling Memory spmm Elt-wise
41 Scheduling Memory spmm Elt-wise
42 Scheduling Memory spmm Elt-wise
43 Scheduling Memory spmm Elt-wise
44 Scheduling Memory spmm Elt-wise
45 Agenda Compression Load Balance-Aware Pruning Scheduling Overlap Computation and Memory Reference Accelerated Inference Efficient Architecture for Sparse LSTM Results
46 Challenges Online de-compression while computing Special purpose logic Computation becomes irregular Sparsity Parallelization becomes challenging Load balance
47 Deal with Sparsity vector weight a a 1 a 2 a 3 a 4 a 5 a 3 x w[3] w,3w w2,3w w4,3w w,2 w,3 w,5 w 1, w 1,5 buf w 2, W 2,1 w 2,2 w 2,3 w 3,4 a 3 w 4, w 4,1 w 4,3 w 4,5 1 buf w,2 w,5 1 w,2 w,5 w 7,3 w 7, w 7,3 w 7,3 w 7,5 Figure 8: The computation pattern: non-zero weights in a column are assigned to 2 s, and ev-
48 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)
49 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller MEM Controller Memory DATA BUS Input fer Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpMM SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Elt-wise Sigmoid ElemMul /Tanh H t fer (a) (b)
50 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)
51 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)
52 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)
53 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)
54 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)
55 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)
56 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)
57 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)
58 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)
59 Agenda Compression Load Balance-Aware Pruning Scheduling Overlap Computation and Memory Reference Accelerated Inference Efficient Architecture for Sparse LSTM Results
60 Speedup vs Sparsity
61 Speedup vs Sparsity 12% free lunch
62 Speedup and Energy Efficiency ESE CPU GPU Dense Sparse Dense Sparse Latency 82.7us 617us 3569us 24us 287us Power 41W 111W 38W 22W 136W Performance 2.9x x.84 Energy Efficiency 14.3x x 1.25 Compression Ratio 2x 1 1 1x 1
63 Speedup and Energy Efficiency ESE CPU GPU Dense Sparse Dense Sparse Latency 82.7us 617us 3569us 24us 287us Power 41W 111W 38W 22W 136W Performance 2.9x x.84 Energy Efficiency 14.3x x 1.25 Compression Ratio 2x 1 1 1x 1
64 Speedup and Energy Efficiency ESE CPU GPU Dense Sparse Dense Sparse Latency 82.7us 617us 3569us 24us 287us Power 41W 111W 38W 22W 136W Performance 2.9x x.84 Energy Efficiency 14.3x x 1.25 Compression Ratio 2x 1 1 1x 1
65 Speedup and Energy Efficiency ESE CPU GPU Dense Sparse Dense Sparse Latency 82.7us 617us 3569us 24us 287us Power 41W 111W 38W 22W 136W Performance 2.9x x.84 Energy Efficiency 14.3x x 1.25 Compression Ratio 2x 1 1 1x 1
66 Demo
67 Thank you! Conventional Training Inference Proposed Training Compression Pruning Quantization Accelerated Inference Han et al ICLR 17 Han et al NIPS 15 Han et al ICLR 16 (best paper award) Han et al ISCA 16 Han et al FPGA 17 (best paper award)
SOFTWARE HARDWARE CODESIGN ACCELERATION FOR EFFICIENT NEURAL NETWORK. ...Deep learning and neural
... SOFTWARE HARDWARE CODESIGN FOR EFFICIENT NEURAL NETWORK ACCELERATION... Kaiyuan Guo Tsinghua University and DeePhi Song Han Stanford University and DeePhi Song Yao DeePhi Yu Wang Tsinghua University
More informationBandwidth-Efficient Deep Learning
1 Bandwidth-Efficient Deep Learning from Compression to Acceleration Song Han Assistant Professor, EECS Massachusetts Institute of Technology 2 AI is Changing Our Lives Self-Driving Car Machine Translation
More informationRevolutionizing the Datacenter
Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5
More informationBandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design
Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters
More informationHardware for Deep Learning
Hardware for Deep Learning Bill Dally Stanford and NVIDIA Stanford Platform Lab Retreat June 3, 2016 HARDWARE AND DATA ENABLE DNNS 2 THE NEED FOR SPEED Larger data sets and models lead to better accuracy
More informationReconfigurable Computing Lab
Reconfigurable Computing Lab Philip Leong (philip.leongsydney.edu.au) The University of Sydney September 26, 2018 The long short-term memory (LSTM) network [6, 3] has revolutionised approaches to time-series
More informationBinary Convolutional Neural Network on RRAM
Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua
More informationInference
Inference Architectures @Xilinx Graham Schelle, PhD Principal Engineer Xilinx Research Labs Xilinx Headlines!2 Twitch Chooses Xilinx to Enable its Broadcast-quality Livestream of esports Agenda Xilinx
More informationTowards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,
More informationImplementing Long-term Recurrent Convolutional Network Using HLS on POWER System
Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign
More informationXilinx Machine Learning Strategies For Edge
Xilinx Machine Learning Strategies For Edge Presented By Alvin Clark, Sr. FAE, Northwest The Hottest Research: AI / Machine Learning Nick s ML Model Nick s ML Framework copyright sources: Gospel Coalition
More informationAccelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru
More informationNVIDIA FOR DEEP LEARNING. Bill Veenhuis
NVIDIA FOR DEEP LEARNING Bill Veenhuis bveenhuis@nvidia.com Nvidia is the world s leading ai platform ONE ARCHITECTURE CUDA 2 GPU: Perfect Companion for Accelerating Apps & A.I. CPU GPU 3 Intro to AI AGENDA
More informationReal-time object detection towards high power efficiency
Real-time object detection towards high power efficiency Jincheng Yu, Kaiyuan Guo, Yiming Hu, Xuefei Ning, Jiantao Qiu, Huizi Mao, Song Yao, Tianqi Tang, Boxun Li, Yu Wang, and Huazhong Yang Tsinghua University,
More informationEfficient Methods for Deep Learning
Efficient Methods for Deep Learning Song Han Stanford University Sep 2016 Background: Deep Learning for Everything Source: Brody Huval et al., An Empirical Evaluation, arxiv:1504.01716 Source: leon A.
More informationPRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory
Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationInference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA
Inference Optimization Using TensorRT with Use Cases Jack Han / 한재근 Solutions Architect NVIDIA Search Image NLP Maps TensorRT 4 Adoption Use Cases Speech Video AI Inference is exploding 1 Billion Videos
More informationXilinx ML Suite Overview
Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame
More informationLecture 12: Model Serving. CSE599W: Spring 2018
Lecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications That drink will get you to 2800 calories for today I last saw your keys in the store room Remind Tom of the party You re on page
More informationThroughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu
More informationA Method to Estimate the Energy Consumption of Deep Neural Networks
A Method to Estimate the Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze Massachusetts Institute of Technology, Cambridge, MA, USA {tjy, yhchen, jsemer, sze}@mit.edu
More informationDNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs
IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei
More informationLecture 7: Neural network acoustic models in speech recognition
CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 7: Neural network acoustic models in speech recognition Outline Hybrid acoustic modeling overview Basic
More informationThe Explosion in Neural Network Hardware
1 The Explosion in Neural Network Hardware USC Friday 19 th April Trevor Mudge Bredt Family Professor of Computer Science and Engineering The, Ann Arbor 1 What Just Happened? 2 For years the common wisdom
More informationRecurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks
Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and
More informationBHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques
BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques Jingyang Zhu 1, Zhiliang Qian 2*, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and
More informationarxiv: v2 [cs.cv] 3 May 2016
EIE: Efficient Inference Engine on Compressed Deep Neural Network Song Han Xingyu Liu Huizi Mao Jing Pu Ardavan Pedram Mark A. Horowitz William J. Dally Stanford University, NVIDIA {songhan,xyl,huizi,jingpu,perdavan,horowitz,dally}@stanford.edu
More informationScaling Neural Network Acceleration using Coarse-Grained Parallelism
Scaling Neural Network Acceleration using Coarse-Grained Parallelism Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2018 Neural Networks (NNs)
More informationSequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015
Sequence Modeling: Recurrent and Recursive Nets By Pyry Takala 14 Oct 2015 Agenda Why Recurrent neural networks? Anatomy and basic training of an RNN (10.2, 10.2.1) Properties of RNNs (10.2.2, 8.2.6) Using
More informationHow to Estimate the Energy Consumption of Deep Neural Networks
How to Estimate the Energy Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze MIT 1 Problem of DNNs Recognition Smart Drone AI Computation DNN 15k 300k OP/Px DPM 0.1k
More informationEFFICIENT INFERENCE WITH TENSORRT. Han Vanholder
EFFICIENT INFERENCE WITH TENSORRT Han Vanholder AI INFERENCING IS EXPLODING 2 Trillion Messages Per Day On LinkedIn 500M Daily active users of iflytek 140 Billion Words Per Day Translated by Google 60
More informationTEXAS INSTRUMENTS DEEP LEARNING (TIDL) GOES HERE FOR SITARA PROCESSORS GOES HERE
YOUR TEXAS INSTRUMENTS VIDEO TITLE DEEP LEARNING (TIDL) GOES HERE FOR SITARA PROCESSORS OVERVIEW THE SUBTITLE GOES HERE Texas Instruments Deep Learning (TIDL) for Sitara Processors Overview Texas Instruments
More informationSPARSE PERSISTENT RNN. Feiwen Zhu, 5/9/2017
SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017 Motivation Introduction Algorithm AGENDA Naïve Implementation Optimizations Experiments Conclusion 2 MOTIVATION Exploit sparsity for faster, larger networks Recurrent
More informationResearch Faculty Summit Systems Fueling future disruptions
Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning
More informationHigh-Performance Hardware for Machine Learning
High-Performance Hardware for Machine Learning Cadence ENN Summit 2/9/2016 Prof. William Dally Stanford University NVIDIA Corporation Hardware and Data enable DNNs The Need for Speed Larger data sets and
More informationProfiling GPU Code. Jeremy Appleyard, February 2016
Profiling GPU Code Jeremy Appleyard, February 2016 What is Profiling? Measuring Performance Measuring application performance Usually the aim is to reduce runtime Simple profiling: How long does an operation
More informationarxiv: v2 [cs.ar] 15 May 2018
[DL] A Survey of FPGA Based Neural Network Accelerator arxiv:1712.08934v2 [cs.ar] 15 May 2018 KAIYUAN GUO, SHULIN ZENG, JINCHENG YU, YU WANG AND HUAZHONG YANG, Tsinghua University, China Recent researches
More informationNeural Network Exchange Format
Copyright Khronos Group 2017 - Page 1 Neural Network Exchange Format Deploying Trained Networks to Inference Engines Viktor Gyenes, specification editor Copyright Khronos Group 2017 - Page 2 Outlook The
More informationarxiv: v1 [cs.cv] 11 Feb 2018
arxiv:8.8v [cs.cv] Feb 8 - Partitioning of Deep Neural Networks with Feature Space Encoding for Resource-Constrained Internet-of-Things Platforms ABSTRACT Jong Hwan Ko, Taesik Na, Mohammad Faisal Amir,
More informationReal-time convolutional networks for sonar image classification in low-power embedded systems
Real-time convolutional networks for sonar image classification in low-power embedded systems Matias Valdenegro-Toro Ocean Systems Laboratory - School of Engineering & Physical Sciences Heriot-Watt University,
More informationResearch Faculty Summit Systems Fueling future disruptions
Research Faculty Summit 2018 Systems Fueling future disruptions Efficient Edge Computing for Deep Neural Networks and Beyond Vivienne Sze In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang, Sertac
More informationSwitched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network
Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network Lixue Xia, Tianqi Tang, Wenqin Huangfu, Ming Cheng, Xiling Yin, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E., Tsinghua
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant
More informationDeep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications
Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Jongsoo Park Facebook AI System SW/HW Co-design Team Sep-21 2018 Team Introduction
More informationWhen MPPDB Meets GPU:
When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU
More informationDeep Learning on Modern Architectures. Keren Zhou 4/17/2017
Deep Learning on Modern Architectures Keren Zhou 4/17/2017 HPC Software Stack Application Algorithm Data Layout CPU GPU MIC Others HPC Software Stack Deep Learning Algorithm Data Layout CPU GPU MIC Others
More informationParallel Stochastic Gradient Descent: The case for native GPU-side GPI
Parallel Stochastic Gradient Descent: The case for native GPU-side GPI J. Keuper Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Mark Silberstein Accelerated Computer
More informationIn Live Computer Vision
EVA 2 : Exploiting Temporal Redundancy In Live Computer Vision Mark Buckler, Philip Bedoukian, Suren Jayasuriya, Adrian Sampson International Symposium on Computer Architecture (ISCA) Tuesday June 5, 2018
More informationInstruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA
Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA Abstract In recent years, Convolutional Neural Network (CNN) has been widely applied in computer vision tasks. FPGAs
More informationThe Explosion in Neural Network Hardware
1 The Explosion in Neural Network Hardware Arm Summit, Cambridge, September 17 th, 2018 Trevor Mudge Bredt Family Professor of Computer Science and Engineering The, Ann Arbor 1 What Just Happened? 2 For
More informationA Survey of FPGA Based Deep Learning Accelerators: Challenges and Opportunities
A Survey of FPGA Based Deep Learning Accelerators: Challenges and Opportunities Teng Wang 1, Chao Wang 2, Xuehai Zhou 2, Huaping Chen 1 1 School of Software Engineering of USTC 2 School of Computer Science
More informationarxiv: v1 [cs.lg] 9 Jan 2019
How Compact?: Assessing Compactness of Representations through Layer-Wise Pruning Hyun-Joo Jung 1 Jaedeok Kim 1 Yoonsuck Choe 1,2 1 Machine Learning Lab, Artificial Intelligence Center, Samsung Research,
More informationCafeGPI. Single-Sided Communication for Scalable Deep Learning
CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks
More informationGPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13
GPU FOR DEEP LEARNING chandlerz@nvidia.com 周国峰 Wuhan University 2017/10/13 Why Deep Learning Boost Today? Nvidia SDK for Deep Learning? Agenda CUDA 8.0 cudnn TensorRT (GIE) NCCL DIGITS 2 Why Deep Learning
More informationImage Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction
Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction by Noh, Hyeonwoo, Paul Hongsuck Seo, and Bohyung Han.[1] Presented : Badri Patro 1 1 Computer Vision Reading
More informationCan FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.
Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:
More informationDeep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm
Deep Learning on Arm Cortex-M Microcontrollers Rod Crawford Director Software Technologies, Arm What is Machine Learning (ML)? Artificial Intelligence Machine Learning Deep Learning Neural Networks Additional
More informationComputer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal
Computer Architectures for Deep Learning Ethan Dell and Daniyal Iqbal Agenda Introduction to Deep Learning Challenges Architectural Solutions Hardware Architectures CPUs GPUs Accelerators FPGAs SOCs ASICs
More informationOutline. Deep Convolutional Neural Network (DCNN) Stochastic Computing (SC)
L.C.Smith College of Engineering and Computer Science Towards Acceleration of Deep Convolutional Neural Networks using Stochastic Computing Ji Li Ao Ren Zhe Li Caiwen Ding Bo Yuan Qinru Qiu Yanzhi Wang
More informationDeltaRNN: A Power-efficient Recurrent Neural Network Accelerator
DeltaRNN: A Power-efficient Recurrent Neural Network Accelerator Chang Gao Institute of Neuroinformatics, University of Zurich and ETH Zurich Zurich, Switzerland chang@ini.uzh.ch Daniel Neil Institute
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A
More informationUnified Deep Learning with CPU, GPU, and FPGA Technologies
Unified Deep Learning with CPU, GPU, and FPGA Technologies Allen Rush 1, Ashish Sirasao 2, Mike Ignatowski 1 1: Advanced Micro Devices, Inc., 2: Xilinx, Inc. Abstract Deep learning and complex machine
More informationAsynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features
Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features Xu SUN ( 孙栩 ) Peking University xusun@pku.edu.cn Motivation Neural networks -> Good Performance CNN, RNN, LSTM
More informationSPARSE PERSISTENT RNNS: SQUEEZING LARGE RECURRENT NETWORKS ON- CHIP
SPARSE PERSISTENT RNNS: SQUEEZING LARGE RECURRENT NETWORKS ON- CHIP Feiwen Zhu, Jeff Pool, Michael Andersch, Jeremy Appleyard & Fung Xie NVIDIA {mzhu,jpool,mandersch,jappleyard,ftse}@nvidia.com ABSTRACT
More informationTHE NVIDIA DEEP LEARNING ACCELERATOR
THE NVIDIA DEEP LEARNING ACCELERATOR INTRODUCTION NVDLA NVIDIA Deep Learning Accelerator Developed as part of Xavier NVIDIA s SOC for autonomous driving applications Optimized for Convolutional Neural
More informationGPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU
April 4-7, 2016 Silicon Valley GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim Davis, April 6, 2016 OBJECTIVE Direct sparse methods are among the most widely
More informationIn-Place Associative Computing
In-Place Associative Computing All Images are Public in the Web Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com Agenda Introduction to associative computing Use case
More informationIndex. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,
Index A Algorithmic noise tolerance (ANT), 93 94 Application specific instruction set processors (ASIPs), 115 116 Approximate computing application level, 95 circuits-levels, 93 94 DAS and DVAS, 107 110
More informationVITERBI-BASED PRUNING FOR SPARSE MATRIX WITH FIXED AND HIGH INDEX COMPRESSION RATIO
Published as a conference paper at ICLR 8 VITERBI-BASED PRUNING FOR SPARSE MATRIX WITH FIXED AND HIGH INDEX COMPRESSION RATIO Dongsoo Lee,, Daehyun Ahn, Taesu Kim, Pierce I. Chuang, Jae-Joon Kim POSTECH,
More informationBidirectional Recurrent Convolutional Networks for Video Super-Resolution
Bidirectional Recurrent Convolutional Networks for Video Super-Resolution Qi Zhang & Yan Huang Center for Research on Intelligent Perception and Computing (CRIPAC) National Laboratory of Pattern Recognition
More informationIN-MEMORY ASSOCIATIVE COMPUTING
IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY AAKERIB@GSITECHNOLOGY.COM AGENDA The AI computational challenge Introduction to associative computing Examples An NLP use case What s next?
More informationInstitute of Comp. Science (ICS), Foundation of Research and Technology (FORTH), Irakleio, Greece
An Experimental Analysis of the Opportunities to Use Field Programmable Gate Array Multiprocessors for On-board Satellite Deep Learning Classification of Spectroscopic Observations from Future ESA Space
More informationLSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University
LSTM and its variants for visual recognition Xiaodan Liang xdliang328@gmail.com Sun Yat-sen University Outline Context Modelling with CNN LSTM and its Variants LSTM Architecture Variants Application in
More informationMachine Learning 13. week
Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of
More informationFlexible Batched Sparse Matrix-Vector Product on GPUs
ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems November 13, 217 Flexible Batched Sparse Matrix-Vector Product on GPUs Hartwig Anzt, Gary Collins, Jack Dongarra,
More informationComputation-Performance Optimization of Convolutional Neural Networks with Redundant Kernel Removal
Computation-Performance Optimization of Convolutional Neural Networks with Redundant Kernel Removal arxiv:1705.10748v3 [cs.cv] 10 Apr 2018 Chih-Ting Liu, Yi-Heng Wu, Yu-Sheng Lin, and Shao-Yi Chien Media
More informationDeep Neural Network Acceleration Framework Under Hardware Uncertainty
Deep Neural Network Acceleration Framework Under Hardware Uncertainty Mohsen Imani, Pushen Wang, and Tajana Rosing Computer Science and Engineering, UC San Diego, La Jolla, CA 92093, USA {moimani, puw001,
More informationMachine Learning on FPGAs
Machine Learning on FPGAs Jason Cong Chancellor s Professor, UCLA Director, Center for Domain-Specific Computing cong@cs.ucla.edu http://cadlab.cs.ucla.edu/~cong 1 Impacts of deep learning for many applications
More informationMask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma
Mask R-CNN presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma Mask R-CNN Background Related Work Architecture Experiment Mask R-CNN Background Related Work Architecture Experiment Background From left
More informationPERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices
PERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices Chunhua Deng + City University of New York chunhua.deng@rutgers.edu Keshab K. Parhi University of Minnesota, Twin Cities parhi@umn.edu
More informationLab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm
ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm 1 Introduction
More informationModel Compression. Girish Varma IIIT Hyderabad
Model Compression Girish Varma IIIT Hyderabad http://bit.ly/2tpy1wu Big Huge Neural Network! AlexNet - 60 Million Parameters = 240 MB & the Humble Mobile Phone 1 GB RAM 1/2 Billion FLOPs NOT SO BAD! But
More informationApplications of Berkeley s Dwarfs on Nvidia GPUs
Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse
More informationConvolutional Neural Network Layer Reordering for Acceleration
R1-15 SASIMI 2016 Proceedings Convolutional Neural Network Layer Reordering for Acceleration Vijay Daultani Subhajit Chaudhury Kazuhisa Ishizaka System Platform Labs Value Co-creation Center System Platform
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationDeploying Deep Neural Networks in the Embedded Space
Deploying Deep Neural Networks in the Embedded Space Stylianos I. Venieris, Alexandros Kouris, Christos-Savvas Bouganis 2 nd International Workshop on Embedded and Mobile Deep Learning (EMDL) MobiSys,
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationScalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism
Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Jiecao Yu 1, Andrew Lukefahr 1, David Palframan 2, Ganesh Dasika 2, Reetuparna Das 1, Scott Mahlke 1 1 University of Michigan 2 ARM
More informationCapsAcc: An Efficient Hardware Accelerator for CapsuleNets with Data Reuse
Accepted for publication at Design, Automation and Test in Europe (DATE 2019). Florence, Italy CapsAcc: An Efficient Hardware Accelerator for CapsuleNets with Reuse Alberto Marchisio, Muhammad Abdullah
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationMIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius
MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision Training? Reduced precision tensor math with FP32 accumulation, FP16 storage Successfully used to train a variety
More informationDeep Learning with Tensorflow AlexNet
Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification
More informationDeep learning for object detection. Slides from Svetlana Lazebnik and many others
Deep learning for object detection Slides from Svetlana Lazebnik and many others Recent developments in object detection 80% PASCAL VOC mean0average0precision0(map) 70% 60% 50% 40% 30% 20% 10% Before deep
More informationOptimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network Chen Zhang 1, Peng Li 3, Guangyu Sun 1,2, Yijin Guan 1, Bingjun Xiao 3, Jason Cong 1,2,3 1 Peking University 2 PKU/UCLA Joint
More informationCharacterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager
Characterization and Benchmarking of Deep Learning Natalia Vassilieva, PhD Sr. Research Manager Deep learning applications Vision Speech Text Other Search & information extraction Security/Video surveillance
More informationLeveraging MLC STT-RAM for Energy-efficient CNN Training
Memory Demand (GB) Access Bandwidth(GB/s) Normalized Accesses Leveraging for Energy-efficient CNN Training Hengyu Zhao and Jishen Zhao University of California, San Diego {h6zhao, jzhao}@eng.ucsd.edu ABSTRACT
More informationA Lightweight YOLOv2:
FPGA2018 @Monterey A Lightweight YOLOv2: A Binarized CNN with a Parallel Support Vector Regression for an FPGA Hiroki Nakahara, Haruyoshi Yonekawa, Tomoya Fujii, Shimpei Sato Tokyo Institute of Technology,
More informationS8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer
S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer 2 100 倍以上速く 本当に可能ですか? 2 DOUGLAS ADAMS BABEL FISH Neural Machine Translation Unit 3 4 OVER 100X FASTER, IS IT REALLY POSSIBLE?
More informationMachine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,
Machine Learning 10-701, Fall 2015 Deep Learning Eric Xing (and Pengtao Xie) Lecture 8, October 6, 2015 Eric Xing @ CMU, 2015 1 A perennial challenge in computer vision: feature engineering SIFT Spin image
More information