ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA

Size: px
Start display at page:

Download "ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA"

Transcription

1 ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA Song Han 1,2, Junlong Kang 2, Huizi Mao 1, Yiming Hu 3, Xin Li 2, Yubin Li 2, Dongliang Xie 2, Hong Luo 2, Song Yao 2, Yu Wang 2,3, Huazhong Yang 2,3 and Bill Dally 1,4 Stanford University 1, DeePhi 2, Tsinghua University 3, NVIDIA 4 Feb 23, 217 FPGA 17, Monterey, CA

2 Recurrent Neural Networks and LSTM speech recognition image caption machine translation visual question answering Compression Acceleration Regularization

3 Speech Recognition

4 Machine Translation

5 Huang et al. Visual Storytelling Image Caption

6 VQA: Visual Question Answering which country is the flag of? what is behind him? what is the color of his hair?

7 Recurrent Neural Network MLP image caption sentiment analysis machine translation speech recognition Stanford cs231n lecture notes

8 Comparing CNN / LSTM CNN: weights shared in space RNN/LSTM: weights shared in time => Produces complicated data dependency => Making parallelization difficult

9 LSTM Structure Input LSTM LSTM FC Softmax Output

10 Models are Getting Larger SECH RECOGNITION 1X Training Ops 8 GFLOP 7, hrs of Data ~8% Error 465 GFLOP 12, hrs of Data ~5% Error 214 Deep Speech Deep Speech 2

11 We Need more Computation But Moore s law is no longer providing more compute

12 Improve the Efficiency of Deep Learning by Algorithm-Hardware Co-Design

13 Conventional Paradigm Training Inference

14 Conventional Paradigm Training Inference

15 Proposed Paradigm Conventional Training Inference Slow Power Hungry Proposed Training Han et al ICLR 17 Model Compression Han et al NIPS 15 Han et al ICLR 16 (best paper award) Accelerated Inference Han et al ISCA 16 Han et al FPGA 17 (best paper award) Fast Power Efficient

16 Agenda Compression Load Balance-Aware Pruning Scheduling Overlap Computation and Memory Reference Compression Pruning Quantization Accelerated Inference Efficient Architecture for Sparse LSTM Accelerated Inference Results

17 Agenda Compression Load Balance-Aware Pruning Scheduling Overlap Computation and Memory Reference Accelerated Inference Efficient Architecture for Sparse LSTM Results

18 Pruning Review Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 15

19 Pruning Lead to Load Imbalance w, w,1 w,3 w1,2 w2,1 w2,3 w4,2 w4,3 w5, w6, w6,3 w7,1 1 C A

20 Pruning Lead to Load Imbalance w, w,1 w,3 w1,2 w2,1 w2,3 w4,2 w4,3 w5, w6, w6,3 w7,1 1 C A

21 Pruning Lead to Load Imbalance w, w,1 w,3 w1,2 w2,1 w2,3 w4,2 w4,3 w5, w6, w6,3 w7,1 1 C A

22 Pruning Lead to Load Imbalance w, w,1 w,3 w1,2 w2,1 w2,3 w4,2 w4,3 w5, w6, w6,3 w7,1 1 C A Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles

23 Load Balance Aware Pruning w, w,1 w,3 w1,2 w2,1 w2,3 w4,2 w4,3 w5, w6, w6,3 1 C A w, w,3 w1,2 w2,1 w2,3 w3,2 w4,2 w5, w5,3 w6, 1 C A w7,1 w7,1 w7,3 Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles

24 Load Balance Aware Pruning w, w,1 w,3 w1,2 w2,1 w2,3 w4,2 w4,3 w5, w6, w6,3 1 C A w, w,3 w1,2 w2,1 w2,3 w3,2 w4,2 w5, w5,3 w6, 1 C A w7,1 w7,1 w7,3 Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles

25 Load Balance Aware Pruning w, w,1 w,3 w1,2 w2,1 w2,3 w4,2 w4,3 w5, w6, w6,3 1 C A w, w,3 w1,2 w2,1 w2,3 w3,2 w4,2 w5, w5,3 w6, 1 C A w7,1 w7,1 w7,3 Unbalanced 5 cycles 2 cycles 4 cycles 1 cycle Overall: 5 cycles Balanced 3 cycles 3 cycles 3 cycles 3 cycles Overall: 3 cycles

26 racy vs Sparsity

27 racy vs Sparsity

28 Weight Quantization Table 4: WER Before and After Compression. Networks WER 32bit floating original network 2.3% 32bit floating pruned network 2.7% 16bit fixed pruned network 2.7% 12bit fixed pruned network 2.7% 8bit fixed pruned network 84.5%

29 Agenda Compression Load Balance-Aware Pruning Scheduling Overlap Computation and Memory Reference Accelerated Inference Efficient Architecture for Sparse LSTM Results

30 FSM for LSTM

31 Scheduling Memory spmm Elt-wise

32 Scheduling Memory spmm Elt-wise

33 Scheduling Memory spmm Elt-wise

34 Scheduling Memory spmm Elt-wise

35 Scheduling Memory spmm Elt-wise

36 Scheduling Memory spmm Elt-wise

37 Scheduling Memory spmm Elt-wise

38 Scheduling Memory spmm Elt-wise

39 Scheduling Memory spmm Elt-wise

40 Scheduling Memory spmm Elt-wise

41 Scheduling Memory spmm Elt-wise

42 Scheduling Memory spmm Elt-wise

43 Scheduling Memory spmm Elt-wise

44 Scheduling Memory spmm Elt-wise

45 Agenda Compression Load Balance-Aware Pruning Scheduling Overlap Computation and Memory Reference Accelerated Inference Efficient Architecture for Sparse LSTM Results

46 Challenges Online de-compression while computing Special purpose logic Computation becomes irregular Sparsity Parallelization becomes challenging Load balance

47 Deal with Sparsity vector weight a a 1 a 2 a 3 a 4 a 5 a 3 x w[3] w,3w w2,3w w4,3w w,2 w,3 w,5 w 1, w 1,5 buf w 2, W 2,1 w 2,2 w 2,3 w 3,4 a 3 w 4, w 4,1 w 4,3 w 4,5 1 buf w,2 w,5 1 w,2 w,5 w 7,3 w 7, w 7,3 w 7,3 w 7,5 Figure 8: The computation pattern: non-zero weights in a column are assigned to 2 s, and ev-

48 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)

49 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller MEM Controller Memory DATA BUS Input fer Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpMM SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Elt-wise Sigmoid ElemMul /Tanh H t fer (a) (b)

50 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)

51 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)

52 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)

53 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)

54 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)

55 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)

56 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)

57 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)

58 Hardware Architecture Software Program CPU MEM External Memory x/y t-1 m t ActQueue Channel with multiple s FPGA PCIE Controller Input fer DATA BUS MEM Controller Output fer PtrRead PtrRead PtrRead Pointer fer PtrRead Pointer fer Pointer fer Pointer fer SpmatRead SpmatRead SpmatRead Weight fer SpmatRead Weight fer Weight fer Weight fer N k 1 Act fer Act fer Act fer Act fer Assemble y t ESE Controller Channel ESE Accelerator Channel 1 Channel N W c /c t-1 c t ElemMul Wx t /Wy t-1 Adder Tree Sigmoid /Tanh ElemMul H t fer (a) (b)

59 Agenda Compression Load Balance-Aware Pruning Scheduling Overlap Computation and Memory Reference Accelerated Inference Efficient Architecture for Sparse LSTM Results

60 Speedup vs Sparsity

61 Speedup vs Sparsity 12% free lunch

62 Speedup and Energy Efficiency ESE CPU GPU Dense Sparse Dense Sparse Latency 82.7us 617us 3569us 24us 287us Power 41W 111W 38W 22W 136W Performance 2.9x x.84 Energy Efficiency 14.3x x 1.25 Compression Ratio 2x 1 1 1x 1

63 Speedup and Energy Efficiency ESE CPU GPU Dense Sparse Dense Sparse Latency 82.7us 617us 3569us 24us 287us Power 41W 111W 38W 22W 136W Performance 2.9x x.84 Energy Efficiency 14.3x x 1.25 Compression Ratio 2x 1 1 1x 1

64 Speedup and Energy Efficiency ESE CPU GPU Dense Sparse Dense Sparse Latency 82.7us 617us 3569us 24us 287us Power 41W 111W 38W 22W 136W Performance 2.9x x.84 Energy Efficiency 14.3x x 1.25 Compression Ratio 2x 1 1 1x 1

65 Speedup and Energy Efficiency ESE CPU GPU Dense Sparse Dense Sparse Latency 82.7us 617us 3569us 24us 287us Power 41W 111W 38W 22W 136W Performance 2.9x x.84 Energy Efficiency 14.3x x 1.25 Compression Ratio 2x 1 1 1x 1

66 Demo

67 Thank you! Conventional Training Inference Proposed Training Compression Pruning Quantization Accelerated Inference Han et al ICLR 17 Han et al NIPS 15 Han et al ICLR 16 (best paper award) Han et al ISCA 16 Han et al FPGA 17 (best paper award)

SOFTWARE HARDWARE CODESIGN ACCELERATION FOR EFFICIENT NEURAL NETWORK. ...Deep learning and neural

SOFTWARE HARDWARE CODESIGN ACCELERATION FOR EFFICIENT NEURAL NETWORK. ...Deep learning and neural ... SOFTWARE HARDWARE CODESIGN FOR EFFICIENT NEURAL NETWORK ACCELERATION... Kaiyuan Guo Tsinghua University and DeePhi Song Han Stanford University and DeePhi Song Yao DeePhi Yu Wang Tsinghua University

More information

Bandwidth-Efficient Deep Learning

Bandwidth-Efficient Deep Learning 1 Bandwidth-Efficient Deep Learning from Compression to Acceleration Song Han Assistant Professor, EECS Massachusetts Institute of Technology 2 AI is Changing Our Lives Self-Driving Car Machine Translation

More information

Revolutionizing the Datacenter

Revolutionizing the Datacenter Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5

More information

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters

More information

Hardware for Deep Learning

Hardware for Deep Learning Hardware for Deep Learning Bill Dally Stanford and NVIDIA Stanford Platform Lab Retreat June 3, 2016 HARDWARE AND DATA ENABLE DNNS 2 THE NEED FOR SPEED Larger data sets and models lead to better accuracy

More information

Reconfigurable Computing Lab

Reconfigurable Computing Lab Reconfigurable Computing Lab Philip Leong (philip.leongsydney.edu.au) The University of Sydney September 26, 2018 The long short-term memory (LSTM) network [6, 3] has revolutionised approaches to time-series

More information

Binary Convolutional Neural Network on RRAM

Binary Convolutional Neural Network on RRAM Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua

More information

Inference

Inference Inference Architectures @Xilinx Graham Schelle, PhD Principal Engineer Xilinx Research Labs Xilinx Headlines!2 Twitch Chooses Xilinx to Enable its Broadcast-quality Livestream of esports Agenda Xilinx

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign

More information

Xilinx Machine Learning Strategies For Edge

Xilinx Machine Learning Strategies For Edge Xilinx Machine Learning Strategies For Edge Presented By Alvin Clark, Sr. FAE, Northwest The Hottest Research: AI / Machine Learning Nick s ML Model Nick s ML Framework copyright sources: Gospel Coalition

More information

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru

More information

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

NVIDIA FOR DEEP LEARNING. Bill Veenhuis NVIDIA FOR DEEP LEARNING Bill Veenhuis bveenhuis@nvidia.com Nvidia is the world s leading ai platform ONE ARCHITECTURE CUDA 2 GPU: Perfect Companion for Accelerating Apps & A.I. CPU GPU 3 Intro to AI AGENDA

More information

Real-time object detection towards high power efficiency

Real-time object detection towards high power efficiency Real-time object detection towards high power efficiency Jincheng Yu, Kaiyuan Guo, Yiming Hu, Xuefei Ning, Jiantao Qiu, Huizi Mao, Song Yao, Tianqi Tang, Boxun Li, Yu Wang, and Huazhong Yang Tsinghua University,

More information

Efficient Methods for Deep Learning

Efficient Methods for Deep Learning Efficient Methods for Deep Learning Song Han Stanford University Sep 2016 Background: Deep Learning for Everything Source: Brody Huval et al., An Empirical Evaluation, arxiv:1504.01716 Source: leon A.

More information

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA Inference Optimization Using TensorRT with Use Cases Jack Han / 한재근 Solutions Architect NVIDIA Search Image NLP Maps TensorRT 4 Adoption Use Cases Speech Video AI Inference is exploding 1 Billion Videos

More information

Xilinx ML Suite Overview

Xilinx ML Suite Overview Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame

More information

Lecture 12: Model Serving. CSE599W: Spring 2018

Lecture 12: Model Serving. CSE599W: Spring 2018 Lecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications That drink will get you to 2800 calories for today I last saw your keys in the store room Remind Tom of the party You re on page

More information

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu

More information

A Method to Estimate the Energy Consumption of Deep Neural Networks

A Method to Estimate the Energy Consumption of Deep Neural Networks A Method to Estimate the Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze Massachusetts Institute of Technology, Cambridge, MA, USA {tjy, yhchen, jsemer, sze}@mit.edu

More information

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei

More information

Lecture 7: Neural network acoustic models in speech recognition

Lecture 7: Neural network acoustic models in speech recognition CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 7: Neural network acoustic models in speech recognition Outline Hybrid acoustic modeling overview Basic

More information

The Explosion in Neural Network Hardware

The Explosion in Neural Network Hardware 1 The Explosion in Neural Network Hardware USC Friday 19 th April Trevor Mudge Bredt Family Professor of Computer Science and Engineering The, Ann Arbor 1 What Just Happened? 2 For years the common wisdom

More information

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and

More information

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques Jingyang Zhu 1, Zhiliang Qian 2*, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and

More information

arxiv: v2 [cs.cv] 3 May 2016

arxiv: v2 [cs.cv] 3 May 2016 EIE: Efficient Inference Engine on Compressed Deep Neural Network Song Han Xingyu Liu Huizi Mao Jing Pu Ardavan Pedram Mark A. Horowitz William J. Dally Stanford University, NVIDIA {songhan,xyl,huizi,jingpu,perdavan,horowitz,dally}@stanford.edu

More information

Scaling Neural Network Acceleration using Coarse-Grained Parallelism

Scaling Neural Network Acceleration using Coarse-Grained Parallelism Scaling Neural Network Acceleration using Coarse-Grained Parallelism Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2018 Neural Networks (NNs)

More information

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015 Sequence Modeling: Recurrent and Recursive Nets By Pyry Takala 14 Oct 2015 Agenda Why Recurrent neural networks? Anatomy and basic training of an RNN (10.2, 10.2.1) Properties of RNNs (10.2.2, 8.2.6) Using

More information

How to Estimate the Energy Consumption of Deep Neural Networks

How to Estimate the Energy Consumption of Deep Neural Networks How to Estimate the Energy Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze MIT 1 Problem of DNNs Recognition Smart Drone AI Computation DNN 15k 300k OP/Px DPM 0.1k

More information

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder EFFICIENT INFERENCE WITH TENSORRT Han Vanholder AI INFERENCING IS EXPLODING 2 Trillion Messages Per Day On LinkedIn 500M Daily active users of iflytek 140 Billion Words Per Day Translated by Google 60

More information

TEXAS INSTRUMENTS DEEP LEARNING (TIDL) GOES HERE FOR SITARA PROCESSORS GOES HERE

TEXAS INSTRUMENTS DEEP LEARNING (TIDL) GOES HERE FOR SITARA PROCESSORS GOES HERE YOUR TEXAS INSTRUMENTS VIDEO TITLE DEEP LEARNING (TIDL) GOES HERE FOR SITARA PROCESSORS OVERVIEW THE SUBTITLE GOES HERE Texas Instruments Deep Learning (TIDL) for Sitara Processors Overview Texas Instruments

More information

SPARSE PERSISTENT RNN. Feiwen Zhu, 5/9/2017

SPARSE PERSISTENT RNN. Feiwen Zhu, 5/9/2017 SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017 Motivation Introduction Algorithm AGENDA Naïve Implementation Optimizations Experiments Conclusion 2 MOTIVATION Exploit sparsity for faster, larger networks Recurrent

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning

More information

High-Performance Hardware for Machine Learning

High-Performance Hardware for Machine Learning High-Performance Hardware for Machine Learning Cadence ENN Summit 2/9/2016 Prof. William Dally Stanford University NVIDIA Corporation Hardware and Data enable DNNs The Need for Speed Larger data sets and

More information

Profiling GPU Code. Jeremy Appleyard, February 2016

Profiling GPU Code. Jeremy Appleyard, February 2016 Profiling GPU Code Jeremy Appleyard, February 2016 What is Profiling? Measuring Performance Measuring application performance Usually the aim is to reduce runtime Simple profiling: How long does an operation

More information

arxiv: v2 [cs.ar] 15 May 2018

arxiv: v2 [cs.ar] 15 May 2018 [DL] A Survey of FPGA Based Neural Network Accelerator arxiv:1712.08934v2 [cs.ar] 15 May 2018 KAIYUAN GUO, SHULIN ZENG, JINCHENG YU, YU WANG AND HUAZHONG YANG, Tsinghua University, China Recent researches

More information

Neural Network Exchange Format

Neural Network Exchange Format Copyright Khronos Group 2017 - Page 1 Neural Network Exchange Format Deploying Trained Networks to Inference Engines Viktor Gyenes, specification editor Copyright Khronos Group 2017 - Page 2 Outlook The

More information

arxiv: v1 [cs.cv] 11 Feb 2018

arxiv: v1 [cs.cv] 11 Feb 2018 arxiv:8.8v [cs.cv] Feb 8 - Partitioning of Deep Neural Networks with Feature Space Encoding for Resource-Constrained Internet-of-Things Platforms ABSTRACT Jong Hwan Ko, Taesik Na, Mohammad Faisal Amir,

More information

Real-time convolutional networks for sonar image classification in low-power embedded systems

Real-time convolutional networks for sonar image classification in low-power embedded systems Real-time convolutional networks for sonar image classification in low-power embedded systems Matias Valdenegro-Toro Ocean Systems Laboratory - School of Engineering & Physical Sciences Heriot-Watt University,

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Efficient Edge Computing for Deep Neural Networks and Beyond Vivienne Sze In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang, Sertac

More information

Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network

Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network Lixue Xia, Tianqi Tang, Wenqin Huangfu, Ming Cheng, Xiling Yin, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E., Tsinghua

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant

More information

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Jongsoo Park Facebook AI System SW/HW Co-design Team Sep-21 2018 Team Introduction

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017 Deep Learning on Modern Architectures Keren Zhou 4/17/2017 HPC Software Stack Application Algorithm Data Layout CPU GPU MIC Others HPC Software Stack Deep Learning Algorithm Data Layout CPU GPU MIC Others

More information

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI Parallel Stochastic Gradient Descent: The case for native GPU-side GPI J. Keuper Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Mark Silberstein Accelerated Computer

More information

In Live Computer Vision

In Live Computer Vision EVA 2 : Exploiting Temporal Redundancy In Live Computer Vision Mark Buckler, Philip Bedoukian, Suren Jayasuriya, Adrian Sampson International Symposium on Computer Architecture (ISCA) Tuesday June 5, 2018

More information

Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA

Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA Abstract In recent years, Convolutional Neural Network (CNN) has been widely applied in computer vision tasks. FPGAs

More information

The Explosion in Neural Network Hardware

The Explosion in Neural Network Hardware 1 The Explosion in Neural Network Hardware Arm Summit, Cambridge, September 17 th, 2018 Trevor Mudge Bredt Family Professor of Computer Science and Engineering The, Ann Arbor 1 What Just Happened? 2 For

More information

A Survey of FPGA Based Deep Learning Accelerators: Challenges and Opportunities

A Survey of FPGA Based Deep Learning Accelerators: Challenges and Opportunities A Survey of FPGA Based Deep Learning Accelerators: Challenges and Opportunities Teng Wang 1, Chao Wang 2, Xuehai Zhou 2, Huaping Chen 1 1 School of Software Engineering of USTC 2 School of Computer Science

More information

arxiv: v1 [cs.lg] 9 Jan 2019

arxiv: v1 [cs.lg] 9 Jan 2019 How Compact?: Assessing Compactness of Representations through Layer-Wise Pruning Hyun-Joo Jung 1 Jaedeok Kim 1 Yoonsuck Choe 1,2 1 Machine Learning Lab, Artificial Intelligence Center, Samsung Research,

More information

CafeGPI. Single-Sided Communication for Scalable Deep Learning

CafeGPI. Single-Sided Communication for Scalable Deep Learning CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks

More information

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13 GPU FOR DEEP LEARNING chandlerz@nvidia.com 周国峰 Wuhan University 2017/10/13 Why Deep Learning Boost Today? Nvidia SDK for Deep Learning? Agenda CUDA 8.0 cudnn TensorRT (GIE) NCCL DIGITS 2 Why Deep Learning

More information

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction by Noh, Hyeonwoo, Paul Hongsuck Seo, and Bohyung Han.[1] Presented : Badri Patro 1 1 Computer Vision Reading

More information

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al. Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:

More information

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm Deep Learning on Arm Cortex-M Microcontrollers Rod Crawford Director Software Technologies, Arm What is Machine Learning (ML)? Artificial Intelligence Machine Learning Deep Learning Neural Networks Additional

More information

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal Computer Architectures for Deep Learning Ethan Dell and Daniyal Iqbal Agenda Introduction to Deep Learning Challenges Architectural Solutions Hardware Architectures CPUs GPUs Accelerators FPGAs SOCs ASICs

More information

Outline. Deep Convolutional Neural Network (DCNN) Stochastic Computing (SC)

Outline. Deep Convolutional Neural Network (DCNN) Stochastic Computing (SC) L.C.Smith College of Engineering and Computer Science Towards Acceleration of Deep Convolutional Neural Networks using Stochastic Computing Ji Li Ao Ren Zhe Li Caiwen Ding Bo Yuan Qinru Qiu Yanzhi Wang

More information

DeltaRNN: A Power-efficient Recurrent Neural Network Accelerator

DeltaRNN: A Power-efficient Recurrent Neural Network Accelerator DeltaRNN: A Power-efficient Recurrent Neural Network Accelerator Chang Gao Institute of Neuroinformatics, University of Zurich and ETH Zurich Zurich, Switzerland chang@ini.uzh.ch Daniel Neil Institute

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A

More information

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Unified Deep Learning with CPU, GPU, and FPGA Technologies Unified Deep Learning with CPU, GPU, and FPGA Technologies Allen Rush 1, Ashish Sirasao 2, Mike Ignatowski 1 1: Advanced Micro Devices, Inc., 2: Xilinx, Inc. Abstract Deep learning and complex machine

More information

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features Xu SUN ( 孙栩 ) Peking University xusun@pku.edu.cn Motivation Neural networks -> Good Performance CNN, RNN, LSTM

More information

SPARSE PERSISTENT RNNS: SQUEEZING LARGE RECURRENT NETWORKS ON- CHIP

SPARSE PERSISTENT RNNS: SQUEEZING LARGE RECURRENT NETWORKS ON- CHIP SPARSE PERSISTENT RNNS: SQUEEZING LARGE RECURRENT NETWORKS ON- CHIP Feiwen Zhu, Jeff Pool, Michael Andersch, Jeremy Appleyard & Fung Xie NVIDIA {mzhu,jpool,mandersch,jappleyard,ftse}@nvidia.com ABSTRACT

More information

THE NVIDIA DEEP LEARNING ACCELERATOR

THE NVIDIA DEEP LEARNING ACCELERATOR THE NVIDIA DEEP LEARNING ACCELERATOR INTRODUCTION NVDLA NVIDIA Deep Learning Accelerator Developed as part of Xavier NVIDIA s SOC for autonomous driving applications Optimized for Convolutional Neural

More information

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU April 4-7, 2016 Silicon Valley GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim Davis, April 6, 2016 OBJECTIVE Direct sparse methods are among the most widely

More information

In-Place Associative Computing

In-Place Associative Computing In-Place Associative Computing All Images are Public in the Web Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com Agenda Introduction to associative computing Use case

More information

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning, Index A Algorithmic noise tolerance (ANT), 93 94 Application specific instruction set processors (ASIPs), 115 116 Approximate computing application level, 95 circuits-levels, 93 94 DAS and DVAS, 107 110

More information

VITERBI-BASED PRUNING FOR SPARSE MATRIX WITH FIXED AND HIGH INDEX COMPRESSION RATIO

VITERBI-BASED PRUNING FOR SPARSE MATRIX WITH FIXED AND HIGH INDEX COMPRESSION RATIO Published as a conference paper at ICLR 8 VITERBI-BASED PRUNING FOR SPARSE MATRIX WITH FIXED AND HIGH INDEX COMPRESSION RATIO Dongsoo Lee,, Daehyun Ahn, Taesu Kim, Pierce I. Chuang, Jae-Joon Kim POSTECH,

More information

Bidirectional Recurrent Convolutional Networks for Video Super-Resolution

Bidirectional Recurrent Convolutional Networks for Video Super-Resolution Bidirectional Recurrent Convolutional Networks for Video Super-Resolution Qi Zhang & Yan Huang Center for Research on Intelligent Perception and Computing (CRIPAC) National Laboratory of Pattern Recognition

More information

IN-MEMORY ASSOCIATIVE COMPUTING

IN-MEMORY ASSOCIATIVE COMPUTING IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY AAKERIB@GSITECHNOLOGY.COM AGENDA The AI computational challenge Introduction to associative computing Examples An NLP use case What s next?

More information

Institute of Comp. Science (ICS), Foundation of Research and Technology (FORTH), Irakleio, Greece

Institute of Comp. Science (ICS), Foundation of Research and Technology (FORTH), Irakleio, Greece An Experimental Analysis of the Opportunities to Use Field Programmable Gate Array Multiprocessors for On-board Satellite Deep Learning Classification of Spectroscopic Observations from Future ESA Space

More information

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University LSTM and its variants for visual recognition Xiaodan Liang xdliang328@gmail.com Sun Yat-sen University Outline Context Modelling with CNN LSTM and its Variants LSTM Architecture Variants Application in

More information

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

Flexible Batched Sparse Matrix-Vector Product on GPUs

Flexible Batched Sparse Matrix-Vector Product on GPUs ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems November 13, 217 Flexible Batched Sparse Matrix-Vector Product on GPUs Hartwig Anzt, Gary Collins, Jack Dongarra,

More information

Computation-Performance Optimization of Convolutional Neural Networks with Redundant Kernel Removal

Computation-Performance Optimization of Convolutional Neural Networks with Redundant Kernel Removal Computation-Performance Optimization of Convolutional Neural Networks with Redundant Kernel Removal arxiv:1705.10748v3 [cs.cv] 10 Apr 2018 Chih-Ting Liu, Yi-Heng Wu, Yu-Sheng Lin, and Shao-Yi Chien Media

More information

Deep Neural Network Acceleration Framework Under Hardware Uncertainty

Deep Neural Network Acceleration Framework Under Hardware Uncertainty Deep Neural Network Acceleration Framework Under Hardware Uncertainty Mohsen Imani, Pushen Wang, and Tajana Rosing Computer Science and Engineering, UC San Diego, La Jolla, CA 92093, USA {moimani, puw001,

More information

Machine Learning on FPGAs

Machine Learning on FPGAs Machine Learning on FPGAs Jason Cong Chancellor s Professor, UCLA Director, Center for Domain-Specific Computing cong@cs.ucla.edu http://cadlab.cs.ucla.edu/~cong 1 Impacts of deep learning for many applications

More information

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma Mask R-CNN presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma Mask R-CNN Background Related Work Architecture Experiment Mask R-CNN Background Related Work Architecture Experiment Background From left

More information

PERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices

PERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices PERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices Chunhua Deng + City University of New York chunhua.deng@rutgers.edu Keshab K. Parhi University of Minnesota, Twin Cities parhi@umn.edu

More information

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm 1 Introduction

More information

Model Compression. Girish Varma IIIT Hyderabad

Model Compression. Girish Varma IIIT Hyderabad Model Compression Girish Varma IIIT Hyderabad http://bit.ly/2tpy1wu Big Huge Neural Network! AlexNet - 60 Million Parameters = 240 MB & the Humble Mobile Phone 1 GB RAM 1/2 Billion FLOPs NOT SO BAD! But

More information

Applications of Berkeley s Dwarfs on Nvidia GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse

More information

Convolutional Neural Network Layer Reordering for Acceleration

Convolutional Neural Network Layer Reordering for Acceleration R1-15 SASIMI 2016 Proceedings Convolutional Neural Network Layer Reordering for Acceleration Vijay Daultani Subhajit Chaudhury Kazuhisa Ishizaka System Platform Labs Value Co-creation Center System Platform

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

Deploying Deep Neural Networks in the Embedded Space

Deploying Deep Neural Networks in the Embedded Space Deploying Deep Neural Networks in the Embedded Space Stylianos I. Venieris, Alexandros Kouris, Christos-Savvas Bouganis 2 nd International Workshop on Embedded and Mobile Deep Learning (EMDL) MobiSys,

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism

Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Jiecao Yu 1, Andrew Lukefahr 1, David Palframan 2, Ganesh Dasika 2, Reetuparna Das 1, Scott Mahlke 1 1 University of Michigan 2 ARM

More information

CapsAcc: An Efficient Hardware Accelerator for CapsuleNets with Data Reuse

CapsAcc: An Efficient Hardware Accelerator for CapsuleNets with Data Reuse Accepted for publication at Design, Automation and Test in Europe (DATE 2019). Florence, Italy CapsAcc: An Efficient Hardware Accelerator for CapsuleNets with Reuse Alberto Marchisio, Muhammad Abdullah

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision Training? Reduced precision tensor math with FP32 accumulation, FP16 storage Successfully used to train a variety

More information

Deep Learning with Tensorflow AlexNet

Deep Learning with Tensorflow   AlexNet Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification

More information

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Deep learning for object detection. Slides from Svetlana Lazebnik and many others Deep learning for object detection Slides from Svetlana Lazebnik and many others Recent developments in object detection 80% PASCAL VOC mean0average0precision0(map) 70% 60% 50% 40% 30% 20% 10% Before deep

More information

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network Chen Zhang 1, Peng Li 3, Guangyu Sun 1,2, Yijin Guan 1, Bingjun Xiao 3, Jason Cong 1,2,3 1 Peking University 2 PKU/UCLA Joint

More information

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager Characterization and Benchmarking of Deep Learning Natalia Vassilieva, PhD Sr. Research Manager Deep learning applications Vision Speech Text Other Search & information extraction Security/Video surveillance

More information

Leveraging MLC STT-RAM for Energy-efficient CNN Training

Leveraging MLC STT-RAM for Energy-efficient CNN Training Memory Demand (GB) Access Bandwidth(GB/s) Normalized Accesses Leveraging for Energy-efficient CNN Training Hengyu Zhao and Jishen Zhao University of California, San Diego {h6zhao, jzhao}@eng.ucsd.edu ABSTRACT

More information

A Lightweight YOLOv2:

A Lightweight YOLOv2: FPGA2018 @Monterey A Lightweight YOLOv2: A Binarized CNN with a Parallel Support Vector Regression for an FPGA Hiroki Nakahara, Haruyoshi Yonekawa, Tomoya Fujii, Shimpei Sato Tokyo Institute of Technology,

More information

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer 2 100 倍以上速く 本当に可能ですか? 2 DOUGLAS ADAMS BABEL FISH Neural Machine Translation Unit 3 4 OVER 100X FASTER, IS IT REALLY POSSIBLE?

More information

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU, Machine Learning 10-701, Fall 2015 Deep Learning Eric Xing (and Pengtao Xie) Lecture 8, October 6, 2015 Eric Xing @ CMU, 2015 1 A perennial challenge in computer vision: feature engineering SIFT Spin image

More information