ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic- Voltage-Accuracy-Frequency- Scalable CNN Processor in 28nm FDSOI

Size: px
Start display at page:

Download "ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic- Voltage-Accuracy-Frequency- Scalable CNN Processor in 28nm FDSOI"

Transcription

1 ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic- Voltage-Accuracy-Frequency- Scalable CNN Processor in 28nm FDSOI Bert oons, Roel Uytterhoeven, Wim Dehaene, arian Verhelst ESAT/ICAS - KU Leuven 1of 56

2 Embedded Neural Networks Augmented Reality Face Recognition Artificial Intelligence Raw Data Information CLOUD GPU 2of 56

3 Embedded Neural Networks Augmented Reality Face Recognition Artificial Intelligence Local Processing 3of 56

4 Embedded Neural Networks Augmented Reality Face Recognition Artificial Intelligence 1-to-10 TOPS/W CNN processing is crucial for always-on embedded Local Processing operation. 4of 56

5 Always-on Neural Networks Large-Scale, highly accurate CNN s are too expensive for embedded always-on operation. VGG-16 Recognition on LFW* Classes 5760 Accuracy 92.5% Complexity odel Size Processing Energy / 1 TOPS/W 15.4 GACs 15 B ~ 30 mj/f ~ fps [*] Labeled Faces in the Wild Data set LFW 5of 56 A A A 1200mAh - 1.5V Drains in 2h

6 Presentation Outline A. 1. Hierarchical Recognition 2. DVAFS: Dynamic-Voltage- Accuracy-Frequency-Scaling B. 1. Hardware Implementation 2. Results 6of 56

7 Hierarchical recognition Hierarchical processing enables always on CNN-based visual recognition 7of 56

8 Hierarchical Face Recognition Hierarchical processing enables always-on compute Face Detected? Large-Scale Recognition 6 ACs 15.4GACs 8of 56

9 Hierarchical Face Recognition Hierarchical processing enables always-on compute Face Detected? Owner Detected? Large-Scale Recognition N Y 6 ACs 12ACs 15.4GACs 9of 56

10 Hierarchical Face Recognition Hierarchical processing enables always-on compute Face Detected? Owner Detected? Friend Detected? Large-Scale Recognition N N Y N 6 ACs 12ACs 500ACs 15.4GACs 10 of 56

11 CONV-1 Face Detected 6 ACs? 22 kb 5-44%=0 2-4b Ops Hierarchical Face Recognition Hierarchical processing enables always-on compute CONV-2 Owner Detected 12 ACs? 42 kb 8-45%=0 3-4b Ops CONV-3 Friend 500 Detected ACs? 742 kb 8-47%=0 4-6b Ops Large-Scale CONV-4 Recognition 15 GACs 15 B 5-82%=0 4-6b Ops 94 % Nacc. 96 % acc. 94 % acc % acc. N Y Always-on ~1% on ~0.1% on ~0.01% on Increasing # Classes / Network Size / FP precision / Energy per frame 6 ACs 12ACs 500ACs 15.4GACs 11 of 56 N

12 DVAFS: Dynamic-Voltage- Accuracy-Frequency-Scaling An at run-time Energy-vs-Computational Precision trade-off 12 of 56

13 Precision Scaling - DVAS x 0 /0 x 1 /0 x 2 DVAS Dynamic-Voltage-Accuracy-Scaling 4 y 3 y 2 y 1 /0 y 0 /0 x 3 x 2 x 1 x 0 y 3 y 2 y 1 y 0 Gate LSB Gate LSB Standard ultiplier x 3 z 3 z 2 z 1 z As in [4] oons, VLSI2016 ; oons, JSSC of 56

14 Precision Scaling - DVAFS DVAFS Dynamic-Voltage-Accuracy-Frequency-Scaling x 00 y 11 y 01 y 10 y 00 x 11 x 01 x 10 x 00 y 11 y 01 y 10 y 00 x 10 x 01 Subword-Parallel ult. x 11 z 31 z 21 z 11 z 01 z 30 z 20 z 10 z of 56

15 Precision Scaling - DVAFS DVAFS Dynamic-Voltage-Accuracy-Frequency-Scaling x 00 y 11 y 01 y 10 y 00 x 11 x 01 x 10 x 00 y 11 y 01 y 10 y 00 x 10 x 01 x 11 DVAFS is a dynamic Subword-p. precision technique, ultiplier lowering all run-time adaptable parameters: activity, frequency z 31 zand 21 z 11 supply z 01 z 30 voltage z 20 z 10 z of 56

16 Precision Scaling System Level DVAFS outperforms DVAS as it minimizes noncompute overheads at low precision DVAS Energy/word High precision DVAS DVAFS CTRL & Transfer Energy/word emory Overhead Compute Compute Overhead Compute 16 of 56

17 Precision Scaling System Level DVAFS outperforms DVAS as it minimizes noncompute overheads at low precision DVAFS Energy/word High precision DVAFS DVAFS CTRL & Transfer Energy/word emory Overhead Compute Compute Overhead Compute 17 of 56

18 Precision Scaling System Level DVAFS outperforms DVAS as it minimizes noncompute overheads at low precision Rel. Energy / operation [-] * T = 76 GOPS Precision [bits] 8x in DVAS 20x in DVAFS 18 of 56

19 Precision Scaling BB in FDSOI DVAFS modulates leakage-vs-dynamic balance Body-Bias tuning allows minimizing energy High precision Reduce V T, constant (V - V T ) and f f Dominant Low precision Increase V T, constant (V - V T ) and f f Dominant Dynamic Leakage BB nom BB optimal BB nom BB optimal 19 of 56

20 Processor Architecture Exploits: A. Parallelism and Data Reuse; B. Network sparsity; C. Varying precision through DVAFS. 20 of 56

21 Optimization: CNN Characteristics (A) Convolution operators are highly parallel Algorithm allows inherent data reuse Three types of Reuse supported in Envision Images Filter Image Filters 2 Image Filter Convolutional Reuse Image Reuse Filter Reuse [3] Chen, ISSCC of 56

22 Optimization: CNN Characteristics (B,C) CNN weights and activations are sparse. Precision varies between apps, networks, layers Sparsity RELU activations Varying precision Non-uniform 99*% relative benchmark accuracy Network Sparsity LeNet % AlexNet 5-90% VGG 5-82% Network LeNet-5 AlexNet VGG (*95%) Precision 1-5 bits 4-9 bits 4-6 bits 22 of 56

23 A 2D-SID DVAFS Architecture IO en/decoder D A RISC CTRL ALU 1D-SID: ReLu, ax-pool, AC, 2D-SID AC-array Input processing D A D C D B D D data P GRD GRD data Input procesing 23 of 56

24 A 2D-SID DVAFS Architecture IO en/decoder D A RISC CTRL ALU 1D-SID: ReLu, ax-pool, AC, 2D-SID AC-array Input processing D A D C D B D D data P GRD GRD data Input procesing 24 of 56

25 A 2D-SID DVAFS Architecture Filter Image Partial Sum * = 25 of 56

26 A 2D-SID DVAFS Architecture Feature SRA No Reuse in Scalar Solution 1x16b 1 Feature * 1 Weight Filter SRA 1x16b 26 of 56

27 A 2D-SID DVAFS Architecture Convolutional Reuse in 1D-SID Feature SRA Filter SRA 16x16b / 1x16b 16 Features * 1 Weight 1x16b 27 of 56

28 A 2D-SID DVAFS Architecture Convolutional Reuse in 1D-SID Feature SRA Filter SRA 16x16b / 1x16b 16 Features * 1 Weight 1x16b F I F O 28 of 56

29 A 2D-SID DVAFS Architecture Convolutional + Image Reuse in 2D-SID Feature SRA Filter SRA 16x16b / 1x16b 16 Features * 16 Weights 16x16b F I F O 29 of 56

30 A 2D-SID DVAFS Architecture Cnv. + Image + Filter Reuse in 2D-SID DVAFS Feature SRA Filter SRA 16x(Nx16b/N) / 1x(Nx16b/N) 16N Features * 16N Weights 16x(Nx16b/N) N=2 30 of 56

31 A 2D-SID DVAFS Architecture Cnv. + Image + Filter Reuse in 2D-SID DVAFS 256 AC units SR Feature SRA 16b Feature * SRA 16b Filter Filter SRA SRA 48b 48b N = 1, 1x16b Accumulate 16x(Nx16b/N) / 1x(Nx16b/N) 16N Features * 16N Weights 16x(Nx16b/N) F I F O N=1 *Status Register 31 of 56

32 A 2D-SID DVAFS Architecture Cnv. + Image + Filter Reuse in 2D-SID DVAFS 512 AC units SR Feature 8b SRA8b * Unused 4 8b 8b N = 2, 2x8b Filter SRA 2x 24b 2x 24b Unused 16x(Nx16b/N) / 1x(Nx16b/N) 16N Features * 16N Weights 16x(Nx16b/N) N=2 *Status Register 32 of 56

33 A 2D-SID DVAFS Architecture Cnv. + Image + Filter Reuse in 2D-SID DVAFS 1024 AC units 16x(Nx16b/N) / 1x(Nx16b/N) SR Feature 4b SRA 4b 4b 4b * 16N Features Unused * 16N Weights N=4 Unused 16x(Nx16b/N) Filter SRA 4x 12b 4x 12b N = 4, 4x4b *Status Register 33 of 56

34 A 2D-SID DVAFS Architecture Guard SRA and 2D-Array from sparse operators 4 [4] oons, VLSI 2016 Feature SRA GRD SRA F I F O GRD 0 1 Filter SRA GRD of 56

35 Flexible emory / IO compression IO en/decoder D A RISC CTRL ALU 1D-SID: ReLu, ax-pool, AC, 2D-SID AC-array Input processing D A D C D B D D data P GRD GRD data Input procesing 35 of 56

36 Flexible emory / IO compression IO en/decoder D A RISC CTRL C-programmable 1D-SID: ReLu, ALU 4 16b Instructions ax-pool, AC, 4 Huffman-based IO 2D-SID compression, AC-array up to Input 5.8x processing AlexNet 4 data 16 kb P 4 128kB D 4 D A D B P data o 3-wise parallel acc. 4kB GRD SRA 4 D C D D GRD o sparsity flags GRD As in [4] oons, VLSI2016 Input procesing 36 of 56

37 Physical Implementation Efficiency and scalability through granular Power and Body-Bias domains 37 of 56

38 Physical Implementation 28 FDSOI V E BBGND IO en/decoder D A RISC CTRL ALU 1D-SID: ReLu, ax-pool, AC, 2D-SID AC-array Input processing V 2D BB1 V CTRL BB2 D A D C D B D D P GRD GRD Input procesing 38 of 56

39 Physical Implementation 28 FDSOI 1.29 mm 1.45 mm 2D-SID AC array RISC, DA E 1.87 mm 2 39 of 56

40 easurement Results Efficiencies from 0.25-to-10 TOPS/W depending on Precision and Network Sparsity 40 of 56

41 easurement Results 1x16b Eff. [TOPS/W] Voltage [V] BB nom 1.05V 0.25 TOPS/W Throughput [GOPS] 41 of 56

42 easurement Results 1x16b * 2x8b Eff. [TOPS/W] Voltage [V] BB nom 0.8V 1 TOPS/W Throughput [GOPS] 42 of 56

43 easurement Results * + 1x16b 2x8b 4x4b Eff. [TOPS/W] Voltage [V] BB nom 0.67V 4 TOPS/W Throughput [GOPS] 43 of 56

44 easurement Results * + o 1x16b 2x8b 4x4b 30-60% Sparse 4x3-4b Eff. [TOPS/W] Voltage [V] BB nom 0.61V 8.2 TOPS/W Throughput [GOPS] 44 of 56

45 easurement Results BB nom 1x16b = +/-.6V V = 0.85V f, T * + 2x8b 4x4b L D o 30-60% Sparse 4x3-4b BB nom Eff. [TOPS/W] Voltage [V] BB nom 0.85V 0.33 TOPS/W Throughput [GOPS] 45 of 56

46 easurement Results BB nom 1x16b = +/-.6V V = 0.85V BB* opt = 2x8b +/- 1.2V V = 0.70V f, T + L D 4x4b 1.6x L D o 30-60% Sparse 4x3-4b BB nom BB opt Eff. [TOPS/W] Voltage [V] BB nom 0.85V 0.33 TOPS/W BB opt 0.70V 0.53 TOPS/W Throughput [GOPS] Throughput [GOPS] 46 of 56

47 easurement Results BB nom 1x16b = +/-.6V V = 0.61V f, T * + + 2x8b 4x4b 4x4b L D o o 30-60% 30-60% Sparse Sparse 4x3-4b BB4x3-4b nom opt Eff. [TOPS/W] Voltage [V] TOPS/W BB nom 0.61V 8.2 TOPS/W Throughput [GOPS] 47 of 56

48 easurement Results BB nom 1x16b = +/-.6V V = 0.61V BB * opt = 2x8b +/- 0.2V V = 0.63V f, T + + 4x4b 4x4b L D 1.2x L D o o 30-60% 30-60% Sparse Sparse 4x3-4b BB4x3-4b nom opt Eff. [TOPS/W] Voltage [V] TOPS/W BB nom 0.61V 8.2 TOPS/W Throughput [GOPS] BB opt 0.63V 10 TOPS/W 300 Throughput [GOPS] 48 of 56

49 easurement Results BB nom 1x16b = +/-.6V V = 0.61V BB * opt = 2x8b +/- 0.2V V = 0.63V f, T + + 4x4b 4x4b L D 1.2x L D o o 30-60% 30-60% Sparse Sparse 4x3-4b BB4x3-4b nom opt Eff. [TOPS/W] Voltage [V] TOPS/W BB nom 0.61V 8.2 TOPS/W Throughput [GOPS] 40x BB opt 0.63V 10 TOPS/W 300 Throughput [GOPS] 49 of 56

50 Hierarchical Face Recognition Revisited Hierarchical processing enables always-on compute 3 uj/f 2-4b CONV 4.2 TOPS/W 6 uj/f CONV 4 TOPS/W 500 uj/f CONV 1.8TOPS/W uj/f 4-6b CONV 1.3 TOPS/W N N Y N Always-on ~1% on ~0.1% on ~0.01% on 50 of 56

51 Hierarchical Face Recognition Revisited Hierarchical processing enables always-on compute 3 uj/f 2-4b CONV 4.2 TOPS/W 6 uj/f CONV 4 TOPS/W 500 uj/f CONV 1.8TOPS/W uj/f CONV 1.3 TOPS/W This Functionality Always-on At 6uJ / frame average CONVlayer energy consumption N N Y N Always-on ~1% on ~0.1% on ~0.01% on 51 of 56

52 Comparison A. Highest scalability of Energy-vs- Computational Precision (40x) B. Efficiencies up to 10 TOPS/W 52 of 56

53 Eyeriss 3 ISSCC 16 oons 4 VLSI 16 This work N = 1, 2 or 4 Technology 65nm LP 40nm LP 28nm FDSOI f nom f nom Peak GOPS ANet CONV VGG CONV Comparison with SotA 200Hz 1V mW@35fps - 200Hz 1.1V fps - 200Hz 1V N x fps 1.7fps Power GOPS nom in. Eff. ax. Eff GOPS 0.17 TOPS/W 0.25 TOPS/W GOPS 0.27 TOPS/W 2.60 TOPS/W GOPS 0.25 TOPS/W 10.0 TOPS/W 53 of 56

54 Comparison with SotA homes.esat.kuleuven.be/~mverhels/dlicsurvey.html bit 8-bit 16-bit Energy-Efficiency [TOPS/W] This work oons 4 ID14.6 Chen Throughput [GOPS] ID14.2 ID of 56

55 Summary Envision: A 0.25-to-10 TOPS/W CNN processor, trading energy-vscomputational precision 55 of 56

56 Summary Always-on through hierarchical computing. An energy-efficient CNN-architecture: 1. 2D-SID baseline; 2. DVAFS-compatible 3. Operator guarding and IO-compression. Envision: a 0.25-to GOPS varying with the required network precision. Acknowledgement: This work was partly funded by FWO and Intel Corporation. We thank Synopsys for tool support, STicroelectronics for silicon donation. 56 of 56

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning, Index A Algorithmic noise tolerance (ANT), 93 94 Application specific instruction set processors (ASIPs), 115 116 Approximate computing application level, 95 circuits-levels, 93 94 DAS and DVAS, 107 110

More information

How to Estimate the Energy Consumption of Deep Neural Networks

How to Estimate the Energy Consumption of Deep Neural Networks How to Estimate the Energy Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze MIT 1 Problem of DNNs Recognition Smart Drone AI Computation DNN 15k 300k OP/Px DPM 0.1k

More information

arxiv: v1 [cs.cv] 11 Feb 2018

arxiv: v1 [cs.cv] 11 Feb 2018 arxiv:8.8v [cs.cv] Feb 8 - Partitioning of Deep Neural Networks with Feature Space Encoding for Resource-Constrained Internet-of-Things Platforms ABSTRACT Jong Hwan Ko, Taesik Na, Mohammad Faisal Amir,

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Efficient Edge Computing for Deep Neural Networks and Beyond Vivienne Sze In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang, Sertac

More information

Session 14 Overview: Deep-Learning Processors

Session 14 Overview: Deep-Learning Processors ISSCC 2017 / SESSION 14 / DEEP-LEARNING PROCESSORS / OVERVIEW Session 14 Overview: Deep-Learning Processors DIGITAL ARCHITECTURE AND SYSTEMS SUBCOMMITTEE Session Chair: Takashi Hashimoto, Panasonic, Osaka,

More information

Software Defined Hardware

Software Defined Hardware Software Defined Hardware For data intensive computation Wade Shen DARPA I2O September 19, 2017 1 Goal Statement Build runtime reconfigurable hardware and software that enables near ASIC performance (within

More information

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Yu-Hsin Chen 1, Joel Emer 1, 2, Vivienne Sze 1 1 MIT 2 NVIDIA 1 Contributions of This Work A novel energy-efficient

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters

More information

THE NVIDIA DEEP LEARNING ACCELERATOR

THE NVIDIA DEEP LEARNING ACCELERATOR THE NVIDIA DEEP LEARNING ACCELERATOR INTRODUCTION NVDLA NVIDIA Deep Learning Accelerator Developed as part of Xavier NVIDIA s SOC for autonomous driving applications Optimized for Convolutional Neural

More information

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School

More information

Deep ST, Ultra Low Power Artificial Neural Network SOC in 28 FD-SOI. Nitin Chawla,

Deep ST, Ultra Low Power Artificial Neural Network SOC in 28 FD-SOI. Nitin Chawla, Deep learning @ ST, Ultra Low Power Artificial Neural Network SOC in 28 FD-SOI Nitin Chawla, Senior Principal Engineer and Senior Member of Technical Staff at STMicroelectronics Outline Introduction Chip

More information

DNN Accelerator Architectures

DNN Accelerator Architectures DNN Accelerator Architectures ISCA Tutorial (2017) Website: http://eyeriss.mit.edu/tutorial.html Joel Emer, Vivienne Sze, Yu-Hsin Chen 1 2 Highly-Parallel Compute Paradigms Temporal Architecture (SIMD/SIMT)

More information

Revolutionizing the Datacenter

Revolutionizing the Datacenter Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5

More information

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu

More information

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder EFFICIENT INFERENCE WITH TENSORRT Han Vanholder AI INFERENCING IS EXPLODING 2 Trillion Messages Per Day On LinkedIn 500M Daily active users of iflytek 140 Billion Words Per Day Translated by Google 60

More information

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal Computer Architectures for Deep Learning Ethan Dell and Daniyal Iqbal Agenda Introduction to Deep Learning Challenges Architectural Solutions Hardware Architectures CPUs GPUs Accelerators FPGAs SOCs ASICs

More information

A Method to Estimate the Energy Consumption of Deep Neural Networks

A Method to Estimate the Energy Consumption of Deep Neural Networks A Method to Estimate the Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze Massachusetts Institute of Technology, Cambridge, MA, USA {tjy, yhchen, jsemer, sze}@mit.edu

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

Binary Convolutional Neural Network on RRAM

Binary Convolutional Neural Network on RRAM Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua

More information

High Performance Computing

High Performance Computing High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason

More information

Towards Closing the Energy Gap Between HOG and CNN Features for Embedded Vision

Towards Closing the Energy Gap Between HOG and CNN Features for Embedded Vision Towards Closing the Energy Gap Between HOG and CNN Features for Embedded Vision The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA

Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA Abstract In recent years, Convolutional Neural Network (CNN) has been widely applied in computer vision tasks. FPGAs

More information

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

NVIDIA FOR DEEP LEARNING. Bill Veenhuis NVIDIA FOR DEEP LEARNING Bill Veenhuis bveenhuis@nvidia.com Nvidia is the world s leading ai platform ONE ARCHITECTURE CUDA 2 GPU: Perfect Companion for Accelerating Apps & A.I. CPU GPU 3 Intro to AI AGENDA

More information

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough 1,2 S. K. Lee 2, N. Mulholland 2, P. Hansen 2, S. Kodali 3, D. Brooks 2, G.-Y. Wei 2 1 ARM Research, Boston,

More information

Low-Power Neural Processor for Embedded Human and Face detection

Low-Power Neural Processor for Embedded Human and Face detection Low-Power Neural Processor for Embedded Human and Face detection Olivier Brousse 1, Olivier Boisard 1, Michel Paindavoine 1,2, Jean-Marc Philippe, Alexandre Carbon (1) GlobalSensing Technologies (GST)

More information

direct hardware mapping of cnns on fpga-based smart cameras

direct hardware mapping of cnns on fpga-based smart cameras direct hardware mapping of cnns on fpga-based smart cameras Workshop on Architecture of Smart Cameras Kamel ABDELOUAHAB, Francois BERRY, Maxime PELCAT, Jocelyn SEROT, Jean-Charles QUINTON Cordoba, June

More information

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru

More information

arxiv: v2 [cs.cv] 3 May 2016

arxiv: v2 [cs.cv] 3 May 2016 EIE: Efficient Inference Engine on Compressed Deep Neural Network Song Han Xingyu Liu Huizi Mao Jing Pu Ardavan Pedram Mark A. Horowitz William J. Dally Stanford University, NVIDIA {songhan,xyl,huizi,jingpu,perdavan,horowitz,dally}@stanford.edu

More information

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,

More information

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm Deep Learning on Arm Cortex-M Microcontrollers Rod Crawford Director Software Technologies, Arm What is Machine Learning (ML)? Artificial Intelligence Machine Learning Deep Learning Neural Networks Additional

More information

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed

More information

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Christos Kozyrakis Stanford University David Patterson U.C. Berkeley http://csl.stanford.edu/~christos Motivation Ideal processor

More information

Fuzzy Set Theory in Computer Vision: Example 3, Part II

Fuzzy Set Theory in Computer Vision: Example 3, Part II Fuzzy Set Theory in Computer Vision: Example 3, Part II Derek T. Anderson and James M. Keller FUZZ-IEEE, July 2017 Overview Resource; CS231n: Convolutional Neural Networks for Visual Recognition https://github.com/tuanavu/stanford-

More information

An introduction to Machine Learning silicon

An introduction to Machine Learning silicon An introduction to Machine Learning silicon November 28 2017 Insight for Technology Investors AI/ML terminology Artificial Intelligence Machine Learning Deep Learning Algorithms: CNNs, RNNs, etc. Additional

More information

Arm s First-Generation Machine Learning Processor

Arm s First-Generation Machine Learning Processor Arm s First-Generation Machine Learning Processor Ian Bratt 2018 Arm Limited Introducing the Arm Machine Learning (ML) Processor Optimized ground-up architecture for machine learning processing Massive

More information

RTSR: Enhancing Real-time H.264 Video Streaming using Deep Learning based Video Super Resolution Spring 2017 CS570 Project Presentation June 8, 2017

RTSR: Enhancing Real-time H.264 Video Streaming using Deep Learning based Video Super Resolution Spring 2017 CS570 Project Presentation June 8, 2017 RTSR: Enhancing Real-time H.264 Video Streaming using Deep Learning based Video Super Resolution Spring 2017 CS570 Project Presentation June 8, 2017 Team 16 Soomin Kim Leslie Tiong Youngki Kwon Insu Jang

More information

SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks

SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks Angshuman Parashar Minsoo Rhu Anurag Mukkara Antonio Puglielli Rangharajan Venkatesan Brucek Khailany Joel Emer Stephen W. Keckler

More information

Deep Learning Requirements for Autonomous Vehicles

Deep Learning Requirements for Autonomous Vehicles Deep Learning Requirements for Autonomous Vehicles Pierre Paulin, Director of R&D Synopsys Inc. Chipex, 1 May 2018 1 Agenda Deep Learning and Convolutional Neural Networks for Embedded Vision Automotive

More information

In Live Computer Vision

In Live Computer Vision EVA 2 : Exploiting Temporal Redundancy In Live Computer Vision Mark Buckler, Philip Bedoukian, Suren Jayasuriya, Adrian Sampson International Symposium on Computer Architecture (ISCA) Tuesday June 5, 2018

More information

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

PERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices

PERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices PERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices Chunhua Deng + City University of New York chunhua.deng@rutgers.edu Keshab K. Parhi University of Minnesota, Twin Cities parhi@umn.edu

More information

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Nick Fraser (Xilinx & USydney) Yaman Umuroglu (Xilinx & NTNU) Giulio Gambardella (Xilinx)

More information

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural

More information

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu, Toney Jacobson, Gouri Landge, Michael Meeuwsen, Christine

More information

Hardware for Deep Learning

Hardware for Deep Learning Hardware for Deep Learning Bill Dally Stanford and NVIDIA Stanford Platform Lab Retreat June 3, 2016 HARDWARE AND DATA ENABLE DNNS 2 THE NEED FOR SPEED Larger data sets and models lead to better accuracy

More information

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan CENG 783 Special topics in Deep Learning AlchemyAPI Week 11 Sinan Kalkan TRAINING A CNN Fig: http://www.robots.ox.ac.uk/~vgg/practicals/cnn/ Feed-forward pass Note that this is written in terms of the

More information

Efficient Processing for Deep Learning: Challenges and Opportuni:es

Efficient Processing for Deep Learning: Challenges and Opportuni:es Efficient Processing for Deep Learning: Challenges and Opportuni:es Vivienne Sze Massachuse@s Ins:tute of Technology Contact Info email: sze@mit.edu website: www.rle.mit.edu/eems In collabora*on with Yu-Hsin

More information

Architetture di Calcolo Ultra-Low-Power per Internet of Things: La piattaforma PULP

Architetture di Calcolo Ultra-Low-Power per Internet of Things: La piattaforma PULP Architetture di Calcolo Ultra-Low-Power per Internet of Things: La piattaforma PULP 31.05.2018 Davide Rossi davide.rossi@unibo.it 1 Department of Electrical, Electronic and Information Engineering 2 Integrated

More information

Is Bigger CNN Better? Samer Hijazi on behalf of IPG CTO Group Embedded Neural Networks Summit (enns2016) San Jose Feb. 9th

Is Bigger CNN Better? Samer Hijazi on behalf of IPG CTO Group Embedded Neural Networks Summit (enns2016) San Jose Feb. 9th Is Bigger CNN Better? Samer Hijazi on behalf of IPG CTO Group Embedded Neural Networks Summit (enns2016) San Jose Feb. 9th Today s Story Why does CNN matter to the embedded world? How to enable CNN in

More information

Brainchip OCTOBER

Brainchip OCTOBER Brainchip OCTOBER 2017 1 Agenda Neuromorphic computing background Akida Neuromorphic System-on-Chip (NSoC) Brainchip OCTOBER 2017 2 Neuromorphic Computing Background Brainchip OCTOBER 2017 3 A Brief History

More information

Xilinx Machine Learning Strategies For Edge

Xilinx Machine Learning Strategies For Edge Xilinx Machine Learning Strategies For Edge Presented By Alvin Clark, Sr. FAE, Northwest The Hottest Research: AI / Machine Learning Nick s ML Model Nick s ML Framework copyright sources: Gospel Coalition

More information

Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network

Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network Lixue Xia, Tianqi Tang, Wenqin Huangfu, Ming Cheng, Xiling Yin, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E., Tsinghua

More information

PULP: an open source hardware-software platform for near-sensor analytics. Luca Benini IIS-ETHZ & DEI-UNIBO

PULP: an open source hardware-software platform for near-sensor analytics. Luca Benini IIS-ETHZ & DEI-UNIBO PULP: an open source hardware-software platform for near-sensor analytics Luca Benini IIS-ETHZ & DEI-UNIBO An IoT System View Sense MEMS IMU MEMS Microphone ULP Imager Analyze µcontroller L2 Memory e.g.

More information

Lecture 12: Model Serving. CSE599W: Spring 2018

Lecture 12: Model Serving. CSE599W: Spring 2018 Lecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications That drink will get you to 2800 calories for today I last saw your keys in the store room Remind Tom of the party You re on page

More information

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn Intro to Deep Learning Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn Why this class? Deep Features Have been able to harness the big data in the most efficient and effective

More information

MEMORY AUGMENTED CONTROL NETWORKS

MEMORY AUGMENTED CONTROL NETWORKS MEMORY AUGMENTED CONTROL NETWORKS Arbaaz Khan, Clark Zhang, Nikolay Atanasov, Konstantinos Karydis, Vijay Kumar, Daniel D. Lee GRASP Laboratory, University of Pennsylvania Presented by Aravind Balakrishnan

More information

Xilinx ML Suite Overview

Xilinx ML Suite Overview Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame

More information

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA Inference Optimization Using TensorRT with Use Cases Jack Han / 한재근 Solutions Architect NVIDIA Search Image NLP Maps TensorRT 4 Adoption Use Cases Speech Video AI Inference is exploding 1 Billion Videos

More information

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager Characterization and Benchmarking of Deep Learning Natalia Vassilieva, PhD Sr. Research Manager Deep learning applications Vision Speech Text Other Search & information extraction Security/Video surveillance

More information

A 50% Lower Power ARM Cortex CPU using DDC Technology with Body Bias. David Kidd August 26, 2013

A 50% Lower Power ARM Cortex CPU using DDC Technology with Body Bias. David Kidd August 26, 2013 A 50% Lower Power ARM Cortex CPU using DDC Technology with Body Bias David Kidd August 26, 2013 1 HOTCHIPS 2013 Copyright 2013 SuVolta, Inc. All rights reserved. Agenda DDC transistor and PowerShrink platform

More information

3D Convolutional Neural Networks for Landing Zone Detection from LiDAR

3D Convolutional Neural Networks for Landing Zone Detection from LiDAR 3D Convolutional Neural Networks for Landing Zone Detection from LiDAR Daniel Mataruna and Sebastian Scherer Presented by: Sabin Kafle Outline Introduction Preliminaries Approach Volumetric Density Mapping

More information

How to Build Optimized ML Applications with Arm Software

How to Build Optimized ML Applications with Arm Software How to Build Optimized ML Applications with Arm Software Arm Technical Symposia 2018 Arm K.K. Senior FAE Ryuji Tanaka Overview Today we will talk about applied machine learning (ML) on Arm. My aim for

More information

How to Build Optimized ML Applications with Arm Software

How to Build Optimized ML Applications with Arm Software How to Build Optimized ML Applications with Arm Software Arm Technical Symposia 2018 ML Group Overview Today we will talk about applied machine learning (ML) on Arm. My aim for today is to show you just

More information

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign

More information

Real-time convolutional networks for sonar image classification in low-power embedded systems

Real-time convolutional networks for sonar image classification in low-power embedded systems Real-time convolutional networks for sonar image classification in low-power embedded systems Matias Valdenegro-Toro Ocean Systems Laboratory - School of Engineering & Physical Sciences Heriot-Watt University,

More information

Deep Learning with Low Precision Hardware Challenges and Opportunities for Logic Synthesis

Deep Learning with Low Precision Hardware Challenges and Opportunities for Logic Synthesis Deep Learning with Low Precision Hardware Challenges and Opportunities for Logic Synthesis ETHZ & UNIBO http://www.pulp-platform.org 1 of 40 Deep Learning: Why? First, it was machine vision Now it s everywhere!

More information

ComPEND: Computation Pruning through Early Negative Detection for ReLU in a Deep Neural Network Accelerator

ComPEND: Computation Pruning through Early Negative Detection for ReLU in a Deep Neural Network Accelerator ICS 28 ComPEND: Computation Pruning through Early Negative Detection for ReLU in a Deep Neural Network Accelerator June 3, 28 Dongwoo Lee, Sungbum Kang, Kiyoung Choi Neural Processing Research Center (NPRC)

More information

An Asynchronous Array of Simple Processors for DSP Applications

An Asynchronous Array of Simple Processors for DSP Applications An Asynchronous Array of Simple Processors for DSP Applications Zhiyi Yu, Michael Meeuwsen, Ryan Apperson, Omar Sattari, Michael Lai, Jeremy Webb, Eric Work, Tinoosh Mohsenin, Mandeep Singh, Bevan Baas

More information

Deep Learning and Its Applications

Deep Learning and Its Applications Convolutional Neural Network and Its Application in Image Recognition Oct 28, 2016 Outline 1 A Motivating Example 2 The Convolutional Neural Network (CNN) Model 3 Training the CNN Model 4 Issues and Recent

More information

A new Computer Vision Processor Chip Design for automotive ADAS CNN applications in 22nm FDSOI based on Cadence VP6 Technology

A new Computer Vision Processor Chip Design for automotive ADAS CNN applications in 22nm FDSOI based on Cadence VP6 Technology Dr.-Ing Jens Benndorf (DCT) Gregor Schewior (DCT) A new Computer Vision Processor Chip Design for automotive ADAS CNN applications in 22nm FDSOI based on Cadence VP6 Technology Tensilica Day 2017 16th

More information

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia Deep learning for dense per-pixel prediction Chunhua Shen The University of Adelaide, Australia Image understanding Classification error Convolution Neural Networks 0.3 0.2 0.1 Image Classification [Krizhevsky

More information

Non-Profiled Deep Learning-Based Side-Channel Attacks

Non-Profiled Deep Learning-Based Side-Channel Attacks Non-Profiled Deep Learning-Based Side-Channel Attacks Benjamin Timon UL Transaction Security, Singapore benjamin.timon@ul.com Abstract. Deep Learning has recently been introduced as a new alternative to

More information

EVA 2 : Exploiting Temporal Redundancy in Live Computer Vision

EVA 2 : Exploiting Temporal Redundancy in Live Computer Vision EVA 2 : Exploiting Temporal Redundancy in Live Computer Vision Mark Buckler Cornell University mab598@cornell.edu Philip Bedoukian Cornell University pbb59@cornell.edu Suren Jayasuriya Arizona State University

More information

A Communication-Centric Approach for Designing Flexible DNN Accelerators

A Communication-Centric Approach for Designing Flexible DNN Accelerators THEME ARTICLE: Hardware Acceleration A Communication-Centric Approach for Designing Flexible DNN Accelerators Hyoukjun Kwon, High computational demands of deep neural networks Ananda Samajdar, and (DNNs)

More information

C-Brain: A Deep Learning Accelerator

C-Brain: A Deep Learning Accelerator C-Brain: A Deep Learning Accelerator that Tames the Diversity of CNNs through Adaptive Data-level Parallelization Lili Song, Ying Wang, Yinhe Han, Xin Zhao, Bosheng Liu, Xiaowei Li State Key Laboratory

More information

Scaling Neural Network Acceleration using Coarse-Grained Parallelism

Scaling Neural Network Acceleration using Coarse-Grained Parallelism Scaling Neural Network Acceleration using Coarse-Grained Parallelism Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2018 Neural Networks (NNs)

More information

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech Convolutional Neural Networks Computer Vision Jia-Bin Huang, Virginia Tech Today s class Overview Convolutional Neural Network (CNN) Training CNN Understanding and Visualizing CNN Image Categorization:

More information

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017 DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE Dennis Lui August 2017 THE RISE OF GPU COMPUTING APPLICATIONS 10 7 10 6 GPU-Computing perf 1.5X per year 1000X by 2025 ALGORITHMS 10 5 1.1X

More information

Bandwidth-Efficient Deep Learning

Bandwidth-Efficient Deep Learning 1 Bandwidth-Efficient Deep Learning from Compression to Acceleration Song Han Assistant Professor, EECS Massachusetts Institute of Technology 2 AI is Changing Our Lives Self-Driving Car Machine Translation

More information

Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism

Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Jiecao Yu 1, Andrew Lukefahr 1, David Palframan 2, Ganesh Dasika 2, Reetuparna Das 1, Scott Mahlke 1 1 University of Michigan 2 ARM

More information

Face Recognition A Deep Learning Approach

Face Recognition A Deep Learning Approach Face Recognition A Deep Learning Approach Lihi Shiloh Tal Perl Deep Learning Seminar 2 Outline What about Cat recognition? Classical face recognition Modern face recognition DeepFace FaceNet Comparison

More information

Hyperdrive: A Systolically Scalable Binary-Weight CNN Inference Engine for mw IoT End-Nodes

Hyperdrive: A Systolically Scalable Binary-Weight CNN Inference Engine for mw IoT End-Nodes Hyperdrive: A Systolically Scalable Binary-Weight CNN Inference Engine for mw IoT End-Nodes Renzo Andri, Lukas Cavigelli, Davide Rossi, Luca Benini Integrated Systems Laboratory, ETH Zurich, Zurich, Switzerland

More information

Machine Learning. MGS Lecture 3: Deep Learning

Machine Learning. MGS Lecture 3: Deep Learning Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ Machine Learning MGS Lecture 3: Deep Learning Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ WHAT IS DEEP LEARNING? Shallow network: Only one hidden layer

More information

High performance, power-efficient DSPs based on the TI C64x

High performance, power-efficient DSPs based on the TI C64x High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research

More information

IN-MEMORY ASSOCIATIVE COMPUTING

IN-MEMORY ASSOCIATIVE COMPUTING IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY AAKERIB@GSITECHNOLOGY.COM AGENDA The AI computational challenge Introduction to associative computing Examples An NLP use case What s next?

More information

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques Jingyang Zhu 1, Zhiliang Qian 2*, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and

More information

Deep Learning with Tensorflow AlexNet

Deep Learning with Tensorflow   AlexNet Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification

More information

arxiv: v1 [cs.cv] 26 Aug 2016

arxiv: v1 [cs.cv] 26 Aug 2016 Scalable Compression of Deep Neural Networks Xing Wang Simon Fraser University, BC, Canada AltumView Systems Inc., BC, Canada xingw@sfu.ca Jie Liang Simon Fraser University, BC, Canada AltumView Systems

More information

SOFTWARE HARDWARE CODESIGN ACCELERATION FOR EFFICIENT NEURAL NETWORK. ...Deep learning and neural

SOFTWARE HARDWARE CODESIGN ACCELERATION FOR EFFICIENT NEURAL NETWORK. ...Deep learning and neural ... SOFTWARE HARDWARE CODESIGN FOR EFFICIENT NEURAL NETWORK ACCELERATION... Kaiyuan Guo Tsinghua University and DeePhi Song Han Stanford University and DeePhi Song Yao DeePhi Yu Wang Tsinghua University

More information

A 19.4 nj/decision 364K Decisions/s In-Memory Random Forest Classifier in 6T SRAM Array. Mingu Kang, Sujan Gonugondla, Naresh Shanbhag

A 19.4 nj/decision 364K Decisions/s In-Memory Random Forest Classifier in 6T SRAM Array. Mingu Kang, Sujan Gonugondla, Naresh Shanbhag A 19.4 nj/decision 364K Decisions/s In-Memory Random Forest Classifier in 6T SRAM Array Mingu Kang, Sujan Gonugondla, Naresh Shanbhag University of Illinois at Urbana Champaign Machine Learning under Resource

More information

Machine Learning on VMware vsphere with NVIDIA GPUs

Machine Learning on VMware vsphere with NVIDIA GPUs Machine Learning on VMware vsphere with NVIDIA GPUs Uday Kurkure, Hari Sivaraman, Lan Vu GPU Technology Conference 2017 2016 VMware Inc. All rights reserved. Gartner Hype Cycle for Emerging Technology

More information

SmartShuttle: Optimizing Off-Chip Memory Accesses for Deep Learning Accelerators

SmartShuttle: Optimizing Off-Chip Memory Accesses for Deep Learning Accelerators SmartShuttle: Optimizing Off-Chip emory Accesses for Deep Learning Accelerators Jiajun Li, Guihai Yan, Wenyan Lu, Shuhao Jiang, Shijun Gong, Jingya Wu, Xiaowei Li State Key Laboratory of Computer Architecture,

More information

HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array

HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Li, Yiran Chen Duke University, University of Southern California {linghao.song,

More information

The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016

The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016 The Path to Embedded Vision & AI using a Low Power Vision DSP Yair Siegel, Director of Segment Marketing Hotchips August 2016 Presentation Outline Introduction The Need for Embedded Vision & AI Vision

More information

Deep Neural Network Acceleration Framework Under Hardware Uncertainty

Deep Neural Network Acceleration Framework Under Hardware Uncertainty Deep Neural Network Acceleration Framework Under Hardware Uncertainty Mohsen Imani, Pushen Wang, and Tajana Rosing Computer Science and Engineering, UC San Diego, La Jolla, CA 92093, USA {moimani, puw001,

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A

More information

Chain-NN: An Energy-Efficient 1D Chain Architecture for Accelerating Deep Convolutional Neural Networks

Chain-NN: An Energy-Efficient 1D Chain Architecture for Accelerating Deep Convolutional Neural Networks Chain-NN: An Energy-Efficient D Chain Architecture for Accelerating Deep Convolutional Neural Networks Shihao Wang, Dajiang Zhou, Xushen Han, Takeshi Yoshimura Graduate School of Information, Production

More information