ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic- Voltage-Accuracy-Frequency- Scalable CNN Processor in 28nm FDSOI
|
|
- Alban Morris
- 6 years ago
- Views:
Transcription
1 ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic- Voltage-Accuracy-Frequency- Scalable CNN Processor in 28nm FDSOI Bert oons, Roel Uytterhoeven, Wim Dehaene, arian Verhelst ESAT/ICAS - KU Leuven 1of 56
2 Embedded Neural Networks Augmented Reality Face Recognition Artificial Intelligence Raw Data Information CLOUD GPU 2of 56
3 Embedded Neural Networks Augmented Reality Face Recognition Artificial Intelligence Local Processing 3of 56
4 Embedded Neural Networks Augmented Reality Face Recognition Artificial Intelligence 1-to-10 TOPS/W CNN processing is crucial for always-on embedded Local Processing operation. 4of 56
5 Always-on Neural Networks Large-Scale, highly accurate CNN s are too expensive for embedded always-on operation. VGG-16 Recognition on LFW* Classes 5760 Accuracy 92.5% Complexity odel Size Processing Energy / 1 TOPS/W 15.4 GACs 15 B ~ 30 mj/f ~ fps [*] Labeled Faces in the Wild Data set LFW 5of 56 A A A 1200mAh - 1.5V Drains in 2h
6 Presentation Outline A. 1. Hierarchical Recognition 2. DVAFS: Dynamic-Voltage- Accuracy-Frequency-Scaling B. 1. Hardware Implementation 2. Results 6of 56
7 Hierarchical recognition Hierarchical processing enables always on CNN-based visual recognition 7of 56
8 Hierarchical Face Recognition Hierarchical processing enables always-on compute Face Detected? Large-Scale Recognition 6 ACs 15.4GACs 8of 56
9 Hierarchical Face Recognition Hierarchical processing enables always-on compute Face Detected? Owner Detected? Large-Scale Recognition N Y 6 ACs 12ACs 15.4GACs 9of 56
10 Hierarchical Face Recognition Hierarchical processing enables always-on compute Face Detected? Owner Detected? Friend Detected? Large-Scale Recognition N N Y N 6 ACs 12ACs 500ACs 15.4GACs 10 of 56
11 CONV-1 Face Detected 6 ACs? 22 kb 5-44%=0 2-4b Ops Hierarchical Face Recognition Hierarchical processing enables always-on compute CONV-2 Owner Detected 12 ACs? 42 kb 8-45%=0 3-4b Ops CONV-3 Friend 500 Detected ACs? 742 kb 8-47%=0 4-6b Ops Large-Scale CONV-4 Recognition 15 GACs 15 B 5-82%=0 4-6b Ops 94 % Nacc. 96 % acc. 94 % acc % acc. N Y Always-on ~1% on ~0.1% on ~0.01% on Increasing # Classes / Network Size / FP precision / Energy per frame 6 ACs 12ACs 500ACs 15.4GACs 11 of 56 N
12 DVAFS: Dynamic-Voltage- Accuracy-Frequency-Scaling An at run-time Energy-vs-Computational Precision trade-off 12 of 56
13 Precision Scaling - DVAS x 0 /0 x 1 /0 x 2 DVAS Dynamic-Voltage-Accuracy-Scaling 4 y 3 y 2 y 1 /0 y 0 /0 x 3 x 2 x 1 x 0 y 3 y 2 y 1 y 0 Gate LSB Gate LSB Standard ultiplier x 3 z 3 z 2 z 1 z As in [4] oons, VLSI2016 ; oons, JSSC of 56
14 Precision Scaling - DVAFS DVAFS Dynamic-Voltage-Accuracy-Frequency-Scaling x 00 y 11 y 01 y 10 y 00 x 11 x 01 x 10 x 00 y 11 y 01 y 10 y 00 x 10 x 01 Subword-Parallel ult. x 11 z 31 z 21 z 11 z 01 z 30 z 20 z 10 z of 56
15 Precision Scaling - DVAFS DVAFS Dynamic-Voltage-Accuracy-Frequency-Scaling x 00 y 11 y 01 y 10 y 00 x 11 x 01 x 10 x 00 y 11 y 01 y 10 y 00 x 10 x 01 x 11 DVAFS is a dynamic Subword-p. precision technique, ultiplier lowering all run-time adaptable parameters: activity, frequency z 31 zand 21 z 11 supply z 01 z 30 voltage z 20 z 10 z of 56
16 Precision Scaling System Level DVAFS outperforms DVAS as it minimizes noncompute overheads at low precision DVAS Energy/word High precision DVAS DVAFS CTRL & Transfer Energy/word emory Overhead Compute Compute Overhead Compute 16 of 56
17 Precision Scaling System Level DVAFS outperforms DVAS as it minimizes noncompute overheads at low precision DVAFS Energy/word High precision DVAFS DVAFS CTRL & Transfer Energy/word emory Overhead Compute Compute Overhead Compute 17 of 56
18 Precision Scaling System Level DVAFS outperforms DVAS as it minimizes noncompute overheads at low precision Rel. Energy / operation [-] * T = 76 GOPS Precision [bits] 8x in DVAS 20x in DVAFS 18 of 56
19 Precision Scaling BB in FDSOI DVAFS modulates leakage-vs-dynamic balance Body-Bias tuning allows minimizing energy High precision Reduce V T, constant (V - V T ) and f f Dominant Low precision Increase V T, constant (V - V T ) and f f Dominant Dynamic Leakage BB nom BB optimal BB nom BB optimal 19 of 56
20 Processor Architecture Exploits: A. Parallelism and Data Reuse; B. Network sparsity; C. Varying precision through DVAFS. 20 of 56
21 Optimization: CNN Characteristics (A) Convolution operators are highly parallel Algorithm allows inherent data reuse Three types of Reuse supported in Envision Images Filter Image Filters 2 Image Filter Convolutional Reuse Image Reuse Filter Reuse [3] Chen, ISSCC of 56
22 Optimization: CNN Characteristics (B,C) CNN weights and activations are sparse. Precision varies between apps, networks, layers Sparsity RELU activations Varying precision Non-uniform 99*% relative benchmark accuracy Network Sparsity LeNet % AlexNet 5-90% VGG 5-82% Network LeNet-5 AlexNet VGG (*95%) Precision 1-5 bits 4-9 bits 4-6 bits 22 of 56
23 A 2D-SID DVAFS Architecture IO en/decoder D A RISC CTRL ALU 1D-SID: ReLu, ax-pool, AC, 2D-SID AC-array Input processing D A D C D B D D data P GRD GRD data Input procesing 23 of 56
24 A 2D-SID DVAFS Architecture IO en/decoder D A RISC CTRL ALU 1D-SID: ReLu, ax-pool, AC, 2D-SID AC-array Input processing D A D C D B D D data P GRD GRD data Input procesing 24 of 56
25 A 2D-SID DVAFS Architecture Filter Image Partial Sum * = 25 of 56
26 A 2D-SID DVAFS Architecture Feature SRA No Reuse in Scalar Solution 1x16b 1 Feature * 1 Weight Filter SRA 1x16b 26 of 56
27 A 2D-SID DVAFS Architecture Convolutional Reuse in 1D-SID Feature SRA Filter SRA 16x16b / 1x16b 16 Features * 1 Weight 1x16b 27 of 56
28 A 2D-SID DVAFS Architecture Convolutional Reuse in 1D-SID Feature SRA Filter SRA 16x16b / 1x16b 16 Features * 1 Weight 1x16b F I F O 28 of 56
29 A 2D-SID DVAFS Architecture Convolutional + Image Reuse in 2D-SID Feature SRA Filter SRA 16x16b / 1x16b 16 Features * 16 Weights 16x16b F I F O 29 of 56
30 A 2D-SID DVAFS Architecture Cnv. + Image + Filter Reuse in 2D-SID DVAFS Feature SRA Filter SRA 16x(Nx16b/N) / 1x(Nx16b/N) 16N Features * 16N Weights 16x(Nx16b/N) N=2 30 of 56
31 A 2D-SID DVAFS Architecture Cnv. + Image + Filter Reuse in 2D-SID DVAFS 256 AC units SR Feature SRA 16b Feature * SRA 16b Filter Filter SRA SRA 48b 48b N = 1, 1x16b Accumulate 16x(Nx16b/N) / 1x(Nx16b/N) 16N Features * 16N Weights 16x(Nx16b/N) F I F O N=1 *Status Register 31 of 56
32 A 2D-SID DVAFS Architecture Cnv. + Image + Filter Reuse in 2D-SID DVAFS 512 AC units SR Feature 8b SRA8b * Unused 4 8b 8b N = 2, 2x8b Filter SRA 2x 24b 2x 24b Unused 16x(Nx16b/N) / 1x(Nx16b/N) 16N Features * 16N Weights 16x(Nx16b/N) N=2 *Status Register 32 of 56
33 A 2D-SID DVAFS Architecture Cnv. + Image + Filter Reuse in 2D-SID DVAFS 1024 AC units 16x(Nx16b/N) / 1x(Nx16b/N) SR Feature 4b SRA 4b 4b 4b * 16N Features Unused * 16N Weights N=4 Unused 16x(Nx16b/N) Filter SRA 4x 12b 4x 12b N = 4, 4x4b *Status Register 33 of 56
34 A 2D-SID DVAFS Architecture Guard SRA and 2D-Array from sparse operators 4 [4] oons, VLSI 2016 Feature SRA GRD SRA F I F O GRD 0 1 Filter SRA GRD of 56
35 Flexible emory / IO compression IO en/decoder D A RISC CTRL ALU 1D-SID: ReLu, ax-pool, AC, 2D-SID AC-array Input processing D A D C D B D D data P GRD GRD data Input procesing 35 of 56
36 Flexible emory / IO compression IO en/decoder D A RISC CTRL C-programmable 1D-SID: ReLu, ALU 4 16b Instructions ax-pool, AC, 4 Huffman-based IO 2D-SID compression, AC-array up to Input 5.8x processing AlexNet 4 data 16 kb P 4 128kB D 4 D A D B P data o 3-wise parallel acc. 4kB GRD SRA 4 D C D D GRD o sparsity flags GRD As in [4] oons, VLSI2016 Input procesing 36 of 56
37 Physical Implementation Efficiency and scalability through granular Power and Body-Bias domains 37 of 56
38 Physical Implementation 28 FDSOI V E BBGND IO en/decoder D A RISC CTRL ALU 1D-SID: ReLu, ax-pool, AC, 2D-SID AC-array Input processing V 2D BB1 V CTRL BB2 D A D C D B D D P GRD GRD Input procesing 38 of 56
39 Physical Implementation 28 FDSOI 1.29 mm 1.45 mm 2D-SID AC array RISC, DA E 1.87 mm 2 39 of 56
40 easurement Results Efficiencies from 0.25-to-10 TOPS/W depending on Precision and Network Sparsity 40 of 56
41 easurement Results 1x16b Eff. [TOPS/W] Voltage [V] BB nom 1.05V 0.25 TOPS/W Throughput [GOPS] 41 of 56
42 easurement Results 1x16b * 2x8b Eff. [TOPS/W] Voltage [V] BB nom 0.8V 1 TOPS/W Throughput [GOPS] 42 of 56
43 easurement Results * + 1x16b 2x8b 4x4b Eff. [TOPS/W] Voltage [V] BB nom 0.67V 4 TOPS/W Throughput [GOPS] 43 of 56
44 easurement Results * + o 1x16b 2x8b 4x4b 30-60% Sparse 4x3-4b Eff. [TOPS/W] Voltage [V] BB nom 0.61V 8.2 TOPS/W Throughput [GOPS] 44 of 56
45 easurement Results BB nom 1x16b = +/-.6V V = 0.85V f, T * + 2x8b 4x4b L D o 30-60% Sparse 4x3-4b BB nom Eff. [TOPS/W] Voltage [V] BB nom 0.85V 0.33 TOPS/W Throughput [GOPS] 45 of 56
46 easurement Results BB nom 1x16b = +/-.6V V = 0.85V BB* opt = 2x8b +/- 1.2V V = 0.70V f, T + L D 4x4b 1.6x L D o 30-60% Sparse 4x3-4b BB nom BB opt Eff. [TOPS/W] Voltage [V] BB nom 0.85V 0.33 TOPS/W BB opt 0.70V 0.53 TOPS/W Throughput [GOPS] Throughput [GOPS] 46 of 56
47 easurement Results BB nom 1x16b = +/-.6V V = 0.61V f, T * + + 2x8b 4x4b 4x4b L D o o 30-60% 30-60% Sparse Sparse 4x3-4b BB4x3-4b nom opt Eff. [TOPS/W] Voltage [V] TOPS/W BB nom 0.61V 8.2 TOPS/W Throughput [GOPS] 47 of 56
48 easurement Results BB nom 1x16b = +/-.6V V = 0.61V BB * opt = 2x8b +/- 0.2V V = 0.63V f, T + + 4x4b 4x4b L D 1.2x L D o o 30-60% 30-60% Sparse Sparse 4x3-4b BB4x3-4b nom opt Eff. [TOPS/W] Voltage [V] TOPS/W BB nom 0.61V 8.2 TOPS/W Throughput [GOPS] BB opt 0.63V 10 TOPS/W 300 Throughput [GOPS] 48 of 56
49 easurement Results BB nom 1x16b = +/-.6V V = 0.61V BB * opt = 2x8b +/- 0.2V V = 0.63V f, T + + 4x4b 4x4b L D 1.2x L D o o 30-60% 30-60% Sparse Sparse 4x3-4b BB4x3-4b nom opt Eff. [TOPS/W] Voltage [V] TOPS/W BB nom 0.61V 8.2 TOPS/W Throughput [GOPS] 40x BB opt 0.63V 10 TOPS/W 300 Throughput [GOPS] 49 of 56
50 Hierarchical Face Recognition Revisited Hierarchical processing enables always-on compute 3 uj/f 2-4b CONV 4.2 TOPS/W 6 uj/f CONV 4 TOPS/W 500 uj/f CONV 1.8TOPS/W uj/f 4-6b CONV 1.3 TOPS/W N N Y N Always-on ~1% on ~0.1% on ~0.01% on 50 of 56
51 Hierarchical Face Recognition Revisited Hierarchical processing enables always-on compute 3 uj/f 2-4b CONV 4.2 TOPS/W 6 uj/f CONV 4 TOPS/W 500 uj/f CONV 1.8TOPS/W uj/f CONV 1.3 TOPS/W This Functionality Always-on At 6uJ / frame average CONVlayer energy consumption N N Y N Always-on ~1% on ~0.1% on ~0.01% on 51 of 56
52 Comparison A. Highest scalability of Energy-vs- Computational Precision (40x) B. Efficiencies up to 10 TOPS/W 52 of 56
53 Eyeriss 3 ISSCC 16 oons 4 VLSI 16 This work N = 1, 2 or 4 Technology 65nm LP 40nm LP 28nm FDSOI f nom f nom Peak GOPS ANet CONV VGG CONV Comparison with SotA 200Hz 1V mW@35fps - 200Hz 1.1V fps - 200Hz 1V N x fps 1.7fps Power GOPS nom in. Eff. ax. Eff GOPS 0.17 TOPS/W 0.25 TOPS/W GOPS 0.27 TOPS/W 2.60 TOPS/W GOPS 0.25 TOPS/W 10.0 TOPS/W 53 of 56
54 Comparison with SotA homes.esat.kuleuven.be/~mverhels/dlicsurvey.html bit 8-bit 16-bit Energy-Efficiency [TOPS/W] This work oons 4 ID14.6 Chen Throughput [GOPS] ID14.2 ID of 56
55 Summary Envision: A 0.25-to-10 TOPS/W CNN processor, trading energy-vscomputational precision 55 of 56
56 Summary Always-on through hierarchical computing. An energy-efficient CNN-architecture: 1. 2D-SID baseline; 2. DVAFS-compatible 3. Operator guarding and IO-compression. Envision: a 0.25-to GOPS varying with the required network precision. Acknowledgement: This work was partly funded by FWO and Intel Corporation. We thank Synopsys for tool support, STicroelectronics for silicon donation. 56 of 56
Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,
Index A Algorithmic noise tolerance (ANT), 93 94 Application specific instruction set processors (ASIPs), 115 116 Approximate computing application level, 95 circuits-levels, 93 94 DAS and DVAS, 107 110
More informationHow to Estimate the Energy Consumption of Deep Neural Networks
How to Estimate the Energy Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze MIT 1 Problem of DNNs Recognition Smart Drone AI Computation DNN 15k 300k OP/Px DPM 0.1k
More informationarxiv: v1 [cs.cv] 11 Feb 2018
arxiv:8.8v [cs.cv] Feb 8 - Partitioning of Deep Neural Networks with Feature Space Encoding for Resource-Constrained Internet-of-Things Platforms ABSTRACT Jong Hwan Ko, Taesik Na, Mohammad Faisal Amir,
More informationResearch Faculty Summit Systems Fueling future disruptions
Research Faculty Summit 2018 Systems Fueling future disruptions Efficient Edge Computing for Deep Neural Networks and Beyond Vivienne Sze In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang, Sertac
More informationSession 14 Overview: Deep-Learning Processors
ISSCC 2017 / SESSION 14 / DEEP-LEARNING PROCESSORS / OVERVIEW Session 14 Overview: Deep-Learning Processors DIGITAL ARCHITECTURE AND SYSTEMS SUBCOMMITTEE Session Chair: Takashi Hashimoto, Panasonic, Osaka,
More informationSoftware Defined Hardware
Software Defined Hardware For data intensive computation Wade Shen DARPA I2O September 19, 2017 1 Goal Statement Build runtime reconfigurable hardware and software that enables near ASIC performance (within
More informationEyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks
Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Yu-Hsin Chen 1, Joel Emer 1, 2, Vivienne Sze 1 1 MIT 2 NVIDIA 1 Contributions of This Work A novel energy-efficient
More informationTowards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,
More informationBandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design
Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters
More informationTHE NVIDIA DEEP LEARNING ACCELERATOR
THE NVIDIA DEEP LEARNING ACCELERATOR INTRODUCTION NVDLA NVIDIA Deep Learning Accelerator Developed as part of Xavier NVIDIA s SOC for autonomous driving applications Optimized for Convolutional Neural
More informationScalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA
Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School
More informationDeep ST, Ultra Low Power Artificial Neural Network SOC in 28 FD-SOI. Nitin Chawla,
Deep learning @ ST, Ultra Low Power Artificial Neural Network SOC in 28 FD-SOI Nitin Chawla, Senior Principal Engineer and Senior Member of Technical Staff at STMicroelectronics Outline Introduction Chip
More informationDNN Accelerator Architectures
DNN Accelerator Architectures ISCA Tutorial (2017) Website: http://eyeriss.mit.edu/tutorial.html Joel Emer, Vivienne Sze, Yu-Hsin Chen 1 2 Highly-Parallel Compute Paradigms Temporal Architecture (SIMD/SIMT)
More informationRevolutionizing the Datacenter
Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5
More informationThroughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu
More informationEFFICIENT INFERENCE WITH TENSORRT. Han Vanholder
EFFICIENT INFERENCE WITH TENSORRT Han Vanholder AI INFERENCING IS EXPLODING 2 Trillion Messages Per Day On LinkedIn 500M Daily active users of iflytek 140 Billion Words Per Day Translated by Google 60
More informationComputer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal
Computer Architectures for Deep Learning Ethan Dell and Daniyal Iqbal Agenda Introduction to Deep Learning Challenges Architectural Solutions Hardware Architectures CPUs GPUs Accelerators FPGAs SOCs ASICs
More informationA Method to Estimate the Energy Consumption of Deep Neural Networks
A Method to Estimate the Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze Massachusetts Institute of Technology, Cambridge, MA, USA {tjy, yhchen, jsemer, sze}@mit.edu
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationBinary Convolutional Neural Network on RRAM
Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua
More informationHigh Performance Computing
High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason
More informationTowards Closing the Energy Gap Between HOG and CNN Features for Embedded Vision
Towards Closing the Energy Gap Between HOG and CNN Features for Embedded Vision The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationInstruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA
Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA Abstract In recent years, Convolutional Neural Network (CNN) has been widely applied in computer vision tasks. FPGAs
More informationNVIDIA FOR DEEP LEARNING. Bill Veenhuis
NVIDIA FOR DEEP LEARNING Bill Veenhuis bveenhuis@nvidia.com Nvidia is the world s leading ai platform ONE ARCHITECTURE CUDA 2 GPU: Perfect Companion for Accelerating Apps & A.I. CPU GPU 3 Intro to AI AGENDA
More informationDNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses
DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough 1,2 S. K. Lee 2, N. Mulholland 2, P. Hansen 2, S. Kodali 3, D. Brooks 2, G.-Y. Wei 2 1 ARM Research, Boston,
More informationLow-Power Neural Processor for Embedded Human and Face detection
Low-Power Neural Processor for Embedded Human and Face detection Olivier Brousse 1, Olivier Boisard 1, Michel Paindavoine 1,2, Jean-Marc Philippe, Alexandre Carbon (1) GlobalSensing Technologies (GST)
More informationdirect hardware mapping of cnns on fpga-based smart cameras
direct hardware mapping of cnns on fpga-based smart cameras Workshop on Architecture of Smart Cameras Kamel ABDELOUAHAB, Francois BERRY, Maxime PELCAT, Jocelyn SEROT, Jean-Charles QUINTON Cordoba, June
More informationAccelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru
More informationarxiv: v2 [cs.cv] 3 May 2016
EIE: Efficient Inference Engine on Compressed Deep Neural Network Song Han Xingyu Liu Huizi Mao Jing Pu Ardavan Pedram Mark A. Horowitz William J. Dally Stanford University, NVIDIA {songhan,xyl,huizi,jingpu,perdavan,horowitz,dally}@stanford.edu
More informationAn Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection
An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,
More informationDeep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm
Deep Learning on Arm Cortex-M Microcontrollers Rod Crawford Director Software Technologies, Arm What is Machine Learning (ML)? Artificial Intelligence Machine Learning Deep Learning Neural Networks Additional
More informationTwo FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters
Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed
More informationVector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks
Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Christos Kozyrakis Stanford University David Patterson U.C. Berkeley http://csl.stanford.edu/~christos Motivation Ideal processor
More informationFuzzy Set Theory in Computer Vision: Example 3, Part II
Fuzzy Set Theory in Computer Vision: Example 3, Part II Derek T. Anderson and James M. Keller FUZZ-IEEE, July 2017 Overview Resource; CS231n: Convolutional Neural Networks for Visual Recognition https://github.com/tuanavu/stanford-
More informationAn introduction to Machine Learning silicon
An introduction to Machine Learning silicon November 28 2017 Insight for Technology Investors AI/ML terminology Artificial Intelligence Machine Learning Deep Learning Algorithms: CNNs, RNNs, etc. Additional
More informationArm s First-Generation Machine Learning Processor
Arm s First-Generation Machine Learning Processor Ian Bratt 2018 Arm Limited Introducing the Arm Machine Learning (ML) Processor Optimized ground-up architecture for machine learning processing Massive
More informationRTSR: Enhancing Real-time H.264 Video Streaming using Deep Learning based Video Super Resolution Spring 2017 CS570 Project Presentation June 8, 2017
RTSR: Enhancing Real-time H.264 Video Streaming using Deep Learning based Video Super Resolution Spring 2017 CS570 Project Presentation June 8, 2017 Team 16 Soomin Kim Leslie Tiong Youngki Kwon Insu Jang
More informationSCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks
SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks Angshuman Parashar Minsoo Rhu Anurag Mukkara Antonio Puglielli Rangharajan Venkatesan Brucek Khailany Joel Emer Stephen W. Keckler
More informationDeep Learning Requirements for Autonomous Vehicles
Deep Learning Requirements for Autonomous Vehicles Pierre Paulin, Director of R&D Synopsys Inc. Chipex, 1 May 2018 1 Agenda Deep Learning and Convolutional Neural Networks for Embedded Vision Automotive
More informationIn Live Computer Vision
EVA 2 : Exploiting Temporal Redundancy In Live Computer Vision Mark Buckler, Philip Bedoukian, Suren Jayasuriya, Adrian Sampson International Symposium on Computer Architecture (ISCA) Tuesday June 5, 2018
More informationPRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory
Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationPERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices
PERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices Chunhua Deng + City University of New York chunhua.deng@rutgers.edu Keshab K. Parhi University of Minnesota, Twin Cities parhi@umn.edu
More informationScaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research
Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Nick Fraser (Xilinx & USydney) Yaman Umuroglu (Xilinx & NTNU) Giulio Gambardella (Xilinx)
More informationTETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory
TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural
More informationA 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing
A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu, Toney Jacobson, Gouri Landge, Michael Meeuwsen, Christine
More informationHardware for Deep Learning
Hardware for Deep Learning Bill Dally Stanford and NVIDIA Stanford Platform Lab Retreat June 3, 2016 HARDWARE AND DATA ENABLE DNNS 2 THE NEED FOR SPEED Larger data sets and models lead to better accuracy
More informationCENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan
CENG 783 Special topics in Deep Learning AlchemyAPI Week 11 Sinan Kalkan TRAINING A CNN Fig: http://www.robots.ox.ac.uk/~vgg/practicals/cnn/ Feed-forward pass Note that this is written in terms of the
More informationEfficient Processing for Deep Learning: Challenges and Opportuni:es
Efficient Processing for Deep Learning: Challenges and Opportuni:es Vivienne Sze Massachuse@s Ins:tute of Technology Contact Info email: sze@mit.edu website: www.rle.mit.edu/eems In collabora*on with Yu-Hsin
More informationArchitetture di Calcolo Ultra-Low-Power per Internet of Things: La piattaforma PULP
Architetture di Calcolo Ultra-Low-Power per Internet of Things: La piattaforma PULP 31.05.2018 Davide Rossi davide.rossi@unibo.it 1 Department of Electrical, Electronic and Information Engineering 2 Integrated
More informationIs Bigger CNN Better? Samer Hijazi on behalf of IPG CTO Group Embedded Neural Networks Summit (enns2016) San Jose Feb. 9th
Is Bigger CNN Better? Samer Hijazi on behalf of IPG CTO Group Embedded Neural Networks Summit (enns2016) San Jose Feb. 9th Today s Story Why does CNN matter to the embedded world? How to enable CNN in
More informationBrainchip OCTOBER
Brainchip OCTOBER 2017 1 Agenda Neuromorphic computing background Akida Neuromorphic System-on-Chip (NSoC) Brainchip OCTOBER 2017 2 Neuromorphic Computing Background Brainchip OCTOBER 2017 3 A Brief History
More informationXilinx Machine Learning Strategies For Edge
Xilinx Machine Learning Strategies For Edge Presented By Alvin Clark, Sr. FAE, Northwest The Hottest Research: AI / Machine Learning Nick s ML Model Nick s ML Framework copyright sources: Gospel Coalition
More informationSwitched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network
Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network Lixue Xia, Tianqi Tang, Wenqin Huangfu, Ming Cheng, Xiling Yin, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E., Tsinghua
More informationPULP: an open source hardware-software platform for near-sensor analytics. Luca Benini IIS-ETHZ & DEI-UNIBO
PULP: an open source hardware-software platform for near-sensor analytics Luca Benini IIS-ETHZ & DEI-UNIBO An IoT System View Sense MEMS IMU MEMS Microphone ULP Imager Analyze µcontroller L2 Memory e.g.
More informationLecture 12: Model Serving. CSE599W: Spring 2018
Lecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications That drink will get you to 2800 calories for today I last saw your keys in the store room Remind Tom of the party You re on page
More informationIntro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn
Intro to Deep Learning Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn Why this class? Deep Features Have been able to harness the big data in the most efficient and effective
More informationMEMORY AUGMENTED CONTROL NETWORKS
MEMORY AUGMENTED CONTROL NETWORKS Arbaaz Khan, Clark Zhang, Nikolay Atanasov, Konstantinos Karydis, Vijay Kumar, Daniel D. Lee GRASP Laboratory, University of Pennsylvania Presented by Aravind Balakrishnan
More informationXilinx ML Suite Overview
Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame
More informationInference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA
Inference Optimization Using TensorRT with Use Cases Jack Han / 한재근 Solutions Architect NVIDIA Search Image NLP Maps TensorRT 4 Adoption Use Cases Speech Video AI Inference is exploding 1 Billion Videos
More informationCharacterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager
Characterization and Benchmarking of Deep Learning Natalia Vassilieva, PhD Sr. Research Manager Deep learning applications Vision Speech Text Other Search & information extraction Security/Video surveillance
More informationA 50% Lower Power ARM Cortex CPU using DDC Technology with Body Bias. David Kidd August 26, 2013
A 50% Lower Power ARM Cortex CPU using DDC Technology with Body Bias David Kidd August 26, 2013 1 HOTCHIPS 2013 Copyright 2013 SuVolta, Inc. All rights reserved. Agenda DDC transistor and PowerShrink platform
More information3D Convolutional Neural Networks for Landing Zone Detection from LiDAR
3D Convolutional Neural Networks for Landing Zone Detection from LiDAR Daniel Mataruna and Sebastian Scherer Presented by: Sabin Kafle Outline Introduction Preliminaries Approach Volumetric Density Mapping
More informationHow to Build Optimized ML Applications with Arm Software
How to Build Optimized ML Applications with Arm Software Arm Technical Symposia 2018 Arm K.K. Senior FAE Ryuji Tanaka Overview Today we will talk about applied machine learning (ML) on Arm. My aim for
More informationHow to Build Optimized ML Applications with Arm Software
How to Build Optimized ML Applications with Arm Software Arm Technical Symposia 2018 ML Group Overview Today we will talk about applied machine learning (ML) on Arm. My aim for today is to show you just
More informationImplementing Long-term Recurrent Convolutional Network Using HLS on POWER System
Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign
More informationReal-time convolutional networks for sonar image classification in low-power embedded systems
Real-time convolutional networks for sonar image classification in low-power embedded systems Matias Valdenegro-Toro Ocean Systems Laboratory - School of Engineering & Physical Sciences Heriot-Watt University,
More informationDeep Learning with Low Precision Hardware Challenges and Opportunities for Logic Synthesis
Deep Learning with Low Precision Hardware Challenges and Opportunities for Logic Synthesis ETHZ & UNIBO http://www.pulp-platform.org 1 of 40 Deep Learning: Why? First, it was machine vision Now it s everywhere!
More informationComPEND: Computation Pruning through Early Negative Detection for ReLU in a Deep Neural Network Accelerator
ICS 28 ComPEND: Computation Pruning through Early Negative Detection for ReLU in a Deep Neural Network Accelerator June 3, 28 Dongwoo Lee, Sungbum Kang, Kiyoung Choi Neural Processing Research Center (NPRC)
More informationAn Asynchronous Array of Simple Processors for DSP Applications
An Asynchronous Array of Simple Processors for DSP Applications Zhiyi Yu, Michael Meeuwsen, Ryan Apperson, Omar Sattari, Michael Lai, Jeremy Webb, Eric Work, Tinoosh Mohsenin, Mandeep Singh, Bevan Baas
More informationDeep Learning and Its Applications
Convolutional Neural Network and Its Application in Image Recognition Oct 28, 2016 Outline 1 A Motivating Example 2 The Convolutional Neural Network (CNN) Model 3 Training the CNN Model 4 Issues and Recent
More informationA new Computer Vision Processor Chip Design for automotive ADAS CNN applications in 22nm FDSOI based on Cadence VP6 Technology
Dr.-Ing Jens Benndorf (DCT) Gregor Schewior (DCT) A new Computer Vision Processor Chip Design for automotive ADAS CNN applications in 22nm FDSOI based on Cadence VP6 Technology Tensilica Day 2017 16th
More informationDeep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia
Deep learning for dense per-pixel prediction Chunhua Shen The University of Adelaide, Australia Image understanding Classification error Convolution Neural Networks 0.3 0.2 0.1 Image Classification [Krizhevsky
More informationNon-Profiled Deep Learning-Based Side-Channel Attacks
Non-Profiled Deep Learning-Based Side-Channel Attacks Benjamin Timon UL Transaction Security, Singapore benjamin.timon@ul.com Abstract. Deep Learning has recently been introduced as a new alternative to
More informationEVA 2 : Exploiting Temporal Redundancy in Live Computer Vision
EVA 2 : Exploiting Temporal Redundancy in Live Computer Vision Mark Buckler Cornell University mab598@cornell.edu Philip Bedoukian Cornell University pbb59@cornell.edu Suren Jayasuriya Arizona State University
More informationA Communication-Centric Approach for Designing Flexible DNN Accelerators
THEME ARTICLE: Hardware Acceleration A Communication-Centric Approach for Designing Flexible DNN Accelerators Hyoukjun Kwon, High computational demands of deep neural networks Ananda Samajdar, and (DNNs)
More informationC-Brain: A Deep Learning Accelerator
C-Brain: A Deep Learning Accelerator that Tames the Diversity of CNNs through Adaptive Data-level Parallelization Lili Song, Ying Wang, Yinhe Han, Xin Zhao, Bosheng Liu, Xiaowei Li State Key Laboratory
More informationScaling Neural Network Acceleration using Coarse-Grained Parallelism
Scaling Neural Network Acceleration using Coarse-Grained Parallelism Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2018 Neural Networks (NNs)
More informationConvolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech
Convolutional Neural Networks Computer Vision Jia-Bin Huang, Virginia Tech Today s class Overview Convolutional Neural Network (CNN) Training CNN Understanding and Visualizing CNN Image Categorization:
More informationDEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017
DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE Dennis Lui August 2017 THE RISE OF GPU COMPUTING APPLICATIONS 10 7 10 6 GPU-Computing perf 1.5X per year 1000X by 2025 ALGORITHMS 10 5 1.1X
More informationBandwidth-Efficient Deep Learning
1 Bandwidth-Efficient Deep Learning from Compression to Acceleration Song Han Assistant Professor, EECS Massachusetts Institute of Technology 2 AI is Changing Our Lives Self-Driving Car Machine Translation
More informationScalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism
Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Jiecao Yu 1, Andrew Lukefahr 1, David Palframan 2, Ganesh Dasika 2, Reetuparna Das 1, Scott Mahlke 1 1 University of Michigan 2 ARM
More informationFace Recognition A Deep Learning Approach
Face Recognition A Deep Learning Approach Lihi Shiloh Tal Perl Deep Learning Seminar 2 Outline What about Cat recognition? Classical face recognition Modern face recognition DeepFace FaceNet Comparison
More informationHyperdrive: A Systolically Scalable Binary-Weight CNN Inference Engine for mw IoT End-Nodes
Hyperdrive: A Systolically Scalable Binary-Weight CNN Inference Engine for mw IoT End-Nodes Renzo Andri, Lukas Cavigelli, Davide Rossi, Luca Benini Integrated Systems Laboratory, ETH Zurich, Zurich, Switzerland
More informationMachine Learning. MGS Lecture 3: Deep Learning
Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ Machine Learning MGS Lecture 3: Deep Learning Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ WHAT IS DEEP LEARNING? Shallow network: Only one hidden layer
More informationHigh performance, power-efficient DSPs based on the TI C64x
High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research
More informationIN-MEMORY ASSOCIATIVE COMPUTING
IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY AAKERIB@GSITECHNOLOGY.COM AGENDA The AI computational challenge Introduction to associative computing Examples An NLP use case What s next?
More informationBHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques
BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques Jingyang Zhu 1, Zhiliang Qian 2*, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and
More informationDeep Learning with Tensorflow AlexNet
Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification
More informationarxiv: v1 [cs.cv] 26 Aug 2016
Scalable Compression of Deep Neural Networks Xing Wang Simon Fraser University, BC, Canada AltumView Systems Inc., BC, Canada xingw@sfu.ca Jie Liang Simon Fraser University, BC, Canada AltumView Systems
More informationSOFTWARE HARDWARE CODESIGN ACCELERATION FOR EFFICIENT NEURAL NETWORK. ...Deep learning and neural
... SOFTWARE HARDWARE CODESIGN FOR EFFICIENT NEURAL NETWORK ACCELERATION... Kaiyuan Guo Tsinghua University and DeePhi Song Han Stanford University and DeePhi Song Yao DeePhi Yu Wang Tsinghua University
More informationA 19.4 nj/decision 364K Decisions/s In-Memory Random Forest Classifier in 6T SRAM Array. Mingu Kang, Sujan Gonugondla, Naresh Shanbhag
A 19.4 nj/decision 364K Decisions/s In-Memory Random Forest Classifier in 6T SRAM Array Mingu Kang, Sujan Gonugondla, Naresh Shanbhag University of Illinois at Urbana Champaign Machine Learning under Resource
More informationMachine Learning on VMware vsphere with NVIDIA GPUs
Machine Learning on VMware vsphere with NVIDIA GPUs Uday Kurkure, Hari Sivaraman, Lan Vu GPU Technology Conference 2017 2016 VMware Inc. All rights reserved. Gartner Hype Cycle for Emerging Technology
More informationSmartShuttle: Optimizing Off-Chip Memory Accesses for Deep Learning Accelerators
SmartShuttle: Optimizing Off-Chip emory Accesses for Deep Learning Accelerators Jiajun Li, Guihai Yan, Wenyan Lu, Shuhao Jiang, Shijun Gong, Jingya Wu, Xiaowei Li State Key Laboratory of Computer Architecture,
More informationHyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array
HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Li, Yiran Chen Duke University, University of Southern California {linghao.song,
More informationThe Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016
The Path to Embedded Vision & AI using a Low Power Vision DSP Yair Siegel, Director of Segment Marketing Hotchips August 2016 Presentation Outline Introduction The Need for Embedded Vision & AI Vision
More informationDeep Neural Network Acceleration Framework Under Hardware Uncertainty
Deep Neural Network Acceleration Framework Under Hardware Uncertainty Mohsen Imani, Pushen Wang, and Tajana Rosing Computer Science and Engineering, UC San Diego, La Jolla, CA 92093, USA {moimani, puw001,
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A
More informationChain-NN: An Energy-Efficient 1D Chain Architecture for Accelerating Deep Convolutional Neural Networks
Chain-NN: An Energy-Efficient D Chain Architecture for Accelerating Deep Convolutional Neural Networks Shihao Wang, Dajiang Zhou, Xushen Han, Takeshi Yoshimura Graduate School of Information, Production
More information