ComPEND: Computation Pruning through Early Negative Detection for ReLU in a Deep Neural Network Accelerator
|
|
- Justin Dean
- 5 years ago
- Views:
Transcription
1 ICS 28 ComPEND: Computation Pruning through Early Negative Detection for ReLU in a Deep Neural Network Accelerator June 3, 28 Dongwoo Lee, Sungbum Kang, Kiyoung Choi Neural Processing Research Center (NPRC)
2 2 Outline Motivation Early Negative Detection (END) Computation Pruning thru END (ComPEND) Evaluation Conclusion
3 3 Motivation Perceptron AA ll WW ll xx = NN ii= Σ AA ii WW ii xx AA ll =f(x) f(x) x Rectified linear unit (ReLU, [f(x) = max(,x)]) is widely used as an activation function for DNN.
4 4 Motivation Perceptron AA ll WW ll xx = NN ii= Σ AA ii WW ii xx AA ll =f(x)= f(x) x Rectified linear unit (ReLU, [f(x) = max(,x)]) is widely used as an activation function for DNN.
5 5 Motivation Perceptron AA ll WW ll xx = NN ii= Σ AA ii WW ii xx AA ll =f(x)= f(x) x Rectified linear unit (ReLU, [f(x) = max(,x)]) is widely used as an activation function for DNN. If we know a priori that x, we can skip unnecessary computations and simply set ReLU output to zero.
6 6 Motivation Distribution of negative inputs to ReLU functions in VGG-6 More than 6%
7 7 Early Negative Detection (END) Two s complement number representation (4 bits) Negative Positive = -8+7 = - = -8+6 = -2 = -8+5 = -3 = -8+4 = = -+ = + = -+ = + For a B-bit number WW : ( ww BB ww BB 2 ww BB 3 ww ww ) WW = ww BB ( 2 BB ) + BB 2 kk= ww kk +2 kk
8 8 Early Negative Detection (END) Inverted two s complement number representation (4 bits) Positive Negative = +8-7 = + = +8-6 = +2 = +8-5 = +3 = +8-4 = = +- = - = +- = - For a B-bit number WW : ( ww BB ww BB 2 ww BB 3 ww ww ) WW = ww BB (+2 BB ) + BB 2 kk= ww kk 2 kk
9 9 Early Negative Detection (END) Inverted two s complement representation for negative detection Decimal Activation: 5 Weight: ) x s complement x ) ReLU
10 Early Negative Detection (END) Inverted two s complement representation for negative detection Activation: 5 Weight: ) Decimal 2 s complement Inverted 2 s complement x x ) x ) ReLU
11 Early Negative Detection (END) Inverted two s complement representation for negative detection Activation: 5 Weight: ) Decimal 2 s complement Inverted 2 s complement x x ) x ) - ReLU
12 2 Early Negative Detection (END) Inverted two s complement representation for negative detection Activation: 5 Weight: ) Decimal 2 s complement Inverted 2 s complement x x ) Skipped! x ) ReLU
13 3 Early Negative Detection (END) Two s complement representation Positive sum value Negative sum WW = ww BB ( 2 BB ) + BB 2 kk= ww kk +2 kk steps Inverted two s complement representation value WW = ww BB (+2 BB ) + BB 2 kk= ww kk 2 kk steps Stop here!
14 4 Early Negative Detection (END) For multiple inputs AA ll WW ll Σ xx xx = NN ii= AA ii WW ii = AA [ww,bb 2 BB ww,bb 2 2 BB 2 ww,bb 3 2 BB 3 ] +AA 2 [ww 2,BB 2 BB ww 2,BB 2 2 BB 2 ww 2,BB 3 2 BB 3 ] +AA NN [ww NN,BB 2 BB ww NN,BB 2 2 BB 2 ww NN,BB 3 2 BB 3 ]
15 5 Early Negative Detection (END) For multiple inputs AA ll WW ll Σ xx xx = NN ii= AA ii WW ii = AA [ww,bb 2 BB ww,bb 2 2 BB 2 ww,bb 3 2 BB 3 ] +AA 2 [ww 2,BB 2 BB ww 2,BB 2 2 BB 2 ww 2,BB 3 2 BB 3 ] +AA NN [ww NN,BB 2 BB ww NN,BB 2 2 BB 2 ww NN,BB 3 2 BB 3 ]
16 6 Early Negative Detection (END) For multiple inputs AA ll WW ll Σ xx xx = NN ii= AA ii WW ii = AA [ww,bb 2 BB ww,bb 2 2 BB 2 ww,bb 3 2 BB 3 ] +AA 2 [ww 2,BB 2 BB ww 2,BB 2 2 BB 2 ww 2,BB 3 2 BB 3 ] +AA NN [ww NN,BB 2 BB ww NN,BB 2 2 BB 2 ww NN,BB 3 2 BB 3 ]
17 7 Computation Pruning thru END (ComPEND) Bit-serial sum of products Takes multiple steps, but the area of a bit-serial unit is much smaller Can integrate more units higher performance Similar to Stripes (P. Judd et al., MICRO 26) MSB LSB W A W N A N + S W W N LSB MSB A + B bits A N + B Steps << S < Conventional sum of products > < Bit-seral sum of products >
18 8 Computation Pruning thru END (ComPEND) Overall architecture of ComPEND DRAM STT-RAM WB WB WB Memory Controller Provider Network Global Controller 9x6 array of s 32 6-bit inputs per 9x6x32 inputs at a time (3x3x52 filter) 6 + additional s A l * W l A l AB AB AB AB
19 9 Computation Pruning thru END (ComPEND) DATA packing Input activation block 32 activations of same X, Y I z I x O z O x A,, A,,2 A,,3 A,,4 A,,3 A,,32 I y F y F x O y 6-bit F z =I z 52-bit Weight bits block 52 bits of weights in same bit position I z O z w,, MSB w,,2 MSB w,,3 MSB w,,4 MSB w,,5 w,,52 MSB MSB I y I x O y O x w,, MSB- w,,2 MSB- w,,3 MSB- w,,4 MSB- w,,5 w,,52 MSB- MSB- F y F x F z =I z w,, LSB w,,2 LSB w,,3 LSB w,,4 LSB w,,5 w,,52 LSB LSB -bit 52-bit < in the case of F z = 52 >
20 2 Computation Pruning thru END (ComPEND) Processing unit Input activations input 6-bit adder tree 32 6-bit input activation registers 32-bit weight bits register Weight bits
21 2 Computation Pruning thru END (ComPEND) Memory controller Manages all kinds of memory-involved data transfers Weight blocks Off-chip memory -> STT-RAM STT-RAM -> Weight Buffers (WBs) WBs -> Weight registers in s DRAM STT-RAM AB AB AB AB Activation blocks Off-chip memory -> Activation Buffers (ABs) Off-chip memory -> Registers in s (FC layers: activation blocks are moved directly from off-chip memory to registers) ABs -> Registers in s WB WB WB Memory Controller Provider Network Global Controller Output activation blocks Global controller -> Off-chip memory
22 22 Computation Pruning thru END (ComPEND) Provider network A, A,2 A,3 A,4 A, A,2 A,3 A,4 Inputs: 32 x 9 x 6 bits A 2, A 2,2 A 2,3 A 2,4 A 2, A 2,2 A 2,3 A 2,4 outputs: 32 x 9 x 6 bits A 3, A 3,2 A 3,3 A 3,4 Sliding window W, A, W,2 A,2 W,3 A,3 W, W,2 W,3 A 3, A 3,2 A 3,3 A 3,4 Sliding window A, A,2 a,3 Activation reuse in s During 2D convolution with 3x3 filters Reconfiguration with 9 types of connections for shuffling weights W 3,3 A 3,3 W 3,3 A 3,3 < Connection type > < Connection type 2 >
23 23 Computation Pruning thru END (ComPEND) head Global controller id id id id Pipeline list pos pos pos pos Decision unit id last pos DATA id last pos DATA id last pos DATA = Entry board MUX id last pos DATA << - 6 decision units Decision unit Decides final sum of products Zero if DATA is negative DATA if last position is LSB Pipeline list id: filter ID pos: bit position in 6-bit weights head: current output of adder tree Entry board id: filter ID last pos: last position in the pipeline DATA: partial sum
24 24 Computation Pruning thru END (ComPEND) Global controller head Pipeline list id pos id pos id pos id pos Decision unit id last pos DATA id last pos DATA id last pos DATA = Entry board MUX Filling up the pipeline P: The next bit in the bit-serial computation P2: A new sum of products that has not yet been entered into the pipeline P3: The next step of a sum of products whose prior step is still in the pipeline id last pos DATA << - completed P P3 F p : ( ww ii,bb ww ii,bb 2 ww ii,bb 3 ww ii, ww ii, ) F q : ( ww jj,bb ww jj,bb 2 ww jj,bb 3 ww jj, ww jj, ) P2
25 25 Computation Pruning thru END (ComPEND) Operation pipeline (4) Global Controller () Weight buffers -> DRAM STT-RAM () WB WB WB Memory Controller (2) Provider Network (3) (2) Provider network (2) -> (3) Processing unit array (3) -> (4) Global controller AB AB AB AB
26 26 Evaluation Pre-trained weights of VGG-6 network and images from ImageNet ILSVRC-22 In-house cycle-accurate timing simulator by using C++ with DRAMSim2 for off-chip memory CACTI 6.5 to model SRAM NVSim for on-chip STT-RAM Synopsys Design Compiler with TSMC 45nm technology library with.9v to get parameters of timing/power/area for s and Provider Network
27 27 Evaluation VGG-6 network We use 5 layers in the VGG-6 network as workloads, excluding layer F. F is excluded since the total size of input activations is too big. Inputs to C are raw data that can be negative. The pruning scheme cannot be applied. C is implemented without ComPEND.
28 28 Evaluation Configuration Area Peak throughput (32-input 6 s in a row 9 rows GHz = 4.6 TOPS)
29 29 Evaluation Runtime Reduced by 6.62% on average compared to that without ComPEND for 5 layers Left bars: without ComPEND Right bars: with ComPEND < for VGG-6 layers > MEM_STT: reads/writes between off-chip memory and STT-RAM STT_WB: runtime of reads/writes between STT-RAM and WB MEM_WB: reads/writes between off-chip memory and WB MEM_AB: reads/writes between off-chip memory and AB AB_: reads/writes between AB and registers in s RUN_: computation in s
30 3 Evaluation Energy (dynamic & static) consumption Reduced by 23.5% on average for 5 layers D/S_CTRL: global controller D/S_NET: provider network D/S_STT: STT-RAM. D/S_AB: activation buffers D/S_WB: weight buffer D/S_: processing units Left bars: without ComPEND Right bars: with ComPEND < for VGG-6 layers >
31 3 Evaluation Power consumption Average over 5 layers Without ComPEND:.2 Watt With ComPEND:.3 Watt < for VGG-6 layers >
32 32 Evaluation Energy-delay product ComPEND reduces EDP and ED 2 P by 36.2% and 46.8% for the execution of the 5 layers < for VGG-6 layers >
33 33 Conclusion Proposed the concept of END (early negative detection) based on inverted two s complement Proposed an architecture that implements ComPEND Achieved 6.62% higher speed and 23.5% less energy consumption for inference Future work Combining with other zero-skipping approaches Handling layers (say, F in VGG-6) exceeding the capacity of the architecture
34 THANK YOU
BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques
BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques Jingyang Zhu 1, Zhiliang Qian 2*, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and
More informationBinary Convolutional Neural Network on RRAM
Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua
More informationNeural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks
Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bits Research
More informationDNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses
DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough 1,2 S. K. Lee 2, N. Mulholland 2, P. Hansen 2, S. Kodali 3, D. Brooks 2, G.-Y. Wei 2 1 ARM Research, Boston,
More informationBinary Addition. Add the binary numbers and and show the equivalent decimal addition.
Binary Addition The rules for binary addition are 0 + 0 = 0 Sum = 0, carry = 0 0 + 1 = 0 Sum = 1, carry = 0 1 + 0 = 0 Sum = 1, carry = 0 1 + 1 = 10 Sum = 0, carry = 1 When an input carry = 1 due to a previous
More informationScalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA
Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School
More informationHow to Estimate the Energy Consumption of Deep Neural Networks
How to Estimate the Energy Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze MIT 1 Problem of DNNs Recognition Smart Drone AI Computation DNN 15k 300k OP/Px DPM 0.1k
More informationDNN Accelerator Architectures
DNN Accelerator Architectures ISCA Tutorial (2017) Website: http://eyeriss.mit.edu/tutorial.html Joel Emer, Vivienne Sze, Yu-Hsin Chen 1 2 Highly-Parallel Compute Paradigms Temporal Architecture (SIMD/SIMT)
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationCMPE223/CMSE222 Digital Logic Design. Positional representation
CMPE223/CMSE222 Digital Logic Design Number Representation and Arithmetic Circuits: Number Representation and Unsigned Addition Positional representation First consider integers Begin with positive only
More informationDNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs
IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei
More informationEyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks
Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Yu-Hsin Chen 1, Joel Emer 1, 2, Vivienne Sze 1 1 MIT 2 NVIDIA 1 Contributions of This Work A novel energy-efficient
More informationHigh Performance Computing
High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason
More informationTwo FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters
Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationRevolutionizing the Datacenter
Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5
More informationComputer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal
Computer Architectures for Deep Learning Ethan Dell and Daniyal Iqbal Agenda Introduction to Deep Learning Challenges Architectural Solutions Hardware Architectures CPUs GPUs Accelerators FPGAs SOCs ASICs
More informationC-Brain: A Deep Learning Accelerator
C-Brain: A Deep Learning Accelerator that Tames the Diversity of CNNs through Adaptive Data-level Parallelization Lili Song, Ying Wang, Yinhe Han, Xin Zhao, Bosheng Liu, Xiaowei Li State Key Laboratory
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationOptimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms
Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Ruizhe Zhao 1, Xinyu Niu 1, Yajie Wu 2, Wayne Luk 1, and Qiang Liu 3 1 Imperial College London {ruizhe.zhao15,niu.xinyu10,w.luk}@imperial.ac.uk
More informationBit-Pragmatic Deep Neural Network Computing
Bit-Pragmatic Deep Neural Network Computing Jorge Albericio, Patrick Judd, Alberto Delmás, Sayeh Sharify, Andreas Moshovos Department of Electrical and Computer Engineering University of Toronto {jorge,
More informationMulti-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture
The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung
More informationCSEE 3827: Fundamentals of Computer Systems. Storage
CSEE 387: Fundamentals of Computer Systems Storage The big picture General purpose processor (e.g., Power PC, Pentium, MIPS) Internet router (intrusion detection, pacet routing, etc.) WIreless transceiver
More informationIntroduction to Neural Networks
ECE 5775 (Fall 17) High-Level Digital Design Automation Introduction to Neural Networks Ritchie Zhao, Zhiru Zhang School of Electrical and Computer Engineering Rise of the Machines Neural networks have
More informationSwitched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network
Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network Lixue Xia, Tianqi Tang, Wenqin Huangfu, Ming Cheng, Xiling Yin, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E., Tsinghua
More informationResearch Faculty Summit Systems Fueling future disruptions
Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning
More informationTHE NVIDIA DEEP LEARNING ACCELERATOR
THE NVIDIA DEEP LEARNING ACCELERATOR INTRODUCTION NVDLA NVIDIA Deep Learning Accelerator Developed as part of Xavier NVIDIA s SOC for autonomous driving applications Optimized for Convolutional Neural
More informationAn introduction to Machine Learning silicon
An introduction to Machine Learning silicon November 28 2017 Insight for Technology Investors AI/ML terminology Artificial Intelligence Machine Learning Deep Learning Algorithms: CNNs, RNNs, etc. Additional
More informationTowards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,
More informationPipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010
Pipelining, Instruction Level Parallelism and Memory in Processors Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010 NOTE: The material for this lecture was taken from several
More informationIntro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn
Intro to Deep Learning Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn Why this class? Deep Features Have been able to harness the big data in the most efficient and effective
More informationDeep Learning Hardware Acceleration
* Deep Learning Hardware Acceleration Jorge Albericio + Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* + now at NVIDIA Andreas Moshovos Disclaimer
More informationSCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks
SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks Angshuman Parashar Minsoo Rhu Anurag Mukkara Antonio Puglielli Rangharajan Venkatesan Brucek Khailany Joel Emer Stephen W. Keckler
More informationIT 201 Digital System Design Module II Notes
IT 201 Digital System Design Module II Notes BOOLEAN OPERATIONS AND EXPRESSIONS Variable, complement, and literal are terms used in Boolean algebra. A variable is a symbol used to represent a logical quantity.
More informationOutline. Deep Convolutional Neural Network (DCNN) Stochastic Computing (SC)
L.C.Smith College of Engineering and Computer Science Towards Acceleration of Deep Convolutional Neural Networks using Stochastic Computing Ji Li Ao Ren Zhe Li Caiwen Ding Bo Yuan Qinru Qiu Yanzhi Wang
More informationDigital Arithmetic. Digital Arithmetic: Operations and Circuits Dr. Farahmand
Digital Arithmetic Digital Arithmetic: Operations and Circuits Dr. Farahmand Binary Arithmetic Digital circuits are frequently used for arithmetic operations Fundamental arithmetic operations on binary
More informationImplementing Multipliers in Xilinx Virtex II FPGAs
HUNT ENGINEERING Chestnut Court, Burton Row, Brent Knoll, Somerset, TA9 4BP, UK Tel: (+44) (0)1278 760188, Fax: (+44) (0)1278 760199, Email: sales@hunteng.co.uk http://www.hunteng.co.uk http://www.hunt-dsp.com
More information2. Link and Memory Architectures and Technologies
2. Link and Memory Architectures and Technologies 2.1 Links, Thruput/Buffering, Multi-Access Ovrhds 2.2 Memories: On-chip / Off-chip SRAM, DRAM 2.A Appendix: Elastic Buffers for Cross-Clock Commun. Manolis
More informationImplementing Long-term Recurrent Convolutional Network Using HLS on POWER System
Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign
More informationCapsAcc: An Efficient Hardware Accelerator for CapsuleNets with Data Reuse
Accepted for publication at Design, Automation and Test in Europe (DATE 2019). Florence, Italy CapsAcc: An Efficient Hardware Accelerator for CapsuleNets with Reuse Alberto Marchisio, Muhammad Abdullah
More informationUSING DATAFLOW TO OPTIMIZE ENERGY EFFICIENCY OF DEEP NEURAL NETWORK ACCELERATORS
... USING DATAFLOW TO OPTIMIZE ENERGY EFFICIENCY OF DEEP NEURAL NETWORK ACCELERATORS... Yu-Hsin Chen Massachusetts Institute of Technology Joel Emer Nvidia and Massachusetts Institute of Technology Vivienne
More informationResearch Faculty Summit Systems Fueling future disruptions
Research Faculty Summit 2018 Systems Fueling future disruptions Efficient Edge Computing for Deep Neural Networks and Beyond Vivienne Sze In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang, Sertac
More informationOutline. Introduction to Structured VLSI Design. Signed and Unsigned Integers. 8 bit Signed/Unsigned Integers
Outline Introduction to Structured VLSI Design Integer Arithmetic and Pipelining Multiplication in the digital domain HW mapping Pipelining optimization Joachim Rodrigues Signed and Unsigned Integers n-1
More informationDesign of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on. on-chip Architecture
Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on on-chip Architecture Avinash Kodi, Ashwini Sarathy * and Ahmed Louri * Department of Electrical Engineering and
More informationCode No: R Set No. 1
Code No: R059210504 Set No. 1 II B.Tech I Semester Regular Examinations, November 2006 DIGITAL LOGIC DESIGN ( Common to Computer Science & Engineering, Information Technology and Computer Science & Systems
More informationIntegrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim
Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim Farzad Farshchi, Qijing Huang, Heechul Yun University of Kansas, University of California, Berkeley SiFive Internship Rocket
More informationPRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory
Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong
More informationIllusionist: Transforming Lightweight Cores into Aggressive Cores on Demand
Illusionist: Transforming Lightweight Cores into Aggressive Cores on Demand A. Ansari, S. Feng, S. Gupta, J. Torrellas, S. Mahlke HPCA - 2013 University of Illinois University of Michigan June 28, 2013.
More informationCouture: Tailoring STT-MRAM for Persistent Main Memory. Mustafa M Shihab Jie Zhang Shuwen Gao Joseph Callenes-Sloan Myoungsoo Jung
Couture: Tailoring STT-MRAM for Persistent Main Memory Mustafa M Shihab Jie Zhang Shuwen Gao Joseph Callenes-Sloan Myoungsoo Jung Executive Summary Motivation: DRAM plays an instrumental role in modern
More informationHRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing
HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard
More informationThroughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu
More informationNumeric Encodings Prof. James L. Frankel Harvard University
Numeric Encodings Prof. James L. Frankel Harvard University Version of 10:19 PM 12-Sep-2017 Copyright 2017, 2016 James L. Frankel. All rights reserved. Representation of Positive & Negative Integral and
More informationEfficient Methods for Deep Learning
Efficient Methods for Deep Learning Song Han Stanford University Sep 2016 Background: Deep Learning for Everything Source: Brody Huval et al., An Empirical Evaluation, arxiv:1504.01716 Source: leon A.
More informationChapter 3: part 3 Binary Subtraction
Chapter 3: part 3 Binary Subtraction Iterative combinational circuits Binary adders Half and full adders Ripple carry and carry lookahead adders Binary subtraction Binary adder-subtractors Signed binary
More informationCS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory
CS65 Computer Architecture Lecture 9 Memory Hierarchy - Main Memory Andrew Sohn Computer Science Department New Jersey Institute of Technology Lecture 9: Main Memory 9-/ /6/ A. Sohn Memory Cycle Time 5
More informationTETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory
TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural
More informationELE 655 Microprocessor System Design
ELE 655 Microprocessor System Design Section 2 Instruction Level Parallelism Class 1 Basic Pipeline Notes: Reg shows up two places but actually is the same register file Writes occur on the second half
More informationFPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
Date of publication 2018 00, 0000, date of current version 2018 00, 0000. Digital Object Identifier 10.1109/ACCESS.2018.2890150.DOI arxiv:1901.00121v1 [cs.ne] 1 Jan 2019 FPGA-based Accelerators of Deep
More informationUnified Deep Learning with CPU, GPU, and FPGA Technologies
Unified Deep Learning with CPU, GPU, and FPGA Technologies Allen Rush 1, Ashish Sirasao 2, Mike Ignatowski 1 1: Advanced Micro Devices, Inc., 2: Xilinx, Inc. Abstract Deep learning and complex machine
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant
More informationIN-MEMORY ASSOCIATIVE COMPUTING
IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY AAKERIB@GSITECHNOLOGY.COM AGENDA The AI computational challenge Introduction to associative computing Examples An NLP use case What s next?
More informationEET 1131 Lab #7 Arithmetic Circuits
Name Equipment and Components Safety glasses ETS-7000 Digital-Analog Training System Integrated Circuits: 7483, 74181 Quartus II software and Altera DE2-115 board Multisim simulation software EET 1131
More informationarxiv: v1 [cs.cv] 11 Feb 2018
arxiv:8.8v [cs.cv] Feb 8 - Partitioning of Deep Neural Networks with Feature Space Encoding for Resource-Constrained Internet-of-Things Platforms ABSTRACT Jong Hwan Ko, Taesik Na, Mohammad Faisal Amir,
More informationarxiv: v3 [cs.ne] 17 Dec 2018
DPRed: Making Typical Activation and Weight Values Matter In Deep Learning Computing arxiv:84.673v3 [cs.ne] 7 Dec 8 Alberto Delmás Lascorz, Sayeh Sharify, Patrick Judd, Kevin Siu, Milos Nikolic, Andreas
More informationChapter 4: Combinational Logic
Chapter 4: Combinational Logic Combinational Circuit Design Analysis Procedure (Find out nature of O/P) Boolean Expression Approach Truth Table Approach Design Procedure Example : BCD to Excess-3 code
More informationCS/COE 0447 Example Problems for Exam 2 Spring 2011
CS/COE 0447 Example Problems for Exam 2 Spring 2011 1) Show the steps to multiply the 4-bit numbers 3 and 5 with the fast shift-add multipler. Use the table below. List the multiplicand (M) and product
More information1. Mark the correct statement(s)
1. Mark the correct statement(s) 1.1 A theorem in Boolean algebra: a) Can easily be proved by e.g. logic induction b) Is a logical statement that is assumed to be true, c) Can be contradicted by another
More informationSolving the Non-Volatile Memory Conundrum for Deep Learning Workloads
Solving the Non-Volatile Memory Conundrum for Deep Learning Workloads Ahmet Inci and Diana Marculescu Department of Electrical and Computer Engineering Carnegie Mellon University ainci@andrew.cmu.edu Architectures
More informationENGR 303 Introduction to Logic Design Lecture 7. Dr. Chuck Brown Engineering and Computer Information Science Folsom Lake College
Introduction to Logic Design Lecture 7 Dr. Chuck Brown Engineering and Computer Information Science Folsom Lake College Outline for Todays Lecture Shifter Multiplier / Divider Memory Shifters Logical
More informationCS429: Computer Organization and Architecture
CS429: Computer Organization and Architecture Dr. Bill Young Department of Computer Sciences University of Texas at Austin Last updated: January 2, 2018 at 11:23 CS429 Slideset 5: 1 Topics of this Slideset
More informationLecture 20: Neural Networks for NLP. Zubin Pahuja
Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu courses.engr.illinois.edu/cs447 CS447: Natural Language Processing 1 Today s Lecture Feed-forward neural networks as classifiers simple
More informationCMSC 2833 Lecture Memory Organization and Addressing
Computer memory consists of a linear array of addressable storage cells that are similar to registers. Memory can be byte-addressable, or word-addressable, where a word typically consists of two or more
More informationA performance comparison of Deep Learning frameworks on KNL
A performance comparison of Deep Learning frameworks on KNL R. Zanella, G. Fiameni, M. Rorro Middleware, Data Management - SCAI - CINECA IXPUG Bologna, March 5, 2018 Table of Contents 1. Problem description
More informationA Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models
A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating Michael Price*, James Glass, Anantha Chandrakasan MIT, Cambridge, MA * now at Analog Devices, Cambridge,
More informationarxiv: v2 [cs.cv] 3 May 2016
EIE: Efficient Inference Engine on Compressed Deep Neural Network Song Han Xingyu Liu Huizi Mao Jing Pu Ardavan Pedram Mark A. Horowitz William J. Dally Stanford University, NVIDIA {songhan,xyl,huizi,jingpu,perdavan,horowitz,dally}@stanford.edu
More informationThe x86 Microprocessors. Introduction. The 80x86 Microprocessors. 1.1 Assembly Language
The x86 Microprocessors Introduction 1.1 Assembly Language Numbering and Coding Systems Human beings use the decimal system (base 10) Decimal digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 Computer systems use the
More informationBandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design
Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters
More informationChapter 4 Design of Function Specific Arithmetic Circuits
Chapter 4 Design of Function Specific Arithmetic Circuits Contents Chapter 4... 55 4.1 Introduction:... 55 4.1.1 Incrementer/Decrementer Circuit...56 4.1.2 2 s Complement Circuit...56 4.1.3 Priority Encoder
More informationDesign of Arithmetic circuits
Design of Arithmetic circuits ic principle of pipelining ditional approach Input Data clk Process < 100 ns Through 10 MH elining approach Throughput considerably. increases Chip area also increases. Latency
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationBy, Ajinkya Karande Adarsh Yoga
By, Ajinkya Karande Adarsh Yoga Introduction Early computer designers believed saving computer time and memory were more important than programmer time. Bug in the divide algorithm used in Intel chips.
More informationLab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm
ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm 1 Introduction
More informationCache Memory - II. Some of the slides are adopted from David Patterson (UCB)
Cache Memory - II Some of the slides are adopted from David Patterson (UCB) Outline Direct-Mapped Cache Types of Cache Misses A (long) detailed example Peer - to - peer education example Block Size Tradeoff
More informationThe University of Adelaide, School of Computer Science 13 September 2018
Computer Architecture A Quantitative Approach, Sixth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per
More informationSlide Set 1. for ENEL 339 Fall 2014 Lecture Section 02. Steve Norman, PhD, PEng
Slide Set 1 for ENEL 339 Fall 2014 Lecture Section 02 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary Fall Term, 2014 ENEL 353 F14 Section
More informationDeepLearning on FPGAs
DeepLearning on FPGAs Introduction to FPGAs Sebastian Buschjäger Technische Universität Dortmund - Fakultät Informatik - Lehrstuhl 8 October 24, 2017 1 Recap: Convolution Observation 1 Even smaller images
More informationRAPIDNN: In-Memory Deep Neural Network Acceleration Framework
RAPIDNN: In-Memory Deep Neural Network Acceleration Framework Mohsen Imani, Mohammad Samragh, Yeseong Kim, Saransh Gupta, Farinaz Koushanfar and Tajana Rosing Computer Science and Engineering Department,
More informationKeras: Handwritten Digit Recognition using MNIST Dataset
Keras: Handwritten Digit Recognition using MNIST Dataset IIT PATNA January 31, 2018 1 / 30 OUTLINE 1 Keras: Introduction 2 Installing Keras 3 Keras: Building, Testing, Improving A Simple Network 2 / 30
More informationCache/Memory Optimization. - Krishna Parthaje
Cache/Memory Optimization - Krishna Parthaje Hybrid Cache Architecture Replacing SRAM Cache with Future Memory Technology Suji Lee, Jongpil Jung, and Chong-Min Kyung Department of Electrical Engineering,KAIST
More informationDeep Learning Processing Technologies for Embedded Systems. October 2018
Deep Learning Processing Technologies for Embedded Systems October 2018 1 Neural Networks Architecture Single Neuron DNN Multi Task NN Multi-Task Vehicle Detection With Region-of-Interest Voting Popular
More informationXilinx DNN Processor An Inference Engine, Network Compiler + Runtime for Xilinx FPGAs
ilinx DNN Proceor An Inference Engine, Network Compiler Runtime for ilinx FPGA Rahul Nimaiyar, Brian Sun, Victor Wu, Thoma Branca, Yi Wang, Jutin Oo, Elliott Delaye, Aaron Ng, Paolo D'Alberto, Sean Settle,
More informationEND-TERM EXAMINATION
(Please Write your Exam Roll No. immediately) END-TERM EXAMINATION DECEMBER 2006 Exam. Roll No... Exam Series code: 100919DEC06200963 Paper Code: MCA-103 Subject: Digital Electronics Time: 3 Hours Maximum
More informationValue-driven Synthesis for Neural Network ASICs
Value-driven Synthesis for Neural Network ASICs Zhiyuan Yang University of Maryland, College Park zyyang@umd.edu ABSTRACT In order to enable low power and high performance evaluation of neural network
More informationCombinational Logic Use the Boolean Algebra and the minimization techniques to design useful circuits No feedback, no memory Just n inputs, m outputs
Combinational Logic Use the Boolean Algebra and the minimization techniques to design useful circuits No feedback, no memory Just n inputs, m outputs and an arbitrary truth table Analysis Procedure We
More informationAccelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC
Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC Eriko Nurvitadhi, David Sheffield, Jaewoong Sim, Asit Mishra, Ganesh Venkatesh and Debbie Marr Accelerator Architecture Lab,
More informationmrna: Enabling Efficient Mapping Space Exploration for a Reconfigurable Neural Accelerator
mrna: Enabling Efficient Mapping Space Exploration for a Reconfigurable Neural Accelerator Zhongyuan Zhao, Hyoukjun Kwon, Sachit Kuhar, Weiguang Sheng, Zhigang Mao, and Tushar Krishna Shanghai Jiao Tong
More informationHigh-Throughput and High-Accuracy Classification with Convolutional Ternary Neural Networks. Frédéric Pétrot, Adrien Prost-Boucle, Alban Bourge
High-Throughput and High-Accuracy Classification with Convolutional Ternary Neural Networks Frédéric Pétrot, Adrien Prost-Boucle, Alban Bourge International Workshop on Highly Efficient Neural Processing
More informationMemory Devices. Future?
Memory evices Small: Register file (group of numbered registers) Medium: SRAM (Static Random Access Memory) Large: RAM (ynamic Random Access Memory) Future? 1 Processor: ata Path Components 2 1 3 Instruction
More informationMethod for hardware implementation of a convolutional turbo code interleaver and a sub-block interleaver
Method for hardware implementation of a convolutional turbo code interleaver and a sub-block interleaver isclosed is a method for hardware implementation of a convolutional turbo code interleaver and a
More information