TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory
|
|
- Clement Anthony
- 6 years ago
- Views:
Transcription
1 TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017
2 Deep Neural Networks (NNs) Unprecedented accuracy for challenging applications System perspective: compute and memory intensive o Challenging to execute efficiently on conventional systems Many efforts for optimized NN hardware Classification Recognition Control Prediction Optimization Multicores GPUs FPGAs ASICs Clusters 2
3 Deep Neural Networks (NNs) ifmaps filters ofmaps No Dog Ni * = No foreach b in batch Nb foreach ifmap u in Ni foreach ofmap v in No // 2D conv O(v,b) += I(u,b) * W(u,v) + B(v) Ni Convolutional layer (CONV) o Convolutions between fmaps Fully-Connected layer (FC) o Matrix multiplication O = I W foreach b in batch Nb foreach neuron x in Nx foreach neuron y in Ny // Matrix multiply O(y,b) += I(x,b) x W(x,y) + B(v) 3
4 Domain-Specific NN Accelerators Spatial architectures of PEs High computational efficiency o 100x performance and energy efficiency improvement o Low-precision arithmetic, dynamic pruning, static compression, Memory system? foreach b in batch Nb foreach ifmap u in Ni foreach ofmap v in No // 2D conv O(v,b) += I(u,b) * W(u,v) + B(v) foreach b in batch Nb foreach neuron x in Nx foreach neuron y in Ny // Matrix multiply O(y,b) += I(x,b) x W(x,y) + B(v) Main Memory Global Buffer PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Reg File ALU Processing Element 4
5 Memory Challenges for Large NNs Large memory footprints and high bandwidth requirements o Deep networks, large layers, complex neuron structures o More efficient computing requires higher data bandwidth Limit the performance scalability for future NNs Common solutions o Large on-chip buffers less area available for compute elements o Multiple DRAM channels high dynamic power, significant leakage 5
6 Power (W) Bandwidth (GBps) Memory Challenges for Large NNs Example: a state-of-the-art accelerator with 400 PEs needs o 1.5 MB SRAM buffer, > 70% of accelerator area o Four LPDDR3 x32 chips with > 25 GBps o 45% power in SRAM and DRAM even with low-power technologies Number of PEs Series1 Series2 Series3 Series4 Series
7 3D Memory + NN Acceleration Opportunities o High bandwidth at low access energy o Abundant parallelism (vaults, banks) Bank TSVs Key questions o What hardware architecture best exploits the high bandwidth and low energy? o How to best schedule and partition NN computations in software? Schedule: dataflow across memory hierarchy Partition: parallelism across vaults Vault (Channel) DRAM Die Logic Die Micron s Hybrid Memory Cube 7
8 TETRIS NN acceleration with 3D memory o Improves performance scalability by 4.1x over 2D o Improves energy efficiency by 1.5x over 2D Hardware architecture o Rebalance resources between PEs and buffers o In-memory accumulation High performance & low energy Alleviate bandwidth pressure Software techniques o Analytical dataflow scheduling o Hybrid partitioning Optimize buffer use Efficient parallel processing 8
9 TETRIS Hardware Architecture
10 TETRIS Architecture Associate one NN engine with each vault o PE array, local register files, and a shared global buffer NoC + routers for accesses to remote vaults All vaults can process NN computations in parallel PE PE PE PE To local vault To remote vault Router Memory Controller Global Buffer PE PE PE PE PE PE PE PE Vault DRAM Die PE PE PE PE Logic Die 10
11 Normalized Energy Normalized Runtime Resource Balancing Larger PE arrays with smaller SRAM buffers o Limited area with 3D integration o High data bandwidth needs more PEs o Low access energy & sequential access pattern PEs with 133 kb buffer (area 1:1) better performance and 6 energy # PEs / buffer capacity 0 Series1 Series2 Series3 Series4 Series5 11
12 In-Memory Accumulation Small buffer more swap to memory higher bandwidth Move simple accumulation logic close to DRAM banks o 2x bandwidth reduction for output data o Carefully choose logic location to avoid impact on DRAM array layout Memory Memory + W X Y Y W X ΔY PE array PE array Y += W * X ΔY = W * X 12
13 Scheduling and Partitioning for TETRIS
14 Dataflow Scheduling Critical for maximizing on-chip data reuse to save energy Ordering: loop blocking and reordering foreach b in batch Nb foreach ifmap u in Ni Locality in global buffer foreach ofmap v in No Non-convex, exhaustive search // 2D conv O(v,b) += I(u,b) * W(u,v) + B(v) Mapping: execute 2D conv on PE array Regfiles and array interconnect Row stationary [Chen et al., ISCA 16] 14
15 TETRIS Bypass Ordering Few data reuse opportunities in TETRIS small buffers IW bypass, OW bypass, IO bypass o Use buffer only for one stream for maximum benefit o Bypass buffer for the other two to sacrifice their reuse OW bypass ordering ifmaps ofmaps filters Chunk 0 Chunk 1 Off-chip Chunk 2 On-chip Global buffer Reg files 1. Read 1 ifmap chunk into gbuf 2. Stream ofmaps and filters to regf 3. Move ifmaps from gbuf to regf 4. Convolve 15
16 TETRIS Bypass Ordering Near-optimal schedules o With 2% difference from schedules derived from exhaustive search Analytically derived o Closed-form solution o No need for time-consuming exhaustive search NN Runtime Gap (w.r.t. optimal) Energy Gap (w.r.t. optimal) AlexNet 1.48 % 1.86 % ZFNet 1.55 % 1.83 % VGG % 0.20 % VGG % 0.16 % ResNet 2.91 % 0.78 % min A DRAM = 2 N b N o S o t i + N b N i S i + N o N i S w t b N b N i S s.t. t b t i S buf i 1 t b N b, 1 t i N i 16
17 NN Partitioning Process NN computations in parallel in all vaults Fmap partitioning o Divide a fmap into tiles o Each vault processes one tile Vault 0 Vault 1 Layer i Layer i+1 Minimum remote accesses o Prior work exclusively used fmap partitioning Vault 2 Vault 3 17
18 NN Partitioning Process NN computations in parallel in all vaults Output partitioning o Partition all ofmaps into groups o Each vault processes one group Vault 2 Layer i Layer i+1 Vault 3 Better filter weight reuse Fewer total memory accesses Vault 0 Vault 1 18
19 TETRIS Hybrid Partitioning Combine fmap partitioning and output partitioning Total energy = NoC energy + DRAM energy o Balance between minimizing remote accesses and total DRAM accesses Difficulties o Design space exponential to # layers Greedy algorithm reduces to be linear to # layers o Involved dataflow scheduling to determine total DRAM accesses Bypass ordering to quickly estimate total DRAM accesses 19
20 TETRIS Evaluation
21 Methodology Five state-of-the-art NNs o AlexNet, ZFNet, VGG16, VGG19, ResNet o MB total memory footprint for each NN o Up to 152 layers in ResNet, one of the largest NNs today 2D and 3D accelerators with one or more NN engines o 2D engine: 16 x 16 PEs, 576 kb buffer, one LPDDR3 channel 8.5 mm 2, 51.2 Gops/sec Bandwidth-constrained o 3D engine: 14 x 14 PEs, 133 kb buffer, one HMC vault 3.5 mm 2, 39.2 Gops/sec Area-constrained 21
22 Normalized Runtime Single-Engine Comparison Up to 37% performance improvement with TETRIS o Due to higher bandwidth despite smaller PE array 1.2 Large NNs benefit more! AlexNet ZFNet VGG16 VGG19 ResNet 22
23 Normalized Energy Single-Engine Comparison 35 40% energy reduction with TETRIS o Smaller on-chip buffer, better scheduling 1.2 From SRAM & DRAM, and static AlexNet ZFNet VGG16 VGG19 ResNet Series5 Series4 Series3 Series2 Series1 23
24 Normalized Runtime Multi-Engine Comparison 4 2D engines: 34 mm 2, pin constrained (4 LPDDR3 channels) 16 3D engines: 56 mm 2, area constrained (16 HMC vaults) 4.1x performance gain 2x compute density AlexNet ZFNet VGG16 VGG19 ResNet 24
25 Normalized Energy Multi-Engine Comparison 1.5x lower energy o 1.2x from better scheduling and partitioning 4x computation only costs 2.7x power Series5 Series4 Series3 Series2 Series AlexNet ZFNet VGG16 VGG19 ResNet 25
26 TETRIS Summary A scalable and efficient NN accelerator using 3D memory o 4.1x performance and 1.5x energy benefits over 2D baseline Hardware features o PE/buffer area rebalancing o In-memory accumulation Software features o Analytical dataflow scheduling o Hybrid partitioning 26
27 Thanks! Questions?
Scaling Neural Network Acceleration using Coarse-Grained Parallelism
Scaling Neural Network Acceleration using Coarse-Grained Parallelism Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2018 Neural Networks (NNs)
More informationEyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks
Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Yu-Hsin Chen 1, Joel Emer 1, 2, Vivienne Sze 1 1 MIT 2 NVIDIA 1 Contributions of This Work A novel energy-efficient
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationDNN Accelerator Architectures
DNN Accelerator Architectures ISCA Tutorial (2017) Website: http://eyeriss.mit.edu/tutorial.html Joel Emer, Vivienne Sze, Yu-Hsin Chen 1 2 Highly-Parallel Compute Paradigms Temporal Architecture (SIMD/SIMT)
More informationUSING DATAFLOW TO OPTIMIZE ENERGY EFFICIENCY OF DEEP NEURAL NETWORK ACCELERATORS
... USING DATAFLOW TO OPTIMIZE ENERGY EFFICIENCY OF DEEP NEURAL NETWORK ACCELERATORS... Yu-Hsin Chen Massachusetts Institute of Technology Joel Emer Nvidia and Massachusetts Institute of Technology Vivienne
More informationA Method to Estimate the Energy Consumption of Deep Neural Networks
A Method to Estimate the Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze Massachusetts Institute of Technology, Cambridge, MA, USA {tjy, yhchen, jsemer, sze}@mit.edu
More informationHRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing
HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationMulti-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture
The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung
More informationTowards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,
More informationResearch Faculty Summit Systems Fueling future disruptions
Research Faculty Summit 2018 Systems Fueling future disruptions Efficient Edge Computing for Deep Neural Networks and Beyond Vivienne Sze In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang, Sertac
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationSort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot
Sort vs. Hash Join Revisited for Near-Memory Execution Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot 1 Near-Memory Processing (NMP) Emerging technology Stacked memory: A logic die w/ a stack
More informationAccelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru
More informationDNN Dataflow Choice Is Overrated
DNN Dataflow Choice Is Overrated Xuan Yang *, Mingyu Gao *, Jing Pu *, Ankita Nayak *, Qiaoyi Liu *, Steven Emberton Bell *, Jeff Ou Setter *, Kaidi Cao, Heonjae Ha *, Christos Kozyrakis * and Mark Horowitz
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationImplementing Long-term Recurrent Convolutional Network Using HLS on POWER System
Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign
More informationThroughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu
More informationRevolutionizing the Datacenter
Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5
More informationBHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques
BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques Jingyang Zhu 1, Zhiliang Qian 2*, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and
More informationEyeriss v2: A Flexible and High-Performance Accelerator for Emerging Deep Neural Networks
Eyeriss v2: A Flexible and High-Performance Accelerator for Emerging Deep Neural Networks Yu-Hsin Chen, Joel Emer and Vivienne Sze EECS, MIT Cambridge, MA 239 NVIDIA Research, NVIDIA Westford, MA 886 {yhchen,
More informationScaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research
Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Nick Fraser (Xilinx & USydney) Yaman Umuroglu (Xilinx & NTNU) Giulio Gambardella (Xilinx)
More informationHow to Estimate the Energy Consumption of Deep Neural Networks
How to Estimate the Energy Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze MIT 1 Problem of DNNs Recognition Smart Drone AI Computation DNN 15k 300k OP/Px DPM 0.1k
More informationParana: A Parallel Neural Architecture Considering Thermal Problem of 3D Stacked Memory
JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Parana: A Parallel Neural Architecture Considering Thermal Problem of 3D Stacked Memory Shouyi Yin, Shibin Tang, Xinhan Lin, Peng Ouyang,
More informationHyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array
HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Li, Yiran Chen Duke University, University of Southern California {linghao.song,
More informationComputer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal
Computer Architectures for Deep Learning Ethan Dell and Daniyal Iqbal Agenda Introduction to Deep Learning Challenges Architectural Solutions Hardware Architectures CPUs GPUs Accelerators FPGAs SOCs ASICs
More informationIndex. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,
Index A Algorithmic noise tolerance (ANT), 93 94 Application specific instruction set processors (ASIPs), 115 116 Approximate computing application level, 95 circuits-levels, 93 94 DAS and DVAS, 107 110
More informationFive Emerging DRAM Interfaces You Should Know for Your Next Design
Five Emerging DRAM Interfaces You Should Know for Your Next Design By Gopal Raghavan, Cadence Design Systems Producing DRAM chips in commodity volumes and prices to meet the demands of the mobile market
More informationOvercoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics
Overcoming the Memory System Challenge in Dataflow Processing Darren Jones, Wave Computing Drew Wingard, Sonics Current Technology Limits Deep Learning Performance Deep Learning Dataflow Graph Existing
More informationDeep ST, Ultra Low Power Artificial Neural Network SOC in 28 FD-SOI. Nitin Chawla,
Deep learning @ ST, Ultra Low Power Artificial Neural Network SOC in 28 FD-SOI Nitin Chawla, Senior Principal Engineer and Senior Member of Technical Staff at STMicroelectronics Outline Introduction Chip
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationMODELING AND ANALYZING DEEP LEARNING ACCELERATOR DATAFLOWS WITH MAESTRO
MODELING AND ANALYZING DEEP LEARNING ACCELERATOR DATAFLOWS WITH MAESTRO Michael Pellauer*, Hyoukjun Kwon** and Tushar Krishna** *Architecture Research Group, NVIDIA **Georgia Institute of Technology ACKNOWLEDGMENTS
More informationIn Live Computer Vision
EVA 2 : Exploiting Temporal Redundancy In Live Computer Vision Mark Buckler, Philip Bedoukian, Suren Jayasuriya, Adrian Sampson International Symposium on Computer Architecture (ISCA) Tuesday June 5, 2018
More informationBandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design
Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters
More informationMonolithic 3D IC Design for Deep Neural Networks
Monolithic 3D IC Design for Deep Neural Networks 1 with Application on Low-power Speech Recognition Kyungwook Chang 1, Deepak Kadetotad 2, Yu (Kevin) Cao 2, Jae-sun Seo 2, and Sung Kyu Lim 1 1 School of
More informationHigh Performance Computing
High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason
More informationArm s First-Generation Machine Learning Processor
Arm s First-Generation Machine Learning Processor Ian Bratt 2018 Arm Limited Introducing the Arm Machine Learning (ML) Processor Optimized ground-up architecture for machine learning processing Massive
More informationHybrid Memory Cube (HMC)
23 Hybrid Memory Cube (HMC) J. Thomas Pawlowski, Fellow Chief Technologist, Architecture Development Group, Micron jpawlowski@micron.com 2011 Micron Technology, I nc. All rights reserved. Products are
More informationDNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses
DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough 1,2 S. K. Lee 2, N. Mulholland 2, P. Hansen 2, S. Kodali 3, D. Brooks 2, G.-Y. Wei 2 1 ARM Research, Boston,
More informationMulticore SoC is coming. Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems. Source: 2007 ISSCC and IDF.
Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems Liang-Gee Chen Distinguished Professor General Director, SOC Center National Taiwan University DSP/IC Design Lab, GIEE, NTU 1
More informationAddressing the Memory Wall
Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the
More informationPRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory
Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong
More informationDeep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications
Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Jongsoo Park Facebook AI System SW/HW Co-design Team Sep-21 2018 Team Introduction
More informationSTLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip
STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip Codesign for Tiled Manycore Systems Mingyu Wang and Zhaolin Li Institute of Microelectronics, Tsinghua University, Beijing 100084,
More informationBrainchip OCTOBER
Brainchip OCTOBER 2017 1 Agenda Neuromorphic computing background Akida Neuromorphic System-on-Chip (NSoC) Brainchip OCTOBER 2017 2 Neuromorphic Computing Background Brainchip OCTOBER 2017 3 A Brief History
More informationCan FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.
Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:
More informationScalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism
Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Jiecao Yu 1, Andrew Lukefahr 1, David Palframan 2, Ganesh Dasika 2, Reetuparna Das 1, Scott Mahlke 1 1 University of Michigan 2 ARM
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationFPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
Date of publication 2018 00, 0000, date of current version 2018 00, 0000. Digital Object Identifier 10.1109/ACCESS.2018.2890150.DOI arxiv:1901.00121v1 [cs.ne] 1 Jan 2019 FPGA-based Accelerators of Deep
More informationHPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov
HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads Natalia Vassilieva, Sergey Serebryakov Deep learning ecosystem today Software Hardware 2 HPE s portfolio for deep learning Government,
More informationarxiv: v2 [cs.cv] 3 May 2016
EIE: Efficient Inference Engine on Compressed Deep Neural Network Song Han Xingyu Liu Huizi Mao Jing Pu Ardavan Pedram Mark A. Horowitz William J. Dally Stanford University, NVIDIA {songhan,xyl,huizi,jingpu,perdavan,horowitz,dally}@stanford.edu
More informationHello Edge: Keyword Spotting on Microcontrollers
Hello Edge: Keyword Spotting on Microcontrollers Yundong Zhang, Naveen Suda, Liangzhen Lai and Vikas Chandra ARM Research, Stanford University arxiv.org, 2017 Presented by Mohammad Mofrad University of
More informationA Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models
A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating Michael Price*, James Glass, Anantha Chandrakasan MIT, Cambridge, MA * now at Analog Devices, Cambridge,
More informationM.Tech Student, Department of ECE, S.V. College of Engineering, Tirupati, India
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 5 ISSN : 2456-3307 High Performance Scalable Deep Learning Accelerator
More informationBinary Convolutional Neural Network on RRAM
Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua
More informationarxiv: v1 [cs.cv] 11 Feb 2018
arxiv:8.8v [cs.cv] Feb 8 - Partitioning of Deep Neural Networks with Feature Space Encoding for Resource-Constrained Internet-of-Things Platforms ABSTRACT Jong Hwan Ko, Taesik Na, Mohammad Faisal Amir,
More informationEfficient Methods for Deep Learning
Efficient Methods for Deep Learning Song Han Stanford University Sep 2016 Background: Deep Learning for Everything Source: Brody Huval et al., An Empirical Evaluation, arxiv:1504.01716 Source: leon A.
More informationCharacterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager
Characterization and Benchmarking of Deep Learning Natalia Vassilieva, PhD Sr. Research Manager Deep learning applications Vision Speech Text Other Search & information extraction Security/Video surveillance
More informationMorph: Flexible Acceleration for 3D CNN-based Video Understanding
Morph: Flexible Acceleration for 3D CNN-based Video Understanding Kartik Hegde, Rohit Agrawal, Yulun Yao, Christopher W. Fletcher University of Illinois at Urbana-Champaign {kvhegde2, rohita2, yuluny2,
More informationECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University
ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University Lab 4: Binarized Convolutional Neural Networks Due Wednesday, October 31, 2018, 11:59pm
More informationDeep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm
Deep Learning on Arm Cortex-M Microcontrollers Rod Crawford Director Software Technologies, Arm What is Machine Learning (ML)? Artificial Intelligence Machine Learning Deep Learning Neural Networks Additional
More informationdirect hardware mapping of cnns on fpga-based smart cameras
direct hardware mapping of cnns on fpga-based smart cameras Workshop on Architecture of Smart Cameras Kamel ABDELOUAHAB, Francois BERRY, Maxime PELCAT, Jocelyn SEROT, Jean-Charles QUINTON Cordoba, June
More informationUsing a Scalable Parallel 2D FFT for Image Enhancement
Introduction Using a Scalable Parallel 2D FFT for Image Enhancement Yaniv Sapir Adapteva, Inc. Email: yaniv@adapteva.com Frequency domain operations on spatial or time data are often used as a means for
More informationSoftware Defined Hardware
Software Defined Hardware For data intensive computation Wade Shen DARPA I2O September 19, 2017 1 Goal Statement Build runtime reconfigurable hardware and software that enables near ASIC performance (within
More informationDEEP LEARNING ACCELERATOR UNIT WITH HIGH EFFICIENCY ON FPGA
DEEP LEARNING ACCELERATOR UNIT WITH HIGH EFFICIENCY ON FPGA J.Jayalakshmi 1, S.Ali Asgar 2, V.Thrimurthulu 3 1 M.tech Student, Department of ECE, Chadalawada Ramanamma Engineering College, Tirupati Email
More informationNeural Computer Architectures
Neural Computer Architectures 5kk73 Embedded Computer Architecture By: Maurice Peemen Date: Convergence of different domains Neurobiology Applications 1 Constraints Machine Learning Technology Innovations
More informationSCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks
SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks Angshuman Parashar Minsoo Rhu Anurag Mukkara Antonio Puglielli Rangharajan Venkatesan Brucek Khailany Joel Emer Stephen W. Keckler
More informationSwitched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network
Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network Lixue Xia, Tianqi Tang, Wenqin Huangfu, Ming Cheng, Xiling Yin, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E., Tsinghua
More informationTraining Deep Neural Networks (in parallel)
Lecture 9: Training Deep Neural Networks (in parallel) Visual Computing Systems How would you describe this professor? Easy? Mean? Boring? Nerdy? Professor classification task Classifies professors as
More informationLab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm
ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm 1 Introduction
More informationMemory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9
Memory Systems and Compiler Support for MPSoC Architectures Mahmut Kandemir and Nikil Dutt Cap. 9 Fernando Moraes 28/maio/2013 1 MPSoC - Vantagens MPSoC architecture has several advantages over a conventional
More informationScaling Deep Learning on Multiple In-Memory Processors
Scaling Deep Learning on Multiple In-Memory Processors Lifan Xu, Dong Ping Zhang, and Nuwan Jayasena AMD Research, Advanced Micro Devices, Inc. {lifan.xu, dongping.zhang, nuwan.jayasena}@amd.com ABSTRACT
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant
More informationPoseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters
Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters Hao Zhang Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jianliang Wei, Pengtao Xie,
More informationA Communication-Centric Approach for Designing Flexible DNN Accelerators
THEME ARTICLE: Hardware Acceleration A Communication-Centric Approach for Designing Flexible DNN Accelerators Hyoukjun Kwon, High computational demands of deep neural networks Ananda Samajdar, and (DNNs)
More informationOptimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms
Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Ruizhe Zhao 1, Xinyu Niu 1, Yajie Wu 2, Wayne Luk 1, and Qiang Liu 3 1 Imperial College London {ruizhe.zhao15,niu.xinyu10,w.luk}@imperial.ac.uk
More informationDesign Space Exploration of FPGA-Based Deep Convolutional Neural Networks
Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Abstract Deep Convolutional Neural Networks (DCNN) have proven to be very effective in many pattern recognition applications, such
More informationLecture 2. Memory locality optimizations Address space organization
Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput
More informationFINN: A Framework for Fast, Scalable Binarized Neural Network Inference
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (NTNU & Xilinx Research Labs Ireland) in collaboration with N Fraser, G Gambardella, M Blott, P Leong, M Jahre and
More informationGoogle Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand
Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan
More informationBalancing DRAM Locality and Parallelism in Shared Memory CMP Systems
Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard
More informationComparing Memory Systems for Chip Multiprocessors
Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University
More informationHardware for Deep Learning
Hardware for Deep Learning Bill Dally Stanford and NVIDIA Stanford Platform Lab Retreat June 3, 2016 HARDWARE AND DATA ENABLE DNNS 2 THE NEED FOR SPEED Larger data sets and models lead to better accuracy
More informationEmergence of the Memory Centric Architectures
Emergence of the Memory Centric Architectures Balint Fleischer Chief Scientist AI is Everywhere Business Consumer Advising the CEO External Sensing: Market trends, Competitive environment, Customer sentiment,
More informationCollaborative Execution of Deep Neural Networks on Internet of Things Devices
Collaborative Execution of Deep Neural Networks on Internet of Things Devices Ramyad Hadidi Jiashen Cao Micheal S. Ryoo Hyesoon Kim We target IoT devices, the number of which have already outnumbered the
More informationPACE: Power-Aware Computing Engines
PACE: Power-Aware Computing Engines Krste Asanovic Saman Amarasinghe Martin Rinard Computer Architecture Group MIT Laboratory for Computer Science http://www.cag.lcs.mit.edu/ PACE Approach Energy- Conscious
More informationScalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA
Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A
More informationCNN optimization. Rassadin A
CNN optimization Rassadin A. 01.2017-02.2017 What to optimize? Training stage time consumption (CPU / GPU) Inference stage time consumption (CPU / GPU) Training stage memory consumption Inference stage
More informationNvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018
Nvidia Jetson TX2 and its Software Toolset João Fernandes 2017/2018 In this presentation Nvidia Jetson TX2: Hardware Nvidia Jetson TX2: Software Machine Learning: Neural Networks Convolutional Neural Networks
More informationDeep Learning Hardware Acceleration
* Deep Learning Hardware Acceleration Jorge Albericio + Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* + now at NVIDIA Andreas Moshovos Disclaimer
More informationUnified Deep Learning with CPU, GPU, and FPGA Technologies
Unified Deep Learning with CPU, GPU, and FPGA Technologies Allen Rush 1, Ashish Sirasao 2, Mike Ignatowski 1 1: Advanced Micro Devices, Inc., 2: Xilinx, Inc. Abstract Deep learning and complex machine
More informationBanshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!
Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Xiangyao Yu 1, Christopher Hughes 2, Nadathur Satish 2, Onur Mutlu 3, Srinivas Devadas 1 1 MIT 2 Intel Labs 3 ETH Zürich 1 High-Bandwidth
More informationStaged Memory Scheduling
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:
More informationChain-NN: An Energy-Efficient 1D Chain Architecture for Accelerating Deep Convolutional Neural Networks
Chain-NN: An Energy-Efficient D Chain Architecture for Accelerating Deep Convolutional Neural Networks Shihao Wang, Dajiang Zhou, Xushen Han, Takeshi Yoshimura Graduate School of Information, Production
More informationUsing Machine Learning for Classification of Cancer Cells
Using Machine Learning for Classification of Cancer Cells Camille Biscarrat University of California, Berkeley I Introduction Cell screening is a commonly used technique in the development of new drugs.
More informationFPGA & Hybrid Systems in the Enterprise Drivers, Exemplars and Challenges
Bob Blainey IBM Software Group 27 Feb 2011 FPGA & Hybrid Systems in the Enterprise Drivers, Exemplars and Challenges Workshop on The Role of FPGAs in a Converged Future with Heterogeneous Programmable
More informationFINN: A Framework for Fast, Scalable Binarized Neural Network Inference
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (XIR & NTNU), Nick Fraser (XIR & USydney), Giulio Gambardella (XIR), Michaela Blott (XIR), Philip Leong (USydney),
More information40Gbps+ Full Line Rate, Programmable Network Accelerators for Low Latency Applications SAAHPC 19 th July 2011
40Gbps+ Full Line Rate, Programmable Network Accelerators for Low Latency Applications SAAHPC 19 th July 2011 Allan Cantle President & Founder www.nallatech.com Company Overview ISI + Nallatech + Innovative
More informationTowards Performance Modeling of 3D Memory Integrated FPGA Architectures
Towards Performance Modeling of 3D Memory Integrated FPGA Architectures Shreyas G. Singapura, Anand Panangadan and Viktor K. Prasanna University of Southern California, Los Angeles CA 90089, USA, {singapur,
More information