Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network
|
|
- May Little
- 6 years ago
- Views:
Transcription
1 Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network Chen Zhang 1, Peng Li 3, Guangyu Sun 1,2, Yijin Guan 1, Bingjun Xiao 3, Jason Cong 1,2,3 1 Peking University 2 PKU/UCLA Joint Research Institution 3 University of California, Los Angeles 1
2 Convolutional Neural Network Automotive Safety Image Stanford u Image Classification & Object Detection State-of-the-art Algorithm Highest Correctness in Large Scale Visual Recognition Challenge 2012 & 2013 & 2014 & 2015 Widely used in industry & academy 2
3 Convolutional Neural Network input image feature maps feature maps feature maps feature maps output category Convolution layer Max-pooling layer Convolution layer Max-pooling layer Classifier Max-pooling is optional Input feature map Output feature map 3
4 Convolutional Neural Network input image feature maps feature maps feature maps feature maps output category Convolutional layers account for over 90% computation Convolution Max-pooling Convolution Max-pooling Classifier layer layer layer layer [1] A. Krizhevsky, etc. Imagenet classification with deep convolutional neural networks. NIPS [2] J. Cong and B. Xiao. Minimizing computation in convolutional neural networks. ICANN 2014 Max-pooling is optional Input feature map Output feature map 4
5 Related Work 1. CNP: An FPGA-based processor for convolutional networks. FPL 2009; 2. A massively parallel coprocessor for convolutional neural networks. ASAP 2009; 3. A programmable parallel accelerator for learning and classification. PACT 2010; 4. A Dynamically Configurable Coprocessor for Convolutional Neural Networks. ISCA 2010; 5. Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM 2011; 6. A massively parallel digital learning processor. NIPS 2008; 7. NeuFlow: A Runtime Reconfigurable Dataflow Processor for Vision. CVPRW 2011; 8. Accelerating deep neural networks on mobile processor with embedded programmable logic. (Poster) NIPS 2013; 9. Memory-Centric Accelerator Design for Convolutional Neural Networks. ICCD
6 Hardware Computing on FPGA Computation Resource... PE-1 PE-2 PE-n Challenge: Resrc. Util. Solution: Unroll & Pipeline Interconnect On-chip memory buffer1 buffer2 Challenge: BRAM Limits Solution: Loop Tiling Off-chip Bus External Memory Challenge: BW Limits Solution: Data reuse Challenge: Co-optimization, -- match computation & communication Solution: Roofline Model Main Contribution 6
7 FPGA design in Roofline Model Computational Perf. Computation Resource... PE-1 PE-2 PE-n GFLOP/S Interconnect On-chip memory buffer1 Off-chip Bus buffer2 Computation To Communication Ratio External Memory FLOP / (Mem. Byte Access) Bandwidth Gbyte/S 7
8 Computational Performance FPGA design in Roofline Model GFLOP/S Design Bandwidth FLOP/(Memory byte access) Computation To Communication (CTC) Ratio 8
9 Computational Attainable Performance Attainable Performance GFLOP/S Computation Bandwidth Roof Computational Communication Roof Computation 1 B 2 3 Communication A Compt. Perf. (2 & 3 < 1) Attain. Perf. (2 & 3 > 1) FLOP/(DRAM byte access) Computation To Communication (CTC) Ratio 9
10 Attainable Performance Attainable Performance GFLOPS Computational Roof 1 Bandwidth Roof 2 3 Attain. Perf. (2 & 3 > 1) Attain. Perf. (2 & 3 < 1) FLOP/(DRAM byte access) Computation To Communication (CTC) Ratio 10
11 Attainable Performance Attainable Performance GFLOPS Computational Roof 1 Bandwidth Roof Optimization Target: 2 3 Find a Design with Maximized Attainable Performance Attain. Perf. (2 & 3 > 1) Attain. Perf. (2 & 3 < 1) FLOP/(DRAM byte access) Computation To Communication (CTC) Ratio 11
12 Loop tiling 1 for(row=0; row<r; row+=tr) { 2 for(col=0; col<c; col+=tc) { 3 for(to=0; to<m; to+=tm) { 4 for(ti=0; ti<n; ti+=tn) { 5 for(trr=row; trr<min(row+tr, R); trr++) { 6 for(tcc=col; tcc<min(tcc+tc, C); tcc++) { 7 for(too=to; too<min(to+tn, M); too++) { 1 for(row=0; row<r; row++) { 8 for(tii=ti; tii<(ti+tn, N); tii++) { 2 for(col=0; col<c; col++) { 9 3 for(i=0; for(to=0; i<k; to<m; i++) to++) { { for(j=0; for(ti=0; j<k; ti<n; j++) ti++) { { output_fm[to][row][col] for(i=0; i<k; i++) { += }}}} Off-chip Data Transfer: KMemory Access Optimization }}}}}} On-chip Data: Computation Optimization 6 for(j=0; j<k; j++) { }}}}}} Input feature map Output feature map weights[to][ti][i][j]*input_fm[ti][s*row+i][s*col+j]; output_fm[to][row][col] += weights[to][ti][i][j]*input_fm[ti][s*row+i][s*col+j]; w ij 1l (Tile loop) (Tile loop) (Tile loop) (Tile loop) R, C, M, N, K, S are all configuration parameters of the convolutional layer 12
13 Loop tiling 1 for(row=0; row<r; row+=tr) { 2 for(col=0; col<c; col+=tc) { 3 for(to=0; to<m; to+=tm) { 4 for(ti=0; ti<n; ti+=tn) { 5 for(trr=row; trr<min(row+tr, R); trr++) { 6 for(tcc=col; tcc<min(tcc+tc, C); tcc++) { 7 for(too=to; too<min(to+tn, M); too++) { 8 for(tii=ti; tii<(ti+tn, N); tii++) { 9 for(i=0; i<k; i++) { 10 for(j=0; j<k; j++) { output_fm[to][row][col] += weights[to][ti][i][j]*input_fm[ti][s*row+i][s*col+j]; }}}}}} }}}} Off-chip Data Transfer: Memory Access Optimization On-chip Data: Computation Optimization (Tile loop) (Tile loop) (Tile loop) (Tile loop) 13
14 Computation Optimization 5 for(trr=row; trr<min(row+tr, R); trr++) 6 for(tcc=col; tcc<min(tcc+tc, C); tcc++) 7 for(too=to; too<min(to+tn, M); too++) 8 for(tii=ti; tii<(ti+tn, N); tii++) 9 for(i=0; i<k; i++) { 10 for(j=0; j<k; j++) { output_fm[to][row][col] += }}}}}} 5 for(i=0; i<k; i++) { 6 for(j=0; j<k; j++) { 7 for(trr=row; trr<min(row+tr, R); trr++) { 8 for(tcc=col; tcc<min(tcc+tc, C); tcc++) { #pragma HLS pipeline 9 for(too=to; too<min(to+tm, M); too++) { #pragma HLS UNROLL 10 for(tii=ti; tii<(ti+tn, N); tii++) { #pragma HLS UNROLL output_fm[to][row][col] += weights[to][ti][i][j]*input_fm[ti][s*row+i][s*col+j]; }}}}}} (Tile loops) weights[to][ti][i][j]*input_fm[ti][s*row+i][s*col+j]; 14
15 Computation Optimization Tn 5 for(i=0; i<k; i++) { (Tile loops) 5 for(trr=row; 6 for(j=0; trr<min(row+tr, j<k; j++) { R); trr++) 6 for(tcc=col; 7 for(trr=row; tcc<min(tcc+tc, trr<min(row+tr, C); tcc++) R); trr++) { 7 8 for(too=to; for(tcc=col; too<min(to+tn, tcc<min(tcc+tc, M); too++) C); tcc++) { 8 #pragma for(tii=ti; HLS pipeline tii<(ti+tn, N); tii++) 9 for(too=to; too<min(to+tm, M); too++) { 9 for(i=0; i<k; i++) { #pragma Tm HLS UNROLL for(j=0; for(tii=ti; j<k; tii<(ti+tn, j++) { N); tii++) { #pragma HLS output_fm[to][row][col] UNROLL += weights[to][ti][i][j]*input_fm[ti][s*row+i][s*col+j]; output_fm[to][row][col] += }}}}}} weights[to][ti][i][j]*input_fm[ti][s*row+i][s*col+j]; }}}}}} 15
16 Computational Performance Tn Tm Design 1 Design 2 Design 3 Design 4 Output (Tm) Input (Tn) DSPs
17 Computational Performance u Consider Memory Access! How? Design 1 Design 2 Design 3 Design 4 Output (Tm) Input (Tn) DSPs
18 Memory Access Optimization 1 for(row=0; row<r; row+=tr) { 2 for(col=0; col<c; col+=tc) { 3 for(to=0; to<m; to+=tm) { 4 for(ti=0; ti<n; ti+=tn) { } }}} load output feature map S: foo(output_fm(to, row, col)); store output feature map 5 for(trr=row; trr<min(row+tr, R); trr++) 6 for(tcc=col; tcc<min(tcc+tc, C); tcc++) 7 for(too=to; too<min(to+tn, M); too++) 8 for(tii=ti; tii<(ti+tn, N); tii++) 9 for(i=0; i<k; i++) { 10 for(j=0; j<k; j++) { output_fm[to][row][col] += weights[to][ti][i][j]* input_fm[ti][s*row+i][s*col+j]; }}}}}} 18
19 Memory Access Optimization 1 for(row=0; row<r; row+=tr) { 2 for(col=0; col<c; col+=tc) { 3 for(to=0; to<m; to+=tm) { 4 for(ti=0; ti<n; ti+=tn) { } }}} load output feature map S: foo(output_fm(to, row, col)); store output feature map 1 for(row=0; row<r; row+=tr) { 2 for(col=0; col<c; col+=tc) { 3 for(to=0; to<m; to+=tm) { 4 for(ti=0; ti<n; ti+=tn) { }}} } load output feature map S: foo(output_fm(to, row, col)); store output feature map 19
20 Computational Performance x x x x Computation To Communication Ratio x x GFLOP/S x x Computation To Communication (CTC) Ratio FLOP/(DRAM byte access) Design 1 Design 2 Design 3 Design 4 Output (Tm) Input (Tn)
21 Design Space Legal Solutions of Tm & Tn: N=128 N= # of PE = 450 Communication: # of memory access methods Legal Solutions: 3 Total Legal Solutions:
22 Attainable performance (GFLOPS) Design Space Exploration 120 Bandwidth roof 100 Computational roof A D C 80 A CTC Ratio Ratio of (FLOP)/(DRAM byte access) 22
23 FIFO FIFO System Overview u AXI4Lite Timer AXI4 Interrupt controller Accelerator Estimated On-board run Difference Layer ms 7.67 ms ~5% Layer ms 5.35 ms ~5% Data Transfer Engine0 Data Transfer Engine1 Layer 3 MicroBlaze 3.42 ms 3.79 ms ~9% UART Layer ms 2.88 ms ~10% Layer ms 1.93 ms ~10% Overall ms ms ~7% VC707 Vivado Vivado HLS Memory Interface Controller DDR3 interface On-chip Off-chip Main Memory 23
24 FIFO FIFO Accelerator Interrupt controller Accelerator Timer UART MicroBlaze Data Transfer Engine0 Data Transfer Engine1 AXI4Lite AXI4 Memory Interface Controller DDR3 interface On-chip Off-chip Main Memory 24
25 Experimental Results: vs. CPU Energy 90x 20x 1 thread -O3 16 threads -O3 FPGA x 5x Speedup 1 thread -O3 16 threads -O3 FPGA CPU Xeon E (32nm) 16 cores 2.2 GHz FPGA Virtex7-485t (28nm) 448 PEs 100MHz gcc O3 OpenMP 3.0 Vivado Vivado HLS
26 Experimental Results: vs. Other FPGAs FPL2009 ASAP2009 PACT2010 ISCA2010 ICCD2013 Our Impl. Virtex 4 Virtex 5 Virtex 5 Virtex 5 Virtex 6 Virtex 7 125MHz 115MHz 125MHz 200MHz 150MHz 100MHz 5.25 GOPS 6.74 GOPS 7 GOPS 16 GOPS 17 GOPS 61.6 GOPS 26
27 Experimental Results: vs. Other FPGAs GFLOPS/Slice Normalized Performance ~80% ASAP2009 FPL2009 PACT2010 ISCA2010 ICCD2013 Our Impl. 27
28 Attainable performance (GFLOPS) Experimental Results A Bandwidth roof Computation Communication Comm. Time Covered C A Computation Communication Comm. Time NOT Covered CTC Ratio Ratio of (FLOP)/(DRAM byte access) 28
29 Conclusions An accelerator for convolutional neural network Contribution: Accurate Analytical model for computation & communication Find the best solution with roofline model Result: On-board run implementation ~3.5x better performance over other FPGA implementations ~80% performance/area improvement 29
30 Thank You Q & A 30
Introduction to Neural Networks
ECE 5775 (Fall 17) High-Level Digital Design Automation Introduction to Neural Networks Ritchie Zhao, Zhiru Zhang School of Electrical and Computer Engineering Rise of the Machines Neural networks have
More informationMachine Learning on FPGAs
Machine Learning on FPGAs Jason Cong Chancellor s Professor, UCLA Director, Center for Domain-Specific Computing cong@cs.ucla.edu http://cadlab.cs.ucla.edu/~cong 1 Impacts of deep learning for many applications
More informationDesign Exploration of FPGA-based Accelerators for Deep Neural Networks
Design Exploration of FPGA-based Accelerators for Deep Neural Networks Guangyu Sun CECA@Peking University gsun@pku.edu.cn 1 Applications of Deep Neural Networks Unmanned Vehicle Speech & Audio Text & Language
More informationOptimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks Chen Zhang 1 chen.ceca@pku.edu.cn Yijin Guan 1 guanyijin@pku.edu.cn Peng Li 2 pengli@cs.ucla.edu Bingjun Xiao 2 xiao@cs.ucla.edu
More informationDesign Space Exploration of FPGA-Based Deep Convolutional Neural Networks
Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Abstract Deep Convolutional Neural Networks (DCNN) have proven to be very effective in many pattern recognition applications, such
More informationDesign Space Exploration of FPGA-Based Deep Convolutional Neural Networks
Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Mohammad Motamedi, Philipp Gysel, Venkatesh Akella and Soheil Ghiasi Electrical and Computer Engineering Department, University
More informationECE 5775 High-Level Digital Design Automation Fall DNN Acceleration on FPGAs
ECE 5775 High-Level Digital Design Automation Fall 2018 DNN Acceleration on FPGAs Announcements HW 2 due Friday at 11:59pm First student-led seminar is on Tuesday 10/16 Format: 18-min talk plus 2-min Q&A
More informationThroughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu
More informationEnergy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster
Energy-Efficient CNN Implementation on a Deeply Pipelined Cluster Chen Zhang 1, Di Wu, Jiayu Sun 1, Guangyu Sun 1,3, Guojie Luo 1,3, and Jason Cong 1,,3 1 Center for Energy-Efficient Computing and Applications,
More informationExploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs
Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs Qingcheng Xiao *1, Yun Liang 1, Liqiang Lu 1, Shengen Yan,3 and Yu-Wing Tai 3 1 Center for Energy-efficient
More informationTowards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,
More informationOptimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms
Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Ruizhe Zhao 1, Xinyu Niu 1, Yajie Wu 2, Wayne Luk 1, and Qiang Liu 3 1 Imperial College London {ruizhe.zhao15,niu.xinyu10,w.luk}@imperial.ac.uk
More informationSODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou
SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou University of California, Los Angeles 1 What is stencil computation? 2 What is Stencil Computation? A sliding
More informationA Survey of FPGA-based Accelerators for Convolutional Neural Networks
1 A Survey of FPGA-based Accelerators for Convolutional Neural Networks Sparsh Mittal Abstract Deep convolutional neural networks (CNNs) have recently shown very high accuracy in a wide range of cognitive
More informationECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University
ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University Lab 4: Binarized Convolutional Neural Networks Due Wednesday, October 31, 2018, 11:59pm
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationMinimizing Computation in Convolutional Neural Networks
Minimizing Computation in Convolutional Neural Networks Jason Cong and Bingjun Xiao Computer Science Department, University of California, Los Angeles, CA 90095, USA {cong,xiao}@cs.ucla.edu Abstract. Convolutional
More informationLab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm
ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm 1 Introduction
More informationEyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks
Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Yu-Hsin Chen 1, Joel Emer 1, 2, Vivienne Sze 1 1 MIT 2 NVIDIA 1 Contributions of This Work A novel energy-efficient
More informationAccelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru
More informationdirect hardware mapping of cnns on fpga-based smart cameras
direct hardware mapping of cnns on fpga-based smart cameras Workshop on Architecture of Smart Cameras Kamel ABDELOUAHAB, Francois BERRY, Maxime PELCAT, Jocelyn SEROT, Jean-Charles QUINTON Cordoba, June
More informationFlexible On-chip Memory Architecture for DCNN Accelerators
Flexible On-chip Memory Architecture for DCNN Accelerators Arash Azizimazreah Lizhong Chen Oregon State University, OR, 97331, USA {azizimaa, chenliz}@oregonstate.edu ABSTRACT Recent studies show that
More informationScalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA
Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School
More informationCaffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks
Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks Chen Zhang 1,2,3, Zhenman Fang 2, Peipei Zhou 2, Peichen Pan 3, and Jason Cong 1,2,3 1 Center for Energy-Efficient
More informationDNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs
IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei
More informationRevolutionizing the Datacenter
Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5
More informationFCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA
1 FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA Compiler Tan Nguyen 1, Swathi Gurumani 1, Kyle Rupnow 1, Deming Chen 2 1 Advanced Digital Sciences Center, Singapore {tan.nguyen,
More informationarxiv: v1 [cs.dc] 1 Dec 2018
DeCoILFNet: Depth Concatenation and Inter-Layer Fusion based ConvNet Accelerator Akanksha Baranwal 1,, Ishan Bansal 1,, Roopal Nahar 1, K.Madhava Krishna 1 arxiv:1901.02774v1 [cs.dc] 1 Dec 2018 Abstract
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationOpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch
OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens Korinth, Andreas Koch Motivation Increasing use of heterogeneous systems to overcome CPU power limitations 2017-07-12 OpenMP FPGA Device
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationCaffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. X, DEC 2016 1 Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks
More informationFINN: A Framework for Fast, Scalable Binarized Neural Network Inference
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (XIR & NTNU), Nick Fraser (XIR & USydney), Giulio Gambardella (XIR), Michaela Blott (XIR), Philip Leong (USydney),
More informationDEEP LEARNING ACCELERATOR UNIT WITH HIGH EFFICIENCY ON FPGA
DEEP LEARNING ACCELERATOR UNIT WITH HIGH EFFICIENCY ON FPGA J.Jayalakshmi 1, S.Ali Asgar 2, V.Thrimurthulu 3 1 M.tech Student, Department of ECE, Chadalawada Ramanamma Engineering College, Tirupati Email
More informationM.Tech Student, Department of ECE, S.V. College of Engineering, Tirupati, India
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 5 ISSN : 2456-3307 High Performance Scalable Deep Learning Accelerator
More informationBandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design
Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters
More informationInstruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA
Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA Abstract In recent years, Convolutional Neural Network (CNN) has been widely applied in computer vision tasks. FPGAs
More informationImplementing Long-term Recurrent Convolutional Network Using HLS on POWER System
Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign
More informationTed N. Booth. DesignLinx Hardware Solutions
Ted N. Booth DesignLinx Hardware Solutions September 2015 Using Vivado HLS for Video Algorithm Implementation for Demonstration and Validation Agenda Project Description HLS Lessons Learned Summary Project
More informationFPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
Date of publication 2018 00, 0000, date of current version 2018 00, 0000. Digital Object Identifier 10.1109/ACCESS.2018.2890150.DOI arxiv:1901.00121v1 [cs.ne] 1 Jan 2019 FPGA-based Accelerators of Deep
More informationPolyhedral-Based Data Reuse Optimization for Configurable Computing
Polyhedral-Based Data Reuse Optimization for Configurable Computing Louis-Noël Pouchet 1 Peng Zhang 1 P. Sadayappan 2 Jason Cong 1 1 University of California, Los Angeles 2 The Ohio State University February
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationDNN Accelerator Architectures
DNN Accelerator Architectures ISCA Tutorial (2017) Website: http://eyeriss.mit.edu/tutorial.html Joel Emer, Vivienne Sze, Yu-Hsin Chen 1 2 Highly-Parallel Compute Paradigms Temporal Architecture (SIMD/SIMT)
More informationA Lost Cycles Analysis for Performance Prediction using High-Level Synthesis
A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,
More informationBHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques
BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques Jingyang Zhu 1, Zhiliang Qian 2*, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and
More informationFCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow
FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationTETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory
TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural
More informationFPGA-based Accelerator for Long Short-Term Memory Recurrent Neural Networks
FPGA-based Accelerator for Long Short-Term Memory Recurrent Neural Networks Yijin Guan 1, Zhihang Yuan 1, Guangyu Sun 1,3, Jason Cong 2,3,1 1 Center for Energy-Efficient Computing and Applications,Peking
More informationHRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing
HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard
More informationPRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory
Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong
More informationUSING DATAFLOW TO OPTIMIZE ENERGY EFFICIENCY OF DEEP NEURAL NETWORK ACCELERATORS
... USING DATAFLOW TO OPTIMIZE ENERGY EFFICIENCY OF DEEP NEURAL NETWORK ACCELERATORS... Yu-Hsin Chen Massachusetts Institute of Technology Joel Emer Nvidia and Massachusetts Institute of Technology Vivienne
More informationExploring OpenCL Memory Throughput on the Zynq
Exploring OpenCL Memory Throughput on the Zynq Technical Report no. 2016:04, ISSN 1652-926X Chalmers University of Technology Bo Joel Svensson bo.joel.svensson@gmail.com Abstract The Zynq platform combines
More informationBinary Convolutional Neural Network on RRAM
Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua
More informationHigh-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication
High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication Erik H. D Hollander Electronics and Information Systems Department Ghent University, Ghent, Belgium Erik.DHollander@ugent.be
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationarxiv: v2 [cs.ar] 15 May 2018
[DL] A Survey of FPGA Based Neural Network Accelerator arxiv:1712.08934v2 [cs.ar] 15 May 2018 KAIYUAN GUO, SHULIN ZENG, JINCHENG YU, YU WANG AND HUAZHONG YANG, Tsinghua University, China Recent researches
More informationDeploying Deep Neural Networks in the Embedded Space
Deploying Deep Neural Networks in the Embedded Space Stylianos I. Venieris, Alexandros Kouris, Christos-Savvas Bouganis 2 nd International Workshop on Embedded and Mobile Deep Learning (EMDL) MobiSys,
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationA Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013
A Closer Look at the Epiphany IV 28nm 64 core Coprocessor Andreas Olofsson PEGPUM 2013 1 Adapteva Achieves 3 World Firsts 1. First processor company to reach 50 GFLOPS/W 3. First semiconductor company
More informationMidterm Exam. Solutions
Midterm Exam Solutions Problem 1 List at least 3 advantages of implementing selected portions of a complex design in software Software vs. Hardware Trade-offs Improve Performance Improve Energy Efficiency
More informationScaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research
Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Nick Fraser (Xilinx & USydney) Yaman Umuroglu (Xilinx & NTNU) Giulio Gambardella (Xilinx)
More informationFINN: A Framework for Fast, Scalable Binarized Neural Network Inference
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (NTNU & Xilinx Research Labs Ireland) in collaboration with N Fraser, G Gambardella, M Blott, P Leong, M Jahre and
More informationMemory-Centric Accelerator Design for Convolutional Neural Networks
Memory-Centric Accelerator Design for Convolutional Neural Networks Maurice Peemen, Arnaud A. A. Setio, Bart Mesman and Henk Corporaal Department of Electrical Engineering, Eindhoven University of Technology,
More informationRAID prototype system introduction. SATA-IP RAID prototype system for Xilinx FPGA. 3 September 2018 Design Gateway Page 1.
RAID prototype system introduction Ver1.3E - RAID prototype system for Xilinx FPGA 3 September 2018 Design Gateway Page 1 System Outline RAID prototype for the latest Xilinx FPGA Use RAID adapter board
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationSwitched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network
Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network Lixue Xia, Tianqi Tang, Wenqin Huangfu, Ming Cheng, Xiling Yin, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E., Tsinghua
More informationNeural Computer Architectures
Neural Computer Architectures 5kk73 Embedded Computer Architecture By: Maurice Peemen Date: Convergence of different domains Neurobiology Applications 1 Constraints Machine Learning Technology Innovations
More informationImproving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network
Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing Li Department of Electrical and Computer Engineering University of Wisconsin-Madison
More informationComputer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal
Computer Architectures for Deep Learning Ethan Dell and Daniyal Iqbal Agenda Introduction to Deep Learning Challenges Architectural Solutions Hardware Architectures CPUs GPUs Accelerators FPGAs SOCs ASICs
More informationPorting Performance across GPUs and FPGAs
Porting Performance across GPUs and FPGAs Deming Chen, ECE, University of Illinois In collaboration with Alex Papakonstantinou 1, Karthik Gururaj 2, John Stratton 1, Jason Cong 2, Wen-Mei Hwu 1 1: ECE
More informationSmartShuttle: Optimizing Off-Chip Memory Accesses for Deep Learning Accelerators
SmartShuttle: Optimizing Off-Chip emory Accesses for Deep Learning Accelerators Jiajun Li, Guihai Yan, Wenyan Lu, Shuhao Jiang, Shijun Gong, Jingya Wu, Xiaowei Li State Key Laboratory of Computer Architecture,
More informationHEAD HardwarE Accelerated Deduplication
HEAD HardwarE Accelerated Deduplication Final Report CS710 Computing Acceleration with FPGA December 9, 2016 Insu Jang Seikwon Kim Seonyoung Lee Executive Summary A-Z development of deduplication SW version
More informationScale-Out Acceleration for Machine Learning
Scale-Out Acceleration for Machine Learning Jongse Park Hardik Sharma Divya Mahajan Joon Kyung Kim Preston Olds Hadi Esmaeilzadeh Alternative Computing Technologies (ACT) Lab Georgia Institute of Technology
More informationFPGA Entering the Era of the All Programmable SoC
FPGA Entering the Era of the All Programmable SoC Ivo Bolsens, Senior Vice President & CTO Page 1 Moore s Law: The Technology Pipeline Page 2 Industry Debates on Cost Page 3 Design Cost Estimated Chip
More informationGables: A Roofline Model for Mobile SoCs
Gables: A Roofline Model for Mobile SoCs Mark D. Hill, Wisconsin & Former Google Intern Vijay Janapa Reddi, Harvard & Former Google Intern HPCA, Feb 2019 Outline Motivation Gables Model Example Balanced
More informationApplication Partitioning on FPGA Clusters: Inference over Decision Tree Ensembles
Application Partitioning on FPGA Clusters: Inference over Decision Tree Ensembles Muhsen Owaida, Gustavo Alonso Systems Group, Department of Computer Science, ETH Zurich {firstname.lastname}@inf.ethz.ch
More informationAccelerator-Rich CMPs: From Concept to Real Hardware
Accelerator-Rich CMPs: From Concept to Real Hardware Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou Computer Science Department University of California,
More informationReconfigurable Acceleration of 3D-CNNs for Human Action Recognition with Block Floating-Point Representation
Reconfigurable Acceleration of 3D-CNNs for Human Action Recognition with Block Floating-Point Representation Hongxiang Fan, Ho-Cheung Ng, Shuanglong Liu, Zhiqiang Que, Xinyu Niu, Wayne Luk Dept. of Computing,
More informationVersal: AI Engine & Programming Environment
Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY
More informationMassive Parallel QCD Computing on FPGA Accelerator with Data-Flow Programming
Massive Parallel QCD Computing on FPGA Accelerator with Data-Flow Programming Thomas Janson and Udo Kebschull Infrastructure and Computer Systems in Data Processing (IRI) Goethe University Frankfurt Germany
More informationImplementation of Deep Convolutional Neural Net on a Digital Signal Processor
Implementation of Deep Convolutional Neural Net on a Digital Signal Processor Elaina Chai December 12, 2014 1. Abstract In this paper I will discuss the feasibility of an implementation of an algorithm
More informationA 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation
A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation Abstract: The power budget is expected to limit the portion of the chip that we can power ON at the upcoming technology nodes. This problem,
More informationSDSoC: Session 1
SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the
More informationExploring Automatically Generated Platforms in High Performance FPGAs
Exploring Automatically Generated Platforms in High Performance FPGAs Panagiotis Skrimponis b, Georgios Zindros a, Ioannis Parnassos a, Muhsen Owaida b, Nikolaos Bellas a, and Paolo Ienne b a Electrical
More informationEND-TO-END CHINESE TEXT RECOGNITION
END-TO-END CHINESE TEXT RECOGNITION Jie Hu 1, Tszhang Guo 1, Ji Cao 2, Changshui Zhang 1 1 Department of Automation, Tsinghua University 2 Beijing SinoVoice Technology November 15, 2017 Presentation at
More informationFPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS
FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS Mike Ashworth, Graham Riley, Andrew Attwood and John Mawer Advanced Processor Technologies Group School
More informationVivado HLx Design Entry. June 2016
Vivado HLx Design Entry June 2016 Agenda What is the HLx Design Methodology? New & Early Access features for Connectivity Platforms Creating Differentiated Logic 2 What is the HLx Design Methodology? Page
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant
More informationA Framework for Generating High Throughput CNN Implementations on FPGAs
A Framework for Generating High Throughput CNN Implementations on FPGAs Hanqing Zeng University of Southern California Ming Hsieh Department of Electrical Engineering zengh@usc.edu ABSTRACT Chi Zhang University
More informationFCUDA: Enabling Efficient Compilation of CUDA Kernels onto
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:
More informationScalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA
Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A
More informationFCUDA: Enabling Efficient Compilation of CUDA Kernels onto
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:
More informationTowards Design Space Exploration and Optimization of Fast Algorithms for Convolutional Neural Networks (CNNs) on FPGAs
Preprint: Accepted at 22nd IEEE Design, Automation & Test in Europe Conference & Exhibition (DATE 19) Towards Design Space Exploration and Optimization of Fast Algorithms for Convolutional Neural Networks
More informationFPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS
FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS Mike Ashworth, Graham Riley, Andrew Attwood and John Mawer Advanced Processor Technologies Group School
More informationA flexible memory shuffling unit for image processing accelerators
Eindhoven University of Technology MASTER A flexible memory shuffling unit for image processing accelerators Xie, R.Z. Award date: 2013 Disclaimer This document contains a student thesis (bachelor's or
More informationAn Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection
An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,
More informationAn Adaptable Deep Learning Accelerator Unit (DLAU) for FPGA
An Adaptable Deep Learning Accelerator Unit (DLAU) for FPGA N. Sireesha 1 & P.Malleswari 2 1PG Scholar, Dept of ECE, Narsaraopeta Institute of Technology, Yellamanda, Narsaraopeta, Guntur district, Andhra
More informationWhen MPPDB Meets GPU:
When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU
More information