ECE 5775 High-Level Digital Design Automation Fall DNN Acceleration on FPGAs

Size: px
Start display at page:

Download "ECE 5775 High-Level Digital Design Automation Fall DNN Acceleration on FPGAs"

Transcription

1 ECE 5775 High-Level Digital Design Automation Fall 2018 DNN Acceleration on FPGAs

2 Announcements HW 2 due Friday at 11:59pm First student-led seminar is on Tuesday 10/16 Format: 18-min talk plus 2-min Q&A Q = Questions & Quiz The group winning most audience vote gets a bonus point In-class midterm next Thursday on 10/18 75 mins Open book, Open notes, Closed Internet 1

3 Outline Midterm coverage CNN acceleration on FPGAs 2

4 Midterm (20%) Topics covered Specialized computing FPGAs C-based synthesis Control flow graph and SSA Scheduling Resource sharing Pipelining 3

5 Key Topics (1) Algorithm basics Time complexity, esp. big-o notation Graphs Trees, DAGs, topological sort BDDs, timing analysis FPGAs LUTs and LUT mapping C-based synthesis Arbitrary precision and fixed-point types Basic C-to-RTL mapping Key HLS optimizations to improve design performance 4

6 Key Topics (2) Compiler analysis Control data flow graph Dominance relation SSA Scheduling TCS & RCS algorithms: ILP, list scheduling, SDC Operation chaining, frequency/latency/resource constraints Ability to devise a simple scheduling algorithm to optimize certain design metric 5

7 Key Topics (3) Resource sharing Conflict and compatibility graphs Left edge algorithm Ability to determine minimum resource usage in # of functional units and/or registers, given a fixed schedule Pipelining Dependence types Ability to determine minimum II given a code snippet Modulo scheduling concepts: MII, RecMII, ResMII 6

8 Example: What s the MII + + [1] - - Three AddSub units available Two Multipliers available Assume 1-cycle operations, no chaining What is the minimum II (MII) for pipelining the above data flow graph? Write down both ResMII and RecMII 7

9 Modern DNNs are Computationally Expensive NATURE ELECTRONICS PERSPECTIVE a Compute Density 50 DNNs in academia without structural optimization Number of operations ( 109) 20 VGG-19 (2014) VGG-16 Inception-v4 (2016) ResNet50 Inception-v3 AlexNet (2012) ResNet152 (2015) ResNet110 Xception OverFeat (2013) 5 DPN-131 ResNeXt-101 NASNet-A(N=7) DNNs in academia with structural optimization 10 PolyNet NASNet-A(N=7) GoogLeNet (2014) NASNet-A(N=5) Inception-v2 Inception-v1 1 NASNet-A(N=4) ShuffleNet MobileNet 20% 10% 4% Top-five error X. Xu, Y. Ding, S. X. Hu, M. Niemier, J. Cong, Y. Hu, and Y. Shi. Scaling for Edge Inference of Neural Networks. Nature Electronics, vol 1, Apr b Moore s law trend for performance according to ref. 35 Moore s law end DNNs require enormous amount of compute 1,000 PD of GPUs 500 PD of ASICs PD of FPGAs DaDianNao For example, ResNet50 (70 layers) performs 7.7 billion operations ShiDianNao 200 Stratix 10 GX 2800 required to classify one image Cambricon Diannao Arria V GX660 Performance density (gigaops s 1 mm 2) 100 Arria V5AGZE7 50 NeuFlow 20 Arria II EP2AGZ350 Strati IV EP4SGX Moves Park GTX 690 GTX Titan 5 GTX 1080 GTX Titan X GTX Titan Z Myriad2 P100 TPU Eyeriss 8 EIE 65 nm 45 nm 40 nm 28 nm 16 nm 12 nm GTX 1080 Ti V100

10 DNNs are Moving to Dedicated Hardware Blue Chips Startups Academia Apple Google Intel Microsoft Cambricon Cerebras Deephi (à Xilinx) Graphcore EIE/ESE [Han ISCA 16, FPGA 17] Eyeriss [Chen ISCA 16] FINN [Umuroglu FPGA 17] Minerva [Reagen ISCA 16] 9

11 A Hyper -Active Body of Research Recent publications on "Neural Networks on Silicon" in premier computer hardware conferences (>150 papers since 2015) Data source: This lecture focuses on FPGA-based inference accelerators for convolutional neural networks (CNNs) 10

12 Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks Cheng Zhang 1, Peng Li 2, Guangyu Sun 1,3, Yijin Guan 1, Bingjun Xiao 2, Jason Cong 2,3,1 1 Center for Energy-Efficient Computing and Applications, Peking University 2 Computer Science Department, University of California, Los Angeles 3 PKU/UCLA Joint Research Institute in Science and Engineering FPGA 15, Feb

13 Main Contributions 1. Analysis of the different sources of parallelism in the convolution kernel of a CNN 2. Quantitative performance modeling of the hardware design space using the Roofline method 3. Design and implementation of a CNN accelerator for FPGA using Vivado HLS, evaluated on AlexNet 12

14 Convolutional Layer I K C O Filters R Input Feature Maps Output Feature Maps An output pixel is connected to its neighboring region on each input feature map All pixels on an output feature map use the same filter weights 13

15 Parallelism in the Convolutional Layer I K C O Filters R Input Feature Maps Output Feature Maps Four main sources of parallelism 1. Across input feature maps 14

16 Parallelism in the Convolutional Layer I K C O Filters R Input Feature Maps Output Feature Maps Four main sources of parallelism 1. Across input feature maps 2. Across output feature maps 15

17 Parallelism in the Convolutional Layer I K C O Filters R Input Feature Maps Output Feature Maps Four main sources of parallelism 1. Across input feature maps 2. Across output feature maps 3. Across different output pixels (i.e. filter positions) 16

18 Parallelism in the Convolutional Layer I K C O Filters R Input Feature Maps Output Feature Maps Four main sources of parallelism 1. Across input feature maps 2. Across output feature maps 3. Across different output pixels (i.e. filter positions) 4. Across filter pixels 17

19 Parallelism in the Code I K C O Filters R Input Feature Maps Output Feature Maps 1 for (row=0; row<r; row++) { 2 for (col=0; col<c; col++) { Loop over output pixels 3 for (to=0; to<o; to++) { Loop over output feature maps 4 for (ti=0; ti<i; ti++) { Loop over input feature maps 5 for (ki=0; ki<k; ki++) { 6 for (kj=0; kj<k; kj++) { Loop over filter pixels output_fm[to][row][col] += weights[to][ti][ki][kj]*input_fm[ti][s*row+ki][s*col+kj]; }}}}}} S is the stride 18

20 Challenges for FPGA We can t just unroll all the loops due to limited FPGA resources Must choose the right code transformations to exploit the parallelism in a resource efficient way Computation Resource... PE-1 PE-2 PE-n Interconnect On-chip memory buffer1 Off-chip Bus External Memory buffer2 Limited compute and routing resources Technique: loop unrolling, pipelining, interchange Limited on-chip storage Technique: loop tiling Limited off-chip bandwidth Technique: data reuse, compute-memory balance 19

21 Loop Tiling I K C O Filters R Input Feature Maps Output Feature Maps 1 for (row=0; row<r; row++) { 2 for (col=0; col<c; col++) { Loop over pixels in an output map 3 for (to=0; to<o; to++) { 4 for (ti=0; ti<i; ti++) { 5 for (ki=0; ki<k; ki++) { 6 for (kj=0; kj<k; kj++) { output_fm[to][row][col] += weights[to][ti][ki][kj]*input_fm[ti][s*row+ki][s*col+kj]; }}}}}} 20

22 Loop Tiling I K C O Tr Filters Tc R Input Feature Maps Output Feature Maps 1 for (row=0; row<r; row+=tr) { 2 for (col=0; col<c; col+=tc) { 3 for (to=0; to<o; to++) { 4 for (ti=0; ti<i; ti++) { 5 for (trr=row; trr<min(row+tr, R); trr++) { 6 for (tcc=col; tcc<min(col+tc, C); tcc++) { 7 for (ki=0; ki<k; ki++) { 8 for (kj=0; kj<k; kj++) { output_fm[to][trr][tcc] += Loop over different tiles Loops over pixels in each tile Offloading just the inner loops requires only a small portion of the data to be stored on FPGA chip weights[to][ti][ki][kj]*input_fm[ti][s*trr+ki][s*tcc+kj]; }}}}}} }}}} 21

23 Code with Loop Tiling 1 for(row=0; row<r; row+=tr) { 2 for(col=0; col<c; col+=tc) { CPU Portion 3 for(to=0; to<m; to<o; to+=tm) { Maximize 4 // for(ti=0; software: ti<n; write ti+=tn) output feature { maploop over input feature memory maps reuse 4 for // (ti=0; write ti<i; output ti+=tn) feature { map // write software: input write feature input map feature + filters map + filters 5 for(trr=row; trr<min(row+tr, R); trr++) { 6 for(tcc=col; tcc<min(tcc+tc, tcc<min(col+tc, C); tcc++) { FPGA Portion 7 for(too=to; too<min(to+tn, too<min(to+to, M); O); too++) { Maximize 8 for(tii=ti; tii<(ti+tn, tii<(ti+ti, I); N); tii++) {{ computational 9 for(i=0; (ki=0; ki<k; i++) ki++) { { performance 10 for(j=0; (kj=0; kj<k; j++) kj++) { { output_fm[to][row][col] output_fm[too][trr][tcc] += weights[to][ti][i][j]*input_fm[ti][s*row+i][s*col+j]; weights[too][tii][ki][kj]*input_fm[tii][s*trr+ki][s*tcc+kj]; }}}}}} } // // software: read output read output feature feature map map }}}} 22

24 Optimizing for On-Chip Performance 5 for (trr=row; trr<min(row+tr, R); trr++) { 6 for (tcc=col; tcc<min(col+tc, C); tcc++) { 7 for (too=to; too<min(to+to, O); too++) { 8 for (tii=ti; tii<(ti+ti, I); tii++) { 9 for (ki=0; ki<k; ki++) { 10 for (kj=0; kj<k; kj++) { output_fm[too][trr][tcc] += weights[too][tii][ki][kj]*input_fm[tii][s*trr+ki][s*tcc+kj]; }}}}}} 23

25 Optimizing for On-Chip Performance 9 for (ki=0; ki<k; ki++) { Reorder these two loops to the top level 10 for (kj=0; kj<k; kj++) { 5 for (trr=row; trr<min(row+tr, R); trr++) { 6 for (tcc=col; tcc<min(col+tc, C); tcc++) { 7 for (too=to; too<min(to+to, O); too++) { 8 for (tii=ti; tii<(ti+ti, I); tii++) { output_fm[too][trr][tcc] += weights[too][tii][ki][kj]*input_fm[tii][s*trr+ki][s*tcc+kj]; }}}}}} 24

26 Optimizing for On-Chip Performance 9 for(i=0; (ki=0; ki<k; i++) ki++) { { Reorder these two loops to the top level 10 for(j=0; (kj=0; kj<k; j++) kj++) { { 5 for(trr=row; trr<min(row+tr, R); trr++) { 6 for(tcc=col; tcc<min(tcc+tc, tcc<min(col+tc, C); tcc++) { 7 for(too=to; #pragma HLS too<min(to+tm, pipeline M); too++) { 87 for for(tii=ti; (too=to; too<min(to+to, tii<(ti+tn, N); tii++) O); too++) { { 8 for output_fm[too][trr][tcc] (tii=ti; tii<(ti+ti, I); tii++) += { weights[too][tii][i][j]*input_fm[tii][s*trr+i][s*tcc+j]; output_fm[too][trr][tcc] += }}}}}} weights[too][tii][ki][kj]*input_fm[tii][s*trr+ki][s*tcc+kj]; }}}}}} 25

27 Optimizing for On-Chip Performance 9 for(i=0; (ki=0; ki<k; i++) ki++) { { Reorder these two loops to the top level 10 for(j=0; (kj=0; kj<k; j++) kj++) { { 5 for(trr=row; trr<min(row+tr, R); trr++) { 6 for(tcc=col; tcc<min(tcc+tc, tcc<min(col+tc, C); tcc++) { 7 for(too=to; #pragma HLS too<min(to+tm, pipeline M); too++) { 8 7 for(too=to; for(tii=ti; too<min(to+tm, too<min(to+to, tii<(ti+tn, N); tii++) O); M); too++) { { 8 for(tii=ti; #pragma output_fm[too][trr][tcc] HLS tii<(ti+tn, unroll N); tii++) += { 8 for weights[too][tii][i][j]*input_fm[tii][s*trr+i][s*tcc+j]; output_fm[too][trr][tcc] (tii=ti; tii<(ti+ti, I); tii++) += { }}}}}} weights[too][tii][i][j]*input_fm[tii][s*trr+i][s*tcc+j]; output_fm[too][trr][tcc] += }}}}}} weights[too][tii][ki][kj]*input_fm[tii][s*trr+ki][s*tcc+kj]; }}}}}} 26

28 Optimizing for On-Chip Performance 9 for(i=0; (ki=0; ki<k; i++) ki++) { { Reorder these two loops to the top level 10 for(j=0; (kj=0; kj<k; j++) kj++) { { 5 for(trr=row; trr<min(row+tr, R); trr++) { 6 for(tcc=col; tcc<min(tcc+tc, tcc<min(col+tc, C); tcc++) { 7 for(too=to; #pragma HLS too<min(to+tm, pipeline M); too++) { 8 7 for(too=to; for(tii=ti; too<min(to+tm, too<min(to+to, tii<(ti+tn, N); tii++) O); M); too++) { { 8 for(tii=ti; #pragma output_fm[too][trr][tcc] HLS tii<(ti+tn, unroll N); tii++) += { 8 for(tii=ti; weights[too][tii][i][j]*input_fm[tii][s*trr+i][s*tcc+j]; output_fm[too][trr][tcc] tii<(ti+tn, tii<(ti+ti, I); N); tii++) += {{ }}}}}} weights[too][tii][i][j]*input_fm[tii][s*trr+i][s*tcc+j]; output_fm[too][trr][tcc] #pragma HLS unroll += }}}}}} weights[too][tii][i][j]*input_fm[ti][s*trr+i][s*tcc+j]; output_fm[too][trr][tcc] += }}}}}} weights[too][tii][ki][kj]*input_fm[tii][s*trr+ki][s*tcc+kj]; }}}}}} Number of cycles to execute the above loop nest K K Tr Tc + L Tr Tc K ) L is the pipeline depth (# of pipeline stages, II=1) 27

29 Optimizing for On-Chip Performance 9 for(i=0; (ki=0; ki<k; i++) ki++) { { Reorder these two loops to the top level 10 for(j=0; (kj=0; kj<k; j++) kj++) { { 5 for(trr=row; trr<min(row+tr, R); trr++) { 6 Generated Hardware for(tcc=col; tcc<min(tcc+tc, tcc<min(col+tc, C); tcc++) { 7 for(too=to; #pragma HLS too<min(to+tm, pipeline M); too++) { 8 7 Performance and for(too=to; for(tii=ti; size of too<min(to+tm, too<min(to+to, tii<(ti+tn, Ti N); tii++) O); M); too++) { { 8 each PE determined for(tii=ti; #pragma output_fm[too][trr][tcc] by HLS tii<(ti+tn, tile unroll N); tii++) += { 8 factors Ti and Tofor(tii=ti; weights[too][tii][i][j]*input_fm[tii][s*trr+i][s*tcc+j]; output_fm[too][trr][tcc] tii<(ti+tn, tii<(ti+ti, I); N); tii++) += {{ To }}}}}} weights[too][tii][i][j]*input_fm[tii][s*trr+i][s*tcc+j]; output_fm[too][trr][tcc] #pragma HLS unroll += Number }}}}}} of data transfers weights[too][tii][i][j]*input_fm[ti][s*trr+i][s*tcc+j]; output_fm[too][trr][tcc] += determined }}}}}} by Tr weights[too][tii][ki][kj]*input_fm[tii][s*trr+ki][s*tcc+kj]; and Tc }}}}}} 28

30 Design Space Complexity Challenge: Number of available optimizations present a huge space of possible designs What is the optimal loop order? What tile size to use for each loop? Implementing and testing each design by hand will be slow and error-prone Some designs will exceed the on-chip compute/memory capacity Solution: Performance modeling + automated design space exploration 29

31 Performance Modeling We calculate the following design metrics: Total number of operations (FLOP) Depends on the CNN model parameters Total external memory access (Byte) Depends on the CNN weight and activation size Total execution time (Sec) Depends on the hardware architecture (e.g., tile factors To and Ti ) Ignore resource constrains for now 30

32 Performance Modeling Total operations FLOPS 2 O I R C K ) Execution time = Number of Cycles Clock Period Number of cycles Tr Tc K) O To I Ti R C K) External memory accesses = ai Bi + aw Bw + ao Bo Size of input fmap buffer: Bi = Ti (Tr+K 1)(Tc+K 1) with stride=1 Size of output fmap buffer: Bo = To Tr Tc Size of weight buffer: Bw = To Tr K ) External access times: ao = , ai = aw = 3 14 ao 31

33 Performance Modeling Computation Resource... PE-1 PE-2 PE-n Interconnect On-chip memory buffer1 buffer2 Computational Throughput Total number of operations = Total execution time GFLOP/Sec Computation To Communication (CTC) Ratio Total number of operations = Total external memory access FLOP / (Byte Accessed) GByte Sec Off-chip Bus External Memory GFLOP/Sec = FLOP/(Byte Accessed) = Required Bandwidth Computational Throughput CTC Ratio GByte/Sec 32

34 Roofline Method [1] GFLOP/Sec Computational Throughput Design Required Bandwidth Gbyte/Sec Computation To Communication (CTC) Ratio [1] S. Williams, A. Waterman, and D. Patterson, Roofline: an insightful visual performance model for multicore architectures, CACM, FLOP / Byte 33

35 Roofline Method [1] GFLOP/Sec Computational Throughput Exceeds On-Chip Resources Computational Roof (Max device throughput) Bandwidth-Bottlenecked Theoretical Throughput Attainable Throughput Computation To Communication (CTC) Ratio Bandwidth Roof (Device memory bandwidth) The Roofline [1] S. Williams, A. Waterman, and D. Patterson, Roofline: an insightful visual performance model for multicore architectures, CACM, FLOP / Byte 34

36 Design Space Exploration with Roofline 120 Bandwidth Roof Computational Throughput Computational Roof A B Computation To Communication (CTC) Ratio 35

37 Hardware Implementation All values in floating-point Only handles Conv layers, not dense or pooling Virtex7-485t 100MHz 61.6 GOPS 18.6 W power 36

38 Experimental Results Energy Speedup x x x x thread -O3 16 threads -O3 FPGA 0 1 thread -O3 16 threads -O3 FPGA CPU Xeon E (32nm) 16 cores 2.2 GHz FPGA Virtex7-485t (28nm) 448 PEs 100MHz gcc O3 OpenMP 3.0 Vivado Vivado HLS

39 An OpenCL Deep Learning Accelerator on Arria 10 Utku Aydonat, Shane O Connell, Davor Capalija, Andrew C. Ling, Gordon R. Chiu Intel Corporation (formerly Altera) Toronto, Canada FPGA 17, Feb

40 Main Contributions Reducing external memory bandwidth usage by: 1. Storing all intermediate feature maps in on-chip buffers 2. Image batching for dense layers Optimizing the convolution arithmetic using the Winograd Transformation A compute-bound implementation on Arria 10 whose energy efficiency matches the Titan X GPU 39

41 OpenCL Programming void sum (float* a, float* b, float* c) { for (i = 0; i < 100; i++) c[i] = a[i] * b[i]; } kernel void sum (global float* a, global float* b, global float* c) { int gid = get_global_id(0); c[gid] = a[gid] * b[gid]; } Conventional C++ OpenCL The FPGA 15 paper used C++ with Xilinx Vivado HLS to generate RTL This work used OpenCL and Intel FPGA SDK to generate RTL Sequential programming model using loops Inter-loop-iteration parallelism is implicit (requires unrolling) Parallel programming model using multithreaded kernels; where interiteration parallelism is explicit each thread obtains an independent ID 40

42 Data Placement Off-Chip Storage Layer Data FPGA On-Chip Storage Previous papers stored layer data off-chip Insufficient on-chip storage to hold all data for a layer (input+output) PE 12 3 Off-Chip Storage FPGA On-Chip Storage 3 PE 2 1 This work uses Arria 10 FPGA device Enough storage to keep data on-chip (for conv layers in AlexNet) Use double-buffering to store input+output 41

43 Arithmetic Optimizations On FPGA, DSP blocks (used for fixed-point multiplies) are typically the bottlenecked resource Consider a 1-dimensional convolution with output length 2 and filter length 3, denoted F(2,3) f 0 f 1 f 2 i 0 i 1 i 2 i 3 f 0 f 1 f 2 f 0 f 1 f 2 i 0 i 1 i 2 i 1 i 2 i 3 F(2,3) 6 multiplies o 0 o 1 o 0 o 1 In the 70s, Shmuel Winograd proved that F(m,r) can be computed with a lower bound of only m+r 1 multiplies [1] [1] S. Winograd, Arithmetic Complexity of Computations, SIAM, Jan 1,

44 Winograd Transform Naïve Approach f 0 f 1 f 2 i 0 i 1 i 2 i 3 o 0 o 1 o c o = i f c i d i c ) f d i d i ) i d e f ) = i cf c + i d f d + i ) f d i d f c + i ) f d + i e f ) 6 unique multiplies Winograd Approach o c o = y c + y d + y ) d c = i c i ) d e = i d i e d y d y ) y e d d = i d + i ) d ) = i ) i d Each y 4 = d 4 g 4 1 multiply per y i 4 unique multiplies g c = f c g e = f ) g d = f c + f d + f ) 2 g ) = f c f d + f ) 2 d i and g i are linearly mapped from i i and f i 43

45 Comparison to Previous Papers Zhang 2015 Qiu 2016 This Paper Platform Virtex7 VX485t Zynq XC7Z045 Arria Clock (MHz) Quantization 32-bit float 16-bit fixed 16-bit fixed Performance (GOP/s) Power (W) Energy Efficiency (GOP/J) Massive increase in performance due to Winograd Transform and storing all features on-chip Among the first to break the TeraOP/s barrier on FPGA 44

46 Experimental Evaluation Platform img/s Power (W) Arria 10 DLA (20nm) Nvidia Titan X (28nm) Nvidia M4 (28nm) Benchmark app is AlexNet Energy Efficiency (img/s/w) Results show FPGAs can compete with GPUs in energy efficiency Titan X numbers ignore communication overhead and use random data instead of real images (highly optimistic) 45

47 Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng- Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru Zhang 1 1 Electrical and Computer Engineering, Cornell University 2 Electronics Engineering and Computer Science, Peking University 3 Electrical Engineering, University of California Los Angeles 4 Computer Science and Engineering, University of California San Diego FPGA 17, Feb

48 CNNs with Reduced Numerical Precision Hardware architects widely apply fixed-point optimization for CNN acceleration Motivation: both neural nets and image/video apps naturally tolerate small amounts of noise Approach: take a trained floating-point model and apply quantization 16 or 8-bit fixed-point have been shown to be practical Can we go even lower by training a reducednumerical-precision CNN from the ground up? 47

49 Aggressively Quantized CNNs ML research papers: BinaryConnect [NIPS] Dec 2015 BNN [arxiv] Mar 2016 Ternary-Net [arxiv] May 2016 XNOR-Net [ECCV] Oct 2016 HWGQ [CVPR] Jul 2017 LR-Net [arxiv] Oct 2018 Many more! Near state-of-the-art on MNIST, CIFAR-10 and SVHN at time of publication [1] Matthew Courbariaux et al. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. arxiv: , Feb

50 CNN vs. BNN CNN Input Map Weights = Output Map Key Differences 1. Inputs are binarized ( 1 or +1) 2. Weights are binarized ( 1 or +1) 3. Results are binarized after batch normalization BNN Input Map (Binary) Weights (Binary) = x 4x (Integer) Batch Normalization y 4x = x 4x μ γ + β σ ) ε z 4x = +1 if y 4x 0 1 otherwise Binarization Output Map (Binary) 49

51 Advantages of BNN 1. Floating point ops replaced with binary logic ops b 1 b 2 b 1 b b 1 b 2 b 1 XOR b Encode {+1, 1} as {0,1} à multiplies become XORs Conv/dense layers do dot products à XOR and popcount Operations can map to LUT fabric as opposed to DSPs 2. Binarized weights may reduce total model size But note that fewer bits per weight may be offset by having more weights 50

52 BNN CIFAR-10 Architecture [2] Feature map dimensions 32x32 16x16 8x8 4x Number of feature maps conv layers, 3 dense layers, 3 max pooling layers All conv filters are 3x3 First conv layer takes in floating-point input 13.4 Mbits total model size (after hardware optimizations) [2] M. Courbariaux et al. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. arxiv: , Feb

53 BNN Accelerator Design Goals Target low-power embedded applications Design must be resource efficient to fit a small device Execute layers sequentially on a single module Optimize for batch size of 1 Store all feature maps on-chip Binarization makes feature maps smaller Weights are streamed from off-chip storage Synthesize RTL from C++ source 52

54 BNN Accelerator Architecture Input words Input words BitSel BitSel Convolvers f in + f out output streams + Integer buffer Pooling, Bnorm, Binarize Output words Variable-width Line Buffer f out Conv Weights Challenges Many diverse sources of parallelism (across/within images, feature maps, subword) Design must handle various layer types with different sized feature maps Slow interface between accelerator and general-purpose memory system Our Solution Highly parallel and pipelined architecture with a parameterized number of convolvers Novel variable-width line buffer ensure pipeline is fully utilized Careful memory layout and BitSel unit enable word-by-word data processing, instead of pixel-by-pixel 53

55 BNN HLS Design User writes and tests in C++ CPU-FPGA interface automatically synthesized (by Xilinx SDSoC) Significant reduction in verification time BNN RTL takes days to simulate 1 VariableLineBuffer linebuf; 2 ConvWeights wts; 3 IntegerBuffer outbuf; 4 5 for (i = 0; i < n_input_words; i++) { 6 #pragma HLS pipeline 7 8 // read input word, update linebuffer 9 WordType word = input_data[i]; 10 BitSel(linebuf, word, input_width); // update the weights each time we 13 // begin to process a new fmap 14 if (i % words_per_fmap == 0) 15 wts = weights[i / words_per_fmap]; // perform conv across linebuffer 18 for (c = 0; c < LINE_BUF_COLS; c++) { 19 #pragma HLS unroll 20 outbuf[i % words_per_fmap][c] += 21 conv(c, linebuf, wts); 22 } 23 } HLS code for part of convolver unit 54

56 FPGA Implementation Misc. HW Optimizations 1. Quantized the input image and batch norm params 2. Removed additive biases 3. Simplified batch norm computation BNN Model Test Error Claimed in paper [2] 11.40% Python out-of-the-box [2] 11.58% C++ optimized model 11.19% Accelerator 11.19% FPGA: ZedBoard with Xilinx Zynq-7000 mgpu: Jetson TK1 embedded GPU board CPU: Intel Xeon E multicore processor GPU: NVIDIA Tesla K40 GPU Runtime per Image (ms) mgpu CPU GPU FPGA Speedup 1x 6x 120x 15x Power (W) 3.6* Image/sec/Watt R. Zhao et al., Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs. International Symposium on Field-Programmable Gate Arrays, Feb

57 Additional Useful Resources Recent papers on neural networks on silicon Tutorial on hardware architectures for DNNs Landscape of neural network inference accelerators Most cited deep learning papers (since 2012) 56

58 Acknowledgements This tutorial contains/adapts materials developed by Ritchie Zhao (PhD student at Cornell) Authors of the following papers Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks (FPGA 15, PKU-UCLA) An OpenCL Deep Learning Accelerator on Arria 10 (FPGA 17, Intel) 57

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru

More information

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network Chen Zhang 1, Peng Li 3, Guangyu Sun 1,2, Yijin Guan 1, Bingjun Xiao 3, Jason Cong 1,2,3 1 Peking University 2 PKU/UCLA Joint

More information

Introduction to Neural Networks

Introduction to Neural Networks ECE 5775 (Fall 17) High-Level Digital Design Automation Introduction to Neural Networks Ritchie Zhao, Zhiru Zhang School of Electrical and Computer Engineering Rise of the Machines Neural networks have

More information

Design Exploration of FPGA-based Accelerators for Deep Neural Networks

Design Exploration of FPGA-based Accelerators for Deep Neural Networks Design Exploration of FPGA-based Accelerators for Deep Neural Networks Guangyu Sun CECA@Peking University gsun@pku.edu.cn 1 Applications of Deep Neural Networks Unmanned Vehicle Speech & Audio Text & Language

More information

ECE 5775 (Fall 17) High-Level Digital Design Automation. More Pipelining

ECE 5775 (Fall 17) High-Level Digital Design Automation. More Pipelining ECE 5775 (Fall 17) High-Level Digital Design Automation More Pipelining Announcements HW 2 due Monday 10/16 (no late submission) Second round paper bidding @ 5pm tomorrow on Piazza Talk by Prof. Margaret

More information

Machine Learning on FPGAs

Machine Learning on FPGAs Machine Learning on FPGAs Jason Cong Chancellor s Professor, UCLA Director, Center for Domain-Specific Computing cong@cs.ucla.edu http://cadlab.cs.ucla.edu/~cong 1 Impacts of deep learning for many applications

More information

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University Lab 4: Binarized Convolutional Neural Networks Due Wednesday, October 31, 2018, 11:59pm

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School

More information

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm 1 Introduction

More information

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu

More information

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang Profiling the Performance of Binarized Neural Networks Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang 1 Outline Project Significance Prior Work Research Objectives Hypotheses Testing Framework

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Nick Fraser (Xilinx & USydney) Yaman Umuroglu (Xilinx & NTNU) Giulio Gambardella (Xilinx)

More information

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters

More information

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei

More information

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al. Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (NTNU & Xilinx Research Labs Ireland) in collaboration with N Fraser, G Gambardella, M Blott, P Leong, M Jahre and

More information

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (XIR & NTNU), Nick Fraser (XIR & USydney), Giulio Gambardella (XIR), Michaela Blott (XIR), Philip Leong (USydney),

More information

Xilinx ML Suite Overview

Xilinx ML Suite Overview Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame

More information

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Yu-Hsin Chen 1, Joel Emer 1, 2, Vivienne Sze 1 1 MIT 2 NVIDIA 1 Contributions of This Work A novel energy-efficient

More information

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed

More information

DNN Accelerator Architectures

DNN Accelerator Architectures DNN Accelerator Architectures ISCA Tutorial (2017) Website: http://eyeriss.mit.edu/tutorial.html Joel Emer, Vivienne Sze, Yu-Hsin Chen 1 2 Highly-Parallel Compute Paradigms Temporal Architecture (SIMD/SIMT)

More information

Revolutionizing the Datacenter

Revolutionizing the Datacenter Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs

Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs Qingcheng Xiao *1, Yun Liang 1, Liqiang Lu 1, Shengen Yan,3 and Yu-Wing Tai 3 1 Center for Energy-efficient

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

Intel HLS Compiler: Fast Design, Coding, and Hardware

Intel HLS Compiler: Fast Design, Coding, and Hardware white paper Intel HLS Compiler Intel HLS Compiler: Fast Design, Coding, and Hardware The Modern FPGA Workflow Authors Melissa Sussmann HLS Product Manager Intel Corporation Tom Hill OpenCL Product Manager

More information

A Survey of FPGA-based Accelerators for Convolutional Neural Networks

A Survey of FPGA-based Accelerators for Convolutional Neural Networks 1 A Survey of FPGA-based Accelerators for Convolutional Neural Networks Sparsh Mittal Abstract Deep convolutional neural networks (CNNs) have recently shown very high accuracy in a wide range of cognitive

More information

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Jongsoo Park Facebook AI System SW/HW Co-design Team Sep-21 2018 Team Introduction

More information

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Abstract Deep Convolutional Neural Networks (DCNN) have proven to be very effective in many pattern recognition applications, such

More information

direct hardware mapping of cnns on fpga-based smart cameras

direct hardware mapping of cnns on fpga-based smart cameras direct hardware mapping of cnns on fpga-based smart cameras Workshop on Architecture of Smart Cameras Kamel ABDELOUAHAB, Francois BERRY, Maxime PELCAT, Jocelyn SEROT, Jean-Charles QUINTON Cordoba, June

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster

Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster Energy-Efficient CNN Implementation on a Deeply Pipelined Cluster Chen Zhang 1, Di Wu, Jiayu Sun 1, Guangyu Sun 1,3, Guojie Luo 1,3, and Jason Cong 1,,3 1 Center for Energy-Efficient Computing and Applications,

More information

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Ruizhe Zhao 1, Xinyu Niu 1, Yajie Wu 2, Wayne Luk 1, and Qiang Liu 3 1 Imperial College London {ruizhe.zhao15,niu.xinyu10,w.luk}@imperial.ac.uk

More information

FPGA-based Supercomputing: New Opportunities and Challenges

FPGA-based Supercomputing: New Opportunities and Challenges FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:

More information

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Mohammad Motamedi, Philipp Gysel, Venkatesh Akella and Soheil Ghiasi Electrical and Computer Engineering Department, University

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou University of California, Los Angeles 1 What is stencil computation? 2 What is Stencil Computation? A sliding

More information

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer

More information

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural

More information

High Performance Computing Hiroki Kanezashi Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab.

High Performance Computing Hiroki Kanezashi Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab. High Performance Computing 2015 Hiroki Kanezashi Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab. 1 Reviewed Paper 1 DaDianNao: A Machine- Learning Supercomputer

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA

Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA Abstract In recent years, Convolutional Neural Network (CNN) has been widely applied in computer vision tasks. FPGAs

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Efficient Edge Computing for Deep Neural Networks and Beyond Vivienne Sze In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang, Sertac

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

Binary Convolutional Neural Network on RRAM

Binary Convolutional Neural Network on RRAM Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua

More information

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques Jingyang Zhu 1, Zhiliang Qian 2*, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

In Live Computer Vision

In Live Computer Vision EVA 2 : Exploiting Temporal Redundancy In Live Computer Vision Mark Buckler, Philip Bedoukian, Suren Jayasuriya, Adrian Sampson International Symposium on Computer Architecture (ISCA) Tuesday June 5, 2018

More information

High Performance Computing

High Performance Computing High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason

More information

Towards Design Space Exploration and Optimization of Fast Algorithms for Convolutional Neural Networks (CNNs) on FPGAs

Towards Design Space Exploration and Optimization of Fast Algorithms for Convolutional Neural Networks (CNNs) on FPGAs Preprint: Accepted at 22nd IEEE Design, Automation & Test in Europe Conference & Exhibition (DATE 19) Towards Design Space Exploration and Optimization of Fast Algorithms for Convolutional Neural Networks

More information

arxiv: v2 [cs.ar] 15 May 2018

arxiv: v2 [cs.ar] 15 May 2018 [DL] A Survey of FPGA Based Neural Network Accelerator arxiv:1712.08934v2 [cs.ar] 15 May 2018 KAIYUAN GUO, SHULIN ZENG, JINCHENG YU, YU WANG AND HUAZHONG YANG, Tsinghua University, China Recent researches

More information

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks Chen Zhang 1 chen.ceca@pku.edu.cn Yijin Guan 1 guanyijin@pku.edu.cn Peng Li 2 pengli@cs.ucla.edu Bingjun Xiao 2 xiao@cs.ucla.edu

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning, Index A Algorithmic noise tolerance (ANT), 93 94 Application specific instruction set processors (ASIPs), 115 116 Approximate computing application level, 95 circuits-levels, 93 94 DAS and DVAS, 107 110

More information

Exploring OpenCL Memory Throughput on the Zynq

Exploring OpenCL Memory Throughput on the Zynq Exploring OpenCL Memory Throughput on the Zynq Technical Report no. 2016:04, ISSN 1652-926X Chalmers University of Technology Bo Joel Svensson bo.joel.svensson@gmail.com Abstract The Zynq platform combines

More information

DeepLearning on FPGAs

DeepLearning on FPGAs DeepLearning on FPGAs Introduction to FPGAs Sebastian Buschjäger Technische Universität Dortmund - Fakultät Informatik - Lehrstuhl 8 October 24, 2017 1 Recap: Convolution Observation 1 Even smaller images

More information

C-Brain: A Deep Learning Accelerator

C-Brain: A Deep Learning Accelerator C-Brain: A Deep Learning Accelerator that Tames the Diversity of CNNs through Adaptive Data-level Parallelization Lili Song, Ying Wang, Yinhe Han, Xin Zhao, Bosheng Liu, Xiaowei Li State Key Laboratory

More information

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign

More information

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA Inference Optimization Using TensorRT with Use Cases Jack Han / 한재근 Solutions Architect NVIDIA Search Image NLP Maps TensorRT 4 Adoption Use Cases Speech Video AI Inference is exploding 1 Billion Videos

More information

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten, G. Edward Suh Computer Systems Laboratory School of Electrical

More information

Adaptable Intelligence The Next Computing Era

Adaptable Intelligence The Next Computing Era Adaptable Intelligence The Next Computing Era Hot Chips, August 21, 2018 Victor Peng, CEO, Xilinx Pervasive Intelligence from Cloud to Edge to Endpoints >> 1 Exponential Growth and Opportunities Data Explosion

More information

OpenCL on FPGAs - Creating custom accelerated solutions

OpenCL on FPGAs - Creating custom accelerated solutions OpenCL on FPGAs - Creating custom accelerated solutions Manuel Greisinger Channel Manager, Central & Eastern Europe Oct 13 th, 2015 ESSEI Technology Day, Gilching, Germany Industry Trends Increasing product

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

CNN optimization. Rassadin A

CNN optimization. Rassadin A CNN optimization Rassadin A. 01.2017-02.2017 What to optimize? Training stage time consumption (CPU / GPU) Inference stage time consumption (CPU / GPU) Training stage memory consumption Inference stage

More information

Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks

Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. X, DEC 2016 1 Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

arxiv: v1 [cs.cv] 11 Feb 2018

arxiv: v1 [cs.cv] 11 Feb 2018 arxiv:8.8v [cs.cv] Feb 8 - Partitioning of Deep Neural Networks with Feature Space Encoding for Resource-Constrained Internet-of-Things Platforms ABSTRACT Jong Hwan Ko, Taesik Na, Mohammad Faisal Amir,

More information

Fast Hardware For AI

Fast Hardware For AI Fast Hardware For AI Karl Freund karl@moorinsightsstrategy.com Sr. Analyst, AI and HPC Moor Insights & Strategy Follow my blogs covering Machine Learning Hardware on Forbes: http://www.forbes.com/sites/moorinsights

More information

The Art of Parallel Processing

The Art of Parallel Processing The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a

More information

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong

More information

Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks

Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks Chen Zhang 1,2,3, Zhenman Fang 2, Peipei Zhou 2, Peichen Pan 3, and Jason Cong 1,2,3 1 Center for Energy-Efficient

More information

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning

More information

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder ESE532: System-on-a-Chip Architecture Day 8: September 26, 2018 Spatial Computations Today Graph Cycles (from Day 7) Accelerator Pipelines FPGAs Zynq Computational Capacity 1 2 Message Custom accelerators

More information

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

EE282 Computer Architecture. Lecture 1: What is Computer Architecture? EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer

More information

Course Overview Revisited

Course Overview Revisited Course Overview Revisited void blur_filter_3x3( Image &in, Image &blur) { // allocate blur array Image blur(in.width(), in.height()); // blur in the x dimension for (int y = ; y < in.height(); y++) for

More information

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage HENet: A Highly Efficient Convolutional Neural Networks Optimized for Accuracy, Speed and Storage Qiuyu Zhu Shanghai University zhuqiuyu@staff.shu.edu.cn Ruixin Zhang Shanghai University chriszhang96@shu.edu.cn

More information

DATA REUSE ANALYSIS FOR AUTOMATED SYNTHESIS OF CUSTOM INSTRUCTIONS IN SLIDING WINDOW APPLICATIONS

DATA REUSE ANALYSIS FOR AUTOMATED SYNTHESIS OF CUSTOM INSTRUCTIONS IN SLIDING WINDOW APPLICATIONS Georgios Zacharopoulos Giovanni Ansaloni Laura Pozzi DATA REUSE ANALYSIS FOR AUTOMATED SYNTHESIS OF CUSTOM INSTRUCTIONS IN SLIDING WINDOW APPLICATIONS Università della Svizzera italiana (USI Lugano), Faculty

More information

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal Computer Architectures for Deep Learning Ethan Dell and Daniyal Iqbal Agenda Introduction to Deep Learning Challenges Architectural Solutions Hardware Architectures CPUs GPUs Accelerators FPGAs SOCs ASICs

More information

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough 1,2 S. K. Lee 2, N. Mulholland 2, P. Hansen 2, S. Kodali 3, D. Brooks 2, G.-Y. Wei 2 1 ARM Research, Boston,

More information

In-Place Activated BatchNorm for Memory- Optimized Training of DNNs

In-Place Activated BatchNorm for Memory- Optimized Training of DNNs In-Place Activated BatchNorm for Memory- Optimized Training of DNNs Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder Mapillary Research Paper: https://arxiv.org/abs/1712.02616 Code: https://github.com/mapillary/inplace_abn

More information

Implementation of Deep Convolutional Neural Net on a Digital Signal Processor

Implementation of Deep Convolutional Neural Net on a Digital Signal Processor Implementation of Deep Convolutional Neural Net on a Digital Signal Processor Elaina Chai December 12, 2014 1. Abstract In this paper I will discuss the feasibility of an implementation of an algorithm

More information

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

How to Estimate the Energy Consumption of Deep Neural Networks

How to Estimate the Energy Consumption of Deep Neural Networks How to Estimate the Energy Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze MIT 1 Problem of DNNs Recognition Smart Drone AI Computation DNN 15k 300k OP/Px DPM 0.1k

More information

Reconfigurable Acceleration of 3D-CNNs for Human Action Recognition with Block Floating-Point Representation

Reconfigurable Acceleration of 3D-CNNs for Human Action Recognition with Block Floating-Point Representation Reconfigurable Acceleration of 3D-CNNs for Human Action Recognition with Block Floating-Point Representation Hongxiang Fan, Ho-Cheung Ng, Shuanglong Liu, Zhiqiang Que, Xinyu Niu, Wayne Luk Dept. of Computing,

More information

A Lightweight YOLOv2:

A Lightweight YOLOv2: FPGA2018 @Monterey A Lightweight YOLOv2: A Binarized CNN with a Parallel Support Vector Regression for an FPGA Hiroki Nakahara, Haruyoshi Yonekawa, Tomoya Fujii, Shimpei Sato Tokyo Institute of Technology,

More information

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

On the Capability and Achievable Performance of FPGAs for HPC Applications "On the Capability and Achievable Performance of FPGAs for HPC Applications" Wim Vanderbauwhede School of Computing Science, University of Glasgow, UK Or in other words "How Fast Can Those FPGA Thingies

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

Neural Network-Hardware Co-design for Scalable RRAM-based BNN Accelerators

Neural Network-Hardware Co-design for Scalable RRAM-based BNN Accelerators Neural Network-Hardware Co-design for Scalable RRAM-based BNN Accelerators Yulhwa Kim, Hyungjun Kim, and Jae-Joon Kim Dept. of Creative IT Engineering, Pohang University of Science and Technology (POSTECH),

More information

Redundancy-reduced MobileNet Acceleration on Reconfigurable Logic For ImageNet Classification

Redundancy-reduced MobileNet Acceleration on Reconfigurable Logic For ImageNet Classification Redundancy-reduced MobileNet Acceleration on Reconfigurable Logic For ImageNet Classification Jiang Su, Julian Faraone, Junyi Liu, Yiren Zhao, David B. Thomas, Philip H. W. Leong, and Peter Y. K. Cheung

More information

Scaling Neural Network Acceleration using Coarse-Grained Parallelism

Scaling Neural Network Acceleration using Coarse-Grained Parallelism Scaling Neural Network Acceleration using Coarse-Grained Parallelism Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2018 Neural Networks (NNs)

More information

Lecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University 18 643 Lecture 11: OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University 18 643 F17 L11 S1, James C. Hoe, CMU/ECE/CALCM, 2017 Housekeeping Your goal today: understand Altera

More information

Synergy: A HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC

Synergy: A HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC Synergy: A HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC Guanwen Zhong, Akshat Dubey, Tan Cheng, and Tulika Mitra arxiv:1804.00706v1 [cs.dc] 28 Mar 2018 School of Computing, National

More information

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication Erik H. D Hollander Electronics and Information Systems Department Ghent University, Ghent, Belgium Erik.DHollander@ugent.be

More information

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB Ram Kokku 2018 The MathWorks, Inc. 1 GPUs and CUDA programming faster Performance CUDA OpenCL C/C++ GPU Coder MATLAB Python Ease of programming

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information